[jira] [Created] (ARROW-6143) [Java] Unify the copyFrom and copyFromSafe methods for all vectors

2019-08-05 Thread Liya Fan (JIRA)
Liya Fan created ARROW-6143:
---

 Summary: [Java] Unify the copyFrom and copyFromSafe methods for 
all vectors
 Key: ARROW-6143
 URL: https://issues.apache.org/jira/browse/ARROW-6143
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


Some vectors have their own implementations of copyFrom and copyFromSafe 
methods. 

Since we have extracted the copyFrom and copyFromSafe methods to the base 
interface (see ARROW-6021), we want all vectors' implementations to override 
the methods from the super interface.

This will provide a unified way of copying data elements. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6142) [R] Install instructions on linux could be clearer

2019-08-05 Thread Karl Dunkle Werner (JIRA)
Karl Dunkle Werner created ARROW-6142:
-

 Summary: [R] Install instructions on linux could be clearer
 Key: ARROW-6142
 URL: https://issues.apache.org/jira/browse/ARROW-6142
 Project: Apache Arrow
  Issue Type: Wish
  Components: R
Affects Versions: 0.14.1
 Environment: Ubuntu 19.04
Reporter: Karl Dunkle Werner
 Fix For: 0.15.0


Installing R packages on Linux is almost always from source, which means Arrow 
needs some system dependencies. The existing help message (from 
arrow::install_arrow()) is very helpful in pointing that out, but it's still a 
heavy lift for users who install R packages from source but don't plan to 
develop Arrow itself.

Here are a couple of things that could make things slightly smoother:
 # I would be very grateful if the install_arrow() message or installation page 
told me which libraries were essential to make the R package work.
 # install_arrow() refers to a PPA. Previously I've only seen PPAs hosted on 
launchpad.net, so the bintray URL threw me. Changing it to "bintray.com PPA" 
instead of just "PPA" would have caused me less confusion. (Others may differ)
 # A snap package would be easier than installing a new apt address, but I 
understand that building for snap would be more packaging work and only 
benefits Ubuntu users.

 

Thanks for making R bindings, and congratulations on the CRAN release!



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[GitHub] [arrow-site] nealrichardson commented on a change in pull request #7: ARROW-6139: [Documentation][R] Build R docs (pkgdown) site and add to arrow-site

2019-08-05 Thread GitBox
nealrichardson commented on a change in pull request #7: ARROW-6139: 
[Documentation][R] Build R docs (pkgdown) site and add to arrow-site
URL: https://github.com/apache/arrow-site/pull/7#discussion_r310816056
 
 

 ##
 File path: assets/.sprockets-manifest-af5b3e33562477ba4f207e3ea1e798c0.json
 ##
 @@ -0,0 +1 @@
+{}
 
 Review comment:
   This is autogenerated by jekyll. There's some sorcery in the 
Jekyll/bootstrap asset rendering that I intend to clean up once 
https://lists.apache.org/thread.html/73e8a7d2ee667e83371e8bb99c13338144c8af5444f55918994713be@%3Cdev.arrow.apache.org%3E
 is resolved.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [arrow-site] kou commented on a change in pull request #7: ARROW-6139: [Documentation][R] Build R docs (pkgdown) site and add to arrow-site

2019-08-05 Thread GitBox
kou commented on a change in pull request #7: ARROW-6139: [Documentation][R] 
Build R docs (pkgdown) site and add to arrow-site
URL: https://github.com/apache/arrow-site/pull/7#discussion_r310807654
 
 

 ##
 File path: assets/.sprockets-manifest-af5b3e33562477ba4f207e3ea1e798c0.json
 ##
 @@ -0,0 +1 @@
+{}
 
 Review comment:
   Should we remove this?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


Re: Proposal to move website source to arrow-site, add automatic builds

2019-08-05 Thread Joris Van den Bossche
This sounds as a good proposal to me (at least at the moment where we have
separate docs and main site).
I agree that documentation should indeed stay with the code, as you want to
update those together in PRs. But the website is something you can
typically update separately and also might want to update independently
from code releases. And certainly if this proposal makes it easier to work
on the site, all the better.

Joris

Op ma 5 aug. 2019 20:30 schreef Wes McKinney :

> Let's wait a little while to collect any additional opinions about this.
>
> There's pretty good evidence from other Apache projects that this
> isn't too bad of an idea
>
> Apache Calcite: https://github.com/apache/calcite-site
> Apache Kafka: https://github.com/apache/kafka-site
> Apache Spark: https://github.com/apache/spark-website
>
> The Apache projects I've seen where the same repository is used for
> $FOO.apache.org tend to be ones where the documentation _is_ the
> website. I think we would need to commission a significant web design
> overhaul to be able to make our documentation page adequate as the
> landing point for visitors to https://arrow.apache.org.
>
> On Sat, Aug 3, 2019 at 3:46 PM Neal Richardson
>  wrote:
> >
> > Given the status quo, it would be difficult for this to make the Arrow
> > website less maintained. In fact, arrow-site is currently missing the
> > most recent two patches that modified the site directory in
> > apache/arrow. Having multiple manual deploy steps increases the
> > likelihood that the website stays stale.
> >
> > As someone who has been working on the arrow site lately, this
> > proposal makes it easier for me to make changes to the website because
> > I can automatically deploy my changes to a test site, and that lets
> > others in the community, who perhaps don't touch the website much,
> > verify that they're good.
> >
> > I agree that the documentation situation needs attention, but as I
> > said initially, that's orthogonal to this static site generation. I'd
> > like to work on that next, and I think these changes will make it
> > easier to do. I would not propose moving doc generation out of
> > apache/arrow--that belongs with the code.
> >
> > Neal
> >
> > On Sat, Aug 3, 2019 at 9:49 AM Wes McKinney  wrote:
> > >
> > > I think that the project website and the project documentation are
> > > currently distinct entities. The current Jekyll website is independent
> > > from the Sphinx documentation project aside from a link to the
> > > documentation from the website.
> > >
> > > I am guessing that we would want to maintain some amount of separation
> > > between the main site at arrow.apache.org and the code / format
> > > documentation, at minimum because we may want to make documentation
> > > available for multiple versions of the project (this has already been
> > > cited as an issue -- when we release, we're overwriting the previous
> > > version of the docs)
> > >
> > > On Sat, Aug 3, 2019 at 11:33 AM Antoine Pitrou 
> wrote:
> > > >
> > > >
> > > > I am concerned with this.  What happens if we happen to move part of
> the
> > > > current site to e.g. the Sphinx docs in the Arrow repository (we
> already
> > > > did that, so it's not theoretical)?
> > > >
> > > > More generally, I also think that any move towards separating website
> > > > and code repo more will lead to an even less maintained website.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > > Le 02/08/2019 à 22:39, Wes McKinney a écrit :
> > > > > hi Neal,
> > > > >
> > > > > In general the improvements to the site sound good, and I agree
> with
> > > > > moving the site into the apache/arrow-site repository.
> > > > >
> > > > > It sounds like a committer will have to volunteer a PAT for the
> Travis
> > > > > CI settings in
> > > > >
> > > > > https://travis-ci.org/apache/arrow-site/settings
> > > > >
> > > > > Even though you can't get at such an environment variable there
> after
> > > > > it's set, it could still technically be compromised. Personally I
> > > > > wouldn't be comfortable having a token with "repo" scope out
> there. We
> > > > > might need to think about this some more -- the general idea of
> making
> > > > > it easier to deploy the website I'm totally on board with
> > > > >
> > > > > - Wes
> > > > >
> > > > >
> > > > > On Fri, Aug 2, 2019 at 1:35 PM Neal Richardson
> > > > >  wrote:
> > > > >>
> > > > >> Hi all,
> > > > >> https://issues.apache.org/jira/browse/ARROW-5746 requested to
> move the
> > > > >> source for https://arrow.apache.org out of `apache/arrow` due to
> the
> > > > >> growing number of binary files (mostly images) there.
> > > > >>
> > > > >> https://issues.apache.org/jira/browse/ARROW-4473 requested
> > > > >> improvements to the ability to make a test deploy of the website
> and
> > > > >> noted challenges/bugs in trying to do this when the site
> `baseurl` is
> > > > >> a subdirectory.
> > > > >>
> > > > >> On my fork of `arrow-site` [1] I have a 

[jira] [Created] (ARROW-6141) [C++] Enable memory-mapping a file region that is offset from the beginning of the file

2019-08-05 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-6141:
---

 Summary: [C++] Enable memory-mapping a file region that is offset 
from the beginning of the file
 Key: ARROW-6141
 URL: https://issues.apache.org/jira/browse/ARROW-6141
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


Currently {{MemoryMappedFile}} only allows for the entire file to be mapped. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[C++] Organizing JIRA issues and work related to the "Datasets" C++ subproject

2019-08-05 Thread Wes McKinney
hi all,

I just put together a document to help with creating and organizing
JIRA issues related to the Datasets project that we've been discussing
over the last 6 months

https://docs.google.com/document/d/1QOuz_6rIUskM0Dcxk5NwP8KhKn_qK6o_rFV3fbHQ_AM/edit?usp=sharing

I've left out work relating to expanding filesystem support, such as
S3, GCS, and Azure -- since we have a general purpose filesystem API
now, the initial Datasets implementation work need not be coupled to
implementing new filesystems (though some optimizations or options may
be required to improve performance for systems like S3 that have a lot
different performance than local disk).

One concrete goal of this is to port Parquet-specific Dataset logic in
pyarrow/parquet.py into C++ so that we can have feature parity around
this in Python, R, and Ruby. Similarly, we wish to make this logic not
Parquet-specific so we can also deal with JSON, CSV, ORC, and later
Avro files.

I know there are a number of people interested in this project, so I
don't want to get in anyone's way. I'm tied up with other work this
month at least so I likely won't be able to write any patches for this
until September at the earliest. I'll be glad to give edit access to
anyone who finds this document helpful and wants to add to it (e.g.
JIRA links).

Thanks,
Wes


Re: Proposal to move website source to arrow-site, add automatic builds

2019-08-05 Thread Wes McKinney
Let's wait a little while to collect any additional opinions about this.

There's pretty good evidence from other Apache projects that this
isn't too bad of an idea

Apache Calcite: https://github.com/apache/calcite-site
Apache Kafka: https://github.com/apache/kafka-site
Apache Spark: https://github.com/apache/spark-website

The Apache projects I've seen where the same repository is used for
$FOO.apache.org tend to be ones where the documentation _is_ the
website. I think we would need to commission a significant web design
overhaul to be able to make our documentation page adequate as the
landing point for visitors to https://arrow.apache.org.

On Sat, Aug 3, 2019 at 3:46 PM Neal Richardson
 wrote:
>
> Given the status quo, it would be difficult for this to make the Arrow
> website less maintained. In fact, arrow-site is currently missing the
> most recent two patches that modified the site directory in
> apache/arrow. Having multiple manual deploy steps increases the
> likelihood that the website stays stale.
>
> As someone who has been working on the arrow site lately, this
> proposal makes it easier for me to make changes to the website because
> I can automatically deploy my changes to a test site, and that lets
> others in the community, who perhaps don't touch the website much,
> verify that they're good.
>
> I agree that the documentation situation needs attention, but as I
> said initially, that's orthogonal to this static site generation. I'd
> like to work on that next, and I think these changes will make it
> easier to do. I would not propose moving doc generation out of
> apache/arrow--that belongs with the code.
>
> Neal
>
> On Sat, Aug 3, 2019 at 9:49 AM Wes McKinney  wrote:
> >
> > I think that the project website and the project documentation are
> > currently distinct entities. The current Jekyll website is independent
> > from the Sphinx documentation project aside from a link to the
> > documentation from the website.
> >
> > I am guessing that we would want to maintain some amount of separation
> > between the main site at arrow.apache.org and the code / format
> > documentation, at minimum because we may want to make documentation
> > available for multiple versions of the project (this has already been
> > cited as an issue -- when we release, we're overwriting the previous
> > version of the docs)
> >
> > On Sat, Aug 3, 2019 at 11:33 AM Antoine Pitrou  wrote:
> > >
> > >
> > > I am concerned with this.  What happens if we happen to move part of the
> > > current site to e.g. the Sphinx docs in the Arrow repository (we already
> > > did that, so it's not theoretical)?
> > >
> > > More generally, I also think that any move towards separating website
> > > and code repo more will lead to an even less maintained website.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 02/08/2019 à 22:39, Wes McKinney a écrit :
> > > > hi Neal,
> > > >
> > > > In general the improvements to the site sound good, and I agree with
> > > > moving the site into the apache/arrow-site repository.
> > > >
> > > > It sounds like a committer will have to volunteer a PAT for the Travis
> > > > CI settings in
> > > >
> > > > https://travis-ci.org/apache/arrow-site/settings
> > > >
> > > > Even though you can't get at such an environment variable there after
> > > > it's set, it could still technically be compromised. Personally I
> > > > wouldn't be comfortable having a token with "repo" scope out there. We
> > > > might need to think about this some more -- the general idea of making
> > > > it easier to deploy the website I'm totally on board with
> > > >
> > > > - Wes
> > > >
> > > >
> > > > On Fri, Aug 2, 2019 at 1:35 PM Neal Richardson
> > > >  wrote:
> > > >>
> > > >> Hi all,
> > > >> https://issues.apache.org/jira/browse/ARROW-5746 requested to move the
> > > >> source for https://arrow.apache.org out of `apache/arrow` due to the
> > > >> growing number of binary files (mostly images) there.
> > > >>
> > > >> https://issues.apache.org/jira/browse/ARROW-4473 requested
> > > >> improvements to the ability to make a test deploy of the website and
> > > >> noted challenges/bugs in trying to do this when the site `baseurl` is
> > > >> a subdirectory.
> > > >>
> > > >> On my fork of `arrow-site` [1] I have a solution to both. I created a
> > > >> `master` branch and copied the contents of the `site/` directory in
> > > >> `apache/arrow` to that, using `git filter-branch --prune-empty
> > > >> --subdirectory-filter site master` to preserve the commit history [2].
> > > >> Then I added a build script [3] that gets executed by Travis-CI [4].
> > > >>
> > > >> The script builds the Jekyll site and pushes it to a branch that gets
> > > >> published. On `apache/arrow-site`, commits to the `master` branch
> > > >> trigger a build of the Jekyll site and push the result to the
> > > >> `asf-site` branch. On forks, commits to `master` build the site and
> > > >> publish to the `gh-pages` branch, which can 

[GitHub] [arrow-site] wesm closed pull request #8: ARROW-4473: [Website] Support test site deployment

2019-08-05 Thread GitBox
wesm closed pull request #8: ARROW-4473: [Website] Support test site deployment
URL: https://github.com/apache/arrow-site/pull/8
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [arrow-site] fsaintjacques commented on a change in pull request #8: ARROW-4473: [Website] Support test site deployment

2019-08-05 Thread GitBox
fsaintjacques commented on a change in pull request #8: ARROW-4473: [Website] 
Support test site deployment
URL: https://github.com/apache/arrow-site/pull/8#discussion_r310730531
 
 

 ##
 File path: build-and-deploy.sh
 ##
 @@ -0,0 +1,51 @@
+#!/bin/bash
+set -ev
+
+if [ "${TRAVIS_BRANCH}" = "master" ] && [ "${TRAVIS_PULL_REQUEST}" = "false" 
]; then
+
+if [ -z "${GITHUB_PAT}" ]; then
+# Don't build because we can't publish
+echo "To publish the site, you must set a GITHUB_PAT at"
+echo "https://travis-ci.org/"${TRAVIS_REPO_SLUG}"/settings;
+exit 1
+fi
+
+# Set git config so that the author of the deployed site commit is the same
+# as the author of the commit we're building
+export AUTHOR_EMAIL=$(git log -1 --pretty=format:%ae)
+export AUTHOR_NAME=$(git log -1 --pretty=format:%an)
+git config --global user.email "${AUTHOR_EMAIL}"
+git config --global user.name "${AUTHOR_NAME}"
+
+if [ "${TRAVIS_REPO_SLUG}" = "apache/arrow-site" ]; then
+# Production
+export TARGET_BRANCH=asf-site
+export BASE_URL=https://arrow.apache.org
+else
+# On a fork, so we'll deploy to GitHub Pages
+export TARGET_BRANCH=gh-pages
+# You could supply an alternate BASE_URL, but that's not necessary
+# because we can infer it based on GitHub Pages conventions
+if [ -z "${BASE_URL}" ]; then
+export BASE_URL="https://"$(echo $TRAVIS_REPO_SLUG | sed 
's@/@.github.io/@')
+fi
+fi
+# Set the site.baseurl
+perl -pe 's@^baseurl.*@baseurl: '"${BASE_URL}"'@' -i _config.yml
+
+# Build
+gem install jekyll bundler
+bundle install
+JEKYLL_ENV=production bundle exec jekyll build
+
+# Publish
+git clone -b ${TARGET_BRANCH} 
https://${GITHUB_PAT}@github.com/$TRAVIS_REPO_SLUG.git OUTPUT
+rsync -r build/ OUTPUT/
+cd OUTPUT
+
+git add .
 
 Review comment:
   Because it swallows the return code with true, normally such error would 
bubble up. Makes it easier to chain/re-use the script.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [arrow-site] nealrichardson commented on a change in pull request #8: ARROW-4473: [Website] Support test site deployment

2019-08-05 Thread GitBox
nealrichardson commented on a change in pull request #8: ARROW-4473: [Website] 
Support test site deployment
URL: https://github.com/apache/arrow-site/pull/8#discussion_r310729936
 
 

 ##
 File path: build-and-deploy.sh
 ##
 @@ -0,0 +1,51 @@
+#!/bin/bash
+set -ev
+
+if [ "${TRAVIS_BRANCH}" = "master" ] && [ "${TRAVIS_PULL_REQUEST}" = "false" 
]; then
+
+if [ -z "${GITHUB_PAT}" ]; then
+# Don't build because we can't publish
+echo "To publish the site, you must set a GITHUB_PAT at"
+echo "https://travis-ci.org/"${TRAVIS_REPO_SLUG}"/settings;
+exit 1
+fi
+
+# Set git config so that the author of the deployed site commit is the same
+# as the author of the commit we're building
+export AUTHOR_EMAIL=$(git log -1 --pretty=format:%ae)
+export AUTHOR_NAME=$(git log -1 --pretty=format:%an)
+git config --global user.email "${AUTHOR_EMAIL}"
+git config --global user.name "${AUTHOR_NAME}"
+
+if [ "${TRAVIS_REPO_SLUG}" = "apache/arrow-site" ]; then
+# Production
+export TARGET_BRANCH=asf-site
+export BASE_URL=https://arrow.apache.org
+else
+# On a fork, so we'll deploy to GitHub Pages
+export TARGET_BRANCH=gh-pages
+# You could supply an alternate BASE_URL, but that's not necessary
+# because we can infer it based on GitHub Pages conventions
+if [ -z "${BASE_URL}" ]; then
+export BASE_URL="https://"$(echo $TRAVIS_REPO_SLUG | sed 
's@/@.github.io/@')
+fi
+fi
+# Set the site.baseurl
+perl -pe 's@^baseurl.*@baseurl: '"${BASE_URL}"'@' -i _config.yml
+
+# Build
+gem install jekyll bundler
+bundle install
+JEKYLL_ENV=production bundle exec jekyll build
+
+# Publish
+git clone -b ${TARGET_BRANCH} 
https://${GITHUB_PAT}@github.com/$TRAVIS_REPO_SLUG.git OUTPUT
+rsync -r build/ OUTPUT/
+cd OUTPUT
+
+git add .
 
 Review comment:
   What is the race problem? If `git push` fails because GitHub is down, then 
the build will complete without pushing. That seems reasonable under the 
circumstances.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (ARROW-6140) [C++][Parquet] Support direct dictionary decoding of types other than BYTE_ARRAY

2019-08-05 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-6140:
---

 Summary: [C++][Parquet] Support direct dictionary decoding of 
types other than BYTE_ARRAY
 Key: ARROW-6140
 URL: https://issues.apache.org/jira/browse/ARROW-6140
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


This is follow up work to ARROW-3772, ARROW-3325, and other patches. See 
discussion in 

https://github.com/apache/arrow/pull/4999#discussion_r310634479



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[GitHub] [arrow-site] fsaintjacques commented on a change in pull request #8: ARROW-4473: [Website] Support test site deployment

2019-08-05 Thread GitBox
fsaintjacques commented on a change in pull request #8: ARROW-4473: [Website] 
Support test site deployment
URL: https://github.com/apache/arrow-site/pull/8#discussion_r310724697
 
 

 ##
 File path: build-and-deploy.sh
 ##
 @@ -0,0 +1,51 @@
+#!/bin/bash
+set -ev
+
+if [ "${TRAVIS_BRANCH}" = "master" ] && [ "${TRAVIS_PULL_REQUEST}" = "false" 
]; then
+
+if [ -z "${GITHUB_PAT}" ]; then
+# Don't build because we can't publish
+echo "To publish the site, you must set a GITHUB_PAT at"
+echo "https://travis-ci.org/"${TRAVIS_REPO_SLUG}"/settings;
+exit 1
+fi
+
+# Set git config so that the author of the deployed site commit is the same
+# as the author of the commit we're building
+export AUTHOR_EMAIL=$(git log -1 --pretty=format:%ae)
+export AUTHOR_NAME=$(git log -1 --pretty=format:%an)
+git config --global user.email "${AUTHOR_EMAIL}"
+git config --global user.name "${AUTHOR_NAME}"
+
+if [ "${TRAVIS_REPO_SLUG}" = "apache/arrow-site" ]; then
+# Production
+export TARGET_BRANCH=asf-site
+export BASE_URL=https://arrow.apache.org
+else
+# On a fork, so we'll deploy to GitHub Pages
+export TARGET_BRANCH=gh-pages
+# You could supply an alternate BASE_URL, but that's not necessary
+# because we can infer it based on GitHub Pages conventions
+if [ -z "${BASE_URL}" ]; then
+export BASE_URL="https://"$(echo $TRAVIS_REPO_SLUG | sed 
's@/@.github.io/@')
+fi
+fi
+# Set the site.baseurl
+perl -pe 's@^baseurl.*@baseurl: '"${BASE_URL}"'@' -i _config.yml
+
+# Build
+gem install jekyll bundler
+bundle install
+JEKYLL_ENV=production bundle exec jekyll build
+
+# Publish
+git clone -b ${TARGET_BRANCH} 
https://${GITHUB_PAT}@github.com/$TRAVIS_REPO_SLUG.git OUTPUT
+rsync -r build/ OUTPUT/
+cd OUTPUT
+
+git add .
 
 Review comment:
   Wrap the git stuff in a block guarded by a check with `git status 
--porcelain | wc -l`, otherwise there's a race if git push fails for a real 
reason, e.g. github being down. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [arrow-site] fsaintjacques commented on a change in pull request #8: ARROW-4473: [Website] Support test site deployment

2019-08-05 Thread GitBox
fsaintjacques commented on a change in pull request #8: ARROW-4473: [Website] 
Support test site deployment
URL: https://github.com/apache/arrow-site/pull/8#discussion_r310727786
 
 

 ##
 File path: build-and-deploy.sh
 ##
 @@ -0,0 +1,51 @@
+#!/bin/bash
+set -ev
+
+if [ "${TRAVIS_BRANCH}" = "master" ] && [ "${TRAVIS_PULL_REQUEST}" = "false" 
]; then
+
+if [ -z "${GITHUB_PAT}" ]; then
+# Don't build because we can't publish
+echo "To publish the site, you must set a GITHUB_PAT at"
+echo "https://travis-ci.org/"${TRAVIS_REPO_SLUG}"/settings;
+exit 1
+fi
+
+# Set git config so that the author of the deployed site commit is the same
+# as the author of the commit we're building
+export AUTHOR_EMAIL=$(git log -1 --pretty=format:%ae)
+export AUTHOR_NAME=$(git log -1 --pretty=format:%an)
+git config --global user.email "${AUTHOR_EMAIL}"
+git config --global user.name "${AUTHOR_NAME}"
+
+if [ "${TRAVIS_REPO_SLUG}" = "apache/arrow-site" ]; then
+# Production
+export TARGET_BRANCH=asf-site
+export BASE_URL=https://arrow.apache.org
+else
+# On a fork, so we'll deploy to GitHub Pages
+export TARGET_BRANCH=gh-pages
+# You could supply an alternate BASE_URL, but that's not necessary
+# because we can infer it based on GitHub Pages conventions
+if [ -z "${BASE_URL}" ]; then
+export BASE_URL="https://"$(echo $TRAVIS_REPO_SLUG | sed 
's@/@.github.io/@')
+fi
+fi
+# Set the site.baseurl
+perl -pe 's@^baseurl.*@baseurl: '"${BASE_URL}"'@' -i _config.yml
 
 Review comment:
   Why not sed?
   
   ```
   01:arrow-site/ (master) $ grep -E '^baseurl' _config.yml 
   baseurl:
   01:arrow-site/ (master) $ sed -i -e's#^baseurl:.*#baseurl: HI#' _config.yml 
   01:arrow-site/ (master✗) $ grep -E '^baseurl' _config.yml 
   baseurl: HI
   01:arrow-site/ (master✗) $ sed -i -e's#^baseurl:.*#baseurl: NOPENOPENOPE#' 
_config.yml 
   01:arrow-site/ (master✗) $ grep -E '^baseurl' _config.yml 
   baseurl: NOPENOPENOPE
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [arrow-site] nealrichardson opened a new pull request #8: ARROW-4473: [Website] Support test site deployment

2019-08-05 Thread GitBox
nealrichardson opened a new pull request #8: ARROW-4473: [Website] Support test 
site deployment
URL: https://github.com/apache/arrow-site/pull/8
 
 
   See 
https://lists.apache.org/thread.html/73e8a7d2ee667e83371e8bb99c13338144c8af5444f55918994713be@%3Cdev.arrow.apache.org%3E


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [arrow-site] nealrichardson commented on issue #7: ARROW-6139: [Documentation][R] Build R docs (pkgdown) site and add to arrow-site

2019-08-05 Thread GitBox
nealrichardson commented on issue #7: ARROW-6139: [Documentation][R] Build R 
docs (pkgdown) site and add to arrow-site
URL: https://github.com/apache/arrow-site/pull/7#issuecomment-518334974
 
 
   > Currently there are no links to and from the main arrow.apache.org/docs/ 
site, but I'd like to take that up in a followup issue because there's more 
cleanup to do there than just adding a link.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [arrow-site] fsaintjacques commented on issue #7: ARROW-6139: [Documentation][R] Build R docs (pkgdown) site and add to arrow-site

2019-08-05 Thread GitBox
fsaintjacques commented on issue #7: ARROW-6139: [Documentation][R] Build R 
docs (pkgdown) site and add to arrow-site
URL: https://github.com/apache/arrow-site/pull/7#issuecomment-518332661
 
 
   I don't see the R documentation in the drop-down menu of documentations.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [arrow-site] nealrichardson commented on issue #7: ARROW-6139: [Documentation][R] Build R docs (pkgdown) site and add to arrow-site

2019-08-05 Thread GitBox
nealrichardson commented on issue #7: ARROW-6139: [Documentation][R] Build R 
docs (pkgdown) site and add to arrow-site
URL: https://github.com/apache/arrow-site/pull/7#issuecomment-518327887
 
 
   Full site preview at https://enpiar.com/arrow-site/ and R docs at 
https://enpiar.com/arrow-site/docs/r/


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [arrow-site] nealrichardson opened a new pull request #7: ARROW-6139: [Documentation][R] Build R docs (pkgdown) site and add to arrow-site

2019-08-05 Thread GitBox
nealrichardson opened a new pull request #7: ARROW-6139: [Documentation][R] 
Build R docs (pkgdown) site and add to arrow-site
URL: https://github.com/apache/arrow-site/pull/7
 
 
   This renders the site that was configured in 
https://issues.apache.org/jira/browse/ARROW-5452, using the 0.14.1 R CRAN 
(so-called) "release" branch. Currently there are no links to and from the main 
arrow.apache.org/docs/ site, but I'd like to take that up in a followup issue 
because there's more cleanup to do there than just adding a link.
   
   This patch also includes the last two patches to the website source in 
apache/arrow, which have not yet been published, and it has some relative URL 
fixes from my fork of the site source.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (ARROW-6139) [Documentation][R] Build R docs (pkgdown) site and add to arrow-site

2019-08-05 Thread Neal Richardson (JIRA)
Neal Richardson created ARROW-6139:
--

 Summary: [Documentation][R] Build R docs (pkgdown) site and add to 
arrow-site
 Key: ARROW-6139
 URL: https://issues.apache.org/jira/browse/ARROW-6139
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, R, Website
Reporter: Neal Richardson
Assignee: Neal Richardson


Now that the R package is up on CRAN, we should publish the documentation site. 
We should get this up before we publish the blog post (ARROW-6041) so that we 
can link to it in the post.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6138) [C++] Add a basic (single RecordBatch) implementation of Dataset

2019-08-05 Thread Benjamin Kietzman (JIRA)
Benjamin Kietzman created ARROW-6138:


 Summary: [C++] Add a basic (single RecordBatch) implementation of 
Dataset
 Key: ARROW-6138
 URL: https://issues.apache.org/jira/browse/ARROW-6138
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Benjamin Kietzman
Assignee: Benjamin Kietzman


The simplest case for a Dataset is one which views a single RecordBatch. This 
would be a usefully trivial test implementation and could yield some hints 
about API ergonomics



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6137) [C++][Gandiva] Change output format of castVARCHAR(timestamp) in Gandiva

2019-08-05 Thread Prudhvi Porandla (JIRA)
Prudhvi Porandla created ARROW-6137:
---

 Summary: [C++][Gandiva] Change output format of 
castVARCHAR(timestamp) in Gandiva
 Key: ARROW-6137
 URL: https://issues.apache.org/jira/browse/ARROW-6137
 Project: Apache Arrow
  Issue Type: Task
  Components: C++ - Gandiva
Reporter: Prudhvi Porandla
Assignee: Prudhvi Porandla


Format timestamp to -MM-dd hh:mm:ss.sss



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

2019-08-05 Thread Wes McKinney
I'm going to review Micah's Format PR and propose a vote. I would like
to take on the C++ implementation of this in the near future.

On Mon, Jul 29, 2019 at 11:34 AM Antoine Pitrou  wrote:
>
>
> Le 29/07/2019 à 18:32, Wes McKinney a écrit :
> >
> > We have a version number in the metadata already
> > https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22. I
> > don't think we need to add other signaling beyond the 0x
> > (stream continues / non-null message) / 0x (stream end / null
> > message) marker at the beginning of the IPC payload
> >
> > Unless you envision a situation where the metadata cannot be obtained.
> > I would hope that this can never occur.
>
> Fair enough.  I hope so as well. :-)
>
> Regards
>
> Antoine.


Re: [C++][Python] Direct Arrow DictionaryArray reads from Parquet files

2019-08-05 Thread Wes McKinney
hi Hatem -- I was planning to look at the round-trip question here
early this week, since I have all the code fresh in my mind let me
have a look and I'll report back.

Accurately preserving the original dictionary in a round trip tricky
because the low-level column writer doesn't expose any detail about
what it's doing with each batch of data. At minimum if we
automatically set "read_dictionary" based on what the original schema
was then that gets us part of the way there.

- Wes

On Mon, Aug 5, 2019 at 5:34 AM Hatem Helal  wrote:
>
> Thanks for sharing this very illustrative benchmark.  Really nice to see the 
> huge benefit for languages that have a type for modelling categorical data.
>
> I'm interested in whether we can make the parquet/arrow integration 
> automatically handle the round-trip for Arrow DictionaryArrays.  We've had 
> this requested from users of the MATLAB-Parquet integration.  We've suggested 
> workarounds for those users but as your benchmark shows, you need to have 
> enough memory to store the "dense" representation.  I think this could be 
> solved by writing metadata with the Arrow data type.  An added benefit of 
> doing this at the Arrow-level is that any language that uses the C++ 
> parquet/arrow integration could round-trip DictionaryArrays.
>
> I'm not currently sure how all the pieces would fit together but let me know 
> if there is interest and I'm happy to flesh this out as a PR.
>
>
> On 8/2/19, 4:55 PM, "Wes McKinney"  wrote:
>
> I've been working (with Hatem Helal's assistance!) the last few months
> to put the pieces in place to enable reading BYTE_ARRAY columns in
> Parquet files directly to Arrow DictionaryArray. As context, it's not
> uncommon for a Parquet file to occupy ~100x less (even greater
> compression factor) space on disk than fully-decoded in memory when
> there are a lot of common strings. Users get frustrated sometimes when
> they read a "small" Parquet file and have memory use problems.
>
> I made a benchmark to exhibit an example "worst case scenario"
>
> https://gist.github.com/wesm/450d85e52844aee685c0680111cbb1d7
>
> In this example, we have a table with a single column containing 10
> million values drawn from a dictionary of 1000 values that's about 50
> kilobytes in size. Written to Parquet, the file a little over 1
> megabyte due to Parquet's layers of compression. But read naively to
> Arrow BinaryArray, about 500MB of memory is taken up (10M values * 54
> bytes per value). With the new decoding machinery, we can skip the
> dense decoding of the binary data and append the Parquet file's
> internal dictionary indices directly into an arrow::DictionaryBuilder,
> yielding a DictionaryArray at the end. The end result uses less than
> 10% as much memory (about 40MB compared with 500MB) and is almost 20x
> faster to decode.
>
> The PR making this available finally in Python is here:
> https://github.com/apache/arrow/pull/4999
>
> Complex, multi-layered projects like this can be a little bit
> inscrutable when discussed strictly at a code/technical level, but I
> hope this helps show that employing dictionary encoding can have a lot
> of user impact both in memory use and performance.
>
> - Wes
>
>


[jira] [Created] (ARROW-6136) [FlightRPC][Java] Don't double-close response stream

2019-08-05 Thread lidavidm (JIRA)
lidavidm created ARROW-6136:
---

 Summary: [FlightRPC][Java] Don't double-close response stream
 Key: ARROW-6136
 URL: https://issues.apache.org/jira/browse/ARROW-6136
 Project: Apache Arrow
  Issue Type: Bug
  Components: FlightRPC, Java
Affects Versions: 0.14.1
Reporter: lidavidm
Assignee: lidavidm
 Fix For: 0.15.0


DoPut in Java double-closes the metadata response stream: if the service 
implementation sends an error down that channel, the Flight implementation will 
unconditionally try to complete the stream, violating the gRPC semantics 
(either an error or a completion may be sent, never both).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [C++][Python] Direct Arrow DictionaryArray reads from Parquet files

2019-08-05 Thread Hatem Helal
Thanks for sharing this very illustrative benchmark.  Really nice to see the 
huge benefit for languages that have a type for modelling categorical data.

I'm interested in whether we can make the parquet/arrow integration 
automatically handle the round-trip for Arrow DictionaryArrays.  We've had this 
requested from users of the MATLAB-Parquet integration.  We've suggested 
workarounds for those users but as your benchmark shows, you need to have 
enough memory to store the "dense" representation.  I think this could be 
solved by writing metadata with the Arrow data type.  An added benefit of doing 
this at the Arrow-level is that any language that uses the C++ parquet/arrow 
integration could round-trip DictionaryArrays.

I'm not currently sure how all the pieces would fit together but let me know if 
there is interest and I'm happy to flesh this out as a PR.


On 8/2/19, 4:55 PM, "Wes McKinney"  wrote:

I've been working (with Hatem Helal's assistance!) the last few months
to put the pieces in place to enable reading BYTE_ARRAY columns in
Parquet files directly to Arrow DictionaryArray. As context, it's not
uncommon for a Parquet file to occupy ~100x less (even greater
compression factor) space on disk than fully-decoded in memory when
there are a lot of common strings. Users get frustrated sometimes when
they read a "small" Parquet file and have memory use problems.

I made a benchmark to exhibit an example "worst case scenario"

https://gist.github.com/wesm/450d85e52844aee685c0680111cbb1d7

In this example, we have a table with a single column containing 10
million values drawn from a dictionary of 1000 values that's about 50
kilobytes in size. Written to Parquet, the file a little over 1
megabyte due to Parquet's layers of compression. But read naively to
Arrow BinaryArray, about 500MB of memory is taken up (10M values * 54
bytes per value). With the new decoding machinery, we can skip the
dense decoding of the binary data and append the Parquet file's
internal dictionary indices directly into an arrow::DictionaryBuilder,
yielding a DictionaryArray at the end. The end result uses less than
10% as much memory (about 40MB compared with 500MB) and is almost 20x
faster to decode.

The PR making this available finally in Python is here:
https://github.com/apache/arrow/pull/4999

Complex, multi-layered projects like this can be a little bit
inscrutable when discussed strictly at a code/technical level, but I
hope this helps show that employing dictionary encoding can have a lot
of user impact both in memory use and performance.

- Wes




[jira] [Created] (ARROW-6135) [C++] KeyValueMetadata::Equals should not be order-sensitive

2019-08-05 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-6135:
-

 Summary: [C++] KeyValueMetadata::Equals should not be 
order-sensitive
 Key: ARROW-6135
 URL: https://issues.apache.org/jira/browse/ARROW-6135
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.14.1
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


Currently, two KeyValueMetadata instances with the same key/value pairs but in 
a different order compare unequal.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6134) [C++][Gandiva] Add concat function in Gandiva

2019-08-05 Thread Prudhvi Porandla (JIRA)
Prudhvi Porandla created ARROW-6134:
---

 Summary: [C++][Gandiva] Add concat function in Gandiva
 Key: ARROW-6134
 URL: https://issues.apache.org/jira/browse/ARROW-6134
 Project: Apache Arrow
  Issue Type: Task
  Components: C++ - Gandiva
Reporter: Prudhvi Porandla
Assignee: Prudhvi Porandla


* remove concat alias for concatOperator
 * add concat(utf8, utf8) function



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6133) Schema Missing Exception in ArrowStreamReader

2019-08-05 Thread Boris V.Kuznetsov (JIRA)
Boris V.Kuznetsov created ARROW-6133:


 Summary: Schema Missing Exception in ArrowStreamReader
 Key: ARROW-6133
 URL: https://issues.apache.org/jira/browse/ARROW-6133
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Affects Versions: 0.14.1
Reporter: Boris V.Kuznetsov


Hello

My colleague and I are trying to pass Arrow thru Kafka. He uses a PyArrow, I'm 
using Scala Java API.

Here's the Transmitter code:

import pyarrow as pa
def record_batch_to_bytes(df):
    batch = pa.RecordBatch.from_pandas(df)
    ser_ = pa.serialize(batch)
    return bytes(ser_.to_buffer())

 

My colleague is able to read this stream with the Python API:

def bytes_to_batch_record(bytes_):
    batch = pa.deserialize(bytes_)
    print(batch.schema)

On the Receiver side, I use the following from Java API:

 
{color:#569cd6}def{color} 
{color:#dcdcaa}deserialize{color}{color:#d4d4d4}({color}{color:#9cdcfe}din{color}{color:#d4d4d4}:
 
{color}{color:#4ec9b0}Chunk{color}{color:#d4d4d4}[{color}{color:#4ec9b0}BArr{color}{color:#d4d4d4}]){color}{color:#d4d4d4}:{color}
 
{color:#4ec9b0}Chunk{color}{color:#d4d4d4}[{color}{color:#4ec9b0}ArrowStreamReader{color}{color:#d4d4d4}]
 {color}{color:#d4d4d4}={color}
{color:#c586c0}for{color}{color:#d4d4d4} {{color}
{color:#d4d4d4} arr {color}{color:#d4d4d4}<-{color}{color:#d4d4d4} din{color}
{color:#d4d4d4} stream {color}{color:#d4d4d4}={color} {color:#569cd6}new{color} 
{color:#4ec9b0}ByteArrayInputStream{color}{color:#d4d4d4}(arr){color}

{color:#d4d4d4} } {color}{color:#c586c0}yield{color} {color:#569cd6}new{color} 
{color:#4ec9b0}ArrowStreamReader{color}{color:#d4d4d4}(stream, allocator){color}
 
{color:#d4d4d4}reader {color}{color:#d4d4d4}={color}{color:#d4d4d4} 
deserialize(arr){color}
{color:#d4d4d4}schema {color}{color:#d4d4d4}={color}{color:#d4d4d4} 
reader.map(r {color}{color:#d4d4d4}=>{color}{color:#d4d4d4} 
r.getVectorSchemaRoot.getSchema){color}
{color:#d4d4d4}empty {color}{color:#d4d4d4}={color}{color:#d4d4d4} reader.map(r 
{color}{color:#d4d4d4}=>{color}{color:#d4d4d4} r.loadNextBatch){color}
 

Which fails with exception on both lines 2 and 3 in the last snippet:

Fiber failed.
An unchecked error was produced.
java.io.IOException: Unexpected end of input. Missing schema.
    at 
org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:135)
    at 
org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:178)
    at 
org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:169)
    at 
org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:62)
    at nettest.ArrowSpec.$anonfun$testConsumeArrow$7(Arrow.scala:96)
    at zio.Chunk$Arr.map(Chunk.scala:722)

 

The full Scala code is 
[here|https://github.com/Clover-Group/zio-tsp/blob/46e34c7c060bf4061067922077bbe05ea4b9f301/src/test/scala/Arrow.scala#L95]

 

How do I resolve that ? We both are using Arrow 0.14.1 and my colleague has no 
issues with PyArrow API.

Thank you!



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6132) [Python] ListArray.from_arrays does not check validity of input arrays

2019-08-05 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-6132:


 Summary: [Python] ListArray.from_arrays does not check validity of 
input arrays
 Key: ARROW-6132
 URL: https://issues.apache.org/jira/browse/ARROW-6132
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


>From https://github.com/apache/arrow/pull/4979#issuecomment-517593918.

When creating a ListArray from offsets and values in python, there is no 
validation of the offsets that it starts with 0 and ends with the length of the 
array (but is that required? the docs seem to indicate that: 
https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type
 ("The first value in the offsets array is 0, and the last element is the 
length of the values array.").

The array you get "seems" ok (the repr), but on conversion to python or 
flattened arrays, things go wrong:

{code}
In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) 

In [62]: a
Out[62]: 

[
  [
1,
2
  ],
  [
3,
4
  ]
]

In [63]: a.flatten()
Out[63]: 

[
  0,   # <--- includes the 0
  1,
  2,
  3,
  4
]

In [64]: a.to_pylist()
Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]]  # <--includes more 
elements as garbage
{code}


Calling {{validate}} manually correctly raises:

{code}
In [65]: a.validate()
...
ArrowInvalid: Final offset invariant not equal to values length: 10!=5
{code}

In C++ the main constructors are not safe, and as the caller you need to ensure 
that the data is correct or call a safe (slower) constructor. But do we want to 
use the unsafe / fast constructors without validation in Python as default as 
well? Or should we do a call to {{validate}} here?

A quick search seems to indicate that `pa.Array.from_buffers` does validation, 
but other `from_arrays` method don't seem to explicitly do this. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)