Re: [RESULT][VOTE] Release Apache Arrow 0.14.0 - RC0

2019-07-05 Thread Micah Kornfield
>
> * All pull requests need to rebase on master by
> "Rebasing the master branch on local release branch"

Since it doesn't look like its been claimed i'll do it.

On Thu, Jul 4, 2019 at 12:46 AM Sutou Kouhei  wrote:

> Hi,
>
> I need your help!
> Could Rust developers see "Failed:" section?
> Could someone take over tasks in "Need helped:" section?
>
> Failed:
>
>   * Updating Rust packages
>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingRustpackages
>
> * We need the following patch:
>
> 
> diff --git a/dev/release/post-07-rust.sh b/dev/release/post-07-rust.sh
> index a2f6e2988..c632fa793 100755
> --- a/dev/release/post-07-rust.sh
> +++ b/dev/release/post-07-rust.sh
> @@ -53,6 +53,12 @@ curl \
>  rm -rf ${archive_name}
>  tar xf ${tar_gz}
>  modules=()
> +  sed \
> +-i \
> +-E \
> +-e 's/^arrow = "([^"]*)"/arrow = { version = "\1", path = "..\/arrow"
> }/g' \
> +-e 's/^parquet = "([^"]*)"/parquet = { version = "\1", path =
> "..\/parquet" }/g' \
> +${archive_name}/rust/*/Cargo.toml
>  for cargo_toml in ${archive_name}/rust/*/Cargo.toml; do
>module_dir=$(dirname ${cargo_toml})
>pushd ${module_dir}
> 
>
> * "INSTALL_RUST=yes dev/release/post-07-rust.sh 0.14.0"
>   is failed with the above patch:
>
> 
>Packaging arrow v0.14.0
> (/home/kou/work/cpp/arrow.kou/apache-arrow-0.14.0/rust/arrow)
>Verifying arrow v0.14.0
> (/home/kou/work/cpp/arrow.kou/apache-arrow-0.14.0/rust/arrow)
> error: failed to verify package tarball
>
> Caused by:
>   failed to parse manifest at
> `/home/kou/work/cpp/arrow.kou/apache-arrow-0.14.0/rust/target/package/arrow-0.14.0/Cargo.toml`
>
> Caused by:
>   can't find `array_from_vec` bench, specify bench.path
> 
>
> * How to solve this?
>
> Done:
>
>   * Rebasing the master branch on local release branch
>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Rebasingthemasterbranchonlocalreleasebranch
>
>   * Marking the released version as "RELEASED" on JIRA
>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Markingthereleasedversionas%22RELEASED%22onJIRA
>
>   * Starting the new version on JIRA
>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-StartingthenewversiononJIRA
>
>   * Partially: Updating the Arrow website
>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingtheArrowwebsite
>
> * Release note has been added.
> * No blog post.
> * Not upload to website yet.
>
>   * Uploading source release artifacts to SVN
>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UploadingsourcereleaseartifactstoSVN
>
>   * Uploading binary release artifacts to Bintray
>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UploadingbinaryreleaseartifactstoBintray
>
>   * Partially: Announcing release
>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Announcingrelease
>
> * Added release date.
> * Not send release announce to announce@ and dev@ yet.
>
>   * Partially: Updating C++ and Python packages
>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingC++andPythonpackages
>
> * Uploaded to PyPI.
>   * Wrote upload shell script but not create pull request yet.
> * Not update conda packages yet
>
>   * Updating Java Maven artifacts in Maven central
>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingJavaMavenartifactsinMavencentral
>
>   * Updating Ruby packages
>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingRubypackages
>
>   * Updating JavaScript packages
>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingJavaScriptpackages
>
>   * Updating .NET NuGet packages
>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Updating.NETNuGetpackages
>
>   * Removing source artifacts for RC
>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-RemovingsourceartifactsforRC
>
> Need help:
>
>   * All pull requests need to rebase on master by
> "Rebasing the master branch on local release branch"
>
>   * Blog post
>
>   * Update website
>
>   * Announcing release to announce@ and dev@
>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Announcingrelease
>
>   * Updating website with new API documentation
>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Updatingwebsitew

Re: [RESULT][VOTE] Release Apache Arrow 0.14.0 - RC0

2019-07-05 Thread Micah Kornfield
Actually, can someone clarify is the correct approach here to clone the
@Kou's repo and use his RC0 branch to do the rebase?

e.g. run:

"./dev/release/post-00-rebase.sh apache-arrow-0.14.0-rc0"?


Thanks,

Micah

On Fri, Jul 5, 2019 at 12:38 AM Micah Kornfield 
wrote:

> * All pull requests need to rebase on master by
>> "Rebasing the master branch on local release branch"
>
> Since it doesn't look like its been claimed i'll do it.
>
> On Thu, Jul 4, 2019 at 12:46 AM Sutou Kouhei  wrote:
>
>> Hi,
>>
>> I need your help!
>> Could Rust developers see "Failed:" section?
>> Could someone take over tasks in "Need helped:" section?
>>
>> Failed:
>>
>>   * Updating Rust packages
>>
>> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingRustpackages
>>
>> * We need the following patch:
>>
>> 
>> diff --git a/dev/release/post-07-rust.sh b/dev/release/post-07-rust.sh
>> index a2f6e2988..c632fa793 100755
>> --- a/dev/release/post-07-rust.sh
>> +++ b/dev/release/post-07-rust.sh
>> @@ -53,6 +53,12 @@ curl \
>>  rm -rf ${archive_name}
>>  tar xf ${tar_gz}
>>  modules=()
>> +  sed \
>> +-i \
>> +-E \
>> +-e 's/^arrow = "([^"]*)"/arrow = { version = "\1", path =
>> "..\/arrow" }/g' \
>> +-e 's/^parquet = "([^"]*)"/parquet = { version = "\1", path =
>> "..\/parquet" }/g' \
>> +${archive_name}/rust/*/Cargo.toml
>>  for cargo_toml in ${archive_name}/rust/*/Cargo.toml; do
>>module_dir=$(dirname ${cargo_toml})
>>pushd ${module_dir}
>> 
>>
>> * "INSTALL_RUST=yes dev/release/post-07-rust.sh 0.14.0"
>>   is failed with the above patch:
>>
>> 
>>Packaging arrow v0.14.0
>> (/home/kou/work/cpp/arrow.kou/apache-arrow-0.14.0/rust/arrow)
>>Verifying arrow v0.14.0
>> (/home/kou/work/cpp/arrow.kou/apache-arrow-0.14.0/rust/arrow)
>> error: failed to verify package tarball
>>
>> Caused by:
>>   failed to parse manifest at
>> `/home/kou/work/cpp/arrow.kou/apache-arrow-0.14.0/rust/target/package/arrow-0.14.0/Cargo.toml`
>>
>> Caused by:
>>   can't find `array_from_vec` bench, specify bench.path
>> 
>>
>> * How to solve this?
>>
>> Done:
>>
>>   * Rebasing the master branch on local release branch
>>
>> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Rebasingthemasterbranchonlocalreleasebranch
>>
>>   * Marking the released version as "RELEASED" on JIRA
>>
>> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Markingthereleasedversionas%22RELEASED%22onJIRA
>>
>>   * Starting the new version on JIRA
>>
>> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-StartingthenewversiononJIRA
>>
>>   * Partially: Updating the Arrow website
>>
>> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingtheArrowwebsite
>>
>> * Release note has been added.
>> * No blog post.
>> * Not upload to website yet.
>>
>>   * Uploading source release artifacts to SVN
>>
>> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UploadingsourcereleaseartifactstoSVN
>>
>>   * Uploading binary release artifacts to Bintray
>>
>> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UploadingbinaryreleaseartifactstoBintray
>>
>>   * Partially: Announcing release
>>
>> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Announcingrelease
>>
>> * Added release date.
>> * Not send release announce to announce@ and dev@ yet.
>>
>>   * Partially: Updating C++ and Python packages
>>
>> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingC++andPythonpackages
>>
>> * Uploaded to PyPI.
>>   * Wrote upload shell script but not create pull request yet.
>> * Not update conda packages yet
>>
>>   * Updating Java Maven artifacts in Maven central
>>
>> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingJavaMavenartifactsinMavencentral
>>
>>   * Updating Ruby packages
>>
>> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingRubypackages
>>
>>   * Updating JavaScript packages
>>
>> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingJavaScriptpackages
>>
>>   * Updating .NET NuGet packages
>>
>> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Updating.NETNuGetpackages
>>
>>   * Removing source artifacts for RC
>>
>> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-RemovingsourceartifactsforRC
>>
>> Need help:
>>
>>   * All pull requests need to rebase on master by
>> "Rebasing the master branch on 

Re: linking 3rd party cython modules against pyarrow fails since 0.14.0

2019-07-05 Thread Weston Steimel
Hello,

I wonder if perhaps that may be due to the work done for reducing the wheel
size in https://issues.apache.org/jira/browse/ARROW-5082?

On Thu, Jul 4, 2019 at 10:06 PM Stestagg  wrote:

> 1) pip install pyarrow==0.14.0
> 2) All the pyarrow files including, for example libarrow.so.14, but not
> libarrow.so (hence the linker error)
>
> Reproducible on Python 3.7.2 on linux mint 19.1 and debian docker:
>
> Example dockerfile:
> ```
> FROM debian:unstable-slim
>
> RUN apt-get update && apt-get upgrade -y
> RUN apt-get install -y python3 python3-dev
> RUN apt-get install -y python3-pip
>
> RUN python3 -m pip install --upgrade pip
> RUN pip3 install Cython pyarrow
> COPY setup.py /root
> COPY test.pyx /root
> WORKDIR /root
> RUN python3 setup.py build_ext --inplace
> ```
>
> Where setup.py and test.pyx are the files listed above, with an added call
> to numpy.get_include().
>
> Appending ' ==0.13.0' to the 'RUN pip3 install...' line above results in
> the docker image building
>
> Steve
>
> On Thu, Jul 4, 2019 at 10:37 PM Antoine Pitrou  wrote:
>
> >
> > Hi,
> >
> > 1) How did you install PyArrow?
> >
> > 2) What does /usr/local/lib/python3.7/dist-packages/pyarrow contain?
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 04/07/2019 à 22:10, Stestagg a écrit :
> > > Hi
> > >
> > > I've got a cython module that links against PyArrow, using the
> > > 'pyarrow.get_libraries()' associated methods.
> > >
> > > Builds on Windows and Linux are consitently failing against 0.14, but
> > > working on 0.12 to 0.13.
> > >
> > > Linux gives:
> > > x86_64-linux-gnu-g++ -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions
> > > -Wl,-z,relro -Wl,-z,relro -g -fstack-protector-strong -Wformat
> > > -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2
> > > build/temp.linux-x86_64-3.7/test.o
> > > -L/usr/local/lib/python3.7/dist-packages/pyarrow -larrow -larrow_python
> > -o
> > > /home/dduser/att/arrowtest.cpython-37m-x86_64-linux-gnu.so
> > > /usr/bin/ld: cannot find -larrow
> > > /usr/bin/ld: cannot find -larrow_python
> > > collect2: error: ld returned 1 exit status
> > > error: command 'x86_64-linux-gnu-g++' failed with exit status 1
> > >
> > > The windows build is more funky, but I'm still investigating.
> > >
> > > A minimal example is:
> > >
> > > setup.py:
> > >
> > > import pyarrow
> > > from Cython.Build import cythonize
> > > from distutils.command.build_clib import build_clib
> > > from distutils.core import setup, Extension
> > >
> > >
> > > OPTIONS = {
> > > 'sources': ["test.pyx"],
> > > 'language': "c++",
> > > 'include_dirs':  [pyarrow.get_include()],
> > > 'libraries': pyarrow.get_libraries(),
> > > 'library_dirs': pyarrow.get_library_dirs()
> > > }
> > >
> > > setup(
> > > name='arrowtest',
> > > ext_modules = cythonize(Extension("arrowtest",**OPTIONS)),
> > > cmdclass = {'build_clib': build_clib},
> > > version="0.1",
> > > )
> > >
> > > test.pyx:
> > >
> > > import pyarrow as pa
> > > cimport pyarrow.lib as pa
> > >
> > > Thanks
> > >
> > > Steve
> > >
> >
>


Re: [RESULT][VOTE] Release Apache Arrow 0.14.0 - RC0

2019-07-05 Thread Krisztián Szűcs
Hey Micah,

Kou has already rebased the master branch of apache/arrow. So if you
want to rebase PRs, then you should rebase on top of apache/arrow@master.

On Fri, Jul 5, 2019 at 10:01 AM Micah Kornfield 
wrote:

> Actually, can someone clarify is the correct approach here to clone the
> @Kou's repo and use his RC0 branch to do the rebase?
>
> e.g. run:
>
> "./dev/release/post-00-rebase.sh apache-arrow-0.14.0-rc0"?
>
>
> Thanks,
>
> Micah
>
> On Fri, Jul 5, 2019 at 12:38 AM Micah Kornfield 
> wrote:
>
> > * All pull requests need to rebase on master by
> >> "Rebasing the master branch on local release branch"
> >
> > Since it doesn't look like its been claimed i'll do it.
> >
> > On Thu, Jul 4, 2019 at 12:46 AM Sutou Kouhei  wrote:
> >
> >> Hi,
> >>
> >> I need your help!
> >> Could Rust developers see "Failed:" section?
> >> Could someone take over tasks in "Need helped:" section?
> >>
> >> Failed:
> >>
> >>   * Updating Rust packages
> >>
> >>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingRustpackages
> >>
> >> * We need the following patch:
> >>
> >> 
> >> diff --git a/dev/release/post-07-rust.sh b/dev/release/post-07-rust.sh
> >> index a2f6e2988..c632fa793 100755
> >> --- a/dev/release/post-07-rust.sh
> >> +++ b/dev/release/post-07-rust.sh
> >> @@ -53,6 +53,12 @@ curl \
> >>  rm -rf ${archive_name}
> >>  tar xf ${tar_gz}
> >>  modules=()
> >> +  sed \
> >> +-i \
> >> +-E \
> >> +-e 's/^arrow = "([^"]*)"/arrow = { version = "\1", path =
> >> "..\/arrow" }/g' \
> >> +-e 's/^parquet = "([^"]*)"/parquet = { version = "\1", path =
> >> "..\/parquet" }/g' \
> >> +${archive_name}/rust/*/Cargo.toml
> >>  for cargo_toml in ${archive_name}/rust/*/Cargo.toml; do
> >>module_dir=$(dirname ${cargo_toml})
> >>pushd ${module_dir}
> >> 
> >>
> >> * "INSTALL_RUST=yes dev/release/post-07-rust.sh 0.14.0"
> >>   is failed with the above patch:
> >>
> >> 
> >>Packaging arrow v0.14.0
> >> (/home/kou/work/cpp/arrow.kou/apache-arrow-0.14.0/rust/arrow)
> >>Verifying arrow v0.14.0
> >> (/home/kou/work/cpp/arrow.kou/apache-arrow-0.14.0/rust/arrow)
> >> error: failed to verify package tarball
> >>
> >> Caused by:
> >>   failed to parse manifest at
> >>
> `/home/kou/work/cpp/arrow.kou/apache-arrow-0.14.0/rust/target/package/arrow-0.14.0/Cargo.toml`
> >>
> >> Caused by:
> >>   can't find `array_from_vec` bench, specify bench.path
> >> 
> >>
> >> * How to solve this?
> >>
> >> Done:
> >>
> >>   * Rebasing the master branch on local release branch
> >>
> >>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Rebasingthemasterbranchonlocalreleasebranch
> >>
> >>   * Marking the released version as "RELEASED" on JIRA
> >>
> >>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Markingthereleasedversionas%22RELEASED%22onJIRA
> >>
> >>   * Starting the new version on JIRA
> >>
> >>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-StartingthenewversiononJIRA
> >>
> >>   * Partially: Updating the Arrow website
> >>
> >>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingtheArrowwebsite
> >>
> >> * Release note has been added.
> >> * No blog post.
> >> * Not upload to website yet.
> >>
> >>   * Uploading source release artifacts to SVN
> >>
> >>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UploadingsourcereleaseartifactstoSVN
> >>
> >>   * Uploading binary release artifacts to Bintray
> >>
> >>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UploadingbinaryreleaseartifactstoBintray
> >>
> >>   * Partially: Announcing release
> >>
> >>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Announcingrelease
> >>
> >> * Added release date.
> >> * Not send release announce to announce@ and dev@ yet.
> >>
> >>   * Partially: Updating C++ and Python packages
> >>
> >>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingC++andPythonpackages
> >>
> >> * Uploaded to PyPI.
> >>   * Wrote upload shell script but not create pull request yet.
> >> * Not update conda packages yet
> >>
> >>   * Updating Java Maven artifacts in Maven central
> >>
> >>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingJavaMavenartifactsinMavencentral
> >>
> >>   * Updating Ruby packages
> >>
> >>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingRubypackages
> >>
> >>   * Updating JavaScript packages
> >>
> >>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#Releas

Re: [RESULT][VOTE] Release Apache Arrow 0.14.0 - RC0

2019-07-05 Thread Sutou Kouhei
Hi Micah,

Thanks for helping this.

Sorry for my bad description of the task.

> e.g. run:
> 
> "./dev/release/post-00-rebase.sh apache-arrow-0.14.0-rc0"?

I've already done this:

>>> Done:
>>>
>>>   * Rebasing the master branch on local release branch
>>>
>>> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Rebasingthemasterbranchonlocalreleasebranch

I want to say that we need to rebase all open pull requests
onto master. For example,
https://github.com/apache/arrow/pull/4739 is needed to be
rebased:

  git clone --checkout decimal_benchmark g...@github.com:emkornfielda/arrow.git
  cd arrow
  git remote add upstream g...@github.com:apache/arrow.git
  git fetch --all --prune --tags --force
  git rebase upstream/master
  git push --force


Thanks,
--
kou

In 
  "Re: [RESULT][VOTE] Release Apache Arrow 0.14.0 - RC0" on Fri, 5 Jul 2019 
01:01:18 -0700,
  Micah Kornfield  wrote:

> Actually, can someone clarify is the correct approach here to clone the
> @Kou's repo and use his RC0 branch to do the rebase?
> 
> e.g. run:
> 
> "./dev/release/post-00-rebase.sh apache-arrow-0.14.0-rc0"?
> 
> 
> Thanks,
> 
> Micah
> 
> On Fri, Jul 5, 2019 at 12:38 AM Micah Kornfield 
> wrote:
> 
>> * All pull requests need to rebase on master by
>>> "Rebasing the master branch on local release branch"
>>
>> Since it doesn't look like its been claimed i'll do it.
>>
>> On Thu, Jul 4, 2019 at 12:46 AM Sutou Kouhei  wrote:
>>
>>> Hi,
>>>
>>> I need your help!
>>> Could Rust developers see "Failed:" section?
>>> Could someone take over tasks in "Need helped:" section?
>>>
>>> Failed:
>>>
>>>   * Updating Rust packages
>>>
>>> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingRustpackages
>>>
>>> * We need the following patch:
>>>
>>> 
>>> diff --git a/dev/release/post-07-rust.sh b/dev/release/post-07-rust.sh
>>> index a2f6e2988..c632fa793 100755
>>> --- a/dev/release/post-07-rust.sh
>>> +++ b/dev/release/post-07-rust.sh
>>> @@ -53,6 +53,12 @@ curl \
>>>  rm -rf ${archive_name}
>>>  tar xf ${tar_gz}
>>>  modules=()
>>> +  sed \
>>> +-i \
>>> +-E \
>>> +-e 's/^arrow = "([^"]*)"/arrow = { version = "\1", path =
>>> "..\/arrow" }/g' \
>>> +-e 's/^parquet = "([^"]*)"/parquet = { version = "\1", path =
>>> "..\/parquet" }/g' \
>>> +${archive_name}/rust/*/Cargo.toml
>>>  for cargo_toml in ${archive_name}/rust/*/Cargo.toml; do
>>>module_dir=$(dirname ${cargo_toml})
>>>pushd ${module_dir}
>>> 
>>>
>>> * "INSTALL_RUST=yes dev/release/post-07-rust.sh 0.14.0"
>>>   is failed with the above patch:
>>>
>>> 
>>>Packaging arrow v0.14.0
>>> (/home/kou/work/cpp/arrow.kou/apache-arrow-0.14.0/rust/arrow)
>>>Verifying arrow v0.14.0
>>> (/home/kou/work/cpp/arrow.kou/apache-arrow-0.14.0/rust/arrow)
>>> error: failed to verify package tarball
>>>
>>> Caused by:
>>>   failed to parse manifest at
>>> `/home/kou/work/cpp/arrow.kou/apache-arrow-0.14.0/rust/target/package/arrow-0.14.0/Cargo.toml`
>>>
>>> Caused by:
>>>   can't find `array_from_vec` bench, specify bench.path
>>> 
>>>
>>> * How to solve this?
>>>
>>> Done:
>>>
>>>   * Rebasing the master branch on local release branch
>>>
>>> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Rebasingthemasterbranchonlocalreleasebranch
>>>
>>>   * Marking the released version as "RELEASED" on JIRA
>>>
>>> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Markingthereleasedversionas%22RELEASED%22onJIRA
>>>
>>>   * Starting the new version on JIRA
>>>
>>> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-StartingthenewversiononJIRA
>>>
>>>   * Partially: Updating the Arrow website
>>>
>>> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingtheArrowwebsite
>>>
>>> * Release note has been added.
>>> * No blog post.
>>> * Not upload to website yet.
>>>
>>>   * Uploading source release artifacts to SVN
>>>
>>> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UploadingsourcereleaseartifactstoSVN
>>>
>>>   * Uploading binary release artifacts to Bintray
>>>
>>> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UploadingbinaryreleaseartifactstoBintray
>>>
>>>   * Partially: Announcing release
>>>
>>> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Announcingrelease
>>>
>>> * Added release date.
>>> * Not send release announce to announce@ and dev@ yet.
>>>
>>>   * Partially: Updating C++ and Python packages
>>>
>>> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingC++andPythonpackages
>>>
>>> * Uploaded to PyPI.
>>>   * Wro

Re: [RESULT][VOTE] Release Apache Arrow 0.14.0 - RC0

2019-07-05 Thread Micah Kornfield
Thanks.  Is there a script to do this or is it typically just done by hand?

On Fri, Jul 5, 2019 at 1:12 AM Sutou Kouhei  wrote:

> Hi Micah,
>
> Thanks for helping this.
>
> Sorry for my bad description of the task.
>
> > e.g. run:
> >
> > "./dev/release/post-00-rebase.sh apache-arrow-0.14.0-rc0"?
>
> I've already done this:
>
> >>> Done:
> >>>
> >>>   * Rebasing the master branch on local release branch
> >>>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Rebasingthemasterbranchonlocalreleasebranch
>
> I want to say that we need to rebase all open pull requests
> onto master. For example,
> https://github.com/apache/arrow/pull/4739 is needed to be
> rebased:
>
>   git clone --checkout decimal_benchmark g...@github.com:
> emkornfielda/arrow.git
>   cd arrow
>   git remote add upstream g...@github.com:apache/arrow.git
>   git fetch --all --prune --tags --force
>   git rebase upstream/master
>   git push --force
>
>
> Thanks,
> --
> kou
>
> In 
>   "Re: [RESULT][VOTE] Release Apache Arrow 0.14.0 - RC0" on Fri, 5 Jul
> 2019 01:01:18 -0700,
>   Micah Kornfield  wrote:
>
> > Actually, can someone clarify is the correct approach here to clone the
> > @Kou's repo and use his RC0 branch to do the rebase?
> >
> > e.g. run:
> >
> > "./dev/release/post-00-rebase.sh apache-arrow-0.14.0-rc0"?
> >
> >
> > Thanks,
> >
> > Micah
> >
> > On Fri, Jul 5, 2019 at 12:38 AM Micah Kornfield 
> > wrote:
> >
> >> * All pull requests need to rebase on master by
> >>> "Rebasing the master branch on local release branch"
> >>
> >> Since it doesn't look like its been claimed i'll do it.
> >>
> >> On Thu, Jul 4, 2019 at 12:46 AM Sutou Kouhei 
> wrote:
> >>
> >>> Hi,
> >>>
> >>> I need your help!
> >>> Could Rust developers see "Failed:" section?
> >>> Could someone take over tasks in "Need helped:" section?
> >>>
> >>> Failed:
> >>>
> >>>   * Updating Rust packages
> >>>
> >>>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingRustpackages
> >>>
> >>> * We need the following patch:
> >>>
> >>> 
> >>> diff --git a/dev/release/post-07-rust.sh b/dev/release/post-07-rust.sh
> >>> index a2f6e2988..c632fa793 100755
> >>> --- a/dev/release/post-07-rust.sh
> >>> +++ b/dev/release/post-07-rust.sh
> >>> @@ -53,6 +53,12 @@ curl \
> >>>  rm -rf ${archive_name}
> >>>  tar xf ${tar_gz}
> >>>  modules=()
> >>> +  sed \
> >>> +-i \
> >>> +-E \
> >>> +-e 's/^arrow = "([^"]*)"/arrow = { version = "\1", path =
> >>> "..\/arrow" }/g' \
> >>> +-e 's/^parquet = "([^"]*)"/parquet = { version = "\1", path =
> >>> "..\/parquet" }/g' \
> >>> +${archive_name}/rust/*/Cargo.toml
> >>>  for cargo_toml in ${archive_name}/rust/*/Cargo.toml; do
> >>>module_dir=$(dirname ${cargo_toml})
> >>>pushd ${module_dir}
> >>> 
> >>>
> >>> * "INSTALL_RUST=yes dev/release/post-07-rust.sh 0.14.0"
> >>>   is failed with the above patch:
> >>>
> >>> 
> >>>Packaging arrow v0.14.0
> >>> (/home/kou/work/cpp/arrow.kou/apache-arrow-0.14.0/rust/arrow)
> >>>Verifying arrow v0.14.0
> >>> (/home/kou/work/cpp/arrow.kou/apache-arrow-0.14.0/rust/arrow)
> >>> error: failed to verify package tarball
> >>>
> >>> Caused by:
> >>>   failed to parse manifest at
> >>>
> `/home/kou/work/cpp/arrow.kou/apache-arrow-0.14.0/rust/target/package/arrow-0.14.0/Cargo.toml`
> >>>
> >>> Caused by:
> >>>   can't find `array_from_vec` bench, specify bench.path
> >>> 
> >>>
> >>> * How to solve this?
> >>>
> >>> Done:
> >>>
> >>>   * Rebasing the master branch on local release branch
> >>>
> >>>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Rebasingthemasterbranchonlocalreleasebranch
> >>>
> >>>   * Marking the released version as "RELEASED" on JIRA
> >>>
> >>>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Markingthereleasedversionas%22RELEASED%22onJIRA
> >>>
> >>>   * Starting the new version on JIRA
> >>>
> >>>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-StartingthenewversiononJIRA
> >>>
> >>>   * Partially: Updating the Arrow website
> >>>
> >>>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingtheArrowwebsite
> >>>
> >>> * Release note has been added.
> >>> * No blog post.
> >>> * Not upload to website yet.
> >>>
> >>>   * Uploading source release artifacts to SVN
> >>>
> >>>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UploadingsourcereleaseartifactstoSVN
> >>>
> >>>   * Uploading binary release artifacts to Bintray
> >>>
> >>>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UploadingbinaryreleaseartifactstoBintray
> >>>
> >>>   * Partially: Announcing release
> >>>
> >>>
> https://cwiki.apache.org

Re: [RESULT][VOTE] Release Apache Arrow 0.14.0 - RC0

2019-07-05 Thread Sutou Kouhei
We did this by hand in the past releases.

It may be better that we have a script to do this.

In 
  "Re: [RESULT][VOTE] Release Apache Arrow 0.14.0 - RC0" on Fri, 5 Jul 2019 
01:16:42 -0700,
  Micah Kornfield  wrote:

> Thanks.  Is there a script to do this or is it typically just done by hand?
> 
> On Fri, Jul 5, 2019 at 1:12 AM Sutou Kouhei  wrote:
> 
>> Hi Micah,
>>
>> Thanks for helping this.
>>
>> Sorry for my bad description of the task.
>>
>> > e.g. run:
>> >
>> > "./dev/release/post-00-rebase.sh apache-arrow-0.14.0-rc0"?
>>
>> I've already done this:
>>
>> >>> Done:
>> >>>
>> >>>   * Rebasing the master branch on local release branch
>> >>>
>> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Rebasingthemasterbranchonlocalreleasebranch
>>
>> I want to say that we need to rebase all open pull requests
>> onto master. For example,
>> https://github.com/apache/arrow/pull/4739 is needed to be
>> rebased:
>>
>>   git clone --checkout decimal_benchmark g...@github.com:
>> emkornfielda/arrow.git
>>   cd arrow
>>   git remote add upstream g...@github.com:apache/arrow.git
>>   git fetch --all --prune --tags --force
>>   git rebase upstream/master
>>   git push --force
>>
>>
>> Thanks,
>> --
>> kou
>>
>> In 
>>   "Re: [RESULT][VOTE] Release Apache Arrow 0.14.0 - RC0" on Fri, 5 Jul
>> 2019 01:01:18 -0700,
>>   Micah Kornfield  wrote:
>>
>> > Actually, can someone clarify is the correct approach here to clone the
>> > @Kou's repo and use his RC0 branch to do the rebase?
>> >
>> > e.g. run:
>> >
>> > "./dev/release/post-00-rebase.sh apache-arrow-0.14.0-rc0"?
>> >
>> >
>> > Thanks,
>> >
>> > Micah
>> >
>> > On Fri, Jul 5, 2019 at 12:38 AM Micah Kornfield 
>> > wrote:
>> >
>> >> * All pull requests need to rebase on master by
>> >>> "Rebasing the master branch on local release branch"
>> >>
>> >> Since it doesn't look like its been claimed i'll do it.
>> >>
>> >> On Thu, Jul 4, 2019 at 12:46 AM Sutou Kouhei 
>> wrote:
>> >>
>> >>> Hi,
>> >>>
>> >>> I need your help!
>> >>> Could Rust developers see "Failed:" section?
>> >>> Could someone take over tasks in "Need helped:" section?
>> >>>
>> >>> Failed:
>> >>>
>> >>>   * Updating Rust packages
>> >>>
>> >>>
>> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingRustpackages
>> >>>
>> >>> * We need the following patch:
>> >>>
>> >>> 
>> >>> diff --git a/dev/release/post-07-rust.sh b/dev/release/post-07-rust.sh
>> >>> index a2f6e2988..c632fa793 100755
>> >>> --- a/dev/release/post-07-rust.sh
>> >>> +++ b/dev/release/post-07-rust.sh
>> >>> @@ -53,6 +53,12 @@ curl \
>> >>>  rm -rf ${archive_name}
>> >>>  tar xf ${tar_gz}
>> >>>  modules=()
>> >>> +  sed \
>> >>> +-i \
>> >>> +-E \
>> >>> +-e 's/^arrow = "([^"]*)"/arrow = { version = "\1", path =
>> >>> "..\/arrow" }/g' \
>> >>> +-e 's/^parquet = "([^"]*)"/parquet = { version = "\1", path =
>> >>> "..\/parquet" }/g' \
>> >>> +${archive_name}/rust/*/Cargo.toml
>> >>>  for cargo_toml in ${archive_name}/rust/*/Cargo.toml; do
>> >>>module_dir=$(dirname ${cargo_toml})
>> >>>pushd ${module_dir}
>> >>> 
>> >>>
>> >>> * "INSTALL_RUST=yes dev/release/post-07-rust.sh 0.14.0"
>> >>>   is failed with the above patch:
>> >>>
>> >>> 
>> >>>Packaging arrow v0.14.0
>> >>> (/home/kou/work/cpp/arrow.kou/apache-arrow-0.14.0/rust/arrow)
>> >>>Verifying arrow v0.14.0
>> >>> (/home/kou/work/cpp/arrow.kou/apache-arrow-0.14.0/rust/arrow)
>> >>> error: failed to verify package tarball
>> >>>
>> >>> Caused by:
>> >>>   failed to parse manifest at
>> >>>
>> `/home/kou/work/cpp/arrow.kou/apache-arrow-0.14.0/rust/target/package/arrow-0.14.0/Cargo.toml`
>> >>>
>> >>> Caused by:
>> >>>   can't find `array_from_vec` bench, specify bench.path
>> >>> 
>> >>>
>> >>> * How to solve this?
>> >>>
>> >>> Done:
>> >>>
>> >>>   * Rebasing the master branch on local release branch
>> >>>
>> >>>
>> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Rebasingthemasterbranchonlocalreleasebranch
>> >>>
>> >>>   * Marking the released version as "RELEASED" on JIRA
>> >>>
>> >>>
>> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Markingthereleasedversionas%22RELEASED%22onJIRA
>> >>>
>> >>>   * Starting the new version on JIRA
>> >>>
>> >>>
>> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-StartingthenewversiononJIRA
>> >>>
>> >>>   * Partially: Updating the Arrow website
>> >>>
>> >>>
>> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingtheArrowwebsite
>> >>>
>> >>> * Release note has been added.
>> >>> * No blog post.
>> >>> * Not upload to website yet.
>> >>>
>> >>>   * Uploading source release artifacts to SVN
>> >>>
>> >>>
>> https://cwiki.apache.org/confluence/display/ARROW/Relea

Re: [RESULT][VOTE] Release Apache Arrow 0.14.0 - RC0

2019-07-05 Thread Krisztián Szűcs
I prefer to use hub [1] to checkout a PR:

hub pr checkout  
git rebase upstream/master
git push -f

[1]: https://github.com/github/hub


On Fri, Jul 5, 2019 at 10:22 AM Sutou Kouhei  wrote:

> We did this by hand in the past releases.
>
> It may be better that we have a script to do this.
>
> In 
>   "Re: [RESULT][VOTE] Release Apache Arrow 0.14.0 - RC0" on Fri, 5 Jul
> 2019 01:16:42 -0700,
>   Micah Kornfield  wrote:
>
> > Thanks.  Is there a script to do this or is it typically just done by
> hand?
> >
> > On Fri, Jul 5, 2019 at 1:12 AM Sutou Kouhei  wrote:
> >
> >> Hi Micah,
> >>
> >> Thanks for helping this.
> >>
> >> Sorry for my bad description of the task.
> >>
> >> > e.g. run:
> >> >
> >> > "./dev/release/post-00-rebase.sh apache-arrow-0.14.0-rc0"?
> >>
> >> I've already done this:
> >>
> >> >>> Done:
> >> >>>
> >> >>>   * Rebasing the master branch on local release branch
> >> >>>
> >>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Rebasingthemasterbranchonlocalreleasebranch
> >>
> >> I want to say that we need to rebase all open pull requests
> >> onto master. For example,
> >> https://github.com/apache/arrow/pull/4739 is needed to be
> >> rebased:
> >>
> >>   git clone --checkout decimal_benchmark g...@github.com:
> >> emkornfielda/arrow.git
> >>   cd arrow
> >>   git remote add upstream g...@github.com:apache/arrow.git
> >>   git fetch --all --prune --tags --force
> >>   git rebase upstream/master
> >>   git push --force
> >>
> >>
> >> Thanks,
> >> --
> >> kou
> >>
> >> In 
> >>   "Re: [RESULT][VOTE] Release Apache Arrow 0.14.0 - RC0" on Fri, 5 Jul
> >> 2019 01:01:18 -0700,
> >>   Micah Kornfield  wrote:
> >>
> >> > Actually, can someone clarify is the correct approach here to clone
> the
> >> > @Kou's repo and use his RC0 branch to do the rebase?
> >> >
> >> > e.g. run:
> >> >
> >> > "./dev/release/post-00-rebase.sh apache-arrow-0.14.0-rc0"?
> >> >
> >> >
> >> > Thanks,
> >> >
> >> > Micah
> >> >
> >> > On Fri, Jul 5, 2019 at 12:38 AM Micah Kornfield <
> emkornfi...@gmail.com>
> >> > wrote:
> >> >
> >> >> * All pull requests need to rebase on master by
> >> >>> "Rebasing the master branch on local release branch"
> >> >>
> >> >> Since it doesn't look like its been claimed i'll do it.
> >> >>
> >> >> On Thu, Jul 4, 2019 at 12:46 AM Sutou Kouhei 
> >> wrote:
> >> >>
> >> >>> Hi,
> >> >>>
> >> >>> I need your help!
> >> >>> Could Rust developers see "Failed:" section?
> >> >>> Could someone take over tasks in "Need helped:" section?
> >> >>>
> >> >>> Failed:
> >> >>>
> >> >>>   * Updating Rust packages
> >> >>>
> >> >>>
> >>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingRustpackages
> >> >>>
> >> >>> * We need the following patch:
> >> >>>
> >> >>> 
> >> >>> diff --git a/dev/release/post-07-rust.sh
> b/dev/release/post-07-rust.sh
> >> >>> index a2f6e2988..c632fa793 100755
> >> >>> --- a/dev/release/post-07-rust.sh
> >> >>> +++ b/dev/release/post-07-rust.sh
> >> >>> @@ -53,6 +53,12 @@ curl \
> >> >>>  rm -rf ${archive_name}
> >> >>>  tar xf ${tar_gz}
> >> >>>  modules=()
> >> >>> +  sed \
> >> >>> +-i \
> >> >>> +-E \
> >> >>> +-e 's/^arrow = "([^"]*)"/arrow = { version = "\1", path =
> >> >>> "..\/arrow" }/g' \
> >> >>> +-e 's/^parquet = "([^"]*)"/parquet = { version = "\1", path =
> >> >>> "..\/parquet" }/g' \
> >> >>> +${archive_name}/rust/*/Cargo.toml
> >> >>>  for cargo_toml in ${archive_name}/rust/*/Cargo.toml; do
> >> >>>module_dir=$(dirname ${cargo_toml})
> >> >>>pushd ${module_dir}
> >> >>> 
> >> >>>
> >> >>> * "INSTALL_RUST=yes dev/release/post-07-rust.sh 0.14.0"
> >> >>>   is failed with the above patch:
> >> >>>
> >> >>> 
> >> >>>Packaging arrow v0.14.0
> >> >>> (/home/kou/work/cpp/arrow.kou/apache-arrow-0.14.0/rust/arrow)
> >> >>>Verifying arrow v0.14.0
> >> >>> (/home/kou/work/cpp/arrow.kou/apache-arrow-0.14.0/rust/arrow)
> >> >>> error: failed to verify package tarball
> >> >>>
> >> >>> Caused by:
> >> >>>   failed to parse manifest at
> >> >>>
> >>
> `/home/kou/work/cpp/arrow.kou/apache-arrow-0.14.0/rust/target/package/arrow-0.14.0/Cargo.toml`
> >> >>>
> >> >>> Caused by:
> >> >>>   can't find `array_from_vec` bench, specify bench.path
> >> >>> 
> >> >>>
> >> >>> * How to solve this?
> >> >>>
> >> >>> Done:
> >> >>>
> >> >>>   * Rebasing the master branch on local release branch
> >> >>>
> >> >>>
> >>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Rebasingthemasterbranchonlocalreleasebranch
> >> >>>
> >> >>>   * Marking the released version as "RELEASED" on JIRA
> >> >>>
> >> >>>
> >>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Markingthereleasedversionas%22RELEASED%22onJIRA
> >> >>>
> >> >>>   * Starting the new version on JIRA
> >> >>>
> >> >>>
> >>
> https://cwiki.apache.org/confluence/d

Re: linking 3rd party cython modules against pyarrow fails since 0.14.0

2019-07-05 Thread Antoine Pitrou


That's quite likely indeed.

A bit worrying is that this should have been caught by our unit tests.

Regards

Antoine.



Le 05/07/2019 à 10:02, Weston Steimel a écrit :
> Hello,
> 
> I wonder if perhaps that may be due to the work done for reducing the wheel
> size in https://issues.apache.org/jira/browse/ARROW-5082?
> 
> On Thu, Jul 4, 2019 at 10:06 PM Stestagg  wrote:
> 
>> 1) pip install pyarrow==0.14.0
>> 2) All the pyarrow files including, for example libarrow.so.14, but not
>> libarrow.so (hence the linker error)
>>
>> Reproducible on Python 3.7.2 on linux mint 19.1 and debian docker:
>>
>> Example dockerfile:
>> ```
>> FROM debian:unstable-slim
>>
>> RUN apt-get update && apt-get upgrade -y
>> RUN apt-get install -y python3 python3-dev
>> RUN apt-get install -y python3-pip
>>
>> RUN python3 -m pip install --upgrade pip
>> RUN pip3 install Cython pyarrow
>> COPY setup.py /root
>> COPY test.pyx /root
>> WORKDIR /root
>> RUN python3 setup.py build_ext --inplace
>> ```
>>
>> Where setup.py and test.pyx are the files listed above, with an added call
>> to numpy.get_include().
>>
>> Appending ' ==0.13.0' to the 'RUN pip3 install...' line above results in
>> the docker image building
>>
>> Steve
>>
>> On Thu, Jul 4, 2019 at 10:37 PM Antoine Pitrou  wrote:
>>
>>>
>>> Hi,
>>>
>>> 1) How did you install PyArrow?
>>>
>>> 2) What does /usr/local/lib/python3.7/dist-packages/pyarrow contain?
>>>
>>> Regards
>>>
>>> Antoine.
>>>
>>>
>>> Le 04/07/2019 à 22:10, Stestagg a écrit :
 Hi

 I've got a cython module that links against PyArrow, using the
 'pyarrow.get_libraries()' associated methods.

 Builds on Windows and Linux are consitently failing against 0.14, but
 working on 0.12 to 0.13.

 Linux gives:
 x86_64-linux-gnu-g++ -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions
 -Wl,-z,relro -Wl,-z,relro -g -fstack-protector-strong -Wformat
 -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2
 build/temp.linux-x86_64-3.7/test.o
 -L/usr/local/lib/python3.7/dist-packages/pyarrow -larrow -larrow_python
>>> -o
 /home/dduser/att/arrowtest.cpython-37m-x86_64-linux-gnu.so
 /usr/bin/ld: cannot find -larrow
 /usr/bin/ld: cannot find -larrow_python
 collect2: error: ld returned 1 exit status
 error: command 'x86_64-linux-gnu-g++' failed with exit status 1

 The windows build is more funky, but I'm still investigating.

 A minimal example is:

 setup.py:

 import pyarrow
 from Cython.Build import cythonize
 from distutils.command.build_clib import build_clib
 from distutils.core import setup, Extension


 OPTIONS = {
 'sources': ["test.pyx"],
 'language': "c++",
 'include_dirs':  [pyarrow.get_include()],
 'libraries': pyarrow.get_libraries(),
 'library_dirs': pyarrow.get_library_dirs()
 }

 setup(
 name='arrowtest',
 ext_modules = cythonize(Extension("arrowtest",**OPTIONS)),
 cmdclass = {'build_clib': build_clib},
 version="0.1",
 )

 test.pyx:

 import pyarrow as pa
 cimport pyarrow.lib as pa

 Thanks

 Steve

>>>
>>
> 


Re: [RESULT][VOTE] Release Apache Arrow 0.14.0 - RC0

2019-07-05 Thread Micah Kornfield
OK, I wrote a quick script (I'll clean it up and send it out a PR tomorrow)
and rebased everything that could be done so cleanly.  What do we generally
do about PRs that don't rebase cleanly?

Thanks,
Micah

On Fri, Jul 5, 2019 at 1:29 AM Krisztián Szűcs 
wrote:

> I prefer to use hub [1] to checkout a PR:
>
> hub pr checkout  
> git rebase upstream/master
> git push -f
>
> [1]: https://github.com/github/hub
>
>
> On Fri, Jul 5, 2019 at 10:22 AM Sutou Kouhei  wrote:
>
> > We did this by hand in the past releases.
> >
> > It may be better that we have a script to do this.
> >
> > In 
> >   "Re: [RESULT][VOTE] Release Apache Arrow 0.14.0 - RC0" on Fri, 5 Jul
> > 2019 01:16:42 -0700,
> >   Micah Kornfield  wrote:
> >
> > > Thanks.  Is there a script to do this or is it typically just done by
> > hand?
> > >
> > > On Fri, Jul 5, 2019 at 1:12 AM Sutou Kouhei 
> wrote:
> > >
> > >> Hi Micah,
> > >>
> > >> Thanks for helping this.
> > >>
> > >> Sorry for my bad description of the task.
> > >>
> > >> > e.g. run:
> > >> >
> > >> > "./dev/release/post-00-rebase.sh apache-arrow-0.14.0-rc0"?
> > >>
> > >> I've already done this:
> > >>
> > >> >>> Done:
> > >> >>>
> > >> >>>   * Rebasing the master branch on local release branch
> > >> >>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Rebasingthemasterbranchonlocalreleasebranch
> > >>
> > >> I want to say that we need to rebase all open pull requests
> > >> onto master. For example,
> > >> https://github.com/apache/arrow/pull/4739 is needed to be
> > >> rebased:
> > >>
> > >>   git clone --checkout decimal_benchmark g...@github.com:
> > >> emkornfielda/arrow.git
> > >>   cd arrow
> > >>   git remote add upstream g...@github.com:apache/arrow.git
> > >>   git fetch --all --prune --tags --force
> > >>   git rebase upstream/master
> > >>   git push --force
> > >>
> > >>
> > >> Thanks,
> > >> --
> > >> kou
> > >>
> > >> In <
> cak7z5t96vuwqjv4vhj2ijicpyexuqpjsgopcsgxprm4msi8...@mail.gmail.com>
> > >>   "Re: [RESULT][VOTE] Release Apache Arrow 0.14.0 - RC0" on Fri, 5 Jul
> > >> 2019 01:01:18 -0700,
> > >>   Micah Kornfield  wrote:
> > >>
> > >> > Actually, can someone clarify is the correct approach here to clone
> > the
> > >> > @Kou's repo and use his RC0 branch to do the rebase?
> > >> >
> > >> > e.g. run:
> > >> >
> > >> > "./dev/release/post-00-rebase.sh apache-arrow-0.14.0-rc0"?
> > >> >
> > >> >
> > >> > Thanks,
> > >> >
> > >> > Micah
> > >> >
> > >> > On Fri, Jul 5, 2019 at 12:38 AM Micah Kornfield <
> > emkornfi...@gmail.com>
> > >> > wrote:
> > >> >
> > >> >> * All pull requests need to rebase on master by
> > >> >>> "Rebasing the master branch on local release branch"
> > >> >>
> > >> >> Since it doesn't look like its been claimed i'll do it.
> > >> >>
> > >> >> On Thu, Jul 4, 2019 at 12:46 AM Sutou Kouhei 
> > >> wrote:
> > >> >>
> > >> >>> Hi,
> > >> >>>
> > >> >>> I need your help!
> > >> >>> Could Rust developers see "Failed:" section?
> > >> >>> Could someone take over tasks in "Need helped:" section?
> > >> >>>
> > >> >>> Failed:
> > >> >>>
> > >> >>>   * Updating Rust packages
> > >> >>>
> > >> >>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingRustpackages
> > >> >>>
> > >> >>> * We need the following patch:
> > >> >>>
> > >> >>> 
> > >> >>> diff --git a/dev/release/post-07-rust.sh
> > b/dev/release/post-07-rust.sh
> > >> >>> index a2f6e2988..c632fa793 100755
> > >> >>> --- a/dev/release/post-07-rust.sh
> > >> >>> +++ b/dev/release/post-07-rust.sh
> > >> >>> @@ -53,6 +53,12 @@ curl \
> > >> >>>  rm -rf ${archive_name}
> > >> >>>  tar xf ${tar_gz}
> > >> >>>  modules=()
> > >> >>> +  sed \
> > >> >>> +-i \
> > >> >>> +-E \
> > >> >>> +-e 's/^arrow = "([^"]*)"/arrow = { version = "\1", path =
> > >> >>> "..\/arrow" }/g' \
> > >> >>> +-e 's/^parquet = "([^"]*)"/parquet = { version = "\1", path =
> > >> >>> "..\/parquet" }/g' \
> > >> >>> +${archive_name}/rust/*/Cargo.toml
> > >> >>>  for cargo_toml in ${archive_name}/rust/*/Cargo.toml; do
> > >> >>>module_dir=$(dirname ${cargo_toml})
> > >> >>>pushd ${module_dir}
> > >> >>> 
> > >> >>>
> > >> >>> * "INSTALL_RUST=yes dev/release/post-07-rust.sh 0.14.0"
> > >> >>>   is failed with the above patch:
> > >> >>>
> > >> >>> 
> > >> >>>Packaging arrow v0.14.0
> > >> >>> (/home/kou/work/cpp/arrow.kou/apache-arrow-0.14.0/rust/arrow)
> > >> >>>Verifying arrow v0.14.0
> > >> >>> (/home/kou/work/cpp/arrow.kou/apache-arrow-0.14.0/rust/arrow)
> > >> >>> error: failed to verify package tarball
> > >> >>>
> > >> >>> Caused by:
> > >> >>>   failed to parse manifest at
> > >> >>>
> > >>
> >
> `/home/kou/work/cpp/arrow.kou/apache-arrow-0.14.0/rust/target/package/arrow-0.14.0/Cargo.toml`
> > >> >>>
> > >> >>> Caused by:
> > >> >>>   can't find `array_from_vec` bench, specify bench.path
> > >> >>> 
> > >> >>>
> > >> >>> * Ho

[jira] [Created] (ARROW-5861) [Java] Initial implement to convert Avro record with primitive types

2019-07-05 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5861:
-

 Summary: [Java] Initial implement to convert Avro record with 
primitive types
 Key: ARROW-5861
 URL: https://issues.apache.org/jira/browse/ARROW-5861
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5862) [Java] Provide dictionary builder

2019-07-05 Thread Liya Fan (JIRA)
Liya Fan created ARROW-5862:
---

 Summary: [Java] Provide dictionary builder
 Key: ARROW-5862
 URL: https://issues.apache.org/jira/browse/ARROW-5862
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


The dictionary builder servers for the following scenario which is frequently 
encountered in practice when dictionary encoding is involved: the dictionary 
values are not known a priori, so they are determined dynamically, as new data 
arrive continually.

In particular, when a new value arrives, it is tested to check if it is already 
in the dictionary. If so, it is simply neglected, otherwise, it is added to the 
dictionary.
 
When all values have been evaluated, the dictionary can be considered complete. 
So encoding can start afterward.

The code snippet using a dictionary builder should be like this:

{{DictonaryBuilder dictionaryBuilder = ...}}
{{dictionaryBuilder.startBuild();}}
{{...}}
{{dictionaryBuild.addValue(newValue);}}
{{...}}
{{dictionaryBuilder.endBuild();}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5863) Segmentation Fault via pytest-runner

2019-07-05 Thread Josh Bode (JIRA)
Josh Bode created ARROW-5863:


 Summary: Segmentation Fault via pytest-runner
 Key: ARROW-5863
 URL: https://issues.apache.org/jira/browse/ARROW-5863
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.14.0
 Environment: $ uname -a
Linux aleph 5.1.15-arch1-1-ARCH #1 SMP PREEMPT Tue Jun 25 04:49:39 UTC 2019 
x86_64 GNU/Linux

$ python --version
Python 3.7.3

$ pip freeze | grep -P "(pyarrow|pytest)"
pyarrow==0.14.0
pytest==5.0.0
pytest-benchmark==3.2.2
pytest-cov==2.7.1
pytest-env==0.6.2
pytest-forked==1.0.2
pytest-html==1.21.1
pytest-metadata==1.8.0
pytest-mock==1.10.4
pytest-runner==5.1
pytest-sugar==0.9.2
pytest-xdist==1.29.0
Reporter: Josh Bode
 Attachments: pyarrow-issue.tar.bz2

When running {{pytest}} on projects using {{pyarrow==0.14.0}} on Linux, I am 
getting segmentation faults, but interestingly _only_ when run via 
{{pytest-runner}}

This works (i.e. {{pytest}} directly):
{code:java}
$ pytest

Test session starts (platform: linux, Python 3.7.3, pytest 5.0.0, pytest-sugar 
0.9.2)
benchmark: 3.2.2 (defaults: timer=time.perf_counter disable_gc=False 
min_rounds=5 min_time=0.05 max_time=1.0 calibration_precision=10 
warmup=False warmup_iterations=10)
rootdir: /home/josh/scratch/pyarrow-issue
plugins: sugar-0.9.2, Flask-Dance-2.2.0, env-0.6.2, mock-1.10.4, xdist-1.29.0, 
requests-mock-1.6.0, forked-1.0.2, dash-1.0.0, cov-2.7.1, html-1.21.1, 
benchmark-3.2.2, metadata-1.8.0
collecting ...
tests/test_pyarrow.py ✓ 100% ██

Results (0.09s):
1 passed{code}
However, this does not work, ending in a segmentation fault, even though the 
tests pass:
{code:java}
$ python setup.py pytest

running pytest
running egg_info
writing pyarrow_issue.egg-info/PKG-INFO
writing dependency_links to pyarrow_issue.egg-info/dependency_links.txt
writing requirements to pyarrow_issue.egg-info/requires.txt
writing top-level names to pyarrow_issue.egg-info/top_level.txt
reading manifest file 'pyarrow_issue.egg-info/SOURCES.txt'
writing manifest file 'pyarrow_issue.egg-info/SOURCES.txt'
running build_ext

Test session starts (platform: linux, Python 3.7.3, pytest 5.0.0, pytest-sugar 
0.9.2)
benchmark: 3.2.2 (defaults: timer=time.perf_counter disable_gc=False 
min_rounds=5 min_time=0.05 max_time=1.0 calibration_precision=10 
warmup=False warmup_iterations=10)
rootdir: /home/josh/scratch/pyarrow-issue
plugins: sugar-0.9.2, Flask-Dance-2.2.0, env-0.6.2, mock-1.10.4, xdist-1.29.0, 
requests-mock-1.6.0, forked-1.0.2, dash-1.0.0, cov-2.7.1, html-1.21.1, 
benchmark-3.2.2, metadata-1.8.0
collecting ...
tests/test_pyarrow.py ✓ 100% ██

Results (0.07s):
1 passed
zsh: segmentation fault (core dumped) python setup.py pytest{code}
I have observed this behaviour on my machine running natively, and also via 
docker.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [CI] Ursabot Java builders

2019-07-05 Thread Krisztián Szűcs
Thanks to Sebastien we now have two Go builders [1]
and I've just added a Rust builder [2].

Go build takes ~15 seconds.
Rust build takes ~3 minutes.

[1]: https://github.com/ursa-labs/ursabot/pull/125
[2]: https://github.com/ursa-labs/ursabot/pull/126

On Thu, Jul 4, 2019 at 7:41 PM Krisztián Szűcs 
wrote:

> Hi,
>
> I've added simple Java builders for JDK 8 and 11.
> The required change was quite small [1], but a build
> takes a bit more than 2 minutes [2].
> Adding Go and Rust builders should be similarly easy.
>
> Regard, Krisztian
>
> [1]:
> https://github.com/ursa-labs/ursabot/commit/476a8b07fd81f6d7664dc5e115e079d014e54436
> [2]: https://ci.ursalabs.org/#/builders/89/builds/2
>


[jira] [Created] (ARROW-5864) [Python] simplify cython wrapping of Result

2019-07-05 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5864:


 Summary: [Python] simplify cython wrapping of Result
 Key: ARROW-5864
 URL: https://issues.apache.org/jira/browse/ARROW-5864
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


See answer in https://github.com/cython/cython/issues/3018



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [Discuss] Streaming: Differentiate between length of RecordBatch and utilized portion-- common use-case?

2019-07-05 Thread John Muehlhausen
So far it seems as if pyarrow is completely ignoring the RecordBatch.length
field.  More info to follow...

On Tue, Jul 2, 2019 at 3:02 PM John Muehlhausen  wrote:

> Crikey! I'll do some testing around that and suggest some test cases to
> ensure it continues to work, assuming that it does.
>
> -John
>
> On Tue, Jul 2, 2019 at 2:41 PM Wes McKinney  wrote:
>
>> Thanks for the attachment, it's helpful.
>>
>> On Tue, Jul 2, 2019 at 1:40 PM John Muehlhausen  wrote:
>> >
>> > Attachments referred to in previous two messages:
>> >
>> https://www.dropbox.com/sh/6ycfuivrx70q2jx/AAAt-RDaZWmQ2VqlM-0s6TqWa?dl=0
>> >
>> > On Tue, Jul 2, 2019 at 1:14 PM John Muehlhausen  wrote:
>> >
>> > > Thanks, Wes, for the thoughtful reply.  I really appreciate the
>> > > engagement.  In order to clarify things a bit, I am attaching a
>> graphic of
>> > > how our application will take record-wise (row-oriented) data from an
>> event
>> > > source and incrementally populate a pre-allocated Arrow-compatible
>> buffer,
>> > > including for variable-length fields.  (Obviously at this stage I am
>> not
>> > > using the reference implementation Arrow code, although that would be
>> a
>> > > goal to contribute that back to the project.)
>> > >
>> > > For sake of simplicity these are non-nullable fields.  As a result a
>> > > reader of "y" that has no knowledge of the "utilized" metadata would
>> get a
>> > > long string (zeros, spaces, uninitialized, or whatever we decide for
>> the
>> > > pre-allocation model) for the record just beyond the last utilized
>> record.
>> > >
>> > > I don't see any "big O"-analysis problems with this approach.  The
>> > > space/time tradeoff is that we have to guess how much room to
>> allocate for
>> > > variable-length fields.  We will probably almost always be wrong.
>> This
>> > > ends up in "wasted" space.  However, we can do calculations based on
>> these
>> > > partially filled batches that take full advantage of the columnar
>> layout.
>> > >  (Here I've shown the case where we had too little variable-length
>> buffer
>> > > set aside, resulting in "wasted" rows.  The flip side is that rows
>> achieve
>> > > full [1] utilization but there is wasted variable-length buffer if we
>> guess
>> > > incorrectly in the other direction.)
>> > >
>> > > I proposed a few things that are "nice to have" but really what I'm
>> eyeing
>> > > is the ability for a reader-- any reader (e.g. pyarrow)-- to see that
>> some
>> > > of the rows in a RecordBatch are not to be read, based on the new
>> > > "utilized" (or whatever name) metadata.  That single tweak to the
>> > > metadata-- and readers honoring it-- is the core of the proposal.
>> > >  (Proposal 4.)  This would indicate that the attached example (or
>> something
>> > > similar) is the blessed approach for those seeking to accumulate
>> events and
>> > > process them while still expecting more data, with the heavier-weight
>> task
>> > > of creating a new pre-allocated batch being a rare occurrence.
>> > >
>>
>> So the "length" field in RecordBatch is already the utilized number of
>> rows. The body buffers can certainly have excess unused space. So your
>> application can mutate Flatbuffer "length" field in-place as new
>> records are filled in.
>>
>> > > Notice that the mutability is only in the sense of "appending."  The
>> > > current doctrine of total immutability would be revised to refer to
>> the
>> > > immutability of only the already-populated rows.
>> > >
>> > > It gives folks an option other than choosing the lesser of two evils:
>> on
>> > > the one hand, length 1 RecordBatches that don't result in a stream
>> that is
>> > > computationally efficient.  On the other hand, adding artificial
>> latency by
>> > > accumulating events before "freezing" a larger batch and only then
>> making
>> > > it available to computation.
>> > >
>> > > -John
>> > >
>> > > On Tue, Jul 2, 2019 at 12:21 PM Wes McKinney 
>> wrote:
>> > >
>> > >> hi John,
>> > >>
>> > >> On Tue, Jul 2, 2019 at 11:23 AM John Muehlhausen 
>> wrote:
>> > >> >
>> > >> > During my time building financial analytics and trading systems (23
>> > >> years!), both the "batch processing" and "stream processing"
>> paradigms have
>> > >> been extensively used by myself and by colleagues.
>> > >> >
>> > >> > Unfortunately, the tools used in these paradigms have not
>> successfully
>> > >> overlapped.  For example, an analyst might use a Python notebook with
>> > >> pandas to do some batch analysis.  Then, for acceptable latency and
>> > >> throughput, a C++ programmer must implement the same schemas and
>> processing
>> > >> logic in order to analyze real-time data for real-time decision
>> support.
>> > >> (Time horizons often being sub-second or even sub-millisecond for an
>> > >> acceptable reaction to an event.  The most aggressive software-based
>> > >> systems, leaving custom hardware aside other than things like
>> kernel-bypass
>> > >> NICs, target 10s of microseconds for a full round trip

Re: [Discuss] Streaming: Differentiate between length of RecordBatch and utilized portion-- common use-case?

2019-07-05 Thread John Muehlhausen
This seems to help... still testing it though.

  Status GetFieldMetadata(int field_index, ArrayData* out) {
auto nodes = metadata_->nodes();
// pop off a field
if (field_index >= static_cast(nodes->size())) {
  return Status::Invalid("Ran out of field metadata, likely malformed");
}
const flatbuf::FieldNode* node = nodes->Get(field_index);

*//out->length = node->length();*
*out->length = metadata_->length();*
out->null_count = node->null_count();
out->offset = 0;
return Status::OK();
  }

On Fri, Jul 5, 2019 at 10:24 AM John Muehlhausen  wrote:

> So far it seems as if pyarrow is completely ignoring the
> RecordBatch.length field.  More info to follow...
>
> On Tue, Jul 2, 2019 at 3:02 PM John Muehlhausen  wrote:
>
>> Crikey! I'll do some testing around that and suggest some test cases to
>> ensure it continues to work, assuming that it does.
>>
>> -John
>>
>> On Tue, Jul 2, 2019 at 2:41 PM Wes McKinney  wrote:
>>
>>> Thanks for the attachment, it's helpful.
>>>
>>> On Tue, Jul 2, 2019 at 1:40 PM John Muehlhausen  wrote:
>>> >
>>> > Attachments referred to in previous two messages:
>>> >
>>> https://www.dropbox.com/sh/6ycfuivrx70q2jx/AAAt-RDaZWmQ2VqlM-0s6TqWa?dl=0
>>> >
>>> > On Tue, Jul 2, 2019 at 1:14 PM John Muehlhausen  wrote:
>>> >
>>> > > Thanks, Wes, for the thoughtful reply.  I really appreciate the
>>> > > engagement.  In order to clarify things a bit, I am attaching a
>>> graphic of
>>> > > how our application will take record-wise (row-oriented) data from
>>> an event
>>> > > source and incrementally populate a pre-allocated Arrow-compatible
>>> buffer,
>>> > > including for variable-length fields.  (Obviously at this stage I am
>>> not
>>> > > using the reference implementation Arrow code, although that would
>>> be a
>>> > > goal to contribute that back to the project.)
>>> > >
>>> > > For sake of simplicity these are non-nullable fields.  As a result a
>>> > > reader of "y" that has no knowledge of the "utilized" metadata would
>>> get a
>>> > > long string (zeros, spaces, uninitialized, or whatever we decide for
>>> the
>>> > > pre-allocation model) for the record just beyond the last utilized
>>> record.
>>> > >
>>> > > I don't see any "big O"-analysis problems with this approach.  The
>>> > > space/time tradeoff is that we have to guess how much room to
>>> allocate for
>>> > > variable-length fields.  We will probably almost always be wrong.
>>> This
>>> > > ends up in "wasted" space.  However, we can do calculations based on
>>> these
>>> > > partially filled batches that take full advantage of the columnar
>>> layout.
>>> > >  (Here I've shown the case where we had too little variable-length
>>> buffer
>>> > > set aside, resulting in "wasted" rows.  The flip side is that rows
>>> achieve
>>> > > full [1] utilization but there is wasted variable-length buffer if
>>> we guess
>>> > > incorrectly in the other direction.)
>>> > >
>>> > > I proposed a few things that are "nice to have" but really what I'm
>>> eyeing
>>> > > is the ability for a reader-- any reader (e.g. pyarrow)-- to see
>>> that some
>>> > > of the rows in a RecordBatch are not to be read, based on the new
>>> > > "utilized" (or whatever name) metadata.  That single tweak to the
>>> > > metadata-- and readers honoring it-- is the core of the proposal.
>>> > >  (Proposal 4.)  This would indicate that the attached example (or
>>> something
>>> > > similar) is the blessed approach for those seeking to accumulate
>>> events and
>>> > > process them while still expecting more data, with the
>>> heavier-weight task
>>> > > of creating a new pre-allocated batch being a rare occurrence.
>>> > >
>>>
>>> So the "length" field in RecordBatch is already the utilized number of
>>> rows. The body buffers can certainly have excess unused space. So your
>>> application can mutate Flatbuffer "length" field in-place as new
>>> records are filled in.
>>>
>>> > > Notice that the mutability is only in the sense of "appending."  The
>>> > > current doctrine of total immutability would be revised to refer to
>>> the
>>> > > immutability of only the already-populated rows.
>>> > >
>>> > > It gives folks an option other than choosing the lesser of two
>>> evils: on
>>> > > the one hand, length 1 RecordBatches that don't result in a stream
>>> that is
>>> > > computationally efficient.  On the other hand, adding artificial
>>> latency by
>>> > > accumulating events before "freezing" a larger batch and only then
>>> making
>>> > > it available to computation.
>>> > >
>>> > > -John
>>> > >
>>> > > On Tue, Jul 2, 2019 at 12:21 PM Wes McKinney 
>>> wrote:
>>> > >
>>> > >> hi John,
>>> > >>
>>> > >> On Tue, Jul 2, 2019 at 11:23 AM John Muehlhausen 
>>> wrote:
>>> > >> >
>>> > >> > During my time building financial analytics and trading systems
>>> (23
>>> > >> years!), both the "batch processing" and "stream processing"
>>> paradigms have
>>> > >> been extensively used by myself and by colleagues.
>>> > >

flatbuffers vectors and --gen-object-api

2019-07-05 Thread John Muehlhausen
It seems as if Arrow expects for some vectors to be empty rather than null.
 (Examples: Footer.dictionaries, Field.children)

Anyone using --gen-object-api with flatc will get code that writes null
when (e.g.) _o->children.size() is zero in CreateField().

I may be missing something but I don't see a way to change this behavior in
flatc.

I understand that the object API is not as performant, but wanted to toss
out the question:

Do we want to tolerate null vectors as well as empty vectors so that other
writer implementations have this option?  E.g. if they choose to use
--gen-object-api ?

-John


[jira] [Created] (ARROW-5865) [Release] Helper script for rebasing open pull requests on master

2019-07-05 Thread Micah Kornfield (JIRA)
Micah Kornfield created ARROW-5865:
--

 Summary: [Release] Helper script for rebasing open pull requests 
on master
 Key: ARROW-5865
 URL: https://issues.apache.org/jira/browse/ARROW-5865
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Micah Kornfield
Assignee: Micah Kornfield


Create a script so we don't have to manually rebase all open pull requests off 
of master.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[Discuss] Format additions to Arrow for sparse data and data integrity

2019-07-05 Thread Micah Kornfield
Hi Arrow-dev,

I’d like to make a straw-man proposal to cover some features that I think
would be useful to Arrow, and that I would like to make a proof-of-concept
implementation for in Java and C++.  In particular, the proposal covers
allowing for smaller data sizes via compression and encoding [1][2][8],
data integrity [3] and avoiding unnecessary data transfer [4][5].

I’ve put together a PR  [6] that has proposed changes to the flatbuffer
metadata to support the new features.  The PR introduces:

   -

   A new “SparseRecordBatch” that can support one of multiple possible
   encodings (both dense and sparse), compression and column elision.
   -

   A “Digest” message type to support optional data integrity.


Going into more details on the specific features in the PR:

   1.

   Sparse encodings for arrays and buffers.  The guiding principles behind
   the suggested encodings are to support encodings that can be exploited by
   compute engines for more efficient computation (I don’t think parquet style
   bit-packing belongs in Arrow).  While the encodings don’t maintain O(1)
   data element access, they support sublinear, O(log(N)), element access. The
   suggested encodings are:
   1.

  Array encodings:
  1.

 Add a run-length encoding scheme to efficiently represent repeated
 values (the actual scheme encodes run ends instead of length
to preserve
 sub-linear random access).
 2.

 Add a “packed” sparse representation (null values don’t take up
 space in value buffers)
 2.

  Buffer encodings:
  1.

 Add frame of reference integer encoding [7] (this allows for lower
 bit-width encoding of integer types by subtracting a
“reference” value from
 all values in the buffer).
 2.

 Add a sparse integer set encoding.  This encoding allows more
 efficient encoding of validity bit-masks for cases when all values are
 either null or not null.
 2.

   Data compression.  Similar to encodings but compression is solely for
   reduction of data at rest/on the wire.  The proposal is to allow
   compression of individual buffers. Right now zstd is proposed, but I don’t
   feel strongly on the specific technologies here.
   3.

   Column Elision.  For some use-cases, like structured logging, the
   overhead of including array metadata for columns with no data present
   represents non-negligible overhead.   The proposal provides a mechanism for
   omitting meta-data for such arrays.
   4.

   Data Integrity.  While the arrow file format isn’t meant for archiving
   data, I think it is important to allow for optional native data integrity
   checks in the format.  To this end, I proposed a new “Digest” message type
   that can be added after other messages to record a digest/hash of the
   preceding data. I suggested xxhash, but I don’t have a strong opinion here,
   as long as there is some minimal support that can potentially be expanded
   later.


In the proposal I chose to use Tables and Unions everywhere for flexibility
but in all likelihood some could be replaced by enums.

My initial plan would be to solely focus on an IPC mechanism that can send
a SparseRecordBatch and immediately translate it to a normal RecordBatch in
both Java and C++.

As a practical matter the proposal represents a lot of work to get an MVP
working in time for 1.0.0 release (provided they are accepted by the
community), so I'd greatly appreciate if anyone wants to collaborate on
this.

If it is easier I’m happy to start a separate thread for feature if people
feel like it would make the conversation easier.  I can also create a
Google Doc for direct comments if that is preferred.

Thanks,

Micah



P.S. In the interest of full disclosure, these ideas evolved in
collaboration with Brian Hulette and other colleagues at Google who are
interested in making use of Arrow in both internal and external projects.

[1] https://issues.apache.org/jira/browse/ARROW-300

[2]  https://issues.apache.org/jira/browse/ARROW-5224

[3]
https://lists.apache.org/thread.html/36ab9c2b8b5d9f04493b3f9ea3b63c3ca3bc0f90743aa726b7a3199b@%3Cdev.arrow.apache.org%3E

[4]
https://lists.apache.org/thread.html/5e09557274f9018efee770ad3712122d874447331f52d27169f99fe0@%3Cdev.arrow.apache.org%3E

[5]
https://issues.apache.org/jira/browse/ARROW-1693?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=16244812#comment-16244812

[6] https://github.com/apache/arrow/pull/4815

[7]
https://lemire.me/blog/2012/02/08/effective-compression-using-frame-of-reference-and-delta-coding/

[8] https://issues.apache.org/jira/browse/ARROW-5821


Re: [Discuss] Format additions to Arrow for sparse data and data integrity

2019-07-05 Thread Jacques Nadeau
Hey Micah, you're formatting seems to be messed up on this mail. Some kind
of copy/paste error?

On Fri, Jul 5, 2019 at 11:54 AM Micah Kornfield 
wrote:

> Hi Arrow-dev,
>
> I’d like to make a straw-man proposal to cover some features that I think
> would be useful to Arrow, and that I would like to make a proof-of-concept
> implementation for in Java and C++.  In particular, the proposal covers
> allowing for smaller data sizes via compression and encoding [1][2][8],
> data integrity [3] and avoiding unnecessary data transfer [4][5].
>
> I’ve put together a PR  [6] that has proposed changes to the flatbuffer
> metadata to support the new features.  The PR introduces:
>
>-
>
>A new “SparseRecordBatch” that can support one of multiple possible
>encodings (both dense and sparse), compression and column elision.
>-
>
>A “Digest” message type to support optional data integrity.
>
>
> Going into more details on the specific features in the PR:
>
>1.
>
>Sparse encodings for arrays and buffers.  The guiding principles behind
>the suggested encodings are to support encodings that can be exploited
> by
>compute engines for more efficient computation (I don’t think parquet
> style
>bit-packing belongs in Arrow).  While the encodings don’t maintain O(1)
>data element access, they support sublinear, O(log(N)), element access.
> The
>suggested encodings are:
>1.
>
>   Array encodings:
>   1.
>
>  Add a run-length encoding scheme to efficiently represent repeated
>  values (the actual scheme encodes run ends instead of length
> to preserve
>  sub-linear random access).
>  2.
>
>  Add a “packed” sparse representation (null values don’t take up
>  space in value buffers)
>  2.
>
>   Buffer encodings:
>   1.
>
>  Add frame of reference integer encoding [7] (this allows for lower
>  bit-width encoding of integer types by subtracting a
> “reference” value from
>  all values in the buffer).
>  2.
>
>  Add a sparse integer set encoding.  This encoding allows more
>  efficient encoding of validity bit-masks for cases when all
> values are
>  either null or not null.
>  2.
>
>Data compression.  Similar to encodings but compression is solely for
>reduction of data at rest/on the wire.  The proposal is to allow
>compression of individual buffers. Right now zstd is proposed, but I
> don’t
>feel strongly on the specific technologies here.
>3.
>
>Column Elision.  For some use-cases, like structured logging, the
>overhead of including array metadata for columns with no data present
>represents non-negligible overhead.   The proposal provides a mechanism
> for
>omitting meta-data for such arrays.
>4.
>
>Data Integrity.  While the arrow file format isn’t meant for archiving
>data, I think it is important to allow for optional native data
> integrity
>checks in the format.  To this end, I proposed a new “Digest” message
> type
>that can be added after other messages to record a digest/hash of the
>preceding data. I suggested xxhash, but I don’t have a strong opinion
> here,
>as long as there is some minimal support that can potentially be
> expanded
>later.
>
>
> In the proposal I chose to use Tables and Unions everywhere for flexibility
> but in all likelihood some could be replaced by enums.
>
> My initial plan would be to solely focus on an IPC mechanism that can send
> a SparseRecordBatch and immediately translate it to a normal RecordBatch in
> both Java and C++.
>
> As a practical matter the proposal represents a lot of work to get an MVP
> working in time for 1.0.0 release (provided they are accepted by the
> community), so I'd greatly appreciate if anyone wants to collaborate on
> this.
>
> If it is easier I’m happy to start a separate thread for feature if people
> feel like it would make the conversation easier.  I can also create a
> Google Doc for direct comments if that is preferred.
>
> Thanks,
>
> Micah
>
>
>
> P.S. In the interest of full disclosure, these ideas evolved in
> collaboration with Brian Hulette and other colleagues at Google who are
> interested in making use of Arrow in both internal and external projects.
>
> [1] https://issues.apache.org/jira/browse/ARROW-300
>
> [2]  https://issues.apache.org/jira/browse/ARROW-5224
>
> [3]
>
> https://lists.apache.org/thread.html/36ab9c2b8b5d9f04493b3f9ea3b63c3ca3bc0f90743aa726b7a3199b@%3Cdev.arrow.apache.org%3E
>
> [4]
>
> https://lists.apache.org/thread.html/5e09557274f9018efee770ad3712122d874447331f52d27169f99fe0@%3Cdev.arrow.apache.org%3E
>
> [5]
>
> https://issues.apache.org/jira/browse/ARROW-1693?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=16244812#comment-16244812
>
> [6] https://github.com/apache/arrow/pull/4815
>
> [7]
>
> https://lemire.me/blog/2012/02/08/effective-c

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

2019-07-05 Thread Jacques Nadeau
Initial thought: I don't think most of this should be targeted for 1.0. It
is a lot of change/enhancement and seems like it would likely substantially
delay 1.0. The one piece that seems least disruptive would be basic on the
wire compression. You suggested that this be done on the buffer level but
it seems like that maybe too narrow depending on batch size? What is the
thinking here about tradeoffs around message versus batch. When pipelining,
we target relatively small batches typically of 256k-1mb. Sometimes we
might go up to 10mb but that is a pretty rare use case.

On Fri, Jul 5, 2019 at 12:32 PM Jacques Nadeau  wrote:

> Hey Micah, you're formatting seems to be messed up on this mail. Some kind
> of copy/paste error?
>
> On Fri, Jul 5, 2019 at 11:54 AM Micah Kornfield 
> wrote:
>
>> Hi Arrow-dev,
>>
>> I’d like to make a straw-man proposal to cover some features that I think
>> would be useful to Arrow, and that I would like to make a proof-of-concept
>> implementation for in Java and C++.  In particular, the proposal covers
>> allowing for smaller data sizes via compression and encoding [1][2][8],
>> data integrity [3] and avoiding unnecessary data transfer [4][5].
>>
>> I’ve put together a PR  [6] that has proposed changes to the flatbuffer
>> metadata to support the new features.  The PR introduces:
>>
>>-
>>
>>A new “SparseRecordBatch” that can support one of multiple possible
>>encodings (both dense and sparse), compression and column elision.
>>-
>>
>>A “Digest” message type to support optional data integrity.
>>
>>
>> Going into more details on the specific features in the PR:
>>
>>1.
>>
>>Sparse encodings for arrays and buffers.  The guiding principles behind
>>the suggested encodings are to support encodings that can be exploited
>> by
>>compute engines for more efficient computation (I don’t think parquet
>> style
>>bit-packing belongs in Arrow).  While the encodings don’t maintain O(1)
>>data element access, they support sublinear, O(log(N)), element
>> access. The
>>suggested encodings are:
>>1.
>>
>>   Array encodings:
>>   1.
>>
>>  Add a run-length encoding scheme to efficiently represent
>> repeated
>>  values (the actual scheme encodes run ends instead of length
>> to preserve
>>  sub-linear random access).
>>  2.
>>
>>  Add a “packed” sparse representation (null values don’t take up
>>  space in value buffers)
>>  2.
>>
>>   Buffer encodings:
>>   1.
>>
>>  Add frame of reference integer encoding [7] (this allows for
>> lower
>>  bit-width encoding of integer types by subtracting a
>> “reference” value from
>>  all values in the buffer).
>>  2.
>>
>>  Add a sparse integer set encoding.  This encoding allows more
>>  efficient encoding of validity bit-masks for cases when all
>> values are
>>  either null or not null.
>>  2.
>>
>>Data compression.  Similar to encodings but compression is solely for
>>reduction of data at rest/on the wire.  The proposal is to allow
>>compression of individual buffers. Right now zstd is proposed, but I
>> don’t
>>feel strongly on the specific technologies here.
>>3.
>>
>>Column Elision.  For some use-cases, like structured logging, the
>>overhead of including array metadata for columns with no data present
>>represents non-negligible overhead.   The proposal provides a
>> mechanism for
>>omitting meta-data for such arrays.
>>4.
>>
>>Data Integrity.  While the arrow file format isn’t meant for archiving
>>data, I think it is important to allow for optional native data
>> integrity
>>checks in the format.  To this end, I proposed a new “Digest” message
>> type
>>that can be added after other messages to record a digest/hash of the
>>preceding data. I suggested xxhash, but I don’t have a strong opinion
>> here,
>>as long as there is some minimal support that can potentially be
>> expanded
>>later.
>>
>>
>> In the proposal I chose to use Tables and Unions everywhere for
>> flexibility
>> but in all likelihood some could be replaced by enums.
>>
>> My initial plan would be to solely focus on an IPC mechanism that can send
>> a SparseRecordBatch and immediately translate it to a normal RecordBatch
>> in
>> both Java and C++.
>>
>> As a practical matter the proposal represents a lot of work to get an MVP
>> working in time for 1.0.0 release (provided they are accepted by the
>> community), so I'd greatly appreciate if anyone wants to collaborate on
>> this.
>>
>> If it is easier I’m happy to start a separate thread for feature if people
>> feel like it would make the conversation easier.  I can also create a
>> Google Doc for direct comments if that is preferred.
>>
>> Thanks,
>>
>> Micah
>>
>>
>>
>> P.S. In the interest of full disclosure, these ideas evolved in
>> collaboration with Brian Hulette and other

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

2019-07-05 Thread Micah Kornfield
Strange, I've pasted the contents into a google document at [1]

[1]
https://docs.google.com/document/d/1uJzWh63Iqk7FRbElHPhHrsmlfe0NIJ6M8-0kejPmwIw/edit


On Fri, Jul 5, 2019 at 12:32 PM Jacques Nadeau  wrote:

> Hey Micah, you're formatting seems to be messed up on this mail. Some kind
> of copy/paste error?
>
> On Fri, Jul 5, 2019 at 11:54 AM Micah Kornfield 
> wrote:
>
> > Hi Arrow-dev,
> >
> > I’d like to make a straw-man proposal to cover some features that I think
> > would be useful to Arrow, and that I would like to make a
> proof-of-concept
> > implementation for in Java and C++.  In particular, the proposal covers
> > allowing for smaller data sizes via compression and encoding [1][2][8],
> > data integrity [3] and avoiding unnecessary data transfer [4][5].
> >
> > I’ve put together a PR  [6] that has proposed changes to the flatbuffer
> > metadata to support the new features.  The PR introduces:
> >
> >-
> >
> >A new “SparseRecordBatch” that can support one of multiple possible
> >encodings (both dense and sparse), compression and column elision.
> >-
> >
> >A “Digest” message type to support optional data integrity.
> >
> >
> > Going into more details on the specific features in the PR:
> >
> >1.
> >
> >Sparse encodings for arrays and buffers.  The guiding principles
> behind
> >the suggested encodings are to support encodings that can be exploited
> > by
> >compute engines for more efficient computation (I don’t think parquet
> > style
> >bit-packing belongs in Arrow).  While the encodings don’t maintain
> O(1)
> >data element access, they support sublinear, O(log(N)), element
> access.
> > The
> >suggested encodings are:
> >1.
> >
> >   Array encodings:
> >   1.
> >
> >  Add a run-length encoding scheme to efficiently represent
> repeated
> >  values (the actual scheme encodes run ends instead of length
> > to preserve
> >  sub-linear random access).
> >  2.
> >
> >  Add a “packed” sparse representation (null values don’t take up
> >  space in value buffers)
> >  2.
> >
> >   Buffer encodings:
> >   1.
> >
> >  Add frame of reference integer encoding [7] (this allows for
> lower
> >  bit-width encoding of integer types by subtracting a
> > “reference” value from
> >  all values in the buffer).
> >  2.
> >
> >  Add a sparse integer set encoding.  This encoding allows more
> >  efficient encoding of validity bit-masks for cases when all
> > values are
> >  either null or not null.
> >  2.
> >
> >Data compression.  Similar to encodings but compression is solely for
> >reduction of data at rest/on the wire.  The proposal is to allow
> >compression of individual buffers. Right now zstd is proposed, but I
> > don’t
> >feel strongly on the specific technologies here.
> >3.
> >
> >Column Elision.  For some use-cases, like structured logging, the
> >overhead of including array metadata for columns with no data present
> >represents non-negligible overhead.   The proposal provides a
> mechanism
> > for
> >omitting meta-data for such arrays.
> >4.
> >
> >Data Integrity.  While the arrow file format isn’t meant for archiving
> >data, I think it is important to allow for optional native data
> > integrity
> >checks in the format.  To this end, I proposed a new “Digest” message
> > type
> >that can be added after other messages to record a digest/hash of the
> >preceding data. I suggested xxhash, but I don’t have a strong opinion
> > here,
> >as long as there is some minimal support that can potentially be
> > expanded
> >later.
> >
> >
> > In the proposal I chose to use Tables and Unions everywhere for
> flexibility
> > but in all likelihood some could be replaced by enums.
> >
> > My initial plan would be to solely focus on an IPC mechanism that can
> send
> > a SparseRecordBatch and immediately translate it to a normal RecordBatch
> in
> > both Java and C++.
> >
> > As a practical matter the proposal represents a lot of work to get an MVP
> > working in time for 1.0.0 release (provided they are accepted by the
> > community), so I'd greatly appreciate if anyone wants to collaborate on
> > this.
> >
> > If it is easier I’m happy to start a separate thread for feature if
> people
> > feel like it would make the conversation easier.  I can also create a
> > Google Doc for direct comments if that is preferred.
> >
> > Thanks,
> >
> > Micah
> >
> >
> >
> > P.S. In the interest of full disclosure, these ideas evolved in
> > collaboration with Brian Hulette and other colleagues at Google who are
> > interested in making use of Arrow in both internal and external projects.
> >
> > [1] https://issues.apache.org/jira/browse/ARROW-300
> >
> > [2]  https://issues.apache.org/jira/browse/ARROW-5224
> >
> > [3]
> >
> >
> https://lists.apache.org/thread.html/36ab9c2b

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

2019-07-05 Thread Micah Kornfield
Hi Jacques,
Thanks for the quick response.

I don't think most of this should be targeted for 1.0. It is a lot of
> change/enhancement and seems like it would likely substantially delay 1.0.


I agree it shouldn't block 1.0.  I think time based releases are working
well for the community.But if the features are implemented in Java and
C++ with integration tests between the two in time for 1.0 should we
explicitly rule it out?  If not for 1.0 would the subsequent release make
sense?

You suggested that this be done on the buffer level but it seems like that
> maybe too narrow depending on batch size? What is the thinking here about
> tradeoffs around message versus batch.


Two reasons for this proposal:
- I'm not sure if there is much value add at the batch level vs simply
compressing the whole transport channel.  It could be for small batch sizes
compression mostly goes unused.  But if it is seen as valuable we could
certainly incorporate a batch level aspect as well .
-  At the buffer level you can use potentially use more specialized
compression techniques that don't require larger sized data to be
effective.  For example there is a JIRA open to consider using  PFOR [1]
which if I understand correctly starts being effective once you have ~128
integers.

Thanks,
Micah

[1] https://github.com/lemire/FastPFor




On Fri, Jul 5, 2019 at 12:38 PM Jacques Nadeau  wrote:

> Initial thought: I don't think most of this should be targeted for 1.0. It
> is a lot of change/enhancement and seems like it would likely substantially
> delay 1.0. The one piece that seems least disruptive would be basic on the
> wire compression. You suggested that this be done on the buffer level but
> it seems like that maybe too narrow depending on batch size? What is the
> thinking here about tradeoffs around message versus batch. When pipelining,
> we target relatively small batches typically of 256k-1mb. Sometimes we
> might go up to 10mb but that is a pretty rare use case.
>
> On Fri, Jul 5, 2019 at 12:32 PM Jacques Nadeau  wrote:
>
>> Hey Micah, you're formatting seems to be messed up on this mail. Some
>> kind of copy/paste error?
>>
>> On Fri, Jul 5, 2019 at 11:54 AM Micah Kornfield 
>> wrote:
>>
>>> Hi Arrow-dev,
>>>
>>> I’d like to make a straw-man proposal to cover some features that I think
>>> would be useful to Arrow, and that I would like to make a
>>> proof-of-concept
>>> implementation for in Java and C++.  In particular, the proposal covers
>>> allowing for smaller data sizes via compression and encoding [1][2][8],
>>> data integrity [3] and avoiding unnecessary data transfer [4][5].
>>>
>>> I’ve put together a PR  [6] that has proposed changes to the flatbuffer
>>> metadata to support the new features.  The PR introduces:
>>>
>>>-
>>>
>>>A new “SparseRecordBatch” that can support one of multiple possible
>>>encodings (both dense and sparse), compression and column elision.
>>>-
>>>
>>>A “Digest” message type to support optional data integrity.
>>>
>>>
>>> Going into more details on the specific features in the PR:
>>>
>>>1.
>>>
>>>Sparse encodings for arrays and buffers.  The guiding principles
>>> behind
>>>the suggested encodings are to support encodings that can be
>>> exploited by
>>>compute engines for more efficient computation (I don’t think parquet
>>> style
>>>bit-packing belongs in Arrow).  While the encodings don’t maintain
>>> O(1)
>>>data element access, they support sublinear, O(log(N)), element
>>> access. The
>>>suggested encodings are:
>>>1.
>>>
>>>   Array encodings:
>>>   1.
>>>
>>>  Add a run-length encoding scheme to efficiently represent
>>> repeated
>>>  values (the actual scheme encodes run ends instead of length
>>> to preserve
>>>  sub-linear random access).
>>>  2.
>>>
>>>  Add a “packed” sparse representation (null values don’t take up
>>>  space in value buffers)
>>>  2.
>>>
>>>   Buffer encodings:
>>>   1.
>>>
>>>  Add frame of reference integer encoding [7] (this allows for
>>> lower
>>>  bit-width encoding of integer types by subtracting a
>>> “reference” value from
>>>  all values in the buffer).
>>>  2.
>>>
>>>  Add a sparse integer set encoding.  This encoding allows more
>>>  efficient encoding of validity bit-masks for cases when all
>>> values are
>>>  either null or not null.
>>>  2.
>>>
>>>Data compression.  Similar to encodings but compression is solely for
>>>reduction of data at rest/on the wire.  The proposal is to allow
>>>compression of individual buffers. Right now zstd is proposed, but I
>>> don’t
>>>feel strongly on the specific technologies here.
>>>3.
>>>
>>>Column Elision.  For some use-cases, like structured logging, the
>>>overhead of including array metadata for columns with no data present
>>>represents non-negligible overhead.   The proposal

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

2019-07-05 Thread Jacques Nadeau
One question and a random thought:

What is the driving force for transport compression? Are you seeing that as
a major bottleneck in particular circumstances? (I'm not disagreeing, just
want to clearly define the particular problem you're worried about.)

Random thought: what do you think of defining this at the transport level
rather than the record batch level? (e.g. in Arrow Flight). This is one way
to avoid extending the core record batch concept with something that isn't
related to processing (at least in your initial proposal).


Re: [Discuss][Java] Make the semantics of lastSet consistent

2019-07-05 Thread Jacques Nadeau
Ravindra, Praveen and Prudhvi, can you confirm the ramifications of this
change and what impact this inconsistency has had downstream?

On Thu, Jul 4, 2019 at 7:32 PM Fan Liya  wrote:

> There are two lastSet member variables in the code. One is in
> BaseVariableWidthVector and the other is in ListVector. In
> BaseVariableWidthVector, the lastSet refers to the last index that is
> actually set, while in ListVector, the lastSet refers to the next index
> that will be set. So there is an inconsistency.
>
>
> According to the name, lastSet should refer to the last index that is
> actually set. So the semantics in ListVector should be revised. However,
> the setLastSet and getLastSet methods in ListVector have been made public,
> so they cannot be changed freely.
>
>
> My initial idea is that: we first change the internal semantics of
> ListVector, leaving the external semantics (setLastSet and getLastSet
> methods) unchanged. Meanwhile, we make the setLastset & getLastSet methods
> deprecated. Changing the external semantics will be performed later as a
> long process.
>
>
> Would you please give some comments? Do you have some other ideas?
>
>
> Thank you in advance.
>
>
> Liya Fan
>


[jira] [Created] (ARROW-5866) [C++] Remove duplicate library in cpp/Brewfile

2019-07-05 Thread Yosuke Shiro (JIRA)
Yosuke Shiro created ARROW-5866:
---

 Summary: [C++] Remove duplicate library in cpp/Brewfile
 Key: ARROW-5866
 URL: https://issues.apache.org/jira/browse/ARROW-5866
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yosuke Shiro
Assignee: Yosuke Shiro






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [Discuss] Format additions to Arrow for sparse data and data integrity

2019-07-05 Thread Micah Kornfield
Hi Jacques,
I think our e-mails might have crossed, so I'm consolidating my responses
from the previous e-mail as well.

I don't think most of this should be targeted for 1.0. It is a lot of
> change/enhancement and seems like it would likely substantially delay 1.0.


I agree it shouldn't block 1.0.  I think time based releases are working
well for the community.But if the features are implemented in Java and
C++ with integration tests in time for 1.0, should we explicitly rule it
out?  If not for 1.0 would the subsequent release make sense?

What is the driving force for transport compression? Are you seeing that as
> a major bottleneck in particular circumstances? (I'm not disagreeing, just
> want to clearly define the particular problem you're worried about.)


I've been working on a 20% project where we appear to be IO bound for
transporting record batches.   Also, I believe Ji Liu (tianchen92) has been
seeing some of the same bottlenecks with the query engine they are is
working on.  Trading off some CPU here would allow us to lower the overall
latency in the system.

You suggested that this be done on the buffer level but it seems like that
> maybe too narrow depending on batch size? What is the thinking here about
> tradeoffs around message versus batch.


Two reasons for this proposal:
- I'm not sure if there is much value add at the batch level vs simply
compressing the whole transport channel.  It could be for small batch sizes
compression mostly goes unused.  But if it is seen as valuable we could
certainly incorporate a batch level aspect as well .
-  At the buffer level you can use more specialized compression techniques
that don't require larger sized data to be effective.  For example there is
a JIRA open to consider using  PFOR [1] which, if I understand correctly,
starts being effective once you have ~128 integers.

Random thought: what do you think of defining this at the transport level
> rather than the record batch level? (e.g. in Arrow Flight). This is one way
> to avoid extending the core record batch concept with something that isn't
> related to processing (at least in your initial proposal)


Per above, this seems like a reasonable approach to me if we want to hold
off on buffer level compression.  Another use-case for buffer/record-batch
level compression would be the Feather file format for only decompressing
subset of columns/rows.  If this use-case isn't compelling, I'd be happy to
hold off adding compression to sparse batches until we have benchmarks
showing the trade-off between channel level and buffer level compression.

If we implement buffer level encodings we should also see a decent size win
on space without compression.

Thanks,
Micah

[1] https://github.com/lemire/FastPFor

On Fri, Jul 5, 2019 at 1:48 PM Jacques Nadeau  wrote:

> One question and a random thought:
>
> What is the driving force for transport compression? Are you seeing that
> as a major bottleneck in particular circumstances? (I'm not disagreeing,
> just want to clearly define the particular problem you're worried about.)
>
> Random thought: what do you think of defining this at the transport level
> rather than the record batch level? (e.g. in Arrow Flight). This is one way
> to avoid extending the core record batch concept with something that isn't
> related to processing (at least in your initial proposal).
>