Re: pyarrow kafka support

2020-07-21 Thread Micah Kornfield
Nothing exists in Arrow core to do this. You will need to manually decide
on how to batch and serialize data into and out of Kafka.  The recent
discussion [1] on user@ on transferring data into and out of Redis provides
some pointers on how to do this.

Note that the "*fetch_pandas_all" *is something that Snowflake developed on
their own (I imagine on top of the existing libraries).

[1]
https://lists.apache.org/x/thread.html/r949d6e477a0e4ed1807a5b305a3b79d045fa296cfbd1b66313463cc6@%3Cuser.arrow.apache.org%3E



On Tue, Jul 21, 2020 at 11:51 AM Mehul Batra  wrote:

> Hi Arrow Community,
>
>
>
> Do we guys have any Api to ingest and process apache Kafka data fast using
> pyarrow/python ….just like we have to *fetch_pandas_all* to ingest and
> process snowflake data fast.
>
>
>
> Thanks,
>
> Mehul Batra
> [image: Pitney Bowes] 
>
>
>
>
>


Re: Java Arrow to C++ Arrow and vice versa

2020-07-21 Thread Micah Kornfield
>
> Was there any particular reason for not writing Java Arrow as a JNI binding
> for CPP Arrow?


The Java code base originated from Apache Drill and was the first
implementation of Arrow.  There is value in having a pure java
implementation separate from any C++ code base (JNI cannot be used in all
contexts).

On Tue, Jul 21, 2020 at 7:48 PM Ji Liu  wrote:

> Hi Chathura,
>
>
> https://lists.apache.org/thread.html/5bf70a6f1a3fa3e543a92b3217e64465a3b761ca307e8114550f9d8b@%3Cdev.arrow.apache.org%3E
> has
> the relevant pointers.
>
>
> Thanks,
> Ji Liu
>
>
>
> Chathura Widanage  于2020年7月22日周三 上午3:03写道:
>
> > Hi all,
> >
> > Was there any particular reason for not writing Java Arrow as a JNI
> binding
> > for CPP Arrow?
> >
> > What is the most straightforward and efficient way to convert a java
> arrow
> > schema/table to a JNI backed C++ arrow schema/table?
> >
> > Regards,
> > Chathura
> >
>


Re: Java Arrow to C++ Arrow and vice versa

2020-07-21 Thread Ji Liu
Hi Chathura,

https://lists.apache.org/thread.html/5bf70a6f1a3fa3e543a92b3217e64465a3b761ca307e8114550f9d8b@%3Cdev.arrow.apache.org%3E
has
the relevant pointers.


Thanks,
Ji Liu



Chathura Widanage  于2020年7月22日周三 上午3:03写道:

> Hi all,
>
> Was there any particular reason for not writing Java Arrow as a JNI binding
> for CPP Arrow?
>
> What is the most straightforward and efficient way to convert a java arrow
> schema/table to a JNI backed C++ arrow schema/table?
>
> Regards,
> Chathura
>


Re: [VOTE] Release Apache Arrow 1.0.0 - RC2

2020-07-21 Thread Sutou Kouhei
Hi,

+1 (binding)

I ran the followings on Debian GNU/Linux sid:

  * INSTALL_NODE=0 \
  JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \
  CUDA_TOOLKIT_ROOT=/usr \
  ARROW_CMAKE_OPTIONS="-DgRPC_SOURCE=BUNDLED -DBoost_NO_BOOST_CMAKE=ON" \
dev/release/verify-release-candidate.sh source 1.0.0 2
  * dev/release/verify-release-candidate.sh binaries 1.0.0 2
  * dev/release/verify-release-candidate.sh wheels 1.0.0 2

with:

  * gcc version 9.3.0 (Debian 9.3.0-15)
  * openjdk version "1.8.0_252"
  * Node.js v12.18.1
  * nvidia-cuda-dev 10.1.243-6+b1

Notes:

  * JavaScript tests are failed without INSTALL_NODE=0 (with
Node.js 14).

  * Python 3.8 wheel's test is failed.


Thanks,
--
kou


In 
  "[VOTE] Release Apache Arrow 1.0.0 - RC2" on Tue, 21 Jul 2020 04:07:39 +0200,
  Krisztián Szűcs  wrote:

> Hi,
> 
> I would like to propose the following release candidate (RC2) of Apache
> Arrow version 1.0.0. This is a release consisting of 838
> resolved JIRA issues[1].
> 
> This release candidate is based on commit:
> b0d623957db820de4f1ff0a5ebd3e888194a48f0 [2]
> 
> The source release rc2 is hosted at [3].
> The binary artifacts are hosted at [4][5][6][7].
> The changelog is located at [8].
> 
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. See [9] for how to validate a release candidate.
> 
> The vote will be open for at least 72 hours.
> 
> [ ] +1 Release this as Apache Arrow 1.0.0
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow 1.0.0 because...
> 
> [1]: 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%201.0.0
> [2]: 
> https://github.com/apache/arrow/tree/b0d623957db820de4f1ff0a5ebd3e888194a48f0
> [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-1.0.0-rc2
> [4]: https://bintray.com/apache/arrow/centos-rc/1.0.0-rc2
> [5]: https://bintray.com/apache/arrow/debian-rc/1.0.0-rc2
> [6]: https://bintray.com/apache/arrow/python-rc/1.0.0-rc2
> [7]: https://bintray.com/apache/arrow/ubuntu-rc/1.0.0-rc2
> [8]: 
> https://github.com/apache/arrow/blob/b0d623957db820de4f1ff0a5ebd3e888194a48f0/CHANGELOG.md
> [9]: 
> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates


Dummy Scalar in Filter and pass the scalar through Evaluate.

2020-07-21 Thread Gopinath Jaganmohan
Hi,

I would like to do compile once and run many using Gandiva Filter, here only 
the scalar change for each run. Currently I had to recreate the entire filter 
before evaluate. If there is way to pass scalar to evaluate then it would 
drastically reduce the compile time and allow reuse the same compiled code.

Our real use case is to implement group by.

Thanks
Gopinath Jaganmohan
CTO | ConverSight.ai
gopina...@conversight.ai | 
[signature_1414236063]  812.371.3300 | [signature_466943024]  
www.conversight.ai




Java Arrow to C++ Arrow and vice versa

2020-07-21 Thread Chathura Widanage
Hi all,

Was there any particular reason for not writing Java Arrow as a JNI binding
for CPP Arrow?

What is the most straightforward and efficient way to convert a java arrow
schema/table to a JNI backed C++ arrow schema/table?

Regards,
Chathura


pyarrow kafka support

2020-07-21 Thread Mehul Batra
Hi Arrow Community,

Do we guys have any Api to ingest and process apache Kafka data fast using 
pyarrow/python just like we have to fetch_pandas_all to ingest and process 
snowflake data fast.

Thanks,
Mehul Batra
[Pitney Bowes]





Introducing Cylon

2020-07-21 Thread Niranda Perera
Hi all,

We would like to introduce Cylon to the Arrow community. It is an
open-source, lean distributed data processing library using the Arrow data
format underneath. It is developed in C++ with bindings to Java, and
Python. It has an in-memory Table API that integrates with PyArrow Table
API. Cylon enables distributed data operations (ex: join (all variants),
union, intersection, difference, etc). It can be imported as a library to
existing applications or operate as a standalone framework. At the moment
it is using OpenMPI to distribute and communicate. It is released with
Apache License.

We are developing a distributed data-frame API on top of Cylon table API.
It would be similar to the Dask/ Modin data-frame. Our initial experiments
show promising performance. Cylon language bindings are also very
lightweight. We just had the very first release of Cylon. We would like to
hear from the Arrow community... Any comments, ideas are most appreciated!

Web visit - https://cylondata.org/  
Github - https://github.com/cylondata/cylon
Paper - https://arxiv.org/abs/2007.09589

Best
-- 
Niranda Perera
@n1r44 
+1 812 558 8884 / +94 71 554 8430
https://www.linkedin.com/in/niranda


Re: [DISCUSS] Plasma appears to have been forked, consider deprecating pyarrow.serialization

2020-07-21 Thread Robert Nishihara
Hi all,

Regarding Plasma, you're right we should have started this conversation
earlier! The way it's being developed in Ray currently isn't useful as a
standalone project. We realized that tighter integration with Ray's object
lifetime tracking could be important, and removing IPCs and making it a
separate thread in the same process as our scheduler could make a big
difference for performance. Some of these optimizations wouldn't be easy
without a tight integration, so there are some trade-offs here.

Regarding the Python serialization format, I agree with Antoine that it
should be deprecated. We began developing it before pickle 5, but now that
pickle 5 has taken off, it makes less sense (it's useful in its own right,
but at the end of the day, we were interested in it as a way to serialize
arbitrary Python objects).

-Robert

On Sun, Jul 12, 2020 at 5:26 PM Wes McKinney  wrote:

> I'll add deprecation warnings to the pyarrow.serialize functions in
> question, it will be pretty simple.
>
> On Sun, Jul 12, 2020, 6:34 PM Neal Richardson  >
> wrote:
>
> > This seems like something to investigate after the 1.0 release.
> >
> > Neal
> >
> > On Sun, Jul 12, 2020 at 11:53 AM Antoine Pitrou 
> > wrote:
> >
> > >
> > > I'd certainly like to deprecate our custom Python serialization format,
> > > and using pickle protocol 5 instead is a very good idea.
> > >
> > > We can probably keep it in 1.0 while raising a FutureWarning.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 12/07/2020 à 19:22, Wes McKinney a écrit :
> > > > It appears that the Ray developers have decided to fork Plasma and
> > > > decouple from the Arrow codebase:
> > > >
> > > > https://github.com/ray-project/ray/pull/9154
> > > >
> > > > This is a disappointing development to occur without any discussion
> on
> > > > this mailing list but given the lack of development activity on
> Plasma
> > > > I would like to see how others in the community would like to
> proceed.
> > > >
> > > > It appears additionally that the Union-based serialization format
> > > > implemented by arrow/python/serialize.h and the pyarrow/serialize.py
> > > > has been dropped in favor of pickle5. If there is not value in
> > > > maintaining this code then it would probably be preferable for us to
> > > > remove this from the codebase.
> > > >
> > > > Thanks,
> > > > Wes
> > > >
> > >
> >
>


Re: [VOTE] Release Apache Arrow 1.0.0 - RC2

2020-07-21 Thread Antoine Pitrou


+1 (binding)

I tested the sources on Ubuntu 18.04, with CUDA enabled and
TEST_INTEGRATION_JS=0 TEST_JS=0.

Regards

Antoine.


Le 21/07/2020 à 04:07, Krisztián Szűcs a écrit :
> Hi,
> 
> I would like to propose the following release candidate (RC2) of Apache
> Arrow version 1.0.0. This is a release consisting of 838
> resolved JIRA issues[1].
> 
> This release candidate is based on commit:
> b0d623957db820de4f1ff0a5ebd3e888194a48f0 [2]
> 
> The source release rc2 is hosted at [3].
> The binary artifacts are hosted at [4][5][6][7].
> The changelog is located at [8].
> 
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. See [9] for how to validate a release candidate.
> 
> The vote will be open for at least 72 hours.
> 
> [ ] +1 Release this as Apache Arrow 1.0.0
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow 1.0.0 because...
> 
> [1]: 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%201.0.0
> [2]: 
> https://github.com/apache/arrow/tree/b0d623957db820de4f1ff0a5ebd3e888194a48f0
> [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-1.0.0-rc2
> [4]: https://bintray.com/apache/arrow/centos-rc/1.0.0-rc2
> [5]: https://bintray.com/apache/arrow/debian-rc/1.0.0-rc2
> [6]: https://bintray.com/apache/arrow/python-rc/1.0.0-rc2
> [7]: https://bintray.com/apache/arrow/ubuntu-rc/1.0.0-rc2
> [8]: 
> https://github.com/apache/arrow/blob/b0d623957db820de4f1ff0a5ebd3e888194a48f0/CHANGELOG.md
> [9]: 
> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
> 


Re: [VOTE] Release Apache Arrow 1.0.0 - RC2

2020-07-21 Thread Ryan Murray
+0 (non-binding)


I verified source, release, binaries, integration tests for Python, C++,
Java. All went fine except for a failed test in c++ Gandiva: [  FAILED  ]
TestProjector.TestDateTime


Not sure if this is known or expected?


On Tue, Jul 21, 2020 at 1:32 PM Andy Grove  wrote:

> +1 (binding) on testing the Rust implementation only.
>
> I did notice that the release script is not updating all the versions
> correctly and I filed a JIRA [1].
>
> This shouldn't prevent the release though since this one version number can
> be updated manually when we publish the crates.
>
> [1] https://issues.apache.org/jira/browse/ARROW-9537
>
> On Mon, Jul 20, 2020 at 8:08 PM Krisztián Szűcs  >
> wrote:
>
> > Hi,
> >
> > I would like to propose the following release candidate (RC2) of Apache
> > Arrow version 1.0.0. This is a release consisting of 838
> > resolved JIRA issues[1].
> >
> > This release candidate is based on commit:
> > b0d623957db820de4f1ff0a5ebd3e888194a48f0 [2]
> >
> > The source release rc2 is hosted at [3].
> > The binary artifacts are hosted at [4][5][6][7].
> > The changelog is located at [8].
> >
> > Please download, verify checksums and signatures, run the unit tests,
> > and vote on the release. See [9] for how to validate a release candidate.
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Release this as Apache Arrow 1.0.0
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow 1.0.0 because...
> >
> > [1]:
> >
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%201.0.0
> > [2]:
> >
> https://github.com/apache/arrow/tree/b0d623957db820de4f1ff0a5ebd3e888194a48f0
> > [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-1.0.0-rc2
> > [4]: https://bintray.com/apache/arrow/centos-rc/1.0.0-rc2
> > [5]: https://bintray.com/apache/arrow/debian-rc/1.0.0-rc2
> > [6]: https://bintray.com/apache/arrow/python-rc/1.0.0-rc2
> > [7]: https://bintray.com/apache/arrow/ubuntu-rc/1.0.0-rc2
> > [8]:
> >
> https://github.com/apache/arrow/blob/b0d623957db820de4f1ff0a5ebd3e888194a48f0/CHANGELOG.md
> > [9]:
> >
> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
> >
>


Re: 1.0 release announcement blog post: help needed

2020-07-21 Thread Andy Grove
I created a PR to add Rust notes.

https://github.com/nealrichardson/arrow-site/pull/5

On Mon, Jul 20, 2020 at 3:35 PM Andy Grove  wrote:

> I'll put something together for Rust today.
>
> On Mon, Jul 20, 2020 at 3:27 PM Sutou Kouhei  wrote:
>
>> Hi,
>>
>> Sorry. I've filled the Ruby part.
>> Thanks for notifying this.
>>
>> --
>> kou
>>
>> In 
>>   "Re: 1.0 release announcement blog post: help needed" on Mon, 20 Jul
>> 2020 13:22:23 -0700,
>>   Neal Richardson  wrote:
>>
>> > Thanks to all who have contributed. We are still missing notes for Ruby
>> and
>> > Rust. Looks like we have a couple more days to finish this up, but
>> please
>> > do submit some notes for those languages.
>> >
>> > Thanks,
>> > Neal
>> >
>> > On Thu, Jul 16, 2020 at 3:40 PM Neal Richardson <
>> neal.p.richard...@gmail.com>
>> > wrote:
>> >
>> >> Hi all,
>> >> In anticipation of our 1.0 release, I've started a draft of a blog post
>> >> announcement:
>> >>
>> https://github.com/apache/arrow-site/pull/63/files#diff-e1757e061ab8aaf0dde183d74980f588
>> >>
>> >> For those who maintain the various languages/libraries within the
>> project,
>> >> please fill in the corresponding sections. Note that the audience is
>> the
>> >> broader user community, not Arrow developers, so please write clearly
>> using
>> >> terms they will understand and care about.
>> >>
>> >> You can either push commits to my branch, or you can use the GitHub PR
>> >> "suggestion" feature to add content in the browser.
>> >>
>> >> Note that this draft is on the branch where I've been working on the
>> Arrow
>> >> website redesign. If you haven't already, you can also take this
>> >> opportunity to review and provide feedback on that. Preview site is
>> >> deployed to https://enpiar.com/arrow-site/.
>> >>
>> >> Thanks,
>> >> Neal
>> >>
>>
>


Re: [VOTE] Release Apache Arrow 1.0.0 - RC2

2020-07-21 Thread Andy Grove
+1 (binding) on testing the Rust implementation only.

I did notice that the release script is not updating all the versions
correctly and I filed a JIRA [1].

This shouldn't prevent the release though since this one version number can
be updated manually when we publish the crates.

[1] https://issues.apache.org/jira/browse/ARROW-9537

On Mon, Jul 20, 2020 at 8:08 PM Krisztián Szűcs 
wrote:

> Hi,
>
> I would like to propose the following release candidate (RC2) of Apache
> Arrow version 1.0.0. This is a release consisting of 838
> resolved JIRA issues[1].
>
> This release candidate is based on commit:
> b0d623957db820de4f1ff0a5ebd3e888194a48f0 [2]
>
> The source release rc2 is hosted at [3].
> The binary artifacts are hosted at [4][5][6][7].
> The changelog is located at [8].
>
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. See [9] for how to validate a release candidate.
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 Release this as Apache Arrow 1.0.0
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow 1.0.0 because...
>
> [1]:
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%201.0.0
> [2]:
> https://github.com/apache/arrow/tree/b0d623957db820de4f1ff0a5ebd3e888194a48f0
> [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-1.0.0-rc2
> [4]: https://bintray.com/apache/arrow/centos-rc/1.0.0-rc2
> [5]: https://bintray.com/apache/arrow/debian-rc/1.0.0-rc2
> [6]: https://bintray.com/apache/arrow/python-rc/1.0.0-rc2
> [7]: https://bintray.com/apache/arrow/ubuntu-rc/1.0.0-rc2
> [8]:
> https://github.com/apache/arrow/blob/b0d623957db820de4f1ff0a5ebd3e888194a48f0/CHANGELOG.md
> [9]:
> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
>


[NIGHTLY] Arrow Build Report for Job nightly-2020-07-21-0

2020-07-21 Thread Crossbow


Arrow Build Report for Job nightly-2020-07-21-0

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-21-0

Failed Tasks:
- conda-win-vs2017-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-21-0-azure-conda-win-vs2017-py36
- conda-win-vs2017-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-21-0-azure-conda-win-vs2017-py37
- conda-win-vs2017-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-21-0-azure-conda-win-vs2017-py38
- test-conda-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-21-0-github-test-conda-cpp
- test-conda-python-3.7-hdfs-2.9.2:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-21-0-github-test-conda-python-3.7-hdfs-2.9.2

Pending Tasks:
- test-conda-python-3.7-turbodbc-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-21-0-github-test-conda-python-3.7-turbodbc-master
- test-r-linux-as-cran:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-21-0-github-test-r-linux-as-cran

Succeeded Tasks:
- centos-6-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-21-0-github-centos-6-amd64
- centos-7-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-21-0-travis-centos-7-aarch64
- centos-7-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-21-0-github-centos-7-amd64
- centos-8-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-21-0-travis-centos-8-aarch64
- centos-8-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-21-0-github-centos-8-amd64
- conda-clean:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-21-0-azure-conda-clean
- conda-linux-gcc-py36-cpu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-21-0-azure-conda-linux-gcc-py36-cpu
- conda-linux-gcc-py36-cuda:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-21-0-azure-conda-linux-gcc-py36-cuda
- conda-linux-gcc-py37-cpu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-21-0-azure-conda-linux-gcc-py37-cpu
- conda-linux-gcc-py37-cuda:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-21-0-azure-conda-linux-gcc-py37-cuda
- conda-linux-gcc-py38-cpu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-21-0-azure-conda-linux-gcc-py38-cpu
- conda-linux-gcc-py38-cuda:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-21-0-azure-conda-linux-gcc-py38-cuda
- conda-osx-clang-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-21-0-azure-conda-osx-clang-py36
- conda-osx-clang-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-21-0-azure-conda-osx-clang-py37
- conda-osx-clang-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-21-0-azure-conda-osx-clang-py38
- debian-buster-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-21-0-github-debian-buster-amd64
- debian-buster-arm64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-21-0-travis-debian-buster-arm64
- debian-stretch-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-21-0-github-debian-stretch-amd64
- debian-stretch-arm64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-21-0-travis-debian-stretch-arm64
- gandiva-jar-osx:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-21-0-travis-gandiva-jar-osx
- gandiva-jar-xenial:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-21-0-travis-gandiva-jar-xenial
- homebrew-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-21-0-travis-homebrew-cpp
- homebrew-r-autobrew:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-21-0-travis-homebrew-r-autobrew
- nuget:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-21-0-github-nuget
- test-conda-cpp-valgrind:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-21-0-github-test-conda-cpp-valgrind
- test-conda-python-3.6-pandas-0.23:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-21-0-github-test-conda-python-3.6-pandas-0.23
- test-conda-python-3.6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-21-0-github-test-conda-python-3.6
- test-conda-python-3.7-dask-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-21-0-git

Re [DISCUSS] Using direct memory size as a limit of populated off-heap buffers in Java

2020-07-21 Thread Hongze Zhang
Thanks for the inputs Micah.



So it's clearer that we may need to use Bits.Java or not. If Netty is 
considered to be something optional so maybe it's more acceptable to just use 
Bits.java since Dataset module is built-in? This way we can treat all built-in 
off-heap memory allocation as direct memory allocation.


Hongze


At 2020-07-21 11:48:36, "Micah Kornfield"  wrote:
>I don't have deep expertise here, but I think we should not choose one of
>the options with Netty.  There has been a decent amount of work to decouple
>the Arrow from any hard netty dependencies.
>
>On Mon, Jul 20, 2020 at 3:52 AM Hongze Zhang  wrote:
>
>> Hi,
>>
>> I want to discuss a bit about the discussion[1] in the pending PR[2] for
>> Java Dataset(it's no longer "Datasets" I guess?) API.
>>
>>
>> - Background:
>>
>> We are transferring C++ Arrow buffers to Java side BufferAllocators. We
>> should decide whether to use -XX:MaxDirectMemorySize as a limit of these
>> buffers. If yes, what should be a desired solution?
>>
>> - Possible alternative solutions so far:
>>
>> 1. Reserve from Bits.java from Java side
>>
>> Pros: Share memory counter with JVM direct byte buffers, No JNI overhead,
>> less codes
>> Cons: More invocations (each buffer a call to Bits#reserveMemory)
>>
>> 2. Reserve from Bits.java from C++ side
>>
>> Pros: Share memory counter with JVM direct byte buffers, Less invocations
>> (e.g. if using Jemalloc, we can somehow perform one call for one underlying
>> trunk)
>> Cons: JNI overhead, more codes
>>
>> 3. Reserve from Netty's PlatformDependent.java from Java side
>>
>> Pros: Share memory counter with Netty-based buffers, No JNI overhead, less
>> codes
>> Cons: More invocations
>>
>> 4. Reserve from Netty's PlatformDependent.java from C++ side
>>
>> Pros: Share memory counter with Netty-based buffers, Less invocations
>> Cons: JNI overhead, more codes
>>
>> 5. Not to implement any of the above, respect to BufferAllocator's limit
>> only.
>>
>>
>> So far I prefer 5, not to use any of the solutions. I am not sure if
>> "direct memory" is a good indicator for these off-heap buffers, because we
>> should finally have to decide to share counter with either JVM direct byte
>> buffers or Netty-based buffers. As far as I could think about, a complete
>> solution may ideally either to have a global counter for all types of
>> off-heap buffers, or give each type a individual counter.
>>
>> So do you have any thoughts or suggestions on this topic? It would be
>> great if we could have a conclusion soon as the PR was blocked for some
>> time. Thanks in advance :)
>>
>>
>> Best,
>> Hongze
>>
>> [1] https://github.com/apache/arrow/pull/7030#issuecomment-657096664
>> [2] https://github.com/apache/arrow/pull/7030
>>


[DISCUSS] Execute dataset scan tasks in distributed system

2020-07-21 Thread Hongze Zhang
Hi all,

Does anyone ever try using Arrow Dataset API in a distributed system? E.g. 
create scan tasks in machine 1, then send and execute these tasks from machine 
2, 3, 4.

So far I think a possible workaround is to:

1. Create Dataset on machine 1;
2. Call Scan(), collect all scan tasks from scan task iterator;
3. Say we have 5 tasks with number 1, 2, 3, 4, 5 here, and we decide to run 
task 1, 2 on machine 2, run task 3, 4, 5 on machine 3;
4. Send the target task numbers to machine 2, 3 respectively;
5. Create Dataset with the same configuration on machine 2 and 3, and Call 
Scan() to create 5 tasks for each machine;
6. On machine 2, run task 1, 2, skip 3, 4, 5
7. On machine 3, skip task 1, 2, run 3, 4, 5

This should work correctly only if we assume that the method `Dataset::Scan()` 
returns exactly the same task iterator on different machines. And not sure if 
unnecessary overheads will be brought during the process, alter all we'll run 
the scan method N times for N machines.

A better solution I could think about is to make scan tasks serializable so we 
could distribute them directly to machines. Currently they don't seem to be 
designed in such way since we allow contextual stuffs to be used to create the 
tasks, e.g. the opening readers in ParquetScanTask[1]. At the same time a 
built-in ser/de mechanism will be needed. Anyway a bunch of work has to be done.

So far I am not sure which way is more reasonable or there is a better one than 
both. Any thoughts please let me know.

Best,
Hongze

[1] 
https://github.com/apache/arrow/blob/c09a82a388e79ddf4377f44ecfe515604f147270/cpp/src/arrow/dataset/file_parquet.cc#L52-L74