[jira] [Created] (ARROW-3481) [Java] Fix Java building failure when use Maven 3.5.4

2018-10-09 Thread Yuhong Guo (JIRA)
Yuhong Guo created ARROW-3481:
-

 Summary: [Java] Fix Java building failure when use Maven 3.5.4
 Key: ARROW-3481
 URL: https://issues.apache.org/jira/browse/ARROW-3481
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Yuhong Guo






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3480) [Website] Install document for Ubuntu is broken

2018-10-09 Thread Kouhei Sutou (JIRA)
Kouhei Sutou created ARROW-3480:
---

 Summary: [Website] Install document for Ubuntu is broken
 Key: ARROW-3480
 URL: https://issues.apache.org/jira/browse/ARROW-3480
 Project: Apache Arrow
  Issue Type: Bug
  Components: Website
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


[https://lists.apache.org/thread.html/11f0aee1ebde1a011816b84dcd3dca4f7bf14dd397b7531451870f29@%3Cuser.arrow.apache.org%3E]

{quote}
The instructions found here https://arrow.apache.org/install/ don't work.
The /etc/apt/sources.list.d/red-data-tools.list file points to the 'main'
component. The 'main' component only exists for Debian, for Ubuntu it
should be 'universe'.
Seen here: https://packages.red-data-tools.org/ubuntu/dists/bionic/universe
{quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3479) [R] Support to write record_batch as stream

2018-10-09 Thread Javier Luraschi (JIRA)
Javier Luraschi created ARROW-3479:
--

 Summary: [R] Support to write record_batch as stream
 Key: ARROW-3479
 URL: https://issues.apache.org/jira/browse/ARROW-3479
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Javier Luraschi


Currently, one can only export a record batch to a file:
{code:java}
record <- arrow::record_batch(data.frame(a = c(1,2,3)))
record$to_file()
{code}
But to improve performance in Spark's R bindings through sparklyr an 
improvement is to support streams returning R raw's as follows:
{code:java}
record <- arrow::record_batch(data.frame(a = c(1,2,3)))
record$to_stream(){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3478) [C++] API add value / null mask buffer accessor to ArrayData

2018-10-09 Thread Wolf Vollprecht (JIRA)
Wolf Vollprecht created ARROW-3478:
--

 Summary: [C++] API add value / null mask buffer accessor to 
ArrayData
 Key: ARROW-3478
 URL: https://issues.apache.org/jira/browse/ARROW-3478
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wolf Vollprecht


Currently, the ArrayData struct has the `std::vector buffers` member.

The buffers will (from reading the code) either be [null_mask, data] or [data] 
(if the null mask does not exist).

I'm not sure if there is an easy way to get the null mask reliably at the 
moment. If I am understanding correctly, the way to do it right now is to check 
if the vector has one or two elements, and then use `buffers[0]` as the null 
mask, and `buffers[1]` as the values.

I also did not find information regarding this in the spec. So I am not sure if 
I can rely on this behavior in future versions of the library.

I am wondering wether adding explicit API for this would make this more 
reliable. 

 

For example two more interface functions
 * `std::shared_ptr mask()`
 * `std::shared_ptr values()` 

Would make it easy for me to rely on the interface to "do the right thing".

 

Or am I missing something?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3477) Testsuite fails on 32 bit arch

2018-10-09 Thread Dmitry Kalinkin (JIRA)
Dmitry Kalinkin created ARROW-3477:
--

 Summary: Testsuite fails on 32 bit arch
 Key: ARROW-3477
 URL: https://issues.apache.org/jira/browse/ARROW-3477
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.11.0
Reporter: Dmitry Kalinkin
 Attachments: arrow_0.10.0_i686_test_fail.log, 
arrow_0.11.0_i686_test_fail.log

While investigating PARQUET-1438 we have discovered that there is a regression 
in arrow-cpp testsuite results between versions 0.10.0 and 0.11.0 when running 
in 32 bit executable. There used to be just a single test failing:

* array-test

and starting 0.11.0 it's four tests:

* array-test
* buffer-test
* bit-util-test
* rle-encoding-test

(list not including parquet-* tests)

I did bisect and found that the three tests were broken in 
479c011a6ac7a8f1e6d77ecf651a4b2be9e5eec0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3476) [Java] mvn test in memory fails on a big-endian platform

2018-10-09 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created ARROW-3476:
---

 Summary: [Java] mvn test in memory fails on a big-endian platform
 Key: ARROW-3476
 URL: https://issues.apache.org/jira/browse/ARROW-3476
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Kazuaki Ishizaki


On a big-endian platform, {{mvn test}} in memory causes a failure due to an 
assertion.
In {{TestEndianess.testLittleEndian}} test suite, the assertion occurs during 
an allocation of a {{RootAllocator}} class.

{code}
$ uname -a
Linux ppc64be.novalocal 4.5.7-300.fc24.ppc64 #1 SMP Fri Jun 10 20:29:32 UTC 
2016 ppc64 ppc64 ppc64 GNU/Linux
$ arch  
ppc64
$ cd java/memory
$ mvn test
[INFO] Scanning for projects...
[INFO] 
[INFO] 
[INFO] Building Arrow Memory 0.12.0-SNAPSHOT
[INFO] 
[INFO] 
...
[INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.082 s 
- in org.apache.arrow.memory.TestAccountant
[INFO] Running org.apache.arrow.memory.TestLowCostIdentityHashMap
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.012 s 
- in org.apache.arrow.memory.TestLowCostIdentityHashMap
[INFO] Running org.apache.arrow.memory.TestBaseAllocator
[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.746 s 
<<< FAILURE! - in org.apache.arrow.memory.TestEndianess
[ERROR] testLittleEndian(org.apache.arrow.memory.TestEndianess)  Time elapsed: 
0.313 s  <<< ERROR!
java.lang.ExceptionInInitializerError
at 
org.apache.arrow.memory.TestEndianess.testLittleEndian(TestEndianess.java:31)
Caused by: java.lang.IllegalStateException: Arrow only runs on LittleEndian 
systems.
at 
org.apache.arrow.memory.TestEndianess.testLittleEndian(TestEndianess.java:31)

[ERROR] Tests run: 22, Failures: 0, Errors: 21, Skipped: 1, Time elapsed: 0.055 
s <<< FAILURE! - in org.apache.arrow.memory.TestBaseAllocator
...
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Nightly tests for Arrow

2018-10-09 Thread Krisztián Szűcs
On Tue, Oct 9, 2018 at 6:02 PM Antoine Pitrou  wrote:

>
> Le 09/10/2018 à 17:54, Wes McKinney a écrit :
> > hi folks,
> >
> > After the packaging automation work for 0.10 was completed, we have
> > stalled out a bit on one of the objectives of this framework, which is
> > to allow contributors to define and add new tasks that can be run on
> > demand or as part of a nightly job.
> >
> > So we have some problems to solve:
> >
> > * How to define a task we wish to validate (like building the API
> > documentation, or building Arrow with some particular build
> > parameters) as a new Crossbow task -- document this well so that
> > people have some instructions to follow
>
Crossbow indeed lacks of documentation in that matter. Defining a task
requires
a CI configuration and commands per platform and a section in tasks.yml.
However I think this is not straightforward enough - like just creating a
bash/batch
script - We still need to define config management stuff (which makes user
friendliness harder to achieve).

> > * How to add a task to some kind of a nightly build manifest
>
> * Where to schedule and run the nightly jobs
>
Currently nightly builds are submitted by this nightly travis script:
https://github.com/kszucs/crossbow/blob/trigger-nightly-builds/.travis.yml
We can have arbitrary number of branches to trigger custom jobs, however it
requires manual travis setup - with still not satisfying ergonomics.

> > * Reporting nightly build failures to the mailing list
>
I regularly check the nightly builds which occasionally fails, mostly
transient failures.
For example last conda nightlies have failed, because conda-build have some
issues with libarchive - during the feedstock updates I couldn't even
rerender them
locally.
BTW to send the errors to the mailing list We need to set CROSSBOW_EMAIL env
variable
https://github.com/apache/arrow/blob/master/dev/tasks/crossbow.py#L475
(We might want to use a centralized crossbow repository though with proper
permissions).

> >
> > In terms of scalability requirements, this needs to accommodate 50-100
> tasks.
>
The current tasks.yml contains a lot of duplication which bothers me, but
it provides
more flexibility than having another "matrix" definition and
implementation. I don't have
a user friendly solution for that yet.
Parallelization is another question, a single crossbow repo can run ~5
travis jobs and
a single appveyor job simultaneously, however We can improve that via
introducing more
CI services, e.g. pipelines and/or circleci.

CI service agnostic?
Ideally We should abstract away the CI service (the worker itself), where
We do the
configuration management right now, see the "..yml" files:
https://github.com/apache/arrow/tree/master/dev/tasks/conda-recipes
But then We need to create another, custom (I hope not yml) "dialect" to
define build
requirements (e.g. node, python, ruby, clang, etc.). It's quite hard to
plan an easy
and flexible interface for that.

> >
> > This won't be the last time we need to do some infrastructure work to
> > scale our testing process, but this will help with testing things that
> > we want to make sure work but without having to increase the size of
> > our CI matrix.
>
> One question which came to my mind is how to develop, debug and maintain
> the nightly tasks without waiting for the nightly Travis run for
> validation.  It doesn't seem easy to trigger a "nightly" build from the
> Travis UI.
>
Good point! Triggering is not the actual issue, but the evaluation of the
outcome.
We can submit builds if the PR touches e.g. the task definitions, but We
cannot
really wait for the results, thus triggering builds could be useless.

Actually this can be solved by a github integration bot Wes has mentioned,
with
manual triggering and approval.

>
> Regards
>
> Antoine.
>
All in all I feel the usability crucial here. A couple of examples how a
straightforward
task definition should look like would be handy. Handling and defining task
dependencies is another question too (I'm experimenting with a prototype
though).

Regards, Krisztian


RE: Apache Arrow .NET implementation

2018-10-09 Thread Christopher S. Hutchinson
I planned on contributing long term in an open-source capacity separate from my 
employer, but I imagine spending professional time as our product evolves. I 
don't have an Apache ID at the moment; I also forgot to include that in my 
ICLA, although I specified in my submission to the secretary the ID(s) I prefer.

-Original Message-
From: Wes McKinney [mailto:wesmck...@gmail.com] 
Sent: Tuesday, October 9, 2018 12:04 PM
To: dev@arrow.apache.org
Subject: Re: Apache Arrow .NET implementation

hi Christopher,

This is great to hear. Do you and your colleagues have plans to continue 
developing it if the donation is accepted into Apache Arrow?

Thanks,
Wes
On Tue, Oct 9, 2018 at 12:01 PM Christopher S. Hutchinson 
 wrote:
>
> I am writing to announce that my employer (Feyen Zylstra LLC) has developed 
> an implementation of Apache Arrow targeting .NET Standard. We are donating 
> this implementation to the Apache Software Foundation. The source code is 
> available for review on GitHub:
>
> https://github.com/feyenzylstra/apache-arrow
>
> Please let me know if you have any questions. Feel free to leave issues on 
> GitHub, and happy voting!
>


Re: Nightly tests for Arrow

2018-10-09 Thread Wes McKinney
On Tue, Oct 9, 2018 at 12:02 PM Antoine Pitrou  wrote:
>
> One question which came to my mind is how to develop, debug and maintain
> the nightly tasks without waiting for the nightly Travis run for
> validation.  It doesn't seem easy to trigger a "nightly" build from the
> Travis UI.

I think we should develop a bot that we can ask to run tasks having a
particular name or matching a wildcard. e.g. something like

@arrow-test-bot validate conda

>
> Regards
>
> Antoine.


Re: Apache Arrow .NET implementation

2018-10-09 Thread Wes McKinney
hi Christopher,

This is great to hear. Do you and your colleagues have plans to
continue developing it if the donation is accepted into Apache Arrow?

Thanks,
Wes
On Tue, Oct 9, 2018 at 12:01 PM Christopher S. Hutchinson
 wrote:
>
> I am writing to announce that my employer (Feyen Zylstra LLC) has developed 
> an implementation of Apache Arrow targeting .NET Standard. We are donating 
> this implementation to the Apache Software Foundation. The source code is 
> available for review on GitHub:
>
> https://github.com/feyenzylstra/apache-arrow
>
> Please let me know if you have any questions. Feel free to leave issues on 
> GitHub, and happy voting!
>


Apache Arrow .NET implementation

2018-10-09 Thread Christopher S. Hutchinson
I am writing to announce that my employer (Feyen Zylstra LLC) has developed an 
implementation of Apache Arrow targeting .NET Standard. We are donating this 
implementation to the Apache Software Foundation. The source code is available 
for review on GitHub:

https://github.com/feyenzylstra/apache-arrow

Please let me know if you have any questions. Feel free to leave issues on 
GitHub, and happy voting!



Re: Nightly tests for Arrow

2018-10-09 Thread Antoine Pitrou


Le 09/10/2018 à 17:54, Wes McKinney a écrit :
> hi folks,
> 
> After the packaging automation work for 0.10 was completed, we have
> stalled out a bit on one of the objectives of this framework, which is
> to allow contributors to define and add new tasks that can be run on
> demand or as part of a nightly job.
> 
> So we have some problems to solve:
> 
> * How to define a task we wish to validate (like building the API
> documentation, or building Arrow with some particular build
> parameters) as a new Crossbow task -- document this well so that
> people have some instructions to follow
> * How to add a task to some kind of a nightly build manifest
> * Where to schedule and run the nightly jobs
> * Reporting nightly build failures to the mailing list
> 
> In terms of scalability requirements, this needs to accommodate 50-100 tasks.
> 
> This won't be the last time we need to do some infrastructure work to
> scale our testing process, but this will help with testing things that
> we want to make sure work but without having to increase the size of
> our CI matrix.

One question which came to my mind is how to develop, debug and maintain
the nightly tasks without waiting for the nightly Travis run for
validation.  It doesn't seem easy to trigger a "nightly" build from the
Travis UI.

Regards

Antoine.


Nightly tests for Arrow

2018-10-09 Thread Wes McKinney
hi folks,

After the packaging automation work for 0.10 was completed, we have
stalled out a bit on one of the objectives of this framework, which is
to allow contributors to define and add new tasks that can be run on
demand or as part of a nightly job.

So we have some problems to solve:

* How to define a task we wish to validate (like building the API
documentation, or building Arrow with some particular build
parameters) as a new Crossbow task -- document this well so that
people have some instructions to follow
* How to add a task to some kind of a nightly build manifest
* Where to schedule and run the nightly jobs
* Reporting nightly build failures to the mailing list

In terms of scalability requirements, this needs to accommodate 50-100 tasks.

This won't be the last time we need to do some infrastructure work to
scale our testing process, but this will help with testing things that
we want to make sure work but without having to increase the size of
our CI matrix.

Thoughts about how to proceed?

Thanks
Wes


[jira] [Created] (ARROW-3475) C++ Int64Builder.Finish(NumericArray)

2018-10-09 Thread Wolf Vollprecht (JIRA)
Wolf Vollprecht created ARROW-3475:
--

 Summary: C++ Int64Builder.Finish(NumericArray)
 Key: ARROW-3475
 URL: https://issues.apache.org/jira/browse/ARROW-3475
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wolf Vollprecht


I was intuitively thinking that the following code would work:

{{Status s;}}
{{Int64Builder builder;}}
{{s = builder.Append(1);}}
{{s = builder.Append(2);}}

{{std::shared_ptr> array;}}
{{builder.Finish(&array);}}



However, it does not seem to work, as the finish operation is not overloaded in 
the Int64 (or the numeric builder).

Would it make sense to add this interface?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3474) [Glib] Extend gparquet API with get_schema and read_column

2018-10-09 Thread Benoit Rostykus (JIRA)
Benoit Rostykus created ARROW-3474:
--

 Summary: [Glib] Extend gparquet API with get_schema and read_column
 Key: ARROW-3474
 URL: https://issues.apache.org/jira/browse/ARROW-3474
 Project: Apache Arrow
  Issue Type: Improvement
  Components: GLib
Affects Versions: 0.11.0
Reporter: Benoit Rostykus


So we can read individual columns without loading the whole parquet file in 
memory, we need to surface the getSchema and ReadColumn functions of 
parquet-cpp to the parquet glib API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [RESULT][VOTE] Release Apache Arrow 0.11.0 (RC1)

2018-10-09 Thread Kouhei Sutou
Hi,

> I clicked "Release" button at
> https://repository.apache.org/#stagingRepositories but
> https://search.maven.org/search?q=g:org.apache.arrow%20AND%20v:0.11.0
> shows nothing.

There are 0.11.0 packages.
I should have waited...

In <20181009.001150.1130060421909409464@clear-code.com>
  "Re: [RESULT][VOTE] Release Apache Arrow 0.11.0 (RC1)" on Tue, 09 Oct 2018 
00:11:50 +0900 (JST),
  Kouhei Sutou  wrote:

> Hi,
> 
> One problem for Java packages:
> 
> I clicked "Release" button at
> https://repository.apache.org/#stagingRepositories but
> https://search.maven.org/search?q=g:org.apache.arrow%20AND%20v:0.11.0
> shows nothing.
> 
> Can you help this?
> 
> 
> Here are remains tasks:
> 
>   * Updating the Arrow website
> 
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingtheArrowwebsite
> 
> The remained sub task is only blog post. Wes is working
> on this:
>   https://github.com/apache/arrow/pull/2724
> 
> Can contributors for 0.11.0 confirm this?
> 
> (I'll confirm this tomorrow.)
> 
>   * Updating website with new API documentation
> 
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingwebsitewithnewAPIdocumentation
> 
> Krisztián is working on this:
>   https://github.com/apache/arrow/pull/2723
> 
>   * Updating conda packages
> 
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Updatingcondapackages
> 
> Krisztián is working on this.
> 
>   * Announcing release
> 
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Announcingrelease
> 
> I'll do this after the above blog post is published.
> 
> Other tasks in "Post-release tasks"
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Post-releasetasks
> have done.
> 
> 
> Thanks,
> --
> kou
> 
> In <20181008.194105.48515664531298@clear-code.com>
>   "[RESULT][VOTE] Release Apache Arrow 0.11.0 (RC1)" on Mon, 08 Oct 2018 
> 19:41:05 +0900 (JST),
>   Kouhei Sutou  wrote:
> 
>> With 3 binding +1 votes, 1 non-binding +1 and no other
>> votes, the vote passes. Thanks all!
>> 
>> I'll start "Post-release tasks".
>>   
>> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Post-releasetasks
>> 
>> Wes will write a blog post.
>> 
>> Krisztián will create the conda-forge PRs.
>> 
>> Any other helps are also welcome!
>> 
>> 
>> Thanks,
>> --
>> kou


[ANNOUNCE] Apache Arrow 0.11.0 released

2018-10-09 Thread Kouhei Sutou
The Apache Arrow community is pleased to announce the 0.11.0 release. It
includes 288 resolved issues ([1]) since the 0.10.0 release.

The release is available now from our website and [2]:
  https://arrow.apache.org/install/

Read about what's new in the release
  https://arrow.apache.org/blog/2018/10/09/0.11.0-release/

Changelog
  https://arrow.apache.org/release/0.11.0.html

What is Apache Arrow?
-

Apache Arrow is a columnar in-memory analytics layer designed to accelerate big
data. It houses a set of canonical in-memory representations of flat and
hierarchical data along with multiple language-bindings for structure
manipulation. It also provides low-overhead streaming and batch messaging,
zero-copy interprocess communication (IPC), and vectorized in-memory analytics
libraries.

Please report any feedback to the mailing lists ([3])

Regards,
The Apache Arrow community

[1]: 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20fixVersion%20%3D%200.11.0%20ORDER%20BY%20priority%20DESC
[2]: https://www.apache.org/dyn/closer.cgi/arrow/arrow-0.11.0/
[3]: https://lists.apache.org/list.html?dev@arrow.apache.org


[jira] [Created] (ARROW-3473) [Format] Update Layout.md document to clarify use of 64-bit array lengths

2018-10-09 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-3473:
---

 Summary: [Format] Update Layout.md document to clarify use of 
64-bit array lengths
 Key: ARROW-3473
 URL: https://issues.apache.org/jira/browse/ARROW-3473
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Format
Reporter: Wes McKinney
 Fix For: 0.12.0


See https://github.com/apache/arrow/issues/2733. While 64-bit lengths are 
permitted, it is recommended to limit array sizes to 32-bit length or less. I 
will update



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [JAVA] Arrow performance measurement

2018-10-09 Thread Animesh Trivedi
Hi Wes and all,

Here is another round of updates:

Quick recap - previously we established that for 1kB binary blobs, Arrow
can deliver > 160 Gbps performance from in-memory buffers.

In this round I looked at the performance of materializing "integers". In
my benchmarks, I found that with careful optimizations/code-rewriting we
can push the performance of integer reading from 5.42 Gbps/core to 13.61
Gbps/core (~2.5x). The peak performance with 16 cores, scale up to 110+
Gbps. Key things to do is:

1) Disable memory access checks in Arrow and Netty buffers. This gave
significant performance boost. However, for such an important performance
flag, it is very poorly documented
("drill.enable_unsafe_memory_access=true").

2) Materialize values from Validity and Value direct buffers instead of
calling getInt() function on the IntVector. This is implemented as a new
Unsafe reader type (
https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L31
)

3) Optimize bitmap operation to check if a bit is set or not (
https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L23
)

A detailed write up of these steps is available here:
https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-09-arrow-int.md

I have 2 follow-up questions:

1) Regarding the `isSet` function, why does it has to calculate number of
bits set? (
https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java#L797).
Wouldn't just checking if the result of the AND operation is zero or not be
sufficient? Like what I did :
https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L28


2) What is the reason behind this bitmap generation optimization here
https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BitVectorHelper.java#L179
? At this point when this function is called, the bitmap vector is already
read from the storage, and contains the right values (either all null, all
set, or whatever). Generating this mask here for the special cases when the
values are all NULL or all set (this was the case in my benchmark), can be
slower than just returning what one has read from the storage.

Collectively optimizing these two bitmap operations give more than 1 Gbps
gains in my bench-marking code.

Cheers,
--
Animesh


On Thu, Oct 4, 2018 at 12:52 PM Wes McKinney  wrote:

> See e.g.
>
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/ipc-read-write-test.cc#L222
>
>
> On Thu, Oct 4, 2018 at 6:48 AM Animesh Trivedi
>  wrote:
> >
> > Primarily write the same microbenchmark as I have in Java in C++ for
> table
> > reading and value materialization. So just an example of equivalent
> > ArrowFileReader example code in C++. Unit tests are a good starting
> point,
> > thanks for the tip :)
> >
> > On Thu, Oct 4, 2018 at 12:39 PM Wes McKinney 
> wrote:
> >
> > > > 3. Are there examples of Arrow in C++ read/write code that I can
> have a
> > > look?
> > >
> > > What kind of code are you looking for? I would direct you to relevant
> > > unit tests that exhibit certain functionality, but it depends on what
> > > you are trying to do
> > > On Wed, Oct 3, 2018 at 9:45 AM Animesh Trivedi
> > >  wrote:
> > > >
> > > > Hi all - quick update on the performance investigation:
> > > >
> > > > - I spent some time looking at performance profile for a binary blob
> > > column
> > > > (1024 bytes of byte[]) and found a few favorable settings for
> delivering
> > > up
> > > > to 168 Gbps from in-memory reading benchmark on 16 cores. These
> settings
> > > > (NUMA, JVM settings, Arrow holder API, and batch size, etc.) are
> > > documented
> > > > here:
> > > >
> > >
> https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-03-arrow-binary.md
> > > > - these setting also help to improved the last number that reported
> (but
> > > > not by much) for the in-memory TPC-DS store_sales table from ~39
> Gbps up
> > > to
> > > > ~45-47 Gbps (note: this number is just in-memory benchmark, i.e.,
> w/o any
> > > > networking or storage links)
> > > >
> > > > A few follow up questions that I have:
> > > > 1. Arrow reads a batch size worth of data in one go. Are there any
> > > > recommended batch sizes? In my investigation, small batch size help
> with
> > > a
> > > > better cache profile but increase number of instructions required
> (more
> > > > looping). Larger one do otherwise. Somehow ~10MB/thread seem to be
> the
> > > best
> > > > performing configuration, which is also a bit counter intuitive as
> for 16
> > > > threads this will lead to 160 MB of memory footprint. May be this is
> also
> > > > tired to the memory management logic which is my next question.
> > > > 2. Arrow use's netty's memory manager. (i) what are

[jira] [Created] (ARROW-3472) remove gandiva helpers library

2018-10-09 Thread Pindikura Ravindra (JIRA)
Pindikura Ravindra created ARROW-3472:
-

 Summary: remove gandiva helpers library
 Key: ARROW-3472
 URL: https://issues.apache.org/jira/browse/ARROW-3472
 Project: Apache Arrow
  Issue Type: Task
  Components: Gandiva
Reporter: Pindikura Ravindra
Assignee: Pindikura Ravindra


Gandiva has two native libraries - libgandiva.so and libgandiva_helpers.so - 
the helpers one is mostly a duplicate and was added to get around unresolved 
symbols with java/jni. but, this is a hack and needs to be cleaned up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3471) [Gandiva] Investigate caching isomorphic expressions

2018-10-09 Thread Praveen Kumar Desabandu (JIRA)
Praveen Kumar Desabandu created ARROW-3471:
--

 Summary: [Gandiva] Investigate caching isomorphic expressions
 Key: ARROW-3471
 URL: https://issues.apache.org/jira/browse/ARROW-3471
 Project: Apache Arrow
  Issue Type: Task
Reporter: Praveen Kumar Desabandu
 Fix For: 0.12.0


Two expressions say add(a+b) and add(c+d), could potentially be reused if the 
only thing differing are the names.

Test E2E.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3470) [C++] Row-wise conversion tutorial has fallen out of date

2018-10-09 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-3470:
---

 Summary: [C++] Row-wise conversion tutorial has fallen out of date
 Key: ARROW-3470
 URL: https://issues.apache.org/jira/browse/ARROW-3470
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.12.0


As reported on user@ list



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)