Re: [Discuss][Format] Zero size record batches

2019-05-20 Thread Ravindra Pindikura
On Tue, May 21, 2019 at 10:35 AM Micah Kornfield 
wrote:

> Today, the format docs are ambiguous on whether zero sized batches are
> supported.  Wes opened a PR [1] for empty record batches that shows C++
> handles them but Java and javascript fail to handle them.
>
>
> I'd like to propose:
> 1.  Make it explicit in the format docs, that 0 size record batches are
> supported
> 2.  Update Java and javascript implementations to work with them (I can put
> the Java work on my backlog, but would need a volunteer for JS).  And any
> other implementations that don't currently handle them.
>
> Thoughts?
>

Will need to add a test case for gandiva also - and fix if it shows up any
bugs. but, I agree we should support zero sized batches.



> Thanks,
> Micah
>


-- 
Thanks and regards,
Ravindra.


Re: [DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries

2019-05-20 Thread Micah Kornfield
Hi Wes,
It looks like comments are turned off on the doc, this intentional?

Thanks,
Micah

On Mon, May 20, 2019 at 3:49 PM Wes McKinney  wrote:

> hi folks,
>
> I'm interested in starting to build a so-called "data frame" interface
> as a moderately opinionated, higher-level usability layer for
> interacting with Arrow-based chunked in-memory data. I've had numerous
> discussions (mostly in-person) over the last few years about this and
> it feels to me that if we don't build something like this in Apache
> Arrow that we could end up with several third party efforts without
> much community discussion or collaboration, which would be sad.
>
> Another anti-pattern that is occurring is that users are loading data
> into Arrow, converting to a library like pandas in order to do some
> simple in-memory data manipulations, then converting back to Arrow.
> This is not the intended long term mode of operation.
>
> I wrote in significantly more detail (~7-8 pages) about the context
> and motivation for this project:
>
>
> https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit?usp=sharing
>
> Note that this would be a parallel effort to go alongside the
> previously-discussed "Query Engine" project, and the two things are
> intended to work together. Since we are creating computational
> kernels, this would also provide some immediacy in being able to
> invoke kernels easily on large in-memory datasets without having to
> wait for a more full-fledged query engine system to be developed
>
> The details with these kinds of projects can be bedeviling so my
> approach would be to begin to lay down the core abstractions and basic
> APIs and use the project to drive the agenda for kernel development
> (which can also be used in the context of a query engine runtime).
> From my past experience designing pandas and some other in-memory
> analytics projects, I have some idea of the kinds of mistakes or
> design patterns I would like to _avoid_ in this effort, but others may
> have some experiences they can offer to inform the design approach as
> well.
>
> Looking forward to comments and discussion.
>
> - Wes
>


[Discuss][Format] Zero size record batches

2019-05-20 Thread Micah Kornfield
Today, the format docs are ambiguous on whether zero sized batches are
supported.  Wes opened a PR [1] for empty record batches that shows C++
handles them but Java and javascript fail to handle them.


I'd like to propose:
1.  Make it explicit in the format docs, that 0 size record batches are
supported
2.  Update Java and javascript implementations to work with them (I can put
the Java work on my backlog, but would need a volunteer for JS).  And any
other implementations that don't currently handle them.

Thoughts?

Thanks,
Micah


Re: [DISCUSS][C++] Unaligned memory accesses (undefined behavior)

2019-05-20 Thread Micah Kornfield
Created https://jira.apache.org/jira/browse/ARROW-5380 to track turning
fixing and turning on unaligned access warnings in UBSan
https://jira.apache.org/jira/browse/ARROW-5365 tracks turning on ASAN and
UBSAN in CI.

Thanks,
Micah



On Fri, May 17, 2019 at 1:48 PM Antoine Pitrou  wrote:

>
> Le 17/05/2019 à 21:22, Micah Kornfield a écrit :
> > I recently ran UBSan over parquet and arrow code bases and there are
> quite
> > a few unaligned pointer warnings (we do reinterpret casts on integer
> types
> > without checking alignment).  Most of them are in Arrow itself, which
> > parquet calls into.
> >
> > Is this something the community would like to fix?
> >
> > I imagine adding a helper method something like:
> >
> > template
> > T LoadUnaligned(T* pointer_to_t) {
> >   T aligned;
> >   memcpy(, t, sizeof(T));
> >   return aligned;
> > }
> >
> > I believe clang/GCC/MSVC should be good enough to recognize that an
> > unaligned load can replace the memcpy and inline it.  But hopefully
> archery
> > will be able to catch any performance regressions if this isn't the case.
>
> +1 from me.  We may need a similar helper for unaligned stores.
>
> By the way, running ASan and UBSan is something that we should ideally
> do in CI, especially now that our Valgrind runs are disabled.
>
> Regards
>
> Antoine.
>


[jira] [Created] (ARROW-5380) [C++] Fix and enable UBSan for unaligned accesses.

2019-05-20 Thread Micah Kornfield (JIRA)
Micah Kornfield created ARROW-5380:
--

 Summary: [C++] Fix and enable UBSan for unaligned accesses.
 Key: ARROW-5380
 URL: https://issues.apache.org/jira/browse/ARROW-5380
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Micah Kornfield
Assignee: Micah Kornfield


Currently unaligned access configuration in UBSan has been turned off.  We 
should introduce a method that safely loads unaligned data, and use it to fix 
UBSan errors.  Then turn them on for UBSan.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[Discuss][Format][Java] Finalizing Union Types

2019-05-20 Thread Micah Kornfield
In the past [1] there hasn't been agreement on the final requirements for
union types.

Briefly the two approaches that are currently advocated:
1.  Limit unions to only contain one field of each individual type (e.g.
you can't have two separate int32 fields).  Java takes this approach.
2.  Generalized unions (unions can have any number of fields with the same
type).  C++ takes this approach.

There was a prior PR [2] that stalled in trying to take this approach with
Java.  For writing vectors it seemed to be slower on a benchmark.

My proposal:  We should pursue option 2 (the general approach).  There are
already data interchange formats that support it and it would be nice to a
data-model that lets us make the translation between Arrow schemas easy:
1.  Avro Seems to support it [3] (with the exception of complex types)
2.  Protobufs loosely support it [4] via one-of.

In order to address issues in [2], I propose the following making the
changes/additions to the Java implementation:
1.  Keep the default write-path untouched with the existing class.
2.  Add in a new sparse union class that implements the same interface that
can be used on the read path, and if a client opts in (via direct
construction).
3.  Add in a dense union class (I don't believe Java has one).

I'm still ramping up the Java code base, so I'd like other Java
contributors to chime in to see if this plan sounds feasible and acceptable.

Any other thoughts on Unions?

Thanks,
Micah

[1]
https://lists.apache.org/thread.html/82ec2049fc3c29de232c9c6962aaee9ec022d581cecb6cf0eb6a8f36@%3Cdev.arrow.apache.org%3E
[2] https://github.com/apache/arrow/pull/987#issuecomment-493231493
[3] https://github.com/apache/arrow/pull/987#issuecomment-493231493
[4] https://developers.google.com/protocol-buffers/docs/proto#oneof


Re: ARROW-4714: Providing JNI interface to Read ORC file via Arrow C++

2019-05-20 Thread Yurui Zhou
Hi Micah:

Thanks for the response. According to our benchmark, the cpp-orc is on average 
1% to 10% slower than the java-orc,
While the on-heap to off-heap memory conversion overhead can easily outweigh 
such a performance difference.
And we are currently also working on some performance improvement patches to 
cpp-orc to make sure it achieve at least the same performance as java-orc.

Thanks
Yurui
On 20 May 2019, 9:22 PM +0800, Micah Kornfield , wrote:
> Hi Yurui,
> This is cool, I will try to leave some comments tonight.
>
> Reading the JIRA it references the conversion from on-heap to off heap
> memory being the performance issue. Now that Arrow Java can point at
> arbitrary memory do you know the performance delta between java-orc and
> cpp-orc? (I'm wondering if we should do something similar for parquet-cpp)
>
> Thanks,
> Micah
>
> On Monday, May 20, 2019, Yurui Zhou  wrote:
>
> > Hi Guys:
> >
> > I just created a PR with WIP changes about adding JNI interface for
> > reading orc files.
> >
> > All the major changes has been done and I would like some early feedback
> > from the community.
> >
> > Feel free to take a look and leave your feedback.
> > https://github.com/apache/arrow/pull/4348
> >
> > Some clean up and unit tests will be added up in follow up iterations.
> >
> > Thanks
> > Yurui
> >
> >


[DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries

2019-05-20 Thread Wes McKinney
hi folks,

I'm interested in starting to build a so-called "data frame" interface
as a moderately opinionated, higher-level usability layer for
interacting with Arrow-based chunked in-memory data. I've had numerous
discussions (mostly in-person) over the last few years about this and
it feels to me that if we don't build something like this in Apache
Arrow that we could end up with several third party efforts without
much community discussion or collaboration, which would be sad.

Another anti-pattern that is occurring is that users are loading data
into Arrow, converting to a library like pandas in order to do some
simple in-memory data manipulations, then converting back to Arrow.
This is not the intended long term mode of operation.

I wrote in significantly more detail (~7-8 pages) about the context
and motivation for this project:

https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit?usp=sharing

Note that this would be a parallel effort to go alongside the
previously-discussed "Query Engine" project, and the two things are
intended to work together. Since we are creating computational
kernels, this would also provide some immediacy in being able to
invoke kernels easily on large in-memory datasets without having to
wait for a more full-fledged query engine system to be developed

The details with these kinds of projects can be bedeviling so my
approach would be to begin to lay down the core abstractions and basic
APIs and use the project to drive the agenda for kernel development
(which can also be used in the context of a query engine runtime).
>From my past experience designing pandas and some other in-memory
analytics projects, I have some idea of the kinds of mistakes or
design patterns I would like to _avoid_ in this effort, but others may
have some experiences they can offer to inform the design approach as
well.

Looking forward to comments and discussion.

- Wes


[jira] [Created] (ARROW-5379) [Python] support pandas' nullable Integer type in from_pandas

2019-05-20 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5379:


 Summary: [Python] support pandas' nullable Integer type in 
from_pandas
 Key: ARROW-5379
 URL: https://issues.apache.org/jira/browse/ARROW-5379
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


>From https://github.com/apache/arrow/issues/4168. We should add support for 
>pandas' nullable Integer extension dtypes, as those could map nicely to arrows 
>integer types. 

Ideally this happens in a generic way though, and not specific for this 
extension type, which is discussed in ARROW-5271



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [Discuss] [Python] protocol for conversion to pyarrow Array

2019-05-20 Thread Joris Van den Bossche
Hi Wes,

That indeeds seems as a good fit for the pandas ExtensionArray <-> Arrow
conversion.
I will look into it starting this week.

Joris

Op vr 17 mei 2019 om 00:28 schreef Wes McKinney :

> hi Joris,
>
> Somewhat related to this, I want to also point out that we have C++
> extension types [1]. As part of this, it would also be good to define
> and document a public API for users to create ExtensionArray
> subclasses that can be serialized and deserialized using this
> machinery.
>
> As a motivating example, suppose that a Java application has a special
> data type that can be serialized as a Binary value in Arrow, and we
> want to be able to receive this special object as a pandas
> ExtensionArray column, which unboxing into a Python user space type.
>
> The ExtensionType can be implemented in Java, and then on the Python
> side the implementation can occur either in C++ or Python. An API will
> need to be defined to serializer functions for the pandas
> ExtensionArray to map the pandas-space type onto the the Arrow-space
> type. Does this seem like a project you might be able to help drive
> forward? As a matter of sequencing, we do not yet have the capability
> to interact with C++ ExtensionType in Python, so we might need to
> first create callback machinery to enable Arrow extension types to be
> defined in Python (that call into the C++ ExtensionType registry)
>
> - Wes
>
> [1]:
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/extension_type-test.cc
>
> On Fri, May 10, 2019 at 2:11 AM Joris Van den Bossche
>  wrote:
> >
> > Op do 9 mei 2019 om 21:38 schreef Uwe L. Korn :
> >
> > > +1 to the idea of adding a protocol to let other objects define their
> way
> > > to Arrow structures. For pandas.Series I would expect that they return
> an
> > > Arrow Column.
> > >
> > > For the Arrow->pandas conversion I have a bit mixed feelings. In the
> > > normal Fletcher case I would expect that we don't convert anything as
> we
> > > represent anything from Arrow with it.
> >
> >
> > Yes, you don't want to convert anything (apart from wrapping the arrow
> > array into a FletcherArray). But how does Table.to_pandas know that?
> > Maybe it doesn't need to know that. And then you might write a function
> in
> > fletcher to convert a pyarrow Table to a pandas DataFrame with
> > fletcher-backed columns. But if you want to have this roundtrip
> > automatically, without the need that each project that defines an
> > ExtensionArray and wants to interact with arrow (eg in GeoPandas as well)
> > needs to have his own "arrow-table-to-pandas-dataframe" converter,
> pyarrow
> > needs to have some notion of how to convert back to a pandas
> ExtensionArray.
> >
> >
> > > For the case where we want to restore the exact pandas DataFrame we had
> > > before this will become a bit more complicated as we either would need
> to
> > > have all third-party libraries to support Arrow via a hook as proposed
> or
> > > we also define some kind of other protocol on the pandas side to
> > > reconstruct ExtensionArrays from Arrow data.
> > >
> >
> > That last one is basically what I proposed in
> >
> https://github.com/pandas-dev/pandas/issues/20612/#issuecomment-489649556
> >
> > Thanks Antoine and Uwe for the discussion!
> >
> > Joris
>


Re: [pyarrow] Parquet page header size limit

2019-05-20 Thread Wes McKinney
Those instructions are a bit out of date after the monorepo merge, see

https://github.com/apache/arrow/blob/master/docs/source/developers/cpp.rst#apache-parquet-development

On Mon, May 20, 2019 at 8:33 AM Micah Kornfield  wrote:
>
> Hi Shyam,
> https://github.com/apache/parquet-testing contains stand alone test files.
>
>
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/bloom_filter-test.cc
> is an example of how this is used (search for get_data_dir).
>
>
> https://github.com/apache/parquet-cpp/blob/master/README.md#testing
> describes how to setup your environment to use it.
>
> Thanks,
> Micah
>
>
>
> n Monday, May 20, 2019, shyam narayan singh  wrote:
>
> > Hi Wes
> >
> > Sorry, this got out of my radar. I went ahead to dig the problem and filed
> > the issue . We can track
> > the error message as part of the different bug?
> >
> > Now, I have a parquet file that can be read by java reader but not pyarrow.
> > I have the fix for the issue but I do not know how to add a test case.
> > Reason being, the test cases generate the files and then test the readers.
> > Is there a way to add an existing parquet file as a test case to the
> > current set of tests?
> >
> > Regards
> > Shyam
> >
> > Regards
> > Shyam
> >
> > On Tue, Apr 23, 2019 at 9:20 AM Wes McKinney  wrote:
> >
> > > hi Shyam,
> > >
> > > Well "Invalid data. Deserializing page header failed." is not a very
> > > good error message. Can you open a JIRA issue and provide a way to
> > > reproduce the problem (e.g. code to generate a file, or a sample
> > > file)? From what you say it seems to be an atypical usage of Parquet,
> > > but there might be a configurable option we can add to help. IIRC the
> > > large header limit is there to prevent runaway behavior in malformed
> > > Parquet files. I believe we used other Parquet implementations to
> > > guide the choice
> > >
> > > Thanks
> > >
> > > On Wed, Apr 17, 2019 at 6:09 AM shyam narayan singh
> > >  wrote:
> > > >
> > > > My mistake. The max is 16MB.
> > > >
> > > > So, if deserialisation fails, we keep trying until we hit the max, that
> > > > works but not efficient. Looks like the custom page header is not
> > > > deserialisable. Will keep digging.
> > > >
> > > > Thanks
> > > > Shyam
> > > >
> > > > On Wed, Apr 17, 2019 at 11:56 AM shyam narayan singh <
> > > > shyambits2...@gmail.com> wrote:
> > > >
> > > > > Hi
> > > > >
> > > > > While reading a custom parquet file that has extra information
> > embedded
> > > > > (some custom stats), pyarrow is failing to read it.
> > > > >
> > > > >
> > > > > Traceback (most recent call last):
> > > > >
> > > > >   File "/tmp/pytest.py", line 19, in 
> > > > >
> > > > > table = dataset.read()
> > > > >
> > > > >   File "/usr/local/lib/python3.7/site-packages/pyarrow/parquet.py",
> > > line
> > > > > 214, in read
> > > > >
> > > > > use_threads=use_threads)
> > > > >
> > > > >   File "pyarrow/_parquet.pyx", line 737, in
> > > > > pyarrow._parquet.ParquetReader.read_all
> > > > >
> > > > >   File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
> > > > >
> > > > > pyarrow.lib.ArrowIOError: Couldn't deserialize thrift:
> > > TProtocolException:
> > > > > Invalid data
> > > > >
> > > > > Deserializing page header failed.
> > > > >
> > > > >
> > > > >
> > > > > Looking at the code, I realised that SerializedPageReader throws
> > > exception
> > > > > if the page header size goes beyond 16k (default max). There is a
> > > setter
> > > > > method for the max page header size that is used only in tests.
> > > > >
> > > > >
> > > > > Is there a way to get around the problem?
> > > > >
> > > > >
> > > > > Regards
> > > > >
> > > > > Shyam
> > > > >
> > >
> >


[jira] [Created] (ARROW-5378) [C++] Add local FileSystem implementation

2019-05-20 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-5378:
-

 Summary: [C++] Add local FileSystem implementation
 Key: ARROW-5378
 URL: https://issues.apache.org/jira/browse/ARROW-5378
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou
 Fix For: 0.14.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [pyarrow] Parquet page header size limit

2019-05-20 Thread Micah Kornfield
Hi Shyam,
https://github.com/apache/parquet-testing contains stand alone test files.


https://github.com/apache/arrow/blob/master/cpp/src/parquet/bloom_filter-test.cc
is an example of how this is used (search for get_data_dir).


https://github.com/apache/parquet-cpp/blob/master/README.md#testing
describes how to setup your environment to use it.

Thanks,
Micah



n Monday, May 20, 2019, shyam narayan singh  wrote:

> Hi Wes
>
> Sorry, this got out of my radar. I went ahead to dig the problem and filed
> the issue . We can track
> the error message as part of the different bug?
>
> Now, I have a parquet file that can be read by java reader but not pyarrow.
> I have the fix for the issue but I do not know how to add a test case.
> Reason being, the test cases generate the files and then test the readers.
> Is there a way to add an existing parquet file as a test case to the
> current set of tests?
>
> Regards
> Shyam
>
> Regards
> Shyam
>
> On Tue, Apr 23, 2019 at 9:20 AM Wes McKinney  wrote:
>
> > hi Shyam,
> >
> > Well "Invalid data. Deserializing page header failed." is not a very
> > good error message. Can you open a JIRA issue and provide a way to
> > reproduce the problem (e.g. code to generate a file, or a sample
> > file)? From what you say it seems to be an atypical usage of Parquet,
> > but there might be a configurable option we can add to help. IIRC the
> > large header limit is there to prevent runaway behavior in malformed
> > Parquet files. I believe we used other Parquet implementations to
> > guide the choice
> >
> > Thanks
> >
> > On Wed, Apr 17, 2019 at 6:09 AM shyam narayan singh
> >  wrote:
> > >
> > > My mistake. The max is 16MB.
> > >
> > > So, if deserialisation fails, we keep trying until we hit the max, that
> > > works but not efficient. Looks like the custom page header is not
> > > deserialisable. Will keep digging.
> > >
> > > Thanks
> > > Shyam
> > >
> > > On Wed, Apr 17, 2019 at 11:56 AM shyam narayan singh <
> > > shyambits2...@gmail.com> wrote:
> > >
> > > > Hi
> > > >
> > > > While reading a custom parquet file that has extra information
> embedded
> > > > (some custom stats), pyarrow is failing to read it.
> > > >
> > > >
> > > > Traceback (most recent call last):
> > > >
> > > >   File "/tmp/pytest.py", line 19, in 
> > > >
> > > > table = dataset.read()
> > > >
> > > >   File "/usr/local/lib/python3.7/site-packages/pyarrow/parquet.py",
> > line
> > > > 214, in read
> > > >
> > > > use_threads=use_threads)
> > > >
> > > >   File "pyarrow/_parquet.pyx", line 737, in
> > > > pyarrow._parquet.ParquetReader.read_all
> > > >
> > > >   File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
> > > >
> > > > pyarrow.lib.ArrowIOError: Couldn't deserialize thrift:
> > TProtocolException:
> > > > Invalid data
> > > >
> > > > Deserializing page header failed.
> > > >
> > > >
> > > >
> > > > Looking at the code, I realised that SerializedPageReader throws
> > exception
> > > > if the page header size goes beyond 16k (default max). There is a
> > setter
> > > > method for the max page header size that is used only in tests.
> > > >
> > > >
> > > > Is there a way to get around the problem?
> > > >
> > > >
> > > > Regards
> > > >
> > > > Shyam
> > > >
> >
>


[jira] [Created] (ARROW-5377) [C++] Develop interface for writing a RecordBatch IPC stream into pre-allocated space (e.g. memory map) that avoids unnecessary serialization

2019-05-20 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5377:
---

 Summary: [C++] Develop interface for writing a RecordBatch IPC 
stream into pre-allocated space (e.g. memory map) that avoids unnecessary 
serialization
 Key: ARROW-5377
 URL: https://issues.apache.org/jira/browse/ARROW-5377
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


As discussed in recent mailing list thread

https://lists.apache.org/thread.html/b756209052fecb8c28a5eb37db7aecb82a5f5351fa79a9d86f0dba3e@%3Cuser.arrow.apache.org%3E

The only viable process at the moment for getting an accurate report of stream 
size is to write a simulated stream using {{MockOutputStream}}. This is 
suboptimal for a couple of reasons:

* Flatbuffers metadata must be created twice
* Record batch disassembly into IpcPayload must be performed twice

It seems like an interface with a very constrained public API could be provided 
to deconstruct a sequence of RecordBatches and report the size of the produced 
IPC stream (based on metadata sizes, and padding), and then this deconstructed 
set of IPC payloads can be written out to a stream (e.g. using 
{{FixedSizeBufferWriter}})



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5376) [C++] Compile failure on gcc 5.4.0

2019-05-20 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-5376:
-

 Summary: [C++] Compile failure on gcc 5.4.0
 Key: ARROW-5376
 URL: https://issues.apache.org/jira/browse/ARROW-5376
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou


{code}
In file included from ../src/arrow/filesystem/test-util.h:22:0,
 from ../src/arrow/filesystem/test-util.cc:26:
../src/arrow/filesystem/filesystem.h:58:1: error: type attributes ignored after 
type is already defined [-Werror=attributes]
 };
 ^
{code}

This is a bug in gcc 5.4.0: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43407



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: ARROW-4714: Providing JNI interface to Read ORC file via Arrow C++

2019-05-20 Thread Micah Kornfield
Hi Yurui,
This is cool, I will try to leave some comments tonight.

 Reading the JIRA it references the conversion from on-heap to off heap
memory being the performance issue.  Now that Arrow Java can point at
arbitrary memory do you know the performance delta between java-orc and
cpp-orc?  (I'm wondering if we should do something similar for parquet-cpp)

Thanks,
Micah

On Monday, May 20, 2019, Yurui Zhou  wrote:

> Hi Guys:
>
> I just created a PR with WIP changes about adding JNI interface for
> reading orc files.
>
> All the major changes has been done and I would like some early feedback
> from the community.
>
> Feel free to take a look and leave your feedback.
> https://github.com/apache/arrow/pull/4348
>
> Some clean up and unit tests will be added up in follow up iterations.
>
> Thanks
> Yurui
>
>


Re: [pyarrow] Parquet page header size limit

2019-05-20 Thread shyam narayan singh
Hi Wes

Sorry, this got out of my radar. I went ahead to dig the problem and filed
the issue . We can track
the error message as part of the different bug?

Now, I have a parquet file that can be read by java reader but not pyarrow.
I have the fix for the issue but I do not know how to add a test case.
Reason being, the test cases generate the files and then test the readers.
Is there a way to add an existing parquet file as a test case to the
current set of tests?

Regards
Shyam

Regards
Shyam

On Tue, Apr 23, 2019 at 9:20 AM Wes McKinney  wrote:

> hi Shyam,
>
> Well "Invalid data. Deserializing page header failed." is not a very
> good error message. Can you open a JIRA issue and provide a way to
> reproduce the problem (e.g. code to generate a file, or a sample
> file)? From what you say it seems to be an atypical usage of Parquet,
> but there might be a configurable option we can add to help. IIRC the
> large header limit is there to prevent runaway behavior in malformed
> Parquet files. I believe we used other Parquet implementations to
> guide the choice
>
> Thanks
>
> On Wed, Apr 17, 2019 at 6:09 AM shyam narayan singh
>  wrote:
> >
> > My mistake. The max is 16MB.
> >
> > So, if deserialisation fails, we keep trying until we hit the max, that
> > works but not efficient. Looks like the custom page header is not
> > deserialisable. Will keep digging.
> >
> > Thanks
> > Shyam
> >
> > On Wed, Apr 17, 2019 at 11:56 AM shyam narayan singh <
> > shyambits2...@gmail.com> wrote:
> >
> > > Hi
> > >
> > > While reading a custom parquet file that has extra information embedded
> > > (some custom stats), pyarrow is failing to read it.
> > >
> > >
> > > Traceback (most recent call last):
> > >
> > >   File "/tmp/pytest.py", line 19, in 
> > >
> > > table = dataset.read()
> > >
> > >   File "/usr/local/lib/python3.7/site-packages/pyarrow/parquet.py",
> line
> > > 214, in read
> > >
> > > use_threads=use_threads)
> > >
> > >   File "pyarrow/_parquet.pyx", line 737, in
> > > pyarrow._parquet.ParquetReader.read_all
> > >
> > >   File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
> > >
> > > pyarrow.lib.ArrowIOError: Couldn't deserialize thrift:
> TProtocolException:
> > > Invalid data
> > >
> > > Deserializing page header failed.
> > >
> > >
> > >
> > > Looking at the code, I realised that SerializedPageReader throws
> exception
> > > if the page header size goes beyond 16k (default max). There is a
> setter
> > > method for the max page header size that is used only in tests.
> > >
> > >
> > > Is there a way to get around the problem?
> > >
> > >
> > > Regards
> > >
> > > Shyam
> > >
>


[jira] [Created] (ARROW-5375) [C++] Try to move out of public headers

2019-05-20 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-5375:
-

 Summary: [C++] Try to move  out of public headers
 Key: ARROW-5375
 URL: https://issues.apache.org/jira/browse/ARROW-5375
 Project: Apache Arrow
  Issue Type: Wish
  Components: C++
Reporter: Antoine Pitrou


Followup to ARROW-5102: to try and reduce compile times, try to move inclusions 
of {{sstream}} (and other costly headers) out of Arrow public header such as 
{{status.h}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5374) [Python] pa.read_record_batch() doesn't work

2019-05-20 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-5374:
-

 Summary: [Python] pa.read_record_batch() doesn't work
 Key: ARROW-5374
 URL: https://issues.apache.org/jira/browse/ARROW-5374
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Antoine Pitrou


{code:python}
>>> batch = pa.RecordBatch.from_arrays([pa.array([b"foo"], type=pa.utf8())], 
>>> names=['strs']) 
>>>   
>>> stream = pa.BufferOutputStream()
>>> writer = pa.RecordBatchStreamWriter(stream, batch.schema)
>>> writer.write_batch(batch)   
>>> 
>>>
>>> writer.close()  
>>> 
>>>
>>> buf = stream.getvalue() 
>>> 
>>>
>>> pa.read_record_batch(buf, batch.schema) 
>>> 
>>>
Traceback (most recent call last):
  File "", line 1, in 
pa.read_record_batch(buf, batch.schema)
  File "pyarrow/ipc.pxi", line 583, in pyarrow.lib.read_record_batch
check_status(ReadRecordBatch(deref(message.message.get()),
  File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
raise ArrowIOError(message)
ArrowIOError: Expected IPC message of type schema got record batch

{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


ARROW-4714: Providing JNI interface to Read ORC file via Arrow C++

2019-05-20 Thread Yurui Zhou
Hi Guys:

I just created a PR with WIP changes about adding JNI interface for reading orc 
files.

All the major changes has been done and I would like some early feedback from 
the community.

Feel free to take a look and leave your feedback.
https://github.com/apache/arrow/pull/4348

Some clean up and unit tests will be added up in follow up iterations.

Thanks
Yurui