Re: [DISCUSS] PR Backlog reduction

2019-05-21 Thread Micah Kornfield
I agree on hand curation for now.

 I'll try to setup a sign up spreadsheet for shepherding old PRs and once
that done assign reviewers/ping old PRs.  I expect to have something to
share by the weekend.

On Tuesday, May 21, 2019, Wes McKinney  wrote:

> I think maintainers or contributors should be responsible for closing
> PRs, it also helps with backlog curation (sometimes when a stale PR is
> closed the JIRA may also be closed if it's a Won't Fix)
>
> On Tue, May 21, 2019 at 1:12 PM Antoine Pitrou  wrote:
> >
> >
> >
> > Le 21/05/2019 à 20:02, Neal Richardson a écrit :
> > > Automatically close stale PRs? https://github.com/probot/stale
> >
> > That doesn't sound like a good idea to me.
> >
> > Regards
> >
> > Antoine.
>


Re: Should EOS be mandatory for IPC File format?

2019-05-21 Thread Micah Kornfield
This seems like a reasonable change.  Is there any reason that we shouldnt
always append EOS?

On Tuesday, May 21, 2019, John Muehlhausen  wrote:

> Wes,
>
> Check out reader.cpp.  It seg faults when it gets to the next
> message-that-is-not-a-message... it is a footer.  But I have no way to
> know this in reader.cpp because I'm piping the File in via stdin.
>
> In seeker.cpp I seek to the end and figure out where the footer is (this
> is a py-arrow-written file) and indeed it is at the offset where my
> "streamed File" reader bombed out.  If EOS were mandatory at this location
> it would have been fine... I would have said "oh, time for the footer!"
>
> Basically what I'm saying is that we can't assume that File won't be
> processed as a stream.  In an actual non-file stream it is either EOS or
> end-of-stream.  But with a file-as-stream there is more data and we have to
> know it isn't the stream anymore.
>
> Otherwise we've locked the File use-cases into those where the File isn't
> streamed -- i.e. is seekable.  See what I'm saying?  For reader.cpp to have
> been functional it would have had to read the entire File into a buffer
> before parsing, since it could not seek().  This could be easily avoided
> with a mandatory EOS in the File format.  Basically:
>
> 
> 
> 
> **
> 
> 
> 
>
> -John
>
> On Tue, May 21, 2019 at 4:44 PM Wes McKinney  wrote:
>
>> hi John,
>>
>> I'm not sure I follow. The EOS you're referring to is part of the
>> streaming format. It's designed to be readable using an InputStream
>> interface that does not support seeking at all. You can see the core
>> logic where messages are popped off the InputStream here
>>
>> https://github.com/apache/arrow/blob/6f80ea4928f0d26ca175002f2e9f51
>> 1962c8b012/cpp/src/arrow/ipc/message.cc#L281
>>
>> If the end of the byte stream is reached, or EOS (0) is encountered,
>> then the stream reader stops iteration.
>>
>> - Wes
>>
>> On Tue, May 21, 2019 at 4:34 PM John Muehlhausen  wrote:
>> >
>> > https://arrow.apache.org/docs/format/IPC.html#file-format
>> >
>> > 
>> >
>> > If this stream marker is optional in the file format, doesn't this
>> prevent
>> > someone from reading the file without being able to seek() it, e.g. if
>> it
>> > is "piped in" to a program?  Or otherwise they'll have to stream in the
>> > entire thing before they can start parsing?
>> >
>> > Any reason it can't be mandatory for a File?
>> >
>> > -John
>>
>


Re: Should EOS be mandatory for IPC File format?

2019-05-21 Thread John Muehlhausen
Wes,

Check out reader.cpp.  It seg faults when it gets to the next
message-that-is-not-a-message... it is a footer.  But I have no way to know
this in reader.cpp because I'm piping the File in via stdin.

In seeker.cpp I seek to the end and figure out where the footer is (this is
a py-arrow-written file) and indeed it is at the offset where my "streamed
File" reader bombed out.  If EOS were mandatory at this location it would
have been fine... I would have said "oh, time for the footer!"

Basically what I'm saying is that we can't assume that File won't be
processed as a stream.  In an actual non-file stream it is either EOS or
end-of-stream.  But with a file-as-stream there is more data and we have to
know it isn't the stream anymore.

Otherwise we've locked the File use-cases into those where the File isn't
streamed -- i.e. is seekable.  See what I'm saying?  For reader.cpp to have
been functional it would have had to read the entire File into a buffer
before parsing, since it could not seek().  This could be easily avoided
with a mandatory EOS in the File format.  Basically:




**




-John

On Tue, May 21, 2019 at 4:44 PM Wes McKinney  wrote:

> hi John,
>
> I'm not sure I follow. The EOS you're referring to is part of the
> streaming format. It's designed to be readable using an InputStream
> interface that does not support seeking at all. You can see the core
> logic where messages are popped off the InputStream here
>
>
> https://github.com/apache/arrow/blob/6f80ea4928f0d26ca175002f2e9f511962c8b012/cpp/src/arrow/ipc/message.cc#L281
>
> If the end of the byte stream is reached, or EOS (0) is encountered,
> then the stream reader stops iteration.
>
> - Wes
>
> On Tue, May 21, 2019 at 4:34 PM John Muehlhausen  wrote:
> >
> > https://arrow.apache.org/docs/format/IPC.html#file-format
> >
> > 
> >
> > If this stream marker is optional in the file format, doesn't this
> prevent
> > someone from reading the file without being able to seek() it, e.g. if it
> > is "piped in" to a program?  Or otherwise they'll have to stream in the
> > entire thing before they can start parsing?
> >
> > Any reason it can't be mandatory for a File?
> >
> > -John
>
import pyarrow as pa

batch=pa.RecordBatch.from_arrays([
pa.array([1,None],type=pa.int32())
],
[
'field1'
])

with open('/tmp/test.batch','wb') as sink:
writer=pa.RecordBatchFileWriter(sink, batch.schema)
writer.write_batch(batch)
writer.close()

df = pa.ipc.open_file('/tmp/test.batch').read_pandas() 
print(df)



[jira] [Created] (ARROW-5392) [C++][CI][MinGW] Disable static library build on AppVeyor

2019-05-21 Thread Kouhei Sutou (JIRA)
Kouhei Sutou created ARROW-5392:
---

 Summary: [C++][CI][MinGW] Disable static library build on AppVeyor
 Key: ARROW-5392
 URL: https://issues.apache.org/jira/browse/ARROW-5392
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Should EOS be mandatory for IPC File format?

2019-05-21 Thread Wes McKinney
hi John,

I'm not sure I follow. The EOS you're referring to is part of the
streaming format. It's designed to be readable using an InputStream
interface that does not support seeking at all. You can see the core
logic where messages are popped off the InputStream here

https://github.com/apache/arrow/blob/6f80ea4928f0d26ca175002f2e9f511962c8b012/cpp/src/arrow/ipc/message.cc#L281

If the end of the byte stream is reached, or EOS (0) is encountered,
then the stream reader stops iteration.

- Wes

On Tue, May 21, 2019 at 4:34 PM John Muehlhausen  wrote:
>
> https://arrow.apache.org/docs/format/IPC.html#file-format
>
> 
>
> If this stream marker is optional in the file format, doesn't this prevent
> someone from reading the file without being able to seek() it, e.g. if it
> is "piped in" to a program?  Or otherwise they'll have to stream in the
> entire thing before they can start parsing?
>
> Any reason it can't be mandatory for a File?
>
> -John


Should EOS be mandatory for IPC File format?

2019-05-21 Thread John Muehlhausen
https://arrow.apache.org/docs/format/IPC.html#file-format



If this stream marker is optional in the file format, doesn't this prevent
someone from reading the file without being able to seek() it, e.g. if it
is "piped in" to a program?  Or otherwise they'll have to stream in the
entire thing before they can start parsing?

Any reason it can't be mandatory for a File?

-John


[jira] [Created] (ARROW-5391) [Format] Move "Buffer" from Schema.fbs to Message.fbs?

2019-05-21 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5391:
---

 Summary: [Format] Move "Buffer" from Schema.fbs to Message.fbs?
 Key: ARROW-5391
 URL: https://issues.apache.org/jira/browse/ARROW-5391
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Format
Reporter: Wes McKinney
 Fix For: 0.14.0


I'm not sure why it is there, but I believe it should be moved. With the way 
that we compile I am not sure that this will cause any issues but I wanted to 
at least note the issue so we can investigate



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS] PR Backlog reduction

2019-05-21 Thread Wes McKinney
I think maintainers or contributors should be responsible for closing
PRs, it also helps with backlog curation (sometimes when a stale PR is
closed the JIRA may also be closed if it's a Won't Fix)

On Tue, May 21, 2019 at 1:12 PM Antoine Pitrou  wrote:
>
>
>
> Le 21/05/2019 à 20:02, Neal Richardson a écrit :
> > Automatically close stale PRs? https://github.com/probot/stale
>
> That doesn't sound like a good idea to me.
>
> Regards
>
> Antoine.


Re: [Discuss][Format] Zero size record batches

2019-05-21 Thread Wes McKinney
https://github.com/apache/arrow/pull/3871

On Tue, May 21, 2019 at 1:26 PM Paul Taylor  wrote:
>
> I'd be happy to PR a fix for JS today if someone can link me to Wes's PR.
>
> On 5/21/19 11:02 AM, Wes McKinney wrote:
> > I agree also. As a practical use case, the results of a request made
> > with Arrow Flight might yield an empty result set. I'm not sure if
> > this needs to be formally noted in the specification documents but it
> > might not hurt.
> >
> > If someone can fix the Java implementation we could enable the
> > integration test (minus JavaScript for now) in my PR
> >
> > On Tue, May 21, 2019 at 12:47 AM Ravindra Pindikura  
> > wrote:
> >> On Tue, May 21, 2019 at 10:35 AM Micah Kornfield 
> >> wrote:
> >>
> >>> Today, the format docs are ambiguous on whether zero sized batches are
> >>> supported.  Wes opened a PR [1] for empty record batches that shows C++
> >>> handles them but Java and javascript fail to handle them.
> >>>
> >>>
> >>> I'd like to propose:
> >>> 1.  Make it explicit in the format docs, that 0 size record batches are
> >>> supported
> >>> 2.  Update Java and javascript implementations to work with them (I can 
> >>> put
> >>> the Java work on my backlog, but would need a volunteer for JS).  And any
> >>> other implementations that don't currently handle them.
> >>>
> >>> Thoughts?
> >>>
> >> Will need to add a test case for gandiva also - and fix if it shows up any
> >> bugs. but, I agree we should support zero sized batches.
> >>
> >>
> >>
> >>> Thanks,
> >>> Micah
> >>>
> >>
> >> --
> >> Thanks and regards,
> >> Ravindra.


Re: [Discuss][Format] Zero size record batches

2019-05-21 Thread Paul Taylor

I'd be happy to PR a fix for JS today if someone can link me to Wes's PR.

On 5/21/19 11:02 AM, Wes McKinney wrote:

I agree also. As a practical use case, the results of a request made
with Arrow Flight might yield an empty result set. I'm not sure if
this needs to be formally noted in the specification documents but it
might not hurt.

If someone can fix the Java implementation we could enable the
integration test (minus JavaScript for now) in my PR

On Tue, May 21, 2019 at 12:47 AM Ravindra Pindikura  wrote:

On Tue, May 21, 2019 at 10:35 AM Micah Kornfield 
wrote:


Today, the format docs are ambiguous on whether zero sized batches are
supported.  Wes opened a PR [1] for empty record batches that shows C++
handles them but Java and javascript fail to handle them.


I'd like to propose:
1.  Make it explicit in the format docs, that 0 size record batches are
supported
2.  Update Java and javascript implementations to work with them (I can put
the Java work on my backlog, but would need a volunteer for JS).  And any
other implementations that don't currently handle them.

Thoughts?


Will need to add a test case for gandiva also - and fix if it shows up any
bugs. but, I agree we should support zero sized batches.




Thanks,
Micah



--
Thanks and regards,
Ravindra.


Re: [DISCUSS] PR Backlog reduction

2019-05-21 Thread Antoine Pitrou



Le 21/05/2019 à 20:02, Neal Richardson a écrit :
> Automatically close stale PRs? https://github.com/probot/stale

That doesn't sound like a good idea to me.

Regards

Antoine.


Re: [DISCUSS] PR Backlog reduction

2019-05-21 Thread Neal Richardson
Automatically close stale PRs? https://github.com/probot/stale

On Tue, May 21, 2019 at 11:00 AM Wes McKinney  wrote:

> Any other thoughts about process to manage the backlog?
>
> On Thu, May 16, 2019 at 2:58 PM Wes McKinney  wrote:
> >
> > hi Micah,
> >
> > This sounds like a reasonable proposal, and I agree in particular for
> > regular contributors that it makes sense to close PRs that are not
> > close to being in merge-readiness to thin the noise of the patch queue
> >
> > We have some short-term issues such as various reviewers being busy
> > lately (e.g. I was on vacation in April, then heads down working on
> > ARROW-3144) but I agree that there are some structural issues with how
> > we're organizing code review efforts.
> >
> > Note that Apache Spark, with ~500 open PRs, created this dashboard
> > application to help manage the insanity
> >
> > https://spark-prs.appspot.com/
> >
> > Ultimately (in the next few years as the number of active contributors
> > grows) I expect that we'll have to do something similar.
> >
> > - Wes
> >
> > On Thu, May 16, 2019 at 2:34 PM Micah Kornfield 
> wrote:
> > >
> > > Our backlog of open PRs is slowly creeping up.  This isn't great
> because it
> > > allows contributions to slip through the cracks (which in turn possibly
> > > turns off new contributors).  Perusing PRs I think things roughly fall
> into
> > > the following categories.
> > >
> > >
> > > 1.  PRs are work in progress that never got completed but were left
> open
> > > (mostly by regular arrow contributors).
> > >
> > > 2.  PR stalled because changes where requested and the PR author never
> > > responded.
> > >
> > > 3.  PR stalled due to lack of consensus on approach/design.
> > >
> > > 4.  PR is blocked on some external dependency (mostly these are PRs by
> > > regular arrow contributor).
> > >
> > >
> > > A straw-man proposal for handling these:
> > >
> > > 1.  Regular arrow contributors, please close the PR if it isn't close
> to
> > > being ready and you aren't actively working on it.
> > >
> > > 2.  I think we should start assigning reviewers who will have the
> > > responsibility of:
> > >
> > >a.  Pinging contributor and working through the review with them.
> > >
> > >b.  Closing out the PR in some form if there hasn't been activity
> in a
> > > 30 day period (either merging as is, making the necessary changes or
> > > closing the PR, and removing the tag from JIRA).
> > >
> > > 3.  Same as 2, but bring the discussion to the mailing list and try to
> have
> > > a formal vote if necessary.
> > >
> > > 4.  Same as 2, but tag the PR as blocked and the time window expands.
> > >
> > >
> > > The question comes up with how to manage assignment of PRs to
> reviewers.  I
> > > am happy to try to triage any PRs older then a week (assuming some PRs
> will
> > > be closed quickly with the current ad-hoc process) and load balance
> between
> > > volunteers (it would be great to have a doc someplace where people can
> > > express there available bandwidth and which languages they feel
> comfortable
> > > with).
> > >
> > >
> > > Thoughts/other proposals?
> > >
> > >
> > > Thanks,
> > >
> > > Micah
> > >
> > >
> > >
> > > P.S. A very rough analysis of PR tags gives the following counts.
> > >
> > >   29 C++
> > >
> > >   17 Python
> > >
> > >8 Rust
> > >
> > >7 WIP
> > >
> > >7 Plasma
> > >
> > >7 Java
> > >
> > >5 R
> > >
> > >4 Go
> > >
> > >4 Flight
>


Re: [Discuss][Format] Zero size record batches

2019-05-21 Thread Wes McKinney
I agree also. As a practical use case, the results of a request made
with Arrow Flight might yield an empty result set. I'm not sure if
this needs to be formally noted in the specification documents but it
might not hurt.

If someone can fix the Java implementation we could enable the
integration test (minus JavaScript for now) in my PR

On Tue, May 21, 2019 at 12:47 AM Ravindra Pindikura  wrote:
>
> On Tue, May 21, 2019 at 10:35 AM Micah Kornfield 
> wrote:
>
> > Today, the format docs are ambiguous on whether zero sized batches are
> > supported.  Wes opened a PR [1] for empty record batches that shows C++
> > handles them but Java and javascript fail to handle them.
> >
> >
> > I'd like to propose:
> > 1.  Make it explicit in the format docs, that 0 size record batches are
> > supported
> > 2.  Update Java and javascript implementations to work with them (I can put
> > the Java work on my backlog, but would need a volunteer for JS).  And any
> > other implementations that don't currently handle them.
> >
> > Thoughts?
> >
>
> Will need to add a test case for gandiva also - and fix if it shows up any
> bugs. but, I agree we should support zero sized batches.
>
>
>
> > Thanks,
> > Micah
> >
>
>
> --
> Thanks and regards,
> Ravindra.


Re: [DISCUSS] PR Backlog reduction

2019-05-21 Thread Wes McKinney
Any other thoughts about process to manage the backlog?

On Thu, May 16, 2019 at 2:58 PM Wes McKinney  wrote:
>
> hi Micah,
>
> This sounds like a reasonable proposal, and I agree in particular for
> regular contributors that it makes sense to close PRs that are not
> close to being in merge-readiness to thin the noise of the patch queue
>
> We have some short-term issues such as various reviewers being busy
> lately (e.g. I was on vacation in April, then heads down working on
> ARROW-3144) but I agree that there are some structural issues with how
> we're organizing code review efforts.
>
> Note that Apache Spark, with ~500 open PRs, created this dashboard
> application to help manage the insanity
>
> https://spark-prs.appspot.com/
>
> Ultimately (in the next few years as the number of active contributors
> grows) I expect that we'll have to do something similar.
>
> - Wes
>
> On Thu, May 16, 2019 at 2:34 PM Micah Kornfield  wrote:
> >
> > Our backlog of open PRs is slowly creeping up.  This isn't great because it
> > allows contributions to slip through the cracks (which in turn possibly
> > turns off new contributors).  Perusing PRs I think things roughly fall into
> > the following categories.
> >
> >
> > 1.  PRs are work in progress that never got completed but were left open
> > (mostly by regular arrow contributors).
> >
> > 2.  PR stalled because changes where requested and the PR author never
> > responded.
> >
> > 3.  PR stalled due to lack of consensus on approach/design.
> >
> > 4.  PR is blocked on some external dependency (mostly these are PRs by
> > regular arrow contributor).
> >
> >
> > A straw-man proposal for handling these:
> >
> > 1.  Regular arrow contributors, please close the PR if it isn't close to
> > being ready and you aren't actively working on it.
> >
> > 2.  I think we should start assigning reviewers who will have the
> > responsibility of:
> >
> >a.  Pinging contributor and working through the review with them.
> >
> >b.  Closing out the PR in some form if there hasn't been activity in a
> > 30 day period (either merging as is, making the necessary changes or
> > closing the PR, and removing the tag from JIRA).
> >
> > 3.  Same as 2, but bring the discussion to the mailing list and try to have
> > a formal vote if necessary.
> >
> > 4.  Same as 2, but tag the PR as blocked and the time window expands.
> >
> >
> > The question comes up with how to manage assignment of PRs to reviewers.  I
> > am happy to try to triage any PRs older then a week (assuming some PRs will
> > be closed quickly with the current ad-hoc process) and load balance between
> > volunteers (it would be great to have a doc someplace where people can
> > express there available bandwidth and which languages they feel comfortable
> > with).
> >
> >
> > Thoughts/other proposals?
> >
> >
> > Thanks,
> >
> > Micah
> >
> >
> >
> > P.S. A very rough analysis of PR tags gives the following counts.
> >
> >   29 C++
> >
> >   17 Python
> >
> >8 Rust
> >
> >7 WIP
> >
> >7 Plasma
> >
> >7 Java
> >
> >5 R
> >
> >4 Go
> >
> >4 Flight


[jira] [Created] (ARROW-5390) [CI] Job time limit exceeded on Travis

2019-05-21 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-5390:
-

 Summary: [CI] Job time limit exceeded on Travis
 Key: ARROW-5390
 URL: https://issues.apache.org/jira/browse/ARROW-5390
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration, Python
Reporter: Antoine Pitrou


We now frequently hit the 50 minutes job time limit on Travis-CI on the "Python 
2.7 and 3.6 unit tests w/ Valgrind, conda-forge toolchain, coverage" job.

e.g. https://travis-ci.org/pitrou/arrow/jobs/535373888

Hopefully we can soon ditch Python 2.7, which would allow saving a bit of time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Metadata for partitioned datasets in pyarrow.parquet

2019-05-21 Thread Richard Zamora
Thank you for the responses Wes and Joris!  These summaries are very helpful to 
me.

I decided to look into ARROW-5349 to get my feet wet, and just submitted a WIP 
PR (https://github.com/apache/arrow/pull/4361).  If you get a chance, please 
take a look and provide feedback.  I have limited experience with C++ 
development, so feel free to be harsh __

Once the metadata can include file-path information, I will be happy to help 
with ARROW-5138 as well

Best,
Rick

On 5/16/19, 2:48 PM, "Joris Van den Bossche"  
wrote:

Missed the email of Wes, but yeah, I think we basically said the same.

Answer to another question you raised in the notebook:

> [about writing a _common_metadata file] ... uses the schema object for
> the 0th partition. This actually means that not *all* information in
> _common_metadata will be true for the entire dataset. More specifically,
> the "index_columns" [in the pandas_metadata] its "start" and "stop"
> values will correspond to the 0th partition, rather than the global 
dataset.
>
That's indeed a problem with storing the index information not as a column.
We have seen some other related issues about this, such as ARROW-5138 (when
reading a single row group of a parquet file).
In those cases, I think the only solution is to ignore this part of the
metadata. But, specifically for dask, I think the idea actually is to not
write the index at all (based on discussion in
https://github.com/dask/dask/pull/4336), so then you would not have this
problem.

However, note that writing the _common_metadata file like that from the
schema of the first partition might not be fully correct: it might have the
correct schema, but it will not have the correct dataset size (eg number of
row groups). Although I am not sure what the "common practice" is on this
aspect of _common_metadata file.

Joris



Op do 16 mei 2019 om 20:50 schreef Joris Van den Bossche <
jorisvandenboss...@gmail.com>:

> Hi Rick,
>
> Thanks for exploring this!
>
> I am still quite new to Parquet myself, so the following might not be
> fully correct, but based on my current understanding, to enable projects
> like dask to write the different pieces of a Parquet dataset using 
pyarrow,
> we need the following functionalities:
>
> - Write a single Parquet file (for one pieces / partition) and get the
> metadata of that file
> -> Writing is already long possible and ARROW-5258 (GH4236) enabled
> getting the metadata
> - Update and combine this list of metadata objects
> -> Dask needs a way to update the metadata (eg the exact file path
> where they put it inside the partitioned dataset): I opened ARROW-5349
> for this.
> -> We need to combine the metadata, discussed in ARROW-1983
> - Write a metadata object (for both the _metadata and _common_metadata
> files)
> -> Also discussed in ARROW-1983. The Python interface could also
> combine (step above) and write together.
>
> But it would be good if some people more familiar with Parquet could chime
> in here.
>
> Best,
> Joris
>
> Op do 16 mei 2019 om 16:37 schreef Richard Zamora :
>
>> Note that I was asked to post here after making a similar comment on
>> GitHub (https://github.com/apache/arrow/pull/4236)…
>>
>> I am hoping to help improve the use of pyarrow.parquet within dask (
>> https://github.com/dask/dask). To this end, I put together a simple
>> notebook to explore how pyarrow.parquet can be used to read/write a
>> partitioned dataset without dask (see:
>> 
https://github.com/rjzamora/notebooks/blob/master/pandas_pyarrow_simple.ipynb).
>> If your search for "Assuming that a single-file metadata solution is
>> currently missing" in that notebook, you will see where I am unsure of 
the
>> best way to write/read metadata to/from a centralized location using
>> pyarrow.parquet.
>>
>> I believe that it would be best for dask to have a way to read/write a
>> single metadata file for a partitioned dataset using pyarrow (perhaps a
>> ‘_metadata’ file?).   Am I correct to assume that: (1) this functionality
>> is missing in pyarrow, and (2) this  approach is the best way to process 
a
>> partitioned dataset in parallel?
>>
>> Best,
>> Rick
>>
>> --
>> Richard J. Zamora
>> NVIDA
>>
>>
>>
>>
>> 
---
>> This email message is for the sole use of the intended recipient(s) and
>> may contain
>> confidential information.  Any unauthorized review, use, disclosure or
>> distribution
>> is prohibited.  If you are not the intended recipient, please contact the
>> sender by
>> reply email and destroy all co

[jira] [Created] (ARROW-5389) [C++] Add an internal temporary directory API

2019-05-21 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-5389:
-

 Summary: [C++] Add an internal temporary directory API
 Key: ARROW-5389
 URL: https://issues.apache.org/jira/browse/ARROW-5389
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou
 Fix For: 0.14.0


This is needed to easily write tests involving filesystem operations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5388) [Go] use arrow.TypeEqual in array.NewChunked

2019-05-21 Thread Sebastien Binet (JIRA)
Sebastien Binet created ARROW-5388:
--

 Summary: [Go] use arrow.TypeEqual in array.NewChunked
 Key: ARROW-5388
 URL: https://issues.apache.org/jira/browse/ARROW-5388
 Project: Apache Arrow
  Issue Type: Bug
  Components: Go
Reporter: Sebastien Binet






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries

2019-05-21 Thread Wes McKinney
On Tue, May 21, 2019, 8:43 AM Antoine Pitrou  wrote:

>
> Le 21/05/2019 à 13:42, Wes McKinney a écrit :
> > hi Antoine,
> >
> > On Tue, May 21, 2019 at 5:48 AM Antoine Pitrou 
> wrote:
> >>
> >>
> >> Hi Wes,
> >>
> >> How does copy-on-write play together with memory-mapped data?  It seems
> >> that, depending on whether the memory map has several concurrent users
> >> (a condition which may be timing-dependent), we will either persist
> >> changes on disk or make them ephemeral in memory.  That doesn't sound
> >> very user-friendly, IMHO.
> >
> > With memory-mapping, any Buffer is sliced from the parent MemoryMap
> > [1] so mutating the data on disk using this interface wouldn't be
> > possible with the way that I've framed it.
>
> Hmm... I always forget that SliceBuffer returns a read-only view.
>

The more important issue is that parent_ is non-null. The idea is that no
mutation is allowed if we reason that another Buffer object has access to
the address space of interest. I think this style of copy-on-write is a
reasonable compromise that prevents most kinds of defensive copying.


> Regards
>
> Antoine.
>


Re: [DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries

2019-05-21 Thread Antoine Pitrou


Le 21/05/2019 à 13:42, Wes McKinney a écrit :
> hi Antoine,
> 
> On Tue, May 21, 2019 at 5:48 AM Antoine Pitrou  wrote:
>>
>>
>> Hi Wes,
>>
>> How does copy-on-write play together with memory-mapped data?  It seems
>> that, depending on whether the memory map has several concurrent users
>> (a condition which may be timing-dependent), we will either persist
>> changes on disk or make them ephemeral in memory.  That doesn't sound
>> very user-friendly, IMHO.
> 
> With memory-mapping, any Buffer is sliced from the parent MemoryMap
> [1] so mutating the data on disk using this interface wouldn't be
> possible with the way that I've framed it.

Hmm... I always forget that SliceBuffer returns a read-only view.

Regards

Antoine.


[jira] [Created] (ARROW-5387) [Go] properly handle sub-slice of List

2019-05-21 Thread Sebastien Binet (JIRA)
Sebastien Binet created ARROW-5387:
--

 Summary: [Go] properly handle sub-slice of List
 Key: ARROW-5387
 URL: https://issues.apache.org/jira/browse/ARROW-5387
 Project: Apache Arrow
  Issue Type: Bug
  Components: Go
Reporter: Sebastien Binet


consider an `array.List` with the following content:

`[[0 1 2] (null) [3 4 5 6]]`

 

sub-slicing it with `array.NewSlice(arr, 1, 3)`, we get:

`[(null) []]` instead of `[(null) [3 4 5 6]]`

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries

2019-05-21 Thread Wes McKinney
hi Antoine,

On Tue, May 21, 2019 at 5:48 AM Antoine Pitrou  wrote:
>
>
> Hi Wes,
>
> How does copy-on-write play together with memory-mapped data?  It seems
> that, depending on whether the memory map has several concurrent users
> (a condition which may be timing-dependent), we will either persist
> changes on disk or make them ephemeral in memory.  That doesn't sound
> very user-friendly, IMHO.

With memory-mapping, any Buffer is sliced from the parent MemoryMap
[1] so mutating the data on disk using this interface wouldn't be
possible with the way that I've framed it.

Note that memory-mapping at all is already significantly advanced over
what most people in the world are using every day. You won't find
examples of memory-mapping with pandas in my book, for example,
because it's not possible. So if you memory-map, perform some
analytics on the mapped data (causing results to be materialized in
memory), then write out the results to a new file (or set of files),
that would be an innovation for most users.

[1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc#L353

>
> Regards
>
> Antoine.
>
>
> Le 21/05/2019 à 00:39, Wes McKinney a écrit :
> > hi folks,
> >
> > I'm interested in starting to build a so-called "data frame" interface
> > as a moderately opinionated, higher-level usability layer for
> > interacting with Arrow-based chunked in-memory data. I've had numerous
> > discussions (mostly in-person) over the last few years about this and
> > it feels to me that if we don't build something like this in Apache
> > Arrow that we could end up with several third party efforts without
> > much community discussion or collaboration, which would be sad.
> >
> > Another anti-pattern that is occurring is that users are loading data
> > into Arrow, converting to a library like pandas in order to do some
> > simple in-memory data manipulations, then converting back to Arrow.
> > This is not the intended long term mode of operation.
> >
> > I wrote in significantly more detail (~7-8 pages) about the context
> > and motivation for this project:
> >
> > https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit?usp=sharing
> >
> > Note that this would be a parallel effort to go alongside the
> > previously-discussed "Query Engine" project, and the two things are
> > intended to work together. Since we are creating computational
> > kernels, this would also provide some immediacy in being able to
> > invoke kernels easily on large in-memory datasets without having to
> > wait for a more full-fledged query engine system to be developed
> >
> > The details with these kinds of projects can be bedeviling so my
> > approach would be to begin to lay down the core abstractions and basic
> > APIs and use the project to drive the agenda for kernel development
> > (which can also be used in the context of a query engine runtime).
> > From my past experience designing pandas and some other in-memory
> > analytics projects, I have some idea of the kinds of mistakes or
> > design patterns I would like to _avoid_ in this effort, but others may
> > have some experiences they can offer to inform the design approach as
> > well.
> >
> > Looking forward to comments and discussion.
> >
> > - Wes
> >


Re: [DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries

2019-05-21 Thread Wes McKinney
Comments are on now, sorry about that.

On Tue, May 21, 2019, 1:06 AM Micah Kornfield  wrote:

> Hi Wes,
> It looks like comments are turned off on the doc, this intentional?
>
> Thanks,
> Micah
>
> On Mon, May 20, 2019 at 3:49 PM Wes McKinney  wrote:
>
> > hi folks,
> >
> > I'm interested in starting to build a so-called "data frame" interface
> > as a moderately opinionated, higher-level usability layer for
> > interacting with Arrow-based chunked in-memory data. I've had numerous
> > discussions (mostly in-person) over the last few years about this and
> > it feels to me that if we don't build something like this in Apache
> > Arrow that we could end up with several third party efforts without
> > much community discussion or collaboration, which would be sad.
> >
> > Another anti-pattern that is occurring is that users are loading data
> > into Arrow, converting to a library like pandas in order to do some
> > simple in-memory data manipulations, then converting back to Arrow.
> > This is not the intended long term mode of operation.
> >
> > I wrote in significantly more detail (~7-8 pages) about the context
> > and motivation for this project:
> >
> >
> >
> https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit?usp=sharing
> >
> > Note that this would be a parallel effort to go alongside the
> > previously-discussed "Query Engine" project, and the two things are
> > intended to work together. Since we are creating computational
> > kernels, this would also provide some immediacy in being able to
> > invoke kernels easily on large in-memory datasets without having to
> > wait for a more full-fledged query engine system to be developed
> >
> > The details with these kinds of projects can be bedeviling so my
> > approach would be to begin to lay down the core abstractions and basic
> > APIs and use the project to drive the agenda for kernel development
> > (which can also be used in the context of a query engine runtime).
> > From my past experience designing pandas and some other in-memory
> > analytics projects, I have some idea of the kinds of mistakes or
> > design patterns I would like to _avoid_ in this effort, but others may
> > have some experiences they can offer to inform the design approach as
> > well.
> >
> > Looking forward to comments and discussion.
> >
> > - Wes
> >
>


Re: [pyarrow] Parquet page header size limit

2019-05-21 Thread shyam narayan singh
Hi

I have submitted parent PR  and
the submodule PR .

Regards
Shyam

On Tue, May 21, 2019 at 12:09 PM shyam narayan singh <
shyambits2...@gmail.com> wrote:

> Thanks Micah and Wes. Will try to submit a PR in a day or two.
>
> Regards
> Shyam
>
> On Mon, May 20, 2019 at 10:46 PM Wes McKinney  wrote:
>
>> Those instructions are a bit out of date after the monorepo merge, see
>>
>>
>> https://github.com/apache/arrow/blob/master/docs/source/developers/cpp.rst#apache-parquet-development
>>
>> On Mon, May 20, 2019 at 8:33 AM Micah Kornfield 
>> wrote:
>> >
>> > Hi Shyam,
>> > https://github.com/apache/parquet-testing contains stand alone test
>> files.
>> >
>> >
>> >
>> https://github.com/apache/arrow/blob/master/cpp/src/parquet/bloom_filter-test.cc
>> > is an example of how this is used (search for get_data_dir).
>> >
>> >
>> > https://github.com/apache/parquet-cpp/blob/master/README.md#testing
>> > describes how to setup your environment to use it.
>> >
>> > Thanks,
>> > Micah
>> >
>> >
>> >
>> > n Monday, May 20, 2019, shyam narayan singh 
>> wrote:
>> >
>> > > Hi Wes
>> > >
>> > > Sorry, this got out of my radar. I went ahead to dig the problem and
>> filed
>> > > the issue . We can
>> track
>> > > the error message as part of the different bug?
>> > >
>> > > Now, I have a parquet file that can be read by java reader but not
>> pyarrow.
>> > > I have the fix for the issue but I do not know how to add a test case.
>> > > Reason being, the test cases generate the files and then test the
>> readers.
>> > > Is there a way to add an existing parquet file as a test case to the
>> > > current set of tests?
>> > >
>> > > Regards
>> > > Shyam
>> > >
>> > > Regards
>> > > Shyam
>> > >
>> > > On Tue, Apr 23, 2019 at 9:20 AM Wes McKinney 
>> wrote:
>> > >
>> > > > hi Shyam,
>> > > >
>> > > > Well "Invalid data. Deserializing page header failed." is not a very
>> > > > good error message. Can you open a JIRA issue and provide a way to
>> > > > reproduce the problem (e.g. code to generate a file, or a sample
>> > > > file)? From what you say it seems to be an atypical usage of
>> Parquet,
>> > > > but there might be a configurable option we can add to help. IIRC
>> the
>> > > > large header limit is there to prevent runaway behavior in malformed
>> > > > Parquet files. I believe we used other Parquet implementations to
>> > > > guide the choice
>> > > >
>> > > > Thanks
>> > > >
>> > > > On Wed, Apr 17, 2019 at 6:09 AM shyam narayan singh
>> > > >  wrote:
>> > > > >
>> > > > > My mistake. The max is 16MB.
>> > > > >
>> > > > > So, if deserialisation fails, we keep trying until we hit the
>> max, that
>> > > > > works but not efficient. Looks like the custom page header is not
>> > > > > deserialisable. Will keep digging.
>> > > > >
>> > > > > Thanks
>> > > > > Shyam
>> > > > >
>> > > > > On Wed, Apr 17, 2019 at 11:56 AM shyam narayan singh <
>> > > > > shyambits2...@gmail.com> wrote:
>> > > > >
>> > > > > > Hi
>> > > > > >
>> > > > > > While reading a custom parquet file that has extra information
>> > > embedded
>> > > > > > (some custom stats), pyarrow is failing to read it.
>> > > > > >
>> > > > > >
>> > > > > > Traceback (most recent call last):
>> > > > > >
>> > > > > >   File "/tmp/pytest.py", line 19, in 
>> > > > > >
>> > > > > > table = dataset.read()
>> > > > > >
>> > > > > >   File
>> "/usr/local/lib/python3.7/site-packages/pyarrow/parquet.py",
>> > > > line
>> > > > > > 214, in read
>> > > > > >
>> > > > > > use_threads=use_threads)
>> > > > > >
>> > > > > >   File "pyarrow/_parquet.pyx", line 737, in
>> > > > > > pyarrow._parquet.ParquetReader.read_all
>> > > > > >
>> > > > > >   File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
>> > > > > >
>> > > > > > pyarrow.lib.ArrowIOError: Couldn't deserialize thrift:
>> > > > TProtocolException:
>> > > > > > Invalid data
>> > > > > >
>> > > > > > Deserializing page header failed.
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > Looking at the code, I realised that SerializedPageReader throws
>> > > > exception
>> > > > > > if the page header size goes beyond 16k (default max). There is
>> a
>> > > > setter
>> > > > > > method for the max page header size that is used only in tests.
>> > > > > >
>> > > > > >
>> > > > > > Is there a way to get around the problem?
>> > > > > >
>> > > > > >
>> > > > > > Regards
>> > > > > >
>> > > > > > Shyam
>> > > > > >
>> > > >
>> > >
>>
>


Re: [DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries

2019-05-21 Thread Antoine Pitrou


Hi Wes,

How does copy-on-write play together with memory-mapped data?  It seems
that, depending on whether the memory map has several concurrent users
(a condition which may be timing-dependent), we will either persist
changes on disk or make them ephemeral in memory.  That doesn't sound
very user-friendly, IMHO.

Regards

Antoine.


Le 21/05/2019 à 00:39, Wes McKinney a écrit :
> hi folks,
> 
> I'm interested in starting to build a so-called "data frame" interface
> as a moderately opinionated, higher-level usability layer for
> interacting with Arrow-based chunked in-memory data. I've had numerous
> discussions (mostly in-person) over the last few years about this and
> it feels to me that if we don't build something like this in Apache
> Arrow that we could end up with several third party efforts without
> much community discussion or collaboration, which would be sad.
> 
> Another anti-pattern that is occurring is that users are loading data
> into Arrow, converting to a library like pandas in order to do some
> simple in-memory data manipulations, then converting back to Arrow.
> This is not the intended long term mode of operation.
> 
> I wrote in significantly more detail (~7-8 pages) about the context
> and motivation for this project:
> 
> https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit?usp=sharing
> 
> Note that this would be a parallel effort to go alongside the
> previously-discussed "Query Engine" project, and the two things are
> intended to work together. Since we are creating computational
> kernels, this would also provide some immediacy in being able to
> invoke kernels easily on large in-memory datasets without having to
> wait for a more full-fledged query engine system to be developed
> 
> The details with these kinds of projects can be bedeviling so my
> approach would be to begin to lay down the core abstractions and basic
> APIs and use the project to drive the agenda for kernel development
> (which can also be used in the context of a query engine runtime).
> From my past experience designing pandas and some other in-memory
> analytics projects, I have some idea of the kinds of mistakes or
> design patterns I would like to _avoid_ in this effort, but others may
> have some experiences they can offer to inform the design approach as
> well.
> 
> Looking forward to comments and discussion.
> 
> - Wes
> 


[jira] [Created] (ARROW-5386) Making the rounding behavior of the buffer capacity configurable

2019-05-21 Thread Liya Fan (JIRA)
Liya Fan created ARROW-5386:
---

 Summary: Making the rounding behavior of the buffer capacity 
configurable
 Key: ARROW-5386
 URL: https://issues.apache.org/jira/browse/ARROW-5386
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


In our scenario, the following code snippet is frequent in our code base:

int requestSize = ...;
if (requestSize <= allocator.getLimit() - allocator.getAllocatedMemory()) {
 ArrowBuf buffer = allocator.buffer(requestSize);
}

However, it often causes OutOfMemoryException, due to Arrow's rounding behavior.

For example, we have only 12 MB memory left, and we request a buffer with size 
10 MB. Appearantly, there is sufficient memory to meet the request. However, 
the rounding behavior rounds the request size from 10 MB to 16 MB, and there is 
no 16 MB memory, so an OutOfMemoryException will be thrown.

We propose two ways to solve this problem:

1. We provide a rounding option as an argument to the BaseAllocator#buf method. 
There are two possible values for the rounding option: rounding up and rounding 
down. In the above scenario, the rounding down option can solve the problem.

2. We add a method to the allocator:

int getRoundedSize(final int size, BaseAllocator.AllocationRoundingOption 
roundingOption)

This method will give the rounding buffer size, given the initial request size. 
With this method, the user can freely adjust their request size to avoid OOM.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5385) [Go] implement EXTENSION datatype

2019-05-21 Thread Sebastien Binet (JIRA)
Sebastien Binet created ARROW-5385:
--

 Summary: [Go] implement EXTENSION datatype
 Key: ARROW-5385
 URL: https://issues.apache.org/jira/browse/ARROW-5385
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Go
Reporter: Sebastien Binet






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5384) [Go] add FixedSizeList array

2019-05-21 Thread Sebastien Binet (JIRA)
Sebastien Binet created ARROW-5384:
--

 Summary: [Go] add FixedSizeList array
 Key: ARROW-5384
 URL: https://issues.apache.org/jira/browse/ARROW-5384
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Go
Reporter: Sebastien Binet






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5383) [Go] update IPC flatbuf (new Duration type)

2019-05-21 Thread Sebastien Binet (JIRA)
Sebastien Binet created ARROW-5383:
--

 Summary: [Go] update IPC flatbuf (new Duration type)
 Key: ARROW-5383
 URL: https://issues.apache.org/jira/browse/ARROW-5383
 Project: Apache Arrow
  Issue Type: Bug
  Components: Go
Reporter: Sebastien Binet






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5382) SSE on ARM NEON

2019-05-21 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-5382:
--

 Summary: SSE on ARM NEON
 Key: ARROW-5382
 URL: https://issues.apache.org/jira/browse/ARROW-5382
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Krisztian Szucs


Arrow doesn't yet support SIMD on arm architectures. SSE on ARM can be 
complicated, but there are a couple of libraries We could depend on, namely:
- https://github.com/nemequ/simde (MIT)
- 
https://github.com/catboost/catboost/tree/ee47f9aa399833cb04bfeec5fe9f3e3792d428e4/library/sse
 (Apache)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5381) Crash at arrow::internal::CountSetBits

2019-05-21 Thread Tham (JIRA)
Tham created ARROW-5381:
---

 Summary: Crash at arrow::internal::CountSetBits
 Key: ARROW-5381
 URL: https://issues.apache.org/jira/browse/ARROW-5381
 Project: Apache Arrow
  Issue Type: Bug
 Environment: Operating System: Windows 7 Professional 64-bit (6.1, 
Build 7601) Service Pack 1(7601.win7sp1_ldr_escrow.181110-1429)
Language: English (Regional Setting: English)
System Manufacturer: SAMSUNG ELECTRONICS CO., LTD.
System Model: RV420/RV520/RV720/E3530/S3530/E3420/E3520
BIOS: Phoenix SecureCore-Tiano(tm) NB Version 2.1 05PQ
Processor: Intel(R) Pentium(R) CPU B950 @ 2.10GHz (2 CPUs), ~2.1GHz
Memory: 2048MB RAM
Available OS Memory: 1962MB RAM
  Page File: 1517MB used, 2405MB available
Windows Dir: C:\Windows
DirectX Version: DirectX 11

Reporter: Tham


I've got a lot of crash dump from a customer's windows machine. The stacktrace 
shows that it crashed at arrow::internal::CountSetBits.

 
{code:java}
STACK_TEXT:  
00c9`5354a4c0 7ff7`2f2830fd : 00c9`544841c0 ` 
`1e00 ` : 
CortexService!arrow::internal::CountSetBits+0x16d
00c9`5354a550 7ff7`2f2834b7 : 00c9`5337c930 ` 
` ` : 
CortexService!arrow::ArrayData::GetNullCount+0x8d
00c9`5354a580 7ff7`2f13df55 : 00c9`54476080 00c9`5354a5d8 
` ` : 
CortexService!arrow::Array::null_count+0x37
00c9`5354a5b0 7ff7`2f13fb68 : 00c9`5354ab40 00c9`5354a6f8 
00c9`54476080 ` : CortexService!parquet::arrow::`anonymous 
namespace'::LevelBuilder::Visit >+0xa5
00c9`5354a640 7ff7`2f12fa34 : 00c9`5354a6f8 00c9`54476080 
00c9`5354ab40 ` : 
CortexService!arrow::VisitArrayInline+0x298
00c9`5354a680 7ff7`2f14bf03 : 00c9`5354ab40 00c9`5354a6f8 
00c9`54476080 ` : CortexService!parquet::arrow::`anonymous 
namespace'::LevelBuilder::VisitInline+0x44
00c9`5354a6c0 7ff7`2f12fe2a : 00c9`5354ab40 00c9`5354ae18 
00c9`54476080 00c9`5354b208 : CortexService!parquet::arrow::`anonymous 
namespace'::LevelBuilder::GenerateLevels+0x93
00c9`5354aa00 7ff7`2f14de56 : 00c9`5354b1f8 00c9`5354afc8 
00c9`54476080 `1e00 : CortexService!parquet::arrow::`anonymous 
namespace'::ArrowColumnWriter::Write+0x25a
00c9`5354af20 7ff7`2f14e66b : 00c9`5354b1f8 00c9`5354b238 
00c9`54445c20 ` : CortexService!parquet::arrow::`anonymous 
namespace'::ArrowColumnWriter::Write+0x2a6
00c9`5354b040 7ff7`2f12f137 : 00c9`544041f0 00c9`5354b4d8 
00c9`5354b4a8 ` : 
CortexService!parquet::arrow::FileWriter::Impl::WriteColumnChunk+0x70b
00c9`5354b400 7ff7`2f14b4d5 : 00c9`54431180 00c9`5354b4d8 
00c9`5354b4a8 ` : 
CortexService!parquet::arrow::FileWriter::WriteColumnChunk+0x67
00c9`5354b450 7ff7`2f12eef1 : 00c9`5354b5d8 00c9`5354b648 
` `1e00 : 
CortexService!::operator()+0x195
00c9`5354b530 7ff7`2eb8e31e : 00c9`54431180 00c9`5354b760 
00c9`54442fb0 `1e00 : 
CortexService!parquet::arrow::FileWriter::WriteTable+0x521
00c9`5354b730 7ff7`2eb58ac5 : 00c9`5307bd88 00c9`54442fb0 
` ` : 
CortexService!Cortex::Storage::ParquetStreamWriter::writeRowGroup+0xfe
00c9`5354b860 7ff7`2eafdce6 : 00c9`5307bd80 00c9`5354ba08 
00c9`5354b9e0 00c9`5354b9d8 : 
CortexService!Cortex::Storage::ParquetFileWriter::writeRowGroup+0x545
00c9`5354b9a0 7ff7`2eaf8bae : 00c9`53275600 00c9`53077220 
`fffe ` : 
CortexService!Cortex::Storage::DataStreamWriteWorker::onNewData+0x1a6
{code}
{code:java}
FAILED_INSTRUCTION_ADDRESS: 
CortexService!arrow::internal::CountSetBits+16d 
[c:\jenkins\workspace\cortexv2-dev-win64-service\src\thirdparty\arrow\cpp\src\arrow\util\bit-util.cc
 @ 99]
7ff7`2f3a4e4d f3480fb800  popcnt  rax,qword ptr [rax]

FOLLOWUP_IP: 
CortexService!arrow::internal::CountSetBits+16d 
[c:\jenkins\workspace\cortexv2-dev-win64-service\src\thirdparty\arrow\cpp\src\arrow\util\bit-util.cc
 @ 99]
7ff7`2f3a4e4d f3480fb800  popcnt  rax,qword ptr [rax]

FAULTING_SOURCE_LINE:  
c:\jenkins\workspace\cortexv2-dev-win64-service\src\thirdparty\arrow\cpp\src\arrow\util\bit-util.cc

FAULTING_SOURCE_FILE:  
c:\jenkins\workspace\cortexv2-dev-win64-service\src\thirdparty\arrow\cpp\src\arrow\util\bit-util.cc

FAULTING_SOURCE_LINE_NUMBER:  99

SYMBOL_STACK_INDEX:  0

SYMBOL_NAME:  cortexservice!arrow::internal::CountSetBits+16d
{code}
{code:java}
ERROR_CODE: (NTSTATUS) 0xc01d - {EXCEPTION}  Illegal Instruction  An 
attempt was made to execute an illegal instruction.

EXCEPTION_CODE: (NTSTATUS) 0xc01d - {EXCEPTION}  I