GSoC 2019 with a bit of Apache Arrow

2019-03-05 Thread Sebastien Binet
Hi there, Just to let you know CERN has been accepted as a GSoC organization this year. As such, I have submitted a proposal that's loosely connected to Apache Arrow (and Go.) Here's the proposal: https://hepsoftwarefoundation.org/gsoc/2019/proposal_GoHEPgroot.html It's mostly about *using* Arr

Re: [Discuss][Format] Checksum/Hash signature for data

2019-03-05 Thread Micah Kornfield
Doing some light research it looks xxhash has better cross-platform support as is faster then a vanilla implementation of crc32 [1]. However, crc32c (a slightly different crc32 algorithm) is hardware accelerated on newer (circa 2016) Intel CPUs [2] and is potentially faster. [1] https://cyan4973.

Re: [Discuss][Format] Checksum/Hash signature for data

2019-03-05 Thread Micah Kornfield
Thanks Philipp, Yeah, I probably shouldn't have said SHA1 either :)I'm not too concerned with a particular hash/checksum implementation. It would be good to have at least 1 or 2 well supported ones, and a migration path to support more if necessary without breaking file/streaming formats for

Re: [Discuss][Format] Checksum/Hash signature for data

2019-03-05 Thread Philipp Moritz
Hey Micah, in plasma, we are using xxhash to compute a hash/checksum [1] (it is computed in parallel using multiple threads) and have good experience with it -- all data in Ray is checksummed this way. Initially there were problems with uninitialized bits in the arrow representation, but that has

Re: [Discuss][Format] Checksum/Hash signature for data

2019-03-05 Thread Philipp Moritz
(I meant to say SHA256 instead of SHA1) On Tue, Mar 5, 2019 at 9:45 PM Philipp Moritz wrote: > Hey Micah, > > in plasma, we are using xxhash to compute a hash/checksum [1] (it is > computed in parallel using multiple threads) and have good experience with > it -- all data in Ray is checksummed t

[Discuss][Format] Checksum/Hash signature for data

2019-03-05 Thread Micah Kornfield
Hi Arrow Dev, As we expand the use-cases for Arrow to move it more across system boundaries (Flight) and make it live longer (e.g. in the file format), it seems to make sense to build in a mechanism for data integrity verification (e.g. a checksum like CRC32 or in some cases a cryptographic hash li

[jira] [Created] (ARROW-4784) [C++][CI] Re-enable flaky mingw tests.

2019-03-05 Thread Micah Kornfield (JIRA)
Micah Kornfield created ARROW-4784: -- Summary: [C++][CI] Re-enable flaky mingw tests. Key: ARROW-4784 URL: https://issues.apache.org/jira/browse/ARROW-4784 Project: Apache Arrow Issue Type: B

[jira] [Created] (ARROW-4783) [C++][CI] Mingw32 builds sometimes timeout

2019-03-05 Thread Micah Kornfield (JIRA)
Micah Kornfield created ARROW-4783: -- Summary: [C++][CI] Mingw32 builds sometimes timeout Key: ARROW-4783 URL: https://issues.apache.org/jira/browse/ARROW-4783 Project: Apache Arrow Issue Typ

[jira] [Created] (ARROW-4782) [C++] Prototype scalar and array expression types for developing deferred operator algebra

2019-03-05 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-4782: --- Summary: [C++] Prototype scalar and array expression types for developing deferred operator algebra Key: ARROW-4782 URL: https://issues.apache.org/jira/browse/ARROW-4782

[jira] [Created] (ARROW-4781) [JS] Ensure empty data initializes empty typed arrays

2019-03-05 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-4781: -- Summary: [JS] Ensure empty data initializes empty typed arrays Key: ARROW-4781 URL: https://issues.apache.org/jira/browse/ARROW-4781 Project: Apache Arrow Issue

[jira] [Created] (ARROW-4780) [JS] Package sourcemap files, update default package JS version

2019-03-05 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-4780: -- Summary: [JS] Package sourcemap files, update default package JS version Key: ARROW-4780 URL: https://issues.apache.org/jira/browse/ARROW-4780 Project: Apache Arrow

[jira] [Created] (ARROW-4779) [CI] AppVeyor link failure

2019-03-05 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-4779: - Summary: [CI] AppVeyor link failure Key: ARROW-4779 URL: https://issues.apache.org/jira/browse/ARROW-4779 Project: Apache Arrow Issue Type:

Re: Depending on non-released Apache projects (C++ Avro)

2019-03-05 Thread Wes McKinney
I am OK with that, but if we find ourselves making compromises that affect performance or memory efficiency (where possibly invasive refactoring may be required) perhaps we should reconsider option #3. On Tue, Mar 5, 2019 at 11:29 AM Uwe L. Korn wrote: > > I'm leaning a bit towards 1) but I would

[jira] [Created] (ARROW-4778) [C++/Python] manylinux1: Update Thrift to 0.12.0

2019-03-05 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-4778: -- Summary: [C++/Python] manylinux1: Update Thrift to 0.12.0 Key: ARROW-4778 URL: https://issues.apache.org/jira/browse/ARROW-4778 Project: Apache Arrow Issue Type:

[jira] [Created] (ARROW-4777) [C++/Python] manylinux1: Update lz4 to 1.8.3

2019-03-05 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-4777: -- Summary: [C++/Python] manylinux1: Update lz4 to 1.8.3 Key: ARROW-4777 URL: https://issues.apache.org/jira/browse/ARROW-4777 Project: Apache Arrow Issue Type: Imp

Re: [Discuss][C++] Hashing floating point numbers

2019-03-05 Thread Wes McKinney
+1 from me. Thanks for driving this discussion so we have the rationale documented On Tue, Mar 5, 2019 at 12:16 AM Micah Kornfield wrote: > > OK to summarize my understanding of the thoughts expressed: > 1. People really shouldn't be trying to do things like grouping and > joining on double valu

Re: Depending on non-released Apache projects (C++ Avro)

2019-03-05 Thread Uwe L. Korn
I'm leaning a bit towards 1) but I would love to get some input from the Avro community as 1) depends also on their side as we will submit some patches upstream that need to be reviewed and someday also released. Are AVRO committers subscribed here or should we reach out to them on their ML? Gi

Re: Depending on non-released Apache projects (C++ Avro)

2019-03-05 Thread Wes McKinney
I'd be +0.5 in favor of forking in this particular case. Since Avro is not vectorized (unlike Parquet and ORC) I suspect it may be more difficult to get the best performance using a general purpose API versus one that is more specialized to producing Arrow record batches. Given that has been relati

[jira] [Created] (ARROW-4776) [C++] DictionaryBuilder should support bootstrapping from an existing dict type

2019-03-05 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-4776: - Summary: [C++] DictionaryBuilder should support bootstrapping from an existing dict type Key: ARROW-4776 URL: https://issues.apache.org/jira/browse/ARROW-4776

[jira] [Created] (ARROW-4775) [Website] Site navbar cannot be expanded

2019-03-05 Thread Kenta Murata (JIRA)
Kenta Murata created ARROW-4775: --- Summary: [Website] Site navbar cannot be expanded Key: ARROW-4775 URL: https://issues.apache.org/jira/browse/ARROW-4775 Project: Apache Arrow Issue Type: Bug

[jira] [Created] (ARROW-4774) Python crash writing nested array to parquet

2019-03-05 Thread Stephen Gallagher (JIRA)
Stephen Gallagher created ARROW-4774: Summary: Python crash writing nested array to parquet Key: ARROW-4774 URL: https://issues.apache.org/jira/browse/ARROW-4774 Project: Apache Arrow Iss