Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)
Hi Antoine, I think Liya Fan raised some good points in his reply but I'd like to answer your questions directly. > So the question is whether this really needs to be in the in-memory > format, i.e. is it desired to operate directly on this compressed > format, or is it solely for transport? I tried to separate the two concepts into Encodings (things Arrow can operate directly on) and Compression (solely for transport). While there is some overlap I think the two features can be considered separately. For each encoding there is additional implementation complexity to properly exploit it. However, the benefit for some workloads can be large [1][2]. If the latter, I wonder why Parquet cannot simply be used instead of > reinventing something similar but different. This is a reasonable point. However there is continuum here between file size and read and write times. Parquet will likely always be the smallest with the largest times to convert to and from Arrow. An uncompressed Feather/Arrow file will likely always take the most space but will much faster conversion times.The question is whether a buffer level or some other sub-file level compression scheme provides enough values compared with compressing of the entire Feather file. This is somewhat hand-wavy but if we feel we might want to investigate this further I can write some benchmarks to quantify the differences. Cheers, Micah [1] http://db.csail.mit.edu/projects/cstore/abadicidr07.pdf [2] http://db.csail.mit.edu/projects/cstore/abadisigmod06.pdf On Fri, Jul 12, 2019 at 2:24 AM Antoine Pitrou wrote: > > Le 12/07/2019 à 10:08, Micah Kornfield a écrit : > > OK, I've created a separate thread for data integrity/digests [1], and > > retitled this thread to continue the discussion on compression and > > encodings. As a reminder the PR for the format additions [2] suggested a > > new SparseRecordBatch that would allow for the following features: > > 1. Different data encodings at the Array (e.g. RLE) and Buffer levels > > (e.g. narrower bit-width integers) > > 2. Compression at the buffer level > > 3. Eliding all metadata and data for empty columns. > > So the question is whether this really needs to be in the in-memory > format, i.e. is it desired to operate directly on this compressed > format, or is it solely for transport? > > If the latter, I wonder why Parquet cannot simply be used instead of > reinventing something similar but different. > > Regards > > Antoine. >
Re: IPC Tensor + Indices
Hi Razvan, I'm not sure about plans around tensors. However, depending on how you are trying to transfer the data and consume it, you might consider using an extension type [1]. For the physical representation you could model it as something like: { RowLabel : Date32/64 ColumnLabels : FixedSizeList (dictionary encoded) Data : FixedSize } which would be more compact that making N individual columns if N is large. You would have to handle the mapping from column label to index at the application level though. Hope this helps. -Micah [1] https://github.com/apache/arrow/blob/6fb850cf57fd6227573cca6d43a46e1d5d2b0a66/docs/source/format/Metadata.rst#extension-types On Fri, Jul 12, 2019 at 1:53 PM Razvan Chitu wrote: > Sure. I'd like to bundle an M x N shaped tensor along with the M row labels > (dates) and N column labels (string identifiers) in one response. > > Razvan > > On Fri, Jul 12, 2019, 6:53 PM Wes McKinney wrote: > > > hi Razvan -- can you clarify what "together with a row and a column > > index? means? > > > > On Fri, Jul 12, 2019 at 11:17 AM Razvan Chitu > > wrote: > > > > > > Hi, > > > > > > Does the IPC format currently support streaming a tensor together with > a > > > row and a column index? If not, are there any plans for this to be > > > supported? It'd be quite a useful for matrices that could have 10s of > > > thousands of either rows, columns or both. For my use case I am > currently > > > representing matrices as record batches, but performance is not that > > great > > > when there are many columns and few rows. > > > > > > Thanks, > > > Razvan > > >
[jira] [Created] (ARROW-5943) [GLib][Gandiva] Add support for function aliases
Sutou Kouhei created ARROW-5943: --- Summary: [GLib][Gandiva] Add support for function aliases Key: ARROW-5943 URL: https://issues.apache.org/jira/browse/ARROW-5943 Project: Apache Arrow Issue Type: Improvement Components: GLib Reporter: Sutou Kouhei Assignee: Sutou Kouhei Fix For: 1.0.0 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5942) [JS] Implement Tensor Type
Todd Hay created ARROW-5942: --- Summary: [JS] Implement Tensor Type Key: ARROW-5942 URL: https://issues.apache.org/jira/browse/ARROW-5942 Project: Apache Arrow Issue Type: New Feature Components: JavaScript Reporter: Todd Hay Implement a generic N-dimensional tensor type for JavaScript -- This message was sent by Atlassian JIRA (v7.6.14#76016)
Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems
Hi, I've created pull requests that were used to release 0.14.0: ARROW-5937: [Release] Stop parallel binary upload https://github.com/apache/arrow/pull/4868 ARROW-5938: [Release] Create branch for adding release note automatically https://github.com/apache/arrow/pull/4869 ARROW-5939: [Release] Add support for generating vote email template separately https://github.com/apache/arrow/pull/4870 ARROW-5940: [Release] Add support for re-uploading sign/checksum for binary artifacts https://github.com/apache/arrow/pull/4871 ARROW-5941: [Release] Avoid re-uploading already uploaded binary artifacts https://github.com/apache/arrow/pull/4872 (This will be conflicted with https://github.com/apache/arrow/pull/4868 .) They will be useful to release 0.14.1. Thanks, -- kou In "Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems" on Fri, 12 Jul 2019 13:27:41 -0500, Wes McKinney wrote: > I updated https://gist.github.com/wesm/1e4ac14baaa8b27bf13b071d2d715014 > to include all the cited patches, as well as the Parquet forward > compatibility fix. > > I'm waiting on CI to be able to pass ARROW-5921 (fuzzing-discovered > IPC crash) and the ARROW-5889 (Parquet backwards compatibility with > 0.13) needs to be rebased > > https://github.com/apache/arrow/pull/4856 > > I think those are the last 2 patches that should go into the branch > unless something else comes up. Once those land I'll update the > commands and then push up the patch release branch (hopefully > everything will cherry pick cleanly) > > On Fri, Jul 12, 2019 at 12:34 PM Francois Saint-Jacques > wrote: >> >> There's also ARROW-5921 (I tagged it 0.14.1) if it passes travis. This >> one fixes a segfault found via fuzzing. >> >> François >> >> On Fri, Jul 12, 2019 at 6:54 AM Krisztián Szűcs >> wrote: >> > >> > PRs touching the wheel packaging scripts: >> > - https://github.com/apache/arrow/pull/4828 (lz4) >> > - https://github.com/apache/arrow/pull/4833 (uriparser - only if >> > https://github.com/apache/arrow/commit/88fcb096c4f24861bc7f8181cba1ad8be0e4048a >> > is cherry picked as well) >> > - https://github.com/apache/arrow/pull/4834 (zlib) >> > >> > On Fri, Jul 12, 2019 at 11:49 AM Hatem Helal wrote: >> > >> > > Thanks François, I closed PARQUET-1623 this morning. It would be nice to >> > > include the PR in the patch release: >> > > >> > > https://github.com/apache/arrow/pull/4857 >> > > >> > > This bug has been around for a few releases but I think it should be a >> > > low >> > > risk change to include. >> > > >> > > Hatem >> > > >> > > >> > > On 7/12/19, 2:27 AM, "Francois Saint-Jacques" >> > > wrote: >> > > >> > > I just merged PARQUET-1623, I think it's worth inserting since it >> > > fixes an invalid memory write. Note that I couldn't resolve/close the >> > > parquet issue, do I have to be contributor to the project? >> > > >> > > François >> > > >> > > On Thu, Jul 11, 2019 at 6:10 PM Wes McKinney >> > > wrote: >> > > > >> > > > I just merged Eric's 2nd patch ARROW-5908 and I went through all >> > > the >> > > > patches since the release commit and have come up with the >> > > following >> > > > list of 32 fix-only patches to pick into a maintenance branch: >> > > > >> > > > https://gist.github.com/wesm/1e4ac14baaa8b27bf13b071d2d715014 >> > > > >> > > > Note there's still unresolved Parquet forward/backward >> > > compatibility >> > > > issues in C++ that we haven't merged patches for yet, so that is >> > > > pending. >> > > > >> > > > Are there any other patches / JIRA issues people would like to see >> > > > resolved in a patch release? >> > > > >> > > > Thanks >> > > > >> > > > On Thu, Jul 11, 2019 at 3:03 PM Wes McKinney >> > > wrote: >> > > > > >> > > > > Eric -- you are free to set the Fix Version prior to the patch >> > > being merged >> > > > > >> > > > > On Thu, Jul 11, 2019 at 3:01 PM Eric Erhardt >> > > > > wrote: >> > > > > > >> > > > > > The two C# fixes I'd like in the 0.14.1 release are: >> > > > > > >> > > > > > https://issues.apache.org/jira/browse/ARROW-5887 - already >> > > marked with 0.14.1 fix version. >> > > > > > https://issues.apache.org/jira/browse/ARROW-5908 - hasn't been >> > > resolved yet. The PR https://github.com/apache/arrow/pull/4851 has one >> > > approver and the Rust failure doesn't appear to be caused by my change. >> > > > > > >> > > > > > I assume I shouldn't mark ARROW-5908 with a 0.14.1 fix version >> > > until the PR has been merged. >> > > > > > >> > > > > > -Original Message- >> > > > > > From: Neal Richardson >> > > > > > Sent: Thursday, July 11, 2019 11:59 AM >> > > > > > To: dev@arrow.apache.org >> > > > > > Subject: Re: [DISCUSS] Need for 0.14.1 release due to Python >> > > package problems, Parquet forward compatibility problems >> > > > > > >> >
[jira] [Created] (ARROW-5941) [Release] Avoid re-uploading already uploaded binary artifacts
Sutou Kouhei created ARROW-5941: --- Summary: [Release] Avoid re-uploading already uploaded binary artifacts Key: ARROW-5941 URL: https://issues.apache.org/jira/browse/ARROW-5941 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Sutou Kouhei Assignee: Sutou Kouhei Fix For: 1.0.0, 0.14.1 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5940) [Release] Add support for re-uploading sign/checksum for binary artifacts
Sutou Kouhei created ARROW-5940: --- Summary: [Release] Add support for re-uploading sign/checksum for binary artifacts Key: ARROW-5940 URL: https://issues.apache.org/jira/browse/ARROW-5940 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Sutou Kouhei Assignee: Sutou Kouhei Fix For: 1.0.0, 0.14.1 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5939) [Release] Add support for generating vote email template separately
Sutou Kouhei created ARROW-5939: --- Summary: [Release] Add support for generating vote email template separately Key: ARROW-5939 URL: https://issues.apache.org/jira/browse/ARROW-5939 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Sutou Kouhei Assignee: Sutou Kouhei Fix For: 1.0.0, 0.14.1 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
RE: [DISCUSS][C++][Proposal] Threading engine for Arrow
Hi, folks We were discussing improvements for the threading engine back in May and agreed to implement benchmarks (sorry, I've lost the original mail thread, here is the link: https://lists.apache.org/thread.html/c690253d0bde643a5b644af70ec1511c6e510ebc86cc970aa8d5252e@%3Cdev.arrow.apache.org%3E ) Here is update of what's going on with this effort. We've implemented a rough prototype for group_by, aggregate, and transform execution nodes on top of Arrow (along with studying the whole data analytics domain along the way :-) ) and made them parallel, as you can see in this repository: https://github.com/anton-malakhov/nyc_taxi The result is that all these execution nodes scale well enough and run under 100 milliseconds on my 2 x Xeon E5-2650 v4 @ 2.20GHz, 128Gb RAM while CSV reader takes several seconds to complete even reading from in-memory file (8Gb), thus it is not IO bound yet even with good consumer-grade SSDs. Thus my focus recently has been around optimization of CSV parser where I have achieved 50% improvement substituting all the small object allocations via TBB scalable allocator and using TBB-based memory pool instead of default one with pre-allocated huge (2Mb) memory pages (echo 3 > /proc/sys/vm/nr_hugepages). I found no way yet how to do both of these tricks with jemalloc, so please try to beat or meet my times without TBB allocator. I also see other hotspots and opportunities for optimizations, some examples are memset is being heavily used while resizing buffers (why and why?) and the column builder trashes caches by not using of streaming stores. I used TBB directly to make the execution nodes parallel, however I have also implemented a simple TBB-based ThreadPool and TaskGroup as you can see in this PR: https://github.com/aregm/arrow/pull/6 I see consistent improvement (up to 1200%!) on BM_ThreadedTaskGroup and BM_ThreadPoolSpawn microbenchmarks, however applying it to the real world task of CSV reader, I don't see any improvements yet. Or even worse, while reading the file, TBB wastes some cycles spinning.. probably because of read-ahead thread, which oversubscribes the machine. Arrow's threading better interacts with OS scheduler thus shows better performance. So, this simple approach to TBB without a deeper redesign didn't help. I'll be looking into applying more sophisticated NUMA and locality-aware tricks as I'll be cleaning paths for the data streams in the parser. Though, I'll take some time off before returning to this effort. See you in September! Regards, // Anton
[jira] [Created] (ARROW-5938) [Release] Create branch for adding release note automatically
Sutou Kouhei created ARROW-5938: --- Summary: [Release] Create branch for adding release note automatically Key: ARROW-5938 URL: https://issues.apache.org/jira/browse/ARROW-5938 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Sutou Kouhei Assignee: Sutou Kouhei Fix For: 1.0.0, 0.14.1 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5937) [Release] Stop parallel binary upload
Sutou Kouhei created ARROW-5937: --- Summary: [Release] Stop parallel binary upload Key: ARROW-5937 URL: https://issues.apache.org/jira/browse/ARROW-5937 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Sutou Kouhei Assignee: Sutou Kouhei Fix For: 1.0.0, 0.14.1 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
Re: IPC Tensor + Indices
Sure. I'd like to bundle an M x N shaped tensor along with the M row labels (dates) and N column labels (string identifiers) in one response. Razvan On Fri, Jul 12, 2019, 6:53 PM Wes McKinney wrote: > hi Razvan -- can you clarify what "together with a row and a column > index? means? > > On Fri, Jul 12, 2019 at 11:17 AM Razvan Chitu > wrote: > > > > Hi, > > > > Does the IPC format currently support streaming a tensor together with a > > row and a column index? If not, are there any plans for this to be > > supported? It'd be quite a useful for matrices that could have 10s of > > thousands of either rows, columns or both. For my use case I am currently > > representing matrices as record batches, but performance is not that > great > > when there are many columns and few rows. > > > > Thanks, > > Razvan >
[jira] [Created] (ARROW-5936) [C++] [FlightRPC] user_metadata is not present in fields read from flight
Benjamin Kietzman created ARROW-5936: Summary: [C++] [FlightRPC] user_metadata is not present in fields read from flight Key: ARROW-5936 URL: https://issues.apache.org/jira/browse/ARROW-5936 Project: Apache Arrow Issue Type: Bug Components: C++, FlightRPC Reporter: Benjamin Kietzman Should this go in the arrow::Field::metadata property somewhere? Does user_metadata round trip through some other channel? https://github.com/apache/arrow/pull/4841#discussion_r302623241 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5935) [C++] ArrayBuilders with mutable type are not robustly supported
Benjamin Kietzman created ARROW-5935: Summary: [C++] ArrayBuilders with mutable type are not robustly supported Key: ARROW-5935 URL: https://issues.apache.org/jira/browse/ARROW-5935 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Benjamin Kietzman Assignee: Benjamin Kietzman (Dense|Sparse)UnionBuilder, DictionaryBuilder, Addaptive(U)IntBuilders and any nested builder which contains one of those may Finish to an array whose type disagrees with what was passed to MakeBuilder. This is not well documented or supported; ListBuilder checks if its child has changed type but StructBuilder does not. Furthermore ListBuilder's check does not catch modifications to a DictionaryBuidler's type and results in an invalid array on Finish: https://github.com/apache/arrow/blob/1bcfbe1/cpp/src/arrow/array-dict-test.cc#L951-L994 Let's add to the ArrayBuilder contract: the type property is null iff that builder's type is indeterminate until Finish() is called. Then all nested builders can check this on their children at construction and bubble type mutability correclty -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5934) [Python] Bundle arrow's LICENSE with the wheels
Krisztian Szucs created ARROW-5934: -- Summary: [Python] Bundle arrow's LICENSE with the wheels Key: ARROW-5934 URL: https://issues.apache.org/jira/browse/ARROW-5934 Project: Apache Arrow Issue Type: Task Components: Python Reporter: Krisztian Szucs Fix For: 1.0.0, 0.14.1 Guide to bundle LICENSE files with the wheels: https://wheel.readthedocs.io/en/stable/user_guide.html#including-license-files-in-the-generated-wheel-file We also need to ensure, that all thirdparty dependencies' license are attached to it, especially because we're statically linking multiple 3rdparty dependencies, and for example uriparser is missing from the LICENSE file. cc [~wesmckinn] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
Re: [Discuss] Are Union.typeIds worth keeping?
Thanks all, this is helpful and I've added https://issues.apache.org/jira/browse/ARROW-5933 to improve the documentation for future developers. On Wed, Jul 10, 2019 at 11:09 PM Jacques Nadeau wrote: > I was also supportive of this pattern. We definitely have used it before to > optimize in certain cases. > > On Wed, Jul 10, 2019, 2:40 PM Wes McKinney wrote: > > > On Wed, Jul 10, 2019 at 3:57 PM Ben Kietzman > > wrote: > > > > > > In this scenario option A (include child arrays for each child type, > even > > > if that type is not observed) seems like the clearly correct choice to > > me. > > > It yiedls a more intuitive layout for the union array and incurs no > > runtime > > > overhead (since the absent children are empty/null arrays). > > > > I am not sure this is right. The child arrays still occupy memory in > > the Sparse Union case (where all child arrays have the same length). > > In order to satisfy the requirements of the IPC protocol, the child > > arrays need to be of the same type as the types in the union. In the > > Dense Union case, the not-present children will have length 0. > > > > > > > > > why not allow them to be flexible in this regard? > > > > > > I would say that if code doesn't add anything except cognitive overhead > > > then it's worthwhile to remove it. > > > > The cognitive overhead comes for the Arrow library implementer -- > > users of the libraries aren't required to deal with this detail > > necessarily. The type ids are optional, after all. Even if it is > > removed, you still have ids, so whether it's > > > > type 0, id=0 > > type 1, id=1 > > type 2, id=2 > > > > or > > > > type 0, id=3 > > type 1, id=7 > > type 2, id=10 > > > > the difference is in the second case, you have to look up the code > > corresponding to each type rather than assuming that the type's > > position and its code are the same. > > > > In processing, branching should occur at the Type level, so a function > > to process a child looks like > > > > ProcessChild(child, child_id, ...) > > > > In either case you have to match a child with its id that appears in the > > data. > > > > Anyway, since Julien and I are responsible for introducing this > > concept in the early stages of the project I'm interested to hear more > > from others. Note that this doesn't serve to resolve the > > Union-of-Nested-Types problem that has prevented the development of > > integration tests between Java and C++. > > > > > > > > On Wed, Jul 10, 2019 at 2:51 PM Wes McKinney > > wrote: > > > > > > > hi Ben, > > > > > > > > Some applications use static type ids for various data types. Let's > > > > consider one possibility: > > > > > > > > BOOLEAN: 0 > > > > INT32: 1 > > > > DOUBLE: 2 > > > > STRING (UTF8): 3 > > > > > > > > If you were parsing JSON and constructing unions while parsing, you > > > > might encounter some types, but not all. So if we _don't_ have the > > > > option of having type ids in the metadata then we are left with some > > > > unsatisfactory options: > > > > > > > > A: Include all types in the resulting union, even if they are > > unobserved, > > > > or > > > > B: Assign type id dynamically to types when they are observed > > > > > > > > Option B: is potentially bad because it does not parallelize across > > > > threads or nodes. > > > > > > > > So I do think the feature is useful. It does make the implementations > > > > of unions more complex, though, so it does not come without cost. But > > > > unions are already the most complex tool we have in our nested data > > > > toolbox, so why not allow them to be flexible in this regard? > > > > > > > > In any case I'm -0 on making changes, but would be interested in > > > > feedback of others if there is strong sentiment about deprecating the > > > > feature. > > > > > > > > - Wes > > > > > > > > On Wed, Jul 10, 2019 at 1:40 PM Ben Kietzman < > ben.kietz...@rstudio.com > > > > > > > wrote: > > > > > > > > > > The Union.typeIds property is confusing and its utility is unclear. > > I'd > > > > > like to remove it (or at least document it better). Unless anyone > > knows a > > > > > real advantage for keeping it I plan to assemble a PR to drop it > > from the > > > > > format and the C++ implementation. > > > > > > > > > > ARROW-257 ( resolved by pull request > > > > > https://github.com/apache/arrow/pull/143 ) extended Unions with an > > > > optional > > > > > typeIds property (in the C++ implementation, this is > > > > > UnionType::type_codes). Prior to that pull request each element > > (int8) in > > > > > the type_ids (second) buffer of a union array was the index of a > > child > > > > > array. Thus a type_ids buffer beginning with 5 indicated that the > > union > > > > > array began with a value from child_data[5]. After that change to > > > > interpret > > > > > a type_id of 5 one must look through the typeIds property and the > > index > > > > at > > > > > which a 5 is found is the index of the corresponding child array. > > > > > > > > > > The
[jira] [Created] (ARROW-5933) [C++] [Documentation] add discussion of Union.typeIds to Layout.rst
Benjamin Kietzman created ARROW-5933: Summary: [C++] [Documentation] add discussion of Union.typeIds to Layout.rst Key: ARROW-5933 URL: https://issues.apache.org/jira/browse/ARROW-5933 Project: Apache Arrow Issue Type: Bug Components: C++, Documentation Reporter: Benjamin Kietzman Assignee: Benjamin Kietzman Union.typeIds is poorly documented and the corresponding property in UnionType is confusingly named type_codes. In particular, Layout.rst doesn't include an explanation of Union.typeIds and implies that an element of a union array's type_ids buffer is always the index of a child array. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems
Thanks for collecting them! We should also run the packaging tasks on them before cutting RC0. On Fri, Jul 12, 2019 at 8:28 PM Wes McKinney wrote: > I updated https://gist.github.com/wesm/1e4ac14baaa8b27bf13b071d2d715014 > to include all the cited patches, as well as the Parquet forward > compatibility fix. > > I'm waiting on CI to be able to pass ARROW-5921 (fuzzing-discovered > IPC crash) and the ARROW-5889 (Parquet backwards compatibility with > 0.13) needs to be rebased > > https://github.com/apache/arrow/pull/4856 > > I think those are the last 2 patches that should go into the branch > unless something else comes up. Once those land I'll update the > commands and then push up the patch release branch (hopefully > everything will cherry pick cleanly) > > On Fri, Jul 12, 2019 at 12:34 PM Francois Saint-Jacques > wrote: > > > > There's also ARROW-5921 (I tagged it 0.14.1) if it passes travis. This > > one fixes a segfault found via fuzzing. > > > > François > > > > On Fri, Jul 12, 2019 at 6:54 AM Krisztián Szűcs > > wrote: > > > > > > PRs touching the wheel packaging scripts: > > > - https://github.com/apache/arrow/pull/4828 (lz4) > > > - https://github.com/apache/arrow/pull/4833 (uriparser - only if > > > > https://github.com/apache/arrow/commit/88fcb096c4f24861bc7f8181cba1ad8be0e4048a > > > is cherry picked as well) > > > - https://github.com/apache/arrow/pull/4834 (zlib) > > > > > > On Fri, Jul 12, 2019 at 11:49 AM Hatem Helal > wrote: > > > > > > > Thanks François, I closed PARQUET-1623 this morning. It would be > nice to > > > > include the PR in the patch release: > > > > > > > > https://github.com/apache/arrow/pull/4857 > > > > > > > > This bug has been around for a few releases but I think it should be > a low > > > > risk change to include. > > > > > > > > Hatem > > > > > > > > > > > > On 7/12/19, 2:27 AM, "Francois Saint-Jacques" < > fsaintjacq...@gmail.com> > > > > wrote: > > > > > > > > I just merged PARQUET-1623, I think it's worth inserting since it > > > > fixes an invalid memory write. Note that I couldn't > resolve/close the > > > > parquet issue, do I have to be contributor to the project? > > > > > > > > François > > > > > > > > On Thu, Jul 11, 2019 at 6:10 PM Wes McKinney < > wesmck...@gmail.com> > > > > wrote: > > > > > > > > > > I just merged Eric's 2nd patch ARROW-5908 and I went through > all the > > > > > patches since the release commit and have come up with the > following > > > > > list of 32 fix-only patches to pick into a maintenance branch: > > > > > > > > > > https://gist.github.com/wesm/1e4ac14baaa8b27bf13b071d2d715014 > > > > > > > > > > Note there's still unresolved Parquet forward/backward > compatibility > > > > > issues in C++ that we haven't merged patches for yet, so that > is > > > > > pending. > > > > > > > > > > Are there any other patches / JIRA issues people would like to > see > > > > > resolved in a patch release? > > > > > > > > > > Thanks > > > > > > > > > > On Thu, Jul 11, 2019 at 3:03 PM Wes McKinney < > wesmck...@gmail.com> > > > > wrote: > > > > > > > > > > > > Eric -- you are free to set the Fix Version prior to the > patch > > > > being merged > > > > > > > > > > > > On Thu, Jul 11, 2019 at 3:01 PM Eric Erhardt > > > > > > wrote: > > > > > > > > > > > > > > The two C# fixes I'd like in the 0.14.1 release are: > > > > > > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-5887 - already > > > > marked with 0.14.1 fix version. > > > > > > > https://issues.apache.org/jira/browse/ARROW-5908 - hasn't > been > > > > resolved yet. The PR https://github.com/apache/arrow/pull/4851 has > one > > > > approver and the Rust failure doesn't appear to be caused by my > change. > > > > > > > > > > > > > > I assume I shouldn't mark ARROW-5908 with a 0.14.1 fix > version > > > > until the PR has been merged. > > > > > > > > > > > > > > -Original Message- > > > > > > > From: Neal Richardson > > > > > > > Sent: Thursday, July 11, 2019 11:59 AM > > > > > > > To: dev@arrow.apache.org > > > > > > > Subject: Re: [DISCUSS] Need for 0.14.1 release due to > Python > > > > package problems, Parquet forward compatibility problems > > > > > > > > > > > > > > I just moved > > > > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FARROW-5850data=02%7C01%7CEric.Erhardt%40microsoft.com%7C244c0dd319dd4ea18a5508d7062125de%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636984611747771373sdata=B6xfFcBu4Iz0jJE5tUXkKvoJx36kMCS4UJCdTV7jqGA%3Dreserved=0 > > > > from 1.0.0 to 0.14.1. > > > > > > > > > > > > > > On Thu, Jul 11, 2019 at 8:12 AM Wes McKinney < > > > > wesmck...@gmail.com> wrote: > > > > > > > > > > > > > > > To limit uncertainty, I'm going to start preparing a > 0.14.1 > > > > patch > > > > > > > > release branch.
Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems
I updated https://gist.github.com/wesm/1e4ac14baaa8b27bf13b071d2d715014 to include all the cited patches, as well as the Parquet forward compatibility fix. I'm waiting on CI to be able to pass ARROW-5921 (fuzzing-discovered IPC crash) and the ARROW-5889 (Parquet backwards compatibility with 0.13) needs to be rebased https://github.com/apache/arrow/pull/4856 I think those are the last 2 patches that should go into the branch unless something else comes up. Once those land I'll update the commands and then push up the patch release branch (hopefully everything will cherry pick cleanly) On Fri, Jul 12, 2019 at 12:34 PM Francois Saint-Jacques wrote: > > There's also ARROW-5921 (I tagged it 0.14.1) if it passes travis. This > one fixes a segfault found via fuzzing. > > François > > On Fri, Jul 12, 2019 at 6:54 AM Krisztián Szűcs > wrote: > > > > PRs touching the wheel packaging scripts: > > - https://github.com/apache/arrow/pull/4828 (lz4) > > - https://github.com/apache/arrow/pull/4833 (uriparser - only if > > https://github.com/apache/arrow/commit/88fcb096c4f24861bc7f8181cba1ad8be0e4048a > > is cherry picked as well) > > - https://github.com/apache/arrow/pull/4834 (zlib) > > > > On Fri, Jul 12, 2019 at 11:49 AM Hatem Helal wrote: > > > > > Thanks François, I closed PARQUET-1623 this morning. It would be nice to > > > include the PR in the patch release: > > > > > > https://github.com/apache/arrow/pull/4857 > > > > > > This bug has been around for a few releases but I think it should be a low > > > risk change to include. > > > > > > Hatem > > > > > > > > > On 7/12/19, 2:27 AM, "Francois Saint-Jacques" > > > wrote: > > > > > > I just merged PARQUET-1623, I think it's worth inserting since it > > > fixes an invalid memory write. Note that I couldn't resolve/close the > > > parquet issue, do I have to be contributor to the project? > > > > > > François > > > > > > On Thu, Jul 11, 2019 at 6:10 PM Wes McKinney > > > wrote: > > > > > > > > I just merged Eric's 2nd patch ARROW-5908 and I went through all the > > > > patches since the release commit and have come up with the following > > > > list of 32 fix-only patches to pick into a maintenance branch: > > > > > > > > https://gist.github.com/wesm/1e4ac14baaa8b27bf13b071d2d715014 > > > > > > > > Note there's still unresolved Parquet forward/backward compatibility > > > > issues in C++ that we haven't merged patches for yet, so that is > > > > pending. > > > > > > > > Are there any other patches / JIRA issues people would like to see > > > > resolved in a patch release? > > > > > > > > Thanks > > > > > > > > On Thu, Jul 11, 2019 at 3:03 PM Wes McKinney > > > wrote: > > > > > > > > > > Eric -- you are free to set the Fix Version prior to the patch > > > being merged > > > > > > > > > > On Thu, Jul 11, 2019 at 3:01 PM Eric Erhardt > > > > > wrote: > > > > > > > > > > > > The two C# fixes I'd like in the 0.14.1 release are: > > > > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-5887 - already > > > marked with 0.14.1 fix version. > > > > > > https://issues.apache.org/jira/browse/ARROW-5908 - hasn't been > > > resolved yet. The PR https://github.com/apache/arrow/pull/4851 has one > > > approver and the Rust failure doesn't appear to be caused by my change. > > > > > > > > > > > > I assume I shouldn't mark ARROW-5908 with a 0.14.1 fix version > > > until the PR has been merged. > > > > > > > > > > > > -Original Message- > > > > > > From: Neal Richardson > > > > > > Sent: Thursday, July 11, 2019 11:59 AM > > > > > > To: dev@arrow.apache.org > > > > > > Subject: Re: [DISCUSS] Need for 0.14.1 release due to Python > > > package problems, Parquet forward compatibility problems > > > > > > > > > > > > I just moved > > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FARROW-5850data=02%7C01%7CEric.Erhardt%40microsoft.com%7C244c0dd319dd4ea18a5508d7062125de%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636984611747771373sdata=B6xfFcBu4Iz0jJE5tUXkKvoJx36kMCS4UJCdTV7jqGA%3Dreserved=0 > > > from 1.0.0 to 0.14.1. > > > > > > > > > > > > On Thu, Jul 11, 2019 at 8:12 AM Wes McKinney < > > > wesmck...@gmail.com> wrote: > > > > > > > > > > > > > To limit uncertainty, I'm going to start preparing a 0.14.1 > > > patch > > > > > > > release branch. I will update the list with the patches that > > > are being > > > > > > > cherry-picked. If other folks could give me a list of other > > > PRs that > > > > > > > need to be backported I will add them to the list. Any JIRA > > > that needs > > > > > > > to be included should have the "0.14.1" fix version added so > > > we can > > > > > > > keep track > > > > > > > > > > > > > > On Wed, Jul 10, 2019 at 9:48 PM Joris Van den Bossche > > > > > > >
[jira] [Created] (ARROW-5932) undefined reference to `__cxa_init_primary_exception@CXXABI_1.3.11'
Cong Ding created ARROW-5932: Summary: undefined reference to `__cxa_init_primary_exception@CXXABI_1.3.11' Key: ARROW-5932 URL: https://issues.apache.org/jira/browse/ARROW-5932 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.14.0 Environment: Linux Mint 19.1 Tessa g++-6 Reporter: Cong Ding I was installing Apache Arrow in my Linux Mint 19.1 Tessa server. I followed the instructions on the official arrow website (using the ubuntu 18.04 method). However, when I was trying to compile the examples, the g++ compiler threw out some errors. I have updated my g++ to g++-6, update my libstdc++ library, and using flag -lstdc++, but it still didn't work. {code:java} //代码占位符 g++-6 -std=c++11 -larrow -lparquet main.cpp -lstdc++ {code} The error message: /usr/lib/x86_64-linux-gnu/libarrow.so: undefined reference to `__cxa_init_primary_exception@CXXABI_1.3.11' /usr/lib/x86_64-linux-gnu/libarrow.so: undefined reference to `std::__exception_ptr::exception_ptr::exception_ptr(void*)@CXXABI_1.3.11' collect2: error: ld returned 1 exit status. I do not know what to do this moment. Can anyone help me? -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5931) [C++] Extend extension types facility to provide for serialization and deserialization in IPC roundtrips
Wes McKinney created ARROW-5931: --- Summary: [C++] Extend extension types facility to provide for serialization and deserialization in IPC roundtrips Key: ARROW-5931 URL: https://issues.apache.org/jira/browse/ARROW-5931 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 A use case here is when an array needs to reference some external data. For example, suppose that we wanted to implement an array that references a sequence of Python objects as {{PyObject**}}. Obviously, a {{PyObject*}} must be managed by the Python interpreter. For a vector of some {{T*}} to be sent through the IPC machinery, it must be embedded in some Arrow type on the wire. For example, the memory resident version of {{PyObject**} might be 8-bytes per value (1 pointer per value) while being serialized to the binary IPC protocol, such {{PyObject*}} values must be serialized into an Arrow Binary type. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
Re: IPC Tensor + Indices
hi Razvan -- can you clarify what "together with a row and a column index? means? On Fri, Jul 12, 2019 at 11:17 AM Razvan Chitu wrote: > > Hi, > > Does the IPC format currently support streaming a tensor together with a > row and a column index? If not, are there any plans for this to be > supported? It'd be quite a useful for matrices that could have 10s of > thousands of either rows, columns or both. For my use case I am currently > representing matrices as record batches, but performance is not that great > when there are many columns and few rows. > > Thanks, > Razvan
Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems
There's also ARROW-5921 (I tagged it 0.14.1) if it passes travis. This one fixes a segfault found via fuzzing. François On Fri, Jul 12, 2019 at 6:54 AM Krisztián Szűcs wrote: > > PRs touching the wheel packaging scripts: > - https://github.com/apache/arrow/pull/4828 (lz4) > - https://github.com/apache/arrow/pull/4833 (uriparser - only if > https://github.com/apache/arrow/commit/88fcb096c4f24861bc7f8181cba1ad8be0e4048a > is cherry picked as well) > - https://github.com/apache/arrow/pull/4834 (zlib) > > On Fri, Jul 12, 2019 at 11:49 AM Hatem Helal wrote: > > > Thanks François, I closed PARQUET-1623 this morning. It would be nice to > > include the PR in the patch release: > > > > https://github.com/apache/arrow/pull/4857 > > > > This bug has been around for a few releases but I think it should be a low > > risk change to include. > > > > Hatem > > > > > > On 7/12/19, 2:27 AM, "Francois Saint-Jacques" > > wrote: > > > > I just merged PARQUET-1623, I think it's worth inserting since it > > fixes an invalid memory write. Note that I couldn't resolve/close the > > parquet issue, do I have to be contributor to the project? > > > > François > > > > On Thu, Jul 11, 2019 at 6:10 PM Wes McKinney > > wrote: > > > > > > I just merged Eric's 2nd patch ARROW-5908 and I went through all the > > > patches since the release commit and have come up with the following > > > list of 32 fix-only patches to pick into a maintenance branch: > > > > > > https://gist.github.com/wesm/1e4ac14baaa8b27bf13b071d2d715014 > > > > > > Note there's still unresolved Parquet forward/backward compatibility > > > issues in C++ that we haven't merged patches for yet, so that is > > > pending. > > > > > > Are there any other patches / JIRA issues people would like to see > > > resolved in a patch release? > > > > > > Thanks > > > > > > On Thu, Jul 11, 2019 at 3:03 PM Wes McKinney > > wrote: > > > > > > > > Eric -- you are free to set the Fix Version prior to the patch > > being merged > > > > > > > > On Thu, Jul 11, 2019 at 3:01 PM Eric Erhardt > > > > wrote: > > > > > > > > > > The two C# fixes I'd like in the 0.14.1 release are: > > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-5887 - already > > marked with 0.14.1 fix version. > > > > > https://issues.apache.org/jira/browse/ARROW-5908 - hasn't been > > resolved yet. The PR https://github.com/apache/arrow/pull/4851 has one > > approver and the Rust failure doesn't appear to be caused by my change. > > > > > > > > > > I assume I shouldn't mark ARROW-5908 with a 0.14.1 fix version > > until the PR has been merged. > > > > > > > > > > -Original Message- > > > > > From: Neal Richardson > > > > > Sent: Thursday, July 11, 2019 11:59 AM > > > > > To: dev@arrow.apache.org > > > > > Subject: Re: [DISCUSS] Need for 0.14.1 release due to Python > > package problems, Parquet forward compatibility problems > > > > > > > > > > I just moved > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FARROW-5850data=02%7C01%7CEric.Erhardt%40microsoft.com%7C244c0dd319dd4ea18a5508d7062125de%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636984611747771373sdata=B6xfFcBu4Iz0jJE5tUXkKvoJx36kMCS4UJCdTV7jqGA%3Dreserved=0 > > from 1.0.0 to 0.14.1. > > > > > > > > > > On Thu, Jul 11, 2019 at 8:12 AM Wes McKinney < > > wesmck...@gmail.com> wrote: > > > > > > > > > > > To limit uncertainty, I'm going to start preparing a 0.14.1 > > patch > > > > > > release branch. I will update the list with the patches that > > are being > > > > > > cherry-picked. If other folks could give me a list of other > > PRs that > > > > > > need to be backported I will add them to the list. Any JIRA > > that needs > > > > > > to be included should have the "0.14.1" fix version added so > > we can > > > > > > keep track > > > > > > > > > > > > On Wed, Jul 10, 2019 at 9:48 PM Joris Van den Bossche > > > > > > wrote: > > > > > > > > > > > > > > I personally prefer 0.14.1 over 0.15.0. I think that is > > clearer in > > > > > > > communication, as we are fixing regressions of the 0.14.0 > > release. > > > > > > > > > > > > > > (but I haven't been involved much in releases, so certainly > > no > > > > > > > strong > > > > > > > opinion) > > > > > > > > > > > > > > Joris > > > > > > > > > > > > > > > > > > > > > Op wo 10 jul. 2019 om 15:07 schreef Wes McKinney < > > wesmck...@gmail.com>: > > > > > > > > > > > > > > > hi folks, > > > > > > > > > > > > > > > > Are there any opinions / strong feelings about the two > > options: > > > > > > > > > > > > > > > > * Prepare patch 0.14.1 release from a maintenance branch > > > > > > > > * Release 0.15.0 out of master > > > > > > > > > > >
[jira] [Created] (ARROW-5930) [FlightRPC] [Python] Flight CI tests are failing
lidavidm created ARROW-5930: --- Summary: [FlightRPC] [Python] Flight CI tests are failing Key: ARROW-5930 URL: https://issues.apache.org/jira/browse/ARROW-5930 Project: Apache Arrow Issue Type: Bug Components: FlightRPC, Python Affects Versions: 0.14.0 Reporter: lidavidm Flight tests segfault on Travis: [https://travis-ci.org/apache/arrow/jobs/557690959] The relevant part is: {noformat} Fatal Python error: Aborted Thread 0x7fcf009fe700 (most recent call first): File "/home/travis/build/apache/arrow/python/pyarrow/tests/test_flight.py", line 386 in _server_thread File "/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/threading.py", line 864 in run File "/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/threading.py", line 916 in _bootstrap_inner File "/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/threading.py", line 884 in _bootstrap Current thread 0x7fcf1f9fa700 (most recent call first): File "/home/travis/build/apache/arrow/python/pyarrow/tests/test_flight.py", line 411 in flight_server File "/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/contextlib.py", line 99 in __exit__ File "/home/travis/build/apache/arrow/python/pyarrow/tests/test_flight.py", line 670 in test_tls_do_get File "/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/_pytest/python.py", line 165 in pytest_pyfunc_call File "/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pluggy/callers.py", line 187 in _multicall File "/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pluggy/manager.py", line 81 in File "/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pluggy/manager.py", line 87 in _hookexec File "/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pluggy/hooks.py", line 289 in __call__ File "/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/_pytest/python.py", line 1451 in runtest File "/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/_pytest/runner.py", line 117 in pytest_runtest_call File "/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pluggy/callers.py", line 187 in _multicall File "/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pluggy/manager.py", line 81 in File "/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pluggy/manager.py", line 87 in _hookexec File "/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pluggy/hooks.py", line 289 in __call__ File "/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/_pytest/runner.py", line 192 in File "/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/_pytest/runner.py", line 220 in from_call File "/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/_pytest/runner.py", line 192 in call_runtest_hook File "/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/_pytest/runner.py", line 167 in call_and_report File "/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/_pytest/runner.py", line 87 in runtestprotocol File "/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/_pytest/runner.py", line 72 in pytest_runtest_protocol File "/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pluggy/callers.py", line 187 in _multicall File "/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pluggy/manager.py", line 81 in File "/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pluggy/manager.py", line 87 in _hookexec File "/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pluggy/hooks.py", line 289 in __call__ File "/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/_pytest/main.py", line 278 in pytest_runtestloop File "/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pluggy/callers.py", line 187 in _multicall File "/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pluggy/manager.py", line 81 in File "/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pluggy/manager.py", line 87 in _hookexec File "/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pluggy/hooks.py", line 289 in __call__ File "/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/_pytest/main.py", line 257 in _main File "/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/_pytest/main.py",
IPC Tensor + Indices
Hi, Does the IPC format currently support streaming a tensor together with a row and a column index? If not, are there any plans for this to be supported? It'd be quite a useful for matrices that could have 10s of thousands of either rows, columns or both. For my use case I am currently representing matrices as record batches, but performance is not that great when there are many columns and few rows. Thanks, Razvan
Re: [Discuss] Support an alternative memory layout for varchar/varbinary vectors
hi Liya -- yes, it seems reasonable to defer the conversion from your pointer-based extension representation to a proper VarCharVector until you need to send over IPC. Note that there is no mechanism yet in Java with extension types to cause a conversion to take place when the IPC step is reached. I just opened https://issues.apache.org/jira/browse/ARROW-5929 to try to explain this issue. Let me know if it is not clear I'm interested to experiment with the same thing in C++. We would have an ExtensionArray in C++ whose values are string_view referencing external memory, for example. - Wes On Thu, Jul 11, 2019 at 10:16 PM Fan Liya wrote: > > @Wes McKinney, > > Thanks a lot for the brainstorming. I think your ideas are reasonable and > feasible. > About IPC, my idea is that we can send the vector as a PointerStringVector, > and receive it as a VarCharVector, so that the overhead of memory > compaction can be hidden. > What do you think? > > Best, > Liya Fan > > On Fri, Jul 12, 2019 at 11:07 AM Fan Liya wrote: > > > @Uwe L. Korn > > > > Thanks a lot for the suggestion. I think this is exactly what we are doing > > right now. > > > > Best, > > Liya Fan > > > > On Thu, Jul 11, 2019 at 9:44 PM Wes McKinney wrote: > > > >> hi Liya -- have you thought about implementing this as an > >> ExtensionType / ExtensionVector? You actually can already do this, so > >> if this helps you reference strings stored in some external memory > >> then that seems reasonable. Such a PointerStringVector could have a > >> method that converts it into the Arrow varbinary columnar > >> representation. > >> > >> You wouldn't be able to put such an object into the IPC binary > >> protocol, though. If that's a requirement (being able to use the IPC > >> protocol) for this kind of data, before going any further in the > >> discussion I would suggest that you work out exactly how such data > >> would be moved from one process address space to another (using > >> Buffers). > >> > >> - Wes > >> > >> On Thu, Jul 11, 2019 at 7:35 AM Uwe L. Korn wrote: > >> > > >> > Hello Liya Fan, > >> > > >> > here your best approach is to copy into the Arrow format as you can > >> then use this as the basis for working with the Arrow-native representation > >> as well as your internal representation. You will have to use two different > >> offset vector as those two will always differ but in the case of your > >> internal representation, you don't have the requirement of consecutive data > >> as Arrow has but you can still work with the strings just as before even > >> when stored consecutively. > >> > > >> > Uwe > >> > > >> > On Thu, Jul 11, 2019, at 2:24 PM, Fan Liya wrote: > >> > > Hi Korn, > >> > > > >> > > Thanks a lot for your comments. > >> > > > >> > > In my opinion, your comments make sense to me. Allowing > >> non-consecutive > >> > > memory segments will break some good design choices of Arrow. > >> > > However, there are wide-spread user requirements for non-consecutive > >> memory > >> > > segments. I am wondering how can we help such users. What advice we > >> can > >> > > give to them? > >> > > > >> > > Memory copy/move can be a solution, but is there a better solution? > >> > > Is there a third alternative? Can we virtualize the non-consecutive > >> memory > >> > > segments into a consecutive one? (Although performance overhead is > >> > > unavoidable.) > >> > > > >> > > What do you think? Let's brain-storm it. > >> > > > >> > > Best, > >> > > Liya Fan > >> > > > >> > > > >> > > On Thu, Jul 11, 2019 at 8:05 PM Uwe L. Korn wrote: > >> > > > >> > > > Hello Liya, > >> > > > > >> > > > I'm quite -1 on this type as Arrow is about efficient columnar > >> structures. > >> > > > We have opened the standard also to matrix-like types but always > >> keep the > >> > > > constraint of consecutive memory. Now also adding types where > >> memory is no > >> > > > longer consecutive but spread in the heap will make the scope of the > >> > > > project much wider (It seems that we then just turn into a general > >> > > > serialization framework). > >> > > > > >> > > > One of the ideas of a common standard is that some need to make > >> > > > compromises. I think in this case it is a necessary compromise to > >> not allow > >> > > > all kind of string representations. > >> > > > > >> > > > Uwe > >> > > > > >> > > > On Thu, Jul 11, 2019, at 6:01 AM, Fan Liya wrote: > >> > > > > Hi all, > >> > > > > > >> > > > > > >> > > > > We are thinking of providing varchar/varbinary vectors with a > >> different > >> > > > > memory layout which exists in a wide range of systems. The memory > >> layout > >> > > > is > >> > > > > different from that of VarCharVector in the following ways: > >> > > > > > >> > > > > > >> > > > >1. > >> > > > > > >> > > > >Instead of storing (start offset, end offset), the new layout > >> stores > >> > > > >(start offset, length) > >> > > > >2. > >> > > > > > >> > > > >The content of varchars may not be in a
[jira] [Created] (ARROW-5929) [Java] Define API for ExtensionVector whose data must be serialized prior to being sent via IPC
Wes McKinney created ARROW-5929: --- Summary: [Java] Define API for ExtensionVector whose data must be serialized prior to being sent via IPC Key: ARROW-5929 URL: https://issues.apache.org/jira/browse/ARROW-5929 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Wes McKinney As being discussed on the mailing list, a possible use case for ExtensionVector involves having the Arrow buffers contain pointer-type values referring to memory outside of the Arrow memory heap. In IPC, such vectors would need to be serialized to a wholly Arrow-resident form, such as a VarBinaryVector. We do not have an API to allow for this, so this JIRA proposes to add new functions that can indicate to the IPC layer that an ExtensionVector requires additional serialization to a native Arrow type (in such case, the extension type metadata would be discarded) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5928) [JS] Test fuzzer inputs
Wes McKinney created ARROW-5928: --- Summary: [JS] Test fuzzer inputs Key: ARROW-5928 URL: https://issues.apache.org/jira/browse/ARROW-5928 Project: Apache Arrow Issue Type: Improvement Components: JavaScript Reporter: Wes McKinney Fix For: 1.0.0 We are developing a fuzzer-based corpus of malformed IPC inputs https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc The JavaScript implementation should also test against these to verify that the correct kind of exception is raised -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5927) [Go] Test fuzzer inputs
Wes McKinney created ARROW-5927: --- Summary: [Go] Test fuzzer inputs Key: ARROW-5927 URL: https://issues.apache.org/jira/browse/ARROW-5927 Project: Apache Arrow Issue Type: Improvement Components: Go Reporter: Wes McKinney Fix For: 1.0.0 We are developing a fuzzer-based corpus of malformed IPC inputs https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc The Go implementation should also test against these to verify that the correct kind of exception is raised -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5926) [Java] Test fuzzer inputs
Wes McKinney created ARROW-5926: --- Summary: [Java] Test fuzzer inputs Key: ARROW-5926 URL: https://issues.apache.org/jira/browse/ARROW-5926 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Wes McKinney Fix For: 1.0.0 We are developing a fuzzer-based corpus of malformed IPC inputs https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc The Java implementation should also test against these to verify that the correct kind of exception is raised -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5925) [Gandiva][C++] cast decimal to int should round up
Pindikura Ravindra created ARROW-5925: - Summary: [Gandiva][C++] cast decimal to int should round up Key: ARROW-5925 URL: https://issues.apache.org/jira/browse/ARROW-5925 Project: Apache Arrow Issue Type: Bug Reporter: Pindikura Ravindra Assignee: Pindikura Ravindra -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5924) [C++][Plasma] It is not convenient to release a GPU object
shengjun.li created ARROW-5924: -- Summary: [C++][Plasma] It is not convenient to release a GPU object Key: ARROW-5924 URL: https://issues.apache.org/jira/browse/ARROW-5924 Project: Apache Arrow Issue Type: Improvement Components: C++ - Plasma Affects Versions: 0.14.0 Reporter: shengjun.li Fix For: 0.14.1 cmake_modules/DefineOptions.cmake define_option(ARROW_CUDA "Build the Arrow CUDA extensions (requires CUDA toolkit)" ON) define_option(ARROW_PLASMA "Build the plasma object store along with Arrow" ON) The corrent sequence is as follow: (1) plasma_client.Create(object_id, size, nullptr, 0, , 1); // where device_num > 0 (2) plasma_client.Seal(object_id); (3) buff = nullptr; (4) plasma_client.Release(object_id); (5) plasma_client.Delete(object_id); To set buff nullptr (step 3) just before release the object (step 4) because CloseIpcBuffer is in its destructor (class CudaBuffer). If a user does not do that promptly, CloseIpcBuffer will be blocked. Then, the following error may occure when another object created: IOError: Cuda Driver API call in /home/zilliz/arrow/cpp/src/arrow/gpu/cuda_context.cc at line 156 failed with code 208: cuIpcOpenMemHandle(, *handle, CU_IPC_MEM_LAZY_ENABLE_PEER_ACCESS) (nil) To prevent the risk, we can call CloseIpcBuffer manually. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5923) [C++] Fix int96 comment
Francois Saint-Jacques created ARROW-5923: - Summary: [C++] Fix int96 comment Key: ARROW-5923 URL: https://issues.apache.org/jira/browse/ARROW-5923 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Francois Saint-Jacques Assignee: Micah Kornfield -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5922) Unable to connect to HDFS from a worker/data node on a Kerberized cluster using pyarrow' hdfs API
Saurabh Bajaj created ARROW-5922: Summary: Unable to connect to HDFS from a worker/data node on a Kerberized cluster using pyarrow' hdfs API Key: ARROW-5922 URL: https://issues.apache.org/jira/browse/ARROW-5922 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.14.0 Environment: Unix Reporter: Saurabh Bajaj Fix For: 0.14.0 Here's what I'm trying: {{```}} {{import pyarrow as pa }} {{conf = \{"hadoop.security.authentication": "kerberos"} }} {{fs = pa.hdfs.connect(kerb_ticket="/tmp/krb5cc_4", extra_conf=conf)}} {{```}} However, when I submit this job to the cluster using {{Dask-YARN}}, I get the following error: ``` {{File "test/run.py", line 3 fs = pa.hdfs.connect(kerb_ticket="/tmp/krb5cc_4", extra_conf=conf) File "/opt/hadoop/data/10/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_183242/container_e47_1560931326013_183242_01_03/environment/lib/python3.7/site-packages/pyarrow/hdfs.py", line 211, in connect File "/opt/hadoop/data/10/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_183242/container_e47_1560931326013_183242_01_03/environment/lib/python3.7/site-packages/pyarrow/hdfs.py", line 38, in __init__ File "pyarrow/io-hdfs.pxi", line 105, in pyarrow.lib.HadoopFileSystem._connect File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS connection failed}} {{```}} I also tried setting {{host (to a name node)}} and {{port (=8020)}}, however I run into the same error. Since the error is not descriptive, I'm not sure which setting needs to be altered. Any clues anyone? -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5921) [C++][Fuzzing] Missing nullptr checks in IPC
Marco Neumann created ARROW-5921: Summary: [C++][Fuzzing] Missing nullptr checks in IPC Key: ARROW-5921 URL: https://issues.apache.org/jira/browse/ARROW-5921 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.14.0 Reporter: Marco Neumann Assignee: Marco Neumann Attachments: crash-09f72ba2a52b80366ab676364abec850fc668168, crash-607e9caa76863a97f2694a769a1ae2fb83c55e02, crash-cb8cedb6ff8a6f164210c497d91069812ef5d6f8, crash-f37e71777ad0324b55b99224f2c7ffb0107bdfa2, crash-fd237566879dc60fff4d956d5fe3533d74a367f3 {{arrow-ipc-fuzzing-test}} found the attached attached crashes. Reproduce with {code} arrow-ipc-fuzzing-test crash-xxx {code} The attached crashes have all distinct sources and are all related with missing nullptr checks. I have a fix basically ready. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems
PRs touching the wheel packaging scripts: - https://github.com/apache/arrow/pull/4828 (lz4) - https://github.com/apache/arrow/pull/4833 (uriparser - only if https://github.com/apache/arrow/commit/88fcb096c4f24861bc7f8181cba1ad8be0e4048a is cherry picked as well) - https://github.com/apache/arrow/pull/4834 (zlib) On Fri, Jul 12, 2019 at 11:49 AM Hatem Helal wrote: > Thanks François, I closed PARQUET-1623 this morning. It would be nice to > include the PR in the patch release: > > https://github.com/apache/arrow/pull/4857 > > This bug has been around for a few releases but I think it should be a low > risk change to include. > > Hatem > > > On 7/12/19, 2:27 AM, "Francois Saint-Jacques" > wrote: > > I just merged PARQUET-1623, I think it's worth inserting since it > fixes an invalid memory write. Note that I couldn't resolve/close the > parquet issue, do I have to be contributor to the project? > > François > > On Thu, Jul 11, 2019 at 6:10 PM Wes McKinney > wrote: > > > > I just merged Eric's 2nd patch ARROW-5908 and I went through all the > > patches since the release commit and have come up with the following > > list of 32 fix-only patches to pick into a maintenance branch: > > > > https://gist.github.com/wesm/1e4ac14baaa8b27bf13b071d2d715014 > > > > Note there's still unresolved Parquet forward/backward compatibility > > issues in C++ that we haven't merged patches for yet, so that is > > pending. > > > > Are there any other patches / JIRA issues people would like to see > > resolved in a patch release? > > > > Thanks > > > > On Thu, Jul 11, 2019 at 3:03 PM Wes McKinney > wrote: > > > > > > Eric -- you are free to set the Fix Version prior to the patch > being merged > > > > > > On Thu, Jul 11, 2019 at 3:01 PM Eric Erhardt > > > wrote: > > > > > > > > The two C# fixes I'd like in the 0.14.1 release are: > > > > > > > > https://issues.apache.org/jira/browse/ARROW-5887 - already > marked with 0.14.1 fix version. > > > > https://issues.apache.org/jira/browse/ARROW-5908 - hasn't been > resolved yet. The PR https://github.com/apache/arrow/pull/4851 has one > approver and the Rust failure doesn't appear to be caused by my change. > > > > > > > > I assume I shouldn't mark ARROW-5908 with a 0.14.1 fix version > until the PR has been merged. > > > > > > > > -Original Message- > > > > From: Neal Richardson > > > > Sent: Thursday, July 11, 2019 11:59 AM > > > > To: dev@arrow.apache.org > > > > Subject: Re: [DISCUSS] Need for 0.14.1 release due to Python > package problems, Parquet forward compatibility problems > > > > > > > > I just moved > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FARROW-5850data=02%7C01%7CEric.Erhardt%40microsoft.com%7C244c0dd319dd4ea18a5508d7062125de%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636984611747771373sdata=B6xfFcBu4Iz0jJE5tUXkKvoJx36kMCS4UJCdTV7jqGA%3Dreserved=0 > from 1.0.0 to 0.14.1. > > > > > > > > On Thu, Jul 11, 2019 at 8:12 AM Wes McKinney < > wesmck...@gmail.com> wrote: > > > > > > > > > To limit uncertainty, I'm going to start preparing a 0.14.1 > patch > > > > > release branch. I will update the list with the patches that > are being > > > > > cherry-picked. If other folks could give me a list of other > PRs that > > > > > need to be backported I will add them to the list. Any JIRA > that needs > > > > > to be included should have the "0.14.1" fix version added so > we can > > > > > keep track > > > > > > > > > > On Wed, Jul 10, 2019 at 9:48 PM Joris Van den Bossche > > > > > wrote: > > > > > > > > > > > > I personally prefer 0.14.1 over 0.15.0. I think that is > clearer in > > > > > > communication, as we are fixing regressions of the 0.14.0 > release. > > > > > > > > > > > > (but I haven't been involved much in releases, so certainly > no > > > > > > strong > > > > > > opinion) > > > > > > > > > > > > Joris > > > > > > > > > > > > > > > > > > Op wo 10 jul. 2019 om 15:07 schreef Wes McKinney < > wesmck...@gmail.com>: > > > > > > > > > > > > > hi folks, > > > > > > > > > > > > > > Are there any opinions / strong feelings about the two > options: > > > > > > > > > > > > > > * Prepare patch 0.14.1 release from a maintenance branch > > > > > > > * Release 0.15.0 out of master > > > > > > > > > > > > > > Aside from the Parquet forward compatibility issues we're > still > > > > > > > discussing, and Eric's C# patch PR 4836, are there any > other > > > > > > > issues that need to be fixed before we go down one of > these paths? > > > > > > > > > > > > > > Would anyone like to help with release management? I can > do so if > > > > > > > necessary, but I've already done a lot of release >
Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)
@Antoine Pitrou, Good question. I think the answer depends on the concrete encoding scheme. For some encoding schemes, it is not a good idea to use them for in-memory data compression. For others, it is beneficial to operator directly on the compressed data. For example, it is beneficial to directly work on RLE data, with better locality and fewer cache misses. Best, Liya Fan On Fri, Jul 12, 2019 at 5:24 PM Antoine Pitrou wrote: > > Le 12/07/2019 à 10:08, Micah Kornfield a écrit : > > OK, I've created a separate thread for data integrity/digests [1], and > > retitled this thread to continue the discussion on compression and > > encodings. As a reminder the PR for the format additions [2] suggested a > > new SparseRecordBatch that would allow for the following features: > > 1. Different data encodings at the Array (e.g. RLE) and Buffer levels > > (e.g. narrower bit-width integers) > > 2. Compression at the buffer level > > 3. Eliding all metadata and data for empty columns. > > So the question is whether this really needs to be in the in-memory > format, i.e. is it desired to operate directly on this compressed > format, or is it solely for transport? > > If the latter, I wonder why Parquet cannot simply be used instead of > reinventing something similar but different. > > Regards > > Antoine. >
[jira] [Created] (ARROW-5920) [Java] Support sort & compare for all variable width vectors
Liya Fan created ARROW-5920: --- Summary: [Java] Support sort & compare for all variable width vectors Key: ARROW-5920 URL: https://issues.apache.org/jira/browse/ARROW-5920 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan All variable-width vector can reuse the same comparator for sorting & searching. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems
Thanks François, I closed PARQUET-1623 this morning. It would be nice to include the PR in the patch release: https://github.com/apache/arrow/pull/4857 This bug has been around for a few releases but I think it should be a low risk change to include. Hatem On 7/12/19, 2:27 AM, "Francois Saint-Jacques" wrote: I just merged PARQUET-1623, I think it's worth inserting since it fixes an invalid memory write. Note that I couldn't resolve/close the parquet issue, do I have to be contributor to the project? François On Thu, Jul 11, 2019 at 6:10 PM Wes McKinney wrote: > > I just merged Eric's 2nd patch ARROW-5908 and I went through all the > patches since the release commit and have come up with the following > list of 32 fix-only patches to pick into a maintenance branch: > > https://gist.github.com/wesm/1e4ac14baaa8b27bf13b071d2d715014 > > Note there's still unresolved Parquet forward/backward compatibility > issues in C++ that we haven't merged patches for yet, so that is > pending. > > Are there any other patches / JIRA issues people would like to see > resolved in a patch release? > > Thanks > > On Thu, Jul 11, 2019 at 3:03 PM Wes McKinney wrote: > > > > Eric -- you are free to set the Fix Version prior to the patch being merged > > > > On Thu, Jul 11, 2019 at 3:01 PM Eric Erhardt > > wrote: > > > > > > The two C# fixes I'd like in the 0.14.1 release are: > > > > > > https://issues.apache.org/jira/browse/ARROW-5887 - already marked with 0.14.1 fix version. > > > https://issues.apache.org/jira/browse/ARROW-5908 - hasn't been resolved yet. The PR https://github.com/apache/arrow/pull/4851 has one approver and the Rust failure doesn't appear to be caused by my change. > > > > > > I assume I shouldn't mark ARROW-5908 with a 0.14.1 fix version until the PR has been merged. > > > > > > -Original Message- > > > From: Neal Richardson > > > Sent: Thursday, July 11, 2019 11:59 AM > > > To: dev@arrow.apache.org > > > Subject: Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems > > > > > > I just moved https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FARROW-5850data=02%7C01%7CEric.Erhardt%40microsoft.com%7C244c0dd319dd4ea18a5508d7062125de%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636984611747771373sdata=B6xfFcBu4Iz0jJE5tUXkKvoJx36kMCS4UJCdTV7jqGA%3Dreserved=0 from 1.0.0 to 0.14.1. > > > > > > On Thu, Jul 11, 2019 at 8:12 AM Wes McKinney wrote: > > > > > > > To limit uncertainty, I'm going to start preparing a 0.14.1 patch > > > > release branch. I will update the list with the patches that are being > > > > cherry-picked. If other folks could give me a list of other PRs that > > > > need to be backported I will add them to the list. Any JIRA that needs > > > > to be included should have the "0.14.1" fix version added so we can > > > > keep track > > > > > > > > On Wed, Jul 10, 2019 at 9:48 PM Joris Van den Bossche > > > > wrote: > > > > > > > > > > I personally prefer 0.14.1 over 0.15.0. I think that is clearer in > > > > > communication, as we are fixing regressions of the 0.14.0 release. > > > > > > > > > > (but I haven't been involved much in releases, so certainly no > > > > > strong > > > > > opinion) > > > > > > > > > > Joris > > > > > > > > > > > > > > > Op wo 10 jul. 2019 om 15:07 schreef Wes McKinney : > > > > > > > > > > > hi folks, > > > > > > > > > > > > Are there any opinions / strong feelings about the two options: > > > > > > > > > > > > * Prepare patch 0.14.1 release from a maintenance branch > > > > > > * Release 0.15.0 out of master > > > > > > > > > > > > Aside from the Parquet forward compatibility issues we're still > > > > > > discussing, and Eric's C# patch PR 4836, are there any other > > > > > > issues that need to be fixed before we go down one of these paths? > > > > > > > > > > > > Would anyone like to help with release management? I can do so if > > > > > > necessary, but I've already done a lot of release management :) > > > > > > > > > > > > - Wes > > > > > > > > > > > > On Tue, Jul 9, 2019 at 4:13 PM Wes McKinney > > > > wrote: > > > > > > > > > > > > > > Hi Eric -- of course! > > > > > > > > > > > > > > On Tue, Jul 9, 2019, 4:03 PM Eric Erhardt < > > > > eric.erha...@microsoft.com.invalid> > > > > > > wrote: > > > > > > >> > > > > > > >> Can we propose getting changes other than Python or Parquet > > > > > > >> related > > > > > > into this release? > > > > > > >> > > > > > > >> For example, I found a critical issue in the C# implementation > >
Re: [Python] Wheel questions
Le 12/07/2019 à 11:39, Uwe L. Korn a écrit : > Actually the most pragmatic way I have thought of yet would be to use conda > and build all our dependencies. Instead of using the compilers defaults and > conda-forge use, we should build the dependencies in the manylinux image > and then upload them to a custom channel. This should also make the > maintenance of the arrow-manylinx docker container easy as this won't require > you then to do a full recompile of LLVM just because you changed something in > a preceeding step. That sounds cumbersome though. Each upgrade or modification in the building of those libraries needs changing and updating some conda packages somewhere... So we would be trading one inconvenience against another. Note I recently moved llvm and clang compilation up in the Dockerfile, so most changes can now be done without recompiling them. Regards Antoine.
[jira] [Created] (ARROW-5919) [R] Add nightly tests for building r-arrow with dependencies from conda-forge
Uwe L. Korn created ARROW-5919: -- Summary: [R] Add nightly tests for building r-arrow with dependencies from conda-forge Key: ARROW-5919 URL: https://issues.apache.org/jira/browse/ARROW-5919 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, R Reporter: Uwe L. Korn Assignee: Uwe L. Korn Fix For: 1.0.0 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
Re: [Python] Wheel questions
Hallo, On Thu, Jul 11, 2019, at 9:51 PM, Wes McKinney wrote: > On Thu, Jul 11, 2019 at 11:26 AM Antoine Pitrou wrote: > > > > > > Le 11/07/2019 à 17:52, Krisztián Szűcs a écrit : > > > Hi All, > > > > > > I have a couple of questions about the wheel packaging: > > > - why do we build an arrow namespaced boost on linux and osx, could we > > > link > > > statically like with the windows wheels? > > > > No idea. Boost shouldn't leak in the public APIs, so theoretically a > > static build would be fine... Static linkage is fine as long as we don't expose any Boost symbols. We had that historically in the Decimals. If this is gone, we can switch static linkage. > > > - do we explicitly say somewhere in the linux wheels to link the 3rdparty > > > dependencies statically or just implicitly, by removing (or not building) > > > the shared libs for the 3rdparty dependencies? > > > > It's implicit by removing the shared libs (or not building them). > > Some time ago the compression libs were always linked statically by > > default, but it was changed to dynamic along the time, probably to > > please system packagers. > > I think only libz shared library is being bundled, for security reasons Ah, yes. This was why we made the dynamic linkage! Can you add a comment the next time you touch the build scripts? > > > - couldn't we use the 3rdparty toolchain to build the smaller 3rdparty > > > dependencies for the linux wheels instead of building them manually in the > > > manylinux docker image - it'd easier to say _SOURCE=BUNDLED > > > > I don't think so. The conda-forge and Anaconda packages use a different > > build chain (different compiler, different libstdc++ version) and may > > not be usable directly on manylinux-compliant systems. > > I think you may misunderstand. Krisztian is suggesting building the > dependencies through the ExternalProject mechanism during "docker run" > on the image rather than caching pre-built versions in the Docker > image. > > For small dependencies, I don't see why we couldn't used the BUNDLED > approach. This might spare us having to maintain some of the build > scripts. It will strictly increase build times, though -- I think the > reason that everything is cached now is to save on build times (which > have historically been quite long) Actually the most pragmatic way I have thought of yet would be to use conda and build all our dependencies. Instead of using the compilers defaults and conda-forge use, we should build the dependencies in the manylinux image and then upload them to a custom channel. This should also make the maintenance of the arrow-manylinx docker container easy as this won't require you then to do a full recompile of LLVM just because you changed something in a preceeding step. Uwe
Re: [DISCUSS][FORMAT] Data Integrity
Le 12/07/2019 à 09:56, Micah Kornfield a écrit : > Per Antoine's recommendation. I'm splitting off the discussion about data > integrity from the previous e-mail thread about the format additions [1]. > To re-cap I made a proposal including data integrity [2] by adding a new > message type to the > > From the previous thread the main question was at what level to apply > digests to Arrow data (Message level, array, buffer or potentially some > hybrid). > > Some trade-offs I've thought of for each approach: > * Message level > + Simplest implementation and can be applied across all messages with the > pretty much the same code. > + Smallest amount of additional data (each digest will likely be 8-64 bytes) > - It lacks granularity to recover partial data from a record batch if there > is corruption. Also: - Will only apply to transmission errors using the IPC mechanism, not other kinds of errors that may occur > Array level: > + Allows for reading non-corrupted columns > + Allows for potentially more complicated use-cases like have different > compute engines "collaborate" and sign each array they computed to > establish a "chain-of-trust" > - Adds some implementation complexity. Will need different schemes for > message types other than RecordBatch and for message metadata. We also > need to determine digest boundaries (would a complex column be consumed > entirely or would child arrays be separate). Also: - Need to compute a new checksum when slicing an array? > Buffer level: > More or less same issues as array but with the following other factors: > - The most amount of additional data It's not clear that's much of a problem (currently?), especially if checksumming is optional. Arrow isn't well-suited for use cases with many tiny buffers... > - Its not clear if there is a benefit of detecting if a single buffer is > corrupted if it means we can't accurately decode the array. Also: + decorrelated from logical interpretation of buffer, e.g. slicing I think the possibility of a hybrid scheme should be discussed as well. For example, compute physical checksums at the buffer level, then devise a lightweight formula for the checkum of an array based on those physical checksums. And a formula for an IPC message's checksum based on its type (schema, record batch, dictionary...). Regards Antoine.
Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)
Le 12/07/2019 à 10:08, Micah Kornfield a écrit : > OK, I've created a separate thread for data integrity/digests [1], and > retitled this thread to continue the discussion on compression and > encodings. As a reminder the PR for the format additions [2] suggested a > new SparseRecordBatch that would allow for the following features: > 1. Different data encodings at the Array (e.g. RLE) and Buffer levels > (e.g. narrower bit-width integers) > 2. Compression at the buffer level > 3. Eliding all metadata and data for empty columns. So the question is whether this really needs to be in the in-memory format, i.e. is it desired to operate directly on this compressed format, or is it solely for transport? If the latter, I wonder why Parquet cannot simply be used instead of reinventing something similar but different. Regards Antoine.
[jira] [Created] (ARROW-5918) [Java] Revise the BaseIntVector interface
Liya Fan created ARROW-5918: --- Summary: [Java] Revise the BaseIntVector interface Key: ARROW-5918 URL: https://issues.apache.org/jira/browse/ARROW-5918 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan 1. In the set method should not use long as parameter. It is hardly the case that there are more than 2^32 distinct values in a dictionary. If it really happens, maybe it means we should not have used dictionary in the first place. 2. In addition to the get method, there should also be a set method. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5917) [Java] Redesign the dictionary encoder
Liya Fan created ARROW-5917: --- Summary: [Java] Redesign the dictionary encoder Key: ARROW-5917 URL: https://issues.apache.org/jira/browse/ARROW-5917 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan The current dictionary encoder implementation (org.apache.arrow.vector.dictionary.DictionaryEncoder) has heavy performance overhead, which prevents it from being useful in practice: # There are repeated conversions between Java objects and bytes (e.g. vector.getObject(i)). # Unnecessary memory copy (the vector data must be copied to the hash table). # The hash table cannot be reused for encoding multiple vectors (other data structure & results cannot be reused either). # The output vector should not be created/managed by the encoder (just like in the out-of-place sorter) # The hash table requires that the hashCode & equals methods be implemented appropriately, but this is not guaranteed. We plan to implement a new one in the algorithm module, and gradually deprecate the current one. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)
OK, I've created a separate thread for data integrity/digests [1], and retitled this thread to continue the discussion on compression and encodings. As a reminder the PR for the format additions [2] suggested a new SparseRecordBatch that would allow for the following features: 1. Different data encodings at the Array (e.g. RLE) and Buffer levels (e.g. narrower bit-width integers) 2. Compression at the buffer level 3. Eliding all metadata and data for empty columns. To recap my understanding of the highlights discussion so far: Encodings: There are some concerns over efficiency of some of the encodings in different scenarios. * Eliding null values makes many algorithms less efficient * Joins might become harder with these encodings. * Also the additional code complexity came up on the Arrow sync call. Compression: - Buffer level compression might be too small a granularity for data compression. - General purpose compression at this level might not add much value, so it might be better to keep it at the transport level. Alternative designs: * Put buffer level compression in specific transports (e.g. flight) * Try to use the extension mechanism to support different encodings Thanks, Micah [1] https://lists.apache.org/thread.html/23c95508dcba432caa73253062520157346fad82fce9943ba6f681dd@%3Cdev.arrow.apache.org%3E [2] https://github.com/apache/arrow/pull/4815 On Fri, Jul 12, 2019 at 12:15 AM Antoine Pitrou wrote: > > I think it would be worthwhile to split the discussion into two separate > threads. One thread for compression & encodings (which are related or > even the same topic), one thread for data integrity. > > Regards > > Antoine. > > > Le 08/07/2019 à 07:22, Micah Kornfield a écrit : > > > > - Compression: > >* Use parquet for random access to data elements. > >- This is one option, the main downside I see to this is > generally > > higher encoding/decoding costs. Per below, I think it is reasonable to > > wait until we have more data to add compression into the the spec. > >* Have the transport layer do buffer specific compression: > > - I'm not a fan of this approach. Once nice thing about the > current > > communication protocols is once you strip away "framing" data all the > byte > > streams are equivalent. I think the simplicity that follows in code from > > this is a nice feature. > > > > > > *Computational efficiency of array encodings:* > > > >> How does "more efficient computation" play out for operations such as > >> hash or join? > > > > You would still need to likely materialize rows in most case. In some > > "join" cases the sparse encoding of the null bitmap buffer could be a win > > because it serves as an index to non-null values. > > > > I think I should clarify that these encodings aren't always a win > depending > > on workload/data shape, but can have a large impact when used > appropriately > > (especially at the "Expression evaluation stage"). Also, any wins don't > > come for free, to exploit encodings properly will add some level of > > complication to existing computation code. > > > > On a packed sparse array representation: > > > >> This would be fine for simple SIMD aggregations like count/avg/mean, but > >> compacting null slots complicates more advanced parallel routines that > >> execute independently and rely on indices aligning with an element's > >> logical position. > > > > > > The main use-case I had in mind here was for scenarios like loading data > > directly parquet (i.e. nulls are already elided) doing some computation > and > > then potentially translating to a dense representation. Similarly it > > appears other have had advantage in some contexts for saving time at > > shuffle [1]. In many cases there is an overlap with RLE, so I'd be open > to > > removing this from the proposal. > > > > > > *On buffer encodings:* > > To paraphrase, the main concern here seems to be it is similar to > metadata > > that was already removed [2]. > > > > A few points on this: > > 1. There was a typo in the original e-mail on sparse-integer set > encoding > > where it said "all" values are either null or not null. This should have > > read "most" values. The elision of buffers is a separate feature. > > 2. I believe these are different then the previous metadata because this > > isn't repetitive information. It provides new information about the > > contents of buffers not available anywhere else. > > 3. The proposal is to create a new message type for the this feature so > it > > wouldn't be bringing back the old code and hopefully would have minimal > > impact on already existing IPC code. > > > > > > *On Compression:* > > So far my take is the consensus is that this can probably be applied at > the > > transport level without being in the spec directly. There might be value > > in more specific types of compression at the buffer level, but we should > > benchmark them first.. > > > > *Data Integrity/Digest:* > > > >>
Re: [Discuss] Format additions to Arrow for sparse data and data integrity
I think it would be worthwhile to split the discussion into two separate threads. One thread for compression & encodings (which are related or even the same topic), one thread for data integrity. Regards Antoine. Le 08/07/2019 à 07:22, Micah Kornfield a écrit : > > - Compression: >* Use parquet for random access to data elements. >- This is one option, the main downside I see to this is generally > higher encoding/decoding costs. Per below, I think it is reasonable to > wait until we have more data to add compression into the the spec. >* Have the transport layer do buffer specific compression: > - I'm not a fan of this approach. Once nice thing about the current > communication protocols is once you strip away "framing" data all the byte > streams are equivalent. I think the simplicity that follows in code from > this is a nice feature. > > > *Computational efficiency of array encodings:* > >> How does "more efficient computation" play out for operations such as >> hash or join? > > You would still need to likely materialize rows in most case. In some > "join" cases the sparse encoding of the null bitmap buffer could be a win > because it serves as an index to non-null values. > > I think I should clarify that these encodings aren't always a win depending > on workload/data shape, but can have a large impact when used appropriately > (especially at the "Expression evaluation stage"). Also, any wins don't > come for free, to exploit encodings properly will add some level of > complication to existing computation code. > > On a packed sparse array representation: > >> This would be fine for simple SIMD aggregations like count/avg/mean, but >> compacting null slots complicates more advanced parallel routines that >> execute independently and rely on indices aligning with an element's >> logical position. > > > The main use-case I had in mind here was for scenarios like loading data > directly parquet (i.e. nulls are already elided) doing some computation and > then potentially translating to a dense representation. Similarly it > appears other have had advantage in some contexts for saving time at > shuffle [1]. In many cases there is an overlap with RLE, so I'd be open to > removing this from the proposal. > > > *On buffer encodings:* > To paraphrase, the main concern here seems to be it is similar to metadata > that was already removed [2]. > > A few points on this: > 1. There was a typo in the original e-mail on sparse-integer set encoding > where it said "all" values are either null or not null. This should have > read "most" values. The elision of buffers is a separate feature. > 2. I believe these are different then the previous metadata because this > isn't repetitive information. It provides new information about the > contents of buffers not available anywhere else. > 3. The proposal is to create a new message type for the this feature so it > wouldn't be bringing back the old code and hopefully would have minimal > impact on already existing IPC code. > > > *On Compression:* > So far my take is the consensus is that this can probably be applied at the > transport level without being in the spec directly. There might be value > in more specific types of compression at the buffer level, but we should > benchmark them first.. > > *Data Integrity/Digest:* > >> one question is whether this occurs at the table level, column level, >> sequential array level, etc. > > This is a good question, it seemed like the batch level was easiest and > that is why I proposed it, but I'd be open to other options. One nice > thing about the batch level is that it works for all other message types > out of the box (i.e. we can ensure the schema has been transmitted > faithfully). > > Cheers, > Micah > > [1] https://issues.apache.org/jira/browse/ARROW-5821 > [2] https://github.com/apache/arrow/pull/1297/files > [3] https://jira.apache.org/jira/browse/ARROW-300 > > > On Sat, Jul 6, 2019 at 11:17 AM Paul Taylor > wrote: > >> Hi Micah, >> >> Similar to Jacques I'm not disagreeing, but wondering if they belong in >> Arrow vs. can be done externally. I'm mostly interested in changes that >> might impact SIMD processing, considering Arrow's already made conscious >> design decisions to trade memory for speed. Apologies in advance if I've >> misunderstood any of the proposals. >> >>> a. Add a run-length encoding scheme to efficiently represent repeated >>> values (the actual scheme encodes run ends instead of length to preserve >>> sub-linear random access). >> Couldn't one do RLE at the buffer level via a custom >> FixedSizeBinary/Binary/Utf8 encoding? Perhaps as a new ExtensionType? >> >>> b. Add a “packed” sparse representation (null values don’t take up >>> space in value buffers) >> This would be fine for simple SIMD aggregations like count/avg/mean, but >> compacting null slots complicates more advanced parallel routines that >> execute independently