[jira] [Created] (ARROW-7845) [c++] reading list from parquet files
Mikhail Filimonov created ARROW-7845: Summary: [c++] reading list from parquet files Key: ARROW-7845 URL: https://issues.apache.org/jira/browse/ARROW-7845 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 0.15.1 Reporter: Mikhail Filimonov Currently, the Parquet format reader delivered with apache arrow c++ does not support Parquet lists. See a related issue in https://issues.apache.org/jira/browse/PARQUET-834 -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Arrow doesn't have a MapType
Thanks Wes. I was using 0.14 before. BTW, it seems the doc for data types didn't updated fully. I'll submit a PR for this. On Thu, Feb 13, 2020 at 12:28 AM Wes McKinney wrote: > It was added between 0.15.0 and 0.16.0. Any feedback from using it > would be welcome > > > https://github.com/apache/arrow/commit/e0c1ffe9c38d1759f1b5311f95864b0e2a406c51 > > On Wed, Feb 12, 2020 at 5:12 AM Shawn Yang > wrote: > > > > Thanks François, I didn't find it in pyarrow. I'll check again. > > > > On Fri, Feb 7, 2020 at 9:18 PM Francois Saint-Jacques < > > fsaintjacq...@gmail.com> wrote: > > > > > Arrow does have a Map type [1][2][3]. It is represented as a list of > pairs. > > > > > > François > > > > > > [1] > > > > https://github.com/apache/arrow/blob/762202418541e843923b8cae640d15b4952a0af6/format/Schema.fbs#L60-L87 > > > [2] > > > > https://github.com/apache/arrow/blob/762202418541e843923b8cae640d15b4952a0af6/cpp/src/arrow/type.h#L691-L719 > > > [3] > > > > https://github.com/apache/arrow/blob/762202418541e843923b8cae640d15b4952a0af6/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java#L36-L47 > > > > > > On Fri, Feb 7, 2020 at 3:55 AM Shawn Yang > wrote: > > > > > > > > Hi guys, > > > > I'm writing an cross-language row-oriented serialization framework > mainly > > > > for java/python for now. I detained many data types and schema, > field, > > > such > > > > as Byte, short, int, long, double, float, map, array, struct,. But > then I > > > > find using Arrow schema is a better choice. Since my framework need > to > > > > support conversion between my row-format and arrow columnar format. > If I > > > do > > > > all it by myself, I need to support schema conversion and schema > > > > serialization. Which is not necessary if I use arrow schema. > > > > > > > > But I find that arrow doesn't have a map data type, which is I needed > > > > exactly. I know I can use struct to mock it or ExtensionType for it. > But > > > > it's not very convenient. So I want to know whether will Map type > be > > > > supported by arrow? > > > > > > > > Thanks. Regards > > > >
Re: [VOTE] Adopt Arrow in-process C Data Interface specification
+1 (binding) On Tue, Feb 11, 2020 at 4:29 PM Antoine Pitrou wrote: > > > Ah, you're right, it's PR 6040: > https://github.com/apache/arrow/pull/6040 > > Similarly, the C++ implementation is at PR 6026: > https://github.com/apache/arrow/pull/6026 > > Regards > > Antoine. > > > Le 11/02/2020 à 23:17, Wes McKinney a écrit : > > hi Antoine, PR 5442 seems to no longer be the right one. Which open PR > > contains the specification now? > > > > On Tue, Feb 11, 2020 at 1:06 PM Antoine Pitrou wrote: > >> > >> > >> Hello, > >> > >> We have been discussing the creation of a minimalist C-based data > >> interface for applications to exchange Arrow columnar data structures > >> with each other. Some notable features of this interface include: > >> > >> * A small amount of header-only C code can be copied independently into > >> third-party libraries and downstream applications, no dependencies are > >> needed even on Arrow C++ itself (notably, it is not required to use > >> Flatbuffers, though there are trade-offs resulting from this). > >> > >> * Low development investment (in other words: limited-scope use cases > >> can be accomplished with little code), so as to enable C or C++ > >> libraries to export Arrow columnar data with minimal code. > >> > >> * Data lifetime management hooks so as to properly handle non-trivial > >> data sharing (for example passing Arrow columnar data to an async > >> processing consumer). > >> > >> This "C Data Interface" serves different use cases from the > >> language-independent IPC protocol and trades away a number of features > >> in the interest of minimalism / simplicity. It is not a replacement for > >> the IPC protocol and will only be used to interchange in-process data at > >> C or C++ call sites. > >> > >> The PR providing the specification is here: > >> https://github.com/apache/arrow/pull/5442 > >> > >> In particular, you can read the spec document here: > >> https://github.com/pitrou/arrow/blob/doc-c-data-interface2/docs/source/format/CDataInterface.rst > >> > >> A fairly comprehensive C++ implementation of this demonstrating its > >> use is found here: > >> https://github.com/apache/arrow/pull/5608 > >> > >> (note that other applications implementing the interface may choose to > >> only support a few features and thus have far less code to write) > >> > >> Please vote to adopt the SPECIFICATION (GitHub PR #5442). > >> > >> This vote will be open for at least 72 hours > >> > >> [ ] +1 Adopt C Data Interface specification > >> [ ] +0 > >> [ ] -1 Do not adopt because... > >> > >> Thank you > >> > >> Regards > >> > >> Antoine. > >> > >> > >> (PS: yes, this is in large part a copy/paste of Wes's previous vote > >> email :-))
Re: PR Dashboard for Java?
Works now, thanks! I added a page for Java open PRs https://cwiki.apache.org/confluence/display/ARROW/Java+Open+Patches On Tue, Feb 11, 2020 at 12:08 PM Wes McKinney wrote: > Weird. Try now > > On Tue, Feb 11, 2020 at 1:03 PM Bryan Cutler wrote: > > > > Wes, it doesn't seem to have worked. Could you double check the > privileges > > for me (cutlerb)? I'd also like to add something to the verify release > > candidate page. It's weird, I made an edit before on another page a while > > ago, not sure what happened. Thanks! > > > > On Mon, Jan 27, 2020 at 2:23 PM Wes McKinney > wrote: > > > > > Bryan -- I just gave you (cutlerb) Confluence edit privileges. These > > > have to be explicitly managed on a per-user basis to avoid spam > > > problems > > > > > > On Mon, Jan 27, 2020 at 4:12 PM Bryan Cutler > wrote: > > > > > > > > Thanks Neal, but it doesn't look like I have confluence privileges. > > > That's > > > > fine though, the github interface is easy enough. > > > > > > > > On Mon, Jan 27, 2020 at 11:59 AM Neal Richardson < > > > > neal.p.richard...@gmail.com> wrote: > > > > > > > > > If you have confluence privileges, duplicate a page like > > > > > > https://cwiki.apache.org/confluence/display/ARROW/Ruby+JIRA+Dashboard > > > and > > > > > then edit the Jira query (something like status in open/in > > > > > progress/reopened, labels = pull-request-available, component = > java, > > > > > project = ARROW) if you want to make it Java issues that have pull > > > requests > > > > > open. > > > > > > > > > > Or you could bookmark > > > > > > > > > > > > > > https://github.com/apache/arrow/pulls?utf8=%E2%9C%93=is%3Apr+is%3Aopen+%22%5BJava%5D%22 > > > > > or https://github.com/apache/arrow/labels/lang-java > > > > > > > > > > Neal > > > > > > > > > > On Mon, Jan 27, 2020 at 11:26 AM Bryan Cutler > > > wrote: > > > > > > > > > > > I saw on Confluence that other Arrow components have PR > dashboards, > > > but I > > > > > > don't see one for Java? I think it would be helpful, is it > difficult > > > to > > > > > add > > > > > > one for Java? I'm happy to do it if someone could point me in the > > > right > > > > > > direction. Thanks! > > > > > > > > > > > > Bryan > > > > > > > > > > > > > > >
[jira] [Created] (ARROW-7844) [R] Parquet list column test is flaky
Neal Richardson created ARROW-7844: -- Summary: [R] Parquet list column test is flaky Key: ARROW-7844 URL: https://issues.apache.org/jira/browse/ARROW-7844 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Neal Richardson Assignee: Francois Saint-Jacques See [https://travis-ci.org/ursa-labs/arrow-r-nightly/jobs/649649349#L373-L375] for an example on public CI. I was seeing this locally this week but figured I'd screwed up my env somehow. {code} ── 1. Failure: Lists are preserved when writing/reading from Parquet (@test-parq `object` not equivalent to `expected`. Component "num": Component 1: target is numeric, current is character {code} It's not always the same column in the data.frame that is affected. Also strange that it's only one column. You'd think that if it were transposing the order somehow, you'd get two that were swapped. The test itself is straightforward (https://github.com/apache/arrow/blob/master/r/tests/testthat/test-parquet.R#L124-L137) so this is somewhat troubling. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [ARROW-3329] Re: Decimal casting or scaling
On Wed, Feb 12, 2020 at 2:37 PM Jacek Pliszka wrote: > > Actually these options still make some sense - but not as much as before. > > The use case: unit conversion > > Data about prices exported from sql in Decimal(38,10) which uses 128 > bit but the numbers are actually prices which expressed in cents fit > perfectly in uint32 > > Having scaling would reduce bandwidth/disk usage by factor of 4. You'd need to implement a separate function for this since you're changing the semantics of the cast. I don't think it makes sense to convert from 123.45 (decimal) to 12345 (uint32) in Cast > What would be the best approach to such use case? > > Would decimal_scale CastOption be OK or should it rather be compute > 'multiply' kernel ? > > BR, > > Jacek > > > śr., 12 lut 2020 o 19:32 Jacek Pliszka napisał(a): > > > > OK, then what I proposed does not make sense and I can just copy the > > solution you pointed out. > > > > Thank you, > > > > Jacek > > > > śr., 12 lut 2020 o 19:27 Wes McKinney napisał(a): > > > > > > On Wed, Feb 12, 2020 at 12:09 PM Jacek Pliszka > > > wrote: > > > > > > > > Hi! > > > > > > > > ARROW-3329 - we can discuss there. > > > > > > > > > It seems like it makes sense to implement both lossless safe casts > > > > > (when all zeros after the decimal point) and lossy casts (fractional > > > > > part discarded) from decimal to integer, do I have that right? > > > > > > > > Yes, though if I understood your examples are the same case - in both > > > > cases fractional part is discarded - just it is all 0s in the first > > > > case. > > > > > > > > The key question is whether CastFunctor in cast.cc has access to scale > > > > of the decimal? If yes how? > > > > > > Yes, it's in the type of the input array. Here's a kernel > > > implementation that uses the TimestampType metadata of the input > > > > > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/cast.cc#L521 > > > > > > > > > > > If not - these are the options I've came up with: > > > > > > > > Let's assume Decimal128Type value is n > > > > > > > > Then I expect that base call > > > > .cast('int64') will return overflow for n beyond int64 values, value > > > > otherwise > > > > > > > > Option 1: > > > > > > > > .cast('int64', decimal_scale=s) would calculate n/10**s and return > > > > overflow if it is beyond int64, value otherwise > > > > > > > > Option 2: > > > > > > > > .cast('int64', bytes_group=0) would return n & 0x > > > > .cast('int64', bytes_group=1) would return (n >> 64) & > > > > 0x > > > > .cast('int64') would have default value bytes_group=0 > > > > > > > > Option 3: > > > > > > > > cast has no CastOptions but we add multiply compute kernel and have > > > > something like this instead: > > > > > > > > .compute('multiply', 10**-s).cast('int64') > > > > > > > > BR, > > > > > > > > Jacek
Re: [C++][Parquet] Is arrow::parquet::FileWriter::WriteColumnChunk intended to be public?
Having them be public was the intention, but it seems that column-wise writing is not yet fully baked. I think it would be OK to make these methods private until they can be appropriately tested On Sat, Feb 8, 2020 at 10:49 PM Micah Kornfield wrote: > > I'm asking because it doesn't seem to do validation that the schemas are > equivalent between the array being written and the original schema? > > It also appears to only be used in unit tests. > > Thanks, > Micah
[jira] [Created] (ARROW-7843) [Ruby] MSYS2 packages needed for arrow-gandiva arrow-cuda
Dominic Sisneros created ARROW-7843: --- Summary: [Ruby] MSYS2 packages needed for arrow-gandiva arrow-cuda Key: ARROW-7843 URL: https://issues.apache.org/jira/browse/ARROW-7843 Project: Apache Arrow Issue Type: Bug Components: Ruby Affects Versions: 0.16.0 Environment: windows with rubyinstaller Reporter: Dominic Sisneros require "gandiva" table = Arrow::Table.new(:field1 => Arrow::Int32Array.new([1, 2, 3, 4]), :field2 => Arrow::Int32Array.new([11, 13, 15, 17])) schema = table.schema expression1 = schema.build_expression do |record| record.field1 + record.field2 end expression2 = schema.build_expression do |record, context| context.if(record.field1 > record.field2) .then(record.field1 / record.field2) .else(record.field1) end projector = Gandiva::Projector.new(schema, [expression1, expression2]) table.each_record_batch do |record_batch| outputs = projector.evaluate(record_batch) puts outputs.collect(&:values) end C:\Users\Dominic E Sisneros\source\repos\ruby\try_arrow>ruby gandiva_test2.rb Traceback (most recent call last): 2: from gandiva_test2.rb:1:in `' 1: from c:/Ruby27-x64/lib/ruby/2.7.0/rubygems/core_ext/kernel_require.rb:92:in `require' c:/Ruby27-x64/lib/ruby/2.7.0/rubygems/core_ext/kernel_require.rb:92:in `require': cannot load such file -- gandiva (LoadError) 9: from gandiva_test2.rb:1:in `' 8: from c:/Ruby27-x64/lib/ruby/2.7.0/rubygems/core_ext/kernel_require.rb:156:in `require' 7: from c:/Ruby27-x64/lib/ruby/2.7.0/rubygems/core_ext/kernel_require.rb:168:in `rescue in require' 6: from c:/Ruby27-x64/lib/ruby/2.7.0/rubygems/core_ext/kernel_require.rb:168:in `require' 5: from c:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/red-gandiva-0.16.0/lib/gandiva.rb:24:in `' 4: from c:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/red-gandiva-0.16.0/lib/gandiva.rb:28:in `' 3: from c:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/red-gandiva-0.16.0/lib/gandiva/loader.rb:22:in `load' 2: from c:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/gobject-introspection-3.4.1/lib/gobject-introspection/loader.rb:25:in `load' 1: from c:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/gobject-introspection-3.4.1/lib/gobject-introspection/loader.rb:37:in `load' c:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/gobject-introspection-3.4.1/lib/gobject-introspection/loader.rb:37:in `require': Typelib file for namespace 'Gandiva' (any version) not found (GObjectIntrospection::RepositoryError::TypelibNotFound) -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [ARROW-3329] Re: Decimal casting or scaling
Actually these options still make some sense - but not as much as before. The use case: unit conversion Data about prices exported from sql in Decimal(38,10) which uses 128 bit but the numbers are actually prices which expressed in cents fit perfectly in uint32 Having scaling would reduce bandwidth/disk usage by factor of 4. What would be the best approach to such use case? Would decimal_scale CastOption be OK or should it rather be compute 'multiply' kernel ? BR, Jacek śr., 12 lut 2020 o 19:32 Jacek Pliszka napisał(a): > > OK, then what I proposed does not make sense and I can just copy the > solution you pointed out. > > Thank you, > > Jacek > > śr., 12 lut 2020 o 19:27 Wes McKinney napisał(a): > > > > On Wed, Feb 12, 2020 at 12:09 PM Jacek Pliszka > > wrote: > > > > > > Hi! > > > > > > ARROW-3329 - we can discuss there. > > > > > > > It seems like it makes sense to implement both lossless safe casts > > > > (when all zeros after the decimal point) and lossy casts (fractional > > > > part discarded) from decimal to integer, do I have that right? > > > > > > Yes, though if I understood your examples are the same case - in both > > > cases fractional part is discarded - just it is all 0s in the first > > > case. > > > > > > The key question is whether CastFunctor in cast.cc has access to scale > > > of the decimal? If yes how? > > > > Yes, it's in the type of the input array. Here's a kernel > > implementation that uses the TimestampType metadata of the input > > > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/cast.cc#L521 > > > > > > > > If not - these are the options I've came up with: > > > > > > Let's assume Decimal128Type value is n > > > > > > Then I expect that base call > > > .cast('int64') will return overflow for n beyond int64 values, value > > > otherwise > > > > > > Option 1: > > > > > > .cast('int64', decimal_scale=s) would calculate n/10**s and return > > > overflow if it is beyond int64, value otherwise > > > > > > Option 2: > > > > > > .cast('int64', bytes_group=0) would return n & 0x > > > .cast('int64', bytes_group=1) would return (n >> 64) & 0x > > > .cast('int64') would have default value bytes_group=0 > > > > > > Option 3: > > > > > > cast has no CastOptions but we add multiply compute kernel and have > > > something like this instead: > > > > > > .compute('multiply', 10**-s).cast('int64') > > > > > > BR, > > > > > > Jacek
[jira] [Created] (ARROW-7842) [Rust] [Parquet] Implement array reader for list type
Morgan Cassels created ARROW-7842: - Summary: [Rust] [Parquet] Implement array reader for list type Key: ARROW-7842 URL: https://issues.apache.org/jira/browse/ARROW-7842 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Morgan Cassels Currently array reader does not support list or map types. The initial PR implementing array reader https://issues.apache.org/jira/browse/ARROW-4218 says that list and map support will come later. Is it known when support for list types might be implemented? -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [ARROW-3329] Re: Decimal casting or scaling
OK, then what I proposed does not make sense and I can just copy the solution you pointed out. Thank you, Jacek śr., 12 lut 2020 o 19:27 Wes McKinney napisał(a): > > On Wed, Feb 12, 2020 at 12:09 PM Jacek Pliszka > wrote: > > > > Hi! > > > > ARROW-3329 - we can discuss there. > > > > > It seems like it makes sense to implement both lossless safe casts > > > (when all zeros after the decimal point) and lossy casts (fractional > > > part discarded) from decimal to integer, do I have that right? > > > > Yes, though if I understood your examples are the same case - in both > > cases fractional part is discarded - just it is all 0s in the first > > case. > > > > The key question is whether CastFunctor in cast.cc has access to scale > > of the decimal? If yes how? > > Yes, it's in the type of the input array. Here's a kernel > implementation that uses the TimestampType metadata of the input > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/cast.cc#L521 > > > > > If not - these are the options I've came up with: > > > > Let's assume Decimal128Type value is n > > > > Then I expect that base call > > .cast('int64') will return overflow for n beyond int64 values, value > > otherwise > > > > Option 1: > > > > .cast('int64', decimal_scale=s) would calculate n/10**s and return > > overflow if it is beyond int64, value otherwise > > > > Option 2: > > > > .cast('int64', bytes_group=0) would return n & 0x > > .cast('int64', bytes_group=1) would return (n >> 64) & 0x > > .cast('int64') would have default value bytes_group=0 > > > > Option 3: > > > > cast has no CastOptions but we add multiply compute kernel and have > > something like this instead: > > > > .compute('multiply', 10**-s).cast('int64') > > > > BR, > > > > Jacek
Re: [ARROW-3329] Re: Decimal casting or scaling
On Wed, Feb 12, 2020 at 12:09 PM Jacek Pliszka wrote: > > Hi! > > ARROW-3329 - we can discuss there. > > > It seems like it makes sense to implement both lossless safe casts > > (when all zeros after the decimal point) and lossy casts (fractional > > part discarded) from decimal to integer, do I have that right? > > Yes, though if I understood your examples are the same case - in both > cases fractional part is discarded - just it is all 0s in the first > case. > > The key question is whether CastFunctor in cast.cc has access to scale > of the decimal? If yes how? Yes, it's in the type of the input array. Here's a kernel implementation that uses the TimestampType metadata of the input https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/cast.cc#L521 > > If not - these are the options I've came up with: > > Let's assume Decimal128Type value is n > > Then I expect that base call > .cast('int64') will return overflow for n beyond int64 values, value > otherwise > > Option 1: > > .cast('int64', decimal_scale=s) would calculate n/10**s and return > overflow if it is beyond int64, value otherwise > > Option 2: > > .cast('int64', bytes_group=0) would return n & 0x > .cast('int64', bytes_group=1) would return (n >> 64) & 0x > .cast('int64') would have default value bytes_group=0 > > Option 3: > > cast has no CastOptions but we add multiply compute kernel and have > something like this instead: > > .compute('multiply', 10**-s).cast('int64') > > BR, > > Jacek
[jira] [Created] (ARROW-7841) pyarrow release 0.16.0 breaks `libhdfs.so` loading mechanism
Jack Fan created ARROW-7841: --- Summary: pyarrow release 0.16.0 breaks `libhdfs.so` loading mechanism Key: ARROW-7841 URL: https://issues.apache.org/jira/browse/ARROW-7841 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.16.0 Reporter: Jack Fan Fix For: 0.15.1 I have my env variable setup correctly according to the pyarrow README {code:java} $ ls $HADOOP_HOME/lib/native libhadoop.a libhadooppipes.a libhadoop.so libhadoop.so.1.0.0 libhadooputils.a libhdfs.a libhdfs.so libhdfs.so.0.0.0 {code} Use the following script to reproduce {code:java} import pyarrow pyarrow.hdfs.connect('hdfs://localhost'){code} With pyarrow version 0.15.1 it is fine. However, version 0.16.0 will give error {code:java} Traceback (most recent call last): File "", line 2, in File "/home/jackwindows/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 215, in connect extra_conf=extra_conf) File "/home/jackwindows/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 40, in __init__ self._connect(host, port, user, kerb_ticket, driver, extra_conf) File "pyarrow/io-hdfs.pxi", line 89, in pyarrow.lib.HadoopFileSystem._connect File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status IOError: Unable to load libhdfs: /opt/hadoop/latest/libhdfs.so: cannot open shared object file: No such file or directory {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[ARROW-3329] Re: Decimal casting or scaling
Hi! ARROW-3329 - we can discuss there. > It seems like it makes sense to implement both lossless safe casts > (when all zeros after the decimal point) and lossy casts (fractional > part discarded) from decimal to integer, do I have that right? Yes, though if I understood your examples are the same case - in both cases fractional part is discarded - just it is all 0s in the first case. The key question is whether CastFunctor in cast.cc has access to scale of the decimal? If yes how? If not - these are the options I've came up with: Let's assume Decimal128Type value is n Then I expect that base call .cast('int64') will return overflow for n beyond int64 values, value otherwise Option 1: .cast('int64', decimal_scale=s) would calculate n/10**s and return overflow if it is beyond int64, value otherwise Option 2: .cast('int64', bytes_group=0) would return n & 0x .cast('int64', bytes_group=1) would return (n >> 64) & 0x .cast('int64') would have default value bytes_group=0 Option 3: cast has no CastOptions but we add multiply compute kernel and have something like this instead: .compute('multiply', 10**-s).cast('int64') BR, Jacek
Re: Decimal casting or scaling
hi Jacek, What is the JIRA issue for this change? In the interest of organizing the discussion (may make sense to move some of this to that issue) There are no casts implemented DecimalType at all in [1], either to decimal or from decimal to anything else. It seems like it makes sense to implement both lossless safe casts (when all zeros after the decimal point) and lossy casts (fractional part discarded) from decimal to integer, do I have that right? I don't understand your other questions very well so perhaps you can provide some illustration about what values are to be expected when calling ".cast('int64')" on a Decimal128Type array. Thanks Wes [1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/cast.cc On Wed, Feb 12, 2020 at 5:36 AM Jacek Pliszka wrote: > > Hi! > > I am interested in having cast from Decimal to Int in pyarrow. > > I have couple ideas but I am a newbie so I might be wrong: > > Do I understand correctly that the problem lies in the fact that > CastFunctor knows nothing about decimal scale? > > Were there any ideas how to handle this properly? > > My ideas are not that great but maybe one of them would be OK: > > 1. We can pass 'numeric_scale_shift' or 'decimal_scale_shift' in CastOptions. > Then while casting, then numbers would be scaled properly. > > 2. Pass byte group selector in CastOptions i.e. when casting from > N*M bytes to N bytes we can pick any of the M groups > > 3. Do not modify cast but add scale/multiply compute kernel so we can > apply it explicitly prior > to casting > > What do you think? I like 1 most 2 least. > > Would any of this solutions be accepted into the code? > > I do not want to work on something that would be rejected immediately... > > Thanks for any input provided, > > Jacek
Re: Arrow doesn't have a MapType
It was added between 0.15.0 and 0.16.0. Any feedback from using it would be welcome https://github.com/apache/arrow/commit/e0c1ffe9c38d1759f1b5311f95864b0e2a406c51 On Wed, Feb 12, 2020 at 5:12 AM Shawn Yang wrote: > > Thanks François, I didn't find it in pyarrow. I'll check again. > > On Fri, Feb 7, 2020 at 9:18 PM Francois Saint-Jacques < > fsaintjacq...@gmail.com> wrote: > > > Arrow does have a Map type [1][2][3]. It is represented as a list of pairs. > > > > François > > > > [1] > > https://github.com/apache/arrow/blob/762202418541e843923b8cae640d15b4952a0af6/format/Schema.fbs#L60-L87 > > [2] > > https://github.com/apache/arrow/blob/762202418541e843923b8cae640d15b4952a0af6/cpp/src/arrow/type.h#L691-L719 > > [3] > > https://github.com/apache/arrow/blob/762202418541e843923b8cae640d15b4952a0af6/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java#L36-L47 > > > > On Fri, Feb 7, 2020 at 3:55 AM Shawn Yang wrote: > > > > > > Hi guys, > > > I'm writing an cross-language row-oriented serialization framework mainly > > > for java/python for now. I detained many data types and schema, field, > > such > > > as Byte, short, int, long, double, float, map, array, struct,. But then I > > > find using Arrow schema is a better choice. Since my framework need to > > > support conversion between my row-format and arrow columnar format. If I > > do > > > all it by myself, I need to support schema conversion and schema > > > serialization. Which is not necessary if I use arrow schema. > > > > > > But I find that arrow doesn't have a map data type, which is I needed > > > exactly. I know I can use struct to mock it or ExtensionType for it. But > > > it's not very convenient. So I want to know whether will Map type be > > > supported by arrow? > > > > > > Thanks. Regards > >
[jira] [Created] (ARROW-7840) [Java] [Integration] Java executables fail
Antoine Pitrou created ARROW-7840: - Summary: [Java] [Integration] Java executables fail Key: ARROW-7840 URL: https://issues.apache.org/jira/browse/ARROW-7840 Project: Apache Arrow Issue Type: Bug Components: Integration, Java Reporter: Antoine Pitrou Fix For: 1.0.0 When trying to run integration tests using {{docker-compose run conda-integration}}, I always get failures during the Java tests: {code} RuntimeError: Command failed: ['java', '-Dio.netty.tryReflectionSetAccessible=true', '-cp', '/arrow/java/tools/target/arrow-tools-1.0.0-SNAPSHOT-jar-with-dependencies.jar', 'org.apache.arrow.tools.StreamToFile', '/tmp/tmpqbkrmpo1/e75ed336_simple.producer_file_as_stream', '/tmp/tmpqbkrmpo1/e75ed336_simple.consumer_stream_as_file'] With output: -- 15:57:01.194 [main] DEBUG io.netty.util.internal.logging.InternalLoggerFactory - Using SLF4J as the default logging framework 15:57:01.196 [main] DEBUG io.netty.util.ResourceLeakDetector - -Dio.netty.leakDetection.level: simple 15:57:01.196 [main] DEBUG io.netty.util.ResourceLeakDetector - -Dio.netty.leakDetection.targetRecords: 4 15:57:01.208 [main] DEBUG io.netty.util.internal.PlatformDependent0 - -Dio.netty.noUnsafe: false 15:57:01.209 [main] DEBUG io.netty.util.internal.PlatformDependent0 - Java version: 8 15:57:01.210 [main] DEBUG io.netty.util.internal.PlatformDependent0 - sun.misc.Unsafe.theUnsafe: available 15:57:01.210 [main] DEBUG io.netty.util.internal.PlatformDependent0 - sun.misc.Unsafe.copyMemory: available 15:57:01.210 [main] DEBUG io.netty.util.internal.PlatformDependent0 - java.nio.Buffer.address: available 15:57:01.210 [main] DEBUG io.netty.util.internal.PlatformDependent0 - direct buffer constructor: available 15:57:01.211 [main] DEBUG io.netty.util.internal.PlatformDependent0 - java.nio.Bits.unaligned: available, true 15:57:01.211 [main] DEBUG io.netty.util.internal.PlatformDependent0 - jdk.internal.misc.Unsafe.allocateUninitializedArray(int): unavailable prior to Java9 15:57:01.211 [main] DEBUG io.netty.util.internal.PlatformDependent0 - java.nio.DirectByteBuffer.(long, int): available 15:57:01.211 [main] DEBUG io.netty.util.internal.PlatformDependent - sun.misc.Unsafe: available 15:57:01.211 [main] DEBUG io.netty.util.internal.PlatformDependent - -Dio.netty.tmpdir: /tmp (java.io.tmpdir) 15:57:01.211 [main] DEBUG io.netty.util.internal.PlatformDependent - -Dio.netty.bitMode: 64 (sun.arch.data.model) 15:57:01.212 [main] DEBUG io.netty.util.internal.PlatformDependent - -Dio.netty.noPreferDirect: false 15:57:01.212 [main] DEBUG io.netty.util.internal.PlatformDependent - -Dio.netty.maxDirectMemory: 11252269056 bytes 15:57:01.212 [main] DEBUG io.netty.util.internal.PlatformDependent - -Dio.netty.uninitializedArrayAllocationThreshold: -1 15:57:01.213 [main] DEBUG io.netty.util.internal.CleanerJava6 - java.nio.ByteBuffer.cleaner(): available 15:57:01.213 [main] DEBUG io.netty.buffer.PooledByteBufAllocator - -Dio.netty.allocator.numHeapArenas: 48 15:57:01.213 [main] DEBUG io.netty.buffer.PooledByteBufAllocator - -Dio.netty.allocator.numDirectArenas: 48 15:57:01.213 [main] DEBUG io.netty.buffer.PooledByteBufAllocator - -Dio.netty.allocator.pageSize: 8192 15:57:01.213 [main] DEBUG io.netty.buffer.PooledByteBufAllocator - -Dio.netty.allocator.maxOrder: 11 15:57:01.213 [main] DEBUG io.netty.buffer.PooledByteBufAllocator - -Dio.netty.allocator.chunkSize: 16777216 15:57:01.213 [main] DEBUG io.netty.buffer.PooledByteBufAllocator - -Dio.netty.allocator.tinyCacheSize: 512 15:57:01.213 [main] DEBUG io.netty.buffer.PooledByteBufAllocator - -Dio.netty.allocator.smallCacheSize: 256 15:57:01.213 [main] DEBUG io.netty.buffer.PooledByteBufAllocator - -Dio.netty.allocator.normalCacheSize: 64 15:57:01.213 [main] DEBUG io.netty.buffer.PooledByteBufAllocator - -Dio.netty.allocator.maxCachedBufferCapacity: 32768 15:57:01.213 [main] DEBUG io.netty.buffer.PooledByteBufAllocator - -Dio.netty.allocator.cacheTrimInterval: 8192 15:57:01.213 [main] DEBUG io.netty.buffer.PooledByteBufAllocator - -Dio.netty.allocator.useCacheForAllThreads: true 15:57:01.216 [main] DEBUG io.netty.util.internal.InternalThreadLocalMap - -Dio.netty.threadLocalMap.stringBuilder.initialSize: 1024 15:57:01.216 [main] DEBUG io.netty.util.internal.InternalThreadLocalMap - -Dio.netty.threadLocalMap.stringBuilder.maxSize: 4096 15:57:01.228 [main] DEBUG io.netty.buffer.AbstractByteBuf - -Dio.netty.buffer.bytebuf.checkAccessible: true 15:57:01.228 [main] DEBUG io.netty.util.ResourceLeakDetectorFactory - Loaded default ResourceLeakDetector: io.netty.util.ResourceLeakDetector@71bc1ae4 15:57:01.242 [main] DEBUG org.apache.arrow.vector.ipc.ReadChannel - Reading buffer with size: 4 15:57:01.242 [main] DEBUG org.apache.arrow.vector.ipc.ReadChannel - Reading buffer with size: 4 15:57:01.242 [main] DEBUG
[NIGHTLY] Arrow Build Report for Job nightly-2020-02-12-0
Arrow Build Report for Job nightly-2020-02-12-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-12-0 Failed Tasks: - test-conda-python-3.7-turbodbc-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-12-0-circle-test-conda-python-3.7-turbodbc-latest - test-conda-python-3.7-turbodbc-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-12-0-circle-test-conda-python-3.7-turbodbc-master - wheel-osx-cp27m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-12-0-travis-wheel-osx-cp27m Succeeded Tasks: - centos-6: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-12-0-azure-centos-6 - centos-7: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-12-0-azure-centos-7 - centos-8: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-12-0-azure-centos-8 - conda-linux-gcc-py27: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-12-0-azure-conda-linux-gcc-py27 - conda-linux-gcc-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-12-0-azure-conda-linux-gcc-py36 - conda-linux-gcc-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-12-0-azure-conda-linux-gcc-py37 - conda-linux-gcc-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-12-0-azure-conda-linux-gcc-py38 - conda-osx-clang-py27: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-12-0-azure-conda-osx-clang-py27 - conda-osx-clang-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-12-0-azure-conda-osx-clang-py36 - conda-osx-clang-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-12-0-azure-conda-osx-clang-py37 - conda-osx-clang-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-12-0-azure-conda-osx-clang-py38 - conda-win-vs2015-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-12-0-azure-conda-win-vs2015-py36 - conda-win-vs2015-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-12-0-azure-conda-win-vs2015-py37 - conda-win-vs2015-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-12-0-azure-conda-win-vs2015-py38 - debian-buster: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-12-0-azure-debian-buster - debian-stretch: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-12-0-azure-debian-stretch - gandiva-jar-osx: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-12-0-travis-gandiva-jar-osx - gandiva-jar-trusty: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-12-0-travis-gandiva-jar-trusty - homebrew-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-12-0-travis-homebrew-cpp - macos-r-autobrew: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-12-0-travis-macos-r-autobrew - test-conda-cpp-valgrind: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-12-0-circle-test-conda-cpp-valgrind - test-conda-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-12-0-circle-test-conda-cpp - test-conda-python-2.7-pandas-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-12-0-circle-test-conda-python-2.7-pandas-latest - test-conda-python-2.7: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-12-0-circle-test-conda-python-2.7 - test-conda-python-3.6: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-12-0-circle-test-conda-python-3.6 - test-conda-python-3.7-dask-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-12-0-circle-test-conda-python-3.7-dask-latest - test-conda-python-3.7-hdfs-2.9.2: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-12-0-circle-test-conda-python-3.7-hdfs-2.9.2 - test-conda-python-3.7-pandas-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-12-0-circle-test-conda-python-3.7-pandas-latest - test-conda-python-3.7-pandas-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-12-0-circle-test-conda-python-3.7-pandas-master - test-conda-python-3.7-spark-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-12-0-circle-test-conda-python-3.7-spark-master - test-conda-python-3.7: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-12-0-circle-test-conda-python-3.7 - test-conda-python-3.8-dask-master: URL:
Decimal casting or scaling
Hi! I am interested in having cast from Decimal to Int in pyarrow. I have couple ideas but I am a newbie so I might be wrong: Do I understand correctly that the problem lies in the fact that CastFunctor knows nothing about decimal scale? Were there any ideas how to handle this properly? My ideas are not that great but maybe one of them would be OK: 1. We can pass 'numeric_scale_shift' or 'decimal_scale_shift' in CastOptions. Then while casting, then numbers would be scaled properly. 2. Pass byte group selector in CastOptions i.e. when casting from N*M bytes to N bytes we can pick any of the M groups 3. Do not modify cast but add scale/multiply compute kernel so we can apply it explicitly prior to casting What do you think? I like 1 most 2 least. Would any of this solutions be accepted into the code? I do not want to work on something that would be rejected immediately... Thanks for any input provided, Jacek
Re: Arrow doesn't have a MapType
Thanks François, I didn't find it in pyarrow. I'll check again. On Fri, Feb 7, 2020 at 9:18 PM Francois Saint-Jacques < fsaintjacq...@gmail.com> wrote: > Arrow does have a Map type [1][2][3]. It is represented as a list of pairs. > > François > > [1] > https://github.com/apache/arrow/blob/762202418541e843923b8cae640d15b4952a0af6/format/Schema.fbs#L60-L87 > [2] > https://github.com/apache/arrow/blob/762202418541e843923b8cae640d15b4952a0af6/cpp/src/arrow/type.h#L691-L719 > [3] > https://github.com/apache/arrow/blob/762202418541e843923b8cae640d15b4952a0af6/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java#L36-L47 > > On Fri, Feb 7, 2020 at 3:55 AM Shawn Yang wrote: > > > > Hi guys, > > I'm writing an cross-language row-oriented serialization framework mainly > > for java/python for now. I detained many data types and schema, field, > such > > as Byte, short, int, long, double, float, map, array, struct,. But then I > > find using Arrow schema is a better choice. Since my framework need to > > support conversion between my row-format and arrow columnar format. If I > do > > all it by myself, I need to support schema conversion and schema > > serialization. Which is not necessary if I use arrow schema. > > > > But I find that arrow doesn't have a map data type, which is I needed > > exactly. I know I can use struct to mock it or ExtensionType for it. But > > it's not very convenient. So I want to know whether will Map type be > > supported by arrow? > > > > Thanks. Regards >
[jira] [Created] (ARROW-7839) [Python][Dataset] Add IPC format to python bindings
Joris Van den Bossche created ARROW-7839: Summary: [Python][Dataset] Add IPC format to python bindings Key: ARROW-7839 URL: https://issues.apache.org/jira/browse/ARROW-7839 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche The C++ / R was done in ARROW-7415, we should add bindings for it in Python as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7838) [C++] Installed plasma-store-server fails finding Boost
Antoine Pitrou created ARROW-7838: - Summary: [C++] Installed plasma-store-server fails finding Boost Key: ARROW-7838 URL: https://issues.apache.org/jira/browse/ARROW-7838 Project: Apache Arrow Issue Type: Bug Components: C++, C++ - Plasma Reporter: Antoine Pitrou In my build directory I have: {code} $ ldd build-test/debug/plasma-store-server linux-vdso.so.1 (0x7ffc0001f000) libplasma.so.100 => /home/antoine/arrow/dev/cpp/build-test/debug/libplasma.so.100 (0x7efbff629000) libarrow_cuda.so.100 => /home/antoine/arrow/dev/cpp/build-test/debug/libarrow_cuda.so.100 (0x7efbff58d000) libarrow.so.100 => /home/antoine/arrow/dev/cpp/build-test/debug/libarrow.so.100 (0x7efbfcbae000) libssl.so.1.1 => /home/antoine/miniconda3/envs/pyarrow/lib/libssl.so.1.1 (0x7efbfcb1e000) libcrypto.so.1.1 => /home/antoine/miniconda3/envs/pyarrow/lib/libcrypto.so.1.1 (0x7efbfc87) libaws-cpp-sdk-config.so => /home/antoine/miniconda3/envs/pyarrow/lib/libaws-cpp-sdk-config.so (0x7efbfc6be000) libaws-cpp-sdk-transfer.so => /home/antoine/miniconda3/envs/pyarrow/lib/libaws-cpp-sdk-transfer.so (0x7efbff557000) libaws-cpp-sdk-s3.so => /home/antoine/miniconda3/envs/pyarrow/lib/libaws-cpp-sdk-s3.so (0x7efbfc478000) libaws-cpp-sdk-core.so => /home/antoine/miniconda3/envs/pyarrow/lib/libaws-cpp-sdk-core.so (0x7efbfc37b000) libaws-c-event-stream.so.0unstable => /home/antoine/miniconda3/envs/pyarrow/lib/libaws-c-event-stream.so.0unstable (0x7efbff54e000) libaws-c-common.so.0unstable => /home/antoine/miniconda3/envs/pyarrow/lib/libaws-c-common.so.0unstable (0x7efbff52d000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x7efbfbfa2000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x7efbfbd83000) libaws-checksums.so => /home/antoine/miniconda3/envs/pyarrow/lib/libaws-checksums.so (0x7efbff51d000) librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x7efbfbb7b000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x7efbfb977000) libcuda.so.1 => /usr/lib/x86_64-linux-gnu/libcuda.so.1 (0x7efbfadd7000) libstdc++.so.6 => /home/antoine/miniconda3/envs/pyarrow/lib/libstdc++.so.6 (0x7efbfac63000) libgcc_s.so.1 => /home/antoine/miniconda3/envs/pyarrow/lib/libgcc_s.so.1 (0x7efbff507000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x7efbfa872000) /lib64/ld-linux-x86-64.so.2 (0x7efbff4d7000) libbz2.so.1.0 => /home/antoine/miniconda3/envs/pyarrow/lib/libbz2.so.1.0 (0x7efbfa85e000) liblz4.so.1 => /home/antoine/miniconda3/envs/pyarrow/lib/liblz4.so.1 (0x7efbfa829000) libsnappy.so.1 => /home/antoine/miniconda3/envs/pyarrow/lib/libsnappy.so.1 (0x7efbfa81e000) libz.so.1 => /home/antoine/miniconda3/envs/pyarrow/lib/libz.so.1 (0x7efbfa804000) libzstd.so.1 => /home/antoine/miniconda3/envs/pyarrow/lib/libzstd.so.1 (0x7efbfa748000) libboost_filesystem.so.1.68.0 => /home/antoine/miniconda3/envs/pyarrow/lib/libboost_filesystem.so.1.68.0 (0x7efbfa72a000) libboost_system.so.1.68.0 => /home/antoine/miniconda3/envs/pyarrow/lib/libboost_system.so.1.68.0 (0x7efbff4fe000) libcurl.so.4 => /home/antoine/miniconda3/envs/pyarrow/lib/./libcurl.so.4 (0x7efbfa6a4000) libnvidia-fatbinaryloader.so.390.116 => /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.390.116 (0x7efbfa456000) libssh2.so.1 => /home/antoine/miniconda3/envs/pyarrow/lib/././libssh2.so.1 (0x7efbfa423000) libgssapi_krb5.so.2 => /home/antoine/miniconda3/envs/pyarrow/lib/././libgssapi_krb5.so.2 (0x7efbfa3d4000) libkrb5.so.3 => /home/antoine/miniconda3/envs/pyarrow/lib/././libkrb5.so.3 (0x7efbfa2fd000) libk5crypto.so.3 => /home/antoine/miniconda3/envs/pyarrow/lib/././libk5crypto.so.3 (0x7efbfa2de000) libcom_err.so.3 => /home/antoine/miniconda3/envs/pyarrow/lib/././libcom_err.so.3 (0x7efbfa2d6000) libkrb5support.so.0 => /home/antoine/miniconda3/envs/pyarrow/lib/./././libkrb5support.so.0 (0x7efbfa2c8000) libresolv.so.2 => /lib/x86_64-linux-gnu/libresolv.so.2 (0x7efbfa0ad000) {code} However, once installed it seems the Boost resolution fails: {code} $ ldd /home/antoine/miniconda3/envs/pyarrow/bin/plasma-store-server linux-vdso.so.1 (0x7ffc0001f000) libplasma.so.100 => /home/antoine/miniconda3/envs/pyarrow/lib/libplasma.so.100 (0x7efbff629000) libarrow_cuda.so.100 => /home/antoine/miniconda3/envs/pyarrow/lib/libarrow_cuda.so.100 (0x7efbff58d000) libarrow.so.100 => /home/antoine/miniconda3/envs/pyarrow/lib/libarrow.so.100 (0x7efbfcbae000)