[jira] [Resolved] (ARROW-6089) [Rust] [DataFusion] Implement parallel execution for selection
[ https://issues.apache.org/jira/browse/ARROW-6089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove resolved ARROW-6089. --- Fix Version/s: (was: 1.0.0) 0.15.0 Resolution: Fixed Issue resolved by pull request 5320 [https://github.com/apache/arrow/pull/5320] > [Rust] [DataFusion] Implement parallel execution for selection > -- > > Key: ARROW-6089 > URL: https://issues.apache.org/jira/browse/ARROW-6089 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > Implement physical plan for selection operator. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6683) [Python] Add unit tests that validate cross-compatibility with pyarrow.parquet when fastparquet is installed
Wes McKinney created ARROW-6683: --- Summary: [Python] Add unit tests that validate cross-compatibility with pyarrow.parquet when fastparquet is installed Key: ARROW-6683 URL: https://issues.apache.org/jira/browse/ARROW-6683 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney Fix For: 1.0.0 This will help prevent such issues as ARROW-6678 from reocurring -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6613) [C++] Remove dependency on boost::filesystem
[ https://issues.apache.org/jira/browse/ARROW-6613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937407#comment-16937407 ] Wes McKinney commented on ARROW-6613: - I would suggest adding a {{ARROW_FILESYSTEM}} option, setting it to off by default, then handling the remaining usages of Boost in the core library-only build. I think it's fine to have boost::filesystem in src/arrow/filesystem. People who are making use of this functionality are probably more OK with accepting additional build dependencies. I would mainly like this stuff to be out of the way of people who are only using Array / ArrayBuilder / RecordBatch and IPC read/write tools > [C++] Remove dependency on boost::filesystem > > > Key: ARROW-6613 > URL: https://issues.apache.org/jira/browse/ARROW-6613 > Project: Apache Arrow > Issue Type: Wish > Components: C++ >Reporter: Antoine Pitrou >Priority: Major > Fix For: 1.0.0 > > > See ARROW-2196 for details. > boost::filesystem should not be required for base functionality at least > (including filesystems, probably). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6678) [C++] Regression in Parquet file compatibility introduced by ARROW-3246
[ https://issues.apache.org/jira/browse/ARROW-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved ARROW-6678. Resolution: Fixed Issue resolved by pull request 5493 [https://github.com/apache/arrow/pull/5493] > [C++] Regression in Parquet file compatibility introduced by ARROW-3246 > --- > > Key: ARROW-6678 > URL: https://issues.apache.org/jira/browse/ARROW-6678 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Blocker > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 2h 40m > Remaining Estimate: 0h > > I randomly discovered that this script fails after applying the patch for > ARROW-3246 > https://github.com/apache/arrow/commit/2ba0566b29312e84fafc987fd8dc9664748be96a > {code} > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > import fastparquet as fp > df = pd.util.testing.makeDataFrame() > pq.write_table(pa.table(df), 'test.parquet') > fp.ParquetFile('test.parquet') > {code} > with > {code} > Traceback (most recent call last): > File > "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/api.py", > line 110, in __init__ > with open_with(fn2, 'rb') as f: > File > "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/util.py", > line 38, in default_open > return open(f, mode) > NotADirectoryError: [Errno 20] Not a directory: 'test.parquet/_metadata' > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File "test.py", line 10, in > fp.ParquetFile('test.parquet') > File > "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/api.py", > line 116, in __init__ > self._parse_header(f, verify) > File > "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/api.py", > line 135, in _parse_header > fmd = read_thrift(f, parquet_thrift.FileMetaData) > File > "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/thrift_structures.py", > line 25, in read_thrift > obj.read(pin) > File > "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/parquet_thrift/parquet/ttypes.py", > line 1929, in read > iprot._fast_decode(self, iprot, [self.__class__, self.thrift_spec]) > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 0: > invalid start byte > {code} > I don't recall making any metadata-related changes but I'm going to review > the patch to see if I can narrow down where the problem is to see whether > it's a bug with Arrow/parquet-cpp or with the third party library -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-6678) [C++] Regression in Parquet file compatibility introduced by ARROW-3246
[ https://issues.apache.org/jira/browse/ARROW-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield reassigned ARROW-6678: -- Assignee: Wes McKinney > [C++] Regression in Parquet file compatibility introduced by ARROW-3246 > --- > > Key: ARROW-6678 > URL: https://issues.apache.org/jira/browse/ARROW-6678 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Blocker > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 2.5h > Remaining Estimate: 0h > > I randomly discovered that this script fails after applying the patch for > ARROW-3246 > https://github.com/apache/arrow/commit/2ba0566b29312e84fafc987fd8dc9664748be96a > {code} > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > import fastparquet as fp > df = pd.util.testing.makeDataFrame() > pq.write_table(pa.table(df), 'test.parquet') > fp.ParquetFile('test.parquet') > {code} > with > {code} > Traceback (most recent call last): > File > "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/api.py", > line 110, in __init__ > with open_with(fn2, 'rb') as f: > File > "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/util.py", > line 38, in default_open > return open(f, mode) > NotADirectoryError: [Errno 20] Not a directory: 'test.parquet/_metadata' > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File "test.py", line 10, in > fp.ParquetFile('test.parquet') > File > "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/api.py", > line 116, in __init__ > self._parse_header(f, verify) > File > "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/api.py", > line 135, in _parse_header > fmd = read_thrift(f, parquet_thrift.FileMetaData) > File > "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/thrift_structures.py", > line 25, in read_thrift > obj.read(pin) > File > "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/parquet_thrift/parquet/ttypes.py", > line 1929, in read > iprot._fast_decode(self, iprot, [self.__class__, self.thrift_spec]) > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 0: > invalid start byte > {code} > I don't recall making any metadata-related changes but I'm going to review > the patch to see if I can narrow down where the problem is to see whether > it's a bug with Arrow/parquet-cpp or with the third party library -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6681) [C# -> R] - Record Batches in reverse order?
[ https://issues.apache.org/jira/browse/ARROW-6681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Abate updated ARROW-6681: - Issue Type: Bug (was: New Feature) > [C# -> R] - Record Batches in reverse order? > > > Key: ARROW-6681 > URL: https://issues.apache.org/jira/browse/ARROW-6681 > Project: Apache Arrow > Issue Type: Bug > Components: C#, R >Affects Versions: 0.14.1 >Reporter: Anthony Abate >Priority: Minor > > Are 'RecordBatches' being in C# being written in reverse order? > I made a simple test which creates a single row per record batch of 0 to 99 > and attempted to read this in R. To my surprise batch(0) in R had the value > 99 not 0 > This may not seem like a big deal, however when dealing with 'huge' files, > its more efficient to use Record Batches / index lookup than attempting to > load the entire file into memory. > Having the order consistent within the different language / API seems only to > make sense - for now I can work around this by reversing the order before > writing. > > https://github.com/apache/arrow/issues/5475 > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6682) Arrow Hangs on Large Files (10-12gb)
Anthony Abate created ARROW-6682: Summary: Arrow Hangs on Large Files (10-12gb) Key: ARROW-6682 URL: https://issues.apache.org/jira/browse/ARROW-6682 Project: Apache Arrow Issue Type: Bug Components: C++, R Affects Versions: 0.14.1 Reporter: Anthony Abate I get random hangs on arrow_read in R (windows) when using a very large file (10-12gb). I have memory dumps - All threads seem to be in wait handles. Are there debug symbols somewhere? Is there a way to get the C++ code to produce diagnostic logging from R? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6681) [C# -> R] - Record Batches in reverse order?
[ https://issues.apache.org/jira/browse/ARROW-6681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937373#comment-16937373 ] Anthony Abate commented on ARROW-6681: -- sample code in github issue > [C# -> R] - Record Batches in reverse order? > > > Key: ARROW-6681 > URL: https://issues.apache.org/jira/browse/ARROW-6681 > Project: Apache Arrow > Issue Type: New Feature > Components: C#, R >Affects Versions: 0.14.1 >Reporter: Anthony Abate >Priority: Minor > > Are 'RecordBatches' being in C# being written in reverse order? > I made a simple test which creates a single row per record batch of 0 to 99 > and attempted to read this in R. To my surprise batch(0) in R had the value > 99 not 0 > This may not seem like a big deal, however when dealing with 'huge' files, > its more efficient to use Record Batches / index lookup than attempting to > load the entire file into memory. > Having the order consistent within the different language / API seems only to > make sense - for now I can work around this by reversing the order before > writing. > > https://github.com/apache/arrow/issues/5475 > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6681) [C# -> R] - Record Batches in reverse order?
[ https://issues.apache.org/jira/browse/ARROW-6681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Abate updated ARROW-6681: - Component/s: R > [C# -> R] - Record Batches in reverse order? > > > Key: ARROW-6681 > URL: https://issues.apache.org/jira/browse/ARROW-6681 > Project: Apache Arrow > Issue Type: New Feature > Components: C#, R >Affects Versions: 0.14.1 >Reporter: Anthony Abate >Priority: Minor > > Are 'RecordBatches' being in C# being written in reverse order? > I made a simple test which creates a single row per record batch of 0 to 99 > and attempted to read this in R. To my surprise batch(0) in R had the value > 99 not 0 > This may not seem like a big deal, however when dealing with 'huge' files, > its more efficient to use Record Batches / index lookup than attempting to > load the entire file into memory. > Having the order consistent within the different language / API seems only to > make sense - for now I can work around this by reversing the order before > writing. > > https://github.com/apache/arrow/issues/5475 > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6681) [C# -> R] - Record Batches in reverse order?
Anthony Abate created ARROW-6681: Summary: [C# -> R] - Record Batches in reverse order? Key: ARROW-6681 URL: https://issues.apache.org/jira/browse/ARROW-6681 Project: Apache Arrow Issue Type: New Feature Components: C# Affects Versions: 0.14.1 Reporter: Anthony Abate Are 'RecordBatches' being in C# being written in reverse order? I made a simple test which creates a single row per record batch of 0 to 99 and attempted to read this in R. To my surprise batch(0) in R had the value 99 not 0 This may not seem like a big deal, however when dealing with 'huge' files, its more efficient to use Record Batches / index lookup than attempting to load the entire file into memory. Having the order consistent within the different language / API seems only to make sense - for now I can work around this by reversing the order before writing. https://github.com/apache/arrow/issues/5475 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6667) [Python] Avoid Reference Cycles in pyarrow.parquet
[ https://issues.apache.org/jira/browse/ARROW-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-6667: Fix Version/s: 0.15.0 > [Python] Avoid Reference Cycles in pyarrow.parquet > -- > > Key: ARROW-6667 > URL: https://issues.apache.org/jira/browse/ARROW-6667 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Aaron Opfer >Priority: Minor > Labels: pull-request-available > Fix For: 0.15.0 > > Attachments: cycle1_build_nested_path.PNG, cycle2_open_dataset.PNG > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Reference cycles appear in two places inside pyarrow.parquet which causes > these objects to have much longer lifetimes than necessary: > > {{_build_nested_path}} has a reference cycle because the closured function > refers to the parent cell which also refers to the closured function again > (objgraph shown in attachment) > {{open_dataset_file}} is partialed with self inside the {{ParquetFile}} class > (objgraph shown in attachment). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6158) [Python] possible to create StructArray with type that conflicts with child array's types
[ https://issues.apache.org/jira/browse/ARROW-6158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-6158. - Fix Version/s: (was: 1.0.0) 0.15.0 Resolution: Fixed Issue resolved by pull request 5488 [https://github.com/apache/arrow/pull/5488] > [Python] possible to create StructArray with type that conflicts with child > array's types > - > > Key: ARROW-6158 > URL: https://issues.apache.org/jira/browse/ARROW-6158 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Using the Python interface as example. This creates a {{StructArray}} where > the field types don't match the child array types: > {code} > a = pa.array([1, 2, 3], type=pa.int64()) > b = pa.array(['a', 'b', 'c'], type=pa.string()) > inconsistent_fields = [pa.field('a', pa.int32()), pa.field('b', pa.float64())] > a = pa.StructArray.from_arrays([a, b], fields=inconsistent_fields) > {code} > The above works fine. I didn't find anything that errors (eg conversion to > pandas, slicing), also validation passes, but the type actually has the > inconsistent child types: > {code} > In [2]: a > Out[2]: > > -- is_valid: all not null > -- child 0 type: int64 > [ > 1, > 2, > 3 > ] > -- child 1 type: string > [ > "a", > "b", > "c" > ] > In [3]: a.type > Out[3]: StructType(struct) > In [4]: a.to_pandas() > Out[4]: > array([{'a': 1, 'b': 'a'}, {'a': 2, 'b': 'b'}, {'a': 3, 'b': 'c'}], > dtype=object) > In [5]: a.validate() > {code} > Shouldn't this be disallowed somehow? (it could be checked in the Python > {{from_arrays}} method, but maybe also in {{StructArray::Make}} which already > checks for the number of fields vs arrays and a consistent array length). > Similarly to discussion in ARROW-6132, I would also expect that this the > {{ValidateArray}} catches this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6680) [Python] Add Array ctor microbenchmarks
Wes McKinney created ARROW-6680: --- Summary: [Python] Add Array ctor microbenchmarks Key: ARROW-6680 URL: https://issues.apache.org/jira/browse/ARROW-6680 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney Fix For: 1.0.0 Since more unavoidable validation is being added in e.g. https://github.com/apache/arrow/pull/5488 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6674) [Python] Fix or ignore the test warnings
[ https://issues.apache.org/jira/browse/ARROW-6674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-6674. - Fix Version/s: 0.15.0 Resolution: Fixed Issue resolved by pull request 5489 [https://github.com/apache/arrow/pull/5489] > [Python] Fix or ignore the test warnings > > > Key: ARROW-6674 > URL: https://issues.apache.org/jira/browse/ARROW-6674 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Minor > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 40m > Remaining Estimate: 0h > > Currently when running the python tests, we have a bunch of warnings. Some of > them can be ignored, some of them can be fixed. But it is better to do that > explicitly, so that new warnings (which can be potentially important to see) > get more attention. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-6575) [JS] decimal toString does not support negative values
[ https://issues.apache.org/jira/browse/ARROW-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934909#comment-16934909 ] Paul Taylor edited comment on ARROW-6575 at 9/25/19 2:24 AM: - [~zad] Yeah I couldn't figure out how to propagate the sign bit through the decimal conversion. I'd be happy to review a PR if you know the right way to do it. was (Author: paul.e.taylor): Yeah, I couldn't figure out how to propagate the sign bit through the decimal conversion. I'd be happy to review a PR if you know the right way to do it. > [JS] decimal toString does not support negative values > -- > > Key: ARROW-6575 > URL: https://issues.apache.org/jira/browse/ARROW-6575 > Project: Apache Arrow > Issue Type: Bug > Components: JavaScript >Affects Versions: 0.14.1 >Reporter: Andong Zhan >Priority: Critical > > The main description is here: [https://github.com/apache/arrow/issues/5397] > Also, I have a simple test case (slightly changed generate-test-data.js and > generated-data-validators): > {code:java} > export const decimal = (length = 2, nullCount = length * 0.2 | 0, scale = 0, > precision = 38) => vectorGenerator.visit(new Decimal(scale, precision), > length, nullCount); > function fillDecimal(length: number) { > // const BPE = Uint32Array.BYTES_PER_ELEMENT; // 4 > const array = new Uint32Array(length); > // const max = (2 ** (8 * BPE)) - 1; > // for (let i = -1; ++i < length; array[i] = rand() * max * (rand() > 0.5 > ? -1 : 1)); > array[0] = 0; > array[1] = 1286889712; > array[2] = 2218195178; > array[3] = 4282345521; > array[4] = 0; > array[5] = 16004768; > array[6] = 3587851993; > array[7] = 126217744; > return array; > } > {code} > and the expected value should be > {code:java} > expect(vector.get(0).toString()).toBe('-1'); > expect(vector.get(1).toString()).toBe('1'); > {code} > However, the actual first value is 339282366920938463463374607431768211456 > which is wrong! The second value is correct by the way. > I believe the bug is in the function called > function decimalToString>(a: T) because it cannot > return a negative value at all. > [arrow/js/src/util/bn.ts|https://github.com/apache/arrow/blob/d54425de19b7dbb2764a40355d76d1c785cf64ec/js/src/util/bn.ts#L99] > Line 99 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6679) [RELEASE] autobrew license in LICENSE.txt is not acceptable
[ https://issues.apache.org/jira/browse/ARROW-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6679: -- Labels: pull-request-available (was: ) > [RELEASE] autobrew license in LICENSE.txt is not acceptable > --- > > Key: ARROW-6679 > URL: https://issues.apache.org/jira/browse/ARROW-6679 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Wes McKinney >Priority: Blocker > Labels: pull-request-available > Fix For: 0.15.0 > > > {code} > This project includes code from the autobrew project. > * r/tools/autobrew and dev/tasks/homebrew-formulae/autobrew/apache-arrow.rb > are based on code from the autobrew project. > Copyright: Copyright (c) 2017 - 2019, Jeroen Ooms. > All rights reserved. > Homepage: https://github.com/jeroen/autobrew > {code} > This code needs to be made available under a Category A license > https://apache.org/legal/resolved.html#category-a -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6086) [Rust] [DataFusion] Implement parallel execution for parquet scan
[ https://issues.apache.org/jira/browse/ARROW-6086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-6086: -- Fix Version/s: (was: 1.0.0) 0.15.0 > [Rust] [DataFusion] Implement parallel execution for parquet scan > - > > Key: ARROW-6086 > URL: https://issues.apache.org/jira/browse/ARROW-6086 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6649) [R] print() methods for Table, RecordBatch, etc.
[ https://issues.apache.org/jira/browse/ARROW-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-6649. Fix Version/s: (was: 1.0.0) 0.15.0 Resolution: Fixed Issue resolved by pull request 5492 [https://github.com/apache/arrow/pull/5492] > [R] print() methods for Table, RecordBatch, etc. > > > Key: ARROW-6649 > URL: https://issues.apache.org/jira/browse/ARROW-6649 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > Inspired by tibble: show schema, head of data, etc. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6086) [Rust] [DataFusion] Implement parallel execution for parquet scan
[ https://issues.apache.org/jira/browse/ARROW-6086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6086: -- Labels: pull-request-available (was: ) > [Rust] [DataFusion] Implement parallel execution for parquet scan > - > > Key: ARROW-6086 > URL: https://issues.apache.org/jira/browse/ARROW-6086 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6679) [RELEASE] autobrew license in LICENSE.txt is not acceptable
[ https://issues.apache.org/jira/browse/ARROW-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937325#comment-16937325 ] Wes McKinney commented on ARROW-6679: - Right, either the file needs an appropriate license applied or it needs to be removed. > [RELEASE] autobrew license in LICENSE.txt is not acceptable > --- > > Key: ARROW-6679 > URL: https://issues.apache.org/jira/browse/ARROW-6679 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Wes McKinney >Priority: Blocker > Fix For: 0.15.0 > > > {code} > This project includes code from the autobrew project. > * r/tools/autobrew and dev/tasks/homebrew-formulae/autobrew/apache-arrow.rb > are based on code from the autobrew project. > Copyright: Copyright (c) 2017 - 2019, Jeroen Ooms. > All rights reserved. > Homepage: https://github.com/jeroen/autobrew > {code} > This code needs to be made available under a Category A license > https://apache.org/legal/resolved.html#category-a -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6086) [Rust] [DataFusion] Implement parallel execution for parquet scan
[ https://issues.apache.org/jira/browse/ARROW-6086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-6086: -- Fix Version/s: (was: 0.15.0) 1.0.0 > [Rust] [DataFusion] Implement parallel execution for parquet scan > - > > Key: ARROW-6086 > URL: https://issues.apache.org/jira/browse/ARROW-6086 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (ARROW-6086) [Rust] [DataFusion] Implement parallel execution for parquet scan
[ https://issues.apache.org/jira/browse/ARROW-6086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove reopened ARROW-6086: --- This was not actually fully implemented and needs further work. > [Rust] [DataFusion] Implement parallel execution for parquet scan > - > > Key: ARROW-6086 > URL: https://issues.apache.org/jira/browse/ARROW-6086 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Fix For: 0.15.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6679) [RELEASE] autobrew license in LICENSE.txt is not acceptable
[ https://issues.apache.org/jira/browse/ARROW-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937274#comment-16937274 ] Neal Richardson commented on ARROW-6679: Sorry, I thought this was dealt with adequately in https://github.com/apache/arrow/pull/5095 (see discussion). What are the options for resolution? Jeroen adds a license file to https://github.com/jeroen/autobrew, or we remove the file? > [RELEASE] autobrew license in LICENSE.txt is not acceptable > --- > > Key: ARROW-6679 > URL: https://issues.apache.org/jira/browse/ARROW-6679 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Wes McKinney >Priority: Blocker > Fix For: 0.15.0 > > > {code} > This project includes code from the autobrew project. > * r/tools/autobrew and dev/tasks/homebrew-formulae/autobrew/apache-arrow.rb > are based on code from the autobrew project. > Copyright: Copyright (c) 2017 - 2019, Jeroen Ooms. > All rights reserved. > Homepage: https://github.com/jeroen/autobrew > {code} > This code needs to be made available under a Category A license > https://apache.org/legal/resolved.html#category-a -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6679) [RELEASE] autobrew license in LICENSE.txt is not acceptable
[ https://issues.apache.org/jira/browse/ARROW-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937264#comment-16937264 ] Wes McKinney commented on ARROW-6679: - cc [~jeroenooms] > [RELEASE] autobrew license in LICENSE.txt is not acceptable > --- > > Key: ARROW-6679 > URL: https://issues.apache.org/jira/browse/ARROW-6679 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Wes McKinney >Priority: Blocker > Fix For: 0.15.0 > > > {code} > This project includes code from the autobrew project. > * r/tools/autobrew and dev/tasks/homebrew-formulae/autobrew/apache-arrow.rb > are based on code from the autobrew project. > Copyright: Copyright (c) 2017 - 2019, Jeroen Ooms. > All rights reserved. > Homepage: https://github.com/jeroen/autobrew > {code} > This code needs to be made available under a Category A license > https://apache.org/legal/resolved.html#category-a -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6678) [C++] Regression in Parquet file compatibility introduced by ARROW-3246
[ https://issues.apache.org/jira/browse/ARROW-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6678: -- Labels: pull-request-available (was: ) > [C++] Regression in Parquet file compatibility introduced by ARROW-3246 > --- > > Key: ARROW-6678 > URL: https://issues.apache.org/jira/browse/ARROW-6678 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Priority: Blocker > Labels: pull-request-available > Fix For: 0.15.0 > > > I randomly discovered that this script fails after applying the patch for > ARROW-3246 > https://github.com/apache/arrow/commit/2ba0566b29312e84fafc987fd8dc9664748be96a > {code} > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > import fastparquet as fp > df = pd.util.testing.makeDataFrame() > pq.write_table(pa.table(df), 'test.parquet') > fp.ParquetFile('test.parquet') > {code} > with > {code} > Traceback (most recent call last): > File > "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/api.py", > line 110, in __init__ > with open_with(fn2, 'rb') as f: > File > "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/util.py", > line 38, in default_open > return open(f, mode) > NotADirectoryError: [Errno 20] Not a directory: 'test.parquet/_metadata' > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File "test.py", line 10, in > fp.ParquetFile('test.parquet') > File > "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/api.py", > line 116, in __init__ > self._parse_header(f, verify) > File > "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/api.py", > line 135, in _parse_header > fmd = read_thrift(f, parquet_thrift.FileMetaData) > File > "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/thrift_structures.py", > line 25, in read_thrift > obj.read(pin) > File > "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/parquet_thrift/parquet/ttypes.py", > line 1929, in read > iprot._fast_decode(self, iprot, [self.__class__, self.thrift_spec]) > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 0: > invalid start byte > {code} > I don't recall making any metadata-related changes but I'm going to review > the patch to see if I can narrow down where the problem is to see whether > it's a bug with Arrow/parquet-cpp or with the third party library -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6679) [RELEASE] autobrew license in LICENSE.txt is not acceptable
Wes McKinney created ARROW-6679: --- Summary: [RELEASE] autobrew license in LICENSE.txt is not acceptable Key: ARROW-6679 URL: https://issues.apache.org/jira/browse/ARROW-6679 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Wes McKinney Fix For: 0.15.0 {code} This project includes code from the autobrew project. * r/tools/autobrew and dev/tasks/homebrew-formulae/autobrew/apache-arrow.rb are based on code from the autobrew project. Copyright: Copyright (c) 2017 - 2019, Jeroen Ooms. All rights reserved. Homepage: https://github.com/jeroen/autobrew {code} This code needs to be made available under a Category A license https://apache.org/legal/resolved.html#category-a -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6678) [C++] Regression in Parquet file compatibility introduced by ARROW-3246
[ https://issues.apache.org/jira/browse/ARROW-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937250#comment-16937250 ] Wes McKinney commented on ARROW-6678: - The problem is that the serialized schema needs to be base64 encoded because Thrift string types must be UTF-8. https://github.com/apache/arrow/blob/master/cpp/src/parquet/parquet.thrift#L593 Working on a patch > [C++] Regression in Parquet file compatibility introduced by ARROW-3246 > --- > > Key: ARROW-6678 > URL: https://issues.apache.org/jira/browse/ARROW-6678 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Priority: Blocker > Fix For: 0.15.0 > > > I randomly discovered that this script fails after applying the patch for > ARROW-3246 > https://github.com/apache/arrow/commit/2ba0566b29312e84fafc987fd8dc9664748be96a > {code} > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > import fastparquet as fp > df = pd.util.testing.makeDataFrame() > pq.write_table(pa.table(df), 'test.parquet') > fp.ParquetFile('test.parquet') > {code} > with > {code} > Traceback (most recent call last): > File > "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/api.py", > line 110, in __init__ > with open_with(fn2, 'rb') as f: > File > "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/util.py", > line 38, in default_open > return open(f, mode) > NotADirectoryError: [Errno 20] Not a directory: 'test.parquet/_metadata' > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File "test.py", line 10, in > fp.ParquetFile('test.parquet') > File > "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/api.py", > line 116, in __init__ > self._parse_header(f, verify) > File > "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/api.py", > line 135, in _parse_header > fmd = read_thrift(f, parquet_thrift.FileMetaData) > File > "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/thrift_structures.py", > line 25, in read_thrift > obj.read(pin) > File > "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/parquet_thrift/parquet/ttypes.py", > line 1929, in read > iprot._fast_decode(self, iprot, [self.__class__, self.thrift_spec]) > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 0: > invalid start byte > {code} > I don't recall making any metadata-related changes but I'm going to review > the patch to see if I can narrow down where the problem is to see whether > it's a bug with Arrow/parquet-cpp or with the third party library -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6678) [C++] Regression in Parquet file compatibility introduced by ARROW-3246
[ https://issues.apache.org/jira/browse/ARROW-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937247#comment-16937247 ] Wes McKinney commented on ARROW-6678: - Luckily I did not have to search long. This is caused by the code that adds the "ARROW:schema" metadata field that was added here https://github.com/apache/arrow/commit/2ba0566b29312e84fafc987fd8dc9664748be96a#diff-806bd9c3d77823ae1bff914269e7db02R592 Investigating further > [C++] Regression in Parquet file compatibility introduced by ARROW-3246 > --- > > Key: ARROW-6678 > URL: https://issues.apache.org/jira/browse/ARROW-6678 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Priority: Blocker > Fix For: 0.15.0 > > > I randomly discovered that this script fails after applying the patch for > ARROW-3246 > https://github.com/apache/arrow/commit/2ba0566b29312e84fafc987fd8dc9664748be96a > {code} > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > import fastparquet as fp > df = pd.util.testing.makeDataFrame() > pq.write_table(pa.table(df), 'test.parquet') > fp.ParquetFile('test.parquet') > {code} > with > {code} > Traceback (most recent call last): > File > "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/api.py", > line 110, in __init__ > with open_with(fn2, 'rb') as f: > File > "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/util.py", > line 38, in default_open > return open(f, mode) > NotADirectoryError: [Errno 20] Not a directory: 'test.parquet/_metadata' > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File "test.py", line 10, in > fp.ParquetFile('test.parquet') > File > "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/api.py", > line 116, in __init__ > self._parse_header(f, verify) > File > "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/api.py", > line 135, in _parse_header > fmd = read_thrift(f, parquet_thrift.FileMetaData) > File > "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/thrift_structures.py", > line 25, in read_thrift > obj.read(pin) > File > "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/parquet_thrift/parquet/ttypes.py", > line 1929, in read > iprot._fast_decode(self, iprot, [self.__class__, self.thrift_spec]) > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 0: > invalid start byte > {code} > I don't recall making any metadata-related changes but I'm going to review > the patch to see if I can narrow down where the problem is to see whether > it's a bug with Arrow/parquet-cpp or with the third party library -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6678) [C++] Regression in Parquet file compatibility introduced by ARROW-3246
Wes McKinney created ARROW-6678: --- Summary: [C++] Regression in Parquet file compatibility introduced by ARROW-3246 Key: ARROW-6678 URL: https://issues.apache.org/jira/browse/ARROW-6678 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Fix For: 0.15.0 I randomly discovered that this script fails after applying the patch for ARROW-3246 https://github.com/apache/arrow/commit/2ba0566b29312e84fafc987fd8dc9664748be96a {code} import pandas as pd import pyarrow as pa import pyarrow.parquet as pq import fastparquet as fp df = pd.util.testing.makeDataFrame() pq.write_table(pa.table(df), 'test.parquet') fp.ParquetFile('test.parquet') {code} with {code} Traceback (most recent call last): File "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/api.py", line 110, in __init__ with open_with(fn2, 'rb') as f: File "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/util.py", line 38, in default_open return open(f, mode) NotADirectoryError: [Errno 20] Not a directory: 'test.parquet/_metadata' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "test.py", line 10, in fp.ParquetFile('test.parquet') File "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/api.py", line 116, in __init__ self._parse_header(f, verify) File "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/api.py", line 135, in _parse_header fmd = read_thrift(f, parquet_thrift.FileMetaData) File "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/thrift_structures.py", line 25, in read_thrift obj.read(pin) File "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/parquet_thrift/parquet/ttypes.py", line 1929, in read iprot._fast_decode(self, iprot, [self.__class__, self.thrift_spec]) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 0: invalid start byte {code} I don't recall making any metadata-related changes but I'm going to review the patch to see if I can narrow down where the problem is to see whether it's a bug with Arrow/parquet-cpp or with the third party library -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6629) [Doc][C++] Document the FileSystem API
[ https://issues.apache.org/jira/browse/ARROW-6629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-6629. Fix Version/s: (was: 1.0.0) 0.15.0 Resolution: Fixed Issue resolved by pull request 5487 [https://github.com/apache/arrow/pull/5487] > [Doc][C++] Document the FileSystem API > -- > > Key: ARROW-6629 > URL: https://issues.apache.org/jira/browse/ARROW-6629 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Documentation >Reporter: Neal Richardson >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > In ARROW-6622, I was looking for a place in the docs to add about path > normalization, and I couldn't find filesystem docs at all. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-5379) [Python] support pandas' nullable Integer type in from_pandas
[ https://issues.apache.org/jira/browse/ARROW-5379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937175#comment-16937175 ] Daniel Nugent edited comment on ARROW-5379 at 9/24/19 8:30 PM: --- Is this actually something that would be appropriate to implement with extension types? It just requires that the mask parameter of pa.Array actually be used with Pandas integer columns, right? was (Author: nugend): Is this actually something that would be appropriate to implement with extension types? It just requires that the mask actually be used with Pandas integer columns. > [Python] support pandas' nullable Integer type in from_pandas > - > > Key: ARROW-5379 > URL: https://issues.apache.org/jira/browse/ARROW-5379 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Priority: Major > > From https://github.com/apache/arrow/issues/4168. We should add support for > pandas' nullable Integer extension dtypes, as those could map nicely to > arrows integer types. > Ideally this happens in a generic way though, and not specific for this > extension type, which is discussed in ARROW-5271 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-5379) [Python] support pandas' nullable Integer type in from_pandas
[ https://issues.apache.org/jira/browse/ARROW-5379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937175#comment-16937175 ] Daniel Nugent commented on ARROW-5379: -- Is this actually something that would be appropriate to implement with extension types? It just requires that the mask actually be used with Pandas integer columns. > [Python] support pandas' nullable Integer type in from_pandas > - > > Key: ARROW-5379 > URL: https://issues.apache.org/jira/browse/ARROW-5379 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Priority: Major > > From https://github.com/apache/arrow/issues/4168. We should add support for > pandas' nullable Integer extension dtypes, as those could map nicely to > arrows integer types. > Ideally this happens in a generic way though, and not specific for this > extension type, which is discussed in ARROW-5271 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6649) [R] print() methods for Table, RecordBatch, etc.
[ https://issues.apache.org/jira/browse/ARROW-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6649: -- Labels: pull-request-available (was: ) > [R] print() methods for Table, RecordBatch, etc. > > > Key: ARROW-6649 > URL: https://issues.apache.org/jira/browse/ARROW-6649 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Inspired by tibble: show schema, head of data, etc. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-6649) [R] print() methods for Table, RecordBatch, etc.
[ https://issues.apache.org/jira/browse/ARROW-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-6649: -- Assignee: Neal Richardson > [R] print() methods for Table, RecordBatch, etc. > > > Key: ARROW-6649 > URL: https://issues.apache.org/jira/browse/ARROW-6649 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Inspired by tibble: show schema, head of data, etc. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-3850) [Python] Support MapType and StructType for enhanced PySpark integration
[ https://issues.apache.org/jira/browse/ARROW-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937159#comment-16937159 ] Bryan Cutler commented on ARROW-3850: - Now that SPARK-23836 is merged, a scalar Pandas UDF can return a StructType that will accept a pandas.DataFrame. By nested structs, I mean a column of StructType that have a child that is a StructType. Spark does not currently support this as an input column, or return type from Pandas UDFs. > [Python] Support MapType and StructType for enhanced PySpark integration > > > Key: ARROW-3850 > URL: https://issues.apache.org/jira/browse/ARROW-3850 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Affects Versions: 0.11.1 >Reporter: Florian Wilhelm >Priority: Major > Fix For: 1.0.0 > > > It would be great to support MapType and (nested) StructType in Arrow so that > PySpark can make use of it. > > Quite often as in my use-case in Hive table cells are also complex types > saved. Currently it's not possible to user the new > {{[pandas_udf|https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=explode#pyspark.sql.functions.pandas_udf]}} > decorator which internally uses Arrow to generate a UDF for columns with > complex types. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6677) [FlightRPC][C++] Document using Flight in C++
[ https://issues.apache.org/jira/browse/ARROW-6677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6677: -- Labels: pull-request-available (was: ) > [FlightRPC][C++] Document using Flight in C++ > - > > Key: ARROW-6677 > URL: https://issues.apache.org/jira/browse/ARROW-6677 > Project: Apache Arrow > Issue Type: Bug > Components: Documentation, FlightRPC >Reporter: lidavidm >Assignee: lidavidm >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Similarly to ARROW-6390 for Python, we should have C++ documentation for > Flight. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6677) [FlightRPC][C++] Document using Flight in C++
lidavidm created ARROW-6677: --- Summary: [FlightRPC][C++] Document using Flight in C++ Key: ARROW-6677 URL: https://issues.apache.org/jira/browse/ARROW-6677 Project: Apache Arrow Issue Type: Bug Components: Documentation, FlightRPC Reporter: lidavidm Assignee: lidavidm Fix For: 1.0.0 Similarly to ARROW-6390 for Python, we should have C++ documentation for Flight. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6187) [C++] fallback to storage type when writing ExtensionType to Parquet
[ https://issues.apache.org/jira/browse/ARROW-6187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-6187. --- Fix Version/s: (was: 1.0.0) 0.15.0 Resolution: Fixed Issue resolved by pull request 5436 [https://github.com/apache/arrow/pull/5436] > [C++] fallback to storage type when writing ExtensionType to Parquet > > > Key: ARROW-6187 > URL: https://issues.apache.org/jira/browse/ARROW-6187 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: parquet, pull-request-available > Fix For: 0.15.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > Writing a table that contains an ExtensionType array to a parquet file is not > yet implemented. It currently raises "ArrowNotImplementedError: Unhandled > type for Arrow to Parquet schema conversion: > extension" (for a PyExtensionType in this case). > I think minimal support can consist of writing the storage type / array. > We also might want to save the extension name and metadata in the parquet > FileMetadata. > Later on, this could be potentially be used to restore the extension type > when reading. This is related to other issues that need to save the arrow > schema (categorical: ARROW-5480, time zones: ARROW-5888). Only in this case, > we probably want to store the serialised type in addition to the schema > (which only has the extension type's name). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6613) [C++] Remove dependency on boost::filesystem
[ https://issues.apache.org/jira/browse/ARROW-6613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937028#comment-16937028 ] Antoine Pitrou commented on ARROW-6613: --- I've started looking into this. There's some non-trivial Windows-specific code in boost::filesystem to handle reparse points and symlinks. It feels a bit counter-productive to copy/paste it without knowing exactly what it does. > [C++] Remove dependency on boost::filesystem > > > Key: ARROW-6613 > URL: https://issues.apache.org/jira/browse/ARROW-6613 > Project: Apache Arrow > Issue Type: Wish > Components: C++ >Reporter: Antoine Pitrou >Priority: Major > Fix For: 1.0.0 > > > See ARROW-2196 for details. > boost::filesystem should not be required for base functionality at least > (including filesystems, probably). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6675) [JS] Add scanReverse function
[ https://issues.apache.org/jira/browse/ARROW-6675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-6675: --- Component/s: JavaScript > [JS] Add scanReverse function > - > > Key: ARROW-6675 > URL: https://issues.apache.org/jira/browse/ARROW-6675 > Project: Apache Arrow > Issue Type: New Feature > Components: JavaScript >Reporter: Malcolm MacLachlan >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > * Add scanReverse function to dataFrame and filteredDataframe > * Update tests -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6675) [JS] Add scanReverse function
[ https://issues.apache.org/jira/browse/ARROW-6675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-6675: --- Summary: [JS] Add scanReverse function (was: Add scanReverse function) > [JS] Add scanReverse function > - > > Key: ARROW-6675 > URL: https://issues.apache.org/jira/browse/ARROW-6675 > Project: Apache Arrow > Issue Type: New Feature >Reporter: Malcolm MacLachlan >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > * Add scanReverse function to dataFrame and filteredDataframe > * Update tests -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6676) [C++] [Parquet] Refactor encoding/decoding APIs for clarity
Benjamin Kietzman created ARROW-6676: Summary: [C++] [Parquet] Refactor encoding/decoding APIs for clarity Key: ARROW-6676 URL: https://issues.apache.org/jira/browse/ARROW-6676 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Benjamin Kietzman Assignee: Benjamin Kietzman {{encoding.h}} and {{encoding.cc}} are difficult to read and rewrite. I think there's also lost opportunities for more generic implementations. Simplify/winnow the interfaces while keeping an eye on the benchmarks for performance regressions -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6675) Add scanReverse function
[ https://issues.apache.org/jira/browse/ARROW-6675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936980#comment-16936980 ] Malcolm MacLachlan commented on ARROW-6675: --- [https://github.com/apache/arrow/pull/5480] > Add scanReverse function > > > Key: ARROW-6675 > URL: https://issues.apache.org/jira/browse/ARROW-6675 > Project: Apache Arrow > Issue Type: New Feature >Reporter: Malcolm MacLachlan >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > * Add scanReverse function to dataFrame and filteredDataframe > * Update tests -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6675) Add scanReverse function
[ https://issues.apache.org/jira/browse/ARROW-6675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6675: -- Labels: pull-request-available (was: ) > Add scanReverse function > > > Key: ARROW-6675 > URL: https://issues.apache.org/jira/browse/ARROW-6675 > Project: Apache Arrow > Issue Type: New Feature >Reporter: Malcolm MacLachlan >Priority: Minor > Labels: pull-request-available > > * Add scanReverse function to dataFrame and filteredDataframe > * Update tests -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6675) Add scanReverse function
Malcolm MacLachlan created ARROW-6675: - Summary: Add scanReverse function Key: ARROW-6675 URL: https://issues.apache.org/jira/browse/ARROW-6675 Project: Apache Arrow Issue Type: New Feature Reporter: Malcolm MacLachlan * Add scanReverse function to dataFrame and filteredDataframe * Update tests -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6674) [Python] Fix or ignore the test warnings
[ https://issues.apache.org/jira/browse/ARROW-6674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6674: -- Labels: pull-request-available (was: ) > [Python] Fix or ignore the test warnings > > > Key: ARROW-6674 > URL: https://issues.apache.org/jira/browse/ARROW-6674 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Minor > Labels: pull-request-available > > Currently when running the python tests, we have a bunch of warnings. Some of > them can be ignored, some of them can be fixed. But it is better to do that > explicitly, so that new warnings (which can be potentially important to see) > get more attention. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-6158) [Python] possible to create StructArray with type that conflicts with child array's types
[ https://issues.apache.org/jira/browse/ARROW-6158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche reassigned ARROW-6158: Assignee: Joris Van den Bossche > [Python] possible to create StructArray with type that conflicts with child > array's types > - > > Key: ARROW-6158 > URL: https://issues.apache.org/jira/browse/ARROW-6158 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Using the Python interface as example. This creates a {{StructArray}} where > the field types don't match the child array types: > {code} > a = pa.array([1, 2, 3], type=pa.int64()) > b = pa.array(['a', 'b', 'c'], type=pa.string()) > inconsistent_fields = [pa.field('a', pa.int32()), pa.field('b', pa.float64())] > a = pa.StructArray.from_arrays([a, b], fields=inconsistent_fields) > {code} > The above works fine. I didn't find anything that errors (eg conversion to > pandas, slicing), also validation passes, but the type actually has the > inconsistent child types: > {code} > In [2]: a > Out[2]: > > -- is_valid: all not null > -- child 0 type: int64 > [ > 1, > 2, > 3 > ] > -- child 1 type: string > [ > "a", > "b", > "c" > ] > In [3]: a.type > Out[3]: StructType(struct) > In [4]: a.to_pandas() > Out[4]: > array([{'a': 1, 'b': 'a'}, {'a': 2, 'b': 'b'}, {'a': 3, 'b': 'c'}], > dtype=object) > In [5]: a.validate() > {code} > Shouldn't this be disallowed somehow? (it could be checked in the Python > {{from_arrays}} method, but maybe also in {{StructArray::Make}} which already > checks for the number of fields vs arrays and a consistent array length). > Similarly to discussion in ARROW-6132, I would also expect that this the > {{ValidateArray}} catches this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-6674) [Python] Fix or ignore the test warnings
[ https://issues.apache.org/jira/browse/ARROW-6674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche reassigned ARROW-6674: Assignee: Joris Van den Bossche > [Python] Fix or ignore the test warnings > > > Key: ARROW-6674 > URL: https://issues.apache.org/jira/browse/ARROW-6674 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Minor > > Currently when running the python tests, we have a bunch of warnings. Some of > them can be ignored, some of them can be fixed. But it is better to do that > explicitly, so that new warnings (which can be potentially important to see) > get more attention. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6674) [Python] Fix or ignore the test warnings
Joris Van den Bossche created ARROW-6674: Summary: [Python] Fix or ignore the test warnings Key: ARROW-6674 URL: https://issues.apache.org/jira/browse/ARROW-6674 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Currently when running the python tests, we have a bunch of warnings. Some of them can be ignored, some of them can be fixed. But it is better to do that explicitly, so that new warnings (which can be potentially important to see) get more attention. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6669) [Rust] [DataFusion] Implement physical expression for binary expressions
[ https://issues.apache.org/jira/browse/ARROW-6669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paddy Horan resolved ARROW-6669. Fix Version/s: 0.15.0 Resolution: Fixed Issue resolved by pull request 5478 [https://github.com/apache/arrow/pull/5478] > [Rust] [DataFusion] Implement physical expression for binary expressions > > > Key: ARROW-6669 > URL: https://issues.apache.org/jira/browse/ARROW-6669 > Project: Apache Arrow > Issue Type: Sub-task >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Implement comparison operators (<, <=, >, >=, =, !=) as well as binary > operators AND and OR. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6673) [Python] Consider separating libarrow.pxd into multiple definition files
Krisztian Szucs created ARROW-6673: -- Summary: [Python] Consider separating libarrow.pxd into multiple definition files Key: ARROW-6673 URL: https://issues.apache.org/jira/browse/ARROW-6673 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Krisztian Szucs See discussion https://github.com/apache/arrow/pull/5423#discussion_r327522836 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6613) [C++] Remove dependency on boost::filesystem
[ https://issues.apache.org/jira/browse/ARROW-6613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936855#comment-16936855 ] Wes McKinney commented on ARROW-6613: - I think we should remove Boost as a dependency of the _core_ build. Which may mean not building certain modules (like the code in src/arrow/filesystem) by default. I think that should make things easier > [C++] Remove dependency on boost::filesystem > > > Key: ARROW-6613 > URL: https://issues.apache.org/jira/browse/ARROW-6613 > Project: Apache Arrow > Issue Type: Wish > Components: C++ >Reporter: Antoine Pitrou >Priority: Major > Fix For: 1.0.0 > > > See ARROW-2196 for details. > boost::filesystem should not be required for base functionality at least > (including filesystems, probably). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-3777) [C++] Implement a mock "high latency" filesystem
[ https://issues.apache.org/jira/browse/ARROW-3777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-3777. --- Fix Version/s: (was: 1.0.0) 0.15.0 Resolution: Fixed Issue resolved by pull request 5439 [https://github.com/apache/arrow/pull/5439] > [C++] Implement a mock "high latency" filesystem > > > Key: ARROW-3777 > URL: https://issues.apache.org/jira/browse/ARROW-3777 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > Some of our tools don't perform well out of the box for filesystems with high > latency reads, like cloud blob stores. In such cases, it may be better to use > buffered reads with a larger read ahead window. Having a mock filesystem to > introduce latency into reads will help with testing / developing APIs for this -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6613) [C++] Remove dependency on boost::filesystem
[ https://issues.apache.org/jira/browse/ARROW-6613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936833#comment-16936833 ] Antoine Pitrou commented on ARROW-6613: --- [~wesm] We should perhaps discuss if it's really useful to have bare-bones no-boost build. We will probably end up copy-pasting some boost code along the way. > [C++] Remove dependency on boost::filesystem > > > Key: ARROW-6613 > URL: https://issues.apache.org/jira/browse/ARROW-6613 > Project: Apache Arrow > Issue Type: Wish > Components: C++ >Reporter: Antoine Pitrou >Priority: Major > Fix For: 1.0.0 > > > See ARROW-2196 for details. > boost::filesystem should not be required for base functionality at least > (including filesystems, probably). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6213) [C++] tests fail for AVX512
[ https://issues.apache.org/jira/browse/ARROW-6213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936829#comment-16936829 ] Wes McKinney commented on ARROW-6213: - It does. I'll create an account, add your SSH keys (https://github.com/pitrou.keys) and send you the connection information privately > [C++] tests fail for AVX512 > --- > > Key: ARROW-6213 > URL: https://issues.apache.org/jira/browse/ARROW-6213 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.14.1 > Environment: CentOS 7.6.1810, Intel Xeon Processor (Skylake, IBRS) > avx512 >Reporter: Charles Coulombe >Priority: Minor > Fix For: 2.0.0 > > Attachments: arrow-0.14.1-c++-failed-tests-cmake-conf.txt, > arrow-0.14.1-c++-failed-tests.txt > > > When building libraries for avx512 with GCC 7.3.0, two C++ tests fails. > {noformat} > The following tests FAILED: > 28 - arrow-compute-compare-test (Failed) > 30 - arrow-compute-filter-test (Failed) > Errors while running CTest{noformat} > while for avx2 they passes. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6158) [Python] possible to create StructArray with type that conflicts with child array's types
[ https://issues.apache.org/jira/browse/ARROW-6158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6158: -- Labels: pull-request-available (was: ) > [Python] possible to create StructArray with type that conflicts with child > array's types > - > > Key: ARROW-6158 > URL: https://issues.apache.org/jira/browse/ARROW-6158 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Using the Python interface as example. This creates a {{StructArray}} where > the field types don't match the child array types: > {code} > a = pa.array([1, 2, 3], type=pa.int64()) > b = pa.array(['a', 'b', 'c'], type=pa.string()) > inconsistent_fields = [pa.field('a', pa.int32()), pa.field('b', pa.float64())] > a = pa.StructArray.from_arrays([a, b], fields=inconsistent_fields) > {code} > The above works fine. I didn't find anything that errors (eg conversion to > pandas, slicing), also validation passes, but the type actually has the > inconsistent child types: > {code} > In [2]: a > Out[2]: > > -- is_valid: all not null > -- child 0 type: int64 > [ > 1, > 2, > 3 > ] > -- child 1 type: string > [ > "a", > "b", > "c" > ] > In [3]: a.type > Out[3]: StructType(struct) > In [4]: a.to_pandas() > Out[4]: > array([{'a': 1, 'b': 'a'}, {'a': 2, 'b': 'b'}, {'a': 3, 'b': 'c'}], > dtype=object) > In [5]: a.validate() > {code} > Shouldn't this be disallowed somehow? (it could be checked in the Python > {{from_arrays}} method, but maybe also in {{StructArray::Make}} which already > checks for the number of fields vs arrays and a consistent array length). > Similarly to discussion in ARROW-6132, I would also expect that this the > {{ValidateArray}} catches this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6613) [C++] Remove dependency on boost::filesystem
[ https://issues.apache.org/jira/browse/ARROW-6613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936802#comment-16936802 ] Antoine Pitrou commented on ARROW-6613: --- boost::filesystem is used in conjuction with boost::process for testing Flight and S3FS. I don't think it's reasonable to reimplement boost::process. Still, we can try to make boost::filesystem unnecessary if tests are not built. > [C++] Remove dependency on boost::filesystem > > > Key: ARROW-6613 > URL: https://issues.apache.org/jira/browse/ARROW-6613 > Project: Apache Arrow > Issue Type: Wish > Components: C++ >Reporter: Antoine Pitrou >Priority: Major > Fix For: 1.0.0 > > > See ARROW-2196 for details. > boost::filesystem should not be required for base functionality at least > (including filesystems, probably). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6671) [C++] Sparse tensor naming
[ https://issues.apache.org/jira/browse/ARROW-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936786#comment-16936786 ] Wes McKinney commented on ARROW-6671: - I am OK with having the type be called "matrix". Good to make things as clear and consistent as possible > [C++] Sparse tensor naming > -- > > Key: ARROW-6671 > URL: https://issues.apache.org/jira/browse/ARROW-6671 > Project: Apache Arrow > Issue Type: Wish > Components: C++ >Reporter: Antoine Pitrou >Assignee: Kenta Murata >Priority: Minor > > Currently there's {{SparseCOOIndex}} and {{SparseCSRIndex}}, but also > {{SparseTensorCOO}} and {{SparseTensorCSR}}. > For consistency, it would be nice to rename the latter {{SparseCOOTensor}} > and {{SparseCSRTensor}}. > Also, it's not obvious the {{SparseMatrixCSR}} alias is useful. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Issue Comment Deleted] (ARROW-6671) [C++] Sparse tensor naming
[ https://issues.apache.org/jira/browse/ARROW-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-6671: Comment: was deleted (was: I don't know. We don't use the term "matrix" currently in Arrow. cc [~wesm]) > [C++] Sparse tensor naming > -- > > Key: ARROW-6671 > URL: https://issues.apache.org/jira/browse/ARROW-6671 > Project: Apache Arrow > Issue Type: Wish > Components: C++ >Reporter: Antoine Pitrou >Assignee: Kenta Murata >Priority: Minor > > Currently there's {{SparseCOOIndex}} and {{SparseCSRIndex}}, but also > {{SparseTensorCOO}} and {{SparseTensorCSR}}. > For consistency, it would be nice to rename the latter {{SparseCOOTensor}} > and {{SparseCSRTensor}}. > Also, it's not obvious the {{SparseMatrixCSR}} alias is useful. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6671) [C++] Sparse tensor naming
[ https://issues.apache.org/jira/browse/ARROW-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936779#comment-16936779 ] Rok Mihevc commented on ARROW-6671: --- SparseCSRMatrix might be more misleading as it 'doesn't look look like' a Tensor type. I think that is potentially more confusing than it being limited to 2D. +1 for the consistent naming > [C++] Sparse tensor naming > -- > > Key: ARROW-6671 > URL: https://issues.apache.org/jira/browse/ARROW-6671 > Project: Apache Arrow > Issue Type: Wish > Components: C++ >Reporter: Antoine Pitrou >Assignee: Kenta Murata >Priority: Minor > > Currently there's {{SparseCOOIndex}} and {{SparseCSRIndex}}, but also > {{SparseTensorCOO}} and {{SparseTensorCSR}}. > For consistency, it would be nice to rename the latter {{SparseCOOTensor}} > and {{SparseCSRTensor}}. > Also, it's not obvious the {{SparseMatrixCSR}} alias is useful. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6629) [Doc][C++] Document the FileSystem API
[ https://issues.apache.org/jira/browse/ARROW-6629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6629: -- Labels: pull-request-available (was: ) > [Doc][C++] Document the FileSystem API > -- > > Key: ARROW-6629 > URL: https://issues.apache.org/jira/browse/ARROW-6629 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Documentation >Reporter: Neal Richardson >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > In ARROW-6622, I was looking for a place in the docs to add about path > normalization, and I couldn't find filesystem docs at all. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6671) [C++] Sparse tensor naming
[ https://issues.apache.org/jira/browse/ARROW-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936727#comment-16936727 ] Antoine Pitrou commented on ARROW-6671: --- I don't know. We don't use the term "matrix" currently in Arrow. cc [~wesm] > [C++] Sparse tensor naming > -- > > Key: ARROW-6671 > URL: https://issues.apache.org/jira/browse/ARROW-6671 > Project: Apache Arrow > Issue Type: Wish > Components: C++ >Reporter: Antoine Pitrou >Assignee: Kenta Murata >Priority: Minor > > Currently there's {{SparseCOOIndex}} and {{SparseCSRIndex}}, but also > {{SparseTensorCOO}} and {{SparseTensorCSR}}. > For consistency, it would be nice to rename the latter {{SparseCOOTensor}} > and {{SparseCSRTensor}}. > Also, it's not obvious the {{SparseMatrixCSR}} alias is useful. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6671) [C++] Sparse tensor naming
[ https://issues.apache.org/jira/browse/ARROW-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936728#comment-16936728 ] Antoine Pitrou commented on ARROW-6671: --- I don't know. We don't use the term "matrix" currently in Arrow. cc [~wesm] > [C++] Sparse tensor naming > -- > > Key: ARROW-6671 > URL: https://issues.apache.org/jira/browse/ARROW-6671 > Project: Apache Arrow > Issue Type: Wish > Components: C++ >Reporter: Antoine Pitrou >Assignee: Kenta Murata >Priority: Minor > > Currently there's {{SparseCOOIndex}} and {{SparseCSRIndex}}, but also > {{SparseTensorCOO}} and {{SparseTensorCSR}}. > For consistency, it would be nice to rename the latter {{SparseCOOTensor}} > and {{SparseCSRTensor}}. > Also, it's not obvious the {{SparseMatrixCSR}} alias is useful. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6671) [C++] Sparse tensor naming
[ https://issues.apache.org/jira/browse/ARROW-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936724#comment-16936724 ] Kenta Murata commented on ARROW-6671: - Indeed. I want to make their name consistent, so I’ll make a pull request tomorrow. [~apitrou] How about employ SparseCSRMatrix rather than SparseCSRTensor because it cannot represent a tensor with more than 2-dimension. > [C++] Sparse tensor naming > -- > > Key: ARROW-6671 > URL: https://issues.apache.org/jira/browse/ARROW-6671 > Project: Apache Arrow > Issue Type: Wish > Components: C++ >Reporter: Antoine Pitrou >Assignee: Kenta Murata >Priority: Minor > > Currently there's {{SparseCOOIndex}} and {{SparseCSRIndex}}, but also > {{SparseTensorCOO}} and {{SparseTensorCSR}}. > For consistency, it would be nice to rename the latter {{SparseCOOTensor}} > and {{SparseCSRTensor}}. > Also, it's not obvious the {{SparseMatrixCSR}} alias is useful. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-6671) [C++] Sparse tensor naming
[ https://issues.apache.org/jira/browse/ARROW-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kenta Murata reassigned ARROW-6671: --- Assignee: Kenta Murata > [C++] Sparse tensor naming > -- > > Key: ARROW-6671 > URL: https://issues.apache.org/jira/browse/ARROW-6671 > Project: Apache Arrow > Issue Type: Wish > Components: C++ >Reporter: Antoine Pitrou >Assignee: Kenta Murata >Priority: Minor > > Currently there's {{SparseCOOIndex}} and {{SparseCSRIndex}}, but also > {{SparseTensorCOO}} and {{SparseTensorCSR}}. > For consistency, it would be nice to rename the latter {{SparseCOOTensor}} > and {{SparseCSRTensor}}. > Also, it's not obvious the {{SparseMatrixCSR}} alias is useful. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6672) [Java] Extract a common interface for dictionary builders
[ https://issues.apache.org/jira/browse/ARROW-6672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6672: -- Labels: pull-request-available (was: ) > [Java] Extract a common interface for dictionary builders > - > > Key: ARROW-6672 > URL: https://issues.apache.org/jira/browse/ARROW-6672 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Minor > Labels: pull-request-available > > We need a common interface for dictionary builders to support more > sophisticated scenarios, like collecting dictionary statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6672) [Java] Extract a common interface for dictionary builders
Liya Fan created ARROW-6672: --- Summary: [Java] Extract a common interface for dictionary builders Key: ARROW-6672 URL: https://issues.apache.org/jira/browse/ARROW-6672 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan We need a common interface for dictionary builders to support more sophisticated scenarios, like collecting dictionary statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6671) [C++] Sparse tensor naming
[ https://issues.apache.org/jira/browse/ARROW-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936696#comment-16936696 ] Antoine Pitrou commented on ARROW-6671: --- cc [~mrkn] > [C++] Sparse tensor naming > -- > > Key: ARROW-6671 > URL: https://issues.apache.org/jira/browse/ARROW-6671 > Project: Apache Arrow > Issue Type: Wish > Components: C++ >Reporter: Antoine Pitrou >Priority: Minor > > Currently there's {{SparseCOOIndex}} and {{SparseCSRIndex}}, but also > {{SparseTensorCOO}} and {{SparseTensorCSR}}. > For consistency, it would be nice to rename the latter {{SparseCOOTensor}} > and {{SparseCSRTensor}}. > Also, it's not obvious the {{SparseMatrixCSR}} alias is useful. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6671) [C++] Sparse tensor naming
Antoine Pitrou created ARROW-6671: - Summary: [C++] Sparse tensor naming Key: ARROW-6671 URL: https://issues.apache.org/jira/browse/ARROW-6671 Project: Apache Arrow Issue Type: Wish Components: C++ Reporter: Antoine Pitrou Currently there's {{SparseCOOIndex}} and {{SparseCSRIndex}}, but also {{SparseTensorCOO}} and {{SparseTensorCSR}}. For consistency, it would be nice to rename the latter {{SparseCOOTensor}} and {{SparseCSRTensor}}. Also, it's not obvious the {{SparseMatrixCSR}} alias is useful. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-6629) [Doc][C++] Document the FileSystem API
[ https://issues.apache.org/jira/browse/ARROW-6629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-6629: - Assignee: Antoine Pitrou > [Doc][C++] Document the FileSystem API > -- > > Key: ARROW-6629 > URL: https://issues.apache.org/jira/browse/ARROW-6629 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Documentation >Reporter: Neal Richardson >Assignee: Antoine Pitrou >Priority: Major > Fix For: 1.0.0 > > > In ARROW-6622, I was looking for a place in the docs to add about path > normalization, and I couldn't find filesystem docs at all. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6472) [Java] ValueVector#accept may has potential cast exception
[ https://issues.apache.org/jira/browse/ARROW-6472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6472: -- Labels: pull-request-available (was: ) > [Java] ValueVector#accept may has potential cast exception > -- > > Key: ARROW-6472 > URL: https://issues.apache.org/jira/browse/ARROW-6472 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Ji Liu >Assignee: Ji Liu >Priority: Major > Labels: pull-request-available > > Per discussion > [https://github.com/apache/arrow/pull/5195#issuecomment-528425302] > We may use API this way: > {code:java} > RangeEqualsVisitor visitor = new RangeEqualsVisitor(vector1, vector2); > vector3.accept(visitor, range){code} > if vector1/vector2 are say, {{StructVector}}s and vector3 is an {{IntVector}} > - things can go bad. we'll use the {{compareBaseFixedWidthVectors()}} and do > wrong type-casts for vector1/vector2. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-4930) [Python] Remove LIBDIR assumptions in Python build
[ https://issues.apache.org/jira/browse/ARROW-4930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936581#comment-16936581 ] Antoine Pitrou commented on ARROW-4930: --- I'm afraid I can't really help constructively :-/ My CMake-fu is quite weak. I'll cc [~kou] who's quite more knowledgeable in the area. > [Python] Remove LIBDIR assumptions in Python build > -- > > Key: ARROW-4930 > URL: https://issues.apache.org/jira/browse/ARROW-4930 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.12.1 >Reporter: Suvayu Ali >Priority: Minor > Labels: setup.py > Fix For: 2.0.0 > > Attachments: FindArrow.cmake.patch, FindParquet.cmake.patch > > > This is in reference to (4) in > [this|http://mail-archives.apache.org/mod_mbox/arrow-dev/201903.mbox/%3C0AF328A1-ED2A-457F-B72D-3B49C8614850%40xhochy.com%3E] > mailing list discussion. > Certain sections of setup.py assume a specific location of the C++ libraries. > Removing this hard assumption will simplify PyArrow builds significantly. As > far as I could tell these assumptions are made in the > {{build_ext._run_cmake()}} method (wherever bundling of C++ libraries are > handled). > # The first occurrence is before invoking cmake (see line 237). > # The second occurrence is when the C++ libraries are moved from their build > directory to the Python tree (see line 347). The actual implementation is in > the function {{_move_shared_libs_unix(..)}} (see line 468). > Hope this helps. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-5845) [Java] Implement converter between Arrow record batches and Avro records
[ https://issues.apache.org/jira/browse/ARROW-5845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ji Liu updated ARROW-5845: -- Priority: Major (was: Minor) > [Java] Implement converter between Arrow record batches and Avro records > > > Key: ARROW-5845 > URL: https://issues.apache.org/jira/browse/ARROW-5845 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Ji Liu >Assignee: Ji Liu >Priority: Major > Fix For: 1.0.0 > > > It would be useful for applications which need convert Avro data to Arrow > data. > This is an adapter which convert data with existing API (like JDBC adapter) > rather than a native reader (like orc). > We implement this function through Avro java project, receiving param like > Decoder/Schema/DatumReader of Avro and return VectorSchemaRoot. For each data > type we have a consumer class as below to get Avro data and write it into > vector to avoid boxing/unboxing (e.g. GenericRecord#get returns Object) > {code:java} > public class AvroIntConsumer implements Consumer { > private final IntWriter writer; > public AvroIntConsumer(IntVector vector) > { this.writer = new IntWriterImpl(vector); } > @Override > public void consume(Decoder decoder) throws IOException > { writer.writeInt(decoder.readInt()); writer.setPosition(writer.getPosition() > + 1); } > {code} > We intended to support primitive and complex types (null value represented > via unions type with null type), size limit and field selection could be > optional for users. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6592) [Java] Add support for skipping decoding of columns/field in Avro converter
[ https://issues.apache.org/jira/browse/ARROW-6592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6592: -- Labels: avro pull-request-available (was: avro) > [Java] Add support for skipping decoding of columns/field in Avro converter > --- > > Key: ARROW-6592 > URL: https://issues.apache.org/jira/browse/ARROW-6592 > Project: Apache Arrow > Issue Type: Sub-task > Components: Java >Reporter: Micah Kornfield >Assignee: Ji Liu >Priority: Major > Labels: avro, pull-request-available > > Users should be able to pass in a set of fields they wish to decode from Avro > and the converter should avoid creating Vectors in the returned > ArrowSchemaRoot. This would ideally support nested columns so if there was: > > Struct A { > int B; > int C; > } > > The use could choose to only read A.B or A.C or both. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6601) [Java] Improve JDBC adapter performance & add benchmark
[ https://issues.apache.org/jira/browse/ARROW-6601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved ARROW-6601. Fix Version/s: 0.15.0 Resolution: Fixed Issue resolved by pull request 5472 [https://github.com/apache/arrow/pull/5472] > [Java] Improve JDBC adapter performance & add benchmark > --- > > Key: ARROW-6601 > URL: https://issues.apache.org/jira/browse/ARROW-6601 > Project: Apache Arrow > Issue Type: Task > Components: Java >Reporter: Ji Liu >Assignee: Ji Liu >Priority: Critical > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > Add a performance test as well to get a baseline number, to avoid performance > regression when we change related code. > -- This message was sent by Atlassian Jira (v8.3.4#803005)