date:20190924

[jira] [Resolved] (ARROW-6089) [Rust] [DataFusion] Implement parallel execution for selection

2019-09-24 Thread Andy Grove (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-6089.
---
Fix Version/s: (was: 1.0.0)
   0.15.0
   Resolution: Fixed

Issue resolved by pull request 5320
[https://github.com/apache/arrow/pull/5320]

> [Rust] [DataFusion] Implement parallel execution for selection
> --
>
> Key: ARROW-6089
> URL: https://issues.apache.org/jira/browse/ARROW-6089
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Implement physical plan for selection operator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-6683) [Python] Add unit tests that validate cross-compatibility with pyarrow.parquet when fastparquet is installed

2019-09-24 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-6683:
---

 Summary: [Python] Add unit tests that validate cross-compatibility 
with pyarrow.parquet when fastparquet is installed
 Key: ARROW-6683
 URL: https://issues.apache.org/jira/browse/ARROW-6683
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 1.0.0


This will help prevent such issues as ARROW-6678 from reocurring



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-6613) [C++] Remove dependency on boost::filesystem

2019-09-24 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937407#comment-16937407
 ] 

Wes McKinney commented on ARROW-6613:
-

I would suggest adding a {{ARROW_FILESYSTEM}} option, setting it to off by 
default, then handling the remaining usages of Boost in the core library-only 
build.

I think it's fine to have boost::filesystem in src/arrow/filesystem. People who 
are making use of this functionality are probably more OK with accepting 
additional build dependencies. I would mainly like this stuff to be out of the 
way of people who are only using Array / ArrayBuilder / RecordBatch and IPC 
read/write tools

> [C++] Remove dependency on boost::filesystem
> 
>
> Key: ARROW-6613
> URL: https://issues.apache.org/jira/browse/ARROW-6613
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 1.0.0
>
>
> See ARROW-2196 for details.
> boost::filesystem should not be required for base functionality at least 
> (including filesystems, probably).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-6678) [C++] Regression in Parquet file compatibility introduced by ARROW-3246

2019-09-24 Thread Micah Kornfield (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6678.

Resolution: Fixed

Issue resolved by pull request 5493
[https://github.com/apache/arrow/pull/5493]

> [C++] Regression in Parquet file compatibility introduced by ARROW-3246
> ---
>
> Key: ARROW-6678
> URL: https://issues.apache.org/jira/browse/ARROW-6678
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> I randomly discovered that this script fails after applying the patch for 
> ARROW-3246
> https://github.com/apache/arrow/commit/2ba0566b29312e84fafc987fd8dc9664748be96a
> {code}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> import fastparquet as fp
> df = pd.util.testing.makeDataFrame()
> pq.write_table(pa.table(df), 'test.parquet')
> fp.ParquetFile('test.parquet')
> {code}
> with 
> {code}
> Traceback (most recent call last):
>   File 
> "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/api.py",
>  line 110, in __init__
> with open_with(fn2, 'rb') as f:
>   File 
> "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/util.py",
>  line 38, in default_open
> return open(f, mode)
> NotADirectoryError: [Errno 20] Not a directory: 'test.parquet/_metadata'
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File "test.py", line 10, in 
> fp.ParquetFile('test.parquet')
>   File 
> "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/api.py",
>  line 116, in __init__
> self._parse_header(f, verify)
>   File 
> "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/api.py",
>  line 135, in _parse_header
> fmd = read_thrift(f, parquet_thrift.FileMetaData)
>   File 
> "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/thrift_structures.py",
>  line 25, in read_thrift
> obj.read(pin)
>   File 
> "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/parquet_thrift/parquet/ttypes.py",
>  line 1929, in read
> iprot._fast_decode(self, iprot, [self.__class__, self.thrift_spec])
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 0: 
> invalid start byte
> {code}
> I don't recall making any metadata-related changes but I'm going to review 
> the patch to see if I can narrow down where the problem is to see whether 
> it's a bug with Arrow/parquet-cpp or with the third party library



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-6678) [C++] Regression in Parquet file compatibility introduced by ARROW-3246

2019-09-24 Thread Micah Kornfield (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield reassigned ARROW-6678:
--

Assignee: Wes McKinney

> [C++] Regression in Parquet file compatibility introduced by ARROW-3246
> ---
>
> Key: ARROW-6678
> URL: https://issues.apache.org/jira/browse/ARROW-6678
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> I randomly discovered that this script fails after applying the patch for 
> ARROW-3246
> https://github.com/apache/arrow/commit/2ba0566b29312e84fafc987fd8dc9664748be96a
> {code}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> import fastparquet as fp
> df = pd.util.testing.makeDataFrame()
> pq.write_table(pa.table(df), 'test.parquet')
> fp.ParquetFile('test.parquet')
> {code}
> with 
> {code}
> Traceback (most recent call last):
>   File 
> "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/api.py",
>  line 110, in __init__
> with open_with(fn2, 'rb') as f:
>   File 
> "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/util.py",
>  line 38, in default_open
> return open(f, mode)
> NotADirectoryError: [Errno 20] Not a directory: 'test.parquet/_metadata'
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File "test.py", line 10, in 
> fp.ParquetFile('test.parquet')
>   File 
> "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/api.py",
>  line 116, in __init__
> self._parse_header(f, verify)
>   File 
> "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/api.py",
>  line 135, in _parse_header
> fmd = read_thrift(f, parquet_thrift.FileMetaData)
>   File 
> "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/thrift_structures.py",
>  line 25, in read_thrift
> obj.read(pin)
>   File 
> "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/parquet_thrift/parquet/ttypes.py",
>  line 1929, in read
> iprot._fast_decode(self, iprot, [self.__class__, self.thrift_spec])
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 0: 
> invalid start byte
> {code}
> I don't recall making any metadata-related changes but I'm going to review 
> the patch to see if I can narrow down where the problem is to see whether 
> it's a bug with Arrow/parquet-cpp or with the third party library



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6681) [C# -> R] - Record Batches in reverse order?

2019-09-24 Thread Anthony Abate (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate updated ARROW-6681:
-
Issue Type: Bug  (was: New Feature)

> [C# -> R] - Record Batches in reverse order?
> 
>
> Key: ARROW-6681
> URL: https://issues.apache.org/jira/browse/ARROW-6681
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Priority: Minor
>
> Are 'RecordBatches' being in C# being written in reverse order?
> I made a simple test which creates a single row per record batch of 0 to 99 
> and attempted to read this in R. To my surprise batch(0) in R had the value 
> 99 not 0
> This may not seem like a big deal, however when dealing with 'huge' files, 
> its more efficient to use Record Batches / index lookup than attempting to 
> load the entire file into memory.
> Having the order consistent within the different language / API seems only to 
> make sense - for now I can work around this by reversing the order before 
> writing.
>  
> https://github.com/apache/arrow/issues/5475
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-6682) Arrow Hangs on Large Files (10-12gb)

2019-09-24 Thread Anthony Abate (Jira)

Anthony Abate created ARROW-6682:


 Summary: Arrow Hangs on Large Files (10-12gb)
 Key: ARROW-6682
 URL: https://issues.apache.org/jira/browse/ARROW-6682
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, R
Affects Versions: 0.14.1
Reporter: Anthony Abate


I get random hangs on arrow_read in R (windows) when using a very large file 
(10-12gb). 

I have memory dumps - All threads seem to be in wait handles.

Are there debug symbols somewhere? 

Is there a way to get the C++ code to produce diagnostic logging from R? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-6681) [C# -> R] - Record Batches in reverse order?

2019-09-24 Thread Anthony Abate (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937373#comment-16937373
 ] 

Anthony Abate commented on ARROW-6681:
--

sample code in github issue

> [C# -> R] - Record Batches in reverse order?
> 
>
> Key: ARROW-6681
> URL: https://issues.apache.org/jira/browse/ARROW-6681
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C#, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Priority: Minor
>
> Are 'RecordBatches' being in C# being written in reverse order?
> I made a simple test which creates a single row per record batch of 0 to 99 
> and attempted to read this in R. To my surprise batch(0) in R had the value 
> 99 not 0
> This may not seem like a big deal, however when dealing with 'huge' files, 
> its more efficient to use Record Batches / index lookup than attempting to 
> load the entire file into memory.
> Having the order consistent within the different language / API seems only to 
> make sense - for now I can work around this by reversing the order before 
> writing.
>  
> https://github.com/apache/arrow/issues/5475
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6681) [C# -> R] - Record Batches in reverse order?

2019-09-24 Thread Anthony Abate (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate updated ARROW-6681:
-
Component/s: R

> [C# -> R] - Record Batches in reverse order?
> 
>
> Key: ARROW-6681
> URL: https://issues.apache.org/jira/browse/ARROW-6681
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C#, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Priority: Minor
>
> Are 'RecordBatches' being in C# being written in reverse order?
> I made a simple test which creates a single row per record batch of 0 to 99 
> and attempted to read this in R. To my surprise batch(0) in R had the value 
> 99 not 0
> This may not seem like a big deal, however when dealing with 'huge' files, 
> its more efficient to use Record Batches / index lookup than attempting to 
> load the entire file into memory.
> Having the order consistent within the different language / API seems only to 
> make sense - for now I can work around this by reversing the order before 
> writing.
>  
> https://github.com/apache/arrow/issues/5475
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-6681) [C# -> R] - Record Batches in reverse order?

2019-09-24 Thread Anthony Abate (Jira)

Anthony Abate created ARROW-6681:


 Summary: [C# -> R] - Record Batches in reverse order?
 Key: ARROW-6681
 URL: https://issues.apache.org/jira/browse/ARROW-6681
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C#
Affects Versions: 0.14.1
Reporter: Anthony Abate


Are 'RecordBatches' being in C# being written in reverse order?

I made a simple test which creates a single row per record batch of 0 to 99 and 
attempted to read this in R. To my surprise batch(0) in R had the value 99 not 0

This may not seem like a big deal, however when dealing with 'huge' files, its 
more efficient to use Record Batches / index lookup than attempting to load the 
entire file into memory.

Having the order consistent within the different language / API seems only to 
make sense - for now I can work around this by reversing the order before 
writing.

 

https://github.com/apache/arrow/issues/5475

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6667) [Python] Avoid Reference Cycles in pyarrow.parquet

2019-09-24 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6667:

Fix Version/s: 0.15.0

> [Python] Avoid Reference Cycles in pyarrow.parquet
> --
>
> Key: ARROW-6667
> URL: https://issues.apache.org/jira/browse/ARROW-6667
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Aaron Opfer
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
> Attachments: cycle1_build_nested_path.PNG, cycle2_open_dataset.PNG
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Reference cycles appear in two places inside pyarrow.parquet which causes 
> these objects to have much longer lifetimes than necessary:
>  
> {{_build_nested_path}} has a reference cycle because the closured function 
> refers to the parent cell which also refers to the closured function again 
> (objgraph shown in attachment)
> {{open_dataset_file}} is partialed with self inside the {{ParquetFile}} class 
> (objgraph shown in attachment).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-6158) [Python] possible to create StructArray with type that conflicts with child array's types

2019-09-24 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6158.
-
Fix Version/s: (was: 1.0.0)
   0.15.0
   Resolution: Fixed

Issue resolved by pull request 5488
[https://github.com/apache/arrow/pull/5488]

> [Python] possible to create StructArray with type that conflicts with child 
> array's types
> -
>
> Key: ARROW-6158
> URL: https://issues.apache.org/jira/browse/ARROW-6158
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Using the Python interface as example. This creates a {{StructArray}} where 
> the field types don't match the child array types:
> {code}
> a = pa.array([1, 2, 3], type=pa.int64())
> b = pa.array(['a', 'b', 'c'], type=pa.string())
> inconsistent_fields = [pa.field('a', pa.int32()), pa.field('b', pa.float64())]
> a = pa.StructArray.from_arrays([a, b], fields=inconsistent_fields) 
> {code}
> The above works fine. I didn't find anything that errors (eg conversion to 
> pandas, slicing), also validation passes, but the type actually has the 
> inconsistent child types:
> {code}
> In [2]: a
> Out[2]: 
> 
> -- is_valid: all not null
> -- child 0 type: int64
>   [
> 1,
> 2,
> 3
>   ]
> -- child 1 type: string
>   [
> "a",
> "b",
> "c"
>   ]
> In [3]: a.type
> Out[3]: StructType(struct)
> In [4]: a.to_pandas()
> Out[4]: 
> array([{'a': 1, 'b': 'a'}, {'a': 2, 'b': 'b'}, {'a': 3, 'b': 'c'}],
>   dtype=object)
> In [5]: a.validate() 
> {code}
> Shouldn't this be disallowed somehow? (it could be checked in the Python 
> {{from_arrays}} method, but maybe also in {{StructArray::Make}} which already 
> checks for the number of fields vs arrays and a consistent array length). 
> Similarly to discussion in ARROW-6132, I would also expect that this the 
> {{ValidateArray}} catches this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-6680) [Python] Add Array ctor microbenchmarks

2019-09-24 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-6680:
---

 Summary: [Python] Add Array ctor microbenchmarks
 Key: ARROW-6680
 URL: https://issues.apache.org/jira/browse/ARROW-6680
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 1.0.0


Since more unavoidable validation is being added in e.g. 
https://github.com/apache/arrow/pull/5488



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-6674) [Python] Fix or ignore the test warnings

2019-09-24 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6674.
-
Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 5489
[https://github.com/apache/arrow/pull/5489]

> [Python] Fix or ignore the test warnings
> 
>
> Key: ARROW-6674
> URL: https://issues.apache.org/jira/browse/ARROW-6674
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Currently when running the python tests, we have a bunch of warnings. Some of 
> them can be ignored, some of them can be fixed. But it is better to do that 
> explicitly, so that new warnings (which can be potentially important to see) 
> get more attention.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-6575) [JS] decimal toString does not support negative values

2019-09-24 Thread Paul Taylor (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934909#comment-16934909
 ] 

Paul Taylor edited comment on ARROW-6575 at 9/25/19 2:24 AM:
-

[~zad] Yeah I couldn't figure out how to propagate the sign bit through the 
decimal conversion. I'd be happy to review a PR if you know the right way to do 
it.


was (Author: paul.e.taylor):
Yeah, I couldn't figure out how to propagate the sign bit through the decimal 
conversion. I'd be happy to review a PR if you know the right way to do it.

> [JS] decimal toString does not support negative values
> --
>
> Key: ARROW-6575
> URL: https://issues.apache.org/jira/browse/ARROW-6575
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 0.14.1
>Reporter: Andong Zhan
>Priority: Critical
>
> The main description is here: [https://github.com/apache/arrow/issues/5397]
> Also, I have a simple test case (slightly changed generate-test-data.js and 
> generated-data-validators):
> {code:java}
> export const decimal = (length = 2, nullCount = length * 0.2 | 0, scale = 0, 
> precision = 38) => vectorGenerator.visit(new Decimal(scale, precision), 
> length, nullCount);
> function fillDecimal(length: number) {
> // const BPE = Uint32Array.BYTES_PER_ELEMENT; // 4
> const array = new Uint32Array(length);
> // const max = (2 ** (8 * BPE)) - 1;
> // for (let i = -1; ++i < length; array[i] = rand() * max * (rand() > 0.5 
> ? -1 : 1));
> array[0] = 0;
> array[1] = 1286889712;
> array[2] = 2218195178;
> array[3] = 4282345521;
> array[4] = 0;
> array[5] = 16004768;
> array[6] = 3587851993;
> array[7] = 126217744;
> return array;
> }
> {code}
> and the expected value should be
> {code:java}
> expect(vector.get(0).toString()).toBe('-1');
> expect(vector.get(1).toString()).toBe('1');
> {code}
> However, the actual first value is 339282366920938463463374607431768211456 
> which is wrong! The second value is correct by the way.
> I believe the bug is in the function called 
> function decimalToString>(a: T) because it cannot 
> return a negative value at all.
> [arrow/js/src/util/bn.ts|https://github.com/apache/arrow/blob/d54425de19b7dbb2764a40355d76d1c785cf64ec/js/src/util/bn.ts#L99]
> Line 99 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6679) [RELEASE] autobrew license in LICENSE.txt is not acceptable

2019-09-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6679:
--
Labels: pull-request-available  (was: )

> [RELEASE] autobrew license in LICENSE.txt is not acceptable
> ---
>
> Key: ARROW-6679
> URL: https://issues.apache.org/jira/browse/ARROW-6679
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Wes McKinney
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
> {code}
> This project includes code from the autobrew project.
> * r/tools/autobrew and dev/tasks/homebrew-formulae/autobrew/apache-arrow.rb
>   are based on code from the autobrew project.
> Copyright: Copyright (c) 2017 - 2019, Jeroen Ooms.
> All rights reserved.
> Homepage: https://github.com/jeroen/autobrew
> {code}
> This code needs to be made available under a Category A license
> https://apache.org/legal/resolved.html#category-a



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6086) [Rust] [DataFusion] Implement parallel execution for parquet scan

2019-09-24 Thread Andy Grove (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-6086:
--
Fix Version/s: (was: 1.0.0)
   0.15.0

> [Rust] [DataFusion] Implement parallel execution for parquet scan
> -
>
> Key: ARROW-6086
> URL: https://issues.apache.org/jira/browse/ARROW-6086
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-6649) [R] print() methods for Table, RecordBatch, etc.

2019-09-24 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-6649.

Fix Version/s: (was: 1.0.0)
   0.15.0
   Resolution: Fixed

Issue resolved by pull request 5492
[https://github.com/apache/arrow/pull/5492]

> [R] print() methods for Table, RecordBatch, etc.
> 
>
> Key: ARROW-6649
> URL: https://issues.apache.org/jira/browse/ARROW-6649
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Inspired by tibble: show schema, head of data, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6086) [Rust] [DataFusion] Implement parallel execution for parquet scan

2019-09-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6086:
--
Labels: pull-request-available  (was: )

> [Rust] [DataFusion] Implement parallel execution for parquet scan
> -
>
> Key: ARROW-6086
> URL: https://issues.apache.org/jira/browse/ARROW-6086
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-6679) [RELEASE] autobrew license in LICENSE.txt is not acceptable

2019-09-24 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937325#comment-16937325
 ] 

Wes McKinney commented on ARROW-6679:
-

Right, either the file needs an appropriate license applied or it needs to be 
removed. 

> [RELEASE] autobrew license in LICENSE.txt is not acceptable
> ---
>
> Key: ARROW-6679
> URL: https://issues.apache.org/jira/browse/ARROW-6679
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Wes McKinney
>Priority: Blocker
> Fix For: 0.15.0
>
>
> {code}
> This project includes code from the autobrew project.
> * r/tools/autobrew and dev/tasks/homebrew-formulae/autobrew/apache-arrow.rb
>   are based on code from the autobrew project.
> Copyright: Copyright (c) 2017 - 2019, Jeroen Ooms.
> All rights reserved.
> Homepage: https://github.com/jeroen/autobrew
> {code}
> This code needs to be made available under a Category A license
> https://apache.org/legal/resolved.html#category-a



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6086) [Rust] [DataFusion] Implement parallel execution for parquet scan

2019-09-24 Thread Andy Grove (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-6086:
--
Fix Version/s: (was: 0.15.0)
   1.0.0

> [Rust] [DataFusion] Implement parallel execution for parquet scan
> -
>
> Key: ARROW-6086
> URL: https://issues.apache.org/jira/browse/ARROW-6086
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Reopened] (ARROW-6086) [Rust] [DataFusion] Implement parallel execution for parquet scan

2019-09-24 Thread Andy Grove (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove reopened ARROW-6086:
---

This was not actually fully implemented and needs further work.

> [Rust] [DataFusion] Implement parallel execution for parquet scan
> -
>
> Key: ARROW-6086
> URL: https://issues.apache.org/jira/browse/ARROW-6086
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 0.15.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-6679) [RELEASE] autobrew license in LICENSE.txt is not acceptable

2019-09-24 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937274#comment-16937274
 ] 

Neal Richardson commented on ARROW-6679:


Sorry, I thought this was dealt with adequately in 
https://github.com/apache/arrow/pull/5095 (see discussion). What are the 
options for resolution? Jeroen adds a license file to 
https://github.com/jeroen/autobrew, or we remove the file?

> [RELEASE] autobrew license in LICENSE.txt is not acceptable
> ---
>
> Key: ARROW-6679
> URL: https://issues.apache.org/jira/browse/ARROW-6679
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Wes McKinney
>Priority: Blocker
> Fix For: 0.15.0
>
>
> {code}
> This project includes code from the autobrew project.
> * r/tools/autobrew and dev/tasks/homebrew-formulae/autobrew/apache-arrow.rb
>   are based on code from the autobrew project.
> Copyright: Copyright (c) 2017 - 2019, Jeroen Ooms.
> All rights reserved.
> Homepage: https://github.com/jeroen/autobrew
> {code}
> This code needs to be made available under a Category A license
> https://apache.org/legal/resolved.html#category-a



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-6679) [RELEASE] autobrew license in LICENSE.txt is not acceptable

2019-09-24 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937264#comment-16937264
 ] 

Wes McKinney commented on ARROW-6679:
-

cc [~jeroenooms]

> [RELEASE] autobrew license in LICENSE.txt is not acceptable
> ---
>
> Key: ARROW-6679
> URL: https://issues.apache.org/jira/browse/ARROW-6679
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Wes McKinney
>Priority: Blocker
> Fix For: 0.15.0
>
>
> {code}
> This project includes code from the autobrew project.
> * r/tools/autobrew and dev/tasks/homebrew-formulae/autobrew/apache-arrow.rb
>   are based on code from the autobrew project.
> Copyright: Copyright (c) 2017 - 2019, Jeroen Ooms.
> All rights reserved.
> Homepage: https://github.com/jeroen/autobrew
> {code}
> This code needs to be made available under a Category A license
> https://apache.org/legal/resolved.html#category-a



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6678) [C++] Regression in Parquet file compatibility introduced by ARROW-3246

2019-09-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6678:
--
Labels: pull-request-available  (was: )

> [C++] Regression in Parquet file compatibility introduced by ARROW-3246
> ---
>
> Key: ARROW-6678
> URL: https://issues.apache.org/jira/browse/ARROW-6678
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
> I randomly discovered that this script fails after applying the patch for 
> ARROW-3246
> https://github.com/apache/arrow/commit/2ba0566b29312e84fafc987fd8dc9664748be96a
> {code}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> import fastparquet as fp
> df = pd.util.testing.makeDataFrame()
> pq.write_table(pa.table(df), 'test.parquet')
> fp.ParquetFile('test.parquet')
> {code}
> with 
> {code}
> Traceback (most recent call last):
>   File 
> "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/api.py",
>  line 110, in __init__
> with open_with(fn2, 'rb') as f:
>   File 
> "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/util.py",
>  line 38, in default_open
> return open(f, mode)
> NotADirectoryError: [Errno 20] Not a directory: 'test.parquet/_metadata'
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File "test.py", line 10, in 
> fp.ParquetFile('test.parquet')
>   File 
> "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/api.py",
>  line 116, in __init__
> self._parse_header(f, verify)
>   File 
> "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/api.py",
>  line 135, in _parse_header
> fmd = read_thrift(f, parquet_thrift.FileMetaData)
>   File 
> "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/thrift_structures.py",
>  line 25, in read_thrift
> obj.read(pin)
>   File 
> "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/parquet_thrift/parquet/ttypes.py",
>  line 1929, in read
> iprot._fast_decode(self, iprot, [self.__class__, self.thrift_spec])
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 0: 
> invalid start byte
> {code}
> I don't recall making any metadata-related changes but I'm going to review 
> the patch to see if I can narrow down where the problem is to see whether 
> it's a bug with Arrow/parquet-cpp or with the third party library



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-6679) [RELEASE] autobrew license in LICENSE.txt is not acceptable

2019-09-24 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-6679:
---

 Summary: [RELEASE] autobrew license in LICENSE.txt is not 
acceptable
 Key: ARROW-6679
 URL: https://issues.apache.org/jira/browse/ARROW-6679
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Wes McKinney
 Fix For: 0.15.0


{code}
This project includes code from the autobrew project.

* r/tools/autobrew and dev/tasks/homebrew-formulae/autobrew/apache-arrow.rb
  are based on code from the autobrew project.

Copyright: Copyright (c) 2017 - 2019, Jeroen Ooms.
All rights reserved.
Homepage: https://github.com/jeroen/autobrew
{code}

This code needs to be made available under a Category A license

https://apache.org/legal/resolved.html#category-a



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-6678) [C++] Regression in Parquet file compatibility introduced by ARROW-3246

2019-09-24 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937250#comment-16937250
 ] 

Wes McKinney commented on ARROW-6678:
-

The problem is that the serialized schema needs to be base64 encoded because 
Thrift string types must be UTF-8. 

https://github.com/apache/arrow/blob/master/cpp/src/parquet/parquet.thrift#L593

Working on a patch

> [C++] Regression in Parquet file compatibility introduced by ARROW-3246
> ---
>
> Key: ARROW-6678
> URL: https://issues.apache.org/jira/browse/ARROW-6678
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Blocker
> Fix For: 0.15.0
>
>
> I randomly discovered that this script fails after applying the patch for 
> ARROW-3246
> https://github.com/apache/arrow/commit/2ba0566b29312e84fafc987fd8dc9664748be96a
> {code}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> import fastparquet as fp
> df = pd.util.testing.makeDataFrame()
> pq.write_table(pa.table(df), 'test.parquet')
> fp.ParquetFile('test.parquet')
> {code}
> with 
> {code}
> Traceback (most recent call last):
>   File 
> "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/api.py",
>  line 110, in __init__
> with open_with(fn2, 'rb') as f:
>   File 
> "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/util.py",
>  line 38, in default_open
> return open(f, mode)
> NotADirectoryError: [Errno 20] Not a directory: 'test.parquet/_metadata'
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File "test.py", line 10, in 
> fp.ParquetFile('test.parquet')
>   File 
> "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/api.py",
>  line 116, in __init__
> self._parse_header(f, verify)
>   File 
> "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/api.py",
>  line 135, in _parse_header
> fmd = read_thrift(f, parquet_thrift.FileMetaData)
>   File 
> "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/thrift_structures.py",
>  line 25, in read_thrift
> obj.read(pin)
>   File 
> "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/parquet_thrift/parquet/ttypes.py",
>  line 1929, in read
> iprot._fast_decode(self, iprot, [self.__class__, self.thrift_spec])
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 0: 
> invalid start byte
> {code}
> I don't recall making any metadata-related changes but I'm going to review 
> the patch to see if I can narrow down where the problem is to see whether 
> it's a bug with Arrow/parquet-cpp or with the third party library



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-6678) [C++] Regression in Parquet file compatibility introduced by ARROW-3246

2019-09-24 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937247#comment-16937247
 ] 

Wes McKinney commented on ARROW-6678:
-

Luckily I did not have to search long. This is caused by the code that adds the 
"ARROW:schema" metadata field that was added here

https://github.com/apache/arrow/commit/2ba0566b29312e84fafc987fd8dc9664748be96a#diff-806bd9c3d77823ae1bff914269e7db02R592

Investigating further

> [C++] Regression in Parquet file compatibility introduced by ARROW-3246
> ---
>
> Key: ARROW-6678
> URL: https://issues.apache.org/jira/browse/ARROW-6678
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Blocker
> Fix For: 0.15.0
>
>
> I randomly discovered that this script fails after applying the patch for 
> ARROW-3246
> https://github.com/apache/arrow/commit/2ba0566b29312e84fafc987fd8dc9664748be96a
> {code}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> import fastparquet as fp
> df = pd.util.testing.makeDataFrame()
> pq.write_table(pa.table(df), 'test.parquet')
> fp.ParquetFile('test.parquet')
> {code}
> with 
> {code}
> Traceback (most recent call last):
>   File 
> "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/api.py",
>  line 110, in __init__
> with open_with(fn2, 'rb') as f:
>   File 
> "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/util.py",
>  line 38, in default_open
> return open(f, mode)
> NotADirectoryError: [Errno 20] Not a directory: 'test.parquet/_metadata'
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File "test.py", line 10, in 
> fp.ParquetFile('test.parquet')
>   File 
> "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/api.py",
>  line 116, in __init__
> self._parse_header(f, verify)
>   File 
> "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/api.py",
>  line 135, in _parse_header
> fmd = read_thrift(f, parquet_thrift.FileMetaData)
>   File 
> "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/thrift_structures.py",
>  line 25, in read_thrift
> obj.read(pin)
>   File 
> "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/parquet_thrift/parquet/ttypes.py",
>  line 1929, in read
> iprot._fast_decode(self, iprot, [self.__class__, self.thrift_spec])
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 0: 
> invalid start byte
> {code}
> I don't recall making any metadata-related changes but I'm going to review 
> the patch to see if I can narrow down where the problem is to see whether 
> it's a bug with Arrow/parquet-cpp or with the third party library



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-6678) [C++] Regression in Parquet file compatibility introduced by ARROW-3246

2019-09-24 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-6678:
---

 Summary: [C++] Regression in Parquet file compatibility introduced 
by ARROW-3246
 Key: ARROW-6678
 URL: https://issues.apache.org/jira/browse/ARROW-6678
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.15.0


I randomly discovered that this script fails after applying the patch for 
ARROW-3246

https://github.com/apache/arrow/commit/2ba0566b29312e84fafc987fd8dc9664748be96a

{code}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import fastparquet as fp

df = pd.util.testing.makeDataFrame()

pq.write_table(pa.table(df), 'test.parquet')

fp.ParquetFile('test.parquet')
{code}

with 

{code}
Traceback (most recent call last):
  File 
"/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/api.py",
 line 110, in __init__
with open_with(fn2, 'rb') as f:
  File 
"/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/util.py",
 line 38, in default_open
return open(f, mode)
NotADirectoryError: [Errno 20] Not a directory: 'test.parquet/_metadata'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test.py", line 10, in 
fp.ParquetFile('test.parquet')
  File 
"/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/api.py",
 line 116, in __init__
self._parse_header(f, verify)
  File 
"/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/api.py",
 line 135, in _parse_header
fmd = read_thrift(f, parquet_thrift.FileMetaData)
  File 
"/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/thrift_structures.py",
 line 25, in read_thrift
obj.read(pin)
  File 
"/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/parquet_thrift/parquet/ttypes.py",
 line 1929, in read
iprot._fast_decode(self, iprot, [self.__class__, self.thrift_spec])
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 0: invalid 
start byte
{code}

I don't recall making any metadata-related changes but I'm going to review the 
patch to see if I can narrow down where the problem is to see whether it's a 
bug with Arrow/parquet-cpp or with the third party library



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-6629) [Doc][C++] Document the FileSystem API

2019-09-24 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-6629.

Fix Version/s: (was: 1.0.0)
   0.15.0
   Resolution: Fixed

Issue resolved by pull request 5487
[https://github.com/apache/arrow/pull/5487]

> [Doc][C++] Document the FileSystem API
> --
>
> Key: ARROW-6629
> URL: https://issues.apache.org/jira/browse/ARROW-6629
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Reporter: Neal Richardson
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> In ARROW-6622, I was looking for a place in the docs to add about path 
> normalization, and I couldn't find filesystem docs at all. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-5379) [Python] support pandas' nullable Integer type in from_pandas

2019-09-24 Thread Daniel Nugent (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937175#comment-16937175
 ] 

Daniel Nugent edited comment on ARROW-5379 at 9/24/19 8:30 PM:
---

Is this actually something that would be appropriate to implement with 
extension types? It just requires that the mask parameter of pa.Array actually 
be used with Pandas integer columns, right?


was (Author: nugend):
Is this actually something that would be appropriate to implement with 
extension types? It just requires that the mask actually be used with Pandas 
integer columns.

> [Python] support pandas' nullable Integer type in from_pandas
> -
>
> Key: ARROW-5379
> URL: https://issues.apache.org/jira/browse/ARROW-5379
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
>
> From https://github.com/apache/arrow/issues/4168. We should add support for 
> pandas' nullable Integer extension dtypes, as those could map nicely to 
> arrows integer types. 
> Ideally this happens in a generic way though, and not specific for this 
> extension type, which is discussed in ARROW-5271



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-5379) [Python] support pandas' nullable Integer type in from_pandas

2019-09-24 Thread Daniel Nugent (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937175#comment-16937175
 ] 

Daniel Nugent commented on ARROW-5379:
--

Is this actually something that would be appropriate to implement with 
extension types? It just requires that the mask actually be used with Pandas 
integer columns.

> [Python] support pandas' nullable Integer type in from_pandas
> -
>
> Key: ARROW-5379
> URL: https://issues.apache.org/jira/browse/ARROW-5379
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
>
> From https://github.com/apache/arrow/issues/4168. We should add support for 
> pandas' nullable Integer extension dtypes, as those could map nicely to 
> arrows integer types. 
> Ideally this happens in a generic way though, and not specific for this 
> extension type, which is discussed in ARROW-5271



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6649) [R] print() methods for Table, RecordBatch, etc.

2019-09-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6649:
--
Labels: pull-request-available  (was: )

> [R] print() methods for Table, RecordBatch, etc.
> 
>
> Key: ARROW-6649
> URL: https://issues.apache.org/jira/browse/ARROW-6649
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Inspired by tibble: show schema, head of data, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-6649) [R] print() methods for Table, RecordBatch, etc.

2019-09-24 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-6649:
--

Assignee: Neal Richardson

> [R] print() methods for Table, RecordBatch, etc.
> 
>
> Key: ARROW-6649
> URL: https://issues.apache.org/jira/browse/ARROW-6649
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Inspired by tibble: show schema, head of data, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-3850) [Python] Support MapType and StructType for enhanced PySpark integration

2019-09-24 Thread Bryan Cutler (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937159#comment-16937159
 ] 

Bryan Cutler commented on ARROW-3850:
-

Now that SPARK-23836 is merged, a scalar Pandas UDF can return a StructType 
that will accept a pandas.DataFrame. By nested structs, I mean a column of 
StructType that have a child that is a StructType.  Spark does not currently 
support this as an input column, or return type from Pandas UDFs.

> [Python] Support MapType and StructType for enhanced PySpark integration
> 
>
> Key: ARROW-3850
> URL: https://issues.apache.org/jira/browse/ARROW-3850
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Affects Versions: 0.11.1
>Reporter: Florian Wilhelm
>Priority: Major
> Fix For: 1.0.0
>
>
> It would be great to support MapType and (nested) StructType in Arrow so that 
> PySpark can make use of it.
>  
>  Quite often as in my use-case in Hive table cells are also complex types 
> saved. Currently it's not possible to user the new 
> {{[pandas_udf|https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=explode#pyspark.sql.functions.pandas_udf]}}
>  decorator which internally uses Arrow to generate a UDF for columns with 
> complex types.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6677) [FlightRPC][C++] Document using Flight in C++

2019-09-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6677:
--
Labels: pull-request-available  (was: )

> [FlightRPC][C++] Document using Flight in C++
> -
>
> Key: ARROW-6677
> URL: https://issues.apache.org/jira/browse/ARROW-6677
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation, FlightRPC
>Reporter: lidavidm
>Assignee: lidavidm
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Similarly to ARROW-6390 for Python, we should have C++ documentation for 
> Flight.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-6677) [FlightRPC][C++] Document using Flight in C++

2019-09-24 Thread lidavidm (Jira)

lidavidm created ARROW-6677:
---

 Summary: [FlightRPC][C++] Document using Flight in C++
 Key: ARROW-6677
 URL: https://issues.apache.org/jira/browse/ARROW-6677
 Project: Apache Arrow
  Issue Type: Bug
  Components: Documentation, FlightRPC
Reporter: lidavidm
Assignee: lidavidm
 Fix For: 1.0.0


Similarly to ARROW-6390 for Python, we should have C++ documentation for Flight.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-6187) [C++] fallback to storage type when writing ExtensionType to Parquet

2019-09-24 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-6187.
---
Fix Version/s: (was: 1.0.0)
   0.15.0
   Resolution: Fixed

Issue resolved by pull request 5436
[https://github.com/apache/arrow/pull/5436]

> [C++] fallback to storage type when writing ExtensionType to Parquet
> 
>
> Key: ARROW-6187
> URL: https://issues.apache.org/jira/browse/ARROW-6187
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Writing a table that contains an ExtensionType array to a parquet file is not 
> yet implemented. It currently raises "ArrowNotImplementedError: Unhandled 
> type for Arrow to Parquet schema conversion: 
> extension" (for a PyExtensionType in this case).
> I think minimal support can consist of writing the storage type / array. 
> We also might want to save the extension name and metadata in the parquet 
> FileMetadata. 
> Later on, this could be potentially be used to restore the extension type 
> when reading. This is related to other issues that need to save the arrow 
> schema (categorical: ARROW-5480, time zones: ARROW-5888). Only in this case, 
> we probably want to store the serialised type in addition to the schema 
> (which only has the extension type's name). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-6613) [C++] Remove dependency on boost::filesystem

2019-09-24 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937028#comment-16937028
 ] 

Antoine Pitrou commented on ARROW-6613:
---

I've started looking into this. There's some non-trivial Windows-specific code 
in boost::filesystem to handle reparse points and symlinks. It feels a bit 
counter-productive to copy/paste it without knowing exactly what it does.

> [C++] Remove dependency on boost::filesystem
> 
>
> Key: ARROW-6613
> URL: https://issues.apache.org/jira/browse/ARROW-6613
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 1.0.0
>
>
> See ARROW-2196 for details.
> boost::filesystem should not be required for base functionality at least 
> (including filesystems, probably).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6675) [JS] Add scanReverse function

2019-09-24 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6675:
---
Component/s: JavaScript

> [JS] Add scanReverse function
> -
>
> Key: ARROW-6675
> URL: https://issues.apache.org/jira/browse/ARROW-6675
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: JavaScript
>Reporter: Malcolm MacLachlan
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> * Add scanReverse function to dataFrame and filteredDataframe
>  * Update tests



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6675) [JS] Add scanReverse function

2019-09-24 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6675:
---
Summary: [JS] Add scanReverse function  (was: Add scanReverse function)

> [JS] Add scanReverse function
> -
>
> Key: ARROW-6675
> URL: https://issues.apache.org/jira/browse/ARROW-6675
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Malcolm MacLachlan
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> * Add scanReverse function to dataFrame and filteredDataframe
>  * Update tests



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-6676) [C++] [Parquet] Refactor encoding/decoding APIs for clarity

2019-09-24 Thread Benjamin Kietzman (Jira)

Benjamin Kietzman created ARROW-6676:


 Summary: [C++] [Parquet] Refactor encoding/decoding APIs for 
clarity
 Key: ARROW-6676
 URL: https://issues.apache.org/jira/browse/ARROW-6676
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Benjamin Kietzman
Assignee: Benjamin Kietzman


{{encoding.h}} and {{encoding.cc}} are difficult to read and rewrite. I think 
there's also lost opportunities for more generic implementations. 
Simplify/winnow the interfaces while keeping an eye on the benchmarks for 
performance regressions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-6675) Add scanReverse function

2019-09-24 Thread Malcolm MacLachlan (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936980#comment-16936980
 ] 

Malcolm MacLachlan commented on ARROW-6675:
---

[https://github.com/apache/arrow/pull/5480]

> Add scanReverse function
> 
>
> Key: ARROW-6675
> URL: https://issues.apache.org/jira/browse/ARROW-6675
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Malcolm MacLachlan
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> * Add scanReverse function to dataFrame and filteredDataframe
>  * Update tests



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6675) Add scanReverse function

2019-09-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6675:
--
Labels: pull-request-available  (was: )

> Add scanReverse function
> 
>
> Key: ARROW-6675
> URL: https://issues.apache.org/jira/browse/ARROW-6675
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Malcolm MacLachlan
>Priority: Minor
>  Labels: pull-request-available
>
> * Add scanReverse function to dataFrame and filteredDataframe
>  * Update tests



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-6675) Add scanReverse function

2019-09-24 Thread Malcolm MacLachlan (Jira)

Malcolm MacLachlan created ARROW-6675:
-

 Summary: Add scanReverse function
 Key: ARROW-6675
 URL: https://issues.apache.org/jira/browse/ARROW-6675
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Malcolm MacLachlan


* Add scanReverse function to dataFrame and filteredDataframe
 * Update tests



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6674) [Python] Fix or ignore the test warnings

2019-09-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6674:
--
Labels: pull-request-available  (was: )

> [Python] Fix or ignore the test warnings
> 
>
> Key: ARROW-6674
> URL: https://issues.apache.org/jira/browse/ARROW-6674
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Minor
>  Labels: pull-request-available
>
> Currently when running the python tests, we have a bunch of warnings. Some of 
> them can be ignored, some of them can be fixed. But it is better to do that 
> explicitly, so that new warnings (which can be potentially important to see) 
> get more attention.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-6158) [Python] possible to create StructArray with type that conflicts with child array's types

2019-09-24 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-6158:


Assignee: Joris Van den Bossche

> [Python] possible to create StructArray with type that conflicts with child 
> array's types
> -
>
> Key: ARROW-6158
> URL: https://issues.apache.org/jira/browse/ARROW-6158
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Using the Python interface as example. This creates a {{StructArray}} where 
> the field types don't match the child array types:
> {code}
> a = pa.array([1, 2, 3], type=pa.int64())
> b = pa.array(['a', 'b', 'c'], type=pa.string())
> inconsistent_fields = [pa.field('a', pa.int32()), pa.field('b', pa.float64())]
> a = pa.StructArray.from_arrays([a, b], fields=inconsistent_fields) 
> {code}
> The above works fine. I didn't find anything that errors (eg conversion to 
> pandas, slicing), also validation passes, but the type actually has the 
> inconsistent child types:
> {code}
> In [2]: a
> Out[2]: 
> 
> -- is_valid: all not null
> -- child 0 type: int64
>   [
> 1,
> 2,
> 3
>   ]
> -- child 1 type: string
>   [
> "a",
> "b",
> "c"
>   ]
> In [3]: a.type
> Out[3]: StructType(struct)
> In [4]: a.to_pandas()
> Out[4]: 
> array([{'a': 1, 'b': 'a'}, {'a': 2, 'b': 'b'}, {'a': 3, 'b': 'c'}],
>   dtype=object)
> In [5]: a.validate() 
> {code}
> Shouldn't this be disallowed somehow? (it could be checked in the Python 
> {{from_arrays}} method, but maybe also in {{StructArray::Make}} which already 
> checks for the number of fields vs arrays and a consistent array length). 
> Similarly to discussion in ARROW-6132, I would also expect that this the 
> {{ValidateArray}} catches this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-6674) [Python] Fix or ignore the test warnings

2019-09-24 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-6674:


Assignee: Joris Van den Bossche

> [Python] Fix or ignore the test warnings
> 
>
> Key: ARROW-6674
> URL: https://issues.apache.org/jira/browse/ARROW-6674
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Minor
>
> Currently when running the python tests, we have a bunch of warnings. Some of 
> them can be ignored, some of them can be fixed. But it is better to do that 
> explicitly, so that new warnings (which can be potentially important to see) 
> get more attention.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-6674) [Python] Fix or ignore the test warnings

2019-09-24 Thread Joris Van den Bossche (Jira)

Joris Van den Bossche created ARROW-6674:


 Summary: [Python] Fix or ignore the test warnings
 Key: ARROW-6674
 URL: https://issues.apache.org/jira/browse/ARROW-6674
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


Currently when running the python tests, we have a bunch of warnings. Some of 
them can be ignored, some of them can be fixed. But it is better to do that 
explicitly, so that new warnings (which can be potentially important to see) 
get more attention.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-6669) [Rust] [DataFusion] Implement physical expression for binary expressions

2019-09-24 Thread Paddy Horan (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paddy Horan resolved ARROW-6669.

Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 5478
[https://github.com/apache/arrow/pull/5478]

> [Rust] [DataFusion] Implement physical expression for binary expressions
> 
>
> Key: ARROW-6669
> URL: https://issues.apache.org/jira/browse/ARROW-6669
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Implement comparison operators (<, <=, >, >=, =, !=) as well as binary 
> operators AND and OR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-6673) [Python] Consider separating libarrow.pxd into multiple definition files

2019-09-24 Thread Krisztian Szucs (Jira)

Krisztian Szucs created ARROW-6673:
--

 Summary: [Python] Consider separating libarrow.pxd into multiple 
definition files
 Key: ARROW-6673
 URL: https://issues.apache.org/jira/browse/ARROW-6673
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Krisztian Szucs


See discussion https://github.com/apache/arrow/pull/5423#discussion_r327522836



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-6613) [C++] Remove dependency on boost::filesystem

2019-09-24 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936855#comment-16936855
 ] 

Wes McKinney commented on ARROW-6613:
-

I think we should remove Boost as a dependency of the _core_ build. Which may 
mean not building certain modules (like the code in src/arrow/filesystem) by 
default. I think that should make things easier

> [C++] Remove dependency on boost::filesystem
> 
>
> Key: ARROW-6613
> URL: https://issues.apache.org/jira/browse/ARROW-6613
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 1.0.0
>
>
> See ARROW-2196 for details.
> boost::filesystem should not be required for base functionality at least 
> (including filesystems, probably).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-3777) [C++] Implement a mock "high latency" filesystem

2019-09-24 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-3777.
---
Fix Version/s: (was: 1.0.0)
   0.15.0
   Resolution: Fixed

Issue resolved by pull request 5439
[https://github.com/apache/arrow/pull/5439]

> [C++] Implement a mock "high latency" filesystem
> 
>
> Key: ARROW-3777
> URL: https://issues.apache.org/jira/browse/ARROW-3777
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Some of our tools don't perform well out of the box for filesystems with high 
> latency reads, like cloud blob stores. In such cases, it may be better to use 
> buffered reads with a larger read ahead window. Having a mock filesystem to 
> introduce latency into reads will help with testing / developing APIs for this



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-6613) [C++] Remove dependency on boost::filesystem

2019-09-24 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936833#comment-16936833
 ] 

Antoine Pitrou commented on ARROW-6613:
---

[~wesm] We should perhaps discuss if it's really useful to have bare-bones 
no-boost build. We will probably end up copy-pasting some boost code along the 
way.

> [C++] Remove dependency on boost::filesystem
> 
>
> Key: ARROW-6613
> URL: https://issues.apache.org/jira/browse/ARROW-6613
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 1.0.0
>
>
> See ARROW-2196 for details.
> boost::filesystem should not be required for base functionality at least 
> (including filesystems, probably).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-6213) [C++] tests fail for AVX512

2019-09-24 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936829#comment-16936829
 ] 

Wes McKinney commented on ARROW-6213:
-

It does. I'll create an account, add your SSH keys 
(https://github.com/pitrou.keys) and send you the connection information 
privately

> [C++] tests fail for AVX512
> ---
>
> Key: ARROW-6213
> URL: https://issues.apache.org/jira/browse/ARROW-6213
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.14.1
> Environment: CentOS 7.6.1810, Intel Xeon Processor (Skylake, IBRS) 
> avx512
>Reporter: Charles Coulombe
>Priority: Minor
> Fix For: 2.0.0
>
> Attachments: arrow-0.14.1-c++-failed-tests-cmake-conf.txt, 
> arrow-0.14.1-c++-failed-tests.txt
>
>
> When building libraries for avx512 with GCC 7.3.0, two C++ tests fails.
> {noformat}
> The following tests FAILED: 
>   28 - arrow-compute-compare-test (Failed) 
>   30 - arrow-compute-filter-test (Failed) 
> Errors while running CTest{noformat}
> while for avx2 they passes.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6158) [Python] possible to create StructArray with type that conflicts with child array's types

2019-09-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6158:
--
Labels: pull-request-available  (was: )

> [Python] possible to create StructArray with type that conflicts with child 
> array's types
> -
>
> Key: ARROW-6158
> URL: https://issues.apache.org/jira/browse/ARROW-6158
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Using the Python interface as example. This creates a {{StructArray}} where 
> the field types don't match the child array types:
> {code}
> a = pa.array([1, 2, 3], type=pa.int64())
> b = pa.array(['a', 'b', 'c'], type=pa.string())
> inconsistent_fields = [pa.field('a', pa.int32()), pa.field('b', pa.float64())]
> a = pa.StructArray.from_arrays([a, b], fields=inconsistent_fields) 
> {code}
> The above works fine. I didn't find anything that errors (eg conversion to 
> pandas, slicing), also validation passes, but the type actually has the 
> inconsistent child types:
> {code}
> In [2]: a
> Out[2]: 
> 
> -- is_valid: all not null
> -- child 0 type: int64
>   [
> 1,
> 2,
> 3
>   ]
> -- child 1 type: string
>   [
> "a",
> "b",
> "c"
>   ]
> In [3]: a.type
> Out[3]: StructType(struct)
> In [4]: a.to_pandas()
> Out[4]: 
> array([{'a': 1, 'b': 'a'}, {'a': 2, 'b': 'b'}, {'a': 3, 'b': 'c'}],
>   dtype=object)
> In [5]: a.validate() 
> {code}
> Shouldn't this be disallowed somehow? (it could be checked in the Python 
> {{from_arrays}} method, but maybe also in {{StructArray::Make}} which already 
> checks for the number of fields vs arrays and a consistent array length). 
> Similarly to discussion in ARROW-6132, I would also expect that this the 
> {{ValidateArray}} catches this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-6613) [C++] Remove dependency on boost::filesystem

2019-09-24 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936802#comment-16936802
 ] 

Antoine Pitrou commented on ARROW-6613:
---

boost::filesystem is used in conjuction with boost::process for testing Flight 
and S3FS. I don't think it's reasonable to reimplement boost::process.

Still, we can try to make boost::filesystem unnecessary if tests are not built.

> [C++] Remove dependency on boost::filesystem
> 
>
> Key: ARROW-6613
> URL: https://issues.apache.org/jira/browse/ARROW-6613
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 1.0.0
>
>
> See ARROW-2196 for details.
> boost::filesystem should not be required for base functionality at least 
> (including filesystems, probably).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-6671) [C++] Sparse tensor naming

2019-09-24 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936786#comment-16936786
 ] 

Wes McKinney commented on ARROW-6671:
-

I am OK with having the type be called "matrix". Good to make things as clear 
and consistent as possible

> [C++] Sparse tensor naming
> --
>
> Key: ARROW-6671
> URL: https://issues.apache.org/jira/browse/ARROW-6671
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Kenta Murata
>Priority: Minor
>
> Currently there's {{SparseCOOIndex}} and {{SparseCSRIndex}}, but also 
> {{SparseTensorCOO}} and {{SparseTensorCSR}}.
> For consistency, it would be nice to rename the latter {{SparseCOOTensor}} 
> and {{SparseCSRTensor}}.
> Also, it's not obvious the {{SparseMatrixCSR}} alias is useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Issue Comment Deleted] (ARROW-6671) [C++] Sparse tensor naming

2019-09-24 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6671:

Comment: was deleted

(was: I don't know. We don't use the term "matrix" currently in Arrow. cc 
[~wesm])

> [C++] Sparse tensor naming
> --
>
> Key: ARROW-6671
> URL: https://issues.apache.org/jira/browse/ARROW-6671
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Kenta Murata
>Priority: Minor
>
> Currently there's {{SparseCOOIndex}} and {{SparseCSRIndex}}, but also 
> {{SparseTensorCOO}} and {{SparseTensorCSR}}.
> For consistency, it would be nice to rename the latter {{SparseCOOTensor}} 
> and {{SparseCSRTensor}}.
> Also, it's not obvious the {{SparseMatrixCSR}} alias is useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-6671) [C++] Sparse tensor naming

2019-09-24 Thread Rok Mihevc (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936779#comment-16936779
 ] 

Rok Mihevc commented on ARROW-6671:
---

SparseCSRMatrix might be more misleading as it 'doesn't look look like' a 
Tensor type. I think that is potentially more confusing than it being limited 
to 2D.

+1 for the consistent naming

> [C++] Sparse tensor naming
> --
>
> Key: ARROW-6671
> URL: https://issues.apache.org/jira/browse/ARROW-6671
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Kenta Murata
>Priority: Minor
>
> Currently there's {{SparseCOOIndex}} and {{SparseCSRIndex}}, but also 
> {{SparseTensorCOO}} and {{SparseTensorCSR}}.
> For consistency, it would be nice to rename the latter {{SparseCOOTensor}} 
> and {{SparseCSRTensor}}.
> Also, it's not obvious the {{SparseMatrixCSR}} alias is useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6629) [Doc][C++] Document the FileSystem API

2019-09-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6629:
--
Labels: pull-request-available  (was: )

> [Doc][C++] Document the FileSystem API
> --
>
> Key: ARROW-6629
> URL: https://issues.apache.org/jira/browse/ARROW-6629
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Reporter: Neal Richardson
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> In ARROW-6622, I was looking for a place in the docs to add about path 
> normalization, and I couldn't find filesystem docs at all. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-6671) [C++] Sparse tensor naming

2019-09-24 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936727#comment-16936727
 ] 

Antoine Pitrou commented on ARROW-6671:
---

I don't know. We don't use the term "matrix" currently in Arrow. cc [~wesm]

> [C++] Sparse tensor naming
> --
>
> Key: ARROW-6671
> URL: https://issues.apache.org/jira/browse/ARROW-6671
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Kenta Murata
>Priority: Minor
>
> Currently there's {{SparseCOOIndex}} and {{SparseCSRIndex}}, but also 
> {{SparseTensorCOO}} and {{SparseTensorCSR}}.
> For consistency, it would be nice to rename the latter {{SparseCOOTensor}} 
> and {{SparseCSRTensor}}.
> Also, it's not obvious the {{SparseMatrixCSR}} alias is useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-6671) [C++] Sparse tensor naming

2019-09-24 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936728#comment-16936728
 ] 

Antoine Pitrou commented on ARROW-6671:
---

I don't know. We don't use the term "matrix" currently in Arrow. cc [~wesm]

> [C++] Sparse tensor naming
> --
>
> Key: ARROW-6671
> URL: https://issues.apache.org/jira/browse/ARROW-6671
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Kenta Murata
>Priority: Minor
>
> Currently there's {{SparseCOOIndex}} and {{SparseCSRIndex}}, but also 
> {{SparseTensorCOO}} and {{SparseTensorCSR}}.
> For consistency, it would be nice to rename the latter {{SparseCOOTensor}} 
> and {{SparseCSRTensor}}.
> Also, it's not obvious the {{SparseMatrixCSR}} alias is useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-6671) [C++] Sparse tensor naming

2019-09-24 Thread Kenta Murata (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936724#comment-16936724
 ] 

Kenta Murata commented on ARROW-6671:
-

Indeed.  I want to make their name consistent, so I’ll make a pull request 
tomorrow.

 [~apitrou] How about employ SparseCSRMatrix rather than SparseCSRTensor 
because it cannot represent a tensor with more than 2-dimension.

> [C++] Sparse tensor naming
> --
>
> Key: ARROW-6671
> URL: https://issues.apache.org/jira/browse/ARROW-6671
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Kenta Murata
>Priority: Minor
>
> Currently there's {{SparseCOOIndex}} and {{SparseCSRIndex}}, but also 
> {{SparseTensorCOO}} and {{SparseTensorCSR}}.
> For consistency, it would be nice to rename the latter {{SparseCOOTensor}} 
> and {{SparseCSRTensor}}.
> Also, it's not obvious the {{SparseMatrixCSR}} alias is useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-6671) [C++] Sparse tensor naming

2019-09-24 Thread Kenta Murata (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenta Murata reassigned ARROW-6671:
---

Assignee: Kenta Murata

> [C++] Sparse tensor naming
> --
>
> Key: ARROW-6671
> URL: https://issues.apache.org/jira/browse/ARROW-6671
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Kenta Murata
>Priority: Minor
>
> Currently there's {{SparseCOOIndex}} and {{SparseCSRIndex}}, but also 
> {{SparseTensorCOO}} and {{SparseTensorCSR}}.
> For consistency, it would be nice to rename the latter {{SparseCOOTensor}} 
> and {{SparseCSRTensor}}.
> Also, it's not obvious the {{SparseMatrixCSR}} alias is useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6672) [Java] Extract a common interface for dictionary builders

2019-09-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6672:
--
Labels: pull-request-available  (was: )

> [Java] Extract a common interface for dictionary builders
> -
>
> Key: ARROW-6672
> URL: https://issues.apache.org/jira/browse/ARROW-6672
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Minor
>  Labels: pull-request-available
>
> We need a common interface for dictionary builders to support more 
> sophisticated scenarios, like collecting dictionary statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-6672) [Java] Extract a common interface for dictionary builders

2019-09-24 Thread Liya Fan (Jira)

Liya Fan created ARROW-6672:
---

 Summary: [Java] Extract a common interface for dictionary builders
 Key: ARROW-6672
 URL: https://issues.apache.org/jira/browse/ARROW-6672
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


We need a common interface for dictionary builders to support more 
sophisticated scenarios, like collecting dictionary statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-6671) [C++] Sparse tensor naming

2019-09-24 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936696#comment-16936696
 ] 

Antoine Pitrou commented on ARROW-6671:
---

cc [~mrkn]

> [C++] Sparse tensor naming
> --
>
> Key: ARROW-6671
> URL: https://issues.apache.org/jira/browse/ARROW-6671
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Minor
>
> Currently there's {{SparseCOOIndex}} and {{SparseCSRIndex}}, but also 
> {{SparseTensorCOO}} and {{SparseTensorCSR}}.
> For consistency, it would be nice to rename the latter {{SparseCOOTensor}} 
> and {{SparseCSRTensor}}.
> Also, it's not obvious the {{SparseMatrixCSR}} alias is useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-6671) [C++] Sparse tensor naming

2019-09-24 Thread Antoine Pitrou (Jira)

Antoine Pitrou created ARROW-6671:
-

 Summary: [C++] Sparse tensor naming
 Key: ARROW-6671
 URL: https://issues.apache.org/jira/browse/ARROW-6671
 Project: Apache Arrow
  Issue Type: Wish
  Components: C++
Reporter: Antoine Pitrou


Currently there's {{SparseCOOIndex}} and {{SparseCSRIndex}}, but also 
{{SparseTensorCOO}} and {{SparseTensorCSR}}.

For consistency, it would be nice to rename the latter {{SparseCOOTensor}} and 
{{SparseCSRTensor}}.

Also, it's not obvious the {{SparseMatrixCSR}} alias is useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-6629) [Doc][C++] Document the FileSystem API

2019-09-24 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-6629:
-

Assignee: Antoine Pitrou

> [Doc][C++] Document the FileSystem API
> --
>
> Key: ARROW-6629
> URL: https://issues.apache.org/jira/browse/ARROW-6629
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Reporter: Neal Richardson
>Assignee: Antoine Pitrou
>Priority: Major
> Fix For: 1.0.0
>
>
> In ARROW-6622, I was looking for a place in the docs to add about path 
> normalization, and I couldn't find filesystem docs at all. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6472) [Java] ValueVector#accept may has potential cast exception

2019-09-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6472:
--
Labels: pull-request-available  (was: )

> [Java] ValueVector#accept may has potential cast exception
> --
>
> Key: ARROW-6472
> URL: https://issues.apache.org/jira/browse/ARROW-6472
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Major
>  Labels: pull-request-available
>
> Per discussion 
> [https://github.com/apache/arrow/pull/5195#issuecomment-528425302]
> We may use API this way:
> {code:java}
> RangeEqualsVisitor visitor = new RangeEqualsVisitor(vector1, vector2);
> vector3.accept(visitor, range){code}
> if vector1/vector2 are say, {{StructVector}}s and vector3 is an {{IntVector}} 
> - things can go bad. we'll use the {{compareBaseFixedWidthVectors()}} and do 
> wrong type-casts for vector1/vector2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-4930) [Python] Remove LIBDIR assumptions in Python build

2019-09-24 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-4930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936581#comment-16936581
 ] 

Antoine Pitrou commented on ARROW-4930:
---

I'm afraid I can't really help constructively :-/ My CMake-fu is quite weak. 
I'll cc [~kou] who's quite more knowledgeable in the area.

> [Python] Remove LIBDIR assumptions in Python build
> --
>
> Key: ARROW-4930
> URL: https://issues.apache.org/jira/browse/ARROW-4930
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.12.1
>Reporter: Suvayu Ali
>Priority: Minor
>  Labels: setup.py
> Fix For: 2.0.0
>
> Attachments: FindArrow.cmake.patch, FindParquet.cmake.patch
>
>
> This is in reference to (4) in 
> [this|http://mail-archives.apache.org/mod_mbox/arrow-dev/201903.mbox/%3C0AF328A1-ED2A-457F-B72D-3B49C8614850%40xhochy.com%3E]
>  mailing list discussion.
> Certain sections of setup.py assume a specific location of the C++ libraries. 
> Removing this hard assumption will simplify PyArrow builds significantly. As 
> far as I could tell these assumptions are made in the 
> {{build_ext._run_cmake()}} method (wherever bundling of C++ libraries are 
> handled).
>  # The first occurrence is before invoking cmake (see line 237).
>  # The second occurrence is when the C++ libraries are moved from their build 
> directory to the Python tree (see line 347). The actual implementation is in 
> the function {{_move_shared_libs_unix(..)}} (see line 468).
> Hope this helps.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-5845) [Java] Implement converter between Arrow record batches and Avro records

2019-09-24 Thread Ji Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ji Liu updated ARROW-5845:
--
Priority: Major  (was: Minor)

> [Java] Implement converter between Arrow record batches and Avro records
> 
>
> Key: ARROW-5845
> URL: https://issues.apache.org/jira/browse/ARROW-5845
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Major
> Fix For: 1.0.0
>
>
> It would be useful for applications which need convert Avro data to Arrow 
> data.
> This is an adapter which convert data with existing API (like JDBC adapter) 
> rather than a native reader (like orc).
> We implement this function through Avro java project, receiving param like 
> Decoder/Schema/DatumReader of Avro and return VectorSchemaRoot. For each data 
> type we have a consumer class as below to get Avro data and write it into 
> vector to avoid boxing/unboxing (e.g. GenericRecord#get returns Object)
> {code:java}
> public class AvroIntConsumer implements Consumer {
> private final IntWriter writer;
> public AvroIntConsumer(IntVector vector)
> { this.writer = new IntWriterImpl(vector); }
> @Override
> public void consume(Decoder decoder) throws IOException
> { writer.writeInt(decoder.readInt()); writer.setPosition(writer.getPosition() 
> + 1); }
> {code}
> We intended to support primitive and complex types (null value represented 
> via unions type with null type), size limit and field selection could be 
> optional for users. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6592) [Java] Add support for skipping decoding of columns/field in Avro converter

2019-09-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6592:
--
Labels: avro pull-request-available  (was: avro)

> [Java] Add support for skipping decoding of columns/field in Avro converter
> ---
>
> Key: ARROW-6592
> URL: https://issues.apache.org/jira/browse/ARROW-6592
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Java
>Reporter: Micah Kornfield
>Assignee: Ji Liu
>Priority: Major
>  Labels: avro, pull-request-available
>
> Users should be able to pass in a set of fields they wish to decode from Avro 
> and the converter should avoid creating Vectors in the returned 
> ArrowSchemaRoot.  This would ideally support nested columns so if there was:
>  
> Struct A {
>     int B;
>     int C;
> } 
>  
> The use could choose to only read A.B or A.C or both.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-6601) [Java] Improve JDBC adapter performance & add benchmark

2019-09-24 Thread Micah Kornfield (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6601.

Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 5472
[https://github.com/apache/arrow/pull/5472]

> [Java] Improve JDBC adapter performance & add benchmark
> ---
>
> Key: ARROW-6601
> URL: https://issues.apache.org/jira/browse/ARROW-6601
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Add a performance test as well to get a baseline number, to avoid performance 
> regression when we change related code.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

75 matches

Mail list logo