[jira] [Commented] (ARROW-5293) [C++] Take kernel on DictionaryArray does not preserve ordered flag

2019-05-14 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839993#comment-16839993
 ] 

Wes McKinney commented on ARROW-5293:
-

I seem to have fixed this in passing while working on ARROW-3144

> [C++] Take kernel on DictionaryArray does not preserve ordered flag
> ---
>
> Key: ARROW-5293
> URL: https://issues.apache.org/jira/browse/ARROW-5293
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> In the Python tests I was adding, this was failing for an ordered 
> DictionaryArray: 
> https://github.com/apache/arrow/pull/4281/commits/1f65936e1a06ae415647af7d5c7f54c5937861f6#diff-01b63f189a63c0d4016f2f91370e08fcR92



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-5293) [C++] Take kernel on DictionaryArray does not preserve ordered flag

2019-05-14 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-5293:
---

Assignee: Wes McKinney

> [C++] Take kernel on DictionaryArray does not preserve ordered flag
> ---
>
> Key: ARROW-5293
> URL: https://issues.apache.org/jira/browse/ARROW-5293
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> In the Python tests I was adding, this was failing for an ordered 
> DictionaryArray: 
> https://github.com/apache/arrow/pull/4281/commits/1f65936e1a06ae415647af7d5c7f54c5937861f6#diff-01b63f189a63c0d4016f2f91370e08fcR92



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5320) [C++] Undefined symbol errors are occurred when linking parquet executables

2019-05-14 Thread Kenta Murata (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839930#comment-16839930
 ] 

Kenta Murata commented on ARROW-5320:
-

I found that `brew unlink boost` resolves this issue. So the cause is referring 
the header file in boost installed by homebrew instead of vendored version.

I think it is better that the vendored version is prioritized than the 
system-installed one when -DARROW_BOOST_VENDORED=ON is specified.

> [C++] Undefined symbol errors are occurred when linking parquet executables
> ---
>
> Key: ARROW-5320
> URL: https://issues.apache.org/jira/browse/ARROW-5320
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
> Environment: Xcode 10.2 on macOS Mojave 10.14.4
>Reporter: Kenta Murata
>Priority: Major
>
> Undefined symbol errors occurred when linking debug/parquet-reader, 
> debug/parquet-file-deserialize-test, and debug/parquet-scan.  The 
> unresolvable symbol is of boost regex referred in libparquet.a.
> I tried to build the commit 608e846a9f825a30a0faa651bc0a3eebba20e7db with 
> Xcode 10.2 on macOS Mojave.
> I specified -DARROW_BOOST_VENDORED=ON to avoid the problem related to the 
> latest boost in Homebrew (See 
> [https://github.com/boostorg/process/issues/55]).
> The complete build log is available here:
> [https://gist.github.com/mrkn/e5489140c9a782ca13a1b4bb8dd33111]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5320) [C++] Undefined symbol errors are occurred when linking parquet executables

2019-05-14 Thread Kenta Murata (JIRA)
Kenta Murata created ARROW-5320:
---

 Summary: [C++] Undefined symbol errors are occurred when linking 
parquet executables
 Key: ARROW-5320
 URL: https://issues.apache.org/jira/browse/ARROW-5320
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
 Environment: Xcode 10.2 on macOS Mojave 10.14.4
Reporter: Kenta Murata


Undefined symbol errors occurred when linking debug/parquet-reader, 
debug/parquet-file-deserialize-test, and debug/parquet-scan.  The unresolvable 
symbol is of boost regex referred in libparquet.a.

I tried to build the commit 608e846a9f825a30a0faa651bc0a3eebba20e7db with Xcode 
10.2 on macOS Mojave.

I specified -DARROW_BOOST_VENDORED=ON to avoid the problem related to the 
latest boost in Homebrew (See [https://github.com/boostorg/process/issues/55]).

The complete build log is available here:
[https://gist.github.com/mrkn/e5489140c9a782ca13a1b4bb8dd33111]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-5319) [CI] Enable ccache with MinGW builds

2019-05-14 Thread Kouhei Sutou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reassigned ARROW-5319:
---

Assignee: Kouhei Sutou

> [CI] Enable ccache with MinGW builds
> 
>
> Key: ARROW-5319
> URL: https://issues.apache.org/jira/browse/ARROW-5319
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Assignee: Kouhei Sutou
>Priority: Major
>
> MinGW builds on AppVeyor have become quite slow. We should be able to enable 
> ccache with them and use AppVeyor build caching to speed up some of those 
> builds.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-473) [C++/Python] Add public API for retrieving block locations for a particular HDFS file

2019-05-14 Thread Neal Richardson (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-473:
--
Labels: filesystem hdfs pull-request-available  (was: hdfs 
pull-request-available)

> [C++/Python] Add public API for retrieving block locations for a particular 
> HDFS file
> -
>
> Key: ARROW-473
> URL: https://issues.apache.org/jira/browse/ARROW-473
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: filesystem, hdfs, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This is necessary for applications looking to schedule data-local work. 
> libhdfs does not have APIs to request the block locations directly, so we 
> need to see if the {{hdfsGetHosts}} function will do what we need. For 
> libhdfs3 there is a public API function 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-560) [C++] Add support for zero-copy libhdfs reads

2019-05-14 Thread Neal Richardson (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-560:
--
Labels: filesystem  (was: )

> [C++] Add support for zero-copy libhdfs reads
> -
>
> Key: ARROW-560
> URL: https://issues.apache.org/jira/browse/ARROW-560
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: filesystem
>
> See *Rz* functions in 
> https://github.com/apache/arrow/blob/master/cpp/thirdparty/hadoop/include/hdfs.h



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1319) [Python] Add additional HDFS filesystem methods

2019-05-14 Thread Neal Richardson (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-1319:
---
Labels: HDFS filesystem  (was: HDFS)

> [Python] Add additional HDFS filesystem methods
> ---
>
> Key: ARROW-1319
> URL: https://issues.apache.org/jira/browse/ARROW-1319
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Martin Durant
>Priority: Major
>  Labels: HDFS, filesystem
> Fix For: 1.0.0
>
>
> The python library hdfs3 http://hdfs3.readthedocs.io/en/latest/api.html 
> contains a wider set of file-system methods than arrow's python bindings. 
> These are probably simple to implement for arrow-hdfs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3503) [Python] Allow config hadoop_bin in pyarrow hdfs.py

2019-05-14 Thread Neal Richardson (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-3503:
---
Labels: filesystem pull-request-available  (was: pull-request-available)

> [Python] Allow config hadoop_bin in pyarrow hdfs.py 
> 
>
> Key: ARROW-3503
> URL: https://issues.apache.org/jira/browse/ARROW-3503
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wenbo Zhao
>Priority: Major
>  Labels: filesystem, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Currently, the hadoop_bin is either from `HADOOP_HOME` or the `hadoop` 
> command. 
> [https://github.com/apache/arrow/blob/master/python/pyarrow/hdfs.py#L130]
> However, in some of environment setup, hadoop_bin could be some other 
> location. Can we do something like 
>  
> {code:java}
> if 'HADOOP_BIN' in os.environ:
>     hadoop_bin = os.environ['HADOOP_BIN']
> elif 'HADOOP_HOME' in os.environ:
>     hadoop_bin = '{0}/bin/hadoop'.format(os.environ['HADOOP_HOME'])
> else:
>     hadoop_bin = 'hadoop'
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2081) Hdfs client isn't fork-safe

2019-05-14 Thread Neal Richardson (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-2081:
---
Labels: filesystem  (was: )

> Hdfs client isn't fork-safe
> ---
>
> Key: ARROW-2081
> URL: https://issues.apache.org/jira/browse/ARROW-2081
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Jim Crist
>Priority: Major
>  Labels: filesystem
>
> Given the following script:
>  
> {code:java}
> import multiprocessing as mp
> import pyarrow as pa
> def ls(h):
>     print("calling ls")
>     return h.ls("/tmp")
> if __name__ == '__main__':
>     h = pa.hdfs.connect()
>     print("Using 'spawn'")
>     pool = mp.get_context('spawn').Pool(2)
>     results = pool.map(ls, [h, h])
>     sol = h.ls("/tmp")
>     for r in results:
>     assert r == sol
>     print("'spawn' succeeded\n")
>     print("Using 'fork'")
>     pool = mp.get_context('fork').Pool(2)
>     results = pool.map(ls, [h, h])
>     sol = h.ls("/tmp")
>     for r in results:
>     assert r == sol
>     print("'fork' succeeded")
> {code}
>  
> Results in the following output:
>  
> {code:java}
> $ python test.py
> Using 'spawn'
> calling ls
> calling ls
> 'spawn' succeeded
> Using 'fork{code}
>  
> The process then hangs, and I have to `kill -9` the forked worker processes.
>  
> I'm unable to get the libhdfs3 driver to work, so I'm unsure if this is a 
> problem with libhdfs or just arrow's use of it (a quick google search didn't 
> turn up anything useful).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1292) [C++/Python] Expand libhdfs feature coverage

2019-05-14 Thread Neal Richardson (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-1292:
---
Labels: filesystem  (was: )

> [C++/Python] Expand libhdfs feature coverage
> 
>
> Key: ARROW-1292
> URL: https://issues.apache.org/jira/browse/ARROW-1292
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: filesystem
> Fix For: 1.0.0
>
>
> Umbrella JIRA. Will create child issues for more granular tasks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4668) [C++] Support GCP BigQuery Storage API

2019-05-14 Thread Neal Richardson (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-4668:
---
Labels: filesystem  (was: )

> [C++] Support GCP BigQuery Storage API
> --
>
> Key: ARROW-4668
> URL: https://issues.apache.org/jira/browse/ARROW-4668
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: filesystem
> Fix For: 0.14.0
>
>
> Docs: [https://cloud.google.com/bigquery/docs/reference/storage/] 
> Need to investigate the best way to do this maybe just see if we can build 
> our client on GCP (once a protobuf definition is published to 
> [https://github.com/googleapis/googleapis/tree/master/google)?|https://github.com/googleapis/googleapis/tree/master/google)]
>  
> This will serve as a parent issue, and sub-issues will be added for subtasks 
> if necessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5073) [C++] Build toolchain support for libcurl

2019-05-14 Thread Neal Richardson (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-5073:
---
Labels: filesystem  (was: )

> [C++] Build toolchain support for libcurl
> -
>
> Key: ARROW-5073
> URL: https://issues.apache.org/jira/browse/ARROW-5073
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: filesystem
> Fix For: 0.14.0
>
>
> libcurl can be used in a number of different situations (e.g. TensorFlow uses 
> it for GCS interactions 
> https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/platform/cloud/gcs_file_system.cc)
>  so this will likely be required once we begin to tackle that problem



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1119) [Python/C++] Implement NativeFile interfaces for Amazon S3

2019-05-14 Thread Neal Richardson (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-1119:
---
Labels: filesystem  (was: )

> [Python/C++] Implement NativeFile interfaces for Amazon S3
> --
>
> Key: ARROW-1119
> URL: https://issues.apache.org/jira/browse/ARROW-1119
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: filesystem
> Fix For: 0.14.0
>
>
> While we support HDFS and the local file system now, it would be nice to also 
> support S3 and eventually other cloud storage natively



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-453) [C++] Add file interface implementations for Amazon S3

2019-05-14 Thread Neal Richardson (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-453:
--
Labels: filesystem  (was: )

> [C++] Add file interface implementations for Amazon S3
> --
>
> Key: ARROW-453
> URL: https://issues.apache.org/jira/browse/ARROW-453
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: filesystem
> Fix For: 0.14.0
>
>
> The BSD-licensed C++ code in SFrame 
> (https://github.com/turi-code/SFrame/tree/master/oss_src/fileio) may provide 
> some inspiration. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1682) [Python] Add documentation / example for reading a directory of Parquet files on S3

2019-05-14 Thread Neal Richardson (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-1682:
---
Labels: filesystem parquet  (was: parquet)

> [Python] Add documentation / example for reading a directory of Parquet files 
> on S3
> ---
>
> Key: ARROW-1682
> URL: https://issues.apache.org/jira/browse/ARROW-1682
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: filesystem, parquet
> Fix For: 0.14.0
>
>
> Opened based on comment 
> https://github.com/apache/arrow/pull/916#issuecomment-337563492



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5072) [Python] write_table fails silently on S3 errors

2019-05-14 Thread Neal Richardson (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-5072:
---
Labels: filesystem parquet  (was: parquet)

> [Python] write_table fails silently on S3 errors
> 
>
> Key: ARROW-5072
> URL: https://issues.apache.org/jira/browse/ARROW-5072
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.1
> Environment: Python 3.6.8
>Reporter: Paul George
>Priority: Minor
>  Labels: filesystem, parquet
>
> {{pyarrow==0.12.1}}
> *pyarrow.parquet.write_table* called with where=S3File(...) fails silently 
> when encountering errors while writing to S3 (in the example below, boto3 is 
> raising a NoSuchBucket exception). However, instead of using S3File(), 
> calling write_table with where=__ and with 
> filesystem=S3FileSystem() does *not* fail silently and raises, as is expected.
> h4. Code/Repro
>  
> {code:java}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> from s3fs import S3File, S3FileSystem
> df = pd.DataFrame({'col0': []})
> s3_filepath = 's3://some-bogus-bucket/df.parquet'
> print('>> test 1')
> try:
> # use S3File --> fails silently
> pq.write_table(pa.Table.from_pandas(df.copy()),
> S3File(S3FileSystem(), s3_filepath, mode='wb'))
> except Exception:
> print(' Exception raised!')
> else:
> print(' Exception **NOT** raised!')
> print('>> test 2')
> try:
> # use filepath and S3FileSystem --> raises Exception, as expected
> pq.write_table(pa.Table.from_pandas(df.copy()),
> s3_filepath,
> filesystem=S3FileSystem())
> except Exception:
> print(' Exception raised!')
> else:
> print(' Exception **NOT** raised!'){code}
>  
> h4.  
> h4. Output
> {code:java}
> >> test 1
> Exception ignored in:  some-bogus-bucket/df.parquet>>
> Traceback (most recent call last):
> File "/lib/python3.6/site-packages/s3fs/core.py", line 1476, in 
> __del__
> self.close()
> File "/lib/python3.6/site-packages/s3fs/core.py", line 1454, in 
> close
> raise_from(IOError('Write failed: %s' % self.path), e)
> File "", line 3, in raise_from
> OSError: Write failed: some-bogus-bucket/df.parquet
>  Exception **NOT** raised!
> >> test 2
>  Exception raised!
> Exception ignored in:  some-bogus-bucket/df.parquet>>
> Traceback (most recent call last):
> File "/lib/python3.6/site-packages/s3fs/core.py", line 1476, in 
> __del__
> self.close()
> File "/lib/python3.6/site-packages/s3fs/core.py", line 1454, in 
> close
> raise_from(IOError('Write failed: %s' % self.path), e)
> File "", line 3, in raise_from
> OSError: Write failed: some-bogus-bucket/df.parquet
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1456) [Python] Run s3fs unit tests in Travis CI

2019-05-14 Thread Neal Richardson (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-1456:
---
Labels: filesystem  (was: )

> [Python] Run s3fs unit tests in Travis CI
> -
>
> Key: ARROW-1456
> URL: https://issues.apache.org/jira/browse/ARROW-1456
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: filesystem
> Fix For: 0.14.0
>
>
> We'll need to set up an S3 bucket to write to with credentials that cannot 
> compromise anyone's AWS account. I've been testing locally with a user that I 
> set up but I wouldn't be comfortable checking in these credentials, even in 
> encrypted form, without more scrutiny



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4860) [C++] Build AWS C++ SDK for Windows in conda-forge

2019-05-14 Thread Neal Richardson (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-4860:
---
Labels: filesystem  (was: )

> [C++] Build AWS C++ SDK for Windows in conda-forge
> --
>
> Key: ARROW-4860
> URL: https://issues.apache.org/jira/browse/ARROW-4860
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: filesystem
> Fix For: 0.14.0
>
>
> We the aws-sdk-cpp package to be able to use the C++ SDK for S3 support. it 
> is currently available for Linux and macOS



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-500) [C++] Implement concurrent IO read queue for file-like sources

2019-05-14 Thread Neal Richardson (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-500:
--
Labels: filesystem  (was: )

> [C++] Implement concurrent IO read queue for file-like sources
> --
>
> Key: ARROW-500
> URL: https://issues.apache.org/jira/browse/ARROW-500
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: filesystem
>
> In a multithreaded setting, we may spawn many threads which will have access 
> to a shared IO resources. It may be useful to create a thread-safe IO queue 
> implementing the {{arrow::io::ReadableFileInterface}}, limiting the number of 
> concurrent requests to the desired number (which may be 1, for services not 
> permitting concurrent access).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-501) [C++] Implement concurrent / buffering InputStream for streaming data use cases

2019-05-14 Thread Neal Richardson (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-501:
--
Labels: csv filesystem pull-request-available  (was: csv 
pull-request-available)

> [C++] Implement concurrent / buffering InputStream for streaming data use 
> cases
> ---
>
> Key: ARROW-501
> URL: https://issues.apache.org/jira/browse/ARROW-501
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: csv, filesystem, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> Related to ARROW-500, when processing an input data stream, we may wish to 
> continue buffering input (up to an maximum buffer size) in between 
> synchronous Read calls



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1848) [Python] Add documentation examples for reading single Parquet files and datasets from HDFS

2019-05-14 Thread Neal Richardson (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-1848:
---
Labels: filesystem parquet  (was: parquet)

> [Python] Add documentation examples for reading single Parquet files and 
> datasets from HDFS
> ---
>
> Key: ARROW-1848
> URL: https://issues.apache.org/jira/browse/ARROW-1848
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: filesystem, parquet
> Fix For: 0.14.0
>
>
> see 
> https://stackoverflow.com/questions/47443151/read-a-parquet-files-from-hdfs-using-pyarrow



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-5284) [Rust] Replace libc with std::alloc for memory allocation

2019-05-14 Thread Chao Sun (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved ARROW-5284.
-
   Resolution: Fixed
Fix Version/s: 0.14.0

Issue resolved by pull request 4273
[https://github.com/apache/arrow/pull/4273]

> [Rust] Replace libc with std::alloc for memory allocation
> -
>
> Key: ARROW-5284
> URL: https://issues.apache.org/jira/browse/ARROW-5284
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5319) [CI] Enable ccache with MinGW builds

2019-05-14 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-5319:
-

 Summary: [CI] Enable ccache with MinGW builds
 Key: ARROW-5319
 URL: https://issues.apache.org/jira/browse/ARROW-5319
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Continuous Integration
Reporter: Antoine Pitrou


MinGW builds on AppVeyor have become quite slow. We should be able to enable 
ccache with them and use AppVeyor build caching to speed up some of those 
builds.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5319) [CI] Enable ccache with MinGW builds

2019-05-14 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839797#comment-16839797
 ] 

Antoine Pitrou commented on ARROW-5319:
---

[~kou]

> [CI] Enable ccache with MinGW builds
> 
>
> Key: ARROW-5319
> URL: https://issues.apache.org/jira/browse/ARROW-5319
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Priority: Major
>
> MinGW builds on AppVeyor have become quite slow. We should be able to enable 
> ccache with them and use AppVeyor build caching to speed up some of those 
> builds.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-5299) [C++] ListArray comparison is incorrect

2019-05-14 Thread Benjamin Kietzman (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Kietzman resolved ARROW-5299.
--
Resolution: Fixed

Resolved as a drive by in pull 4278 
https://github.com/apache/arrow/pull/4278/files#diff-651c90e7833d833bab85492c313bf3feR587

> [C++] ListArray comparison is incorrect
> ---
>
> Key: ARROW-5299
> URL: https://issues.apache.org/jira/browse/ARROW-5299
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Benjamin Kietzman
>Assignee: Benjamin Kietzman
>Priority: Minor
>
> Equality comparison for ListArray accounts for offsets incorrectly, so slices 
> of ListArrays with identical offsets but differing values will sometimes 
> compare equal.
> For example, the following test fails:
> {code}
> TEST_F(TestListArray, ValuesEquality) {
>   auto type = list(int32());
>   auto left = ArrayFromJSON(type, "[[1, 2], [3, null], [5], []]");
>   auto right = ArrayFromJSON(type, "[[1, 2], [3, null], [9], []]");
>   AssertArraysEqual(*left->Slice(2), *right->Slice(2));
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5318) pyarrow hdfs reader overrequests

2019-05-14 Thread Ivan Dimitrov (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839714#comment-16839714
 ] 

Ivan Dimitrov commented on ARROW-5318:
--

Maybe this is some eager caching?

> pyarrow hdfs reader overrequests  
> --
>
> Key: ARROW-5318
> URL: https://issues.apache.org/jira/browse/ARROW-5318
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0
>Reporter: Ivan Dimitrov
>Priority: Blocker
>
> When reading using HdfsFilesystem's read method, the amount is not constant 
> and flactuates between extra 0% to 300%.
> Example code 
> {code:java}
> fs = hdfs.connect(hostname, driver='libhdfs')
> f = fs.open(dataset_path)
> f.read(nbytes=350){code}
> In this case, the read can send back up to 15 M bytes. The issue is true with 
> 'libhdfs3' as well. Also present in newer versions of pyarrow. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5318) pyarrow hdfs reader overrequests

2019-05-14 Thread Ivan Dimitrov (JIRA)
Ivan Dimitrov created ARROW-5318:


 Summary: pyarrow hdfs reader overrequests  
 Key: ARROW-5318
 URL: https://issues.apache.org/jira/browse/ARROW-5318
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.10.0
Reporter: Ivan Dimitrov


When reading using HdfsFilesystem's read method, the amount is not constant and 
flactuates between extra 0% to 300%.

Example code 
{code:java}
fs = hdfs.connect(hostname, driver='libhdfs')
f = fs.open(dataset_path)

f.read(nbytes=350){code}
In this case, the read can send back up to 15 M bytes. The issue is true with 
'libhdfs3' as well. Also present in newer versions of pyarrow. 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-4806) [Rust] Support casting temporal arrays in cast kernels

2019-05-14 Thread Chao Sun (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved ARROW-4806.
-
Resolution: Fixed

Issue resolved by pull request 4150
[https://github.com/apache/arrow/pull/4150]

> [Rust] Support casting temporal arrays in cast kernels
> --
>
> Key: ARROW-4806
> URL: https://issues.apache.org/jira/browse/ARROW-4806
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Affects Versions: 0.12.0
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> [ARROW-3882] is too far in the review process to add temporal casts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5317) [Rust] [Parquet] impl IntoIterator for SerializedFileReader

2019-05-14 Thread Chao Sun (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839627#comment-16839627
 ] 

Chao Sun commented on ARROW-5317:
-

[~wesmckinn] @andygrove: could you add [~FabioBatSilva] into the contributor 
list so we can assign this Jira to him? Thanks.

> [Rust] [Parquet] impl IntoIterator for SerializedFileReader
> ---
>
> Key: ARROW-5317
> URL: https://issues.apache.org/jira/browse/ARROW-5317
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Fabio Batista da Silva
>Priority: Minor
>
> This is a follow up to [https://github.com/apache/arrow/issues/4301].
> The current implementation of a row iterator *RowIter* borrows the 
> *FileReader*
>  which the user has to keep the file reader alive for as long as the iterator 
> is alive..
> And make is hard to iterate over multiple *FileReader* / *RowIter*..
> {code:java}
> fn main() {
> let path1 = Path::new("path-to/1.snappy.parquet");
> let path2 = Path::new("path-to/2.snappy.parquet");
> let vec = vec![path1, path2];
> let it = vec.iter()
> .map(|p| {
> File::open(p).unwrap()
> })
> .map(|f| {
> SerializedFileReader::new(f).unwrap()
> })
> .flat_map(|reader| -> RowIter {
> RowIter::from_file(None, ).unwrap()
> //| ||
> //| |`reader` is borrowed here
> //| returns a value referencing data owned by the current function
> })
> ;
> for r in it {
> println!("{}", r);
> }
> }
> {code}
> One solution could be to implement a row iterator that takes owners of the 
> reader.
> Perhaps implementing *std::iter::IntoIterator* for the *SerializedFileReader*
> {code:java}
> 
> .map(|p| {
> File::open(p).unwrap()
> })
> .map(|f| {
> SerializedFileReader::new(f).unwrap()
> })
> .flat_map(|r| -> r.into_iter())
> 
> {code}
>  
> Happy to put a PR out with this..
>  Please let me know if this makes sense and you guys already have some way of 
> doing this..



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-5317) [Rust] [Parquet] impl IntoIterator for SerializedFileReader

2019-05-14 Thread Chao Sun (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839627#comment-16839627
 ] 

Chao Sun edited comment on ARROW-5317 at 5/14/19 5:19 PM:
--

[~wesmckinn], [~andygrove] could you add [~FabioBatSilva] into the contributor 
list so we can assign this Jira to him? Thanks.


was (Author: csun):
[~wesmckinn] @andygrove: could you add [~FabioBatSilva] into the contributor 
list so we can assign this Jira to him? Thanks.

> [Rust] [Parquet] impl IntoIterator for SerializedFileReader
> ---
>
> Key: ARROW-5317
> URL: https://issues.apache.org/jira/browse/ARROW-5317
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Fabio Batista da Silva
>Priority: Minor
>
> This is a follow up to [https://github.com/apache/arrow/issues/4301].
> The current implementation of a row iterator *RowIter* borrows the 
> *FileReader*
>  which the user has to keep the file reader alive for as long as the iterator 
> is alive..
> And make is hard to iterate over multiple *FileReader* / *RowIter*..
> {code:java}
> fn main() {
> let path1 = Path::new("path-to/1.snappy.parquet");
> let path2 = Path::new("path-to/2.snappy.parquet");
> let vec = vec![path1, path2];
> let it = vec.iter()
> .map(|p| {
> File::open(p).unwrap()
> })
> .map(|f| {
> SerializedFileReader::new(f).unwrap()
> })
> .flat_map(|reader| -> RowIter {
> RowIter::from_file(None, ).unwrap()
> //| ||
> //| |`reader` is borrowed here
> //| returns a value referencing data owned by the current function
> })
> ;
> for r in it {
> println!("{}", r);
> }
> }
> {code}
> One solution could be to implement a row iterator that takes owners of the 
> reader.
> Perhaps implementing *std::iter::IntoIterator* for the *SerializedFileReader*
> {code:java}
> 
> .map(|p| {
> File::open(p).unwrap()
> })
> .map(|f| {
> SerializedFileReader::new(f).unwrap()
> })
> .flat_map(|r| -> r.into_iter())
> 
> {code}
>  
> Happy to put a PR out with this..
>  Please let me know if this makes sense and you guys already have some way of 
> doing this..



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5317) [Rust] [Parquet] impl IntoIterator for SerializedFileReader

2019-05-14 Thread Chao Sun (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839624#comment-16839624
 ] 

Chao Sun commented on ARROW-5317:
-

[~FabioBatSilva] yes this does make sense. Can you put a PR for this? Thanks!

> [Rust] [Parquet] impl IntoIterator for SerializedFileReader
> ---
>
> Key: ARROW-5317
> URL: https://issues.apache.org/jira/browse/ARROW-5317
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Fabio Batista da Silva
>Priority: Minor
>
> This is a follow up to [https://github.com/apache/arrow/issues/4301].
> The current implementation of a row iterator *RowIter* borrows the 
> *FileReader*
>  which the user has to keep the file reader alive for as long as the iterator 
> is alive..
> And make is hard to iterate over multiple *FileReader* / *RowIter*..
> {code:java}
> fn main() {
> let path1 = Path::new("path-to/1.snappy.parquet");
> let path2 = Path::new("path-to/2.snappy.parquet");
> let vec = vec![path1, path2];
> let it = vec.iter()
> .map(|p| {
> File::open(p).unwrap()
> })
> .map(|f| {
> SerializedFileReader::new(f).unwrap()
> })
> .flat_map(|reader| -> RowIter {
> RowIter::from_file(None, ).unwrap()
> //| ||
> //| |`reader` is borrowed here
> //| returns a value referencing data owned by the current function
> })
> ;
> for r in it {
> println!("{}", r);
> }
> }
> {code}
> One solution could be to implement a row iterator that takes owners of the 
> reader.
> Perhaps implementing *std::iter::IntoIterator* for the *SerializedFileReader*
> {code:java}
> 
> .map(|p| {
> File::open(p).unwrap()
> })
> .map(|f| {
> SerializedFileReader::new(f).unwrap()
> })
> .flat_map(|r| -> r.into_iter())
> 
> {code}
>  
> Happy to put a PR out with this..
>  Please let me know if this makes sense and you guys already have some way of 
> doing this..



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-653) [Python / C++] Add debugging function to print an array's buffer contents in hexadecimal

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-653:
-
Labels: pull-request-available  (was: )

> [Python / C++] Add debugging function to print an array's buffer contents in 
> hexadecimal
> 
>
> Key: ARROW-653
> URL: https://issues.apache.org/jira/browse/ARROW-653
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> This would help with debugging and illustrating the Arrow internals



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1278) Integration tests for Fixed Size List type

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-1278:
--
Labels: pull-request-available  (was: )

> Integration tests for Fixed Size List type
> --
>
> Key: ARROW-1278
> URL: https://issues.apache.org/jira/browse/ARROW-1278
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Integration
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5102) [C++] Reduce header dependencies

2019-05-14 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839567#comment-16839567
 ] 

Antoine Pitrou commented on ARROW-5102:
---

Also see 
https://aras-p.info/blog/2019/01/16/time-trace-timeline-flame-chart-profiler-for-Clang/
clang 9.0 will hopefully help us tackle this much more rationally.

> [C++] Reduce header dependencies
> 
>
> Key: ARROW-5102
> URL: https://issues.apache.org/jira/browse/ARROW-5102
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Affects Versions: 0.13.0
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.14.0
>
>
> To tame C++ compile times, we should try to reduce the number of heavy 
> dependencies in our .h files.
> Two possible avenues come to mind:
> * avoid including `unordered_map` and friends
> * avoid including C++ stream libraries (such as `iostream`, `ios`, 
> `sstream`...)
> Unfortunately we're currently including `sstream` in `status.h` for some 
> template APIs. We may move those to a separate include file (e.g. 
> `status-builder.h`).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5317) [Rust] [Parquet] impl IntoIterator for SerializedFileReader

2019-05-14 Thread Fabio Batista da Silva (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabio Batista da Silva updated ARROW-5317:
--
Description: 
This is a follow up to [https://github.com/apache/arrow/issues/4301].

The current implementation of a row iterator *RowIter* borrows the *FileReader*
 which the user has to keep the file reader alive for as long as the iterator 
is alive..

And make is hard to iterate over multiple *FileReader* / *RowIter*..
{code:java}
fn main() {
let path1 = Path::new("path-to/1.snappy.parquet");
let path2 = Path::new("path-to/2.snappy.parquet");
let vec = vec![path1, path2];
let it = vec.iter()
.map(|p| {
File::open(p).unwrap()
})
.map(|f| {
SerializedFileReader::new(f).unwrap()
})
.flat_map(|reader| -> RowIter {
RowIter::from_file(None, ).unwrap()
//| ||
//| |`reader` is borrowed here
//| returns a value referencing data owned by the current function
})
;

for r in it {
println!("{}", r);
}
}
{code}
One solution could be to implement a row iterator that takes owners of the 
reader.

Perhaps implementing *std::iter::IntoIterator* for the *SerializedFileReader*
{code:java}

.map(|p| {
File::open(p).unwrap()
})
.map(|f| {
SerializedFileReader::new(f).unwrap()
})
.flat_map(|r| -> r.into_iter())

{code}
 

Happy to put a PR out with this..
 Please let me know if this makes sense and you guys already have some way of 
doing this..

  was:
This is a follow up to [https://github.com/apache/arrow/issues/4301].

The current implementation of a row iterator *RowIter* borrows the *FileReader*
 which the user has to keep the file reader alive for as long as the iterator 
is alive..

And make is hard to iterate over multiple *FileReader* / *RowIter*..
{code:java}
fn main() {
let path1 = Path::new("path-to/1.snappy.parquet");
let path2 = Path::new("path-to/2.snappy.parquet");
let vec = vec![path1, path2];
let it = vec.iter()
.map(|p| {
File::open(p).unwrap()
})
.map(|f| {
SerializedFileReader::new(f).unwrap()
})
.flat_map(|reader| -> RowIter {
RowIter::from_file(None, ).unwrap()
//| ||
//| |`reader` is borrowed here
//| returns a value referencing data owned by the current function
})
;

for r in it {
println!("{}", r);
}
}
{code}
One solution could be to implement a row iterator that takes owners of the 
reader.

Perhaps implementing *std::iter::IntoIterator* for the *SerializedFileReader*
{code:java}

.map(|p| {
File::open(p).unwrap()
})
.map(|f| {
SerializedFileReader::new(f).unwrap()
})
.flat_map(|r| -> r)

{code}
 

Happy to put a PR out with this..
Please let me know if this makes sense and you guys already have some way of 
doing this..


> [Rust] [Parquet] impl IntoIterator for SerializedFileReader
> ---
>
> Key: ARROW-5317
> URL: https://issues.apache.org/jira/browse/ARROW-5317
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Fabio Batista da Silva
>Priority: Minor
>
> This is a follow up to [https://github.com/apache/arrow/issues/4301].
> The current implementation of a row iterator *RowIter* borrows the 
> *FileReader*
>  which the user has to keep the file reader alive for as long as the iterator 
> is alive..
> And make is hard to iterate over multiple *FileReader* / *RowIter*..
> {code:java}
> fn main() {
> let path1 = Path::new("path-to/1.snappy.parquet");
> let path2 = Path::new("path-to/2.snappy.parquet");
> let vec = vec![path1, path2];
> let it = vec.iter()
> .map(|p| {
> File::open(p).unwrap()
> })
> .map(|f| {
> SerializedFileReader::new(f).unwrap()
> })
> .flat_map(|reader| -> RowIter {
> RowIter::from_file(None, ).unwrap()
> //| ||
> //| |`reader` is borrowed here
> //| returns a value referencing data owned by the current function
> })
> ;
> for r in it {
> println!("{}", r);
> }
> }
> {code}
> One solution could be to implement a row iterator that takes owners of the 
> reader.
> Perhaps implementing *std::iter::IntoIterator* for the *SerializedFileReader*
> {code:java}
> 
> .map(|p| {
> File::open(p).unwrap()
> })
> .map(|f| {
> SerializedFileReader::new(f).unwrap()
> })
> .flat_map(|r| -> r.into_iter())
> 
> {code}
>  
> Happy to put a PR out with this..
>  

[jira] [Created] (ARROW-5317) [Rust] [Parquet] impl IntoIterator for SerializedFileReader

2019-05-14 Thread Fabio Batista da Silva (JIRA)
Fabio Batista da Silva created ARROW-5317:
-

 Summary: [Rust] [Parquet] impl IntoIterator for 
SerializedFileReader
 Key: ARROW-5317
 URL: https://issues.apache.org/jira/browse/ARROW-5317
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Fabio Batista da Silva


This is a follow up to [https://github.com/apache/arrow/issues/4301].

The current implementation of a row iterator *RowIter* borrows the *FileReader*
 which the user has to keep the file reader alive for as long as the iterator 
is alive..

And make is hard to iterate over multiple *FileReader* / *RowIter*..
{code:java}
fn main() {
let path1 = Path::new("path-to/1.snappy.parquet");
let path2 = Path::new("path-to/2.snappy.parquet");
let vec = vec![path1, path2];
let it = vec.iter()
.map(|p| {
File::open(p).unwrap()
})
.map(|f| {
SerializedFileReader::new(f).unwrap()
})
.flat_map(|reader| -> RowIter {
RowIter::from_file(None, ).unwrap()
//| ||
//| |`reader` is borrowed here
//| returns a value referencing data owned by the current function
})
;

for r in it {
println!("{}", r);
}
}
{code}
One solution could be to implement a row iterator that takes owners of the 
reader.

Perhaps implementing *std::iter::IntoIterator* for the *SerializedFileReader*
{code:java}

.map(|p| {
File::open(p).unwrap()
})
.map(|f| {
SerializedFileReader::new(f).unwrap()
})
.flat_map(|r| -> r)

{code}
 

Happy to put a PR out with this..
Please let me know if this makes sense and you guys already have some way of 
doing this..



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5311) [C++] Return more specific invalid Status in Take kernel

2019-05-14 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839385#comment-16839385
 ] 

Wes McKinney commented on ARROW-5311:
-

Seems OK to me, interested in other opinions

> [C++] Return more specific invalid Status in Take kernel
> 
>
> Key: ARROW-5311
> URL: https://issues.apache.org/jira/browse/ARROW-5311
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Minor
> Fix For: 0.14.0
>
>
> Currently the {{Take}} kernel returns generic Invalid Status for certain 
> cases, that could use more specific error:
> - indices of wrong type (eg floats) -> TypeError instead of Invalid?
> - out of bounds index -> new IndexError ?
> From review in https://github.com/apache/arrow/pull/4281
> cc [~bkietz]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5315) [Rust] Gandiva binding.

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5315:
--
Labels: pull-request-available  (was: )

> [Rust] Gandiva binding.
> ---
>
> Key: ARROW-5315
> URL: https://issues.apache.org/jira/browse/ARROW-5315
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Renjie Liu
>Priority: Major
>  Labels: pull-request-available
>
> Add gandiva binding for rust.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5102) [C++] Reduce header dependencies

2019-05-14 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839360#comment-16839360
 ] 

Antoine Pitrou commented on ARROW-5102:
---

There are interesting measurements of compile time induced by standard C++ 
headers here:
http://virtuallyrandom.com/c-compilation-lies-damned-lies-and-statistics/


> [C++] Reduce header dependencies
> 
>
> Key: ARROW-5102
> URL: https://issues.apache.org/jira/browse/ARROW-5102
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Affects Versions: 0.13.0
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.14.0
>
>
> To tame C++ compile times, we should try to reduce the number of heavy 
> dependencies in our .h files.
> Two possible avenues come to mind:
> * avoid including `unordered_map` and friends
> * avoid including C++ stream libraries (such as `iostream`, `ios`, 
> `sstream`...)
> Unfortunately we're currently including `sstream` in `status.h` for some 
> template APIs. We may move those to a separate include file (e.g. 
> `status-builder.h`).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5316) [Rust] Interfaces for gandiva bindings.

2019-05-14 Thread Renjie Liu (JIRA)
Renjie Liu created ARROW-5316:
-

 Summary: [Rust] Interfaces for gandiva bindings.
 Key: ARROW-5316
 URL: https://issues.apache.org/jira/browse/ARROW-5316
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Renjie Liu
Assignee: Renjie Liu


Create interfaces to demonstrate high level design and ideas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5315) [Rust] Gandiva binding.

2019-05-14 Thread Renjie Liu (JIRA)
Renjie Liu created ARROW-5315:
-

 Summary: [Rust] Gandiva binding.
 Key: ARROW-5315
 URL: https://issues.apache.org/jira/browse/ARROW-5315
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Renjie Liu


Add gandiva binding for rust.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5311) [C++] Return more specific invalid Status in Take kernel

2019-05-14 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839210#comment-16839210
 ] 

Antoine Pitrou commented on ARROW-5311:
---

I think we could probably add a Status::IndexError on the C++ side to designate 
out-of-bounds conditions (when detected :-)). [~wesmckinn] [~xhochy] what do 
you think?

> [C++] Return more specific invalid Status in Take kernel
> 
>
> Key: ARROW-5311
> URL: https://issues.apache.org/jira/browse/ARROW-5311
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Minor
> Fix For: 0.14.0
>
>
> Currently the {{Take}} kernel returns generic Invalid Status for certain 
> cases, that could use more specific error:
> - indices of wrong type (eg floats) -> TypeError instead of Invalid?
> - out of bounds index -> new IndexError ?
> From review in https://github.com/apache/arrow/pull/4281
> cc [~bkietz]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5270) [C++] Reenable Valgrind on Travis-CI

2019-05-14 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839183#comment-16839183
 ] 

Antoine Pitrou commented on ARROW-5270:
---

The instructions not recognized seem to be SSE4.2 instructions. We would 
probably need a more recent Valgrind version on Travis-CI...


> [C++] Reenable Valgrind on Travis-CI
> 
>
> Key: ARROW-5270
> URL: https://issues.apache.org/jira/browse/ARROW-5270
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Running Valgrind on Travis-CI was disabled in ARROW-4611 (apparently because 
> of issues within the re2 library).
> We should reenable it at some point in order to exercise the reliability of 
> our C++ code.
> (and/or have a build with another piece of instrumentation enabled such as 
> ASAN)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5272) [C++] [Gandiva] JIT code executed over uninitialized values

2019-05-14 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839181#comment-16839181
 ] 

Antoine Pitrou commented on ARROW-5272:
---

Just {{-DARROW_TEST_MEMCHECK=on}}.

This is on Ubuntu 18.04 with gcc 7.4.0, Valgrind 3.13.0 and LLVM 7.0.

> [C++] [Gandiva] JIT code executed over uninitialized values
> ---
>
> Key: ARROW-5272
> URL: https://issues.apache.org/jira/browse/ARROW-5272
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva
>Reporter: Antoine Pitrou
>Assignee: Pindikura Ravindra
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When running Gandiva tests with Valgrind, I get the following errors:
> {code}
> [==] Running 4 tests from 1 test case.
> [--] Global test environment set-up.
> [--] 4 tests from TestDecimal
> [ RUN  ] TestDecimal.TestSimple
> ==12052== Conditional jump or move depends on uninitialised value(s)
> ==12052==at 0x41110D5: ???
> ==12052== 
> {
>
>Memcheck:Cond
>obj:*
> }
> ==12052== Conditional jump or move depends on uninitialised value(s)
> ==12052==at 0x41110E8: ???
> ==12052== 
> {
>
>Memcheck:Cond
>obj:*
> }
> ==12052== Conditional jump or move depends on uninitialised value(s)
> ==12052==at 0x44B: ???
> ==12052== 
> {
>
>Memcheck:Cond
>obj:*
> }
> ==12052== Conditional jump or move depends on uninitialised value(s)
> ==12052==at 0x47B: ???
> ==12052== 
> {
>
>Memcheck:Cond
>obj:*
> }
> [   OK ] TestDecimal.TestSimple (16625 ms)
> [ RUN  ] TestDecimal.TestLiteral
> [   OK ] TestDecimal.TestLiteral (3480 ms)
> [ RUN  ] TestDecimal.TestIfElse
> [   OK ] TestDecimal.TestIfElse (2408 ms)
> [ RUN  ] TestDecimal.TestCompare
> [   OK ] TestDecimal.TestCompare (5303 ms)
> {code}
> I think this is legitimate. Gandiva runs computations over all values, even 
> when the bitmap indicates a null value. But decimal computations are complex 
> and involve conditional jumps, hence the error ("Conditional jump or move 
> depends on uninitialised value(s)").
> [~pravindra]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5270) [C++] Reenable Valgrind on Travis-CI

2019-05-14 Thread Pindikura Ravindra (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839104#comment-16839104
 ] 

Pindikura Ravindra commented on ARROW-5270:
---

[https://travis-ci.org/apache/arrow/jobs/531878628]

> [C++] Reenable Valgrind on Travis-CI
> 
>
> Key: ARROW-5270
> URL: https://issues.apache.org/jira/browse/ARROW-5270
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Running Valgrind on Travis-CI was disabled in ARROW-4611 (apparently because 
> of issues within the re2 library).
> We should reenable it at some point in order to exercise the reliability of 
> our C++ code.
> (and/or have a build with another piece of instrumentation enabled such as 
> ASAN)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5270) [C++] Reenable Valgrind on Travis-CI

2019-05-14 Thread Pindikura Ravindra (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839103#comment-16839103
 ] 

Pindikura Ravindra commented on ARROW-5270:
---

There are two issues :
 # instructions not recognized by valgrind

=20276== Your program just tried to execute an instruction that Valgrind 
==20276== did not recognise. There are two possible reasons for this. ==20276== 
1. Your program has a bug and erroneously jumped to a non-code ==20276== 
location. If you are running Memcheck and you just saw a ==20276== warning 
about a bad jump, it's probably your program's fault.

2. the re2 issues

I think these are already covered by the suppressions listed in the 
valgrind.supp but they aren't being recognized due to missing symbols in the 
stack. 

When I ran this on my xenial setup without any conda setup, the stacks showed 
up correctly and got suppressed. so, I suspect this is an issue with conda 
binaries.

> [C++] Reenable Valgrind on Travis-CI
> 
>
> Key: ARROW-5270
> URL: https://issues.apache.org/jira/browse/ARROW-5270
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Running Valgrind on Travis-CI was disabled in ARROW-4611 (apparently because 
> of issues within the re2 library).
> We should reenable it at some point in order to exercise the reliability of 
> our C++ code.
> (and/or have a build with another piece of instrumentation enabled such as 
> ASAN)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)