[jira] [Created] (ARROW-7669) [CI] [C++] Turn optimizations off on AppVeyor

2020-01-24 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-7669: - Summary: [CI] [C++] Turn optimizations off on AppVeyor Key: ARROW-7669 URL: https://issues.apache.org/jira/browse/ARROW-7669 Project: Apache Arrow Issue Ty

Re: PySpark failure [RE: [NIGHTLY] Arrow Build Report for Job nightly-2020-01-15-0]

2020-01-24 Thread Joris Van den Bossche
Hi Bryan, For the case that the column is no timestamp and was not modified: I don't think it will take copies of the full dataframe by assigning columns in a loop like that. But it is still doing work (it will copy data for that column into the array holding those data for 2D blocks), and which c

Improve the ergonomics of new PyArrow FileSystem API in Python ARROW-7584

2020-01-24 Thread Fabian Höring
Hello, I created this ticket to discuss possible improvements of the new PyArrow FileSystem API https://issues.apache.org/jira/browse/ARROW-7584   As of today there seem to be only two popular projects to have an agnostic FileSystem API that can handle S3 & HDFS from Python: - PyArrow via https:

[jira] [Created] (ARROW-7670) [Python][Dataset] Better ergonomics for the filter expressions

2020-01-24 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-7670: -- Summary: [Python][Dataset] Better ergonomics for the filter expressions Key: ARROW-7670 URL: https://issues.apache.org/jira/browse/ARROW-7670 Project: Apache Arro

Re: [Format] Array/RowBatch filters

2020-01-24 Thread Francois Saint-Jacques
By filter, you mean a filter expression, or a selection vector/bitmap? On Thu, Jan 23, 2020 at 11:38 PM Micah Kornfield wrote: > > One of the things that I think got overlooked in the conversation on having > a slice offset in the C API was a suggestion from Jacques of perhaps > generalizing the

[jira] [Created] (ARROW-7671) [Python][Dataset] Add bindings for the DatasetFactory

2020-01-24 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-7671: -- Summary: [Python][Dataset] Add bindings for the DatasetFactory Key: ARROW-7671 URL: https://issues.apache.org/jira/browse/ARROW-7671 Project: Apache Arrow

[NIGHTLY] Arrow Build Report for Job nightly-2020-01-24-0

2020-01-24 Thread Crossbow
Arrow Build Report for Job nightly-2020-01-24-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-24-0 Failed Tasks: - conda-osx-clang-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-24-0-azure-conda-osx-clang-py38 - gand

[jira] [Created] (ARROW-7672) NULL pointer dereference bug

2020-01-24 Thread daehee jang (Jira)
daehee jang created ARROW-7672: -- Summary: NULL pointer dereference bug Key: ARROW-7672 URL: https://issues.apache.org/jira/browse/ARROW-7672 Project: Apache Arrow Issue Type: Bug Enviro

[jira] [Created] (ARROW-7673) [C++][Dataset] Revisit File discovery failure mode

2020-01-24 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-7673: - Summary: [C++][Dataset] Revisit File discovery failure mode Key: ARROW-7673 URL: https://issues.apache.org/jira/browse/ARROW-7673 Project: Apache Arr

[jira] [Created] (ARROW-7674) Add helpful message for captcha challenge in merge_arrow_pr.py

2020-01-24 Thread Brian Hulette (Jira)
Brian Hulette created ARROW-7674: Summary: Add helpful message for captcha challenge in merge_arrow_pr.py Key: ARROW-7674 URL: https://issues.apache.org/jira/browse/ARROW-7674 Project: Apache Arrow

Re: [DISCUSS] Format additions for encoding/compression

2020-01-24 Thread John Muehlhausen
Thanks Micah, I will see if I can find some time to explore this further. On Thu, Jan 23, 2020 at 10:56 PM Micah Kornfield wrote: > Hi John, > Not Wes, but my thoughts on this are as follows: > > 1. Alternate bit/byte arrangements can also be useful for processing [1] in > addition to compressio

[jira] [Created] (ARROW-7675) [R][CI] Move Windows CI from Appveyor to GHA

2020-01-24 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-7675: -- Summary: [R][CI] Move Windows CI from Appveyor to GHA Key: ARROW-7675 URL: https://issues.apache.org/jira/browse/ARROW-7675 Project: Apache Arrow Issue T

Re: [DISCUSS][JAVA] Correct the behavior of ListVector isEmpty

2020-01-24 Thread Brian Hulette
What about returning null for a null list? It looks like now the function returns a primitive boolean, so I guess that would be a substantial change, but null seems more correct to me. On Thu, Jan 23, 2020, 21:38 Micah Kornfield wrote: > I would vote for treating nulls as empty. > > On Fri, Jan

Re: PySpark failure [RE: [NIGHTLY] Arrow Build Report for Job nightly-2020-01-15-0]

2020-01-24 Thread Bryan Cutler
Thanks Joris for clearing that up! It's correct that pyspark will allow the user to do operations on the resulting DataFrame, so it doesn't sound like I should set `split_blocks=True` in the conversion. You're right that the unnecessary assignments can be easily avoided if not timestamps, so that w

[jira] [Created] (ARROW-7676) [Packaging][Python] Ensure that the static libraries are not built in the wheel scripts

2020-01-24 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-7676: -- Summary: [Packaging][Python] Ensure that the static libraries are not built in the wheel scripts Key: ARROW-7676 URL: https://issues.apache.org/jira/browse/ARROW-7676

[jira] [Created] (ARROW-7677) [C++] Handle Windows file paths with backslashes in GetTargetStats

2020-01-24 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7677: Summary: [C++] Handle Windows file paths with backslashes in GetTargetStats Key: ARROW-7677 URL: https://issues.apache.org/jira/browse/ARROW-7677 Proj

[jira] [Created] (ARROW-7678) [C++][Parquet] setting TZ= in environment on Linux causes broken parquet

2020-01-24 Thread Joshua Pedrick (Jira)
Joshua Pedrick created ARROW-7678: - Summary: [C++][Parquet] setting TZ= in environment on Linux causes broken parquet Key: ARROW-7678 URL: https://issues.apache.org/jira/browse/ARROW-7678 Project: Apa

Re: [DISCUSS] Format additions for encoding/compression

2020-01-24 Thread Micah Kornfield
Great John, I'd be interesting to hear about progress. Also, IMO I think we should be only focusing on encoding that have the potential to be exploited for computational benefits (not just compressibility). I think this is what distinguishes Arrow from other formats like Parquet. I think this ech

[jira] [Created] (ARROW-7679) [R] Purge unnecessary dataset classes and methods

2020-01-24 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-7679: -- Summary: [R] Purge unnecessary dataset classes and methods Key: ARROW-7679 URL: https://issues.apache.org/jira/browse/ARROW-7679 Project: Apache Arrow Is

Re: [Format] Array/RowBatch filters

2020-01-24 Thread Micah Kornfield
I was thinking selection vector/bitmap (possibly with different encodings), but really nothing for now. Ordinarily, I'd lean towards YAGNI but there isn't a good way to add this in easily in a forward compatible way unless we add a placeholder enum/table for 1.0 (the default option would be no fil