Re: PySpark failure [RE: [NIGHTLY] Arrow Build Report for Job nightly-2020-01-15-0]

2020-01-16 Thread Joris Van den Bossche
That sounds like a good solution. Having the zero-copy behavior depending
on whether you have only 1 column of a certain type or not, might lead to
surprising results. To avoid yet another keyword, only doing it when
split_blocks=True sounds good to me (in practice, that's also when it will
happen mostly, except for very narrow dataframes with only few columns).

Joris

On Thu, 16 Jan 2020 at 22:44, Wes McKinney  wrote:

> hi Joris,
>
> Thanks for investigating this. It seems there were some unintended
> consequences of the zero-copy optimizations from ARROW-3789. Another
> way forward might be to "opt in" to this behavior, or to only do the
> zero copy optimizations when split_blocks=True. What do you think?
>
> - Wes
>
> On Thu, Jan 16, 2020 at 3:42 AM Joris Van den Bossche
>  wrote:
> >
> > So the spark integration build started to fail, and with the following
> test
> > error:
> >
> > ==
> > ERROR: test_toPandas_batch_order
> > (pyspark.sql.tests.test_arrow.EncryptionArrowTests)
> > --
> > Traceback (most recent call last):
> >   File "/spark/python/pyspark/sql/tests/test_arrow.py", line 422, in
> > test_toPandas_batch_order
> > run_test(*case)
> >   File "/spark/python/pyspark/sql/tests/test_arrow.py", line 409, in
> run_test
> > pdf, pdf_arrow = self._toPandas_arrow_toggle(df)
> >   File "/spark/python/pyspark/sql/tests/test_arrow.py", line 152, in
> > _toPandas_arrow_toggle
> > pdf_arrow = df.toPandas()
> >   File "/spark/python/pyspark/sql/pandas/conversion.py", line 115, in
> toPandas
> > return _check_dataframe_localize_timestamps(pdf, timezone)
> >   File "/spark/python/pyspark/sql/pandas/types.py", line 180, in
> > _check_dataframe_localize_timestamps
> > pdf[column] = _check_series_localize_timestamps(series, timezone)
> >   File
> "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/frame.py",
> > line 3487, in __setitem__
> > self._set_item(key, value)
> >   File
> "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/frame.py",
> > line 3565, in _set_item
> > NDFrame._set_item(self, key, value)
> >   File
> "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/generic.py",
> > line 3381, in _set_item
> > self._data.set(key, value)
> >   File
> "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/internals/managers.py",
> > line 1090, in set
> > blk.set(blk_locs, value_getitem(val_locs))
> >   File
> "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/internals/blocks.py",
> > line 380, in set
> > self.values[locs] = values
> > ValueError: assignment destination is read-only
> >
> >
> > It's from a test that is doing conversions from spark to arrow to pandas
> > (so calling pyarrow.Table.to_pandas here
> > <
> https://github.com/apache/spark/blob/018bdcc53c925072b07956de0600452ad255b9c7/python/pyspark/sql/pandas/conversion.py#L111-L115
> >),
> > and on the resulting DataFrame, it is iterating through all columns,
> > potentially fixing timezones, and writing each column back into the
> > DataFrame (here
> > <
> https://github.com/apache/spark/blob/018bdcc53c925072b07956de0600452ad255b9c7/python/pyspark/sql/pandas/types.py#L179-L181
> >
> > ).
> >
> > Since it is giving an error about read-only, it might be related to
> > zero-copy behaviour of to_pandas, and thus might be related to the
> refactor
> > of the arrow->pandas conversion that landed yesterday (
> > https://github.com/apache/arrow/pull/6067, it says it changed to do
> > zero-copy for 1-column blocks if possible).
> > I am not sure if something should be fixed in pyarrow for this, but the
> > obvious thing that pyspark can do is specify they don't want zero-copy.
> >
> > Joris
> >
> > On Wed, 15 Jan 2020 at 14:32, Crossbow  wrote:
> >
>


[jira] [Created] (ARROW-7600) [C++][Parquet] Add a basic disabled unit test to excercise nesting functionality

2020-01-16 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-7600:
--

 Summary: [C++][Parquet] Add a basic disabled unit test to 
excercise nesting functionality
 Key: ARROW-7600
 URL: https://issues.apache.org/jira/browse/ARROW-7600
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Micah Kornfield
Assignee: Micah Kornfield






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [CI] Java build broken on master

2020-01-16 Thread Ji Liu
Thanks, PR opened https://github.com/apache/arrow/pull/6216, please help merge 
once the build turns green.


--
From:Micah Kornfield 
Send Time:2020年1月17日(星期五) 14:53
To:Ji Liu 
Cc:dev 
Subject:Re: [CI] Java build broken on master

OK, I've opened https://issues.apache.org/jira/browse/ARROW-7599 to track.
On Thu, Jan 16, 2020 at 10:49 PM Ji Liu  wrote:

 I was fixing, and will open a PR later.

Thanks,
Ji Liu

--
From:Micah Kornfield 
Send Time:2020年1月17日(星期五) 14:48
To:dev 
Subject:[CI] Java build broken on master

This was due to an unexpected conflict between two patches I just merged.
I'm going to see if I can fix this quickly, otherwise I will rollback.



Re: [CI] Java build broken on master

2020-01-16 Thread Micah Kornfield
OK, I've opened https://issues.apache.org/jira/browse/ARROW-7599 to track.

On Thu, Jan 16, 2020 at 10:49 PM Ji Liu  wrote:

>  I was fixing, and will open a PR later.
>
> Thanks,
> Ji Liu
>
> --
> From:Micah Kornfield 
> Send Time:2020年1月17日(星期五) 14:48
> To:dev 
> Subject:[CI] Java build broken on master
>
> This was due to an unexpected conflict between two patches I just merged.
> I'm going to see if I can fix this quickly, otherwise I will rollback.
>
>
>


[jira] [Created] (ARROW-7599) [Java] Fix build break due to change in RangeEqualsVisitor

2020-01-16 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-7599:
--

 Summary: [Java] Fix build break due to change in RangeEqualsVisitor
 Key: ARROW-7599
 URL: https://issues.apache.org/jira/browse/ARROW-7599
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Micah Kornfield






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [CI] Java build broken on master

2020-01-16 Thread Ji Liu
 I was fixing, and will open a PR later.

Thanks,
Ji Liu


--
From:Micah Kornfield 
Send Time:2020年1月17日(星期五) 14:48
To:dev 
Subject:[CI] Java build broken on master

This was due to an unexpected conflict between two patches I just merged.
I'm going to see if I can fix this quickly, otherwise I will rollback.



[CI] Java build broken on master

2020-01-16 Thread Micah Kornfield
This was due to an unexpected conflict between two patches I just merged.
I'm going to see if I can fix this quickly, otherwise I will rollback.


Re: [C++] Arrow added to OSS-Fuzz

2020-01-16 Thread Marco Neumann
Hey Antoine, 

Thanks a lot also from my side. 

The build is likely currently succeeding due to the Fuzzing work done by 
fuzzit. We had loads of crashes in the beginning and fixed tons of edge cases, 
especially around null pointer handling. 

I also have some code locally for a Parquet fuzzing stub, which I can publish 
(there is a ticket for that task I think). Sadly, this doesn't result in a long 
fuzzer run as the arrow IPC reader fuzzer and crashes after some seconds (was 
some issue in the thrift deserialization last time I've tried it). So I would 
probably not add it directly to fuzzit and OSS-fuzz to not be bombarded by 
error messages. 

Cheers, 
Marco Neumann 

Jan 16, 2020, 02:23 by liya.fa...@gmail.com:

> Hi Antoine,
>
> Good job! And thanks for sharing the great news!
>
> Best,
> Liya Fan
>
> On Thu, Jan 16, 2020 at 2:59 AM Antoine Pitrou  wrote:
>
>>
>> Hello,
>>
>> I would like to announce that Arrow has been accepted on the OSS-Fuzz
>> infrastructure (a continuous fuzzing infrastructure operated by Google):
>> https://github.com/google/oss-fuzz/pull/3233
>>
>> Right now the only fuzz targets are the C++ stream and file IPC readers.
>> The first build results haven't appeared yet.  They will appear on
>> https://oss-fuzz.com/ .   Access needs a Google account, and you also
>> need to be listed in the "auto_ccs" here:
>> https://github.com/google/oss-fuzz/blob/master/projects/arrow/project.yaml
>>
>> (if you are a PMC or core developer and want to be listed, just open a
>> PR to the oss-fuzz repository)
>>
>> Once we confirm the first builds succeed on OSS-Fuzz, we should probably
>> add more fuzz targets (for example for reading Parquet files).
>>
>> Regards
>>
>> Antoine.
>>



Re: [Format] Make fields required?

2020-01-16 Thread Micah Kornfield
I too, couldn't find anything that says this would break backwards
compatibility for the binary format. But it probably pays to open an issue
with the flatbuffer team just to be safe.

Two points:
1.  I'd like to make sure we are conservative in choosing "definitely
required"
2.  Before committing to the change, it would be good to get a sense of how
much this affects other language bindings (e.g. scope of work).
3.  If we decide to this it seems like it should be a 1.0.0 blocker?

On Thu, Jan 16, 2020 at 1:47 PM Wes McKinney  wrote:

> If using "required" does not alter the Flatbuffers binary format (it
> doesn't seem that it does, it adds non-null assertions on the write
> path and additional checks in the read verifiers, is that accurate?),
> then it may be worthwhile to set it on "definitely required" fields so
> spare clients from having to implement their own null checks. Thoughts
> from others?
>
> - Wes
>
> On Thu, Jan 16, 2020 at 8:13 AM Antoine Pitrou  wrote:
> >
> >
> > Hello,
> >
> > In Flatbuffers, all fields are optional by default.  It means that the
> > reader can get NULL (in C++) for a missing field.  In turn, this means
> > that message validation (at least in C++) should check all child table
> > fields for non-NULL.  Not only is this burdensome, but it's easy to miss
> > some checks.  Currently, we don't seem to do any of them.
> >
> > Instead, it seems we could mark most child fields *required*.  This
> > would allow the generated verifier to check that those fields are indeed
> > non-NULL when reading.  It would also potentially break compatibility,
> > though I'm not sure why (do people rely on the fields being missing
> > sometimes?).  What do you think?
> >
> >
> > To quote the Flatbuffers documentation:
> > """
> > By default, all fields are optional, i.e. may be left out. This is
> > desirable, as it helps with forwards/backwards compatibility, and
> > flexibility of data structures. It is also a burden on the reading code,
> > since for non-scalar fields it requires you to check against NULL and
> > take appropriate action. By specifying this field, you force code that
> > constructs FlatBuffers to ensure this field is initialized, so the
> > reading code may access it directly, without checking for NULL. If the
> > constructing code does not initialize this field, they will get an
> > assert, and also the verifier will fail on buffers that have missing
> > required fields.
> > """
> > https://google.github.io/flatbuffers/md__schemas.html
> >
> > Regards
> >
> > Antoine.
>


[jira] [Created] (ARROW-7598) Unable to install pyarrow

2020-01-16 Thread Rockwell Shabani (Jira)
Rockwell Shabani created ARROW-7598:
---

 Summary: Unable to install pyarrow
 Key: ARROW-7598
 URL: https://issues.apache.org/jira/browse/ARROW-7598
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.15.1
 Environment: Windows 10 64-bit
Reporter: Rockwell Shabani


I'm unable to install Pyarrow running 64-bit python version 3.8 on 64-bit 
Windows.

I get the following error:

ModuleNotFoundError: No module named 'cmake'
 error: command 
'C:\\Users\\*\\AppData\\Local\\Programs\\Python\\Python38\\Scripts\\cmake.exe'
 failed with exit status 1

Pandas/Numpy are already installed.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7597) [C++] Improvements to CMake configuration console summary

2020-01-16 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-7597:
---

 Summary: [C++] Improvements to CMake configuration console summary
 Key: ARROW-7597
 URL: https://issues.apache.org/jira/browse/ARROW-7597
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 0.16.0


Making output more compact and easier to copy-and-paste options. Based on 
discussion and patch in https://github.com/apache/arrow/pull/6193



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: PySpark failure [RE: [NIGHTLY] Arrow Build Report for Job nightly-2020-01-15-0]

2020-01-16 Thread Wes McKinney
I created https://issues.apache.org/jira/browse/ARROW-7596 and made it
a blocker for 0.16.0 so this does not get lost in the shuffle

On Thu, Jan 16, 2020 at 3:43 PM Wes McKinney  wrote:
>
> hi Joris,
>
> Thanks for investigating this. It seems there were some unintended
> consequences of the zero-copy optimizations from ARROW-3789. Another
> way forward might be to "opt in" to this behavior, or to only do the
> zero copy optimizations when split_blocks=True. What do you think?
>
> - Wes
>
> On Thu, Jan 16, 2020 at 3:42 AM Joris Van den Bossche
>  wrote:
> >
> > So the spark integration build started to fail, and with the following test
> > error:
> >
> > ==
> > ERROR: test_toPandas_batch_order
> > (pyspark.sql.tests.test_arrow.EncryptionArrowTests)
> > --
> > Traceback (most recent call last):
> >   File "/spark/python/pyspark/sql/tests/test_arrow.py", line 422, in
> > test_toPandas_batch_order
> > run_test(*case)
> >   File "/spark/python/pyspark/sql/tests/test_arrow.py", line 409, in 
> > run_test
> > pdf, pdf_arrow = self._toPandas_arrow_toggle(df)
> >   File "/spark/python/pyspark/sql/tests/test_arrow.py", line 152, in
> > _toPandas_arrow_toggle
> > pdf_arrow = df.toPandas()
> >   File "/spark/python/pyspark/sql/pandas/conversion.py", line 115, in 
> > toPandas
> > return _check_dataframe_localize_timestamps(pdf, timezone)
> >   File "/spark/python/pyspark/sql/pandas/types.py", line 180, in
> > _check_dataframe_localize_timestamps
> > pdf[column] = _check_series_localize_timestamps(series, timezone)
> >   File 
> > "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/frame.py",
> > line 3487, in __setitem__
> > self._set_item(key, value)
> >   File 
> > "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/frame.py",
> > line 3565, in _set_item
> > NDFrame._set_item(self, key, value)
> >   File 
> > "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/generic.py",
> > line 3381, in _set_item
> > self._data.set(key, value)
> >   File 
> > "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/internals/managers.py",
> > line 1090, in set
> > blk.set(blk_locs, value_getitem(val_locs))
> >   File 
> > "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/internals/blocks.py",
> > line 380, in set
> > self.values[locs] = values
> > ValueError: assignment destination is read-only
> >
> >
> > It's from a test that is doing conversions from spark to arrow to pandas
> > (so calling pyarrow.Table.to_pandas here
> > ),
> > and on the resulting DataFrame, it is iterating through all columns,
> > potentially fixing timezones, and writing each column back into the
> > DataFrame (here
> > 
> > ).
> >
> > Since it is giving an error about read-only, it might be related to
> > zero-copy behaviour of to_pandas, and thus might be related to the refactor
> > of the arrow->pandas conversion that landed yesterday (
> > https://github.com/apache/arrow/pull/6067, it says it changed to do
> > zero-copy for 1-column blocks if possible).
> > I am not sure if something should be fixed in pyarrow for this, but the
> > obvious thing that pyspark can do is specify they don't want zero-copy.
> >
> > Joris
> >
> > On Wed, 15 Jan 2020 at 14:32, Crossbow  wrote:
> >
> > >
> > > Arrow Build Report for Job nightly-2020-01-15-0
> > >
> > > All tasks:
> > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-15-0
> > >
> > > Failed Tasks:
> > > - gandiva-jar-osx:
> > >   URL:
> > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-15-0-travis-gandiva-jar-osx
> > > - test-conda-python-3.7-spark-master:
> > >   URL:
> > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-15-0-circle-test-conda-python-3.7-spark-master
> > > - wheel-manylinux2014-cp35m:
> > >   URL:
> > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-15-0-azure-wheel-manylinux2014-cp35m
> > >
> > > Succeeded Tasks:
> > > - centos-6:
> > >   URL:
> > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-15-0-azure-centos-6
> > > - centos-7:
> > >   URL:
> > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-15-0-azure-centos-7
> > > - centos-8:
> > >   URL:
> > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-15-0-azure-centos-8
> > > - conda-linux-gcc-py27:
> > >   URL:
> > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-15-0-azure-conda-linux-gcc-py27
> > > - conda-linux-gcc-py36:
> > >   URL:
> > > https://github.com/ur

[jira] [Created] (ARROW-7596) [Python] Only apply zero-copy DataFrame block optimizations when split_blocks=True

2020-01-16 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-7596:
---

 Summary: [Python] Only apply zero-copy DataFrame block 
optimizations when split_blocks=True
 Key: ARROW-7596
 URL: https://issues.apache.org/jira/browse/ARROW-7596
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.16.0


Follow up to ARROW-3789 since there is downstream code that assumes that the 
DataFrame produced always has all mutable blocks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [Format] Make fields required?

2020-01-16 Thread Wes McKinney
If using "required" does not alter the Flatbuffers binary format (it
doesn't seem that it does, it adds non-null assertions on the write
path and additional checks in the read verifiers, is that accurate?),
then it may be worthwhile to set it on "definitely required" fields so
spare clients from having to implement their own null checks. Thoughts
from others?

- Wes

On Thu, Jan 16, 2020 at 8:13 AM Antoine Pitrou  wrote:
>
>
> Hello,
>
> In Flatbuffers, all fields are optional by default.  It means that the
> reader can get NULL (in C++) for a missing field.  In turn, this means
> that message validation (at least in C++) should check all child table
> fields for non-NULL.  Not only is this burdensome, but it's easy to miss
> some checks.  Currently, we don't seem to do any of them.
>
> Instead, it seems we could mark most child fields *required*.  This
> would allow the generated verifier to check that those fields are indeed
> non-NULL when reading.  It would also potentially break compatibility,
> though I'm not sure why (do people rely on the fields being missing
> sometimes?).  What do you think?
>
>
> To quote the Flatbuffers documentation:
> """
> By default, all fields are optional, i.e. may be left out. This is
> desirable, as it helps with forwards/backwards compatibility, and
> flexibility of data structures. It is also a burden on the reading code,
> since for non-scalar fields it requires you to check against NULL and
> take appropriate action. By specifying this field, you force code that
> constructs FlatBuffers to ensure this field is initialized, so the
> reading code may access it directly, without checking for NULL. If the
> constructing code does not initialize this field, they will get an
> assert, and also the verifier will fail on buffers that have missing
> required fields.
> """
> https://google.github.io/flatbuffers/md__schemas.html
>
> Regards
>
> Antoine.


Re: PySpark failure [RE: [NIGHTLY] Arrow Build Report for Job nightly-2020-01-15-0]

2020-01-16 Thread Wes McKinney
hi Joris,

Thanks for investigating this. It seems there were some unintended
consequences of the zero-copy optimizations from ARROW-3789. Another
way forward might be to "opt in" to this behavior, or to only do the
zero copy optimizations when split_blocks=True. What do you think?

- Wes

On Thu, Jan 16, 2020 at 3:42 AM Joris Van den Bossche
 wrote:
>
> So the spark integration build started to fail, and with the following test
> error:
>
> ==
> ERROR: test_toPandas_batch_order
> (pyspark.sql.tests.test_arrow.EncryptionArrowTests)
> --
> Traceback (most recent call last):
>   File "/spark/python/pyspark/sql/tests/test_arrow.py", line 422, in
> test_toPandas_batch_order
> run_test(*case)
>   File "/spark/python/pyspark/sql/tests/test_arrow.py", line 409, in run_test
> pdf, pdf_arrow = self._toPandas_arrow_toggle(df)
>   File "/spark/python/pyspark/sql/tests/test_arrow.py", line 152, in
> _toPandas_arrow_toggle
> pdf_arrow = df.toPandas()
>   File "/spark/python/pyspark/sql/pandas/conversion.py", line 115, in toPandas
> return _check_dataframe_localize_timestamps(pdf, timezone)
>   File "/spark/python/pyspark/sql/pandas/types.py", line 180, in
> _check_dataframe_localize_timestamps
> pdf[column] = _check_series_localize_timestamps(series, timezone)
>   File 
> "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/frame.py",
> line 3487, in __setitem__
> self._set_item(key, value)
>   File 
> "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/frame.py",
> line 3565, in _set_item
> NDFrame._set_item(self, key, value)
>   File 
> "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/generic.py",
> line 3381, in _set_item
> self._data.set(key, value)
>   File 
> "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/internals/managers.py",
> line 1090, in set
> blk.set(blk_locs, value_getitem(val_locs))
>   File 
> "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/internals/blocks.py",
> line 380, in set
> self.values[locs] = values
> ValueError: assignment destination is read-only
>
>
> It's from a test that is doing conversions from spark to arrow to pandas
> (so calling pyarrow.Table.to_pandas here
> ),
> and on the resulting DataFrame, it is iterating through all columns,
> potentially fixing timezones, and writing each column back into the
> DataFrame (here
> 
> ).
>
> Since it is giving an error about read-only, it might be related to
> zero-copy behaviour of to_pandas, and thus might be related to the refactor
> of the arrow->pandas conversion that landed yesterday (
> https://github.com/apache/arrow/pull/6067, it says it changed to do
> zero-copy for 1-column blocks if possible).
> I am not sure if something should be fixed in pyarrow for this, but the
> obvious thing that pyspark can do is specify they don't want zero-copy.
>
> Joris
>
> On Wed, 15 Jan 2020 at 14:32, Crossbow  wrote:
>
> >
> > Arrow Build Report for Job nightly-2020-01-15-0
> >
> > All tasks:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-15-0
> >
> > Failed Tasks:
> > - gandiva-jar-osx:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-15-0-travis-gandiva-jar-osx
> > - test-conda-python-3.7-spark-master:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-15-0-circle-test-conda-python-3.7-spark-master
> > - wheel-manylinux2014-cp35m:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-15-0-azure-wheel-manylinux2014-cp35m
> >
> > Succeeded Tasks:
> > - centos-6:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-15-0-azure-centos-6
> > - centos-7:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-15-0-azure-centos-7
> > - centos-8:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-15-0-azure-centos-8
> > - conda-linux-gcc-py27:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-15-0-azure-conda-linux-gcc-py27
> > - conda-linux-gcc-py36:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-15-0-azure-conda-linux-gcc-py36
> > - conda-linux-gcc-py37:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-15-0-azure-conda-linux-gcc-py37
> > - conda-linux-gcc-py38:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-15-0-azure-conda-linux-gcc-py38
> > - conda-osx-clang-py27:
> >   UR

[jira] [Created] (ARROW-7595) [R][CI] R appveyor job fails on glob

2020-01-16 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-7595:
--

 Summary: [R][CI] R appveyor job fails on glob
 Key: ARROW-7595
 URL: https://issues.apache.org/jira/browse/ARROW-7595
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration, R
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 0.16.0


It looks like Appveyor is escaping/shell quoting things it didn't before. This 
line is failing now: 
[https://github.com/apache/arrow/blob/master/ci/appveyor-build-r.sh#L47]
 * before (passing): 
[https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/30150558/job/48s5uf3750yhvfqf#L2614]
 * after (failing): 
[https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/30166850/job/t4nuy7xqwu8pr4ph#L2632]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7594) [C++] Implement HTTP and FTP file systems

2020-01-16 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-7594:
---

 Summary: [C++] Implement HTTP and FTP file systems
 Key: ARROW-7594
 URL: https://issues.apache.org/jira/browse/ARROW-7594
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Affects Versions: 0.15.1
Reporter: Ben Kietzman
 Fix For: 1.0.0


It'd be handy to have (probably read only) a generic filesystem implementation 
which wrapped {{any cURLable base url}}:

{code}
ARROW_ASSIGN_OR_RAISE(auto fs, 
HttpFileSystem::Make("https://some.site/json-api/v3";));
ASSERT_OK_AND_ASSIGN(auto json_stream, fs->OpenInputStream("slug"));
// ...
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7593) [CI][Python] Python datasets failing on master / not run on CI

2020-01-16 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7593:


 Summary: [CI][Python] Python datasets failing on master / not run 
on CI
 Key: ARROW-7593
 URL: https://issues.apache.org/jira/browse/ARROW-7593
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7592) [C++] Fix crashes on corrupt IPC input

2020-01-16 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-7592:
-

 Summary: [C++] Fix crashes on corrupt IPC input
 Key: ARROW-7592
 URL: https://issues.apache.org/jira/browse/ARROW-7592
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


Fix the following issues spotted by OSS-Fuzz:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=20117
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=20124
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=20127

Those are basic missing sanity checks when reading an IPC file.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[NIGHTLY] Arrow Build Report for Job nightly-2020-01-16-0

2020-01-16 Thread Crossbow


Arrow Build Report for Job nightly-2020-01-16-0

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-16-0

Failed Tasks:
- centos-8:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-16-0-azure-centos-8
- gandiva-jar-osx:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-16-0-travis-gandiva-jar-osx
- homebrew-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-16-0-travis-homebrew-cpp
- test-conda-python-3.7-spark-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-16-0-circle-test-conda-python-3.7-spark-master
- test-ubuntu-fuzzit-fuzzing:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-16-0-circle-test-ubuntu-fuzzit-fuzzing
- test-ubuntu-fuzzit-regression:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-16-0-circle-test-ubuntu-fuzzit-regression

Succeeded Tasks:
- centos-6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-16-0-azure-centos-6
- centos-7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-16-0-azure-centos-7
- conda-linux-gcc-py27:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-16-0-azure-conda-linux-gcc-py27
- conda-linux-gcc-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-16-0-azure-conda-linux-gcc-py36
- conda-linux-gcc-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-16-0-azure-conda-linux-gcc-py37
- conda-linux-gcc-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-16-0-azure-conda-linux-gcc-py38
- conda-osx-clang-py27:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-16-0-azure-conda-osx-clang-py27
- conda-osx-clang-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-16-0-azure-conda-osx-clang-py36
- conda-osx-clang-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-16-0-azure-conda-osx-clang-py37
- conda-osx-clang-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-16-0-azure-conda-osx-clang-py38
- conda-win-vs2015-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-16-0-azure-conda-win-vs2015-py36
- conda-win-vs2015-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-16-0-azure-conda-win-vs2015-py37
- conda-win-vs2015-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-16-0-azure-conda-win-vs2015-py38
- debian-buster:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-16-0-azure-debian-buster
- debian-stretch:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-16-0-azure-debian-stretch
- gandiva-jar-trusty:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-16-0-travis-gandiva-jar-trusty
- macos-r-autobrew:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-16-0-travis-macos-r-autobrew
- test-conda-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-16-0-circle-test-conda-cpp
- test-conda-python-2.7-pandas-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-16-0-circle-test-conda-python-2.7-pandas-latest
- test-conda-python-2.7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-16-0-circle-test-conda-python-2.7
- test-conda-python-3.6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-16-0-circle-test-conda-python-3.6
- test-conda-python-3.7-dask-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-16-0-circle-test-conda-python-3.7-dask-latest
- test-conda-python-3.7-hdfs-2.9.2:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-16-0-circle-test-conda-python-3.7-hdfs-2.9.2
- test-conda-python-3.7-pandas-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-16-0-circle-test-conda-python-3.7-pandas-latest
- test-conda-python-3.7-pandas-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-16-0-circle-test-conda-python-3.7-pandas-master
- test-conda-python-3.7-turbodbc-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-16-0-circle-test-conda-python-3.7-turbodbc-latest
- test-conda-python-3.7-turbodbc-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-16-0-circle-test-conda-python-3.7-turbodbc-master
- test-conda-python-3.7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-16-0-circle-test-conda-python-3.7
- test-conda-python

[Format] Make fields required?

2020-01-16 Thread Antoine Pitrou


Hello,

In Flatbuffers, all fields are optional by default.  It means that the
reader can get NULL (in C++) for a missing field.  In turn, this means
that message validation (at least in C++) should check all child table
fields for non-NULL.  Not only is this burdensome, but it's easy to miss
some checks.  Currently, we don't seem to do any of them.

Instead, it seems we could mark most child fields *required*.  This
would allow the generated verifier to check that those fields are indeed
non-NULL when reading.  It would also potentially break compatibility,
though I'm not sure why (do people rely on the fields being missing
sometimes?).  What do you think?


To quote the Flatbuffers documentation:
"""
By default, all fields are optional, i.e. may be left out. This is
desirable, as it helps with forwards/backwards compatibility, and
flexibility of data structures. It is also a burden on the reading code,
since for non-scalar fields it requires you to check against NULL and
take appropriate action. By specifying this field, you force code that
constructs FlatBuffers to ensure this field is initialized, so the
reading code may access it directly, without checking for NULL. If the
constructing code does not initialize this field, they will get an
assert, and also the verifier will fail on buffers that have missing
required fields.
"""
https://google.github.io/flatbuffers/md__schemas.html

Regards

Antoine.


[jira] [Created] (ARROW-7591) [Python] DictionaryArray.to_numpy returns dict of parts instead of numpy array

2020-01-16 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7591:


 Summary: [Python] DictionaryArray.to_numpy returns dict of parts 
instead of numpy array
 Key: ARROW-7591
 URL: https://issues.apache.org/jira/browse/ARROW-7591
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


Currently, the {{to_numpy}} method doesn't return an ndarray incase of 
dictionaryd type data:

{code}
In [54]: a = pa.array(pd.Categorical(["a", "b", "a"]))  

   

In [55]: a  

   
Out[55]: 


-- dictionary:
  [
"a",
"b"
  ]
-- indices:
  [
0,
1,
0
  ]

In [57]: a.to_numpy(zero_copy_only=False)   

   
Out[57]: 
{'indices': array([0, 1, 0], dtype=int8),
 'dictionary': array(['a', 'b'], dtype=object),
 'ordered': False}
{code}

This is actually just an internal representation that is passed from C++ to 
Python so on the Python side a {{pd.Categorical}} / {{CategoricalBlock}} can be 
constructed, but it's not something we should return as such to the user. 
Rather, I think we should return a decoded / dense numpy array (or at least 
error instead of returning this dict)

(also, if the user wants those parts, they are already available from the 
dictionary array as {{a.indices}}, {{a.dictionary}} and {{a.type.ordered}})



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


PySpark failure [RE: [NIGHTLY] Arrow Build Report for Job nightly-2020-01-15-0]

2020-01-16 Thread Joris Van den Bossche
So the spark integration build started to fail, and with the following test
error:

==
ERROR: test_toPandas_batch_order
(pyspark.sql.tests.test_arrow.EncryptionArrowTests)
--
Traceback (most recent call last):
  File "/spark/python/pyspark/sql/tests/test_arrow.py", line 422, in
test_toPandas_batch_order
run_test(*case)
  File "/spark/python/pyspark/sql/tests/test_arrow.py", line 409, in run_test
pdf, pdf_arrow = self._toPandas_arrow_toggle(df)
  File "/spark/python/pyspark/sql/tests/test_arrow.py", line 152, in
_toPandas_arrow_toggle
pdf_arrow = df.toPandas()
  File "/spark/python/pyspark/sql/pandas/conversion.py", line 115, in toPandas
return _check_dataframe_localize_timestamps(pdf, timezone)
  File "/spark/python/pyspark/sql/pandas/types.py", line 180, in
_check_dataframe_localize_timestamps
pdf[column] = _check_series_localize_timestamps(series, timezone)
  File "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/frame.py",
line 3487, in __setitem__
self._set_item(key, value)
  File "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/frame.py",
line 3565, in _set_item
NDFrame._set_item(self, key, value)
  File 
"/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/generic.py",
line 3381, in _set_item
self._data.set(key, value)
  File 
"/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/internals/managers.py",
line 1090, in set
blk.set(blk_locs, value_getitem(val_locs))
  File 
"/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/internals/blocks.py",
line 380, in set
self.values[locs] = values
ValueError: assignment destination is read-only


It's from a test that is doing conversions from spark to arrow to pandas
(so calling pyarrow.Table.to_pandas here
),
and on the resulting DataFrame, it is iterating through all columns,
potentially fixing timezones, and writing each column back into the
DataFrame (here

).

Since it is giving an error about read-only, it might be related to
zero-copy behaviour of to_pandas, and thus might be related to the refactor
of the arrow->pandas conversion that landed yesterday (
https://github.com/apache/arrow/pull/6067, it says it changed to do
zero-copy for 1-column blocks if possible).
I am not sure if something should be fixed in pyarrow for this, but the
obvious thing that pyspark can do is specify they don't want zero-copy.

Joris

On Wed, 15 Jan 2020 at 14:32, Crossbow  wrote:

>
> Arrow Build Report for Job nightly-2020-01-15-0
>
> All tasks:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-15-0
>
> Failed Tasks:
> - gandiva-jar-osx:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-15-0-travis-gandiva-jar-osx
> - test-conda-python-3.7-spark-master:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-15-0-circle-test-conda-python-3.7-spark-master
> - wheel-manylinux2014-cp35m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-15-0-azure-wheel-manylinux2014-cp35m
>
> Succeeded Tasks:
> - centos-6:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-15-0-azure-centos-6
> - centos-7:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-15-0-azure-centos-7
> - centos-8:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-15-0-azure-centos-8
> - conda-linux-gcc-py27:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-15-0-azure-conda-linux-gcc-py27
> - conda-linux-gcc-py36:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-15-0-azure-conda-linux-gcc-py36
> - conda-linux-gcc-py37:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-15-0-azure-conda-linux-gcc-py37
> - conda-linux-gcc-py38:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-15-0-azure-conda-linux-gcc-py38
> - conda-osx-clang-py27:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-15-0-azure-conda-osx-clang-py27
> - conda-osx-clang-py36:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-15-0-azure-conda-osx-clang-py36
> - conda-osx-clang-py37:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-15-0-azure-conda-osx-clang-py37
> - conda-osx-clang-py38:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-15-0-azure-conda-osx-clang-py38
> - conda-win-vs2015-py36:

[jira] [Created] (ARROW-7590) Update .gitignore for for thirdparty

2020-01-16 Thread Jiajia Li (Jira)
Jiajia Li created ARROW-7590:


 Summary: Update .gitignore for for thirdparty
 Key: ARROW-7590
 URL: https://issues.apache.org/jira/browse/ARROW-7590
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Jiajia Li
Assignee: Jiajia Li


Some files under thirdparty should not be ignored, e.g. flatbuffers, README.md, 
versions.txt.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)