Re: [VOTE] Adopt Arrow in-process C Data Interface specification

2019-12-06 Thread Jacques Nadeau
-1 (binding)

I'm voting -1 on this. I posted the thinking why on the PR. The high-level
is that I think it needs to better address the pipelined use case as right
now it fails to support that at all and has too much weight to ignore that
use case.

I actually would have posted it here but totally missed this vote thread
until just now (I'm traveling atm). My -1 is not an indefinite -1, I'm
simply asking for some small changes to the approach to also support the
pipelined usage pattern.

On Sat, Dec 7, 2019 at 3:09 AM Wes McKinney  wrote:

> Hello,
>
> Could more PMC members take a look at this work?
>
> Thank you
>
> On Tue, Dec 3, 2019 at 1:50 PM Neal Richardson
>  wrote:
> >
> > +1 (non-binding)
> >
> > On Tue, Dec 3, 2019 at 10:56 AM Wes McKinney 
> wrote:
> >
> > > +1 (binding)
> > >
> > > On Tue, Dec 3, 2019 at 12:54 PM Wes McKinney 
> wrote:
> > > >
> > > > hello,
> > > >
> > > > We have been discussing the creation of a minimalist C-based data
> > > > interface for applications to exchange Arrow columnar data structures
> > > > with each other. Some notable features of this interface include:
> > > >
> > > > * A small amount of header-only C code can be copied into downstream
> > > > applications, no external dependencies are needed (notable, it is not
> > > > required to use Flatbuffers, though there are trade-offs resulting
> > > > from this)
> > > > * Low development investment (in other words: limited-scope use cases
> > > > can be accomplished with little code). Enable C libraries to export
> > > > Arrow columnar data at C call sites with minimal code
> > > >
> > > > This "C Data Interface" serves different use cases from the
> > > > language-independent IPC protocol and trades away a number of
> features
> > > > (such as forward/backward compatibility) in the interest of
> minimalism
> > > > / simplicity. It is not a replacement for the IPC protocol and will
> > > > only be used to interchange in-process data at C call sites.
> > > >
> > > > The PR providing the specification is here
> > > >
> > > > https://github.com/apache/arrow/pull/5442
> > > >
> > > > A fairly comprehensive C++ implementation of this demonstrating its
> > > > use is found here
> > > >
> > > > https://github.com/apache/arrow/pull/5608
> > > >
> > > > (note that other applications implementing the interface may choose
> to
> > > > only support a few features and thus have far less code to write)
> > > >
> > > > Please vote to adopt the SPECIFICATION (GitHub PR #5442).
> > > >
> > > > This vote will be open for at least 72 hours
> > > >
> > > > [ ] +1 Adopt C Data Interface specification
> > > > [ ] +0
> > > > [ ] -1 Do not adopt because...
> > > >
> > > > Thank you
> > >
>


[jira] [Created] (ARROW-7345) [Python] Writing partitions with NaNs silently drops data

2019-12-06 Thread Karl Dunkle Werner (Jira)
Karl Dunkle Werner created ARROW-7345:
-

 Summary: [Python] Writing partitions with NaNs silently drops data
 Key: ARROW-7345
 URL: https://issues.apache.org/jira/browse/ARROW-7345
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.15.1
Reporter: Karl Dunkle Werner


When writing a partitioned table, if the partitioning column has NA values, 
they're silently dropped. I think it would be helpful if there was a warning. 
Even better, from my perspective, would be writing out those partitions with a 
directory name like {{partition_col=NaN}}. 

Here's a small example where only the {{b = 2}} group is written out and the 
{{b = NaN}} group is dropped.

{code:python}
import os
import tempfile
import pyarrow.json
import pyarrow.parquet
from pathlib import Path

# Create a dataset with NaN:
json_str = """
{"a": 1, "b": 2}
{"a": 2, "b": null}
"""
with tempfile.NamedTemporaryFile() as tf:
tf = Path(tf.name)
tf.write_text(json_str)
table = pyarrow.json.read_json(tf)

# Write out a partitioned dataset, using the NaN-containing column
with tempfile.TemporaryDirectory() as out_dir:
pyarrow.parquet.write_to_dataset(table, out_dir, partition_cols=["b"])
print(os.listdir(out_dir))
read_table = pyarrow.parquet.read_table(out_dir)
print(f"Wrote out {table.shape[0]} rows, read back {read_table.shape[0]} row")

# Output:
#> ['b=2.0']
#> Wrote out 2 rows, read back 1 row
{code}
 
It looks like this caused by pandas dropping NaNs when doing [the {{groupby}} 
here|https://github.com/apache/arrow/blob/b16a3b53092ccfbc67e5a4e5c90be5913a67c8a5/python/pyarrow/parquet.py#L1434].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7344) [Packaging][Python] Build manylinux2014 wheels

2019-12-06 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-7344:
--

 Summary: [Packaging][Python] Build manylinux2014 wheels
 Key: ARROW-7344
 URL: https://issues.apache.org/jira/browse/ARROW-7344
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Packaging, Python
Reporter: Neal Richardson
Assignee: Neal Richardson


See https://www.python.org/dev/peps/pep-0599/

https://github.com/pypa/manylinux/issues/338 tracks the standard's progress

https://quay.io/organization/pypa now has manylinux2014 wheels on it

I've been experimenting with it and have made 
https://dl.bintray.com/nealrichardson/pyarrow-dev/pyarrow-0.15.1.dev427+ga309da790.d20191206-cp37-cp37m-linux_x86_64.whl



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Human-readable version of Arrow Schema?

2019-12-06 Thread Micah Kornfield
Hi Christian,
As far as I know no-one is working on a canonical text representation for
schemas.  A JSON serializer exists for integration test purposes, but
IMO it shouldn't be relied upon as canonical.

It looks like Flatbuffers supports serialization to/from JSON [1
],
using that functionality might be a promising avenue to pursue for a human
readable schema. I could see adding a helper method someplace under IPC for
this.  Would that meet your needs?  I think if there are other
requirements, then a proposal would be welcome.  Ideally, a solution would
not require additional build/runtime dependencies.


Thanks,
Micah

[1] See Text & schema parsing
https://google.github.io/flatbuffers/flatbuffers_guide_use_cpp.html

On Fri, Dec 6, 2019 at 1:26 PM Christian Hudon  wrote:

> Hi,
>
> For the uses I would like to make of Arrow, I would need a human-readable
> and -writable version of an Arrow Schema, that could be converted to and
> from the Arrow Schema C++ object. Going through the doc for 0.15.1, I don't
> see anything to that effect, with the closest being the ToString() method
> on DataType instances, but which is meant for debugging only. (I need an
> expression of an Arrow Schema that people can read, and that can live
> outside of the code for a particular operation.)
>
> Is a text representation of an Arrow Schema something that is being worked
> on now? If not, would you folks be interested in me putting up an initial
> proposal for discussion? Any design constraints I should pay attention to,
> then?
>
> Thanks,
>
>   Christian
> --
>
>
> │ Christian Hudon
>
> │ Applied Research Scientist
>
>Element AI, 6650 Saint-Urbain #500
>
>Montréal, QC, H2S 3G9, Canada
>Elementai.com
>


Re: [VOTE] Adopt Arrow in-process C Data Interface specification

2019-12-06 Thread Wes McKinney
Hello,

Could more PMC members take a look at this work?

Thank you

On Tue, Dec 3, 2019 at 1:50 PM Neal Richardson
 wrote:
>
> +1 (non-binding)
>
> On Tue, Dec 3, 2019 at 10:56 AM Wes McKinney  wrote:
>
> > +1 (binding)
> >
> > On Tue, Dec 3, 2019 at 12:54 PM Wes McKinney  wrote:
> > >
> > > hello,
> > >
> > > We have been discussing the creation of a minimalist C-based data
> > > interface for applications to exchange Arrow columnar data structures
> > > with each other. Some notable features of this interface include:
> > >
> > > * A small amount of header-only C code can be copied into downstream
> > > applications, no external dependencies are needed (notable, it is not
> > > required to use Flatbuffers, though there are trade-offs resulting
> > > from this)
> > > * Low development investment (in other words: limited-scope use cases
> > > can be accomplished with little code). Enable C libraries to export
> > > Arrow columnar data at C call sites with minimal code
> > >
> > > This "C Data Interface" serves different use cases from the
> > > language-independent IPC protocol and trades away a number of features
> > > (such as forward/backward compatibility) in the interest of minimalism
> > > / simplicity. It is not a replacement for the IPC protocol and will
> > > only be used to interchange in-process data at C call sites.
> > >
> > > The PR providing the specification is here
> > >
> > > https://github.com/apache/arrow/pull/5442
> > >
> > > A fairly comprehensive C++ implementation of this demonstrating its
> > > use is found here
> > >
> > > https://github.com/apache/arrow/pull/5608
> > >
> > > (note that other applications implementing the interface may choose to
> > > only support a few features and thus have far less code to write)
> > >
> > > Please vote to adopt the SPECIFICATION (GitHub PR #5442).
> > >
> > > This vote will be open for at least 72 hours
> > >
> > > [ ] +1 Adopt C Data Interface specification
> > > [ ] +0
> > > [ ] -1 Do not adopt because...
> > >
> > > Thank you
> >


[jira] [Created] (ARROW-7343) Memory leak in Flight ArrowMessage

2019-12-06 Thread David Li (Jira)
David Li created ARROW-7343:
---

 Summary: Memory leak in Flight ArrowMessage
 Key: ARROW-7343
 URL: https://issues.apache.org/jira/browse/ARROW-7343
 Project: Apache Arrow
  Issue Type: Bug
  Components: FlightRPC, Java
Affects Versions: 0.15.1
Reporter: David Li
Assignee: David Li


I believe this causes things like ARROW-4765.

If a stream is interrupted or otherwise not drained on the server-side, the 
serialized form of the ArrowMessage (DrainableByteBufInputStream) will sit 
around forever, leaking memory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Human-readable version of Arrow Schema?

2019-12-06 Thread Christian Hudon
Hi,

For the uses I would like to make of Arrow, I would need a human-readable
and -writable version of an Arrow Schema, that could be converted to and
from the Arrow Schema C++ object. Going through the doc for 0.15.1, I don't
see anything to that effect, with the closest being the ToString() method
on DataType instances, but which is meant for debugging only. (I need an
expression of an Arrow Schema that people can read, and that can live
outside of the code for a particular operation.)

Is a text representation of an Arrow Schema something that is being worked
on now? If not, would you folks be interested in me putting up an initial
proposal for discussion? Any design constraints I should pay attention to,
then?

Thanks,

  Christian
-- 


│ Christian Hudon

│ Applied Research Scientist

   Element AI, 6650 Saint-Urbain #500

   Montréal, QC, H2S 3G9, Canada
   Elementai.com


[jira] [Created] (ARROW-7342) [Java] offset buffer for vector of variable-width type with zero value count is empty

2019-12-06 Thread Steve M. Kim (Jira)
Steve M. Kim created ARROW-7342:
---

 Summary: [Java] offset buffer for vector of variable-width type 
with zero value count is empty
 Key: ARROW-7342
 URL: https://issues.apache.org/jira/browse/ARROW-7342
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Steve M. Kim


I am reporting what I think might be two related bugs in 
{{org.apache.arrow.vector.BaseVariableWidthVector}}
 # The offset buffer is initialized as empty. I expect that it to have 4 bytes 
that represent the integer zero.
 # The {{getBufferSize}} method returns 0 when value count is zero, instead of 
4.

Compare to the pyarrow implementation, which I believe correctly populates the 
offset buffer:
{code:java}
>>> import pyarrow as pa
>>> array = pa.array([], type=pa.binary())
>>> array 
[]
>>> print([b.hex().decode() for b in array.buffers()])
['', '', '']

{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7341) [CI] Unbreak nightly Conda R job

2019-12-06 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-7341:
--

 Summary: [CI] Unbreak nightly Conda R job
 Key: ARROW-7341
 URL: https://issues.apache.org/jira/browse/ARROW-7341
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration
Reporter: Neal Richardson
Assignee: Neal Richardson


ARROW-7146 fixed a number of issues in the regular R docker setup and made the 
testing more rigorous. At least one of those fixes also should have been added 
to the R conda setup.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Timestamp coerced by default writing to parquet when resolution is ns (python)

2019-12-06 Thread Weston Pace
Thanks.  I similarly noticed that uint32 gets converted to int64.  This
makes some surface sense as uint32 is a logical type with int64 as the
backing physical type.  However, uint8, uint16, and uint64 all keep their
data types so I was a little surprised.

On Fri, Dec 6, 2019 at 6:52 AM Wes McKinney  wrote:

> Some notes
>
> * 96-bit nanosecond timestamps are deprecated in the Parquet format by
> default, so we don't write them by default unless you use the
> use_deprecated_int96_timestamps flag
> * 64-bit timestamps are relatively new to the Parquet format, I'm not
> actually sure what's required to write these. Using version='2.0' is
> not safe because our implementation of Parquet V2 data pages is
> incorrect (see PARQUET-458)
>
> So I'd recommend using the deprecated int96 flag if you need
> nanoseconds right now
>
> On Fri, Dec 6, 2019 at 8:50 AM Weston Pace  wrote:
> >
> > If my table has timestamp fields with ns resolution and I save the table
> to
> > parquet format without specifying any timestamp args (default coerce and
> > legacy settings) then it automatically converts my timestamp to us
> > resolution.
> >
> > As best I can tell Parquet supports ns resolution so I would prefer it
> just
> > keep that.  Is there some argument I can pass to write_table to get my
> > desired resolution?
> >
> > Here is an example program:
> >
> > import pyarrow as pa
> > import pyarrow.parquet as pq
> >
> > table = pa.table({'mytimestamp': []}, schema=pa.schema({'mytimestamp':
> > pa.timestamp('ns')}))
> > pq.write_table(table, '/tmp/foo.parquet')
> > table2 = pq.read_table('/tmp/foo.parquet')
> > print(table.schema.field('mytimestamp').type)
> > # timestamp[ns]
> > print(table2.schema.field('mytimestamp').type)
> > # timestamp[us]
>


[jira] [Created] (ARROW-7340) [CI] Prune defunct appveyor build setup

2019-12-06 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-7340:
--

 Summary: [CI] Prune defunct appveyor build setup
 Key: ARROW-7340
 URL: https://issues.apache.org/jira/browse/ARROW-7340
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Neal Richardson


We don't run rust, go, or C# on appveyor anymore so delete references to them 
in the appveyor build scripts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7339) [CMake] Thrift version not respected in CMake configuration version.txt

2019-12-06 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-7339:
-

 Summary: [CMake] Thrift version not respected in CMake 
configuration version.txt
 Key: ARROW-7339
 URL: https://issues.apache.org/jira/browse/ARROW-7339
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Francois Saint-Jacques


If thrift is requested via BUNBLED, thrift 0.9.1 will be downloaded instead of 
the requested version. This is due to FindThrift.cmake overriding 
THRIFT_VERSION from the locally installed thrift compiler (0.9.1. on ubuntu 
18.04).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Timestamp coerced by default writing to parquet when resolution is ns (python)

2019-12-06 Thread Wes McKinney
Some notes

* 96-bit nanosecond timestamps are deprecated in the Parquet format by
default, so we don't write them by default unless you use the
use_deprecated_int96_timestamps flag
* 64-bit timestamps are relatively new to the Parquet format, I'm not
actually sure what's required to write these. Using version='2.0' is
not safe because our implementation of Parquet V2 data pages is
incorrect (see PARQUET-458)

So I'd recommend using the deprecated int96 flag if you need
nanoseconds right now

On Fri, Dec 6, 2019 at 8:50 AM Weston Pace  wrote:
>
> If my table has timestamp fields with ns resolution and I save the table to
> parquet format without specifying any timestamp args (default coerce and
> legacy settings) then it automatically converts my timestamp to us
> resolution.
>
> As best I can tell Parquet supports ns resolution so I would prefer it just
> keep that.  Is there some argument I can pass to write_table to get my
> desired resolution?
>
> Here is an example program:
>
> import pyarrow as pa
> import pyarrow.parquet as pq
>
> table = pa.table({'mytimestamp': []}, schema=pa.schema({'mytimestamp':
> pa.timestamp('ns')}))
> pq.write_table(table, '/tmp/foo.parquet')
> table2 = pq.read_table('/tmp/foo.parquet')
> print(table.schema.field('mytimestamp').type)
> # timestamp[ns]
> print(table2.schema.field('mytimestamp').type)
> # timestamp[us]


[jira] [Created] (ARROW-7338) [C++] Rename SimpleDataSource to InMemoryDataSource

2019-12-06 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-7338:
-

 Summary: [C++] Rename SimpleDataSource to InMemoryDataSource
 Key: ARROW-7338
 URL: https://issues.apache.org/jira/browse/ARROW-7338
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Dataset
Reporter: Francois Saint-Jacques


The constructor should take a generator

{code:c++}
// Some comments here
class InMemoryDataSource : public DataSource {
  public:
using Generator = std::function>;

InMemoryDataSource(Generator&& generator);
// Convenience constructor to support a fixed list of RecordBatch
InMemoryDataSource(std::shared_ptr);
InMemoryDataSource(std::vector>);

private:
  Generator generator;
}
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Timestamp coerced by default writing to parquet when resolution is ns (python)

2019-12-06 Thread Weston Pace
If my table has timestamp fields with ns resolution and I save the table to
parquet format without specifying any timestamp args (default coerce and
legacy settings) then it automatically converts my timestamp to us
resolution.

As best I can tell Parquet supports ns resolution so I would prefer it just
keep that.  Is there some argument I can pass to write_table to get my
desired resolution?

Here is an example program:

import pyarrow as pa
import pyarrow.parquet as pq

table = pa.table({'mytimestamp': []}, schema=pa.schema({'mytimestamp':
pa.timestamp('ns')}))
pq.write_table(table, '/tmp/foo.parquet')
table2 = pq.read_table('/tmp/foo.parquet')
print(table.schema.field('mytimestamp').type)
# timestamp[ns]
print(table2.schema.field('mytimestamp').type)
# timestamp[us]


[NIGHTLY] Arrow Build Report for Job nightly-2019-12-06-0

2019-12-06 Thread Crossbow


Arrow Build Report for Job nightly-2019-12-06-0

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-12-06-0

Failed Tasks:
- test-conda-python-3.7-hdfs-2.9.2:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-12-06-0-circle-test-conda-python-3.7-hdfs-2.9.2
- test-conda-r-3.6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-12-06-0-circle-test-conda-r-3.6

Succeeded Tasks:
- centos-6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-12-06-0-azure-centos-6
- centos-7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-12-06-0-azure-centos-7
- centos-8:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-12-06-0-azure-centos-8
- conda-linux-gcc-py27:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-12-06-0-azure-conda-linux-gcc-py27
- conda-linux-gcc-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-12-06-0-azure-conda-linux-gcc-py36
- conda-linux-gcc-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-12-06-0-azure-conda-linux-gcc-py37
- conda-osx-clang-py27:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-12-06-0-azure-conda-osx-clang-py27
- conda-osx-clang-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-12-06-0-azure-conda-osx-clang-py36
- conda-osx-clang-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-12-06-0-azure-conda-osx-clang-py37
- conda-win-vs2015-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-12-06-0-azure-conda-win-vs2015-py36
- conda-win-vs2015-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-12-06-0-azure-conda-win-vs2015-py37
- debian-buster:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-12-06-0-azure-debian-buster
- debian-stretch:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-12-06-0-azure-debian-stretch
- gandiva-jar-osx:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-12-06-0-travis-gandiva-jar-osx
- gandiva-jar-trusty:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-12-06-0-travis-gandiva-jar-trusty
- homebrew-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-12-06-0-travis-homebrew-cpp
- macos-r-autobrew:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-12-06-0-travis-macos-r-autobrew
- test-conda-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-12-06-0-circle-test-conda-cpp
- test-conda-python-2.7-pandas-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-12-06-0-circle-test-conda-python-2.7-pandas-latest
- test-conda-python-2.7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-12-06-0-circle-test-conda-python-2.7
- test-conda-python-3.6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-12-06-0-circle-test-conda-python-3.6
- test-conda-python-3.7-dask-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-12-06-0-circle-test-conda-python-3.7-dask-latest
- test-conda-python-3.7-pandas-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-12-06-0-circle-test-conda-python-3.7-pandas-latest
- test-conda-python-3.7-pandas-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-12-06-0-circle-test-conda-python-3.7-pandas-master
- test-conda-python-3.7-spark-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-12-06-0-circle-test-conda-python-3.7-spark-master
- test-conda-python-3.7-turbodbc-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-12-06-0-circle-test-conda-python-3.7-turbodbc-latest
- test-conda-python-3.7-turbodbc-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-12-06-0-circle-test-conda-python-3.7-turbodbc-master
- test-conda-python-3.7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-12-06-0-circle-test-conda-python-3.7
- test-conda-python-3.8-dask-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-12-06-0-circle-test-conda-python-3.8-dask-master
- test-conda-python-3.8-pandas-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-12-06-0-circle-test-conda-python-3.8-pandas-latest
- test-debian-10-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-12-06-0-circle-test-debian-10-cpp
- test-debian-10-go-1.12:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-12-06-0-circle-test-debian-10-go-1.12

Re: Java - Spark dataframe to Arrow format

2019-12-06 Thread GaoXiang Wang
Hi Wes and Liya,

Appreciate your feedback and information.

Looking forward to a more efficient integration between Arrow and Spark on
the Java/Scala level. I would like to make my contribution if I can help in
any way during my free time.

Thank you very much.


*Best Regards,WANG GAOXIANG*
* (Eric) *
National University of Singapore Graduate ::
API Craft Singapore Co-organiser ::
Singapore Python User Group Co-organiser
*+6597685360 (P) :: wgx...@gmail.com  (E) ::
**https://medium.com/@wgx731
 **(W)*


On Fri, Dec 6, 2019 at 6:17 PM Fan Liya  wrote:

> Hi folks,
>
> Thanks for your clarification.
>
> I also think this is a universal requirement (including Java UDF in Arrow
> format).
>
> The Java converter provided by Spark is inefficient, due to two reasons
> (IMO)
>
> 1. There are frequent memory copies between on-heap and off-heap memory.
> 2. The Spark API is in a row-oriented view (Iterator of InternalRow), so we
> need to perform some column/row conversion, and we cannot copy data in
> batch.
>
> To solve the problem, maybe we need something equivalent to pandas in Java
> (I think pandas acts as a bridge between PyArrow and PySpark).
> In addition, we need to integrate it in Arrow and Spark.
>
> Best,
> Liya Fan
>
> On Fri, Dec 6, 2019 at 2:14 AM Chen Li  wrote:
>
> > We have a similar use case, and we use ArrowConverters.scala mentioned by
> > Wes. However, the overhead of the conversion is kinda high.
> > --
> > *From:* Wes McKinney 
> > *Sent:* Thursday, December 5, 2019 6:53 AM
> > *To:* dev 
> > *Cc:* Fan Liya ;
> > jeetendra.jais...@impetus.co.in.invalid
> > 
> > *Subject:* Re: Java - Spark dataframe to Arrow format
> >
> > hi folks,
> >
> > I understand the question to be about serialization.
> >
> > see
> >
> > *
> >
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java
> > *
> >
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala
> > *
> >
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala
> >
> > This code is used to convert between Spark Data Frames and Arrow
> > columnar format for UDF evaluation purposes
> >
> > On Thu, Dec 5, 2019 at 6:58 AM GaoXiang Wang  wrote:
> > >
> > > Hi Jeetendra and Liya,
> > >
> > > I am actually having a similar use case. We have some data stored as
> > *parquet
> > > format in HDFS* and would like to make use of Apache Arrow to improve
> > > compute performance if possible. Right now, I didn't see there is a
> > direct
> > > way to do in Java with Spark.
> > >
> > > I have search the Spark documentation, it looks like python support is
> > > added after 2.3.0 (
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_sql-2Dpyspark-2Dpandas-2Dwith-2Darrow.html=DwIBaQ=5VD0RTtNlTh3ycd41b3MUw=JX5y-LzqAOulZIcSbMRGYA=E587baUuoFDKcpKDDIF4Su8nBHs0I9bGTBqEhtErCuY=n28eUF_7egcwK6LLh63Wra3oWTzZWBlB6en3xCxDEdE=
> > ),
> > > any plan from Apache Arrow team to provide *Spark integration for
> Java*?
> > >
> > > Thank you very much.
> > >
> > >
> > > *Best Regards,WANG GAOXIANG*
> > > * (Eric) *
> > > National University of Singapore Graduate ::
> > > API Craft Singapore Co-organiser ::
> > > Singapore Python User Group Co-organiser
> > > *+6597685360 (P) :: wgx...@gmail.com  (E) ::
> > > **
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__medium.com_-40wgx731=DwIBaQ=5VD0RTtNlTh3ycd41b3MUw=JX5y-LzqAOulZIcSbMRGYA=E587baUuoFDKcpKDDIF4Su8nBHs0I9bGTBqEhtErCuY=thoJd3JhOJ8HBCsAJTzhnfw91reStRfH0pUj9v-v5xE=
> > > <
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__medium.com_-40wgx731=DwIBaQ=5VD0RTtNlTh3ycd41b3MUw=JX5y-LzqAOulZIcSbMRGYA=E587baUuoFDKcpKDDIF4Su8nBHs0I9bGTBqEhtErCuY=thoJd3JhOJ8HBCsAJTzhnfw91reStRfH0pUj9v-v5xE=
> > > **(W)*
> > >
> > >
> > > On Thu, Dec 5, 2019 at 6:58 PM Fan Liya  wrote:
> > >
> > > > Hi Jeetendra,
> > > >
> > > > I am not sure if I understand your question correctly.
> > > >
> > > > Arrow is an in-memory columnar data format, and Spark has its own
> > in-memory
> > > > data format for DataFrame, which is invisible to end users.
> > > > So the Spark user has no control over the underlying in-memory
> layout.
> > > >
> > > > If you really want to convert a DataFrame into Arrow format, maybe
> you
> > can
> > > > save the results of a Spark job to some external store (e.g. in ORC
> > > > format), and then load it back to memory in Arrow format (if this is
> > what
> > > > you want).
> > > >
> > > > Best,
> > > > Liya Fan
> > > >
> > > > On Thu, Dec 5, 2019 at 5:53 PM Jeetendra Kumar Jaiswal
> > > >  wrote:
> > > >
> > > > > Hi Dev Team,
> > > > >
> > > > > Can someone please let me know how to convert spark data frame to
> > Arrow
> > > > > format. I am coding in Java.
> > > > >
> > > > > Java documentation of Arrow just has 

[jira] [Created] (ARROW-7337) [CI][C++] Excersive benchmarks as GitHub actions cron job

2019-12-06 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-7337:
--

 Summary: [CI][C++] Excersive benchmarks as GitHub actions cron job
 Key: ARROW-7337
 URL: https://issues.apache.org/jira/browse/ARROW-7337
 Project: Apache Arrow
  Issue Type: Task
  Components: C++, Continuous Integration
Reporter: Krisztian Szucs


Just to ensure that they compile and run properly. 
Github comments thread reference 
https://github.com/apache/arrow/pull/5948#issuecomment-561366522



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7336) implement minmax options

2019-12-06 Thread Yuan Zhou (Jira)
Yuan Zhou created ARROW-7336:


 Summary: implement minmax options
 Key: ARROW-7336
 URL: https://issues.apache.org/jira/browse/ARROW-7336
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Compute
Reporter: Yuan Zhou
Assignee: Yuan Zhou


minmax kernel has MinMaxOptions but not used



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Java - Spark dataframe to Arrow format

2019-12-06 Thread Fan Liya
Hi folks,

Thanks for your clarification.

I also think this is a universal requirement (including Java UDF in Arrow
format).

The Java converter provided by Spark is inefficient, due to two reasons
(IMO)

1. There are frequent memory copies between on-heap and off-heap memory.
2. The Spark API is in a row-oriented view (Iterator of InternalRow), so we
need to perform some column/row conversion, and we cannot copy data in
batch.

To solve the problem, maybe we need something equivalent to pandas in Java
(I think pandas acts as a bridge between PyArrow and PySpark).
In addition, we need to integrate it in Arrow and Spark.

Best,
Liya Fan

On Fri, Dec 6, 2019 at 2:14 AM Chen Li  wrote:

> We have a similar use case, and we use ArrowConverters.scala mentioned by
> Wes. However, the overhead of the conversion is kinda high.
> --
> *From:* Wes McKinney 
> *Sent:* Thursday, December 5, 2019 6:53 AM
> *To:* dev 
> *Cc:* Fan Liya ;
> jeetendra.jais...@impetus.co.in.invalid
> 
> *Subject:* Re: Java - Spark dataframe to Arrow format
>
> hi folks,
>
> I understand the question to be about serialization.
>
> see
>
> *
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java
> *
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala
> *
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala
>
> This code is used to convert between Spark Data Frames and Arrow
> columnar format for UDF evaluation purposes
>
> On Thu, Dec 5, 2019 at 6:58 AM GaoXiang Wang  wrote:
> >
> > Hi Jeetendra and Liya,
> >
> > I am actually having a similar use case. We have some data stored as
> *parquet
> > format in HDFS* and would like to make use of Apache Arrow to improve
> > compute performance if possible. Right now, I didn't see there is a
> direct
> > way to do in Java with Spark.
> >
> > I have search the Spark documentation, it looks like python support is
> > added after 2.3.0 (
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_sql-2Dpyspark-2Dpandas-2Dwith-2Darrow.html=DwIBaQ=5VD0RTtNlTh3ycd41b3MUw=JX5y-LzqAOulZIcSbMRGYA=E587baUuoFDKcpKDDIF4Su8nBHs0I9bGTBqEhtErCuY=n28eUF_7egcwK6LLh63Wra3oWTzZWBlB6en3xCxDEdE=
> ),
> > any plan from Apache Arrow team to provide *Spark integration for Java*?
> >
> > Thank you very much.
> >
> >
> > *Best Regards,WANG GAOXIANG*
> > * (Eric) *
> > National University of Singapore Graduate ::
> > API Craft Singapore Co-organiser ::
> > Singapore Python User Group Co-organiser
> > *+6597685360 (P) :: wgx...@gmail.com  (E) ::
> > **
> https://urldefense.proofpoint.com/v2/url?u=https-3A__medium.com_-40wgx731=DwIBaQ=5VD0RTtNlTh3ycd41b3MUw=JX5y-LzqAOulZIcSbMRGYA=E587baUuoFDKcpKDDIF4Su8nBHs0I9bGTBqEhtErCuY=thoJd3JhOJ8HBCsAJTzhnfw91reStRfH0pUj9v-v5xE=
> > <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__medium.com_-40wgx731=DwIBaQ=5VD0RTtNlTh3ycd41b3MUw=JX5y-LzqAOulZIcSbMRGYA=E587baUuoFDKcpKDDIF4Su8nBHs0I9bGTBqEhtErCuY=thoJd3JhOJ8HBCsAJTzhnfw91reStRfH0pUj9v-v5xE=
> > **(W)*
> >
> >
> > On Thu, Dec 5, 2019 at 6:58 PM Fan Liya  wrote:
> >
> > > Hi Jeetendra,
> > >
> > > I am not sure if I understand your question correctly.
> > >
> > > Arrow is an in-memory columnar data format, and Spark has its own
> in-memory
> > > data format for DataFrame, which is invisible to end users.
> > > So the Spark user has no control over the underlying in-memory layout.
> > >
> > > If you really want to convert a DataFrame into Arrow format, maybe you
> can
> > > save the results of a Spark job to some external store (e.g. in ORC
> > > format), and then load it back to memory in Arrow format (if this is
> what
> > > you want).
> > >
> > > Best,
> > > Liya Fan
> > >
> > > On Thu, Dec 5, 2019 at 5:53 PM Jeetendra Kumar Jaiswal
> > >  wrote:
> > >
> > > > Hi Dev Team,
> > > >
> > > > Can someone please let me know how to convert spark data frame to
> Arrow
> > > > format. I am coding in Java.
> > > >
> > > > Java documentation of Arrow just has function API information. It is
> > > > little hard to develop without proper documentation.
> > > >
> > > > Is there a way to directly convert spark dataframe to Arrow format
> > > > dataframes.
> > > >
> > > > Thanks,
> > > > Jeetendra
> > > >
> > > > 
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > NOTE: This message may contain information that is confidential,
> > > > proprietary, privileged or otherwise protected by law. The message is
> > > > intended solely for the named addressee. If received in error, please
> > > > destroy and notify the sender. Any use of this email is prohibited
> when
> > > > received in error. Impetus does not represent, warrant and/or
> guarantee,
> > > > that the integrity of this communication has been maintained nor
> that the
> > > >