[jira] [Created] (ARROW-8055) [GLib][Ruby] Add some metadata bindings to GArrowSchema

2020-03-10 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-8055:
---

 Summary: [GLib][Ruby] Add some metadata bindings to GArrowSchema
 Key: ARROW-8055
 URL: https://issues.apache.org/jira/browse/ARROW-8055
 Project: Apache Arrow
  Issue Type: Improvement
  Components: GLib, Ruby
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Summary of RLE and other compression efforts?

2020-03-10 Thread Radev, Martin
Hey Evan,


thank you for the interest.

There has been some effort for compressing floating-point data on the Parquet 
side, namely the BYTE_STREAM_SPLIT encoding. On its own it does not compress 
floating point data but makes it more compressible for when a compressor, such 
as ZSTD, LZ4, etc, is used. It only works well for high-entropy floating-point 
data, somewhere at least as large as >= 15 bits of entropy per element. I 
suppose the encoding might actually also make sense for high-entropy integer 
data but I am not super sure.
For low-entropy data, the dictionary encoding is good though I suspect there 
can be room for performance improvements.
This is my final report for the encoding here: 
https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf

Note that at some point my investigation turned out be quite the same solution 
as the one in https://github.com/powturbo/Turbo-Transpose.


Maybe the points I sent can be helpful.


Kinds regards,

Martin


From: evan_c...@apple.com  on behalf of Evan Chan 

Sent: Tuesday, March 10, 2020 5:15:48 AM
To: dev@arrow.apache.org
Subject: Summary of RLE and other compression efforts?

Hi folks,

I’m curious about the state of efforts for more compressed encodings in the 
Arrow columnar format.  I saw discussions previously about RLE, but is there a 
place to summarize all of the different efforts that are ongoing to bring more 
compressed encodings?

Is there an effort to compress floating point or integer data using techniques 
such as XOR compression and Delta-Delta?  I can contribute to some of these 
efforts as well.

Thanks,
Evan




[jira] [Created] (ARROW-8056) [R] Support read and write orc file format

2020-03-10 Thread Dyfan Jones (Jira)
Dyfan Jones created ARROW-8056:
--

 Summary: [R] Support read and write orc file format
 Key: ARROW-8056
 URL: https://issues.apache.org/jira/browse/ARROW-8056
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Dyfan Jones


Currently the R package can read/write arrow, feather, parquet etc ... How 
feasible is it for orc file format to be support with read / write capabilities?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8057) Schema equality not roundtrip safe

2020-03-10 Thread Florian Jetter (Jira)
Florian Jetter created ARROW-8057:
-

 Summary: Schema equality not roundtrip safe
 Key: ARROW-8057
 URL: https://issues.apache.org/jira/browse/ARROW-8057
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Florian Jetter


When performing schema roundtrips, the equality check for fields break. This is 
a regression from PyArrow 0.16.0

The equality check for entire schemas has never worked (but should from my POV)
{code:python}
import pyarrow.parquet as pq
import pyarrow as pa
print(pa.__version__)
fields = [
pa.field("bool", pa.bool_()),
pa.field("byte", pa.binary()),
pa.field("date", pa.date32()),
pa.field("datetime64", pa.timestamp("us")),
pa.field("float32", pa.float64()),
pa.field("float64", pa.float64()),
pa.field("int16", pa.int64()),
pa.field("int32", pa.int64()),
pa.field("int64", pa.int64()),
pa.field("int8", pa.int64()),
pa.field("null", pa.null()),
pa.field("uint16", pa.uint64()),
pa.field("uint32", pa.uint64()),
pa.field("uint64", pa.uint64()),
pa.field("uint8", pa.uint64()),
pa.field("unicode", pa.string()),
pa.field("array_float32", pa.list_(pa.float64())),
pa.field("array_float64", pa.list_(pa.float64())),
pa.field("array_int16", pa.list_(pa.int64())),
pa.field("array_int32", pa.list_(pa.int64())),
pa.field("array_int64", pa.list_(pa.int64())),
pa.field("array_int8", pa.list_(pa.int64())),
pa.field("array_uint16", pa.list_(pa.uint64())),
pa.field("array_uint32", pa.list_(pa.uint64())),
pa.field("array_uint64", pa.list_(pa.uint64())),
pa.field("array_uint8", pa.list_(pa.uint64())),
pa.field("array_unicode", pa.list_(pa.string())),
]

schema = pa.schema(fields)

buf = pa.BufferOutputStream()
pq.write_metadata(schema, buf)
reader = pa.BufferReader(buf.getvalue().to_pybytes())
reconstructed_schema = pq.read_schema(reader)

assert reconstructed_schema == reconstructed_schema
assert reconstructed_schema[0] == reconstructed_schema[0]
# This breaks on master / regression from 0.16.0 
assert schema[0] == reconstructed_schema[0]

# This never worked but should
assert reconstructed_schema == schema
assert schema == reconstructed_schema
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[DISCUSS][Java] Support non-nullable vectors

2020-03-10 Thread Fan Liya
Dear all,

A non-nullable vector is one that is guaranteed to contain no nulls. We
want to support non-nullable vectors in Java.

*Motivations:*
1. It is widely used in practice. For example, in a database engine, a
column can be declared as not null, so it cannot contain null values.
2.Non-nullable vectors has significant performance advantages compared with
their nullable conterparts, such as:
  1) the memory space of the validity buffer can be saved.
  2) manipulation of the validity buffer can be bypassed
  3) some if-else branches can be replaced by sequential instructions (by
the JIT compiler), leading to high throughput for the CPU pipeline.

*Potential Cost:*
For nullable vectors, there can be extra checks against the nullablility
flag. So we must change the code in a way that minimizes the cost.

*Proposed Changes:*
1. There is no need to create new vector classes. We add a final boolean to
the vector base classes as the nullability flag. The value of the flag can
be obtained from the field when creating the vector.
2. Add a method "boolean isNullable()" to the root interface ValueVector.
3. If a vector is non-nullable, its validity buffer should be an empty
buffer (not null, so much of the existing logic can be left unchanged).
4. For operations involving validity buffers (e.g. isNull, get, set), we
use the nullability flag to bypass manipulations to the validity buffer.

Therefore, it should be possible to support the feature with small code
changes.

BTW, please note that similar behaviors have already been supported in C++.

Would you please give your valueable feedback?

Best,
Liya Fan


[jira] [Created] (ARROW-8058) [C++][Python][Dataset] Provide an option to skip validation in FileSystemDatasetFactoryOptions

2020-03-10 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8058:
---

 Summary: [C++][Python][Dataset] Provide an option to skip 
validation in FileSystemDatasetFactoryOptions
 Key: ARROW-8058
 URL: https://issues.apache.org/jira/browse/ARROW-8058
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Dataset, Python
Affects Versions: 0.16.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


This can be costly and is not always necessary.

At the same time we could move file validation into the scan tasks; currently 
all files are inspected as the dataset is constructed, which can be expensive 
if the filesystem is slow. We'll be performing the validation multiple times 
but the check will be cheap since at scan time we'll be reading the file into 
memory anyway.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS][Java] Support non-nullable vectors

2020-03-10 Thread Wes McKinney
hi Liya,

In C++ we elect certain faster code paths when the null count is 0 or
computed to be zero. When the null count is 0, we do not allocate a
validity bitmap. And there is a "nullable" metadata-only flag at the
Field level. Could the same kinds of optimizations be implemented in
Java without introducing a "nullable" concept?

- Wes

On Tue, Mar 10, 2020 at 8:13 AM Fan Liya  wrote:
>
> Dear all,
>
> A non-nullable vector is one that is guaranteed to contain no nulls. We
> want to support non-nullable vectors in Java.
>
> *Motivations:*
> 1. It is widely used in practice. For example, in a database engine, a
> column can be declared as not null, so it cannot contain null values.
> 2.Non-nullable vectors has significant performance advantages compared with
> their nullable conterparts, such as:
>   1) the memory space of the validity buffer can be saved.
>   2) manipulation of the validity buffer can be bypassed
>   3) some if-else branches can be replaced by sequential instructions (by
> the JIT compiler), leading to high throughput for the CPU pipeline.
>
> *Potential Cost:*
> For nullable vectors, there can be extra checks against the nullablility
> flag. So we must change the code in a way that minimizes the cost.
>
> *Proposed Changes:*
> 1. There is no need to create new vector classes. We add a final boolean to
> the vector base classes as the nullability flag. The value of the flag can
> be obtained from the field when creating the vector.
> 2. Add a method "boolean isNullable()" to the root interface ValueVector.
> 3. If a vector is non-nullable, its validity buffer should be an empty
> buffer (not null, so much of the existing logic can be left unchanged).
> 4. For operations involving validity buffers (e.g. isNull, get, set), we
> use the nullability flag to bypass manipulations to the validity buffer.
>
> Therefore, it should be possible to support the feature with small code
> changes.
>
> BTW, please note that similar behaviors have already been supported in C++.
>
> Would you please give your valueable feedback?
>
> Best,
> Liya Fan


Re: [DISCUSS][Java] Support non-nullable vectors

2020-03-10 Thread Fan Liya
Hi Wes,

Thanks a lot for your quick reply.
I think what you mentioned is almost exactly what we want to do in Java.The
concept is not important.

Maybe there are only some minor differences:
1. In C++, the null_count is mutable, while for Java, once a vector is
constructed as non-nullable, its null count can only be 0.
2. In C++, a non-nullable array's validity buffer is null, while in Java,
the buffer is an empty buffer, and cannot be changed.

Best,
Liya Fan

On Tue, Mar 10, 2020 at 9:26 PM Wes McKinney  wrote:

> hi Liya,
>
> In C++ we elect certain faster code paths when the null count is 0 or
> computed to be zero. When the null count is 0, we do not allocate a
> validity bitmap. And there is a "nullable" metadata-only flag at the
> Field level. Could the same kinds of optimizations be implemented in
> Java without introducing a "nullable" concept?
>
> - Wes
>
> On Tue, Mar 10, 2020 at 8:13 AM Fan Liya  wrote:
> >
> > Dear all,
> >
> > A non-nullable vector is one that is guaranteed to contain no nulls. We
> > want to support non-nullable vectors in Java.
> >
> > *Motivations:*
> > 1. It is widely used in practice. For example, in a database engine, a
> > column can be declared as not null, so it cannot contain null values.
> > 2.Non-nullable vectors has significant performance advantages compared
> with
> > their nullable conterparts, such as:
> >   1) the memory space of the validity buffer can be saved.
> >   2) manipulation of the validity buffer can be bypassed
> >   3) some if-else branches can be replaced by sequential instructions (by
> > the JIT compiler), leading to high throughput for the CPU pipeline.
> >
> > *Potential Cost:*
> > For nullable vectors, there can be extra checks against the nullablility
> > flag. So we must change the code in a way that minimizes the cost.
> >
> > *Proposed Changes:*
> > 1. There is no need to create new vector classes. We add a final boolean
> to
> > the vector base classes as the nullability flag. The value of the flag
> can
> > be obtained from the field when creating the vector.
> > 2. Add a method "boolean isNullable()" to the root interface ValueVector.
> > 3. If a vector is non-nullable, its validity buffer should be an empty
> > buffer (not null, so much of the existing logic can be left unchanged).
> > 4. For operations involving validity buffers (e.g. isNull, get, set), we
> > use the nullability flag to bypass manipulations to the validity buffer.
> >
> > Therefore, it should be possible to support the feature with small code
> > changes.
> >
> > BTW, please note that similar behaviors have already been supported in
> C++.
> >
> > Would you please give your valueable feedback?
> >
> > Best,
> > Liya Fan
>


Re: Making a patch 0.16.1 Arrow release

2020-03-10 Thread Wes McKinney
It seems like the consensus is to push for a 0.17.0 major release
sooner rather than doing a patch release, since releases in general
are costly. This is fine with me. I see that a 0.17.0 milestone has
been created in JIRA and some JIRA gardening has begun. Do you think
we can be in a position to release by the week of March 23 or the week
of March 30?

On Thu, Mar 5, 2020 at 8:39 PM Wes McKinney  wrote:
>
> If people are generally on board with accelerating a 0.17.0 major
> release, then I would suggest renaming "1.0.0" to "0.17.0" and
> beginning to do issue gardening to whittle things down to
> critical-looking bugs and high probability patches for the next couple
> of weeks.
>
> On Thu, Mar 5, 2020 at 11:31 AM Wes McKinney  wrote:
> >
> > I recall there are some other issues that have been reported or fixed
> > that are critical and not yet marked with 0.16.1.
> >
> > I'm also OK with doing a 0.17.0 release sooner
> >
> > On Thu, Mar 5, 2020 at 11:31 AM Neal Richardson
> >  wrote:
> > >
> > > I would also be more supportive of doing 0.17 earlier instead of a patch
> > > release.
> > >
> > > Neal
> > >
> > >
> > > On Thu, Mar 5, 2020 at 9:29 AM Neal Richardson 
> > > 
> > > wrote:
> > >
> > > > If releases were costless to make, I'd be all for it, but it's not clear
> > > > to me that it's worth the diversion from other priorities to make a 
> > > > release
> > > > right now. Nothing on
> > > > https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20%3D%20Resolved%20AND%20fixVersion%20%3D%200.16.1
> > > > jumps out to me as super urgent--what are you seeing as critical?
> > > >
> > > > If we did decide to go forward, would it be possible to do a release 
> > > > that
> > > > is limited to the affected implementations (say, do a Python-only 
> > > > release)?
> > > > That might reduce the cost of building and verifying enough to make it
> > > > reasonable to consider.
> > > >
> > > > Neal
> > > >
> > > >
> > > > On Thu, Mar 5, 2020 at 8:19 AM Krisztián Szűcs 
> > > > 
> > > > wrote:
> > > >
> > > >> On Thu, Mar 5, 2020 at 5:07 PM Wes McKinney  
> > > >> wrote:
> > > >> >
> > > >> > hi folks,
> > > >> >
> > > >> > There have been a number of critical issues reported (many of them
> > > >> > fixed already) since 0.16.0 was released. Is there interest in
> > > >> > preparing a patch 0.16.1 release (with backported patches onto a
> > > >> > maint-0.16.x branch as with 0.15.1) since the next major release is a
> > > >> > minimum of 6-8 weeks away from general availability?
> > > >> >
> > > >> > Did the 0.15.1 patch release helper script that Krisztian wrote get
> > > >> > contributed as a PR?
> > > >> Not yet, but it is available at
> > > >> https://gist.github.com/kszucs/b2743546044ccd3215e5bb34fa0d76a0
> > > >> >
> > > >> > Thanks
> > > >> > Wes
> > > >>
> > > >


Re: [jira] [Created] (ARROW-8049) [C++] Upgrade bundled Thrift version to 0.13.0

2020-03-10 Thread Don Hilborn
Unsubscribe


-Don


On Mon, Mar 9, 2020 at 6:19 PM Wes McKinney (Jira)  wrote:

> Wes McKinney created ARROW-8049:
> ---
>
>  Summary: [C++] Upgrade bundled Thrift version to 0.13.0
>  Key: ARROW-8049
>  URL: https://issues.apache.org/jira/browse/ARROW-8049
>  Project: Apache Arrow
>   Issue Type: Improvement
>   Components: C++
> Reporter: Wes McKinney
>  Fix For: 0.17.0
>
>
> Follow up to discussion in ARROW-6821
>
>
>
> --
> This message was sent by Atlassian Jira
> (v8.3.4#803005)
>


[jira] [Created] (ARROW-8059) [Python] Make FileSystem objects serializable

2020-03-10 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8059:


 Summary: [Python] Make FileSystem objects serializable
 Key: ARROW-8059
 URL: https://issues.apache.org/jira/browse/ARROW-8059
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


It would be good to be able to pickle {{pyarrow.fs.FileSystem}} objects (eg for 
use in dask.distributed)

cc [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8060) [Python] Make dataset Expression objects serializable

2020-03-10 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8060:


 Summary: [Python] Make dataset Expression objects serializable
 Key: ARROW-8060
 URL: https://issues.apache.org/jira/browse/ARROW-8060
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


It would be good to be able to pickle pyarrow.dataset.Expression objects (eg 
for use in dask.distributed)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8061) [C++][Dataset] Ability to specify granularity of ParquetFileFragment (support row groups)

2020-03-10 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8061:


 Summary: [C++][Dataset] Ability to specify granularity of 
ParquetFileFragment (support row groups)
 Key: ARROW-8061
 URL: https://issues.apache.org/jira/browse/ARROW-8061
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Dataset
Reporter: Joris Van den Bossche


Specifically for parquet (not sure if it will be relevant for other file 
formats as well, for IPC/feather potentially ther record batch), it would be 
useful to target row groups instead of files as fragments.

Quoting the original design documents: _"In datasets consisting of many 
fragments, the dataset API must expose the granularity of fragments in a public 
way to enable parallel processing, if desired. "._   
And a comment from Wes on that: _"a single Parquet file can "export" one or 
more fragments based on settings. The default might be to split fragments based 
on row group"_

Currently, the level on which fragments are defined (at least in the typical 
partitioned parquet dataset) is "1 file == 1 fragment".

Would it be possible or desirable to make this more fine grained, where you 
could also opt to have a fragment per row group?   
We could have a ParquetFragment that has this option, and a ParquetFileFormat 
specific option to say what the granularity of a fragment is (file vs row 
group)?

cc [~fsaintjacques] [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8062) [C++][Dataset] Parquet Dataset factory from a _metadata/_common_metadata file

2020-03-10 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8062:


 Summary: [C++][Dataset] Parquet Dataset factory from a 
_metadata/_common_metadata file
 Key: ARROW-8062
 URL: https://issues.apache.org/jira/browse/ARROW-8062
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Dataset, Python
Reporter: Joris Van den Bossche


Partitioned parquet datasets sometimes come with {{_metadata}} / 
{{_common_metadata}} files. Those files include information about the schema of 
the full dataset and potentially all RowGroup metadata as well (for 
{{_metadata}}).

Using those files during the creation of a parquet {{Dataset}} can give a more 
efficient factory (using the stored schema instead of inferring the schema from 
unioning the schemas of all files + using the paths to individual parquet files 
instead of crawling the directory).

Basically, based those files, the schema, list of paths and partition 
expressions (the information that is needed to create a Dataset) could be 
constructed.   
Such logic could be put in a different factory class, eg 
{{ParquetManifestFactory}} (as suggestetd by [~fsaintjacques]).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8063) [Python] Add user guide documentation for Datasets API

2020-03-10 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8063:


 Summary: [Python] Add user guide documentation for Datasets API
 Key: ARROW-8063
 URL: https://issues.apache.org/jira/browse/ARROW-8063
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Joris Van den Bossche
 Fix For: 0.17.0


Currently, we only have API docs 
(https://arrow.apache.org/docs/python/api/dataset.html), but we also need prose 
docs explaining what the dataset module does with examples.

This can also include guidelines on how to use this instead of the 
ParquetDataset API (depending on how we end up doing ARROW-8039), this aspect 
is also covered by ARROW-8047



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [Rust] Dictionary encoding for strings?

2020-03-10 Thread Wes McKinney
I believe that dictionary encoding in-memory was very recently
implemented (February 28) in
https://github.com/apache/arrow/commit/c7a7d2dcc46ed06593b994cb54c5eaf9ccd1d21d#diff-72812e30873455dcee2ce2d1ee26e4ab.
Not sure about the other questions

On Mon, Mar 9, 2020 at 11:07 PM Evan Chan  wrote:
>
> Hi,
>
> Does the Rust implementation support dictionary encoded strings?  It is not 
> in the documentation anywhere, but there seem to be some variable-sized 
> dictionary structs in the code base.
> If not, is there a plan to support it?
> Does DataFusion support reading from dictionary strings?
>
> It seems all the examples in DataFusion and the Rust part are focused on 
> numbers.  How robust is the string support, and how robust is the string 
> functionality overall?
>
> Thanks,
> Evan


[jira] [Created] (ARROW-8064) [Dev] Implement Comment bot via Github actions

2020-03-10 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8064:
--

 Summary: [Dev] Implement Comment bot via Github actions
 Key: ARROW-8064
 URL: https://issues.apache.org/jira/browse/ARROW-8064
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs


Ala {{@ursabot}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8065) [C++][Dataset] Untangle Dataset, Fragment and ScanOptions

2020-03-10 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-8065:
-

 Summary: [C++][Dataset] Untangle Dataset, Fragment and ScanOptions
 Key: ARROW-8065
 URL: https://issues.apache.org/jira/browse/ARROW-8065
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Francois Saint-Jacques


We should be able to list fragments without going through the 
Scanner/ScanOptions hoops. This exposes a flaw with the current API where it 
require a ScanOptions to create Fragment, this is also a problem for 
ARROW-7824, i.e. why do we need a ScanOptions (read manifest) to write record 
batches to a given path.
 # Remove {{ScanOptions}} from Fragment's properties and move it into 
{{Fragment::Scan}} parameters.
 # Remove {{ScanOptions}} from {{Dataset::GetFragments}}, if required, we can 
still provide an alternate signature, e.g. 
{{Dataset::GetFragments(std::shared_ptr predicate)}} for sub-tree 
pruning in FileSystemDataset.
 # Fragment constructor should take a schema (and store it as a property), 
usually extracted from the Dataset schema. Update the schema() method 
accordingly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8066) PyArrow discards timezones

2020-03-10 Thread Markovtsev Vadim (Jira)
Markovtsev Vadim created ARROW-8066:
---

 Summary: PyArrow discards timezones
 Key: ARROW-8066
 URL: https://issues.apache.org/jira/browse/ARROW-8066
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.16.0
Reporter: Markovtsev Vadim


The description is at [https://github.com/pandas-dev/pandas/issues/32587]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8067) [Python] FindPython3 fails on Python 3.5

2020-03-10 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8067:
---

 Summary: [Python] FindPython3 fails on Python 3.5
 Key: ARROW-8067
 URL: https://issues.apache.org/jira/browse/ARROW-8067
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.17.0


{code}
-- Could NOT find Backtrace (missing: Backtrace_LIBRARY Backtrace_INCLUDE_DIR)
-- Found PythonInterp: C:/Miniconda/python.exe (found version "3.7.4")
-- Found PythonLibs: C:/Miniconda/libs/Python37.lib
CMake Error at cmake_modules/FindNumPy.cmake:58 (message):
  NumPy import failure:

  Traceback (most recent call last):

File "", line 1, in 

  ModuleNotFoundError: No module named 'numpy'

Call Stack (most recent call first):
  cmake_modules/FindPython3Alt.cmake:31 (find_package)
  src/arrow/python/CMakeLists.txt:22 (find_package)


-- Configuring incomplete, errors occurred!
See also "C:/Users/wesmc/code/arrow/cpp/build/CMakeFiles/CMakeOutput.log".
See also "C:/Users/wesmc/code/arrow/cpp/build/CMakeFiles/CMakeError.log".
{code}

This appears to work in 0.16.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Summary of RLE and other compression efforts?

2020-03-10 Thread Wes McKinney
See this past mailing list thread

https://lists.apache.org/thread.html/a99124e57c14c3c9ef9d98f3c80cfe1dd25496bf3ff7046778add937%40%3Cdev.arrow.apache.org%3E

and associated PR

https://github.com/apache/arrow/pull/4815

There hasn't been a lot of movement on this but primarily because all
the key people who've expressed interest in it have been really busy
with other matters (myself included). Have RLE-encoding in memory at
minimum would be a huge benefit for a number of applications, so it
would be great to continue the discussion and create a more
comprehensive proposal document describing what we would like to
implement (and what we do not want to implement)

On Tue, Mar 10, 2020 at 3:41 AM Radev, Martin  wrote:
>
> Hey Evan,
>
>
> thank you for the interest.
>
> There has been some effort for compressing floating-point data on the Parquet 
> side, namely the BYTE_STREAM_SPLIT encoding. On its own it does not compress 
> floating point data but makes it more compressible for when a compressor, 
> such as ZSTD, LZ4, etc, is used. It only works well for high-entropy 
> floating-point data, somewhere at least as large as >= 15 bits of entropy per 
> element. I suppose the encoding might actually also make sense for 
> high-entropy integer data but I am not super sure.
> For low-entropy data, the dictionary encoding is good though I suspect there 
> can be room for performance improvements.
> This is my final report for the encoding here: 
> https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf
>
> Note that at some point my investigation turned out be quite the same 
> solution as the one in https://github.com/powturbo/Turbo-Transpose.
>
>
> Maybe the points I sent can be helpful.
>
>
> Kinds regards,
>
> Martin
>
> 
> From: evan_c...@apple.com  on behalf of Evan Chan 
> 
> Sent: Tuesday, March 10, 2020 5:15:48 AM
> To: dev@arrow.apache.org
> Subject: Summary of RLE and other compression efforts?
>
> Hi folks,
>
> I’m curious about the state of efforts for more compressed encodings in the 
> Arrow columnar format.  I saw discussions previously about RLE, but is there 
> a place to summarize all of the different efforts that are ongoing to bring 
> more compressed encodings?
>
> Is there an effort to compress floating point or integer data using 
> techniques such as XOR compression and Delta-Delta?  I can contribute to some 
> of these efforts as well.
>
> Thanks,
> Evan
>
>


Re: Summary of RLE and other compression efforts?

2020-03-10 Thread Evan Chan
Martin,

Many thanks for the links.  

My main concern is not actually FP and integer data, but sparse string data.  
Having many very sparse arrays, each with a bitmap and values (assume 
dictionary also), would be really expensive. I have some ideas I’d like to 
throw out there, around something like a MapArray (Think of it essentially as 
dictionaries of keys and values, plus List> for example), but 
something optimized for sparseness.

Overall, while I appreciate the design of Arrow arrays to be super fast for 
computation, I’d like to be able to keep more of such data in memory, thus I’m 
interested in more compact representations, that ideally don’t need a 
compressor.  More like encoding.

I saw a thread middle of last year about RLE encodings, this would be in the 
right direction I think.   It could be implemented properly such that it 
doesn’t make random access that bad.

As for FP, I have my own scheme which is scale tested, SIMD friendly and should 
perform relatively well, and can fit in with different predictors including 
XOR, DFCM, etc.   Due to the high cardinality of most such data, I wonder if it 
wouldn’t be simpler to stick with one such scheme for all FP data.

Anyways, I’m most curious about if there is a plan to implement RLE, the FP 
schemes you describe, etc. and bring them to Arrow.  IE, is there a plan for 
space efficient encodings overall for Arrow?

Thanks very much,
Evan

> On Mar 10, 2020, at 1:41 AM, Radev, Martin  wrote:
> 
> Hey Evan,
> 
> 
> thank you for the interest.
> 
> There has been some effort for compressing floating-point data on the Parquet 
> side, namely the BYTE_STREAM_SPLIT encoding. On its own it does not compress 
> floating point data but makes it more compressible for when a compressor, 
> such as ZSTD, LZ4, etc, is used. It only works well for high-entropy 
> floating-point data, somewhere at least as large as >= 15 bits of entropy per 
> element. I suppose the encoding might actually also make sense for 
> high-entropy integer data but I am not super sure.
> For low-entropy data, the dictionary encoding is good though I suspect there 
> can be room for performance improvements.
> This is my final report for the encoding here: 
> https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf
> 
> Note that at some point my investigation turned out be quite the same 
> solution as the one in https://github.com/powturbo/Turbo-Transpose.
> 
> 
> Maybe the points I sent can be helpful.
> 
> 
> Kinds regards,
> 
> Martin
> 
> 
> From: evan_c...@apple.com  on behalf of Evan Chan 
> 
> Sent: Tuesday, March 10, 2020 5:15:48 AM
> To: dev@arrow.apache.org
> Subject: Summary of RLE and other compression efforts?
> 
> Hi folks,
> 
> I’m curious about the state of efforts for more compressed encodings in the 
> Arrow columnar format.  I saw discussions previously about RLE, but is there 
> a place to summarize all of the different efforts that are ongoing to bring 
> more compressed encodings?
> 
> Is there an effort to compress floating point or integer data using 
> techniques such as XOR compression and Delta-Delta?  I can contribute to some 
> of these efforts as well.
> 
> Thanks,
> Evan
> 
> 



Re: Summary of RLE and other compression efforts?

2020-03-10 Thread Evan Chan
Thank you Wes.  If the stars line up I’d be interested in joining and 
contributing to this effort.   I have a ton of ideas around efficient encodings 
for different types of data.

> On Mar 10, 2020, at 2:52 PM, Wes McKinney  wrote:
> 
> See this past mailing list thread
> 
> https://lists.apache.org/thread.html/a99124e57c14c3c9ef9d98f3c80cfe1dd25496bf3ff7046778add937%40%3Cdev.arrow.apache.org%3E
> 
> and associated PR
> 
> https://github.com/apache/arrow/pull/4815
> 
> There hasn't been a lot of movement on this but primarily because all
> the key people who've expressed interest in it have been really busy
> with other matters (myself included). Have RLE-encoding in memory at
> minimum would be a huge benefit for a number of applications, so it
> would be great to continue the discussion and create a more
> comprehensive proposal document describing what we would like to
> implement (and what we do not want to implement)
> 
> On Tue, Mar 10, 2020 at 3:41 AM Radev, Martin  wrote:
>> 
>> Hey Evan,
>> 
>> 
>> thank you for the interest.
>> 
>> There has been some effort for compressing floating-point data on the 
>> Parquet side, namely the BYTE_STREAM_SPLIT encoding. On its own it does not 
>> compress floating point data but makes it more compressible for when a 
>> compressor, such as ZSTD, LZ4, etc, is used. It only works well for 
>> high-entropy floating-point data, somewhere at least as large as >= 15 bits 
>> of entropy per element. I suppose the encoding might actually also make 
>> sense for high-entropy integer data but I am not super sure.
>> For low-entropy data, the dictionary encoding is good though I suspect there 
>> can be room for performance improvements.
>> This is my final report for the encoding here: 
>> https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf
>> 
>> Note that at some point my investigation turned out be quite the same 
>> solution as the one in https://github.com/powturbo/Turbo-Transpose.
>> 
>> 
>> Maybe the points I sent can be helpful.
>> 
>> 
>> Kinds regards,
>> 
>> Martin
>> 
>> 
>> From: evan_c...@apple.com  on behalf of Evan Chan 
>> 
>> Sent: Tuesday, March 10, 2020 5:15:48 AM
>> To: dev@arrow.apache.org
>> Subject: Summary of RLE and other compression efforts?
>> 
>> Hi folks,
>> 
>> I’m curious about the state of efforts for more compressed encodings in the 
>> Arrow columnar format.  I saw discussions previously about RLE, but is there 
>> a place to summarize all of the different efforts that are ongoing to bring 
>> more compressed encodings?
>> 
>> Is there an effort to compress floating point or integer data using 
>> techniques such as XOR compression and Delta-Delta?  I can contribute to 
>> some of these efforts as well.
>> 
>> Thanks,
>> Evan
>> 
>> 



Re: Summary of RLE and other compression efforts?

2020-03-10 Thread Wes McKinney
On Tue, Mar 10, 2020 at 5:01 PM Evan Chan  wrote:
>
> Martin,
>
> Many thanks for the links.
>
> My main concern is not actually FP and integer data, but sparse string data.  
> Having many very sparse arrays, each with a bitmap and values (assume 
> dictionary also), would be really expensive. I have some ideas I’d like to 
> throw out there, around something like a MapArray (Think of it essentially as 
> dictionaries of keys and values, plus List> for example), but 
> something optimized for sparseness.
>
> Overall, while I appreciate the design of Arrow arrays to be super fast for 
> computation, I’d like to be able to keep more of such data in memory, thus 
> I’m interested in more compact representations, that ideally don’t need a 
> compressor.  More like encoding.
>
> I saw a thread middle of last year about RLE encodings, this would be in the 
> right direction I think.   It could be implemented properly such that it 
> doesn’t make random access that bad.
>
> As for FP, I have my own scheme which is scale tested, SIMD friendly and 
> should perform relatively well, and can fit in with different predictors 
> including XOR, DFCM, etc.   Due to the high cardinality of most such data, I 
> wonder if it wouldn’t be simpler to stick with one such scheme for all FP 
> data.
>
> Anyways, I’m most curious about if there is a plan to implement RLE, the FP 
> schemes you describe, etc. and bring them to Arrow.  IE, is there a plan for 
> space efficient encodings overall for Arrow?

It's been discussed many times in the past. As Arrow is developed by
volunteers, if someone volunteers their time to work on it, it can
happen. The first step would be to build consensus about what sort of
protocol level additions (see the format/ directory and associated
documentation) are needed. Once there is consensus about what to build
and a complete specification for that, then implementation can move
forward.

> Thanks very much,
> Evan
>
> > On Mar 10, 2020, at 1:41 AM, Radev, Martin  wrote:
> >
> > Hey Evan,
> >
> >
> > thank you for the interest.
> >
> > There has been some effort for compressing floating-point data on the 
> > Parquet side, namely the BYTE_STREAM_SPLIT encoding. On its own it does not 
> > compress floating point data but makes it more compressible for when a 
> > compressor, such as ZSTD, LZ4, etc, is used. It only works well for 
> > high-entropy floating-point data, somewhere at least as large as >= 15 bits 
> > of entropy per element. I suppose the encoding might actually also make 
> > sense for high-entropy integer data but I am not super sure.
> > For low-entropy data, the dictionary encoding is good though I suspect 
> > there can be room for performance improvements.
> > This is my final report for the encoding here: 
> > https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf
> >
> > Note that at some point my investigation turned out be quite the same 
> > solution as the one in https://github.com/powturbo/Turbo-Transpose.
> >
> >
> > Maybe the points I sent can be helpful.
> >
> >
> > Kinds regards,
> >
> > Martin
> >
> > 
> > From: evan_c...@apple.com  on behalf of Evan Chan 
> > 
> > Sent: Tuesday, March 10, 2020 5:15:48 AM
> > To: dev@arrow.apache.org
> > Subject: Summary of RLE and other compression efforts?
> >
> > Hi folks,
> >
> > I’m curious about the state of efforts for more compressed encodings in the 
> > Arrow columnar format.  I saw discussions previously about RLE, but is 
> > there a place to summarize all of the different efforts that are ongoing to 
> > bring more compressed encodings?
> >
> > Is there an effort to compress floating point or integer data using 
> > techniques such as XOR compression and Delta-Delta?  I can contribute to 
> > some of these efforts as well.
> >
> > Thanks,
> > Evan
> >
> >
>


[jira] [Created] (ARROW-8068) [Python] Externalize option whether to bundle zlib DLL in Python packages

2020-03-10 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8068:
---

 Summary: [Python] Externalize option whether to bundle zlib DLL in 
Python packages
 Key: ARROW-8068
 URL: https://issues.apache.org/jira/browse/ARROW-8068
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney


I ran into an esoteric situation in ARROW-8015 where I built the C++ library 
with all bundled dependencies including zlib. I then build a Python wheel, but 
the Python build failed when using {{PYARROW_BUNDLE_ARROW_CPP=1}} because it 
could not find {{zlib.dll}}. The failure points were both in CMakeLists.txt and 
in setup.py.

Perhaps this situation will only arise esoterically, but we may want to add a 
flag to toggle off the zlib bundling behavior



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8069) [C++] Should the default value of "check_metadata" arguments of Equals methods be "true"?

2020-03-10 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8069:
---

 Summary: [C++] Should the default value of "check_metadata" 
arguments of Equals methods be "true"?
 Key: ARROW-8069
 URL: https://issues.apache.org/jira/browse/ARROW-8069
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


We just changed the default in Python to False for usability reasons. Since C++ 
has different usability considerations, we don't necessarily need to have the 
default be the same, but I'm curious if anyone has any opinions one way or the 
other. I would be weakly supportive of changing the default to false



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[NIGHTLY] Arrow Build Report for Job nightly-2020-03-10-0

2020-03-10 Thread Crossbow


Arrow Build Report for Job nightly-2020-03-10-0

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0

Failed Tasks:
- centos-7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-github-centos-7
- centos-8:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-github-centos-8
- conda-linux-gcc-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-conda-linux-gcc-py38
- debian-stretch:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-github-debian-stretch
- gandiva-jar-osx:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-travis-gandiva-jar-osx
- gandiva-jar-trusty:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-travis-gandiva-jar-trusty
- homebrew-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-travis-homebrew-cpp
- test-conda-cpp-valgrind:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-circle-test-conda-cpp-valgrind
- test-conda-python-3.7-pandas-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-circle-test-conda-python-3.7-pandas-master
- test-conda-python-3.7-turbodbc-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-circle-test-conda-python-3.7-turbodbc-latest
- test-conda-python-3.7-turbodbc-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-circle-test-conda-python-3.7-turbodbc-master
- test-r-rhub-debian-gcc-devel:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-test-r-rhub-debian-gcc-devel
- test-r-rhub-ubuntu-gcc-release:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-test-r-rhub-ubuntu-gcc-release
- test-r-rstudio-r-base-3.6-opensuse15:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-test-r-rstudio-r-base-3.6-opensuse15
- test-r-rstudio-r-base-3.6-opensuse42:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-test-r-rstudio-r-base-3.6-opensuse42
- ubuntu-xenial:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-github-ubuntu-xenial
- wheel-osx-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-travis-wheel-osx-cp35m
- wheel-osx-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-travis-wheel-osx-cp36m
- wheel-osx-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-travis-wheel-osx-cp37m
- wheel-osx-cp38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-travis-wheel-osx-cp38

Succeeded Tasks:
- centos-6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-github-centos-6
- conda-linux-gcc-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-conda-linux-gcc-py36
- conda-linux-gcc-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-conda-linux-gcc-py37
- conda-osx-clang-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-conda-osx-clang-py36
- conda-osx-clang-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-conda-osx-clang-py37
- conda-osx-clang-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-conda-osx-clang-py38
- conda-win-vs2015-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-conda-win-vs2015-py36
- conda-win-vs2015-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-conda-win-vs2015-py37
- conda-win-vs2015-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-conda-win-vs2015-py38
- debian-buster:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-github-debian-buster
- macos-r-autobrew:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-travis-macos-r-autobrew
- test-conda-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-circle-test-conda-cpp
- test-conda-python-3.6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-circle-test-conda-python-3.6
- test-conda-python-3.7-dask-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-circle-test-conda-python-3.7-dask-latest
- test-conda-python-3.7-hdfs-2.9.2:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-202

[jira] [Created] (ARROW-8070) [Python] Casting Segfault

2020-03-10 Thread Daniel Nugent (Jira)
Daniel Nugent created ARROW-8070:


 Summary: [Python] Casting Segfault
 Key: ARROW-8070
 URL: https://issues.apache.org/jira/browse/ARROW-8070
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Daniel Nugent


Was messing around with some nested arrays and found a pretty easy to reproduce 
segfault:


{code:java}
Python 3.7.6 | packaged by conda-forge | (default, Jan  7 2020, 22:33:48)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np, pyarrow as pa
>>> pa.__version__
'0.16.0'
>>> np.__version__
'1.18.1'
>>> x=[np.array([b'a',b'b'])]
>>> a = pa.array(x,pa.list_(pa.binary()))
>>> a

[
  [
61,
62
  ]
]
>>> a.cast(pa.string())
Segmentation fault
{code}

I don't know if that cast makes sense, but I left the checks on, so I would not 
expect a segfault from it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8071) [GLib] Build error with configure

2020-03-10 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-8071:
---

 Summary: [GLib] Build error with configure
 Key: ARROW-8071
 URL: https://issues.apache.org/jira/browse/ARROW-8071
 Project: Apache Arrow
  Issue Type: Bug
  Components: GLib
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


This is introduced by ARROW-8055.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [NIGHTLY] Arrow Build Report for Job nightly-2020-03-10-0

2020-03-10 Thread Sutou Kouhei
Hi,

Failures of Linux packages will be fixed by
https://github.com/apache/arrow/pull/6575 .
Sorry.

Thanks,
--
kou

In <5e6834bf.1c69fb81.a268f.f...@mx.google.com>
  "[NIGHTLY] Arrow Build Report for Job nightly-2020-03-10-0" on Tue, 10 Mar 
2020 17:45:51 -0700 (PDT),
  Crossbow  wrote:

> 
> Arrow Build Report for Job nightly-2020-03-10-0
> 
> All tasks: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0
> 
> Failed Tasks:
> - centos-7:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-github-centos-7
> - centos-8:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-github-centos-8
> - conda-linux-gcc-py38:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-conda-linux-gcc-py38
> - debian-stretch:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-github-debian-stretch
> - gandiva-jar-osx:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-travis-gandiva-jar-osx
> - gandiva-jar-trusty:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-travis-gandiva-jar-trusty
> - homebrew-cpp:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-travis-homebrew-cpp
> - test-conda-cpp-valgrind:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-circle-test-conda-cpp-valgrind
> - test-conda-python-3.7-pandas-master:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-circle-test-conda-python-3.7-pandas-master
> - test-conda-python-3.7-turbodbc-latest:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-circle-test-conda-python-3.7-turbodbc-latest
> - test-conda-python-3.7-turbodbc-master:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-circle-test-conda-python-3.7-turbodbc-master
> - test-r-rhub-debian-gcc-devel:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-test-r-rhub-debian-gcc-devel
> - test-r-rhub-ubuntu-gcc-release:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-test-r-rhub-ubuntu-gcc-release
> - test-r-rstudio-r-base-3.6-opensuse15:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-test-r-rstudio-r-base-3.6-opensuse15
> - test-r-rstudio-r-base-3.6-opensuse42:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-test-r-rstudio-r-base-3.6-opensuse42
> - ubuntu-xenial:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-github-ubuntu-xenial
> - wheel-osx-cp35m:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-travis-wheel-osx-cp35m
> - wheel-osx-cp36m:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-travis-wheel-osx-cp36m
> - wheel-osx-cp37m:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-travis-wheel-osx-cp37m
> - wheel-osx-cp38:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-travis-wheel-osx-cp38
> 
> Succeeded Tasks:
> - centos-6:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-github-centos-6
> - conda-linux-gcc-py36:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-conda-linux-gcc-py36
> - conda-linux-gcc-py37:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-conda-linux-gcc-py37
> - conda-osx-clang-py36:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-conda-osx-clang-py36
> - conda-osx-clang-py37:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-conda-osx-clang-py37
> - conda-osx-clang-py38:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-conda-osx-clang-py38
> - conda-win-vs2015-py36:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-conda-win-vs2015-py36
> - conda-win-vs2015-py37:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-conda-win-vs2015-py37
> - conda-win-vs2015-py38:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-conda-win-vs2015-py38
> - debian-buster:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-github-debian-buster
> - macos-r-autobrew:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-travis-macos-r-autobrew
> - test-conda-cpp:
>   URL: 
> https://github.co

Re: Summary of RLE and other compression efforts?

2020-03-10 Thread Micah Kornfield
+1 to what Wes said.

I'm hoping to have some more time to spend on this end of Q2/beginning of
Q3 if no progress is made by then.

I still think we should be careful on what is added to the spec, in
particular, we should be focused on encodings that can be used to improve
computational efficiency rather than just smaller size. Also, it is
important to note that any sort of encoding/compression must be supportable
across multiple languages/platforms.

Thanks,
Micah

On Tue, Mar 10, 2020 at 3:12 PM Wes McKinney  wrote:

> On Tue, Mar 10, 2020 at 5:01 PM Evan Chan 
> wrote:
> >
> > Martin,
> >
> > Many thanks for the links.
> >
> > My main concern is not actually FP and integer data, but sparse string
> data.  Having many very sparse arrays, each with a bitmap and values
> (assume dictionary also), would be really expensive. I have some ideas I’d
> like to throw out there, around something like a MapArray (Think of it
> essentially as dictionaries of keys and values, plus List> for
> example), but something optimized for sparseness.
> >
> > Overall, while I appreciate the design of Arrow arrays to be super fast
> for computation, I’d like to be able to keep more of such data in memory,
> thus I’m interested in more compact representations, that ideally don’t
> need a compressor.  More like encoding.
> >
> > I saw a thread middle of last year about RLE encodings, this would be in
> the right direction I think.   It could be implemented properly such that
> it doesn’t make random access that bad.
> >
> > As for FP, I have my own scheme which is scale tested, SIMD friendly and
> should perform relatively well, and can fit in with different predictors
> including XOR, DFCM, etc.   Due to the high cardinality of most such data,
> I wonder if it wouldn’t be simpler to stick with one such scheme for all FP
> data.
> >
> > Anyways, I’m most curious about if there is a plan to implement RLE, the
> FP schemes you describe, etc. and bring them to Arrow.  IE, is there a plan
> for space efficient encodings overall for Arrow?
>
> It's been discussed many times in the past. As Arrow is developed by
> volunteers, if someone volunteers their time to work on it, it can
> happen. The first step would be to build consensus about what sort of
> protocol level additions (see the format/ directory and associated
> documentation) are needed. Once there is consensus about what to build
> and a complete specification for that, then implementation can move
> forward.
>
> > Thanks very much,
> > Evan
> >
> > > On Mar 10, 2020, at 1:41 AM, Radev, Martin 
> wrote:
> > >
> > > Hey Evan,
> > >
> > >
> > > thank you for the interest.
> > >
> > > There has been some effort for compressing floating-point data on the
> Parquet side, namely the BYTE_STREAM_SPLIT encoding. On its own it does not
> compress floating point data but makes it more compressible for when a
> compressor, such as ZSTD, LZ4, etc, is used. It only works well for
> high-entropy floating-point data, somewhere at least as large as >= 15 bits
> of entropy per element. I suppose the encoding might actually also make
> sense for high-entropy integer data but I am not super sure.
> > > For low-entropy data, the dictionary encoding is good though I suspect
> there can be room for performance improvements.
> > > This is my final report for the encoding here:
> https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf
> > >
> > > Note that at some point my investigation turned out be quite the same
> solution as the one in https://github.com/powturbo/Turbo-Transpose.
> > >
> > >
> > > Maybe the points I sent can be helpful.
> > >
> > >
> > > Kinds regards,
> > >
> > > Martin
> > >
> > > 
> > > From: evan_c...@apple.com  on behalf of Evan
> Chan 
> > > Sent: Tuesday, March 10, 2020 5:15:48 AM
> > > To: dev@arrow.apache.org
> > > Subject: Summary of RLE and other compression efforts?
> > >
> > > Hi folks,
> > >
> > > I’m curious about the state of efforts for more compressed encodings
> in the Arrow columnar format.  I saw discussions previously about RLE, but
> is there a place to summarize all of the different efforts that are ongoing
> to bring more compressed encodings?
> > >
> > > Is there an effort to compress floating point or integer data using
> techniques such as XOR compression and Delta-Delta?  I can contribute to
> some of these efforts as well.
> > >
> > > Thanks,
> > > Evan
> > >
> > >
> >
>


Re: [jira] [Created] (ARROW-8049) [C++] Upgrade bundled Thrift version to 0.13.0

2020-03-10 Thread Micah Kornfield
Hi Don,
I believe you send an e-mail to dev-unsubscr...@arrow.apache.org instead of
simply replying to the list.

Thanks,
Micah

On Tue, Mar 10, 2020 at 8:57 AM Don Hilborn  wrote:

> Unsubscribe
>
>
> -Don
>
>
> On Mon, Mar 9, 2020 at 6:19 PM Wes McKinney (Jira) 
> wrote:
>
> > Wes McKinney created ARROW-8049:
> > ---
> >
> >  Summary: [C++] Upgrade bundled Thrift version to 0.13.0
> >  Key: ARROW-8049
> >  URL: https://issues.apache.org/jira/browse/ARROW-8049
> >  Project: Apache Arrow
> >   Issue Type: Improvement
> >   Components: C++
> > Reporter: Wes McKinney
> >  Fix For: 0.17.0
> >
> >
> > Follow up to discussion in ARROW-6821
> >
> >
> >
> > --
> > This message was sent by Atlassian Jira
> > (v8.3.4#803005)
> >
>


Re: [Java] Port vector validate functionality

2020-03-10 Thread Micah Kornfield
I agree, it would also be good to run with some of the fuzzed IPC files.

On Fri, Mar 6, 2020 at 6:20 AM Wes McKinney  wrote:

> Seems useful. It may be a good idea to run within integration tests as
> an extra sanity check also
>
> On Fri, Mar 6, 2020 at 2:27 AM Ji Liu  wrote:
> >
> >
> > Hi all,
> > In C++ side, we already have array validate functionality[1] but no
> similar functionality in Java side.
> > I was thinking if we should port this into Java implementation? Since we
> already has visitor interface[2] and it seems not very complicated. I
> created an issue to track this[3].
> >
> >
> > Thanks,
> > Ji Liu
> >
> > [1]
> https://github.com/apache/arrow/blob/6600a39ffe149971afd5ad3c78c2b538cdc03cfd/cpp/src/arrow/array/validate.h
> > [2]
> https://github.com/apache/arrow/blob/6600a39ffe149971afd5ad3c78c2b538cdc03cfd/java/vector/src/main/java/org/apache/arrow/vector/compare/VectorVisitor.java
> > [3] https://issues.apache.org/jira/browse/ARROW-8020
>


Re: [Discuss] [Java] Implement vector diff functionality

2020-03-10 Thread Micah Kornfield
I'm in favor of this.  I think this can be combined with a custom matcher
for Google's Truth [1] library, to make a lot of our unit tests much more
readable

[1] https://github.com/google/truth

On Thu, Mar 5, 2020 at 11:29 PM Ji Liu  wrote:

>
> Hi all,
> In C++ side, we already have array diff functionality[1] for array equals
> and testing to make it easy to see difference between arrays and reduce
> debugging time.
> I think it’s better to have similar functionality in Java side for better
> testing facilities,  and I created an issue to track this[2].
>
>
> Thanks,
> Ji Liu
>
> [1]
> https://github.com/apache/arrow/blob/6600a39ffe149971afd5ad3c78c2b538cdc03cfd/cpp/src/arrow/array/diff.h
> [2] https://issues.apache.org/jira/browse/ARROW-8019


Re: [DISCUSS][Java] Support non-nullable vectors

2020-03-10 Thread Micah Kornfield
Hi Liya Fan,
I'm a little concerned that this will change assumptions for at least some
of the clients using the library (some might always rely on the validity
buffer being present).

I think this is a good feature to have for the reasons you mentioned. It
seems like there would need to be some sort of configuration bit to set for
this behavior. But, I'd be worried about code complexity this would
introduce.

Thanks,
Micah

On Tue, Mar 10, 2020 at 6:42 AM Fan Liya  wrote:

> Hi Wes,
>
> Thanks a lot for your quick reply.
> I think what you mentioned is almost exactly what we want to do in Java.The
> concept is not important.
>
> Maybe there are only some minor differences:
> 1. In C++, the null_count is mutable, while for Java, once a vector is
> constructed as non-nullable, its null count can only be 0.
> 2. In C++, a non-nullable array's validity buffer is null, while in Java,
> the buffer is an empty buffer, and cannot be changed.
>
> Best,
> Liya Fan
>
> On Tue, Mar 10, 2020 at 9:26 PM Wes McKinney  wrote:
>
> > hi Liya,
> >
> > In C++ we elect certain faster code paths when the null count is 0 or
> > computed to be zero. When the null count is 0, we do not allocate a
> > validity bitmap. And there is a "nullable" metadata-only flag at the
> > Field level. Could the same kinds of optimizations be implemented in
> > Java without introducing a "nullable" concept?
> >
> > - Wes
> >
> > On Tue, Mar 10, 2020 at 8:13 AM Fan Liya  wrote:
> > >
> > > Dear all,
> > >
> > > A non-nullable vector is one that is guaranteed to contain no nulls. We
> > > want to support non-nullable vectors in Java.
> > >
> > > *Motivations:*
> > > 1. It is widely used in practice. For example, in a database engine, a
> > > column can be declared as not null, so it cannot contain null values.
> > > 2.Non-nullable vectors has significant performance advantages compared
> > with
> > > their nullable conterparts, such as:
> > >   1) the memory space of the validity buffer can be saved.
> > >   2) manipulation of the validity buffer can be bypassed
> > >   3) some if-else branches can be replaced by sequential instructions
> (by
> > > the JIT compiler), leading to high throughput for the CPU pipeline.
> > >
> > > *Potential Cost:*
> > > For nullable vectors, there can be extra checks against the
> nullablility
> > > flag. So we must change the code in a way that minimizes the cost.
> > >
> > > *Proposed Changes:*
> > > 1. There is no need to create new vector classes. We add a final
> boolean
> > to
> > > the vector base classes as the nullability flag. The value of the flag
> > can
> > > be obtained from the field when creating the vector.
> > > 2. Add a method "boolean isNullable()" to the root interface
> ValueVector.
> > > 3. If a vector is non-nullable, its validity buffer should be an empty
> > > buffer (not null, so much of the existing logic can be left unchanged).
> > > 4. For operations involving validity buffers (e.g. isNull, get, set),
> we
> > > use the nullability flag to bypass manipulations to the validity
> buffer.
> > >
> > > Therefore, it should be possible to support the feature with small code
> > > changes.
> > >
> > > BTW, please note that similar behaviors have already been supported in
> > C++.
> > >
> > > Would you please give your valueable feedback?
> > >
> > > Best,
> > > Liya Fan
> >
>


[jira] [Created] (ARROW-8072) Add const constraint when parsing data

2020-03-10 Thread Siyuan Zhuang (Jira)
Siyuan Zhuang created ARROW-8072:


 Summary: Add const constraint when parsing data
 Key: ARROW-8072
 URL: https://issues.apache.org/jira/browse/ARROW-8072
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Plasma
Reporter: Siyuan Zhuang
Assignee: Siyuan Zhuang


Input data for plasma protocol.h/protocol.cc should be const.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8073) [GLib] Add binding of arrow::fs::PathForest

2020-03-10 Thread Kenta Murata (Jira)
Kenta Murata created ARROW-8073:
---

 Summary: [GLib] Add binding of arrow::fs::PathForest
 Key: ARROW-8073
 URL: https://issues.apache.org/jira/browse/ARROW-8073
 Project: Apache Arrow
  Issue Type: New Feature
  Components: GLib
Reporter: Kenta Murata
Assignee: Kenta Murata






--
This message was sent by Atlassian Jira
(v8.3.4#803005)