[jira] [Created] (ARROW-6420) [Java] Improve the performance of UnionVector when getting underlying vectors

2019-09-02 Thread Liya Fan (Jira)
Liya Fan created ARROW-6420:
---

 Summary: [Java] Improve the performance of UnionVector when 
getting underlying vectors
 Key: ARROW-6420
 URL: https://issues.apache.org/jira/browse/ARROW-6420
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


Getting the underlying vector is a frequent opertation for UnionVector. It 
relies on this operation to get/set data at each index.

The current implementation is inefficient. In particular, it first gets the 
minor type at the given index, and then compares it against all possible minor 
types in a switch statment, until a match is found.

We improve the performance by storing the internal vectors in an array, whose 
index is the ordinal of the minor type. So given a minor type, its 
corresponding underlying vector can be obtained in O(1) time.

It should be noted that this technique is also applicable to UnionReader and 
UnionWriter, and support for UnionReader is already implemented.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6419) [Website] Blog post about Parquet dictionary performance work coming in 0.15.x release

2019-09-02 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6419:
---

 Summary: [Website] Blog post about Parquet dictionary performance 
work coming in 0.15.x release
 Key: ARROW-6419
 URL: https://issues.apache.org/jira/browse/ARROW-6419
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Website
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 0.15.0






--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6418) Plasma cmake targets are not exported

2019-09-02 Thread Tobias Mayer (Jira)
Tobias Mayer created ARROW-6418:
---

 Summary: Plasma cmake targets are not exported
 Key: ARROW-6418
 URL: https://issues.apache.org/jira/browse/ARROW-6418
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Plasma
Affects Versions: 0.14.1
Reporter: Tobias Mayer
 Fix For: 0.15.0


The generated arrowTargets.cmake files in the build and install directories



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6417) [C++][Parquet] Non-dictionary BinaryArray reads from Parquet format have slowed down since 0.11.x

2019-09-02 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6417:
---

 Summary: [C++][Parquet] Non-dictionary BinaryArray reads from 
Parquet format have slowed down since 0.11.x
 Key: ARROW-6417
 URL: https://issues.apache.org/jira/browse/ARROW-6417
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Reporter: Wes McKinney
 Attachments: 20190903_parquet_benchmark.py, 
20190903_parquet_read_perf.png

In doing some benchmarking, I have found that binary reads seem to be slower 
from Arrow 0.11.1 to master branch. The comparison isn't quite apples-to-apples 
since I think these results were produced with different versions of gcc, but 
it would be a good idea to do some basic profiling to see where we might 
improve our memory allocation strategy (or whatever the bottleneck turns out to 
be)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6416) [Python] Confusing API & documentation regarding chunksizes

2019-09-02 Thread Arik Funke (Jira)
Arik Funke created ARROW-6416:
-

 Summary: [Python] Confusing API & documentation regarding 
chunksizes
 Key: ARROW-6416
 URL: https://issues.apache.org/jira/browse/ARROW-6416
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Arik Funke


The python API and documentation regarding chunksizes is confusing in my 
opinion:

Example:

[https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchFileWriter.html#pyarrow.RecordBatchFileWriter.write_table]
{color:#ff}def{color} 
{color:#795e26}write_table{color}{color:#00}({color}{color:#001080}self{color}{color:#00},
 Table {color}{color:#001080}table{color}{color:#00}, 
{color}{color:#001080}chunksize{color}{color:#00}={color}{color:#ff}None{color}{color:#00}):{color}
{color:#a31515}"""{color}
{color:#a31515} Write RecordBatch to stream{color}

{color:#a31515} Parameters{color}
{color:#a31515} --{color}
{color:#a31515} batch : RecordBatch{color}
 
This suggests, the file will be written with a fixed chunk size when in fact 
the {{chunksize}} parameter is an upper bound on the size of the chunks to be 
written.

In my opinion this parameter should be renamed {{max_chunksize}} to avoid 
confusion and reflect its true purpose.

This would also improve naming consistency in the code base, since in the C++ 
implementation, this parameter is already named {{max_chunksize}} in 
{{cpp/source/arrow/ipc/writer.cc}}:
{color:#00}Status 
{color}{color:#267f99}RecordBatchWriter{color}{color:#00}::{color}{color:#795e26}WriteTable{color}{color:#00}({color}{color:#ff}const{color}{color:#00}
 Table{color}{color:#00}&{color}{color:#001080} 
table{color}{color:#00}, {color}{color:#ff}int64_t{color} 
{color:#001080}max_chunksize{color}{color:#00}) {{color}
Similarly, the parameter should be renamed in {{pyarrow.Table.to_batches(self, 
chunksize=None)}}.

 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


AW: KeyValue metadata for column

2019-09-02 Thread roman.karlstetter
Thanks for the feedback, I'll try to work on this if I find some time.

Roman

-Ursprüngliche Nachricht-
Von: Wes McKinney  
Gesendet: Montag, 2. September 2019 16:25
An: dev@arrow.apache.org
Betreff: Re: KeyValue metadata for column

hi Roman,

It's just not implemented. See issue related to preserving Field-level Arrow 
metadata in Parquet

https://issues.apache.org/jira/browse/ARROW-4359

I think implementing this should be pretty straightforward. You can follow the 
code that handles KeyValue metadata at the Schema level

- Wes

On Mon, Sep 2, 2019 at 8:05 AM  wrote:
>
> Hi everyone,
>
>
>
> reading the descriptions here
> 
> https://parquet.apache.org/documentation/latest/#metadata, I think it 
> should be possible in general to add arbitrary key-value metadata to 
> any parquet column. Is that correct?
>
>
>
> If yes, in the https://github.com/apache/arrow repository, I do not 
> really find that functionality in the API (no such thing as 
> KeyValueMetaData or similar in 
> ColumnChunkMetaData::ColumnChunkMetaDataImpl), so I guess that it's 
> either not yet implemented or I just don't see it. If it's not 
> implemented yet, is that something someone not very familiar with the 
> code-base like me could implement or does that require a little more insight 
> and experience?
>
>
>
> Regards,
>
> Roman
>



[jira] [Created] (ARROW-6415) [R] Remove usage of R CMD config CXXCPP

2019-09-02 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-6415:
--

 Summary: [R] Remove usage of R CMD config CXXCPP
 Key: ARROW-6415
 URL: https://issues.apache.org/jira/browse/ARROW-6415
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
 Fix For: 0.15.0


>From email from BDR at CRAN: 

"R CMD config CXXCPP has been deprecated: it is not used by R itself and 
there are several things wrong with the standard autoconf detection code:

- If CXXCPP is set by the user, it is not tested.  It could be empty, 
which AFAICS none of you allow for.
- The code looks at $CXX -E and /lib/cpp in turn, and tests a system C 
header without consulting CPPFLAGS.  /lib/cpp is unlikely to find C++ 
headers, and we have seen instances where without CPPFLAGS it did not 
find C headers.
- It is the setting for the default C++ compiler, in R-devel C++11 but 
not specified in earlier R (even 3.6.x could be C++98).

It would be better to use $(CXX) -E (or $(CXX11) etc) or test for yourself.

Please change at the next package update."



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6414) pyarrow cannot (de)serialise an empty MultiIndex-ed column DataFrame

2019-09-02 Thread Stpehen Gowdy (Jira)
Stpehen Gowdy created ARROW-6414:


 Summary: pyarrow cannot (de)serialise an empty MultiIndex-ed 
column DataFrame
 Key: ARROW-6414
 URL: https://issues.apache.org/jira/browse/ARROW-6414
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.14.0
Reporter: Stpehen Gowdy


If you have an empty multiindex columns in a pandas dataframe pyarrow cannot 
serialise an deserialise it. Example code is below to show this.

{code:python}
import pandas as pd
import pyarrow as pa
columns = pd.MultiIndex.from_tuples([('a', 'b', 'c')])
df = pd.DataFrame(columns = columns)
df = df[[]]
pa.deserialize_pandas(pa.serialize_pandas(df).to_pybytes())
...
AttributeError: 'dict' object has no attribute 'dtype'
{code}




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6413) [R] Support autogenerating column names

2019-09-02 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-6413:
--

 Summary: [R] Support autogenerating column names
 Key: ARROW-6413
 URL: https://issues.apache.org/jira/browse/ARROW-6413
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson


Following ARROW-6231, the C++ library has a way to create column names. Enable 
that in R.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


Re: [DISCUSS][FORMAT] Concerning about character encoding of binary string data

2019-09-02 Thread Antoine Pitrou


Hello,

Le 02/09/2019 à 10:39, Kenta Murata a écrit :
> 
> There are two options to manage a character encoding in a BinaryArray.
> The first way is introducing an optional character_encoding field in
> BinaryType.  The second way is using custom_metadata field to supply
> the character encoding name.

I am against parameterizing Binary types with the character encoding.
Binary data in a Binary array is opaque, and the type should reflect
that.  Its application-dependent meaning should be (optionally) encoded
in the metadata.  For example, at some point people might also want to
add a "mime-type" metadata key.

So I think the second solution (defining a well-known metadata key) is fine.

> If we use custom_metadata, we should decide the key for this
> information.  I guess “charset” is good candidates for the key because
> it is widely used for specifying what a character encoding is used.

Or perhaps "ARROW:charset"?
(I think I prefer "encoding" rather than "charset" personally...)

> The value must be the name of a character encoding, such as “UTF-8”
> and “Windows-31J”.  It is better if we can decide canonical encoding
> names, but I guess it is hard work because many systems use the same
> name for the different encodings.

I don't think encoding name canonicalization is Arrow's concern. Each
system has its rules and aliases.  And I doubt we're willing to
implement string processing algorithms for encodings other than UTF-8.

Regards

Antoine.


Re: [DISCUSS][FORMAT] Concerning about character encoding of binary string data

2019-09-02 Thread Wes McKinney
hi Kenta,

It seems like using ExtensionType would be a simple way to handle this
for the immediate purpose of implementing user-facing Array types. If
we wanted to change the the metadata representation to something more
"built-in" then we can keep discussing this. It seems like having a
distinct DataType subclass and Array subclass for unicode-but-not-UTF8
would be useful as opposed to adding an encoding attribute to
BinaryType. Interested to know what you think about this solution.

- Wes

On Mon, Sep 2, 2019 at 3:40 AM Kenta Murata  wrote:
>
> [Abstract]
> When we have a string data encoded in a character encoding other than
> UTF-8, we must use a BinaryArray for the data.  But Apache Arrow
> doesn’t provide the way to specify what a character encoding used in a
> BinaryArray.  In this mail, I’d like to discuss how Apache Arrow
> provides the way to manage a character encoding in a BinaryArray.
>
> I’d appreciate any comments or suggestions.
>
> [Long description]
> Apache Arrow has the specialized type for UTF-8 encoded string but
> doesn’t have types for other character encodings, such as ISO-8859-x
> and Shift_JIS. We need to manage what a character encoding is used in
> a binary string array, in the outside of the arrays such as metadata.
>
> In Datasets project, one of the goals is to support database
> protocols.  Some databases support a lot of character encodings in
> each manner.  For example, PostgreSQL supports to specify what a
> character encoding is used for each database, and MySQL allows us to
> specify character encodings separately for each level: database,
> table, and column.
>
> I have a concern about how does Apache Arrow provide the way to
> specify character encodings for values in arrays.
>
> If people can constrain to use UTF-8 for all the string data,
> StringArray is enough for them. But if they cannot unify the character
> encoding of string data in UTF-8, should Apache Arrow provides the
> standard way of the character encoding management?
>
> The example use of Apache Arrow in such case is an application to the
> internal data of OR mapper library, such as ActiveRecord of Ruby on
> Rails.
>
> My opinion is that Apache Arrow must have the standard way in both its
> format and its API.  The reason is below:
>
> (1) Currently, when we use MySQL or PostgreSQL as the data source of
> record batch streams, we will lose the information of character
> encodings the original data have
>
> (2) We need to struggle to support character encoding treatment on
> each combination of systems if we don’t have a standard way of
> character encoding management, though this is not fit to Apache
> Arrow’s philosophy
>
> (3) We cannot support character encoding treatment in the level of
> language-binding if Apache Arrow doesn’t provide the standard APIs of
> character encoding management
>
> There are two options to manage a character encoding in a BinaryArray.
> The first way is introducing an optional character_encoding field in
> BinaryType.  The second way is using custom_metadata field to supply
> the character encoding name.
>
> If we use custom_metadata, we should decide the key for this
> information.  I guess “charset” is good candidates for the key because
> it is widely used for specifying what a character encoding is used.
> The value must be the name of a character encoding, such as “UTF-8”
> and “Windows-31J”.  It is better if we can decide canonical encoding
> names, but I guess it is hard work because many systems use the same
> name for the different encodings.  For example, “Shift_JIS” means
> either IANA’s Shift_JIS or Windows-31J, they use the same coding rule
> but the corresponding character sets are slightly different.  See the
> spreadsheet [1] for the correspondence of character encoding names
> between MySQL, PostgreSQL, Ruby, Python, IANA [3], and Encoding
> standard of WhatWG [4].
>
> If we introduce the new optional field for the information of a
> character encoding in BinaryType, I recommend let this new field be a
> string to keep the name of a character encoding.  But it is possible
> to make the field integer and let it keep the enum value.  I don’t
> know there is a good standard for the enum value of character
> encodings.  IANA manages MIBenum [2], though the registered character
> encodings [3] are not enough for our requirement, I think.
>
> I prefer the second way because the first way can supply the
> information of character encoding only to a Field, not a BinaryArray.
>
> [1] 
> https://docs.google.com/spreadsheets/d/1D0xlI5r2wJUV45aTY1q2TwqD__v7acmd8FOfr8xSOVQ/edit?usp=sharing
> [2] https://tools.ietf.org/html/rfc3808
> [3] https://www.iana.org/assignments/character-sets/character-sets.xhtml
> [4] https://encoding.spec.whatwg.org/
>
> --
> Kenta Murata


Re: KeyValue metadata for column

2019-09-02 Thread Wes McKinney
hi Roman,

It's just not implemented. See issue related to preserving Field-level
Arrow metadata in Parquet

https://issues.apache.org/jira/browse/ARROW-4359

I think implementing this should be pretty straightforward. You can
follow the code that handles KeyValue metadata at the Schema level

- Wes

On Mon, Sep 2, 2019 at 8:05 AM  wrote:
>
> Hi everyone,
>
>
>
> reading the descriptions here
> 
> https://parquet.apache.org/documentation/latest/#metadata, I think it should
> be possible in general to add arbitrary key-value metadata to any parquet
> column. Is that correct?
>
>
>
> If yes, in the https://github.com/apache/arrow repository, I do not really
> find that functionality in the API (no such thing as KeyValueMetaData or
> similar in ColumnChunkMetaData::ColumnChunkMetaDataImpl), so I guess that
> it's either not yet implemented or I just don't see it. If it's not
> implemented yet, is that something someone not very familiar with the
> code-base like me could implement or does that require a little more insight
> and experience?
>
>
>
> Regards,
>
> Roman
>


[jira] [Created] (ARROW-6412) [C++] arrow-flight-test can crash because of port allocation

2019-09-02 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-6412:
-

 Summary: [C++] arrow-flight-test can crash because of port 
allocation
 Key: ARROW-6412
 URL: https://issues.apache.org/jira/browse/ARROW-6412
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou


I get this error sometimes locally when running the tests in parallel:
{code}
[--] 11 tests from TestFlightClient
[ RUN  ] TestFlightClient.ListFlights
E0902 15:13:55.996271678   17281 socket_utils_common_posix.cc:201] check for 
SO_REUSEPORT: {"created":"@1567430035.996256600","description":"SO_REUSEPORT 
unavailable on compiling 
system","file":"../src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":169}
[   OK ] TestFlightClient.ListFlights (17 ms)
[ RUN  ] TestFlightClient.GetFlightInfo
E0902 15:13:56.013065793   17281 server_chttp2.cc:40]
{"created":"@1567430036.013032600","description":"No address added out of total 
1 
resolved","file":"../src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":394,"referenced_errors":[{"created":"@1567430036.013029044","description":"Unable
 to configure 
socket","fd":6,"file":"../src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":217,"referenced_errors":[{"created":"@1567430036.013021880","description":"Address
 already in 
use","errno":98,"file":"../src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address
 already in use","syscall":"bind"}]}]}
../src/arrow/flight/flight_test.cc:271: Failure
Failed
'server->Init(options)' failed with Unknown error: Server did not start properly
{code}




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


KeyValue metadata for column

2019-09-02 Thread roman.karlstetter
Hi everyone,

 

reading the descriptions here

https://parquet.apache.org/documentation/latest/#metadata, I think it should
be possible in general to add arbitrary key-value metadata to any parquet
column. Is that correct?

 

If yes, in the https://github.com/apache/arrow repository, I do not really
find that functionality in the API (no such thing as KeyValueMetaData or
similar in ColumnChunkMetaData::ColumnChunkMetaDataImpl), so I guess that
it's either not yet implemented or I just don't see it. If it's not
implemented yet, is that something someone not very familiar with the
code-base like me could implement or does that require a little more insight
and experience?

 

Regards,

Roman



[DISCUSS][FORMAT] Concerning about character encoding of binary string data

2019-09-02 Thread Kenta Murata
[Abstract]
When we have a string data encoded in a character encoding other than
UTF-8, we must use a BinaryArray for the data.  But Apache Arrow
doesn’t provide the way to specify what a character encoding used in a
BinaryArray.  In this mail, I’d like to discuss how Apache Arrow
provides the way to manage a character encoding in a BinaryArray.

I’d appreciate any comments or suggestions.

[Long description]
Apache Arrow has the specialized type for UTF-8 encoded string but
doesn’t have types for other character encodings, such as ISO-8859-x
and Shift_JIS. We need to manage what a character encoding is used in
a binary string array, in the outside of the arrays such as metadata.

In Datasets project, one of the goals is to support database
protocols.  Some databases support a lot of character encodings in
each manner.  For example, PostgreSQL supports to specify what a
character encoding is used for each database, and MySQL allows us to
specify character encodings separately for each level: database,
table, and column.

I have a concern about how does Apache Arrow provide the way to
specify character encodings for values in arrays.

If people can constrain to use UTF-8 for all the string data,
StringArray is enough for them. But if they cannot unify the character
encoding of string data in UTF-8, should Apache Arrow provides the
standard way of the character encoding management?

The example use of Apache Arrow in such case is an application to the
internal data of OR mapper library, such as ActiveRecord of Ruby on
Rails.

My opinion is that Apache Arrow must have the standard way in both its
format and its API.  The reason is below:

(1) Currently, when we use MySQL or PostgreSQL as the data source of
record batch streams, we will lose the information of character
encodings the original data have

(2) We need to struggle to support character encoding treatment on
each combination of systems if we don’t have a standard way of
character encoding management, though this is not fit to Apache
Arrow’s philosophy

(3) We cannot support character encoding treatment in the level of
language-binding if Apache Arrow doesn’t provide the standard APIs of
character encoding management

There are two options to manage a character encoding in a BinaryArray.
The first way is introducing an optional character_encoding field in
BinaryType.  The second way is using custom_metadata field to supply
the character encoding name.

If we use custom_metadata, we should decide the key for this
information.  I guess “charset” is good candidates for the key because
it is widely used for specifying what a character encoding is used.
The value must be the name of a character encoding, such as “UTF-8”
and “Windows-31J”.  It is better if we can decide canonical encoding
names, but I guess it is hard work because many systems use the same
name for the different encodings.  For example, “Shift_JIS” means
either IANA’s Shift_JIS or Windows-31J, they use the same coding rule
but the corresponding character sets are slightly different.  See the
spreadsheet [1] for the correspondence of character encoding names
between MySQL, PostgreSQL, Ruby, Python, IANA [3], and Encoding
standard of WhatWG [4].

If we introduce the new optional field for the information of a
character encoding in BinaryType, I recommend let this new field be a
string to keep the name of a character encoding.  But it is possible
to make the field integer and let it keep the enum value.  I don’t
know there is a good standard for the enum value of character
encodings.  IANA manages MIBenum [2], though the registered character
encodings [3] are not enough for our requirement, I think.

I prefer the second way because the first way can supply the
information of character encoding only to a Field, not a BinaryArray.

[1] 
https://docs.google.com/spreadsheets/d/1D0xlI5r2wJUV45aTY1q2TwqD__v7acmd8FOfr8xSOVQ/edit?usp=sharing
[2] https://tools.ietf.org/html/rfc3808
[3] https://www.iana.org/assignments/character-sets/character-sets.xhtml
[4] https://encoding.spec.whatwg.org/

-- 
Kenta Murata


Re: [DISCUSS][Format][C++] Improvement of sparse tensor format and implementation

2019-09-02 Thread Kenta Murata
2019年8月28日(水) 8:57 Rok Mihevc :
>
> On Wed, Aug 28, 2019 at 1:18 AM Wes McKinney  wrote:
>
> > null/NA. But, as far as I'm aware, this component of pandas is
> > relatively unique and was never intended as an alternatives to sparse
> > matrix libraries.
> >
>
> Another example is
> https://sparse.pydata.org/en/latest/generated/sparse.SparseArray.html?highlight=fill%20value#sparse.SparseArray.fill_value,
> but it might have been influenced by Pandas.

pydata/sparse's COO tensor also has fill_value property,
and it raises a ValueError in to_scipy_sparse method when the tensor
has a non-zero fill value.

So we should support fill value someday, I think.

> I'm ok with dropping this for now.

Yes, we can advance without it, and support it later.
And, I think supporting fill value is not difficult.

-- 
Kenta Murata


Re: [DISCUSS][Format][C++] Improvement of sparse tensor format and implementation

2019-09-02 Thread Kenta Murata
2019年8月28日(水) 6:05 Wes McKinney :
> I'm also OK with these changes. Since we have not established a
> versioning or compatibility policy with regards to "Other" data
> structures like Tensor and SparseTensor, I don't know that a vote is
> needed, just a pull request.

I didn't understand that Tensor and SparseTensor isn't restricted by a
versioning and compatibility policy.

OK, I'll send some pull-requests.

-- 
Kenta Murata