[jira] [Created] (ARROW-2661) [Python/C++] Allow passing HDFS Config values via map/dict instead of needing an hdfs-site.xml file

2018-06-01 Thread Matt Topol (JIRA)
Matt Topol created ARROW-2661:
-

 Summary: [Python/C++] Allow passing HDFS Config values via 
map/dict instead of needing an hdfs-site.xml file
 Key: ARROW-2661
 URL: https://issues.apache.org/jira/browse/ARROW-2661
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++, Python
Affects Versions: 0.10.0
Reporter: Matt Topol


Currently in order to pass HDFS configurations of values down to the underlying 
libhdfs driver you have to either set HADOOP_HOME (using libhdfs) or 
LIBHDFS3_CONF (for libhdfs3) to point to a location of an hdfs-site.xml file to 
use. However the API provided by them allows for calling hdfsBuilderConfSetStr 
to set arbitrary configuration values on the hdfsBuilder object. This would 
allow for consumers to programmatically set any arbitrary configurations they 
want including allowing for encryption and other complex hdfs configurations 
without needing to provide an hdfs-site.xml file.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


RE: JDBC Adapter PR - 1759

2018-06-01 Thread Atul Dambalkar
Hi Laurent, 

Thanks for your review comments. We have completed the code changes and merged 
as well. I have replied to few your comments. Please take a look when you get a 
chance.

Regards,
-Atul

-Original Message-
From: Laurent Goujon [mailto:laur...@dremio.com] 
Sent: Wednesday, May 30, 2018 5:38 AM
To: dev@arrow.apache.org
Subject: Re: JDBC Adapter PR - 1759

Same here.

On Tue, May 29, 2018 at 9:59 AM, Siddharth Teotia 
wrote:

> Hi Atul,
>
> I will take a look today.
>
> Thanks,
> Sidd
>
> On Tue, May 29, 2018 at 2:45 AM, Atul Dambalkar < 
> atul.dambal...@xoriant.com>
> wrote:
>
> > Hi Sid, Laurent, Uwe,
> >
> > Any idea when can someone take a look at the PR
> https://github.com/apache/
> > arrow/pull/1759/.
> >
> > Laurent had given bunch of comments earlier and now we have taken 
> > care of most of those. We have also added multiple test cases. It 
> > will be great
> if
> > someone can take a look.
> >
> > Regards,
> > -Atul
> >
> >
>


[C++] Avoiding Nullptrs?

2018-06-01 Thread Krisztián Szűcs
Hi Everyone,

Recently I've investigated a parquet edge case ARROW-2591 which is caused by a 
returned nullptr here. 
(https://link.getmailspring.com/link/1527842287.local-dc8ae776-35fa-v1.2.2-96fb3...@getmailspring.com/0?redirect=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-cpp%2Fblob%2Fmaster%2Fsrc%2Fparquet%2Farrow%2Fwriter.cc%23L415=ZGV2QGFycm93LmFwYWNoZS5vcmc%3D)
I'm wondering would it be possible to avoid having nullptrs instead of checking 
them?
IMHO having them around causes hidden bugs as well as increases the complexity.

Krisztian


[jira] [Created] (ARROW-2660) [Python] Experiment with zero-copy pickling

2018-06-01 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-2660:
-

 Summary: [Python] Experiment with zero-copy pickling
 Key: ARROW-2660
 URL: https://issues.apache.org/jira/browse/ARROW-2660
 Project: Apache Arrow
  Issue Type: Wish
  Components: Python
Affects Versions: 0.9.0
Reporter: Antoine Pitrou


PEP 574 has an implementation ready and a PyPI-available backport (at 
[https://pypi.org/project/pickle5/] ). Adding experimental support for it would 
allow for zero-copy pickling of Arrow arrays, columns, etc.

I think it mainly involves implementing {{reduce_ex}} on the {{Buffer}} class, 
as described in [https://www.python.org/dev/peps/pep-0574/#producer-api]

In addition, the consumer API added by PEP 574 could be used in Arrow's 
serialization array, to avoid or minimize copies when serializing foreign 
objects.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2659) [Python] More graceful reading of empty String columns in ParquetDataset

2018-06-01 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-2659:
--

 Summary: [Python] More graceful reading of empty String columns in 
ParquetDataset
 Key: ARROW-2659
 URL: https://issues.apache.org/jira/browse/ARROW-2659
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.9.0
Reporter: Uwe L. Korn
 Fix For: 0.11.0


When currently saving a {{ParquetDataset}} from Pandas, we don't get consistent 
schemas, even if the source was a single DataFrame. This is due to the fact 
that in some partitions object columns like string can become empty. Then the 
resulting Arrow schema will differ. In the central metadata, we will store this 
column as {{pa.string}} whereas in the partition file with the empty columns, 
this columns will be stored as {{pa.null}}.

The two schemas are still a valid match in terms of schema evolution and we 
should respect that in 
https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L754
 Instead of doing a {{pa.Schema.equals}} in 
https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L778
 we should introduce a new method {{pa.Schema.can_evolve_to}} that is more 
graceful and returns {{True}} if a dataset piece has a null column where the 
main metadata states a nullable column of any type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)