[jira] [Created] (ARROW-2661) [Python/C++] Allow passing HDFS Config values via map/dict instead of needing an hdfs-site.xml file
Matt Topol created ARROW-2661: - Summary: [Python/C++] Allow passing HDFS Config values via map/dict instead of needing an hdfs-site.xml file Key: ARROW-2661 URL: https://issues.apache.org/jira/browse/ARROW-2661 Project: Apache Arrow Issue Type: New Feature Components: C++, Python Affects Versions: 0.10.0 Reporter: Matt Topol Currently in order to pass HDFS configurations of values down to the underlying libhdfs driver you have to either set HADOOP_HOME (using libhdfs) or LIBHDFS3_CONF (for libhdfs3) to point to a location of an hdfs-site.xml file to use. However the API provided by them allows for calling hdfsBuilderConfSetStr to set arbitrary configuration values on the hdfsBuilder object. This would allow for consumers to programmatically set any arbitrary configurations they want including allowing for encryption and other complex hdfs configurations without needing to provide an hdfs-site.xml file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
RE: JDBC Adapter PR - 1759
Hi Laurent, Thanks for your review comments. We have completed the code changes and merged as well. I have replied to few your comments. Please take a look when you get a chance. Regards, -Atul -Original Message- From: Laurent Goujon [mailto:laur...@dremio.com] Sent: Wednesday, May 30, 2018 5:38 AM To: dev@arrow.apache.org Subject: Re: JDBC Adapter PR - 1759 Same here. On Tue, May 29, 2018 at 9:59 AM, Siddharth Teotia wrote: > Hi Atul, > > I will take a look today. > > Thanks, > Sidd > > On Tue, May 29, 2018 at 2:45 AM, Atul Dambalkar < > atul.dambal...@xoriant.com> > wrote: > > > Hi Sid, Laurent, Uwe, > > > > Any idea when can someone take a look at the PR > https://github.com/apache/ > > arrow/pull/1759/. > > > > Laurent had given bunch of comments earlier and now we have taken > > care of most of those. We have also added multiple test cases. It > > will be great > if > > someone can take a look. > > > > Regards, > > -Atul > > > > >
[C++] Avoiding Nullptrs?
Hi Everyone, Recently I've investigated a parquet edge case ARROW-2591 which is caused by a returned nullptr here. (https://link.getmailspring.com/link/1527842287.local-dc8ae776-35fa-v1.2.2-96fb3...@getmailspring.com/0?redirect=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-cpp%2Fblob%2Fmaster%2Fsrc%2Fparquet%2Farrow%2Fwriter.cc%23L415=ZGV2QGFycm93LmFwYWNoZS5vcmc%3D) I'm wondering would it be possible to avoid having nullptrs instead of checking them? IMHO having them around causes hidden bugs as well as increases the complexity. Krisztian
[jira] [Created] (ARROW-2660) [Python] Experiment with zero-copy pickling
Antoine Pitrou created ARROW-2660: - Summary: [Python] Experiment with zero-copy pickling Key: ARROW-2660 URL: https://issues.apache.org/jira/browse/ARROW-2660 Project: Apache Arrow Issue Type: Wish Components: Python Affects Versions: 0.9.0 Reporter: Antoine Pitrou PEP 574 has an implementation ready and a PyPI-available backport (at [https://pypi.org/project/pickle5/] ). Adding experimental support for it would allow for zero-copy pickling of Arrow arrays, columns, etc. I think it mainly involves implementing {{reduce_ex}} on the {{Buffer}} class, as described in [https://www.python.org/dev/peps/pep-0574/#producer-api] In addition, the consumer API added by PEP 574 could be used in Arrow's serialization array, to avoid or minimize copies when serializing foreign objects. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2659) [Python] More graceful reading of empty String columns in ParquetDataset
Uwe L. Korn created ARROW-2659: -- Summary: [Python] More graceful reading of empty String columns in ParquetDataset Key: ARROW-2659 URL: https://issues.apache.org/jira/browse/ARROW-2659 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.9.0 Reporter: Uwe L. Korn Fix For: 0.11.0 When currently saving a {{ParquetDataset}} from Pandas, we don't get consistent schemas, even if the source was a single DataFrame. This is due to the fact that in some partitions object columns like string can become empty. Then the resulting Arrow schema will differ. In the central metadata, we will store this column as {{pa.string}} whereas in the partition file with the empty columns, this columns will be stored as {{pa.null}}. The two schemas are still a valid match in terms of schema evolution and we should respect that in https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L754 Instead of doing a {{pa.Schema.equals}} in https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L778 we should introduce a new method {{pa.Schema.can_evolve_to}} that is more graceful and returns {{True}} if a dataset piece has a null column where the main metadata states a nullable column of any type. -- This message was sent by Atlassian JIRA (v7.6.3#76005)