Re: Updating Parquet dep to 1.9

Michael Allman Wed, 02 Nov 2016 08:32:00 -0700

Sounds great. Regarding the min/max stats issue, is that an issue with the way 
the files are written or read? What's the Parquet project issue for that bug? 
What's the 1.9.1 release timeline look like?


I will aim to have a PR in by the end of the week. I feel strongly that either 
this or https://github.com/apache/spark/pull/15538 
<https://github.com/apache/spark/pull/15538> needs to make it into 2.1. The 
logging output issue is really bad. I would probably call it a blocker.

Michael


> On Nov 1, 2016, at 1:22 PM, Ryan Blue <rb...@netflix.com> wrote:
> 
> I can when I'm finished with a couple other issues if no one gets to it first.
> 
> Michael, if you're interested in updating to 1.9.0 I'm happy to help review 
> that PR.
> 
> On Tue, Nov 1, 2016 at 1:03 PM, Reynold Xin <r...@databricks.com 
> <mailto:r...@databricks.com>> wrote:
> Ryan want to submit a pull request?
> 
> 
> On Tue, Nov 1, 2016 at 9:05 AM, Ryan Blue <rb...@netflix.com.invalid 
> <mailto:rb...@netflix.com.invalid>> wrote:
> 1.9.0 includes some fixes intended specifically for Spark:
> 
> * PARQUET-389: Evaluates push-down predicates for missing columns as though 
> they are null. This is to address Spark's work-around that requires reading 
> and merging file schemas, even for metastore tables.
> * PARQUET-654: Adds an option to disable record-level predicate push-down, 
> but keep row group evaluation. This allows Spark to skip row groups based on 
> stats and dictionaries, but implement its own vectorized record filtering.
> 
> The Parquet community also evaluated performance to ensure no performance 
> regressions from moving to the ByteBuffer read path.
> 
> There is one concern about 1.9.0 that will be addressed in 1.9.1, which is 
> that stats calculations were incorrectly using unsigned byte order for string 
> comparison. This means that min/max stats can't be used if the data contains 
> (or may contain) UTF8 characters with the msb set. 1.9.0 won't return the bad 
> min/max values for correctness, but there is a property to override this 
> behavior for data that doesn't use the affected code points.
> 
> Upgrading to 1.9.0 depends on how the community wants to handle the sort 
> order bug: whether correctness or performance should be the default.
> 
> rb
> 
> On Tue, Nov 1, 2016 at 2:22 AM, Sean Owen <so...@cloudera.com 
> <mailto:so...@cloudera.com>> wrote:
> Yes this came up from a different direction: 
> https://issues.apache.org/jira/browse/SPARK-18140 
> <https://issues.apache.org/jira/browse/SPARK-18140>
> 
> I think it's fine to pursue an upgrade to fix these several issues. The 
> question is just how well it will play with other components, so bears some 
> testing and evaluation of the changes from 1.8, but yes this would be good.
> 
> On Mon, Oct 31, 2016 at 9:07 PM Michael Allman <mich...@videoamp.com 
> <mailto:mich...@videoamp.com>> wrote:
> Hi All,
> 
> Is anyone working on updating Spark's Parquet library dep to 1.9? If not, I 
> can at least get started on it and publish a PR.
> 
> Cheers,
> 
> Michael
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> <mailto:dev-unsubscr...@spark.apache.org>
> 
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix
> 
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix

Re: Updating Parquet dep to 1.9

Reply via email to