Ryan want to submit a pull request?

On Tue, Nov 1, 2016 at 9:05 AM, Ryan Blue <rb...@netflix.com.invalid> wrote:

> 1.9.0 includes some fixes intended specifically for Spark:
>
> * PARQUET-389: Evaluates push-down predicates for missing columns as
> though they are null. This is to address Spark's work-around that requires
> reading and merging file schemas, even for metastore tables.
> * PARQUET-654: Adds an option to disable record-level predicate push-down,
> but keep row group evaluation. This allows Spark to skip row groups based
> on stats and dictionaries, but implement its own vectorized record
> filtering.
>
> The Parquet community also evaluated performance to ensure no performance
> regressions from moving to the ByteBuffer read path.
>
> There is one concern about 1.9.0 that will be addressed in 1.9.1, which is
> that stats calculations were incorrectly using unsigned byte order for
> string comparison. This means that min/max stats can't be used if the data
> contains (or may contain) UTF8 characters with the msb set. 1.9.0 won't
> return the bad min/max values for correctness, but there is a property to
> override this behavior for data that doesn't use the affected code points.
>
> Upgrading to 1.9.0 depends on how the community wants to handle the sort
> order bug: whether correctness or performance should be the default.
>
> rb
>
> On Tue, Nov 1, 2016 at 2:22 AM, Sean Owen <so...@cloudera.com> wrote:
>
>> Yes this came up from a different direction: https://issues.apac
>> he.org/jira/browse/SPARK-18140
>>
>> I think it's fine to pursue an upgrade to fix these several issues. The
>> question is just how well it will play with other components, so bears some
>> testing and evaluation of the changes from 1.8, but yes this would be good.
>>
>> On Mon, Oct 31, 2016 at 9:07 PM Michael Allman <mich...@videoamp.com>
>> wrote:
>>
>>> Hi All,
>>>
>>> Is anyone working on updating Spark's Parquet library dep to 1.9? If
>>> not, I can at least get started on it and publish a PR.
>>>
>>> Cheers,
>>>
>>> Michael
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Reply via email to