Re: parquet sync up

Julien Le Dem Tue, 28 Oct 2014 15:40:34 -0700

Attendance:
- Criteo: Mickael working on Hive Serde
- Apache Drill: Parth (MapR)
- Cloudera: Ryan
- Netflix: Dan, Tonjie, Zhengxiao, Nezih (working on Presto)
- Twitter: Julien


Notes:
- Dealing with List and Maps containing nulls.
in the Serde, Map of array and array of Map has been fixed
Mickael currently working on HIVE-6994 => null inside array.
List or arrays are modeled with a 3 level representation:
- One optional field for the list itself that can be null
- One repeated field for the items
- One optional field to allow storing nulls in the list
Ryan to send a PR for standardizing representation of lists.
We need a permissive model for backward compatibility.
We need to make sure there's no ambiguity between user defined one field
groups and synthetic extra layers to represent null in lists
- Vectorized execution. Netflix and Drill team working together
  proposed API based on presto.
  people interested should review (Drill, Hive, Spark)
  Parth: we should be able to pass in an allocator. (init and cleanup) See
PARQUET-8[7-8]
  possibly we should use [Byte,...]Buffers instead of arrays
- Jobs with significant setup time. What done to speed it up.
   PARQUET-100: HCatalog => write one file per partition.
   increasing default parallelism.
Need to be reviewed.
- Java 8 support: Tom form Cloudera working on it.
- Parquet release:
   - We need to add license headers.
   - plan: release, rename packages, merge byte buffer APIs, merge 2.0
related JIRAs
   - See PARQUET-111: plan for release to review
- encoding fallback: Julien to add description in PR
- new PRs for Parquet 2.0
 encoding fall back
 new page formats
 predicate push down on dictionary

Next sync up Tuesday, Nov 18, 2014 10:30 am PST
If you want a reminder send an email.

On Tue, Oct 28, 2014 at 10:31 AM, Julien Le Dem <[email protected]> wrote:

> Happening now:
> https://plus.google.com/events/c2qu63kvjn2m31gnlq9hcrounh8
>

Re: parquet sync up

Reply via email to