Per the previous email thread from Spark community, it seems they are following this parquet logical type standard: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#nested-types
Should Drill follow the same? On Tue, Sep 1, 2015 at 10:09 AM, Steven Phillips <[email protected]> wrote: > No there is no trick. This is because Drill reads the data as it is > physically written. At some point, we will add the ability to interpret > these types according to their logical type. However, that will require > that the parquet files are written with the correct OriginalType metadata. > I don't know if hive or Spark are currently doing this. > > On Fri, Aug 28, 2015 at 4:44 PM, Hao Zhu <[email protected]> wrote: > > > Thanks and I do not want to argue if Drill's parquet format is valid or > > Spark/Hive is doing the right thing. > > > > Current concern is the nested types in parquet generated by Spark/Hive > can > > not be read properly in Drill. > > > > Take previous simple list for example, if it is converted by Spark to > > parquet file: > > 1. Spark can read it as a list > > val parquetFile = > > sqlContext.parquetFile("/tmp/testjson_spark/part-r-00001.parquet") > > parquetFile.registerTempTable("parquetFile") > > val myresult = sqlContext.sql("SELECT * FROM parquetFile limit 10") > > myresult.map(t => "Name: " + t(0)).collect().foreach(println) > > > > Name: ArrayBuffer(1, 2, 3) > > > > 2. Drill can only return like this > > > select * from dfs.`/tmp/testjson_spark/part-r-00001.parquet`; > > +------------------------------------------------+ > > | c1 | > > +------------------------------------------------+ > > | {"bag":[{"array":1},{"array":2},{"array":3}]} | > > +------------------------------------------------+ > > 1 row selected (0.335 seconds) > > > > > > Is there any trick of reading the list properly in Drill? > > > > Thanks, > > Hao > > > > > > > > > > > > On Fri, Aug 28, 2015 at 4:20 PM, Steven Phillips <[email protected]> > > wrote: > > > > > Both parquet and drill internal data model is based on protobuf, > meaning > > > there are required, optional, and repeated fields. In this model, > > repeated > > > fields cannot be null, nor can they have null elements. The 3-layer > > nested > > > structure is necessary to represent a field where the array itself is > > > nullable, as well as elements of the array. > > > > > > We are going to add nullability to repeated types in Drill, and when we > > do > > > so, it would make sense to adopt the same format for representing them > in > > > parquet that other projects have adopted. > > > > > > At the same time, I would argue that the fact that Drill writes the > > parquet > > > data in a different format than spark sql is not a problem. The format > > the > > > Drill currently writes is perfectly valid, and other parquet tools > should > > > be able to interpret it just fine. It's just that this way of writing > an > > > array doesn't allow for null values, which Drill internally doesn't > > > currently support anyway. > > > > > > On Fri, Aug 28, 2015 at 11:41 AM, Hao Zhu <[email protected]> wrote: > > > > > > > Hi Team, > > > > > > > > I want to raise one topic about the Standard of Parquet nested data > > > types. > > > > Firstly let me show you one simple example. > > > > > > > > Sample Json file: > > > > {"c1":[1,2,3]} > > > > > > > > Using Spark to convert it to parquet, the schema is: > > > > c1: OPTIONAL F:1 > > > > .bag: REPEATED F:1 > > > > ..array: OPTIONAL INT64 R:1 D:3 > > > > > > > > Using Drill to create parquet file, schema will be: > > > > c1: REPEATED INT64 R:1 D:1 > > > > > > > > So this caused that Drill can not read the parquet nested data types > > > > generated by Spark, or even Hive(See DRILL-1999 > > > > <https://issues.apache.org/jira/browse/DRILL-1999>) > > > > Spark community's answer to this standard question of parquet nested > > data > > > > types are in: > > > > https://www.mail-archive.com/[email protected]/msg35663.html > > > > > > > > What is Drill's stand point on this topic? Do we need to make some > > > > agreement on the standard of nested data types in the Parquet > > community? > > > > > > > > Any comment is welcome. > > > > > > > > Thanks, > > > > Hao > > > > > > > > > >
