Hi, I'm trying to reproduce the example from dremel paper (https://research.google.com/pubs/archive/36632.pdf) in Apache Spark using pyspark and I wonder if it is possible at all?
Trying to follow the paper example as close as possible I created this document type: from pyspark.sql.types import * links_type = StructType([ StructField("Backward", ArrayType(IntegerType(), containsNull=False), nullable=False), StructField("Forward", ArrayType(IntegerType(), containsNull=False), nullable=False), ]) language_type = StructType([ StructField("Code", StringType(), nullable=False), StructField("Country", StringType()) ]) names_type = StructType([ StructField("Language", ArrayType(language_type, containsNull=False)), StructField("Url", StringType()), ]) document_type = StructType([ StructField("DocId", LongType(), nullable=False), StructField("Links", links_type, nullable=True), StructField("Name", ArrayType(names_type, containsNull=False)) ]) But when I store data in parquet using this type, the resulting parquet schema is different from the described in the paper: message spark_schema { required int64 DocId; optional group Links { required group Backward (LIST) { repeated group list { required int32 element; } } required group Forward (LIST) { repeated group list { required int32 element; } } } optional group Name (LIST) { repeated group list { required group element { optional group Language (LIST) { repeated group list { required group element { required binary Code (UTF8); optional binary Country (UTF8); } } } optional binary Url (UTF8); } } } } Moreover, if I create a parquet file with schema described in the dremel paper using Apache Parquet Java API and try to read it into Apache Spark, I get an exception: org.apache.spark.sql.execution.QueryExecutionException: Encounter error while reading parquet files. One possible cause: Parquet column cannot be converted in the corresponding files Is it possible to create example schema described in the dremel paper using Apache Spark and what is the correct approach to build this example? Regards, Lubomir Chorbadjiev -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org