Open source impl of dremel is parquet ! On Mon, Oct 29, 2018, 8:42 AM Gourav Sengupta <gourav.sengu...@gmail.com> wrote:
> Hi, > > why not just use dremel? > > Regards, > Gourav Sengupta > > On Mon, Oct 29, 2018 at 1:35 PM lchorbadjiev < > lubomir.chorbadj...@gmail.com> wrote: > >> Hi, >> >> I'm trying to reproduce the example from dremel paper >> (https://research.google.com/pubs/archive/36632.pdf) in Apache Spark >> using >> pyspark and I wonder if it is possible at all? >> >> Trying to follow the paper example as close as possible I created this >> document type: >> >> from pyspark.sql.types import * >> >> links_type = StructType([ >> StructField("Backward", ArrayType(IntegerType(), containsNull=False), >> nullable=False), >> StructField("Forward", ArrayType(IntegerType(), containsNull=False), >> nullable=False), >> ]) >> >> language_type = StructType([ >> StructField("Code", StringType(), nullable=False), >> StructField("Country", StringType()) >> ]) >> >> names_type = StructType([ >> StructField("Language", ArrayType(language_type, containsNull=False)), >> StructField("Url", StringType()), >> ]) >> >> document_type = StructType([ >> StructField("DocId", LongType(), nullable=False), >> StructField("Links", links_type, nullable=True), >> StructField("Name", ArrayType(names_type, containsNull=False)) >> ]) >> >> But when I store data in parquet using this type, the resulting parquet >> schema is different from the described in the paper: >> >> message spark_schema { >> required int64 DocId; >> optional group Links { >> required group Backward (LIST) { >> repeated group list { >> required int32 element; >> } >> } >> required group Forward (LIST) { >> repeated group list { >> required int32 element; >> } >> } >> } >> optional group Name (LIST) { >> repeated group list { >> required group element { >> optional group Language (LIST) { >> repeated group list { >> required group element { >> required binary Code (UTF8); >> optional binary Country (UTF8); >> } >> } >> } >> optional binary Url (UTF8); >> } >> } >> } >> } >> >> Moreover, if I create a parquet file with schema described in the dremel >> paper using Apache Parquet Java API and try to read it into Apache Spark, >> I >> get an exception: >> >> org.apache.spark.sql.execution.QueryExecutionException: Encounter error >> while reading parquet files. One possible cause: Parquet column cannot be >> converted in the corresponding files >> >> Is it possible to create example schema described in the dremel paper >> using >> Apache Spark and what is the correct approach to build this example? >> >> Regards, >> Lubomir Chorbadjiev >> >> >> >> -- >> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >>