Re: dremel paper example schema

Debasish Das Mon, 29 Oct 2018 09:34:05 -0700

Open source impl of dremel is parquet !

On Mon, Oct 29, 2018, 8:42 AM Gourav Sengupta <gourav.sengu...@gmail.com>
wrote:


> Hi,
>
> why not just use dremel?
>
> Regards,
> Gourav Sengupta
>
> On Mon, Oct 29, 2018 at 1:35 PM lchorbadjiev <
> lubomir.chorbadj...@gmail.com> wrote:
>
>> Hi,
>>
>> I'm trying to reproduce the example from dremel paper
>> (https://research.google.com/pubs/archive/36632.pdf) in Apache Spark
>> using
>> pyspark and I wonder if it is possible at all?
>>
>> Trying to follow the paper example as close as possible I created this
>> document type:
>>
>> from pyspark.sql.types import *
>>
>> links_type = StructType([
>>     StructField("Backward", ArrayType(IntegerType(), containsNull=False),
>> nullable=False),
>>     StructField("Forward", ArrayType(IntegerType(), containsNull=False),
>> nullable=False),
>> ])
>>
>> language_type = StructType([
>>     StructField("Code", StringType(), nullable=False),
>>     StructField("Country", StringType())
>> ])
>>
>> names_type = StructType([
>>     StructField("Language", ArrayType(language_type, containsNull=False)),
>>     StructField("Url", StringType()),
>> ])
>>
>> document_type = StructType([
>>     StructField("DocId", LongType(), nullable=False),
>>     StructField("Links", links_type, nullable=True),
>>     StructField("Name", ArrayType(names_type, containsNull=False))
>> ])
>>
>> But when I store data in parquet using this type, the resulting parquet
>> schema is different from the described in the paper:
>>
>> message spark_schema {
>>   required int64 DocId;
>>   optional group Links {
>>     required group Backward (LIST) {
>>       repeated group list {
>>         required int32 element;
>>       }
>>     }
>>     required group Forward (LIST) {
>>       repeated group list {
>>         required int32 element;
>>       }
>>     }
>>   }
>>   optional group Name (LIST) {
>>     repeated group list {
>>       required group element {
>>         optional group Language (LIST) {
>>           repeated group list {
>>             required group element {
>>               required binary Code (UTF8);
>>               optional binary Country (UTF8);
>>             }
>>           }
>>         }
>>         optional binary Url (UTF8);
>>       }
>>     }
>>   }
>> }
>>
>> Moreover, if I create a parquet file with schema described in the dremel
>> paper using Apache Parquet Java API and try to read it into Apache Spark,
>> I
>> get an exception:
>>
>> org.apache.spark.sql.execution.QueryExecutionException: Encounter error
>> while reading parquet files. One possible cause: Parquet column cannot be
>> converted in the corresponding files
>>
>> Is it possible to create example schema described in the dremel paper
>> using
>> Apache Spark and what is the correct approach to build this example?
>>
>> Regards,
>> Lubomir Chorbadjiev
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>

Re: dremel paper example schema

Reply via email to