Hi,

I'm trying to reproduce the example from dremel paper
(https://research.google.com/pubs/archive/36632.pdf) in Apache Spark using
pyspark and I wonder if it is possible at all?

Trying to follow the paper example as close as possible I created this
document type:

from pyspark.sql.types import *

links_type = StructType([
    StructField("Backward", ArrayType(IntegerType(), containsNull=False),
nullable=False),
    StructField("Forward", ArrayType(IntegerType(), containsNull=False),
nullable=False),
])

language_type = StructType([
    StructField("Code", StringType(), nullable=False),
    StructField("Country", StringType())
])

names_type = StructType([
    StructField("Language", ArrayType(language_type, containsNull=False)),
    StructField("Url", StringType()),
])

document_type = StructType([
    StructField("DocId", LongType(), nullable=False),
    StructField("Links", links_type, nullable=True),
    StructField("Name", ArrayType(names_type, containsNull=False))
])

But when I store data in parquet using this type, the resulting parquet
schema is different from the described in the paper:

message spark_schema {
  required int64 DocId;
  optional group Links {
    required group Backward (LIST) {
      repeated group list {
        required int32 element;
      }
    }
    required group Forward (LIST) {
      repeated group list {
        required int32 element;
      }
    }
  }
  optional group Name (LIST) {
    repeated group list {
      required group element {
        optional group Language (LIST) {
          repeated group list {
            required group element {
              required binary Code (UTF8);
              optional binary Country (UTF8);
            }
          }
        }
        optional binary Url (UTF8);
      }
    }
  }
}

Moreover, if I create a parquet file with schema described in the dremel
paper using Apache Parquet Java API and try to read it into Apache Spark, I
get an exception:

org.apache.spark.sql.execution.QueryExecutionException: Encounter error
while reading parquet files. One possible cause: Parquet column cannot be
converted in the corresponding files

Is it possible to create example schema described in the dremel paper using
Apache Spark and what is the correct approach to build this example?

Regards,
Lubomir Chorbadjiev



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to