Could you please provide the full stack trace of the exception? And what's the Git commit hash of the version you were using?

Cheng

On 7/24/15 6:37 AM, Jerry Lam wrote:
Hi Cheng,

I ran into issues related to ENUM when I tried to use Filter push down. I'm using Spark 1.5.0 (which contains fixes for parquet filter push down). The exception is the following:

java.lang.IllegalArgumentException: FilterPredicate column: item's declared type (org.apache.parquet.io.api.Binary) does not match the schema found in file metadata. Column item is of type: FullTypeDescriptor(PrimitiveType: BINARY, OriginalType: ENUM)
Valid types for this column are: null

Is it because Spark does not recognize ENUM type in parquet?

Best Regards,

Jerry



On Wed, Jul 22, 2015 at 12:21 AM, Cheng Lian <lian.cs....@gmail.com <mailto:lian.cs....@gmail.com>> wrote:

    On 7/22/15 9:03 AM, Ankit wrote:

    Thanks a lot Cheng. So it seems even in spark 1.3 and 1.4,
    parquet ENUMs were treated as Strings in Spark SQL right? So does
    this mean partitioning for enums already works in previous
    versions too since they are just treated as strings?

    It’s a little bit complicated. A Thrift/Avro/ProtoBuf |ENUM| value
    is represented as a |BINARY| annotated with original type |ENUM|
    in Parquet. For example, an optional |ENUM| field |e| is
    translated to something like |optional BINARY e (ENUM)| in
    Parquet. And the underlying data is always a UTF8 string of the
    |ENUM| name. However, the Parquet original type |ENUM| is not
    documented, thus Spark 1.3 and 1.4 doesn’t recognize the |ENUM|
    annotation and just see it as a normal |BINARY|. (I didn’t even
    notice the existence of |ENUM| in Parquet before PR #7048…)

    On the other hand, Spark SQL has a boolean option named
    |spark.sql.parquet.binaryAsString|. When this option is set to
    |true|, all Parquet |BINARY| values are considered and converted
    to UTF8 strings. The original purpose of this option is used to
    work around a bug of Hive, which writes strings as plain Parquet
    |BINARY| values without a proper |UTF8| annotation.

    That said, by using
    |sqlContext.setConf("spark.sql.parquet.binaryAsString", "true")|
    in Scala/Java/Python, or |SET
    spark.sql.parquet.binaryAsString=true| in SQL, you may read those
    |ENUM| values as plain UTF8 strings.


    Also, is there a good way to verify that the partitioning is
    being used? I tried "explain" like (where data is partitioned by
    "type" column)

    scala> ev.filter("type = 'NON'").explain
    == Physical Plan ==
    Filter (type#4849 = NON)
     PhysicalRDD [...cols..], MapPartitionsRDD[103] at map at
    newParquet.scala:573

    but that is the same even with non partitioned data.

    Do you mean how to verify whether partition pruning is effective?
    You should be able to see log lines like this:

        15/07/22 11:14:35 INFO DataSourceStrategy: Selected 1
        partitions out of 3, pruned 66.66666666666667% partitions.



    On Tue, Jul 21, 2015 at 4:35 PM, Cheng Lian
    <lian.cs....@gmail.com <mailto:lian.cs....@gmail.com>> wrote:

        Parquet support for Thrift/Avro/ProtoBuf ENUM types are just
        added to the master branch.
        https://github.com/apache/spark/pull/7048

        ENUM types are actually not in the Parquet format spec,
        that's why we didn't have it at the first place. Basically,
        ENUMs are always treated as UTF8 strings in Spark SQL now.

        Cheng

        On 7/22/15 3:41 AM, ankits wrote:

            Hi, I am using a custom build of spark 1.4 with the
            parquet dependency
            upgraded to 1.7. I have thrift data encoded with parquet
            that i want to
            partition by a column of type ENUM. Spark programming
            guide says partition
            discovery is only supported for string and numeric
            columns, so it seems
            partition discovery won't work out of the box here.

            Is there any workaround that will allow me to partition
            by ENUMs? Will hive
            partitioning help here? I am unfamiliar with Hive, and
            how it plays into
            parquet, thrift and spark so I would appreciate any
            pointers in the right
            direction. Thanks.



            --
            View this message in context:
            
http://apache-spark-user-list.1001560.n3.nabble.com/Partition-parquet-data-by-ENUM-column-tp23939.html
            Sent from the Apache Spark User List mailing list archive
            at Nabble.com.

            
---------------------------------------------------------------------
            To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
            <mailto:user-unsubscr...@spark.apache.org>
            For additional commands, e-mail:
            user-h...@spark.apache.org
            <mailto:user-h...@spark.apache.org>





Reply via email to