Re: Partition parquet data by ENUM column

2015-07-24 Thread Cheng Lian
Could you please provide the full stack trace of the exception? And 
what's the Git commit hash of the version you were using?


Cheng

On 7/24/15 6:37 AM, Jerry Lam wrote:

Hi Cheng,

I ran into issues related to ENUM when I tried to use Filter push 
down. I'm using Spark 1.5.0 (which contains fixes for parquet filter 
push down). The exception is the following:


java.lang.IllegalArgumentException: FilterPredicate column: item's 
declared type (org.apache.parquet.io.api.Binary) does not match the 
schema found in file metadata. Column item is of type: 
FullTypeDescriptor(PrimitiveType: BINARY, OriginalType: ENUM)

Valid types for this column are: null

Is it because Spark does not recognize ENUM type in parquet?

Best Regards,

Jerry



On Wed, Jul 22, 2015 at 12:21 AM, Cheng Lian lian.cs@gmail.com 
mailto:lian.cs@gmail.com wrote:


On 7/22/15 9:03 AM, Ankit wrote:


Thanks a lot Cheng. So it seems even in spark 1.3 and 1.4,
parquet ENUMs were treated as Strings in Spark SQL right? So does
this mean partitioning for enums already works in previous
versions too since they are just treated as strings?


It’s a little bit complicated. A Thrift/Avro/ProtoBuf |ENUM| value
is represented as a |BINARY| annotated with original type |ENUM|
in Parquet. For example, an optional |ENUM| field |e| is
translated to something like |optional BINARY e (ENUM)| in
Parquet. And the underlying data is always a UTF8 string of the
|ENUM| name. However, the Parquet original type |ENUM| is not
documented, thus Spark 1.3 and 1.4 doesn’t recognize the |ENUM|
annotation and just see it as a normal |BINARY|. (I didn’t even
notice the existence of |ENUM| in Parquet before PR #7048…)

On the other hand, Spark SQL has a boolean option named
|spark.sql.parquet.binaryAsString|. When this option is set to
|true|, all Parquet |BINARY| values are considered and converted
to UTF8 strings. The original purpose of this option is used to
work around a bug of Hive, which writes strings as plain Parquet
|BINARY| values without a proper |UTF8| annotation.

That said, by using
|sqlContext.setConf(spark.sql.parquet.binaryAsString, true)|
in Scala/Java/Python, or |SET
spark.sql.parquet.binaryAsString=true| in SQL, you may read those
|ENUM| values as plain UTF8 strings.



Also, is there a good way to verify that the partitioning is
being used? I tried explain like (where data is partitioned by
type column)

scala ev.filter(type = 'NON').explain
== Physical Plan ==
Filter (type#4849 = NON)
 PhysicalRDD [...cols..], MapPartitionsRDD[103] at map at
newParquet.scala:573

but that is the same even with non partitioned data.


Do you mean how to verify whether partition pruning is effective?
You should be able to see log lines like this:

15/07/22 11:14:35 INFO DataSourceStrategy: Selected 1
partitions out of 3, pruned 66.67% partitions.




On Tue, Jul 21, 2015 at 4:35 PM, Cheng Lian
lian.cs@gmail.com mailto:lian.cs@gmail.com wrote:

Parquet support for Thrift/Avro/ProtoBuf ENUM types are just
added to the master branch.
https://github.com/apache/spark/pull/7048

ENUM types are actually not in the Parquet format spec,
that's why we didn't have it at the first place. Basically,
ENUMs are always treated as UTF8 strings in Spark SQL now.

Cheng

On 7/22/15 3:41 AM, ankits wrote:

Hi, I am using a custom build of spark 1.4 with the
parquet dependency
upgraded to 1.7. I have thrift data encoded with parquet
that i want to
partition by a column of type ENUM. Spark programming
guide says partition
discovery is only supported for string and numeric
columns, so it seems
partition discovery won't work out of the box here.

Is there any workaround that will allow me to partition
by ENUMs? Will hive
partitioning help here? I am unfamiliar with Hive, and
how it plays into
parquet, thrift and spark so I would appreciate any
pointers in the right
direction. Thanks.



--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/Partition-parquet-data-by-ENUM-column-tp23939.html
Sent from the Apache Spark User List mailing list archive
at Nabble.com.


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
mailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail:
user-h...@spark.apache.org
mailto:user-h...@spark.apache.org





​






Re: Partition parquet data by ENUM column

2015-07-24 Thread Cheng Lian
, and
how it plays into
parquet, thrift and spark so I would appreciate any
pointers in the right
direction. Thanks.



--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/Partition-parquet-data-by-ENUM-column-tp23939.html
Sent from the Apache Spark User List mailing list
archive at Nabble.com.


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





​





​

​



Re: Partition parquet data by ENUM column

2015-07-21 Thread Cheng Lian

On 7/22/15 9:03 AM, Ankit wrote:

Thanks a lot Cheng. So it seems even in spark 1.3 and 1.4, parquet 
ENUMs were treated as Strings in Spark SQL right? So does this mean 
partitioning for enums already works in previous versions too since 
they are just treated as strings?


It’s a little bit complicated. A Thrift/Avro/ProtoBuf |ENUM| value is 
represented as a |BINARY| annotated with original type |ENUM| in 
Parquet. For example, an optional |ENUM| field |e| is translated to 
something like |optional BINARY e (ENUM)| in Parquet. And the underlying 
data is always a UTF8 string of the |ENUM| name. However, the Parquet 
original type |ENUM| is not documented, thus Spark 1.3 and 1.4 doesn’t 
recognize the |ENUM| annotation and just see it as a normal |BINARY|. (I 
didn’t even notice the existence of |ENUM| in Parquet before PR #7048…)


On the other hand, Spark SQL has a boolean option named 
|spark.sql.parquet.binaryAsString|. When this option is set to |true|, 
all Parquet |BINARY| values are considered and converted to UTF8 
strings. The original purpose of this option is used to work around a 
bug of Hive, which writes strings as plain Parquet |BINARY| values 
without a proper |UTF8| annotation.


That said, by using 
|sqlContext.setConf(spark.sql.parquet.binaryAsString, true)| in 
Scala/Java/Python, or |SET spark.sql.parquet.binaryAsString=true| in 
SQL, you may read those |ENUM| values as plain UTF8 strings.




Also, is there a good way to verify that the partitioning is being 
used? I tried explain like (where data is partitioned by type column)


scala ev.filter(type = 'NON').explain
== Physical Plan ==
Filter (type#4849 = NON)
 PhysicalRDD [...cols..], MapPartitionsRDD[103] at map at 
newParquet.scala:573


but that is the same even with non partitioned data.


Do you mean how to verify whether partition pruning is effective? You 
should be able to see log lines like this:


   15/07/22 11:14:35 INFO DataSourceStrategy: Selected 1 partitions out
   of 3, pruned 66.67% partitions.




On Tue, Jul 21, 2015 at 4:35 PM, Cheng Lian lian.cs@gmail.com 
mailto:lian.cs@gmail.com wrote:


Parquet support for Thrift/Avro/ProtoBuf ENUM types are just added
to the master branch. https://github.com/apache/spark/pull/7048

ENUM types are actually not in the Parquet format spec, that's why
we didn't have it at the first place. Basically, ENUMs are always
treated as UTF8 strings in Spark SQL now.

Cheng

On 7/22/15 3:41 AM, ankits wrote:

Hi, I am using a custom build of spark 1.4 with the parquet
dependency
upgraded to 1.7. I have thrift data encoded with parquet that
i want to
partition by a column of type ENUM. Spark programming guide
says partition
discovery is only supported for string and numeric columns, so
it seems
partition discovery won't work out of the box here.

Is there any workaround that will allow me to partition by
ENUMs? Will hive
partitioning help here? I am unfamiliar with Hive, and how it
plays into
parquet, thrift and spark so I would appreciate any pointers
in the right
direction. Thanks.



--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/Partition-parquet-data-by-ENUM-column-tp23939.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
mailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
mailto:user-h...@spark.apache.org





​


Re: Partition parquet data by ENUM column

2015-07-21 Thread Ankit
Thanks a lot Cheng. So it seems even in spark 1.3 and 1.4, parquet ENUMs
were treated as Strings in Spark SQL right? So does this mean partitioning
for enums already works in previous versions too since they are just
treated as strings?

Also, is there a good way to verify that the partitioning is being used? I
tried explain like (where data is partitioned by type column)

scala ev.filter(type = 'NON').explain
== Physical Plan ==
Filter (type#4849 = NON)
 PhysicalRDD [...cols..], MapPartitionsRDD[103] at map at
newParquet.scala:573

but that is the same even with non partitioned data.


On Tue, Jul 21, 2015 at 4:35 PM, Cheng Lian lian.cs@gmail.com wrote:

 Parquet support for Thrift/Avro/ProtoBuf ENUM types are just added to the
 master branch. https://github.com/apache/spark/pull/7048

 ENUM types are actually not in the Parquet format spec, that's why we
 didn't have it at the first place. Basically, ENUMs are always treated as
 UTF8 strings in Spark SQL now.

 Cheng

 On 7/22/15 3:41 AM, ankits wrote:

 Hi, I am using a custom build of spark 1.4 with the parquet dependency
 upgraded to 1.7. I have thrift data encoded with parquet that i want to
 partition by a column of type ENUM. Spark programming guide says partition
 discovery is only supported for string and numeric columns, so it seems
 partition discovery won't work out of the box here.

 Is there any workaround that will allow me to partition by ENUMs? Will
 hive
 partitioning help here? I am unfamiliar with Hive, and how it plays into
 parquet, thrift and spark so I would appreciate any pointers in the right
 direction. Thanks.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Partition-parquet-data-by-ENUM-column-tp23939.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Partition parquet data by ENUM column

2015-07-21 Thread ankits
Hi, I am using a custom build of spark 1.4 with the parquet dependency
upgraded to 1.7. I have thrift data encoded with parquet that i want to
partition by a column of type ENUM. Spark programming guide says partition
discovery is only supported for string and numeric columns, so it seems
partition discovery won't work out of the box here.

Is there any workaround that will allow me to partition by ENUMs? Will hive
partitioning help here? I am unfamiliar with Hive, and how it plays into
parquet, thrift and spark so I would appreciate any pointers in the right
direction. Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Partition-parquet-data-by-ENUM-column-tp23939.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Partition parquet data by ENUM column

2015-07-21 Thread Cheng Lian
Parquet support for Thrift/Avro/ProtoBuf ENUM types are just added to 
the master branch. https://github.com/apache/spark/pull/7048


ENUM types are actually not in the Parquet format spec, that's why we 
didn't have it at the first place. Basically, ENUMs are always treated 
as UTF8 strings in Spark SQL now.


Cheng

On 7/22/15 3:41 AM, ankits wrote:

Hi, I am using a custom build of spark 1.4 with the parquet dependency
upgraded to 1.7. I have thrift data encoded with parquet that i want to
partition by a column of type ENUM. Spark programming guide says partition
discovery is only supported for string and numeric columns, so it seems
partition discovery won't work out of the box here.

Is there any workaround that will allow me to partition by ENUMs? Will hive
partitioning help here? I am unfamiliar with Hive, and how it plays into
parquet, thrift and spark so I would appreciate any pointers in the right
direction. Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Partition-parquet-data-by-ENUM-column-tp23939.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org