RE: HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...

2014-09-01 Thread chutium
thanks a lot, Hao, finally solved this problem, changes of CSVSerDe are here:
https://github.com/chutium/csv-serde/commit/22c667c003e705613c202355a8791978d790591e

btw, add jar in spark hive or hive-thriftserver always doesn't work, we
build the spark with libraryDependencies += csv-serde ...

or maybe should try to add it to SPARK_CLASSPATH ?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/HiveContext-schemaRDD-printSchema-get-different-dataTypes-feature-or-a-bug-really-strange-and-surpri-tp8035p8166.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...

2014-08-31 Thread chutium
Hi Cheng, thank you very much for helping me to finally find out the secret
of this magic...

actually we defined this external table with
SID STRING
REQUEST_ID STRING
TIMES_DQ TIMESTAMP
TOTAL_PRICE FLOAT
...

using desc table ext_fullorders it is only shown as
[# col_name data_type   comment ]
...
[times_dq   string  from deserializer   ]
[total_pricestring  from deserializer   ]
...
because, as you said, CSVSerde sets all field object inspectors to
javaStringObjectInspector
and therefore there are comments from deserializer

but in StorageDescriptor, are the real user defined types,
using desc extended table ext_fullorders we can see his
sd:StorageDescriptor
is:
FieldSchema(name:times_dq, type:timestamp, comment:null),
FieldSchema(name:total_price, type:float, comment:null)

and Spark HiveContext reads the schema info from this StorageDescriptor
https://github.com/apache/spark/blob/7e191fe29bb09a8560cd75d453c4f7f662dff406/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L316

so, in the SchemaRDD, the fields in Row were filled with strings (via
fillObject, all of values were retrieved from CSVSerDe with
javaStringObjectInspector)

but Spark considers that some of them are float or timestamp (schema info
were got from sd:StorageDescriptor)

crazy...

and sorry for update on the weekend...

a little more about how i fand this problem and why it is a trouble for us.

we use the new spark thrift server, to query normal managed hive table, it
works fine

but when we try to access the external tables with custom SerDe such as this
CSVSerDe, then we will get this ClassCastException, such as:
java.lang.ClassCastException: java.lang.String cannot be cast to
java.lang.Float

the reason is
https://github.com/apache/spark/blob/d94a44d7caaf3fe7559d9ad7b10872fa16cf81ca/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/server/SparkSQLOperationManager.scala#L104-L105

here Spark's thrift server try to get a float value from SparkRow, because
in the schema info (sd:StorageDescriptor) this column is float, but actually
in SparkRow, this field was filled with string value...



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/HiveContext-schemaRDD-printSchema-get-different-dataTypes-feature-or-a-bug-really-strange-and-surpri-tp8035p8157.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RE: HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...

2014-08-31 Thread Cheng, Hao
Yes, the root cause for that is the output ObjectInspector in SerDe 
implementation doesn't reflect the real typeinfo.

Hive actually provides the API like 
TypeInfoUtils.getStandardJavaObjectInspectorFromTypeInfo(TypeInfo) for the 
mapping.

You probably need to update the code at 
https://github.com/ogrodnek/csv-serde/blob/master/src/main/java/com/bizo/hive/serde/csv/CSVSerde.java#L60.

-Original Message-
From: chutium [mailto:teng@gmail.com] 
Sent: Monday, September 01, 2014 2:58 AM
To: d...@spark.incubator.apache.org
Subject: Re: HiveContext, schemaRDD.printSchema get different dataTypes, 
feature or a bug? really strange and surprised...

Hi Cheng, thank you very much for helping me to finally find out the secret of 
this magic...

actually we defined this external table with
SID STRING
REQUEST_ID STRING
TIMES_DQ TIMESTAMP
TOTAL_PRICE FLOAT
...

using desc table ext_fullorders it is only shown as
[# col_name data_type   comment ]
...
[times_dq   string  from deserializer   ]
[total_pricestring  from deserializer   ]
...
because, as you said, CSVSerde sets all field object inspectors to 
javaStringObjectInspector and therefore there are comments from deserializer

but in StorageDescriptor, are the real user defined types, using desc extended 
table ext_fullorders we can see his sd:StorageDescriptor
is:
FieldSchema(name:times_dq, type:timestamp, comment:null), 
FieldSchema(name:total_price, type:float, comment:null)

and Spark HiveContext reads the schema info from this StorageDescriptor
https://github.com/apache/spark/blob/7e191fe29bb09a8560cd75d453c4f7f662dff406/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L316

so, in the SchemaRDD, the fields in Row were filled with strings (via 
fillObject, all of values were retrieved from CSVSerDe with
javaStringObjectInspector)

but Spark considers that some of them are float or timestamp (schema info were 
got from sd:StorageDescriptor)

crazy...

and sorry for update on the weekend...

a little more about how i fand this problem and why it is a trouble for us.

we use the new spark thrift server, to query normal managed hive table, it 
works fine

but when we try to access the external tables with custom SerDe such as this 
CSVSerDe, then we will get this ClassCastException, such as:
java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Float

the reason is
https://github.com/apache/spark/blob/d94a44d7caaf3fe7559d9ad7b10872fa16cf81ca/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/server/SparkSQLOperationManager.scala#L104-L105

here Spark's thrift server try to get a float value from SparkRow, because in 
the schema info (sd:StorageDescriptor) this column is float, but actually in 
SparkRow, this field was filled with string value...



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/HiveContext-schemaRDD-printSchema-get-different-dataTypes-feature-or-a-bug-really-strange-and-surpri-tp8035p8157.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional 
commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...

2014-08-27 Thread Cheng Lian
I believe in your case, the “magic” happens in TableReader.fillObject
https://github.com/apache/spark/blob/4fa2fda88fc7beebb579ba808e400113b512533b/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L706-L712.
Here we unwrap the field value according to the object inspector of that
field. It seems that somehow a FloatObjectInspector is specified for the
total_price field. I don’t think CSVSerde is responsible for this, since it
sets all field object inspectors to javaStringObjectInspector (here
https://github.com/ogrodnek/csv-serde/blob/f315c1ae4b21a8288eb939e7c10f3b29c1a854ef/src/main/java/com/bizo/hive/serde/csv/CSVSerde.java#L59-L61
).

Which version of Spark SQL are you using? If you are using a snapshot
version, please provide the exact Git commit hash. Thanks!
​


On Tue, Aug 26, 2014 at 8:29 AM, chutium teng@gmail.com wrote:

 oops, i tried on a managed table, column types will not be changed

 so it is mostly due to the serde lib CSVSerDe
 (
 https://github.com/ogrodnek/csv-serde/blob/master/src/main/java/com/bizo/hive/serde/csv/CSVSerde.java#L123
 )
 or maybe CSVReader from opencsv?...

 but if the columns are defined as string, no matter what type returned from
 custom SerDe or CSVReader, they should be cast to string at the end right?

 why do not use the schema from hive metadata directly?



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/HiveContext-schemaRDD-printSchema-get-different-dataTypes-feature-or-a-bug-really-strange-and-surpri-tp8035p8039.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...

2014-08-26 Thread chutium
oops, i tried on a managed table, column types will not be changed

so it is mostly due to the serde lib CSVSerDe
(https://github.com/ogrodnek/csv-serde/blob/master/src/main/java/com/bizo/hive/serde/csv/CSVSerde.java#L123)
or maybe CSVReader from opencsv?...

but if the columns are defined as string, no matter what type returned from
custom SerDe or CSVReader, they should be cast to string at the end right?

why do not use the schema from hive metadata directly?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/HiveContext-schemaRDD-printSchema-get-different-dataTypes-feature-or-a-bug-really-strange-and-surpri-tp8035p8039.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org