is there any dataType auto convert or detect or something in HiveContext ?all columns of a table is defined as string in hive metastoreone column is total_price with values like 123.45, then this column will be recognized as dataType Float in HiveContext...this is a feature or a bug? it really surprised me... how is it implemented? if it is a feature, can i turn it off? i want to get a schemaRDD with exactly the same datatype defined in hive metadata, i know the column total_price should be float values, but they must not be, and what happens if there is some broken line in my huge CSV file? or maybe some total_price is 9,123.45 or $123.45 or something==============================================================some example for this in our env.MapR v3 cluster, newest spark github master clone from yesterdaybuilt withsbt/sbt -Dhadoop.version=1.0.3-mapr-3.0.3 -Phive assemblyhive-site.xml configured==============================================================spark-shell scripts:val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)hiveContext.sql("use our_live_db")hiveContext.sql("desc formatted et_fullorders").collect.foreach(println)......14/08/26 15:47:09 INFO SparkContext: Job finished: collect at SparkPlan.scala:85, took 0.0305408 s[# col_name data_type comment ][ ][sid string from deserializer ][request_id string from deserializer ][*times_dq string* from deserializer ][*total_price string* from deserializer ][order_id string from deserializer ][ ][# Partition Information ][# col_name data_type comment ][ ][wt_date string None ][country string None ][ ][# Detailed Table Information ][Database: our_live_db ][Owner: client02 ][CreateTime: Fri Jan 31 12:23:40 CET 2014 ][LastAccessTime: UNKNOWN ][Protect Mode: None ][Retention: 0 ][Location: maprfs:/mapr/cluster01.xxx.net/common/external_tables/et_fullorders ][Table Type: EXTERNAL_TABLE ][Table Parameters: ][ EXTERNAL TRUE ][ transient_lastDdlTime 1391167420 ][ ][# Storage Information ][SerDe Library: com.bizo.hive.serde.csv.CSVSerde ][InputFormat: org.apache.hadoop.mapred.TextInputFormat ][OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat ][Compressed: No ][Num Buckets: -1 ][Bucket Columns: [] ][Sort Columns: [] ][Storage Desc Params: ][ separatorChar ; ][ serialization.format 1 ]then, create a schemaRDD from this tableval result = hiveContext.sql("select sid, order_id, total_price, times_dq from et_fullorders where wt_date='2014-04-14' and country='uk' limit 5")ok now, printSchema...scala> result.printSchemaroot |-- sid: string (nullable = true) |-- order_id: string (nullable = true) |-- *total_price: float* (nullable = true) |-- *times_dq: timestamp* (nullable = true)total_price was STRING but now in schemaRDD is FLOATandtimes_dq, now is TIMESTAMPreally strange and surprised...and more strange is:scala> result.map(row => row.getString(2)).collect.foreach(println)i got240.0045.8321.6795.83120.83butscala> result.map(row => row.getFloat(2)).collect.foreach(println)14/08/26 16:01:24 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID 8)java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Float at scala.runtime.BoxesRunTime.unboxToFloat(BoxesRunTime.java:114)==============================================================btw, files in this external table are gzipped csv files:14/08/26 15:49:56 INFO HadoopRDD: Input split: maprfs:/mapr/cluster01.xxx.net/common/external_tables/et_fullorders/wt_date=2014-04-14/country=uk/getFullOrders_2014-04-14.csv.gz:0+16990and the data in it:scala> result.collect.foreach(println)[5000000001402123123,12344000123454,240.00,2014-04-14 00:03:49.082000][5000000001402110123,12344000123455,45.83,2014-04-14 00:04:13.639000][5000000001402129123,12344000123458,21.67,2014-04-14 00:09:12.276000][5000000001402092123,12344000132457,95.83,2014-04-14 00:09:42.228000][5000000001402135123,12344000123460,120.83,2014-04-14 00:12:44.742000]we use CSVSerDe https://drone.io/github.com/ogrodnek/csv-serde/files/target/csv-serde-1.1.2-0.11.0-all.jarmaybe this is a reason?but why the 1st and 2nd column, will not be recognized as bigint or double or something...?Thanks for any idea
-- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/HiveContext-schemaRDD-printSchema-get-different-dataTypes-feature-or-a-bug-really-strange-and-surpri-tp8034.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.