is there any dataType auto convert or detect or something in HiveContext ?all
columns of a table is defined as string in hive metastoreone column is
total_price with values like 123.45, then this column will be recognized as
dataType Float in HiveContext...this is a feature or a bug? it really
surprised me... how is it implemented? if it is a feature, can i turn it
off? i want to get a schemaRDD with exactly the same datatype defined in
hive metadata, i know the column total_price should be float values, but
they must not be, and what happens if there is some broken line in my huge
CSV file? or maybe some total_price is 9,123.45 or $123.45 or
example for this in our env.MapR v3 cluster, newest spark github master
clone from yesterdaybuilt withsbt/sbt -Dhadoop.version=1.0.3-mapr-3.0.3
-Phive assemblyhive-site.xml
scripts:val hiveContext = new
our_live_db")hiveContext.sql("desc formatted
et_fullorders").collect.foreach(println)......14/08/26 15:47:09 INFO
SparkContext: Job finished: collect at SparkPlan.scala:85, took 0.0305408
s[# col_name             data_type               comment             ][         
][sid                    string                  from deserializer  
][request_id             string                  from deserializer  
][*times_dq               string*                  from deserializer  
][*total_price            string*                  from deserializer  
][order_id               string                  from deserializer   ][         
][# Partition Information                 ][# col_name             data_type    
comment             ][                ][wt_date                string           
None                ][country                string                  None       
][                ][# Detailed Table Information            ][Database:         
our_live_db            ][Owner:                 client02             
][CreateTime:            Fri Jan 31 12:23:40 CET 2014     ][LastAccessTime:     
UNKNOWN                  ][Protect Mode:          None                    
][Retention:             0                        ][Location:             
][Table Type:            EXTERNAL_TABLE           ][Table Parameters:           
][       EXTERNAL                TRUE                ][      
transient_lastDdlTime   1391167420          ][                ][# Storage
Information           ][SerDe Library:        
com.bizo.hive.serde.csv.CSVSerde         ][InputFormat:          
org.apache.hadoop.mapred.TextInputFormat         ][OutputFormat:      
][Compressed:            No                       ][Num Buckets:          
-1                       ][Bucket Columns:        []                      
][Sort Columns:          []                       ][Storage Desc Params:        
][       separatorChar           ;                   ][      
serialization.format    1                   ]then, create a schemaRDD from
this tableval result = hiveContext.sql("select sid, order_id, total_price,
times_dq from et_fullorders where wt_date='2014-04-14' and country='uk'
limit 5")ok now, printSchema...scala> result.printSchemaroot |-- sid: string
(nullable = true) |-- order_id: string (nullable = true) |-- *total_price:
float* (nullable = true) |-- *times_dq: timestamp* (nullable =
true)total_price was STRING but now in schemaRDD is FLOATandtimes_dq, now is
TIMESTAMPreally strange and surprised...and more strange is:scala> => row.getString(2)).collect.foreach(println)i
got240.0045.8321.6795.83120.83butscala> =>
row.getFloat(2)).collect.foreach(println)14/08/26 16:01:24 ERROR Executor:
Exception in task 0.0 in stage 9.0 (TID 8)java.lang.ClassCastException:
java.lang.String cannot be cast to java.lang.Float        at
files in this external table are gzipped csv files:14/08/26 15:49:56 INFO
HadoopRDD: Input split:
the data in it:scala>
00:12:44.742000]we use CSVSerDe
this is a reason?but why the 1st and 2nd column, will not be recognized as
bigint or double or something...?Thanks for any idea

View this message in context:
Sent from the Apache Spark Developers List mailing list archive at

Reply via email to