HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...

chutium Tue, 26 Aug 2014 07:54:12 -0700

is there any dataType auto convert or detect or something in HiveContext ?all
columns of a table is defined as string in hive metastoreone column is
total_price with values like 123.45, then this column will be recognized as
dataType Float in HiveContext...this is a feature or a bug? it really
surprised me... how is it implemented? if it is a feature, can i turn it
off? i want to get a schemaRDD with exactly the same datatype defined in
hive metadata, i know the column total_price should be float values, but
they must not be, and what happens if there is some broken line in my huge
CSV file? or maybe some total_price is 9,123.45 or $123.45 or
something==============================================================some
example for this in our env.MapR v3 cluster, newest spark github master
clone from yesterdaybuilt withsbt/sbt -Dhadoop.version=1.0.3-mapr-3.0.3
-Phive assemblyhive-site.xml
configured==============================================================spark-shell
scripts:val hiveContext = new
org.apache.spark.sql.hive.HiveContext(sc)hiveContext.sql("use
our_live_db")hiveContext.sql("desc formatted
et_fullorders").collect.foreach(println)......14/08/26 15:47:09 INFO
SparkContext: Job finished: collect at SparkPlan.scala:85, took 0.0305408
s[# col_name             data_type               comment             ][         
      
][sid                    string                  from deserializer  
][request_id             string                  from deserializer  
][*times_dq               string*                  from deserializer  
][*total_price            string*                  from deserializer  
][order_id               string                  from deserializer   ][         
      
][# Partition Information                 ][# col_name             data_type    
          
comment             ][                ][wt_date                string           
      
None                ][country                string                  None       
        
][                ][# Detailed Table Information            ][Database:         
    
our_live_db            ][Owner:                 client02             
][CreateTime:            Fri Jan 31 12:23:40 CET 2014     ][LastAccessTime:     
  
UNKNOWN                  ][Protect Mode:          None                    
][Retention:             0                        ][Location:             
maprfs:/mapr/cluster01.xxx.net/common/external_tables/et_fullorders    
][Table Type:            EXTERNAL_TABLE           ][Table Parameters:           
   
][       EXTERNAL                TRUE                ][      
transient_lastDdlTime   1391167420          ][                ][# Storage
Information           ][SerDe Library:        
com.bizo.hive.serde.csv.CSVSerde         ][InputFormat:          
org.apache.hadoop.mapred.TextInputFormat         ][OutputFormat:         
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat      
][Compressed:            No                       ][Num Buckets:          
-1                       ][Bucket Columns:        []                      
][Sort Columns:          []                       ][Storage Desc Params:        
   
][       separatorChar           ;                   ][      
serialization.format    1                   ]then, create a schemaRDD from
this tableval result = hiveContext.sql("select sid, order_id, total_price,
times_dq from et_fullorders where wt_date='2014-04-14' and country='uk'
limit 5")ok now, printSchema...scala> result.printSchemaroot |-- sid: string
(nullable = true) |-- order_id: string (nullable = true) |-- *total_price:
float* (nullable = true) |-- *times_dq: timestamp* (nullable =
true)total_price was STRING but now in schemaRDD is FLOATandtimes_dq, now is
TIMESTAMPreally strange and surprised...and more strange is:scala>
result.map(row => row.getString(2)).collect.foreach(println)i
got240.0045.8321.6795.83120.83butscala> result.map(row =>
row.getFloat(2)).collect.foreach(println)14/08/26 16:01:24 ERROR Executor:
Exception in task 0.0 in stage 9.0 (TID 8)java.lang.ClassCastException:
java.lang.String cannot be cast to java.lang.Float        at
scala.runtime.BoxesRunTime.unboxToFloat(BoxesRunTime.java:114)==============================================================btw,
files in this external table are gzipped csv files:14/08/26 15:49:56 INFO
HadoopRDD: Input split:
maprfs:/mapr/cluster01.xxx.net/common/external_tables/et_fullorders/wt_date=2014-04-14/country=uk/getFullOrders_2014-04-14.csv.gz:0+16990and
the data in it:scala>
result.collect.foreach(println)[5000000001402123123,12344000123454,240.00,2014-04-14
00:03:49.082000][5000000001402110123,12344000123455,45.83,2014-04-14
00:04:13.639000][5000000001402129123,12344000123458,21.67,2014-04-14
00:09:12.276000][5000000001402092123,12344000132457,95.83,2014-04-14
00:09:42.228000][5000000001402135123,12344000123460,120.83,2014-04-14
00:12:44.742000]we use CSVSerDe
https://drone.io/github.com/ogrodnek/csv-serde/files/target/csv-serde-1.1.2-0.11.0-all.jarmaybe
this is a reason?but why the 1st and 2nd column, will not be recognized as
bigint or double or something...?Thanks for any idea




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/HiveContext-schemaRDD-printSchema-get-different-dataTypes-feature-or-a-bug-really-strange-and-surpri-tp8034.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...

Reply via email to