Re: Spark 1.6.0 + Hive + HBase

2016-02-15 Thread chutium
anyone took a look at this issue:
https://issues.apache.org/jira/browse/HIVE-11166

i got same exception by inserting into hbase table



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-1-6-0-Hive-HBase-tp16128p16332.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



hive client.getAllPartitions in lookupRelation can take a very long time

2014-09-02 Thread chutium
in our hive warehouse there are many tables with a lot of partitions, such as
scala hiveContext.sql(use db_external)
scala val result = hiveContext.sql(show partitions et_fullorders).count
result: Long = 5879

i noticed that this part of code:
https://github.com/apache/spark/blob/9d006c97371ddf357e0b821d5c6d1535d9b6fe41/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L55-L56

reads the whole partitions info at the beginning of plan phase, i added a
logInfo around this val partitions = ...

it shows:

scala val result = hiveContext.sql(select * from db_external.et_fullorders
limit 5)
14/09/02 16:15:56 INFO ParseDriver: Parsing command: select * from
db_external.et_fullorders limit 5
14/09/02 16:15:56 INFO ParseDriver: Parse Completed
14/09/02 16:15:56 INFO HiveContext$$anon$1: getAllPartitionsForPruner
started
14/09/02 16:17:35 INFO HiveContext$$anon$1: getAllPartitionsForPruner
finished

it took about 2min to get all partitions...

is there any possible way to avoid this operation? such as only fetch the
requested partition somehow?

Thanks



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/hive-client-getAllPartitions-in-lookupRelation-can-take-a-very-long-time-tp8186.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RE: HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...

2014-09-01 Thread chutium
thanks a lot, Hao, finally solved this problem, changes of CSVSerDe are here:
https://github.com/chutium/csv-serde/commit/22c667c003e705613c202355a8791978d790591e

btw, add jar in spark hive or hive-thriftserver always doesn't work, we
build the spark with libraryDependencies += csv-serde ...

or maybe should try to add it to SPARK_CLASSPATH ?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/HiveContext-schemaRDD-printSchema-get-different-dataTypes-feature-or-a-bug-really-strange-and-surpri-tp8035p8166.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...

2014-08-31 Thread chutium
Hi Cheng, thank you very much for helping me to finally find out the secret
of this magic...

actually we defined this external table with
SID STRING
REQUEST_ID STRING
TIMES_DQ TIMESTAMP
TOTAL_PRICE FLOAT
...

using desc table ext_fullorders it is only shown as
[# col_name data_type   comment ]
...
[times_dq   string  from deserializer   ]
[total_pricestring  from deserializer   ]
...
because, as you said, CSVSerde sets all field object inspectors to
javaStringObjectInspector
and therefore there are comments from deserializer

but in StorageDescriptor, are the real user defined types,
using desc extended table ext_fullorders we can see his
sd:StorageDescriptor
is:
FieldSchema(name:times_dq, type:timestamp, comment:null),
FieldSchema(name:total_price, type:float, comment:null)

and Spark HiveContext reads the schema info from this StorageDescriptor
https://github.com/apache/spark/blob/7e191fe29bb09a8560cd75d453c4f7f662dff406/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L316

so, in the SchemaRDD, the fields in Row were filled with strings (via
fillObject, all of values were retrieved from CSVSerDe with
javaStringObjectInspector)

but Spark considers that some of them are float or timestamp (schema info
were got from sd:StorageDescriptor)

crazy...

and sorry for update on the weekend...

a little more about how i fand this problem and why it is a trouble for us.

we use the new spark thrift server, to query normal managed hive table, it
works fine

but when we try to access the external tables with custom SerDe such as this
CSVSerDe, then we will get this ClassCastException, such as:
java.lang.ClassCastException: java.lang.String cannot be cast to
java.lang.Float

the reason is
https://github.com/apache/spark/blob/d94a44d7caaf3fe7559d9ad7b10872fa16cf81ca/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/server/SparkSQLOperationManager.scala#L104-L105

here Spark's thrift server try to get a float value from SparkRow, because
in the schema info (sd:StorageDescriptor) this column is float, but actually
in SparkRow, this field was filled with string value...



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/HiveContext-schemaRDD-printSchema-get-different-dataTypes-feature-or-a-bug-really-strange-and-surpri-tp8035p8157.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-08-31 Thread chutium
has anyone tried to build it on hadoop.version=2.0.0-mr1-cdh4.3.0 or
hadoop.version=1.0.3-mapr-3.0.3 ?

see comments in
https://issues.apache.org/jira/browse/SPARK-3124
https://github.com/apache/spark/pull/2035

i built spark snapshot on hadoop.version=1.0.3-mapr-3.0.3
and the ticket creator built on hadoop.version=2.0.0-mr1-cdh4.3.0

both hadoop version do not work

on 1.0.3-mapr3.0.3

when i try to start spark-shell

i got:

14/08/23 23:29:46 INFO SecurityManager: Changing view acls to: client09,
14/08/23 23:29:46 INFO SecurityManager: Changing modify acls to: client09,
14/08/23 23:29:46 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(client09, );
users with modify permissions: Set(client09, )
14/08/23 23:29:50 INFO Slf4jLogger: Slf4jLogger started
14/08/23 23:29:50 INFO Remoting: Starting remoting
14/08/23 23:29:50 ERROR ActorSystemImpl: Uncaught fatal error from thread
[spark-akka.actor.default-dispatcher-2] shutting down ActorSystem [spark]
java.lang.VerifyError: (class:
org/jboss/netty/channel/socket/nio/NioWorkerPool, method: createWorker
signature:
(Ljava/util/concurrent/Executor;)Lorg/jboss/netty/channel/socket/nio/AbstractNioWorker;)
Wrong return type in function
at
akka.remote.transport.netty.NettyTransport.init(NettyTransport.scala:282)
at
akka.remote.transport.netty.NettyTransport.init(NettyTransport.scala:239)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at
akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$2.apply(DynamicAccess.scala:78)
at scala.util.Try$.apply(Try.scala:161)
...
...
...


it seems this netty jar conflict affects not only SQL component and some
test-case



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-1-0-RC3-tp8147p8159.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...

2014-08-26 Thread chutium
is there any dataType auto convert or detect or something in HiveContext ?all
columns of a table is defined as string in hive metastoreone column is
total_price with values like 123.45, then this column will be recognized as
dataType Float in HiveContext...this is a feature or a bug? it really
surprised me... how is it implemented? if it is a feature, can i turn it
off? i want to get a schemaRDD with exactly the same datatype defined in
hive metadata, i know the column total_price should be float values, but
they must not be, and what happens if there is some broken line in my huge
CSV file? or maybe some total_price is 9,123.45 or $123.45 or
something==some
example for this in our env.MapR v3 cluster, newest spark github master
clone from yesterdaybuilt withsbt/sbt -Dhadoop.version=1.0.3-mapr-3.0.3
-Phive assemblyhive-site.xml
configured==spark-shell
scripts:val hiveContext = new
org.apache.spark.sql.hive.HiveContext(sc)hiveContext.sql(use
our_live_db)hiveContext.sql(desc formatted
et_fullorders).collect.foreach(println)..14/08/26 15:47:09 INFO
SparkContext: Job finished: collect at SparkPlan.scala:85, took 0.0305408
s[# col_name data_type   comment ][ 
  
][sidstring  from deserializer  
][request_id string  from deserializer  
][*times_dq   string*  from deserializer  
][*total_pricestring*  from deserializer  
][order_id   string  from deserializer   ][ 
  
][# Partition Information ][# col_name data_type
  
comment ][][wt_datestring   
  
None][countrystring  None   

][][# Detailed Table Information][Database: 

our_live_db][Owner: client02 
][CreateTime:Fri Jan 31 12:23:40 CET 2014 ][LastAccessTime: 
  
UNKNOWN  ][Protect Mode:  None
][Retention: 0][Location: 
maprfs:/mapr/cluster01.xxx.net/common/external_tables/et_fullorders
][Table Type:EXTERNAL_TABLE   ][Table Parameters:   
   
][   EXTERNALTRUE][  
transient_lastDdlTime   1391167420  ][][# Storage
Information   ][SerDe Library:
com.bizo.hive.serde.csv.CSVSerde ][InputFormat:  
org.apache.hadoop.mapred.TextInputFormat ][OutputFormat: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  
][Compressed:No   ][Num Buckets:  
-1   ][Bucket Columns:[]  
][Sort Columns:  []   ][Storage Desc Params:
   
][   separatorChar   ;   ][  
serialization.format1   ]then, create a schemaRDD from
this tableval result = hiveContext.sql(select sid, order_id, total_price,
times_dq from et_fullorders where wt_date='2014-04-14' and country='uk'
limit 5)ok now, printSchema...scala result.printSchemaroot |-- sid: string
(nullable = true) |-- order_id: string (nullable = true) |-- *total_price:
float* (nullable = true) |-- *times_dq: timestamp* (nullable =
true)total_price was STRING but now in schemaRDD is FLOATandtimes_dq, now is
TIMESTAMPreally strange and surprised...and more strange is:scala
result.map(row = row.getString(2)).collect.foreach(println)i
got240.0045.8321.6795.83120.83butscala result.map(row =
row.getFloat(2)).collect.foreach(println)14/08/26 16:01:24 ERROR Executor:
Exception in task 0.0 in stage 9.0 (TID 8)java.lang.ClassCastException:
java.lang.String cannot be cast to java.lang.Floatat
scala.runtime.BoxesRunTime.unboxToFloat(BoxesRunTime.java:114)==btw,
files in this external table are gzipped csv files:14/08/26 15:49:56 INFO
HadoopRDD: Input split:
maprfs:/mapr/cluster01.xxx.net/common/external_tables/et_fullorders/wt_date=2014-04-14/country=uk/getFullOrders_2014-04-14.csv.gz:0+16990and
the data in it:scala
result.collect.foreach(println)[51402123123,12344000123454,240.00,2014-04-14
00:03:49.082000][51402110123,12344000123455,45.83,2014-04-14
00:04:13.639000][51402129123,12344000123458,21.67,2014-04-14
00:09:12.276000][51402092123,12344000132457,95.83,2014-04-14
00:09:42.228000][51402135123,12344000123460,120.83,2014-04-14
00:12:44.742000]we use CSVSerDe
https://drone.io/github.com/ogrodnek/csv-serde/files/target/csv-serde-1.1.2-0.11.0-all.jarmaybe

HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...

2014-08-26 Thread chutium
is there any dataType auto convert or detect or something in HiveContext ?

all columns of a table is defined as string in hive metastore

one column is total_price with values like 123.45, then this column will be
recognized as dataType Float in HiveContext...

this is a feature or a bug? it really surprised me... how is it implemented?
if it is a feature, can i turn it off? i want to get a schemaRDD with
exactly the same datatype defined in hive metadata, i know the column
total_price should be float values, but they must not be, and what happens
if there is some broken line in my huge CSV file? or maybe some total_price
is 9,123.45 or $123.45 or something

==

some example for this in our env.

MapR v3 cluster, newest spark github master clone from yesterday

built with
sbt/sbt -Dhadoop.version=1.0.3-mapr-3.0.3 -Phive assembly

hive-site.xml configured

==

spark-shell scripts:

val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
hiveContext.sql(use our_live_db)
hiveContext.sql(desc formatted et_fullorders).collect.foreach(println)
...
...
14/08/26 15:47:09 INFO SparkContext: Job finished: collect at
SparkPlan.scala:85, took 0.0305408 s
[# col_name data_type   comment ]
[]
[sidstring  from deserializer   ]
[request_id string  from deserializer   ]
[*times_dq   string*  from deserializer   ]
[*total_pricestring*  from deserializer   ]
[order_id   string  from deserializer   ]
[]
[# Partition Information ]
[# col_name data_type   comment ]
[]
[wt_datestring  None]
[countrystring  None]
[]
[# Detailed Table Information]
[Database:  our_live_db]
[Owner: client02  ]
[CreateTime:Fri Jan 31 12:23:40 CET 2014 ]
[LastAccessTime:UNKNOWN  ]
[Protect Mode:  None ]
[Retention: 0]
[Location: 
maprfs:/mapr/cluster01.xxx.net/common/external_tables/et_fullorders ]
[Table Type:EXTERNAL_TABLE   ]
[Table Parameters:   ]
[   EXTERNALTRUE]
[   transient_lastDdlTime   1391167420  ]
[]
[# Storage Information   ]
[SerDe Library: com.bizo.hive.serde.csv.CSVSerde ]
[InputFormat:   org.apache.hadoop.mapred.TextInputFormat ]
[OutputFormat: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   ]
[Compressed:No   ]
[Num Buckets:   -1   ]
[Bucket Columns:[]   ]
[Sort Columns:  []   ]
[Storage Desc Params:]
[   separatorChar   ;   ]
[   serialization.format1   ]

then, create a schemaRDD from this table

val result = hiveContext.sql(select sid, order_id, total_price, times_dq
from et_fullorders where wt_date='2014-04-14' and country='uk' limit 5)

ok now, printSchema...

scala result.printSchema
root
 |-- sid: string (nullable = true)
 |-- order_id: string (nullable = true)
 |-- *total_price: float* (nullable = true)
 |-- *times_dq: timestamp* (nullable = true)


total_price was STRING but now in schemaRDD is FLOAT
and
times_dq, now is TIMESTAMP

really strange and surprised...

and more strange is:

scala result.map(row = row.getString(2)).collect.foreach(println)

i got
240.00
45.83
21.67
95.83
120.83

but

scala result.map(row = row.getFloat(2)).collect.foreach(println)

14/08/26 16:01:24 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID 8)
java.lang.ClassCastException: java.lang.String cannot be cast to
java.lang.Float
at scala.runtime.BoxesRunTime.unboxToFloat(BoxesRunTime.java:114)

==

btw, files in this external table are gzipped csv files:
14/08/26 15:49:56 INFO HadoopRDD: Input split:
maprfs:/mapr/cluster01.xxx.net/common/external_tables/et_fullorders/wt_date=2014-04-14/country=uk/getFullOrders_2014-04-14.csv.gz:0+16990

and the data in it:

scala result.collect.foreach(println)
[51402123123,12344000123454,240.00,2014-04-14 00:03:49.082000]
[51402110123,12344000123455,45.83,2014-04-14 00:04:13.639000]
[51402129123,12344000123458,21.67,2014-04-14 00:09:12.276000]
[51402092123,12344000132457,95.83,2014-04-14 00:09:42.228000]
[51402135123,12344000123460,120.83,2014-04-14 00:12:44.742000]


Re: HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...

2014-08-26 Thread chutium
oops, i tried on a managed table, column types will not be changed

so it is mostly due to the serde lib CSVSerDe
(https://github.com/ogrodnek/csv-serde/blob/master/src/main/java/com/bizo/hive/serde/csv/CSVSerde.java#L123)
or maybe CSVReader from opencsv?...

but if the columns are defined as string, no matter what type returned from
custom SerDe or CSVReader, they should be cast to string at the end right?

why do not use the schema from hive metadata directly?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/HiveContext-schemaRDD-printSchema-get-different-dataTypes-feature-or-a-bug-really-strange-and-surpri-tp8035p8039.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark SQL Query and join different data sources.

2014-08-21 Thread chutium
as far as i know, HQL queries try to find the schema info of all the tables
in this query from hive metastore, so it is not possible to join tables from
sqlContext using hiveContext.hql

but this should work:

hiveContext.hql(select ...).regAsTable(a)
sqlContext.jsonFile(xxx).regAsTable(b)

then

sqlContext.sql( a join b )


i created a ticket SPARK-2710 to add ResultSets from JDBC connection as a
new data source, but no predicate push down yet, also, it is not available
for HQL

so, if you are looking for something that can query different data sources
with full SQL92 syntax, facebook presto is still the only choice, they have
some kind of JDBC connector in deveopment, and there are some unofficial
implementations...

but i am looking forward to seeing the progress of Spark SQL, after
SPARK-2179 SQLContext can handle any
kind of structured data with a sequence of DataTypes as schema, although
turning the data into Rows is still a little bit tricky...



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Query-and-join-different-data-sources-tp7914p7937.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: spark-shell is broken! (bad option: '--master')

2014-08-08 Thread chutium
no one use spark-shell in master branch?

i created a PR as follow up commit of SPARK-2678 and PR #1801:

https://github.com/apache/spark/pull/1861



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/spark-shell-is-broken-bad-option-master-tp7778p7780.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org