RE: How to add a custom jar file to the Spark driver?

2016-03-08 Thread Wang, Daoyuan
Hi Gerhard,

How does EMR set its conf for spark? I think if you set SPARK_CLASSPATH and 
spark.dirver.extraClassPath, spark would ignore SPARK_CLASSPATH.
I think you can do this by read the configuration from SparkConf, and then add 
your custom settings to the corresponding key, and use the updated SparkConf to 
instantiate your SparkContext.

Thanks,
Daoyuan

From: Gerhard Fiedler [mailto:gfied...@algebraixdata.com]
Sent: Wednesday, March 09, 2016 5:41 AM
To: user@spark.apache.org
Subject: How to add a custom jar file to the Spark driver?

We're running Spark 1.6.0 on EMR, in YARN client mode. We run Python code, but 
we want to add a custom jar file to the driver.

When running on a local one-node standalone cluster, we just use 
spark.driver.extraClassPath and everything works:

spark-submit --conf spark.driver.extraClassPath=/path/to/our/custom/jar/*  
our-python-script.py

But on EMR, this value is set to something that is needed to make their 
installation of Spark work. Setting it to point to our custom jar overwrites 
the original setting rather than adding to it and breaks Spark.

Our current workaround is to capture to whatever EMR sets 
spark.driver.extraClassPath once, then use that path and add our jar file to 
it. Of course this breaks when EMR changes this path in their cluster settings. 
We wouldn't necessarily notice this easily. This is how it looks:

spark-submit --conf 
spark.driver.extraClassPath=/path/to/our/custom/jar/*:/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-yarn/*:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*
  our-python-script.py

We prefer not to do this...

We tried the spark-submit argument --jars, but it didn't seem to do anything. 
Like this:

spark-submit --jars /path/to/our/custom/jar/file.jar  our-python-script.py

We also tried to set CLASSPATH, but it doesn't seem to have any impact:

export CLASSPATH=/path/to/our/custom/jar/*
spark-submit  our-python-script.py

When using SPARK_CLASSPATH, we got warnings that it is deprecated, and the 
messages also seemed to imply that it affects the same configuration that is 
set by spark.driver.extraClassPath.


So, my question is: Is there a clean way to add a custom jar file to a Spark 
configuration?

Thanks,
Gerhard



RE: RE: Spark build with Hive

2015-05-20 Thread Wang, Daoyuan
In 1.4 I think we still only support 0.12.0 and 0.13.1.

From: guoqing0...@yahoo.com.hk [mailto:guoqing0...@yahoo.com.hk]
Sent: Thursday, May 21, 2015 12:03 PM
To: Cheng, Hao; Ted Yu
Cc: user
Subject: Re: RE: Spark build with Hive

Thanks very much , Which version will be support In the upcome 1.4 ?  I hope it 
will be support more versions.


guoqing0...@yahoo.com.hk

From: Cheng, Hao
Date: 2015-05-21 11:20
To: Ted Yu; 
guoqing0...@yahoo.com.hk
CC: user
Subject: RE: Spark build with Hive
Yes, ONLY support 0.12.0 and 0.13.1 currently. Hopefully we can support higher 
versions in next 1 or 2 releases.

From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Thursday, May 21, 2015 11:12 AM
To: guoqing0...@yahoo.com.hk
Cc: user
Subject: Re: Spark build with Hive

I am afraid even Hive 1.0 is not supported, let alone Hive 1.2

Cheers

On Wed, May 20, 2015 at 8:08 PM, 
guoqing0...@yahoo.com.hk 
mailto:guoqing0...@yahoo.com.hk>> wrote:
Hi , is the Spark-1.3.1 can build with the Hive-1.2 ? it seem to Spark-1.3.1 
can only build with 0.13 , 0.12 according to the document .


# Apache Hadoop 2.4.X with Hive 13 support

mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver 
-DskipTests clean package

# Apache Hadoop 2.4.X with Hive 12 support

mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-0.12.0 
-Phive-thriftserver -DskipTests clean package


guoqing0...@yahoo.com.hk



RE: Spark SQL and java.lang.RuntimeException

2015-05-11 Thread Wang, Daoyuan
Hi,

Are you creating the table from hive? Which version of hive are you using?

Thanks,
Daoyuan

-Original Message-
From: Nick Travers [mailto:n.e.trav...@gmail.com] 
Sent: Sunday, May 10, 2015 10:34 AM
To: user@spark.apache.org
Subject: Spark SQL and java.lang.RuntimeException

I'm getting the following error when reading a table from Hive. Note the 
spelling of the 'Primitve' in the stack trace. I can't seem to find it anywhere 
else online.

It seems to only occur with this one particular table I am reading from.
Occasionally the task will completely fail, other times it will not.

I run into different variants of the exception, presumably for each of the 
different types of the columns (LONG, INT, BOOLEAN).

Has anyone else run into this issue? I'm running Spark 1.3.0 with the 
standalone cluster manager.

java.lang.RuntimeException: Primitve type LONG should not take parameters
at
org.apache.hadoop.hive.serde2.lazy.objectinspector.primitive.LazyPrimitiveObjectInspectorFactory.getLazyObjectInspector(LazyPrimitiveObjectInspectorFactory.java:136)
at
org.apache.hadoop.hive.serde2.lazy.objectinspector.primitive.LazyPrimitiveObjectInspectorFactory.getLazyObjectInspector(LazyPrimitiveObjectInspectorFactory.java:113)
at
org.apache.hadoop.hive.serde2.lazy.LazyFactory.createLazyObjectInspector(LazyFactory.java:224)
at
org.apache.hadoop.hive.serde2.lazy.LazyFactory.createColumnarStructInspector(LazyFactory.java:314)
at
org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe.initialize(ColumnarSerDe.java:88)
at
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$2.apply(TableReader.scala:118)
at
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$2.apply(TableReader.scala:115)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-and-java-lang-RuntimeException-tp22831.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: 回复:spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-05-05 Thread Wang, Daoyuan
How did you configure your metastore?

Thanks,
Daoyuan

From: 鹰 [mailto:980548...@qq.com]
Sent: Tuesday, May 05, 2015 3:11 PM
To: luohui20001
Cc: user
Subject: 回复:spark Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient

hi luo,
thanks for your reply in fact I can use hive by spark on my  spark master 
machine, but when I copy my spark files to another machine  and when I want to 
access the hive by spark get the error "Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient" , I have copy 
hive-site.xml to spark conf directory and I have the authenticated to access 
hive metastore warehouse;

Thanks , Best regards!
-- 原始邮件 --
发件人: "luohui20001";mailto:luohui20...@sina.com>>;
发送时间: 2015年5月5日(星期二) 上午9:56
收件人: "鹰"<980548...@qq.com>; 
"user"mailto:user@spark.apache.org>>;
主题: 回复:spark Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient

you may need to copy hive-site.xml to your spark conf directory and check your 
hive metastore warehouse setting, and also check if you are authenticated to 
access hive metastore warehouse.



Thanks&Best regards!
罗辉 San.Luo

- 原始邮件 -
发件人:"鹰" <980548...@qq.com>
收件人:"user" mailto:user@spark.apache.org>>
主题:spark Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
日期:2015年05月05日 08点49分

hi all,
   when i use submit a spark-sql programe to select data from my hive 
database I get an error like this:
User class threw exception: java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient ,what's wrong with my 
spark configure ,thank any help !


RE: 回复:RE: 回复:Re: sparksql running slow while joining_2_tables.

2015-05-04 Thread Wang, Daoyuan
You can use
Explain extended select ….

From: luohui20...@sina.com [mailto:luohui20...@sina.com]
Sent: Tuesday, May 05, 2015 9:52 AM
To: Cheng, Hao; Olivier Girardot; user
Subject: 回复:RE: 回复:Re: sparksql running slow while joining_2_tables.


As I know broadcastjoin is automatically enabled by 
spark.sql.autoBroadcastJoinThreshold.

refer to 
http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options



and how to check my app's physical plan,and others things like optimized 
plan,executable plan.etc



thanks



Thanks&Best regards!
罗辉 San.Luo

- 原始邮件 -
发件人:"Cheng, Hao" mailto:hao.ch...@intel.com>>
收件人:"Cheng, Hao" mailto:hao.ch...@intel.com>>, 
"luohui20...@sina.com" 
mailto:luohui20...@sina.com>>, Olivier Girardot 
mailto:ssab...@gmail.com>>, user 
mailto:user@spark.apache.org>>
主题:RE: 回复:Re: sparksql running slow while joining_2_tables.
日期:2015年05月05日 08点38分

Or, have you ever try broadcast join?

From: Cheng, Hao [mailto:hao.ch...@intel.com]
Sent: Tuesday, May 5, 2015 8:33 AM
To: luohui20...@sina.com; Olivier Girardot; user
Subject: RE: 回复:Re: sparksql running slow while joining 2 tables.

Can you print out the physical plan?

EXPLAIN SELECT xxx…

From: luohui20...@sina.com 
[mailto:luohui20...@sina.com]
Sent: Monday, May 4, 2015 9:08 PM
To: Olivier Girardot; user
Subject: 回复:Re: sparksql running slow while joining 2 tables.


hi Olivier

spark1.3.1, with java1.8.0.45

and add 2 pics .

it seems like a GC issue. I also tried with different parameters like memory 
size of driver&executor, memory fraction, java opts...

but this issue still happens.



Thanks&Best regards!
罗辉 San.Luo

- 原始邮件 -
发件人:Olivier Girardot mailto:ssab...@gmail.com>>
收件人:luohui20...@sina.com, user 
mailto:user@spark.apache.org>>
主题:Re: sparksql running slow while joining 2 tables.
日期:2015年05月04日 20点46分

Hi,
What is you Spark version ?

Regards,

Olivier.

Le lun. 4 mai 2015 à 11:03, mailto:luohui20...@sina.com>> 
a écrit :

hi guys

when i am running a sql  like "select 
a.name,a.startpoint,a.endpoint, a.piece from db a join sample b 
on (a.name = b.name) where (b.startpoint > 
a.startpoint + 25);" I found sparksql running slow in minutes which may caused 
by very long GC and shuffle time.



   table db is created from a txt file size at 56mb while table sample 
sized at 26mb, both at small size.

   my spark cluster is a standalone  pseudo-distributed spark cluster with 
8g executor and 4g driver manager.

   any advises? thank you guys.





Thanks&Best regards!
罗辉 San.Luo

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.org


RE: ClassCastException processing date fields using spark SQL since 1.3.0

2015-04-17 Thread Wang, Daoyuan
Thank you for the explanation! I’ll check what can be done here.

From: Krist Rastislav [mailto:rkr...@vub.sk]
Sent: Friday, April 17, 2015 9:03 PM
To: Wang, Daoyuan; Michael Armbrust
Cc: user
Subject: RE: ClassCastException processing date fields using spark SQL since 
1.3.0

So finally, org.apache.spark.sql.catalyst.ScalaReflection#convertToCatalyst was 
the method I was looking for (this is the way how it is being done with case 
classes at least, so it should be good for me too ;-)) My problem is thus 
solved...

Someone should put that method also in JdbcRDD to make it work again.

Sorry for spamming you ;-)

Thank You very much, best regards

R.Krist


From: Krist Rastislav
Sent: Friday, April 17, 2015 11:57 AM
To: 'Wang, Daoyuan'; 'Michael Armbrust'
Cc: 'user'
Subject: RE: ClassCastException processing date fields using spark SQL since 
1.3.0

Hello again,

steps to reproduce the same problem in JdbcRDD:

- create a table containig Date field in your favourite DBMS, I used PostgreSQL:

CREATE TABLE spark_test
(
  pk_spark_test integer NOT NULL,
  text character varying(25),
  date1 date,
  CONSTRAINT pk PRIMARY KEY (pk_spark_test)
)
WITH (
  OIDS=FALSE
);
ALTER TABLE spark_test
  OWNER TO postgres;
GRANT ALL ON TABLE spark_test TO postgres;
GRANT ALL ON TABLE spark_test TO public;

- fill it with data:

insert into spark_test(pk_spark_test, text, date1) values (1, 'one', 
'2014-04-01')
insert into spark_test(pk_spark_test, text, date1) values (2, 'two', 
'2014-04-02')

- from scala REPL, try the following:
import org.apache.spark.sql.SQLContext

val sqc = new SQLContext(sc)
sqc.jdbc("jdbc:postgresql://localhost:5432/ebx_repository?schema=ebx_repository&user=abc&password=def",
 "spark_test").cache.registerTempTable("spark_test")  // don’t forget the cache 
method

sqc.sql("select * from spark_test").foreach(println)

the last command will produce the following error (if you don’t use cache, it 
will produce correct results as expected):

11:50:27.306 [Executor task launch worker-0] ERROR 
org.apache.spark.executor.Executor - Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to 
org.apache.spark.sql.catalyst.expressions.MutableInt
at 
org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.getInt(SpecificMutableRow.scala:248)
 ~[spark-catalyst_2.11-1.3.0.jar:1.3.0]
at 
org.apache.spark.sql.columnar.IntColumnStats.gatherStats(ColumnStats.scala:191) 
~[spark-sql_2.11-1.3.0.jar:1.3.0]
at 
org.apache.spark.sql.columnar.NullableColumnBuilder$class.appendFrom(NullableColumnBuilder.scala:56)
 ~[spark-sql_2.11-1.3.0.jar:1.3.0]
at 
org.apache.spark.sql.columnar.NativeColumnBuilder.org$apache$spark$sql$columnar$compression$CompressibleColumnBuilder$$super$appendFrom(ColumnBuilder.scala:87)
 ~[spark-sql_2.11-1.3.0.jar:1.3.0]
at 
org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.appendFrom(CompressibleColumnBuilder.scala:78)
 ~[spark-sql_2.11-1.3.0.jar:1.3.0]
at 
org.apache.spark.sql.columnar.NativeColumnBuilder.appendFrom(ColumnBuilder.scala:87)
 ~[spark-sql_2.11-1.3.0.jar:1.3.0]
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:135)
 ~[spark-sql_2.11-1.3.0.jar:1.3.0]
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:111)
 ~[spark-sql_2.11-1.3.0.jar:1.3.0]
at 
org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:249) 
~[spark-core_2.11-1.3.0.jar:1.3.0]
at 
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:172) 
~[spark-core_2.11-1.3.0.jar:1.3.0]
at 
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:79) 
~[spark-core_2.11-1.3.0.jar:1.3.0]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:242) 
~[spark-core_2.11-1.3.0.jar:1.3.0]
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) 
~[spark-core_2.11-1.3.0.jar:1.3.0]
at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) 
~[spark-core_2.11-1.3.0.jar:1.3.0]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) 
~[spark-core_2.11-1.3.0.jar:1.3.0]
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) 
~[spark-core_2.11-1.3.0.jar:1.3.0]
at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) 
~[spark-core_2.11-1.3.0.jar:1.3.0]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) 
~[spark-core_2.11-1.3.0.jar:1.3.0]
at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) 
~[spark-core_2.11-1.3

RE: ClassCastException processing date fields using spark SQL since 1.3.0

2015-04-17 Thread Wang, Daoyuan
Normally I use like the following in scala:

>case calss datetest (x: Int, y:java.sql.Date)
>val dt = sc.parallelize(1 to 3).map(p => datetest(p, new 
>java.sql.Date(p*1000*60*60*24)))
>sqlContext.createDataFrame(dt).registerTempTable(“t1”)
>sql(“select * from t1”).collect.foreach(println)

If you still meets exceptions, please let me know about your query. The 
implicit conversion should be driven when you call  createDataFrame

Thanks,
Daoyuan

From: Krist Rastislav [mailto:rkr...@vub.sk]
Sent: Friday, April 17, 2015 3:52 PM
To: Wang, Daoyuan; Michael Armbrust
Cc: user
Subject: RE: ClassCastException processing date fields using spark SQL since 
1.3.0

Hello,

thank You for Your answer – I am creating the DataFrames manually using 
org.apache.spark.sql.SQLContext#createDataFrame. RDD is my custom 
implementation encapsulating invocation of a remote REST-based web service and 
schema is created programatically upon metadata (obtained from the same WS).
So in other words, the creation of Rows in DataFrame is fully under my control 
and the implicit conversion thus cannot occur. Is there any best practice 
(ideally a utility method) of creating Row instance from a set of values of 
types represented by DataFrame schema? I will try to take a deeper look into 
Your source code to locate the definition of the implicit conversion, but maybe 
some hint from Your side could help deliver a better implementation.

Thank You very much for Your help (and for the great work you are doing there).

Regards

R.Krist


From: Wang, Daoyuan [mailto:daoyuan.w...@intel.com]
Sent: Friday, April 17, 2015 5:08 AM
To: Michael Armbrust; Krist Rastislav
Cc: user
Subject: RE: ClassCastException processing date fields using spark SQL since 
1.3.0

The conversion between date and int should be automatically handled by Implicit 
conversion. So we are accepting date types externally, and represented as 
integer internally.

From: Wang, Daoyuan
Sent: Friday, April 17, 2015 11:00 AM
To: 'Michael Armbrust'; rkrist
Cc: user
Subject: RE: ClassCastException processing date fields using spark SQL since 
1.3.0

Can you tell us how did you create the dataframe?

From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Friday, April 17, 2015 2:52 AM
To: rkrist
Cc: user
Subject: Re: ClassCastException processing date fields using spark SQL since 
1.3.0

Filed: https://issues.apache.org/jira/browse/SPARK-6967

Shouldn't they be null?

 Statistics are only used to eliminate partitions that can't possibly hold 
matching values.  So while you are right this might result in a false positive, 
that will not result in a wrong answer.


Informacie, ktore su obsahom tejto spravy elektronickej posty a vsetky 
pripojene subory a prilohy su doverne a su/mozu byt obchodnym a/alebo bankovym 
tajomstvom alebo su/mozu byt pravne chranene podla inych pravnych predpisov. 
Pre blizsie informacie navstivte, prosim, 
www.vub.sk/legalcaution<http://www.vub.sk/legalcaution>.

The information contained in this electronic mail message and any files and 
attachments transmitted are confidential and are/may be a trade and/or bank 
secret or are/may be legally privileged under other legal regulations. For 
further information, please, visit 
www.vub.sk/legalcaution<http://www.vub.sk/legalcaution>.

VUB, a.s., Mlynske nivy 1, 829 90 Bratislava 25, Slovenska republika

Pred vytlacenim e-mailu prosim zvazte dopad na zivotne prostredie.
Before printing this e-mail, think about the impact on the environment.


RE: ClassCastException processing date fields using spark SQL since 1.3.0

2015-04-16 Thread Wang, Daoyuan
Can you tell us how did you create the dataframe?

From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Friday, April 17, 2015 2:52 AM
To: rkrist
Cc: user
Subject: Re: ClassCastException processing date fields using spark SQL since 
1.3.0

Filed: https://issues.apache.org/jira/browse/SPARK-6967

Shouldn't they be null?

 Statistics are only used to eliminate partitions that can't possibly hold 
matching values.  So while you are right this might result in a false positive, 
that will not result in a wrong answer.


RE: ClassCastException processing date fields using spark SQL since 1.3.0

2015-04-16 Thread Wang, Daoyuan
The conversion between date and int should be automatically handled by Implicit 
conversion. So we are accepting date types externally, and represented as 
integer internally.

From: Wang, Daoyuan
Sent: Friday, April 17, 2015 11:00 AM
To: 'Michael Armbrust'; rkrist
Cc: user
Subject: RE: ClassCastException processing date fields using spark SQL since 
1.3.0

Can you tell us how did you create the dataframe?

From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Friday, April 17, 2015 2:52 AM
To: rkrist
Cc: user
Subject: Re: ClassCastException processing date fields using spark SQL since 
1.3.0

Filed: https://issues.apache.org/jira/browse/SPARK-6967

Shouldn't they be null?

 Statistics are only used to eliminate partitions that can't possibly hold 
matching values.  So while you are right this might result in a false positive, 
that will not result in a wrong answer.


RE: SparkSQL JDBC Datasources API when running on YARN - Spark 1.3.0

2015-04-15 Thread Wang, Daoyuan
Can you provide your spark version?

Thanks,
Daoyuan

From: Nathan McCarthy [mailto:nathan.mccar...@quantium.com.au]
Sent: Wednesday, April 15, 2015 1:57 PM
To: Nathan McCarthy; user@spark.apache.org
Subject: Re: SparkSQL JDBC Datasources API when running on YARN - Spark 1.3.0

Just an update, tried with the old JdbcRDD and that worked fine.

From: Nathan 
mailto:nathan.mccar...@quantium.com.au>>
Date: Wednesday, 15 April 2015 1:57 pm
To: "user@spark.apache.org" 
mailto:user@spark.apache.org>>
Subject: SparkSQL JDBC Datasources API when running on YARN - Spark 1.3.0

Hi guys,

Trying to use a Spark SQL context's .load("jdbc", ...) method to create a DF 
from a JDBC data source. All seems to work well locally (master = local[*]), 
however as soon as we try and run on YARN we have problems.

We seem to be running into problems with the class path and loading up the JDBC 
driver. I'm using the jTDS 1.3.1 driver, net.sourceforge.jtds.jdbc.Driver.

./bin/spark-shell --jars /tmp/jtds-1.3.1.jar --master yarn-client

When trying to run I get an exception;

scala> sqlContext.load("jdbc", Map("url" -> 
"jdbc:jtds:sqlserver://blah:1433/MyDB;user=usr;password=pwd", "dbtable" -> 
"CUBE.DIM_SUPER_STORE_TBL"))

java.sql.SQLException: No suitable driver found for 
jdbc:jtds:sqlserver://blah:1433/MyDB;user=usr;password=pwd

Thinking maybe we need to force load the driver, if I supply "driver" -> 
"net.sourceforge.jtds.jdbc.Driver" to .load we get;

scala> sqlContext.load("jdbc", Map("url" -> 
"jdbc:jtds:sqlserver://blah:1433/MyDB;user=usr;password=pwd", "driver" -> 
"net.sourceforge.jtds.jdbc.Driver", "dbtable" -> "CUBE.DIM_SUPER_STORE_TBL"))

java.lang.ClassNotFoundException: net.sourceforge.jtds.jdbc.Driver
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:191)
at 
org.apache.spark.sql.jdbc.DefaultSource.createRelation(JDBCRelation.scala:97)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:290)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:679)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:21)

Yet if I run a Class.forName() just from the shell;

scala> Class.forName("net.sourceforge.jtds.jdbc.Driver")
res1: Class[_] = class net.sourceforge.jtds.jdbc.Driver

No problem finding the JAR. I've tried in both the shell, and running with 
spark-submit (packing the driver in with my application as a fat JAR). Nothing 
seems to work.

I can also get a connection in the driver/shell no problem;

scala> import java.sql.DriverManager
import java.sql.DriverManager
scala> 
DriverManager.getConnection("jdbc:jtds:sqlserver://blah:1433/MyDB;user=usr;password=pwd")
res3: java.sql.Connection = 
net.sourceforge.jtds.jdbc.JtdsConnection@2a67ecd0

I'm probably missing some class path setting here. In 
jdbc.DefaultSource.createRelation it looks like the call to Class.forName 
doesn't specify a class loader so it just uses the default Java behaviour to 
reflectively get the class loader. It almost feels like its using a different 
class loader.

I also tried seeing if the class path was there on all my executors by running;

import scala.collection.JavaConverters._
sc.parallelize(Seq(1,2,3,4)).flatMap(_ => 
java.sql.DriverManager.getDrivers().asScala.map(d => s"$d | 
${d.acceptsURL("jdbc:jtds:sqlserver://blah:1433/MyDB;user=usr;password=pwd")}")).collect().foreach(println)

This successfully returns;

15/04/15 01:07:37 INFO scheduler.DAGScheduler: Job 0 finished: collect at 
Main.scala:46, took 1.495597 s
org.apache.derby.jdbc.AutoloadedDriver40 | false
com.mysql.jdbc.Driver | false
net.sourceforge.jtds.jdbc.Driver | true
org.apache.derby.jdbc.AutoloadedDriver40 | false
com.mysql.jdbc.Driver | false
net.sourceforge.jtds.jdbc.Driver | true
org.apache.derby.jdbc.AutoloadedDriver40 | false
com.mysql.jdbc.Driver | false
net.sourceforge.jtds.jdbc.Driver | true
org.apache.derby.jdbc.AutoloadedDriver40 | false
com.mysql.jdbc.Driver | false
net.sourceforge.jtds.jdbc.Driver | true

As a final test we tried with postgres driver and had the same problem. Any 
ideas?

Cheers,
Nathan


RE: How to use Joda Time with Spark SQL?

2015-04-12 Thread Wang, Daoyuan
Actually, I did a little investigation on joda time when I was working on 
SPARK-4987 for Timestamp ser-de in parquet format. I think Joda offers 
interface to get java object from joda time object natively.

For example, to transform a java.util.Date (parent of java.sql.Date and 
java.sql.Timestamp) object named jd, in jave code you can use
DateTime dt = new DateTime(jd);
Or in scala code
val dt: DateTime = new DateTime(jd)

On the other hand, giving a DateTime object named dt, you can use code like
val jd: java.sql.Timestamp = new Timestamp(dt.getMillis)
to get the java object.

Thanks,
Daoyuan.

From: Cheng Lian [mailto:lian.cs@gmail.com]
Sent: Sunday, April 12, 2015 11:51 PM
To: Justin Yip
Cc: adamgerst; user@spark.apache.org
Subject: Re: How to use Joda Time with Spark SQL?

These common UDTs can always be wrapped in libraries and published to 
spark-packages http://spark-packages.org/ :-)

Cheng
On 4/12/15 3:00 PM, Justin Yip wrote:
Cheng, this is great info. I have a follow up question. There are a few very 
common data types (i.e. Joda DateTime) that is not directly supported by 
SparkSQL. Do you know if there are any plans for accommodating some common data 
types in SparkSQL? They don't need to be a first class datatype, but if they 
are available as UDT and provided by the SparkSQL library, that will make 
DataFrame users' life easier.

Justin

On Sat, Apr 11, 2015 at 5:41 AM, Cheng Lian 
mailto:lian.cs@gmail.com>> wrote:
One possible approach can be defining a UDT (user-defined type) for Joda time. 
A UDT maps an arbitrary type to and from Spark SQL data types. You may check 
the ExamplePointUDT [1] for more details.

[1]: 
https://github.com/apache/spark/blob/694aef0d71d2683eaf63cbd1d8e95c2da423b72e/sql/core/src/main/scala/org/apache/spark/sql/test/ExamplePointUDT.scala


On 4/8/15 6:09 AM, adamgerst wrote:
I've been using Joda Time in all my spark jobs (by using the nscala-time
package) and have not run into any issues until I started trying to use
spark sql.  When I try to convert a case class that has a
com.github.nscala_time.time.Imports.DateTime object in it, an exception is
thrown for with a MatchError

My assumption is that this is because the basic types of spark sql are
java.sql.Timestamp and java.sql.Date and therefor spark doesn't know what to
do about the DateTime value.

How can I get around this? I would prefer not to have to change my code to
make the values be Timestamps but I'm concerned that might be the only way.
Would something like implicit conversions work here?

It seems that even if I specify the schema manually then I would still have
the issue since you have to specify the column type which has to be of type
org.apache.spark.sql.types.DataType



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-Joda-Time-with-Spark-SQL-tp22415.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.org



-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.org




RE: Explanation on the Hive in the Spark assembly

2015-03-13 Thread Wang, Daoyuan
Hi bit1129,

1, hive in spark assembly removed most dependencies of hive on Hadoop to avoid 
conflicts.
2, this hive is used to run some native command, which does not rely on spark 
or mapreduce.

Thanks,
Daoyuan

From: bit1...@163.com [mailto:bit1...@163.com]
Sent: Friday, March 13, 2015 4:24 PM
To: user
Subject: Explanation on the Hive in the Spark assembly

Hi, sparkers,

I am kind of confused about hive in the spark assembly.  I think hive in the 
spark assembly is not the same thing as Hive On Spark(Hive On Spark,  is meant 
to run hive using spark execution engine).
So, my question is:
1. What is the difference between Hive in the spark assembly and Hive on Hadoop?
2. Does Hive in the spark assembly use Spark execution engine or Hadoop MR 
engine?
Thanks.



bit1...@163.com


RE: SparkSQL production readiness

2015-02-28 Thread Wang, Daoyuan
Hopefully  the alpha tag will be remove in 1.4.0, if the community can review 
code a little bit faster :P

Thanks,
Daoyuan

From: Ashish Mukherjee [mailto:ashish.mukher...@gmail.com]
Sent: Saturday, February 28, 2015 4:28 PM
To: user@spark.apache.org
Subject: SparkSQL production readiness

Hi,

I am exploring SparkSQL for my purposes of performing large relational 
operations across a cluster. However, it seems to be in alpha right now. Is 
there any indication when it would be considered production-level? I don't see 
any info on the site.

Regards,
Ashish


RE: IF statement doesn't work in Spark-SQL?

2015-01-20 Thread Wang, Daoyuan
Hi Xuelin,

What version of Spark are you using?

Thanks,
Daoyuan

From: Xuelin Cao [mailto:xuelincao2...@gmail.com]
Sent: Tuesday, January 20, 2015 5:22 PM
To: User
Subject: IF statement doesn't work in Spark-SQL?


Hi,

  I'm trying to migrate some hive scripts to Spark-SQL. However, I found 
some statement is incompatible in Spark-sql.

  Here is my SQL. And the same SQL works fine in HIVE environment.

SELECT
  if(ad_user_id>1000, 1000, ad_user_id) as user_id
FROM
  ad_search_keywords

 What I found is, the parser reports error on the "if" statement:

No function to evaluate expression. type: AttributeReference, tree: ad_user_id#4


 Anyone have any idea about this?




RE: MatchError in JsonRDD.toLong

2015-01-19 Thread Wang, Daoyuan
Yes, actually that is what I mean exactly. And maybe you missed my last 
response, you can use the API:
jsonRDD(json:RDD[String], schema:StructType)
to clearly clarify your schema. For numbers bigger than Long, we can use 
DecimalType.

Thanks,
Daoyuan


From: Tobias Pfeiffer [mailto:t...@preferred.jp]
Sent: Tuesday, January 20, 2015 9:26 AM
To: Wang, Daoyuan
Cc: user
Subject: Re: MatchError in JsonRDD.toLong

Hi,

On Fri, Jan 16, 2015 at 6:14 PM, Wang, Daoyuan 
mailto:daoyuan.w...@intel.com>> wrote:
The second parameter of jsonRDD is the sampling ratio when we infer schema.

OK, I was aware of this, but I guess I understand the problem now. My sampling 
ratio is so low that I only see the Long values of data items and infer it's a 
Long. When I meet the data that's actually longer than Long, I get the error I 
posted; basically it's the same situation as when specifying a wrong schema 
manually.

So is there any way around this other than increasing the sample ratio to 
discover also the very BigDecimal-sized numbers?

Thanks
Tobias



RE: MatchError in JsonRDD.toLong

2015-01-16 Thread Wang, Daoyuan
And you can use jsonRDD(json:RDD[String], schema:StructType) to clearly clarify 
your schema. For numbers later than Long, we can use DecimalType.

Thanks,
Daoyuan

From: Wang, Daoyuan [mailto:daoyuan.w...@intel.com]
Sent: Friday, January 16, 2015 5:14 PM
To: Tobias Pfeiffer
Cc: user
Subject: RE: MatchError in JsonRDD.toLong

The second parameter of jsonRDD is the sampling ratio when we infer schema.

Thanks,
Daoyuan

From: Tobias Pfeiffer [mailto:t...@preferred.jp]
Sent: Friday, January 16, 2015 5:11 PM
To: Wang, Daoyuan
Cc: user
Subject: Re: MatchError in JsonRDD.toLong

Hi,

On Fri, Jan 16, 2015 at 5:55 PM, Wang, Daoyuan 
mailto:daoyuan.w...@intel.com>> wrote:
Can you provide how you create the JsonRDD?

This should be reproducible in the Spark shell:

-
import org.apache.spark.sql._
val sqlc = new SparkContext(sc)
val rdd = sc.parallelize("""{"Click":"nonclicked", "Impression":1, 
"DisplayURL":4401798909506983219, "AdId":21215341}""" ::
 """{"Click":"nonclicked", "Impression":1, 
"DisplayURL":14452800566866169008, "AdId":10587781}""" :: Nil)

// works fine
val json = sqlc.jsonRDD(rdd)
json.registerTempTable("test")
sqlc.sql("SELECT * FROM test").collect

// -> MatchError
val json2 = sqlc.jsonRDD(rdd, 0.1)
json2.registerTempTable("test2")
sqlc.sql("SELECT * FROM test2").collect
-

I guess the issue in the latter case is that the column is inferred as Long 
when some rows actually are too big for Long...

Thanks
Tobias



RE: MatchError in JsonRDD.toLong

2015-01-16 Thread Wang, Daoyuan
The second parameter of jsonRDD is the sampling ratio when we infer schema.

Thanks,
Daoyuan

From: Tobias Pfeiffer [mailto:t...@preferred.jp]
Sent: Friday, January 16, 2015 5:11 PM
To: Wang, Daoyuan
Cc: user
Subject: Re: MatchError in JsonRDD.toLong

Hi,

On Fri, Jan 16, 2015 at 5:55 PM, Wang, Daoyuan 
mailto:daoyuan.w...@intel.com>> wrote:
Can you provide how you create the JsonRDD?

This should be reproducible in the Spark shell:

-
import org.apache.spark.sql._
val sqlc = new SparkContext(sc)
val rdd = sc.parallelize("""{"Click":"nonclicked", "Impression":1, 
"DisplayURL":4401798909506983219, "AdId":21215341}""" ::
 """{"Click":"nonclicked", "Impression":1, 
"DisplayURL":14452800566866169008, "AdId":10587781}""" :: Nil)

// works fine
val json = sqlc.jsonRDD(rdd)
json.registerTempTable("test")
sqlc.sql("SELECT * FROM test").collect

// -> MatchError
val json2 = sqlc.jsonRDD(rdd, 0.1)
json2.registerTempTable("test2")
sqlc.sql("SELECT * FROM test2").collect
-

I guess the issue in the latter case is that the column is inferred as Long 
when some rows actually are too big for Long...

Thanks
Tobias



RE: MatchError in JsonRDD.toLong

2015-01-16 Thread Wang, Daoyuan
Hi Tobias,

Can you provide how you create the JsonRDD?

Thanks,
Daoyuan


From: Tobias Pfeiffer [mailto:t...@preferred.jp]
Sent: Friday, January 16, 2015 4:01 PM
To: user
Subject: Re: MatchError in JsonRDD.toLong

Hi again,

On Fri, Jan 16, 2015 at 4:25 PM, Tobias Pfeiffer 
mailto:t...@preferred.jp>> wrote:
Now I'm wondering where this comes from (I haven't touched this component in a 
while, nor upgraded Spark etc.) [...]

So the reason that the error is showing up now is that suddenly data from a 
different dataset is showing up in my test dataset... don't ask me... anyway, 
this different dataset contains data like

  {"Click":"nonclicked", "Impression":1,
   "DisplayURL":4401798909506983219, "AdId":21215341, ...}
  {"Click":"nonclicked", "Impression":1,
   "DisplayURL":14452800566866169008, "AdId":10587781, ...}

and the DisplayURL seems to be too long for Long, while it is still inferred as 
a Long column.

So, what to do about this? Is jsonRDD inherently incapable of handling those 
long numbers or is it just an issue in the schema inference and I should file a 
JIRA issue?

Thanks
Tobias


RE: SparkSQL Timestamp query failure

2014-11-23 Thread Wang, Daoyuan
Hi,

I think you can try
 cast(l.timestamp as string)='2012-10-08 16:10:36.0'

Thanks,
Daoyuan

-Original Message-
From: whitebread [mailto:ale.panebia...@me.com] 
Sent: Sunday, November 23, 2014 12:11 AM
To: u...@spark.incubator.apache.org
Subject: Re: SparkSQL Timestamp query failure

Thanks for your answer Akhil, 

I have already tried that and the query actually doesn't fail but it doesn't 
return anything either as it should.
Using single quotes I think it reads it as a string and not as a timestamp. 

I don't know how to solve this. Any other hint by any chance?

Thanks,

Alessandro



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Timestamp-query-failure-tp19502p19554.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: spark sql: timestamp in json - fails

2014-10-20 Thread Wang, Daoyuan
Yes, SPARK-3853 just got merged 11 days ago. It should be OK in 1.2.0. And for 
the first approach, It would be ok after SPARK-4003 is merged.

-Original Message-
From: tridib [mailto:tridib.sama...@live.com] 
Sent: Tuesday, October 21, 2014 11:09 AM
To: u...@spark.incubator.apache.org
Subject: RE: spark sql: timestamp in json - fails

Spark 1.1.0



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-timestamp-in-json-fails-tp16864p16888.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: spark sql: timestamp in json - fails

2014-10-20 Thread Wang, Daoyuan
The exception of second approach, has been resolved by SPARK-3853.

Thanks,
Daoyuan

-Original Message-
From: Wang, Daoyuan [mailto:daoyuan.w...@intel.com] 
Sent: Tuesday, October 21, 2014 11:06 AM
To: tridib; u...@spark.incubator.apache.org
Subject: RE: spark sql: timestamp in json - fails

That's weird, I think we have that Pattern match in enforceCorrectType. What 
version of spark are you using?

Thanks,
Daoyuan

-Original Message-
From: tridib [mailto:tridib.sama...@live.com]
Sent: Tuesday, October 21, 2014 11:03 AM
To: u...@spark.incubator.apache.org
Subject: Re: spark sql: timestamp in json - fails

Stack trace for my second case:


2014-10-20 23:00:36,903 ERROR [Executor task launch worker-0] executor.Executor 
(Logging.scala:logError(96)) - Exception in task 0.0 in stage 0.0 (TID 0)
scala.MatchError: TimestampType (of class
org.apache.spark.sql.catalyst.types.TimestampType$)
at
org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:348)
at
org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1$$anonfun$apply$12.apply(JsonRDD.scala:381)
at scala.Option.map(Option.scala:145)
at
org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:380)
at
org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:365)
at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73)
at
org.apache.spark.sql.json.JsonRDD$.org$apache$spark$sql$json$JsonRDD$$asRow(JsonRDD.scala:365)
at
org.apache.spark.sql.json.JsonRDD$$anonfun$jsonStringToRow$1.apply(JsonRDD.scala:38)
at
org.apache.spark.sql.json.JsonRDD$$anonfun$jsonStringToRow$1.apply(JsonRDD.scala:38)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:774)
at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:774)
at
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
at
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:180)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2014-10-20 23:00:36,933 WARN  [Result resolver thread-1] 
scheduler.TaskSetManager (Logging.scala:logWarning(71)) - Lost task 0.0 in 
stage 0.0 (TID 0, localhost): scala.MatchError: TimestampType (of class
org.apache.spark.sql.catalyst.types.TimestampType$)
   
org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:348)
   
org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1$$anonfun$apply$12.apply(JsonRDD.scala:381)
scala.Option.map(Option.scala:145)
   
org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:380)
   
org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:365)
scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73)
   
org.apache.spark.sql.json.JsonRDD$.org$apache$spark$sql$json$JsonRDD$$asRow(JsonRDD.scala:365)
   
org.apache.spark.sql.json.JsonRDD$$anonfun$jsonStringToRow$1.apply(JsonRDD.scala:38)
   
org.apache.spark.sql.json.JsonRDD$$anonfun$jsonStringToRow$1.apply(JsonRDD.scala:38)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala

RE: spark sql: timestamp in json - fails

2014-10-20 Thread Wang, Daoyuan
That's weird, I think we have that Pattern match in enforceCorrectType. What 
version of spark are you using?

Thanks,
Daoyuan

-Original Message-
From: tridib [mailto:tridib.sama...@live.com] 
Sent: Tuesday, October 21, 2014 11:03 AM
To: u...@spark.incubator.apache.org
Subject: Re: spark sql: timestamp in json - fails

Stack trace for my second case:


2014-10-20 23:00:36,903 ERROR [Executor task launch worker-0] executor.Executor 
(Logging.scala:logError(96)) - Exception in task 0.0 in stage 0.0 (TID 0)
scala.MatchError: TimestampType (of class
org.apache.spark.sql.catalyst.types.TimestampType$)
at
org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:348)
at
org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1$$anonfun$apply$12.apply(JsonRDD.scala:381)
at scala.Option.map(Option.scala:145)
at
org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:380)
at
org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:365)
at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73)
at
org.apache.spark.sql.json.JsonRDD$.org$apache$spark$sql$json$JsonRDD$$asRow(JsonRDD.scala:365)
at
org.apache.spark.sql.json.JsonRDD$$anonfun$jsonStringToRow$1.apply(JsonRDD.scala:38)
at
org.apache.spark.sql.json.JsonRDD$$anonfun$jsonStringToRow$1.apply(JsonRDD.scala:38)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:774)
at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:774)
at
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
at
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:180)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2014-10-20 23:00:36,933 WARN  [Result resolver thread-1] 
scheduler.TaskSetManager (Logging.scala:logWarning(71)) - Lost task 0.0 in 
stage 0.0 (TID 0, localhost): scala.MatchError: TimestampType (of class
org.apache.spark.sql.catalyst.types.TimestampType$)
   
org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:348)
   
org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1$$anonfun$apply$12.apply(JsonRDD.scala:381)
scala.Option.map(Option.scala:145)
   
org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:380)
   
org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:365)
scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73)
   
org.apache.spark.sql.json.JsonRDD$.org$apache$spark$sql$json$JsonRDD$$asRow(JsonRDD.scala:365)
   
org.apache.spark.sql.json.JsonRDD$$anonfun$jsonStringToRow$1.apply(JsonRDD.scala:38)
   
org.apache.spark.sql.json.JsonRDD$$anonfun$jsonStringToRow$1.apply(JsonRDD.scala:38)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
scal

RE: spark sql: timestamp in json - fails

2014-10-20 Thread Wang, Daoyuan
Seems I made a mistake…

From: Wang, Daoyuan
Sent: Tuesday, October 21, 2014 10:35 AM
To: 'Yin Huai'
Cc: Michael Armbrust; tridib; u...@spark.incubator.apache.org
Subject: RE: spark sql: timestamp in json - fails

I got that, it is in JsonRDD.java of `typeOfPrimitiveValues`. I’ll fix that 
together.

Thanks,
Daoyuan

From: Yin Huai [mailto:huaiyin@gmail.com]
Sent: Tuesday, October 21, 2014 10:13 AM
To: Wang, Daoyuan
Cc: Michael Armbrust; tridib; 
u...@spark.incubator.apache.org<mailto:u...@spark.incubator.apache.org>
Subject: Re: spark sql: timestamp in json - fails

Seems the second approach does not go through applySchema. So, I was wondering 
if there is an issue related to our JSON apis in Java.

On Mon, Oct 20, 2014 at 10:04 PM, Wang, Daoyuan 
mailto:daoyuan.w...@intel.com>> wrote:
I think this has something to do with my recent work at 
https://issues.apache.org/jira/browse/SPARK-4003
You can check PR https://github.com/apache/spark/pull/2850 .

Thanks,
Daoyuan

From: Yin Huai [mailto:huaiyin@gmail.com<mailto:huaiyin@gmail.com>]
Sent: Tuesday, October 21, 2014 10:00 AM
To: Michael Armbrust
Cc: tridib; 
u...@spark.incubator.apache.org<mailto:u...@spark.incubator.apache.org>
Subject: Re: spark sql: timestamp in json - fails

Hi Tridib,

For the second approach, can you attach the complete stack trace?

Thanks,

Yin

On Mon, Oct 20, 2014 at 8:24 PM, Michael Armbrust 
mailto:mich...@databricks.com>> wrote:
I think you are running into a bug that will be fixed by this PR: 
https://github.com/apache/spark/pull/2850

On Mon, Oct 20, 2014 at 4:34 PM, tridib 
mailto:tridib.sama...@live.com>> wrote:
Hello Experts,
After repeated attempt I am unable to run query on map json date string. I
tried two approaches:

*** Approach 1 *** created a Bean class with timespamp field. When I try to
run it I get scala.MatchError: class java.sql.Timestamp (of class
java.lang.Class). Here is the code:
import java.sql.Timestamp;

public class ComplexClaim {
private Timestamp timestamp;

public Timestamp getTimestamp() {
return timestamp;
}

public void setTimestamp(Timestamp timestamp) {
this.timestamp = timestamp;
}
}

JavaSparkContext ctx = getCtx(sc);
JavaSQLContext sqlCtx = getSqlCtx(getCtx(sc));
String path = "/hdd/spark/test.json";
JavaSchemaRDD test = sqlCtx.applySchema(ctx.textFile(path) ,
ComplexClaim.class);
sqlCtx.registerRDDAsTable(test, "test");
execSql(sqlCtx, "select * from test", 1);



*** Approach 2 ***
Created a StructType to map the date field. I got scala.MatchError:
TimestampType (of class org.apache.spark.sql.catalyst.types.TimestampType$).
here is the code:

public StructType createStructType() {
List fields = new ArrayList();
fields.add(DataType.createStructField("timestamp",
DataType.TimestampType, false));
return DataType.createStructType(fields);
}

public void testJsonStruct(SparkContext sc) {
JavaSQLContext sqlCtx = getSqlCtx(getCtx(sc));
String path = "/hdd/spark/test.json";
JavaSchemaRDD test = sqlCtx.jsonFile(path, createStructType());
sqlCtx.registerRDDAsTable(test, "test");
execSql(sqlCtx, "select * from test", 1);
}

Input file has a single record:
{"timestamp":"2014-10-10T01:01:01"}


Thanks
Tridib





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-timestamp-in-json-fails-tp16864.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>





RE: spark sql: timestamp in json - fails

2014-10-20 Thread Wang, Daoyuan
I got that, it is in JsonRDD.java of `typeOfPrimitiveValues`. I’ll fix that 
together.

Thanks,
Daoyuan

From: Yin Huai [mailto:huaiyin@gmail.com]
Sent: Tuesday, October 21, 2014 10:13 AM
To: Wang, Daoyuan
Cc: Michael Armbrust; tridib; u...@spark.incubator.apache.org
Subject: Re: spark sql: timestamp in json - fails

Seems the second approach does not go through applySchema. So, I was wondering 
if there is an issue related to our JSON apis in Java.

On Mon, Oct 20, 2014 at 10:04 PM, Wang, Daoyuan 
mailto:daoyuan.w...@intel.com>> wrote:
I think this has something to do with my recent work at 
https://issues.apache.org/jira/browse/SPARK-4003
You can check PR https://github.com/apache/spark/pull/2850 .

Thanks,
Daoyuan

From: Yin Huai [mailto:huaiyin@gmail.com<mailto:huaiyin@gmail.com>]
Sent: Tuesday, October 21, 2014 10:00 AM
To: Michael Armbrust
Cc: tridib; 
u...@spark.incubator.apache.org<mailto:u...@spark.incubator.apache.org>
Subject: Re: spark sql: timestamp in json - fails

Hi Tridib,

For the second approach, can you attach the complete stack trace?

Thanks,

Yin

On Mon, Oct 20, 2014 at 8:24 PM, Michael Armbrust 
mailto:mich...@databricks.com>> wrote:
I think you are running into a bug that will be fixed by this PR: 
https://github.com/apache/spark/pull/2850

On Mon, Oct 20, 2014 at 4:34 PM, tridib 
mailto:tridib.sama...@live.com>> wrote:
Hello Experts,
After repeated attempt I am unable to run query on map json date string. I
tried two approaches:

*** Approach 1 *** created a Bean class with timespamp field. When I try to
run it I get scala.MatchError: class java.sql.Timestamp (of class
java.lang.Class). Here is the code:
import java.sql.Timestamp;

public class ComplexClaim {
private Timestamp timestamp;

public Timestamp getTimestamp() {
return timestamp;
}

public void setTimestamp(Timestamp timestamp) {
this.timestamp = timestamp;
}
}

JavaSparkContext ctx = getCtx(sc);
JavaSQLContext sqlCtx = getSqlCtx(getCtx(sc));
String path = "/hdd/spark/test.json";
JavaSchemaRDD test = sqlCtx.applySchema(ctx.textFile(path) ,
ComplexClaim.class);
sqlCtx.registerRDDAsTable(test, "test");
execSql(sqlCtx, "select * from test", 1);



*** Approach 2 ***
Created a StructType to map the date field. I got scala.MatchError:
TimestampType (of class org.apache.spark.sql.catalyst.types.TimestampType$).
here is the code:

public StructType createStructType() {
List fields = new ArrayList();
fields.add(DataType.createStructField("timestamp",
DataType.TimestampType, false));
return DataType.createStructType(fields);
}

public void testJsonStruct(SparkContext sc) {
JavaSQLContext sqlCtx = getSqlCtx(getCtx(sc));
String path = "/hdd/spark/test.json";
JavaSchemaRDD test = sqlCtx.jsonFile(path, createStructType());
sqlCtx.registerRDDAsTable(test, "test");
execSql(sqlCtx, "select * from test", 1);
}

Input file has a single record:
{"timestamp":"2014-10-10T01:01:01"}


Thanks
Tridib





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-timestamp-in-json-fails-tp16864.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>





RE: spark sql: timestamp in json - fails

2014-10-20 Thread Wang, Daoyuan
I think this has something to do with my recent work at 
https://issues.apache.org/jira/browse/SPARK-4003
You can check PR https://github.com/apache/spark/pull/2850 .

Thanks,
Daoyuan

From: Yin Huai [mailto:huaiyin@gmail.com]
Sent: Tuesday, October 21, 2014 10:00 AM
To: Michael Armbrust
Cc: tridib; u...@spark.incubator.apache.org
Subject: Re: spark sql: timestamp in json - fails

Hi Tridib,

For the second approach, can you attach the complete stack trace?

Thanks,

Yin

On Mon, Oct 20, 2014 at 8:24 PM, Michael Armbrust 
mailto:mich...@databricks.com>> wrote:
I think you are running into a bug that will be fixed by this PR: 
https://github.com/apache/spark/pull/2850

On Mon, Oct 20, 2014 at 4:34 PM, tridib 
mailto:tridib.sama...@live.com>> wrote:
Hello Experts,
After repeated attempt I am unable to run query on map json date string. I
tried two approaches:

*** Approach 1 *** created a Bean class with timespamp field. When I try to
run it I get scala.MatchError: class java.sql.Timestamp (of class
java.lang.Class). Here is the code:
import java.sql.Timestamp;

public class ComplexClaim {
private Timestamp timestamp;

public Timestamp getTimestamp() {
return timestamp;
}

public void setTimestamp(Timestamp timestamp) {
this.timestamp = timestamp;
}
}

JavaSparkContext ctx = getCtx(sc);
JavaSQLContext sqlCtx = getSqlCtx(getCtx(sc));
String path = "/hdd/spark/test.json";
JavaSchemaRDD test = sqlCtx.applySchema(ctx.textFile(path) ,
ComplexClaim.class);
sqlCtx.registerRDDAsTable(test, "test");
execSql(sqlCtx, "select * from test", 1);



*** Approach 2 ***
Created a StructType to map the date field. I got scala.MatchError:
TimestampType (of class org.apache.spark.sql.catalyst.types.TimestampType$).
here is the code:

public StructType createStructType() {
List fields = new ArrayList();
fields.add(DataType.createStructField("timestamp",
DataType.TimestampType, false));
return DataType.createStructType(fields);
}

public void testJsonStruct(SparkContext sc) {
JavaSQLContext sqlCtx = getSqlCtx(getCtx(sc));
String path = "/hdd/spark/test.json";
JavaSchemaRDD test = sqlCtx.jsonFile(path, createStructType());
sqlCtx.registerRDDAsTable(test, "test");
execSql(sqlCtx, "select * from test", 1);
}

Input file has a single record:
{"timestamp":"2014-10-10T01:01:01"}


Thanks
Tridib





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-timestamp-in-json-fails-tp16864.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.org




RE: scala.MatchError: class java.sql.Timestamp

2014-10-19 Thread Wang, Daoyuan
I have created an issue for this 
https://issues.apache.org/jira/browse/SPARK-4003


From: Cheng, Hao
Sent: Monday, October 20, 2014 9:20 AM
To: Ge, Yao (Y.); Wang, Daoyuan; user@spark.apache.org
Subject: RE: scala.MatchError: class java.sql.Timestamp

Seems bugs in the JavaSQLContext.getSchema(), which doesn't enumerate all of 
the data types supported by Catalyst.

From: Ge, Yao (Y.) [mailto:y...@ford.com]
Sent: Sunday, October 19, 2014 11:44 PM
To: Wang, Daoyuan; user@spark.apache.org<mailto:user@spark.apache.org>
Subject: RE: scala.MatchError: class java.sql.Timestamp

scala.MatchError: class java.sql.Timestamp (of class java.lang.Class)
at 
org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:189)
at 
org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:188)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at 
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at 
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.sql.api.java.JavaSQLContext.getSchema(JavaSQLContext.scala:188)
at 
org.apache.spark.sql.api.java.JavaSQLContext.applySchema(JavaSQLContext.scala:90)
at 
com.ford.dtc.ff.SessionStats.testTimeStamp(SessionStats.java:111)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown 
Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
at 
org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
at 
org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
at 
org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
at 
org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
at 
org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at 
org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)


From: Wang, Daoyuan [mailto:daoyuan.w...@intel.com]
Sent: Sunday, October 19, 2014 10:31 AM
To: Ge, Yao (Y.); user@spark.apache.org<mailto:user@spark.apache.org>
Subject: RE: scala.MatchError: class java.sql.Timestamp

Can you provide the exception stack?

Thanks,
Daoyuan

From: Ge, Yao (Y.) [mailto:y...@ford.com]
Sent: Sunday, October 19, 2014 10:17 PM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: scala.MatchError: class java.sql.Timestamp

I am working with Spark 1.1.0 and I believe Timestamp is a supported data type 
for Spark SQL. However I keep getting this MatchError for java.sql.Timestamp 
when I try to use reflection to register a Java Bean with Timestamp field. 
Anything wrong with my code below?

public static class Event implements Serializable {
private String name;
private Timestamp time;
public String getName() {
return name;
}
   

RE: scala.MatchError: class java.sql.Timestamp

2014-10-19 Thread Wang, Daoyuan
Can you provide the exception stack?

Thanks,
Daoyuan

From: Ge, Yao (Y.) [mailto:y...@ford.com]
Sent: Sunday, October 19, 2014 10:17 PM
To: user@spark.apache.org
Subject: scala.MatchError: class java.sql.Timestamp

I am working with Spark 1.1.0 and I believe Timestamp is a supported data type 
for Spark SQL. However I keep getting this MatchError for java.sql.Timestamp 
when I try to use reflection to register a Java Bean with Timestamp field. 
Anything wrong with my code below?

public static class Event implements Serializable {
private String name;
private Timestamp time;
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public Timestamp getTime() {
return time;
}
public void setTime(Timestamp time) {
this.time = time;
}
}

@Test
public void testTimeStamp() {
JavaSparkContext sc = new 
JavaSparkContext("local", "timestamp");
String[] data = {"1,2014-01-01", 
"2,2014-02-01"};
JavaRDD input = 
sc.parallelize(Arrays.asList(data));
JavaRDD events = input.map(new 
Function() {
public Event call(String arg0) 
throws Exception {
String[] c = 
arg0.split(",");
Event e = new 
Event();
e.setName(c[0]);
DateFormat fmt 
= new SimpleDateFormat("-MM-dd");
e.setTime(new 
Timestamp(fmt.parse(c[1]).getTime()));
return e;
}
});

JavaSQLContext sqlCtx = new JavaSQLContext(sc);
JavaSchemaRDD schemaEvent = 
sqlCtx.applySchema(events, Event.class);
schemaEvent.registerTempTable("event");

sc.stop();
}


RE: SQL queries fail in 1.2.0-SNAPSHOT

2014-09-29 Thread Wang, Daoyuan
Also some lines on another node :

14/09/30 10:22:31 ERROR nio.NioBlockTransferService: Exception handling buffer 
message
java.io.IOException: Error in reading 
org.apache.spark.network.FileSegmentManagedBuffer(/mnt/DP_disk10/animal/spark/spark-local-20140930101701-c9ee/38/shuffle_6_162_0.data,
 21118074, 544623) (actual file length 769648)
 at 
org.apache.spark.network.FileSegmentManagedBuffer.nioByteBuffer(ManagedBuffer.scala:80)
at 
org.apache.spark.network.nio.NioBlockTransferService.getBlock(NioBlockTransferService.scala:203)
 at 
org.apache.spark.network.nio.NioBlockTransferService.org$apache$spark$network$nio$NioBlockTransferService$$processBlockMessage(NioBlockTransferService.scala:179)
 at 
org.apache.spark.network.nio.NioBlockTransferService$$anonfun$2.apply(NioBlockTransferService.scala:149)
 at 
org.apache.spark.network.nio.NioBlockTransferService$$anonfun$2.apply(NioBlockTransferService.scala:149)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
 at 
org.apache.spark.network.nio.BlockMessageArray.foreach(BlockMessageArray.scala:28)
 at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at 
org.apache.spark.network.nio.BlockMessageArray.map(BlockMessageArray.scala:28)
 at 
org.apache.spark.network.nio.NioBlockTransferService.org$apache$spark$network$nio$NioBlockTransferService$$onBlockMessageReceive(NioBlockTransferService.scala:149)
 at 
org.apache.spark.network.nio.NioBlockTransferService$$anonfun$init$1.apply(NioBlockTransferService.scala:68)
 at 
org.apache.spark.network.nio.NioBlockTransferService$$anonfun$init$1.apply(NioBlockTransferService.scala:68)
 at 
org.apache.spark.network.nio.ConnectionManager.org$apache$spark$network$nio$ConnectionManager$$handleMessage(ConnectionManager.scala:677)
 at 
org.apache.spark.network.nio.ConnectionManager$$anon$10.run(ConnectionManager.scala:515)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Channel not open for writing - cannot extend 
file to required size
 at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:868)
 at 
org.apache.spark.network.FileSegmentManagedBuffer.nioByteBuffer(ManagedBuffer.scala:75)
 ... 20 more


From: Wang, Daoyuan
Sent: Tuesday, September 30, 2014 11:20 AM
To: Wang, Daoyuan; Reynold Xin
Cc: user@spark.apache.org
Subject: RE: SQL queries fail in 1.2.0-SNAPSHOT

And the 
/mnt/DP_disk2/animal/spark/spark-local-20140930102549-622d/11/shuffle_6_191_0.data
 file is comparatively much smaller than other shuffle*.data files


From: Wang, Daoyuan [mailto:daoyuan.w...@intel.com]
Sent: Tuesday, September 30, 2014 10:54 AM
To: Reynold Xin
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: RE: SQL queries fail in 1.2.0-SNAPSHOT

Hi Reynold,

Seems I am getting a much larger offset than file size.
reading 
org.apache.spark.network.FileSegmentManagedBuffer(/mnt/DP_disk2/animal/spark/spark-local-20140930102549-622d/11/shuffle_6_191_0.data,
 3154043, 588396) (actual file length 676025)
 at 
org.apache.spark.network.FileSegmentManagedBuffer.nioByteBuffer(ManagedBuffer.scala:80)
at 
org.apache.spark.network.nio.NioBlockTransferService.getBlock(NioBlockTransferService.scala:203)
 at 
org.apache.spark.network.nio.NioBlockTransferService.org$apache$spark$network$nio$NioBlockTransferService$$processBlockMessage(NioBlockTransferService.scala:179)
 at 
org.apache.spark.network.nio.NioBlockTransferService$$anonfun$2.apply(NioBlockTransferService.scala:149)
 at 
org.apache.spark.network.nio.NioBlockTransferService$$anonfun$2.apply(NioBlockTransferService.scala:149)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)


java.io.IOException: Error in reading 
org.apache.spark.network.FileSegmentManagedBuffer(/mnt/DP_disk2/animal/spark/spark-local-20140930102549-622d/11/shuffle_6_191_0.data,
 5618712, 616204) (actual file length 676025)

 at 
org.apache.spark.network.FileSegmentManagedBuffer.nioByteBuffer(ManagedBuffer.scala:80)


RE: SQL queries fail in 1.2.0-SNAPSHOT

2014-09-29 Thread Wang, Daoyuan
And the 
/mnt/DP_disk2/animal/spark/spark-local-20140930102549-622d/11/shuffle_6_191_0.data
 file is comparatively much smaller than other shuffle*.data files


From: Wang, Daoyuan [mailto:daoyuan.w...@intel.com]
Sent: Tuesday, September 30, 2014 10:54 AM
To: Reynold Xin
Cc: user@spark.apache.org
Subject: RE: SQL queries fail in 1.2.0-SNAPSHOT

Hi Reynold,

Seems I am getting a much larger offset than file size.
reading 
org.apache.spark.network.FileSegmentManagedBuffer(/mnt/DP_disk2/animal/spark/spark-local-20140930102549-622d/11/shuffle_6_191_0.data,
 3154043, 588396) (actual file length 676025)
 at 
org.apache.spark.network.FileSegmentManagedBuffer.nioByteBuffer(ManagedBuffer.scala:80)
at 
org.apache.spark.network.nio.NioBlockTransferService.getBlock(NioBlockTransferService.scala:203)
 at 
org.apache.spark.network.nio.NioBlockTransferService.org$apache$spark$network$nio$NioBlockTransferService$$processBlockMessage(NioBlockTransferService.scala:179)
 at 
org.apache.spark.network.nio.NioBlockTransferService$$anonfun$2.apply(NioBlockTransferService.scala:149)
 at 
org.apache.spark.network.nio.NioBlockTransferService$$anonfun$2.apply(NioBlockTransferService.scala:149)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)


java.io.IOException: Error in reading 
org.apache.spark.network.FileSegmentManagedBuffer(/mnt/DP_disk2/animal/spark/spark-local-20140930102549-622d/11/shuffle_6_191_0.data,
 5618712, 616204) (actual file length 676025)

 at 
org.apache.spark.network.FileSegmentManagedBuffer.nioByteBuffer(ManagedBuffer.scala:80)

at 
org.apache.spark.network.nio.NioBlockTransferService.getBlock(NioBlockTransferService.scala:203)

 at 
org.apache.spark.network.nio.NioBlockTransferService.org$apache$spark$network$nio$NioBlockTransferService$$processBlockMessage(NioBlockTransferService.scala:179)

 at 
org.apache.spark.network.nio.NioBlockTransferService$$anonfun$2.apply(NioBlockTransferService.scala:149)

 at 
org.apache.spark.network.nio.NioBlockTransferService$$anonfun$2.apply(NioBlockTransferService.scala:149)

 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

 at scala.collection.Iterator$class.foreach(Iterator.scala:727)

 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

These are random errors, in most but not every run. And also depend on queries.

Thanks,
Daoyuan

From: Reynold Xin [mailto:r...@databricks.com]
Sent: Tuesday, September 30, 2014 3:48 AM
To: Wang, Daoyuan
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: SQL queries fail in 1.2.0-SNAPSHOT

Hi Daoyuan,

Do you mind applying this patch and look at the exception again?

https://github.com/apache/spark/pull/2580


It has also been merged in master so if you pull from master, you should have 
that.


On Mon, Sep 29, 2014 at 1:17 AM, Wang, Daoyuan 
mailto:daoyuan.w...@intel.com>> wrote:
Hi all,

I had some of my queries run on 1.1.0-SANPSHOT at commit b1b20301(Aug 24), but 
in current master branch, my queries would not work. I looked into the stderr 
file in executor, and find the following lines:

14/09/26 16:52:46 ERROR nio.NioBlockTransferService: Exception handling buffer 
message
java.io.IOException: Channel not open for writing - cannot extend file to 
required size
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:868)
at 
org.apache.spark.network.FileSegmentManagedBuffer.nioByteBuffer(ManagedBuffer.scala:73)
at 
org.apache.spark.network.nio.NioBlockTransferService.getBlock(NioBlockTransferService.scala:203)
at 
org.apache.spark.network.nio.NioBlockTransferService.org<http://org.apache.spark.network.nio.NioBlockTransferService.org>$apache$spark$network$nio$NioBlockTransferService$$processBlockMessage(NioBlockTransferService.scala:179)
at 
org.apache.spark.network.nio.NioBlockTransferService$$anonfun$2.apply(NioBlockTransferService.scala:149)
at 
org.apache.spark.network.nio.NioBlockTransferService$$anonfun$2.apply(NioBlockTransferService.scala:149)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at 
org.apache.spark.network.nio.BlockMessageArray.foreach(Block

RE: SQL queries fail in 1.2.0-SNAPSHOT

2014-09-29 Thread Wang, Daoyuan
Hi Reynold,

Seems I am getting a much larger offset than file size.
reading 
org.apache.spark.network.FileSegmentManagedBuffer(/mnt/DP_disk2/animal/spark/spark-local-20140930102549-622d/11/shuffle_6_191_0.data,
 3154043, 588396) (actual file length 676025)
 at 
org.apache.spark.network.FileSegmentManagedBuffer.nioByteBuffer(ManagedBuffer.scala:80)
at 
org.apache.spark.network.nio.NioBlockTransferService.getBlock(NioBlockTransferService.scala:203)
 at 
org.apache.spark.network.nio.NioBlockTransferService.org$apache$spark$network$nio$NioBlockTransferService$$processBlockMessage(NioBlockTransferService.scala:179)
 at 
org.apache.spark.network.nio.NioBlockTransferService$$anonfun$2.apply(NioBlockTransferService.scala:149)
 at 
org.apache.spark.network.nio.NioBlockTransferService$$anonfun$2.apply(NioBlockTransferService.scala:149)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)


java.io.IOException: Error in reading 
org.apache.spark.network.FileSegmentManagedBuffer(/mnt/DP_disk2/animal/spark/spark-local-20140930102549-622d/11/shuffle_6_191_0.data,
 5618712, 616204) (actual file length 676025)

 at 
org.apache.spark.network.FileSegmentManagedBuffer.nioByteBuffer(ManagedBuffer.scala:80)

at 
org.apache.spark.network.nio.NioBlockTransferService.getBlock(NioBlockTransferService.scala:203)

 at 
org.apache.spark.network.nio.NioBlockTransferService.org$apache$spark$network$nio$NioBlockTransferService$$processBlockMessage(NioBlockTransferService.scala:179)

 at 
org.apache.spark.network.nio.NioBlockTransferService$$anonfun$2.apply(NioBlockTransferService.scala:149)

 at 
org.apache.spark.network.nio.NioBlockTransferService$$anonfun$2.apply(NioBlockTransferService.scala:149)

 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

 at scala.collection.Iterator$class.foreach(Iterator.scala:727)

 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

These are random errors, in most but not every run. And also depend on queries.

Thanks,
Daoyuan

From: Reynold Xin [mailto:r...@databricks.com]
Sent: Tuesday, September 30, 2014 3:48 AM
To: Wang, Daoyuan
Cc: user@spark.apache.org
Subject: Re: SQL queries fail in 1.2.0-SNAPSHOT

Hi Daoyuan,

Do you mind applying this patch and look at the exception again?

https://github.com/apache/spark/pull/2580


It has also been merged in master so if you pull from master, you should have 
that.


On Mon, Sep 29, 2014 at 1:17 AM, Wang, Daoyuan 
mailto:daoyuan.w...@intel.com>> wrote:
Hi all,

I had some of my queries run on 1.1.0-SANPSHOT at commit b1b20301(Aug 24), but 
in current master branch, my queries would not work. I looked into the stderr 
file in executor, and find the following lines:

14/09/26 16:52:46 ERROR nio.NioBlockTransferService: Exception handling buffer 
message
java.io.IOException: Channel not open for writing - cannot extend file to 
required size
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:868)
at 
org.apache.spark.network.FileSegmentManagedBuffer.nioByteBuffer(ManagedBuffer.scala:73)
at 
org.apache.spark.network.nio.NioBlockTransferService.getBlock(NioBlockTransferService.scala:203)
at 
org.apache.spark.network.nio.NioBlockTransferService.org<http://org.apache.spark.network.nio.NioBlockTransferService.org>$apache$spark$network$nio$NioBlockTransferService$$processBlockMessage(NioBlockTransferService.scala:179)
at 
org.apache.spark.network.nio.NioBlockTransferService$$anonfun$2.apply(NioBlockTransferService.scala:149)
at 
org.apache.spark.network.nio.NioBlockTransferService$$anonfun$2.apply(NioBlockTransferService.scala:149)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at 
org.apache.spark.network.nio.BlockMessageArray.foreach(BlockMessageArray.scala:28)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at 
org.apache.spark.network.nio.BlockMessageArray.map(BlockMessageArray.scala:28)
at 
org.apache.spark.network.nio.NioBlockTransferService.org<http://org.apache.spark.network.nio.NioBlockTransferService.org>$apache$spark$network$nio$NioBl

SQL queries fail in 1.2.0-SNAPSHOT

2014-09-29 Thread Wang, Daoyuan
Hi all,

I had some of my queries run on 1.1.0-SANPSHOT at commit b1b20301(Aug 24), but 
in current master branch, my queries would not work. I looked into the stderr 
file in executor, and find the following lines:

14/09/26 16:52:46 ERROR nio.NioBlockTransferService: Exception handling buffer 
message
java.io.IOException: Channel not open for writing - cannot extend file to 
required size
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:868)
at 
org.apache.spark.network.FileSegmentManagedBuffer.nioByteBuffer(ManagedBuffer.scala:73)
at 
org.apache.spark.network.nio.NioBlockTransferService.getBlock(NioBlockTransferService.scala:203)
at 
org.apache.spark.network.nio.NioBlockTransferService.org$apache$spark$network$nio$NioBlockTransferService$$processBlockMessage(NioBlockTransferService.scala:179)
at 
org.apache.spark.network.nio.NioBlockTransferService$$anonfun$2.apply(NioBlockTransferService.scala:149)
at 
org.apache.spark.network.nio.NioBlockTransferService$$anonfun$2.apply(NioBlockTransferService.scala:149)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at 
org.apache.spark.network.nio.BlockMessageArray.foreach(BlockMessageArray.scala:28)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at 
org.apache.spark.network.nio.BlockMessageArray.map(BlockMessageArray.scala:28)
at 
org.apache.spark.network.nio.NioBlockTransferService.org$apache$spark$network$nio$NioBlockTransferService$$onBlockMessageReceive(NioBlockTransferService.scala:149)
at 
org.apache.spark.network.nio.NioBlockTransferService$$anonfun$init$1.apply(NioBlockTransferService.scala:68)
at 
org.apache.spark.network.nio.NioBlockTransferService$$anonfun$init$1.apply(NioBlockTransferService.scala:68)
at 
org.apache.spark.network.nio.ConnectionManager.org$apache$spark$network$nio$ConnectionManager$$handleMessage(ConnectionManager.scala:677)
at 
org.apache.spark.network.nio.ConnectionManager$$anon$10.run(ConnectionManager.scala:515)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Shuffle compress was turned off, because I encountered parsing_error when with 
shuffle compress. Even after I set the native library path, I got errors when 
uncompress in snappy. With shuffle compress turned off, I still get message 
above in some of my nodes, and the others would have a message that saying ack 
is not received after 60s. Any one get some ideas? Thanks for your help!

Thanks,
Daoyuan Wang