Writing UDF with variable number of arguments

2015-10-05 Thread tridib
Hi Friends,
I want to write a UDF which takes variable number of arguments with varying
type.

myudf(String key1, String value1, String key2, int value2,)

What is the best way to do it in Spark?

Thanks
Tridib



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Writing-UDF-with-variable-number-of-arguments-tp24940.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: nested collection object query

2015-09-29 Thread Tridib Samanta
Well I figure out a way to use explode. But it returns two rows if there is two 
match in nested array objects.
 
select id from department LATERAL VIEW explode(employee) dummy_table as emp 
where emp.name = 'employee0'
 
I was looking for an operator that loops through the array and return true if 
it matches the condition and returns the parent object.
From: tridib.sama...@live.com
To: java8...@hotmail.com; user@spark.apache.org
Subject: RE: nested collection object query
Date: Mon, 28 Sep 2015 22:26:46 -0700




Thanks for you response Yong! Array syntax works fine. But I am not sure how to 
use explode. Should I use as follows?
select id from department where explode(employee).name = 'employee0
 
This query gives me java.lang.UnsupportedOperationException . I am using 
HiveContext.
 
From: java8...@hotmail.com
To: tridib.sama...@live.com; user@spark.apache.org
Subject: RE: nested collection object query
Date: Mon, 28 Sep 2015 20:42:11 -0400




Your employee in fact is an array of struct, not just struct.
If you are using HiveSQLContext, then you can refer it like following:
select id from member where employee[0].name = 'employee0'
The employee[0] is pointing to the 1st element of the array. 
If you want to query all the elements in the array, then you have to use 
"explode" in the Hive. 
See Hive document for this:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-explode
Yong

> Date: Mon, 28 Sep 2015 16:37:23 -0700
> From: tridib.sama...@live.com
> To: user@spark.apache.org
> Subject: nested collection object query
> 
> Hi Friends,
> What is the right syntax to query on collection of nested object? I have a
> following schema and SQL. But it does not return anything. Is the syntax
> correct?
> 
> root
>  |-- id: string (nullable = false)
>  |-- employee: array (nullable = false)
>  ||-- element: struct (containsNull = true)
>  |||-- id: string (nullable = false)
>  |||-- name: string (nullable = false)
>  |||-- speciality: string (nullable = false)
> 
> 
> select id from member where employee.name = 'employee0'
> 
> Uploaded a test if some one want to try it out. NestedObjectTest.java
> 
>   
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/nested-collection-object-query-tp24853.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

  

RE: nested collection object query

2015-09-28 Thread Tridib Samanta
Thanks for you response Yong! Array syntax works fine. But I am not sure how to 
use explode. Should I use as follows?
select id from department where explode(employee).name = 'employee0
 
This query gives me java.lang.UnsupportedOperationException . I am using 
HiveContext.
 
From: java8...@hotmail.com
To: tridib.sama...@live.com; user@spark.apache.org
Subject: RE: nested collection object query
Date: Mon, 28 Sep 2015 20:42:11 -0400




Your employee in fact is an array of struct, not just struct.
If you are using HiveSQLContext, then you can refer it like following:
select id from member where employee[0].name = 'employee0'
The employee[0] is pointing to the 1st element of the array. 
If you want to query all the elements in the array, then you have to use 
"explode" in the Hive. 
See Hive document for this:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-explode
Yong

> Date: Mon, 28 Sep 2015 16:37:23 -0700
> From: tridib.sama...@live.com
> To: user@spark.apache.org
> Subject: nested collection object query
> 
> Hi Friends,
> What is the right syntax to query on collection of nested object? I have a
> following schema and SQL. But it does not return anything. Is the syntax
> correct?
> 
> root
>  |-- id: string (nullable = false)
>  |-- employee: array (nullable = false)
>  ||-- element: struct (containsNull = true)
>  |||-- id: string (nullable = false)
>  |||-- name: string (nullable = false)
>  |||-- speciality: string (nullable = false)
> 
> 
> select id from member where employee.name = 'employee0'
> 
> Uploaded a test if some one want to try it out. NestedObjectTest.java
> 
>   
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/nested-collection-object-query-tp24853.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

  

nested collection object query

2015-09-28 Thread tridib
Hi Friends,
What is the right syntax to query on collection of nested object? I have a
following schema and SQL. But it does not return anything. Is the syntax
correct?

root
 |-- id: string (nullable = false)
 |-- employee: array (nullable = false)
 ||-- element: struct (containsNull = true)
 |||-- id: string (nullable = false)
 |||-- name: string (nullable = false)
 |||-- speciality: string (nullable = false)


select id from member where employee.name = 'employee0'

Uploaded a test if some one want to try it out. NestedObjectTest.java

  




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/nested-collection-object-query-tp24853.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-09-23 Thread tridib
Setting spark.sql.shuffle.partitions = 2000 solved my issue. I am able to
join 2 1 billion rows tables in 3 minutes.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750p24782.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to control spark.sql.shuffle.partitions per query

2015-09-23 Thread tridib
I am having GC issue with default value of spark.sql.shuffle.partitions
(200). When I increase it to 2000, shuffle join works fine.

I want to use different values for spark.sql.shuffle.partitions depending on
data volume, for different queries which are fired from sane SparkSql
context.

Thanks
Tridib



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-control-spark-sql-shuffle-partitions-per-query-tp24781.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-09-22 Thread tridib
By skewed did you mean it's not distributed uniformly across partition?
All of my columns are string and almost of same size. i.e.

id1,field11,fields12
id2,field21,field22




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750p24776.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-09-21 Thread tridib
Did you get any solution to this? I am getting same issue.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750p24759.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Official Docker container for Spark

2015-05-29 Thread Tridib Samanta
Thanks all for your reply. I was evaluating which one fits best for me. I 
picked epahomov/docker-spark from docker registry and suffice my need.
 
Thanks
Tridib
 
Date: Fri, 22 May 2015 14:15:42 +0530
Subject: Re: Official Docker container for Spark
From: riteshoneinamill...@gmail.com
To: 917361...@qq.com
CC: tridib.sama...@live.com; user@spark.apache.org

Use this:
sequenceiq/docker

Here's a link to their github repo:
docker-spark


They have repos for other big data tools too which are agin really nice. Its 
being maintained properly by their devs and 
  

Official Docker container for Spark

2015-05-21 Thread tridib
Hi,

I am using spark 1.2.0. Can you suggest docker containers which can be
deployed in production? I found lot of spark images in
https://registry.hub.docker.com/ . But could not figure out which one to
use. None of them seems like official image.

Does anybody have any recommendation?

Thanks
Tridib



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Official-Docker-container-for-Spark-tp22977.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: HBase HTable constructor hangs

2015-04-30 Thread Tridib Samanta
You are right. After I moved from HBase 0.98.1 to 1.0.0 this problem got 
solved. Thanks all!
 
Date: Wed, 29 Apr 2015 06:58:59 -0700
Subject: Re: HBase HTable constructor hangs
From: yuzhih...@gmail.com
To: tridib.sama...@live.com
CC: d...@ocirs.com; user@spark.apache.org

Can you verify whether the hbase release you're using has the following fix ?
HBASE-8 non environment variable solution for IllegalAccessError

Cheers
On Tue, Apr 28, 2015 at 10:47 PM, Tridib Samanta tridib.sama...@live.com 
wrote:



I turned on the TRACE and I see lot of following exception:
 
java.lang.IllegalAccessError: com/google/protobuf/ZeroCopyLiteralByteString
 at 
org.apache.hadoop.hbase.protobuf.RequestConverter.buildRegionSpecifier(RequestConverter.java:897)
 at 
org.apache.hadoop.hbase.protobuf.RequestConverter.buildGetRowOrBeforeRequest(RequestConverter.java:131)
 at 
org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRowOrBefore(ProtobufUtil.java:1402)
 at org.apache.hadoop.hbase.client.HTable$2.call(HTable.java:701)
 at org.apache.hadoop.hbase.client.HTable$2.call(HTable.java:699)
 at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:120)
 at org.apache.hadoop.hbase.client.HTable.getRowOrBefore(HTable.java:705)
 at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:144)
 at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.prefetchRegionCache(HConnectionManager.java:1102)
 at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1162)
 at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1054)
 at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1011)
 at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:326)
 at org.apache.hadoop.hbase.client.HTable.init(HTable.java:192)
 
Thanks
Tridib
 
From: d...@ocirs.com
Date: Tue, 28 Apr 2015 22:24:39 -0700
Subject: Re: HBase HTable constructor hangs
To: tridib.sama...@live.com

In that case, something else is failing and the reason HBase looks like it 
hangs is that the hbase timeout or retry count is too high.
Try setting the following conf and hbase will only hang for a few mins max and 
return a helpful error message.
hbaseConf.set(HConstants.HBASE_CLIENT_RETRIES_NUMBER, 2)



--
Dean Chen


On Tue, Apr 28, 2015 at 10:18 PM, Tridib Samanta tridib.sama...@live.com 
wrote:



Nope, my hbase is unsecured.
 
From: d...@ocirs.com
Date: Tue, 28 Apr 2015 22:09:51 -0700
Subject: Re: HBase HTable constructor hangs
To: tridib.sama...@live.com

Hi Tridib,
Are you running this on a secure Hadoop/HBase cluster? I ran in to a similar 
issue where the HBase client can successfully connect in local mode and in the 
yarn-client driver but not on remote executors. The problem is that Spark 
doesn't distribute the hbase auth key, see the following Jira ticket and PR.
https://issues.apache.org/jira/browse/SPARK-6918

--
Dean Chen


On Tue, Apr 28, 2015 at 9:34 PM, Tridib Samanta tridib.sama...@live.com wrote:



I am 100% sure how it's picking up the configuration. I copied the 
hbase-site.xml in hdfs/spark cluster (single machine). I also included 
hbase-site.xml in spark-job jar files. spark-job jar file also have yarn-site 
and mapred-site and core-site.xml in it.
 
One interesting thing is, when I run the spark-job jar as standalone and 
execute the HBase client from a main method, it works fine. Same client unable 
to connect/hangs when the jar is distributed in spark.
 
Thanks
Tridib
 
Date: Tue, 28 Apr 2015 21:25:41 -0700
Subject: Re: HBase HTable constructor hangs
From: yuzhih...@gmail.com
To: tridib.sama...@live.com
CC: user@spark.apache.org

How did you distribute hbase-site.xml to the nodes ?
Looks like HConnectionManager couldn't find the hbase:meta server.
Cheers
On Tue, Apr 28, 2015 at 9:19 PM, Tridib Samanta tridib.sama...@live.com wrote:



I am using Spark 1.2.0 and HBase 0.98.1-cdh5.1.0.
 
Here is the jstack trace. Complete stack trace attached.
 
Executor task launch worker-1 #58 daemon prio=5 os_prio=0 
tid=0x7fd3d0445000 nid=0x488 waiting on condition [0x7fd4507d9000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
 at java.lang.Thread.sleep(Native Method)
 at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:152)
 - locked 0xf8cb7258 (a 
org.apache.hadoop.hbase.client.RpcRetryingCaller)
 at org.apache.hadoop.hbase.client.HTable.getRowOrBefore(HTable.java:705)
 at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:144)
 at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.prefetchRegionCache(HConnectionManager.java:1102)
 at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1162)
 - locked 0xf84ac0b0

RE: HBase HTable constructor hangs

2015-04-29 Thread Tridib Samanta
I turned on the TRACE and I see lot of following exception:
 
java.lang.IllegalAccessError: com/google/protobuf/ZeroCopyLiteralByteString
 at 
org.apache.hadoop.hbase.protobuf.RequestConverter.buildRegionSpecifier(RequestConverter.java:897)
 at 
org.apache.hadoop.hbase.protobuf.RequestConverter.buildGetRowOrBeforeRequest(RequestConverter.java:131)
 at 
org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRowOrBefore(ProtobufUtil.java:1402)
 at org.apache.hadoop.hbase.client.HTable$2.call(HTable.java:701)
 at org.apache.hadoop.hbase.client.HTable$2.call(HTable.java:699)
 at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:120)
 at org.apache.hadoop.hbase.client.HTable.getRowOrBefore(HTable.java:705)
 at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:144)
 at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.prefetchRegionCache(HConnectionManager.java:1102)
 at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1162)
 at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1054)
 at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1011)
 at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:326)
 at org.apache.hadoop.hbase.client.HTable.init(HTable.java:192)
 
Thanks
Tridib
 
From: d...@ocirs.com
Date: Tue, 28 Apr 2015 22:24:39 -0700
Subject: Re: HBase HTable constructor hangs
To: tridib.sama...@live.com

In that case, something else is failing and the reason HBase looks like it 
hangs is that the hbase timeout or retry count is too high.
Try setting the following conf and hbase will only hang for a few mins max and 
return a helpful error message.
hbaseConf.set(HConstants.HBASE_CLIENT_RETRIES_NUMBER, 2)



--
Dean Chen


On Tue, Apr 28, 2015 at 10:18 PM, Tridib Samanta tridib.sama...@live.com 
wrote:



Nope, my hbase is unsecured.
 
From: d...@ocirs.com
Date: Tue, 28 Apr 2015 22:09:51 -0700
Subject: Re: HBase HTable constructor hangs
To: tridib.sama...@live.com

Hi Tridib,
Are you running this on a secure Hadoop/HBase cluster? I ran in to a similar 
issue where the HBase client can successfully connect in local mode and in the 
yarn-client driver but not on remote executors. The problem is that Spark 
doesn't distribute the hbase auth key, see the following Jira ticket and PR.
https://issues.apache.org/jira/browse/SPARK-6918

--
Dean Chen


On Tue, Apr 28, 2015 at 9:34 PM, Tridib Samanta tridib.sama...@live.com wrote:



I am 100% sure how it's picking up the configuration. I copied the 
hbase-site.xml in hdfs/spark cluster (single machine). I also included 
hbase-site.xml in spark-job jar files. spark-job jar file also have yarn-site 
and mapred-site and core-site.xml in it.
 
One interesting thing is, when I run the spark-job jar as standalone and 
execute the HBase client from a main method, it works fine. Same client unable 
to connect/hangs when the jar is distributed in spark.
 
Thanks
Tridib
 
Date: Tue, 28 Apr 2015 21:25:41 -0700
Subject: Re: HBase HTable constructor hangs
From: yuzhih...@gmail.com
To: tridib.sama...@live.com
CC: user@spark.apache.org

How did you distribute hbase-site.xml to the nodes ?
Looks like HConnectionManager couldn't find the hbase:meta server.
Cheers
On Tue, Apr 28, 2015 at 9:19 PM, Tridib Samanta tridib.sama...@live.com wrote:



I am using Spark 1.2.0 and HBase 0.98.1-cdh5.1.0.
 
Here is the jstack trace. Complete stack trace attached.
 
Executor task launch worker-1 #58 daemon prio=5 os_prio=0 
tid=0x7fd3d0445000 nid=0x488 waiting on condition [0x7fd4507d9000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
 at java.lang.Thread.sleep(Native Method)
 at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:152)
 - locked 0xf8cb7258 (a 
org.apache.hadoop.hbase.client.RpcRetryingCaller)
 at org.apache.hadoop.hbase.client.HTable.getRowOrBefore(HTable.java:705)
 at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:144)
 at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.prefetchRegionCache(HConnectionManager.java:1102)
 at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1162)
 - locked 0xf84ac0b0 (a java.lang.Object)
 at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1054)
 at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1011)
 at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:326)
 at org.apache.hadoop.hbase.client.HTable.init(HTable.java:192)
 at org.apache.hadoop.hbase.client.HTable.init(HTable.java:150)
 at com.mypackage.storeTuples(CubeStoreService.java

Re: HBase HTable constructor hangs

2015-04-28 Thread tridib
I am exactly having same issue. I am running hbase and spark in docker
container.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/HBase-HTable-constructor-hangs-tp4926p22696.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: HBase HTable constructor hangs

2015-04-28 Thread Tridib Samanta
-hadoop.com/?q=spark+HBase+HTable+constructor+hangs and 
saw a very old thread with this subject.
Cheers
On Tue, Apr 28, 2015 at 7:12 PM, tridib tridib.sama...@live.com wrote:
I am exactly having same issue. I am running hbase and spark in docker

container.







--

View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/HBase-HTable-constructor-hangs-tp4926p22696.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.



-

To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

For additional commands, e-mail: user-h...@spark.apache.org




  2015-04-28 23:20:33
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.40-b25 mixed mode):

Attach Listener #69 daemon prio=9 os_prio=0 tid=0x7fd430001000 nid=0x4a9 
waiting on condition [0x]
   java.lang.Thread.State: RUNNABLE

Executor task launch worker-0-EventThread #68 daemon prio=5 os_prio=0 
tid=0x7fd410296800 nid=0x494 waiting on condition [0x7fd3d8acb000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  0xf8692410 (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at 
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:491)

Executor task launch worker-0-SendThread(dfdevdt01:2181) #67 daemon prio=5 
os_prio=0 tid=0x7fd410296000 nid=0x493 runnable [0x7fd3d8bcc000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked 0xf864a078 (a sun.nio.ch.Util$2)
- locked 0xf864a068 (a java.util.Collections$UnmodifiableSet)
- locked 0xf8649f50 (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:338)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)

org.apache.hadoop.hdfs.PeerCache@440cfa5d #65 daemon prio=5 os_prio=0 
tid=0x7fd40c22f000 nid=0x492 waiting on condition [0x7fd3db0f1000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.hdfs.PeerCache.run(PeerCache.java:245)
at org.apache.hadoop.hdfs.PeerCache.access$000(PeerCache.java:41)
at org.apache.hadoop.hdfs.PeerCache$1.run(PeerCache.java:119)
at java.lang.Thread.run(Thread.java:745)

IPC Parameter Sending Thread #0 #64 daemon prio=5 os_prio=0 
tid=0x7fd4101c2000 nid=0x48f waiting on condition [0x7fd3f84d]
   java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  0xe0cb65f0 (a 
java.util.concurrent.SynchronousQueue$TransferStack)
at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at 
java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
at 
java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

threadDeathWatcher-2-1 #61 daemon prio=1 os_prio=0 tid=0x7fd3dc101000 
nid=0x48d waiting on condition [0x7fd4501d4000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at 
io.netty.util.ThreadDeathWatcher$Watcher.run(ThreadDeathWatcher.java:137)
at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
at java.lang.Thread.run(Thread.java:745)

shuffle-client-0 #47 daemon prio=5 os_prio=0 tid=0x7fd41007e000 nid=0x48c 
runnable [0x7fd3f85d1000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79

RE: HBase HTable constructor hangs

2015-04-28 Thread Tridib Samanta
I am 100% sure how it's picking up the configuration. I copied the 
hbase-site.xml in hdfs/spark cluster (single machine). I also included 
hbase-site.xml in spark-job jar files. spark-job jar file also have yarn-site 
and mapred-site and core-site.xml in it.
 
One interesting thing is, when I run the spark-job jar as standalone and 
execute the HBase client from a main method, it works fine. Same client unable 
to connect/hangs when the jar is distributed in spark.
 
Thanks
Tridib
 
Date: Tue, 28 Apr 2015 21:25:41 -0700
Subject: Re: HBase HTable constructor hangs
From: yuzhih...@gmail.com
To: tridib.sama...@live.com
CC: user@spark.apache.org

How did you distribute hbase-site.xml to the nodes ?
Looks like HConnectionManager couldn't find the hbase:meta server.
Cheers
On Tue, Apr 28, 2015 at 9:19 PM, Tridib Samanta tridib.sama...@live.com wrote:



I am using Spark 1.2.0 and HBase 0.98.1-cdh5.1.0.
 
Here is the jstack trace. Complete stack trace attached.
 
Executor task launch worker-1 #58 daemon prio=5 os_prio=0 
tid=0x7fd3d0445000 nid=0x488 waiting on condition [0x7fd4507d9000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
 at java.lang.Thread.sleep(Native Method)
 at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:152)
 - locked 0xf8cb7258 (a 
org.apache.hadoop.hbase.client.RpcRetryingCaller)
 at org.apache.hadoop.hbase.client.HTable.getRowOrBefore(HTable.java:705)
 at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:144)
 at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.prefetchRegionCache(HConnectionManager.java:1102)
 at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1162)
 - locked 0xf84ac0b0 (a java.lang.Object)
 at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1054)
 at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1011)
 at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:326)
 at org.apache.hadoop.hbase.client.HTable.init(HTable.java:192)
 at org.apache.hadoop.hbase.client.HTable.init(HTable.java:150)
 at com.mypackage.storeTuples(CubeStoreService.java:59)
 at 
com.mypackage.StorePartitionToHBaseStoreFunction.call(StorePartitionToHBaseStoreFunction.java:23)
 at 
com.mypackage.StorePartitionToHBaseStoreFunction.call(StorePartitionToHBaseStoreFunction.java:13)
 at 
org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:195)
 at 
org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:195)
 at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:773)
 at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:773)
 at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
 at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
Executor task launch worker-0 #57 daemon prio=5 os_prio=0 
tid=0x7fd3d0443800 nid=0x487 waiting for monitor entry [0x7fd4506d8000]
   java.lang.Thread.State: BLOCKED (on object monitor)
 at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1156)
 - waiting to lock 0xf84ac0b0 (a java.lang.Object)
 at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1054)
 at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1011)
 at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:326)
 at org.apache.hadoop.hbase.client.HTable.init(HTable.java:192)
 at org.apache.hadoop.hbase.client.HTable.init(HTable.java:150)
 at com.mypackage.storeTuples(CubeStoreService.java:59)
 at 
com.mypackage.StorePartitionToHBaseStoreFunction.call(StorePartitionToHBaseStoreFunction.java:23)
 at 
com.mypackage.StorePartitionToHBaseStoreFunction.call(StorePartitionToHBaseStoreFunction.java:13)
 at 
org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:195)
 at 
org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:195)
 at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:773)
 at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:773)
 at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply

spark sql median and standard deviation

2015-03-04 Thread tridib
Hello,
Is there in built function for getting median and standard deviation in
spark sql? Currently I am converting the schemaRdd to DoubleRdd and calling
doubleRDD.stats(). But still it does not have median.

What is the most efficient way to get the median?

Thanks  Regards
Tridib



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-median-and-standard-deviation-tp21914.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Running spark function on parquet without sql

2015-02-27 Thread tridib
Somehow my posts are not getting excepted, and replies are not visible here.
But I got following reply from Zhan.

From Zhan Zhang's reply, yes I still get the parquet's advantage. 

My next question is, if I operate on SchemaRdd will I get the advantage of
Spark SQL's in memory columnar store when cached the table using
cacheTable()?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Running-spark-function-on-parquet-without-sql-tp21833p21850.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: group by order by fails

2015-02-27 Thread Tridib Samanta
Thanks Michael! It worked. Some how my mails are not getting accepted by spark 
user mailing list. :(
 
From: mich...@databricks.com
Date: Thu, 26 Feb 2015 17:49:43 -0800
Subject: Re: group by order by fails
To: tridib.sama...@live.com
CC: ak...@sigmoidanalytics.com; user@spark.apache.org

Assign an alias to the count in the select clause and use that alias in the 
order by clause.
On Wed, Feb 25, 2015 at 11:17 PM, Tridib Samanta tridib.sama...@live.com 
wrote:



Actually I just realized , I am using 1.2.0.
 
Thanks
Tridib
 
Date: Thu, 26 Feb 2015 12:37:06 +0530
Subject: Re: group by order by fails
From: ak...@sigmoidanalytics.com
To: tridib.sama...@live.com
CC: user@spark.apache.org

Which version of spark are you having? It seems there was a similar Jira 
https://issues.apache.org/jira/browse/SPARK-2474ThanksBest Regards

On Thu, Feb 26, 2015 at 12:03 PM, tridib tridib.sama...@live.com wrote:
Hi,

I need to find top 10 most selling samples. So query looks like:

select  s.name, count(s.name) from sample s group by s.name order by

count(s.name)



This query fails with following error:

org.apache.spark.sql.catalyst.errors.package$TreeNodeException: sort, tree:

Sort [COUNT(name#0) ASC], true

 Exchange (RangePartitioning [COUNT(name#0) ASC], 200)

  Aggregate false, [name#0], [name#0 AS

name#1,Coalesce(SUM(PartialCount#4L),0) AS count#2L,name#0]

   Exchange (HashPartitioning [name#0], 200)

Aggregate true, [name#0], [name#0,COUNT(name#0) AS PartialCount#4L]

 PhysicalRDD [name#0], MapPartitionsRDD[1] at mapPartitions at

JavaSQLContext.scala:102



at

org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)

at org.apache.spark.sql.execution.Sort.execute(basicOperators.scala:206)

at 
org.apache.spark.sql.execution.Project.execute(basicOperators.scala:43)

at

org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:84)

at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)

at

org.apache.spark.sql.api.java.JavaSchemaRDD.collect(JavaSchemaRDD.scala:114)

at

com.edifecs.platform.df.analytics.spark.domain.dao.OrderByTest.testGetVisitDistributionByPrimaryDx(OrderByTest.java:48)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at

org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)

at

org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)

at

org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)

at

org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)

at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)

at

org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)

at

org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)

at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)

at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)

at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)

at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)

at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)

at org.junit.runners.ParentRunner.run(ParentRunner.java:309)

at org.junit.runner.JUnitCore.run(JUnitCore.java:160)

at

com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74)

at

com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:211)

at 
com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:67)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at

com.intellij.rt.execution.CommandLineWrapper.main(CommandLineWrapper.java:121)

Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException:

execute, tree:

Exchange (RangePartitioning [COUNT(name#0) ASC], 200)

 Aggregate false, [name#0], [name#0 AS

name#1,Coalesce(SUM(PartialCount#4L),0) AS count#2L,name#0]

  Exchange (HashPartitioning [name#0], 200)

   Aggregate true, [name#0], [name#0,COUNT(name#0) AS PartialCount#4L]

PhysicalRDD [name#0], MapPartitionsRDD[1] at mapPartitions at

JavaSQLContext.scala

Running spark function on parquet without sql

2015-02-26 Thread tridib
Hello Experts,
In one of my projects we are having parquet files and we are using spark SQL
to get our analytics. I am encountering situation where simple SQL is not
getting me what I need or the complex SQL is not supported by Spark Sql. In
scenarios like this I am able to get things done using low level spark
constructs like MapFunction and reducers.

My question is if I create a JavaSchemaRdd on Parquet and use basic spark
constructs, will I still get the benefit of parquets columnar format? Will
my aggregation be as fast as it would have been if I have used SQL?

Please advice.

Thanks  Regards
Tridib



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Running-spark-function-on-parquet-without-sql-tp21833.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: spark sql: join sql fails after sqlCtx.cacheTable()

2015-02-25 Thread tridib
Using Hivecontext solved it.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-join-sql-fails-after-sqlCtx-cacheTable-tp16893p21807.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: group by order by fails

2015-02-25 Thread Tridib Samanta
Actually I just realized , I am using 1.2.0.
 
Thanks
Tridib
 
Date: Thu, 26 Feb 2015 12:37:06 +0530
Subject: Re: group by order by fails
From: ak...@sigmoidanalytics.com
To: tridib.sama...@live.com
CC: user@spark.apache.org

Which version of spark are you having? It seems there was a similar Jira 
https://issues.apache.org/jira/browse/SPARK-2474ThanksBest Regards

On Thu, Feb 26, 2015 at 12:03 PM, tridib tridib.sama...@live.com wrote:
Hi,

I need to find top 10 most selling samples. So query looks like:

select  s.name, count(s.name) from sample s group by s.name order by

count(s.name)



This query fails with following error:

org.apache.spark.sql.catalyst.errors.package$TreeNodeException: sort, tree:

Sort [COUNT(name#0) ASC], true

 Exchange (RangePartitioning [COUNT(name#0) ASC], 200)

  Aggregate false, [name#0], [name#0 AS

name#1,Coalesce(SUM(PartialCount#4L),0) AS count#2L,name#0]

   Exchange (HashPartitioning [name#0], 200)

Aggregate true, [name#0], [name#0,COUNT(name#0) AS PartialCount#4L]

 PhysicalRDD [name#0], MapPartitionsRDD[1] at mapPartitions at

JavaSQLContext.scala:102



at

org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)

at org.apache.spark.sql.execution.Sort.execute(basicOperators.scala:206)

at 
org.apache.spark.sql.execution.Project.execute(basicOperators.scala:43)

at

org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:84)

at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)

at

org.apache.spark.sql.api.java.JavaSchemaRDD.collect(JavaSchemaRDD.scala:114)

at

com.edifecs.platform.df.analytics.spark.domain.dao.OrderByTest.testGetVisitDistributionByPrimaryDx(OrderByTest.java:48)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at

org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)

at

org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)

at

org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)

at

org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)

at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)

at

org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)

at

org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)

at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)

at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)

at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)

at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)

at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)

at org.junit.runners.ParentRunner.run(ParentRunner.java:309)

at org.junit.runner.JUnitCore.run(JUnitCore.java:160)

at

com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74)

at

com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:211)

at 
com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:67)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at

com.intellij.rt.execution.CommandLineWrapper.main(CommandLineWrapper.java:121)

Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException:

execute, tree:

Exchange (RangePartitioning [COUNT(name#0) ASC], 200)

 Aggregate false, [name#0], [name#0 AS

name#1,Coalesce(SUM(PartialCount#4L),0) AS count#2L,name#0]

  Exchange (HashPartitioning [name#0], 200)

   Aggregate true, [name#0], [name#0,COUNT(name#0) AS PartialCount#4L]

PhysicalRDD [name#0], MapPartitionsRDD[1] at mapPartitions at

JavaSQLContext.scala:102



at

org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)

at org.apache.spark.sql.execution.Exchange.execute(Exchange.scala:47)

at

org.apache.spark.sql.execution.Sort$$anonfun$execute$3.apply(basicOperators.scala:207)

at

org.apache.spark.sql.execution.Sort$$anonfun$execute$3.apply(basicOperators.scala:207)

at

org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46

group by order by fails

2015-02-25 Thread tridib
Hi,
I need to find top 10 most selling samples. So query looks like:
select  s.name, count(s.name) from sample s group by s.name order by
count(s.name)

This query fails with following error:
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: sort, tree:
Sort [COUNT(name#0) ASC], true
 Exchange (RangePartitioning [COUNT(name#0) ASC], 200)
  Aggregate false, [name#0], [name#0 AS
name#1,Coalesce(SUM(PartialCount#4L),0) AS count#2L,name#0]
   Exchange (HashPartitioning [name#0], 200)
Aggregate true, [name#0], [name#0,COUNT(name#0) AS PartialCount#4L]
 PhysicalRDD [name#0], MapPartitionsRDD[1] at mapPartitions at
JavaSQLContext.scala:102

at
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
at org.apache.spark.sql.execution.Sort.execute(basicOperators.scala:206)
at 
org.apache.spark.sql.execution.Project.execute(basicOperators.scala:43)
at
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:84)
at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)
at
org.apache.spark.sql.api.java.JavaSchemaRDD.collect(JavaSchemaRDD.scala:114)
at
com.edifecs.platform.df.analytics.spark.domain.dao.OrderByTest.testGetVisitDistributionByPrimaryDx(OrderByTest.java:48)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
at
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
at
com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74)
at
com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:211)
at 
com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:67)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
com.intellij.rt.execution.CommandLineWrapper.main(CommandLineWrapper.java:121)
Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
execute, tree:
Exchange (RangePartitioning [COUNT(name#0) ASC], 200)
 Aggregate false, [name#0], [name#0 AS
name#1,Coalesce(SUM(PartialCount#4L),0) AS count#2L,name#0]
  Exchange (HashPartitioning [name#0], 200)
   Aggregate true, [name#0], [name#0,COUNT(name#0) AS PartialCount#4L]
PhysicalRDD [name#0], MapPartitionsRDD[1] at mapPartitions at
JavaSQLContext.scala:102

at
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
at org.apache.spark.sql.execution.Exchange.execute(Exchange.scala:47)
at
org.apache.spark.sql.execution.Sort$$anonfun$execute$3.apply(basicOperators.scala:207)
at
org.apache.spark.sql.execution.Sort$$anonfun$execute$3.apply(basicOperators.scala:207)
at
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46)
... 37 more
Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
No function to evaluate expression. type: Count, tree: COUNT(input[2])
at
org.apache.spark.sql.catalyst.expressions.AggregateExpression.eval(aggregates.scala:41)
at
org.apache.spark.sql.catalyst.expressions.RowOrdering.compare(Row.scala:250)
at
org.apache.spark.sql.catalyst.expressions.RowOrdering.compare(Row.scala:242)
at scala.math.Ordering$$anon$5.compare(Ordering.scala:122)
at 

Re: Control number of parquet generated from JavaSchemaRDD

2014-11-25 Thread tridib
I am experimenting with two files and trying to generate 1 parquet file.

public class CompactParquetGenerator implements Serializable {

public void generateParquet(JavaSparkContext sc, String jsonFilePath,
String parquetPath) {
//int MB_128 = 128*1024*1024;
//sc.hadoopConfiguration().setInt(dfs.blocksize, MB_128);
//sc.hadoopConfiguration().setInt(parquet.block.size, MB_128);
JavaSQLContext sqlCtx = new JavaSQLContext(sc);
JavaRDDClaim claimRdd = sc.textFile(jsonFilePath).map(new
StringToClaimMapper()).filter(new NullFilter());
JavaSchemaRDD claimSchemaRdd = sqlCtx.applySchema(claimRdd,
Claim.class);
claimSchemaRdd.coalesce(1)
claimSchemaRdd.saveAsParquetFile(parquetPath);
}



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Control-number-of-parquet-generated-from-JavaSchemaRDD-tp19717p19773.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Control number of parquet generated from JavaSchemaRDD

2014-11-25 Thread tridib
public void generateParquet(JavaSparkContext sc, String jsonFilePath,
String parquetPath) {
//int MB_128 = 128*1024*1024;
//sc.hadoopConfiguration().setInt(dfs.blocksize, MB_128);
//sc.hadoopConfiguration().setInt(parquet.block.size, MB_128);
JavaSQLContext sqlCtx = new JavaSQLContext(sc);
JavaRDDClaim claimRdd = sc.textFile(jsonFilePath).map(new
StringToClaimMapper()).filter(new NullFilter());
JavaSchemaRDD claimSchemaRdd = sqlCtx.applySchema(claimRdd,
Claim.class);
claimSchemaRdd.coalesce(1, true); //tried with false also. Tried
repartition(1) too.

claimSchemaRdd.saveAsParquetFile(parquetPath);
}



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Control-number-of-parquet-generated-from-JavaSchemaRDD-tp19717p19776.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Control number of parquet generated from JavaSchemaRDD

2014-11-25 Thread tridib
Ohh...how can I miss that. :(. Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Control-number-of-parquet-generated-from-JavaSchemaRDD-tp19717p19788.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Control number of parquet generated from JavaSchemaRDD

2014-11-25 Thread tridib
Thanks Michael,
It worked like a charm! I have few more queries:
1. Is there a way to control the size of parquet file?
2. Which method do you recommend coalesce(n, true), coalesce(n, false) or
repartition(n)?

Thanks  Regards
Tridib




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Control-number-of-parquet-generated-from-JavaSchemaRDD-tp19717p19789.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Control number of parquet generated from JavaSchemaRDD

2014-11-24 Thread tridib
Hello,
I am reading around 1000 input files from disk in an RDD and generating
parquet. It always produces same number of parquet files as number of input
files. I tried to merge them using 

rdd.coalesce(n) and/or rdd.repatition(n).
also tried using:

int MB_128 = 128*1024*1024;
sc.hadoopConfiguration().setInt(dfs.blocksize, MB_128);
sc.hadoopConfiguration().setInt(parquet.block.size, MB_128);

No luck.
Is there a way to control the size/number of parquet files generated?

Thanks
Tridib



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Control-number-of-parquet-generated-from-JavaSchemaRDD-tp19717.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



allocating different memory to different executor for same application

2014-11-21 Thread tridib
Hello Experts,
I have 5 worker machines with different size of RAM. is there a way to
configure it with different executor memory?

Currently I see that all worker spins up 1 executor with same amount of
memory.

Thanks  Regards
Tridib



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/allocating-different-memory-to-different-executor-for-same-application-tp19534.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark-sql broken

2014-11-21 Thread tridib
After taking today's build from master branch I started getting this error
when run spark-sql:

Class org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.

I used following command for building:
 ./make-distribution.sh --tgz -Pyarn -Dyarn.version=2.4.0 -Phadoop-2.4
-Dhadoop.version=2.4.0 -Phive  -Phive-thriftserver -DskipTests

Is there anything I am missing?

Thanks
Tridib




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-broken-tp19536.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



sum/avg group by specified ranges

2014-11-18 Thread tridib
Hello Experts,
I need to get total of an amount fields for specified date range. Now that
group by on calculated field does not work
(https://issues.apache.org/jira/browse/SPARK-4296), what is the best way to
get this done?

I thought to do it using spark, but I suspect I will miss the performance of
spark sql on top of parquet file. Any suggestion?

Thanks  Regards
Tridib




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/sum-avg-group-by-specified-ranges-tp19187.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark sql - save to Parquet file - Unsupported datatype TimestampType

2014-11-11 Thread tridib
Hi Friends,
I am trying to save a json file to parquet. I got error Unsupported
datatype TimestampType. 
Is not parquet support date? Which parquet version does spark uses? Is there
any work around?


Here the stacktrace:

java.lang.RuntimeException: Unsupported datatype TimestampType
at scala.sys.package$.error(package.scala:27)
at
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:343)
at
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:292)
at scala.Option.getOrElse(Option.scala:120)
at
org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetTypes.scala:291)
at
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2$$anonfun$3.apply(ParquetTypes.scala:320)
at
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2$$anonfun$3.apply(ParquetTypes.scala:320)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:319)
at
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:292)
at scala.Option.getOrElse(Option.scala:120)
at
org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetTypes.scala:291)
at
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:363)
at
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:362)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at
org.apache.spark.sql.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypes.scala:361)
at
org.apache.spark.sql.parquet.ParquetTypesConverter$.writeMetaData(ParquetTypes.scala:407)
at
org.apache.spark.sql.parquet.ParquetRelation$.createEmpty(ParquetRelation.scala:151)
at
org.apache.spark.sql.parquet.ParquetRelation$.create(ParquetRelation.scala:130)
at
org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$.apply(SparkStrategies.scala:204)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
at
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)
at
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)
at
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)
at
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)
at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
at
org.apache.spark.sql.SchemaRDDLike$class.saveAsParquetFile(SchemaRDDLike.scala:76)
at
org.apache.spark.sql.api.java.JavaSchemaRDD.saveAsParquetFile(JavaSchemaRDD.scala:42)

Thanks  Regards
Tridib




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-save-to-Parquet-file-Unsupported-datatype-TimestampType-tp18691.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



sql - group by on UDF not working

2014-11-07 Thread Tridib Samanta
)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:828)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:873)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:785)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:628)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:636)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:641)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:968)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:916)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1011)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:353)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

 
Thanks  Regards
Tridib
  

Re: Unable to use HiveContext in spark-shell

2014-11-06 Thread tridib
Help please!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-use-HiveContext-in-spark-shell-tp18261p18280.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: spark sql: join sql fails after sqlCtx.cacheTable()

2014-11-06 Thread Tridib Samanta
I am getting exception at sparksheel at the following line:
 
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
error: bad symbolic reference. A signature in HiveContext.class refers to term 
hive
in package org.apache.hadoop which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling 
HiveContext.class.
error:
 while compiling: console
during phase: erasure
 library version: version 2.10.4
compiler version: version 2.10.4
  reconstructed args:
  last tree to typer: Apply(value $outer)
  symbol: value $outer (flags: method synthetic stable 
expandedname triedcooking)
   symbol definition: val $outer(): $iwC.$iwC.type
 tpe: $iwC.$iwC.type
   symbol owners: value $outer - class $iwC - class $iwC - class $iwC - 
class $read - package $line5
  context owners: class $iwC - class $iwC - class $iwC - class $iwC - 
class $read - package $line5
== Enclosing template or block ==
ClassDef( // class $iwC extends Serializable
  0
  $iwC
  []
  Template( // val local $iwC: notype, tree.tpe=$iwC
java.lang.Object, scala.Serializable // parents
ValDef(
  private
  _
  tpt
  empty
)
// 5 statements
DefDef( // def init(arg$outer: $iwC.$iwC.$iwC.type): $iwC
  method triedcooking
  init
  []
  // 1 parameter list
  ValDef( // $outer: $iwC.$iwC.$iwC.type
param
$outer
tpt // tree.tpe=$iwC.$iwC.$iwC.type
empty
  )
  tpt // tree.tpe=$iwC
  Block( // tree.tpe=Unit
Apply( // def init(): Object in class Object, tree.tpe=Object
  $iwC.super.init // def init(): Object in class Object, 
tree.tpe=()Object
  Nil
)
()
  )
)
ValDef( // private[this] val sqlContext: 
org.apache.spark.sql.hive.HiveContext
  private local triedcooking
  sqlContext 
  tpt // tree.tpe=org.apache.spark.sql.hive.HiveContext
  Apply( // def init(sc: org.apache.spark.SparkContext): 
org.apache.spark.sql.hive.HiveContext in class HiveContext, 
tree.tpe=org.apache.spark.sql.hive.HiveContext
new org.apache.spark.sql.hive.HiveContext.init // def init(sc: 
org.apache.spark.SparkContext): org.apache.spark.sql.hive.HiveContext in class 
HiveContext, tree.tpe=(sc: 
org.apache.spark.SparkContext)org.apache.spark.sql.hive.HiveContext
Apply( // val sc(): org.apache.spark.SparkContext, 
tree.tpe=org.apache.spark.SparkContext
  
$iwC.this.$line5$$read$$iwC$$iwC$$iwC$$iwC$$$outer().$line5$$read$$iwC$$iwC$$iwC$$$outer().$line5$$read$$iwC$$iwC$$$outer().$VAL1().$iw().$iw().sc
 // val sc(): org.apache.spark.SparkContext, 
tree.tpe=()org.apache.spark.SparkContext
  Nil
)
  )
)
DefDef( // val sqlContext(): org.apache.spark.sql.hive.HiveContext
  method stable accessor
  sqlContext
  []
  List(Nil)
  tpt // tree.tpe=org.apache.spark.sql.hive.HiveContext
  $iwC.this.sqlContext  // private[this] val sqlContext: 
org.apache.spark.sql.hive.HiveContext, 
tree.tpe=org.apache.spark.sql.hive.HiveContext
)
ValDef( // protected val $outer: $iwC.$iwC.$iwC.type
  protected synthetic paramaccessor triedcooking
  $outer 
  tpt // tree.tpe=$iwC.$iwC.$iwC.type
  empty
)
DefDef( // val $outer(): $iwC.$iwC.$iwC.type
  method synthetic stable expandedname triedcooking
  $line5$$read$$iwC$$iwC$$iwC$$iwC$$$outer
  []
  List(Nil)
  tpt // tree.tpe=Any
  $iwC.this.$outer  // protected val $outer: $iwC.$iwC.$iwC.type, 
tree.tpe=$iwC.$iwC.$iwC.type
)
  )
)
== Expanded type of tree ==
ThisType(class $iwC)
uncaught exception during compilation: scala.reflect.internal.Types$TypeError
scala.reflect.internal.Types$TypeError: bad symbolic reference. A signature in 
HiveContext.class refers to term conf
in value org.apache.hadoop.hive which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling 
HiveContext.class.
That entry seems to have slain the compiler.  Shall I replay
your session? I can re-run each line except the last one.

 
Thanks
Tridib
 
Date: Tue, 21 Oct 2014 09:39:49 -0700
Subject: Re: spark sql: join sql fails after sqlCtx.cacheTable()
From: ri...@infoobjects.com
To: tridib.sama...@live.com
CC: u...@spark.incubator.apache.org

Hi Tridib,
I changed SQLContext to HiveContext and it started working. These are steps I 
used.







val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)val person = 
sqlContext.jsonFile(json/person.json)person.printSchema()person.registerTempTable(person)val
 address = 
sqlContext.jsonFile(json/address.json)address.printSchema()address.registerTempTable(address)sqlContext.cacheTable(person)sqlContext.cacheTable(address)val
 rs2 = sqlContext.sql(select p.id,p.name,a.city from person

RE: Unable to use HiveContext in spark-shell

2014-11-06 Thread Tridib Samanta



I am using spark 1.1.0.
I built it using:
./make-distribution.sh -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive 
-DskipTests
 
My ultimate goal is to execute a query on parquet file with nested structure 
and cast a date string to Date. This is required to calculate the age of Person 
entity. but I am even unable to pass this line:val sqlContext = new 
org.apache.spark.sql.hive.HiveContext(sc) 
I made sure that org.apache.hadoop package is in the spark assembly jar. 
Re-attaching the stack trace for quick reference. scala val sqlContext = new 
org.apache.spark.sql.hive.HiveContext(sc) 

error: bad symbolic reference. A signature in HiveContext.class refers to term 
hive 
in package org.apache.hadoop which is not available. 
It may be completely missing from the current classpath, or the version on 
the classpath might be incompatible with the version used when compiling 
HiveContext.class. 
error: 
 while compiling: console
during phase: erasure 
 library version: version 2.10.4 
compiler version: version 2.10.4 
  reconstructed args: 

  last tree to typer: Apply(value $outer) 
  symbol: value $outer (flags: method synthetic stable 
expandedname triedcooking) 
   symbol definition: val $outer(): $iwC.$iwC.type 
 tpe: $iwC.$iwC.type 
   symbol owners: value $outer - class $iwC - class $iwC - class $iwC - 
class $read - package $line5 
  context owners: class $iwC - class $iwC - class $iwC - class $iwC - 
class $read - package $line5 

== Enclosing template or block == 

ClassDef( // class $iwC extends Serializable 
  0 
  $iwC 
  [] 
  Template( // val local $iwC: notype, tree.tpe=$iwC 
java.lang.Object, scala.Serializable // parents 
ValDef( 
  private 
  _ 
  tpt
  empty
) 
// 5 statements 
DefDef( // def init(arg$outer: $iwC.$iwC.$iwC.type): $iwC 
  method triedcooking
  init 
  [] 
  // 1 parameter list 
  ValDef( // $outer: $iwC.$iwC.$iwC.type 

$outer 
tpt // tree.tpe=$iwC.$iwC.$iwC.type 
empty
  ) 
  tpt // tree.tpe=$iwC 
  Block( // tree.tpe=Unit 
Apply( // def init(): Object in class Object, tree.tpe=Object 
  $iwC.super.init // def init(): Object in class Object, 
tree.tpe=()Object 
  Nil 
) 
() 
  ) 
) 
ValDef( // private[this] val sqlContext: 
org.apache.spark.sql.hive.HiveContext 
  private local triedcooking
  sqlContext  
  tpt // tree.tpe=org.apache.spark.sql.hive.HiveContext 
  Apply( // def init(sc: org.apache.spark.SparkContext): 
org.apache.spark.sql.hive.HiveContext in class HiveContext, 
tree.tpe=org.apache.spark.sql.hive.HiveContext 
new org.apache.spark.sql.hive.HiveContext.init // def init(sc: 
org.apache.spark.SparkContext): org.apache.spark.sql.hive.HiveContext in class 
HiveContext, tree.tpe=(sc: 
org.apache.spark.SparkContext)org.apache.spark.sql.hive.HiveContext 
Apply( // val sc(): org.apache.spark.SparkContext, 
tree.tpe=org.apache.spark.SparkContext 
  
$iwC.this.$line5$$read$$iwC$$iwC$$iwC$$iwC$$$outer().$line5$$read$$iwC$$iwC$$iwC$$$outer().$line5$$read$$iwC$$iwC$$$outer().$VAL1().$iw().$iw().sc
 // val sc(): org.apache.spark.SparkContext, 
tree.tpe=()org.apache.spark.SparkContext 
  Nil 
) 
  ) 
) 
DefDef( // val sqlContext(): org.apache.spark.sql.hive.HiveContext 
  method stable accessor
  sqlContext 
  [] 
  List(Nil) 
  tpt // tree.tpe=org.apache.spark.sql.hive.HiveContext 
  $iwC.this.sqlContext  // private[this] val sqlContext: 
org.apache.spark.sql.hive.HiveContext, 
tree.tpe=org.apache.spark.sql.hive.HiveContext 
) 
ValDef( // protected val $outer: $iwC.$iwC.$iwC.type 
  protected synthetic paramaccessor triedcooking
  $outer  
  tpt // tree.tpe=$iwC.$iwC.$iwC.type 
  empty
) 
DefDef( // val $outer(): $iwC.$iwC.$iwC.type 
  method synthetic stable expandedname triedcooking
  $line5$$read$$iwC$$iwC$$iwC$$iwC$$$outer 
  [] 
  List(Nil) 
  tpt // tree.tpe=Any 
  $iwC.this.$outer  // protected val $outer: $iwC.$iwC.$iwC.type, 
tree.tpe=$iwC.$iwC.$iwC.type 
) 
  ) 
) 

== Expanded type of tree == 

ThisType(class $iwC) 

uncaught exception during compilation: scala.reflect.internal.Types$TypeError 
scala.reflect.internal.Types$TypeError: bad symbolic reference. A signature in 
HiveContext.class refers to term conf 
in value org.apache.hadoop.hive which is not available. 
It may be completely missing from the current classpath, or the version on 
the classpath might be incompatible with the version used when compiling 
HiveContext.class. 
That entry seems to have slain the compiler.  Shall I replay 
your session? I can re-run each line except the last one. 
[y/n] 

 
Thanks
Tridib
 
 From: terry@smartfocus.com
 To: tridib.sama...@live.com; u...@spark.incubator.apache.org
 Subject: Re: Unable to use

Re: Unable to use HiveContext in spark-shell

2014-11-06 Thread tridib
Yes. I have org.apache.hadoop.hive package in spark assembly.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-use-HiveContext-in-spark-shell-tp18261p18322.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Unable to use HiveContext in spark-shell

2014-11-06 Thread tridib
I built spark-1.1.0 in a new fresh machine. This issue is gone! Thank you all
for your help.

Thanks  Regards
Tridib



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-use-HiveContext-in-spark-shell-tp18261p18324.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Unable to use HiveContext in spark-shell

2014-11-05 Thread tridib
I am connecting to a remote master using spark shell. Then I am getting
following error while trying to instantiate HiveContext.

scala val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

error: bad symbolic reference. A signature in HiveContext.class refers to
term hive
in package org.apache.hadoop which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling
HiveContext.class.
error:
 while compiling: console
during phase: erasure
 library version: version 2.10.4
compiler version: version 2.10.4
  reconstructed args:

  last tree to typer: Apply(value $outer)
  symbol: value $outer (flags: method synthetic stable
expandedname triedcooking)
   symbol definition: val $outer(): $iwC.$iwC.type
 tpe: $iwC.$iwC.type
   symbol owners: value $outer - class $iwC - class $iwC - class $iwC
- class $read - package $line5
  context owners: class $iwC - class $iwC - class $iwC - class $iwC
- class $read - package $line5

== Enclosing template or block ==

ClassDef( // class $iwC extends Serializable
  0
  $iwC
  []
  Template( // val local $iwC: notype, tree.tpe=$iwC
java.lang.Object, scala.Serializable // parents
ValDef(
  private
  _
  tpt
  empty
)
// 5 statements
DefDef( // def init(arg$outer: $iwC.$iwC.$iwC.type): $iwC
  method triedcooking
  init
  []
  // 1 parameter list
  ValDef( // $outer: $iwC.$iwC.$iwC.type

$outer
tpt // tree.tpe=$iwC.$iwC.$iwC.type
empty
  )
  tpt // tree.tpe=$iwC
  Block( // tree.tpe=Unit
Apply( // def init(): Object in class Object, tree.tpe=Object
  $iwC.super.init // def init(): Object in class Object,
tree.tpe=()Object
  Nil
)
()
  )
)
ValDef( // private[this] val sqlContext:
org.apache.spark.sql.hive.HiveContext
  private local triedcooking
  sqlContext 
  tpt // tree.tpe=org.apache.spark.sql.hive.HiveContext
  Apply( // def init(sc: org.apache.spark.SparkContext):
org.apache.spark.sql.hive.HiveContext in class HiveContext,
tree.tpe=org.apache.spark.sql.hive.HiveContext
new org.apache.spark.sql.hive.HiveContext.init // def init(sc:
org.apache.spark.SparkContext): org.apache.spark.sql.hive.HiveContext in
class HiveContext, tree.tpe=(sc:
org.apache.spark.SparkContext)org.apache.spark.sql.hive.HiveContext
Apply( // val sc(): org.apache.spark.SparkContext,
tree.tpe=org.apache.spark.SparkContext
 
$iwC.this.$line5$$read$$iwC$$iwC$$iwC$$iwC$$$outer().$line5$$read$$iwC$$iwC$$iwC$$$outer().$line5$$read$$iwC$$iwC$$$outer().$VAL1().$iw().$iw().sc
// val sc(): org.apache.spark.SparkContext,
tree.tpe=()org.apache.spark.SparkContext
  Nil
)
  )
)
DefDef( // val sqlContext(): org.apache.spark.sql.hive.HiveContext
  method stable accessor
  sqlContext
  []
  List(Nil)
  tpt // tree.tpe=org.apache.spark.sql.hive.HiveContext
  $iwC.this.sqlContext  // private[this] val sqlContext:
org.apache.spark.sql.hive.HiveContext,
tree.tpe=org.apache.spark.sql.hive.HiveContext
)
ValDef( // protected val $outer: $iwC.$iwC.$iwC.type
  protected synthetic paramaccessor triedcooking
  $outer 
  tpt // tree.tpe=$iwC.$iwC.$iwC.type
  empty
)
DefDef( // val $outer(): $iwC.$iwC.$iwC.type
  method synthetic stable expandedname triedcooking
  $line5$$read$$iwC$$iwC$$iwC$$iwC$$$outer
  []
  List(Nil)
  tpt // tree.tpe=Any
  $iwC.this.$outer  // protected val $outer: $iwC.$iwC.$iwC.type,
tree.tpe=$iwC.$iwC.$iwC.type
)
  )
)

== Expanded type of tree ==

ThisType(class $iwC)

uncaught exception during compilation:
scala.reflect.internal.Types$TypeError
scala.reflect.internal.Types$TypeError: bad symbolic reference. A signature
in HiveContext.class refers to term conf
in value org.apache.hadoop.hive which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling
HiveContext.class.
That entry seems to have slain the compiler.  Shall I replay
your session? I can re-run each line except the last one.
[y/n]


Thanks
Tridib



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-use-HiveContext-in-spark-shell-tp18261.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark sql create nested schema

2014-11-04 Thread tridib
I am trying to create a schema which will look like:

root
 |-- ParentInfo: struct (nullable = true)
 ||-- ID: string (nullable = true)
 ||-- State: string (nullable = true)
 ||-- Zip: string (nullable = true)
 |-- ChildInfo: struct (nullable = true)
 ||-- ID: string (nullable = true)
 ||-- State: string (nullable = true)
 ||-- Hobby: string (nullable = true)
 ||-- Zip: string (nullable = true)

How do I create a StructField of StructType? I think that's what the root
is.

Thanks  Regards 
Tridib



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-create-nested-schema-tp18090.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



StructField of StructType

2014-11-04 Thread tridib
How do I create a StructField of StructType? I need to create a nested
schema.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/StructField-of-StructType-tp18091.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: submit query to spark cluster using spark-sql

2014-10-24 Thread tridib
Figured it out. spark-sql --master spark://sparkmaster:7077 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/submit-query-to-spark-cluster-using-spark-sql-tp17182p17183.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



hive timestamp column always returns null

2014-10-22 Thread tridib
Hello Experts,
I created a table using spark-sql CLI. No Hive is installed. I am using
spark 1.1.0.

create table date_test(my_date timestamp)
row format delimited
fields terminated by ' '
lines terminated by '\n'
LOCATION '/user/hive/date_test';

The data file has following data:
2014-12-11 00:00:00
2013-11-11T00:00:00
2012-11-11T00:00:00Z

when I query using select * from date_test it returns:
NULL
NULL
NULL

Could you please help me to resolve this issue?

Thanks
Tridib



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/hive-timestamp-column-always-returns-null-tp17079.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark sql: join sql fails after sqlCtx.cacheTable()

2014-10-21 Thread tridib
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val personPath = /hdd/spark/person.json
val person = sqlContext.jsonFile(personPath)
person.printSchema()
person.registerTempTable(person)
val addressPath = /hdd/spark/address.json
val address = sqlContext.jsonFile(addressPath)
address.printSchema()
address.registerTempTable(address)
sqlContext.cacheTable(person)
sqlContext.cacheTable(address)
val rs2 = sqlContext.sql(SELECT p.id, p.name, a.city FROM person p, address
a where p.id = a.id limit 10).collect.foreach(println)

person.json
{id:1,name:Mr. X}

address.json
{city:Earth,id:1}



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-join-sql-fails-after-sqlCtx-cacheTable-tp16893p16914.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark sql: join sql fails after sqlCtx.cacheTable()

2014-10-21 Thread tridib
Hmm... I thought HiveContext will only worki if Hive is present. I am curious
to know when to use HiveContext and when to use SqlContext.

Thanks  Regards
Tridib



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-join-sql-fails-after-sqlCtx-cacheTable-tp16893p16924.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark sql: join sql fails after sqlCtx.cacheTable()

2014-10-21 Thread tridib
Thank for pointing that out.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-join-sql-fails-after-sqlCtx-cacheTable-tp16893p16933.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark sql: sqlContext.jsonFile date type detection and perforormance

2014-10-21 Thread tridib
Any help? or comments?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-sqlContext-jsonFile-date-type-detection-and-perforormance-tp16881p16939.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark sql: sqlContext.jsonFile date type detection and perforormance

2014-10-21 Thread tridib
Yes, I am unable to use jsonFile() so that it can detect date type
automatically from json data.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-sqlContext-jsonFile-date-type-detection-and-perforormance-tp16881p16974.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark sql: timestamp in json - fails

2014-10-20 Thread tridib
Hello Experts,
After repeated attempt I am unable to run query on map json date string. I
tried two approaches:

*** Approach 1 *** created a Bean class with timespamp field. When I try to
run it I get scala.MatchError: class java.sql.Timestamp (of class
java.lang.Class). Here is the code:
import java.sql.Timestamp;

public class ComplexClaim {
private Timestamp timestamp;

public Timestamp getTimestamp() {
return timestamp;
}

public void setTimestamp(Timestamp timestamp) {
this.timestamp = timestamp;
}
}

JavaSparkContext ctx = getCtx(sc);
JavaSQLContext sqlCtx = getSqlCtx(getCtx(sc));
String path = /hdd/spark/test.json;
JavaSchemaRDD test = sqlCtx.applySchema(ctx.textFile(path) ,
ComplexClaim.class);
sqlCtx.registerRDDAsTable(test, test);
execSql(sqlCtx, select * from test, 1);



*** Approach 2 ***
Created a StructType to map the date field. I got scala.MatchError:
TimestampType (of class org.apache.spark.sql.catalyst.types.TimestampType$).
here is the code:

public StructType createStructType() {
ListStructField fields = new ArrayListStructField();
fields.add(DataType.createStructField(timestamp,
DataType.TimestampType, false));
return DataType.createStructType(fields);
}

public void testJsonStruct(SparkContext sc) {
JavaSQLContext sqlCtx = getSqlCtx(getCtx(sc));
String path = /hdd/spark/test.json;
JavaSchemaRDD test = sqlCtx.jsonFile(path, createStructType());
sqlCtx.registerRDDAsTable(test, test);
execSql(sqlCtx, select * from test, 1);
}

Input file has a single record:
{timestamp:2014-10-10T01:01:01}


Thanks
Tridib





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-timestamp-in-json-fails-tp16864.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark SQL : sqlContext.jsonFile date type detection and perforormance

2014-10-20 Thread tridib
Hi Spark SQL team,
I trying to explore automatic schema detection for json document. I have few
questions:
1. What should be the date format to detect the fields as date type?
2. Is automatic schema infer slower than applying specific schema?
3. At this moment I am parsing json myself using map Function and creating
schema RDD from the parsed JavaRDD. Is there any performance impact not
using inbuilt jsonFile()?

Thanks
Tridib




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-sqlContext-jsonFile-date-type-detection-and-perforormance-tp16881.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark sql: timestamp in json - fails

2014-10-20 Thread tridib
Stack trace for my second case:


2014-10-20 23:00:36,903 ERROR [Executor task launch worker-0]
executor.Executor (Logging.scala:logError(96)) - Exception in task 0.0 in
stage 0.0 (TID 0)
scala.MatchError: TimestampType (of class
org.apache.spark.sql.catalyst.types.TimestampType$)
at
org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:348)
at
org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1$$anonfun$apply$12.apply(JsonRDD.scala:381)
at scala.Option.map(Option.scala:145)
at
org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:380)
at
org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:365)
at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73)
at
org.apache.spark.sql.json.JsonRDD$.org$apache$spark$sql$json$JsonRDD$$asRow(JsonRDD.scala:365)
at
org.apache.spark.sql.json.JsonRDD$$anonfun$jsonStringToRow$1.apply(JsonRDD.scala:38)
at
org.apache.spark.sql.json.JsonRDD$$anonfun$jsonStringToRow$1.apply(JsonRDD.scala:38)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:774)
at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:774)
at
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
at
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:180)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2014-10-20 23:00:36,933 WARN  [Result resolver thread-1]
scheduler.TaskSetManager (Logging.scala:logWarning(71)) - Lost task 0.0 in
stage 0.0 (TID 0, localhost): scala.MatchError: TimestampType (of class
org.apache.spark.sql.catalyst.types.TimestampType$)
   
org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:348)
   
org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1$$anonfun$apply$12.apply(JsonRDD.scala:381)
scala.Option.map(Option.scala:145)
   
org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:380)
   
org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:365)
scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73)
   
org.apache.spark.sql.json.JsonRDD$.org$apache$spark$sql$json$JsonRDD$$asRow(JsonRDD.scala:365)
   
org.apache.spark.sql.json.JsonRDD$$anonfun$jsonStringToRow$1.apply(JsonRDD.scala:38)
   
org.apache.spark.sql.json.JsonRDD$$anonfun$jsonStringToRow$1.apply(JsonRDD.scala:38)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
scala.collection.AbstractIterator.to(Iterator.scala:1157)
   
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
   
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)

RE: spark sql: timestamp in json - fails

2014-10-20 Thread tridib
Spark 1.1.0



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-timestamp-in-json-fails-tp16864p16888.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark sql: join sql fails after sqlCtx.cacheTable()

2014-10-20 Thread tridib
Hello Experts,
I have two tables build using jsonFile(). I can successfully run join query
on these tables. But once I cacheTable(), all join query fails?

Here is stackstrace:
java.lang.NullPointerException
at
org.apache.spark.sql.columnar.InMemoryRelation.statistics$lzycompute(InMemoryColumnarTableScan.scala:43)
at
org.apache.spark.sql.columnar.InMemoryRelation.statistics(InMemoryColumnarTableScan.scala:42)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$statistics$1.apply(LogicalPlan.scala:50)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$statistics$1.apply(LogicalPlan.scala:50)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.statistics$lzycompute(LogicalPlan.scala:50)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.statistics(LogicalPlan.scala:44)
at
org.apache.spark.sql.execution.SparkStrategies$HashJoin$.apply(SparkStrategies.scala:83)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at
org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:268)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at
org.apache.spark.sql.execution.SparkStrategies$HashAggregation$.apply(SparkStrategies.scala:146)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at
org.apache.spark.sql.execution.SparkStrategies$TakeOrdered$.apply(SparkStrategies.scala:191)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
at
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:402)
at
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:400)
at
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:406)
at
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:406)
at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:438)
at $iwC$$iwC$$iwC$$iwC.init(console:15)
at $iwC$$iwC$$iwC.init(console:20)
at $iwC$$iwC.init(console:22)
at $iwC.init(console:24)
at init(console:26)
at .init(console:30)
at .clinit(console)
at .init(console:7)
at .clinit(console)
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:846)
at
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1119)
at
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:672)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:703)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:667)
at