How to debug ClassCastException: java.lang.String cannot be cast to java.lang.Long in SparkSQL

2016-01-26 Thread Anfernee Xu
Hi, I'm using Spark 1.5.0, I wrote a custom Hadoop InputFormat to load data from 3rdparty datasource, the data type mapping has been taken care of in my code, but when I issued below query, SELECT * FROM ( SELECT count(*) as failures from test WHERE state != 'success' ) as tmp WHERE (

How to debug

2016-01-26 Thread Anfernee Xu
Hi, I'm using Spark 1.5.0, I wrote a custom Hadoop InputFormat to load data from 3rdparty datasource, the data type mapping has been taken care of in my code, but when I issued below query, SELECT * FROM ( SELECT count(*) as failures from test WHERE state != 'success' ) as tmp WHERE (

Re: is Multiple Spark Contexts is supported in spark 1.5.0 ?

2015-12-04 Thread Anfernee Xu
If multiple users are looking at the same data set, then it's good choice to share the SparkContext. But my usercases are different, users are looking at different data(I use custom Hadoop InputFormat to load data from my data source based on the user input), the data might not have any overlap.

Question about yarn-cluster mode and spark.driver.allowMultipleContexts

2015-12-01 Thread Anfernee Xu
Hi, I have a doubt regarding yarn-cluster mode and spark.driver. allowMultipleContexts for below usercases. I have a long running backend server where I will create a short-lived Spark job in response to each user request, base on the fact that by default multiple Spark Context cannot be created

Re: Question about yarn-cluster mode and spark.driver.allowMultipleContexts

2015-12-01 Thread Anfernee Xu
er.allowMultipleContexts", "true") > .set("spark.driver.allowMultipleContexts", "true")) > ./core/src/test/scala/org/apache/spark/SparkContextSuite.scala > > FYI > > On Tue, Dec 1, 2015 at 3:32 PM, Anfernee Xu <anfernee...@gmail.com&

Re: Question about yarn-cluster mode and spark.driver.allowMultipleContexts

2015-12-01 Thread Anfernee Xu
ue, Dec 1, 2015 at 3:32 PM, Anfernee Xu <anfernee...@gmail.com> wrote: > > I have a long running backend server where I will create a short-lived > Spark > > job in response to each user request, base on the fact that by default > > multiple Spark Context cannot be created

Millions of entities in custom Hadoop InputFormat and broadcast variable

2015-11-26 Thread Anfernee Xu
Hi Spark experts, First of all, happy Thanksgiving! The comes to my question, I have implemented custom Hadoop InputFormat to load millions of entities from my data source to Spark(as JavaRDD and transform to DataFrame). The approach I took in implementing the custom Hadoop RDD is loading all

RDD's filter() or using 'where' condition in SparkSQL

2015-10-29 Thread Anfernee Xu
Hi, I have a pretty large data set(2M entities) in my RDD, the data has already been partitioned by a specific key, the key has a range(type in long), now I want to create a bunch of key buckets, for example, the key has range 1 -> 100, I will break the whole range into below buckets 1

Re: RDD's filter() or using 'where' condition in SparkSQL

2015-10-29 Thread Anfernee Xu
Thanks Yong for your response. Let me see if I can understand what you're suggesting, so the whole data set, when I load them into Spark(I'm using custom Hadoop InputFormat), I will add an extra field to each element in RDD, like bucket_id. For example Key: 1 - 10, bucket_id=1 11-20,

SparkSQL: What is the cost of DataFrame.registerTempTable(String)? Can I have multiple tables referencing to the same DataFrame?

2015-10-28 Thread Anfernee Xu
Hi, I just want to understand the cost of DataFrame.registerTempTable(String), is it just a trivial operation(like creating a object reference) in master(Driver) JVM? And Can I have multiple tables with different name referencing to the same DataFrame? Thanks -- --Anfernee

Spark Streaming: how to use StreamingContext.queueStream with existing RDD

2015-10-26 Thread Anfernee Xu
Hi, Here's my situation, I have some kind of offline dataset and got them loaded them into Spark as RDD, but I want to form a virtual data stream feeding to Spark Streaming, my code looks like this // sort offline data by time, the dataset spans 2 hours 1) JavaRDD sortedByTime =

Spark Streaming: how to StreamingContext.queueStream

2015-10-23 Thread Anfernee Xu
Hi, Here's my situation, I have some kind of offline dataset, but I want to form a virtual data stream feeding to Spark Streaming, my code looks like this // sort offline data by time 1) JavaRDD sortedByTime = offlineDataRDD.sortBy( ); // compute a list of JavaRDD, each element

Application not found in Spark historyserver in yarn-client mode

2015-10-15 Thread Anfernee Xu
Sorry, I have to re-send it again as I did not get the answer. Here's the problem I'm facing, I'm using Spark 1.5.0 release, I have a standalone java application which is periodically submit Spark jobs to my yarn cluster, btw I'm not using 'spark-submit' or 'org.apache.spark.launcher' to submit

[no subject]

2015-10-15 Thread Anfernee Xu
Sorry, I have to re-send it again as I did not get the answer. Here's the problem I'm facing, I have a standalone java application which is periodically submit Spark jobs to my yarn cluster, btw I'm not using 'spark-submit' or 'org.apache.spark.launcher' to submit my jobs. These jobs are

Application not found in Spark historyserver in yarn-client mode

2015-10-14 Thread Anfernee Xu
Hi, Here's the problem I'm facing, I have a standalone java application which is periodically submit Spark jobs to my yarn cluster, btw I'm not using 'spark-submit' or 'org.apache.spark.launcher' to submit my jobs. These jobs are successful and I can see them on Yarn RM webUI, but when I want to

Custom Hadoop InputSplit, Spark partitions, spark executors/task and Yarn containers

2015-09-23 Thread Anfernee Xu
Hi Spark experts, I'm coming across these terminologies and having some confusions, could you please help me understand them better? For instance I have implemented a Hadoop InputFormat to load my external data in Spark, in turn my custom InputFormat will create a bunch of InputSplit's, my