Date class not supported by SparkSQL

2015-04-19 Thread Lior Chaga
Using Spark 1.2.0. Tried to apply register an RDD and got: scala.MatchError: class java.util.Date (of class java.lang.Class) I see it was resolved in https://issues.apache.org/jira/browse/SPARK-2562 (included in 1.2.0) Anyone encountered this issue? Thanks, Lior

Re: Random pairs / RDD order

2015-04-19 Thread Aurélien Bellet
Hi Imran, Thanks for the suggestion! Unfortunately the type does not match. But I could write my own function that shuffle the sample though. Le 4/17/15 9:34 PM, Imran Rashid a écrit : if you can store the entire sample for one partition in memory, I think you just want: val sample1 =

Re: spark application was submitted twice unexpectedly

2015-04-19 Thread Pengcheng Liu
looking into the work folder of problematic application, seems that the application is continuing creating executors, and error log of worker is as below: Exception in thread main java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs at

Re: Skipped Jobs

2015-04-19 Thread Mark Hamstra
Almost. Jobs don't get skipped. Stages and Tasks do if the needed results are already available. On Sun, Apr 19, 2015 at 3:18 PM, Denny Lee denny.g@gmail.com wrote: The job is skipped because the results are available in memory from a prior run. More info at:

Re: Skipped Jobs

2015-04-19 Thread Denny Lee
Thanks for the correction Mark :) On Sun, Apr 19, 2015 at 3:45 PM Mark Hamstra m...@clearstorydata.com wrote: Almost. Jobs don't get skipped. Stages and Tasks do if the needed results are already available. On Sun, Apr 19, 2015 at 3:18 PM, Denny Lee denny.g@gmail.com wrote: The job

Re: newAPIHadoopRDD file name

2015-04-19 Thread hnahak
In record reader level you can pass the file name as key or value. sc.newAPIHadoopRDD(job.getConfiguration, classOf[AvroKeyInputFormat[myObject]], classOf[AvroKey[myObject]], classOf[Text] // can contain your file) AvroKeyInputFormat extends

Skipped Jobs

2015-04-19 Thread James King
In the web ui i can see some jobs as 'skipped' what does that mean? why are these jobs skipped? do they ever get executed? Regards jk

RE: Can a map function return null

2015-04-19 Thread Evo Eftimov
In fact you can return “NULL” from your initial map and hence not resort to OptionalString at all From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Sunday, April 19, 2015 9:48 PM To: 'Steve Lewis' Cc: 'Olivier Girardot'; 'user@spark.apache.org' Subject: RE: Can a map function return

GraphX: unbalanced computation and slow runtime on livejournal network

2015-04-19 Thread harenbergsd
Hi all, I have been testing GraphX on the soc-LiveJournal1 network from the SNAP repository. Currently I am running on c3.8xlarge EC2 instances on Amazon. These instances have 32 cores and 60GB RAM per node, and so far I have run SSSP, PageRank, and WCC on a 1, 4, and 8 node cluster. The issues

RE: Can a map function return null

2015-04-19 Thread Evo Eftimov
Well you can do another map to turn OptionalString into String as in the cases when Optional is empty you can store e.g. “NULL” as the value of the RDD element If this is not acceptable (based on the objectives of your architecture) and IF when returning plain null instead of Optional does

Re: GraphX: unbalanced computation and slow runtime on livejournal network

2015-04-19 Thread hnahak
Hi Steve i did spark 1.3.0 page rank bench-marking on soc-LiveJournal1 in 4 node cluster. 16,16,8,8 Gbs ram respectively. Cluster have 4 worker including master with 4,4,2,2 CPUs I set executor memroy to 3g and driver to 5g. No. of Iterations -- GraphX(mins) 1 -- 1 2

Re: MLlib -Collaborative Filtering

2015-04-19 Thread Christian S. Perone
The easiest way to do that is to use a similarity metric between the different user factors. On Sat, Apr 18, 2015 at 7:49 AM, riginos samarasrigi...@gmail.com wrote: Is there any way that i can see the similarity table of 2 users in that algorithm? by that i mean the similarity between 2 users

Aggregation by column and generating a json

2015-04-19 Thread dsub
I am exploring Spark SQL and Dataframe and trying to create an aggregration by column and generate a single json row with aggregation. Any inputs on the right approach will be helpful. Here is my sample data user,sports,major,league,count [test1,Sports,Switzerland,NLA,6]

Re: Can a map function return null

2015-04-19 Thread Steve Lewis
So you imagine something like this: JavaRDDString words = ... JavaRDD OptionalString wordsFiltered = words.map(new FunctionString, OptionalString() { @Override public OptionalString call(String s) throws Exception { if ((s.length()) % 2 == 1) // drop strings of odd length

GraphX: unbalanced computation and slow runtime on livejournal network

2015-04-19 Thread Steven Harenberg
Hi all, I have been testing GraphX on the soc-LiveJournal1 network from the SNAP repository. Currently I am running on c3.8xlarge EC2 instances on Amazon. These instances have 32 cores and 60GB RAM per node, and so far I have run SSSP, PageRank, and WCC on a 1, 4, and 8 node cluster. The issues

compliation error

2015-04-19 Thread Brahma Reddy Battula
Hi All Getting following error, when I am compiling spark..What did I miss..? Even googled and did not find the exact solution for this... [ERROR] Failed to execute goal org.apache.maven.plugins:maven-shade-plugin:2.2:shade (default) on project spark-assembly_2.10: Error creating shaded jar:

Re: compliation error

2015-04-19 Thread Ted Yu
What JDK release are you using ? Can you give the complete command you used ? Which Spark branch are you working with ? Cheers On Sun, Apr 19, 2015 at 7:25 PM, Brahma Reddy Battula brahmareddy.batt...@huawei.com wrote: Hi All Getting following error, when I am compiling spark..What did I

Re: dataframe can not find fields after loading from hive

2015-04-19 Thread Yin Huai
Hi Cesar, Can you try 1.3.1 ( https://spark.apache.org/releases/spark-release-1-3-1.html) and see if it still shows the error? Thanks, Yin On Fri, Apr 17, 2015 at 1:58 PM, Reynold Xin r...@databricks.com wrote: This is strange. cc the dev list since it might be a bug. On Thu, Apr 16,

Re: Can't get SparkListener to work

2015-04-19 Thread Shixiong Zhu
The problem is the code you use to test: sc.parallelize(List(1, 2, 3)).map(throw new SparkException(test)).collect(); is like the following example: def foo: Int = Nothing = { throw new SparkException(test) } sc.parallelize(List(1, 2, 3)).map(foo).collect(); So actually the Spark jobs do not

Re: Can't get SparkListener to work

2015-04-19 Thread Praveen Balaji
Thanks Shixiong. I'll try this. On Sun, Apr 19, 2015, 7:36 PM Shixiong Zhu zsxw...@gmail.com wrote: The problem is the code you use to test: sc.parallelize(List(1, 2, 3)).map(throw new SparkException(test)).collect(); is like the following example: def foo: Int = Nothing = { throw

RE: compliation error

2015-04-19 Thread Brahma Reddy Battula
Hey Todd Thanks a lot for your reply...Kindly check following details.. spark version :1.1.0 jdk:jdk1.7.0_60 , command:mvn -Pbigtop-dist -Phive -Pyarn -Phadoop-2.4 -Dhadoop.version=V100R001C00 -DskipTests package Thanks Regards Brahma Reddy Battula

Re: [STREAMING KAFKA - Direct Approach] JavaPairRDD cannot be cast to HasOffsetRanges

2015-04-19 Thread Sean Owen
You need to access the underlying RDD with .rdd() and cast that. That works for me. On Mon, Apr 20, 2015 at 4:41 AM, RimBerry truonghoanglinhk55b...@gmail.com wrote: Hi everyone, i am trying to use the direct approach in streaming-kafka-integration

Re: compliation error

2015-04-19 Thread Sean Owen
Brahma since you can see the continuous integration builds are passing, it's got to be something specific to your environment, right? this is not even an error from Spark, but from Maven plugins. On Mon, Apr 20, 2015 at 4:42 AM, Ted Yu yuzhih...@gmail.com wrote: bq. -Dhadoop.version=V100R001C00

Code Deployment tools in Production

2015-04-19 Thread Arun Patel
Generally what tools are used to schedule spark jobs in production? How is spark streaming code is deployed? I am interested in knowing the tools used like cron, oozie etc. Thanks, Arun

[STREAMING KAFKA - Direct Approach] JavaPairRDD cannot be cast to HasOffsetRanges

2015-04-19 Thread RimBerry
Hi everyone, i am trying to use the direct approach in streaming-kafka-integration http://spark.apache.org/docs/latest/streaming-kafka-integration.html pulling data from kafka as follow JavaPairInputDStreamString, String messages =

Re: how to make a spark cluster ?

2015-04-19 Thread Jörn Franke
Hi, If you have just one physical machine then I would try out Docker instead of a full VM (would be waste of memory and CPU). Best regards Le 20 avr. 2015 00:11, hnahak harihar1...@gmail.com a écrit : Hi All, I've big physical machine with 16 CPUs , 256 GB RAM, 20 TB Hard disk. I just need

RE: compliation error

2015-04-19 Thread Brahma Reddy Battula
Thanks a lot for your replies.. @Ted,V100R001C00 this is our internal hadoop version which is based on hadoop 2.4.1.. @Sean Owen,Yes, you are correct...Just I wanted to know, what leads this problem... Thanks Regards Brahma Reddy Battula From: Sean

SparkStreaming onStart not being invoked on CustomReceiver attached to master with multiple workers

2015-04-19 Thread Ankit Patel
I am experiencing problem with SparkStreaming (Spark 1.2.0), the onStart method is never called on CustomReceiver when calling spark-submit against a master node with multiple workers. However, SparkStreaming works fine with no master node set. Anyone notice this issue?

Re: Code Deployment tools in Production

2015-04-19 Thread Vova Shelgunov
On 20 Apr 2015 05:45, Arun Patel arunp.bigd...@gmail.com wrote: http://23.251.129.190:8090/spark-twitter-streaming-web/analysis/3fb28f76-62fe-47f3-a1a8-66ac610c2447.html spark jobs in production? How is spark streaming code is deployed? I am interested in knowing the tools used like cron,

Re: Dataframes Question

2015-04-19 Thread Ted Yu
That's right. On Sun, Apr 19, 2015 at 8:59 AM, Arun Patel arunp.bigd...@gmail.com wrote: Thanks Ted. So, whatever the operations I am performing now are DataFrames and not SchemaRDD? Is that right? Regards, Venkat On Sun, Apr 19, 2015 at 9:13 AM, Ted Yu yuzhih...@gmail.com wrote: bq.

Re: Dataframes Question

2015-04-19 Thread Ted Yu
bq. SchemaRDD is not existing in 1.3? That's right. See this thread for more background: http://search-hadoop.com/m/JW1q5zQ1Xw/spark+DataFrame+schemarddsubj=renaming+SchemaRDD+gt+DataFrame On Sat, Apr 18, 2015 at 5:43 PM, Abhishek R. Singh abhis...@tetrationanalytics.com wrote: I am no

Re: Dataframes Question

2015-04-19 Thread Arun Patel
Thanks Ted. So, whatever the operations I am performing now are DataFrames and not SchemaRDD? Is that right? Regards, Venkat On Sun, Apr 19, 2015 at 9:13 AM, Ted Yu yuzhih...@gmail.com wrote: bq. SchemaRDD is not existing in 1.3? That's right. See this thread for more background:

Re: Date class not supported by SparkSQL

2015-04-19 Thread Lior Chaga
Here's a code example: public class DateSparkSQLExample { public static void main(String[] args) { SparkConf conf = new SparkConf().setAppName(test).setMaster(local); JavaSparkContext sc = new JavaSparkContext(conf); ListSomeObject itemsList =