Date class not supported by SparkSQL
Using Spark 1.2.0. Tried to apply register an RDD and got: scala.MatchError: class java.util.Date (of class java.lang.Class) I see it was resolved in https://issues.apache.org/jira/browse/SPARK-2562 (included in 1.2.0) Anyone encountered this issue? Thanks, Lior
Re: Random pairs / RDD order
Hi Imran, Thanks for the suggestion! Unfortunately the type does not match. But I could write my own function that shuffle the sample though. Le 4/17/15 9:34 PM, Imran Rashid a écrit : if you can store the entire sample for one partition in memory, I think you just want: val sample1 = rdd.sample(true,0.01,42).mapPartitions(scala.util.Random.shuffle) val sample2 = rdd.sample(true,0.01,43).mapPartitions(scala.util.Random.shuffle) ... On Fri, Apr 17, 2015 at 3:05 AM, Aurélien Bellet aurelien.bel...@telecom-paristech.fr mailto:aurelien.bel...@telecom-paristech.fr wrote: Hi Sean, Thanks a lot for your reply. The problem is that I need to sample random *independent* pairs. If I draw two samples and build all n*(n-1) pairs then there is a lot of dependency. My current solution is also not satisfying because some pairs (the closest ones in a partition) have a much higher probability to be sampled. Not sure how to fix this. Aurelien Le 16/04/2015 20:44, Sean Owen a écrit : Use mapPartitions, and then take two random samples of the elements in the partition, and return an iterator over all pairs of them? Should be pretty simple assuming your sample size n is smallish since you're returning ~n^2 pairs. On Thu, Apr 16, 2015 at 7:00 PM, abellet aurelien.bel...@telecom-paristech.fr mailto:aurelien.bel...@telecom-paristech.fr wrote: Hi everyone, I have a large RDD and I am trying to create a RDD of a random sample of pairs of elements from this RDD. The elements composing a pair should come from the same partition for efficiency. The idea I've come up with is to take two random samples and then use zipPartitions to pair each i-th element of the first sample with the i-th element of the second sample. Here is a sample code illustrating the idea: --- val rdd = sc.parallelize(1 to 6, 16) val sample1 = rdd.sample(true,0.01,42) val sample2 = rdd.sample(true,0.01,43) def myfunc(s1: Iterator[Int], s2: Iterator[Int]): Iterator[String] = { var res = List[String]() while (s1.hasNext s2.hasNext) { val x = s1.next + + s2.next res ::= x } res.iterator } val pairs = sample1.zipPartitions(sample2)(myfunc) - However I am not happy with this solution because each element is most likely to be paired with elements that are closeby in the partition. This is because sample returns an ordered Iterator. Any idea how to fix this? I did not find a way to efficiently shuffle the random sample so far. Thanks a lot! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Random-pairs-RDD-order-tp22529.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org mailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org mailto:user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org mailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org mailto:user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark application was submitted twice unexpectedly
looking into the work folder of problematic application, seems that the application is continuing creating executors, and error log of worker is as below: Exception in thread main java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:52) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:113) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:156) at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) Caused by: java.security.PrivilegedActionException: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) ... 4 more Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:125) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:53) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:52) ... 7 more In master's log, I also see errors are generated continuously: ... 15/04/19 17:24:20 ERROR EndpointWriter: dropping message [class org.apache.spark.deploy.DeployMessages$ExecutorUpdated] for non-local recipient [Actor[akka.tcp://sparkDriver@spark101:46215/user/$b#-642262056]] arriving at [akka.tcp://sparkDriver@spark101:46215] inbound addresses are [akka.tcp://sparkMaster@spark101:7077] 15/04/19 17:24:20 ERROR EndpointWriter: dropping message [class org.apache.spark.deploy.DeployMessages$ExecutorAdded] for non-local recipient [Actor[akka.tcp://sparkDriver@spark101:46215/user/$b#-642262056]] arriving at [akka.tcp://sparkDriver@spark101:46215] inbound addresses are [akka.tcp://sparkMaster@spark101:7077] 15/04/19 17:24:20 ERROR EndpointWriter: dropping message [class org.apache.spark.deploy.DeployMessages$ExecutorUpdated] for non-local recipient [Actor[akka.tcp://sparkDriver@spark101:46215/user/$b#-642262056]] arriving at [akka.tcp://sparkDriver@spark101:46215] inbound addresses are [akka.tcp://sparkMaster@spark101:7077] 15/04/19 17:24:21 ERROR EndpointWriter: dropping message [class org.apache.spark.deploy.DeployMessages$ExecutorUpdated] for non-local recipient [Actor[akka.tcp://sparkDriver@spark101:46215/user/$b#-642262056]] arriving at [akka.tcp://sparkDriver@spark101:46215] inbound addresses are [akka.tcp://sparkMaster@spark101:7077] 15/04/19 17:24:21 ERROR UserGroupInformation: PriviledgedActionException as:root cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] 15/04/19 17:24:22 ERROR EndpointWriter: dropping message [class org.apache.spark.deploy.DeployMessages$ExecutorUpdated] for non-local recipient [Actor[akka.tcp://sparkDriver@spark101:53140/user/$b#-510580371]] arriving at [akka.tcp://sparkDriver@spark101:53140] inbound addresses are [akka.tcp://sparkMaster@spark101:7077] 15/04/19 17:24:22 ERROR EndpointWriter: dropping message [class org.apache.spark.deploy.DeployMessages$ExecutorAdded] for non-local recipient [Actor[akka.tcp://sparkDriver@spark101:53140/user/$b#-510580371]] arriving at [akka.tcp://sparkDriver@spark101:53140] inbound addresses are [akka.tcp://sparkMaster@spark101:7077] ... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-application-was-submitted-twice-unexpectedly-tp22551p22560.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Skipped Jobs
Almost. Jobs don't get skipped. Stages and Tasks do if the needed results are already available. On Sun, Apr 19, 2015 at 3:18 PM, Denny Lee denny.g@gmail.com wrote: The job is skipped because the results are available in memory from a prior run. More info at: http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3ccakx7bf-u+jc6q_zm7gtsj1mihagd_4up4qxpd9jfdjrfjax...@mail.gmail.com%3E. HTH! On Sun, Apr 19, 2015 at 1:43 PM James King jakwebin...@gmail.com wrote: In the web ui i can see some jobs as 'skipped' what does that mean? why are these jobs skipped? do they ever get executed? Regards jk
Re: Skipped Jobs
Thanks for the correction Mark :) On Sun, Apr 19, 2015 at 3:45 PM Mark Hamstra m...@clearstorydata.com wrote: Almost. Jobs don't get skipped. Stages and Tasks do if the needed results are already available. On Sun, Apr 19, 2015 at 3:18 PM, Denny Lee denny.g@gmail.com wrote: The job is skipped because the results are available in memory from a prior run. More info at: http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3ccakx7bf-u+jc6q_zm7gtsj1mihagd_4up4qxpd9jfdjrfjax...@mail.gmail.com%3E. HTH! On Sun, Apr 19, 2015 at 1:43 PM James King jakwebin...@gmail.com wrote: In the web ui i can see some jobs as 'skipped' what does that mean? why are these jobs skipped? do they ever get executed? Regards jk
Re: newAPIHadoopRDD file name
In record reader level you can pass the file name as key or value. sc.newAPIHadoopRDD(job.getConfiguration, classOf[AvroKeyInputFormat[myObject]], classOf[AvroKey[myObject]], classOf[Text] // can contain your file) AvroKeyInputFormat extends InputFormatAvroKey[myObject], Text { cretaRecordReader(){ return new YourRecordReader()} } YourRecordReader extends RecordReaderAvroKey[myObject], Text{ initialize(){ Path file = inputSplit.getPath() ; // you can pass this file as a value from your record reader } } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/newAPIHadoopRDD-file-name-tp22556p22567.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Skipped Jobs
In the web ui i can see some jobs as 'skipped' what does that mean? why are these jobs skipped? do they ever get executed? Regards jk
RE: Can a map function return null
In fact you can return “NULL” from your initial map and hence not resort to OptionalString at all From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Sunday, April 19, 2015 9:48 PM To: 'Steve Lewis' Cc: 'Olivier Girardot'; 'user@spark.apache.org' Subject: RE: Can a map function return null Well you can do another map to turn OptionalString into String as in the cases when Optional is empty you can store e.g. “NULL” as the value of the RDD element If this is not acceptable (based on the objectives of your architecture) and IF when returning plain null instead of Optional does throw Spark exception THEN as far as I am concerned, chess-mate From: Steve Lewis [mailto:lordjoe2...@gmail.com] Sent: Sunday, April 19, 2015 8:16 PM To: Evo Eftimov Cc: Olivier Girardot; user@spark.apache.org Subject: Re: Can a map function return null So you imagine something like this: JavaRDDString words = ... JavaRDD OptionalString wordsFiltered = words.map(new FunctionString, OptionalString() { @Override public OptionalString call(String s) throws Exception { if ((s.length()) % 2 == 1) // drop strings of odd length return Optional.empty(); else return Optional.of(s); } }); That seems to return the wrong type a JavaRDD OptionalString which cannot be used as a JavaRDDString which is what the next step expects On Sun, Apr 19, 2015 at 12:17 PM, Evo Eftimov evo.efti...@isecc.com wrote: I am on the move at the moment so i cant try it immediately but from previous memory / experience i think if you return plain null you will get a spark exception Anyway yiu can try it and see what happens and then ask the question If you do get exception try Optional instead of plain null Sent from Samsung Mobile Original message From: Olivier Girardot Date:2015/04/18 22:04 (GMT+00:00) To: Steve Lewis ,user@spark.apache.org Subject: Re: Can a map function return null You can return an RDD with null values inside, and afterwards filter on item != null In scala (or even in Java 8) you'd rather use Option/Optional, and in Scala they're directly usable from Spark. Exemple : sc.parallelize(1 to 1000).flatMap(item = if (item % 2 ==0) Some(item) else None).collect() res0: Array[Int] = Array(2, 4, 6, ) Regards, Olivier. Le sam. 18 avr. 2015 à 20:44, Steve Lewis lordjoe2...@gmail.com a écrit : I find a number of cases where I have an JavaRDD and I wish to transform the data and depending on a test return 0 or one item (don't suggest a filter - the real case is more complex). So I currently do something like the following - perform a flatmap returning a list with 0 or 1 entry depending on the isUsed function. JavaRDDFoo original = ... JavaRDDFoo words = original.flatMap(new FlatMapFunctionFoo, Foo() { @Override public IterableFoo call(final Foo s) throws Exception { ListFoo ret = new ArrayListFoo(); if(isUsed(s)) ret.add(transform(s)); return ret; // contains 0 items if isUsed is false } }); My question is can I do a map returning the transformed data and null if nothing is to be returned. as shown below - what does a Spark do with a map function returning null JavaRDDFoo words = original.map(new MapFunctionString, String() { @Override Foo call(final Foo s) throws Exception { ListFoo ret = new ArrayListFoo(); if(isUsed(s)) return transform(s); return null; // not used - what happens now } }); -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com
GraphX: unbalanced computation and slow runtime on livejournal network
Hi all, I have been testing GraphX on the soc-LiveJournal1 network from the SNAP repository. Currently I am running on c3.8xlarge EC2 instances on Amazon. These instances have 32 cores and 60GB RAM per node, and so far I have run SSSP, PageRank, and WCC on a 1, 4, and 8 node cluster. The issues I am having, which are present for all three algorithms, is that (1) GraphX is not improving between 4 and 8 nodes and (2) GraphX seems to be heavily unbalanced with some machines doing the majority of the computation. PageRank (20 iterations) is the worst. For 1-node, 4-node, an 8-node clusters I get the following runtimes (wallclock): 192s, 154s, and 154s. This results is potentially understandable, though the times are significantly worse than the results in the paper https://amplab.cs.berkeley.edu/wp-content/uploads/2014/02/graphx.pdf, where this algorithm ran in ~75s on a worse cluster. My main concern is that the computation seems to be heavily unbalanced. I have measured the CPU time of all the process associated with GraphX during its execution and for a 4-node cluster it yielded the following CPU times (for each machine): 724s, 697s, 2216s, 694s. Is this normal? Should I expect a more even distribution of work across machines? I am using the stock pagerank code found here: https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala. I use the configurations spark.executor.memory=40g and spark.cores.max=128 for the 4-node case. I also set the number of edge partitions to be 64. Could you please let me know if these results are reasonable, or if I am doing something wrong. I really appreciate the help. Thanks, Steve -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-unbalanced-computation-and-slow-runtime-on-livejournal-network-tp22565.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Can a map function return null
Well you can do another map to turn OptionalString into String as in the cases when Optional is empty you can store e.g. “NULL” as the value of the RDD element If this is not acceptable (based on the objectives of your architecture) and IF when returning plain null instead of Optional does throw Spark exception THEN as far as I am concerned, chess-mate From: Steve Lewis [mailto:lordjoe2...@gmail.com] Sent: Sunday, April 19, 2015 8:16 PM To: Evo Eftimov Cc: Olivier Girardot; user@spark.apache.org Subject: Re: Can a map function return null So you imagine something like this: JavaRDDString words = ... JavaRDD OptionalString wordsFiltered = words.map(new FunctionString, OptionalString() { @Override public OptionalString call(String s) throws Exception { if ((s.length()) % 2 == 1) // drop strings of odd length return Optional.empty(); else return Optional.of(s); } }); That seems to return the wrong type a JavaRDD OptionalString which cannot be used as a JavaRDDString which is what the next step expects On Sun, Apr 19, 2015 at 12:17 PM, Evo Eftimov evo.efti...@isecc.com wrote: I am on the move at the moment so i cant try it immediately but from previous memory / experience i think if you return plain null you will get a spark exception Anyway yiu can try it and see what happens and then ask the question If you do get exception try Optional instead of plain null Sent from Samsung Mobile Original message From: Olivier Girardot Date:2015/04/18 22:04 (GMT+00:00) To: Steve Lewis ,user@spark.apache.org Subject: Re: Can a map function return null You can return an RDD with null values inside, and afterwards filter on item != null In scala (or even in Java 8) you'd rather use Option/Optional, and in Scala they're directly usable from Spark. Exemple : sc.parallelize(1 to 1000).flatMap(item = if (item % 2 ==0) Some(item) else None).collect() res0: Array[Int] = Array(2, 4, 6, ) Regards, Olivier. Le sam. 18 avr. 2015 à 20:44, Steve Lewis lordjoe2...@gmail.com a écrit : I find a number of cases where I have an JavaRDD and I wish to transform the data and depending on a test return 0 or one item (don't suggest a filter - the real case is more complex). So I currently do something like the following - perform a flatmap returning a list with 0 or 1 entry depending on the isUsed function. JavaRDDFoo original = ... JavaRDDFoo words = original.flatMap(new FlatMapFunctionFoo, Foo() { @Override public IterableFoo call(final Foo s) throws Exception { ListFoo ret = new ArrayListFoo(); if(isUsed(s)) ret.add(transform(s)); return ret; // contains 0 items if isUsed is false } }); My question is can I do a map returning the transformed data and null if nothing is to be returned. as shown below - what does a Spark do with a map function returning null JavaRDDFoo words = original.map(new MapFunctionString, String() { @Override Foo call(final Foo s) throws Exception { ListFoo ret = new ArrayListFoo(); if(isUsed(s)) return transform(s); return null; // not used - what happens now } }); -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com
Re: GraphX: unbalanced computation and slow runtime on livejournal network
Hi Steve i did spark 1.3.0 page rank bench-marking on soc-LiveJournal1 in 4 node cluster. 16,16,8,8 Gbs ram respectively. Cluster have 4 worker including master with 4,4,2,2 CPUs I set executor memroy to 3g and driver to 5g. No. of Iterations -- GraphX(mins) 1 -- 1 2 -- 1.2 3 -- 1.3 5 -- 1.6 10 -- 2.6 20 -- 3.9 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-unbalanced-computation-and-slow-runtime-on-livejournal-network-tp22565p22566.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: MLlib -Collaborative Filtering
The easiest way to do that is to use a similarity metric between the different user factors. On Sat, Apr 18, 2015 at 7:49 AM, riginos samarasrigi...@gmail.com wrote: Is there any way that i can see the similarity table of 2 users in that algorithm? by that i mean the similarity between 2 users -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-Collaborative-Filtering-tp22553.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Blog http://blog.christianperone.com | Github https://github.com/perone | Twitter https://twitter.com/tarantulae Forgive, O Lord, my little jokes on Thee, and I'll forgive Thy great big joke on me.
Aggregation by column and generating a json
I am exploring Spark SQL and Dataframe and trying to create an aggregration by column and generate a single json row with aggregation. Any inputs on the right approach will be helpful. Here is my sample data user,sports,major,league,count [test1,Sports,Switzerland,NLA,6] [test1,Football,Australia,A-League,6] [test1,Ice Hockey,Sweden,SHL,3] [test1,Ice Hockey,Switzerland,NLB,2] [test1,Football,Romania,Liga I,1] I want to aggregate by user and create a single json row. { user : test1 , sports : [ { Ice Hockey : 11, Football : 7 }] , major : [ {Switzerland : 8, Australia :6 , Sweden : 3 , Romania :1 }] ,league : [ NLA : 6 , A-League : 6 , SHL :3 , NLB :2 , Liga I : 1] , total : 18} -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Aggregation-by-column-and-generating-a-json-tp22562.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Can a map function return null
So you imagine something like this: JavaRDDString words = ... JavaRDD OptionalString wordsFiltered = words.map(new FunctionString, OptionalString() { @Override public OptionalString call(String s) throws Exception { if ((s.length()) % 2 == 1) // drop strings of odd length return Optional.empty(); else return Optional.of(s); } }); That seems to return the wrong type a JavaRDD OptionalString which cannot be used as a JavaRDDString which is what the next step expects On Sun, Apr 19, 2015 at 12:17 PM, Evo Eftimov evo.efti...@isecc.com wrote: I am on the move at the moment so i cant try it immediately but from previous memory / experience i think if you return plain null you will get a spark exception Anyway yiu can try it and see what happens and then ask the question If you do get exception try Optional instead of plain null Sent from Samsung Mobile Original message From: Olivier Girardot Date:2015/04/18 22:04 (GMT+00:00) To: Steve Lewis ,user@spark.apache.org Subject: Re: Can a map function return null You can return an RDD with null values inside, and afterwards filter on item != null In scala (or even in Java 8) you'd rather use Option/Optional, and in Scala they're directly usable from Spark. Exemple : sc.parallelize(1 to 1000).flatMap(item = if (item % 2 ==0) Some(item) else None).collect() res0: Array[Int] = Array(2, 4, 6, ) Regards, Olivier. Le sam. 18 avr. 2015 à 20:44, Steve Lewis lordjoe2...@gmail.com a écrit : I find a number of cases where I have an JavaRDD and I wish to transform the data and depending on a test return 0 or one item (don't suggest a filter - the real case is more complex). So I currently do something like the following - perform a flatmap returning a list with 0 or 1 entry depending on the isUsed function. JavaRDDFoo original = ... JavaRDDFoo words = original.flatMap(new FlatMapFunctionFoo, Foo() { @Override public IterableFoo call(final Foo s) throws Exception { ListFoo ret = new ArrayListFoo(); if(isUsed(s)) ret.add(transform(s)); return ret; // contains 0 items if isUsed is false } }); My question is can I do a map returning the transformed data and null if nothing is to be returned. as shown below - what does a Spark do with a map function returning null JavaRDDFoo words = original.map(new MapFunctionString, String() { @Override Foo call(final Foo s) throws Exception { ListFoo ret = new ArrayListFoo(); if(isUsed(s)) return transform(s); return null; // not used - what happens now } }); -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com
GraphX: unbalanced computation and slow runtime on livejournal network
Hi all, I have been testing GraphX on the soc-LiveJournal1 network from the SNAP repository. Currently I am running on c3.8xlarge EC2 instances on Amazon. These instances have 32 cores and 60GB RAM per node, and so far I have run SSSP, PageRank, and WCC on a 1, 4, and 8 node cluster. The issues I am having, which are present for all three algorithms, is that (1) GraphX is not improving between 4 and 8 nodes and (2) GraphX seems to be heavily unbalanced with some machines doing the majority of the computation. PageRank (20 iterations) is the worst. For 1-node, 4-node, an 8-node clusters I get the following runtimes (wallclock): 192s, 154s, and 154s. These results are potentially understandable, though the times are significantly worse than the results in the paper https://amplab.cs.berkeley.edu/wp-content/uploads/2014/02/graphx.pdf, where this algorithm ran in ~75s on a worse cluster. My main concern is that the computation seems to be heavily unbalanced. I have measured the CPU time of all the process associated with GraphX during its execution and for a 4-node cluster it yielded the following CPU times (for each machine): 724s, 697s, 2216s, 694s. Is this normal? Should I expect a more even distribution of work across machines? I am using the stock pagerank code found here: https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala. I use the configurations spark.executor.memory=40g and spark.cores.max=128 for the 4-node case. I also set the number of edge partitions to be 64. Could you please let me know if these results are reasonable or if there is a better way to ensure the computation is better distributed among the nodes in a cluster. I really appreciate the help. Thanks, Steve
compliation error
Hi All Getting following error, when I am compiling spark..What did I miss..? Even googled and did not find the exact solution for this... [ERROR] Failed to execute goal org.apache.maven.plugins:maven-shade-plugin:2.2:shade (default) on project spark-assembly_2.10: Error creating shaded jar: Error in ASM processing class com/ibm/icu/impl/data/LocaleElements_zh__PINYIN.class: 48188 - [Help 1] org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.maven.plugins:maven-shade-plugin:2.2:shade (default) on project spark-assembly_2.10: Error creating shaded jar: Error in ASM processing class com/ibm/icu/impl/data/LocaleElements_zh__PINYIN.class at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:217) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59) at org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183) at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161) at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320) Thanks Regards Brahma Reddy Battula
Re: compliation error
What JDK release are you using ? Can you give the complete command you used ? Which Spark branch are you working with ? Cheers On Sun, Apr 19, 2015 at 7:25 PM, Brahma Reddy Battula brahmareddy.batt...@huawei.com wrote: Hi All Getting following error, when I am compiling spark..What did I miss..? Even googled and did not find the exact solution for this... [ERROR] Failed to execute goal org.apache.maven.plugins:maven-shade-plugin:2.2:shade (default) on project spark-assembly_2.10: Error creating shaded jar: Error in ASM processing class com/ibm/icu/impl/data/LocaleElements_zh__PINYIN.class: 48188 - [Help 1] org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.maven.plugins:maven-shade-plugin:2.2:shade (default) on project spark-assembly_2.10: Error creating shaded jar: Error in ASM processing class com/ibm/icu/impl/data/LocaleElements_zh__PINYIN.class at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:217) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59) at org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183) at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161) at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320) Thanks Regards Brahma Reddy Battula
Re: dataframe can not find fields after loading from hive
Hi Cesar, Can you try 1.3.1 ( https://spark.apache.org/releases/spark-release-1-3-1.html) and see if it still shows the error? Thanks, Yin On Fri, Apr 17, 2015 at 1:58 PM, Reynold Xin r...@databricks.com wrote: This is strange. cc the dev list since it might be a bug. On Thu, Apr 16, 2015 at 3:18 PM, Cesar Flores ces...@gmail.com wrote: Never mind. I found the solution: val newDataFrame = hc.createDataFrame(hiveLoadedDataFrame.rdd, hiveLoadedDataFrame.schema) which translate to convert the data frame to rdd and back again to data frame. Not the prettiest solution, but at least it solves my problems. Thanks, Cesar Flores On Thu, Apr 16, 2015 at 11:17 AM, Cesar Flores ces...@gmail.com wrote: I have a data frame in which I load data from a hive table. And my issue is that the data frame is missing the columns that I need to query. For example: val newdataset = dataset.where(dataset(label) === 1) gives me an error like the following: ERROR yarn.ApplicationMaster: User class threw exception: resolved attributes label missing from label, user_id, ...(the rest of the fields of my table org.apache.spark.sql.AnalysisException: resolved attributes label missing from label, user_id, ... (the rest of the fields of my table) where we can see that the label field actually exist. I manage to solve this issue by updating my syntax to: val newdataset = dataset.where($label === 1) which works. However I can not make this trick in all my queries. For example, when I try to do a unionAll from two subsets of the same data frame the error I am getting is that all my fields are missing. Can someone tell me if I need to do some post processing after loading from hive in order to avoid this kind of errors? Thanks -- Cesar Flores -- Cesar Flores
Re: Can't get SparkListener to work
The problem is the code you use to test: sc.parallelize(List(1, 2, 3)).map(throw new SparkException(test)).collect(); is like the following example: def foo: Int = Nothing = { throw new SparkException(test) } sc.parallelize(List(1, 2, 3)).map(foo).collect(); So actually the Spark jobs do not be submitted since it fails in `foo` that is used to create the map function. Change it to sc.parallelize(List(1, 2, 3)).map(i = throw new SparkException(test)).collect(); And you will see the correct messages from your listener. Best Regards, Shixiong(Ryan) Zhu 2015-04-19 1:06 GMT+08:00 Praveen Balaji secondorderpolynom...@gmail.com: Thanks for the response, Archit. I get callbacks when I do not throw an exception from map. My use case, however, is to get callbacks for exceptions in transformations on executors. Do you think I'm going down the right route? Cheers -p On Sat, Apr 18, 2015 at 1:49 AM, Archit Thakur archit279tha...@gmail.com wrote: Hi Praveen, Can you try once removing throw exception in map. Do you still not get it.? On Apr 18, 2015 8:14 AM, Praveen Balaji secondorderpolynom...@gmail.com wrote: Thanks for the response, Imran. I probably chose the wrong methods for this email. I implemented all methods of SparkListener and the only callback I get is onExecutorMetricsUpdate. Here's the complete code: == import org.apache.spark.scheduler._ sc.addSparkListener(new SparkListener() { override def onStageCompleted(e: SparkListenerStageCompleted) = println( onStageCompleted); override def onStageSubmitted(e: SparkListenerStageSubmitted) = println( onStageSubmitted); override def onTaskStart(e: SparkListenerTaskStart) = println( onTaskStart); override def onTaskGettingResult(e: SparkListenerTaskGettingResult) = println( onTaskGettingResult); override def onTaskEnd(e: SparkListenerTaskEnd) = println( onTaskEnd); override def onJobStart(e: SparkListenerJobStart) = println( onJobStart); override def onJobEnd(e: SparkListenerJobEnd) = println( onJobEnd); override def onEnvironmentUpdate(e: SparkListenerEnvironmentUpdate) = println( onEnvironmentUpdate); override def onBlockManagerAdded(e: SparkListenerBlockManagerAdded) = println( onBlockManagerAdded); override def onBlockManagerRemoved(e: SparkListenerBlockManagerRemoved) = println( onBlockManagerRemoved); override def onUnpersistRDD(e: SparkListenerUnpersistRDD) = println( onUnpersistRDD); override def onApplicationStart(e: SparkListenerApplicationStart) = println( onApplicationStart); override def onApplicationEnd(e: SparkListenerApplicationEnd) = println( onApplicationEnd); override def onExecutorMetricsUpdate(e: SparkListenerExecutorMetricsUpdate) = println( onExecutorMetricsUpdate); }); sc.parallelize(List(1, 2, 3)).map(throw new SparkException(test)).collect(); = On Fri, Apr 17, 2015 at 4:13 PM, Imran Rashid iras...@cloudera.com wrote: when you start the spark-shell, its already too late to get the ApplicationStart event. Try listening for StageCompleted or JobEnd instead. On Fri, Apr 17, 2015 at 5:54 PM, Praveen Balaji secondorderpolynom...@gmail.com wrote: I'm trying to create a simple SparkListener to get notified of error on executors. I do not get any call backs on my SparkListener. Here some simple code I'm executing in spark-shell. But I still don't get any callbacks on my listener. Am I doing something wrong? Thanks for any clue you can send my way. Cheers Praveen == import org.apache.spark.scheduler.SparkListener import org.apache.spark.scheduler.SparkListenerApplicationStart import org.apache.spark.scheduler.SparkListenerApplicationEnd import org.apache.spark.SparkException sc.addSparkListener(new SparkListener() { override def onApplicationStart(applicationStart: SparkListenerApplicationStart) { println( onApplicationStart: + applicationStart.appName); } override def onApplicationEnd(applicationEnd: SparkListenerApplicationEnd) { println( onApplicationEnd: + applicationEnd.time); } }); sc.parallelize(List(1, 2, 3)).map(throw new SparkException(test)).collect(); === output: scala org.apache.spark.SparkException: hshsh at $iwC$$iwC$$iwC$$iwC.init(console:29) at $iwC$$iwC$$iwC.init(console:34) at $iwC$$iwC.init(console:36) at $iwC.init(console:38)
Re: Can't get SparkListener to work
Thanks Shixiong. I'll try this. On Sun, Apr 19, 2015, 7:36 PM Shixiong Zhu zsxw...@gmail.com wrote: The problem is the code you use to test: sc.parallelize(List(1, 2, 3)).map(throw new SparkException(test)).collect(); is like the following example: def foo: Int = Nothing = { throw new SparkException(test) } sc.parallelize(List(1, 2, 3)).map(foo).collect(); So actually the Spark jobs do not be submitted since it fails in `foo` that is used to create the map function. Change it to sc.parallelize(List(1, 2, 3)).map(i = throw new SparkException(test)).collect(); And you will see the correct messages from your listener. Best Regards, Shixiong(Ryan) Zhu 2015-04-19 1:06 GMT+08:00 Praveen Balaji secondorderpolynom...@gmail.com : Thanks for the response, Archit. I get callbacks when I do not throw an exception from map. My use case, however, is to get callbacks for exceptions in transformations on executors. Do you think I'm going down the right route? Cheers -p On Sat, Apr 18, 2015 at 1:49 AM, Archit Thakur archit279tha...@gmail.com wrote: Hi Praveen, Can you try once removing throw exception in map. Do you still not get it.? On Apr 18, 2015 8:14 AM, Praveen Balaji secondorderpolynom...@gmail.com wrote: Thanks for the response, Imran. I probably chose the wrong methods for this email. I implemented all methods of SparkListener and the only callback I get is onExecutorMetricsUpdate. Here's the complete code: == import org.apache.spark.scheduler._ sc.addSparkListener(new SparkListener() { override def onStageCompleted(e: SparkListenerStageCompleted) = println( onStageCompleted); override def onStageSubmitted(e: SparkListenerStageSubmitted) = println( onStageSubmitted); override def onTaskStart(e: SparkListenerTaskStart) = println( onTaskStart); override def onTaskGettingResult(e: SparkListenerTaskGettingResult) = println( onTaskGettingResult); override def onTaskEnd(e: SparkListenerTaskEnd) = println( onTaskEnd); override def onJobStart(e: SparkListenerJobStart) = println( onJobStart); override def onJobEnd(e: SparkListenerJobEnd) = println( onJobEnd); override def onEnvironmentUpdate(e: SparkListenerEnvironmentUpdate) = println( onEnvironmentUpdate); override def onBlockManagerAdded(e: SparkListenerBlockManagerAdded) = println( onBlockManagerAdded); override def onBlockManagerRemoved(e: SparkListenerBlockManagerRemoved) = println( onBlockManagerRemoved); override def onUnpersistRDD(e: SparkListenerUnpersistRDD) = println( onUnpersistRDD); override def onApplicationStart(e: SparkListenerApplicationStart) = println( onApplicationStart); override def onApplicationEnd(e: SparkListenerApplicationEnd) = println( onApplicationEnd); override def onExecutorMetricsUpdate(e: SparkListenerExecutorMetricsUpdate) = println( onExecutorMetricsUpdate); }); sc.parallelize(List(1, 2, 3)).map(throw new SparkException(test)).collect(); = On Fri, Apr 17, 2015 at 4:13 PM, Imran Rashid iras...@cloudera.com wrote: when you start the spark-shell, its already too late to get the ApplicationStart event. Try listening for StageCompleted or JobEnd instead. On Fri, Apr 17, 2015 at 5:54 PM, Praveen Balaji secondorderpolynom...@gmail.com wrote: I'm trying to create a simple SparkListener to get notified of error on executors. I do not get any call backs on my SparkListener. Here some simple code I'm executing in spark-shell. But I still don't get any callbacks on my listener. Am I doing something wrong? Thanks for any clue you can send my way. Cheers Praveen == import org.apache.spark.scheduler.SparkListener import org.apache.spark.scheduler.SparkListenerApplicationStart import org.apache.spark.scheduler.SparkListenerApplicationEnd import org.apache.spark.SparkException sc.addSparkListener(new SparkListener() { override def onApplicationStart(applicationStart: SparkListenerApplicationStart) { println( onApplicationStart: + applicationStart.appName); } override def onApplicationEnd(applicationEnd: SparkListenerApplicationEnd) { println( onApplicationEnd: + applicationEnd.time); } }); sc.parallelize(List(1, 2, 3)).map(throw new SparkException(test)).collect(); === output: scala org.apache.spark.SparkException: hshsh at $iwC$$iwC$$iwC$$iwC.init(console:29) at $iwC$$iwC$$iwC.init(console:34) at $iwC$$iwC.init(console:36) at $iwC.init(console:38)
RE: compliation error
Hey Todd Thanks a lot for your reply...Kindly check following details.. spark version :1.1.0 jdk:jdk1.7.0_60 , command:mvn -Pbigtop-dist -Phive -Pyarn -Phadoop-2.4 -Dhadoop.version=V100R001C00 -DskipTests package Thanks Regards Brahma Reddy Battula From: Ted Yu [yuzhih...@gmail.com] Sent: Monday, April 20, 2015 8:07 AM To: Brahma Reddy Battula Cc: user@spark.apache.org Subject: Re: compliation error What JDK release are you using ? Can you give the complete command you used ? Which Spark branch are you working with ? Cheers On Sun, Apr 19, 2015 at 7:25 PM, Brahma Reddy Battula brahmareddy.batt...@huawei.commailto:brahmareddy.batt...@huawei.com wrote: Hi All Getting following error, when I am compiling spark..What did I miss..? Even googled and did not find the exact solution for this... [ERROR] Failed to execute goal org.apache.maven.plugins:maven-shade-plugin:2.2:shade (default) on project spark-assembly_2.10: Error creating shaded jar: Error in ASM processing class com/ibm/icu/impl/data/LocaleElements_zh__PINYIN.class: 48188 - [Help 1] org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.maven.plugins:maven-shade-plugin:2.2:shade (default) on project spark-assembly_2.10: Error creating shaded jar: Error in ASM processing class com/ibm/icu/impl/data/LocaleElements_zh__PINYIN.class at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:217) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59) at org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183) at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161) at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320) Thanks Regards Brahma Reddy Battula
Re: [STREAMING KAFKA - Direct Approach] JavaPairRDD cannot be cast to HasOffsetRanges
You need to access the underlying RDD with .rdd() and cast that. That works for me. On Mon, Apr 20, 2015 at 4:41 AM, RimBerry truonghoanglinhk55b...@gmail.com wrote: Hi everyone, i am trying to use the direct approach in streaming-kafka-integration http://spark.apache.org/docs/latest/streaming-kafka-integration.html pulling data from kafka as follow JavaPairInputDStreamString, String messages = KafkaUatils.createDirectStream(jssc, String.class, String.class, StringDecoder.class, StringDecoder.class, kafkaParams, topicsSet); messages.foreachRDD( new FunctionJavaPairRDDlt;String,String, Void() { @Override public Void call(JavaPairRDDString, String rdd) throws IOException { OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd).offsetRanges(); //. return null; } } ); then i got an error when running it *java.lang.ClassCastException: org.apache.spark.api.java.JavaPairRDD cannot be cast to org.apache.spark.streaming.kafka.HasOffsetRanges* at OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd).offsetRanges(); i am using the version 1.3.1 if is it a bug in this version ? Thank you for spending time with me. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/STREAMING-KAFKA-Direct-Approach-JavaPairRDD-cannot-be-cast-to-HasOffsetRanges-tp22568.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: compliation error
Brahma since you can see the continuous integration builds are passing, it's got to be something specific to your environment, right? this is not even an error from Spark, but from Maven plugins. On Mon, Apr 20, 2015 at 4:42 AM, Ted Yu yuzhih...@gmail.com wrote: bq. -Dhadoop.version=V100R001C00 First time I saw above hadoop version. Doesn't look like Apache release. I checked my local maven repo but didn't find impl under ~/.m2/repository/com/ibm/icu FYI On Sun, Apr 19, 2015 at 8:04 PM, Brahma Reddy Battula brahmareddy.batt...@huawei.com wrote: Hey Todd Thanks a lot for your reply...Kindly check following details.. spark version :1.1.0 jdk:jdk1.7.0_60 , command:mvn -Pbigtop-dist -Phive -Pyarn -Phadoop-2.4 -Dhadoop.version=V100R001C00 -DskipTests package Thanks Regards Brahma Reddy Battula From: Ted Yu [yuzhih...@gmail.com] Sent: Monday, April 20, 2015 8:07 AM To: Brahma Reddy Battula Cc: user@spark.apache.org Subject: Re: compliation error What JDK release are you using ? Can you give the complete command you used ? Which Spark branch are you working with ? Cheers On Sun, Apr 19, 2015 at 7:25 PM, Brahma Reddy Battula brahmareddy.batt...@huawei.com wrote: Hi All Getting following error, when I am compiling spark..What did I miss..? Even googled and did not find the exact solution for this... [ERROR] Failed to execute goal org.apache.maven.plugins:maven-shade-plugin:2.2:shade (default) on project spark-assembly_2.10: Error creating shaded jar: Error in ASM processing class com/ibm/icu/impl/data/LocaleElements_zh__PINYIN.class: 48188 - [Help 1] org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.maven.plugins:maven-shade-plugin:2.2:shade (default) on project spark-assembly_2.10: Error creating shaded jar: Error in ASM processing class com/ibm/icu/impl/data/LocaleElements_zh__PINYIN.class at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:217) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59) at org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183) at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161) at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320) Thanks Regards Brahma Reddy Battula - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Code Deployment tools in Production
Generally what tools are used to schedule spark jobs in production? How is spark streaming code is deployed? I am interested in knowing the tools used like cron, oozie etc. Thanks, Arun
[STREAMING KAFKA - Direct Approach] JavaPairRDD cannot be cast to HasOffsetRanges
Hi everyone, i am trying to use the direct approach in streaming-kafka-integration http://spark.apache.org/docs/latest/streaming-kafka-integration.html pulling data from kafka as follow JavaPairInputDStreamString, String messages = KafkaUatils.createDirectStream(jssc, String.class, String.class, StringDecoder.class, StringDecoder.class, kafkaParams, topicsSet); messages.foreachRDD( new FunctionJavaPairRDDlt;String,String, Void() { @Override public Void call(JavaPairRDDString, String rdd) throws IOException { OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd).offsetRanges(); //. return null; } } ); then i got an error when running it *java.lang.ClassCastException: org.apache.spark.api.java.JavaPairRDD cannot be cast to org.apache.spark.streaming.kafka.HasOffsetRanges* at OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd).offsetRanges(); i am using the version 1.3.1 if is it a bug in this version ? Thank you for spending time with me. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/STREAMING-KAFKA-Direct-Approach-JavaPairRDD-cannot-be-cast-to-HasOffsetRanges-tp22568.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: how to make a spark cluster ?
Hi, If you have just one physical machine then I would try out Docker instead of a full VM (would be waste of memory and CPU). Best regards Le 20 avr. 2015 00:11, hnahak harihar1...@gmail.com a écrit : Hi All, I've big physical machine with 16 CPUs , 256 GB RAM, 20 TB Hard disk. I just need to know what should be the best solution to make a spark cluster? If I need to process TBs of data then 1. Only one machine, which contain driver, executor, job tracker and task tracker everything. 2. create 4 VMs and each VM should consist 4 CPUs , 64 GB RAM 3. create 8 VMs and each VM should consist 2 CPUs , 32 GB RAM each please give me your views/suggestions -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-make-a-spark-cluster-tp22563.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: compliation error
Thanks a lot for your replies.. @Ted,V100R001C00 this is our internal hadoop version which is based on hadoop 2.4.1.. @Sean Owen,Yes, you are correct...Just I wanted to know, what leads this problem... Thanks Regards Brahma Reddy Battula From: Sean Owen [so...@cloudera.com] Sent: Monday, April 20, 2015 9:14 AM To: Ted Yu Cc: Brahma Reddy Battula; user@spark.apache.org Subject: Re: compliation error Brahma since you can see the continuous integration builds are passing, it's got to be something specific to your environment, right? this is not even an error from Spark, but from Maven plugins. On Mon, Apr 20, 2015 at 4:42 AM, Ted Yu yuzhih...@gmail.com wrote: bq. -Dhadoop.version=V100R001C00 First time I saw above hadoop version. Doesn't look like Apache release. I checked my local maven repo but didn't find impl under ~/.m2/repository/com/ibm/icu FYI On Sun, Apr 19, 2015 at 8:04 PM, Brahma Reddy Battula brahmareddy.batt...@huawei.com wrote: Hey Todd Thanks a lot for your reply...Kindly check following details.. spark version :1.1.0 jdk:jdk1.7.0_60 , command:mvn -Pbigtop-dist -Phive -Pyarn -Phadoop-2.4 -Dhadoop.version=V100R001C00 -DskipTests package Thanks Regards Brahma Reddy Battula From: Ted Yu [yuzhih...@gmail.com] Sent: Monday, April 20, 2015 8:07 AM To: Brahma Reddy Battula Cc: user@spark.apache.org Subject: Re: compliation error What JDK release are you using ? Can you give the complete command you used ? Which Spark branch are you working with ? Cheers On Sun, Apr 19, 2015 at 7:25 PM, Brahma Reddy Battula brahmareddy.batt...@huawei.com wrote: Hi All Getting following error, when I am compiling spark..What did I miss..? Even googled and did not find the exact solution for this... [ERROR] Failed to execute goal org.apache.maven.plugins:maven-shade-plugin:2.2:shade (default) on project spark-assembly_2.10: Error creating shaded jar: Error in ASM processing class com/ibm/icu/impl/data/LocaleElements_zh__PINYIN.class: 48188 - [Help 1] org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.maven.plugins:maven-shade-plugin:2.2:shade (default) on project spark-assembly_2.10: Error creating shaded jar: Error in ASM processing class com/ibm/icu/impl/data/LocaleElements_zh__PINYIN.class at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:217) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59) at org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183) at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161) at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320) Thanks Regards Brahma Reddy Battula - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
SparkStreaming onStart not being invoked on CustomReceiver attached to master with multiple workers
I am experiencing problem with SparkStreaming (Spark 1.2.0), the onStart method is never called on CustomReceiver when calling spark-submit against a master node with multiple workers. However, SparkStreaming works fine with no master node set. Anyone notice this issue?
Re: Code Deployment tools in Production
On 20 Apr 2015 05:45, Arun Patel arunp.bigd...@gmail.com wrote: http://23.251.129.190:8090/spark-twitter-streaming-web/analysis/3fb28f76-62fe-47f3-a1a8-66ac610c2447.html spark jobs in production? How is spark streaming code is deployed? I am interested in knowing the tools used like cron, oozie etc. Thanks, Arun
Re: Dataframes Question
That's right. On Sun, Apr 19, 2015 at 8:59 AM, Arun Patel arunp.bigd...@gmail.com wrote: Thanks Ted. So, whatever the operations I am performing now are DataFrames and not SchemaRDD? Is that right? Regards, Venkat On Sun, Apr 19, 2015 at 9:13 AM, Ted Yu yuzhih...@gmail.com wrote: bq. SchemaRDD is not existing in 1.3? That's right. See this thread for more background: http://search-hadoop.com/m/JW1q5zQ1Xw/spark+DataFrame+schemarddsubj=renaming+SchemaRDD+gt+DataFrame On Sat, Apr 18, 2015 at 5:43 PM, Abhishek R. Singh abhis...@tetrationanalytics.com wrote: I am no expert myself, but from what I understand DataFrame is grandfathering SchemaRDD. This was done for API stability as spark sql matured out of alpha as part of 1.3.0 release. It is forward looking and brings (dataframe like) syntax that was not available with the older schema RDD. On Apr 18, 2015, at 4:43 PM, Arun Patel arunp.bigd...@gmail.com wrote: Experts, I have few basic questions on DataFrames vs Spark SQL. My confusion is more with DataFrames. 1) What is the difference between Spark SQL and DataFrames? Are they same? 2) Documentation says SchemaRDD is renamed as DataFrame. This means SchemaRDD is not existing in 1.3? 3) As per documentation, it looks like creating dataframe is no different than SchemaRDD - df = sqlContext.jsonFile(examples/src/main/resources/people.json). So, my question is what is the difference? Thanks for your help. Arun - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Dataframes Question
bq. SchemaRDD is not existing in 1.3? That's right. See this thread for more background: http://search-hadoop.com/m/JW1q5zQ1Xw/spark+DataFrame+schemarddsubj=renaming+SchemaRDD+gt+DataFrame On Sat, Apr 18, 2015 at 5:43 PM, Abhishek R. Singh abhis...@tetrationanalytics.com wrote: I am no expert myself, but from what I understand DataFrame is grandfathering SchemaRDD. This was done for API stability as spark sql matured out of alpha as part of 1.3.0 release. It is forward looking and brings (dataframe like) syntax that was not available with the older schema RDD. On Apr 18, 2015, at 4:43 PM, Arun Patel arunp.bigd...@gmail.com wrote: Experts, I have few basic questions on DataFrames vs Spark SQL. My confusion is more with DataFrames. 1) What is the difference between Spark SQL and DataFrames? Are they same? 2) Documentation says SchemaRDD is renamed as DataFrame. This means SchemaRDD is not existing in 1.3? 3) As per documentation, it looks like creating dataframe is no different than SchemaRDD - df = sqlContext.jsonFile(examples/src/main/resources/people.json). So, my question is what is the difference? Thanks for your help. Arun - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Dataframes Question
Thanks Ted. So, whatever the operations I am performing now are DataFrames and not SchemaRDD? Is that right? Regards, Venkat On Sun, Apr 19, 2015 at 9:13 AM, Ted Yu yuzhih...@gmail.com wrote: bq. SchemaRDD is not existing in 1.3? That's right. See this thread for more background: http://search-hadoop.com/m/JW1q5zQ1Xw/spark+DataFrame+schemarddsubj=renaming+SchemaRDD+gt+DataFrame On Sat, Apr 18, 2015 at 5:43 PM, Abhishek R. Singh abhis...@tetrationanalytics.com wrote: I am no expert myself, but from what I understand DataFrame is grandfathering SchemaRDD. This was done for API stability as spark sql matured out of alpha as part of 1.3.0 release. It is forward looking and brings (dataframe like) syntax that was not available with the older schema RDD. On Apr 18, 2015, at 4:43 PM, Arun Patel arunp.bigd...@gmail.com wrote: Experts, I have few basic questions on DataFrames vs Spark SQL. My confusion is more with DataFrames. 1) What is the difference between Spark SQL and DataFrames? Are they same? 2) Documentation says SchemaRDD is renamed as DataFrame. This means SchemaRDD is not existing in 1.3? 3) As per documentation, it looks like creating dataframe is no different than SchemaRDD - df = sqlContext.jsonFile(examples/src/main/resources/people.json). So, my question is what is the difference? Thanks for your help. Arun - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Date class not supported by SparkSQL
Here's a code example: public class DateSparkSQLExample { public static void main(String[] args) { SparkConf conf = new SparkConf().setAppName(test).setMaster(local); JavaSparkContext sc = new JavaSparkContext(conf); ListSomeObject itemsList = Lists.newArrayListWithCapacity(1); itemsList.add(new SomeObject(new Date(), 1L)); JavaRDDSomeObject someObjectJavaRDD = sc.parallelize(itemsList); JavaSQLContext sqlContext = new org.apache.spark.sql.api.java.JavaSQLContext(sc); sqlContext.applySchema(someObjectJavaRDD, SomeObject.class).registerTempTable(temp_table); } private static class SomeObject implements Serializable{ private Date timestamp; private Long value; public SomeObject() { } public SomeObject(Date timestamp, Long value) { this.timestamp = timestamp; this.value = value; } public Date getTimestamp() { return timestamp; } public void setTimestamp(Date timestamp) { this.timestamp = timestamp; } public Long getValue() { return value; } public void setValue(Long value) { this.value = value; } } } On Sun, Apr 19, 2015 at 4:27 PM, Lior Chaga lio...@taboola.com wrote: Using Spark 1.2.0. Tried to apply register an RDD and got: scala.MatchError: class java.util.Date (of class java.lang.Class) I see it was resolved in https://issues.apache.org/jira/browse/SPARK-2562 (included in 1.2.0) Anyone encountered this issue? Thanks, Lior