Re: Spark Job not using all nodes in cluster
No. I am not setting the number of executors anywhere (in env file or in program). Is it due to large number of small files ? On Wed, May 20, 2015 at 5:11 PM, ayan guha wrote: > What is your spark env file says? Are you setting number of executors in > spark context? > On 20 May 2015 13:16, "Shailesh Birari" wrote: > >> Hi, >> >> I have a 4 node Spark 1.3.1 cluster. All four nodes have 4 cores and 64 GB >> of RAM. >> I have around 600,000+ Json files on HDFS. Each file is small around 1KB >> in >> size. Total data is around 16GB. Hadoop block size is 256MB. >> My application reads these files with sc.textFile() (or sc.jsonFile() >> tried >> both) API. But all the files are getting read by only one node (4 >> executors). Spark UI shows all 600K+ tasks on one node and 0 on other >> nodes. >> >> I confirmed that all files are accessible from all nodes. Some other >> application which uses big files uses all nodes on same cluster. >> >> Can you please let me know why it is behaving in such way ? >> >> Thanks, >> Shailesh >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Job-not-using-all-nodes-in-cluster-tp22951.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >>
Spark Job not using all nodes in cluster
Hi, I have a 4 node Spark 1.3.1 cluster. All four nodes have 4 cores and 64 GB of RAM. I have around 600,000+ Json files on HDFS. Each file is small around 1KB in size. Total data is around 16GB. Hadoop block size is 256MB. My application reads these files with sc.textFile() (or sc.jsonFile() tried both) API. But all the files are getting read by only one node (4 executors). Spark UI shows all 600K+ tasks on one node and 0 on other nodes. I confirmed that all files are accessible from all nodes. Some other application which uses big files uses all nodes on same cluster. Can you please let me know why it is behaving in such way ? Thanks, Shailesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Job-not-using-all-nodes-in-cluster-tp22951.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark SQL Self join with agreegate
Hello, I want to use Spark sql to aggregate some columns of the data. e.g. I have huge data with some columns as: time, src, dst, val1, val2 I want to calculate sum(val1) and sum(val2) for all unique pairs of src and dst. I tried by forming SQL query SELECT a.time, a.src, a.dst, sum(a.val1), sum(a.val2) from table a, table b where a.src = b.src and a.dst = b.dst I know I am doing something wrong here. Can you please let me know is it doable and how ? Thanks, Shailesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Self-join-with-agreegate-tp22151.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark 1.2 – How to change Default (Random) port ….
Hi SM, Apologize for delayed response. No, the issue is with Spark 1.2.0. There is a bug in Spark 1.2.0. Recently Spark have latest 1.3.0 release so it might have fixed in it. I am not planning to test it soon, may be after some time. You can try for it. Regards, Shailesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-2-How-to-change-Default-Random-port-tp21306p22063.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark 1.2 – How to change Default (Random) port ….
Thanks. But after setting "spark.shuffle.blockTransferService" to "nio" application fails with Akka Client disassociation. 15/01/27 13:38:11 ERROR TaskSchedulerImpl: Lost executor 3 on wynchcs218.wyn.cnw.co.nz: remote Akka client disassociated 15/01/27 13:38:11 INFO TaskSetManager: Re-queueing tasks for 3 from TaskSet 0.0 15/01/27 13:38:11 WARN TaskSetManager: Lost task 0.3 in stage 0.0 (TID 7, wynchcs218.wyn.cnw.co.nz): ExecutorLostFailure (executor lost) 15/01/27 13:38:11 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job 15/01/27 13:38:11 WARN TaskSetManager: Lost task 1.3 in stage 0.0 (TID 6, wynchcs218.wyn.cnw.co.nz): ExecutorLostFailure (executor lost) 15/01/27 13:38:11 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 15/01/27 13:38:11 INFO TaskSchedulerImpl: Cancelling stage 0 15/01/27 13:38:11 INFO DAGScheduler: Failed to run count at RowMatrix.scala:71 Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 7, wynchcs218.wyn.cnw.co.nz): ExecutorLostFailure (executor lost) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 15/01/27 13:38:11 INFO DAGScheduler: Executor lost: 3 (epoch 3) 15/01/27 13:38:11 INFO BlockManagerMasterActor: Trying to remove executor 3 from BlockManagerMaster. 15/01/27 13:38:11 INFO BlockManagerMaster: Removed 3 successfully in removeExecutor On Mon, Jan 26, 2015 at 6:34 PM, Aaron Davidson wrote: > This was a regression caused by Netty Block Transfer Service. The fix for > this just barely missed the 1.2 release, and you can see the associated > JIRA here: https://issues.apache.org/jira/browse/SPARK-4837 > > Current master has the fix, and the Spark 1.2.1 release will have it > included. If you don't want to rebuild from master or wait, then you can > turn it off by setting "spark.shuffle.blockTransferService" to "nio". > > On Sun, Jan 25, 2015 at 6:28 PM, Shailesh Birari > wrote: > >> Can anyone please let me know ? >> I don't want to open all ports on n/w. So, am interested in the property >> by >> which this new port I can configure. >> >> Shailesh >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-2-How-to-change-Default-Random-port-tp21306p21360.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >
Re: Spark 1.2 – How to change Default (Random) port ….
Can anyone please let me know ? I don't want to open all ports on n/w. So, am interested in the property by which this new port I can configure. Shailesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-2-How-to-change-Default-Random-port-tp21306p21360.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark 1.2 – How to change Default (Random) port ….
Hello, Recently, I have upgraded my setup to Spark 1.2 from Spark 1.1. I have 4 node Ubuntu Spark Cluster. With Spark 1.1, I used to write Spark Scala program in Eclipse on my Windows development host and submit the job on Ubuntu Cluster, from Eclipse (Windows machine). As on my network not all ports between Spark cluster and development machine are open, I set spark process ports to valid ports. On Spark 1.1 this works perfectly. When I try to run the same program with same user defined ports on Spark 1.2 cluster it gives me connection time out for port *56117*. I referred the Spark 1.2 configuration page (http://spark.apache.org/docs/1.2.0/configuration.html) but there are no new ports mentioned. *Here is my code for reference:* val conf = new SparkConf() .setMaster(sparkMaster) .setAppName("Spark SVD") .setSparkHome("/usr/local/spark") .setJars(jars) .set("spark.driver.host", "consb2a") //Windows host (Development machine) .set("spark.driver.port", "51810") .set("spark.fileserver.port", "51811") .set("spark.broadcast.port", "51812") .set("spark.replClassServer.port", "51813") .set("spark.blockManager.port", "51814") .set("spark.executor.port", "51815") .set("spark.executor.memory", "2g") .set("spark.driver.memory", "4g") val sc = new SparkContext(conf) *Here is Exception:* 15/01/21 15:44:08 INFO BlockManagerMasterActor: Registering block manager wynchcs217.wyn.cnw.co.nz:37173 with 1059.9 MB RAM, BlockManagerId(2, wynchcs217.wyn.cnw.co.nz, 37173) 15/01/21 15:44:08 INFO BlockManagerMasterActor: Registering block manager wynchcs219.wyn.cnw.co.nz:53850 with 1059.9 MB RAM, BlockManagerId(1, wynchcs219.wyn.cnw.co.nz, 53850) 15/01/21 15:44:08 INFO BlockManagerMasterActor: Registering block manager wynchcs220.wyn.cnw.co.nz:35670 with 1060.3 MB RAM, BlockManagerId(0, wynchcs220.wyn.cnw.co.nz, 35670) 15/01/21 15:44:08 INFO BlockManagerMasterActor: Registering block manager wynchcs218.wyn.cnw.co.nz:46890 with 1059.9 MB RAM, BlockManagerId(3, wynchcs218.wyn.cnw.co.nz, 46890) 15/01/21 15:52:23 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, wynchcs217.wyn.cnw.co.nz): java.io.IOException: Connecting to CONSB2A.cnw.co.nz/143.96.130.27:56117 timed out (12 ms) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:188) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156) at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:701) 15/01/21 15:52:23 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 2, wynchcs220.wyn.cnw.co.nz, NODE_LOCAL, 1366 bytes) 15/01/21 15:55:35 INFO TaskSchedulerImpl: Cancelling stage 0 15/01/21 15:55:35 INFO TaskSchedulerImpl: Stage 0 was cancelled 15/01/21 15:55:35 INFO DAGScheduler: Job 0 failed: count at RowMatrix.scala:76, took 689.331309 s Exception in thread "main" org.apache.spark.SparkException: Job 0 cancelled because Stage 0 was cancelled Can you please let me know how can I define the port 56117 to some other port ? Thanks, Shailesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-2-How-to-change-Default-Random-port-tp21306.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark 1.2 - com/google/common/base/Preconditions java.lang.NoClassDefFoundErro
Thanks Aaron. Adding Guava jar resolves the issue. Shailesh On Wed, Jan 21, 2015 at 3:26 PM, Aaron Davidson wrote: > Spark's network-common package depends on guava as a "provided" dependency > in order to avoid conflicting with other libraries (e.g., Hadoop) that > depend on specific versions. com/google/common/base/Preconditions has been > present in Guava since version 2, so this is likely a "dependency not > found" rather than "wrong version of dependency" issue. > > To resolve this, please depend on some version of Guava (14.0.1 is > guaranteed to work, as should any other version from the past few years). > > On Tue, Jan 20, 2015 at 6:16 PM, Shailesh Birari > wrote: > >> Hi Frank, >> >> Its a normal eclipse project where I added Scala and Spark libraries as >> user libraries. >> Though, I am not attaching any hadoop libraries, in my application code I >> have following line. >> >> System.setProperty("hadoop.home.dir", "C:\\SB\\HadoopWin") >> >> This Hadoop home dir contains "winutils.exe" only. >> >> Don't think that its an issue. >> >> Please suggest. >> >> Thanks, >> Shailesh >> >> >> On Wed, Jan 21, 2015 at 2:19 PM, Frank Austin Nothaft < >> fnoth...@berkeley.edu> wrote: >> >>> Shailesh, >>> >>> To add, are you packaging Hadoop in your app? Hadoop will pull in Guava. >>> Not sure if you are using Maven (or what) to build, but if you can pull up >>> your builds dependency tree, you will likely find com.google.guava being >>> brought in by one of your dependencies. >>> >>> Regards, >>> >>> Frank Austin Nothaft >>> fnoth...@berkeley.edu >>> fnoth...@eecs.berkeley.edu >>> 202-340-0466 >>> >>> On Jan 20, 2015, at 5:13 PM, Shailesh Birari >>> wrote: >>> >>> Hello, >>> >>> I double checked the libraries. I am linking only with Spark 1.2. >>> Along with Spark 1.2 jars I have Scala 2.10 jars and JRE 7 jars linked >>> and nothing else. >>> >>> Thanks, >>>Shailesh >>> >>> On Wed, Jan 21, 2015 at 12:58 PM, Sean Owen wrote: >>> >>>> Guava is shaded in Spark 1.2+. It looks like you are mixing versions >>>> of Spark then, with some that still refer to unshaded Guava. Make sure >>>> you are not packaging Spark with your app and that you don't have >>>> other versions lying around. >>>> >>>> On Tue, Jan 20, 2015 at 11:55 PM, Shailesh Birari >>>> wrote: >>>> > Hello, >>>> > >>>> > I recently upgraded my setup from Spark 1.1 to Spark 1.2. >>>> > My existing applications are working fine on ubuntu cluster. >>>> > But, when I try to execute Spark MLlib application from Eclipse >>>> (Windows >>>> > node) it gives java.lang.NoClassDefFoundError: >>>> > com/google/common/base/Preconditions exception. >>>> > >>>> > Note, >>>> >1. With Spark 1.1 this was working fine. >>>> >2. The Spark 1.2 jar files are linked in Eclipse project. >>>> >3. Checked the jar -tf output and found the above >>>> com.google.common.base >>>> > is not present. >>>> > >>>> > >>>> -Exception >>>> > log: >>>> > >>>> > Exception in thread "main" java.lang.NoClassDefFoundError: >>>> > com/google/common/base/Preconditions >>>> > at >>>> > >>>> org.apache.spark.network.client.TransportClientFactory.(TransportClientFactory.java:94) >>>> > at >>>> > >>>> org.apache.spark.network.TransportContext.createClientFactory(TransportContext.java:77) >>>> > at >>>> > >>>> org.apache.spark.network.netty.NettyBlockTransferService.init(NettyBlockTransferService.scala:62) >>>> > at >>>> org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:194) >>>> > at >>>> org.apache.spark.SparkContext.(SparkContext.scala:340) >>>> > at >>>> > >>>> org.apache.spark.examples.mllib.TallSkinnySVD$.main(TallSkinnySVD.scala:74) >>>> > at >>&
Re: Spark 1.2 - com/google/common/base/Preconditions java.lang.NoClassDefFoundErro
Hi Frank, Its a normal eclipse project where I added Scala and Spark libraries as user libraries. Though, I am not attaching any hadoop libraries, in my application code I have following line. System.setProperty("hadoop.home.dir", "C:\\SB\\HadoopWin") This Hadoop home dir contains "winutils.exe" only. Don't think that its an issue. Please suggest. Thanks, Shailesh On Wed, Jan 21, 2015 at 2:19 PM, Frank Austin Nothaft wrote: > Shailesh, > > To add, are you packaging Hadoop in your app? Hadoop will pull in Guava. > Not sure if you are using Maven (or what) to build, but if you can pull up > your builds dependency tree, you will likely find com.google.guava being > brought in by one of your dependencies. > > Regards, > > Frank Austin Nothaft > fnoth...@berkeley.edu > fnoth...@eecs.berkeley.edu > 202-340-0466 > > On Jan 20, 2015, at 5:13 PM, Shailesh Birari wrote: > > Hello, > > I double checked the libraries. I am linking only with Spark 1.2. > Along with Spark 1.2 jars I have Scala 2.10 jars and JRE 7 jars linked and > nothing else. > > Thanks, >Shailesh > > On Wed, Jan 21, 2015 at 12:58 PM, Sean Owen wrote: > >> Guava is shaded in Spark 1.2+. It looks like you are mixing versions >> of Spark then, with some that still refer to unshaded Guava. Make sure >> you are not packaging Spark with your app and that you don't have >> other versions lying around. >> >> On Tue, Jan 20, 2015 at 11:55 PM, Shailesh Birari >> wrote: >> > Hello, >> > >> > I recently upgraded my setup from Spark 1.1 to Spark 1.2. >> > My existing applications are working fine on ubuntu cluster. >> > But, when I try to execute Spark MLlib application from Eclipse (Windows >> > node) it gives java.lang.NoClassDefFoundError: >> > com/google/common/base/Preconditions exception. >> > >> > Note, >> >1. With Spark 1.1 this was working fine. >> >2. The Spark 1.2 jar files are linked in Eclipse project. >> >3. Checked the jar -tf output and found the above >> com.google.common.base >> > is not present. >> > >> > >> -Exception >> > log: >> > >> > Exception in thread "main" java.lang.NoClassDefFoundError: >> > com/google/common/base/Preconditions >> > at >> > >> org.apache.spark.network.client.TransportClientFactory.(TransportClientFactory.java:94) >> > at >> > >> org.apache.spark.network.TransportContext.createClientFactory(TransportContext.java:77) >> > at >> > >> org.apache.spark.network.netty.NettyBlockTransferService.init(NettyBlockTransferService.scala:62) >> > at >> org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:194) >> > at org.apache.spark.SparkContext.(SparkContext.scala:340) >> > at >> > >> org.apache.spark.examples.mllib.TallSkinnySVD$.main(TallSkinnySVD.scala:74) >> > at >> org.apache.spark.examples.mllib.TallSkinnySVD.main(TallSkinnySVD.scala) >> > Caused by: java.lang.ClassNotFoundException: >> > com.google.common.base.Preconditions >> > at java.net.URLClassLoader$1.run(Unknown Source) >> > at java.net.URLClassLoader$1.run(Unknown Source) >> > at java.security.AccessController.doPrivileged(Native Method) >> > at java.net.URLClassLoader.findClass(Unknown Source) >> > at java.lang.ClassLoader.loadClass(Unknown Source) >> > at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source) >> > at java.lang.ClassLoader.loadClass(Unknown Source) >> > ... 7 more >> > >> > >> - >> > >> > jar -tf output: >> > >> > >> > consb2@CONSB2A >> > /cygdrive/c/SB/spark-1.2.0-bin-hadoop2.4/spark-1.2.0-bin-hadoop2.4/lib >> > $ jar -tf spark-assembly-1.2.0-hadoop2.4.0.jar | grep Preconditions >> > org/spark-project/guava/common/base/Preconditions.class >> > org/spark-project/guava/common/math/MathPreconditions.class >> > com/clearspring/analytics/util/Preconditions.class >> > parquet/Preconditions.class >> > com/google/inject/internal/util/$Preconditions.class >> > >> > >> --- >> > >> > Please help me in resolving this. >> > >> > Thanks, >> > Shailesh >> > >> > >> > >> > >> > -- >> > View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-2-com-google-common-base-Preconditions-java-lang-NoClassDefFoundErro-tp21271.html >> > Sent from the Apache Spark User List mailing list archive at Nabble.com >> . >> > >> > - >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> > For additional commands, e-mail: user-h...@spark.apache.org >> > >> > > >
Re: Spark 1.2 - com/google/common/base/Preconditions java.lang.NoClassDefFoundErro
Hello, I double checked the libraries. I am linking only with Spark 1.2. Along with Spark 1.2 jars I have Scala 2.10 jars and JRE 7 jars linked and nothing else. Thanks, Shailesh On Wed, Jan 21, 2015 at 12:58 PM, Sean Owen wrote: > Guava is shaded in Spark 1.2+. It looks like you are mixing versions > of Spark then, with some that still refer to unshaded Guava. Make sure > you are not packaging Spark with your app and that you don't have > other versions lying around. > > On Tue, Jan 20, 2015 at 11:55 PM, Shailesh Birari > wrote: > > Hello, > > > > I recently upgraded my setup from Spark 1.1 to Spark 1.2. > > My existing applications are working fine on ubuntu cluster. > > But, when I try to execute Spark MLlib application from Eclipse (Windows > > node) it gives java.lang.NoClassDefFoundError: > > com/google/common/base/Preconditions exception. > > > > Note, > >1. With Spark 1.1 this was working fine. > >2. The Spark 1.2 jar files are linked in Eclipse project. > >3. Checked the jar -tf output and found the above > com.google.common.base > > is not present. > > > > > -Exception > > log: > > > > Exception in thread "main" java.lang.NoClassDefFoundError: > > com/google/common/base/Preconditions > > at > > > org.apache.spark.network.client.TransportClientFactory.(TransportClientFactory.java:94) > > at > > > org.apache.spark.network.TransportContext.createClientFactory(TransportContext.java:77) > > at > > > org.apache.spark.network.netty.NettyBlockTransferService.init(NettyBlockTransferService.scala:62) > > at > org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:194) > > at org.apache.spark.SparkContext.(SparkContext.scala:340) > > at > > > org.apache.spark.examples.mllib.TallSkinnySVD$.main(TallSkinnySVD.scala:74) > > at > org.apache.spark.examples.mllib.TallSkinnySVD.main(TallSkinnySVD.scala) > > Caused by: java.lang.ClassNotFoundException: > > com.google.common.base.Preconditions > > at java.net.URLClassLoader$1.run(Unknown Source) > > at java.net.URLClassLoader$1.run(Unknown Source) > > at java.security.AccessController.doPrivileged(Native Method) > > at java.net.URLClassLoader.findClass(Unknown Source) > > at java.lang.ClassLoader.loadClass(Unknown Source) > > at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source) > > at java.lang.ClassLoader.loadClass(Unknown Source) > > ... 7 more > > > > > - > > > > jar -tf output: > > > > > > consb2@CONSB2A > > /cygdrive/c/SB/spark-1.2.0-bin-hadoop2.4/spark-1.2.0-bin-hadoop2.4/lib > > $ jar -tf spark-assembly-1.2.0-hadoop2.4.0.jar | grep Preconditions > > org/spark-project/guava/common/base/Preconditions.class > > org/spark-project/guava/common/math/MathPreconditions.class > > com/clearspring/analytics/util/Preconditions.class > > parquet/Preconditions.class > > com/google/inject/internal/util/$Preconditions.class > > > > > --- > > > > Please help me in resolving this. > > > > Thanks, > > Shailesh > > > > > > > > > > -- > > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-2-com-google-common-base-Preconditions-java-lang-NoClassDefFoundErro-tp21271.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > > - > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org > > >
Spark 1.2 - com/google/common/base/Preconditions java.lang.NoClassDefFoundErro
Hello, I recently upgraded my setup from Spark 1.1 to Spark 1.2. My existing applications are working fine on ubuntu cluster. But, when I try to execute Spark MLlib application from Eclipse (Windows node) it gives java.lang.NoClassDefFoundError: com/google/common/base/Preconditions exception. Note, 1. With Spark 1.1 this was working fine. 2. The Spark 1.2 jar files are linked in Eclipse project. 3. Checked the jar -tf output and found the above com.google.common.base is not present. -Exception log: Exception in thread "main" java.lang.NoClassDefFoundError: com/google/common/base/Preconditions at org.apache.spark.network.client.TransportClientFactory.(TransportClientFactory.java:94) at org.apache.spark.network.TransportContext.createClientFactory(TransportContext.java:77) at org.apache.spark.network.netty.NettyBlockTransferService.init(NettyBlockTransferService.scala:62) at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:194) at org.apache.spark.SparkContext.(SparkContext.scala:340) at org.apache.spark.examples.mllib.TallSkinnySVD$.main(TallSkinnySVD.scala:74) at org.apache.spark.examples.mllib.TallSkinnySVD.main(TallSkinnySVD.scala) Caused by: java.lang.ClassNotFoundException: com.google.common.base.Preconditions at java.net.URLClassLoader$1.run(Unknown Source) at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) ... 7 more - jar -tf output: consb2@CONSB2A /cygdrive/c/SB/spark-1.2.0-bin-hadoop2.4/spark-1.2.0-bin-hadoop2.4/lib $ jar -tf spark-assembly-1.2.0-hadoop2.4.0.jar | grep Preconditions org/spark-project/guava/common/base/Preconditions.class org/spark-project/guava/common/math/MathPreconditions.class com/clearspring/analytics/util/Preconditions.class parquet/Preconditions.class com/google/inject/internal/util/$Preconditions.class --- Please help me in resolving this. Thanks, Shailesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-2-com-google-common-base-Preconditions-java-lang-NoClassDefFoundErro-tp21271.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Submitting Spark job on Unix cluster from dev environment (Windows)
Have you tried to set the host name/port to your Windows machine ? Also specify the following ports for Spark. Make sure the ports you mentioned should not be blocked (on windows machine). spark.fileserver.port spark.broadcast.port spark.replClassServer.port spark.blockManager.port spark.executor.port Hope this helps. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-Spark-job-on-Unix-cluster-from-dev-environment-Windows-tp16989p18436.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark SQL takes unexpected time
Yes, I am using Spark1.1.0 and have used rdd.registerTempTable(). I tried by adding sqlContext.cacheTable(), but it took 59 seconds (more than earlier). I also tried by changing schema to use Long data type in some fields but seems conversion takes more time. Is there any way to specify index ? Though I checked and didn't found any, just want to confirm. For your reference here is the snippet of code. - case class EventDataTbl(EventUID: Long, ONum: Long, RNum: Long, Timestamp: java.sql.Timestamp, Duration: String, Type: String, Source: String, OName: String, RName: String) val format = new java.text.SimpleDateFormat("-MM-dd hh:mm:ss") val cedFileName = "hdfs://hadoophost:8020/demo/poc/JoinCsv/output_2" val cedRdd = sc.textFile(cedFileName).map(_.split(",", -1)).map(p => EventDataTbl(p(0).toLong, p(1).toLong, p(2).toLong, new java.sql.Timestamp(format.parse(p(3)).getTime()), p(4), p(5), p(6), p(7), p(8))) cedRdd.registerTempTable("EventDataTbl") sqlCntxt.cacheTable("EventDataTbl") val t1 = System.nanoTime() println("\n\n10 Most frequent conversations between the Originators and Recipients\n") sql("SELECT COUNT(*) AS Frequency,ONum,OName,RNum,RName FROM EventDataTbl GROUP BY ONum,OName,RNum,RName ORDER BY Frequency DESC LIMIT 10").collect().foreach(println) val t2 = System.nanoTime() println("Time taken " + (t2-t1)/10.0 + " Seconds") - Thanks, Shailesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-takes-unexpected-time-tp17925p18017.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark SQL takes unexpected time
Hello, I have written an Spark SQL application which reads data from HDFS and query on it. The data size is around 2GB (30 million records). The schema and query I am running is as below. The query takes around 05+ seconds to execute. I tried by adding rdd.persist(StorageLevel.MEMORY_AND_DISK) and rdd.cache() but in both the cases it takes extra time, even if I give the below query as second the data. (assuming Spark will cache it for first query). case class EventDataTbl(ID: String, ONum: String, RNum: String, Timestamp: String, Duration: String, Type: String, Source: String, OName: String, RName: String) sql("SELECT COUNT(*) AS Frequency,ONum,OName,RNum,RName FROM EventDataTbl GROUP BY ONum,OName,RNum,RName ORDER BY Frequency DESC LIMIT 10").collect().foreach(println) Can you let me know if I am missing anything ? Thanks, Shailesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-takes-unexpected-time-tp17925.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Submitting Spark job on Unix cluster from dev environment (Windows)
Thanks by setting driver host to Windows and specifying some ports (like driver, fileserver, broadcast etc..) it worked perfectly. I need to specify those ports as not all ports are open on my machine. For, driver host name, I was assuming Spark should get it, as in case of linux we are not setting it. I was thinking its usable only in case we want to set driver host other than the machine from which we are running the program. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-Spark-job-on-Unix-cluster-from-dev-environment-Windows-tp16989p17693.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Submiting Spark application through code
Yes, this is doable. I am submitting the Spark job using JavaSparkContext spark = new JavaSparkContext(sparkMaster, "app name", System.getenv("SPARK_HOME"), new String[] {"application JAR"}); To run this first you have to create the application jar and in above API specify its absolute path. That's all. Run your java application like any other. Shailesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Submiting-Spark-application-through-code-tp17452p17553.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Submitting Spark job on Unix cluster from dev environment (Windows)
Can anyone please help me here ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-Spark-job-on-Unix-cluster-from-dev-environment-Windows-tp16989p17552.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Submitting Spark job on cluster from dev environment
Some more update. Now, I tried with by setting spark.driver.host to Spark Master node and spark.driver.port to 51800 (available open port), but its failing with bind error. I was hoping that it will start the driver on supplied host:port and as its unix node there should not be any issue. Can you please suggest me where I am doing wrong ? Here is stack trace obtained in Eclipse ... 14/10/28 17:13:37 INFO Remoting: Starting remoting Exception in thread "main" org.jboss.netty.channel.ChannelException: Failed to bind to: wynchcs220.wyn.cnw.co.nz/143.96.25.30:51800 at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272) at akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:391) at akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:388) at scala.util.Success$$anonfun$map$1.apply(Try.scala:206) at scala.util.Try$.apply(Try.scala:161) at scala.util.Success.map(Try.scala:206) at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58) at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:42) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: java.net.BindException: Cannot assign requested address: bind at sun.nio.ch.Net.bind0(Native Method) at sun.nio.ch.Net.bind(Unknown Source) at sun.nio.ch.Net.bind(Unknown Source) at sun.nio.ch.ServerSocketChannelImpl.bind(Unknown Source) at sun.nio.ch.ServerSocketAdaptor.bind(Unknown Source) at org.jboss.netty.channel.socket.nio.NioServerBoss$RegisterTask.run(NioServerBoss.java:193) at org.jboss.netty.channel.socket.nio.AbstractNioSelector.processTaskQueue(AbstractNioSelector.java:366) at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:290) at org.jboss.netty.channel.socket.nio.NioServerBoss.run(NioServerBoss.java:42) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) 14/10/28 17:13:37 ERROR Remoting: Remoting error: [Startup failed] [ akka.remote.RemoteTransportException: Startup failed at akka.remote.Remoting.akka$remote$Remoting$$notifyError(Remoting.scala:129) at akka.remote.Remoting.start(Remoting.scala:194) at akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184) at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:579) at akka.actor.ActorSystemImpl._start(ActorSystem.scala:577) at akka.actor.ActorSystemImpl.start(ActorSystem.scala:588) at akka.actor.ActorSystem$.apply(ActorSystem.scala:111) at akka.actor.ActorSystem$.apply(ActorSystem.scala:104) at org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:121) at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54) at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53) at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1446) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1442) at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:56) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:153) at org.apache.spark.SparkContext.(SparkContext.scala:203) at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:53) at org.wyn.RemoteWordCount.main(RemoteWordCount.java:51) Caused by: org.jboss.netty.channel.ChannelException: Failed to bind to: wynchcs220.wyn.cnw.co.nz/143.96.25.30:51800 at
Re: Submitting Spark job on cluster from dev environment
Hello, I am able to submit Job on Spark cluster from Windows desktop. But the executors are not able to run. When I check the Spark UI (which is on Windows, as Driver is there) it shows me JAVA_HOME, CLASS_PATH and other environment variables related to Windows. I tried by setting spark.executor.extraClassPath to Unix classpath, spark.getConf().setExecutorEnv("spark.executorEnv.JAVA_HOME", "/usr/lib/jvm/java-6-openjdk-amd64") within program but it didn't helped. The executor gives below mentioned error. Can you please let me know what I am missing ? /14/10/28 09:36:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Exception in thread "main" java.lang.reflect.UndeclaredThrowableException at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1563) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:52) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:113) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:156) at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:125) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:53) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:52) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) ... 4 more/ Thanks, Shailesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-Spark-job-on-cluster-from-dev-environment-tp16989p17397.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Streaming - How to write RDD's in same directory ?
Thanks Sameer for quick reply. I will try to implement it. Shailesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-How-to-write-RDD-s-in-same-directory-tp16962p16970.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark Streaming - How to write RDD's in same directory ?
Hello, Spark 1.1.0, Hadoop 2.4.1 I have written a Spark streaming application. And I am getting FileAlreadyExistsException for rdd.saveAsTextFile(outputFolderPath). Here is brief what I am is trying to do. My application is creating text file stream using Java Stream context. The input file is on HDFS. JavaDStream textStream = ssc.textFileStream(InputFile); Then it is comparing each line of input stream with some data and filtering it. The filtered data I am storing in JavaDStream. JavaDStream suspectedStream= textStream.flatMap(new FlatMapFunction(){ @Override public Iterable call(String line) throws Exception { List filteredList = new ArrayList(); // doing filter job return filteredList; } And this filteredList I am storing in HDFS as: suspectedStream.foreach(new Function,Void>(){ @Override public Void call(JavaRDD rdd) throws Exception { rdd.saveAsTextFile(outputFolderPath); return null; }}); But with this I am receiving org.apache.hadoop.mapred.FileAlreadyExistsException. I tried with appending random number with outputFolderPath and its working. But my requirement is to collect all output in one directory. Can you please suggest if there is any way to get rid of this exception ? Thanks, Shailesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-How-to-write-RDD-s-in-same-directory-tp16962.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: java.lang.OutOfMemoryError while running SVD MLLib example
Hi Xianguri, After setting SVD to smaller value (200) its working. Thanks, Shailesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-OutOfMemoryError-while-running-SVD-MLLib-example-tp14972p15179.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: java.lang.OutOfMemoryError while running SVD MLLib example
Note, the data is random numbers (double). Any suggestions/pointers will be highly appreciated. Thanks, Shailesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-OutOfMemoryError-while-running-SVD-MLLib-example-tp14972p15083.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org