Re: Need help for Spark-JobServer setup on Maven (for Java programming)
Does my question make sense or required some elaboration? Sasi -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-for-Spark-JobServer-setup-on-Maven-for-Java-programming-tp20849p20896.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Need help for Spark-JobServer setup on Maven (for Java programming)
Hey, why specific in maven?? we setup a spark job server thru sbt which is easy way to up and running job server. On 30 Dec 2014 13:32, Sasi [via Apache Spark User List] ml-node+s1001560n20896...@n3.nabble.com wrote: Does my question make sense or required some elaboration? Sasi If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-for-Spark-JobServer-setup-on-Maven-for-Java-programming-tp20849p20896.html To start a new topic under Apache Spark User List, email ml-node+s1001560n...@n3.nabble.com To unsubscribe from Apache Spark User List, click here. NAMLNAMLNAMLNAMLNAML -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-for-Spark-JobServer-setup-on-Maven-for-Java-programming-tp20849p20897.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Need help for Spark-JobServer setup on Maven (for Java programming)
The reason being, we had Vaadin (Java Framework) application which displays data from Spark RDD, which in turn gets data from Cassandra. As we know, we need to use Maven for building Spark API in Java. We tested the spark-jobserver using SBT and able to run it. However, for our requirement, we need to integrate with Vaadin (Java Framework). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-for-Spark-JobServer-setup-on-Maven-for-Java-programming-tp20849p20898.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Clustering text data with MLlib
Kmeans really needs to have identified number of clusters in advance. There are multiple algorithms (XMeans, ART,...) which do not need this information. Unfortunately, none of them is implemented in MLLib for the moment (you can give a hand and help community). Anyway, it seems to me you will not be satisfied with those algorithms(Xmeans, ART,...) either. I understood that what you want to achieve is precise number of clusters. Notice, whenever you change input parameters (random seed,...) number of clusters might be different. Clustering is great tool but it won't give you one true (one number). regards, Tomas -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Clustering-text-data-with-MLlib-tp20883p20899.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Need help for Spark-JobServer setup on Maven (for Java programming)
Ohh... Just curious, we did similar use case like yours getting data out of Cassandra since job server is a rest architecture all we need is an URL to access it. Why integrating with your framework matters here when all we need is a URL. On 30 Dec 2014 14:05, Sasi [via Apache Spark User List] ml-node+s1001560n20898...@n3.nabble.com wrote: The reason being, we had Vaadin (Java Framework) application which displays data from Spark RDD, which in turn gets data from Cassandra. As we know, we need to use Maven for building Spark API in Java. We tested the spark-jobserver using SBT and able to run it. However, for our requirement, we need to integrate with Vaadin (Java Framework). If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-for-Spark-JobServer-setup-on-Maven-for-Java-programming-tp20849p20898.html To start a new topic under Apache Spark User List, email ml-node+s1001560n...@n3.nabble.com To unsubscribe from Apache Spark User List, click here. NAML -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-for-Spark-JobServer-setup-on-Maven-for-Java-programming-tp20849p20900.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: SPARK-streaming app running 10x slower on YARN vs STANDALONE cluster
Thanks Sandy, It was the issue with the no of cores. Another issue I was facing is that tasks are not getting distributed evenly among all executors and are running on the NODE_LOCAL locality level i.e. all the tasks are running on the same executor where my kafkareceiver(s) are running even though other executors are idle. I configured *spark.locality.wait=50* instead of the default 3000 ms, which forced the task rebalancing among nodes, let me know if there is a better way to deal with this. On Tue, Dec 30, 2014 at 12:09 AM, Mukesh Jha me.mukesh@gmail.com wrote: Makes sense, I've also tries it in standalone mode where all 3 workers driver were running on the same 8 core box and the results were similar. Anyways I will share the results in YARN mode with 8 core yarn containers. On Mon, Dec 29, 2014 at 11:58 PM, Sandy Ryza sandy.r...@cloudera.com wrote: When running in standalone mode, each executor will be able to use all 8 cores on the box. When running on YARN, each executor will only have access to 2 cores. So the comparison doesn't seem fair, no? -Sandy On Mon, Dec 29, 2014 at 10:22 AM, Mukesh Jha me.mukesh@gmail.com wrote: Nope, I am setting 5 executors with 2 cores each. Below is the command that I'm using to submit in YARN mode. This starts up 5 executor nodes and a drives as per the spark application master UI. spark-submit --master yarn-cluster --num-executors 5 --driver-memory 1024m --executor-memory 1024m --executor-cores 2 --class com.oracle.ci.CmsgK2H /homext/lib/MJ-ci-k2h.jar vm.cloud.com:2181/kafka spark-yarn avro 1 5000 On Mon, Dec 29, 2014 at 11:45 PM, Sandy Ryza sandy.r...@cloudera.com wrote: *oops, I mean are you setting --executor-cores to 8 On Mon, Dec 29, 2014 at 10:15 AM, Sandy Ryza sandy.r...@cloudera.com wrote: Are you setting --num-executors to 8? On Mon, Dec 29, 2014 at 10:13 AM, Mukesh Jha me.mukesh@gmail.com wrote: Sorry Sandy, The command is just for reference but I can confirm that there are 4 executors and a driver as shown in the spark UI page. Each of these machines is a 8 core box with ~15G of ram. On Mon, Dec 29, 2014 at 11:23 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Mukesh, Based on your spark-submit command, it looks like you're only running with 2 executors on YARN. Also, how many cores does each machine have? -Sandy On Mon, Dec 29, 2014 at 4:36 AM, Mukesh Jha me.mukesh@gmail.com wrote: Hello Experts, I'm bench-marking Spark on YARN ( https://spark.apache.org/docs/latest/running-on-yarn.html) vs a standalone spark cluster ( https://spark.apache.org/docs/latest/spark-standalone.html). I have a standalone cluster with 3 executors, and a spark app running on yarn with 4 executors as shown below. The spark job running inside yarn is 10x slower than the one running on the standalone cluster (even though the yarn has more number of workers), also in both the case all the executors are in the same datacenter so there shouldn't be any latency. On YARN each 5sec batch is reading data from kafka and processing it in 5sec on the standalone cluster each 5sec batch is getting processed in 0.4sec. Also, In YARN mode all the executors are not getting used up evenly as vm-13 vm-14 are running most of the tasks whereas in the standalone mode all the executors are running the tasks. Do I need to set up some configuration to evenly distribute the tasks? Also do you have any pointers on the reasons the yarn job is 10x slower than the standalone job? Any suggestion is greatly appreciated, Thanks in advance. YARN(5 workers + driver) Executor ID Address RDD Blocks Memory Used DU AT FT CT TT TT Input ShuffleRead ShuffleWrite Thread Dump 1 vm-18.cloud.com:51796 0 0.0B/530.3MB 0.0 B 1 0 16 17 634 ms 0.0 B 2047.0 B 1710.0 B Thread Dump 2 vm-13.cloud.com:57264 0 0.0B/530.3MB 0.0 B 0 0 1427 1427 5.5 m 0.0 B 0.0 B 0.0 B Thread Dump 3 vm-14.cloud.com:54570 0 0.0B/530.3MB 0.0 B 0 0 1379 1379 5.2 m 0.0 B 1368.0 B 2.8 KB Thread Dump 4 vm-11.cloud.com:56201 0 0.0B/530.3MB 0.0 B 0 0 10 10 625 ms 0.0 B 1368.0 B 1026.0 B Thread Dump 5 vm-5.cloud.com:42958 0 0.0B/530.3MB 0.0 B 0 0 22 22 632 ms 0.0 B 1881.0 B 2.8 KB Thread Dump driver vm.cloud.com:51847 0 0.0B/530.0MB 0.0 B 0 0 0 0 0 ms 0.0 B 0.0 B 0.0 B Thread Dump /homext/spark/bin/spark-submit --master yarn-cluster --num-executors 2 --driver-memory 512m --executor-memory 512m --executor-cores 2 --class com.oracle.ci.CmsgK2H /homext/lib/MJ-ci-k2h.jar vm.cloud.com:2181/kafka spark-yarn avro 1 5000 STANDALONE(3 workers + driver) == Executor ID Address RDD Blocks Memory Used DU AT FT CT TT TT Input ShuffleRead ShuffleWrite Thread Dump 0 vm-71.cloud.com:55912 0 0.0B/265.0MB 0.0 B 0 0 1069 1069 6.0 m 0.0 B 1534.0 B 3.0 KB Thread Dump 1 vm-72.cloud.com:40897 0 0.0B/265.0MB 0.0 B 0 0 1057 1057 5.9 m 0.0 B 1368.0 B 4.0 KB Thread Dump 2
Spark SQL implementation error
I have a table(csv file) loaded data on that by creating POJO as per table structure,and created SchemaRDD as under JavaRDDTest1 testSchema = sc.textFile(D:/testTable.csv).map(GetTableData);/* GetTableData will transform the all table data in testTable object*/ JavaSchemaRDD schemaTest = sqlContext.applySchema(testSchema, Test.class); schemaTest.registerTempTable(testTable); JavaSchemaRDD sqlQuery = sqlContext.sql(SELECT * FROM testTable); ListString totDuration = sqlQuery.map(new FunctionRow, String() { public String call(Row row) { return Field1is : + row.getInt(0); } }).collect(); its working fine but. if I am changing query as(rest code is same)- JavaSchemaRDD sqlQuery = sqlContext.sql(SELECT sum(field1) FROM testTable group by field2); error as - Exception in thread main java.lang.NoSuchMethodError: org.apache.spark.rdd.ShuffledRDD.init(Lorg/apache/spark/rdd/RDD;Lorg/apache/spark/Partitioner;)V Please help and Suggest -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-implementation-error-tp20901.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Need help for Spark-JobServer setup on Maven (for Java programming)
Thanks Abhishek. We understand your point and will try using REST URL. However one concern, we had around 1 lakh rows in our Cassandra table presently. Will REST URL result can withstand the response size? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-for-Spark-JobServer-setup-on-Maven-for-Java-programming-tp20849p20902.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Can we say 1 RDD is generated every batch interval?
Foreach iterates through the partitions in the RDD and executes the operations for each partitions i guess. On 29-Dec-2014, at 10:19 pm, SamyaMaiti samya.maiti2...@gmail.com wrote: Hi All, Please clarify. Can we say 1 RDD is generated every batch interval? If the above is true. Then, is the foreachRDD() operator executed one only once for each batch processing? Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-say-1-RDD-is-generated-every-batch-interval-tp20885.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark SQL insert overwrite table failed.
While I was doing JOIN operation of three tables using Spark 1.1.1, and always got the following error. However, I've never met the exception in Spark 1.1.0 with the same operation and same data. Does anyone meet the problem? 14/12/30 17:49:33 ERROR CliDriver: org.apache.hadoop.hive.ql.metadata.HiveException: checkPaths: hdfs://xx.com:20632/tmp/hive-work/hive_2014-12-30_17-46-25_327_2097835982529092412-1/-ext-1 has nested directoryhdfs://x/tmp/hive-work/hive_2014-12-30_17-46-25_327_2097835982529092412-1/-ext-1/_temporary at org.apache.hadoop.hive.ql.metadata.Hive.checkPaths(Hive.java:2081) at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:) at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1224) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.result$lzycompute(InsertIntoHiveTable.scala:238) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.result(InsertIntoHiveTable.scala:173) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.execute(InsertIntoHiveTable.scala:164) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:382) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:382) at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:103) at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:98) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:58) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:291) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:329) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-insert-overwrite-table-failed-tp20903.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Can we say 1 RDD is generated every batch interval?
The DStream model is one RDD of data per interval, yes. foreachRDD performs an operation on each RDD in the stream, which means it is executed once* for the one RDD in each interval. * ignoring the possibility here of failure and retry of course On Mon, Dec 29, 2014 at 4:49 PM, SamyaMaiti samya.maiti2...@gmail.com wrote: Hi All, Please clarify. Can we say 1 RDD is generated every batch interval? If the above is true. Then, is the foreachRDD() operator executed one only once for each batch processing? Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-say-1-RDD-is-generated-every-batch-interval-tp20885.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Writing and reading sequence file results in trailing extra data
Hi, I'm facing a weird issue. Any help appreciated. When I execute the below code and compare input and output, each record in the output has some extra trailing data appended to it, and hence corrupted. I'm just reading and writing, so the input and output should be exactly the same. I'm using spark-core 1.2.0_2.10 and the Hadoop bundled in it (hadoop-common: 2.2.0, hadoop-core: 1.2.1). I also confirmed the binary is fine at the time it's passed to Hadoop classes, and has already the extra data when in Hadoop classes (I guess this makes it more of a Hadoop question...). Code: = def main(args: Array[String]) { val conf = new SparkConf() .setMaster(local[4]) .setAppName(Simple Application) val sc = new SparkContext(conf) // input.txt is a text file with some Base64 encoded binaries stored as lines val src = sc .textFile(input.txt) .map(DatatypeConverter.parseBase64Binary) .map(x = (NullWritable.get(), new BytesWritable(x))) .saveAsSequenceFile(s3n://fake-test/stored) val file = s3n://fake-test/stored val logData = sc.sequenceFile(file, classOf[NullWritable], classOf[BytesWritable]) val count = logData .map { case (k, v) = v } .map(x = DatatypeConverter.printBase64Binary(x.getBytes)) .saveAsTextFile(/tmp/output) } ᐧ
Re: Can we say 1 RDD is generated every batch interval?
Thank Sean. That was helpful. Regards, Sam On Dec 30, 2014, at 4:12 PM, Sean Owen so...@cloudera.com wrote: The DStream model is one RDD of data per interval, yes. foreachRDD performs an operation on each RDD in the stream, which means it is executed once* for the one RDD in each interval. * ignoring the possibility here of failure and retry of course On Mon, Dec 29, 2014 at 4:49 PM, SamyaMaiti samya.maiti2...@gmail.com wrote: Hi All, Please clarify. Can we say 1 RDD is generated every batch interval? If the above is true. Then, is the foreachRDD() operator executed one only once for each batch processing? Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-say-1-RDD-is-generated-every-batch-interval-tp20885.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
[SOLVED] Re: Writing and reading sequence file results in trailing extra data
This poor soul had the exact same problem and solution: http://stackoverflow.com/questions/24083332/write-and-read-raw-byte-arrays-in-spark-using-sequence-file-sequencefile ᐧ On Tue, Dec 30, 2014 at 10:58 AM, Enno Shioji eshi...@gmail.com wrote: Hi, I'm facing a weird issue. Any help appreciated. When I execute the below code and compare input and output, each record in the output has some extra trailing data appended to it, and hence corrupted. I'm just reading and writing, so the input and output should be exactly the same. I'm using spark-core 1.2.0_2.10 and the Hadoop bundled in it (hadoop-common: 2.2.0, hadoop-core: 1.2.1). I also confirmed the binary is fine at the time it's passed to Hadoop classes, and has already the extra data when in Hadoop classes (I guess this makes it more of a Hadoop question...). Code: = def main(args: Array[String]) { val conf = new SparkConf() .setMaster(local[4]) .setAppName(Simple Application) val sc = new SparkContext(conf) // input.txt is a text file with some Base64 encoded binaries stored as lines val src = sc .textFile(input.txt) .map(DatatypeConverter.parseBase64Binary) .map(x = (NullWritable.get(), new BytesWritable(x))) .saveAsSequenceFile(s3n://fake-test/stored) val file = s3n://fake-test/stored val logData = sc.sequenceFile(file, classOf[NullWritable], classOf[BytesWritable]) val count = logData .map { case (k, v) = v } .map(x = DatatypeConverter.printBase64Binary(x.getBytes)) .saveAsTextFile(/tmp/output) }
Re: Need help for Spark-JobServer setup on Maven (for Java programming)
Frankly saying I never tried for this volume in practical. But I believe it should work. On 30 Dec 2014 15:26, Sasi [via Apache Spark User List] ml-node+s1001560n20902...@n3.nabble.com wrote: Thanks Abhishek. We understand your point and will try using REST URL. However one concern, we had around 1 lakh rows in our Cassandra table presently. Will REST URL result can withstand the response size? -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-for-Spark-JobServer-setup-on-Maven-for-Java-programming-tp20849p20902.html To start a new topic under Apache Spark User List, email ml-node+s1001560n...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cmVhY2hhYmhpc2hlay5rQGdtYWlsLmNvbXwxfC0zNjM4MzA5Nzg= . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-for-Spark-JobServer-setup-on-Maven-for-Java-programming-tp20849p20904.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: building spark1.2 meet error
Hi, well, spark 1.2 was prepared for scala 2.10. If you want stable and fully functional tool I'd compile it this default compiler. *I was able to compile Spar 1.2 by Java 7 and scala 2.10 seamlessly.* I also tried Java8 and scala 2.11 (no -Dscala.usejavacp=true), but I failed for some other problem: /mvn -Pyarn -Phadoop-2.5 -Dhadoop.version=2.5.0 -Dscala-2.11 -X -DskipTests clean package [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM ... SUCCESS [ 14.453 s] [INFO] Spark Project Core . SUCCESS [ 47.508 s] [INFO] Spark Project Bagel SUCCESS [ 3.646 s] [INFO] Spark Project GraphX ... SUCCESS [ 5.533 s] [INFO] Spark Project ML Library ... SUCCESS [ 12.715 s] [INFO] Spark Project Tools SUCCESS [ 1.854 s] [INFO] Spark Project Networking ... SUCCESS [ 6.580 s] [INFO] Spark Project Shuffle Streaming Service SUCCESS [ 5.290 s] [INFO] Spark Project Streaming SUCCESS [ 10.846 s] [INFO] Spark Project Catalyst . SUCCESS [ 8.296 s] [INFO] Spark Project SQL .. SUCCESS [ 12.921 s] [INFO] Spark Project Hive . SUCCESS [ 28.931 s] [INFO] Spark Project Assembly . FAILURE [01:09 min] [INFO] Spark Project External Twitter . SKIPPED [INFO] Spark Project External Flume ... SKIPPED [INFO] Spark Project External Flume Sink .. SKIPPED [INFO] Spark Project External MQTT SKIPPED [INFO] Spark Project External ZeroMQ .. SKIPPED [INFO] Spark Project Examples . SKIPPED [INFO] Spark Project REPL . SKIPPED [INFO] Spark Project YARN Parent POM .. SKIPPED [INFO] Spark Project YARN Stable API .. SKIPPED [INFO] Spark Project YARN Shuffle Service . SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 03:49 min [INFO] Finished at: 2014-12-30T12:41:59+01:00 [INFO] Final Memory: 59M/417M [INFO] [WARNING] The requested profile hadoop-2.5 could not be activated because it does not exist. [ERROR] Failed to execute goal on project spark-assembly_2.10: Could not resolve dependencies for project org.apache.spark:spark-assembly_2.10:pom:1.2.0: The following artifacts could not be resolved: org.apache.spark:spark-repl_2.11:jar:1.2.0, org.apache.spark:spark-yarn_2.11:jar:1.2.0: Could not find artifact org.apache.spark:spark-repl_2.11:jar:1.2.0 in central (https://repo1.maven.org/maven2) - [Help 1] org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal on project spark-assembly_2.10: Could not resolve dependencies for project org.apache.spark:spark-assembly_2.10:pom:1.2.0: The following artifacts could not be resolved: org.apache.spark:spark-repl_2.11:jar:1.2.0, org.apache.spark:spark-yarn_2.11:jar:1.2.0: Could not find artifact org.apache.spark:spark-repl_2.11:jar:1.2.0 in central (https://repo1.maven.org/maven2) at org.apache.maven.lifecycle.internal.LifecycleDependencyResolver.getDependencies(LifecycleDependencyResolver.java:220) at org.apache.maven.lifecycle.internal.LifecycleDependencyResolver.resolveProjectDependencies(LifecycleDependencyResolver.java:127) at org.apache.maven.lifecycle.internal.MojoExecutor.ensureDependenciesAreResolved(MojoExecutor.java:257) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:200) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80) at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51) at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:120) at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:347) at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:154) at org.apache.maven.cli.MavenCli.execute(MavenCli.java:584) at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:213) at
How to collect() each partition in scala ?
Hi all, i have one large data-set. when i am getting the number of partitions its showing 43. We can't collect() the large data-set in to memory so i am thinking like this, collect() each partitions so that it will be small in size. Any thoughts ?
Re: Need help for Spark-JobServer setup on Maven (for Java programming)
Thanks Abhishek. We are good know with an answer to try. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-for-Spark-JobServer-setup-on-Maven-for-Java-programming-tp20849p20906.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to collect() each partition in scala ?
collect()-ing a partition still implies copying it to the driver, but you're suggesting you can't collect() the whole data set to the driver. What do you mean: collect() 1 partition? or collect() some smaller result from each partition? On Tue, Dec 30, 2014 at 11:54 AM, DEVAN M.S. msdeva...@gmail.com wrote: Hi all, i have one large data-set. when i am getting the number of partitions its showing 43. We can't collect() the large data-set in to memory so i am thinking like this, collect() each partitions so that it will be small in size. Any thoughts ? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
SparkContext with error from PySpark
Hi Team, I was trying to execute a Pyspark code in cluster. It gives me the following error. (Wne I run the same job in local it is working fine too :-() Eoor Error from python worker: /usr/lib/spark-1.2.0-bin-hadoop2.3/python/pyspark/context.py:209: Warning: 'with' will become a reserved keyword in Python 2.6 Traceback (most recent call last): File /home/beehive/toolchain/x86_64-unknown-linux-gnu/python-2.5.2/lib/python2.5/runpy.py, line 85, in run_module loader = get_loader(mod_name) File /home/beehive/toolchain/x86_64-unknown-linux-gnu/python-2.5.2/lib/python2.5/pkgutil.py, line 456, in get_loader return find_loader(fullname) File /home/beehive/toolchain/x86_64-unknown-linux-gnu/python-2.5.2/lib/python2.5/pkgutil.py, line 466, in find_loader for importer in iter_importers(fullname): File /home/beehive/toolchain/x86_64-unknown-linux-gnu/python-2.5.2/lib/python2.5/pkgutil.py, line 422, in iter_importers __import__(pkg) File /usr/lib/spark-1.2.0-bin-hadoop2.3/python/pyspark/__init__.py, line 41, in module from pyspark.context import SparkContext File /usr/lib/spark-1.2.0-bin-hadoop2.3/python/pyspark/context.py, line 209 with SparkContext._lock: ^ SyntaxError: invalid syntax PYTHONPATH was: /usr/lib/spark-1.2.0-bin-hadoop2.3/python:/usr/lib/spark-1.2.0-bin-hadoop2.3/python/lib/py4j-0.8.2.1-src.zip:/usr/lib/spark-1.2.0-bin-hadoop2.3/lib/spark-assembly-1.2.0-hadoop2.3.0.jar:/usr/lib/spark-1.2.0-bin-hadoop2.3/sbin/../python/lib/py4j-0.8.2.1-src.zip:/usr/lib/spark-1.2.0-bin-hadoop2.3/sbin/../python:/home/beehive/bin/utils/primitives:/home/beehive/bin/utils/pylogger:/home/beehive/bin/utils/asterScript:/home/beehive/bin/lib:/home/beehive/bin/utils/init:/home/beehive/installer/packages:/home/beehive/ncli java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163) at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:102) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) 14/12/31 04:49:58 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 1, aster4, NODE_LOCAL, 1321 bytes) 14/12/31 04:49:58 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on aster4:43309 (size: 3.8 KB, free: 265.0 MB) 14/12/31 04:49:59 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) on executor aster4: org.apache.spark.SparkException ( Any clue how to resolve the same. Best regards Jagan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-with-error-from-PySpark-tp20907.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SparkContext with error from PySpark
The Python installed in your cluster is 2.5. You need at least 2.6. Eric Friedman On Dec 30, 2014, at 7:45 AM, Jaggu jagana...@gmail.com wrote: Hi Team, I was trying to execute a Pyspark code in cluster. It gives me the following error. (Wne I run the same job in local it is working fine too :-() Eoor Error from python worker: /usr/lib/spark-1.2.0-bin-hadoop2.3/python/pyspark/context.py:209: Warning: 'with' will become a reserved keyword in Python 2.6 Traceback (most recent call last): File /home/beehive/toolchain/x86_64-unknown-linux-gnu/python-2.5.2/lib/python2.5/runpy.py, line 85, in run_module loader = get_loader(mod_name) File /home/beehive/toolchain/x86_64-unknown-linux-gnu/python-2.5.2/lib/python2.5/pkgutil.py, line 456, in get_loader return find_loader(fullname) File /home/beehive/toolchain/x86_64-unknown-linux-gnu/python-2.5.2/lib/python2.5/pkgutil.py, line 466, in find_loader for importer in iter_importers(fullname): File /home/beehive/toolchain/x86_64-unknown-linux-gnu/python-2.5.2/lib/python2.5/pkgutil.py, line 422, in iter_importers __import__(pkg) File /usr/lib/spark-1.2.0-bin-hadoop2.3/python/pyspark/__init__.py, line 41, in module from pyspark.context import SparkContext File /usr/lib/spark-1.2.0-bin-hadoop2.3/python/pyspark/context.py, line 209 with SparkContext._lock: ^ SyntaxError: invalid syntax PYTHONPATH was: /usr/lib/spark-1.2.0-bin-hadoop2.3/python:/usr/lib/spark-1.2.0-bin-hadoop2.3/python/lib/py4j-0.8.2.1-src.zip:/usr/lib/spark-1.2.0-bin-hadoop2.3/lib/spark-assembly-1.2.0-hadoop2.3.0.jar:/usr/lib/spark-1.2.0-bin-hadoop2.3/sbin/../python/lib/py4j-0.8.2.1-src.zip:/usr/lib/spark-1.2.0-bin-hadoop2.3/sbin/../python:/home/beehive/bin/utils/primitives:/home/beehive/bin/utils/pylogger:/home/beehive/bin/utils/asterScript:/home/beehive/bin/lib:/home/beehive/bin/utils/init:/home/beehive/installer/packages:/home/beehive/ncli java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163) at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:102) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) 14/12/31 04:49:58 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 1, aster4, NODE_LOCAL, 1321 bytes) 14/12/31 04:49:58 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on aster4:43309 (size: 3.8 KB, free: 265.0 MB) 14/12/31 04:49:59 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) on executor aster4: org.apache.spark.SparkException ( Any clue how to resolve the same. Best regards Jagan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-with-error-from-PySpark-tp20907.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Is it possible to store graph directly into HDFS?
Dear all:) We're trying to make a graph using large input data and get a subgraph applied some filter. Now, we wanna save this graph to HDFS so that we can load later. Is it possible to store graph or subgraph directly into HDFS and load it as a graph for future use? We will be glad for your suggestion. Best regards. Jason Hong -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-store-graph-directly-into-HDFS-tp20908.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Is it possible to store graph directly into HDFS?
how about save as object? Yours, Xuefeng Wu 吴雪峰 敬上 On 2014年12月30日, at 下午9:27, Jason Hong begger3...@gmail.com wrote: Dear all:) We're trying to make a graph using large input data and get a subgraph applied some filter. Now, we wanna save this graph to HDFS so that we can load later. Is it possible to store graph or subgraph directly into HDFS and load it as a graph for future use? We will be glad for your suggestion. Best regards. Jason Hong -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-store-graph-directly-into-HDFS-tp20908.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark Standalone Cluster not correctly configured
Hi. I'm trying to configure a spark standalone cluster, with three master nodes (bigdata1, bigdata2 and bigdata3) managed by Zookeeper. It seems there's a configuration problem, since everyone is saying it is the cluster leader: . 14/12/30 13:54:59 INFO Master: I have been elected leader! New state: ALIVE The message above is dumped by every master I start. Zookeeper is configured identically in all of them, as follows: dataDir=/spark The only difference is the myid file in the /spark directory, of course. The masters are started using the following configuration: . export SPARK_DAEMON_JAVA_OPTS= \ -Dspark.deploy.recoverymode=ZOOKEEPER \ -Dspark.deploy.zookeeper.url=bigdata1:2181,bigdata2:2181,bigdata3:2181 I'm not setting the spark.deploy.zookeeper.dir variable, since I'm using the default value, /spark, configured in zookeeper, as I mentioned before. I would like to know if there is any other thing I have to configure, in order to make the masters to behave correctly (only one master node active at a time). With the current situation, I can connect workers and applications to the whole cluster, for instance, I can connect a worker to the cluster using: spark-class org.apache.spark.deploy.worker.Worker spark://bigdata1:2181,bigdata2:2181,bigdata3:2181 But the worker gets registered to each of the masters independently. If I stop one of the masters, it tries to re-register to it. The notion of active-master is completely lost. Do you have any idea? Thanks a lot. -Bob -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Standalone-Cluster-not-correctly-configured-tp20909.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to collect() each partition in scala ?
I'm not sure exactly what you're trying to do, but take a look at rdd.toLocalIterator if you haven't already. On Tue, Dec 30, 2014 at 6:16 AM, Sean Owen so...@cloudera.com wrote: collect()-ing a partition still implies copying it to the driver, but you're suggesting you can't collect() the whole data set to the driver. What do you mean: collect() 1 partition? or collect() some smaller result from each partition? On Tue, Dec 30, 2014 at 11:54 AM, DEVAN M.S. msdeva...@gmail.com wrote: Hi all, i have one large data-set. when i am getting the number of partitions its showing 43. We can't collect() the large data-set in to memory so i am thinking like this, collect() each partitions so that it will be small in size. Any thoughts ? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Anaconda iPython notebook working with CDH Spark
Some time ago I did the (2) approach, I installed Anaconda on every node. But to avoid screwing RedHat (it was CentOS in my case, which is the same) I installed Anaconda on every node using the user yarn and made it the default python only for that user. After you install it, Anaconda asks if it should add it's installation path to the PATH variable in .bashrc for your user (that's the way it overrides the default Python). If you choose yes it will override it only for the current user. And if that user is yarn, you can run Spark in cluster mode, in all the nodes in your cluster, using IPython (a lot better than the default Python console). Just in case, you have to check that you have a directory in your HDFS for yarn (/user/yarn), it may not be created by default and that would difficult everything, not allowing your Spark to run. In summary, something like (correct the syntax if it's wrong, I'm not testing it): # Create yarn directory in HDFS su hdfs hadoop fs -mkdir /user/yarn hadoop fs -chown yarn:yarn /user/yarn exit # Install Anaconda for user yarn # In every node: su yarn cd wget http://09c8d0b2229f813c1b93-c95ac804525aac4b6dba79b00b39d1d3.r79.cf1.rackcdn.com/Anaconda-2.1.0-Linux-x86_64.sh # Or the current link for the moment you are doing it: https://store.continuum.io/cshop/anaconda/ bash Anaconda*.sh # When asked if set it as the default Python, or to add Anaconda to the PATH (I don't remember how they say it), choose yes I hope that helps, *Sebastián Ramírez* Diseñador de Algoritmos http://www.senseta.com Tel: (+571) 795 7950 ext: 1012 Cel: (+57) 300 370 77 10 Calle 73 No 7 - 06 Piso 4 Linkedin: co.linkedin.com/in/tiangolo/ Twitter: @tiangolo https://twitter.com/tiangolo Email: sebastian.rami...@senseta.com www.senseta.com On Sun, Dec 28, 2014 at 1:57 PM, Bin Wang binwang...@gmail.com wrote: Hi there, I have a cluster with CDH5.1 running on top of Redhat6.5, where the default Python version is 2.6. I am trying to set up a proper iPython notebook environment to develop spark application using pyspark. Here http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-apache-spark/ is a tutorial that I have been following. However, it turned out that the author was using iPython1 where we have the latest Anaconda Python2.7 installed on our name node. When I finished following the tutorial, I can connect to the spark cluster but whenever I tried to distribute the work, it will errorred out and google tells me it is the difference between the version of Python across the cluster. Here are a few thoughts that I am planning to try. (1) remove the Anaconda Python from the namenode and install the iPython version that is compatible with Python2.6. (2) or I need to install Anaconda Python on every node and make it the default Python version across the whole cluster (however, I am not sure if this plan will totally screw up the existing environment since some running services are built by Python2.6...) Let me which should be the proper way to set up an iPython notebook environment. Best regards, Bin -- ** *This e-mail transmission, including any attachments, is intended only for the named recipient(s) and may contain information that is privileged, confidential and/or exempt from disclosure under applicable law. If you have received this transmission in error, or are not the named recipient(s), please notify Senseta immediately by return e-mail and permanently delete this transmission, including any attachments.*
Re: SchemaRDD to RDD[String]
Do your debug println show values? i.e. what would you see if in rowToString you output println( row to string +row+ +sub)? Another thing to check would be to do schemaRDD.take(3) or something to make sure you actually have data you can also try this: rowToString(schemaRDD.first,list) and see if you get anything -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-to-RDD-String-tp20846p20910.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Host Error on EC2 while accessing hdfs from stadalone
Hi, I am using spark standalone on EC2. I can access ephemeral hdfs from spark-shell interface but I can't access hdfs in standalone application. I am using spark 1.2.0 with hadoop 2.4.0 and launched cluster from ec2 folder from my local machine. In my pom file I have given hadoop client as 2.4.0. dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-client/artifactId version2.4.0/version /dependency The error is as fallows: java.io.IOException: Failed on local exception: java.io.EOFException; Host Details : local host is: ip-10-40-121-200/10.40.121.200; destination host is: ec2-23-21-113-136.compute-1.amazonaws.com:9000; Regards,Laeeq
Spark 1.2 and Mesos 0.21.0 spark.executor.uri issue?
I've been working with Spark 1.2 and Mesos 0.21.0 and while I have set the spark.executor.uri within spark-env.sh (and directly within bash as well), the Mesos slaves do not seem to be able to access the spark tgz file via HTTP or HDFS as per the message below. 14/12/30 15:57:35 INFO SparkILoop: Created spark context.. Spark context available as sc. scala 14/12/30 15:57:38 INFO CoarseMesosSchedulerBackend: Mesos task 0 is now TASK_FAILED 14/12/30 15:57:38 INFO CoarseMesosSchedulerBackend: Mesos task 1 is now TASK_FAILED 14/12/30 15:57:39 INFO CoarseMesosSchedulerBackend: Mesos task 2 is now TASK_FAILED 14/12/30 15:57:41 INFO CoarseMesosSchedulerBackend: Mesos task 3 is now TASK_FAILED 14/12/30 15:57:41 INFO CoarseMesosSchedulerBackend: Blacklisting Mesos slave value: 20141228-183059-3045950474-5050-2788-S1 due to too many failures; is Spark installed on it? I've verified that the Mesos slaves can access both the HTTP and HDFS locations. I'll start digging into the Mesos logs but was wondering if anyone had run into this issue before. I was able to get this to run successfully on Spark 1.1 on GCP - my current environment that I'm experimenting with is Digital Ocean - perhaps this is in play? Thanks! Denny
Re: Host Error on EC2 while accessing hdfs from stadalone
Did you check firewall rules in security groups? On Tue, Dec 30, 2014, 9:34 PM Laeeq Ahmed laeeqsp...@yahoo.com.invalid wrote: Hi, I am using spark standalone on EC2. I can access ephemeral hdfs from spark-shell interface but I can't access hdfs in standalone application. I am using spark 1.2.0 with hadoop 2.4.0 and launched cluster from ec2 folder from my local machine. In my pom file I have given hadoop client as 2.4.0. dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-client/artifactId version2.4.0/version /dependency The error is as fallows: *java.io.IOException: Failed on local exception: java.io.EOFException; Host Details : local host is: ip-10-40-121-200/10.40.121.200 http://10.40.121.200; destination host is: ec2-23-21-113-136.compute-1.amazonaws.com http://ec2-23-21-113-136.compute-1.amazonaws.com:9000;* Regards, Laeeq
Re: Mapping directory structure to columns in SparkSQL
Hi Michael, I’ve looked through the example and the test cases and I think I understand what we need to do - so I’ll give it a go. I think what I’d like to try to do is allow files to be added at anytime, so perhaps I can cache partition info, and also what may be useful for us would be to derive schema from the set of all files, hopefully this is achievable also. Thanks Mick On 30 Dec 2014, at 04:49, Michael Armbrust mich...@databricks.com wrote: You can't do this now without writing a bunch of custom logic (see here for an example: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala) I would like to make this easier as part of improvements to the datasources api that we are planning for Spark 1.3 On Mon, Dec 29, 2014 at 2:19 AM, Mickalas michael.belldav...@gmail.com mailto:michael.belldav...@gmail.com wrote: I see that there is already a request to add wildcard support to the SQLContext.parquetFile function https://issues.apache.org/jira/browse/SPARK-3928 https://issues.apache.org/jira/browse/SPARK-3928. What seems like a useful thing for our use case is to associate the directory structure with certain columns in the table, but it does not seem like this is supported. For example we want to create parquet files on a daily basis associated with geographic regions and so will create a set of files under directories such as: * 2014-12-29/Americas * 2014-12-29/Asia * 2014-12-30/Americas * ... Where queries have predicates that match the column values determinable from directory structure it would be good to only extract data from matching files. Does anyone know if something like this is supported, or whether this is a reasonable thing to request? Mick -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mapping-directory-structure-to-columns-in-SparkSQL-tp20880.html http://apache-spark-user-list.1001560.n3.nabble.com/Mapping-directory-structure-to-columns-in-SparkSQL-tp20880.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org mailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org mailto:user-h...@spark.apache.org
Shuffle Problems in 1.2.0
Hey all, Since upgrading to 1.2.0 a pyspark job that worked fine in 1.1.1 fails during shuffle. I've tried reverting from the sort-based shuffle back to the hash one, and that fails as well. Does anyone see similar problems or has an idea on where to look next? For the sort-based shuffle I get a bunch of exception like this in the executor logs: 2014-12-30 03:13:04,061 ERROR [Executor task launch worker-2] executor.Executor (Logging.scala:logError(96)) - Exception in task 4523.0 in stage 1.0 (TID 4524) org.apache.spark.SparkException: PairwiseRDD: unexpected value: List([B@130dc7ad) at org.apache.spark.api.python.PairwiseRDD$$anonfun$compute$2.apply(PythonRDD.scala:307) at org.apache.spark.api.python.PairwiseRDD$$anonfun$compute$2.apply(PythonRDD.scala:305) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:219) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) For the hash-based shuffle, there are now a bunch of these exceptions in the logs: 2014-12-30 04:14:01,688 ERROR [Executor task launch worker-0] executor.Executor (Logging.scala:logError(96)) - Exception in task 4479.0 in stage 1.0 (TID 4480) java.io.FileNotFoundException: /mnt/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1419905501183_0004/spark-local-20141230035728-8fc0/23/merged_shuffle_1_68_0 (No such file or directory) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:221) at org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123) at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:192) at org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:67) at org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:65) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Thank you! -Sven -- http://sites.google.com/site/krasser/?utm_source=sig
Re: Mllib native netlib-java/OpenBLAS
I'm half-way there follow 1. compiled and installed open blas library 2. ln -s libopenblas_sandybridgep-r0.2.13.so /usr/lib/libblas.so.3 3. compiled and built spark: mvn -Pnetlib-lgpl -DskipTests clean compile package So far so fine. Then I run into problems by testing the solution: bin/run-example mllib.LinearRegression data/mllib/sample_libsvm_data.txt /14/12/30 18:39:57 INFO BlockManagerMaster: Registered BlockManager 14/12/30 18:39:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/12/30 18:39:58 WARN LoadSnappy: Snappy native library not loaded Training: 80, test: 20. */usr/local/lib/jdk1.8.0//bin/java: symbol lookup error: /tmp/jniloader1826801168744171087netlib-native_system-linux-x86_64.so: undefined symbol: cblas_dscal*/ I created a issue report: https://issues.apache.org/jira/browse/SPARK-5010 any help is deeply appreciated, Tomas -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-native-netlib-java-OpenBLAS-tp19662p20912.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: S3 files , Spark job hungsup
This here may also be of help: http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html. Make sure to spread your objects across multiple partitions to not be rate limited by S3. -Sven On Mon, Dec 22, 2014 at 10:20 AM, durga katakam durgak...@gmail.com wrote: Yes . I am reading thousands of files every hours. Is there any way I can tell spark to timeout. Thanks for your help. -D On Mon, Dec 22, 2014 at 4:57 AM, Shuai Zheng szheng.c...@gmail.com wrote: Is it possible too many connections open to read from s3 from one node? I have this issue before because I open a few hundreds of files on s3 to read from one node. It just block itself without error until timeout later. On Monday, December 22, 2014, durga durgak...@gmail.com wrote: Hi All, I am facing a strange issue sporadically. occasionally my spark job is hungup on reading s3 files. It is not throwing exception . or making some progress, it is just hungs up there. Is this a known issue , Please let me know how could I solve this issue. Thanks, -D -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/S3-files-Spark-job-hungsup-tp20806.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- http://sites.google.com/site/krasser/?utm_source=sig
Trying to make spark-jobserver work with yarn
Hi all, I'm investigating spark for a new project and I'm trying to use spark-jobserver because... I need to reuse and share RDDs and from what I read in the forum that's the standard :D Turns out that spark-jobserver doesn't seem to work on yarn, or at least it does not on 1.1.1 My config is spark 1.1.1 (moving to 1.2.0 soon), hadoop 2.6 (which seems compatible with 2.4 from spark point of view... at least I was able to run spark-submit and shell tasks both in yarn-client and yarn-cluster modes) going back to my original point, I did some changes in spark-jobserver and how I can submit a job but I get: [2014-12-30 18:20:19,769] INFO e.spark.deploy.yarn.Client [] [akka://JobServer/user/context-supervisor/f983d86e-spark.jobserver.WordCountExample] - Max mem capabililty of a single resource in this cluster 15000 [2014-12-30 18:20:19,770] INFO e.spark.deploy.yarn.Client [] [akka://JobServer/user/context-supervisor/f983d86e-spark.jobserver.WordCountExample] - Preparing Local resources [2014-12-30 18:20:20,041] INFO e.spark.deploy.yarn.Client [] [akka://JobServer/user/context-supervisor/f983d86e-spark.jobserver.WordCountExample] - Prepared Local resources Map(__spark__.jar - resource { scheme: file port: -1 file: /home/ec2-user/.ivy2/cache/org.apache.spark/spark-yarn_2.10/jars/spark-yarn_2.10-1.1.1.jar } size: 343226 timestamp: 1416429031000 type: FILE visibility: PRIVATE) [...] [2014-12-30 18:20:20,139] INFO e.spark.deploy.yarn.Client [] [akka://JobServer/user/context-supervisor/f983d86e-spark.jobserver.WordCountExample] - Yarn AM launch context: [2014-12-30 18:20:20,140] INFO e.spark.deploy.yarn.Client [] [akka://JobServer/user/context-supervisor/f983d86e-spark.jobserver.WordCountExample] - class: org.apache.spark.deploy.yarn.ExecutorLauncher [2014-12-30 18:20:20,140] INFO e.spark.deploy.yarn.Client [] [akka://JobServer/user/context-supervisor/f983d86e-spark.jobserver.WordCountExample] - env: Map(CLASSPATH - $PWD:$PWD/__spark__.jar:$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/share/hadoop/common/*:$HADOOP_COMMON_HOME/share/hadoop/common/lib/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*:$HADOOP_YARN_HOME/share/hadoop/yarn/*:$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*:$PWD/__app__.jar:$PWD/*, SPARK_YARN_CACHE_FILES_FILE_SIZES - 343226, SPARK_YARN_STAGING_DIR - .sparkStaging/application_1419963137232_0001/, SPARK_YARN_CACHE_FILES_VISIBILITIES - PRIVATE, SPARK_USER - ec2-user, SPARK_YARN_MODE - true, SPARK_YARN_CACHE_FILES_TIME_STAMPS - 1416429031000, SPARK_YARN_CACHE_FILES - file:/home/ec2-user/.ivy2/cache/org.apache.spark/spark-yarn_2.10/jars/spark-yarn_2.10-1.1.1.jar#__spark__.jar) [...] [2014-12-30 18:03:04,474] INFO YarnClientSchedulerBackend [] [akka://JobServer/user/context-supervisor/ebac0153-spark.jobserver.WordCountExample] - Application report from ASM: appMasterRpcPort: -1 appStartTime: 1419962580444 yarnAppState: FAILED [2014-12-30 18:03:04,475] ERROR .jobserver.JobManagerActor [] [akka://JobServer/user/context-supervisor/ebac0153-spark.jobserver.WordCountExample] - Failed to create context ebac0153-spark.jobserver.WordCountExample, shutting down actor org.apache.spark.SparkException: Yarn application already ended,might be killed or not able to launch application master. at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApp(YarnClientSchedulerBackend.scala:117) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:93) In the hadoop console I can get the detailed issue Diagnostics: File file:/home/ec2-user/.ivy2/cache/org.apache.spark/spark-yarn_2.10/jars/spark-yarn_2.10-1.1.1.jar does not exist java.io.FileNotFoundException: File file:/home/ec2-user/.ivy2/cache/org.apache.spark/spark-yarn_2.10/jars/spark-yarn_2.10-1.1.1.jar does not exist now... it seems like spark is actually use a file I used for launching the task in other nodes Can anyone point me in the right direction of where that might be being set?
Re: Cached RDD
Without caching, each action is recomputed. So assuming rdd2 and rdd3 result in separate actions answer is yes. On Mon, Dec 29, 2014 at 7:53 PM, Corey Nolet cjno...@gmail.com wrote: If I have 2 RDDs which depend on the same RDD like the following: val rdd1 = ... val rdd2 = rdd1.groupBy()... val rdd3 = rdd1.groupBy()... If I don't cache rdd1, will it's lineage be calculated twice (one for rdd2 and one for rdd3)?
Re: Spark SQL implementation error
Anytime you see java.lang.NoSuchMethodError it means that you have multiple conflicting versions of a library on the classpath, or you are trying to run code that was compiled against the wrong version of a library. On Tue, Dec 30, 2014 at 1:43 AM, sachin Singh sachin.sha...@gmail.com wrote: I have a table(csv file) loaded data on that by creating POJO as per table structure,and created SchemaRDD as under JavaRDDTest1 testSchema = sc.textFile(D:/testTable.csv).map(GetTableData);/* GetTableData will transform the all table data in testTable object*/ JavaSchemaRDD schemaTest = sqlContext.applySchema(testSchema, Test.class); schemaTest.registerTempTable(testTable); JavaSchemaRDD sqlQuery = sqlContext.sql(SELECT * FROM testTable); ListString totDuration = sqlQuery.map(new FunctionRow, String() { public String call(Row row) { return Field1is : + row.getInt(0); } }).collect(); its working fine but. if I am changing query as(rest code is same)- JavaSchemaRDD sqlQuery = sqlContext.sql(SELECT sum(field1) FROM testTable group by field2); error as - Exception in thread main java.lang.NoSuchMethodError: org.apache.spark.rdd.ShuffledRDD.init(Lorg/apache/spark/rdd/RDD;Lorg/apache/spark/Partitioner;)V Please help and Suggest -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-implementation-error-tp20901.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SparkContext with error from PySpark
Hi I am using Aanonda Python. Is there any way to specify the Python which we have o use for running pyspark in a cluster. Best regards Jagan On Tue, Dec 30, 2014 at 6:27 PM, Eric Friedman eric.d.fried...@gmail.com wrote: The Python installed in your cluster is 2.5. You need at least 2.6. Eric Friedman On Dec 30, 2014, at 7:45 AM, Jaggu jagana...@gmail.com wrote: Hi Team, I was trying to execute a Pyspark code in cluster. It gives me the following error. (Wne I run the same job in local it is working fine too :-() Eoor Error from python worker: /usr/lib/spark-1.2.0-bin-hadoop2.3/python/pyspark/context.py:209: Warning: 'with' will become a reserved keyword in Python 2.6 Traceback (most recent call last): File /home/beehive/toolchain/x86_64-unknown-linux-gnu/python-2.5.2/lib/python2.5/runpy.py, line 85, in run_module loader = get_loader(mod_name) File /home/beehive/toolchain/x86_64-unknown-linux-gnu/python-2.5.2/lib/python2.5/pkgutil.py, line 456, in get_loader return find_loader(fullname) File /home/beehive/toolchain/x86_64-unknown-linux-gnu/python-2.5.2/lib/python2.5/pkgutil.py, line 466, in find_loader for importer in iter_importers(fullname): File /home/beehive/toolchain/x86_64-unknown-linux-gnu/python-2.5.2/lib/python2.5/pkgutil.py, line 422, in iter_importers __import__(pkg) File /usr/lib/spark-1.2.0-bin-hadoop2.3/python/pyspark/__init__.py, line 41, in module from pyspark.context import SparkContext File /usr/lib/spark-1.2.0-bin-hadoop2.3/python/pyspark/context.py, line 209 with SparkContext._lock: ^ SyntaxError: invalid syntax PYTHONPATH was: /usr/lib/spark-1.2.0-bin-hadoop2.3/python:/usr/lib/spark-1.2.0-bin-hadoop2.3/python/lib/py4j-0.8.2.1-src.zip:/usr/lib/spark-1.2.0-bin-hadoop2.3/lib/spark-assembly-1.2.0-hadoop2.3.0.jar:/usr/lib/spark-1.2.0-bin-hadoop2.3/sbin/../python/lib/py4j-0.8.2.1-src.zip:/usr/lib/spark-1.2.0-bin-hadoop2.3/sbin/../python:/home/beehive/bin/utils/primitives:/home/beehive/bin/utils/pylogger:/home/beehive/bin/utils/asterScript:/home/beehive/bin/lib:/home/beehive/bin/utils/init:/home/beehive/installer/packages:/home/beehive/ncli java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163) at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:102) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) 14/12/31 04:49:58 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 1, aster4, NODE_LOCAL, 1321 bytes) 14/12/31 04:49:58 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on aster4:43309 (size: 3.8 KB, free: 265.0 MB) 14/12/31 04:49:59 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) on executor aster4: org.apache.spark.SparkException ( Any clue how to resolve the same. Best regards Jagan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-with-error-from-PySpark-tp20907.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- ** JAGANADH G http://jaganadhg.in *ILUGCBE* http://ilugcbe.org.in
Spark Accumulators exposed as Metrics to Graphite
Hi Does spark have built in possiblity of exposing current value of Accumulator [1] using Monitoring and Instrumentation [2]. Unfortunately I couldn't find anything in Sources which could be used. Does it mean only way to expose current accumulator value is to implement new Source which would hook to Accumulator in the driver process, or listen for events on the bus? Many thanks! [1] https://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.Accumulator [2] http://spark.apache.org/docs/1.1.1/monitoring.html
Re: Spark Streaming: HiveContext within Custom Actor
I am not sure that can be done. Receivers are designed to be run only on the executors/workers, whereas a SQLContext (for using Spark SQL) can only be defined on the driver. On Mon, Dec 29, 2014 at 6:45 PM, sranga sra...@gmail.com wrote: Hi Could Spark-SQL be used from within a custom actor that acts as a receiver for a streaming application? If yes, what is the recommended way of passing the SparkContext to the actor? Thanks for your help. - Ranga -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-HiveContext-within-Custom-Actor-tp20892.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: word count aggregation
For windows that large (1 hour), you will probably also have to increase the batch interval for efficiency. TD On Mon, Dec 29, 2014 at 12:16 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You can use reduceByKeyAndWindow for that. Here's a pretty clean example https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/TwitterPopularTags.scala Thanks Best Regards On Mon, Dec 29, 2014 at 1:30 PM, Hoai-Thu Vuong thuv...@gmail.com wrote: dear user of spark I've got a program, streaming a folder, when a new file is created in this folder, I count a word, which appears in this document and update it (I used StatefulNetworkWordCount to do it). And it work like charm. However, I would like to know the different of top 10 word at now and at time (one hour before). How could I do it? I try to use windowDuration, but it seem not work. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Shuffle Problems in 1.2.0
Hi Sven, Do you have a small example program that you can share which will allow me to reproduce this issue? If you have a workload that runs into this, you should be able to keep iteratively simplifying the job and reducing the data set size until you hit a fairly minimal reproduction (assuming the issue is deterministic, which it sounds like it is). On Tue, Dec 30, 2014 at 9:49 AM, Sven Krasser kras...@gmail.com wrote: Hey all, Since upgrading to 1.2.0 a pyspark job that worked fine in 1.1.1 fails during shuffle. I've tried reverting from the sort-based shuffle back to the hash one, and that fails as well. Does anyone see similar problems or has an idea on where to look next? For the sort-based shuffle I get a bunch of exception like this in the executor logs: 2014-12-30 03:13:04,061 ERROR [Executor task launch worker-2] executor.Executor (Logging.scala:logError(96)) - Exception in task 4523.0 in stage 1.0 (TID 4524) org.apache.spark.SparkException: PairwiseRDD: unexpected value: List([B@130dc7ad) at org.apache.spark.api.python.PairwiseRDD$$anonfun$compute$2.apply(PythonRDD.scala:307) at org.apache.spark.api.python.PairwiseRDD$$anonfun$compute$2.apply(PythonRDD.scala:305) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:219) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) For the hash-based shuffle, there are now a bunch of these exceptions in the logs: 2014-12-30 04:14:01,688 ERROR [Executor task launch worker-0] executor.Executor (Logging.scala:logError(96)) - Exception in task 4479.0 in stage 1.0 (TID 4480) java.io.FileNotFoundException: /mnt/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1419905501183_0004/spark-local-20141230035728-8fc0/23/merged_shuffle_1_68_0 (No such file or directory) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:221) at org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123) at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:192) at org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:67) at org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:65) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Thank you! -Sven -- http://sites.google.com/site/krasser/?utm_source=sig
Re: SparkContext with error from PySpark
To configure the Python executable used by PySpark, see the Using the Shell Python section in the Spark Programming Guide: https://spark.apache.org/docs/latest/programming-guide.html#using-the-shell You can set the PYSPARK_PYTHON environment variable to choose the Python executable that will be used on the driver and executors. In addition, you can set PYSPARK_DRIVER_PYTHON to use a different Python executable only on the driver (this is useful if you want to use IPython on the driver but not on the executors). On Tue, Dec 30, 2014 at 11:13 AM, JAGANADH G jagana...@gmail.com wrote: Hi I am using Aanonda Python. Is there any way to specify the Python which we have o use for running pyspark in a cluster. Best regards Jagan On Tue, Dec 30, 2014 at 6:27 PM, Eric Friedman eric.d.fried...@gmail.com wrote: The Python installed in your cluster is 2.5. You need at least 2.6. Eric Friedman On Dec 30, 2014, at 7:45 AM, Jaggu jagana...@gmail.com wrote: Hi Team, I was trying to execute a Pyspark code in cluster. It gives me the following error. (Wne I run the same job in local it is working fine too :-() Eoor Error from python worker: /usr/lib/spark-1.2.0-bin-hadoop2.3/python/pyspark/context.py:209: Warning: 'with' will become a reserved keyword in Python 2.6 Traceback (most recent call last): File /home/beehive/toolchain/x86_64-unknown-linux-gnu/python-2.5.2/lib/python2.5/runpy.py, line 85, in run_module loader = get_loader(mod_name) File /home/beehive/toolchain/x86_64-unknown-linux-gnu/python-2.5.2/lib/python2.5/pkgutil.py, line 456, in get_loader return find_loader(fullname) File /home/beehive/toolchain/x86_64-unknown-linux-gnu/python-2.5.2/lib/python2.5/pkgutil.py, line 466, in find_loader for importer in iter_importers(fullname): File /home/beehive/toolchain/x86_64-unknown-linux-gnu/python-2.5.2/lib/python2.5/pkgutil.py, line 422, in iter_importers __import__(pkg) File /usr/lib/spark-1.2.0-bin-hadoop2.3/python/pyspark/__init__.py, line 41, in module from pyspark.context import SparkContext File /usr/lib/spark-1.2.0-bin-hadoop2.3/python/pyspark/context.py, line 209 with SparkContext._lock: ^ SyntaxError: invalid syntax PYTHONPATH was: /usr/lib/spark-1.2.0-bin-hadoop2.3/python:/usr/lib/spark-1.2.0-bin-hadoop2.3/python/lib/py4j-0.8.2.1-src.zip:/usr/lib/spark-1.2.0-bin-hadoop2.3/lib/spark-assembly-1.2.0-hadoop2.3.0.jar:/usr/lib/spark-1.2.0-bin-hadoop2.3/sbin/../python/lib/py4j-0.8.2.1-src.zip:/usr/lib/spark-1.2.0-bin-hadoop2.3/sbin/../python:/home/beehive/bin/utils/primitives:/home/beehive/bin/utils/pylogger:/home/beehive/bin/utils/asterScript:/home/beehive/bin/lib:/home/beehive/bin/utils/init:/home/beehive/installer/packages:/home/beehive/ncli java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163) at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:102) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) 14/12/31 04:49:58 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 1, aster4, NODE_LOCAL, 1321 bytes) 14/12/31 04:49:58 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on aster4:43309 (size: 3.8 KB, free: 265.0 MB) 14/12/31 04:49:59 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) on executor aster4: org.apache.spark.SparkException ( Any clue how to resolve the same. Best regards Jagan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-with-error-from-PySpark-tp20907.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- ** JAGANADH G http://jaganadhg.in *ILUGCBE*
Re: Spark Streaming: HiveContext within Custom Actor
Thanks. Will look at other options. On Tue, Dec 30, 2014 at 11:43 AM, Tathagata Das tathagata.das1...@gmail.com wrote: I am not sure that can be done. Receivers are designed to be run only on the executors/workers, whereas a SQLContext (for using Spark SQL) can only be defined on the driver. On Mon, Dec 29, 2014 at 6:45 PM, sranga sra...@gmail.com wrote: Hi Could Spark-SQL be used from within a custom actor that acts as a receiver for a streaming application? If yes, what is the recommended way of passing the SparkContext to the actor? Thanks for your help. - Ranga -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-HiveContext-within-Custom-Actor-tp20892.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Location of logs in local mode
I¹m submitting a script using spark-submit in local mode for testing, and I¹m having trouble figuring out where the logs are stored. The documentation indicates that they should be in the work folder in the directory in which Spark lives on my system, but I see no such folder there. I¹ve set the SPARK_LOCAL_DIRS and SPARK_LOG_DIR environment variables in spark-env.sh, but there doesn¹t seem to be any log output generated in the locations I¹ve specified there either. I¹m just using spark-submit with master local, I haven¹t run any of the standalone cluster scripts, so I¹m not sure if there¹s something I¹m missing here as far as a default output location for logging. Thanks, Brett smime.p7s Description: S/MIME cryptographic signature
Gradual slow down of the Streaming job (getCallSite at DStream.scala:294)
Here is the code for my streaming job. ~~val sparkConf = new SparkConf().setAppName(SparkStreamingJob) sparkConf.set(spark.serializer, org.apache.spark.serializer.KryoSerializer)sparkConf.set(spark.default.parallelism, 100)sparkConf.set(spark.shuffle.consolidateFiles, true)sparkConf.set(spark.speculation, true)sparkConf.set(spark.speculation.interval, 5000)sparkConf.set(spark.speculation.quantile, 0.9)sparkConf.set(spark.speculation.multiplier, 3)sparkConf.set(spark.mesos.coarse, true)sparkConf.set(spark.executor.extraJavaOptions, -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseConcMarkSweepGC)sparkConf.set(spark.shuffle.manager, SORT) val ssc = new StreamingContext(sparkConf, Seconds(10))ssc.checkpoint(checkpointDir) val topics = traceval numThreads = 1val topicMap = topics.split(,).map((_,numThreads)).toMap val kafkaPartitions = 20val kafkaDStreams = (1 to kafkaPartitions).map { _ = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)} val lines = ssc.union(kafkaDStreams)val words = lines.map(line = doSomething_1(line))val filteredWords = words.filter(word = word != test)val groupedWords = filteredWords.map(word = (word, 1)) val windowedWordCounts = groupedWords.reduceByKeyAndWindow(_ + _, _ - _, Seconds(30), Seconds(10))val windowedWordsFiltered = windowedWordCounts.filter{case (word, count) = count 50}val finalResult = windowedWordsFiltered.foreachRDD(words = doSomething_2(words)) ssc.start()ssc.awaitTermination()~~ I am running this job on a 9 slave AWS EC2 cluster with each slave node has 32 vCPU 60GB memory. When I start this job, the processing time is usually around 5 - 6 seconds for the 10 seconds batch and the scheduling delay is around 0 seconds or a few ms. However, as the job run for 6 - 8 hours, the processing time increases to 15 - 20 seconds but the scheduling delay is increasing to 4 - 6 hours. When I look at the completed stages, I see that the time taken for getCallSite at DStream.scala:294 keeps increasing as time passes by. It goes from around 2 seconds to more than a few minutes. Clicking on +details next to this stage description shows the following execution trace.org.apache.spark.SparkContext.getCallSite(SparkContext.scala:1088)org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:294)org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:288)scala.Option.orElse(Option.scala:257)org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:285)org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:221)org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:221)scala.util.Try$.apply(Try.scala:161)org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:221)org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:165) When I click on one of these slow stages that executed after 6 - 8 hours, I find the following information for individual tasks inside. - All tasks seem to execute with PROCESS_LOCAL locality.- Quite a few of these tasks seem to spend anywhere between 30 - 80% of their time in GC. Although, when I look at the total memory usage on each of the slave nodes under executors information, I see that the usage is only around 200MB out of 20GB available. Even after a few hours, the map stages (val groupedWords = filteredWords.map(word = (word, 1))) seem to have consistent times as during the start of the job which seems to indicate that this code is fine.Also, the waiting batches is either at 0 or 1 even after 8 to 10 hours. Based on the information that map is as fast as during the start of job and that there is no waiting batches, I am assuming that the getCallSite stages correspond to getting data out of Kafka? Is this correct or not?If my assumption is correct, Is there anything that I could do to optimize receiving data from Kafka?If not, which part of my code needs to be optimized to reduce the scheduling
Kafka + Spark streaming
Hi Experts, Few general Queries : 1. Can a single block/partition in a RDD have more than 1 kafka message? or there will be one only one kafka message per block? In a more broader way, is the message count related to block in any way or its just that any message received with in a particular block interval will go in the same block. 2. If a worker goes down which runs the Receiver for Kafka, Will the receiver be restarted on some other worker? Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Spark-streaming-tp20914.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Trouble using MultipleTextOutputFormat with Spark
Hi, I am trying to use the MultipleTextOutputFormat to rename the output files of my Spark job something different from the default part-N. I have implemented a custom MultipleTextOutputFormat class as follows: *class DriveOutputRenameMultipleTextOutputFormat extends MultipleTextOutputFormat[String, Any] {* * override def generateFileNameForKeyValue(key : String, value : Any, name: String) : String = {* *DRIVE + - + name.split(-)(1) + .csv* * }* *}* When I call the saveAsHadoopFile() function on a RDD[K,V], I get the following error: *sc.textFile(/mnt/raw/drive/2014/10/29/part-0).map(x = (x, null)).saveAsHadoopFile(/mnt/test, classOf[String], classOf[Any], classOf[DriveOutputRenameMultipleTextOutputFormat])* java.lang.RuntimeException: java.lang.NoSuchMethodException: line210ee86336174025bcee4914212e1ff6168.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$DriveOutputRenameMultipleTextOutputFormat.init() at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115) at org.apache.hadoop.mapred.JobConf.getOutputFormat(JobConf.java:619) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1001) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:931) Caused by: java.lang.NoSuchMethodException: line210ee86336174025bcee4914212e1ff6168.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$DriveOutputRenameMultipleTextOutputFormat.init() at java.lang.Class.getConstructor0(Class.java:2892) at java.lang.Class.getDeclaredConstructor(Class.java:2058) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:109) at org.apache.hadoop.mapred.JobConf.getOutputFormat(JobConf.java:619) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1001) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:931)
JetS3T settings spark
I am not sure , the way I can pass jets3t.properties file for spark-submit. --file option seems not working. can some one please help me. My production spark jobs get hung up when reading s3 file sporadically. Thanks, -D -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/JetS3T-settings-spark-tp20916.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Long-running job cleanup
Hi Patrick, to follow up on the below discussion, I am including a short code snippet that produces the problem on 1.1. This is kind of stupid code since it’s a greatly simplified version of what I’m actually doing but it has a number of the key components in place. I’m also including some example log output. Thank you. def showMemoryUsage(sc : SparkContext) = { val usersPerStep = 25000 val count = 100 val numSteps = count/usersPerStep val users = sc.parallelize(1 to count) val zippedUsers = users.zipWithIndex().cache() val userFeatures : RDD[(Int, Int)] = sc.parallelize(1 to count).map(s=(s,2)).cache() val productFeatures : RDD[(Int, Int)] = sc.parallelize(1 to 5000) .map(s = (s, 4)).cache() for (i - 1 to numSteps) { val usersFiltered = zippedUsers.filter(s = { ((i - 1) * usersPerStep = s._2) (s._2 i * usersPerStep) }).map(_._1).collect() usersFiltered.foreach(user = { val mult = productFeatures.map(s = s._2 * userFeatures.lookup(user).head) mult.takeOrdered(20) // Normally this would then be written to disk // For the sake of the example this is all we're doing }) } } Example broadcast variable added: 14/12/30 19:25:19 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on innovationclientnode01.cof.ds.capitalone.com:54640 (size: 794.0 B, free: 441.9 MB) And then if I parse the entire log looking for “free : XXX.X MB” I see the available memory slowly ticking away: Free 441.8 MB Free 441.8 MB Free 441.8 MB Free 441.8 MB Free 441.8 MB Free 441.8 MB Free 441.8 MB … Free 441.7 MB Free 441.7 MB Free 441.7 MB Free 441.7 MB And so on. Clearly the above code is not persisting the intermediate RDD (mult), yet memory is never being properly freed up. From: Ilya Ganelin ilgan...@gmail.commailto:ilgan...@gmail.com Date: Sunday, December 28, 2014 at 4:02 PM To: Patrick Wendell pwend...@gmail.commailto:pwend...@gmail.com, Ganelin, Ilya ilya.gane...@capitalone.commailto:ilya.gane...@capitalone.com Cc: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: Long-running job cleanup Hi Patrick - is that cleanup present in 1.1? The overhead I am talking about is with regards to what I believe is shuffle related metadata. If I watch the execution log I see small broadcast variables created for every stage of execution, a few KB at a time, and a certain number of MB remaining of available memory on the driver. As I run, this available memory goes down, and these variables are never erased. The only RDDs that persist are those that are explicitly cached. The RDDs that are generated iteratively are not retained or referenced, so I would expect things to get cleaned up but they do not. The items consuming memory are not RDDs but what appears to be shuffle metadata. I have a script that parses the logs to show memory consumption over time and the script shows memory very steadily being consumed over many hours without clearing one small bit at a time. The specific computation I am doing is the generation of dot products between two RDDs of vectors. I need to generate this product for every combination of products between the two RDDs but both RDDs are too big to fit in memory. Consequently, I iteratively generate this product across one entry from the first RDD and all entries from the second and retain the pared-down result within an accumulator (by retaining the top N results it is possible to actually store the Cartesian which is otherwise too large to fit on disk). After a certain number of iterations these intermediate results are then written to disk. Each of these steps is tractable in itself but due to the accumulation of memory, the overall job becomes intractable. I would appreciate any suggestions as to how to clean up these intermediate broadcast variables. Thank you. On Sun, Dec 28, 2014 at 1:56 PM Patrick Wendell pwend...@gmail.commailto:pwend...@gmail.com wrote: What do you mean when you say the overhead of spark shuffles start to accumulate? Could you elaborate more? In newer versions of Spark shuffle data is cleaned up automatically when an RDD goes out of scope. It is safe to remove shuffle data at this point because the RDD can no longer be referenced. If you are seeing a large build up of shuffle data, it's possible you are retaining references to older RDDs inadvertently. Could you explain what your job actually doing? - Patrick On Mon, Dec 22, 2014 at 2:36 PM, Ganelin, Ilya ilya.gane...@capitalone.commailto:ilya.gane...@capitalone.com wrote: Hi all, I have a long running job iterating over a huge dataset. Parts of this operation are cached. Since the job runs for so long, eventually the overhead of spark shuffles starts to accumulate culminating in the driver starting to swap. I am aware of the spark.cleanup.tll parameter that allows me to configure when cleanup happens but the issue with
Re: building spark1.2 meet error
no,it still fail use mvn -Pyarn -Phadoop-2.5 -Dhadoop.version=2.5.0 -Dscala-2.10 -X -DskipTests clean package ... [DEBUG] /opt/xdsp/spark-1.2.0/core/src/main/scala [DEBUG] includes = [**/*.scala,**/*.java,] [DEBUG] excludes = [] [WARNING] Zinc server is not available at port 3030 - reverting to normal incremental compile [INFO] Using incremental compilation [DEBUG] Setup = { [DEBUG]scala compiler = /opt/xdsp/spark-1.2.0/H:\Soft\Maven\repository/org/scala-lang/scala-compiler/2.10.4/scala-compiler-2.10.4.jar [DEBUG]scala library = /opt/xdsp/spark-1.2.0/H:\Soft\Maven\repository/org/scala-lang/scala-library/2.10.4/scala-library-2.10.4.jar [DEBUG]scala extra = { [DEBUG] /opt/xdsp/spark-1.2.0/H:\Soft\Maven\repository/org/scala-lang/scala-reflect/2.10.4/scala-reflect-2.10.4.jar [DEBUG]} [DEBUG]sbt interface = /opt/xdsp/spark-1.2.0/H:\Soft\Maven\repository/com/typesafe/sbt/sbt-interface/0.13.5/sbt-interface-0.13.5.jar [DEBUG]compiler interface sources = /opt/xdsp/spark-1.2.0/H:\Soft\Maven\repository/com/typesafe/sbt/compiler-interface/0.13.5/compiler-interface-0.13.5-sources.jar [DEBUG]java home = [DEBUG]fork java = false [DEBUG]cache directory = /root/.zinc/0.3.5 [DEBUG] } [INFO] 'compiler-interface' not yet compiled for Scala 2.10.4. Compiling... [DEBUG] Plain interface to Scala compiler 2.10.4 with arguments: -nowarn -d /tmp/sbt_8b816650 -bootclasspath /opt/jdk1.7/jre/lib/resources.jar:/opt/jdk1.7/jre/lib/rt.jar:/opt/jdk1.7/jre/lib/sunrsasign.jar:/opt/jdk1.7/jre/lib/jsse.jar:/opt/jdk1.7/jre/lib/jce.jar:/opt/jdk1.7/jre/lib/charsets.jar:/opt/jdk1.7/jre/lib/jfr.jar:/opt/jdk1.7/jre/classes:/opt/xdsp/spark-1.2.0/H:\Soft\Maven\repository/org/scala-lang/scala-library/2.10.4/scala-library-2.10.4.jar -classpath /opt/xdsp/spark-1.2.0/H:\Soft\Maven\repository/com/typesafe/sbt/sbt-interface/0.13.5/sbt-interface-0.13.5.jar:/opt/xdsp/spark-1.2.0/H:\Soft\Maven\repository/com/typesafe/sbt/compiler-interface/0.13.5/compiler-interface-0.13.5-sources.jar:/opt/xdsp/spark-1.2.0/H:\Soft\Maven\repository/org/scala-lang/scala-compiler/2.10.4/scala-compiler-2.10.4.jar:/opt/xdsp/spark-1.2.0/H:\Soft\Maven\repository/org/scala-lang/scala-reflect/2.10.4/scala-reflect-2.10.4.jar /tmp/sbt_b9456a7b/xsbt/API.scala /tmp/sbt_b9456a7b/xsbt/Analyzer.scala /tmp/sbt_b9456a7b/xsbt/Command.scala /tmp/sbt_b9456a7b/xsbt/Compat.scala /tmp/sbt_b9456a7b/xsbt/CompilerInterface.scala /tmp/sbt_b9456a7b/xsbt/ConsoleInterface.scala /tmp/sbt_b9456a7b/xsbt/DelegatingReporter.scala /tmp/sbt_b9456a7b/xsbt/Dependency.scala /tmp/sbt_b9456a7b/xsbt/ExtractAPI.scala /tmp/sbt_b9456a7b/xsbt/ExtractUsedNames.scala /tmp/sbt_b9456a7b/xsbt/LocateClassFile.scala /tmp/sbt_b9456a7b/xsbt/Log.scala /tmp/sbt_b9456a7b/xsbt/Message.scala /tmp/sbt_b9456a7b/xsbt/ScaladocInterface.scala error: scala.reflect.internal.MissingRequirementError: object scala.runtime in compiler mirror not found. at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61) at scala.reflect.internal.Mirrors$RootsBase.getPackage(Mirrors.scala:172) at scala.reflect.internal.Mirrors$RootsBase.getRequiredPackage(Mirrors.scala:175) at scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackage$lzycompute(Definitions.scala:183) at scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackage(Definitions.scala:183) at scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackageClass$lzycompute(Definitions.scala:184) at scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackageClass(Definitions.scala:184) at scala.reflect.internal.Definitions$DefinitionsClass.AnnotationDefaultAttr$lzycompute(Definitions.scala:1024) at scala.reflect.internal.Definitions$DefinitionsClass.AnnotationDefaultAttr(Definitions.scala:1023) at scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses$lzycompute(Definitions.scala:1153) at scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses(Definitions.scala:1152) at scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode$lzycompute(Definitions.scala:1196) at scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode(Definitions.scala:1196) at scala.reflect.internal.Definitions$DefinitionsClass.init(Definitions.scala:1261) at
Re: JetS3T settings spark
This file needs to be on your CLASSPATH actually, not just in a directory. The best way to pass it in is probably to package it into your application JAR. You can put it in src/main/resources in a Maven or SBT project, and check that it makes it into the JAR using jar tf yourfile.jar. Matei On Dec 30, 2014, at 4:21 PM, durga durgak...@gmail.com wrote: I am not sure , the way I can pass jets3t.properties file for spark-submit. --file option seems not working. can some one please help me. My production spark jobs get hung up when reading s3 file sporadically. Thanks, -D -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/JetS3T-settings-spark-tp20916.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Shuffle Problems in 1.2.0
Hey Josh, I am still trying to prune this to a minimal example, but it has been tricky since scale seems to be a factor. The job runs over ~720GB of data (the cluster's total RAM is around ~900GB, split across 32 executors). I've managed to run it over a vastly smaller data set without issues. Curiously, when I run it over slightly smaller data set of ~230GB (using sort-based shuffle), my job also fails, but I see no shuffle errors in the executor logs. All I see is the error below from the driver (this is also what the driver prints when erroring out on the large data set, but I assumed the executor errors to be the root cause). Any idea on where to look in the interim for more hints? I'll continue to try to get to a minimal repro. 2014-12-30 21:35:34,539 INFO [sparkDriver-akka.actor.default-dispatcher-14] spark.MapOutputTrackerMasterActor (Logging.scala:logInfo(59)) - Asked to send map output locations for shuffle 0 to sparkexecu...@ip-10-20-80-60.us-west-1.compute.internal:39739 2014-12-30 21:35:39,512 INFO [sparkDriver-akka.actor.default-dispatcher-17] spark.MapOutputTrackerMasterActor (Logging.scala:logInfo(59)) - Asked to send map output locations for shuffle 0 to sparkexecu...@ip-10-20-80-62.us-west-1.compute.internal:42277 2014-12-30 21:35:58,893 WARN [sparkDriver-akka.actor.default-dispatcher-16] remote.ReliableDeliverySupervisor (Slf4jLogger.scala:apply$mcV$sp(71)) - Association with remote system [akka.tcp://sparkyar...@ip-10-20-80-64.us-west-1.compute.internal:49584] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 2014-12-30 21:35:59,044 ERROR [Yarn application state monitor] cluster.YarnClientSchedulerBackend (Logging.scala:logError(75)) - Yarn application has already exited with state FINISHED! 2014-12-30 21:35:59,056 INFO [Yarn application state monitor] handler.ContextHandler (ContextHandler.java:doStop(788)) - stopped o.e.j.s.ServletContextHandler{/stages/stage/kill,null} [...] 2014-12-30 21:35:59,111 INFO [Yarn application state monitor] ui.SparkUI (Logging.scala:logInfo(59)) - Stopped Spark web UI at http://ip-10-20-80-37.us-west-1.compute.internal:4040 2014-12-30 21:35:59,130 INFO [Yarn application state monitor] scheduler.DAGScheduler (Logging.scala:logInfo(59)) - Stopping DAGScheduler 2014-12-30 21:35:59,131 INFO [Yarn application state monitor] cluster.YarnClientSchedulerBackend (Logging.scala:logInfo(59)) - Shutting down all executors 2014-12-30 21:35:59,132 INFO [sparkDriver-akka.actor.default-dispatcher-14] cluster.YarnClientSchedulerBackend (Logging.scala:logInfo(59)) - Asking each executor to shut down 2014-12-30 21:35:59,132 INFO [Thread-2] scheduler.DAGScheduler (Logging.scala:logInfo(59)) - Job 1 failed: collect at /home/hadoop/test_scripts/test.py:63, took 980.751936 s Traceback (most recent call last): File /home/hadoop/test_scripts/test.py, line 63, in module result = j.collect() File /home/hadoop/spark/python/pyspark/rdd.py, line 676, in collect bytesInJava = self._jrdd.collect().iterator() File /home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError2014-12-30 21:35:59,140 INFO [Yarn application state monitor] cluster.YarnClientSchedulerBackend (Logging.scala:logInfo(59)) - Stopped : An error occurred while calling o117.collect. : org.apache.spark.SparkException: Job cancelled because SparkContext was shut down at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:702) at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:701) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:701) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.postStop(DAGScheduler.scala:1428) at akka.actor.Actor$class.aroundPostStop(Actor.scala:475) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundPostStop(DAGScheduler.scala:1375) at akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210) at akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:172) at akka.actor.ActorCell.terminate(ActorCell.scala:369) at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:462) at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at
Re: python: module pyspark.daemon not found
Could you share a link about this? It's common to use Java 7, that will be nice if we can fix this. On Mon, Dec 29, 2014 at 1:27 PM, Eric Friedman eric.d.fried...@gmail.com wrote: Was your spark assembly jarred with Java 7? There's a known issue with jar files made with that version. It prevents them from being used on PYTHONPATH. You can rejar with Java 6 for better results. Eric Friedman On Dec 29, 2014, at 8:01 AM, Naveen Kumar Pokala npok...@spcapitaliq.com wrote: 14/12/29 18:10:56 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 2, nj09mhf0730.mhf.mhc, PROCESS_LOCAL, 1246 bytes) 14/12/29 18:10:56 INFO TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1) on executor nj09mhf0730.mhf.mhc: org.apache.spark.SparkException ( Error from python worker: python: module pyspark.daemon not found PYTHONPATH was: /home/npokala/data/spark-install/spark-master/python:/home/npokala/data/spark-install/spark-master/python/lib/py4j-0.8.2.1-src.zip:/home/npokala/data/spark-install/spark-master/assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop2.4.0.jar:/home/npokala/data/spark-install/spark-master/sbin/../python/lib/py4j-0.8.2.1-src.zip:/home/npokala/data/spark-install/spark-master/sbin/../python: java.io.EOFException) [duplicate 1] 14/12/29 18:10:56 INFO TaskSetManager: Starting task 1.1 in stage 0.0 (TID 3, nj09mhf0731.mhf.mhc, PROCESS_LOCAL, 1246 bytes) 14/12/29 18:10:56 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 2) on executor nj09mhf0730.mhf.mhc: org.apache.spark.SparkException ( Error from python worker: python: module pyspark.daemon not found PYTHONPATH was: /home/npokala/data/spark-install/spark-master/python:/home/npokala/data/spark-install/spark-master/python/lib/py4j-0.8.2.1-src.zip:/home/npokala/data/spark-install/spark-master/assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop2.4.0.jar:/home/npokala/data/spark-install/spark-master/sbin/../python/lib/py4j-0.8.2.1-src.zip:/home/npokala/data/spark-install/spark-master/sbin/../python: java.io.EOFException) [duplicate 2] 14/12/29 18:10:56 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID 4, nj09mhf0731.mhf.mhc, PROCESS_LOCAL, 1246 bytes) 14/12/29 18:10:58 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on nj09mhf0731.mhf.mhc:48802 (size: 3.4 KB, free: 265.1 MB) 14/12/29 18:10:58 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on nj09mhf0731.mhf.mhc:41243 (size: 3.4 KB, free: 265.1 MB) 14/12/29 18:10:59 INFO TaskSetManager: Lost task 1.1 in stage 0.0 (TID 3) on executor nj09mhf0731.mhf.mhc: org.apache.spark.SparkException ( Error from python worker: python: module pyspark.daemon not found PYTHONPATH was: /home/npokala/data/spark-install/spark-master/python:/home/npokala/data/spark-install/spark-master/python/lib/py4j-0.8.2.1-src.zip:/home/npokala/data/spark-install/spark-master/assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop2.4.0.jar:/home/npokala/data/spark-install/spark-master/sbin/../python/lib/py4j-0.8.2.1-src.zip:/home/npokala/data/spark-install/spark-master/sbin/../python: java.io.EOFException) [duplicate 3] 14/12/29 18:10:59 INFO TaskSetManager: Starting task 1.2 in stage 0.0 (TID 5, nj09mhf0730.mhf.mhc, PROCESS_LOCAL, 1246 bytes) 14/12/29 18:10:59 INFO TaskSetManager: Lost task 0.2 in stage 0.0 (TID 4) on executor nj09mhf0731.mhf.mhc: org.apache.spark.SparkException ( Error from python worker: python: module pyspark.daemon not found PYTHONPATH was: /home/npokala/data/spark-install/spark-master/python:/home/npokala/data/spark-install/spark-master/python/lib/py4j-0.8.2.1-src.zip:/home/npokala/data/spark-install/spark-master/assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop2.4.0.jar:/home/npokala/data/spark-install/spark-master/sbin/../python/lib/py4j-0.8.2.1-src.zip:/home/npokala/data/spark-install/spark-master/sbin/../python: java.io.EOFException) [duplicate 4] 14/12/29 18:10:59 INFO TaskSetManager: Starting task 0.3 in stage 0.0 (TID 6, nj09mhf0730.mhf.mhc, PROCESS_LOCAL, 1246 bytes) 14/12/29 18:11:00 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on nj09mhf0730.mhf.mhc:60005 (size: 3.4 KB, free: 265.1 MB) 14/12/29 18:11:00 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on nj09mhf0730.mhf.mhc:40227 (size: 3.4 KB, free: 265.1 MB) 14/12/29 18:11:01 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 6) on executor nj09mhf0730.mhf.mhc: org.apache.spark.SparkException ( Error from python worker: python: module pyspark.daemon not found PYTHONPATH was:
Re: Python:Streaming Question
There is a known bug with local scheduler, will be fixed by https://github.com/apache/spark/pull/3779 On Sun, Dec 21, 2014 at 10:57 PM, Samarth Mailinglist mailinglistsama...@gmail.com wrote: I’m trying to run the stateful network word count at https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/stateful_network_wordcount.py using the command: ./bin/spark-submit examples/src/main/python/streaming/stateful_network_wordcount.py localhost I am also running netcat at the same time (prior to running the above command): nc -lk However, no wordcount is printed (even though pprint() is being called). How do I print the results? How do I otherwise access the data at real time? Suppose I want to have a dashboard showing the data in running_counts? Note that https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/network_wordcount.py works perfectly fine. Running Spark 1.2.0, hadoop 2.4.x prebuilt Thanks, Samarth - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Kafka + Spark streaming
1. Of course, a single block / partition has many Kafka messages, and from different Kafka topics interleaved together. The message count is not related to the block count. Any message received within a particular block interval will go in the same block. 2. Yes, the receiver will be started on another worker. TD On Tue, Dec 30, 2014 at 2:19 PM, SamyaMaiti samya.maiti2...@gmail.com wrote: Hi Experts, Few general Queries : 1. Can a single block/partition in a RDD have more than 1 kafka message? or there will be one only one kafka message per block? In a more broader way, is the message count related to block in any way or its just that any message received with in a particular block interval will go in the same block. 2. If a worker goes down which runs the Receiver for Kafka, Will the receiver be restarted on some other worker? Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Spark-streaming-tp20914.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SPARK-streaming app running 10x slower on YARN vs STANDALONE cluster
Thats is kind of expected due to data locality. Though you should see some tasks running on the executors as the data gets replicated to other nodes and can therefore run tasks based on locality. You have two solutions 1. kafkaStream.repartition() to explicitly repartition the received data across the cluster. 2. Create multiple kafka streams and union them together. See http://spark.apache.org/docs/latest/streaming-programming-guide.html#reducing-the-processing-time-of-each-batch On Tue, Dec 30, 2014 at 1:43 AM, Mukesh Jha me.mukesh@gmail.com wrote: Thanks Sandy, It was the issue with the no of cores. Another issue I was facing is that tasks are not getting distributed evenly among all executors and are running on the NODE_LOCAL locality level i.e. all the tasks are running on the same executor where my kafkareceiver(s) are running even though other executors are idle. I configured spark.locality.wait=50 instead of the default 3000 ms, which forced the task rebalancing among nodes, let me know if there is a better way to deal with this. On Tue, Dec 30, 2014 at 12:09 AM, Mukesh Jha me.mukesh@gmail.com wrote: Makes sense, I've also tries it in standalone mode where all 3 workers driver were running on the same 8 core box and the results were similar. Anyways I will share the results in YARN mode with 8 core yarn containers. On Mon, Dec 29, 2014 at 11:58 PM, Sandy Ryza sandy.r...@cloudera.com wrote: When running in standalone mode, each executor will be able to use all 8 cores on the box. When running on YARN, each executor will only have access to 2 cores. So the comparison doesn't seem fair, no? -Sandy On Mon, Dec 29, 2014 at 10:22 AM, Mukesh Jha me.mukesh@gmail.com wrote: Nope, I am setting 5 executors with 2 cores each. Below is the command that I'm using to submit in YARN mode. This starts up 5 executor nodes and a drives as per the spark application master UI. spark-submit --master yarn-cluster --num-executors 5 --driver-memory 1024m --executor-memory 1024m --executor-cores 2 --class com.oracle.ci.CmsgK2H /homext/lib/MJ-ci-k2h.jar vm.cloud.com:2181/kafka spark-yarn avro 1 5000 On Mon, Dec 29, 2014 at 11:45 PM, Sandy Ryza sandy.r...@cloudera.com wrote: *oops, I mean are you setting --executor-cores to 8 On Mon, Dec 29, 2014 at 10:15 AM, Sandy Ryza sandy.r...@cloudera.com wrote: Are you setting --num-executors to 8? On Mon, Dec 29, 2014 at 10:13 AM, Mukesh Jha me.mukesh@gmail.com wrote: Sorry Sandy, The command is just for reference but I can confirm that there are 4 executors and a driver as shown in the spark UI page. Each of these machines is a 8 core box with ~15G of ram. On Mon, Dec 29, 2014 at 11:23 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Mukesh, Based on your spark-submit command, it looks like you're only running with 2 executors on YARN. Also, how many cores does each machine have? -Sandy On Mon, Dec 29, 2014 at 4:36 AM, Mukesh Jha me.mukesh@gmail.com wrote: Hello Experts, I'm bench-marking Spark on YARN (https://spark.apache.org/docs/latest/running-on-yarn.html) vs a standalone spark cluster (https://spark.apache.org/docs/latest/spark-standalone.html). I have a standalone cluster with 3 executors, and a spark app running on yarn with 4 executors as shown below. The spark job running inside yarn is 10x slower than the one running on the standalone cluster (even though the yarn has more number of workers), also in both the case all the executors are in the same datacenter so there shouldn't be any latency. On YARN each 5sec batch is reading data from kafka and processing it in 5sec on the standalone cluster each 5sec batch is getting processed in 0.4sec. Also, In YARN mode all the executors are not getting used up evenly as vm-13 vm-14 are running most of the tasks whereas in the standalone mode all the executors are running the tasks. Do I need to set up some configuration to evenly distribute the tasks? Also do you have any pointers on the reasons the yarn job is 10x slower than the standalone job? Any suggestion is greatly appreciated, Thanks in advance. YARN(5 workers + driver) Executor ID Address RDD Blocks Memory Used DU AT FT CT TT TT Input ShuffleRead ShuffleWrite Thread Dump 1 vm-18.cloud.com:51796 0 0.0B/530.3MB 0.0 B 1 0 16 17 634 ms 0.0 B 2047.0 B 1710.0 B Thread Dump 2 vm-13.cloud.com:57264 0 0.0B/530.3MB 0.0 B 0 0 1427 1427 5.5 m 0.0 B 0.0 B 0.0 B Thread Dump 3 vm-14.cloud.com:54570 0 0.0B/530.3MB 0.0 B 0 0 1379 1379 5.2 m 0.0 B 1368.0 B 2.8 KB Thread Dump 4 vm-11.cloud.com:56201 0 0.0B/530.3MB 0.0 B 0 0 10 10 625 ms 0.0 B 1368.0 B 1026.0 B Thread Dump 5 vm-5.cloud.com:42958 0 0.0B/530.3MB 0.0 B 0 0 22 22 632 ms 0.0 B 1881.0 B 2.8 KB Thread Dump driver vm.cloud.com:51847 0 0.0B/530.0MB 0.0 B 0 0 0 0 0 ms 0.0 B 0.0 B 0.0 B Thread Dump /homext/spark/bin/spark-submit --master yarn-cluster
Re: Gradual slow down of the Streaming job (getCallSite at DStream.scala:294)
Which version of Spark Streaming are you using. When the batch processing time increases to 15-20 seconds, could you compare the task times compared to the tasks time when the application is just launched? Basically is the increase from 6 seconds to 15-20 seconds is caused by increase in computation or because of GC's etc. On Tue, Dec 30, 2014 at 1:41 PM, RK prk...@yahoo.com.invalid wrote: Here is the code for my streaming job. ~~ val sparkConf = new SparkConf().setAppName(SparkStreamingJob) sparkConf.set(spark.serializer, org.apache.spark.serializer.KryoSerializer) sparkConf.set(spark.default.parallelism, 100) sparkConf.set(spark.shuffle.consolidateFiles, true) sparkConf.set(spark.speculation, true) sparkConf.set(spark.speculation.interval, 5000) sparkConf.set(spark.speculation.quantile, 0.9) sparkConf.set(spark.speculation.multiplier, 3) sparkConf.set(spark.mesos.coarse, true) sparkConf.set(spark.executor.extraJavaOptions, -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseConcMarkSweepGC) sparkConf.set(spark.shuffle.manager, SORT) val ssc = new StreamingContext(sparkConf, Seconds(10)) ssc.checkpoint(checkpointDir) val topics = trace val numThreads = 1 val topicMap = topics.split(,).map((_,numThreads)).toMap val kafkaPartitions = 20 val kafkaDStreams = (1 to kafkaPartitions).map { _ = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2) } val lines = ssc.union(kafkaDStreams) val words = lines.map(line = doSomething_1(line)) val filteredWords = words.filter(word = word != test) val groupedWords = filteredWords.map(word = (word, 1)) val windowedWordCounts = groupedWords.reduceByKeyAndWindow(_ + _, _ - _, Seconds(30), Seconds(10)) val windowedWordsFiltered = windowedWordCounts.filter{case (word, count) = count 50} val finalResult = windowedWordsFiltered.foreachRDD(words = doSomething_2(words)) ssc.start() ssc.awaitTermination() ~~ I am running this job on a 9 slave AWS EC2 cluster with each slave node has 32 vCPU 60GB memory. When I start this job, the processing time is usually around 5 - 6 seconds for the 10 seconds batch and the scheduling delay is around 0 seconds or a few ms. However, as the job run for 6 - 8 hours, the processing time increases to 15 - 20 seconds but the scheduling delay is increasing to 4 - 6 hours. When I look at the completed stages, I see that the time taken for getCallSite at DStream.scala:294 keeps increasing as time passes by. It goes from around 2 seconds to more than a few minutes. Clicking on +details next to this stage description shows the following execution trace. org.apache.spark.SparkContext.getCallSite(SparkContext.scala:1088) org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:294) org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:288) scala.Option.orElse(Option.scala:257) org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:285) org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38) org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116) org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:221) org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:221) scala.util.Try$.apply(Try.scala:161) org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:221) org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:165) When I click on one of these slow stages that executed after 6 - 8 hours, I find the following information for individual tasks inside. - All tasks seem to execute with PROCESS_LOCAL locality. - Quite a few of these tasks seem to spend anywhere between 30 - 80% of their time in GC. Although, when I look at the total memory usage on each of the slave nodes under executors information, I see that the usage is only around 200MB out of 20GB available. Even after a few hours, the map stages (val groupedWords = filteredWords.map(word = (word, 1))) seem to have consistent times as during the start of the job which seems to indicate
Re: Gradual slow down of the Streaming job (getCallSite at DStream.scala:294)
I am running the job on 1.1.1. I will let the job run overnight and send you more info on computation vs GC time tomorrow. BTW, do you know what the stage description named getCallSite at DStream.scala:294 might mean? Thanks,RK On Tuesday, December 30, 2014 6:02 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Which version of Spark Streaming are you using. When the batch processing time increases to 15-20 seconds, could you compare the task times compared to the tasks time when the application is just launched? Basically is the increase from 6 seconds to 15-20 seconds is caused by increase in computation or because of GC's etc. On Tue, Dec 30, 2014 at 1:41 PM, RK prk...@yahoo.com.invalid wrote: Here is the code for my streaming job. ~~ val sparkConf = new SparkConf().setAppName(SparkStreamingJob) sparkConf.set(spark.serializer, org.apache.spark.serializer.KryoSerializer) sparkConf.set(spark.default.parallelism, 100) sparkConf.set(spark.shuffle.consolidateFiles, true) sparkConf.set(spark.speculation, true) sparkConf.set(spark.speculation.interval, 5000) sparkConf.set(spark.speculation.quantile, 0.9) sparkConf.set(spark.speculation.multiplier, 3) sparkConf.set(spark.mesos.coarse, true) sparkConf.set(spark.executor.extraJavaOptions, -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseConcMarkSweepGC) sparkConf.set(spark.shuffle.manager, SORT) val ssc = new StreamingContext(sparkConf, Seconds(10)) ssc.checkpoint(checkpointDir) val topics = trace val numThreads = 1 val topicMap = topics.split(,).map((_,numThreads)).toMap val kafkaPartitions = 20 val kafkaDStreams = (1 to kafkaPartitions).map { _ = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2) } val lines = ssc.union(kafkaDStreams) val words = lines.map(line = doSomething_1(line)) val filteredWords = words.filter(word = word != test) val groupedWords = filteredWords.map(word = (word, 1)) val windowedWordCounts = groupedWords.reduceByKeyAndWindow(_ + _, _ - _, Seconds(30), Seconds(10)) val windowedWordsFiltered = windowedWordCounts.filter{case (word, count) = count 50} val finalResult = windowedWordsFiltered.foreachRDD(words = doSomething_2(words)) ssc.start() ssc.awaitTermination() ~~ I am running this job on a 9 slave AWS EC2 cluster with each slave node has 32 vCPU 60GB memory. When I start this job, the processing time is usually around 5 - 6 seconds for the 10 seconds batch and the scheduling delay is around 0 seconds or a few ms. However, as the job run for 6 - 8 hours, the processing time increases to 15 - 20 seconds but the scheduling delay is increasing to 4 - 6 hours. When I look at the completed stages, I see that the time taken for getCallSite at DStream.scala:294 keeps increasing as time passes by. It goes from around 2 seconds to more than a few minutes. Clicking on +details next to this stage description shows the following execution trace. org.apache.spark.SparkContext.getCallSite(SparkContext.scala:1088) org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:294) org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:288) scala.Option.orElse(Option.scala:257) org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:285) org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38) org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116) org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:221) org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:221) scala.util.Try$.apply(Try.scala:161) org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:221) org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:165) When I click on one of these slow stages that executed after 6 - 8 hours, I find the following information for individual tasks inside. - All tasks seem to execute with PROCESS_LOCAL locality. - Quite a few of these tasks seem to spend anywhere between 30 - 80% of their time in GC. Although, when I look at
Re: python: module pyspark.daemon not found
https://issues.apache.org/jira/browse/SPARK-1911 Is one of several tickets on the problem. On Dec 30, 2014, at 8:36 PM, Davies Liu dav...@databricks.com wrote: Could you share a link about this? It's common to use Java 7, that will be nice if we can fix this. On Mon, Dec 29, 2014 at 1:27 PM, Eric Friedman eric.d.fried...@gmail.com wrote: Was your spark assembly jarred with Java 7? There's a known issue with jar files made with that version. It prevents them from being used on PYTHONPATH. You can rejar with Java 6 for better results. Eric Friedman On Dec 29, 2014, at 8:01 AM, Naveen Kumar Pokala npok...@spcapitaliq.com wrote: 14/12/29 18:10:56 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 2, nj09mhf0730.mhf.mhc, PROCESS_LOCAL, 1246 bytes) 14/12/29 18:10:56 INFO TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1) on executor nj09mhf0730.mhf.mhc: org.apache.spark.SparkException ( Error from python worker: python: module pyspark.daemon not found PYTHONPATH was: /home/npokala/data/spark-install/spark-master/python:/home/npokala/data/spark-install/spark-master/python/lib/py4j-0.8.2.1-src.zip:/home/npokala/data/spark-install/spark-master/assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop2.4.0.jar:/home/npokala/data/spark-install/spark-master/sbin/../python/lib/py4j-0.8.2.1-src.zip:/home/npokala/data/spark-install/spark-master/sbin/../python: java.io.EOFException) [duplicate 1] 14/12/29 18:10:56 INFO TaskSetManager: Starting task 1.1 in stage 0.0 (TID 3, nj09mhf0731.mhf.mhc, PROCESS_LOCAL, 1246 bytes) 14/12/29 18:10:56 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 2) on executor nj09mhf0730.mhf.mhc: org.apache.spark.SparkException ( Error from python worker: python: module pyspark.daemon not found PYTHONPATH was: /home/npokala/data/spark-install/spark-master/python:/home/npokala/data/spark-install/spark-master/python/lib/py4j-0.8.2.1-src.zip:/home/npokala/data/spark-install/spark-master/assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop2.4.0.jar:/home/npokala/data/spark-install/spark-master/sbin/../python/lib/py4j-0.8.2.1-src.zip:/home/npokala/data/spark-install/spark-master/sbin/../python: java.io.EOFException) [duplicate 2] 14/12/29 18:10:56 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID 4, nj09mhf0731.mhf.mhc, PROCESS_LOCAL, 1246 bytes) 14/12/29 18:10:58 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on nj09mhf0731.mhf.mhc:48802 (size: 3.4 KB, free: 265.1 MB) 14/12/29 18:10:58 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on nj09mhf0731.mhf.mhc:41243 (size: 3.4 KB, free: 265.1 MB) 14/12/29 18:10:59 INFO TaskSetManager: Lost task 1.1 in stage 0.0 (TID 3) on executor nj09mhf0731.mhf.mhc: org.apache.spark.SparkException ( Error from python worker: python: module pyspark.daemon not found PYTHONPATH was: /home/npokala/data/spark-install/spark-master/python:/home/npokala/data/spark-install/spark-master/python/lib/py4j-0.8.2.1-src.zip:/home/npokala/data/spark-install/spark-master/assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop2.4.0.jar:/home/npokala/data/spark-install/spark-master/sbin/../python/lib/py4j-0.8.2.1-src.zip:/home/npokala/data/spark-install/spark-master/sbin/../python: java.io.EOFException) [duplicate 3] 14/12/29 18:10:59 INFO TaskSetManager: Starting task 1.2 in stage 0.0 (TID 5, nj09mhf0730.mhf.mhc, PROCESS_LOCAL, 1246 bytes) 14/12/29 18:10:59 INFO TaskSetManager: Lost task 0.2 in stage 0.0 (TID 4) on executor nj09mhf0731.mhf.mhc: org.apache.spark.SparkException ( Error from python worker: python: module pyspark.daemon not found PYTHONPATH was: /home/npokala/data/spark-install/spark-master/python:/home/npokala/data/spark-install/spark-master/python/lib/py4j-0.8.2.1-src.zip:/home/npokala/data/spark-install/spark-master/assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop2.4.0.jar:/home/npokala/data/spark-install/spark-master/sbin/../python/lib/py4j-0.8.2.1-src.zip:/home/npokala/data/spark-install/spark-master/sbin/../python: java.io.EOFException) [duplicate 4] 14/12/29 18:10:59 INFO TaskSetManager: Starting task 0.3 in stage 0.0 (TID 6, nj09mhf0730.mhf.mhc, PROCESS_LOCAL, 1246 bytes) 14/12/29 18:11:00 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on nj09mhf0730.mhf.mhc:60005 (size: 3.4 KB, free: 265.1 MB) 14/12/29 18:11:00 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on nj09mhf0730.mhf.mhc:40227 (size: 3.4 KB, free: 265.1 MB) 14/12/29 18:11:01 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 6) on executor nj09mhf0730.mhf.mhc: org.apache.spark.SparkException ( Error from python worker: python: module pyspark.daemon not found PYTHONPATH was:
Spark app performance
I have a spark app that involves series of mapPartition operations and then a keyBy operation. I have measured the time inside mapPartition function block. These blocks take trivial time. Still the application takes way too much time and even sparkUI shows that much time. So i was wondering where does it take time and how can I reduce this. Thanks Raghavendra
Re: serialize protobuf messages
Anyone has suggestions? On Tue, Dec 23, 2014 at 3:08 PM, Chen Song chen.song...@gmail.com wrote: Silly question, what is the best way to shuffle protobuf messages in Spark (Streaming) job? Can I use Kryo on top of protobuf Message type? -- Chen Song -- Chen Song
Re: JetS3T settings spark
Thanks Matei. -D On Tue, Dec 30, 2014 at 4:49 PM, Matei Zaharia matei.zaha...@gmail.com wrote: This file needs to be on your CLASSPATH actually, not just in a directory. The best way to pass it in is probably to package it into your application JAR. You can put it in src/main/resources in a Maven or SBT project, and check that it makes it into the JAR using jar tf yourfile.jar. Matei On Dec 30, 2014, at 4:21 PM, durga durgak...@gmail.com wrote: I am not sure , the way I can pass jets3t.properties file for spark-submit. --file option seems not working. can some one please help me. My production spark jobs get hung up when reading s3 file sporadically. Thanks, -D -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/JetS3T-settings-spark-tp20916.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: building spark1.2 meet error
This is still using a non-existent hadoop-2.5 profile, and -Dscala-2.10 won't do anything. These don't matter though; this error is just some scalac problem. I don't see this error when compiling. On Wed, Dec 31, 2014 at 12:48 AM, j_soft zsof...@gmail.com wrote: no,it still fail use mvn -Pyarn -Phadoop-2.5 -Dhadoop.version=2.5.0 -Dscala-2.10 -X -DskipTests clean package - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Is it possible to store graph directly into HDFS?
Thanks for your answer, Xuefeng Wu. But, I don't understand how to save a graph as object. :( Do you have any sample codes? 2014-12-31 13:27 GMT+09:00 Jason Hong begger3...@gmail.com: Thanks for your answer, Xuefeng Wu. But, I don't understand how to save a graph as object. :( Do you have any sample codes? Jason Hong 2014-12-30 22:31 GMT+09:00 Xuefeng Wu ben...@gmail.com: how about save as object? Yours, Xuefeng Wu 吴雪峰 敬上 On 2014年12月30日, at 下午9:27, Jason Hong begger3...@gmail.com wrote: Dear all:) We're trying to make a graph using large input data and get a subgraph applied some filter. Now, we wanna save this graph to HDFS so that we can load later. Is it possible to store graph or subgraph directly into HDFS and load it as a graph for future use? We will be glad for your suggestion. Best regards. Jason Hong -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-store-graph-directly-into-HDFS-tp20908.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
How to set local property in beeline connect to the spark thrift server
Hi all! I use Spark SQL1.2 start the thrift server on yarn. I want to use fair scheduler in the thrift server. I set the properties in spark-defaults.conf like this: spark.scheduler.mode FAIR spark.scheduler.allocation.file /opt/spark-1.2.0-bin-2.4.1/conf/fairscheduler.xml In the thrift server UI can see the scheduler pool is ok. [image: 内嵌图片 1] How can I specify one sql job to the test pool. By default the sql job run in the default pool. In the http://spark.apache.org/docs/latest/job-scheduling.html document I see sc.setLocalProperty(spark.scheduler.pool, pool1) can be set in the code. In the beeline I execute set spark.scheduler.pool=test, but no use. But how can I set the local property in the beeline?