Re: Need help for Spark-JobServer setup on Maven (for Java programming)

2014-12-30 Thread Sasi
Does my question make sense or required some elaboration?

Sasi



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-for-Spark-JobServer-setup-on-Maven-for-Java-programming-tp20849p20896.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Need help for Spark-JobServer setup on Maven (for Java programming)

2014-12-30 Thread abhishek
Hey,
why specific in maven??
we setup a spark job server thru sbt which is easy way to up and running
job server.

On 30 Dec 2014 13:32, Sasi [via Apache Spark User List] 
ml-node+s1001560n20896...@n3.nabble.com wrote:

 Does my question make sense or required some elaboration?

 Sasi

 
 If you reply to this email, your message will be added to the discussion
below:

http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-for-Spark-JobServer-setup-on-Maven-for-Java-programming-tp20849p20896.html
 To start a new topic under Apache Spark User List, email
ml-node+s1001560n...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here.
 NAMLNAMLNAMLNAMLNAML




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-for-Spark-JobServer-setup-on-Maven-for-Java-programming-tp20849p20897.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Need help for Spark-JobServer setup on Maven (for Java programming)

2014-12-30 Thread Sasi
The reason being, we had Vaadin (Java Framework) application which displays
data from Spark RDD, which in turn gets data from Cassandra. As we know, we
need to use Maven for building Spark API in Java.

We tested the spark-jobserver using SBT and able to run it. However, for our
requirement, we need to integrate with Vaadin (Java Framework).



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-for-Spark-JobServer-setup-on-Maven-for-Java-programming-tp20849p20898.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Clustering text data with MLlib

2014-12-30 Thread xhudik
Kmeans really needs to have identified number of clusters in advance. There
are multiple algorithms (XMeans, ART,...) which do not need this
information. Unfortunately, none of them is implemented in MLLib for the
moment (you can give a hand and help community).

Anyway, it seems to me you will not be satisfied with those
algorithms(Xmeans, ART,...) either. I understood that what you want to
achieve is precise number of clusters. Notice, whenever you change input
parameters (random seed,...) number of clusters might be different.
Clustering is great tool but it won't give you one true (one number).


regards, Tomas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Clustering-text-data-with-MLlib-tp20883p20899.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Need help for Spark-JobServer setup on Maven (for Java programming)

2014-12-30 Thread abhishek
Ohh...
Just curious, we did similar use case like yours getting data out of
Cassandra since job server is a rest architecture all we need is an URL to
access it. Why integrating with your framework matters here when all we
need is a URL.

On 30 Dec 2014 14:05, Sasi [via Apache Spark User List] 
ml-node+s1001560n20898...@n3.nabble.com  wrote:

 The reason being, we had Vaadin (Java Framework) application which
displays data from Spark RDD, which in turn gets data from Cassandra. As we
know, we need to use Maven for building Spark API in Java.

 We tested the spark-jobserver using SBT and able to run it. However, for
our requirement, we need to integrate with Vaadin (Java Framework).

 
 If you reply to this email, your message will be added to the discussion
below:

http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-for-Spark-JobServer-setup-on-Maven-for-Java-programming-tp20849p20898.html
 To start a new topic under Apache Spark User List, email
ml-node+s1001560n...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here.
 NAML




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-for-Spark-JobServer-setup-on-Maven-for-Java-programming-tp20849p20900.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: SPARK-streaming app running 10x slower on YARN vs STANDALONE cluster

2014-12-30 Thread Mukesh Jha
Thanks Sandy, It was the issue with the no of cores.

Another issue I was facing is that tasks are not getting distributed evenly
among all executors and are running on the NODE_LOCAL locality level i.e.
all the tasks are running on the same executor where my kafkareceiver(s)
are running even though other executors are idle.

I configured *spark.locality.wait=50* instead of the default 3000 ms, which
forced the task rebalancing among nodes, let me know if there is a better
way to deal with this.


On Tue, Dec 30, 2014 at 12:09 AM, Mukesh Jha me.mukesh@gmail.com
wrote:

 Makes sense, I've also tries it in standalone mode where all 3 workers 
 driver were running on the same 8 core box and the results were similar.

 Anyways I will share the results in YARN mode with 8 core yarn containers.

 On Mon, Dec 29, 2014 at 11:58 PM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 When running in standalone mode, each executor will be able to use all 8
 cores on the box.  When running on YARN, each executor will only have
 access to 2 cores.  So the comparison doesn't seem fair, no?

 -Sandy

 On Mon, Dec 29, 2014 at 10:22 AM, Mukesh Jha me.mukesh@gmail.com
 wrote:

 Nope, I am setting 5 executors with 2  cores each. Below is the command
 that I'm using to submit in YARN mode. This starts up 5 executor nodes and
 a drives as per the spark  application master UI.

 spark-submit --master yarn-cluster --num-executors 5 --driver-memory
 1024m --executor-memory 1024m --executor-cores 2 --class
 com.oracle.ci.CmsgK2H /homext/lib/MJ-ci-k2h.jar vm.cloud.com:2181/kafka
  spark-yarn avro 1 5000

 On Mon, Dec 29, 2014 at 11:45 PM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 *oops, I mean are you setting --executor-cores to 8

 On Mon, Dec 29, 2014 at 10:15 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 Are you setting --num-executors to 8?

 On Mon, Dec 29, 2014 at 10:13 AM, Mukesh Jha me.mukesh@gmail.com
 wrote:

 Sorry Sandy, The command is just for reference but I can confirm that
 there are 4 executors and a driver as shown in the spark UI page.

 Each of these machines is a 8 core box with ~15G of ram.

 On Mon, Dec 29, 2014 at 11:23 PM, Sandy Ryza sandy.r...@cloudera.com
  wrote:

 Hi Mukesh,

 Based on your spark-submit command, it looks like you're only
 running with 2 executors on YARN.  Also, how many cores does each 
 machine
 have?

 -Sandy

 On Mon, Dec 29, 2014 at 4:36 AM, Mukesh Jha me.mukesh@gmail.com
  wrote:

 Hello Experts,
 I'm bench-marking Spark on YARN (
 https://spark.apache.org/docs/latest/running-on-yarn.html) vs a
 standalone spark cluster (
 https://spark.apache.org/docs/latest/spark-standalone.html).
 I have a standalone cluster with 3 executors, and a spark app
 running on yarn with 4 executors as shown below.

 The spark job running inside yarn is 10x slower than the one
 running on the standalone cluster (even though the yarn has more 
 number of
 workers), also in both the case all the executors are in the same
 datacenter so there shouldn't be any latency. On YARN each 5sec batch 
 is
 reading data from kafka and processing it in 5sec  on the standalone
 cluster each 5sec batch is getting processed in 0.4sec.
 Also, In YARN mode all the executors are not getting used up evenly
 as vm-13  vm-14 are running most of the tasks whereas in the 
 standalone
 mode all the executors are running the tasks.

 Do I need to set up some configuration to evenly distribute the
 tasks? Also do you have any pointers on the reasons the yarn job is 10x
 slower than the standalone job?
 Any suggestion is greatly appreciated, Thanks in advance.

 YARN(5 workers + driver)
 
 Executor ID Address RDD Blocks Memory Used DU  AT FT CT TT TT Input
 ShuffleRead ShuffleWrite Thread Dump
 1 vm-18.cloud.com:51796 0 0.0B/530.3MB 0.0 B 1 0 16 17 634 ms 0.0
 B 2047.0 B 1710.0 B Thread Dump
 2 vm-13.cloud.com:57264 0 0.0B/530.3MB 0.0 B 0 0 1427 1427 5.5 m 0.0
 B 0.0 B 0.0 B Thread Dump
 3 vm-14.cloud.com:54570 0 0.0B/530.3MB 0.0 B 0 0 1379 1379 5.2 m 0.0
 B 1368.0 B 2.8 KB Thread Dump
 4 vm-11.cloud.com:56201 0 0.0B/530.3MB 0.0 B 0 0 10 10 625 ms 0.0
 B 1368.0 B 1026.0 B Thread Dump
 5 vm-5.cloud.com:42958 0 0.0B/530.3MB 0.0 B 0 0 22 22 632 ms 0.0 B 
 1881.0
 B 2.8 KB Thread Dump
 driver vm.cloud.com:51847 0 0.0B/530.0MB 0.0 B 0 0 0 0 0 ms 0.0
 B 0.0 B 0.0 B Thread Dump

 /homext/spark/bin/spark-submit
 --master yarn-cluster --num-executors 2 --driver-memory 512m
 --executor-memory 512m --executor-cores 2
 --class com.oracle.ci.CmsgK2H /homext/lib/MJ-ci-k2h.jar
 vm.cloud.com:2181/kafka spark-yarn avro 1 5000

 STANDALONE(3 workers + driver)
 ==
 Executor ID Address RDD Blocks Memory Used DU AT FT CT TT TT Input 
 ShuffleRead
 ShuffleWrite Thread Dump
 0 vm-71.cloud.com:55912 0 0.0B/265.0MB 0.0 B 0 0 1069 1069 6.0 m 0.0
 B 1534.0 B 3.0 KB Thread Dump
 1 vm-72.cloud.com:40897 0 0.0B/265.0MB 0.0 B 0 0 1057 1057 5.9 m 0.0
 B 1368.0 B 4.0 KB Thread Dump
 2 

Spark SQL implementation error

2014-12-30 Thread sachin Singh
I have a table(csv file) loaded data on that by creating POJO as per table
structure,and created SchemaRDD as under
JavaRDDTest1 testSchema =
sc.textFile(D:/testTable.csv).map(GetTableData);/* GetTableData will
transform the all table data in testTable object*/
JavaSchemaRDD schemaTest = sqlContext.applySchema(testSchema, Test.class);
schemaTest.registerTempTable(testTable);

JavaSchemaRDD sqlQuery = sqlContext.sql(SELECT * FROM testTable);
ListString totDuration = sqlQuery.map(new FunctionRow, String() {
  public String call(Row row) {
return Field1is :  + row.getInt(0);
  }
}).collect();
its working fine
but.
if I am changing query as(rest code is same)-  JavaSchemaRDD sqlQuery =
sqlContext.sql(SELECT sum(field1) FROM testTable group by field2); 
error as - Exception in thread main java.lang.NoSuchMethodError:
org.apache.spark.rdd.ShuffledRDD.init(Lorg/apache/spark/rdd/RDD;Lorg/apache/spark/Partitioner;)V

Please help and Suggest 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-implementation-error-tp20901.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Need help for Spark-JobServer setup on Maven (for Java programming)

2014-12-30 Thread Sasi
Thanks Abhishek. We understand your point and will try using REST URL.
However one concern, we had around 1 lakh rows in our Cassandra table
presently. Will REST URL result can withstand the response size?  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-for-Spark-JobServer-setup-on-Maven-for-Java-programming-tp20849p20902.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Can we say 1 RDD is generated every batch interval?

2014-12-30 Thread Jahagirdar, Madhu
Foreach iterates through the partitions in the RDD and executes the operations 
for each partitions i guess.

 On 29-Dec-2014, at 10:19 pm, SamyaMaiti samya.maiti2...@gmail.com wrote:

 Hi All,

 Please clarify.

 Can we say 1 RDD is generated every batch interval?

 If the above is true. Then, is the foreachRDD() operator executed one  only
 once for each batch processing?

 Regards,
 Sam



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-say-1-RDD-is-generated-every-batch-interval-tp20885.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark SQL insert overwrite table failed.

2014-12-30 Thread Mars Max
While I was doing JOIN operation of three tables using Spark 1.1.1, and
always got the following error. However, I've never met the exception in
Spark 1.1.0 with the same operation and same data. Does anyone meet the
problem?

14/12/30 17:49:33 ERROR CliDriver:
org.apache.hadoop.hive.ql.metadata.HiveException: checkPaths:
hdfs://xx.com:20632/tmp/hive-work/hive_2014-12-30_17-46-25_327_2097835982529092412-1/-ext-1
has nested
directoryhdfs://x/tmp/hive-work/hive_2014-12-30_17-46-25_327_2097835982529092412-1/-ext-1/_temporary
at
org.apache.hadoop.hive.ql.metadata.Hive.checkPaths(Hive.java:2081)
at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:)
at
org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1224)
at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.result$lzycompute(InsertIntoHiveTable.scala:238)
at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.result(InsertIntoHiveTable.scala:173)
at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.execute(InsertIntoHiveTable.scala:164)
at
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:382)
at
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:382)
at
org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:103)
at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:98)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:58)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:291)
at
org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:329)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-insert-overwrite-table-failed-tp20903.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Can we say 1 RDD is generated every batch interval?

2014-12-30 Thread Sean Owen
The DStream model is one RDD of data per interval, yes. foreachRDD
performs an operation on each RDD in the stream, which means it is
executed once* for the one RDD in each interval.

* ignoring the possibility here of failure and retry of course

On Mon, Dec 29, 2014 at 4:49 PM, SamyaMaiti samya.maiti2...@gmail.com wrote:
 Hi All,

 Please clarify.

 Can we say 1 RDD is generated every batch interval?

 If the above is true. Then, is the foreachRDD() operator executed one  only
 once for each batch processing?

 Regards,
 Sam



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-say-1-RDD-is-generated-every-batch-interval-tp20885.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Writing and reading sequence file results in trailing extra data

2014-12-30 Thread Enno Shioji
Hi, I'm facing a weird issue. Any help appreciated.

When I execute the below code and compare input and output, each record
in the output has some extra trailing data appended to it, and hence
corrupted. I'm just reading and writing, so the input and output should be
exactly the same.

I'm using spark-core 1.2.0_2.10 and the Hadoop bundled in it
(hadoop-common: 2.2.0, hadoop-core: 1.2.1). I also confirmed the binary is
fine at the time it's passed to Hadoop classes, and has already the extra
data when in Hadoop classes (I guess this makes it more of a Hadoop
question...).

Code:
=
  def main(args: Array[String]) {
val conf = new SparkConf()
  .setMaster(local[4])
  .setAppName(Simple Application)

val sc = new SparkContext(conf)

   // input.txt is a text file with some Base64 encoded binaries stored as
lines

val src = sc
  .textFile(input.txt)
  .map(DatatypeConverter.parseBase64Binary)
  .map(x = (NullWritable.get(), new BytesWritable(x)))
  .saveAsSequenceFile(s3n://fake-test/stored)

val file = s3n://fake-test/stored
val logData = sc.sequenceFile(file, classOf[NullWritable],
classOf[BytesWritable])

val count = logData
  .map { case (k, v) = v }
  .map(x = DatatypeConverter.printBase64Binary(x.getBytes))
  .saveAsTextFile(/tmp/output)

  }

ᐧ


Re: Can we say 1 RDD is generated every batch interval?

2014-12-30 Thread Maiti, Samya
Thank Sean.

That was helpful.

Regards,
Sam
On Dec 30, 2014, at 4:12 PM, Sean Owen so...@cloudera.com wrote:

 The DStream model is one RDD of data per interval, yes. foreachRDD
 performs an operation on each RDD in the stream, which means it is
 executed once* for the one RDD in each interval.

 * ignoring the possibility here of failure and retry of course

 On Mon, Dec 29, 2014 at 4:49 PM, SamyaMaiti samya.maiti2...@gmail.com wrote:
 Hi All,

 Please clarify.

 Can we say 1 RDD is generated every batch interval?

 If the above is true. Then, is the foreachRDD() operator executed one  only
 once for each batch processing?

 Regards,
 Sam



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-say-1-RDD-is-generated-every-batch-interval-tp20885.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[SOLVED] Re: Writing and reading sequence file results in trailing extra data

2014-12-30 Thread Enno Shioji
This poor soul had the exact same problem and solution:

http://stackoverflow.com/questions/24083332/write-and-read-raw-byte-arrays-in-spark-using-sequence-file-sequencefile

ᐧ

On Tue, Dec 30, 2014 at 10:58 AM, Enno Shioji eshi...@gmail.com wrote:

 Hi, I'm facing a weird issue. Any help appreciated.

 When I execute the below code and compare input and output, each
 record in the output has some extra trailing data appended to it, and hence
 corrupted. I'm just reading and writing, so the input and output should be
 exactly the same.

 I'm using spark-core 1.2.0_2.10 and the Hadoop bundled in it
 (hadoop-common: 2.2.0, hadoop-core: 1.2.1). I also confirmed the binary is
 fine at the time it's passed to Hadoop classes, and has already the extra
 data when in Hadoop classes (I guess this makes it more of a Hadoop
 question...).

 Code:
 =
   def main(args: Array[String]) {
 val conf = new SparkConf()
   .setMaster(local[4])
   .setAppName(Simple Application)

 val sc = new SparkContext(conf)

// input.txt is a text file with some Base64 encoded binaries stored as
 lines

 val src = sc
   .textFile(input.txt)
   .map(DatatypeConverter.parseBase64Binary)
   .map(x = (NullWritable.get(), new BytesWritable(x)))
   .saveAsSequenceFile(s3n://fake-test/stored)

 val file = s3n://fake-test/stored
 val logData = sc.sequenceFile(file, classOf[NullWritable],
 classOf[BytesWritable])

 val count = logData
   .map { case (k, v) = v }
   .map(x = DatatypeConverter.printBase64Binary(x.getBytes))
   .saveAsTextFile(/tmp/output)

   }




Re: Need help for Spark-JobServer setup on Maven (for Java programming)

2014-12-30 Thread abhishek
Frankly saying I never tried for this volume in practical. But I believe it
should work.
On 30 Dec 2014 15:26, Sasi [via Apache Spark User List] 
ml-node+s1001560n20902...@n3.nabble.com wrote:

 Thanks Abhishek. We understand your point and will try using REST URL.
 However one concern, we had around 1 lakh rows in our Cassandra table
 presently. Will REST URL result can withstand the response size?

 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-for-Spark-JobServer-setup-on-Maven-for-Java-programming-tp20849p20902.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cmVhY2hhYmhpc2hlay5rQGdtYWlsLmNvbXwxfC0zNjM4MzA5Nzg=
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-for-Spark-JobServer-setup-on-Maven-for-Java-programming-tp20849p20904.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: building spark1.2 meet error

2014-12-30 Thread xhudik
Hi,
well, spark 1.2 was prepared for scala 2.10. If you want stable and fully
functional tool I'd compile it this default compiler.

*I was able to compile Spar 1.2 by Java 7 and scala 2.10 seamlessly.*

I also tried Java8 and scala 2.11 (no -Dscala.usejavacp=true), but I failed
for some other problem:

/mvn -Pyarn -Phadoop-2.5 -Dhadoop.version=2.5.0 -Dscala-2.11 -X -DskipTests
clean package 
[INFO]

[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Parent POM ... SUCCESS [ 14.453
s]
[INFO] Spark Project Core . SUCCESS [ 47.508
s]
[INFO] Spark Project Bagel  SUCCESS [  3.646
s]
[INFO] Spark Project GraphX ... SUCCESS [  5.533
s]
[INFO] Spark Project ML Library ... SUCCESS [ 12.715
s]
[INFO] Spark Project Tools  SUCCESS [  1.854
s]
[INFO] Spark Project Networking ... SUCCESS [  6.580
s]
[INFO] Spark Project Shuffle Streaming Service  SUCCESS [  5.290
s]
[INFO] Spark Project Streaming  SUCCESS [ 10.846
s]
[INFO] Spark Project Catalyst . SUCCESS [  8.296
s]
[INFO] Spark Project SQL .. SUCCESS [ 12.921
s]
[INFO] Spark Project Hive . SUCCESS [ 28.931
s]
[INFO] Spark Project Assembly . FAILURE [01:09
min]
[INFO] Spark Project External Twitter . SKIPPED
[INFO] Spark Project External Flume ... SKIPPED
[INFO] Spark Project External Flume Sink .. SKIPPED
[INFO] Spark Project External MQTT  SKIPPED
[INFO] Spark Project External ZeroMQ .. SKIPPED
[INFO] Spark Project Examples . SKIPPED
[INFO] Spark Project REPL . SKIPPED
[INFO] Spark Project YARN Parent POM .. SKIPPED
[INFO] Spark Project YARN Stable API .. SKIPPED
[INFO] Spark Project YARN Shuffle Service . SKIPPED
[INFO]

[INFO] BUILD FAILURE
[INFO]

[INFO] Total time: 03:49 min
[INFO] Finished at: 2014-12-30T12:41:59+01:00
[INFO] Final Memory: 59M/417M
[INFO]

[WARNING] The requested profile hadoop-2.5 could not be activated because
it does not exist.
[ERROR] Failed to execute goal on project spark-assembly_2.10: Could not
resolve dependencies for project
org.apache.spark:spark-assembly_2.10:pom:1.2.0: The following artifacts
could not be resolved: org.apache.spark:spark-repl_2.11:jar:1.2.0,
org.apache.spark:spark-yarn_2.11:jar:1.2.0: Could not find artifact
org.apache.spark:spark-repl_2.11:jar:1.2.0 in central
(https://repo1.maven.org/maven2) - [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
goal on project spark-assembly_2.10: Could not resolve dependencies for
project org.apache.spark:spark-assembly_2.10:pom:1.2.0: The following
artifacts could not be resolved: org.apache.spark:spark-repl_2.11:jar:1.2.0,
org.apache.spark:spark-yarn_2.11:jar:1.2.0: Could not find artifact
org.apache.spark:spark-repl_2.11:jar:1.2.0 in central
(https://repo1.maven.org/maven2)
at
org.apache.maven.lifecycle.internal.LifecycleDependencyResolver.getDependencies(LifecycleDependencyResolver.java:220)
at
org.apache.maven.lifecycle.internal.LifecycleDependencyResolver.resolveProjectDependencies(LifecycleDependencyResolver.java:127)
at
org.apache.maven.lifecycle.internal.MojoExecutor.ensureDependenciesAreResolved(MojoExecutor.java:257)
at
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:200)
at
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
at
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
at
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
at
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
at
org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
at
org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:120)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:347)
at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:154)
at org.apache.maven.cli.MavenCli.execute(MavenCli.java:584)
at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:213)
at 

How to collect() each partition in scala ?

2014-12-30 Thread DEVAN M.S.
Hi all,
i have one large data-set. when i am getting the number of partitions its
showing 43.
We can't collect() the large data-set in to  memory so i am thinking like
this, collect() each partitions so that it will be small in size.

Any thoughts ?


Re: Need help for Spark-JobServer setup on Maven (for Java programming)

2014-12-30 Thread Sasi
Thanks Abhishek. We are good know with an answer to try.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-for-Spark-JobServer-setup-on-Maven-for-Java-programming-tp20849p20906.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to collect() each partition in scala ?

2014-12-30 Thread Sean Owen
collect()-ing a partition still implies copying it to the driver, but
you're suggesting you can't collect() the whole data set to the
driver. What do you mean: collect() 1 partition? or collect() some
smaller result from each partition?

On Tue, Dec 30, 2014 at 11:54 AM, DEVAN M.S. msdeva...@gmail.com wrote:
 Hi all,
 i have one large data-set. when i am getting the number of partitions its
 showing 43.
 We can't collect() the large data-set in to  memory so i am thinking like
 this, collect() each partitions so that it will be small in size.

 Any thoughts ?


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SparkContext with error from PySpark

2014-12-30 Thread Jaggu
Hi Team,

I was trying to execute a Pyspark code in cluster. It gives me the following
error. (Wne I run the same job in local it is working fine too :-()

Eoor

Error from python worker:
  /usr/lib/spark-1.2.0-bin-hadoop2.3/python/pyspark/context.py:209: Warning:
'with' will become a reserved keyword in Python 2.6
  Traceback (most recent call last):
File
/home/beehive/toolchain/x86_64-unknown-linux-gnu/python-2.5.2/lib/python2.5/runpy.py,
line 85, in run_module
  loader = get_loader(mod_name)
File
/home/beehive/toolchain/x86_64-unknown-linux-gnu/python-2.5.2/lib/python2.5/pkgutil.py,
line 456, in get_loader
  return find_loader(fullname)
File
/home/beehive/toolchain/x86_64-unknown-linux-gnu/python-2.5.2/lib/python2.5/pkgutil.py,
line 466, in find_loader
  for importer in iter_importers(fullname):
File
/home/beehive/toolchain/x86_64-unknown-linux-gnu/python-2.5.2/lib/python2.5/pkgutil.py,
line 422, in iter_importers
  __import__(pkg)
File /usr/lib/spark-1.2.0-bin-hadoop2.3/python/pyspark/__init__.py,
line 41, in module
  from pyspark.context import SparkContext
File /usr/lib/spark-1.2.0-bin-hadoop2.3/python/pyspark/context.py,
line 209
  with SparkContext._lock:
  ^
  SyntaxError: invalid syntax
PYTHONPATH was:
 
/usr/lib/spark-1.2.0-bin-hadoop2.3/python:/usr/lib/spark-1.2.0-bin-hadoop2.3/python/lib/py4j-0.8.2.1-src.zip:/usr/lib/spark-1.2.0-bin-hadoop2.3/lib/spark-assembly-1.2.0-hadoop2.3.0.jar:/usr/lib/spark-1.2.0-bin-hadoop2.3/sbin/../python/lib/py4j-0.8.2.1-src.zip:/usr/lib/spark-1.2.0-bin-hadoop2.3/sbin/../python:/home/beehive/bin/utils/primitives:/home/beehive/bin/utils/pylogger:/home/beehive/bin/utils/asterScript:/home/beehive/bin/lib:/home/beehive/bin/utils/init:/home/beehive/installer/packages:/home/beehive/ncli
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at
org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
at
org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
at
org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:102)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)

14/12/31 04:49:58 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID
1, aster4, NODE_LOCAL, 1321 bytes)
14/12/31 04:49:58 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory
on aster4:43309 (size: 3.8 KB, free: 265.0 MB)
14/12/31 04:49:59 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) on
executor aster4: org.apache.spark.SparkException (


Any clue how to resolve the same.

Best regards

Jagan



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-with-error-from-PySpark-tp20907.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkContext with error from PySpark

2014-12-30 Thread Eric Friedman
The Python installed in your cluster is 2.5. You need at least 2.6. 


Eric Friedman

 On Dec 30, 2014, at 7:45 AM, Jaggu jagana...@gmail.com wrote:
 
 Hi Team,
 
 I was trying to execute a Pyspark code in cluster. It gives me the following
 error. (Wne I run the same job in local it is working fine too :-()
 
 Eoor
 
 Error from python worker:
  /usr/lib/spark-1.2.0-bin-hadoop2.3/python/pyspark/context.py:209: Warning:
 'with' will become a reserved keyword in Python 2.6
  Traceback (most recent call last):
File
 /home/beehive/toolchain/x86_64-unknown-linux-gnu/python-2.5.2/lib/python2.5/runpy.py,
 line 85, in run_module
  loader = get_loader(mod_name)
File
 /home/beehive/toolchain/x86_64-unknown-linux-gnu/python-2.5.2/lib/python2.5/pkgutil.py,
 line 456, in get_loader
  return find_loader(fullname)
File
 /home/beehive/toolchain/x86_64-unknown-linux-gnu/python-2.5.2/lib/python2.5/pkgutil.py,
 line 466, in find_loader
  for importer in iter_importers(fullname):
File
 /home/beehive/toolchain/x86_64-unknown-linux-gnu/python-2.5.2/lib/python2.5/pkgutil.py,
 line 422, in iter_importers
  __import__(pkg)
File /usr/lib/spark-1.2.0-bin-hadoop2.3/python/pyspark/__init__.py,
 line 41, in module
  from pyspark.context import SparkContext
File /usr/lib/spark-1.2.0-bin-hadoop2.3/python/pyspark/context.py,
 line 209
  with SparkContext._lock:
  ^
  SyntaxError: invalid syntax
 PYTHONPATH was:
 
 /usr/lib/spark-1.2.0-bin-hadoop2.3/python:/usr/lib/spark-1.2.0-bin-hadoop2.3/python/lib/py4j-0.8.2.1-src.zip:/usr/lib/spark-1.2.0-bin-hadoop2.3/lib/spark-assembly-1.2.0-hadoop2.3.0.jar:/usr/lib/spark-1.2.0-bin-hadoop2.3/sbin/../python/lib/py4j-0.8.2.1-src.zip:/usr/lib/spark-1.2.0-bin-hadoop2.3/sbin/../python:/home/beehive/bin/utils/primitives:/home/beehive/bin/utils/pylogger:/home/beehive/bin/utils/asterScript:/home/beehive/bin/lib:/home/beehive/bin/utils/init:/home/beehive/installer/packages:/home/beehive/ncli
 java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at
 org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
at
 org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
at
 org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:102)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
 
 14/12/31 04:49:58 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID
 1, aster4, NODE_LOCAL, 1321 bytes)
 14/12/31 04:49:58 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory
 on aster4:43309 (size: 3.8 KB, free: 265.0 MB)
 14/12/31 04:49:59 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) on
 executor aster4: org.apache.spark.SparkException (
 
 
 Any clue how to resolve the same.
 
 Best regards
 
 Jagan
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-with-error-from-PySpark-tp20907.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Is it possible to store graph directly into HDFS?

2014-12-30 Thread Jason Hong
Dear all:)

We're trying to make a graph using large input data and get a subgraph
applied some filter.

Now, we wanna save this graph to HDFS so that we can load later.

Is it possible to store graph or subgraph directly into HDFS and load it as
a graph for future use?

We will be glad for your suggestion.

Best regards.

Jason Hong







--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-store-graph-directly-into-HDFS-tp20908.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Is it possible to store graph directly into HDFS?

2014-12-30 Thread Xuefeng Wu
how about save as object?


Yours, Xuefeng Wu 吴雪峰 敬上

 On 2014年12月30日, at 下午9:27, Jason Hong begger3...@gmail.com wrote:
 
 Dear all:)
 
 We're trying to make a graph using large input data and get a subgraph
 applied some filter.
 
 Now, we wanna save this graph to HDFS so that we can load later.
 
 Is it possible to store graph or subgraph directly into HDFS and load it as
 a graph for future use?
 
 We will be glad for your suggestion.
 
 Best regards.
 
 Jason Hong
 
 
 
 
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-store-graph-directly-into-HDFS-tp20908.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark Standalone Cluster not correctly configured

2014-12-30 Thread frodo777
Hi.

I'm trying to configure a spark standalone cluster, with three master nodes
(bigdata1, bigdata2 and bigdata3) managed by Zookeeper.

It seems there's a configuration problem, since everyone is saying it is the
cluster leader:

 .
 14/12/30 13:54:59 INFO Master: I have been elected leader! New state:
ALIVE

The message above is dumped by every master I start.


Zookeeper is configured identically in all of them, as follows:

dataDir=/spark


The only difference is the myid file in the /spark directory, of course.


The masters are started using the following configuration:
.
export SPARK_DAEMON_JAVA_OPTS= \
-Dspark.deploy.recoverymode=ZOOKEEPER \
-Dspark.deploy.zookeeper.url=bigdata1:2181,bigdata2:2181,bigdata3:2181

I'm not setting the spark.deploy.zookeeper.dir variable, since I'm using the
default value, /spark, configured in zookeeper, as I mentioned before.

I would like to know if there is any other thing I have to configure, in
order to make the masters to behave correctly (only one master node active
at a time).

With the current situation, I can connect workers and applications to the
whole cluster, for instance, I can connect a worker to the cluster using:

spark-class org.apache.spark.deploy.worker.Worker
spark://bigdata1:2181,bigdata2:2181,bigdata3:2181


But the worker gets registered to each of the masters independently.
If I stop one of the masters, it tries to re-register to it.
The notion of active-master is completely lost.

Do you have any idea?

Thanks a lot.
-Bob



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Standalone-Cluster-not-correctly-configured-tp20909.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to collect() each partition in scala ?

2014-12-30 Thread Cody Koeninger
I'm not sure exactly what you're trying to do, but take a look at
rdd.toLocalIterator if you haven't already.

On Tue, Dec 30, 2014 at 6:16 AM, Sean Owen so...@cloudera.com wrote:

 collect()-ing a partition still implies copying it to the driver, but
 you're suggesting you can't collect() the whole data set to the
 driver. What do you mean: collect() 1 partition? or collect() some
 smaller result from each partition?

 On Tue, Dec 30, 2014 at 11:54 AM, DEVAN M.S. msdeva...@gmail.com wrote:
  Hi all,
  i have one large data-set. when i am getting the number of partitions its
  showing 43.
  We can't collect() the large data-set in to  memory so i am thinking like
  this, collect() each partitions so that it will be small in size.
 
  Any thoughts ?
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Anaconda iPython notebook working with CDH Spark

2014-12-30 Thread Sebastián Ramírez
Some time ago I did the (2) approach, I installed Anaconda on every node.

But to avoid screwing RedHat (it was CentOS in my case, which is the same)
I installed Anaconda on every node using the user yarn and made it the
default python only for that user.

After you install it, Anaconda asks if it should add it's installation path
to the PATH variable in .bashrc for your user (that's the way it overrides
the default Python). If you choose yes it will override it only for the
current user. And if that user is yarn, you can run Spark in cluster
mode, in all the nodes in your cluster, using IPython (a lot better than
the default Python console).

Just in case, you have to check that you have a directory in your HDFS for
yarn (/user/yarn), it may not be created by default and that would
difficult everything, not allowing your Spark to run.

In summary, something like (correct the syntax if it's wrong, I'm not
testing it):

# Create yarn directory in HDFS
su hdfs
hadoop fs -mkdir /user/yarn
hadoop fs -chown yarn:yarn /user/yarn
exit

# Install Anaconda for user yarn
# In every node:
su yarn
cd
wget
http://09c8d0b2229f813c1b93-c95ac804525aac4b6dba79b00b39d1d3.r79.cf1.rackcdn.com/Anaconda-2.1.0-Linux-x86_64.sh
# Or the current link for the moment you are doing it:
https://store.continuum.io/cshop/anaconda/
bash Anaconda*.sh
# When asked if set it as the default Python, or to add Anaconda to the
PATH (I don't remember how they say it), choose yes


I hope that helps,


*Sebastián Ramírez*
Diseñador de Algoritmos

 http://www.senseta.com

 Tel: (+571) 795 7950 ext: 1012
 Cel: (+57) 300 370 77 10
 Calle 73 No 7 - 06  Piso 4
 Linkedin: co.linkedin.com/in/tiangolo/
 Twitter: @tiangolo https://twitter.com/tiangolo
 Email: sebastian.rami...@senseta.com
 www.senseta.com

On Sun, Dec 28, 2014 at 1:57 PM, Bin Wang binwang...@gmail.com wrote:

 Hi there,

 I have a cluster with CDH5.1 running on top of Redhat6.5, where the
 default Python version is 2.6. I am trying to set up a proper iPython
 notebook environment to develop spark application using pyspark.

 Here
 http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-apache-spark/
 is a tutorial that I have been following. However, it turned out that the
 author was using iPython1 where we have the latest Anaconda Python2.7
 installed on our name node. When I finished following the tutorial, I can
 connect to the spark cluster but whenever I tried to distribute the work,
 it will errorred out and google tells me it is the difference between the
 version of Python across the cluster.

 Here are a few thoughts that I am planning to try.
 (1) remove the Anaconda Python from the namenode and install the iPython
 version that is compatible with Python2.6.
 (2) or I need to install Anaconda Python on every node and make it the
 default Python version across the whole cluster (however, I am not sure if
 this plan will totally screw up the existing environment since some running
 services are built by Python2.6...)

 Let me which should be the proper way to set up an iPython notebook
 environment.

 Best regards,

 Bin


-- 
**
*This e-mail transmission, including any attachments, is intended only for 
the named recipient(s) and may contain information that is privileged, 
confidential and/or exempt from disclosure under applicable law. If you 
have received this transmission in error, or are not the named 
recipient(s), please notify Senseta immediately by return e-mail and 
permanently delete this transmission, including any attachments.*


Re: SchemaRDD to RDD[String]

2014-12-30 Thread Yana
Do your debug println show values? i.e. what would you see if in rowToString
you output println( row to string +row+ +sub)? 

Another thing to check would be to do  schemaRDD.take(3) or something to
make sure you actually have data

you can also try this: rowToString(schemaRDD.first,list) and see if you get
anything



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-to-RDD-String-tp20846p20910.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Host Error on EC2 while accessing hdfs from stadalone

2014-12-30 Thread Laeeq Ahmed
Hi,
I am using spark standalone on EC2. I can access ephemeral hdfs from 
spark-shell interface but I can't access hdfs in standalone application. I am 
using spark 1.2.0 with hadoop 2.4.0 and launched cluster from ec2 folder from 
my local machine. In my pom file I have given hadoop client as 2.4.0.   
 dependency                        groupIdorg.apache.hadoop/groupId   
                     artifactIdhadoop-client/artifactId                     
   version2.4.0/version                /dependency
The error is as fallows:
java.io.IOException: Failed on local exception: java.io.EOFException; Host 
Details : local host is: ip-10-40-121-200/10.40.121.200; destination host is: 
ec2-23-21-113-136.compute-1.amazonaws.com:9000;
Regards,Laeeq  

Spark 1.2 and Mesos 0.21.0 spark.executor.uri issue?

2014-12-30 Thread Denny Lee
I've been working with Spark 1.2 and Mesos 0.21.0 and while I have set the
spark.executor.uri within spark-env.sh (and directly within bash as well),
the Mesos slaves do not seem to be able to access the spark tgz file via
HTTP or HDFS as per the message below.


14/12/30 15:57:35 INFO SparkILoop: Created spark context..
Spark context available as sc.

scala 14/12/30 15:57:38 INFO CoarseMesosSchedulerBackend: Mesos task 0 is
now TASK_FAILED
14/12/30 15:57:38 INFO CoarseMesosSchedulerBackend: Mesos task 1 is now
TASK_FAILED
14/12/30 15:57:39 INFO CoarseMesosSchedulerBackend: Mesos task 2 is now
TASK_FAILED
14/12/30 15:57:41 INFO CoarseMesosSchedulerBackend: Mesos task 3 is now
TASK_FAILED
14/12/30 15:57:41 INFO CoarseMesosSchedulerBackend: Blacklisting Mesos
slave value: 20141228-183059-3045950474-5050-2788-S1
 due to too many failures; is Spark installed on it?


I've verified that the Mesos slaves can access both the HTTP and HDFS
locations.  I'll start digging into the Mesos logs but was wondering if
anyone had run into this issue before.  I was able to get this to run
successfully on Spark 1.1 on GCP - my current environment that I'm
experimenting with is Digital Ocean - perhaps this is in play?

Thanks!
Denny


Re: Host Error on EC2 while accessing hdfs from stadalone

2014-12-30 Thread Aniket Bhatnagar
Did you check firewall rules in security groups?

On Tue, Dec 30, 2014, 9:34 PM Laeeq Ahmed laeeqsp...@yahoo.com.invalid
wrote:

 Hi,

 I am using spark standalone on EC2. I can access ephemeral hdfs from
 spark-shell interface but I can't access hdfs in standalone application. I
 am using spark 1.2.0 with hadoop 2.4.0 and launched cluster from ec2 folder
 from my local machine. In my pom file I have given hadoop client as 2.4.0.
 dependency
 groupIdorg.apache.hadoop/groupId
 artifactIdhadoop-client/artifactId
 version2.4.0/version
 /dependency

 The error is as fallows:

 *java.io.IOException: Failed on local exception: java.io.EOFException;
 Host Details : local host is: ip-10-40-121-200/10.40.121.200
 http://10.40.121.200; destination host is:
 ec2-23-21-113-136.compute-1.amazonaws.com
 http://ec2-23-21-113-136.compute-1.amazonaws.com:9000;*

 Regards,
 Laeeq




Re: Mapping directory structure to columns in SparkSQL

2014-12-30 Thread Michael Davies
Hi Michael, 

I’ve looked through the example and the test cases and I think I understand 
what we need to do - so I’ll give it a go. 

I think what I’d like to try to do is allow files to be added at anytime, so 
perhaps I can cache partition info, and also what may be useful for us would be 
to derive schema from the set of all files, hopefully this is achievable also.

Thanks

Mick


 On 30 Dec 2014, at 04:49, Michael Armbrust mich...@databricks.com wrote:
 
 You can't do this now without writing a bunch of custom logic (see here for 
 an example: 
 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala
  
 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala)
 
 I would like to make this easier as part of improvements to the datasources 
 api that we are planning for Spark 1.3
 
 On Mon, Dec 29, 2014 at 2:19 AM, Mickalas michael.belldav...@gmail.com 
 mailto:michael.belldav...@gmail.com wrote:
 I see that there is already a request to add wildcard support to the
 SQLContext.parquetFile function
 https://issues.apache.org/jira/browse/SPARK-3928 
 https://issues.apache.org/jira/browse/SPARK-3928.
 
 What seems like a useful thing for our use case is to associate the
 directory structure with certain columns in the table, but it does not seem
 like this is supported.
 
 For example we want to create parquet files on a daily basis associated with
 geographic regions and so will create a set of files under directories such
 as:
 
 * 2014-12-29/Americas
 * 2014-12-29/Asia
 * 2014-12-30/Americas
 * ...
 
 Where queries have predicates that match the column values determinable from
 directory structure it would be good to only extract data from matching
 files.
 
 Does anyone know if something like this is supported, or whether this is a
 reasonable thing to request?
 
 Mick
 
 
 
 
 
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Mapping-directory-structure-to-columns-in-SparkSQL-tp20880.html
  
 http://apache-spark-user-list.1001560.n3.nabble.com/Mapping-directory-structure-to-columns-in-SparkSQL-tp20880.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 mailto:user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org 
 mailto:user-h...@spark.apache.org
 
 



Shuffle Problems in 1.2.0

2014-12-30 Thread Sven Krasser
Hey all,

Since upgrading to 1.2.0 a pyspark job that worked fine in 1.1.1 fails
during shuffle. I've tried reverting from the sort-based shuffle back to
the hash one, and that fails as well. Does anyone see similar problems or
has an idea on where to look next?

For the sort-based shuffle I get a bunch of exception like this in the
executor logs:

2014-12-30 03:13:04,061 ERROR [Executor task launch worker-2]
executor.Executor (Logging.scala:logError(96)) - Exception in task
4523.0 in stage 1.0 (TID 4524)
org.apache.spark.SparkException: PairwiseRDD: unexpected value:
List([B@130dc7ad)
at 
org.apache.spark.api.python.PairwiseRDD$$anonfun$compute$2.apply(PythonRDD.scala:307)
at 
org.apache.spark.api.python.PairwiseRDD$$anonfun$compute$2.apply(PythonRDD.scala:305)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:219)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

For the hash-based shuffle, there are now a bunch of these exceptions
in the logs:


2014-12-30 04:14:01,688 ERROR [Executor task launch worker-0]
executor.Executor (Logging.scala:logError(96)) - Exception in task
4479.0 in stage 1.0 (TID 4480)
java.io.FileNotFoundException:
/mnt/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1419905501183_0004/spark-local-20141230035728-8fc0/23/merged_shuffle_1_68_0
(No such file or directory)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.init(FileOutputStream.java:221)
at 
org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123)
at 
org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:192)
at 
org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:67)
at 
org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:65)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Thank you!
-Sven



-- 
http://sites.google.com/site/krasser/?utm_source=sig


Re: Mllib native netlib-java/OpenBLAS

2014-12-30 Thread xhudik
I'm half-way there
follow
1. compiled and installed open blas library
2. ln -s libopenblas_sandybridgep-r0.2.13.so /usr/lib/libblas.so.3
3. compiled and built spark:
mvn -Pnetlib-lgpl -DskipTests clean compile package

So far so fine. Then I run into problems by testing the solution:
bin/run-example mllib.LinearRegression data/mllib/sample_libsvm_data.txt

/14/12/30 18:39:57 INFO BlockManagerMaster: Registered BlockManager
14/12/30 18:39:58 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
14/12/30 18:39:58 WARN LoadSnappy: Snappy native library not loaded
Training: 80, test: 20.
*/usr/local/lib/jdk1.8.0//bin/java: symbol lookup error:
/tmp/jniloader1826801168744171087netlib-native_system-linux-x86_64.so:
undefined symbol: cblas_dscal*/


I created a issue report:
https://issues.apache.org/jira/browse/SPARK-5010

any help is deeply appreciated, Tomas




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-native-netlib-java-OpenBLAS-tp19662p20912.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: S3 files , Spark job hungsup

2014-12-30 Thread Sven Krasser
This here may also be of help:
http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html.
Make sure to spread your objects across multiple partitions to not be rate
limited by S3.
-Sven

On Mon, Dec 22, 2014 at 10:20 AM, durga katakam durgak...@gmail.com wrote:

 Yes . I am reading thousands of files every hours. Is there any way I can
 tell spark to timeout.
 Thanks for your help.

 -D

 On Mon, Dec 22, 2014 at 4:57 AM, Shuai Zheng szheng.c...@gmail.com
 wrote:

 Is it possible too many connections open to read from s3 from one node? I
 have this issue before because I open a few hundreds of files on s3 to read
 from one node. It just block itself without error until timeout later.

 On Monday, December 22, 2014, durga durgak...@gmail.com wrote:

 Hi All,

 I am facing a strange issue sporadically. occasionally my spark job is
 hungup on reading s3 files. It is not throwing exception . or making some
 progress, it is just hungs up there.

 Is this a known issue , Please let me know how could I solve this issue.

 Thanks,
 -D



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/S3-files-Spark-job-hungsup-tp20806.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





-- 
http://sites.google.com/site/krasser/?utm_source=sig


Trying to make spark-jobserver work with yarn

2014-12-30 Thread Fernando O.
Hi all,
I'm investigating spark for a new project and I'm trying to use
spark-jobserver because... I need to reuse and share RDDs and from what I
read in the forum that's the standard :D

Turns out that spark-jobserver doesn't seem to work on yarn, or at least it
does not on 1.1.1

My config is spark 1.1.1 (moving to 1.2.0 soon), hadoop 2.6 (which seems
compatible with 2.4 from spark point of view... at least I was able to run
spark-submit and shell tasks both in yarn-client and yarn-cluster modes)




going back to my original point, I did some changes in spark-jobserver and
how I can submit a job but I get:


[2014-12-30 18:20:19,769] INFO  e.spark.deploy.yarn.Client []
[akka://JobServer/user/context-supervisor/f983d86e-spark.jobserver.WordCountExample]
- Max mem capabililty of a single resource in this cluster 15000
[2014-12-30 18:20:19,770] INFO  e.spark.deploy.yarn.Client []
[akka://JobServer/user/context-supervisor/f983d86e-spark.jobserver.WordCountExample]
- Preparing Local resources
[2014-12-30 18:20:20,041] INFO  e.spark.deploy.yarn.Client []
[akka://JobServer/user/context-supervisor/f983d86e-spark.jobserver.WordCountExample]
- Prepared Local resources Map(__spark__.jar - resource { scheme: file
port: -1 file:
/home/ec2-user/.ivy2/cache/org.apache.spark/spark-yarn_2.10/jars/spark-yarn_2.10-1.1.1.jar
} size: 343226 timestamp: 1416429031000 type: FILE visibility: PRIVATE)

[...]

[2014-12-30 18:20:20,139] INFO  e.spark.deploy.yarn.Client []
[akka://JobServer/user/context-supervisor/f983d86e-spark.jobserver.WordCountExample]
- Yarn AM launch context:
[2014-12-30 18:20:20,140] INFO  e.spark.deploy.yarn.Client []
[akka://JobServer/user/context-supervisor/f983d86e-spark.jobserver.WordCountExample]
-   class:   org.apache.spark.deploy.yarn.ExecutorLauncher
[2014-12-30 18:20:20,140] INFO  e.spark.deploy.yarn.Client []
[akka://JobServer/user/context-supervisor/f983d86e-spark.jobserver.WordCountExample]
-   env: Map(CLASSPATH -
$PWD:$PWD/__spark__.jar:$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/share/hadoop/common/*:$HADOOP_COMMON_HOME/share/hadoop/common/lib/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*:$HADOOP_YARN_HOME/share/hadoop/yarn/*:$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*:$PWD/__app__.jar:$PWD/*,
SPARK_YARN_CACHE_FILES_FILE_SIZES - 343226, SPARK_YARN_STAGING_DIR -
.sparkStaging/application_1419963137232_0001/,
SPARK_YARN_CACHE_FILES_VISIBILITIES - PRIVATE, SPARK_USER - ec2-user,
SPARK_YARN_MODE - true, SPARK_YARN_CACHE_FILES_TIME_STAMPS -
1416429031000, SPARK_YARN_CACHE_FILES -
file:/home/ec2-user/.ivy2/cache/org.apache.spark/spark-yarn_2.10/jars/spark-yarn_2.10-1.1.1.jar#__spark__.jar)

[...]

[2014-12-30 18:03:04,474] INFO  YarnClientSchedulerBackend []
[akka://JobServer/user/context-supervisor/ebac0153-spark.jobserver.WordCountExample]
- Application report from ASM:
 appMasterRpcPort: -1
 appStartTime: 1419962580444
 yarnAppState: FAILED

[2014-12-30 18:03:04,475] ERROR .jobserver.JobManagerActor []
[akka://JobServer/user/context-supervisor/ebac0153-spark.jobserver.WordCountExample]
- Failed to create context ebac0153-spark.jobserver.WordCountExample,
shutting down actor
org.apache.spark.SparkException: Yarn application already ended,might be
killed or not able to launch application master.
at
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApp(YarnClientSchedulerBackend.scala:117)
at
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:93)



In the hadoop console I can get the detailed issue

Diagnostics: File
file:/home/ec2-user/.ivy2/cache/org.apache.spark/spark-yarn_2.10/jars/spark-yarn_2.10-1.1.1.jar
does not exist
java.io.FileNotFoundException: File
file:/home/ec2-user/.ivy2/cache/org.apache.spark/spark-yarn_2.10/jars/spark-yarn_2.10-1.1.1.jar
does not exist

now... it seems like spark is actually use a file I used for launching the
task in other nodes

Can anyone point me in the right direction of where that might be being set?


Re: Cached RDD

2014-12-30 Thread Rishi Yadav
Without caching, each action is recomputed. So assuming rdd2 and rdd3
result in separate actions answer is yes.

On Mon, Dec 29, 2014 at 7:53 PM, Corey Nolet cjno...@gmail.com wrote:

 If I have 2 RDDs which depend on the same RDD like the following:

 val rdd1 = ...

 val rdd2 = rdd1.groupBy()...

 val rdd3 = rdd1.groupBy()...


 If I don't cache rdd1, will it's lineage be calculated twice (one for rdd2
 and one for rdd3)?



Re: Spark SQL implementation error

2014-12-30 Thread Michael Armbrust
Anytime you see java.lang.NoSuchMethodError it means that you have
multiple conflicting versions of a library on the classpath, or you are
trying to run code that was compiled against the wrong version of a library.

On Tue, Dec 30, 2014 at 1:43 AM, sachin Singh sachin.sha...@gmail.com
wrote:

 I have a table(csv file) loaded data on that by creating POJO as per table
 structure,and created SchemaRDD as under
 JavaRDDTest1 testSchema =
 sc.textFile(D:/testTable.csv).map(GetTableData);/* GetTableData will
 transform the all table data in testTable object*/
 JavaSchemaRDD schemaTest = sqlContext.applySchema(testSchema, Test.class);
 schemaTest.registerTempTable(testTable);

 JavaSchemaRDD sqlQuery = sqlContext.sql(SELECT * FROM testTable);
 ListString totDuration = sqlQuery.map(new FunctionRow, String() {
   public String call(Row row) {
 return Field1is :  + row.getInt(0);
   }
 }).collect();
 its working fine
 but.
 if I am changing query as(rest code is same)-  JavaSchemaRDD sqlQuery =
 sqlContext.sql(SELECT sum(field1) FROM testTable group by field2);
 error as - Exception in thread main java.lang.NoSuchMethodError:

 org.apache.spark.rdd.ShuffledRDD.init(Lorg/apache/spark/rdd/RDD;Lorg/apache/spark/Partitioner;)V

 Please help and Suggest



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-implementation-error-tp20901.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: SparkContext with error from PySpark

2014-12-30 Thread JAGANADH G
Hi

I am using Aanonda Python. Is there any way to specify the Python which we
have o use for running pyspark in a cluster.

Best regards

Jagan

On Tue, Dec 30, 2014 at 6:27 PM, Eric Friedman eric.d.fried...@gmail.com
wrote:

 The Python installed in your cluster is 2.5. You need at least 2.6.

 
 Eric Friedman

  On Dec 30, 2014, at 7:45 AM, Jaggu jagana...@gmail.com wrote:
 
  Hi Team,
 
  I was trying to execute a Pyspark code in cluster. It gives me the
 following
  error. (Wne I run the same job in local it is working fine too :-()
 
  Eoor
 
  Error from python worker:
   /usr/lib/spark-1.2.0-bin-hadoop2.3/python/pyspark/context.py:209:
 Warning:
  'with' will become a reserved keyword in Python 2.6
   Traceback (most recent call last):
 File
 
 /home/beehive/toolchain/x86_64-unknown-linux-gnu/python-2.5.2/lib/python2.5/runpy.py,
  line 85, in run_module
   loader = get_loader(mod_name)
 File
 
 /home/beehive/toolchain/x86_64-unknown-linux-gnu/python-2.5.2/lib/python2.5/pkgutil.py,
  line 456, in get_loader
   return find_loader(fullname)
 File
 
 /home/beehive/toolchain/x86_64-unknown-linux-gnu/python-2.5.2/lib/python2.5/pkgutil.py,
  line 466, in find_loader
   for importer in iter_importers(fullname):
 File
 
 /home/beehive/toolchain/x86_64-unknown-linux-gnu/python-2.5.2/lib/python2.5/pkgutil.py,
  line 422, in iter_importers
   __import__(pkg)
 File /usr/lib/spark-1.2.0-bin-hadoop2.3/python/pyspark/__init__.py,
  line 41, in module
   from pyspark.context import SparkContext
 File /usr/lib/spark-1.2.0-bin-hadoop2.3/python/pyspark/context.py,
  line 209
   with SparkContext._lock:
   ^
   SyntaxError: invalid syntax
  PYTHONPATH was:
 
 
 /usr/lib/spark-1.2.0-bin-hadoop2.3/python:/usr/lib/spark-1.2.0-bin-hadoop2.3/python/lib/py4j-0.8.2.1-src.zip:/usr/lib/spark-1.2.0-bin-hadoop2.3/lib/spark-assembly-1.2.0-hadoop2.3.0.jar:/usr/lib/spark-1.2.0-bin-hadoop2.3/sbin/../python/lib/py4j-0.8.2.1-src.zip:/usr/lib/spark-1.2.0-bin-hadoop2.3/sbin/../python:/home/beehive/bin/utils/primitives:/home/beehive/bin/utils/pylogger:/home/beehive/bin/utils/asterScript:/home/beehive/bin/lib:/home/beehive/bin/utils/init:/home/beehive/installer/packages:/home/beehive/ncli
  java.io.EOFException
 at java.io.DataInputStream.readInt(DataInputStream.java:392)
 at
 
 org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
 at
 
 org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
 at
 
 org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
 at
 org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:102)
 at
 org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at
  org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at
  org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:722)
 
  14/12/31 04:49:58 INFO TaskSetManager: Starting task 0.1 in stage 0.0
 (TID
  1, aster4, NODE_LOCAL, 1321 bytes)
  14/12/31 04:49:58 INFO BlockManagerInfo: Added broadcast_2_piece0 in
 memory
  on aster4:43309 (size: 3.8 KB, free: 265.0 MB)
  14/12/31 04:49:59 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID
 1) on
  executor aster4: org.apache.spark.SparkException (
 
 
  Any clue how to resolve the same.
 
  Best regards
 
  Jagan
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-with-error-from-PySpark-tp20907.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 




-- 
**
JAGANADH G
http://jaganadhg.in
*ILUGCBE*
http://ilugcbe.org.in


Spark Accumulators exposed as Metrics to Graphite

2014-12-30 Thread Łukasz Stefaniak
Hi
Does spark have built in possiblity of exposing current value of
Accumulator [1] using Monitoring and Instrumentation [2].
Unfortunately I couldn't find anything in Sources which could be used.

Does it mean only way to expose current accumulator value is to implement
new Source which would hook to Accumulator in the driver process, or listen
for events on the bus?

Many thanks!

[1]
https://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.Accumulator
[2] http://spark.apache.org/docs/1.1.1/monitoring.html


Re: Spark Streaming: HiveContext within Custom Actor

2014-12-30 Thread Tathagata Das
I am not sure that can be done. Receivers are designed to be run only
on the executors/workers, whereas a SQLContext (for using Spark SQL)
can only be defined on the driver.


On Mon, Dec 29, 2014 at 6:45 PM, sranga sra...@gmail.com wrote:
 Hi

 Could Spark-SQL be used from within a custom actor that acts as a receiver
 for a streaming application? If yes, what is the recommended way of passing
 the SparkContext to the actor?
 Thanks for your help.


 - Ranga



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-HiveContext-within-Custom-Actor-tp20892.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: word count aggregation

2014-12-30 Thread Tathagata Das
For windows that large (1 hour), you will probably also have to
increase the batch interval for efficiency.

TD

On Mon, Dec 29, 2014 at 12:16 AM, Akhil Das ak...@sigmoidanalytics.com wrote:
 You can use reduceByKeyAndWindow for that. Here's a pretty clean example
 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/TwitterPopularTags.scala

 Thanks
 Best Regards

 On Mon, Dec 29, 2014 at 1:30 PM, Hoai-Thu Vuong thuv...@gmail.com wrote:

 dear user of spark

 I've got a program, streaming a folder, when a new file is created in this
 folder, I count a word, which appears in this document and update it (I used
 StatefulNetworkWordCount to do it). And it work like charm. However, I would
 like to know the different of top 10 word at now and at time (one hour
 before). How could I do it? I try to use windowDuration, but it seem not
 work.



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Shuffle Problems in 1.2.0

2014-12-30 Thread Josh Rosen
Hi Sven,

Do you have a small example program that you can share which will allow me
to reproduce this issue?  If you have a workload that runs into this, you
should be able to keep iteratively simplifying the job and reducing the
data set size until you hit a fairly minimal reproduction (assuming the
issue is deterministic, which it sounds like it is).

On Tue, Dec 30, 2014 at 9:49 AM, Sven Krasser kras...@gmail.com wrote:

 Hey all,

 Since upgrading to 1.2.0 a pyspark job that worked fine in 1.1.1 fails
 during shuffle. I've tried reverting from the sort-based shuffle back to
 the hash one, and that fails as well. Does anyone see similar problems or
 has an idea on where to look next?

 For the sort-based shuffle I get a bunch of exception like this in the
 executor logs:

 2014-12-30 03:13:04,061 ERROR [Executor task launch worker-2] 
 executor.Executor (Logging.scala:logError(96)) - Exception in task 4523.0 in 
 stage 1.0 (TID 4524)
 org.apache.spark.SparkException: PairwiseRDD: unexpected value: 
 List([B@130dc7ad)
   at 
 org.apache.spark.api.python.PairwiseRDD$$anonfun$compute$2.apply(PythonRDD.scala:307)
   at 
 org.apache.spark.api.python.PairwiseRDD$$anonfun$compute$2.apply(PythonRDD.scala:305)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at 
 org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:219)
   at 
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)

 For the hash-based shuffle, there are now a bunch of these exceptions in the 
 logs:


 2014-12-30 04:14:01,688 ERROR [Executor task launch worker-0] 
 executor.Executor (Logging.scala:logError(96)) - Exception in task 4479.0 in 
 stage 1.0 (TID 4480)
 java.io.FileNotFoundException: 
 /mnt/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1419905501183_0004/spark-local-20141230035728-8fc0/23/merged_shuffle_1_68_0
  (No such file or directory)
   at java.io.FileOutputStream.open(Native Method)
   at java.io.FileOutputStream.init(FileOutputStream.java:221)
   at 
 org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123)
   at 
 org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:192)
   at 
 org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:67)
   at 
 org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:65)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)

 Thank you!
 -Sven



 --
 http://sites.google.com/site/krasser/?utm_source=sig



Re: SparkContext with error from PySpark

2014-12-30 Thread Josh Rosen
To configure the Python executable used by PySpark, see the Using the
Shell Python section in the Spark Programming Guide:
https://spark.apache.org/docs/latest/programming-guide.html#using-the-shell

You can set the PYSPARK_PYTHON environment variable to choose the Python
executable that will be used on the driver and executors.  In addition, you
can set PYSPARK_DRIVER_PYTHON to use a different Python executable only on
the driver (this is useful if you want to use IPython on the driver but not
on the executors).

On Tue, Dec 30, 2014 at 11:13 AM, JAGANADH G jagana...@gmail.com wrote:

 Hi

 I am using Aanonda Python. Is there any way to specify the Python which we
 have o use for running pyspark in a cluster.

 Best regards

 Jagan

 On Tue, Dec 30, 2014 at 6:27 PM, Eric Friedman eric.d.fried...@gmail.com
 wrote:

 The Python installed in your cluster is 2.5. You need at least 2.6.

 
 Eric Friedman

  On Dec 30, 2014, at 7:45 AM, Jaggu jagana...@gmail.com wrote:
 
  Hi Team,
 
  I was trying to execute a Pyspark code in cluster. It gives me the
 following
  error. (Wne I run the same job in local it is working fine too :-()
 
  Eoor
 
  Error from python worker:
   /usr/lib/spark-1.2.0-bin-hadoop2.3/python/pyspark/context.py:209:
 Warning:
  'with' will become a reserved keyword in Python 2.6
   Traceback (most recent call last):
 File
 
 /home/beehive/toolchain/x86_64-unknown-linux-gnu/python-2.5.2/lib/python2.5/runpy.py,
  line 85, in run_module
   loader = get_loader(mod_name)
 File
 
 /home/beehive/toolchain/x86_64-unknown-linux-gnu/python-2.5.2/lib/python2.5/pkgutil.py,
  line 456, in get_loader
   return find_loader(fullname)
 File
 
 /home/beehive/toolchain/x86_64-unknown-linux-gnu/python-2.5.2/lib/python2.5/pkgutil.py,
  line 466, in find_loader
   for importer in iter_importers(fullname):
 File
 
 /home/beehive/toolchain/x86_64-unknown-linux-gnu/python-2.5.2/lib/python2.5/pkgutil.py,
  line 422, in iter_importers
   __import__(pkg)
 File /usr/lib/spark-1.2.0-bin-hadoop2.3/python/pyspark/__init__.py,
  line 41, in module
   from pyspark.context import SparkContext
 File /usr/lib/spark-1.2.0-bin-hadoop2.3/python/pyspark/context.py,
  line 209
   with SparkContext._lock:
   ^
   SyntaxError: invalid syntax
  PYTHONPATH was:
 
 
 /usr/lib/spark-1.2.0-bin-hadoop2.3/python:/usr/lib/spark-1.2.0-bin-hadoop2.3/python/lib/py4j-0.8.2.1-src.zip:/usr/lib/spark-1.2.0-bin-hadoop2.3/lib/spark-assembly-1.2.0-hadoop2.3.0.jar:/usr/lib/spark-1.2.0-bin-hadoop2.3/sbin/../python/lib/py4j-0.8.2.1-src.zip:/usr/lib/spark-1.2.0-bin-hadoop2.3/sbin/../python:/home/beehive/bin/utils/primitives:/home/beehive/bin/utils/pylogger:/home/beehive/bin/utils/asterScript:/home/beehive/bin/lib:/home/beehive/bin/utils/init:/home/beehive/installer/packages:/home/beehive/ncli
  java.io.EOFException
 at java.io.DataInputStream.readInt(DataInputStream.java:392)
 at
 
 org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
 at
 
 org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
 at
 
 org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
 at
 org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:102)
 at
 org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
 at
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at
  org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at
  org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:722)
 
  14/12/31 04:49:58 INFO TaskSetManager: Starting task 0.1 in stage 0.0
 (TID
  1, aster4, NODE_LOCAL, 1321 bytes)
  14/12/31 04:49:58 INFO BlockManagerInfo: Added broadcast_2_piece0 in
 memory
  on aster4:43309 (size: 3.8 KB, free: 265.0 MB)
  14/12/31 04:49:59 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID
 1) on
  executor aster4: org.apache.spark.SparkException (
 
 
  Any clue how to resolve the same.
 
  Best regards
 
  Jagan
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-with-error-from-PySpark-tp20907.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 




 --
 **
 JAGANADH G
 http://jaganadhg.in
 *ILUGCBE*
 

Re: Spark Streaming: HiveContext within Custom Actor

2014-12-30 Thread Ranga
Thanks. Will look at other options.

On Tue, Dec 30, 2014 at 11:43 AM, Tathagata Das tathagata.das1...@gmail.com
 wrote:

 I am not sure that can be done. Receivers are designed to be run only
 on the executors/workers, whereas a SQLContext (for using Spark SQL)
 can only be defined on the driver.


 On Mon, Dec 29, 2014 at 6:45 PM, sranga sra...@gmail.com wrote:
  Hi
 
  Could Spark-SQL be used from within a custom actor that acts as a
 receiver
  for a streaming application? If yes, what is the recommended way of
 passing
  the SparkContext to the actor?
  Thanks for your help.
 
 
  - Ranga
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-HiveContext-within-Custom-Actor-tp20892.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 



Location of logs in local mode

2014-12-30 Thread Brett Meyer
I¹m submitting a script using spark-submit in local mode for testing, and
I¹m having trouble figuring out where the logs are stored.  The
documentation indicates that they should be in the work folder in the
directory in which Spark lives on my system, but I see no such folder there.
I¹ve set the SPARK_LOCAL_DIRS and SPARK_LOG_DIR environment variables in
spark-env.sh, but there doesn¹t seem to be any log output generated in the
locations I¹ve specified there either.  I¹m just using spark-submit with
‹master local, I haven¹t run any of the standalone cluster scripts, so I¹m
not sure if there¹s something I¹m missing here as far as a default output
location for logging.

Thanks,
Brett




smime.p7s
Description: S/MIME cryptographic signature


Gradual slow down of the Streaming job (getCallSite at DStream.scala:294)

2014-12-30 Thread RK
Here is the code for my streaming job.
~~val sparkConf = new 
SparkConf().setAppName(SparkStreamingJob)
sparkConf.set(spark.serializer, 
org.apache.spark.serializer.KryoSerializer)sparkConf.set(spark.default.parallelism,
 100)sparkConf.set(spark.shuffle.consolidateFiles, 
true)sparkConf.set(spark.speculation, 
true)sparkConf.set(spark.speculation.interval, 
5000)sparkConf.set(spark.speculation.quantile, 
0.9)sparkConf.set(spark.speculation.multiplier, 
3)sparkConf.set(spark.mesos.coarse, 
true)sparkConf.set(spark.executor.extraJavaOptions, -verbose:gc 
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
-XX:+UseConcMarkSweepGC)sparkConf.set(spark.shuffle.manager, SORT)
val ssc = new StreamingContext(sparkConf, 
Seconds(10))ssc.checkpoint(checkpointDir)
val topics = traceval numThreads = 1val topicMap = 
topics.split(,).map((_,numThreads)).toMap
val kafkaPartitions = 20val kafkaDStreams = (1 to kafkaPartitions).map { _ =  
KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)}
val lines = ssc.union(kafkaDStreams)val words = lines.map(line = 
doSomething_1(line))val filteredWords = words.filter(word = word != test)val 
groupedWords = filteredWords.map(word = (word, 1))
val windowedWordCounts = groupedWords.reduceByKeyAndWindow(_ + _, _ - _, 
Seconds(30), Seconds(10))val windowedWordsFiltered = 
windowedWordCounts.filter{case (word, count) = count  50}val finalResult = 
windowedWordsFiltered.foreachRDD(words = doSomething_2(words))
ssc.start()ssc.awaitTermination()~~
I am running this job on a 9 slave AWS EC2 cluster with each slave node has 32 
vCPU  60GB memory.
When I start this job, the processing time is usually around 5 - 6 seconds for 
the 10 seconds batch and the scheduling delay is around 0 seconds or a few ms. 
However, as the job run for 6 - 8 hours, the processing time increases to 15 - 
20 seconds but the scheduling delay is increasing to 4 - 6 hours.
When I look at the completed stages, I see that the time taken for getCallSite 
at DStream.scala:294 keeps increasing as time passes by. It goes from around 2 
seconds to more than a few minutes.
Clicking on +details next to this stage description shows the following 
execution 
trace.org.apache.spark.SparkContext.getCallSite(SparkContext.scala:1088)org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:294)org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:288)scala.Option.orElse(Option.scala:257)org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:285)org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:221)org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:221)scala.util.Try$.apply(Try.scala:161)org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:221)org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:165)
When I click on one of these slow stages that executed after 6 - 8 hours, I 
find the following information for individual tasks inside.
- All tasks seem to execute with PROCESS_LOCAL locality.- Quite a few of these 
tasks seem to spend anywhere between 30 - 80% of their time in GC. Although, 
when I look at the total memory usage on each of the slave nodes under 
executors information, I see that the usage is only around 200MB out of 20GB 
available.
Even after a few hours, the map stages (val groupedWords = 
filteredWords.map(word = (word, 1))) seem to have consistent times as during 
the start of the job which seems to indicate that this code is fine.Also, the 
waiting batches is either at 0 or 1 even after 8 to 10 hours.
Based on the information that map is as fast as during the start of job and 
that there is no waiting batches, I am assuming that the getCallSite stages 
correspond to getting data out of Kafka? Is this correct or not?If my 
assumption is correct, Is there anything that I could do to optimize receiving 
data from Kafka?If not, which part of my code needs to be optimized to reduce 
the scheduling 

Kafka + Spark streaming

2014-12-30 Thread SamyaMaiti
Hi Experts,

Few general Queries : 

1. Can a single block/partition in a RDD have more than 1 kafka message? or
there will be one  only one kafka message per block? In a more broader way,
is the message count related to block in any way or its just that any
message received with in a particular block interval will go in the same
block.

2. If a worker goes down which runs the Receiver for Kafka, Will the
receiver be restarted on some other worker?

Regards,
Sam



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Spark-streaming-tp20914.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Trouble using MultipleTextOutputFormat with Spark

2014-12-30 Thread Arpan Ghosh
Hi,

I am trying to use the MultipleTextOutputFormat to rename the output files
of my Spark job something different from the default part-N.

I have implemented a custom MultipleTextOutputFormat class as follows:


*class DriveOutputRenameMultipleTextOutputFormat extends
MultipleTextOutputFormat[String, Any] {*
*  override def generateFileNameForKeyValue(key : String, value : Any,
name: String) : String = {*
*DRIVE + - + name.split(-)(1) + .csv*
*  }*
*}*

When I call the saveAsHadoopFile() function on a RDD[K,V], I get the
following error:

*sc.textFile(/mnt/raw/drive/2014/10/29/part-0).map(x = (x,
null)).saveAsHadoopFile(/mnt/test, classOf[String], classOf[Any],
classOf[DriveOutputRenameMultipleTextOutputFormat])*

java.lang.RuntimeException: java.lang.NoSuchMethodException:
line210ee86336174025bcee4914212e1ff6168.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$DriveOutputRenameMultipleTextOutputFormat.init()
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115)
at org.apache.hadoop.mapred.JobConf.getOutputFormat(JobConf.java:619) at
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1001)
at
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:931)
Caused by: java.lang.NoSuchMethodException:
line210ee86336174025bcee4914212e1ff6168.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$DriveOutputRenameMultipleTextOutputFormat.init()
at java.lang.Class.getConstructor0(Class.java:2892) at
java.lang.Class.getDeclaredConstructor(Class.java:2058) at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:109)
at org.apache.hadoop.mapred.JobConf.getOutputFormat(JobConf.java:619) at
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1001)
at
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:931)


JetS3T settings spark

2014-12-30 Thread durga
I am not sure , the way I can pass jets3t.properties file for spark-submit.
--file option seems not working.
can some one please help me. My production spark jobs get hung up when
reading s3 file sporadically.

Thanks,
-D 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/JetS3T-settings-spark-tp20916.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Long-running job cleanup

2014-12-30 Thread Ganelin, Ilya
Hi Patrick, to follow up on the below discussion, I am including a short code 
snippet that produces the problem on 1.1. This is kind of stupid code since 
it’s a greatly simplified version of what I’m actually doing but it has a 
number of the key components in place. I’m also including some example log 
output. Thank you.


def showMemoryUsage(sc : SparkContext) = {

  val usersPerStep = 25000
  val count = 100
  val numSteps = count/usersPerStep

  val users = sc.parallelize(1 to count)
  val zippedUsers = users.zipWithIndex().cache()
  val userFeatures : RDD[(Int, Int)] = sc.parallelize(1 to 
count).map(s=(s,2)).cache()
  val productFeatures : RDD[(Int, Int)] = sc.parallelize(1 to 5000)
.map(s = (s, 4)).cache()

  for (i - 1 to numSteps) {
val usersFiltered = zippedUsers.filter(s = {
  ((i - 1) * usersPerStep = s._2)  (s._2  i * usersPerStep)
}).map(_._1).collect()

usersFiltered.foreach(user = {
  val mult = productFeatures.map(s = s._2 * userFeatures.lookup(user).head)
  mult.takeOrdered(20)

  // Normally this would then be written to disk
  // For the sake of the example this is all we're doing
})
  }
}

Example broadcast variable added:

14/12/30 19:25:19 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 
innovationclientnode01.cof.ds.capitalone.com:54640 (size: 794.0 B, free: 441.9 
MB)


And then if I parse the entire log looking for “free : XXX.X MB” I see the 
available memory slowly ticking away:

Free 441.8 MB

Free 441.8 MB

Free 441.8 MB

Free 441.8 MB

Free 441.8 MB

Free 441.8 MB

Free 441.8 MB

…

Free 441.7 MB

Free 441.7 MB

Free 441.7 MB

Free 441.7 MB

And so on.


Clearly the above code is not persisting the intermediate RDD (mult), yet 
memory is never being properly freed up.

From: Ilya Ganelin ilgan...@gmail.commailto:ilgan...@gmail.com
Date: Sunday, December 28, 2014 at 4:02 PM
To: Patrick Wendell pwend...@gmail.commailto:pwend...@gmail.com, Ganelin, 
Ilya ilya.gane...@capitalone.commailto:ilya.gane...@capitalone.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Long-running job cleanup

Hi Patrick - is that cleanup present in 1.1?

The overhead I am talking about is with regards to what I believe is shuffle 
related metadata. If I watch the execution log I see small broadcast variables 
created for every stage of execution, a few KB at a time, and a certain number 
of MB remaining of available memory on the driver. As I run, this available 
memory goes down, and these variables are never erased. The only RDDs that 
persist are those that are explicitly cached. The RDDs that are generated 
iteratively are not retained or referenced, so I would expect things to get 
cleaned up but they do not. The items consuming memory are not RDDs but what 
appears to be shuffle metadata.

I have a script that parses the logs to show memory consumption over time and 
the script shows memory very steadily being consumed over many hours without 
clearing one small bit at a time.

The specific computation I am doing is the generation of dot products between 
two RDDs of vectors. I need to generate this product for every combination of 
products between the two RDDs but both RDDs are too big to fit in memory. 
Consequently, I iteratively generate this product across one entry from the 
first RDD and all entries from the second and retain the pared-down result 
within an accumulator (by retaining the top N results it is possible to 
actually store the Cartesian which is otherwise too large to fit on disk). 
After a certain number of iterations these intermediate results are then 
written to disk. Each of these steps is tractable in itself but due to the 
accumulation of memory, the overall job becomes intractable.

I would appreciate any suggestions as to how to clean up these intermediate 
broadcast variables. Thank you.


On Sun, Dec 28, 2014 at 1:56 PM Patrick Wendell 
pwend...@gmail.commailto:pwend...@gmail.com wrote:
What do you mean when you say the overhead of spark shuffles start to
accumulate? Could you elaborate more?

In newer versions of Spark shuffle data is cleaned up automatically
when an RDD goes out of scope. It is safe to remove shuffle data at
this point because the RDD can no longer be referenced. If you are
seeing a large build up of shuffle data, it's possible you are
retaining references to older RDDs inadvertently. Could you explain
what your job actually doing?

- Patrick

On Mon, Dec 22, 2014 at 2:36 PM, Ganelin, Ilya
ilya.gane...@capitalone.commailto:ilya.gane...@capitalone.com wrote:
 Hi all, I have a long running job iterating over a huge dataset. Parts of
 this operation are cached. Since the job runs for so long, eventually the
 overhead of spark shuffles starts to accumulate culminating in the driver
 starting to swap.

 I am aware of the spark.cleanup.tll parameter that allows me to configure
 when cleanup happens but the issue with 

Re: building spark1.2 meet error

2014-12-30 Thread j_soft
no,it still fail use mvn -Pyarn -Phadoop-2.5 -Dhadoop.version=2.5.0
-Dscala-2.10 -X -DskipTests clean package 

...
[DEBUG] /opt/xdsp/spark-1.2.0/core/src/main/scala
[DEBUG] includes = [**/*.scala,**/*.java,]
[DEBUG] excludes = []
[WARNING] Zinc server is not available at port 3030 - reverting to normal
incremental compile
[INFO] Using incremental compilation
[DEBUG] Setup = {
[DEBUG]scala compiler =
/opt/xdsp/spark-1.2.0/H:\Soft\Maven\repository/org/scala-lang/scala-compiler/2.10.4/scala-compiler-2.10.4.jar
[DEBUG]scala library =
/opt/xdsp/spark-1.2.0/H:\Soft\Maven\repository/org/scala-lang/scala-library/2.10.4/scala-library-2.10.4.jar
[DEBUG]scala extra = {
[DEBUG]  
/opt/xdsp/spark-1.2.0/H:\Soft\Maven\repository/org/scala-lang/scala-reflect/2.10.4/scala-reflect-2.10.4.jar
[DEBUG]}
[DEBUG]sbt interface =
/opt/xdsp/spark-1.2.0/H:\Soft\Maven\repository/com/typesafe/sbt/sbt-interface/0.13.5/sbt-interface-0.13.5.jar
[DEBUG]compiler interface sources =
/opt/xdsp/spark-1.2.0/H:\Soft\Maven\repository/com/typesafe/sbt/compiler-interface/0.13.5/compiler-interface-0.13.5-sources.jar
[DEBUG]java home = 
[DEBUG]fork java = false
[DEBUG]cache directory = /root/.zinc/0.3.5
[DEBUG] }
[INFO] 'compiler-interface' not yet compiled for Scala 2.10.4. Compiling...
[DEBUG] Plain interface to Scala compiler 2.10.4  with arguments: 
-nowarn
-d
/tmp/sbt_8b816650
-bootclasspath
   
/opt/jdk1.7/jre/lib/resources.jar:/opt/jdk1.7/jre/lib/rt.jar:/opt/jdk1.7/jre/lib/sunrsasign.jar:/opt/jdk1.7/jre/lib/jsse.jar:/opt/jdk1.7/jre/lib/jce.jar:/opt/jdk1.7/jre/lib/charsets.jar:/opt/jdk1.7/jre/lib/jfr.jar:/opt/jdk1.7/jre/classes:/opt/xdsp/spark-1.2.0/H:\Soft\Maven\repository/org/scala-lang/scala-library/2.10.4/scala-library-2.10.4.jar
-classpath
   
/opt/xdsp/spark-1.2.0/H:\Soft\Maven\repository/com/typesafe/sbt/sbt-interface/0.13.5/sbt-interface-0.13.5.jar:/opt/xdsp/spark-1.2.0/H:\Soft\Maven\repository/com/typesafe/sbt/compiler-interface/0.13.5/compiler-interface-0.13.5-sources.jar:/opt/xdsp/spark-1.2.0/H:\Soft\Maven\repository/org/scala-lang/scala-compiler/2.10.4/scala-compiler-2.10.4.jar:/opt/xdsp/spark-1.2.0/H:\Soft\Maven\repository/org/scala-lang/scala-reflect/2.10.4/scala-reflect-2.10.4.jar
/tmp/sbt_b9456a7b/xsbt/API.scala
/tmp/sbt_b9456a7b/xsbt/Analyzer.scala
/tmp/sbt_b9456a7b/xsbt/Command.scala
/tmp/sbt_b9456a7b/xsbt/Compat.scala
/tmp/sbt_b9456a7b/xsbt/CompilerInterface.scala
/tmp/sbt_b9456a7b/xsbt/ConsoleInterface.scala
/tmp/sbt_b9456a7b/xsbt/DelegatingReporter.scala
/tmp/sbt_b9456a7b/xsbt/Dependency.scala
/tmp/sbt_b9456a7b/xsbt/ExtractAPI.scala
/tmp/sbt_b9456a7b/xsbt/ExtractUsedNames.scala
/tmp/sbt_b9456a7b/xsbt/LocateClassFile.scala
/tmp/sbt_b9456a7b/xsbt/Log.scala
/tmp/sbt_b9456a7b/xsbt/Message.scala
/tmp/sbt_b9456a7b/xsbt/ScaladocInterface.scala
error: scala.reflect.internal.MissingRequirementError: object scala.runtime
in compiler mirror not found.
at
scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
at
scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
at
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
at
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
at
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
at
scala.reflect.internal.Mirrors$RootsBase.getPackage(Mirrors.scala:172)
at
scala.reflect.internal.Mirrors$RootsBase.getRequiredPackage(Mirrors.scala:175)
at
scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackage$lzycompute(Definitions.scala:183)
at
scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackage(Definitions.scala:183)
at
scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackageClass$lzycompute(Definitions.scala:184)
at
scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackageClass(Definitions.scala:184)
at
scala.reflect.internal.Definitions$DefinitionsClass.AnnotationDefaultAttr$lzycompute(Definitions.scala:1024)
at
scala.reflect.internal.Definitions$DefinitionsClass.AnnotationDefaultAttr(Definitions.scala:1023)
at
scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses$lzycompute(Definitions.scala:1153)
at
scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses(Definitions.scala:1152)
at
scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode$lzycompute(Definitions.scala:1196)
at
scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode(Definitions.scala:1196)
at
scala.reflect.internal.Definitions$DefinitionsClass.init(Definitions.scala:1261)
at 

Re: JetS3T settings spark

2014-12-30 Thread Matei Zaharia
This file needs to be on your CLASSPATH actually, not just in a directory. The 
best way to pass it in is probably to package it into your application JAR. You 
can put it in src/main/resources in a Maven or SBT project, and check that it 
makes it into the JAR using jar tf yourfile.jar.

Matei

 On Dec 30, 2014, at 4:21 PM, durga durgak...@gmail.com wrote:
 
 I am not sure , the way I can pass jets3t.properties file for spark-submit.
 --file option seems not working.
 can some one please help me. My production spark jobs get hung up when
 reading s3 file sporadically.
 
 Thanks,
 -D 
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/JetS3T-settings-spark-tp20916.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Shuffle Problems in 1.2.0

2014-12-30 Thread Sven Krasser
Hey Josh,

I am still trying to prune this to a minimal example, but it has been
tricky since scale seems to be a factor. The job runs over ~720GB of data
(the cluster's total RAM is around ~900GB, split across 32 executors). I've
managed to run it over a vastly smaller data set without issues. Curiously,
when I run it over slightly smaller data set of ~230GB (using sort-based
shuffle), my job also fails, but I see no shuffle errors in the executor
logs. All I see is the error below from the driver (this is also what the
driver prints when erroring out on the large data set, but I assumed the
executor errors to be the root cause).

Any idea on where to look in the interim for more hints? I'll continue to
try to get to a minimal repro.

2014-12-30 21:35:34,539 INFO
[sparkDriver-akka.actor.default-dispatcher-14]
spark.MapOutputTrackerMasterActor (Logging.scala:logInfo(59)) - Asked to
send map output locations for shuffle 0 to
sparkexecu...@ip-10-20-80-60.us-west-1.compute.internal:39739
2014-12-30 21:35:39,512 INFO
[sparkDriver-akka.actor.default-dispatcher-17]
spark.MapOutputTrackerMasterActor (Logging.scala:logInfo(59)) - Asked to
send map output locations for shuffle 0 to
sparkexecu...@ip-10-20-80-62.us-west-1.compute.internal:42277
2014-12-30 21:35:58,893 WARN
[sparkDriver-akka.actor.default-dispatcher-16]
remote.ReliableDeliverySupervisor (Slf4jLogger.scala:apply$mcV$sp(71)) -
Association with remote system
[akka.tcp://sparkyar...@ip-10-20-80-64.us-west-1.compute.internal:49584]
has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
2014-12-30 21:35:59,044 ERROR [Yarn application state monitor]
cluster.YarnClientSchedulerBackend (Logging.scala:logError(75)) - Yarn
application has already exited with state FINISHED!
2014-12-30 21:35:59,056 INFO  [Yarn application state monitor]
handler.ContextHandler (ContextHandler.java:doStop(788)) - stopped
o.e.j.s.ServletContextHandler{/stages/stage/kill,null}

[...]

2014-12-30 21:35:59,111 INFO  [Yarn application state monitor] ui.SparkUI
(Logging.scala:logInfo(59)) - Stopped Spark web UI at
http://ip-10-20-80-37.us-west-1.compute.internal:4040
2014-12-30 21:35:59,130 INFO  [Yarn application state monitor]
scheduler.DAGScheduler (Logging.scala:logInfo(59)) - Stopping DAGScheduler
2014-12-30 21:35:59,131 INFO  [Yarn application state monitor]
cluster.YarnClientSchedulerBackend (Logging.scala:logInfo(59)) - Shutting
down all executors
2014-12-30 21:35:59,132 INFO
[sparkDriver-akka.actor.default-dispatcher-14]
cluster.YarnClientSchedulerBackend (Logging.scala:logInfo(59)) - Asking
each executor to shut down
2014-12-30 21:35:59,132 INFO  [Thread-2] scheduler.DAGScheduler
(Logging.scala:logInfo(59)) - Job 1 failed: collect at
/home/hadoop/test_scripts/test.py:63, took 980.751936 s
Traceback (most recent call last):
  File /home/hadoop/test_scripts/test.py, line 63, in module
result = j.collect()
  File /home/hadoop/spark/python/pyspark/rdd.py, line 676, in collect
bytesInJava = self._jrdd.collect().iterator()
  File
/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
line 538, in __call__
  File
/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line
300, in get_return_value
py4j.protocol.Py4JJavaError2014-12-30 21:35:59,140 INFO  [Yarn application
state monitor] cluster.YarnClientSchedulerBackend
(Logging.scala:logInfo(59)) - Stopped
: An error occurred while calling o117.collect.
: org.apache.spark.SparkException: Job cancelled because SparkContext was
shut down
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:702)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:701)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
at
org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:701)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor.postStop(DAGScheduler.scala:1428)
at akka.actor.Actor$class.aroundPostStop(Actor.scala:475)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundPostStop(DAGScheduler.scala:1375)
at
akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210)
at
akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:172)
at akka.actor.ActorCell.terminate(ActorCell.scala:369)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:462)
at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at

Re: python: module pyspark.daemon not found

2014-12-30 Thread Davies Liu
Could you share a link about this? It's common to use Java 7, that
will be nice if we can fix this.

On Mon, Dec 29, 2014 at 1:27 PM, Eric Friedman
eric.d.fried...@gmail.com wrote:
 Was your spark assembly jarred with Java 7?  There's a known issue with jar
 files made with that version. It prevents them from being used on
 PYTHONPATH. You can rejar with Java 6 for better results.

 
 Eric Friedman

 On Dec 29, 2014, at 8:01 AM, Naveen Kumar Pokala npok...@spcapitaliq.com
 wrote:



 14/12/29 18:10:56 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID
 2, nj09mhf0730.mhf.mhc, PROCESS_LOCAL, 1246 bytes)

 14/12/29 18:10:56 INFO TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1) on
 executor nj09mhf0730.mhf.mhc: org.apache.spark.SparkException (

 Error from python worker:

   python: module pyspark.daemon not found

 PYTHONPATH was:


 /home/npokala/data/spark-install/spark-master/python:/home/npokala/data/spark-install/spark-master/python/lib/py4j-0.8.2.1-src.zip:/home/npokala/data/spark-install/spark-master/assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop2.4.0.jar:/home/npokala/data/spark-install/spark-master/sbin/../python/lib/py4j-0.8.2.1-src.zip:/home/npokala/data/spark-install/spark-master/sbin/../python:

 java.io.EOFException) [duplicate 1]

 14/12/29 18:10:56 INFO TaskSetManager: Starting task 1.1 in stage 0.0 (TID
 3, nj09mhf0731.mhf.mhc, PROCESS_LOCAL, 1246 bytes)

 14/12/29 18:10:56 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 2) on
 executor nj09mhf0730.mhf.mhc: org.apache.spark.SparkException (

 Error from python worker:

   python: module pyspark.daemon not found

 PYTHONPATH was:


 /home/npokala/data/spark-install/spark-master/python:/home/npokala/data/spark-install/spark-master/python/lib/py4j-0.8.2.1-src.zip:/home/npokala/data/spark-install/spark-master/assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop2.4.0.jar:/home/npokala/data/spark-install/spark-master/sbin/../python/lib/py4j-0.8.2.1-src.zip:/home/npokala/data/spark-install/spark-master/sbin/../python:

 java.io.EOFException) [duplicate 2]

 14/12/29 18:10:56 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID
 4, nj09mhf0731.mhf.mhc, PROCESS_LOCAL, 1246 bytes)

 14/12/29 18:10:58 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory
 on nj09mhf0731.mhf.mhc:48802 (size: 3.4 KB, free: 265.1 MB)

 14/12/29 18:10:58 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory
 on nj09mhf0731.mhf.mhc:41243 (size: 3.4 KB, free: 265.1 MB)

 14/12/29 18:10:59 INFO TaskSetManager: Lost task 1.1 in stage 0.0 (TID 3) on
 executor nj09mhf0731.mhf.mhc: org.apache.spark.SparkException (

 Error from python worker:

   python: module pyspark.daemon not found

 PYTHONPATH was:


 /home/npokala/data/spark-install/spark-master/python:/home/npokala/data/spark-install/spark-master/python/lib/py4j-0.8.2.1-src.zip:/home/npokala/data/spark-install/spark-master/assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop2.4.0.jar:/home/npokala/data/spark-install/spark-master/sbin/../python/lib/py4j-0.8.2.1-src.zip:/home/npokala/data/spark-install/spark-master/sbin/../python:

 java.io.EOFException) [duplicate 3]

 14/12/29 18:10:59 INFO TaskSetManager: Starting task 1.2 in stage 0.0 (TID
 5, nj09mhf0730.mhf.mhc, PROCESS_LOCAL, 1246 bytes)

 14/12/29 18:10:59 INFO TaskSetManager: Lost task 0.2 in stage 0.0 (TID 4) on
 executor nj09mhf0731.mhf.mhc: org.apache.spark.SparkException (

 Error from python worker:

   python: module pyspark.daemon not found

 PYTHONPATH was:


 /home/npokala/data/spark-install/spark-master/python:/home/npokala/data/spark-install/spark-master/python/lib/py4j-0.8.2.1-src.zip:/home/npokala/data/spark-install/spark-master/assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop2.4.0.jar:/home/npokala/data/spark-install/spark-master/sbin/../python/lib/py4j-0.8.2.1-src.zip:/home/npokala/data/spark-install/spark-master/sbin/../python:

 java.io.EOFException) [duplicate 4]

 14/12/29 18:10:59 INFO TaskSetManager: Starting task 0.3 in stage 0.0 (TID
 6, nj09mhf0730.mhf.mhc, PROCESS_LOCAL, 1246 bytes)

 14/12/29 18:11:00 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory
 on nj09mhf0730.mhf.mhc:60005 (size: 3.4 KB, free: 265.1 MB)

 14/12/29 18:11:00 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory
 on nj09mhf0730.mhf.mhc:40227 (size: 3.4 KB, free: 265.1 MB)

 14/12/29 18:11:01 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 6) on
 executor nj09mhf0730.mhf.mhc: org.apache.spark.SparkException (

 Error from python worker:

   python: module pyspark.daemon not found

 PYTHONPATH was:


 

Re: Python:Streaming Question

2014-12-30 Thread Davies Liu
There is a known bug with local scheduler, will be fixed by
https://github.com/apache/spark/pull/3779

On Sun, Dec 21, 2014 at 10:57 PM, Samarth Mailinglist
mailinglistsama...@gmail.com wrote:
 I’m trying to run the stateful network word count at
 https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/stateful_network_wordcount.py
 using the command:

 ./bin/spark-submit
 examples/src/main/python/streaming/stateful_network_wordcount.py localhost
 

 I am also running netcat at the same time (prior to running the above
 command):

 nc -lk 

 However, no wordcount is printed (even though pprint() is being called).

 How do I print the results?
 How do I otherwise access the data at real time? Suppose I want to have a
 dashboard showing the data in running_counts?

 Note that
 https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/network_wordcount.py
 works perfectly fine.

 Running Spark 1.2.0, hadoop 2.4.x prebuilt

 Thanks,
 Samarth

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Kafka + Spark streaming

2014-12-30 Thread Tathagata Das
1. Of course, a single block / partition has many Kafka messages, and
from different Kafka topics interleaved together. The message count is
not related to the block count. Any message received within a
particular block interval will go in the same block.

2. Yes, the receiver will be started on another worker.

TD


On Tue, Dec 30, 2014 at 2:19 PM, SamyaMaiti samya.maiti2...@gmail.com wrote:
 Hi Experts,

 Few general Queries :

 1. Can a single block/partition in a RDD have more than 1 kafka message? or
 there will be one  only one kafka message per block? In a more broader way,
 is the message count related to block in any way or its just that any
 message received with in a particular block interval will go in the same
 block.

 2. If a worker goes down which runs the Receiver for Kafka, Will the
 receiver be restarted on some other worker?

 Regards,
 Sam



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Spark-streaming-tp20914.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SPARK-streaming app running 10x slower on YARN vs STANDALONE cluster

2014-12-30 Thread Tathagata Das
Thats is kind of expected due to data locality. Though you should see
some tasks running on the executors as the data gets replicated to
other nodes and can therefore run tasks based on locality. You have
two solutions

1. kafkaStream.repartition() to explicitly repartition the received
data across the cluster.
2. Create multiple kafka streams and union them together.

See 
http://spark.apache.org/docs/latest/streaming-programming-guide.html#reducing-the-processing-time-of-each-batch

On Tue, Dec 30, 2014 at 1:43 AM, Mukesh Jha me.mukesh@gmail.com wrote:
 Thanks Sandy, It was the issue with the no of cores.

 Another issue I was facing is that tasks are not getting distributed evenly
 among all executors and are running on the NODE_LOCAL locality level i.e.
 all the tasks are running on the same executor where my kafkareceiver(s) are
 running even though other executors are idle.

 I configured spark.locality.wait=50 instead of the default 3000 ms, which
 forced the task rebalancing among nodes, let me know if there is a better
 way to deal with this.


 On Tue, Dec 30, 2014 at 12:09 AM, Mukesh Jha me.mukesh@gmail.com
 wrote:

 Makes sense, I've also tries it in standalone mode where all 3 workers 
 driver were running on the same 8 core box and the results were similar.

 Anyways I will share the results in YARN mode with 8 core yarn containers.

 On Mon, Dec 29, 2014 at 11:58 PM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 When running in standalone mode, each executor will be able to use all 8
 cores on the box.  When running on YARN, each executor will only have access
 to 2 cores.  So the comparison doesn't seem fair, no?

 -Sandy

 On Mon, Dec 29, 2014 at 10:22 AM, Mukesh Jha me.mukesh@gmail.com
 wrote:

 Nope, I am setting 5 executors with 2  cores each. Below is the command
 that I'm using to submit in YARN mode. This starts up 5 executor nodes and 
 a
 drives as per the spark  application master UI.

 spark-submit --master yarn-cluster --num-executors 5 --driver-memory
 1024m --executor-memory 1024m --executor-cores 2 --class
 com.oracle.ci.CmsgK2H /homext/lib/MJ-ci-k2h.jar vm.cloud.com:2181/kafka
 spark-yarn avro 1 5000

 On Mon, Dec 29, 2014 at 11:45 PM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 *oops, I mean are you setting --executor-cores to 8

 On Mon, Dec 29, 2014 at 10:15 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 Are you setting --num-executors to 8?

 On Mon, Dec 29, 2014 at 10:13 AM, Mukesh Jha me.mukesh@gmail.com
 wrote:

 Sorry Sandy, The command is just for reference but I can confirm that
 there are 4 executors and a driver as shown in the spark UI page.

 Each of these machines is a 8 core box with ~15G of ram.

 On Mon, Dec 29, 2014 at 11:23 PM, Sandy Ryza
 sandy.r...@cloudera.com wrote:

 Hi Mukesh,

 Based on your spark-submit command, it looks like you're only
 running with 2 executors on YARN.  Also, how many cores does each 
 machine
 have?

 -Sandy

 On Mon, Dec 29, 2014 at 4:36 AM, Mukesh Jha
 me.mukesh@gmail.com wrote:

 Hello Experts,
 I'm bench-marking Spark on YARN
 (https://spark.apache.org/docs/latest/running-on-yarn.html) vs a 
 standalone
 spark cluster 
 (https://spark.apache.org/docs/latest/spark-standalone.html).
 I have a standalone cluster with 3 executors, and a spark app
 running on yarn with 4 executors as shown below.

 The spark job running inside yarn is 10x slower than the one
 running on the standalone cluster (even though the yarn has more 
 number of
 workers), also in both the case all the executors are in the same 
 datacenter
 so there shouldn't be any latency. On YARN each 5sec batch is reading 
 data
 from kafka and processing it in 5sec  on the standalone cluster each 
 5sec
 batch is getting processed in 0.4sec.
 Also, In YARN mode all the executors are not getting used up evenly
 as vm-13  vm-14 are running most of the tasks whereas in the 
 standalone
 mode all the executors are running the tasks.

 Do I need to set up some configuration to evenly distribute the
 tasks? Also do you have any pointers on the reasons the yarn job is 
 10x
 slower than the standalone job?
 Any suggestion is greatly appreciated, Thanks in advance.

 YARN(5 workers + driver)
 
 Executor ID Address RDD Blocks Memory Used DU AT FT CT TT TT Input
 ShuffleRead ShuffleWrite Thread Dump
 1 vm-18.cloud.com:51796 0 0.0B/530.3MB 0.0 B 1 0 16 17 634 ms 0.0 B
 2047.0 B 1710.0 B Thread Dump
 2 vm-13.cloud.com:57264 0 0.0B/530.3MB 0.0 B 0 0 1427 1427 5.5 m
 0.0 B 0.0 B 0.0 B Thread Dump
 3 vm-14.cloud.com:54570 0 0.0B/530.3MB 0.0 B 0 0 1379 1379 5.2 m
 0.0 B 1368.0 B 2.8 KB Thread Dump
 4 vm-11.cloud.com:56201 0 0.0B/530.3MB 0.0 B 0 0 10 10 625 ms 0.0 B
 1368.0 B 1026.0 B Thread Dump
 5 vm-5.cloud.com:42958 0 0.0B/530.3MB 0.0 B 0 0 22 22 632 ms 0.0 B
 1881.0 B 2.8 KB Thread Dump
 driver vm.cloud.com:51847 0 0.0B/530.0MB 0.0 B 0 0 0 0 0 ms 0.0 B
 0.0 B 0.0 B Thread Dump

 /homext/spark/bin/spark-submit
 --master yarn-cluster 

Re: Gradual slow down of the Streaming job (getCallSite at DStream.scala:294)

2014-12-30 Thread Tathagata Das
Which version of Spark Streaming are you using.

When the batch processing time increases to 15-20 seconds, could you
compare the task times compared to the tasks time when the application
is just launched? Basically is the increase from 6 seconds to 15-20
seconds is caused by increase in computation or because of GC's etc.

On Tue, Dec 30, 2014 at 1:41 PM, RK prk...@yahoo.com.invalid wrote:
 Here is the code for my streaming job.

 ~~
 val sparkConf = new SparkConf().setAppName(SparkStreamingJob)
 sparkConf.set(spark.serializer,
 org.apache.spark.serializer.KryoSerializer)
 sparkConf.set(spark.default.parallelism, 100)
 sparkConf.set(spark.shuffle.consolidateFiles, true)
 sparkConf.set(spark.speculation, true)
 sparkConf.set(spark.speculation.interval, 5000)
 sparkConf.set(spark.speculation.quantile, 0.9)
 sparkConf.set(spark.speculation.multiplier, 3)
 sparkConf.set(spark.mesos.coarse, true)
 sparkConf.set(spark.executor.extraJavaOptions, -verbose:gc
 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseConcMarkSweepGC)
 sparkConf.set(spark.shuffle.manager, SORT)

 val ssc = new StreamingContext(sparkConf, Seconds(10))
 ssc.checkpoint(checkpointDir)

 val topics = trace
 val numThreads = 1
 val topicMap = topics.split(,).map((_,numThreads)).toMap

 val kafkaPartitions = 20
 val kafkaDStreams = (1 to kafkaPartitions).map { _ =
   KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
 }

 val lines = ssc.union(kafkaDStreams)
 val words = lines.map(line = doSomething_1(line))
 val filteredWords = words.filter(word = word != test)
 val groupedWords = filteredWords.map(word = (word, 1))

 val windowedWordCounts = groupedWords.reduceByKeyAndWindow(_ + _, _ - _,
 Seconds(30), Seconds(10))
 val windowedWordsFiltered = windowedWordCounts.filter{case (word, count) =
 count  50}
 val finalResult = windowedWordsFiltered.foreachRDD(words =
 doSomething_2(words))

 ssc.start()
 ssc.awaitTermination()
 ~~

 I am running this job on a 9 slave AWS EC2 cluster with each slave node has
 32 vCPU  60GB memory.

 When I start this job, the processing time is usually around 5 - 6 seconds
 for the 10 seconds batch and the scheduling delay is around 0 seconds or a
 few ms. However, as the job run for 6 - 8 hours, the processing time
 increases to 15 - 20 seconds but the scheduling delay is increasing to 4 - 6
 hours.

 When I look at the completed stages, I see that the time taken for
 getCallSite at DStream.scala:294 keeps increasing as time passes by. It goes
 from around 2 seconds to more than a few minutes.

 Clicking on +details next to this stage description shows the following
 execution trace.
 org.apache.spark.SparkContext.getCallSite(SparkContext.scala:1088)
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:294)
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:288)
 scala.Option.orElse(Option.scala:257)
 org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:285)
 org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
 org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
 org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
 scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
 org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
 org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:221)
 org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:221)
 scala.util.Try$.apply(Try.scala:161)
 org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:221)
 org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:165)

 When I click on one of these slow stages that executed after 6 - 8 hours, I
 find the following information for individual tasks inside.
 - All tasks seem to execute with PROCESS_LOCAL locality.
 - Quite a few of these tasks seem to spend anywhere between 30 - 80% of
 their time in GC. Although, when I look at the total memory usage on each of
 the slave nodes under executors information, I see that the usage is only
 around 200MB out of 20GB available.

 Even after a few hours, the map stages (val groupedWords =
 filteredWords.map(word = (word, 1))) seem to have consistent times as
 during the start of the job which seems to indicate 

Re: Gradual slow down of the Streaming job (getCallSite at DStream.scala:294)

2014-12-30 Thread RK
I am running the job on 1.1.1.
I will let the job run overnight and send you more info on computation vs GC 
time tomorrow.
BTW, do you know what the stage description named getCallSite at 
DStream.scala:294 might mean?
Thanks,RK
 

 On Tuesday, December 30, 2014 6:02 PM, Tathagata Das 
tathagata.das1...@gmail.com wrote:
   

 Which version of Spark Streaming are you using.

When the batch processing time increases to 15-20 seconds, could you
compare the task times compared to the tasks time when the application
is just launched? Basically is the increase from 6 seconds to 15-20
seconds is caused by increase in computation or because of GC's etc.

On Tue, Dec 30, 2014 at 1:41 PM, RK prk...@yahoo.com.invalid wrote:
 Here is the code for my streaming job.

 ~~
 val sparkConf = new SparkConf().setAppName(SparkStreamingJob)
 sparkConf.set(spark.serializer,
 org.apache.spark.serializer.KryoSerializer)
 sparkConf.set(spark.default.parallelism, 100)
 sparkConf.set(spark.shuffle.consolidateFiles, true)
 sparkConf.set(spark.speculation, true)
 sparkConf.set(spark.speculation.interval, 5000)
 sparkConf.set(spark.speculation.quantile, 0.9)
 sparkConf.set(spark.speculation.multiplier, 3)
 sparkConf.set(spark.mesos.coarse, true)
 sparkConf.set(spark.executor.extraJavaOptions, -verbose:gc
 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseConcMarkSweepGC)
 sparkConf.set(spark.shuffle.manager, SORT)

 val ssc = new StreamingContext(sparkConf, Seconds(10))
 ssc.checkpoint(checkpointDir)

 val topics = trace
 val numThreads = 1
 val topicMap = topics.split(,).map((_,numThreads)).toMap

 val kafkaPartitions = 20
 val kafkaDStreams = (1 to kafkaPartitions).map { _ =
  KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
 }

 val lines = ssc.union(kafkaDStreams)
 val words = lines.map(line = doSomething_1(line))
 val filteredWords = words.filter(word = word != test)
 val groupedWords = filteredWords.map(word = (word, 1))

 val windowedWordCounts = groupedWords.reduceByKeyAndWindow(_ + _, _ - _,
 Seconds(30), Seconds(10))
 val windowedWordsFiltered = windowedWordCounts.filter{case (word, count) =
 count  50}
 val finalResult = windowedWordsFiltered.foreachRDD(words =
 doSomething_2(words))

 ssc.start()
 ssc.awaitTermination()
 ~~

 I am running this job on a 9 slave AWS EC2 cluster with each slave node has
 32 vCPU  60GB memory.

 When I start this job, the processing time is usually around 5 - 6 seconds
 for the 10 seconds batch and the scheduling delay is around 0 seconds or a
 few ms. However, as the job run for 6 - 8 hours, the processing time
 increases to 15 - 20 seconds but the scheduling delay is increasing to 4 - 6
 hours.

 When I look at the completed stages, I see that the time taken for
 getCallSite at DStream.scala:294 keeps increasing as time passes by. It goes
 from around 2 seconds to more than a few minutes.

 Clicking on +details next to this stage description shows the following
 execution trace.
 org.apache.spark.SparkContext.getCallSite(SparkContext.scala:1088)
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:294)
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:288)
 scala.Option.orElse(Option.scala:257)
 org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:285)
 org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
 org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
 org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
 scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
 org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
 org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:221)
 org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:221)
 scala.util.Try$.apply(Try.scala:161)
 org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:221)
 org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:165)

 When I click on one of these slow stages that executed after 6 - 8 hours, I
 find the following information for individual tasks inside.
 - All tasks seem to execute with PROCESS_LOCAL locality.
 - Quite a few of these tasks seem to spend anywhere between 30 - 80% of
 their time in GC. Although, when I look at 

Re: python: module pyspark.daemon not found

2014-12-30 Thread Eric Friedman
https://issues.apache.org/jira/browse/SPARK-1911

Is one of several tickets on the problem. 

 On Dec 30, 2014, at 8:36 PM, Davies Liu dav...@databricks.com wrote:
 
 Could you share a link about this? It's common to use Java 7, that
 will be nice if we can fix this.
 
 On Mon, Dec 29, 2014 at 1:27 PM, Eric Friedman
 eric.d.fried...@gmail.com wrote:
 Was your spark assembly jarred with Java 7?  There's a known issue with jar
 files made with that version. It prevents them from being used on
 PYTHONPATH. You can rejar with Java 6 for better results.
 
 
 Eric Friedman
 
 On Dec 29, 2014, at 8:01 AM, Naveen Kumar Pokala npok...@spcapitaliq.com
 wrote:
 
 
 
 14/12/29 18:10:56 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID
 2, nj09mhf0730.mhf.mhc, PROCESS_LOCAL, 1246 bytes)
 
 14/12/29 18:10:56 INFO TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1) on
 executor nj09mhf0730.mhf.mhc: org.apache.spark.SparkException (
 
 Error from python worker:
 
  python: module pyspark.daemon not found
 
 PYTHONPATH was:
 
 
 /home/npokala/data/spark-install/spark-master/python:/home/npokala/data/spark-install/spark-master/python/lib/py4j-0.8.2.1-src.zip:/home/npokala/data/spark-install/spark-master/assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop2.4.0.jar:/home/npokala/data/spark-install/spark-master/sbin/../python/lib/py4j-0.8.2.1-src.zip:/home/npokala/data/spark-install/spark-master/sbin/../python:
 
 java.io.EOFException) [duplicate 1]
 
 14/12/29 18:10:56 INFO TaskSetManager: Starting task 1.1 in stage 0.0 (TID
 3, nj09mhf0731.mhf.mhc, PROCESS_LOCAL, 1246 bytes)
 
 14/12/29 18:10:56 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 2) on
 executor nj09mhf0730.mhf.mhc: org.apache.spark.SparkException (
 
 Error from python worker:
 
  python: module pyspark.daemon not found
 
 PYTHONPATH was:
 
 
 /home/npokala/data/spark-install/spark-master/python:/home/npokala/data/spark-install/spark-master/python/lib/py4j-0.8.2.1-src.zip:/home/npokala/data/spark-install/spark-master/assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop2.4.0.jar:/home/npokala/data/spark-install/spark-master/sbin/../python/lib/py4j-0.8.2.1-src.zip:/home/npokala/data/spark-install/spark-master/sbin/../python:
 
 java.io.EOFException) [duplicate 2]
 
 14/12/29 18:10:56 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID
 4, nj09mhf0731.mhf.mhc, PROCESS_LOCAL, 1246 bytes)
 
 14/12/29 18:10:58 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory
 on nj09mhf0731.mhf.mhc:48802 (size: 3.4 KB, free: 265.1 MB)
 
 14/12/29 18:10:58 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory
 on nj09mhf0731.mhf.mhc:41243 (size: 3.4 KB, free: 265.1 MB)
 
 14/12/29 18:10:59 INFO TaskSetManager: Lost task 1.1 in stage 0.0 (TID 3) on
 executor nj09mhf0731.mhf.mhc: org.apache.spark.SparkException (
 
 Error from python worker:
 
  python: module pyspark.daemon not found
 
 PYTHONPATH was:
 
 
 /home/npokala/data/spark-install/spark-master/python:/home/npokala/data/spark-install/spark-master/python/lib/py4j-0.8.2.1-src.zip:/home/npokala/data/spark-install/spark-master/assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop2.4.0.jar:/home/npokala/data/spark-install/spark-master/sbin/../python/lib/py4j-0.8.2.1-src.zip:/home/npokala/data/spark-install/spark-master/sbin/../python:
 
 java.io.EOFException) [duplicate 3]
 
 14/12/29 18:10:59 INFO TaskSetManager: Starting task 1.2 in stage 0.0 (TID
 5, nj09mhf0730.mhf.mhc, PROCESS_LOCAL, 1246 bytes)
 
 14/12/29 18:10:59 INFO TaskSetManager: Lost task 0.2 in stage 0.0 (TID 4) on
 executor nj09mhf0731.mhf.mhc: org.apache.spark.SparkException (
 
 Error from python worker:
 
  python: module pyspark.daemon not found
 
 PYTHONPATH was:
 
 
 /home/npokala/data/spark-install/spark-master/python:/home/npokala/data/spark-install/spark-master/python/lib/py4j-0.8.2.1-src.zip:/home/npokala/data/spark-install/spark-master/assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop2.4.0.jar:/home/npokala/data/spark-install/spark-master/sbin/../python/lib/py4j-0.8.2.1-src.zip:/home/npokala/data/spark-install/spark-master/sbin/../python:
 
 java.io.EOFException) [duplicate 4]
 
 14/12/29 18:10:59 INFO TaskSetManager: Starting task 0.3 in stage 0.0 (TID
 6, nj09mhf0730.mhf.mhc, PROCESS_LOCAL, 1246 bytes)
 
 14/12/29 18:11:00 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory
 on nj09mhf0730.mhf.mhc:60005 (size: 3.4 KB, free: 265.1 MB)
 
 14/12/29 18:11:00 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory
 on nj09mhf0730.mhf.mhc:40227 (size: 3.4 KB, free: 265.1 MB)
 
 14/12/29 18:11:01 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 6) on
 executor nj09mhf0730.mhf.mhc: org.apache.spark.SparkException (
 
 Error from python worker:
 
  python: module pyspark.daemon not found
 
 PYTHONPATH was:
 
 
 

Spark app performance

2014-12-30 Thread Raghavendra Pandey
I have a spark app that involves series of mapPartition operations and then
a keyBy operation. I have measured the time inside mapPartition function
block. These blocks take trivial time. Still the application takes way too
much time and even sparkUI shows that much time.
So i was wondering where does it take time and how can I reduce this.

Thanks
Raghavendra


Re: serialize protobuf messages

2014-12-30 Thread Chen Song
Anyone has suggestions?

On Tue, Dec 23, 2014 at 3:08 PM, Chen Song chen.song...@gmail.com wrote:

 Silly question, what is the best way to shuffle protobuf messages in Spark
 (Streaming) job? Can I use Kryo on top of protobuf Message type?

 --
 Chen Song




-- 
Chen Song


Re: JetS3T settings spark

2014-12-30 Thread durga katakam
Thanks Matei.
-D

On Tue, Dec 30, 2014 at 4:49 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 This file needs to be on your CLASSPATH actually, not just in a directory.
 The best way to pass it in is probably to package it into your application
 JAR. You can put it in src/main/resources in a Maven or SBT project, and
 check that it makes it into the JAR using jar tf yourfile.jar.

 Matei

  On Dec 30, 2014, at 4:21 PM, durga durgak...@gmail.com wrote:
 
  I am not sure , the way I can pass jets3t.properties file for
 spark-submit.
  --file option seems not working.
  can some one please help me. My production spark jobs get hung up when
  reading s3 file sporadically.
 
  Thanks,
  -D
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/JetS3T-settings-spark-tp20916.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 




Re: building spark1.2 meet error

2014-12-30 Thread Sean Owen
This is still using a non-existent hadoop-2.5 profile, and
-Dscala-2.10 won't do anything. These don't matter though; this error
is just some scalac problem. I don't see this error when compiling.

On Wed, Dec 31, 2014 at 12:48 AM, j_soft zsof...@gmail.com wrote:
 no,it still fail use mvn -Pyarn -Phadoop-2.5 -Dhadoop.version=2.5.0
 -Dscala-2.10 -X -DskipTests clean package


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Is it possible to store graph directly into HDFS?

2014-12-30 Thread Jason Hong
Thanks for your answer, Xuefeng Wu.

But, I don't understand how to save a graph as object. :(

Do you have any sample codes?

2014-12-31 13:27 GMT+09:00 Jason Hong begger3...@gmail.com:

 Thanks for your answer, Xuefeng Wu.

 But, I don't understand how to save a graph as object. :(

 Do you have any sample codes?

 Jason Hong

 2014-12-30 22:31 GMT+09:00 Xuefeng Wu ben...@gmail.com:

 how about save as object?


 Yours, Xuefeng Wu 吴雪峰 敬上

  On 2014年12月30日, at 下午9:27, Jason Hong begger3...@gmail.com wrote:
 
  Dear all:)
 
  We're trying to make a graph using large input data and get a subgraph
  applied some filter.
 
  Now, we wanna save this graph to HDFS so that we can load later.
 
  Is it possible to store graph or subgraph directly into HDFS and load
 it as
  a graph for future use?
 
  We will be glad for your suggestion.
 
  Best regards.
 
  Jason Hong
 
 
 
 
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-store-graph-directly-into-HDFS-tp20908.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 





How to set local property in beeline connect to the spark thrift server

2014-12-30 Thread Xiaoyu Wang
Hi all!

I use Spark SQL1.2 start the thrift server on yarn.

I want to use fair scheduler in the thrift server.

I set the properties in spark-defaults.conf like this:
spark.scheduler.mode FAIR
spark.scheduler.allocation.file
/opt/spark-1.2.0-bin-2.4.1/conf/fairscheduler.xml

In the thrift server UI can see the scheduler pool is ok.
[image: 内嵌图片 1]

How can I specify one sql job to the test pool.
By default the sql job run in the default pool.

In the http://spark.apache.org/docs/latest/job-scheduling.html document
I see sc.setLocalProperty(spark.scheduler.pool, pool1) can be set in
the code.

In the beeline I execute set spark.scheduler.pool=test, but no use.
But how can I set the local property in the beeline?