java.lang.NoClassDefFoundError: org/apache/spark/deploy/worker/Worker

2014-05-18 Thread Hao Wang
Hi, all

*Spark version: bae07e3 [behind 1] fix different versions of commons-lang
dependency and apache/spark#746 addendum*

I have six worker nodes and four of them have this NoClassDefFoundError when
I use thestart-slaves.sh on my driver node. However, running ./bin/spark-class
org.apache.spark.deploy.worker.Worker spark://MASTER_IP:PORT on the worker
nodes works well.

I compile the /spark directory on driver node and distribute to all the
worker nodes. Paths on different nodes are identical.

Here is the logs from one of four driver nodes.

Spark Command: java -cp
::/home/wanghao/spark/conf:/home/wanghao/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.2.0.jar
-Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m
org.apache.spark.deploy.worker.Worker spark://192.168.1.12:7077
--webui-port 8081


Exception in thread main java.lang.NoClassDefFoundError:
org/apache/spark/deploy/worker/Worker
Caused by: java.lang.ClassNotFoundException:
org.apache.spark.deploy.worker.Worker
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
Could not find the main class: org.apache.spark.deploy.worker.Worker.
Program will exit.

Here is spark-env.sh

export SPARK_WORKER_MEMORY=1g
export SPARK_MASTER_IP=192.168.1.12
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_CORES=1
export SPARK_WORKER_INSTANCES=2

hosts file:

127.0.0.1   localhost
192.168.1.12sing12

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

192.168.1.11 sing11
192.168.1.59 sing59

###
# failed machines
###

192.168.1.122 host122
192.168.1.123 host123
192.168.1.124 host124
192.168.1.125 host125


Regards,
Wang Hao(王灏)

CloudTeam | School of Software Engineering
Shanghai Jiao Tong University
Address:800 Dongchuan Road, Minhang District, Shanghai, 200240
Email:wh.s...@gmail.com


Re: breeze DGEMM slow in spark

2014-05-18 Thread wxhsdp
Hi, xiangrui
  i check the stderr of worker node, yes it's failed to load implementation
from:   
  com.github.fommil.netlib.NativeSystemBLAS...

  what do you mean by include breeze-natives or netlib:all? 

  things i've already done:
  1. add breeze and breeze native dependency in sbt build file
  2. download all breeze jars to slaves
  3. add jars to classpath in slave
  4. ln -s libopenblas_nehalemp-r0.2.9.rc2.so libblas.so.3 and add it to
LD_LIBRARY_PATH in slave

  thank you for your help



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/breeze-DGEMM-slow-in-spark-tp5950p5977.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: breeze DGEMM slow in spark

2014-05-18 Thread wxhsdp
Hi, xiangrui

  you said It doesn't work if you put the netlib-native jar inside an
assembly 
  jar. Try to mark it provided in the dependencies, and use --jars to 
  include them with spark-submit. -Xiangrui

  i'am not use an assembly jar which contains every thing, i also mark
breeze dependencies
  provided, and manually download the jars and add them to slave classpath.
but doesn't work:(



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/breeze-DGEMM-slow-in-spark-tp5950p5979.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Configuring Spark for reduceByKey on on massive data sets

2014-05-18 Thread lukas nalezenec
Hi
Try using *reduceByKeyLocally*.
Regards
Lukas Nalezenec


On Sun, May 18, 2014 at 3:33 AM, Matei Zaharia matei.zaha...@gmail.comwrote:

 Make sure you set up enough reduce partitions so you don’t overload them.
 Another thing that may help is checking whether you’ve run out of local
 disk space on the machines, and turning on spark.shuffle.consolidateFiles
 to produce fewer files. Finally, there’s been a recent fix in both branch
 0.9 and master that reduces the amount of memory used when there are small
 files (due to extra memory that was being taken by mmap()):
 https://issues.apache.org/jira/browse/SPARK-1145. You can find this in
 either the 1.0 release candidates on the dev list, or branch-0.9 in git.

 Matei

 On May 17, 2014, at 5:45 PM, Madhu ma...@madhu.com wrote:

  Daniel,
 
  How many partitions do you have?
  Are they more or less uniformly distributed?
  We have similar data volume currently running well on Hadoop MapReduce
 with
  roughly 30 nodes.
  I was planning to test it with Spark.
  I'm very interested in your findings.
 
 
 
  -
  Madhu
  https://www.linkedin.com/in/msiddalingaiah
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-Spark-for-reduceByKey-on-on-massive-data-sets-tp5966p5967.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.




Re: File list read into single RDD

2014-05-18 Thread Pat Ferrel
Doesn’t using an HDFS path pattern then restrict the URI to an HDFS URI. Since 
Spark supports several FS schemes I’m unclear about how much to assume about 
using the hadoop file systems APIs and conventions. Concretely if I pass a 
pattern in with a HTTPS file system, will the pattern work? 

How does Spark implement its storage system? This seems to be an abstraction 
level beyond what is available in HDFS. In order to preserve that flexibility 
what APIs should I be using? It would be easy to say, HDFS only and use HDFS 
APIs but that would seem to limit things. Especially where you would like to 
read from one cluster and write to another. This is not so easy to do inside 
the HDFS APIs, or is advanced beyond my knowledge.

If I can stick to passing URIs to sc.textFile() I’m ok but if I need to examine 
the structure of the file system, I’m unclear how I should do it without 
sacrificing Spark’s flexibility.
 
On Apr 29, 2014, at 12:55 AM, Christophe Préaud christophe.pre...@kelkoo.com 
wrote:

Hi,

You can also use any path pattern as defined here: 
http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29

e.g.:
sc.textFile('{/path/to/file1,/path/to/file2}')
Christophe.

On 29/04/2014 05:07, Nicholas Chammas wrote:
 Not that I know of. We were discussing it on another thread and it came up. 
 
 I think if you look up the Hadoop FileInputFormat API (which Spark uses) 
 you'll see it mentioned there in the docs. 
 
 http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/FileInputFormat.html
 
 But that's not obvious.
 
 Nick
 
 2014년 4월 28일 월요일, Pat Ferrelpat.fer...@gmail.com 님이 작성한 메시지:
 Perfect. 
 
 BTW just so I know where to look next time, was that in some docs?
 
 On Apr 28, 2014, at 7:04 PM, Nicholas Chammas nicholas.cham...@gmail.com 
 wrote:
 
 Yep, as I just found out, you can also provide 
 sc.textFile() with a comma-delimited string of all the files you want to load.
 
 For example:
 
 sc.textFile('/path/to/file1,/path/to/file2')
 So once you have your list of files, concatenate their paths like that and 
 pass the single string to 
 textFile().
 
 Nick
 
 
 
 On Mon, Apr 28, 2014 at 7:23 PM, Pat Ferrel pat.fer...@gmail.com wrote:
 sc.textFile(URI) supports reading multiple files in parallel but only with a 
 wildcard. I need to walk a dir tree, match a regex to create a list of files, 
 then I’d like to read them into a single RDD in parallel. I understand these 
 could go into separate RDDs then a union RDD can be created. Is there a way 
 to create a single RDD from a URI list?
 
 


Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.



Re: File list read into single RDD

2014-05-18 Thread Andrew Ash
Spark's 
sc.textFile()https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L456
method
delegates to sc.hadoopFile(), which uses Hadoop's
FileInputFormat.setInputPaths()https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L546call.
 There is no alternate storage system, Spark just delegates to Hadoop
for the .textFile() call.

Hadoop can also support multiple URI schemes, not just hdfs:/// paths, so
you can use Spark on data in S3 using s3:/// just the same as you would
with HDFS.  See Apache's documentation on
S3https://wiki.apache.org/hadoop/AmazonS3 for
more details.

As far as interacting with a FileSystem (HDFS or other) to list files,
delete files, navigate paths, etc. from your driver program, you should be
able to just instantiate a FileSystem object and use the normal Hadoop APIs
from there.  The Apache getting started docs on reading/writing from Hadoop
DFS https://wiki.apache.org/hadoop/HadoopDfsReadWriteExample should work
the same for non-HDFS examples too.

I do think we could use a little recipe in our documentation to make
interacting with HDFS a bit more straightforward.

Pat, if you get something that covers your case that you don't mind
sharing, we can format it for including in future Spark docs.

Cheers!
Andrew


On Sun, May 18, 2014 at 9:13 AM, Pat Ferrel pat.fer...@gmail.com wrote:

 Doesn’t using an HDFS path pattern then restrict the URI to an HDFS URI.
 Since Spark supports several FS schemes I’m unclear about how much to
 assume about using the hadoop file systems APIs and conventions. Concretely
 if I pass a pattern in with a HTTPS file system, will the pattern work?

 How does Spark implement its storage system? This seems to be an
 abstraction level beyond what is available in HDFS. In order to preserve
 that flexibility what APIs should I be using? It would be easy to say, HDFS
 only and use HDFS APIs but that would seem to limit things. Especially
 where you would like to read from one cluster and write to another. This is
 not so easy to do inside the HDFS APIs, or is advanced beyond my knowledge.

 If I can stick to passing URIs to sc.textFile() I’m ok but if I need to
 examine the structure of the file system, I’m unclear how I should do it
 without sacrificing Spark’s flexibility.

 On Apr 29, 2014, at 12:55 AM, Christophe Préaud 
 christophe.pre...@kelkoo.com wrote:

  Hi,

 You can also use any path pattern as defined here:
 http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29

 e.g.:

 sc.textFile('{/path/to/file1,/path/to/file2}')

 Christophe.

 On 29/04/2014 05:07, Nicholas Chammas wrote:

 Not that I know of. We were discussing it on another thread and it came
 up.

  I think if you look up the Hadoop FileInputFormat API (which Spark uses)
 you'll see it mentioned there in the docs.


 http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/FileInputFormat.html

  But that's not obvious.

  Nick

 2014년 4월 28일 월요일, Pat Ferrelpat.fer...@gmail.com 님이 작성한 메시지:

 Perfect.

  BTW just so I know where to look next time, was that in some docs?

   On Apr 28, 2014, at 7:04 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

  Yep, as I just found out, you can also provide sc.textFile() with a
 comma-delimited string of all the files you want to load.

 For example:

 sc.textFile('/path/to/file1,/path/to/file2')

 So once you have your list of files, concatenate their paths like that
 and pass the single string to textFile().

 Nick


 On Mon, Apr 28, 2014 at 7:23 PM, Pat Ferrel pat.fer...@gmail.com wrote:

 sc.textFile(URI) supports reading multiple files in parallel but only
 with a wildcard. I need to walk a dir tree, match a regex to create a list
 of files, then I’d like to read them into a single RDD in parallel. I
 understand these could go into separate RDDs then a union RDD can be
 created. Is there a way to create a single RDD from a URI list?





 --
 Kelkoo SAS
 Société par Actions Simplifiée
 Au capital de € 4.168.964,30
 Siège social : 8, rue du Sentier 75002 Paris
 425 093 069 RCS Paris

 Ce message et les pièces jointes sont confidentiels et établis à
 l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
 destinataire de ce message, merci de le détruire et d'en avertir
 l'expéditeur.




Re: Passing runtime config to workers?

2014-05-18 Thread Robert James
I see - I didn't realize that scope would work like that.  Are you
saying that any variable that is in scope of the lambda passed to map
will be automagically propagated to all workers? What if it's not
explicitly referenced in the map, only used by it.  E.g.:

def main:
  settings.setSettings
  rdd.map(x = F.f(x))

object F {
  def f(...)...
  val settings:...
}

F.f accesses F.settings, like a Singleton.  The master sets F.settings
before using F.f in a map.  Will all workers have the same F.settings
as seen by F.f?



On 5/16/14, DB Tsai dbt...@stanford.edu wrote:
 Since the evn variables in driver will not be passed into workers, the most
 easy way you can do is refer to the variables directly in workers from
 driver.

 For example,

 val variableYouWantToUse = System.getenv(something defined in env)

 rdd.map(
 you can access `variableYouWantToUse` here
 )



 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Fri, May 16, 2014 at 1:59 PM, Robert James
 srobertja...@gmail.comwrote:

 What is a good way to pass config variables to workers?

 I've tried setting them in environment variables via spark-env.sh, but,
 as
 far as I can tell, the environment variables set there don't appear in
 workers' environments.  If I want to be able to configure all workers,
 what's a good way to do it?  For example, I want to tell all workers:
 USE_ALGO_A or USE_ALGO_B - but I don't want to recompile.




IllegelAccessError when writing to HBase?

2014-05-18 Thread Nan Zhu
Hi, all 

I tried to write data to HBase in a Spark-1.0 rc8  application, 

the application is terminated due to the java.lang.IllegalAccessError, Hbase 
shell works fine, and the same application works with a standalone Hbase 
deployment

java.lang.IllegalAccessError: com/google/protobuf/HBaseZeroCopyByteString 
at 
org.apache.hadoop.hbase.protobuf.RequestConverter.buildRegionSpecifier(RequestConverter.java:930)
at 
org.apache.hadoop.hbase.protobuf.RequestConverter.buildGetRowOrBeforeRequest(RequestConverter.java:133)
at 
org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRowOrBefore(ProtobufUtil.java:1466)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1236)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1110)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1067)
at 
org.apache.hadoop.hbase.client.AsyncProcess.findDestLocation(AsyncProcess.java:356)
at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:301)
at org.apache.hadoop.hbase.client.HTable.backgroundFlushCommits(HTable.java:955)
at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1239)
at org.apache.hadoop.hbase.client.HTable.close(HTable.java:1276)
at 
org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.close(TableOutputFormat.java:112)
at 
org.apache.spark.rdd.PairRDDFunctions.org$apache$spark$rdd$PairRDDFunctions$$writeShard$1(PairRDDFunctions.scala:720)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:730)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:730)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)


Can anyone give some hint to the issue?

Best, 

-- 
Nan Zhu



Re: IllegelAccessError when writing to HBase?

2014-05-18 Thread Nan Zhu
I tried hbase-0.96.2/0.98.1/0.98.2

HDFS version is 2.3 

-- 
Nan Zhu

On Sunday, May 18, 2014 at 4:18 PM, Nan Zhu wrote: 
 Hi, all 
 
 I tried to write data to HBase in a Spark-1.0 rc8  application, 
 
 the application is terminated due to the java.lang.IllegalAccessError, Hbase 
 shell works fine, and the same application works with a standalone Hbase 
 deployment
 
 java.lang.IllegalAccessError: com/google/protobuf/HBaseZeroCopyByteString 
 at 
 org.apache.hadoop.hbase.protobuf.RequestConverter.buildRegionSpecifier(RequestConverter.java:930)
 at 
 org.apache.hadoop.hbase.protobuf.RequestConverter.buildGetRowOrBeforeRequest(RequestConverter.java:133)
 at 
 org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRowOrBefore(ProtobufUtil.java:1466)
 at 
 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1236)
 at 
 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1110)
 at 
 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1067)
 at 
 org.apache.hadoop.hbase.client.AsyncProcess.findDestLocation(AsyncProcess.java:356)
 at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:301)
 at 
 org.apache.hadoop.hbase.client.HTable.backgroundFlushCommits(HTable.java:955)
 at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1239)
 at org.apache.hadoop.hbase.client.HTable.close(HTable.java:1276)
 at 
 org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.close(TableOutputFormat.java:112)
 at org.apache.spark.rdd.PairRDDFunctions.org 
 (http://org.apache.spark.rdd.PairRDDFunctions.org)$apache$spark$rdd$PairRDDFunctions$$writeShard$1(PairRDDFunctions.scala:720)
 at 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:730)
 at 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:730)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
 at org.apache.spark.scheduler.Task.run(Task.scala:51)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 
 
 Can anyone give some hint to the issue?
 
 Best, 
 
 -- 
 Nan Zhu
 



Re: unsubscribe

2014-05-18 Thread Andrew Ash
Hi Shangyu (and everyone else looking to unsubscribe!),

If you'd like to get off this mailing list, please send an email to user
*-unsubscribe*@spark.apache.org, not the regular user@spark.apache.org list.

How to use the Apache mailing list infrastructure is documented here:
https://www.apache.org/foundation/mailinglists.html
And the Spark User list specifically can be found here:
http://mail-archives.apache.org/mod_mbox/spark-user/

Thanks!
Andrew


On Sun, May 18, 2014 at 12:39 PM, Shangyu Luo lsy...@gmail.com wrote:

 Thanks!



Re: Text file and shuffle

2014-05-18 Thread Han JU
I think the shuffle is unavoidable given that the input partitions
(probably hadoop input spits in your case) are not arranged in the way of a
cogroup job. But maybe you can try:

  1) co-partition you data for cogroup:

val par = HashPartitioner(128)
val big = sc.textFile(..).map(...).partitionBy(par)
val small = sc.textFile(...).map(...).partitionBy(par)
...

  See discussion in
https://groups.google.com/forum/#!topic/spark-users/gUyCSoFo5RI

  2) since you have 25GB mem on each node, you can use the broadcast
variable in spark to distribute the smaller dataset on each node and do
cogroup with it.



2014-05-18 4:41 GMT+02:00 Puneet Lakhina puneet.lakh...@gmail.com:

 Hi,

 I'm new to spark and I wanted to understand a few things conceptually so
 that I can optimize my spark job. I have a large text file (~14G, 200k
 lines). This file is available on each worker node of my spark cluster. The
 job I run calls sc.textFile(...).flatmap(...) . The function that I pass
 into flat map splits up each line from the file into a key and value. Now I
 have another text file which is smaller in size(~1.5G) but has a lot more
 lines because it has more than one value per key spread across multiple
 lines. . I call the same textFile and flatmap functions on they other file
 and then call groupByKey to have all values for a key available as a list.

 Having done this I then cogroup these 2 RDDs. I have the following
 questions

 1. Is this sequence of steps the best way to achieve what I want, I.e a
 join across the 2 data sets?

 2. I have a 8 node (25 Gb memory each) . The large file flatmap spawns
 about 400 odd tasks whereas the small file flatmap only spawns about 30 odd
 tasks. The large file's flatmap takes about 2-3 mins and during this time
 it seems to do about 3G of shuffle write. I want to understand if this
 shuffle write is something I can avoid. From what I have read, the shuffle
 write is a disk write. Is that correct? Also is the reason for the shuffle
 write the fact that the partitioner for flatmap ends up having to
 redistribute the data across the cluster?

 Please let me know if I haven't provided enough information. I'm new to
 spark so if you see anything fundamental that I don't understand please
 feel free to just point me to a link that provides some detailed
 information.

 Thanks,
 Puneet




-- 
*JU Han*

Data Engineer @ Botify.com

+33 061960


First sample with Spark Streaming and three Time's?

2014-05-18 Thread Jacek Laskowski
Hi,

I'm quite new to Spark Streaming and developed the following
application to pass 4 strings, process them and shut down:

val conf = new SparkConf(false) // skip loading external settings
  .setMaster(local[1]) // run locally with one thread
  .setAppName(Spark Streaming with Scala) // name in Spark web UI
val ssc = new StreamingContext(conf, Seconds(5))
val stream: ReceiverInputDStream[String] = ssc.receiverStream(
  new Receiver[String](StorageLevel.MEMORY_ONLY_SER_2) {
def onStart() {
  println([ACTIVATOR] onStart called)
  store(one)
  store(two)
  store(three)
  store(four)
  stop(No more data...receiver stopped)
}

def onStop() {
  println([ACTIVATOR] onStop called)
}
  }
)
stream.count().map(cnt = Received  + cnt +  events.).print()

ssc.start()
// ssc.awaitTermination(1000)
val stopSparkContext, stopGracefully = true
ssc.stop(stopSparkContext, stopGracefully)

I'm running it with `xsbt 'runMain StreamingApp'` with xsbt and spark
build from the latest sources.

What I noticed is that the app generates:

14/05/18 22:32:55 INFO DAGScheduler: Completed ResultTask(1, 0)
14/05/18 22:32:55 INFO DAGScheduler: Stage 1 (take at
DStream.scala:593) finished in 0.245 s
14/05/18 22:32:55 INFO SparkContext: Job finished: take at
DStream.scala:593, took 4.829798 s
---
Time: 140044517 ms
---

14/05/18 22:32:55 INFO DAGScheduler: Completed ResultTask(3, 0)
14/05/18 22:32:55 INFO DAGScheduler: Stage 3 (take at
DStream.scala:593) finished in 0.022 s
14/05/18 22:32:55 INFO SparkContext: Job finished: take at
DStream.scala:593, took 0.194738 s
---
Time: 1400445175000 ms
---

14/05/18 22:33:00 INFO DAGScheduler: Completed ResultTask(5, 0)
14/05/18 22:33:00 INFO DAGScheduler: Stage 5 (take at
DStream.scala:593) finished in 0.014 s
14/05/18 22:33:00 INFO SparkContext: Job finished: take at
DStream.scala:593, took 0.319387 s
---
Time: 140044518 ms
---

Why are there three jobs finished? I would expect one since after
`store` the app immediately calls `stop`. Can I have a single job that
would process these 4 `store`s?

Jacek

-- 
Jacek Laskowski | http://blog.japila.pl
Never discourage anyone who continually makes progress, no matter how
slow. Plato


Re: breeze DGEMM slow in spark

2014-05-18 Thread wxhsdp
ok

Spark Executor Command: java -cp
:/root/ephemeral-hdfs/conf:/root/.ivy2/cache/org.scala-lang/scala-library/jars/scala-library-2.10.4.jar:/root/.ivy2/cache/org.scalanlp/breeze_2.10/jars/breeze_2.10-0.7.jar:/root/.ivy2/cache/org.scalanlp/breeze-macros_2.10/jars/breeze-macros_2.10-0.3.jar:/root/.sbt/boot/scala-2.10.3/lib/scala-reflect.jar:/root/.ivy2/cache/com.thoughtworks.paranamer/paranamer/jars/paranamer-2.2.jar:/root/.ivy2/cache/com.github.fommil.netlib/core/jars/core-1.1.2.jar:/root/.ivy2/cache/net.sourceforge.f2j/arpack_combined_all/jars/arpack_combined_all-0.1.jar:/root/.ivy2/cache/net.sourceforge.f2j/arpack_combined_all/jars/arpack_combined_all-0.1-javadoc.jar:/root/.ivy2/cache/net.sf.opencsv/opencsv/jars/opencsv-2.3.jar:/root/.ivy2/cache/com.github.rwl/jtransforms/jars/jtransforms-2.4.0.jar:/root/.ivy2/cache/junit/junit/jars/junit-4.8.2.jar:/root/.ivy2/cache/org.apache.commons/commons-math3/jars/commons-math3-3.2.jar:/root/.ivy2/cache/org.spire-math/spire_2.10/jars/spire_2.10-0.7.1.jar:/root/.ivy2/cache/org.spire-math/spire-macros_2.10/jars/spire-macros_2.10-0.7.1.jar:/root/.ivy2/cache/com.typesafe/scalalogging-slf4j_2.10/jars/scalalogging-slf4j_2.10-1.0.1.jar:/root/.ivy2/cache/org.slf4j/slf4j-api/jars/slf4j-api-1.7.2.jar:/root/.ivy2/cache/org.scalanlp/breeze-natives_2.10/jars/breeze-natives_2.10-0.7.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/jars/netlib-native_ref-osx-x86_64-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/native_ref-java/jars/native_ref-java-1.1.jar:/root/.ivy2/cache/com.github.fommil/jniloader/jars/jniloader-1.1.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_ref-linux-x86_64/jars/netlib-native_ref-linux-x86_64-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_ref-linux-i686/jars/netlib-native_ref-linux-i686-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_ref-win-x86_64/jars/netlib-native_ref-win-x86_64-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_ref-win-i686/jars/netlib-native_ref-win-i686-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_ref-linux-armhf/jars/netlib-native_ref-linux-armhf-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_system-osx-x86_64/jars/netlib-native_system-osx-x86_64-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/native_system-java/jars/native_system-java-1.1.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_system-linux-x86_64/jars/netlib-native_system-linux-x86_64-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_system-linux-i686/jars/netlib-native_system-linux-i686-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_system-linux-armhf/jars/netlib-native_system-linux-armhf-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_system-win-x86_64/jars/netlib-native_system-win-x86_64-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_system-win-i686/jars/netlib-native_system-win-i686-1.1-natives.jar
::/root/spark/conf:/root/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.0.4.jar
-Xms4096M -Xmx4096M



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/breeze-DGEMM-slow-in-spark-tp5950p5994.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Passing runtime config to workers?

2014-05-18 Thread DB Tsai
When you reference any variable outside the executor's scope, spark will
automatically serialize them in the driver, and send them to executors,
which implies, those variables have to implement serializable.

For the example you mention, the Spark will serialize object F, and if it's
not serializable, it will raise exception.


Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Sun, May 18, 2014 at 12:58 PM, Robert James srobertja...@gmail.comwrote:

 I see - I didn't realize that scope would work like that.  Are you
 saying that any variable that is in scope of the lambda passed to map
 will be automagically propagated to all workers? What if it's not
 explicitly referenced in the map, only used by it.  E.g.:

 def main:
   settings.setSettings
   rdd.map(x = F.f(x))

 object F {
   def f(...)...
   val settings:...
 }

 F.f accesses F.settings, like a Singleton.  The master sets F.settings
 before using F.f in a map.  Will all workers have the same F.settings
 as seen by F.f?



 On 5/16/14, DB Tsai dbt...@stanford.edu wrote:
  Since the evn variables in driver will not be passed into workers, the
 most
  easy way you can do is refer to the variables directly in workers from
  driver.
 
  For example,
 
  val variableYouWantToUse = System.getenv(something defined in env)
 
  rdd.map(
  you can access `variableYouWantToUse` here
  )
 
 
 
  Sincerely,
 
  DB Tsai
  ---
  My Blog: https://www.dbtsai.com
  LinkedIn: https://www.linkedin.com/in/dbtsai
 
 
  On Fri, May 16, 2014 at 1:59 PM, Robert James
  srobertja...@gmail.comwrote:
 
  What is a good way to pass config variables to workers?
 
  I've tried setting them in environment variables via spark-env.sh, but,
  as
  far as I can tell, the environment variables set there don't appear in
  workers' environments.  If I want to be able to configure all workers,
  what's a good way to do it?  For example, I want to tell all workers:
  USE_ALGO_A or USE_ALGO_B - but I don't want to recompile.
 
 



sync master with slaves with bittorrent?

2014-05-18 Thread Daniel Mahler
I am launching a rather large cluster on ec2.
It seems like the launch is taking forever on

Setting up spark
RSYNC'ing /root/spark to slaves...
...

It seems that bittorrent might be a faster way to replicate
the sizeable spark directory to the slaves
particularly if there is a lot of not very powerful slaves.

Just a thought ...

cheers
Daniel


Re: sync master with slaves with bittorrent?

2014-05-18 Thread Aaron Davidson
Out of curiosity, do you have a library in mind that would make it easy to
setup a bit torrent network and distribute files in an rsync (i.e., apply a
diff to a tree, ideally) fashion? I'm not familiar with this space, but we
do want to minimize the complexity of our standard ec2 launch scripts to
reduce the chance of something breaking.


On Sun, May 18, 2014 at 9:22 PM, Daniel Mahler dmah...@gmail.com wrote:

 I am launching a rather large cluster on ec2.
 It seems like the launch is taking forever on
 
 Setting up spark
 RSYNC'ing /root/spark to slaves...
 ...

 It seems that bittorrent might be a faster way to replicate
 the sizeable spark directory to the slaves
 particularly if there is a lot of not very powerful slaves.

 Just a thought ...

 cheers
 Daniel




Re: sync master with slaves with bittorrent?

2014-05-18 Thread Daniel Mahler
I am not an expert in this space either. I thought the initial rsync during
launch is really just a straight copy that did not need the tree diff. So
it seemed like having the slaves do the copying among it each other would
be better than having the master copy to everyone directly. That made me
think of bittorrent, though there may well be other systems that do this.
From the launches I did today it seems that it is taking around 1 minute
per slave to launch a cluster, which can be a problem for clusters with 10s
or 100s of slaves, particularly since on ec2  that time has to be paid for.


On Sun, May 18, 2014 at 11:54 PM, Aaron Davidson ilike...@gmail.com wrote:

 Out of curiosity, do you have a library in mind that would make it easy to
 setup a bit torrent network and distribute files in an rsync (i.e., apply a
 diff to a tree, ideally) fashion? I'm not familiar with this space, but we
 do want to minimize the complexity of our standard ec2 launch scripts to
 reduce the chance of something breaking.


 On Sun, May 18, 2014 at 9:22 PM, Daniel Mahler dmah...@gmail.com wrote:

 I am launching a rather large cluster on ec2.
 It seems like the launch is taking forever on
 
 Setting up spark
 RSYNC'ing /root/spark to slaves...
 ...

 It seems that bittorrent might be a faster way to replicate
 the sizeable spark directory to the slaves
 particularly if there is a lot of not very powerful slaves.

 Just a thought ...

 cheers
 Daniel