date:20140518

java.lang.NoClassDefFoundError: org/apache/spark/deploy/worker/Worker

2014-05-18 Thread Hao Wang

Hi, all

*Spark version: bae07e3 [behind 1] fix different versions of commons-lang
dependency and apache/spark#746 addendum*

I have six worker nodes and four of them have this NoClassDefFoundError when
I use thestart-slaves.sh on my driver node. However, running ./bin/spark-class
org.apache.spark.deploy.worker.Worker spark://MASTER_IP:PORT on the worker
nodes works well.

I compile the /spark directory on driver node and distribute to all the
worker nodes. Paths on different nodes are identical.

Here is the logs from one of four driver nodes.

Spark Command: java -cp
::/home/wanghao/spark/conf:/home/wanghao/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.2.0.jar
-Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m
org.apache.spark.deploy.worker.Worker spark://192.168.1.12:7077
--webui-port 8081


Exception in thread main java.lang.NoClassDefFoundError:
org/apache/spark/deploy/worker/Worker
Caused by: java.lang.ClassNotFoundException:
org.apache.spark.deploy.worker.Worker
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
Could not find the main class: org.apache.spark.deploy.worker.Worker.
Program will exit.

Here is spark-env.sh

export SPARK_WORKER_MEMORY=1g
export SPARK_MASTER_IP=192.168.1.12
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_CORES=1
export SPARK_WORKER_INSTANCES=2

hosts file:

127.0.0.1   localhost
192.168.1.12sing12

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

192.168.1.11 sing11
192.168.1.59 sing59

###
# failed machines
###

192.168.1.122 host122
192.168.1.123 host123
192.168.1.124 host124
192.168.1.125 host125


Regards,
Wang Hao(王灏)

CloudTeam | School of Software Engineering
Shanghai Jiao Tong University
Address:800 Dongchuan Road, Minhang District, Shanghai, 200240
Email:wh.s...@gmail.com

Re: breeze DGEMM slow in spark

2014-05-18 Thread wxhsdp

Hi, xiangrui
  i check the stderr of worker node, yes it's failed to load implementation
from:   
  com.github.fommil.netlib.NativeSystemBLAS...

  what do you mean by include breeze-natives or netlib:all? 

  things i've already done:
  1. add breeze and breeze native dependency in sbt build file
  2. download all breeze jars to slaves
  3. add jars to classpath in slave
  4. ln -s libopenblas_nehalemp-r0.2.9.rc2.so libblas.so.3 and add it to
LD_LIBRARY_PATH in slave

  thank you for your help



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/breeze-DGEMM-slow-in-spark-tp5950p5977.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: breeze DGEMM slow in spark

2014-05-18 Thread wxhsdp

Hi, xiangrui

  you said It doesn't work if you put the netlib-native jar inside an
assembly 
  jar. Try to mark it provided in the dependencies, and use --jars to 
  include them with spark-submit. -Xiangrui

  i'am not use an assembly jar which contains every thing, i also mark
breeze dependencies
  provided, and manually download the jars and add them to slave classpath.
but doesn't work:(



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/breeze-DGEMM-slow-in-spark-tp5950p5979.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Configuring Spark for reduceByKey on on massive data sets

2014-05-18 Thread lukas nalezenec

Hi
Try using *reduceByKeyLocally*.
Regards
Lukas Nalezenec


On Sun, May 18, 2014 at 3:33 AM, Matei Zaharia matei.zaha...@gmail.comwrote:

 Make sure you set up enough reduce partitions so you don’t overload them.
 Another thing that may help is checking whether you’ve run out of local
 disk space on the machines, and turning on spark.shuffle.consolidateFiles
 to produce fewer files. Finally, there’s been a recent fix in both branch
 0.9 and master that reduces the amount of memory used when there are small
 files (due to extra memory that was being taken by mmap()):
 https://issues.apache.org/jira/browse/SPARK-1145. You can find this in
 either the 1.0 release candidates on the dev list, or branch-0.9 in git.

 Matei

 On May 17, 2014, at 5:45 PM, Madhu ma...@madhu.com wrote:

  Daniel,
 
  How many partitions do you have?
  Are they more or less uniformly distributed?
  We have similar data volume currently running well on Hadoop MapReduce
 with
  roughly 30 nodes.
  I was planning to test it with Spark.
  I'm very interested in your findings.
 
 
 
  -
  Madhu
  https://www.linkedin.com/in/msiddalingaiah
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-Spark-for-reduceByKey-on-on-massive-data-sets-tp5966p5967.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: File list read into single RDD

2014-05-18 Thread Pat Ferrel

Doesn’t using an HDFS path pattern then restrict the URI to an HDFS URI. Since
Spark supports several FS schemes I’m unclear about how much to assume about
using the hadoop file systems APIs and conventions. Concretely if I pass a
pattern in with a HTTPS file system, will the pattern work?

How does Spark implement its storage system? This seems to be an abstraction
level beyond what is available in HDFS. In order to preserve that flexibility
what APIs should I be using? It would be easy to say, HDFS only and use HDFS
APIs but that would seem to limit things. Especially where you would like to
read from one cluster and write to another. This is not so easy to do inside
the HDFS APIs, or is advanced beyond my knowledge.

If I can stick to passing URIs to sc.textFile() I’m ok but if I need to examine
the structure of the file system, I’m unclear how I should do it without
sacrificing Spark’s flexibility.

On Apr 29, 2014, at 12:55 AM, Christophe Préaud christophe.pre...@kelkoo.com
wrote:

Hi,

You can also use any path pattern as defined here:
http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29

e.g.:
sc.textFile('{/path/to/file1,/path/to/file2}')
Christophe.

On 29/04/2014 05:07, Nicholas Chammas wrote:
Not that I know of. We were discussing it on another thread and it came up.

I think if you look up the Hadoop FileInputFormat API (which Spark uses)
you'll see it mentioned there in the docs.

http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/FileInputFormat.html

But that's not obvious.

Nick

2014년 4월 28일 월요일, Pat Ferrelpat.fer...@gmail.com 님이 작성한 메시지:
Perfect.

BTW just so I know where to look next time, was that in some docs?

On Apr 28, 2014, at 7:04 PM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:

Yep, as I just found out, you can also provide
sc.textFile() with a comma-delimited string of all the files you want to load.

For example:

sc.textFile('/path/to/file1,/path/to/file2')
So once you have your list of files, concatenate their paths like that and
pass the single string to
textFile().

Nick

On Mon, Apr 28, 2014 at 7:23 PM, Pat Ferrel pat.fer...@gmail.com wrote:
sc.textFile(URI) supports reading multiple files in parallel but only with a
wildcard. I need to walk a dir tree, match a regex to create a list of files,
then I’d like to read them into a single RDD in parallel. I understand these
could go into separate RDDs then a union RDD can be created. Is there a way
to create a single RDD from a URI list?

Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce
message, merci de le détruire et d'en avertir l'expéditeur.

Re: File list read into single RDD

2014-05-18 Thread Andrew Ash

Spark's
sc.textFile()https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L456
method
delegates to sc.hadoopFile(), which uses Hadoop's
FileInputFormat.setInputPaths()https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L546call.
There is no alternate storage system, Spark just delegates to Hadoop
for the .textFile() call.

Hadoop can also support multiple URI schemes, not just hdfs:/// paths, so
you can use Spark on data in S3 using s3:/// just the same as you would
with HDFS. See Apache's documentation on
S3https://wiki.apache.org/hadoop/AmazonS3 for
more details.

As far as interacting with a FileSystem (HDFS or other) to list files,
delete files, navigate paths, etc. from your driver program, you should be
able to just instantiate a FileSystem object and use the normal Hadoop APIs
from there. The Apache getting started docs on reading/writing from Hadoop
DFS https://wiki.apache.org/hadoop/HadoopDfsReadWriteExample should work
the same for non-HDFS examples too.

I do think we could use a little recipe in our documentation to make
interacting with HDFS a bit more straightforward.

Pat, if you get something that covers your case that you don't mind
sharing, we can format it for including in future Spark docs.

Cheers!
Andrew

On Sun, May 18, 2014 at 9:13 AM, Pat Ferrel pat.fer...@gmail.com wrote:

Doesn’t using an HDFS path pattern then restrict the URI to an HDFS URI.
Since Spark supports several FS schemes I’m unclear about how much to
assume about using the hadoop file systems APIs and conventions. Concretely
if I pass a pattern in with a HTTPS file system, will the pattern work?

How does Spark implement its storage system? This seems to be an
abstraction level beyond what is available in HDFS. In order to preserve
that flexibility what APIs should I be using? It would be easy to say, HDFS
only and use HDFS APIs but that would seem to limit things. Especially
where you would like to read from one cluster and write to another. This is
not so easy to do inside the HDFS APIs, or is advanced beyond my knowledge.

If I can stick to passing URIs to sc.textFile() I’m ok but if I need to
examine the structure of the file system, I’m unclear how I should do it
without sacrificing Spark’s flexibility.

On Apr 29, 2014, at 12:55 AM, Christophe Préaud
christophe.pre...@kelkoo.com wrote:

Hi,

You can also use any path pattern as defined here:
http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29

e.g.:

sc.textFile('{/path/to/file1,/path/to/file2}')

Christophe.

On 29/04/2014 05:07, Nicholas Chammas wrote:

Not that I know of. We were discussing it on another thread and it came
up.

I think if you look up the Hadoop FileInputFormat API (which Spark uses)
you'll see it mentioned there in the docs.

http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/FileInputFormat.html

But that's not obvious.

Nick

2014년 4월 28일 월요일, Pat Ferrelpat.fer...@gmail.com 님이 작성한 메시지:

Perfect.

BTW just so I know where to look next time, was that in some docs?

On Apr 28, 2014, at 7:04 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:

Yep, as I just found out, you can also provide sc.textFile() with a
comma-delimited string of all the files you want to load.

For example:

sc.textFile('/path/to/file1,/path/to/file2')

So once you have your list of files, concatenate their paths like that
and pass the single string to textFile().

Nick

On Mon, Apr 28, 2014 at 7:23 PM, Pat Ferrel pat.fer...@gmail.com wrote:

sc.textFile(URI) supports reading multiple files in parallel but only
with a wildcard. I need to walk a dir tree, match a regex to create a list
of files, then I’d like to read them into a single RDD in parallel. I
understand these could go into separate RDDs then a union RDD can be
created. Is there a way to create a single RDD from a URI list?

--
Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à
l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
destinataire de ce message, merci de le détruire et d'en avertir
l'expéditeur.

Re: Passing runtime config to workers?

2014-05-18 Thread Robert James

I see - I didn't realize that scope would work like that.  Are you
saying that any variable that is in scope of the lambda passed to map
will be automagically propagated to all workers? What if it's not
explicitly referenced in the map, only used by it.  E.g.:

def main:
  settings.setSettings
  rdd.map(x = F.f(x))

object F {
  def f(...)...
  val settings:...
}

F.f accesses F.settings, like a Singleton.  The master sets F.settings
before using F.f in a map.  Will all workers have the same F.settings
as seen by F.f?



On 5/16/14, DB Tsai dbt...@stanford.edu wrote:
 Since the evn variables in driver will not be passed into workers, the most
 easy way you can do is refer to the variables directly in workers from
 driver.

 For example,

 val variableYouWantToUse = System.getenv(something defined in env)

 rdd.map(
 you can access `variableYouWantToUse` here
 )



 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Fri, May 16, 2014 at 1:59 PM, Robert James
 srobertja...@gmail.comwrote:

 What is a good way to pass config variables to workers?

 I've tried setting them in environment variables via spark-env.sh, but,
 as
 far as I can tell, the environment variables set there don't appear in
 workers' environments.  If I want to be able to configure all workers,
 what's a good way to do it?  For example, I want to tell all workers:
 USE_ALGO_A or USE_ALGO_B - but I don't want to recompile.

IllegelAccessError when writing to HBase?

2014-05-18 Thread Nan Zhu

Hi, all 

I tried to write data to HBase in a Spark-1.0 rc8  application, 

the application is terminated due to the java.lang.IllegalAccessError, Hbase 
shell works fine, and the same application works with a standalone Hbase 
deployment

java.lang.IllegalAccessError: com/google/protobuf/HBaseZeroCopyByteString 
at 
org.apache.hadoop.hbase.protobuf.RequestConverter.buildRegionSpecifier(RequestConverter.java:930)
at 
org.apache.hadoop.hbase.protobuf.RequestConverter.buildGetRowOrBeforeRequest(RequestConverter.java:133)
at 
org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRowOrBefore(ProtobufUtil.java:1466)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1236)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1110)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1067)
at 
org.apache.hadoop.hbase.client.AsyncProcess.findDestLocation(AsyncProcess.java:356)
at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:301)
at org.apache.hadoop.hbase.client.HTable.backgroundFlushCommits(HTable.java:955)
at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1239)
at org.apache.hadoop.hbase.client.HTable.close(HTable.java:1276)
at 
org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.close(TableOutputFormat.java:112)
at 
org.apache.spark.rdd.PairRDDFunctions.org$apache$spark$rdd$PairRDDFunctions$$writeShard$1(PairRDDFunctions.scala:720)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:730)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:730)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)


Can anyone give some hint to the issue?

Best, 

-- 
Nan Zhu

Re: IllegelAccessError when writing to HBase?

2014-05-18 Thread Nan Zhu

I tried hbase-0.96.2/0.98.1/0.98.2

HDFS version is 2.3 

-- 
Nan Zhu

On Sunday, May 18, 2014 at 4:18 PM, Nan Zhu wrote: 
 Hi, all 
 
 I tried to write data to HBase in a Spark-1.0 rc8  application, 
 
 the application is terminated due to the java.lang.IllegalAccessError, Hbase 
 shell works fine, and the same application works with a standalone Hbase 
 deployment
 
 java.lang.IllegalAccessError: com/google/protobuf/HBaseZeroCopyByteString 
 at 
 org.apache.hadoop.hbase.protobuf.RequestConverter.buildRegionSpecifier(RequestConverter.java:930)
 at 
 org.apache.hadoop.hbase.protobuf.RequestConverter.buildGetRowOrBeforeRequest(RequestConverter.java:133)
 at 
 org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRowOrBefore(ProtobufUtil.java:1466)
 at 
 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1236)
 at 
 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1110)
 at 
 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1067)
 at 
 org.apache.hadoop.hbase.client.AsyncProcess.findDestLocation(AsyncProcess.java:356)
 at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:301)
 at 
 org.apache.hadoop.hbase.client.HTable.backgroundFlushCommits(HTable.java:955)
 at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1239)
 at org.apache.hadoop.hbase.client.HTable.close(HTable.java:1276)
 at 
 org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.close(TableOutputFormat.java:112)
 at org.apache.spark.rdd.PairRDDFunctions.org 
 (http://org.apache.spark.rdd.PairRDDFunctions.org)$apache$spark$rdd$PairRDDFunctions$$writeShard$1(PairRDDFunctions.scala:720)
 at 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:730)
 at 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:730)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
 at org.apache.spark.scheduler.Task.run(Task.scala:51)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 
 
 Can anyone give some hint to the issue?
 
 Best, 
 
 -- 
 Nan Zhu

Re: unsubscribe

2014-05-18 Thread Andrew Ash

Hi Shangyu (and everyone else looking to unsubscribe!),

If you'd like to get off this mailing list, please send an email to user
*-unsubscribe*@spark.apache.org, not the regular user@spark.apache.org list.

How to use the Apache mailing list infrastructure is documented here:
https://www.apache.org/foundation/mailinglists.html
And the Spark User list specifically can be found here:
http://mail-archives.apache.org/mod_mbox/spark-user/

Thanks!
Andrew


On Sun, May 18, 2014 at 12:39 PM, Shangyu Luo lsy...@gmail.com wrote:

 Thanks!

Re: Text file and shuffle

2014-05-18 Thread Han JU

I think the shuffle is unavoidable given that the input partitions
(probably hadoop input spits in your case) are not arranged in the way of a
cogroup job. But maybe you can try:

  1) co-partition you data for cogroup:

val par = HashPartitioner(128)
val big = sc.textFile(..).map(...).partitionBy(par)
val small = sc.textFile(...).map(...).partitionBy(par)
...

  See discussion in
https://groups.google.com/forum/#!topic/spark-users/gUyCSoFo5RI

  2) since you have 25GB mem on each node, you can use the broadcast
variable in spark to distribute the smaller dataset on each node and do
cogroup with it.



2014-05-18 4:41 GMT+02:00 Puneet Lakhina puneet.lakh...@gmail.com:

 Hi,

 I'm new to spark and I wanted to understand a few things conceptually so
 that I can optimize my spark job. I have a large text file (~14G, 200k
 lines). This file is available on each worker node of my spark cluster. The
 job I run calls sc.textFile(...).flatmap(...) . The function that I pass
 into flat map splits up each line from the file into a key and value. Now I
 have another text file which is smaller in size(~1.5G) but has a lot more
 lines because it has more than one value per key spread across multiple
 lines. . I call the same textFile and flatmap functions on they other file
 and then call groupByKey to have all values for a key available as a list.

 Having done this I then cogroup these 2 RDDs. I have the following
 questions

 1. Is this sequence of steps the best way to achieve what I want, I.e a
 join across the 2 data sets?

 2. I have a 8 node (25 Gb memory each) . The large file flatmap spawns
 about 400 odd tasks whereas the small file flatmap only spawns about 30 odd
 tasks. The large file's flatmap takes about 2-3 mins and during this time
 it seems to do about 3G of shuffle write. I want to understand if this
 shuffle write is something I can avoid. From what I have read, the shuffle
 write is a disk write. Is that correct? Also is the reason for the shuffle
 write the fact that the partitioner for flatmap ends up having to
 redistribute the data across the cluster?

 Please let me know if I haven't provided enough information. I'm new to
 spark so if you see anything fundamental that I don't understand please
 feel free to just point me to a link that provides some detailed
 information.

 Thanks,
 Puneet




-- 
*JU Han*

Data Engineer @ Botify.com

+33 061960

First sample with Spark Streaming and three Time's?

2014-05-18 Thread Jacek Laskowski

Hi,

I'm quite new to Spark Streaming and developed the following
application to pass 4 strings, process them and shut down:

val conf = new SparkConf(false) // skip loading external settings
  .setMaster(local[1]) // run locally with one thread
  .setAppName(Spark Streaming with Scala) // name in Spark web UI
val ssc = new StreamingContext(conf, Seconds(5))
val stream: ReceiverInputDStream[String] = ssc.receiverStream(
  new Receiver[String](StorageLevel.MEMORY_ONLY_SER_2) {
def onStart() {
  println([ACTIVATOR] onStart called)
  store(one)
  store(two)
  store(three)
  store(four)
  stop(No more data...receiver stopped)
}

def onStop() {
  println([ACTIVATOR] onStop called)
}
  }
)
stream.count().map(cnt = Received  + cnt +  events.).print()

ssc.start()
// ssc.awaitTermination(1000)
val stopSparkContext, stopGracefully = true
ssc.stop(stopSparkContext, stopGracefully)

I'm running it with `xsbt 'runMain StreamingApp'` with xsbt and spark
build from the latest sources.

What I noticed is that the app generates:

14/05/18 22:32:55 INFO DAGScheduler: Completed ResultTask(1, 0)
14/05/18 22:32:55 INFO DAGScheduler: Stage 1 (take at
DStream.scala:593) finished in 0.245 s
14/05/18 22:32:55 INFO SparkContext: Job finished: take at
DStream.scala:593, took 4.829798 s
---
Time: 140044517 ms
---

14/05/18 22:32:55 INFO DAGScheduler: Completed ResultTask(3, 0)
14/05/18 22:32:55 INFO DAGScheduler: Stage 3 (take at
DStream.scala:593) finished in 0.022 s
14/05/18 22:32:55 INFO SparkContext: Job finished: take at
DStream.scala:593, took 0.194738 s
---
Time: 1400445175000 ms
---

14/05/18 22:33:00 INFO DAGScheduler: Completed ResultTask(5, 0)
14/05/18 22:33:00 INFO DAGScheduler: Stage 5 (take at
DStream.scala:593) finished in 0.014 s
14/05/18 22:33:00 INFO SparkContext: Job finished: take at
DStream.scala:593, took 0.319387 s
---
Time: 140044518 ms
---

Why are there three jobs finished? I would expect one since after
`store` the app immediately calls `stop`. Can I have a single job that
would process these 4 `store`s?

Jacek

-- 
Jacek Laskowski | http://blog.japila.pl
Never discourage anyone who continually makes progress, no matter how
slow. Plato

Re: breeze DGEMM slow in spark

2014-05-18 Thread wxhsdp

ok

Spark Executor Command: java -cp
:/root/ephemeral-hdfs/conf:/root/.ivy2/cache/org.scala-lang/scala-library/jars/scala-library-2.10.4.jar:/root/.ivy2/cache/org.scalanlp/breeze_2.10/jars/breeze_2.10-0.7.jar:/root/.ivy2/cache/org.scalanlp/breeze-macros_2.10/jars/breeze-macros_2.10-0.3.jar:/root/.sbt/boot/scala-2.10.3/lib/scala-reflect.jar:/root/.ivy2/cache/com.thoughtworks.paranamer/paranamer/jars/paranamer-2.2.jar:/root/.ivy2/cache/com.github.fommil.netlib/core/jars/core-1.1.2.jar:/root/.ivy2/cache/net.sourceforge.f2j/arpack_combined_all/jars/arpack_combined_all-0.1.jar:/root/.ivy2/cache/net.sourceforge.f2j/arpack_combined_all/jars/arpack_combined_all-0.1-javadoc.jar:/root/.ivy2/cache/net.sf.opencsv/opencsv/jars/opencsv-2.3.jar:/root/.ivy2/cache/com.github.rwl/jtransforms/jars/jtransforms-2.4.0.jar:/root/.ivy2/cache/junit/junit/jars/junit-4.8.2.jar:/root/.ivy2/cache/org.apache.commons/commons-math3/jars/commons-math3-3.2.jar:/root/.ivy2/cache/org.spire-math/spire_2.10/jars/spire_2.10-0.7.1.jar:/root/.ivy2/cache/org.spire-math/spire-macros_2.10/jars/spire-macros_2.10-0.7.1.jar:/root/.ivy2/cache/com.typesafe/scalalogging-slf4j_2.10/jars/scalalogging-slf4j_2.10-1.0.1.jar:/root/.ivy2/cache/org.slf4j/slf4j-api/jars/slf4j-api-1.7.2.jar:/root/.ivy2/cache/org.scalanlp/breeze-natives_2.10/jars/breeze-natives_2.10-0.7.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/jars/netlib-native_ref-osx-x86_64-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/native_ref-java/jars/native_ref-java-1.1.jar:/root/.ivy2/cache/com.github.fommil/jniloader/jars/jniloader-1.1.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_ref-linux-x86_64/jars/netlib-native_ref-linux-x86_64-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_ref-linux-i686/jars/netlib-native_ref-linux-i686-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_ref-win-x86_64/jars/netlib-native_ref-win-x86_64-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_ref-win-i686/jars/netlib-native_ref-win-i686-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_ref-linux-armhf/jars/netlib-native_ref-linux-armhf-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_system-osx-x86_64/jars/netlib-native_system-osx-x86_64-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/native_system-java/jars/native_system-java-1.1.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_system-linux-x86_64/jars/netlib-native_system-linux-x86_64-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_system-linux-i686/jars/netlib-native_system-linux-i686-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_system-linux-armhf/jars/netlib-native_system-linux-armhf-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_system-win-x86_64/jars/netlib-native_system-win-x86_64-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_system-win-i686/jars/netlib-native_system-win-i686-1.1-natives.jar
::/root/spark/conf:/root/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.0.4.jar
-Xms4096M -Xmx4096M



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/breeze-DGEMM-slow-in-spark-tp5950p5994.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Passing runtime config to workers?

2014-05-18 Thread DB Tsai

When you reference any variable outside the executor's scope, spark will
automatically serialize them in the driver, and send them to executors,
which implies, those variables have to implement serializable.

For the example you mention, the Spark will serialize object F, and if it's
not serializable, it will raise exception.


Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Sun, May 18, 2014 at 12:58 PM, Robert James srobertja...@gmail.comwrote:

 I see - I didn't realize that scope would work like that.  Are you
 saying that any variable that is in scope of the lambda passed to map
 will be automagically propagated to all workers? What if it's not
 explicitly referenced in the map, only used by it.  E.g.:

 def main:
   settings.setSettings
   rdd.map(x = F.f(x))

 object F {
   def f(...)...
   val settings:...
 }

 F.f accesses F.settings, like a Singleton.  The master sets F.settings
 before using F.f in a map.  Will all workers have the same F.settings
 as seen by F.f?



 On 5/16/14, DB Tsai dbt...@stanford.edu wrote:
  Since the evn variables in driver will not be passed into workers, the
 most
  easy way you can do is refer to the variables directly in workers from
  driver.
 
  For example,
 
  val variableYouWantToUse = System.getenv(something defined in env)
 
  rdd.map(
  you can access `variableYouWantToUse` here
  )
 
 
 
  Sincerely,
 
  DB Tsai
  ---
  My Blog: https://www.dbtsai.com
  LinkedIn: https://www.linkedin.com/in/dbtsai
 
 
  On Fri, May 16, 2014 at 1:59 PM, Robert James
  srobertja...@gmail.comwrote:
 
  What is a good way to pass config variables to workers?
 
  I've tried setting them in environment variables via spark-env.sh, but,
  as
  far as I can tell, the environment variables set there don't appear in
  workers' environments.  If I want to be able to configure all workers,
  what's a good way to do it?  For example, I want to tell all workers:
  USE_ALGO_A or USE_ALGO_B - but I don't want to recompile.

sync master with slaves with bittorrent?

2014-05-18 Thread Daniel Mahler

I am launching a rather large cluster on ec2.
It seems like the launch is taking forever on

Setting up spark
RSYNC'ing /root/spark to slaves...
...

It seems that bittorrent might be a faster way to replicate
the sizeable spark directory to the slaves
particularly if there is a lot of not very powerful slaves.

Just a thought ...

cheers
Daniel

Re: sync master with slaves with bittorrent?

2014-05-18 Thread Aaron Davidson

Out of curiosity, do you have a library in mind that would make it easy to
setup a bit torrent network and distribute files in an rsync (i.e., apply a
diff to a tree, ideally) fashion? I'm not familiar with this space, but we
do want to minimize the complexity of our standard ec2 launch scripts to
reduce the chance of something breaking.


On Sun, May 18, 2014 at 9:22 PM, Daniel Mahler dmah...@gmail.com wrote:

 I am launching a rather large cluster on ec2.
 It seems like the launch is taking forever on
 
 Setting up spark
 RSYNC'ing /root/spark to slaves...
 ...

 It seems that bittorrent might be a faster way to replicate
 the sizeable spark directory to the slaves
 particularly if there is a lot of not very powerful slaves.

 Just a thought ...

 cheers
 Daniel

Re: sync master with slaves with bittorrent?

2014-05-18 Thread Daniel Mahler

I am not an expert in this space either. I thought the initial rsync during
launch is really just a straight copy that did not need the tree diff. So
it seemed like having the slaves do the copying among it each other would
be better than having the master copy to everyone directly. That made me
think of bittorrent, though there may well be other systems that do this.
From the launches I did today it seems that it is taking around 1 minute
per slave to launch a cluster, which can be a problem for clusters with 10s
or 100s of slaves, particularly since on ec2  that time has to be paid for.


On Sun, May 18, 2014 at 11:54 PM, Aaron Davidson ilike...@gmail.com wrote:

 Out of curiosity, do you have a library in mind that would make it easy to
 setup a bit torrent network and distribute files in an rsync (i.e., apply a
 diff to a tree, ideally) fashion? I'm not familiar with this space, but we
 do want to minimize the complexity of our standard ec2 launch scripts to
 reduce the chance of something breaking.


 On Sun, May 18, 2014 at 9:22 PM, Daniel Mahler dmah...@gmail.com wrote:

 I am launching a rather large cluster on ec2.
 It seems like the launch is taking forever on
 
 Setting up spark
 RSYNC'ing /root/spark to slaves...
 ...

 It seems that bittorrent might be a faster way to replicate
 the sizeable spark directory to the slaves
 particularly if there is a lot of not very powerful slaves.

 Just a thought ...

 cheers
 Daniel

java.lang.NoClassDefFoundError: org/apache/spark/deploy/worker/Worker

Re: breeze DGEMM slow in spark

Re: breeze DGEMM slow in spark

Re: Configuring Spark for reduceByKey on on massive data sets

Re: File list read into single RDD

Re: File list read into single RDD

Re: Passing runtime config to workers?

IllegelAccessError when writing to HBase?

Re: IllegelAccessError when writing to HBase?

Re: unsubscribe

Re: Text file and shuffle

First sample with Spark Streaming and three Time's?

Re: breeze DGEMM slow in spark

Re: Passing runtime config to workers?

sync master with slaves with bittorrent?

Re: sync master with slaves with bittorrent?

Re: sync master with slaves with bittorrent?

17 matches

Site Navigation

Mail list logo

Footer information