1.0.0 Release Date?

2014-05-14 Thread bhusted
Can anyone comment on the anticipated date or worse case timeframe for when
Spark 1.0.0 will be released?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/1-0-0-Release-Date-tp5664.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: streaming on hdfs can detected all new file, but the sum of all the rdd.count() not equals which had detected

2014-05-14 Thread zzzzzqf12345
thanks for reply~~

I had solved the problem and found the reason, because I used the Master
node to upload files to hdfs, this action may take up a lot of Master's
network resources. When I changed to use another computer none of the
cluster to upload these files, it got the correct result.

QingFeng



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/streaming-on-hdfs-can-detected-all-new-file-but-the-sum-of-all-the-rdd-count-not-equals-which-had-ded-tp5572p5635.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark on Yarn - A small issue !

2014-05-14 Thread Tom Graves
You need to look at the logs files for yarn.  Generally this can be done with 
yarn logs -applicationId your_app_id.  That only works if you have log 
aggregation enabled though.   You should be able to see atleast the application 
master logs through the yarn resourcemanager web ui.  I would try that first. 

If that doesn't work you can turn on debug in the nodemanager:

To review per-container launch environment, increase 
yarn.nodemanager.delete.debug-delay-sec to a large value (e.g. 36000), and then 
access the application cache through yarn.nodemanager.local-dirs on the nodes 
on which containers are launched. This directory contains the launch script, 
jars, and all environment variables used for launching each container. This 
process is useful for debugging classpath problems in particular. (Note that 
enabling this requires admin privileges on cluster settings and a restart of 
all node managers. Thus, this is not applicable to hosted clusters).



Tom


On Monday, May 12, 2014 9:38 AM, Sai Prasanna ansaiprasa...@gmail.com wrote:
 
Hi All, 

I wanted to launch Spark on Yarn, interactive - yarn client mode.

With default settings of yarn-site.xml and spark-env.sh, i followed the given 
link 
http://spark.apache.org/docs/0.8.1/running-on-yarn.html

I get the pi value correct when i run without launching the shell.

When i launch the shell, with following command,
SPARK_JAR=./assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop2.3.0.jar
 \
SPARK_YARN_APP_JAR=examples/target/scala-2.9.3/spark-examples-assembly-0.8.1-incubating.jar
 \
MASTER=yarn-client ./spark-shell
And try to create RDDs and do some action on it, i get nothing. After sometime 
tasks fails.

LogFile of spark: 
519095 14/05/12 13:30:40 INFO YarnClientClusterScheduler: 
YarnClientClusterScheduler.postStartHook done
519096 14/05/12 13:30:40 INFO BlockManagerMasterActor$BlockManagerInfo: 
Registering block manager s1:38355 with 324.4 MB RAM
519097 14/05/12 13:31:38 INFO MemoryStore: ensureFreeSpace(202584) called with 
curMem=0, maxMem=340147568
519098 14/05/12 13:31:38 INFO MemoryStore: Block broadcast_0 stored as values 
to memory (estimated size 197.8 KB, free 324.2 MB)
519099 14/05/12 13:31:49 INFO FileInputFormat: Total input paths to process : 1
519100 14/05/12 13:31:49 INFO NetworkTopology: Adding a new node: 
/default-rack/192.168.1.100:50010
519101 14/05/12 13:31:49 INFO SparkContext: Starting job: top at console:15
519102 14/05/12 13:31:49 INFO DAGScheduler: Got job 0 (top at console:15) 
with 4 output partitions (allowLocal=false)
519103 14/05/12 13:31:49 INFO DAGScheduler: Final stage: Stage 0 (top at 
console:15)
519104 14/05/12 13:31:49 INFO DAGScheduler: Parents of final stage: List()
519105 14/05/12 13:31:49 INFO DAGScheduler: Missing parents: List()
519106 14/05/12 13:31:49 INFO DAGScheduler: Submitting Stage 0 
(MapPartitionsRDD[2] at top at console:15), which has no missing par   
ents
519107 14/05/12 13:31:49 INFO DAGScheduler: Submitting 4 missing tasks from 
Stage 0 (MapPartitionsRDD[2] at top at console:15)
519108 14/05/12 13:31:49 INFO YarnClientClusterScheduler: Adding task set 0.0 
with 4 tasks
519109 14/05/12 13:31:49 INFO RackResolver: Resolved s1 to /default-rack
519110 14/05/12 13:31:49 INFO ClusterTaskSetManager: Starting task 0.0:3 as TID 
0 on executor 1: s1 (PROCESS_LOCAL)
519111 14/05/12 13:31:49 INFO ClusterTaskSetManager: Serialized task 0.0:3 as 
1811 bytes in 4 ms
519112 14/05/12 13:31:49 INFO ClusterTaskSetManager: Starting task 0.0:0 as TID 
1 on executor 1: s1 (NODE_LOCAL)
519113 14/05/12 13:31:49 INFO ClusterTaskSetManager: Serialized task 0.0:0 as 
1811 bytes in 1 ms
519114 14/05/12 13:32:18INFO YarnClientSchedulerBackend: Executor 1 
disconnected, so removing it
519115 14/05/12 13:32:18 ERROR YarnClientClusterScheduler: Lost executor 1 on 
s1: remote Akka client shutdown
519116 14/05/12 13:32:18 INFO ClusterTaskSetManager: Re-queueing tasks for 1 
from TaskSet 0.0
519117 14/05/12 13:32:18 WARN ClusterTaskSetManager: Lost TID 1 (task 0.0:0)
519118 14/05/12 13:32:18 WARN ClusterTaskSetManager: Lost TID 0 (task 0.0:3)
519119 14/05/12 13:32:18 INFO DAGScheduler: Executor lost: 1 (epoch 0)
519120 14/05/12 13:32:18 INFO BlockManagerMasterActor: Trying to remove 
executor 1 from BlockManagerMaster.
519121 14/05/12 13:32:18 INFO BlockManagerMaster: Removed 1 successfully in 
removeExecutor


 Do i need to set any other env-variable specifically for SPARK on YARN. What 
could be the isuue ??


Can anyone please help me in this regard.

Thanks in Advance !!

Re: How to run shark?

2014-05-14 Thread Sophia
My configuration is just like this,the slave's node has been configuate,but I
donnot know what's happened to the shark?Can you help me Sir?
shark-env.sh
export SPARK_USER_HOME=/root
export SPARK_MEM=2g
export SCALA_HOME=/root/scala-2.11.0-RC4
export SHARK_MASTER_MEM=1g
export HIVE_CONF_DIR=/usr/lib/hive/conf
export HIVE_HOME=/usr/lib/hive
export HADOOP_HOME=/usr/lib/hadoop
export SPARK_HOME=/root/spark-0.9.1
export MASTER=spark://192.168.10.220:7077
export SHARK_EXEC_MODE=yarn

SPARK_JAVA_OPTS= -Dspark.local.dir=/tmp 
SPARK_JAVA_OPTS+=-Dspark.kryoserializer.buffer.mb=10 
SPARK_JAVA_OPTS+=-verbose:gc -XX:-PrintGCDetails -XX:+PrintGCTimeStamps 
export SPARK_JAVA_OPTS
export
SPARK_ASSEMBLY_JAR=/root/spark-0.9.1/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar
export
SHARK_ASSEMBLY_JAR=/root/shark-0.9.1-bin-hadoop2/target/scala-2.10/shark_2.10-0.9.1.jar

Best regards,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-shark-tp5581p5688.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


EndpointWriter: AssociationError

2014-05-14 Thread Laurent Thoulon
Hi, 

I've been trying to run my newly created spark job on my local master instead 
of just runing it using maven and i haven't been able to make it work. My main 
issue seems to be related to that error: 

14/05/14 09:34:26 ERROR EndpointWriter: AssociationError 
[akka.tcp://sparkMaster@devsrv:7077] - [akka.tcp://driverClient@ devsrv 
.mydomain.priv:50237]: Error [Association failed with [akka.tcp://driverClient@ 
devsrv . mydomain.priv :50237]] [ 
akka.remote.EndpointAssociationException: Association failed with 
[akka.tcp://driverClient@ devsrv . mydomain.priv :50237] 
Caused by: 
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: 
Connection refused: devsrv . mydomain.priv /172.16.202.246:50237 
] 

FYI, the port 50237 is always changing so i'm not sure what it's supposed to 
be. 
I get this kind of error from many commands including 
./bin/spark-class org.apache.spark.deploy.Client kill spark:// devsrv :7077 
driver-20140513165819-0001 
and 
./bin/spark-class org.apache.spark.deploy.Client launch spark:// devsrv :7077 
file:///path/to/my/rspark-jobs-1.0.0.0-jar-with-dependencies.jar 
my.jobs.spark.Indexer 
I get this error from the kill even if the driver has already finished or does 
not exist. 
The launch actually works (or seems to) as i can see my driver appearing on the 
web UI as SUBMITTED 

When i then deploy a worker my job starts running and i have no error in it's 
log. The job, though, never ends and holds after starting to spill on disk. 
2014-05-14 09:46:31 INFO BlockFetcherIterator$BasicBlockFetcherIterator:50 - 
Started 0 remote gets in 45 ms 
2014-05-14 09:46:33 WARN ExternalAppendOnlyMap:62 - Spilling in-memory map of 
147 MB to disk (1 time so far) 
2014-05-14 09:46:33 WARN ExternalAppendOnlyMap:62 - Spilling in-memory map of 
130 MB to disk (1 time so far) 
2014-05-14 09:46:33 WARN ExternalAppendOnlyMap:62 - Spilling in-memory map of 
118 MB to disk (1 time so far) 

and the worker ends up crashing with those errors: 
14/05/14 09:51:23 ERROR OneForOneStrategy: FAILED (of class 
scala.Enumeration$Val) 
scala.MatchError: FAILED (of class scala.Enumeration$Val) 
at 
org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:277)
 
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) 
at akka.actor.ActorCell.invoke(ActorCell.scala:456) 
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) 
at akka.dispatch.Mailbox.run(Mailbox.scala:219) 
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 
14/05/14 09:51:24 ERROR EndpointWriter: AssociationError 
[akka.tcp://sparkwor...@devsrv.mydomain.priv:35607] - 
[akka.tcp://dri...@devsrv.mydomain.priv:45792]: Error [Association failed with 
[akka.tcp://dri...@devsrv.mydomain.priv:45792]] [ 
akka.remote.EndpointAssociationException: Association failed with 
[akka.tcp://dri...@devsrv.mydomain.priv:45792] 
Caused by: 
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: 
Connection refused: devsrv.mydomain.priv/172.XXX.XXX.XXX:45792 
] 
14/05/14 09:51:24 ERROR EndpointWriter: AssociationError 
[akka.tcp://sparkwor...@devsrv.mydomain.priv:35607] - 
[akka.tcp://dri...@devsrv.mydomain.priv:45792]: Error [Association failed with 
[akka.tcp://dri...@devsrv.mydomain.priv:45792]] [ 
akka.remote.EndpointAssociationException: Association failed with 
[akka.tcp://dri...@devsrv.mydomain.priv:45792] 
Caused by: 
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: 
Connection refused: devsrv.mydomain.priv/172.XXX.XXX.XXX:45792 
] 
14/05/14 09:51:24 ERROR EndpointWriter: AssociationError 
[akka.tcp://sparkwor...@devsrv.mydomain.priv:35607] - 
[akka.tcp://dri...@devsrv.mydomain.priv:45792]: Error [Association failed with 
[akka.tcp://dri...@devsrv.mydomain.priv:45792]] [ 
akka.remote.EndpointAssociationException: Association failed with 
[akka.tcp://dri...@devsrv.mydomain.priv:45792] 
Caused by: 
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: 
Connection refused: devsrv.mydomain.priv/172.XXX.XXX.XXX:45792 

I'm almost sure those errors (or at least part of them) have nothing to do with 
our jar (as we get them even when killing an inexistant driver). 
We're using spark 0.9.1 for hadoop 1. 

Any suggestions ? 
Thanks 


Regards 
Laurent 


Re: 1.0.0 Release Date?

2014-05-14 Thread Patrick Wendell
Hey Brian,

We've had a fairly stable 1.0 branch for a while now. I've started
voting on the dev list last night... voting can take some time but it
usually wraps up anywhere from a few days to weeks.

However, you can get started right now with the release candidates.
These are likely to be almost identical to the final release.

- Patrick

On Tue, May 13, 2014 at 9:40 AM, bhusted brian.hus...@gmail.com wrote:
 Can anyone comment on the anticipated date or worse case timeframe for when
 Spark 1.0.0 will be released?



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/1-0-0-Release-Date-tp5664.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Distribute jar dependencies via sc.AddJar(fileName)

2014-05-14 Thread Xiangrui Meng
I don't know whether this would fix the problem. In v0.9, you need
`yarn-standalone` instead of `yarn-cluster`.

See 
https://github.com/apache/spark/commit/328c73d037c17440c2a91a6c88b4258fbefa0c08

On Tue, May 13, 2014 at 11:36 PM, Xiangrui Meng men...@gmail.com wrote:
 Does v0.9 support yarn-cluster mode? I checked SparkContext.scala in
 v0.9.1 and didn't see special handling of `yarn-cluster`. -Xiangrui

 On Mon, May 12, 2014 at 11:14 AM, DB Tsai dbt...@stanford.edu wrote:
 We're deploying Spark in yarn-cluster mode (Spark 0.9), and we add jar
 dependencies in command line with --addJars option. However, those
 external jars are only available in the driver (application running in
 hadoop), and not available in the executors (workers).

 After doing some research, we realize that we've to push those jars to
 executors in driver via sc.AddJar(fileName). Although in the driver's log
 (see the following), the jar is successfully added in the http server in the
 driver, and I confirm that it's downloadable from any machine in the
 network, I still get `java.lang.NoClassDefFoundError` in the executors.

 14/05/09 14:51:41 INFO spark.SparkContext: Added JAR
 analyticshadoop-eba5cdce1.jar at
 http://10.0.0.56:42522/jars/analyticshadoop-eba5cdce1.jar with timestamp
 1399672301568

 Then I check the log in the executors, and I don't find anything `Fetching
 file with timestamp timestamp`, which implies something is wrong; the
 executors are not downloading the external jars.

 Any suggestion what we can look at?

 After digging into how spark distributes external jars, I wonder the
 scalability of this approach. What if there are thousands of nodes
 downloading the jar from single http server in the driver? Why don't we push
 the jars into HDFS distributed cache by default instead of distributing them
 via http server?

 Thanks.

 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


accessing partition i+1 from mapper of partition i

2014-05-14 Thread Mohit Jaggi
Hi,
I am trying to find a way to fill in missing values in an RDD. The RDD is a
sorted sequence.
For example, (1, 2, 3, 5, 8, 11, ...)
I need to fill in the missing numbers and get (1,2,3,4,5,6,7,8,9,10,11)

One way to do this is to slide and zip
rdd1 =  sc.parallelize(List(1, 2, 3, 5, 8, 11, ...))
x = rdd1.first
rdd2 = rdd1 filter (_ != x)
rdd3 = rdd2 zip rdd1
rdd4 = rdd3 flatmap { (x, y) = generate missing elements between x and y }

Another method which I think is more efficient is to use mapParititions()
on rdd1 to be able to iterate on elements of rdd1 in each partition.
However, that leaves the boundaries of the partitions to be unfilled. *Is
there a way within the function passed to mapPartitions, to read the first
element in the next partition?*

The latter approach also appears to work for a general sliding window
calculation on the RDD. The former technique requires a lot of sliding and
zipping and I believe it is not efficient. If only I could read the next
partition...I have tried passing a pointer to rdd1 to the function passed
to mapPartitions but the rdd1 pointer turns out to be NULL, I guess because
Spark cannot deal with a mapper calling another mapper (since it happens on
a worker not the driver)

Mohit.


RE: How to use Mahout VectorWritable in Spark.

2014-05-14 Thread Stuti Awasthi
The issue of console:12: error: not found: type Text is resolved by import 
statement.. But still facing issue with imports of VectorWritable.
Mahout math jar is added to classpath as I can check on WebUI as well on shell

scala System.getenv
res1: java.util.Map[String,String] = {TERM=xterm, 
JAVA_HOME=/usr/lib/jvm/java-6-openjdk, SHLVL=2, 
SHELL_JARS=/home/hduser/installations/work-space/mahout-math-0.7.jar, 
SPARK_MASTER_WEBUI_PORT=5050, LESSCLOSE=/usr/bin/lesspipe %s %s, 
SSH_CLIENT=10.112.67.149 55123 22, 
SPARK_HOME=/home/hduser/installations/spark-0.9.0, MAIL=/var/mail/hduser, 
SPARK_WORKER_DIR=/tmp/spark-hduser-worklogs/work, 
XDG_SESSION_COOKIE=fbd2e4304c8c75dd606c36100186-1400039480.256868-916349946,
 https_proxy=https://DS-1078D2486320:3128/, NICKNAME=vm01, JAVA_OPTS=  
-Djava.library.path= -Xms512m -Xmx512m, 
PWD=/home/hduser/installations/work-space/KMeansClustering_1, 
SSH_TTY=/dev/pts/0, SPARK_MASTER_PORT=7077, LOGNAME=hduser, 
MASTER=spark://VM-52540048731A:7077, SPARK_WORKER_MEMORY=2g, 
HADOOP_HOME=/usr/lib/hadoop, SS...

Still not able to import  Mahout Classes.. Any ideas ??

Thanks
Stuti Awasthi

-Original Message-
From: Stuti Awasthi 
Sent: Wednesday, May 14, 2014 1:13 PM
To: user@spark.apache.org
Subject: RE: How to use Mahout VectorWritable in Spark.

Hi Xiangrui,
Thanks for the response .. I tried few ways to include mahout-math jar while 
launching Spark shell.. but no success.. Can you please point what I am doing 
wrong

1. mahout-math.jar exported in CLASSPATH, and PATH 2. Tried Launching Spark 
Shell by :  MASTER=spark://HOSTNAME:PORT 
ADD_JARS=~/installations/work-space/mahout-math-0.7.jar 
park-0.9.0/bin/spark-shell

 After launching, I checked the environment details on WebUi: It looks like 
mahout-math jar is included.
spark.jars  /home/hduser/installations/work-space/mahout-math-0.7.jar

Then I try :
scala import org.apache.mahout.math.VectorWritable
console:10: error: object mahout is not a member of package org.apache
   import org.apache.mahout.math.VectorWritable

scala val raw = sc.sequenceFile(path, classOf[Text], classOf[VectorWritable])  
console:12: error: not found: type Text
   val data = 
sc.sequenceFile(/stuti/ML/Clustering/KMeans/HAR/KMeans_dataset_seq/part-r-0,
 classOf[Text], classOf[VectorWritable])

 ^ Im using Spark 0.9 and Hadoop 1.0.4 and Mahout 
0.7

Thanks
Stuti 



-Original Message-
From: Xiangrui Meng [mailto:men...@gmail.com]
Sent: Wednesday, May 14, 2014 11:56 AM
To: user@spark.apache.org
Subject: Re: How to use Mahout VectorWritable in Spark.

You need

 val raw = sc.sequenceFile(path, classOf[Text],
 classOf[VectorWriteable])

to load the data. After that, you can do

 val data = raw.values.map(_.get)

To get an RDD of mahout's Vector. You can use `--jar mahout-math.jar` when you 
launch spark-shell to include mahout-math.

Best,
Xiangrui

On Tue, May 13, 2014 at 10:37 PM, Stuti Awasthi stutiawas...@hcl.com wrote:
 Hi All,

 I am very new to Spark and trying to play around with Mllib hence 
 apologies for the basic question.



 I am trying to run KMeans algorithm using Mahout and Spark MLlib to 
 see the performance. Now initial datasize was 10 GB. Mahout converts 
 the data in Sequence File Text,VectorWritable which is used for KMeans 
 Clustering.
 The Sequence File crated was ~ 6GB in size.



 Now I wanted if I can use the Mahout Sequence file to be executed in 
 Spark MLlib for KMeans . I have read that SparkContext.sequenceFile 
 may be used here. Hence I tried to read my sequencefile as below but getting 
 the error :



 Command on Spark Shell :

 scala val data = sc.sequenceFile[String,VectorWritable](/
 KMeans_dataset_seq/part-r-0,String,VectorWritable)

 console:12: error: not found: type VectorWritable

val data = sc.sequenceFile[String,VectorWritable](
 /KMeans_dataset_seq/part-r-0,String,VectorWritable)



 Here I have 2 ques:

 1.  Mahout has “Text” as Key but Spark is printing “not found: type:Text”
 hence I changed it to String.. Is this correct ???

 2. How will VectorWritable be found in Spark. Do I need to include 
 Mahout jar in Classpath or any other option ??



 Please Suggest



 Regards

 Stuti Awasthi



 ::DISCLAIMER::
 --
 --
 

 The contents of this e-mail and any attachment(s) are confidential and 
 intended for the named recipient(s) only.
 E-mail transmission is not guaranteed to be secure or error-free as 
 information could be intercepted, corrupted, lost, destroyed, arrive 
 late or incomplete, or may contain viruses in transmission. The e mail 
 and its contents (with or without referred errors) shall therefore not 
 attach any liability on the originator or HCL or its affiliates.
 Views or opinions, if any, 

Proper way to create standalone app with custom Spark version

2014-05-14 Thread Andrei
We can create standalone Spark application by simply adding
spark-core_2.x to build.sbt/pom.xml and connecting it to Spark master.

We can also compile custom version of Spark (e.g. compiled against Hadoop
2.x) from source and deploy it to cluster manually.

But what is a proper way to use _custom version_ of Spark in _standalone
application_? We can't simply include custom version into
build.sbt/pom.xml, since it's not in central repository.



I'm currently trying to deploy custom version to local Maven repository and
add it to SBT project. Another option is to add Spark as local jar to every
project. But both of these ways look overcomplicated and in general wrong.

What is an implied way to solve this issue?

Thanks,
Andrei


Re: logging in pyspark

2014-05-14 Thread Diana Carroll
foreach vs. map isn't the issue.  Both require serializing the called
function, so the pickle error would still apply, yes?

And at the moment, I'm just testing.  Definitely wouldn't want to log
something for each element, but may want to detect something and log for
SOME elements.

So my question is: how are other people doing logging from distributed
tasks, given the serialization issues?

The same issue actually exists in Scala, too.  I could work around it by
creating a small serializable object that provides a logger, but it seems
kind of kludgy to me, so I'm wondering if other people are logging from
tasks, and if so, how?

Diana


On Tue, May 6, 2014 at 6:24 PM, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:

 I think you're looking for 
 RDD.foreach()http://spark.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#foreach
 .

 According to the programming 
 guidehttp://spark.apache.org/docs/latest/scala-programming-guide.html
 :

 Run a function func on each element of the dataset. This is usually done
 for side effects such as updating an accumulator variable (see below) or
 interacting with external storage systems.


 Do you really want to log something for each element of your RDD?

 Nick


 On Tue, May 6, 2014 at 3:31 PM, Diana Carroll dcarr...@cloudera.comwrote:

 What should I do if I want to log something as part of a task?

 This is what I tried.  To set up a logger, I followed the advice here:
 http://py4j.sourceforge.net/faq.html#how-to-turn-logging-on-off

 logger = logging.getLogger(py4j)
 logger.setLevel(logging.INFO)
 logger.addHandler(logging.StreamHandler())

 This works fine when I call it from my driver (ie pyspark):
 logger.info(this works fine)

 But I want to try logging within a distributed task so I did this:

 def logTestMap(a):
  logger.info(test)
 return a

 myrdd.map(logTestMap).count()

 and got:
 PicklingError: Can't pickle 'lock' object

 So it's trying to serialize my function and can't because of a lock
 object used in logger, presumably for thread-safeness.  But then...how
 would I do it?  Or is this just a really bad idea?

 Thanks
 Diana





Re: Distribute jar dependencies via sc.AddJar(fileName)

2014-05-14 Thread DB Tsai
Hi Xiangrui,

I actually used `yarn-standalone`, sorry for misleading. I did debugging in
the last couple days, and everything up to updateDependency in
executor.scala works. I also checked the file size and md5sum in the
executors, and they are the same as the one in driver. Gonna do more
testing tomorrow.

Thanks.


Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Tue, May 13, 2014 at 11:41 PM, Xiangrui Meng men...@gmail.com wrote:

 I don't know whether this would fix the problem. In v0.9, you need
 `yarn-standalone` instead of `yarn-cluster`.

 See
 https://github.com/apache/spark/commit/328c73d037c17440c2a91a6c88b4258fbefa0c08

 On Tue, May 13, 2014 at 11:36 PM, Xiangrui Meng men...@gmail.com wrote:
  Does v0.9 support yarn-cluster mode? I checked SparkContext.scala in
  v0.9.1 and didn't see special handling of `yarn-cluster`. -Xiangrui
 
  On Mon, May 12, 2014 at 11:14 AM, DB Tsai dbt...@stanford.edu wrote:
  We're deploying Spark in yarn-cluster mode (Spark 0.9), and we add jar
  dependencies in command line with --addJars option. However, those
  external jars are only available in the driver (application running in
  hadoop), and not available in the executors (workers).
 
  After doing some research, we realize that we've to push those jars to
  executors in driver via sc.AddJar(fileName). Although in the driver's
 log
  (see the following), the jar is successfully added in the http server
 in the
  driver, and I confirm that it's downloadable from any machine in the
  network, I still get `java.lang.NoClassDefFoundError` in the executors.
 
  14/05/09 14:51:41 INFO spark.SparkContext: Added JAR
  analyticshadoop-eba5cdce1.jar at
  http://10.0.0.56:42522/jars/analyticshadoop-eba5cdce1.jar with
 timestamp
  1399672301568
 
  Then I check the log in the executors, and I don't find anything
 `Fetching
  file with timestamp timestamp`, which implies something is wrong;
 the
  executors are not downloading the external jars.
 
  Any suggestion what we can look at?
 
  After digging into how spark distributes external jars, I wonder the
  scalability of this approach. What if there are thousands of nodes
  downloading the jar from single http server in the driver? Why don't we
 push
  the jars into HDFS distributed cache by default instead of distributing
 them
  via http server?
 
  Thanks.
 
  Sincerely,
 
  DB Tsai
  ---
  My Blog: https://www.dbtsai.com
  LinkedIn: https://www.linkedin.com/in/dbtsai



Re: java.lang.StackOverflowError when calling count()

2014-05-14 Thread Nicholas Chammas
Would cache() + count() every N iterations work just as well as
checkPoint() + count() to get around this issue?

We're basically trying to get Spark to avoid working on too lengthy a
lineage at once, right?

Nick


On Tue, May 13, 2014 at 12:04 PM, Xiangrui Meng men...@gmail.com wrote:

 After checkPoint, call count directly to materialize it. -Xiangrui

 On Tue, May 13, 2014 at 4:20 AM, Mayur Rustagi mayur.rust...@gmail.com
 wrote:
  We are running into same issue. After 700 or so files the stack
 overflows,
  cache, persist  checkpointing dont help.
  Basically checkpointing only saves the RDD when it is materialized  it
 only
  materializes in the end, then it runs out of stack.
 
  Regards
  Mayur
 
  Mayur Rustagi
  Ph: +1 (760) 203 3257
  http://www.sigmoidanalytics.com
  @mayur_rustagi
 
 
 
  On Tue, May 13, 2014 at 11:40 AM, Xiangrui Meng men...@gmail.com
 wrote:
 
  You have a long lineage that causes the StackOverflow error. Try
  rdd.checkPoint() and rdd.count() for every 20~30 iterations.
  checkPoint can cut the lineage. -Xiangrui
 
  On Mon, May 12, 2014 at 3:42 PM, Guanhua Yan gh...@lanl.gov wrote:
   Dear Sparkers:
  
   I am using Python spark of version 0.9.0 to implement some iterative
   algorithm. I got some errors shown at the end of this email. It seems
   that
   it's due to the Java Stack Overflow error. The same error has been
   duplicated on a mac desktop and a linux workstation, both running the
   same
   version of Spark.
  
   The same line of code works correctly after quite some iterations. At
   the
   line of error, rdd__new.count() could be 0. (In some previous rounds,
   this
   was also 0 without any problem).
  
   Any thoughts on this?
  
   Thank you very much,
   - Guanhua
  
  
   
   CODE:print round, round, rdd__new.count()
   
 File
  
  
 /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py,
   line 542, in count
   14/05/12 16:20:28 INFO TaskSetManager: Loss was due to
   java.lang.StackOverflowError [duplicate 1]
   return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   14/05/12 16:20:28 ERROR TaskSetManager: Task 8419.0:0 failed 1 times;
   aborting job
 File
  
  
 /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py,
   line 533, in sum
   14/05/12 16:20:28 INFO TaskSchedulerImpl: Ignoring update with state
   FAILED
   from TID 1774 because its task set is gone
   return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
 File
  
  
 /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py,
   line 499, in reduce
   vals = self.mapPartitions(func).collect()
 File
  
  
 /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py,
   line 463, in collect
   bytesInJava = self._jrdd.collect().iterator()
 File
  
  
 /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
   line 537, in __call__
 File
  
  
 /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py,
   line 300, in get_return_value
   py4j.protocol.Py4JJavaError: An error occurred while calling
   o4317.collect.
   : org.apache.spark.SparkException: Job aborted: Task 8419.0:1 failed 1
   times
   (most recent failure: Exception failure: java.lang.StackOverflowError)
   at
  
  
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
   at
  
  
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
   at
  
  
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at
  
   org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
   at
  
  
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
   at
  
  
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
   at scala.Option.foreach(Option.scala:236)
   at
  
  
 org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)
   at
  
  
 org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at
  
  
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at
  
  
 

Re: java.lang.StackOverflowError when calling count()

2014-05-14 Thread lalit1303
If we do cache() + count() after say every 50 iterations. The whole process
becomes very slow.
I have tried checkpoint() , cache() + count(), saveAsObjectFiles().
Nothing works.
Materializing RDD's lead to drastic decrease in performance  if we don't
materialize, we face stackoverflowerror.


On Wed, May 14, 2014 at 10:25 AM, Nick Chammas [via Apache Spark User List]
ml-node+s1001560n5683...@n3.nabble.com wrote:

 Would cache() + count() every N iterations work just as well as
 checkPoint() + count() to get around this issue?

 We're basically trying to get Spark to avoid working on too lengthy a
 lineage at once, right?

 Nick


 On Tue, May 13, 2014 at 12:04 PM, Xiangrui Meng [hidden 
 email]http://user/SendEmail.jtp?type=nodenode=5683i=0
  wrote:

 After checkPoint, call count directly to materialize it. -Xiangrui

 On Tue, May 13, 2014 at 4:20 AM, Mayur Rustagi [hidden 
 email]http://user/SendEmail.jtp?type=nodenode=5683i=1
 wrote:
  We are running into same issue. After 700 or so files the stack
 overflows,
  cache, persist  checkpointing dont help.
  Basically checkpointing only saves the RDD when it is materialized  it
 only
  materializes in the end, then it runs out of stack.
 
  Regards
  Mayur
 
  Mayur Rustagi
  Ph: a href=tel:%2B1%20%28760%29%20203%203257 value=+17602033257+1
 (760) 203 3257
  http://www.sigmoidanalytics.com
  @mayur_rustagi

 
 
 
  On Tue, May 13, 2014 at 11:40 AM, Xiangrui Meng [hidden 
  email]http://user/SendEmail.jtp?type=nodenode=5683i=2
 wrote:
 
  You have a long lineage that causes the StackOverflow error. Try
  rdd.checkPoint() and rdd.count() for every 20~30 iterations.
  checkPoint can cut the lineage. -Xiangrui
 
  On Mon, May 12, 2014 at 3:42 PM, Guanhua Yan [hidden 
  email]http://user/SendEmail.jtp?type=nodenode=5683i=3
 wrote:
   Dear Sparkers:
  
   I am using Python spark of version 0.9.0 to implement some iterative
   algorithm. I got some errors shown at the end of this email. It seems
   that
   it's due to the Java Stack Overflow error. The same error has been
   duplicated on a mac desktop and a linux workstation, both running the
   same
   version of Spark.
  
   The same line of code works correctly after quite some iterations. At
   the
   line of error, rdd__new.count() could be 0. (In some previous rounds,
   this
   was also 0 without any problem).
  
   Any thoughts on this?
  
   Thank you very much,
   - Guanhua
  
  
   
   CODE:print round, round, rdd__new.count()
   
 File
  
  
 /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py,
   line 542, in count
   14/05/12 16:20:28 INFO TaskSetManager: Loss was due to
   java.lang.StackOverflowError [duplicate 1]
   return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   14/05/12 16:20:28 ERROR TaskSetManager: Task 8419.0:0 failed 1 times;
   aborting job
 File
  
  
 /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py,
   line 533, in sum
   14/05/12 16:20:28 INFO TaskSchedulerImpl: Ignoring update with state
   FAILED
   from TID 1774 because its task set is gone
   return self.mapPartitions(lambda x:
 [sum(x)]).reduce(operator.add)
 File
  
  
 /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py,
   line 499, in reduce
   vals = self.mapPartitions(func).collect()
 File
  
  
 /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py,
   line 463, in collect
   bytesInJava = self._jrdd.collect().iterator()
 File
  
  
 /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
   line 537, in __call__
 File
  
  
 /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py,
   line 300, in get_return_value
   py4j.protocol.Py4JJavaError: An error occurred while calling
   o4317.collect.
   : org.apache.spark.SparkException: Job aborted: Task 8419.0:1 failed
 1
   times
   (most recent failure: Exception failure:
 java.lang.StackOverflowError)
   at
  
  
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
   at
  
  
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
   at
  
  
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at
  
   org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
   at
  
  
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
   at
  
  
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
   at 

Re: How to run shark?

2014-05-14 Thread Mayur Rustagi
Is your Spark working .. can you try running spark shell?
http://spark.apache.org/docs/0.9.1/quick-start.html
If spark is working we can move this to shark user list(copied here)
Also I am anything but a sir :)

Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi



On Wed, May 14, 2014 at 12:49 PM, Sophia sln-1...@163.com wrote:

 My configuration is just like this,the slave's node has been
 configuate,but I
 donnot know what's happened to the shark?Can you help me Sir?
 shark-env.sh
 export SPARK_USER_HOME=/root
 export SPARK_MEM=2g
 export SCALA_HOME=/root/scala-2.11.0-RC4
 export SHARK_MASTER_MEM=1g
 export HIVE_CONF_DIR=/usr/lib/hive/conf
 export HIVE_HOME=/usr/lib/hive
 export HADOOP_HOME=/usr/lib/hadoop
 export SPARK_HOME=/root/spark-0.9.1
 export MASTER=spark://192.168.10.220:7077
 export SHARK_EXEC_MODE=yarn

 SPARK_JAVA_OPTS= -Dspark.local.dir=/tmp 
 SPARK_JAVA_OPTS+=-Dspark.kryoserializer.buffer.mb=10 
 SPARK_JAVA_OPTS+=-verbose:gc -XX:-PrintGCDetails -XX:+PrintGCTimeStamps 
 export SPARK_JAVA_OPTS
 export

 SPARK_ASSEMBLY_JAR=/root/spark-0.9.1/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar
 export

 SHARK_ASSEMBLY_JAR=/root/shark-0.9.1-bin-hadoop2/target/scala-2.10/shark_2.10-0.9.1.jar

 Best regards,



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-shark-tp5581p5688.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: NoSuchMethodError: breeze.linalg.DenseMatrix

2014-05-14 Thread wxhsdp
Hi, DB
  i've add breeze jars to workers using sc.addJar()
  breeze jars include :
  breeze-natives_2.10-0.7.jar
  breeze-macros_2.10-0.3.jar
  breeze-macros_2.10-0.3.1.jar
  breeze_2.10-0.8-SNAPSHOT.jar
  breeze_2.10-0.7.jar

  almost all the jars about breeze i can find, but still NoSuchMethodError:
breeze.linalg.DenseMatrix

  from the executor stderr, you can see the executor successsully fetches
these jars, what's wrong
  about my method? thank you!

14/05/14 20:36:02 INFO Executor: Fetching
http://192.168.0.106:42883/jars/breeze-natives_2.10-0.7.jar with timestamp
1400070957376
14/05/14 20:36:02 INFO Utils: Fetching
http://192.168.0.106:42883/jars/breeze-natives_2.10-0.7.jar to
/tmp/fetchFileTemp7468892065227766972.tmp
14/05/14 20:36:02 INFO Executor: Adding
file:/home/wxhsdp/spark/spark/tags/v1.0.0-rc3/work/app-20140514203557-/0/./breeze-natives_2.10-0.7.jar
to class loader
14/05/14 20:36:02 INFO Executor: Fetching
http://192.168.0.106:42883/jars/breeze-macros_2.10-0.3.jar with timestamp
1400070957441
14/05/14 20:36:02 INFO Utils: Fetching
http://192.168.0.106:42883/jars/breeze-macros_2.10-0.3.jar to
/tmp/fetchFileTemp2324565598765584917.tmp
14/05/14 20:36:02 INFO Executor: Adding
file:/home/wxhsdp/spark/spark/tags/v1.0.0-rc3/work/app-20140514203557-/0/./breeze-macros_2.10-0.3.jar
to class loader
14/05/14 20:36:02 INFO Executor: Fetching
http://192.168.0.106:42883/jars/breeze_2.10-0.8-SNAPSHOT.jar with timestamp
1400070957358
14/05/14 20:36:02 INFO Utils: Fetching
http://192.168.0.106:42883/jars/breeze_2.10-0.8-SNAPSHOT.jar to
/tmp/fetchFileTemp8730123100104850193.tmp
14/05/14 20:36:02 INFO Executor: Adding
file:/home/wxhsdp/spark/spark/tags/v1.0.0-rc3/work/app-20140514203557-/0/./breeze_2.10-0.8-SNAPSHOT.jar
to class loader
14/05/14 20:36:02 INFO Executor: Fetching
http://192.168.0.106:42883/jars/breeze-macros_2.10-0.3.1.jar with timestamp
1400070957414
14/05/14 20:36:02 INFO Utils: Fetching
http://192.168.0.106:42883/jars/breeze-macros_2.10-0.3.1.jar to
/tmp/fetchFileTemp3473404556989515218.tmp
14/05/14 20:36:02 INFO Executor: Adding
file:/home/wxhsdp/spark/spark/tags/v1.0.0-rc3/work/app-20140514203557-/0/./breeze-macros_2.10-0.3.1.jar
to class loader
14/05/14 20:36:02 INFO Executor: Fetching
http://192.168.0.106:42883/jars/build-project_2.10-1.0.jar with timestamp
1400070956753
14/05/14 20:36:02 INFO Utils: Fetching
http://192.168.0.106:42883/jars/build-project_2.10-1.0.jar to
/tmp/fetchFileTemp1289055585501269156.tmp
14/05/14 20:36:02 INFO Executor: Adding
file:/home/wxhsdp/spark/spark/tags/v1.0.0-rc3/work/app-20140514203557-/0/./build-project_2.10-1.0.jar
to class loader
14/05/14 20:36:02 INFO Executor: Fetching
http://192.168.0.106:42883/jars/breeze_2.10-0.7.jar with timestamp
1400070957228
14/05/14 20:36:02 INFO Utils: Fetching
http://192.168.0.106:42883/jars/breeze_2.10-0.7.jar to
/tmp/fetchFileTemp1287317286108432726.tmp
14/05/14 20:36:02 INFO Executor: Adding
file:/home/wxhsdp/spark/spark/tags/v1.0.0-rc3/work/app-20140514203557-/0/./breeze_2.10-0.7.jar
to class loader


DB Tsai-2 wrote
 Since the breeze jar is brought into spark by mllib package, you may want
 to add mllib as your dependency in spark 1.0. For bring it from your
 application yourself, you can either use sbt assembly in ur build project
 to generate a flat myApp-assembly.jar which contains breeze jar, or use
 spark add jar api like Yadid said.
 
 
 Sincerely,
 
 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai
 
 
 On Sun, May 4, 2014 at 10:24 PM, wxhsdp lt;

 wxhsdp@

 gt; wrote:
 
 Hi, DB, i think it's something related to sbt publishLocal

 if i remove the breeze dependency in my sbt file, breeze can not be found

 [error] /home/wxhsdp/spark/example/test/src/main/scala/test.scala:5: not
 found: object breeze
 [error] import breeze.linalg._
 [error]^

 here's my sbt file:

 name := Build Project

 version := 1.0

 scalaVersion := 2.10.4

 libraryDependencies += org.apache.spark %% spark-core %
 1.0.0-SNAPSHOT

 resolvers += Akka Repository at http://repo.akka.io/releases/;

 i run sbt publishLocal on the Spark tree.

 but if i manully put spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar in
 /lib
 directory, sbt package is
 ok, i can run my app in workers without addJar

 what's the difference between add dependency in sbt after sbt
 publishLocal
 and manully put spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar in /lib
 directory?

 why can i run my app in worker without addJar this time?


 DB Tsai-2 wrote
  If you add the breeze dependency in your build.sbt project, it will not
 be
  available to all the workers.
 
  There are couple options, 1) use sbt assembly to package breeze into
 your
  application jar. 2) manually copy breeze jar into all the nodes, and
 have
  them in the classpath. 3) spark 1.0 has breeze jar in the spark flat
  assembly jar, so you don't need to add breeze dependency 

Re: Packaging a spark job using maven

2014-05-14 Thread Laurent T
Hi,

Thanks François but this didn't change much. I'm not even sure what this
reference.conf is. It isn't mentioned in any of spark documentation. Should
i have one in my resources ?

Thanks
Laurent



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Packaging-a-spark-job-using-maven-tp5615p5707.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark LIBLINEAR

2014-05-14 Thread Debasish Das
Hi Professor Lin,

On our internal datasets,  I am getting accuracy at par with glmnet-R for
sparse feature selection from liblinear. The default mllib based gradient
descent was way off. I did not tune learning rate but I run with varying
lambda. Ths feature selection was weak.

I used liblinear code. Next I will explore the distributed liblinear.

Adding the code on github will definitely help for collaboration.

I am experimenting if a bfgs / owlqn based sparse logistic in spark mllib
give us accuracy at par with liblinear.

If liblinear solver outperforms them (either accuracy/performance) we have
to bring tron to mllib and let other algorithms benefit from it as well.

We are using Bfgs and Owlqn solvers from breeze opt.

Thanks.
Deb
 On May 12, 2014 9:07 PM, DB Tsai dbt...@stanford.edu wrote:

 It seems that the code isn't managed in github. Can be downloaded from
 http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/distributed-liblinear/spark/spark-liblinear-1.94.zip

 It will be easier to track the changes in github.



 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Mon, May 12, 2014 at 7:53 AM, Xiangrui Meng men...@gmail.com wrote:

 Hi Chieh-Yen,

 Great to see the Spark implementation of LIBLINEAR! We will definitely
 consider adding a wrapper in MLlib to support it. Is the source code
 on github?

 Deb, Spark LIBLINEAR uses BSD license, which is compatible with Apache.

 Best,
 Xiangrui

 On Sun, May 11, 2014 at 10:29 AM, Debasish Das debasish.da...@gmail.com
 wrote:
  Hello Prof. Lin,
 
  Awesome news ! I am curious if you have any benchmarks comparing C++ MPI
  with Scala Spark liblinear implementations...
 
  Is Spark Liblinear apache licensed or there are any specific
 restrictions on
  using it ?
 
  Except using native blas libraries (which each user has to manage by
 pulling
  in their best proprietary BLAS package), all Spark code is Apache
 licensed.
 
  Thanks.
  Deb
 
 
  On Sun, May 11, 2014 at 3:01 AM, DB Tsai dbt...@stanford.edu wrote:
 
  Dear Prof. Lin,
 
  Interesting! We had an implementation of L-BFGS in Spark and already
  merged in the upstream now.
 
  We read your paper comparing TRON and OWL-QN for logistic regression
 with
  L1 (http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf), but it seems
 that it's
  not in the distributed setup.
 
  Will be very interesting to know the L2 logistic regression benchmark
  result in Spark with your TRON optimizer and the L-BFGS optimizer
 against
  different datasets (sparse, dense, and wide, etc).
 
  I'll try your TRON out soon.
 
 
  Sincerely,
 
  DB Tsai
  ---
  My Blog: https://www.dbtsai.com
  LinkedIn: https://www.linkedin.com/in/dbtsai
 
 
  On Sun, May 11, 2014 at 1:49 AM, Chieh-Yen r01944...@csie.ntu.edu.tw
  wrote:
 
  Dear all,
 
  Recently we released a distributed extension of LIBLINEAR at
 
  http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/distributed-liblinear/
 
  Currently, TRON for logistic regression and L2-loss SVM is supported.
  We provided both MPI and Spark implementations.
  This is very preliminary so your comments are very welcome.
 
  Thanks,
  Chieh-Yen
 
 
 





RE: How to use Mahout VectorWritable in Spark.

2014-05-14 Thread Stuti Awasthi
Hi Xiangrui,
Thanks for the response .. I tried few ways to include mahout-math jar while 
launching Spark shell.. but no success.. Can you please point what I am doing 
wrong

1. mahout-math.jar exported in CLASSPATH, and PATH
2. Tried Launching Spark Shell by :  MASTER=spark://HOSTNAME:PORT 
ADD_JARS=~/installations/work-space/mahout-math-0.7.jar 
park-0.9.0/bin/spark-shell

 After launching, I checked the environment details on WebUi: It looks like 
mahout-math jar is included.
spark.jars  /home/hduser/installations/work-space/mahout-math-0.7.jar

Then I try :
scala import org.apache.mahout.math.VectorWritable
console:10: error: object mahout is not a member of package org.apache
   import org.apache.mahout.math.VectorWritable

scala val raw = sc.sequenceFile(path, classOf[Text], classOf[VectorWritable])  
console:12: error: not found: type Text
   val data = 
sc.sequenceFile(/stuti/ML/Clustering/KMeans/HAR/KMeans_dataset_seq/part-r-0,
 classOf[Text], classOf[VectorWritable])

 ^
Im using Spark 0.9 and Hadoop 1.0.4 and Mahout 0.7

Thanks
Stuti 



-Original Message-
From: Xiangrui Meng [mailto:men...@gmail.com] 
Sent: Wednesday, May 14, 2014 11:56 AM
To: user@spark.apache.org
Subject: Re: How to use Mahout VectorWritable in Spark.

You need

 val raw = sc.sequenceFile(path, classOf[Text], 
 classOf[VectorWriteable])

to load the data. After that, you can do

 val data = raw.values.map(_.get)

To get an RDD of mahout's Vector. You can use `--jar mahout-math.jar` when you 
launch spark-shell to include mahout-math.

Best,
Xiangrui

On Tue, May 13, 2014 at 10:37 PM, Stuti Awasthi stutiawas...@hcl.com wrote:
 Hi All,

 I am very new to Spark and trying to play around with Mllib hence 
 apologies for the basic question.



 I am trying to run KMeans algorithm using Mahout and Spark MLlib to 
 see the performance. Now initial datasize was 10 GB. Mahout converts 
 the data in Sequence File Text,VectorWritable which is used for KMeans 
 Clustering.
 The Sequence File crated was ~ 6GB in size.



 Now I wanted if I can use the Mahout Sequence file to be executed in 
 Spark MLlib for KMeans . I have read that SparkContext.sequenceFile 
 may be used here. Hence I tried to read my sequencefile as below but getting 
 the error :



 Command on Spark Shell :

 scala val data = sc.sequenceFile[String,VectorWritable](/
 KMeans_dataset_seq/part-r-0,String,VectorWritable)

 console:12: error: not found: type VectorWritable

val data = sc.sequenceFile[String,VectorWritable](
 /KMeans_dataset_seq/part-r-0,String,VectorWritable)



 Here I have 2 ques:

 1.  Mahout has “Text” as Key but Spark is printing “not found: type:Text”
 hence I changed it to String.. Is this correct ???

 2. How will VectorWritable be found in Spark. Do I need to include 
 Mahout jar in Classpath or any other option ??



 Please Suggest



 Regards

 Stuti Awasthi



 ::DISCLAIMER::
 --
 --
 

 The contents of this e-mail and any attachment(s) are confidential and 
 intended for the named recipient(s) only.
 E-mail transmission is not guaranteed to be secure or error-free as 
 information could be intercepted, corrupted, lost, destroyed, arrive 
 late or incomplete, or may contain viruses in transmission. The e mail 
 and its contents (with or without referred errors) shall therefore not 
 attach any liability on the originator or HCL or its affiliates.
 Views or opinions, if any, presented in this email are solely those of 
 the author and may not necessarily reflect the views or opinions of 
 HCL or its affiliates. Any form of reproduction, dissemination, 
 copying, disclosure, modification, distribution and / or publication 
 of this message without the prior written consent of authorized 
 representative of HCL is strictly prohibited. If you have received 
 this email in error please delete it and notify the sender 
 immediately.
 Before opening any email and/or attachments, please check them for 
 viruses and other defects.

 --
 --
 


saveAsTextFile with replication factor in HDFS

2014-05-14 Thread Sai Prasanna
Hi,

Can we override the default file-replication factor while using
saveAsTextFile() to HDFS.

My default repl.factor is 1. But intermediate files that i want to put in
HDFS while running a SPARK query need not be replicated, so is there a way ?


Thanks !


Worker re-spawn and dynamic node joining

2014-05-14 Thread Han JU
Hi all,

Just 2 questions:

  1. Is there a way to automatically re-spawn spark workers? We've
situations where executor OOM causes worker process to be DEAD and it does
not came back automatically.

  2. How to dynamically add (or remove) some worker machines to (from) the
cluster? We'd like to leverage the auto-scaling group in EC2 for example.

We're using spark-standalone.

Thanks a lot.

-- 
*JU Han*

Data Engineer @ Botify.com

+33 061960


Re: Spark unit testing best practices

2014-05-14 Thread Andrew Ash
There's an undocumented mode that looks like it simulates a cluster:

SparkContext.scala:
// Regular expression for simulating a Spark cluster of [N, cores,
memory] locally
val LOCAL_CLUSTER_REGEX =
local-cluster\[\s*([0-9]+)\s*,\s*([0-9]+)\s*,\s*([0-9]+)\s*].r

can you running your tests with a master URL of local-cluster[2,2,512] to
see if that does serialization?


On Wed, May 14, 2014 at 3:34 AM, Andras Nemeth 
andras.nem...@lynxanalytics.com wrote:

 Hi,

 Spark's local mode is great to create simple unit tests for our spark
 logic. The disadvantage however is that certain types of problems are never
 exposed in local mode because things never need to be put on the wire.

 E.g. if I accidentally use a closure which has something non-serializable
 in it, then my test will happily succeed in local mode but go down in
 flames on a real cluster.

 Other example is kryo: I'd like to use setRegistrationRequired(true) to
 avoid any hidden performance problems due to forgotten registration. And of
 course I'd like things to fail in tests. But it won't happen because we
 never actually need to serialize the RDDs in local mode.

 So, is there some good solution to the above problems? Is there some
 local-like mode which simulates serializations as well? Or is there an easy
 way to start up *from code* a standalone spark cluster on the machine
 running the unit test?

 Thanks,
 Andras




Re: Spark unit testing best practices

2014-05-14 Thread Philip Ogren
Have you actually found this to be true?  I have found Spark local mode 
to be quite good about blowing up if there is something non-serializable 
and so my unit tests have been great for detecting this.  I have never 
seen something that worked in local mode that didn't work on the cluster 
because of different serialization requirements between the two.  
Perhaps it is different when using Kryo



On 05/14/2014 04:34 AM, Andras Nemeth wrote:
E.g. if I accidentally use a closure which has something 
non-serializable in it, then my test will happily succeed in local 
mode but go down in flames on a real cluster.




little confused about SPARK_JAVA_OPTS alternatives

2014-05-14 Thread Koert Kuipers
i have some settings that i think are relevant for my application. they are
spark.akka settings so i assume they are relevant for both executors and my
driver program.

i used to do:
SPARK_JAVA_OPTS=-Dspark.akka.frameSize=1

now this is deprecated. the alternatives mentioned are:
* some spark-submit settings which are not relevant to me since i do not
use spark-submit (i launch spark jobs from an existing application)
* spark.executor.extraJavaOptions to set -X options. i am not sure what -X
options are, but it doesnt sound like what i need, since its only for
executors
* SPARK_DAEMON_OPTS to set java options for standalone daemons (i.e.
master, worker), that sounds like i should not use it since i am trying to
change settings for an app, not a daemon.

am i missing the correct setting to use?
should i do -Dspark.akka.frameSize=1 on my application launch directly,
and then also set spark.executor.extraJavaOptions? so basically repeat it?


Re: Packaging a spark job using maven

2014-05-14 Thread François Le Lay
I have a similar objective to use maven as our build tool and ran into the
same issue.
The idea is that your config file is actually not found, your fat jar
assembly does not contain the reference.conf resource.

I added the following the resources section of my pom to make it work :
resource
  directorysrc/main/resources/directory
includes
  include*.conf/include
/includes
  targetPath${project.build.directory}/classes/targetPath
/resource

I think Paul's gist also achieves a similar effect by specifying a proper
appender in the shading conf.

cheers
François







On Tue, May 13, 2014 at 4:09 AM, Laurent Thoulon 
laurent.thou...@ldmobile.net wrote:

 (I've never actually received my previous mail so i'm resending it. Sorry
 if it creates a duplicate.)


 Hi,

 I'm quite new to spark (and scala) but has anyone ever successfully
 compiled and run a spark job using java and maven ?
 Packaging seems to go fine but when i try to execute the job using

 mvn package
 java -Xmx4g -cp target/jobs-1.4.0.0-jar-with-dependencies.jar
 my.jobs.spark.TestJob

 I get the following error
 Exception in thread main com.typesafe.config.ConfigException$Missing: No
 configuration setting found for key 'akka.version'
 at
 com.typesafe.config.impl.SimpleConfig.findKey(SimpleConfig.java:115)
 at
 com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:136)
 at
 com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:142)
 at
 com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:150)
 at
 com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:155)
 at
 com.typesafe.config.impl.SimpleConfig.getString(SimpleConfig.java:197)
 at akka.actor.ActorSystem$Settings.init(ActorSystem.scala:136)
 at akka.actor.ActorSystemImpl.init(ActorSystem.scala:470)
 at akka.actor.ActorSystem$.apply(ActorSystem.scala:111)
 at akka.actor.ActorSystem$.apply(ActorSystem.scala:104)
 at
 org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:96)
 at org.apache.spark.SparkEnv$.create(SparkEnv.scala:126)
 at org.apache.spark.SparkContext.init(SparkContext.scala:139)
 at
 org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:47)
 at my.jobs.spark.TestJob.run(TestJob.java:56)


 Here's the code right until line 56

 SparkConf conf = new SparkConf()
 .setMaster(local[ + cpus + ])
 .setAppName(this.getClass().getSimpleName())
 .setSparkHome(/data/spark)
 .setJars(JavaSparkContext.jarOfClass(this.getClass()))
 .set(spark.default.parallelism, String.valueOf(cpus * 2))
 .set(spark.executor.memory, 4g)
 .set(spark.storage.memoryFraction, 0.6)
 .set(spark.shuffle.memoryFraction, 0.3);
 JavaSparkContext sc = new JavaSparkContext(conf);

 Thanks
 Regards,
 Laurent




-- 
François /fly Le Lay
Data Infra Chapter Lead NYC
+1 (646)-656-0075


NotSerializableException in Spark Streaming

2014-05-14 Thread Diana Carroll
Hey all, trying to set up a pretty simple streaming app and getting some
weird behavior.

First, a non-streaming job that works fine:  I'm trying to pull out lines
of a log file that match a regex, for which I've set up a function:

def getRequestDoc(s: String):
String = { KBDOC-[0-9]*.r.findFirstIn(s).orNull }
logs=sc.textFile(logfiles)
logs.map(getRequestDoc).take(10)

That works, but I want to run that on the same data, but streaming, so I
tried this:

val logs = ssc.socketTextStream(localhost,)
logs.map(getRequestDoc).print()
ssc.start()

From this code, I get:
14/05/08 03:32:08 ERROR JobScheduler: Error running job streaming job
1399545128000 ms.0
org.apache.spark.SparkException: Job aborted: Task not serializable:
java.io.NotSerializableException:
org.apache.spark.streaming.StreamingContext


But if I do the map function inline instead of calling a separate function,
it works:

logs.map(KBDOC-[0-9]*.r.findFirstIn(_).orNull).print()

So why is it able to serialize my little function in regular spark, but not
in streaming?

Thanks,
Diana


No configuration setting found for key 'akka.zeromq'

2014-05-14 Thread Francis . Hu
hi,all

 

When i run ZeroMQWordCount example on cluster, the worker log says:   Caused
by: com.typesafe.config.ConfigException$Missing: No configuration setting
found for key 'akka.zeromq'

 

Actually, i can see that the reference.conf in
spark-examples-assembly-0.9.1.jar contains below configurations: 

Anyone know what happen ?

 

#

# Akka ZeroMQ Reference Config File #

#

 

# This is the reference config file that contains all the default settings.

# Make your edits/overrides in your application.conf.

 

akka {

 

  zeromq {

 

# The default timeout for a poll on the actual zeromq socket.

poll-timeout = 100ms

 

# Timeout for creating a new socket

new-socket-timeout = 5s

 

socket-dispatcher {

  # A zeromq socket needs to be pinned to the thread that created it.

  # Changing this value results in weird errors and race conditions
within

  # zeromq

  executor = thread-pool-executor

  type = PinnedDispatcher

  thread-pool-executor.allow-core-timeout = off

}

  }

}

 

Exception in worker

 

akka.actor.ActorInitializationException: exception during creation

at akka.actor.ActorInitializationException$.apply(Actor.scala:218)

Caused by: com.typesafe.config.ConfigException$Missing: No configuration
setting found for key 'akka.zeromq'

14/05/06 21:26:19 ERROR actor.ActorCell: changing Recreate into Create after
akka.actor.ActorInitializationException: exception during creation

 

 

Thanks,

Francis.Hu



Re: How to use spark-submit

2014-05-14 Thread phoenix bai
I used spark-submit to run the MovieLensALS example from the examples
module.
here is the command:

$spark-submit --master local
/home/phoenix/spark/spark-dev/examples/target/scala-2.10/spark-examples-1.0.0-SNAPSHOT-hadoop1.0.4.jar
--class org.apache.spark.examples.mllib.MovieLensALS u.data

also, you could check the parameters of spark-submit by $spark-submit --h

hope this helps!


On Wed, May 7, 2014 at 9:27 AM, Tathagata Das
tathagata.das1...@gmail.comwrote:

 Doesnt the run-example script work for you? Also, are you on the latest
 commit of branch-1.0 ?

 TD


 On Mon, May 5, 2014 at 7:51 PM, Soumya Simanta 
 soumya.sima...@gmail.comwrote:



 Yes, I'm struggling with a similar problem where my class are not found
 on the worker nodes. I'm using 1.0.0_SNAPSHOT.  I would really appreciate
 if someone can provide some documentation on the usage of spark-submit.

 Thanks

  On May 5, 2014, at 10:24 PM, Stephen Boesch java...@gmail.com wrote:
 
 
  I have a spark streaming application that uses the external streaming
 modules (e.g. kafka, mqtt, ..) as well.  It is not clear how to properly
 invoke the spark-submit script: what are the ---driver-class-path and/or
 -Dspark.executor.extraClassPath parameters required?
 
   For reference, the following error is proving difficult to resolve:
 
  java.lang.ClassNotFoundException:
 org.apache.spark.streaming.examples.StreamingExamples
 





spark on yarn-standalone, throws StackOverflowError and fails somtimes and succeed for the rest

2014-05-14 Thread phoenix bai
Hi all,

My spark code is running on yarn-standalone.

the last three lines of the code as below,

val result = model.predict(prdctpairs)
result.map(x =
x.user+,+x.product+,+x.rating).saveAsTextFile(output)
sc.stop()

the same code, sometimes be able to run successfully and could give out the
right result, while from time to time, it throws StackOverflowError and
fail.

and  I don`t have a clue how I should debug.

below is the error, (the start and end portion to be exact):


14-05-09 17:55:51 INFO [spark-akka.actor.default-dispatcher-17]
MapOutputTrackerMasterActor: Asked to send map output locations for shuffle
44 to sp...@rxx43.mc10.site.net:43885
14-05-09 17:55:51 INFO [spark-akka.actor.default-dispatcher-17]
MapOutputTrackerMaster: Size of output statuses for shuffle 44 is 148 bytes
14-05-09 17:55:51 INFO [spark-akka.actor.default-dispatcher-35]
MapOutputTrackerMasterActor: Asked to send map output locations for shuffle
45 to sp...@rxx43.mc10.site.net:43885
14-05-09 17:55:51 INFO [spark-akka.actor.default-dispatcher-35]
MapOutputTrackerMaster: Size of output statuses for shuffle 45 is 453 bytes
14-05-09 17:55:51 INFO [spark-akka.actor.default-dispatcher-20]
MapOutputTrackerMasterActor: Asked to send map output locations for shuffle
44 to sp...@rxx43.mc10.site.net:56767
14-05-09 17:55:51 INFO [spark-akka.actor.default-dispatcher-29]
MapOutputTrackerMasterActor: Asked to send map output locations for shuffle
45 to sp...@rxx43.mc10.site.net:56767
14-05-09 17:55:51 INFO [spark-akka.actor.default-dispatcher-29]
MapOutputTrackerMasterActor: Asked to send map output locations for shuffle
44 to sp...@rxx43.mc10.site.net:49879
14-05-09 17:55:51 INFO [spark-akka.actor.default-dispatcher-29]
MapOutputTrackerMasterActor: Asked to send map output locations for shuffle
45 to sp...@rxx43.mc10.site.net:49879
14-05-09 17:55:51 INFO [spark-akka.actor.default-dispatcher-17]
TaskSetManager: Starting task 946.0:17 as TID 146 on executor 6:
rx15.mc10.site.net (PROCESS_LOCAL)
14-05-09 17:55:51 INFO [spark-akka.actor.default-dispatcher-17]
TaskSetManager: Serialized task 946.0:17 as 6414 bytes in 0 ms
14-05-09 17:55:51 WARN [Result resolver thread-0] TaskSetManager: Lost TID
133 (task 946.0:4)
14-05-09 17:55:51 WARN [Result resolver thread-0] TaskSetManager: Loss was
due to java.lang.StackOverflowError
java.lang.StackOverflowError
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
at java.lang.ClassLoader.defineClass(ClassLoader.java:615)



at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:969)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1848)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1946)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1870)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1946)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1870)
14-05-09 17:55:51 INFO [spark-akka.actor.default-dispatcher-5]
TaskSetManager: Starting task 946.0:4 as TID 147 on executor 6:
r15.mc10.site.net (PROCESS_LOCAL)
14-05-09 17:55:51 INFO [spark-akka.actor.default-dispatcher-5]
TaskSetManager: Serialized task 946.0:4 as 6414 bytes in 0 ms
14-05-09 17:55:51 WARN [Result resolver thread-1] TaskSetManager: Lost TID
139 (task 946.0:10)
14-05-09 17:55:51 INFO [Result resolver thread-1] TaskSetManager: Loss was
due to java.lang.StackOverflowError [duplicate 1]
14-05-09 17:55:51 INFO [spark-akka.actor.default-dispatcher-5]
CoarseGrainedSchedulerBackend: Executor 4 disconnected, so removing it
14-05-09 17:55:51 ERROR 

Re: Local Dev Env with Mesos + Spark Streaming on Docker: Can't submit jobs.

2014-05-14 Thread Gerard Maas
Hi Jacob,

Thanks for the help  answer on the docker question. Have you already
experimented with the new link feature in Docker? That does not help the
HDFS issue as the DataNode needs the namenode and vice-versa but it does
facilitate simpler client-server interactions.

My issue described at the beginning is  related to networking between the
host and the docker images, but I was loosing too much time tracking down
the exact problem, so I moved my Spark job driver into the mesos node and
it started working.  Sadly, my Mesos UI is partially crippled as workers
are not addressable (therefore spark job logs are hard to gather)

Your discussion about dynamic port allocation is very relevant to
understand why some components cannot talk with each other.  I'll need to
have a more in-depth read of that discussion to  find a better solution for
my local development environment.

regards,  Gerard.



On Tue, May 6, 2014 at 3:30 PM, Jacob Eisinger jeis...@us.ibm.com wrote:

 Howdy,

 You might find the discussion Andrew and I have been having about Docker
 and network security [1] applicable.

 Also, I posted an answer [2] to your stackoverflow question.

 [1]
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-shell-driver-interacting-with-Workers-in-YARN-mode-firewall-blocking-communication-tp5237p5441.html
 [2]
 http://stackoverflow.com/questions/23410505/how-to-run-hdfs-cluster-without-dns/23495100#23495100

 Jacob D. Eisinger
 IBM Emerging Technologies
 jeis...@us.ibm.com - (512) 286-6075

 [image: Inactive hide details for Gerard Maas ---05/05/2014 04:18:08
 PM---Hi Benjamin, Yes, we initially used a modified version of the]Gerard
 Maas ---05/05/2014 04:18:08 PM---Hi Benjamin, Yes, we initially used a
 modified version of the AmpLabs docker scripts

 From: Gerard Maas gerard.m...@gmail.com
 To: user@spark.apache.org
 Date: 05/05/2014 04:18 PM
 Subject: Re: Local Dev Env with Mesos + Spark Streaming on Docker: Can't
 submit jobs.
 --



 Hi Benjamin,

 Yes, we initially used a modified version of the AmpLabs docker scripts
 [1]. The amplab docker images are a good starting point.
 One of the biggest hurdles has been HDFS, which requires reverse-DNS and I
 didn't want to go the dnsmasq route to keep the containers relatively
 simple to use without the need of external scripts. Ended up running a
 1-node setup nnode+dnode. I'm still looking for a better solution for HDFS
 [2]

 Our usecase using docker is to easily create local dev environments both
 for development and for automated functional testing (using cucumber). My
 aim is to strongly reduce the time of the develop-deploy-test cycle.
 That  also means that we run the minimum number of instances required to
 have a functionally working setup. E.g. 1 Zookeeper, 1 Kafka broker, ...

 For the actual cluster deployment we have Chef-based devops toolchain that
  put things in place on public cloud providers.
 Personally, I think Docker rocks and would like to replace those complex
 cookbooks with Dockerfiles once the technology is mature enough.

 -greetz, Gerard.

 [1] 
 *https://github.com/amplab/docker-scripts*https://github.com/amplab/docker-scripts

 [2]
 *http://stackoverflow.com/questions/23410505/how-to-run-hdfs-cluster-without-dns*http://stackoverflow.com/questions/23410505/how-to-run-hdfs-cluster-without-dns


 On Mon, May 5, 2014 at 11:00 PM, Benjamin 
 *bboui...@gmail.com*bboui...@gmail.com
 wrote:

Hi,

Before considering running on Mesos, did you try to submit the
application on Spark deployed without Mesos on Docker containers ?

Currently investigating this idea to deploy quickly a complete set of
clusters with Docker, I'm interested by your findings on sharing the
settings of Kafka and Zookeeper across nodes. How many broker and zookeeper
do you use ?

Regards,



On Mon, May 5, 2014 at 10:11 PM, Gerard Maas 
 *gerard.m...@gmail.com*gerard.m...@gmail.com
wrote:
   Hi all,

   I'm currently working on creating a set of docker images to
   facilitate local development with Spark/streaming on Mesos (+zk, hdfs,
   kafka)

   After solving the initial hurdles to get things working together in
   docker containers, now everything seems to start-up correctly and the 
 mesos
   UI shows slaves as they are started.

   I'm trying to submit a job from IntelliJ and the jobs submissions
   seem to get lost in Mesos translation. The logs are not helping me to
   figure out what's wrong, so I'm posting them here in the hope that they 
 can
   ring a bell and somebdoy could provide me a hint on what's wrong/missing
   with my setup.


    DRIVER (IntelliJ running a Job.scala main) 
   14/05/05 21:52:31 INFO MetadataCleaner: Ran metadata cleaner for
   SHUFFLE_BLOCK_MANAGER
   14/05/05 21:52:31 INFO BlockManager: Dropping broadcast blocks
   older than 1399319251962
   14/05/05 21:52:31 INFO 

Re: master attempted to re-register the worker and then took all workers as unregistered

2014-05-14 Thread Siyuan he
Hi Cheney
Which mode you are running? YARN or standalone?
I got the same exception when I ran spark on YARN.


On Tue, May 6, 2014 at 10:06 PM, Cheney Sun sun.che...@gmail.com wrote:

 Hi Nan,

 In worker's log, I see the following exception thrown when try to launch
 on executor. (The SPARK_HOME is wrongly specified on purpose, so there is
 no such file /usr/local/spark1/bin/compute-classpath.sh).
 After the exception was thrown several times, the worker was requested to
 kill the executor. Following the killing, the worker try to register again
 with master, but master reject the registration with WARN message Got
 heartbeat from unregistered worker
 worker-20140504140005-host-spark-online001

 Looks like the issue wasn't fixed in 0.9.1. Do you know any pull request
 addressing this issue? Thanks.

 java.io.IOException: Cannot run program
 /usr/local/spark1/bin/compute-classpath.sh (in directory .): error=2,
 No such file or directory
 at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
 at
 org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:600)
 at
 org.apache.spark.deploy.worker.CommandUtils$.buildJavaOpts(CommandUtils.scala:58)
 at
 org.apache.spark.deploy.worker.CommandUtils$.buildCommandSeq(CommandUtils.scala:37)
 at
 org.apache.spark.deploy.worker.ExecutorRunner.getCommandSeq(ExecutorRunner.scala:104)
 at
 org.apache.spark.deploy.worker.ExecutorRunner.fetchAndRunExecutor(ExecutorRunner.scala:119)
 at
 org.apache.spark.deploy.worker.ExecutorRunner$$anon$1.run(ExecutorRunner.scala:59)
 Caused by: java.io.IOException: error=2, No such file or directory
 at java.lang.UNIXProcess.forkAndExec(Native Method)
 at java.lang.UNIXProcess.init(UNIXProcess.java:135)
 at java.lang.ProcessImpl.start(ProcessImpl.java:130)
 at java.lang.ProcessBuilder.start(ProcessBuilder.java:1021)
 ... 6 more
 ..
 14/05/04 21:35:45 INFO Worker: Asked to kill executor
 app-20140504213545-0034/18
 14/05/04 21:35:45 INFO Worker: Executor app-20140504213545-0034/18
 finished with state FAILED message class java.io.IOException: Cannot run
 program /usr/local/spark1/bin/compute-classpath.sh (in directory .):
 error=2, No such file or directory
 14/05/04 21:35:45 ERROR OneForOneStrategy: key not found:
 app-20140504213545-0034/18
 java.util.NoSuchElementException: key not found: app-20140504213545-0034/18
 at scala.collection.MapLike$class.default(MapLike.scala:228)
 at scala.collection.AbstractMap.default(Map.scala:58)
 at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
 at
 org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:232)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 14/05/04 21:35:45 INFO Worker: Starting Spark worker
 host-spark-online001:7078 with 10 cores, 28.0 GB RAM
 14/05/04 21:35:45 INFO Worker: Spark home: /usr/local/spark-0.9.1-cdh4.2.0
 14/05/04 21:35:45 INFO WorkerWebUI: Started Worker web UI at
 http://host-spark-online001:8081
 14/05/04 21:35:45 INFO Worker: Connecting to master
 spark://host-spark-online001:7077...
 14/05/04 21:35:45 INFO Worker: Successfully registered with master
 spark://host-spark-online001:7077





Re: Unable to load native-hadoop library problem

2014-05-14 Thread Shivani Rao
Hello Sophia

You are only providing the Spark jar here (nevertheless, a spark jar that
contains hadoop libraries in it, but that is not sufficient). Where is your
hadoop installed? (Most probably: /usr/lib/hadoop/*)

So you need to add that to your class path (by using -cp) I guess. Let me
know if that works

shivani


On Tue, May 6, 2014 at 6:25 PM, Sophia sln-1...@163.com wrote:

 Hi,everyone,
 [root@CHBM220 spark-0.9.1]#

 SPARK_JAR=.assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar
 ./bin/spark-class org.apache.spark.deploy.yarn.Client --jar
 examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar --class
 org.apache.spark.examples.SparkPi --args yarn-standalone --num-workers 3
 --master-memory 2g --worker-memory 2g --worker-cores 1
 14/05/07 09:05:14 WARN NativeCodeLoader: Unable to load native-hadoop
 library for your platform... using builtin-java classes where applicable
 14/05/07 09:05:14 INFO RMProxy: Connecting to ResourceManager at
 CHBM220/192.168.10.220:8032
 Then it stopped,my hadoop_conf_dir has been configued well,what should I do
 to?
 Wish you happy everyday.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-load-native-hadoop-library-problem-tp5469.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.




-- 
Software Engineer
Analytics Engineering Team@ Box
Mountain View, CA


Express VMs - good idea?

2014-05-14 Thread Marco Shaw
Hi,

I've wanted to play with Spark.  I wanted to fast track things and just use
one of the vendor's express VMs.  I've tried Cloudera CDH 5.0 and
Hortonworks HDP 2.1.

I've not written down all of my issues, but for certain, when I try to run
spark-shell it doesn't work.  Cloudera seems to crash, and both complain
when I try to use SparkContext in a simple Scala command.

So, just a basic question on whether anyone has had success getting these
express VMs to work properly with Spark *out of the box* (HDP does required
you install Spark manually).

I know Cloudera recommends 8GB of RAM, but I've been running it with 4GB.

Could it be that 4GB is just not enough, and causing issues or have others
had success using these Hadoop 2.x pre-built VMs with Spark 0.9.x?

Marco


Re: instalation de spark

2014-05-14 Thread Madhu
J'ai oublié la plupart de mes français.

You can download a Spark binary or build from source.
This is how I build from source:

Download and install sbt:

http://www.scala-sbt.org/
I installed in C:\sbt
Check C:\sbt\conf\sbtconfig.txt, use these options:

-Xmx512M

-XX:MaxPermSize=256m

-XX:ReservedCodeCacheSize=128m

Download and install git:

http://git-scm.com/downloads
Make sure git is in your PATH

Download and untar Spark into some directory, e.g. C:\spark-0.91

cd C:\spark-0.9.1
C:\bin\sbt assembly
do something else, this take some time

Enjoy!

bin\spark-shell.cmd

This assumes you have Java installed, I have Java 7
You can install Scala 2.10.x for Scala development.
I have Python 2.7.6? For pySpark
I use ScalaIDE Eclipse plugin.

Let me know how it works out.




-
Madhu
https://www.linkedin.com/in/msiddalingaiah
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/instalation-de-spark-tp5689p5724.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.