Re: How to read a multipart s3 file?

2014-05-11 Thread Nicholas Chammas
On Tue, May 6, 2014 at 10:07 PM, kamatsuoka  wrote:

> I was using s3n:// but I got frustrated by how
> slow it is at writing files.
>

I'm curious: How slow is slow? How long does it take you, for example, to
save a 1GB file to S3 using s3n vs s3?


java.lang.NoSuchMethodError on Java API

2014-05-11 Thread Alessandro De Carli
Dear All,

I'm new to the whole Spark framework, but already fell in love with it
:). For a research project at the University of Zurich I'm trying to
implement a Matrix Centroid Decomposition in Spark. I'm using the Java
API.

My problem occurs when I try to call a JavaPairRDD.reduce:
"""
java.lang.NoSuchMethodError:
org.apache.spark.api.java.JavaPairRDD.reduce(Lorg/apache/spark/api/java/function/Function2;)Lscala/Tuple2;
"""

I read in a forum post, that the issue here might be that I'm using
the maven 0.9.1 version
(http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-from-Spark-Java-td4937.html)

I downloaded the GIT version of spark and compiled with sbt, but I
don't really know how I can force my java project to use that one
instead of the maven version.

Does anyone have an advice on how I could achieve this? I'm using
Intellij as IDE and the project is setup with maven.


Best Regards
-- 
Alessandro De Carli
Sonnmattstr. 121
CH-5242 Birr

Email: decarli@gmail.com
Twitter: @a_d_c_
Tel: +41 76 305 75 00
Web: http://www.papers.ch


Spark LIBLINEAR

2014-05-11 Thread Chieh-Yen
Dear all,

Recently we released a distributed extension of LIBLINEAR at

http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/distributed-liblinear/

Currently, TRON for logistic regression and L2-loss SVM is supported.
We provided both MPI and Spark implementations.
This is very preliminary so your comments are very welcome.

Thanks,
Chieh-Yen


Re: Spark LIBLINEAR

2014-05-11 Thread DB Tsai
Dear Prof. Lin,

Interesting! We had an implementation of L-BFGS in Spark and already merged
in the upstream now.

We read your paper comparing TRON and OWL-QN for logistic regression with
L1 (http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf), but it seems that
it's not in the distributed setup.

Will be very interesting to know the L2 logistic regression benchmark
result in Spark with your TRON optimizer and the L-BFGS optimizer against
different datasets (sparse, dense, and wide, etc).

I'll try your TRON out soon.


Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Sun, May 11, 2014 at 1:49 AM, Chieh-Yen wrote:

> Dear all,
>
> Recently we released a distributed extension of LIBLINEAR at
>
> http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/distributed-liblinear/
>
> Currently, TRON for logistic regression and L2-loss SVM is supported.
> We provided both MPI and Spark implementations.
> This is very preliminary so your comments are very welcome.
>
> Thanks,
> Chieh-Yen
>


Re: Is there anything that I need to modify?

2014-05-11 Thread Arpit Tak
Try setting hostname to domain setting in /etc/hosts .
Its not able to resolve ip to hostname
try this ...
localhost  192.168.10.220 CHBM220




On Wed, May 7, 2014 at 12:50 PM, Sophia  wrote:

> [root@CHBM220 spark-0.9.1]#
>
> SPARK_JAR=.assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar
> ./bin/spark-class org.apache.spark.deploy.yarn.Client --jar
> examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar --class
> org.apache.spark.examples.SparkPi --args yarn-standalone --num-workers 3
> --master-memory 2g --worker-memory 2g --worker-cores 1
> 14:50:45,485%5P RMProxy:56-Connecting to ResourceManager at
> CHBM220/192.168.10.220:8032
> Exception in thread "main" java.io.IOException: Failed on local exception:
> com.google.protobuf.InvalidProtocolBufferException: Protocol message
> contained an invalid tag (zero).; Host Details : local host is:
> "CHBM220/192.168.10.220"; destination host is: "CHBM220":8032;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
> at org.apache.hadoop.ipc.Client.call(Client.java:1351)
> at org.apache.hadoop.ipc.Client.call(Client.java:1300)
> at
>
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> at com.sun.proxy.$Proxy7.getClusterMetrics(Unknown Source)
> at
>
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:152)
> at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
> at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
>
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
> at
>
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy8.getClusterMetrics(Unknown Source)
> at
>
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:246)
> at
>
> org.apache.spark.deploy.yarn.Client.logClusterResourceDetails(Client.scala:144)
> at org.apache.spark.deploy.yarn.Client.runApp(Client.scala:79)
> at org.apache.spark.deploy.yarn.Client.run(Client.scala:115)
> at org.apache.spark.deploy.yarn.Client$.main(Client.scala:493)
> at org.apache.spark.deploy.yarn.Client.main(Client.scala)
> Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol
> message contained an invalid tag (zero).
> at
>
> com.google.protobuf.InvalidProtocolBufferException.invalidTag(InvalidProtocolBufferException.java:89)
> at
> com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:108)
> at
>
> org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto.(RpcHeaderProtos.java:1398)
> at
>
> org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto.(RpcHeaderProtos.java:1362)
> at
>
> org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto$1.parsePartialFrom(RpcHeaderProtos.java:1492)
> at
>
> org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto$1.parsePartialFrom(RpcHeaderProtos.java:1487)
> at
>
> com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:200)
> at
>
> com.google.protobuf.AbstractParser.parsePartialDelimitedFrom(AbstractParser.java:241)
> at
>
> com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:253)
> at
>
> com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:259)
> at
>
> com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:49)
> at
>
> org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto.parseDelimitedFrom(RpcHeaderProtos.java:2364)
> at
> org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:996)
> at org.apache.hadoop.ipc.Client$Connection.run(Client.java:891)
> [root@CHBM220:spark-0.9.1]#
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-anything-that-I-need-to-modify-tp5477.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


答复: 答复: java.io.FileNotFoundException: /test/spark-0.9.1/work/app-20140505053550-0000/2/stdout (No such file or directory)

2014-05-11 Thread Francis . Hu
I  have just the problem resolved via running master and work daemons 
individually on where they are.

if I execute the shell: sbin/start-all.sh , the problem always exist.  

 

 

发件人: Francis.Hu [mailto:francis...@reachjunction.com] 
发送时间: Tuesday, May 06, 2014 10:31
收件人: user@spark.apache.org
主题: 答复: 答复: java.io.FileNotFoundException: 
/test/spark-0.9.1/work/app-20140505053550-/2/stdout (No such file or 
directory)

 

i looked into the log again, all exceptions are about FileNotFoundException . 
In the Webui, no anymore info I can check except for the basic description of 
job.  

Attached the log file, could you help to take a look ? Thanks.

 

Francis.Hu

 

发件人: Tathagata Das [mailto:tathagata.das1...@gmail.com] 
发送时间: Tuesday, May 06, 2014 10:16
收件人: user@spark.apache.org
主题: Re: 答复: java.io.FileNotFoundException: 
/test/spark-0.9.1/work/app-20140505053550-/2/stdout (No such file or 
directory)

 

Can you check the Spark worker logs on that machine. Either from the web ui, or 
directly. Should be /test/spark-XXX/logs/  See if that has any error.

If there is not permission issue, I am not why stdout and stderr is not being 
generated. 

 

TD

 

On Mon, May 5, 2014 at 7:13 PM, Francis.Hu  wrote:

The file does not exist in fact and no permission issue. 

 

francis@ubuntu-4:/test/spark-0.9.1$ ll work/app-20140505053550-/

total 24

drwxrwxr-x  6 francis francis 4096 May  5 05:35 ./

drwxrwxr-x 11 francis francis 4096 May  5 06:18 ../

drwxrwxr-x  2 francis francis 4096 May  5 05:35 2/

drwxrwxr-x  2 francis francis 4096 May  5 05:35 4/

drwxrwxr-x  2 francis francis 4096 May  5 05:35 7/

drwxrwxr-x  2 francis francis 4096 May  5 05:35 9/

 

Francis

 

发件人: Tathagata Das [mailto:tathagata.das1...@gmail.com] 
发送时间: Tuesday, May 06, 2014 3:45
收件人: user@spark.apache.org
主题: Re: java.io.FileNotFoundException: 
/test/spark-0.9.1/work/app-20140505053550-/2/stdout (No such file or 
directory)

 

Do those file actually exist? Those stdout/stderr should have the output of the 
spark's executors running in the workers, and its weird that they dont exist. 
Could be permission issue - maybe the directories/files are not being generated 
because it cannot?

 

TD

 

On Mon, May 5, 2014 at 3:06 AM, Francis.Hu  wrote:

Hi,All

 

 

We run a spark cluster with three workers. 

created a spark streaming application,

then run the spark project using below command:

 

shell> sbt run spark://192.168.219.129:7077 tcp://192.168.20.118:5556 foo

 

we looked at the webui of workers, jobs failed without any error or info, but 
FileNotFoundException occurred in workers' log file as below:

Is this an existent issue of spark? 

 

 

-in workers' 
logs/spark-francis-org.apache.spark.deploy.worker.Worker-1-ubuntu-4.out

 

14/05/05 02:39:39 WARN AbstractHttpConnection: 
/logPage/?appId=app-20140505053550-&executorId=2&logType=stdout

java.io.FileNotFoundException: 
/test/spark-0.9.1/work/app-20140505053550-/2/stdout (No such file or 
directory)

at java.io.FileInputStream.open(Native Method)

at java.io.FileInputStream.(FileInputStream.java:138)

at org.apache.spark.util.Utils$.offsetBytes(Utils.scala:687)

at 
org.apache.spark.deploy.worker.ui.WorkerWebUI.logPage(WorkerWebUI.scala:119)

at 
org.apache.spark.deploy.worker.ui.WorkerWebUI$$anonfun$6.apply(WorkerWebUI.scala:52)

at 
org.apache.spark.deploy.worker.ui.WorkerWebUI$$anonfun$6.apply(WorkerWebUI.scala:52)

at org.apache.spark.ui.JettyUtils$$anon$1.handle(JettyUtils.scala:61)

at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1040)

at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:976)

at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)

at 
org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52)

at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)

at org.eclipse.jetty.server.Server.handle(Server.java:363)

at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:483)

at 
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:920)

at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:982)

at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:635)

at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)

at 
org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)

at 
org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:628)

at 
org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)

at 
org.eclipse.jetty.util.thread.QueuedThreadP

Re: How can adding a random count() change the behavior of my program?

2014-05-11 Thread Walrus theCat
Nick,

I have encountered strange things like this before (usually when
programming with mutable structures and side-effects), and for me, the
answer was that, until .count (or .first, or similar), is called, your
variable 'a' refers to a set of instructions that only get executed to form
the object you expect when you're asking something of it.  Back before I
was using side-effect-free techniques on immutable data structures, I had
to call .first or .count or similar to get the behavior I wanted.  There
are still special cases where I have to purposefully "collapse" the RDD for
some reason or another.  This may not be new information to you, but I've
encountered similar behavior before and highly suspect this is playing a
role here.


On Mon, May 5, 2014 at 5:52 PM, Nicholas Chammas  wrote:

> I’m running into something very strange today. I’m getting an error on the
> follow innocuous operations.
>
> a = sc.textFile('s3n://...')
> a = a.repartition(8)
> a = a.map(...)
> c = a.countByKey() # ERRORs out on this action. See below for traceback. [1]
>
> If I add a count() right after the repartition(), this error magically
> goes away.
>
> a = sc.textFile('s3n://...')
> a = a.repartition(8)
> print a.count()
> a = a.map(...)
> c = a.countByKey() # A-OK! WTF?
>
> To top it off, this “fix” is inconsistent. Sometimes, I still get this
> error.
>
> This is strange. How do I get to the bottom of this?
>
> Nick
>
> [1] Here’s the traceback:
>
> Traceback (most recent call last):
>   File "", line 7, in 
>   File "file.py", line 187, in function_blah
> c = a.countByKey()
>   File "/root/spark/python/pyspark/rdd.py", line 778, in countByKey
> return self.map(lambda x: x[0]).countByValue()
>   File "/root/spark/python/pyspark/rdd.py", line 624, in countByValue
> return self.mapPartitions(countPartition).reduce(mergeMaps)
>   File "/root/spark/python/pyspark/rdd.py", line 505, in reduce
> vals = self.mapPartitions(func).collect()
>   File "/root/spark/python/pyspark/rdd.py", line 469, in collect
> bytesInJava = self._jrdd.collect().iterator()
>   File "/root/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py", line 
> 537, in __call__
>   File "/root/spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", line 
> 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o46.collect.
>
>
> --
> View this message in context: How can adding a random count() change the
> behavior of my 
> program?
> Sent from the Apache Spark User List mailing list 
> archiveat Nabble.com.
>


Re: Error starting EC2 cluster

2014-05-11 Thread wxhsdp
your ssh connection refuse is due to not long enough wait time. your remote
machine is not ready
at the time, i set wait time to 500 secs, and it works



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Error-starting-EC2-cluster-tp5332p5501.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Fwd: Is there a way to load a large file from HDFS faster into Spark

2014-05-11 Thread Mayur Rustagi
.gz files are not splittable hence harder to process. Easiest is to move to
a splittable compression like lzo and break file into multiple blocks to be
read and for subsequent processing.
On 11 May 2014 09:01, "Soumya Simanta"  wrote:

>
>
> I've a Spark cluster with 3 worker nodes.
>
>
>- *Workers:* 3
>- *Cores:* 48 Total, 48 Used
>- *Memory:* 469.8 GB Total, 72.0 GB Used
>
> I want a process a single file compressed (*.gz) on HDFS. The file is
> 1.5GB compressed and 11GB uncompressed.
> When I try to read the compressed file from HDFS it takes a while (4-5
> minutes) load it into an RDD. If I use the .cache operation it takes even
> longer. Is there a way to make loading of the RDD from HDFS faster ?
>
> Thanks
>  -Soumya
>
>
>


Re: Fwd: Is there a way to load a large file from HDFS faster into Spark

2014-05-11 Thread Soumya Simanta
Yep. I figured that out. I uncompressed the file and it looks much faster
now. Thanks.



On Sun, May 11, 2014 at 8:14 AM, Mayur Rustagi wrote:

> .gz files are not splittable hence harder to process. Easiest is to move
> to a splittable compression like lzo and break file into multiple blocks to
> be read and for subsequent processing.
> On 11 May 2014 09:01, "Soumya Simanta"  wrote:
>
>>
>>
>> I've a Spark cluster with 3 worker nodes.
>>
>>
>>- *Workers:* 3
>>- *Cores:* 48 Total, 48 Used
>>- *Memory:* 469.8 GB Total, 72.0 GB Used
>>
>> I want a process a single file compressed (*.gz) on HDFS. The file is
>> 1.5GB compressed and 11GB uncompressed.
>> When I try to read the compressed file from HDFS it takes a while (4-5
>> minutes) load it into an RDD. If I use the .cache operation it takes even
>> longer. Is there a way to make loading of the RDD from HDFS faster ?
>>
>> Thanks
>>  -Soumya
>>
>>
>>


Re: java.lang.NoSuchMethodError on Java API

2014-05-11 Thread Madhu
Alessandro,

I'm using Eclipse, IntelliJ settings will be similar.
I created a standard project, without maven.

For me, the easiest way was to add this jar to my Eclipse project build
path:

/assembly/target/scala-2.10/spark-assembly-x.x.x-hadoop1.0.4.jar

It works for either Java or Scala plugin.
That Jar is quite large (~100MB) and has everything in it.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-NoSuchMethodError-on-Java-API-tp5545p5552.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Test

2014-05-11 Thread Azuryy
Got.

But it doesn't indicate all can receive this test.

Mail list is unstable recently.


Sent from my iPhone5s

> On 2014年5月10日, at 13:31, Matei Zaharia  wrote:
> 
> This message has no content.


Re: details about event log

2014-05-11 Thread Andrew Or
Hi wxhsdp,

These times are computed from Java's System.currentTimeMillis(), which is "the
difference, measured in milliseconds, between the current time and
midnight, January 1, 1970 UTC." Thus, this quantity doesn't mean much by
itself, but is only meaningful when you subtract it from another
System.currentTimeMillis() to find the time elapsed. For instance, in your
case (Finish Time - Launch Time) = 1862, which means it took 1862 ms for
the task to complete (but the actual execution only took 1781 ms, the rest
being overhead).

Correct, (Shuffle Finish Time - Launch Time) is the total time it took for
this task to fetch blocks, and (Finish Time - Shuffle Finish Time) is the
actual execution after fetching all blocks.

"Fetch Wait Time" on the other hand is the time spent blocking on the
thread to wait for shuffle blocks while not doing anything else. For
instance, the example given in the code comments is: "if block B is being
fetched while the task is not finished with processing block A, it is not
considered to be blocking on block B."

"Shuffle Write Time" is the time written to write the shuffle files (only
for ShuffleMapTask). It is in nanoseconds, which is slightly inconsistent
with other values in these metrics.

By the way, all of this information is available in the code comments:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala




On Tue, May 6, 2014 at 11:10 PM, wxhsdp  wrote:

> any ideas?  thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/details-about-event-log-tp5411p5476.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Is there any problem on the spark mailing list?

2014-05-11 Thread lukas nalezenec
There was an outage: https://blogs.apache.org/infra/entry/mail_outage



On Fri, May 9, 2014 at 1:27 PM, wxhsdp  wrote:

> i think so, fewer questions and answers these three days
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-any-problem-on-the-spark-mailing-list-tp5509p5522.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Turn BLAS on MacOSX

2014-05-11 Thread DB Tsai
Hi Debasish,

In https://github.com/apache/spark/blob/master/docs/mllib-guide.mdDependencies
section, the document talks about the native blas dependencies
issue.

For netlib which breeze uses internally, if the native library isn't found,
the jblas implementation will be used.

Here is more detail about how to install native library in different
platform.
https://github.com/fommil/netlib-java/blob/master/README.md#machine-optimised-system-libraries


Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Wed, May 7, 2014 at 10:52 AM, Debasish Das wrote:

> Hi,
>
> How do I load native BLAS libraries on Mac ?
>
> I am getting the following errors while running LR and SVM with SGD:
>
> 14/05/07 10:48:13 WARN BLAS: Failed to load implementation from:
> com.github.fommil.netlib.NativeSystemBLAS
>
> 14/05/07 10:48:13 WARN BLAS: Failed to load implementation from:
> com.github.fommil.netlib.NativeRefBLAS
>
> centos it was fine...but on mac I am getting these warnings..
>
> Also when it fails to run native blas does it use java code for BLAS
> operations ?
>
> May be after addition of breeze, we should add these details on a page as
> well so that users are aware of it before they report any performance
> results..
>
> Thanks.
>
> Deb
>


Re: why is Spark 0.9.1 (context creation?) so slow on my OSX laptop?

2014-05-11 Thread Madhu
Svend,

I built it on my iMac and it was about the same speed as Windows 7, RHEL 6
VM on Windows 7, and Linux on EC2. Spark is pleasantly easy to build on all
of these platforms, which is wonderful.

How long does it take to start spark-shell?
Maybe it's a JVM memory setting problem on your Laptop?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/why-is-Spark-0-9-1-context-creation-so-slow-on-my-OSX-laptop-tp5535p5556.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: writing my own RDD

2014-05-11 Thread Koert Kuipers
resending... my email somehow never made it to the user list.


On Fri, May 9, 2014 at 2:11 PM, Koert Kuipers  wrote:

> in writing my own RDD i ran into a few issues with respect to stuff being
> private in spark.
>
> in compute i would like to return an iterator that respects task killing
> (as HadoopRDD does), but the mechanics for that are inside the private
> InterruptibleIterator. also the exception i am supposed to throw
> (TaskKilledException) is private to spark.
>


Re: Is there any problem on the spark mailing list?

2014-05-11 Thread Chris Fregly
btw, you can see all "missing" messages from may 7th (start of outage)
here:
http://mail-archives.apache.org/mod_mbox/spark-user/201405.mbox/browser

the last message i received in my inbox was this one:

Cheney SunRe: master attempted to re-register the worker and then took all
workers as 
unregisteredWed,
07 May, 02:06
you might want to repost if you initiated a message from that point til
this morning.


On Sun, May 11, 2014 at 8:31 AM, lukas nalezenec
wrote:

> There was an outage: https://blogs.apache.org/infra/entry/mail_outage
>
>
>
> On Fri, May 9, 2014 at 1:27 PM, wxhsdp  wrote:
>
>> i think so, fewer questions and answers these three days
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-any-problem-on-the-spark-mailing-list-tp5509p5522.html
>>
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>


Re: is Mesos falling out of favor?

2014-05-11 Thread Gary Malouf
For what it is worth, our team here at
MediaCrossing has
been using the Spark/Mesos combination since last summer with much success
(low operations overhead, high developer performance).

IMO, Hadoop is overcomplicated from both a development and operations
perspective so I am looking to lower our dependencies on it, not increase
them.  Our stack currently includes:


   - Spark 0.9.1
   - Mesos 0.17
   - Chronos
   - HDFS (CDH 5.0-mr1)
   - Flume 1.4.0
   - ZooKeeper
   - Cassandra 2.0 (key-value store alternative to HBase)
   - Storm 0.9 (we prefer today to Spark Streaming)

We've used Shark in the past as well, but since most of us prefer the Spark
Shell we have not been maintaining it.

Using Mesos to run Spark allows for us to optimize our available resources
(CPU + RAM currently ) between Spark, Chronos and a number of other
services.  I see YARN as being heavily focused on MR2, but the reality is
we are using Spark in large part because writing MapReduce jobs is verbose,
hard to maintain and not performant (against Spark).  We have the advantage
of not having any real legacy Map/Reduce jobs to maintain, so that
consideration does not come into play.

Finally, I am a believer that for the long term direction of our company,
the Berkeley stack  will serve us
best.  Leveraging Mesos and Spark from the onset paves the way for this.


On Sun, May 11, 2014 at 1:28 PM, Paco Nathan  wrote:

> That's FUD. Tracking the Mesos and Spark use cases, there are very large
> production deployments of these together. Some are rather private but
> others are being surfaced. IMHO, one of the most amazing case studies is
> from Christina Delimitrou http://youtu.be/YpmElyi94AA
>
> For a tutorial, use the following but upgrade it to latest production for
> Spark. There was a related O'Reilly webcast and Strata tutorial as well:
> http://mesosphere.io/learn/run-spark-on-mesos/
>
> FWIW, I teach "Intro to Spark" with sections on CM4, YARN, Mesos, etc.
> Based on lots of student experiences, Mesos is clearly the shortest path to
> deploying a Spark cluster if you want to leverage the robustness,
> multi-tenancy for mixed workloads, less ops overhead, etc., that show up
> repeatedly in the use case analyses.
>
> My opinion only and not that of any of my clients: "Don't believe the FUD
> from YHOO unless you really want to be stuck in 2009."
>
>
> On Wed, May 7, 2014 at 8:30 AM, deric  wrote:
>
>> I'm also using right now SPARK_EXECUTOR_URI, though I would prefer
>> distributing Spark as a binary package.
>>
>> For running examples with `./bin/run-example ...` it works fine, however
>> tasks from spark-shell are getting lost.
>>
>> Error: Could not find or load main class
>> org.apache.spark.executor.MesosExecutorBackend
>>
>> which looks more like problem with sbin/spark-executor and missing paths
>> to
>> jar. Anyone encountered this error before?
>>
>> I guess Yahoo invested quite a lot of effort into YARN and Spark
>> integration
>> (moreover when Mahout is migrating to Spark there's much more interest in
>> Hadoop and Spark integration). If there would be some "Mesos company"
>> working on Spark - Mesos integration it could be at least on the same
>> level.
>>
>> I don't see any other reason why would be YARN better than Mesos,
>> personally
>> I like the latter, however I haven't checked YARN for a while, maybe
>> they've
>> made a significant progress. I think Mesos is more universal and flexible
>> than YARN.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/is-Mesos-falling-out-of-favor-tp5444p5481.html
>>
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>


Re: java.lang.NoSuchMethodError on Java API

2014-05-11 Thread Alessandro De Carli
Madhu,

Thank you! I now switched to eclipse and imported the assembly jar, the IDE
successfully finds the imports. But when I try to run my code I get
"java.lang.NoClassDefFoundError:
org/apache/spark/api/java/function/PairFunction" is there anything special
to consider when I want to run my development code directly from eclipse (I
checked the buildpath settings and made sure to completely export the
assembly jar)?

Best Regards
On May 11, 2014 4:48 PM, "Madhu"  wrote:

> Alessandro,
>
> I'm using Eclipse, IntelliJ settings will be similar.
> I created a standard project, without maven.
>
> For me, the easiest way was to add this jar to my Eclipse project build
> path:
>
>  dir>/assembly/target/scala-2.10/spark-assembly-x.x.x-hadoop1.0.4.jar
>
> It works for either Java or Scala plugin.
> That Jar is quite large (~100MB) and has everything in it.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-NoSuchMethodError-on-Java-API-tp5545p5552.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: is Mesos falling out of favor?

2014-05-11 Thread Paco Nathan
That's FUD. Tracking the Mesos and Spark use cases, there are very large
production deployments of these together. Some are rather private but
others are being surfaced. IMHO, one of the most amazing case studies is
from Christina Delimitrou http://youtu.be/YpmElyi94AA

For a tutorial, use the following but upgrade it to latest production for
Spark. There was a related O'Reilly webcast and Strata tutorial as well:
http://mesosphere.io/learn/run-spark-on-mesos/

FWIW, I teach "Intro to Spark" with sections on CM4, YARN, Mesos, etc.
Based on lots of student experiences, Mesos is clearly the shortest path to
deploying a Spark cluster if you want to leverage the robustness,
multi-tenancy for mixed workloads, less ops overhead, etc., that show up
repeatedly in the use case analyses.

My opinion only and not that of any of my clients: "Don't believe the FUD
from YHOO unless you really want to be stuck in 2009."


On Wed, May 7, 2014 at 8:30 AM, deric  wrote:

> I'm also using right now SPARK_EXECUTOR_URI, though I would prefer
> distributing Spark as a binary package.
>
> For running examples with `./bin/run-example ...` it works fine, however
> tasks from spark-shell are getting lost.
>
> Error: Could not find or load main class
> org.apache.spark.executor.MesosExecutorBackend
>
> which looks more like problem with sbin/spark-executor and missing paths to
> jar. Anyone encountered this error before?
>
> I guess Yahoo invested quite a lot of effort into YARN and Spark
> integration
> (moreover when Mahout is migrating to Spark there's much more interest in
> Hadoop and Spark integration). If there would be some "Mesos company"
> working on Spark - Mesos integration it could be at least on the same
> level.
>
> I don't see any other reason why would be YARN better than Mesos,
> personally
> I like the latter, however I haven't checked YARN for a while, maybe
> they've
> made a significant progress. I think Mesos is more universal and flexible
> than YARN.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/is-Mesos-falling-out-of-favor-tp5444p5481.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: cant get tests to pass anymore on master master

2014-05-11 Thread Koert Kuipers
resending because the list didnt seem to like my email before


On Wed, May 7, 2014 at 5:01 PM, Koert Kuipers  wrote:

> i used to be able to get all tests to pass.
>
> with java 6 and sbt i get PermGen errors (no matter how high i make the
> PermGen). so i have given up on that.
>
> with java 7 i see 1 error in a bagel test and a few in streaming tests.
> any ideas? see the error in BagelSuite below.
>
> [info] - large number of iterations *** FAILED *** (10 seconds, 105
> milliseconds)
> [info]   The code passed to failAfter did not complete within 10 seconds.
> (BagelSuite.scala:85)
> [info]   org.scalatest.exceptions.TestFailedDueToTimeoutException:
> [info]   at
> org.scalatest.concurrent.Timeouts$$anonfun$failAfter$1.apply(Timeouts.scala:250)
> [info]   at
> org.scalatest.concurrent.Timeouts$$anonfun$failAfter$1.apply(Timeouts.scala:250)
> [info]   at
> org.scalatest.concurrent.Timeouts$class.timeoutAfter(Timeouts.scala:282)
> [info]   at
> org.scalatest.concurrent.Timeouts$class.failAfter(Timeouts.scala:246)
> [info]   at
> org.apache.spark.bagel.BagelSuite.failAfter(BagelSuite.scala:32)
> [info]   at
> org.apache.spark.bagel.BagelSuite$$anonfun$3.apply$mcV$sp(BagelSuite.scala:85)
> [info]   at
> org.apache.spark.bagel.BagelSuite$$anonfun$3.apply(BagelSuite.scala:85)
> [info]   at
> org.apache.spark.bagel.BagelSuite$$anonfun$3.apply(BagelSuite.scala:85)
> [info]   at org.scalatest.FunSuite$$anon$1.apply(FunSuite.scala:1265)
> [info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1974)
> [info]   at
> org.apache.spark.bagel.BagelSuite.withFixture(BagelSuite.scala:32)
>
>


Re: Spark LIBLINEAR

2014-05-11 Thread Debasish Das
Hello Prof. Lin,

Awesome news ! I am curious if you have any benchmarks comparing C++ MPI
with Scala Spark liblinear implementations...

Is Spark Liblinear apache licensed or there are any specific restrictions
on using it ?

Except using native blas libraries (which each user has to manage by
pulling in their best proprietary BLAS package), all Spark code is Apache
licensed.

Thanks.
Deb


On Sun, May 11, 2014 at 3:01 AM, DB Tsai  wrote:

> Dear Prof. Lin,
>
> Interesting! We had an implementation of L-BFGS in Spark and already
> merged in the upstream now.
>
> We read your paper comparing TRON and OWL-QN for logistic regression with
> L1 (http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf), but it seems that
> it's not in the distributed setup.
>
> Will be very interesting to know the L2 logistic regression benchmark
> result in Spark with your TRON optimizer and the L-BFGS optimizer against
> different datasets (sparse, dense, and wide, etc).
>
> I'll try your TRON out soon.
>
>
> Sincerely,
>
> DB Tsai
> ---
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Sun, May 11, 2014 at 1:49 AM, Chieh-Yen wrote:
>
>> Dear all,
>>
>> Recently we released a distributed extension of LIBLINEAR at
>>
>> http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/distributed-liblinear/
>>
>> Currently, TRON for logistic regression and L2-loss SVM is supported.
>> We provided both MPI and Spark implementations.
>> This is very preliminary so your comments are very welcome.
>>
>> Thanks,
>> Chieh-Yen
>>
>
>


Re: How to use spark-submit

2014-05-11 Thread Soumya Simanta

Will sbt-pack and the maven solution work for the Scala REPL? 

I need the REPL because it save a lot of time when I'm playing with large data 
sets because I load then once, cache them and then try out things interactively 
before putting in a standalone driver. 

I've sbt woking for my own driver program on Spark 0.9. 



> On May 11, 2014, at 3:49 PM, Stephen Boesch  wrote:
> 
> Just discovered sbt-pack: that addresses (quite well) the last item for 
> identifying and packaging the external jars.
> 
> 
> 2014-05-11 12:34 GMT-07:00 Stephen Boesch :
>> HI Sonal,
>> Yes I am working towards that same idea.  How did you go about creating 
>> the non-spark-jar dependencies ?  The way I am doing it is a separate 
>> straw-man project that does not include spark but has the external third 
>> party jars included. Then running sbt compile:managedClasspath and reverse 
>> engineering the lib jars from it.  That is obviously not ideal.
>> 
>> The maven "run" will be useful for other projects built by maven: i will 
>> keep in my notes.
>> 
>> AFA sbt run-example, it requires additional libraries to be added for my 
>> external dependencies.  I tried several items including  ADD_JARS,  
>> --driver-class-path  and combinations of extraClassPath. I have deferred 
>> that ad-hoc approach to finding a systematic one.
>> 
>> 
>> 
>> 
>> 2014-05-08 5:26 GMT-07:00 Sonal Goyal :
>> 
>>> I am creating a jar with only my dependencies and run spark-submit through 
>>> my project mvn build. I have configured the mvn exec goal to the location 
>>> of the script. Here is how I have set it up for my app. The mainClass is my 
>>> driver program, and I am able to send my custom args too. Hope this helps.
>>> 
>>> 
>>> org.codehaus.mojo
>>> exec-maven-plugin
>>> 
>>> 
>>> 
>>> exec
>>> 
>>> 
>>> 
>>> 
>>>/home/sgoyal/spark/bin/spark-submit
>>> 
>>> ${jars}
>>> --class
>>> ${mainClass}
>>> --arg
>>> ${spark.master}
>>> --arg
>>> ${my app arg 1}
>>> --arg
>>> ${my arg 2}
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Best Regards,
>>> Sonal
>>> Nube Technologies 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
 On Wed, May 7, 2014 at 6:57 AM, Tathagata Das 
  wrote:
 Doesnt the run-example script work for you? Also, are you on the latest 
 commit of branch-1.0 ?
 
 TD
 
 
> On Mon, May 5, 2014 at 7:51 PM, Soumya Simanta  
> wrote:
> 
> 
> Yes, I'm struggling with a similar problem where my class are not found 
> on the worker nodes. I'm using 1.0.0_SNAPSHOT.  I would really appreciate 
> if someone can provide some documentation on the usage of spark-submit.
> 
> Thanks
> 
> > On May 5, 2014, at 10:24 PM, Stephen Boesch  wrote:
> >
> >
> > I have a spark streaming application that uses the external streaming 
> > modules (e.g. kafka, mqtt, ..) as well.  It is not clear how to 
> > properly invoke the spark-submit script: what are the 
> > ---driver-class-path and/or -Dspark.executor.extraClassPath parameters 
> > required?
> >
> >  For reference, the following error is proving difficult to resolve:
> >
> > java.lang.ClassNotFoundException: 
> > org.apache.spark.streaming.examples.StreamingExamples
> >
> 


Re: java.lang.NoSuchMethodError on Java API

2014-05-11 Thread Madhu
No, you don't need to do anything special to get it to run in Eclipse.
Just add the assembly jar to the build path, create a main method, add your
code, and click the green "run" button.

Can you post your code here?
I can try it in my environment.



-
Madhu
https://www.linkedin.com/in/msiddalingaiah
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-NoSuchMethodError-on-Java-API-tp5545p5567.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: How to use spark-submit

2014-05-11 Thread Stephen Boesch
Just discovered sbt-pack: that addresses (quite well) the last item for
identifying and packaging the external jars.


2014-05-11 12:34 GMT-07:00 Stephen Boesch :

> HI Sonal,
> Yes I am working towards that same idea.  How did you go about
> creating the non-spark-jar dependencies ?  The way I am doing it is a
> separate straw-man project that does not include spark but has the external
> third party jars included. Then running sbt compile:managedClasspath and
> reverse engineering the lib jars from it.  That is obviously not ideal.
>
> The maven "run" will be useful for other projects built by maven: i will
> keep in my notes.
>
> AFA sbt run-example, it requires additional libraries to be added for my
> external dependencies.  I tried several items including  ADD_JARS,
>  --driver-class-path  and combinations of extraClassPath. I have deferred
> that ad-hoc approach to finding a systematic one.
>
>
>
>
> 2014-05-08 5:26 GMT-07:00 Sonal Goyal :
>
> I am creating a jar with only my dependencies and run spark-submit through
>> my project mvn build. I have configured the mvn exec goal to the location
>> of the script. Here is how I have set it up for my app. The mainClass is my
>> driver program, and I am able to send my custom args too. Hope this helps.
>>
>> 
>> org.codehaus.mojo
>> exec-maven-plugin
>> 
>> 
>>  
>> exec
>> 
>>  
>> 
>> 
>>/home/sgoyal/spark/bin/spark-submit
>>  
>> ${jars}
>> --class
>> ${mainClass}
>> --arg
>> ${spark.master}
>> --arg
>> ${my app arg 1}
>> --arg
>> ${my arg 2}
>> 
>> 
>> 
>>
>>
>> Best Regards,
>> Sonal
>> Nube Technologies 
>>
>>  
>>
>>
>>
>>
>> On Wed, May 7, 2014 at 6:57 AM, Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>>> Doesnt the run-example script work for you? Also, are you on the latest
>>> commit of branch-1.0 ?
>>>
>>> TD
>>>
>>>
>>> On Mon, May 5, 2014 at 7:51 PM, Soumya Simanta >> > wrote:
>>>


 Yes, I'm struggling with a similar problem where my class are not found
 on the worker nodes. I'm using 1.0.0_SNAPSHOT.  I would really appreciate
 if someone can provide some documentation on the usage of spark-submit.

 Thanks

 > On May 5, 2014, at 10:24 PM, Stephen Boesch 
 wrote:
 >
 >
 > I have a spark streaming application that uses the external streaming
 modules (e.g. kafka, mqtt, ..) as well.  It is not clear how to properly
 invoke the spark-submit script: what are the ---driver-class-path and/or
 -Dspark.executor.extraClassPath parameters required?
 >
 >  For reference, the following error is proving difficult to resolve:
 >
 > java.lang.ClassNotFoundException:
 org.apache.spark.streaming.examples.StreamingExamples
 >

>>>
>>>
>>
>


Re: writing my own RDD

2014-05-11 Thread Aaron Davidson
You got a good point there, those APIs should probably be marked as
@DeveloperAPI. Would you mind filing a JIRA for that (
https://issues.apache.org/jira/browse/SPARK)?


On Sun, May 11, 2014 at 11:51 AM, Koert Kuipers  wrote:

> resending... my email somehow never made it to the user list.
>
>
> On Fri, May 9, 2014 at 2:11 PM, Koert Kuipers  wrote:
>
>> in writing my own RDD i ran into a few issues with respect to stuff being
>> private in spark.
>>
>> in compute i would like to return an iterator that respects task killing
>> (as HadoopRDD does), but the mechanics for that are inside the private
>> InterruptibleIterator. also the exception i am supposed to throw
>> (TaskKilledException) is private to spark.
>>
>
>


Re: Comprehensive Port Configuration reference?

2014-05-11 Thread Mark Baker
On Tue, May 6, 2014 at 9:09 AM, Jacob Eisinger  wrote:
> In a nut shell, Spark opens up a couple of well known ports.  And,then the 
> workers and the shell open up dynamic ports for each job.  These dynamic 
> ports make securing the Spark network difficult.

Indeed.

Judging by the frequency with which this topic arises, this is a
concern for many (myself included).

I couldn't find anything in JIRA about it, but I'm curious to know
whether the Spark team considers this a problem in need of a fix?

Mark.


Re: writing my own RDD

2014-05-11 Thread Koert Kuipers
will do
On May 11, 2014 6:44 PM, "Aaron Davidson"  wrote:

> You got a good point there, those APIs should probably be marked as
> @DeveloperAPI. Would you mind filing a JIRA for that (
> https://issues.apache.org/jira/browse/SPARK)?
>
>
> On Sun, May 11, 2014 at 11:51 AM, Koert Kuipers  wrote:
>
>> resending... my email somehow never made it to the user list.
>>
>>
>> On Fri, May 9, 2014 at 2:11 PM, Koert Kuipers  wrote:
>>
>>> in writing my own RDD i ran into a few issues with respect to stuff
>>> being private in spark.
>>>
>>> in compute i would like to return an iterator that respects task killing
>>> (as HadoopRDD does), but the mechanics for that are inside the private
>>> InterruptibleIterator. also the exception i am supposed to throw
>>> (TaskKilledException) is private to spark.
>>>
>>
>>
>


Re: is Mesos falling out of favor?

2014-05-11 Thread Tim St Clair




- Original Message -
> From: "deric" 
> To: u...@spark.incubator.apache.org
> Sent: Tuesday, May 6, 2014 11:42:58 AM
> Subject: Re: is Mesos falling out of favor?

Nope.

> 
> I guess it's due to missing documentation and quite complicated setup.
> Continuous integration would be nice!

The setup is pretty simple, but stack integration tests are certainly missing, 
and Spark POM's have been out of date for some time.  There are JIRAs in both 
projects to clean up integration and update. 

> 
> Btw. is it possible to use spark as a shared library and not to fetch spark
> tarball for each task?

It's really easy to edit Spark's-Mesos scheduler and executor to do what you 
want.  e.g. run local binaries, etc.

> 
> Do you point SPARK_EXECUTOR_URI to HDFS url?

Typically yes, but again it's pretty easy edit to the scheduler and executor to 
do what you want. 

> 
> 
> 
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/is-Mesos-falling-out-of-favor-tp5444p5448.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 

-- 
Cheers,
Tim
Freedom, Features, Friends, First -> Fedora
https://fedoraproject.org/wiki/SIGs/bigdata


Re: How to use spark-submit

2014-05-11 Thread Stephen Boesch
HI Sonal,
Yes I am working towards that same idea.  How did you go about creating
the non-spark-jar dependencies ?  The way I am doing it is a separate
straw-man project that does not include spark but has the external third
party jars included. Then running sbt compile:managedClasspath and reverse
engineering the lib jars from it.  That is obviously not ideal.

The maven "run" will be useful for other projects built by maven: i will
keep in my notes.

AFA sbt run-example, it requires additional libraries to be added for my
external dependencies.  I tried several items including  ADD_JARS,
 --driver-class-path  and combinations of extraClassPath. I have deferred
that ad-hoc approach to finding a systematic one.




2014-05-08 5:26 GMT-07:00 Sonal Goyal :

> I am creating a jar with only my dependencies and run spark-submit through
> my project mvn build. I have configured the mvn exec goal to the location
> of the script. Here is how I have set it up for my app. The mainClass is my
> driver program, and I am able to send my custom args too. Hope this helps.
>
> 
> org.codehaus.mojo
> exec-maven-plugin
> 
> 
>  
> exec
> 
>  
> 
> 
>/home/sgoyal/spark/bin/spark-submit
>  
> ${jars}
> --class
> ${mainClass}
> --arg
> ${spark.master}
> --arg
> ${my app arg 1}
> --arg
> ${my arg 2}
> 
> 
> 
>
>
> Best Regards,
> Sonal
> Nube Technologies 
>
>  
>
>
>
>
> On Wed, May 7, 2014 at 6:57 AM, Tathagata Das  > wrote:
>
>> Doesnt the run-example script work for you? Also, are you on the latest
>> commit of branch-1.0 ?
>>
>> TD
>>
>>
>> On Mon, May 5, 2014 at 7:51 PM, Soumya Simanta 
>> wrote:
>>
>>>
>>>
>>> Yes, I'm struggling with a similar problem where my class are not found
>>> on the worker nodes. I'm using 1.0.0_SNAPSHOT.  I would really appreciate
>>> if someone can provide some documentation on the usage of spark-submit.
>>>
>>> Thanks
>>>
>>> > On May 5, 2014, at 10:24 PM, Stephen Boesch  wrote:
>>> >
>>> >
>>> > I have a spark streaming application that uses the external streaming
>>> modules (e.g. kafka, mqtt, ..) as well.  It is not clear how to properly
>>> invoke the spark-submit script: what are the ---driver-class-path and/or
>>> -Dspark.executor.extraClassPath parameters required?
>>> >
>>> >  For reference, the following error is proving difficult to resolve:
>>> >
>>> > java.lang.ClassNotFoundException:
>>> org.apache.spark.streaming.examples.StreamingExamples
>>> >
>>>
>>
>>
>


Re: Test

2014-05-11 Thread Aaron Davidson
I didn't get the original message, only the reply. Ruh-roh.


On Sun, May 11, 2014 at 8:09 AM, Azuryy  wrote:

> Got.
>
> But it doesn't indicate all can receive this test.
>
> Mail list is unstable recently.
>
>
> Sent from my iPhone5s
>
> On 2014年5月10日, at 13:31, Matei Zaharia  wrote:
>
> *This message has no content.*
>
>


Re: File present but file not found exception

2014-05-11 Thread Koert Kuipers
are you running spark on a cluster? if so, the executors will not be able
to find a file on your local computer.


On Thu, May 8, 2014 at 2:48 PM, Sai Prasanna wrote:

> Hi Everyone,
>
> I think all are pretty busy, the response time in this group has slightly
> increased.
>
> But anyways, this is a pretty silly problem, but could not get over.
>
> I have a file in my localFS, but when i try to create an RDD out of it,
> tasks fails with file not found exception is thrown at the log files.
>
> *var file = sc.textFile("file:///home/sparkcluster/spark/input.txt");*
> *file.top(1);*
>
> input.txt exists in the above folder but still Spark coudnt find it. Some
> parameters need to be set ??
>
> Any help is really appreciated. Thanks !!
>


streaming on hdfs can detected all new file, but the sum of all the rdd.count() not equals which had detected

2014-05-11 Thread zzzzzqf12345
when I put 200 png files to Hdfs , I found sparkStreaming counld detect 200
files , but the sum of rdd.count() is less than 200, always between 130 and
170, I don't know why...Is this a Bug?
PS: When I put 200 files in hdfs before streaming run , It get the correct
count and right result.

Here is the code:

def main(args: Array[String]) {
val conf = new SparkConf().setMaster(SparkURL)
.setAppName("QimageStreaming-broadcast")
.setSparkHome(System.getenv("SPARK_HOME"))
.setJars(SparkContext.jarOfClass(this.getClass()))
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
conf.set("spark.kryo.registrator", "qing.hdu.Image.MyRegistrator")
conf.set("spark.kryoserializer.buffer.mb", "10");
val ssc = new StreamingContext(conf, Seconds(2))
val inputFormatClass = classOf[QimageInputFormat[Text, Qimage]]
val outputFormatClass = classOf[QimageOutputFormat[Text, Qimage]]
val input_path = HdfsURL + "/Qimage/input"
val output_path = HdfsURL + "/Qimage/output/"
val bg_path = HdfsURL + "/Qimage/bg/"
val bg = ssc.sparkContext.newAPIHadoopFile[Text, Qimage,
QimageInputFormat[Text, Qimage]](bg_path)
val bbg = bg.map(data => (data._1.toString(), data._2))
val broadcastbg = ssc.sparkContext.broadcast(bbg)
val file = ssc.fileStream[Text, Qimage, QimageInputFormat[Text,
Qimage]](input_path)
val qingbg = broadcastbg.value.collectAsMap
val foreachFunc = (rdd: RDD[(Text, Qimage)], time: Time) => {
val rddnum = rdd.count
System.out.println("\n\n"+ "rddnum is " + rddnum + "\n\n")
if (rddnum > 0)
{ 
System.out.println("here is foreachFunc") 
val a = rdd.keys 
val b = a.first
 val cbg = qingbg.get(getbgID(b)).getOrElse(new Qimage) 
rdd.map(data => (data._1, (new QimageProc(data._1, data._2)).koutu(cbg)))
.saveAsNewAPIHadoopFile(output_path, classOf[Text], classOf[Qimage],
outputFormatClass) }
}
file.foreachRDD(foreachFunc)
ssc.start()
ssc.awaitTermination()
}



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/streaming-on-hdfs-can-detected-all-new-file-but-the-sum-of-all-the-rdd-count-not-equals-which-had-ded-tp5572.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


build shark(hadoop CDH5) on hadoop2.0.0 CDH4

2014-05-11 Thread Sophia
I have built shark in sbt way,but the sbt exception turn out:
[error] sbt.resolveException:unresolved dependency:
org.apache.hadoop#hadoop-client;2.0.0: not found.
How can I do to build it well?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/build-shark-hadoop-CDH5-on-hadoop2-0-0-CDH4-tp5574.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Is there any problem on the spark mailing list?

2014-05-11 Thread ankurdave
I haven't been getting mail either. This was the last message I received:
http://apache-spark-user-list.1001560.n3.nabble.com/master-attempted-to-re-register-the-worker-and-then-took-all-workers-as-unregistered-tp553p5491.html



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-any-problem-on-the-spark-mailing-list-tp5509p5515.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


about spark interactive shell

2014-05-11 Thread fengshen
hi,all
I am now using spark in production. but I notice spark driver including rdd
and dag...
and the executors will try to register with the driver.

I think the driver should run on the cluster.and client should  run on the
gateway.
Similar like:

 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/about-spark-interactive-shell-tp5575.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: about spark interactive shell

2014-05-11 Thread fengshen
can we do it? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/about-spark-interactive-shell-tp5575p5576.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Driver process succeed exiting but web UI shows FAILED

2014-05-11 Thread Cheney Sun
Hi,

I'm running the spark 0.9.1 in standalone mod. I submitted one job and the
driver succeed running to the end, see the log message below:

2014-05-12 10:34:14,358 - [INFO] (Logging.scala:50) - Finished TID 254 in
19 ms on spark-host007 (progress: 62/63)
2014-05-12 10:34:14,359 - [INFO] (Logging.scala:50) - Finished TID 255 in
18 ms on spark-host002 (progress: 63/63)
2014-05-12 10:34:14,359 - [INFO] (Logging.scala:50) - Completed
ResultTask(7, 63)
2014-05-12 10:34:14,359 - [INFO] (Logging.scala:50) - Removed TaskSet 7.0,
whose tasks have all completed, from pool
2014-05-12 10:34:14,360 - [INFO] (Logging.scala:50) - Stage 7 (take at
ComputeTask.java:110) finished in 0.165 s
2014-05-12 10:34:14,360 - [INFO] (Logging.scala:50) - Job finished: take at
ComputeTask.java:110, took 0.189718 s
2014-05-12 10:34:14,408 - [INFO] (ContextHandler.java:795) - stopped
o.e.j.s.h.ContextHandler{/,null}
2014-05-12 10:34:14,409 - [INFO] (ContextHandler.java:795) - stopped
o.e.j.s.h.ContextHandler{/static,null}
2014-05-12 10:34:14,409 - [INFO] (ContextHandler.java:795) - stopped
o.e.j.s.h.ContextHandler{/metrics/json,null}
2014-05-12 10:34:14,409 - [INFO] (ContextHandler.java:795) - stopped
o.e.j.s.h.ContextHandler{/executors,null}
2014-05-12 10:34:14,410 - [INFO] (ContextHandler.java:795) - stopped
o.e.j.s.h.ContextHandler{/environment,null}
2014-05-12 10:34:14,410 - [INFO] (ContextHandler.java:795) - stopped
o.e.j.s.h.ContextHandler{/stages,null}
2014-05-12 10:34:14,410 - [INFO] (ContextHandler.java:795) - stopped
o.e.j.s.h.ContextHandler{/stages/pool,null}
2014-05-12 10:34:14,410 - [INFO] (ContextHandler.java:795) - stopped
o.e.j.s.h.ContextHandler{/stages/stage,null}
2014-05-12 10:34:14,411 - [INFO] (ContextHandler.java:795) - stopped
o.e.j.s.h.ContextHandler{/storage,null}
2014-05-12 10:34:14,411 - [INFO] (ContextHandler.java:795) - stopped
o.e.j.s.h.ContextHandler{/storage/rdd,null}
2014-05-12 10:34:14,466 - [INFO] (Logging.scala:50) - Shutting down all
executors
2014-05-12 10:34:14,468 - [INFO] (Logging.scala:50) - Asking each executor
to shut down
2014-05-12 10:34:15,527 - [INFO] (Logging.scala:50) - MapOutputTrackerActor
stopped!
2014-05-12 10:34:15,580 - [INFO] (Logging.scala:50) - Selector thread was
interrupted!
2014-05-12 10:34:15,581 - [INFO] (Logging.scala:50) - ConnectionManager
stopped
2014-05-12 10:34:15,582 - [INFO] (Logging.scala:50) - MemoryStore cleared
2014-05-12 10:34:15,583 - [INFO] (Logging.scala:50) - BlockManager stopped
2014-05-12 10:34:15,584 - [INFO] (Logging.scala:50) - Stopping
BlockManagerMaster
2014-05-12 10:34:15,584 - [INFO] (Logging.scala:50) - BlockManagerMaster
stopped
2014-05-12 10:34:15,586 - [INFO] (Logging.scala:50) - Successfully stopped
SparkContext
2014-05-12 10:34:15,586 - [INFO] (ComputeTask.java:174) - Compute Task
success!
2014-05-12 10:34:15,590 - [INFO] (Slf4jLogger.scala:74) - Shutting down
remote daemon.
2014-05-12 10:34:15,592 - [INFO] (Slf4jLogger.scala:74) - Remote daemon
shut down; proceeding with flushing remote transports.
2014-05-12 10:34:15,631 - [INFO] (Slf4jLogger.scala:74) - Remoting shut down
2014-05-12 10:34:15,632 - [INFO] (Slf4jLogger.scala:74) - Remoting shut
down.
2014-05-12 10:34:15,911 - [INFO] (ComputeTask.java:209) - process success!


But in the WebUI, it shows FAILED. Did anyone run into this before? What's
reason behind this inconsistent state?

app-20140512103331-0020 Compute-Task 13 5.0 GB 2014/05/12 10:33:31 root
FAILED 19 s

Thanks,
Cheney


Re: How to use spark-submit

2014-05-11 Thread Sonal Goyal
Hi Stephen,

I am using maven shade plugin for creating my uber jar. I have marked spark
dependencies as provided.

Best Regards,
Sonal
Nube Technologies 






On Mon, May 12, 2014 at 1:04 AM, Stephen Boesch  wrote:

> HI Sonal,
> Yes I am working towards that same idea.  How did you go about
> creating the non-spark-jar dependencies ?  The way I am doing it is a
> separate straw-man project that does not include spark but has the external
> third party jars included. Then running sbt compile:managedClasspath and
> reverse engineering the lib jars from it.  That is obviously not ideal.
>
> The maven "run" will be useful for other projects built by maven: i will
> keep in my notes.
>
> AFA sbt run-example, it requires additional libraries to be added for my
> external dependencies.  I tried several items including  ADD_JARS,
>  --driver-class-path  and combinations of extraClassPath. I have deferred
> that ad-hoc approach to finding a systematic one.
>
>
>
>
> 2014-05-08 5:26 GMT-07:00 Sonal Goyal :
>
> I am creating a jar with only my dependencies and run spark-submit through
>> my project mvn build. I have configured the mvn exec goal to the location
>> of the script. Here is how I have set it up for my app. The mainClass is my
>> driver program, and I am able to send my custom args too. Hope this helps.
>>
>> 
>> org.codehaus.mojo
>> exec-maven-plugin
>> 
>> 
>>  
>> exec
>> 
>>  
>> 
>> 
>>/home/sgoyal/spark/bin/spark-submit
>>  
>> ${jars}
>> --class
>> ${mainClass}
>> --arg
>> ${spark.master}
>> --arg
>> ${my app arg 1}
>> --arg
>> ${my arg 2}
>> 
>> 
>> 
>>
>>
>> Best Regards,
>> Sonal
>> Nube Technologies 
>>
>>  
>>
>>
>>
>>
>> On Wed, May 7, 2014 at 6:57 AM, Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>>> Doesnt the run-example script work for you? Also, are you on the latest
>>> commit of branch-1.0 ?
>>>
>>> TD
>>>
>>>
>>> On Mon, May 5, 2014 at 7:51 PM, Soumya Simanta >> > wrote:
>>>


 Yes, I'm struggling with a similar problem where my class are not found
 on the worker nodes. I'm using 1.0.0_SNAPSHOT.  I would really appreciate
 if someone can provide some documentation on the usage of spark-submit.

 Thanks

 > On May 5, 2014, at 10:24 PM, Stephen Boesch 
 wrote:
 >
 >
 > I have a spark streaming application that uses the external streaming
 modules (e.g. kafka, mqtt, ..) as well.  It is not clear how to properly
 invoke the spark-submit script: what are the ---driver-class-path and/or
 -Dspark.executor.extraClassPath parameters required?
 >
 >  For reference, the following error is proving difficult to resolve:
 >
 > java.lang.ClassNotFoundException:
 org.apache.spark.streaming.examples.StreamingExamples
 >

>>>
>>>
>>
>