Is there a way to load a large file from HDFS faster into Spark

2014-05-11 Thread Soumya Simanta
I've a Spark cluster with 3 worker nodes.


   - *Workers:* 3
   - *Cores:* 48 Total, 48 Used
   - *Memory:* 469.8 GB Total, 72.0 GB Used

I want a process a single file compressed (*.gz) on HDFS. The file is 1.5GB
compressed and 11GB uncompressed.
When I try to read the compressed file from HDFS it takes a while (4-5
minutes) load it into an RDD. If I use the .cache operation it takes even
longer. Is there a way to make loading of the RDD from HDFS faster ?

Thanks
-Soumya


Re: How to read a multipart s3 file?

2014-05-11 Thread Nicholas Chammas
On Tue, May 6, 2014 at 10:07 PM, kamatsuoka ken...@gmail.com wrote:

 I was using s3n:// but I got frustrated by how
 slow it is at writing files.


I'm curious: How slow is slow? How long does it take you, for example, to
save a 1GB file to S3 using s3n vs s3?


Spark LIBLINEAR

2014-05-11 Thread Chieh-Yen
Dear all,

Recently we released a distributed extension of LIBLINEAR at

http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/distributed-liblinear/

Currently, TRON for logistic regression and L2-loss SVM is supported.
We provided both MPI and Spark implementations.
This is very preliminary so your comments are very welcome.

Thanks,
Chieh-Yen


Re: Is there anything that I need to modify?

2014-05-11 Thread Arpit Tak
Try setting hostname to domain setting in /etc/hosts .
Its not able to resolve ip to hostname
try this ...
localhost  192.168.10.220 CHBM220




On Wed, May 7, 2014 at 12:50 PM, Sophia sln-1...@163.com wrote:

 [root@CHBM220 spark-0.9.1]#

 SPARK_JAR=.assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar
 ./bin/spark-class org.apache.spark.deploy.yarn.Client --jar
 examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar --class
 org.apache.spark.examples.SparkPi --args yarn-standalone --num-workers 3
 --master-memory 2g --worker-memory 2g --worker-cores 1
 14:50:45,485%5P RMProxy:56-Connecting to ResourceManager at
 CHBM220/192.168.10.220:8032
 Exception in thread main java.io.IOException: Failed on local exception:
 com.google.protobuf.InvalidProtocolBufferException: Protocol message
 contained an invalid tag (zero).; Host Details : local host is:
 CHBM220/192.168.10.220; destination host is: CHBM220:8032;
 at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
 at org.apache.hadoop.ipc.Client.call(Client.java:1351)
 at org.apache.hadoop.ipc.Client.call(Client.java:1300)
 at

 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
 at com.sun.proxy.$Proxy7.getClusterMetrics(Unknown Source)
 at

 org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:152)
 at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
 at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at

 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
 at

 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
 at com.sun.proxy.$Proxy8.getClusterMetrics(Unknown Source)
 at

 org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:246)
 at

 org.apache.spark.deploy.yarn.Client.logClusterResourceDetails(Client.scala:144)
 at org.apache.spark.deploy.yarn.Client.runApp(Client.scala:79)
 at org.apache.spark.deploy.yarn.Client.run(Client.scala:115)
 at org.apache.spark.deploy.yarn.Client$.main(Client.scala:493)
 at org.apache.spark.deploy.yarn.Client.main(Client.scala)
 Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol
 message contained an invalid tag (zero).
 at

 com.google.protobuf.InvalidProtocolBufferException.invalidTag(InvalidProtocolBufferException.java:89)
 at
 com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:108)
 at

 org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto.init(RpcHeaderProtos.java:1398)
 at

 org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto.init(RpcHeaderProtos.java:1362)
 at

 org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto$1.parsePartialFrom(RpcHeaderProtos.java:1492)
 at

 org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto$1.parsePartialFrom(RpcHeaderProtos.java:1487)
 at

 com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:200)
 at

 com.google.protobuf.AbstractParser.parsePartialDelimitedFrom(AbstractParser.java:241)
 at

 com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:253)
 at

 com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:259)
 at

 com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:49)
 at

 org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto.parseDelimitedFrom(RpcHeaderProtos.java:2364)
 at
 org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:996)
 at org.apache.hadoop.ipc.Client$Connection.run(Client.java:891)
 [root@CHBM220:spark-0.9.1]#




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-anything-that-I-need-to-modify-tp5477.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



答复: 答复: java.io.FileNotFoundException: /test/spark-0.9.1/work/app-20140505053550-0000/2/stdout (No such file or directory)

2014-05-11 Thread Francis . Hu
I  have just the problem resolved via running master and work daemons 
individually on where they are.

if I execute the shell: sbin/start-all.sh , the problem always exist.  

 

 

发件人: Francis.Hu [mailto:francis...@reachjunction.com] 
发送时间: Tuesday, May 06, 2014 10:31
收件人: user@spark.apache.org
主题: 答复: 答复: java.io.FileNotFoundException: 
/test/spark-0.9.1/work/app-20140505053550-/2/stdout (No such file or 
directory)

 

i looked into the log again, all exceptions are about FileNotFoundException . 
In the Webui, no anymore info I can check except for the basic description of 
job.  

Attached the log file, could you help to take a look ? Thanks.

 

Francis.Hu

 

发件人: Tathagata Das [mailto:tathagata.das1...@gmail.com] 
发送时间: Tuesday, May 06, 2014 10:16
收件人: user@spark.apache.org
主题: Re: 答复: java.io.FileNotFoundException: 
/test/spark-0.9.1/work/app-20140505053550-/2/stdout (No such file or 
directory)

 

Can you check the Spark worker logs on that machine. Either from the web ui, or 
directly. Should be /test/spark-XXX/logs/  See if that has any error.

If there is not permission issue, I am not why stdout and stderr is not being 
generated. 

 

TD

 

On Mon, May 5, 2014 at 7:13 PM, Francis.Hu francis...@reachjunction.com wrote:

The file does not exist in fact and no permission issue. 

 

francis@ubuntu-4:/test/spark-0.9.1$ ll work/app-20140505053550-/

total 24

drwxrwxr-x  6 francis francis 4096 May  5 05:35 ./

drwxrwxr-x 11 francis francis 4096 May  5 06:18 ../

drwxrwxr-x  2 francis francis 4096 May  5 05:35 2/

drwxrwxr-x  2 francis francis 4096 May  5 05:35 4/

drwxrwxr-x  2 francis francis 4096 May  5 05:35 7/

drwxrwxr-x  2 francis francis 4096 May  5 05:35 9/

 

Francis

 

发件人: Tathagata Das [mailto:tathagata.das1...@gmail.com] 
发送时间: Tuesday, May 06, 2014 3:45
收件人: user@spark.apache.org
主题: Re: java.io.FileNotFoundException: 
/test/spark-0.9.1/work/app-20140505053550-/2/stdout (No such file or 
directory)

 

Do those file actually exist? Those stdout/stderr should have the output of the 
spark's executors running in the workers, and its weird that they dont exist. 
Could be permission issue - maybe the directories/files are not being generated 
because it cannot?

 

TD

 

On Mon, May 5, 2014 at 3:06 AM, Francis.Hu francis...@reachjunction.com wrote:

Hi,All

 

 

We run a spark cluster with three workers. 

created a spark streaming application,

then run the spark project using below command:

 

shell sbt run spark://192.168.219.129:7077 tcp://192.168.20.118:5556 foo

 

we looked at the webui of workers, jobs failed without any error or info, but 
FileNotFoundException occurred in workers' log file as below:

Is this an existent issue of spark? 

 

 

-in workers' 
logs/spark-francis-org.apache.spark.deploy.worker.Worker-1-ubuntu-4.out

 

14/05/05 02:39:39 WARN AbstractHttpConnection: 
/logPage/?appId=app-20140505053550-executorId=2logType=stdout

java.io.FileNotFoundException: 
/test/spark-0.9.1/work/app-20140505053550-/2/stdout (No such file or 
directory)

at java.io.FileInputStream.open(Native Method)

at java.io.FileInputStream.init(FileInputStream.java:138)

at org.apache.spark.util.Utils$.offsetBytes(Utils.scala:687)

at 
org.apache.spark.deploy.worker.ui.WorkerWebUI.logPage(WorkerWebUI.scala:119)

at 
org.apache.spark.deploy.worker.ui.WorkerWebUI$$anonfun$6.apply(WorkerWebUI.scala:52)

at 
org.apache.spark.deploy.worker.ui.WorkerWebUI$$anonfun$6.apply(WorkerWebUI.scala:52)

at org.apache.spark.ui.JettyUtils$$anon$1.handle(JettyUtils.scala:61)

at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1040)

at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:976)

at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)

at 
org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52)

at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)

at org.eclipse.jetty.server.Server.handle(Server.java:363)

at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:483)

at 
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:920)

at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:982)

at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:635)

at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)

at 
org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)

at 
org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:628)

at 

Re: How can adding a random count() change the behavior of my program?

2014-05-11 Thread Walrus theCat
Nick,

I have encountered strange things like this before (usually when
programming with mutable structures and side-effects), and for me, the
answer was that, until .count (or .first, or similar), is called, your
variable 'a' refers to a set of instructions that only get executed to form
the object you expect when you're asking something of it.  Back before I
was using side-effect-free techniques on immutable data structures, I had
to call .first or .count or similar to get the behavior I wanted.  There
are still special cases where I have to purposefully collapse the RDD for
some reason or another.  This may not be new information to you, but I've
encountered similar behavior before and highly suspect this is playing a
role here.


On Mon, May 5, 2014 at 5:52 PM, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:

 I’m running into something very strange today. I’m getting an error on the
 follow innocuous operations.

 a = sc.textFile('s3n://...')
 a = a.repartition(8)
 a = a.map(...)
 c = a.countByKey() # ERRORs out on this action. See below for traceback. [1]

 If I add a count() right after the repartition(), this error magically
 goes away.

 a = sc.textFile('s3n://...')
 a = a.repartition(8)
 print a.count()
 a = a.map(...)
 c = a.countByKey() # A-OK! WTF?

 To top it off, this “fix” is inconsistent. Sometimes, I still get this
 error.

 This is strange. How do I get to the bottom of this?

 Nick

 [1] Here’s the traceback:

 Traceback (most recent call last):
   File stdin, line 7, in module
   File file.py, line 187, in function_blah
 c = a.countByKey()
   File /root/spark/python/pyspark/rdd.py, line 778, in countByKey
 return self.map(lambda x: x[0]).countByValue()
   File /root/spark/python/pyspark/rdd.py, line 624, in countByValue
 return self.mapPartitions(countPartition).reduce(mergeMaps)
   File /root/spark/python/pyspark/rdd.py, line 505, in reduce
 vals = self.mapPartitions(func).collect()
   File /root/spark/python/pyspark/rdd.py, line 469, in collect
 bytesInJava = self._jrdd.collect().iterator()
   File /root/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 
 537, in __call__
   File /root/spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py, line 
 300, in get_return_value
 py4j.protocol.Py4JJavaError: An error occurred while calling o46.collect.


 --
 View this message in context: How can adding a random count() change the
 behavior of my 
 program?http://apache-spark-user-list.1001560.n3.nabble.com/How-can-adding-a-random-count-change-the-behavior-of-my-program-tp5406.html
 Sent from the Apache Spark User List mailing list 
 archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com.



Re: Fwd: Is there a way to load a large file from HDFS faster into Spark

2014-05-11 Thread Soumya Simanta
Yep. I figured that out. I uncompressed the file and it looks much faster
now. Thanks.



On Sun, May 11, 2014 at 8:14 AM, Mayur Rustagi mayur.rust...@gmail.comwrote:

 .gz files are not splittable hence harder to process. Easiest is to move
 to a splittable compression like lzo and break file into multiple blocks to
 be read and for subsequent processing.
 On 11 May 2014 09:01, Soumya Simanta soumya.sima...@gmail.com wrote:



 I've a Spark cluster with 3 worker nodes.


- *Workers:* 3
- *Cores:* 48 Total, 48 Used
- *Memory:* 469.8 GB Total, 72.0 GB Used

 I want a process a single file compressed (*.gz) on HDFS. The file is
 1.5GB compressed and 11GB uncompressed.
 When I try to read the compressed file from HDFS it takes a while (4-5
 minutes) load it into an RDD. If I use the .cache operation it takes even
 longer. Is there a way to make loading of the RDD from HDFS faster ?

 Thanks
  -Soumya





Re: Test

2014-05-11 Thread Azuryy
Got.

But it doesn't indicate all can receive this test.

Mail list is unstable recently.


Sent from my iPhone5s

 On 2014年5月10日, at 13:31, Matei Zaharia matei.zaha...@gmail.com wrote:
 
 This message has no content.


Re: Is there any problem on the spark mailing list?

2014-05-11 Thread lukas nalezenec
There was an outage: https://blogs.apache.org/infra/entry/mail_outage



On Fri, May 9, 2014 at 1:27 PM, wxhsdp wxh...@gmail.com wrote:

 i think so, fewer questions and answers these three days



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-any-problem-on-the-spark-mailing-list-tp5509p5522.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: why is Spark 0.9.1 (context creation?) so slow on my OSX laptop?

2014-05-11 Thread Madhu
Svend,

I built it on my iMac and it was about the same speed as Windows 7, RHEL 6
VM on Windows 7, and Linux on EC2. Spark is pleasantly easy to build on all
of these platforms, which is wonderful.

How long does it take to start spark-shell?
Maybe it's a JVM memory setting problem on your Laptop?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/why-is-Spark-0-9-1-context-creation-so-slow-on-my-OSX-laptop-tp5535p5556.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: writing my own RDD

2014-05-11 Thread Koert Kuipers
resending... my email somehow never made it to the user list.


On Fri, May 9, 2014 at 2:11 PM, Koert Kuipers ko...@tresata.com wrote:

 in writing my own RDD i ran into a few issues with respect to stuff being
 private in spark.

 in compute i would like to return an iterator that respects task killing
 (as HadoopRDD does), but the mechanics for that are inside the private
 InterruptibleIterator. also the exception i am supposed to throw
 (TaskKilledException) is private to spark.



Re: is Mesos falling out of favor?

2014-05-11 Thread Gary Malouf
For what it is worth, our team here at
MediaCrossinghttp://mediacrossing.com has
been using the Spark/Mesos combination since last summer with much success
(low operations overhead, high developer performance).

IMO, Hadoop is overcomplicated from both a development and operations
perspective so I am looking to lower our dependencies on it, not increase
them.  Our stack currently includes:


   - Spark 0.9.1
   - Mesos 0.17
   - Chronos
   - HDFS (CDH 5.0-mr1)
   - Flume 1.4.0
   - ZooKeeper
   - Cassandra 2.0 (key-value store alternative to HBase)
   - Storm 0.9 (we prefer today to Spark Streaming)

We've used Shark in the past as well, but since most of us prefer the Spark
Shell we have not been maintaining it.

Using Mesos to run Spark allows for us to optimize our available resources
(CPU + RAM currently ) between Spark, Chronos and a number of other
services.  I see YARN as being heavily focused on MR2, but the reality is
we are using Spark in large part because writing MapReduce jobs is verbose,
hard to maintain and not performant (against Spark).  We have the advantage
of not having any real legacy Map/Reduce jobs to maintain, so that
consideration does not come into play.

Finally, I am a believer that for the long term direction of our company,
the Berkeley stack https://amplab.cs.berkeley.edu/software/ will serve us
best.  Leveraging Mesos and Spark from the onset paves the way for this.


On Sun, May 11, 2014 at 1:28 PM, Paco Nathan cet...@gmail.com wrote:

 That's FUD. Tracking the Mesos and Spark use cases, there are very large
 production deployments of these together. Some are rather private but
 others are being surfaced. IMHO, one of the most amazing case studies is
 from Christina Delimitrou http://youtu.be/YpmElyi94AA

 For a tutorial, use the following but upgrade it to latest production for
 Spark. There was a related O'Reilly webcast and Strata tutorial as well:
 http://mesosphere.io/learn/run-spark-on-mesos/

 FWIW, I teach Intro to Spark with sections on CM4, YARN, Mesos, etc.
 Based on lots of student experiences, Mesos is clearly the shortest path to
 deploying a Spark cluster if you want to leverage the robustness,
 multi-tenancy for mixed workloads, less ops overhead, etc., that show up
 repeatedly in the use case analyses.

 My opinion only and not that of any of my clients: Don't believe the FUD
 from YHOO unless you really want to be stuck in 2009.


 On Wed, May 7, 2014 at 8:30 AM, deric barton.to...@gmail.com wrote:

 I'm also using right now SPARK_EXECUTOR_URI, though I would prefer
 distributing Spark as a binary package.

 For running examples with `./bin/run-example ...` it works fine, however
 tasks from spark-shell are getting lost.

 Error: Could not find or load main class
 org.apache.spark.executor.MesosExecutorBackend

 which looks more like problem with sbin/spark-executor and missing paths
 to
 jar. Anyone encountered this error before?

 I guess Yahoo invested quite a lot of effort into YARN and Spark
 integration
 (moreover when Mahout is migrating to Spark there's much more interest in
 Hadoop and Spark integration). If there would be some Mesos company
 working on Spark - Mesos integration it could be at least on the same
 level.

 I don't see any other reason why would be YARN better than Mesos,
 personally
 I like the latter, however I haven't checked YARN for a while, maybe
 they've
 made a significant progress. I think Mesos is more universal and flexible
 than YARN.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/is-Mesos-falling-out-of-favor-tp5444p5481.html

 Sent from the Apache Spark User List mailing list archive at Nabble.com.





Re: Spark LIBLINEAR

2014-05-11 Thread Debasish Das
Hello Prof. Lin,

Awesome news ! I am curious if you have any benchmarks comparing C++ MPI
with Scala Spark liblinear implementations...

Is Spark Liblinear apache licensed or there are any specific restrictions
on using it ?

Except using native blas libraries (which each user has to manage by
pulling in their best proprietary BLAS package), all Spark code is Apache
licensed.

Thanks.
Deb


On Sun, May 11, 2014 at 3:01 AM, DB Tsai dbt...@stanford.edu wrote:

 Dear Prof. Lin,

 Interesting! We had an implementation of L-BFGS in Spark and already
 merged in the upstream now.

 We read your paper comparing TRON and OWL-QN for logistic regression with
 L1 (http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf), but it seems that
 it's not in the distributed setup.

 Will be very interesting to know the L2 logistic regression benchmark
 result in Spark with your TRON optimizer and the L-BFGS optimizer against
 different datasets (sparse, dense, and wide, etc).

 I'll try your TRON out soon.


 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Sun, May 11, 2014 at 1:49 AM, Chieh-Yen r01944...@csie.ntu.edu.twwrote:

 Dear all,

 Recently we released a distributed extension of LIBLINEAR at

 http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/distributed-liblinear/

 Currently, TRON for logistic regression and L2-loss SVM is supported.
 We provided both MPI and Spark implementations.
 This is very preliminary so your comments are very welcome.

 Thanks,
 Chieh-Yen





Re: How to use spark-submit

2014-05-11 Thread Soumya Simanta

Will sbt-pack and the maven solution work for the Scala REPL? 

I need the REPL because it save a lot of time when I'm playing with large data 
sets because I load then once, cache them and then try out things interactively 
before putting in a standalone driver. 

I've sbt woking for my own driver program on Spark 0.9. 



 On May 11, 2014, at 3:49 PM, Stephen Boesch java...@gmail.com wrote:
 
 Just discovered sbt-pack: that addresses (quite well) the last item for 
 identifying and packaging the external jars.
 
 
 2014-05-11 12:34 GMT-07:00 Stephen Boesch java...@gmail.com:
 HI Sonal,
 Yes I am working towards that same idea.  How did you go about creating 
 the non-spark-jar dependencies ?  The way I am doing it is a separate 
 straw-man project that does not include spark but has the external third 
 party jars included. Then running sbt compile:managedClasspath and reverse 
 engineering the lib jars from it.  That is obviously not ideal.
 
 The maven run will be useful for other projects built by maven: i will 
 keep in my notes.
 
 AFA sbt run-example, it requires additional libraries to be added for my 
 external dependencies.  I tried several items including  ADD_JARS,  
 --driver-class-path  and combinations of extraClassPath. I have deferred 
 that ad-hoc approach to finding a systematic one.
 
 
 
 
 2014-05-08 5:26 GMT-07:00 Sonal Goyal sonalgoy...@gmail.com:
 
 I am creating a jar with only my dependencies and run spark-submit through 
 my project mvn build. I have configured the mvn exec goal to the location 
 of the script. Here is how I have set it up for my app. The mainClass is my 
 driver program, and I am able to send my custom args too. Hope this helps.
 
 plugin
 groupIdorg.codehaus.mojo/groupId
 artifactIdexec-maven-plugin/artifactId
 executions
 execution
 goals
 goalexec/goal
 /goals
 /execution
 /executions
 configuration
executable/home/sgoyal/spark/bin/spark-submit/executable
 arguments
 argument${jars}/argument
 argument--class/argument
 argument${mainClass}/argument
 argument--arg/argument
 argument${spark.master}/argument
 argument--arg/argument
 argument${my app arg 1}/argument
 argument--arg/argument
 argument${my arg 2}/argument
 /arguments
 /configuration
 /plugin
 
 
 Best Regards,
 Sonal
 Nube Technologies 
 
 
 
 
 
 
 On Wed, May 7, 2014 at 6:57 AM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:
 Doesnt the run-example script work for you? Also, are you on the latest 
 commit of branch-1.0 ?
 
 TD
 
 
 On Mon, May 5, 2014 at 7:51 PM, Soumya Simanta soumya.sima...@gmail.com 
 wrote:
 
 
 Yes, I'm struggling with a similar problem where my class are not found 
 on the worker nodes. I'm using 1.0.0_SNAPSHOT.  I would really appreciate 
 if someone can provide some documentation on the usage of spark-submit.
 
 Thanks
 
  On May 5, 2014, at 10:24 PM, Stephen Boesch java...@gmail.com wrote:
 
 
  I have a spark streaming application that uses the external streaming 
  modules (e.g. kafka, mqtt, ..) as well.  It is not clear how to 
  properly invoke the spark-submit script: what are the 
  ---driver-class-path and/or -Dspark.executor.extraClassPath parameters 
  required?
 
   For reference, the following error is proving difficult to resolve:
 
  java.lang.ClassNotFoundException: 
  org.apache.spark.streaming.examples.StreamingExamples
 
 


Re: Comprehensive Port Configuration reference?

2014-05-11 Thread Mark Baker
On Tue, May 6, 2014 at 9:09 AM, Jacob Eisinger jeis...@us.ibm.com wrote:
 In a nut shell, Spark opens up a couple of well known ports.  And,then the 
 workers and the shell open up dynamic ports for each job.  These dynamic 
 ports make securing the Spark network difficult.

Indeed.

Judging by the frequency with which this topic arises, this is a
concern for many (myself included).

I couldn't find anything in JIRA about it, but I'm curious to know
whether the Spark team considers this a problem in need of a fix?

Mark.


Re: writing my own RDD

2014-05-11 Thread Koert Kuipers
will do
On May 11, 2014 6:44 PM, Aaron Davidson ilike...@gmail.com wrote:

 You got a good point there, those APIs should probably be marked as
 @DeveloperAPI. Would you mind filing a JIRA for that (
 https://issues.apache.org/jira/browse/SPARK)?


 On Sun, May 11, 2014 at 11:51 AM, Koert Kuipers ko...@tresata.com wrote:

 resending... my email somehow never made it to the user list.


 On Fri, May 9, 2014 at 2:11 PM, Koert Kuipers ko...@tresata.com wrote:

 in writing my own RDD i ran into a few issues with respect to stuff
 being private in spark.

 in compute i would like to return an iterator that respects task killing
 (as HadoopRDD does), but the mechanics for that are inside the private
 InterruptibleIterator. also the exception i am supposed to throw
 (TaskKilledException) is private to spark.






Re: How to use spark-submit

2014-05-11 Thread Stephen Boesch
HI Sonal,
Yes I am working towards that same idea.  How did you go about creating
the non-spark-jar dependencies ?  The way I am doing it is a separate
straw-man project that does not include spark but has the external third
party jars included. Then running sbt compile:managedClasspath and reverse
engineering the lib jars from it.  That is obviously not ideal.

The maven run will be useful for other projects built by maven: i will
keep in my notes.

AFA sbt run-example, it requires additional libraries to be added for my
external dependencies.  I tried several items including  ADD_JARS,
 --driver-class-path  and combinations of extraClassPath. I have deferred
that ad-hoc approach to finding a systematic one.




2014-05-08 5:26 GMT-07:00 Sonal Goyal sonalgoy...@gmail.com:

 I am creating a jar with only my dependencies and run spark-submit through
 my project mvn build. I have configured the mvn exec goal to the location
 of the script. Here is how I have set it up for my app. The mainClass is my
 driver program, and I am able to send my custom args too. Hope this helps.

 plugin
 groupIdorg.codehaus.mojo/groupId
 artifactIdexec-maven-plugin/artifactId
 executions
 execution
  goals
 goalexec/goal
 /goals
  /execution
 /executions
 configuration
executable/home/sgoyal/spark/bin/spark-submit/executable
  arguments
 argument${jars}/argument
 argument--class/argument
 argument${mainClass}/argument
 argument--arg/argument
 argument${spark.master}/argument
 argument--arg/argument
 argument${my app arg 1}/argument
 argument--arg/argument
 argument${my arg 2}/argument
 /arguments
 /configuration
 /plugin


 Best Regards,
 Sonal
 Nube Technologies http://www.nubetech.co

  http://in.linkedin.com/in/sonalgoyal




 On Wed, May 7, 2014 at 6:57 AM, Tathagata Das tathagata.das1...@gmail.com
  wrote:

 Doesnt the run-example script work for you? Also, are you on the latest
 commit of branch-1.0 ?

 TD


 On Mon, May 5, 2014 at 7:51 PM, Soumya Simanta 
 soumya.sima...@gmail.comwrote:



 Yes, I'm struggling with a similar problem where my class are not found
 on the worker nodes. I'm using 1.0.0_SNAPSHOT.  I would really appreciate
 if someone can provide some documentation on the usage of spark-submit.

 Thanks

  On May 5, 2014, at 10:24 PM, Stephen Boesch java...@gmail.com wrote:
 
 
  I have a spark streaming application that uses the external streaming
 modules (e.g. kafka, mqtt, ..) as well.  It is not clear how to properly
 invoke the spark-submit script: what are the ---driver-class-path and/or
 -Dspark.executor.extraClassPath parameters required?
 
   For reference, the following error is proving difficult to resolve:
 
  java.lang.ClassNotFoundException:
 org.apache.spark.streaming.examples.StreamingExamples
 






Re: Test

2014-05-11 Thread Aaron Davidson
I didn't get the original message, only the reply. Ruh-roh.


On Sun, May 11, 2014 at 8:09 AM, Azuryy azury...@gmail.com wrote:

 Got.

 But it doesn't indicate all can receive this test.

 Mail list is unstable recently.


 Sent from my iPhone5s

 On 2014年5月10日, at 13:31, Matei Zaharia matei.zaha...@gmail.com wrote:

 *This message has no content.*




streaming on hdfs can detected all new file, but the sum of all the rdd.count() not equals which had detected

2014-05-11 Thread zzzzzqf12345
when I put 200 png files to Hdfs , I found sparkStreaming counld detect 200
files , but the sum of rdd.count() is less than 200, always between 130 and
170, I don't know why...Is this a Bug?
PS: When I put 200 files in hdfs before streaming run , It get the correct
count and right result.

Here is the code:

def main(args: Array[String]) {
val conf = new SparkConf().setMaster(SparkURL)
.setAppName(QimageStreaming-broadcast)
.setSparkHome(System.getenv(SPARK_HOME))
.setJars(SparkContext.jarOfClass(this.getClass()))
conf.set(spark.serializer, org.apache.spark.serializer.KryoSerializer)
conf.set(spark.kryo.registrator, qing.hdu.Image.MyRegistrator)
conf.set(spark.kryoserializer.buffer.mb, 10);
val ssc = new StreamingContext(conf, Seconds(2))
val inputFormatClass = classOf[QimageInputFormat[Text, Qimage]]
val outputFormatClass = classOf[QimageOutputFormat[Text, Qimage]]
val input_path = HdfsURL + /Qimage/input
val output_path = HdfsURL + /Qimage/output/
val bg_path = HdfsURL + /Qimage/bg/
val bg = ssc.sparkContext.newAPIHadoopFile[Text, Qimage,
QimageInputFormat[Text, Qimage]](bg_path)
val bbg = bg.map(data = (data._1.toString(), data._2))
val broadcastbg = ssc.sparkContext.broadcast(bbg)
val file = ssc.fileStream[Text, Qimage, QimageInputFormat[Text,
Qimage]](input_path)
val qingbg = broadcastbg.value.collectAsMap
val foreachFunc = (rdd: RDD[(Text, Qimage)], time: Time) = {
val rddnum = rdd.count
System.out.println(\n\n+ rddnum is  + rddnum + \n\n)
if (rddnum  0)
{ 
System.out.println(here is foreachFunc) 
val a = rdd.keys 
val b = a.first
 val cbg = qingbg.get(getbgID(b)).getOrElse(new Qimage) 
rdd.map(data = (data._1, (new QimageProc(data._1, data._2)).koutu(cbg)))
.saveAsNewAPIHadoopFile(output_path, classOf[Text], classOf[Qimage],
outputFormatClass) }
}
file.foreachRDD(foreachFunc)
ssc.start()
ssc.awaitTermination()
}



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/streaming-on-hdfs-can-detected-all-new-file-but-the-sum-of-all-the-rdd-count-not-equals-which-had-ded-tp5572.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


build shark(hadoop CDH5) on hadoop2.0.0 CDH4

2014-05-11 Thread Sophia
I have built shark in sbt way,but the sbt exception turn out:
[error] sbt.resolveException:unresolved dependency:
org.apache.hadoop#hadoop-client;2.0.0: not found.
How can I do to build it well?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/build-shark-hadoop-CDH5-on-hadoop2-0-0-CDH4-tp5574.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Is there any problem on the spark mailing list?

2014-05-11 Thread ankurdave
I haven't been getting mail either. This was the last message I received:
http://apache-spark-user-list.1001560.n3.nabble.com/master-attempted-to-re-register-the-worker-and-then-took-all-workers-as-unregistered-tp553p5491.html



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-any-problem-on-the-spark-mailing-list-tp5509p5515.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


about spark interactive shell

2014-05-11 Thread fengshen
hi,all
I am now using spark in production. but I notice spark driver including rdd
and dag...
and the executors will try to register with the driver.

I think the driver should run on the cluster.and client should  run on the
gateway.
Similar like:
http://apache-spark-user-list.1001560.n3.nabble.com/file/n5575/Spark-interactive_shell.jpg
 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/about-spark-interactive-shell-tp5575.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.