Is there a way to load a large file from HDFS faster into Spark
I've a Spark cluster with 3 worker nodes. - *Workers:* 3 - *Cores:* 48 Total, 48 Used - *Memory:* 469.8 GB Total, 72.0 GB Used I want a process a single file compressed (*.gz) on HDFS. The file is 1.5GB compressed and 11GB uncompressed. When I try to read the compressed file from HDFS it takes a while (4-5 minutes) load it into an RDD. If I use the .cache operation it takes even longer. Is there a way to make loading of the RDD from HDFS faster ? Thanks -Soumya
Re: How to read a multipart s3 file?
On Tue, May 6, 2014 at 10:07 PM, kamatsuoka ken...@gmail.com wrote: I was using s3n:// but I got frustrated by how slow it is at writing files. I'm curious: How slow is slow? How long does it take you, for example, to save a 1GB file to S3 using s3n vs s3?
Spark LIBLINEAR
Dear all, Recently we released a distributed extension of LIBLINEAR at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/distributed-liblinear/ Currently, TRON for logistic regression and L2-loss SVM is supported. We provided both MPI and Spark implementations. This is very preliminary so your comments are very welcome. Thanks, Chieh-Yen
Re: Is there anything that I need to modify?
Try setting hostname to domain setting in /etc/hosts . Its not able to resolve ip to hostname try this ... localhost 192.168.10.220 CHBM220 On Wed, May 7, 2014 at 12:50 PM, Sophia sln-1...@163.com wrote: [root@CHBM220 spark-0.9.1]# SPARK_JAR=.assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar ./bin/spark-class org.apache.spark.deploy.yarn.Client --jar examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar --class org.apache.spark.examples.SparkPi --args yarn-standalone --num-workers 3 --master-memory 2g --worker-memory 2g --worker-cores 1 14:50:45,485%5P RMProxy:56-Connecting to ResourceManager at CHBM220/192.168.10.220:8032 Exception in thread main java.io.IOException: Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Protocol message contained an invalid tag (zero).; Host Details : local host is: CHBM220/192.168.10.220; destination host is: CHBM220:8032; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764) at org.apache.hadoop.ipc.Client.call(Client.java:1351) at org.apache.hadoop.ipc.Client.call(Client.java:1300) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy7.getClusterMetrics(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:152) at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy8.getClusterMetrics(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:246) at org.apache.spark.deploy.yarn.Client.logClusterResourceDetails(Client.scala:144) at org.apache.spark.deploy.yarn.Client.runApp(Client.scala:79) at org.apache.spark.deploy.yarn.Client.run(Client.scala:115) at org.apache.spark.deploy.yarn.Client$.main(Client.scala:493) at org.apache.spark.deploy.yarn.Client.main(Client.scala) Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol message contained an invalid tag (zero). at com.google.protobuf.InvalidProtocolBufferException.invalidTag(InvalidProtocolBufferException.java:89) at com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:108) at org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto.init(RpcHeaderProtos.java:1398) at org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto.init(RpcHeaderProtos.java:1362) at org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto$1.parsePartialFrom(RpcHeaderProtos.java:1492) at org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto$1.parsePartialFrom(RpcHeaderProtos.java:1487) at com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:200) at com.google.protobuf.AbstractParser.parsePartialDelimitedFrom(AbstractParser.java:241) at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:253) at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:259) at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:49) at org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto.parseDelimitedFrom(RpcHeaderProtos.java:2364) at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:996) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:891) [root@CHBM220:spark-0.9.1]# -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-anything-that-I-need-to-modify-tp5477.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
答复: 答复: java.io.FileNotFoundException: /test/spark-0.9.1/work/app-20140505053550-0000/2/stdout (No such file or directory)
I have just the problem resolved via running master and work daemons individually on where they are. if I execute the shell: sbin/start-all.sh , the problem always exist. 发件人: Francis.Hu [mailto:francis...@reachjunction.com] 发送时间: Tuesday, May 06, 2014 10:31 收件人: user@spark.apache.org 主题: 答复: 答复: java.io.FileNotFoundException: /test/spark-0.9.1/work/app-20140505053550-/2/stdout (No such file or directory) i looked into the log again, all exceptions are about FileNotFoundException . In the Webui, no anymore info I can check except for the basic description of job. Attached the log file, could you help to take a look ? Thanks. Francis.Hu 发件人: Tathagata Das [mailto:tathagata.das1...@gmail.com] 发送时间: Tuesday, May 06, 2014 10:16 收件人: user@spark.apache.org 主题: Re: 答复: java.io.FileNotFoundException: /test/spark-0.9.1/work/app-20140505053550-/2/stdout (No such file or directory) Can you check the Spark worker logs on that machine. Either from the web ui, or directly. Should be /test/spark-XXX/logs/ See if that has any error. If there is not permission issue, I am not why stdout and stderr is not being generated. TD On Mon, May 5, 2014 at 7:13 PM, Francis.Hu francis...@reachjunction.com wrote: The file does not exist in fact and no permission issue. francis@ubuntu-4:/test/spark-0.9.1$ ll work/app-20140505053550-/ total 24 drwxrwxr-x 6 francis francis 4096 May 5 05:35 ./ drwxrwxr-x 11 francis francis 4096 May 5 06:18 ../ drwxrwxr-x 2 francis francis 4096 May 5 05:35 2/ drwxrwxr-x 2 francis francis 4096 May 5 05:35 4/ drwxrwxr-x 2 francis francis 4096 May 5 05:35 7/ drwxrwxr-x 2 francis francis 4096 May 5 05:35 9/ Francis 发件人: Tathagata Das [mailto:tathagata.das1...@gmail.com] 发送时间: Tuesday, May 06, 2014 3:45 收件人: user@spark.apache.org 主题: Re: java.io.FileNotFoundException: /test/spark-0.9.1/work/app-20140505053550-/2/stdout (No such file or directory) Do those file actually exist? Those stdout/stderr should have the output of the spark's executors running in the workers, and its weird that they dont exist. Could be permission issue - maybe the directories/files are not being generated because it cannot? TD On Mon, May 5, 2014 at 3:06 AM, Francis.Hu francis...@reachjunction.com wrote: Hi,All We run a spark cluster with three workers. created a spark streaming application, then run the spark project using below command: shell sbt run spark://192.168.219.129:7077 tcp://192.168.20.118:5556 foo we looked at the webui of workers, jobs failed without any error or info, but FileNotFoundException occurred in workers' log file as below: Is this an existent issue of spark? -in workers' logs/spark-francis-org.apache.spark.deploy.worker.Worker-1-ubuntu-4.out 14/05/05 02:39:39 WARN AbstractHttpConnection: /logPage/?appId=app-20140505053550-executorId=2logType=stdout java.io.FileNotFoundException: /test/spark-0.9.1/work/app-20140505053550-/2/stdout (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.init(FileInputStream.java:138) at org.apache.spark.util.Utils$.offsetBytes(Utils.scala:687) at org.apache.spark.deploy.worker.ui.WorkerWebUI.logPage(WorkerWebUI.scala:119) at org.apache.spark.deploy.worker.ui.WorkerWebUI$$anonfun$6.apply(WorkerWebUI.scala:52) at org.apache.spark.deploy.worker.ui.WorkerWebUI$$anonfun$6.apply(WorkerWebUI.scala:52) at org.apache.spark.ui.JettyUtils$$anon$1.handle(JettyUtils.scala:61) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1040) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:976) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:363) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:483) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:920) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:982) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:635) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:628) at
Re: How can adding a random count() change the behavior of my program?
Nick, I have encountered strange things like this before (usually when programming with mutable structures and side-effects), and for me, the answer was that, until .count (or .first, or similar), is called, your variable 'a' refers to a set of instructions that only get executed to form the object you expect when you're asking something of it. Back before I was using side-effect-free techniques on immutable data structures, I had to call .first or .count or similar to get the behavior I wanted. There are still special cases where I have to purposefully collapse the RDD for some reason or another. This may not be new information to you, but I've encountered similar behavior before and highly suspect this is playing a role here. On Mon, May 5, 2014 at 5:52 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I’m running into something very strange today. I’m getting an error on the follow innocuous operations. a = sc.textFile('s3n://...') a = a.repartition(8) a = a.map(...) c = a.countByKey() # ERRORs out on this action. See below for traceback. [1] If I add a count() right after the repartition(), this error magically goes away. a = sc.textFile('s3n://...') a = a.repartition(8) print a.count() a = a.map(...) c = a.countByKey() # A-OK! WTF? To top it off, this “fix” is inconsistent. Sometimes, I still get this error. This is strange. How do I get to the bottom of this? Nick [1] Here’s the traceback: Traceback (most recent call last): File stdin, line 7, in module File file.py, line 187, in function_blah c = a.countByKey() File /root/spark/python/pyspark/rdd.py, line 778, in countByKey return self.map(lambda x: x[0]).countByValue() File /root/spark/python/pyspark/rdd.py, line 624, in countByValue return self.mapPartitions(countPartition).reduce(mergeMaps) File /root/spark/python/pyspark/rdd.py, line 505, in reduce vals = self.mapPartitions(func).collect() File /root/spark/python/pyspark/rdd.py, line 469, in collect bytesInJava = self._jrdd.collect().iterator() File /root/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 537, in __call__ File /root/spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o46.collect. -- View this message in context: How can adding a random count() change the behavior of my program?http://apache-spark-user-list.1001560.n3.nabble.com/How-can-adding-a-random-count-change-the-behavior-of-my-program-tp5406.html Sent from the Apache Spark User List mailing list archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com.
Re: Fwd: Is there a way to load a large file from HDFS faster into Spark
Yep. I figured that out. I uncompressed the file and it looks much faster now. Thanks. On Sun, May 11, 2014 at 8:14 AM, Mayur Rustagi mayur.rust...@gmail.comwrote: .gz files are not splittable hence harder to process. Easiest is to move to a splittable compression like lzo and break file into multiple blocks to be read and for subsequent processing. On 11 May 2014 09:01, Soumya Simanta soumya.sima...@gmail.com wrote: I've a Spark cluster with 3 worker nodes. - *Workers:* 3 - *Cores:* 48 Total, 48 Used - *Memory:* 469.8 GB Total, 72.0 GB Used I want a process a single file compressed (*.gz) on HDFS. The file is 1.5GB compressed and 11GB uncompressed. When I try to read the compressed file from HDFS it takes a while (4-5 minutes) load it into an RDD. If I use the .cache operation it takes even longer. Is there a way to make loading of the RDD from HDFS faster ? Thanks -Soumya
Re: Test
Got. But it doesn't indicate all can receive this test. Mail list is unstable recently. Sent from my iPhone5s On 2014年5月10日, at 13:31, Matei Zaharia matei.zaha...@gmail.com wrote: This message has no content.
Re: Is there any problem on the spark mailing list?
There was an outage: https://blogs.apache.org/infra/entry/mail_outage On Fri, May 9, 2014 at 1:27 PM, wxhsdp wxh...@gmail.com wrote: i think so, fewer questions and answers these three days -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-any-problem-on-the-spark-mailing-list-tp5509p5522.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: why is Spark 0.9.1 (context creation?) so slow on my OSX laptop?
Svend, I built it on my iMac and it was about the same speed as Windows 7, RHEL 6 VM on Windows 7, and Linux on EC2. Spark is pleasantly easy to build on all of these platforms, which is wonderful. How long does it take to start spark-shell? Maybe it's a JVM memory setting problem on your Laptop? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/why-is-Spark-0-9-1-context-creation-so-slow-on-my-OSX-laptop-tp5535p5556.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: writing my own RDD
resending... my email somehow never made it to the user list. On Fri, May 9, 2014 at 2:11 PM, Koert Kuipers ko...@tresata.com wrote: in writing my own RDD i ran into a few issues with respect to stuff being private in spark. in compute i would like to return an iterator that respects task killing (as HadoopRDD does), but the mechanics for that are inside the private InterruptibleIterator. also the exception i am supposed to throw (TaskKilledException) is private to spark.
Re: is Mesos falling out of favor?
For what it is worth, our team here at MediaCrossinghttp://mediacrossing.com has been using the Spark/Mesos combination since last summer with much success (low operations overhead, high developer performance). IMO, Hadoop is overcomplicated from both a development and operations perspective so I am looking to lower our dependencies on it, not increase them. Our stack currently includes: - Spark 0.9.1 - Mesos 0.17 - Chronos - HDFS (CDH 5.0-mr1) - Flume 1.4.0 - ZooKeeper - Cassandra 2.0 (key-value store alternative to HBase) - Storm 0.9 (we prefer today to Spark Streaming) We've used Shark in the past as well, but since most of us prefer the Spark Shell we have not been maintaining it. Using Mesos to run Spark allows for us to optimize our available resources (CPU + RAM currently ) between Spark, Chronos and a number of other services. I see YARN as being heavily focused on MR2, but the reality is we are using Spark in large part because writing MapReduce jobs is verbose, hard to maintain and not performant (against Spark). We have the advantage of not having any real legacy Map/Reduce jobs to maintain, so that consideration does not come into play. Finally, I am a believer that for the long term direction of our company, the Berkeley stack https://amplab.cs.berkeley.edu/software/ will serve us best. Leveraging Mesos and Spark from the onset paves the way for this. On Sun, May 11, 2014 at 1:28 PM, Paco Nathan cet...@gmail.com wrote: That's FUD. Tracking the Mesos and Spark use cases, there are very large production deployments of these together. Some are rather private but others are being surfaced. IMHO, one of the most amazing case studies is from Christina Delimitrou http://youtu.be/YpmElyi94AA For a tutorial, use the following but upgrade it to latest production for Spark. There was a related O'Reilly webcast and Strata tutorial as well: http://mesosphere.io/learn/run-spark-on-mesos/ FWIW, I teach Intro to Spark with sections on CM4, YARN, Mesos, etc. Based on lots of student experiences, Mesos is clearly the shortest path to deploying a Spark cluster if you want to leverage the robustness, multi-tenancy for mixed workloads, less ops overhead, etc., that show up repeatedly in the use case analyses. My opinion only and not that of any of my clients: Don't believe the FUD from YHOO unless you really want to be stuck in 2009. On Wed, May 7, 2014 at 8:30 AM, deric barton.to...@gmail.com wrote: I'm also using right now SPARK_EXECUTOR_URI, though I would prefer distributing Spark as a binary package. For running examples with `./bin/run-example ...` it works fine, however tasks from spark-shell are getting lost. Error: Could not find or load main class org.apache.spark.executor.MesosExecutorBackend which looks more like problem with sbin/spark-executor and missing paths to jar. Anyone encountered this error before? I guess Yahoo invested quite a lot of effort into YARN and Spark integration (moreover when Mahout is migrating to Spark there's much more interest in Hadoop and Spark integration). If there would be some Mesos company working on Spark - Mesos integration it could be at least on the same level. I don't see any other reason why would be YARN better than Mesos, personally I like the latter, however I haven't checked YARN for a while, maybe they've made a significant progress. I think Mesos is more universal and flexible than YARN. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/is-Mesos-falling-out-of-favor-tp5444p5481.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark LIBLINEAR
Hello Prof. Lin, Awesome news ! I am curious if you have any benchmarks comparing C++ MPI with Scala Spark liblinear implementations... Is Spark Liblinear apache licensed or there are any specific restrictions on using it ? Except using native blas libraries (which each user has to manage by pulling in their best proprietary BLAS package), all Spark code is Apache licensed. Thanks. Deb On Sun, May 11, 2014 at 3:01 AM, DB Tsai dbt...@stanford.edu wrote: Dear Prof. Lin, Interesting! We had an implementation of L-BFGS in Spark and already merged in the upstream now. We read your paper comparing TRON and OWL-QN for logistic regression with L1 (http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf), but it seems that it's not in the distributed setup. Will be very interesting to know the L2 logistic regression benchmark result in Spark with your TRON optimizer and the L-BFGS optimizer against different datasets (sparse, dense, and wide, etc). I'll try your TRON out soon. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sun, May 11, 2014 at 1:49 AM, Chieh-Yen r01944...@csie.ntu.edu.twwrote: Dear all, Recently we released a distributed extension of LIBLINEAR at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/distributed-liblinear/ Currently, TRON for logistic regression and L2-loss SVM is supported. We provided both MPI and Spark implementations. This is very preliminary so your comments are very welcome. Thanks, Chieh-Yen
Re: How to use spark-submit
Will sbt-pack and the maven solution work for the Scala REPL? I need the REPL because it save a lot of time when I'm playing with large data sets because I load then once, cache them and then try out things interactively before putting in a standalone driver. I've sbt woking for my own driver program on Spark 0.9. On May 11, 2014, at 3:49 PM, Stephen Boesch java...@gmail.com wrote: Just discovered sbt-pack: that addresses (quite well) the last item for identifying and packaging the external jars. 2014-05-11 12:34 GMT-07:00 Stephen Boesch java...@gmail.com: HI Sonal, Yes I am working towards that same idea. How did you go about creating the non-spark-jar dependencies ? The way I am doing it is a separate straw-man project that does not include spark but has the external third party jars included. Then running sbt compile:managedClasspath and reverse engineering the lib jars from it. That is obviously not ideal. The maven run will be useful for other projects built by maven: i will keep in my notes. AFA sbt run-example, it requires additional libraries to be added for my external dependencies. I tried several items including ADD_JARS, --driver-class-path and combinations of extraClassPath. I have deferred that ad-hoc approach to finding a systematic one. 2014-05-08 5:26 GMT-07:00 Sonal Goyal sonalgoy...@gmail.com: I am creating a jar with only my dependencies and run spark-submit through my project mvn build. I have configured the mvn exec goal to the location of the script. Here is how I have set it up for my app. The mainClass is my driver program, and I am able to send my custom args too. Hope this helps. plugin groupIdorg.codehaus.mojo/groupId artifactIdexec-maven-plugin/artifactId executions execution goals goalexec/goal /goals /execution /executions configuration executable/home/sgoyal/spark/bin/spark-submit/executable arguments argument${jars}/argument argument--class/argument argument${mainClass}/argument argument--arg/argument argument${spark.master}/argument argument--arg/argument argument${my app arg 1}/argument argument--arg/argument argument${my arg 2}/argument /arguments /configuration /plugin Best Regards, Sonal Nube Technologies On Wed, May 7, 2014 at 6:57 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Doesnt the run-example script work for you? Also, are you on the latest commit of branch-1.0 ? TD On Mon, May 5, 2014 at 7:51 PM, Soumya Simanta soumya.sima...@gmail.com wrote: Yes, I'm struggling with a similar problem where my class are not found on the worker nodes. I'm using 1.0.0_SNAPSHOT. I would really appreciate if someone can provide some documentation on the usage of spark-submit. Thanks On May 5, 2014, at 10:24 PM, Stephen Boesch java...@gmail.com wrote: I have a spark streaming application that uses the external streaming modules (e.g. kafka, mqtt, ..) as well. It is not clear how to properly invoke the spark-submit script: what are the ---driver-class-path and/or -Dspark.executor.extraClassPath parameters required? For reference, the following error is proving difficult to resolve: java.lang.ClassNotFoundException: org.apache.spark.streaming.examples.StreamingExamples
Re: Comprehensive Port Configuration reference?
On Tue, May 6, 2014 at 9:09 AM, Jacob Eisinger jeis...@us.ibm.com wrote: In a nut shell, Spark opens up a couple of well known ports. And,then the workers and the shell open up dynamic ports for each job. These dynamic ports make securing the Spark network difficult. Indeed. Judging by the frequency with which this topic arises, this is a concern for many (myself included). I couldn't find anything in JIRA about it, but I'm curious to know whether the Spark team considers this a problem in need of a fix? Mark.
Re: writing my own RDD
will do On May 11, 2014 6:44 PM, Aaron Davidson ilike...@gmail.com wrote: You got a good point there, those APIs should probably be marked as @DeveloperAPI. Would you mind filing a JIRA for that ( https://issues.apache.org/jira/browse/SPARK)? On Sun, May 11, 2014 at 11:51 AM, Koert Kuipers ko...@tresata.com wrote: resending... my email somehow never made it to the user list. On Fri, May 9, 2014 at 2:11 PM, Koert Kuipers ko...@tresata.com wrote: in writing my own RDD i ran into a few issues with respect to stuff being private in spark. in compute i would like to return an iterator that respects task killing (as HadoopRDD does), but the mechanics for that are inside the private InterruptibleIterator. also the exception i am supposed to throw (TaskKilledException) is private to spark.
Re: How to use spark-submit
HI Sonal, Yes I am working towards that same idea. How did you go about creating the non-spark-jar dependencies ? The way I am doing it is a separate straw-man project that does not include spark but has the external third party jars included. Then running sbt compile:managedClasspath and reverse engineering the lib jars from it. That is obviously not ideal. The maven run will be useful for other projects built by maven: i will keep in my notes. AFA sbt run-example, it requires additional libraries to be added for my external dependencies. I tried several items including ADD_JARS, --driver-class-path and combinations of extraClassPath. I have deferred that ad-hoc approach to finding a systematic one. 2014-05-08 5:26 GMT-07:00 Sonal Goyal sonalgoy...@gmail.com: I am creating a jar with only my dependencies and run spark-submit through my project mvn build. I have configured the mvn exec goal to the location of the script. Here is how I have set it up for my app. The mainClass is my driver program, and I am able to send my custom args too. Hope this helps. plugin groupIdorg.codehaus.mojo/groupId artifactIdexec-maven-plugin/artifactId executions execution goals goalexec/goal /goals /execution /executions configuration executable/home/sgoyal/spark/bin/spark-submit/executable arguments argument${jars}/argument argument--class/argument argument${mainClass}/argument argument--arg/argument argument${spark.master}/argument argument--arg/argument argument${my app arg 1}/argument argument--arg/argument argument${my arg 2}/argument /arguments /configuration /plugin Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Wed, May 7, 2014 at 6:57 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Doesnt the run-example script work for you? Also, are you on the latest commit of branch-1.0 ? TD On Mon, May 5, 2014 at 7:51 PM, Soumya Simanta soumya.sima...@gmail.comwrote: Yes, I'm struggling with a similar problem where my class are not found on the worker nodes. I'm using 1.0.0_SNAPSHOT. I would really appreciate if someone can provide some documentation on the usage of spark-submit. Thanks On May 5, 2014, at 10:24 PM, Stephen Boesch java...@gmail.com wrote: I have a spark streaming application that uses the external streaming modules (e.g. kafka, mqtt, ..) as well. It is not clear how to properly invoke the spark-submit script: what are the ---driver-class-path and/or -Dspark.executor.extraClassPath parameters required? For reference, the following error is proving difficult to resolve: java.lang.ClassNotFoundException: org.apache.spark.streaming.examples.StreamingExamples
Re: Test
I didn't get the original message, only the reply. Ruh-roh. On Sun, May 11, 2014 at 8:09 AM, Azuryy azury...@gmail.com wrote: Got. But it doesn't indicate all can receive this test. Mail list is unstable recently. Sent from my iPhone5s On 2014年5月10日, at 13:31, Matei Zaharia matei.zaha...@gmail.com wrote: *This message has no content.*
streaming on hdfs can detected all new file, but the sum of all the rdd.count() not equals which had detected
when I put 200 png files to Hdfs , I found sparkStreaming counld detect 200 files , but the sum of rdd.count() is less than 200, always between 130 and 170, I don't know why...Is this a Bug? PS: When I put 200 files in hdfs before streaming run , It get the correct count and right result. Here is the code: def main(args: Array[String]) { val conf = new SparkConf().setMaster(SparkURL) .setAppName(QimageStreaming-broadcast) .setSparkHome(System.getenv(SPARK_HOME)) .setJars(SparkContext.jarOfClass(this.getClass())) conf.set(spark.serializer, org.apache.spark.serializer.KryoSerializer) conf.set(spark.kryo.registrator, qing.hdu.Image.MyRegistrator) conf.set(spark.kryoserializer.buffer.mb, 10); val ssc = new StreamingContext(conf, Seconds(2)) val inputFormatClass = classOf[QimageInputFormat[Text, Qimage]] val outputFormatClass = classOf[QimageOutputFormat[Text, Qimage]] val input_path = HdfsURL + /Qimage/input val output_path = HdfsURL + /Qimage/output/ val bg_path = HdfsURL + /Qimage/bg/ val bg = ssc.sparkContext.newAPIHadoopFile[Text, Qimage, QimageInputFormat[Text, Qimage]](bg_path) val bbg = bg.map(data = (data._1.toString(), data._2)) val broadcastbg = ssc.sparkContext.broadcast(bbg) val file = ssc.fileStream[Text, Qimage, QimageInputFormat[Text, Qimage]](input_path) val qingbg = broadcastbg.value.collectAsMap val foreachFunc = (rdd: RDD[(Text, Qimage)], time: Time) = { val rddnum = rdd.count System.out.println(\n\n+ rddnum is + rddnum + \n\n) if (rddnum 0) { System.out.println(here is foreachFunc) val a = rdd.keys val b = a.first val cbg = qingbg.get(getbgID(b)).getOrElse(new Qimage) rdd.map(data = (data._1, (new QimageProc(data._1, data._2)).koutu(cbg))) .saveAsNewAPIHadoopFile(output_path, classOf[Text], classOf[Qimage], outputFormatClass) } } file.foreachRDD(foreachFunc) ssc.start() ssc.awaitTermination() } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/streaming-on-hdfs-can-detected-all-new-file-but-the-sum-of-all-the-rdd-count-not-equals-which-had-ded-tp5572.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
build shark(hadoop CDH5) on hadoop2.0.0 CDH4
I have built shark in sbt way,but the sbt exception turn out: [error] sbt.resolveException:unresolved dependency: org.apache.hadoop#hadoop-client;2.0.0: not found. How can I do to build it well? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/build-shark-hadoop-CDH5-on-hadoop2-0-0-CDH4-tp5574.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Is there any problem on the spark mailing list?
I haven't been getting mail either. This was the last message I received: http://apache-spark-user-list.1001560.n3.nabble.com/master-attempted-to-re-register-the-worker-and-then-took-all-workers-as-unregistered-tp553p5491.html -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-any-problem-on-the-spark-mailing-list-tp5509p5515.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
about spark interactive shell
hi,all I am now using spark in production. but I notice spark driver including rdd and dag... and the executors will try to register with the driver. I think the driver should run on the cluster.and client should run on the gateway. Similar like: http://apache-spark-user-list.1001560.n3.nabble.com/file/n5575/Spark-interactive_shell.jpg -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/about-spark-interactive-shell-tp5575.html Sent from the Apache Spark User List mailing list archive at Nabble.com.