Re: How to read a multipart s3 file?
Amazon also strongly discourages the use of s3:// because the block file system it maps to is deprecated. http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-file-systems.html Note The configuration of Hadoop running on Amazon EMR differs from the default configuration provided by Apache Hadoop. On Amazon EMR, s3n:// and s3:// both map to the Amazon S3 native file system, *while in the default configuration provided by Apache Hadoop s3:// is mapped to the Amazon S3 block storage system.* Amazon S3 block is a deprecated file system that is not recommended because it can trigger a race condition that might cause your cluster to fail. It may be required by legacy applications. On Tue, May 6, 2014 at 8:23 PM, Matei Zaharia matei.zaha...@gmail.comwrote: There’s a difference between s3:// and s3n:// in the Hadoop S3 access layer. Make sure you use the right one when reading stuff back. In general s3n:// ought to be better because it will create things that look like files in other S3 tools. s3:// was present when the file size limit in S3 was much lower, and it uses S3 objects as blocks in a kind of overlay file system. If you use s3n:// for both, you should be able to pass the exact same file to load as you did to save. Make sure you also set your AWS keys in the environment or in SparkContext.hadoopConfiguration. Matei On May 6, 2014, at 5:19 PM, kamatsuoka ken...@gmail.com wrote: I have a Spark app that writes out a file, s3://mybucket/mydir/myfile.txt. Behind the scenes, the S3 driver creates a bunch of files like s3://mybucket//mydir/myfile.txt/part-, as well as the block files like s3://mybucket/block_3574186879395643429. How do I construct an url to use this file as input to another Spark app? I tried all the variations of s3://mybucket/mydir/myfile.txt, but none of them work. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-a-multipart-s3-file-tp5463.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: How to read a multipart s3 file?
Just some complements to other answers: If you output to, say, `s3://bucket/myfile`, then you can use this bucket as the input of other jobs (sc.textFile('s3://bucket/myfile')). By default all `part-xxx` files will be used. There's also `sc.wholeTextFiles` that you can play with. If you file is small and need to be interoperable by other tools/langs, s3n may be a better choice. But in my experience, when reading directly from s3n, spark create only 1 input partition per file, regardless of the file size. This may lead to some performance problem if you have big files. 2014-05-07 2:39 GMT+02:00 Andre Kuhnen andrekuh...@gmail.com: Try using s3n instead of s3 Em 06/05/2014 21:19, kamatsuoka ken...@gmail.com escreveu: I have a Spark app that writes out a file, s3://mybucket/mydir/myfile.txt. Behind the scenes, the S3 driver creates a bunch of files like s3://mybucket//mydir/myfile.txt/part-, as well as the block files like s3://mybucket/block_3574186879395643429. How do I construct an url to use this file as input to another Spark app? I tried all the variations of s3://mybucket/mydir/myfile.txt, but none of them work. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-a-multipart-s3-file-tp5463.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -- *JU Han* Data Engineer @ Botify.com +33 061960
Re: Easy one
Thanks! From: Aaron Davidson ilike...@gmail.com Reply-To: user@spark.apache.org Date: Tuesday, May 6, 2014 at 5:32 PM To: user@spark.apache.org Subject: Re: Easy one If you're using standalone mode, you need to make sure the Spark Workers know about the extra memory. This can be configured in spark-env.sh on the workers as export SPARK_WORKER_MEMORY=4g On Tue, May 6, 2014 at 5:29 PM, Ian Ferreira ianferre...@hotmail.com wrote: Hi there, Why can¹t I seem to kick the executor memory higher? See below from EC2 deployment using m1.large And in the spark-env.sh export SPARK_MEM=6154m And in the spark context sconf.setExecutorEnv(spark.executor.memory, 4g²) Cheers - Ian
Is there anything that I need to modify?
[root@CHBM220 spark-0.9.1]# SPARK_JAR=.assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar ./bin/spark-class org.apache.spark.deploy.yarn.Client --jar examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar --class org.apache.spark.examples.SparkPi --args yarn-standalone --num-workers 3 --master-memory 2g --worker-memory 2g --worker-cores 1 14:50:45,485%5P RMProxy:56-Connecting to ResourceManager at CHBM220/192.168.10.220:8032 Exception in thread main java.io.IOException: Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Protocol message contained an invalid tag (zero).; Host Details : local host is: CHBM220/192.168.10.220; destination host is: CHBM220:8032; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764) at org.apache.hadoop.ipc.Client.call(Client.java:1351) at org.apache.hadoop.ipc.Client.call(Client.java:1300) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy7.getClusterMetrics(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:152) at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy8.getClusterMetrics(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:246) at org.apache.spark.deploy.yarn.Client.logClusterResourceDetails(Client.scala:144) at org.apache.spark.deploy.yarn.Client.runApp(Client.scala:79) at org.apache.spark.deploy.yarn.Client.run(Client.scala:115) at org.apache.spark.deploy.yarn.Client$.main(Client.scala:493) at org.apache.spark.deploy.yarn.Client.main(Client.scala) Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol message contained an invalid tag (zero). at com.google.protobuf.InvalidProtocolBufferException.invalidTag(InvalidProtocolBufferException.java:89) at com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:108) at org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto.init(RpcHeaderProtos.java:1398) at org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto.init(RpcHeaderProtos.java:1362) at org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto$1.parsePartialFrom(RpcHeaderProtos.java:1492) at org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto$1.parsePartialFrom(RpcHeaderProtos.java:1487) at com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:200) at com.google.protobuf.AbstractParser.parsePartialDelimitedFrom(AbstractParser.java:241) at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:253) at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:259) at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:49) at org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto.parseDelimitedFrom(RpcHeaderProtos.java:2364) at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:996) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:891) [root@CHBM220:spark-0.9.1]# -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-anything-that-I-need-to-modify-tp5477.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: How to use spark-submit
Doesnt the run-example script work for you? Also, are you on the latest commit of branch-1.0 ? TD On Mon, May 5, 2014 at 7:51 PM, Soumya Simanta soumya.sima...@gmail.comwrote: Yes, I'm struggling with a similar problem where my class are not found on the worker nodes. I'm using 1.0.0_SNAPSHOT. I would really appreciate if someone can provide some documentation on the usage of spark-submit. Thanks On May 5, 2014, at 10:24 PM, Stephen Boesch java...@gmail.com wrote: I have a spark streaming application that uses the external streaming modules (e.g. kafka, mqtt, ..) as well. It is not clear how to properly invoke the spark-submit script: what are the ---driver-class-path and/or -Dspark.executor.extraClassPath parameters required? For reference, the following error is proving difficult to resolve: java.lang.ClassNotFoundException: org.apache.spark.streaming.examples.StreamingExamples
Unable to load native-hadoop library problem
Hi,everyone, [root@CHBM220 spark-0.9.1]# SPARK_JAR=.assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar ./bin/spark-class org.apache.spark.deploy.yarn.Client --jar examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar --class org.apache.spark.examples.SparkPi --args yarn-standalone --num-workers 3 --master-memory 2g --worker-memory 2g --worker-cores 1 14/05/07 09:05:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/05/07 09:05:14 INFO RMProxy: Connecting to ResourceManager at CHBM220/192.168.10.220:8032 Then it stopped,my hadoop_conf_dir has been configued well,what should I do to? Wish you happy everyday. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-load-native-hadoop-library-problem-tp5469.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: details about event log
any ideas? thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/details-about-event-log-tp5411p5476.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark and Java 8
Running Hadoop and HDFS on unsupported JVM runtime sounds a little adventurous. But as long as Spark can run in a separate Java 8 runtime it's all good. I think having lambdas and type inference is huge when writing these jobs and using Scala (paying the price of complexity, poor tooling etc etc) for this tiny feature is often not justified. On Wed, May 7, 2014 at 2:03 AM, Dean Wampler deanwamp...@gmail.com wrote: Cloudera customers will need to put pressure on them to support Java 8. They only officially supported Java 7 when Oracle stopped supporting Java 6. dean On Wed, May 7, 2014 at 5:05 AM, Matei Zaharia matei.zaha...@gmail.comwrote: Java 8 support is a feature in Spark, but vendors need to decide for themselves when they’d like support Java 8 commercially. You can still run Spark on Java 7 or 6 without taking advantage of the new features (indeed our builds are always against Java 6). Matei On May 6, 2014, at 8:59 AM, Ian O'Connell i...@ianoconnell.com wrote: I think the distinction there might be they never said they ran that code under CDH5, just that spark supports it and spark runs under CDH5. Not that you can use these features while running under CDH5. They could use mesos or the standalone scheduler to run them On Tue, May 6, 2014 at 6:16 AM, Kristoffer Sjögren sto...@gmail.comwrote: Hi I just read an article [1] about Spark, CDH5 and Java 8 but did not get exactly how Spark can run Java 8 on a YARN cluster at runtime. Is Spark using a separate JVM that run on data nodes or is it reusing the YARN JVM runtime somehow, like hadoop1? CDH5 only supports Java 7 [2] as far as I know? Cheers, -Kristoffer [1] http://blog.cloudera.com/blog/2014/04/making-apache-spark-easier-to-use-in-java-with-java-8/ [2] http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Requirements-and-Supported-Versions/CDH5-Requirements-and-Supported-Versions.html -- Dean Wampler, Ph.D. Typesafe @deanwampler http://typesafe.com http://polyglotprogramming.com
Re: sbt run with spark.ContextCleaner ERROR
Okay, this needs to be fixed. Thanks for reporting this! On Mon, May 5, 2014 at 11:00 PM, wxhsdp wxh...@gmail.com wrote: Hi, TD i tried on v1.0.0-rc3 and still got the error -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sbt-run-with-spark-ContextCleaner-ERROR-tp5304p5421.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: master attempted to re-register the worker and then took all workers as unregistered
Hi Nan, In worker's log, I see the following exception thrown when try to launch on executor. (The SPARK_HOME is wrongly specified on purpose, so there is no such file /usr/local/spark1/bin/compute-classpath.sh). After the exception was thrown several times, the worker was requested to kill the executor. Following the killing, the worker try to register again with master, but master reject the registration with WARN message Got heartbeat from unregistered worker worker-20140504140005-host-spark-online001 Looks like the issue wasn't fixed in 0.9.1. Do you know any pull request addressing this issue? Thanks. java.io.IOException: Cannot run program /usr/local/spark1/bin/compute-classpath.sh (in directory .): error=2, No such file or directory at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:600) at org.apache.spark.deploy.worker.CommandUtils$.buildJavaOpts(CommandUtils.scala:58) at org.apache.spark.deploy.worker.CommandUtils$.buildCommandSeq(CommandUtils.scala:37) at org.apache.spark.deploy.worker.ExecutorRunner.getCommandSeq(ExecutorRunner.scala:104) at org.apache.spark.deploy.worker.ExecutorRunner.fetchAndRunExecutor(ExecutorRunner.scala:119) at org.apache.spark.deploy.worker.ExecutorRunner$$anon$1.run(ExecutorRunner.scala:59) Caused by: java.io.IOException: error=2, No such file or directory at java.lang.UNIXProcess.forkAndExec(Native Method) at java.lang.UNIXProcess.init(UNIXProcess.java:135) at java.lang.ProcessImpl.start(ProcessImpl.java:130) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1021) ... 6 more .. 14/05/04 21:35:45 INFO Worker: Asked to kill executor app-20140504213545-0034/18 14/05/04 21:35:45 INFO Worker: Executor app-20140504213545-0034/18 finished with state FAILED message class java.io.IOException: Cannot run program /usr/local/spark1/bin/compute-classpath.sh (in directory .): error=2, No such file or directory 14/05/04 21:35:45 ERROR OneForOneStrategy: key not found: app-20140504213545-0034/18 java.util.NoSuchElementException: key not found: app-20140504213545-0034/18 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:232) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 14/05/04 21:35:45 INFO Worker: Starting Spark worker host-spark-online001:7078 with 10 cores, 28.0 GB RAM 14/05/04 21:35:45 INFO Worker: Spark home: /usr/local/spark-0.9.1-cdh4.2.0 14/05/04 21:35:45 INFO WorkerWebUI: Started Worker web UI at http://host-spark-online001:8081 14/05/04 21:35:45 INFO Worker: Connecting to master spark://host-spark-online001:7077... 14/05/04 21:35:45 INFO Worker: Successfully registered with master spark://host-spark-online001:7077