Re: How to read a multipart s3 file?

2014-05-07 Thread Nicholas Chammas
Amazon also strongly discourages the use of s3:// because the block file
system it maps to is deprecated.

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-file-systems.html

Note
 The configuration of Hadoop running on Amazon EMR differs from the default
 configuration provided by Apache Hadoop. On Amazon EMR, s3n:// and s3://
 both map to the Amazon S3 native file system, *while in the default
 configuration provided by Apache Hadoop s3:// is mapped to the Amazon S3
 block storage system.*


Amazon S3 block is a deprecated file system that is not recommended because
 it can trigger a race condition that might cause your cluster to fail. It
 may be required by legacy applications.




On Tue, May 6, 2014 at 8:23 PM, Matei Zaharia matei.zaha...@gmail.comwrote:

 There’s a difference between s3:// and s3n:// in the Hadoop S3 access
 layer. Make sure you use the right one when reading stuff back. In general
 s3n:// ought to be better because it will create things that look like
 files in other S3 tools. s3:// was present when the file size limit in S3
 was much lower, and it uses S3 objects as blocks in a kind of overlay file
 system.

 If you use s3n:// for both, you should be able to pass the exact same file
 to load as you did to save. Make sure you also set your AWS keys in the
 environment or in SparkContext.hadoopConfiguration.

 Matei

 On May 6, 2014, at 5:19 PM, kamatsuoka ken...@gmail.com wrote:

  I have a Spark app that writes out a file,
 s3://mybucket/mydir/myfile.txt.
 
  Behind the scenes, the S3 driver creates a bunch of files like
  s3://mybucket//mydir/myfile.txt/part-, as well as the block files
 like
  s3://mybucket/block_3574186879395643429.
 
  How do I construct an url to use this file as input to another Spark
 app?  I
  tried all the variations of s3://mybucket/mydir/myfile.txt, but none of
 them
  work.
 
 
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-a-multipart-s3-file-tp5463.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.




Re: How to read a multipart s3 file?

2014-05-07 Thread Han JU
Just some complements to other answers:

If you output to, say, `s3://bucket/myfile`, then you can use this bucket
as the input of other jobs (sc.textFile('s3://bucket/myfile')). By default
all `part-xxx` files will be used. There's also `sc.wholeTextFiles` that
you can play with.

If you file is small and need to be interoperable by other tools/langs, s3n
may be a better choice. But in my experience, when reading directly from
s3n, spark create only 1 input partition per file, regardless of the file
size. This may lead to some performance problem if you have big files.


2014-05-07 2:39 GMT+02:00 Andre Kuhnen andrekuh...@gmail.com:

 Try using s3n instead of s3
 Em 06/05/2014 21:19, kamatsuoka ken...@gmail.com escreveu:

 I have a Spark app that writes out a file, s3://mybucket/mydir/myfile.txt.

 Behind the scenes, the S3 driver creates a bunch of files like
 s3://mybucket//mydir/myfile.txt/part-, as well as the block files like
 s3://mybucket/block_3574186879395643429.

 How do I construct an url to use this file as input to another Spark app?
  I
 tried all the variations of s3://mybucket/mydir/myfile.txt, but none of
 them
 work.





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-a-multipart-s3-file-tp5463.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.




-- 
*JU Han*

Data Engineer @ Botify.com

+33 061960


Re: Easy one

2014-05-07 Thread Ian Ferreira
Thanks!

From:  Aaron Davidson ilike...@gmail.com
Reply-To:  user@spark.apache.org
Date:  Tuesday, May 6, 2014 at 5:32 PM
To:  user@spark.apache.org
Subject:  Re: Easy one

If you're using standalone mode, you need to make sure the Spark Workers
know about the extra memory. This can be configured in spark-env.sh on the
workers as

export SPARK_WORKER_MEMORY=4g


On Tue, May 6, 2014 at 5:29 PM, Ian Ferreira ianferre...@hotmail.com
wrote:
 Hi there,
 
 Why can¹t I seem to kick the executor memory higher? See below from EC2
 deployment using m1.large
 
 
 And in the spark-env.sh
 export SPARK_MEM=6154m
 
 
 And in the spark context
 sconf.setExecutorEnv(spark.executor.memory, 4g²)
 
 Cheers
 - Ian
 





Is there anything that I need to modify?

2014-05-07 Thread Sophia
[root@CHBM220 spark-0.9.1]#
SPARK_JAR=.assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar
./bin/spark-class org.apache.spark.deploy.yarn.Client --jar
examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar --class
org.apache.spark.examples.SparkPi --args yarn-standalone --num-workers 3
--master-memory 2g --worker-memory 2g --worker-cores 1 
14:50:45,485%5P RMProxy:56-Connecting to ResourceManager at
CHBM220/192.168.10.220:8032
Exception in thread main java.io.IOException: Failed on local exception:
com.google.protobuf.InvalidProtocolBufferException: Protocol message
contained an invalid tag (zero).; Host Details : local host is:
CHBM220/192.168.10.220; destination host is: CHBM220:8032;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
at org.apache.hadoop.ipc.Client.call(Client.java:1351)
at org.apache.hadoop.ipc.Client.call(Client.java:1300)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy7.getClusterMetrics(Unknown Source)
at
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:152)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy8.getClusterMetrics(Unknown Source)
at
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:246)
at
org.apache.spark.deploy.yarn.Client.logClusterResourceDetails(Client.scala:144)
at org.apache.spark.deploy.yarn.Client.runApp(Client.scala:79)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:115)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:493)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol
message contained an invalid tag (zero).
at
com.google.protobuf.InvalidProtocolBufferException.invalidTag(InvalidProtocolBufferException.java:89)
at
com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:108)
at
org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto.init(RpcHeaderProtos.java:1398)
at
org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto.init(RpcHeaderProtos.java:1362)
at
org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto$1.parsePartialFrom(RpcHeaderProtos.java:1492)
at
org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto$1.parsePartialFrom(RpcHeaderProtos.java:1487)
at
com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:200)
at
com.google.protobuf.AbstractParser.parsePartialDelimitedFrom(AbstractParser.java:241)
at
com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:253)
at
com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:259)
at
com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:49)
at
org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto.parseDelimitedFrom(RpcHeaderProtos.java:2364)
at
org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:996)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:891)
[root@CHBM220:spark-0.9.1]#




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-anything-that-I-need-to-modify-tp5477.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: How to use spark-submit

2014-05-07 Thread Tathagata Das
Doesnt the run-example script work for you? Also, are you on the latest
commit of branch-1.0 ?

TD


On Mon, May 5, 2014 at 7:51 PM, Soumya Simanta soumya.sima...@gmail.comwrote:



 Yes, I'm struggling with a similar problem where my class are not found on
 the worker nodes. I'm using 1.0.0_SNAPSHOT.  I would really appreciate if
 someone can provide some documentation on the usage of spark-submit.

 Thanks

  On May 5, 2014, at 10:24 PM, Stephen Boesch java...@gmail.com wrote:
 
 
  I have a spark streaming application that uses the external streaming
 modules (e.g. kafka, mqtt, ..) as well.  It is not clear how to properly
 invoke the spark-submit script: what are the ---driver-class-path and/or
 -Dspark.executor.extraClassPath parameters required?
 
   For reference, the following error is proving difficult to resolve:
 
  java.lang.ClassNotFoundException:
 org.apache.spark.streaming.examples.StreamingExamples
 



Unable to load native-hadoop library problem

2014-05-07 Thread Sophia
Hi,everyone,
[root@CHBM220 spark-0.9.1]#
SPARK_JAR=.assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar
./bin/spark-class org.apache.spark.deploy.yarn.Client --jar
examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar --class
org.apache.spark.examples.SparkPi --args yarn-standalone --num-workers 3
--master-memory 2g --worker-memory 2g --worker-cores 1
14/05/07 09:05:14 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
14/05/07 09:05:14 INFO RMProxy: Connecting to ResourceManager at
CHBM220/192.168.10.220:8032
Then it stopped,my hadoop_conf_dir has been configued well,what should I do
to?
Wish you happy everyday.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-load-native-hadoop-library-problem-tp5469.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: details about event log

2014-05-07 Thread wxhsdp
any ideas?  thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/details-about-event-log-tp5411p5476.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark and Java 8

2014-05-07 Thread Kristoffer Sjögren
Running Hadoop and HDFS on unsupported JVM runtime sounds a little
adventurous. But as long as Spark can run in a separate Java 8 runtime it's
all good. I think having lambdas and type inference is huge when writing
these jobs and using Scala (paying the price of complexity, poor tooling
etc etc) for this tiny feature is often not justified.


On Wed, May 7, 2014 at 2:03 AM, Dean Wampler deanwamp...@gmail.com wrote:

 Cloudera customers will need to put pressure on them to support Java 8.
 They only officially supported Java 7 when Oracle stopped supporting Java 6.

 dean


 On Wed, May 7, 2014 at 5:05 AM, Matei Zaharia matei.zaha...@gmail.comwrote:

 Java 8 support is a feature in Spark, but vendors need to decide for
 themselves when they’d like support Java 8 commercially. You can still run
 Spark on Java 7 or 6 without taking advantage of the new features (indeed
 our builds are always against Java 6).

 Matei

 On May 6, 2014, at 8:59 AM, Ian O'Connell i...@ianoconnell.com wrote:

 I think the distinction there might be they never said they ran that code
 under CDH5, just that spark supports it and spark runs under CDH5. Not that
 you can use these features while running under CDH5.

 They could use mesos or the standalone scheduler to run them


 On Tue, May 6, 2014 at 6:16 AM, Kristoffer Sjögren sto...@gmail.comwrote:

 Hi

 I just read an article [1] about Spark, CDH5 and Java 8 but did not get
 exactly how Spark can run Java 8 on a YARN cluster at runtime. Is Spark
 using a separate JVM that run on data nodes or is it reusing the YARN JVM
 runtime somehow, like hadoop1?

 CDH5 only supports Java 7 [2] as far as I know?

 Cheers,
 -Kristoffer


 [1]
 http://blog.cloudera.com/blog/2014/04/making-apache-spark-easier-to-use-in-java-with-java-8/
 [2]
 http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Requirements-and-Supported-Versions/CDH5-Requirements-and-Supported-Versions.html









 --
 Dean Wampler, Ph.D.
 Typesafe
 @deanwampler
 http://typesafe.com
 http://polyglotprogramming.com



Re: sbt run with spark.ContextCleaner ERROR

2014-05-07 Thread Tathagata Das
Okay, this needs to be fixed. Thanks for reporting this!



On Mon, May 5, 2014 at 11:00 PM, wxhsdp wxh...@gmail.com wrote:

 Hi, TD

 i tried on v1.0.0-rc3 and still got the error



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/sbt-run-with-spark-ContextCleaner-ERROR-tp5304p5421.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: master attempted to re-register the worker and then took all workers as unregistered

2014-05-07 Thread Cheney Sun
Hi Nan,

In worker's log, I see the following exception thrown when try to launch on
executor. (The SPARK_HOME is wrongly specified on purpose, so there is no
such file /usr/local/spark1/bin/compute-classpath.sh).
After the exception was thrown several times, the worker was requested to
kill the executor. Following the killing, the worker try to register again
with master, but master reject the registration with WARN message Got
heartbeat from unregistered worker
worker-20140504140005-host-spark-online001

Looks like the issue wasn't fixed in 0.9.1. Do you know any pull request
addressing this issue? Thanks.

java.io.IOException: Cannot run program
/usr/local/spark1/bin/compute-classpath.sh (in directory .): error=2,
No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:600)
at
org.apache.spark.deploy.worker.CommandUtils$.buildJavaOpts(CommandUtils.scala:58)
at
org.apache.spark.deploy.worker.CommandUtils$.buildCommandSeq(CommandUtils.scala:37)
at
org.apache.spark.deploy.worker.ExecutorRunner.getCommandSeq(ExecutorRunner.scala:104)
at
org.apache.spark.deploy.worker.ExecutorRunner.fetchAndRunExecutor(ExecutorRunner.scala:119)
at
org.apache.spark.deploy.worker.ExecutorRunner$$anon$1.run(ExecutorRunner.scala:59)
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.init(UNIXProcess.java:135)
at java.lang.ProcessImpl.start(ProcessImpl.java:130)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1021)
... 6 more
..
14/05/04 21:35:45 INFO Worker: Asked to kill executor
app-20140504213545-0034/18
14/05/04 21:35:45 INFO Worker: Executor app-20140504213545-0034/18 finished
with state FAILED message class java.io.IOException: Cannot run program
/usr/local/spark1/bin/compute-classpath.sh (in directory .): error=2,
No such file or directory
14/05/04 21:35:45 ERROR OneForOneStrategy: key not found:
app-20140504213545-0034/18
java.util.NoSuchElementException: key not found: app-20140504213545-0034/18
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at
org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:232)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
14/05/04 21:35:45 INFO Worker: Starting Spark worker
host-spark-online001:7078 with 10 cores, 28.0 GB RAM
14/05/04 21:35:45 INFO Worker: Spark home: /usr/local/spark-0.9.1-cdh4.2.0
14/05/04 21:35:45 INFO WorkerWebUI: Started Worker web UI at
http://host-spark-online001:8081
14/05/04 21:35:45 INFO Worker: Connecting to master
spark://host-spark-online001:7077...
14/05/04 21:35:45 INFO Worker: Successfully registered with master
spark://host-spark-online001:7077