date:20140507

Re: How to read a multipart s3 file?

2014-05-07 Thread Nicholas Chammas

Amazon also strongly discourages the use of s3:// because the block file
system it maps to is deprecated.

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-file-systems.html

Note
The configuration of Hadoop running on Amazon EMR differs from the default
configuration provided by Apache Hadoop. On Amazon EMR, s3n:// and s3://
both map to the Amazon S3 native file system, *while in the default
configuration provided by Apache Hadoop s3:// is mapped to the Amazon S3
block storage system.*

Amazon S3 block is a deprecated file system that is not recommended because
it can trigger a race condition that might cause your cluster to fail. It
may be required by legacy applications.

On Tue, May 6, 2014 at 8:23 PM, Matei Zaharia matei.zaha...@gmail.comwrote:

There’s a difference between s3:// and s3n:// in the Hadoop S3 access
layer. Make sure you use the right one when reading stuff back. In general
s3n:// ought to be better because it will create things that look like
files in other S3 tools. s3:// was present when the file size limit in S3
was much lower, and it uses S3 objects as blocks in a kind of overlay file
system.

If you use s3n:// for both, you should be able to pass the exact same file
to load as you did to save. Make sure you also set your AWS keys in the
environment or in SparkContext.hadoopConfiguration.

Matei

On May 6, 2014, at 5:19 PM, kamatsuoka ken...@gmail.com wrote:

I have a Spark app that writes out a file,
s3://mybucket/mydir/myfile.txt.

Behind the scenes, the S3 driver creates a bunch of files like
s3://mybucket//mydir/myfile.txt/part-, as well as the block files
like
s3://mybucket/block_3574186879395643429.

How do I construct an url to use this file as input to another Spark
app? I
tried all the variations of s3://mybucket/mydir/myfile.txt, but none of
them
work.

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-a-multipart-s3-file-tp5463.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How to read a multipart s3 file?

2014-05-07 Thread Han JU

Just some complements to other answers:

If you output to, say, `s3://bucket/myfile`, then you can use this bucket
as the input of other jobs (sc.textFile('s3://bucket/myfile')). By default
all `part-xxx` files will be used. There's also `sc.wholeTextFiles` that
you can play with.

If you file is small and need to be interoperable by other tools/langs, s3n
may be a better choice. But in my experience, when reading directly from
s3n, spark create only 1 input partition per file, regardless of the file
size. This may lead to some performance problem if you have big files.


2014-05-07 2:39 GMT+02:00 Andre Kuhnen andrekuh...@gmail.com:

 Try using s3n instead of s3
 Em 06/05/2014 21:19, kamatsuoka ken...@gmail.com escreveu:

 I have a Spark app that writes out a file, s3://mybucket/mydir/myfile.txt.

 Behind the scenes, the S3 driver creates a bunch of files like
 s3://mybucket//mydir/myfile.txt/part-, as well as the block files like
 s3://mybucket/block_3574186879395643429.

 How do I construct an url to use this file as input to another Spark app?
  I
 tried all the variations of s3://mybucket/mydir/myfile.txt, but none of
 them
 work.





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-a-multipart-s3-file-tp5463.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.




-- 
*JU Han*

Data Engineer @ Botify.com

+33 061960

Re: Easy one

2014-05-07 Thread Ian Ferreira

Thanks!

From:  Aaron Davidson ilike...@gmail.com
Reply-To:  user@spark.apache.org
Date:  Tuesday, May 6, 2014 at 5:32 PM
To:  user@spark.apache.org
Subject:  Re: Easy one

If you're using standalone mode, you need to make sure the Spark Workers
know about the extra memory. This can be configured in spark-env.sh on the
workers as

export SPARK_WORKER_MEMORY=4g

On Tue, May 6, 2014 at 5:29 PM, Ian Ferreira ianferre...@hotmail.com
wrote:
 Hi there,

 Why can¹t I seem to kick the executor memory higher? See below from EC2
 deployment using m1.large

 And in the spark-env.sh
 export SPARK_MEM=6154m

 And in the spark context
 sconf.setExecutorEnv(spark.executor.memory, 4g²)

 Cheers
 - Ian

Is there anything that I need to modify?

2014-05-07 Thread Sophia

[root@CHBM220 spark-0.9.1]#
SPARK_JAR=.assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar
./bin/spark-class org.apache.spark.deploy.yarn.Client --jar
examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar --class
org.apache.spark.examples.SparkPi --args yarn-standalone --num-workers 3
--master-memory 2g --worker-memory 2g --worker-cores 1 
14:50:45,485%5P RMProxy:56-Connecting to ResourceManager at
CHBM220/192.168.10.220:8032
Exception in thread main java.io.IOException: Failed on local exception:
com.google.protobuf.InvalidProtocolBufferException: Protocol message
contained an invalid tag (zero).; Host Details : local host is:
CHBM220/192.168.10.220; destination host is: CHBM220:8032;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
at org.apache.hadoop.ipc.Client.call(Client.java:1351)
at org.apache.hadoop.ipc.Client.call(Client.java:1300)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy7.getClusterMetrics(Unknown Source)
at
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:152)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy8.getClusterMetrics(Unknown Source)
at
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:246)
at
org.apache.spark.deploy.yarn.Client.logClusterResourceDetails(Client.scala:144)
at org.apache.spark.deploy.yarn.Client.runApp(Client.scala:79)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:115)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:493)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol
message contained an invalid tag (zero).
at
com.google.protobuf.InvalidProtocolBufferException.invalidTag(InvalidProtocolBufferException.java:89)
at
com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:108)
at
org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto.init(RpcHeaderProtos.java:1398)
at
org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto.init(RpcHeaderProtos.java:1362)
at
org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto$1.parsePartialFrom(RpcHeaderProtos.java:1492)
at
org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto$1.parsePartialFrom(RpcHeaderProtos.java:1487)
at
com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:200)
at
com.google.protobuf.AbstractParser.parsePartialDelimitedFrom(AbstractParser.java:241)
at
com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:253)
at
com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:259)
at
com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:49)
at
org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto.parseDelimitedFrom(RpcHeaderProtos.java:2364)
at
org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:996)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:891)
[root@CHBM220：spark-0.9.1]#




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-anything-that-I-need-to-modify-tp5477.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How to use spark-submit

2014-05-07 Thread Tathagata Das

Doesnt the run-example script work for you? Also, are you on the latest
commit of branch-1.0 ?

TD


On Mon, May 5, 2014 at 7:51 PM, Soumya Simanta soumya.sima...@gmail.comwrote:



 Yes, I'm struggling with a similar problem where my class are not found on
 the worker nodes. I'm using 1.0.0_SNAPSHOT.  I would really appreciate if
 someone can provide some documentation on the usage of spark-submit.

 Thanks

  On May 5, 2014, at 10:24 PM, Stephen Boesch java...@gmail.com wrote:
 
 
  I have a spark streaming application that uses the external streaming
 modules (e.g. kafka, mqtt, ..) as well.  It is not clear how to properly
 invoke the spark-submit script: what are the ---driver-class-path and/or
 -Dspark.executor.extraClassPath parameters required?
 
   For reference, the following error is proving difficult to resolve:
 
  java.lang.ClassNotFoundException:
 org.apache.spark.streaming.examples.StreamingExamples

Unable to load native-hadoop library problem

2014-05-07 Thread Sophia

Hi,everyone,
[root@CHBM220 spark-0.9.1]#
SPARK_JAR=.assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar
./bin/spark-class org.apache.spark.deploy.yarn.Client --jar
examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar --class
org.apache.spark.examples.SparkPi --args yarn-standalone --num-workers 3
--master-memory 2g --worker-memory 2g --worker-cores 1
14/05/07 09:05:14 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
14/05/07 09:05:14 INFO RMProxy: Connecting to ResourceManager at
CHBM220/192.168.10.220:8032
Then it stopped,my hadoop_conf_dir has been configued well,what should I do
to?
Wish you happy everyday.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-load-native-hadoop-library-problem-tp5469.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: details about event log

2014-05-07 Thread wxhsdp

any ideas?  thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/details-about-event-log-tp5411p5476.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark and Java 8

2014-05-07 Thread Kristoffer Sjögren

Running Hadoop and HDFS on unsupported JVM runtime sounds a little
adventurous. But as long as Spark can run in a separate Java 8 runtime it's
all good. I think having lambdas and type inference is huge when writing
these jobs and using Scala (paying the price of complexity, poor tooling
etc etc) for this tiny feature is often not justified.

On Wed, May 7, 2014 at 2:03 AM, Dean Wampler deanwamp...@gmail.com wrote:

Cloudera customers will need to put pressure on them to support Java 8.
They only officially supported Java 7 when Oracle stopped supporting Java 6.

dean

On Wed, May 7, 2014 at 5:05 AM, Matei Zaharia matei.zaha...@gmail.comwrote:

Java 8 support is a feature in Spark, but vendors need to decide for
themselves when they’d like support Java 8 commercially. You can still run
Spark on Java 7 or 6 without taking advantage of the new features (indeed
our builds are always against Java 6).

Matei

On May 6, 2014, at 8:59 AM, Ian O'Connell i...@ianoconnell.com wrote:

I think the distinction there might be they never said they ran that code
under CDH5, just that spark supports it and spark runs under CDH5. Not that
you can use these features while running under CDH5.

They could use mesos or the standalone scheduler to run them

On Tue, May 6, 2014 at 6:16 AM, Kristoffer Sjögren sto...@gmail.comwrote:

I just read an article [1] about Spark, CDH5 and Java 8 but did not get
exactly how Spark can run Java 8 on a YARN cluster at runtime. Is Spark
using a separate JVM that run on data nodes or is it reusing the YARN JVM
runtime somehow, like hadoop1?

CDH5 only supports Java 7 [2] as far as I know?

Cheers,
-Kristoffer

[1]
http://blog.cloudera.com/blog/2014/04/making-apache-spark-easier-to-use-in-java-with-java-8/
[2]
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Requirements-and-Supported-Versions/CDH5-Requirements-and-Supported-Versions.html

--
Dean Wampler, Ph.D.
Typesafe
@deanwampler
http://typesafe.com
http://polyglotprogramming.com

Re: sbt run with spark.ContextCleaner ERROR

2014-05-07 Thread Tathagata Das

Okay, this needs to be fixed. Thanks for reporting this!



On Mon, May 5, 2014 at 11:00 PM, wxhsdp wxh...@gmail.com wrote:

 Hi, TD

 i tried on v1.0.0-rc3 and still got the error



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/sbt-run-with-spark-ContextCleaner-ERROR-tp5304p5421.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: master attempted to re-register the worker and then took all workers as unregistered

2014-05-07 Thread Cheney Sun

Hi Nan,

In worker's log, I see the following exception thrown when try to launch on
executor. (The SPARK_HOME is wrongly specified on purpose, so there is no
such file /usr/local/spark1/bin/compute-classpath.sh).
After the exception was thrown several times, the worker was requested to
kill the executor. Following the killing, the worker try to register again
with master, but master reject the registration with WARN message Got
heartbeat from unregistered worker
worker-20140504140005-host-spark-online001

Looks like the issue wasn't fixed in 0.9.1. Do you know any pull request
addressing this issue? Thanks.

java.io.IOException: Cannot run program
/usr/local/spark1/bin/compute-classpath.sh (in directory .): error=2,
No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:600)
at
org.apache.spark.deploy.worker.CommandUtils$.buildJavaOpts(CommandUtils.scala:58)
at
org.apache.spark.deploy.worker.CommandUtils$.buildCommandSeq(CommandUtils.scala:37)
at
org.apache.spark.deploy.worker.ExecutorRunner.getCommandSeq(ExecutorRunner.scala:104)
at
org.apache.spark.deploy.worker.ExecutorRunner.fetchAndRunExecutor(ExecutorRunner.scala:119)
at
org.apache.spark.deploy.worker.ExecutorRunner$$anon$1.run(ExecutorRunner.scala:59)
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.init(UNIXProcess.java:135)
at java.lang.ProcessImpl.start(ProcessImpl.java:130)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1021)
... 6 more
..
14/05/04 21:35:45 INFO Worker: Asked to kill executor
app-20140504213545-0034/18
14/05/04 21:35:45 INFO Worker: Executor app-20140504213545-0034/18 finished
with state FAILED message class java.io.IOException: Cannot run program
/usr/local/spark1/bin/compute-classpath.sh (in directory .): error=2,
No such file or directory
14/05/04 21:35:45 ERROR OneForOneStrategy: key not found:
app-20140504213545-0034/18
java.util.NoSuchElementException: key not found: app-20140504213545-0034/18
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at
org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:232)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
14/05/04 21:35:45 INFO Worker: Starting Spark worker
host-spark-online001:7078 with 10 cores, 28.0 GB RAM
14/05/04 21:35:45 INFO Worker: Spark home: /usr/local/spark-0.9.1-cdh4.2.0
14/05/04 21:35:45 INFO WorkerWebUI: Started Worker web UI at
http://host-spark-online001:8081
14/05/04 21:35:45 INFO Worker: Connecting to master
spark://host-spark-online001:7077...
14/05/04 21:35:45 INFO Worker: Successfully registered with master
spark://host-spark-online001:7077

Re: How to read a multipart s3 file?

Re: How to read a multipart s3 file?

Re: Easy one

Is there anything that I need to modify?

Re: How to use spark-submit

Unable to load native-hadoop library problem

Re: details about event log

Re: Spark and Java 8

Re: sbt run with spark.ContextCleaner ERROR

Re: master attempted to re-register the worker and then took all workers as unregistered

10 matches

Site Navigation

Mail list logo

Footer information