Re: Spark + Kinesis

2015-04-03 Thread Daniil Osipov
Assembly settings have an option to exclude jars. You need something
similar to:
assemblyExcludedJars in assembly = (fullClasspath in assembly) map { cp =
val excludes = Set(
  minlog-1.2.jar
)
cp filter { jar = excludes(jar.data.getName) }
  }

in your build file (may need to be refactored into a .scala file)

On Fri, Apr 3, 2015 at 12:57 PM, Vadim Bichutskiy 
vadim.bichuts...@gmail.com wrote:

 Remove provided and got the following error:

 [error] (*:assembly) deduplicate: different file contents found in the
 following:

 [error]
 /Users/vb/.ivy2/cache/com.esotericsoftware.kryo/kryo/bundles/kryo-2.21.jar:com/esotericsoftware/minlog/Log$Logger.class

 [error]
 /Users/vb/.ivy2/cache/com.esotericsoftware.minlog/minlog/jars/minlog-1.2.jar:com/esotericsoftware/minlog/Log$Logger.class
 ᐧ

 On Fri, Apr 3, 2015 at 3:48 PM, Tathagata Das t...@databricks.com wrote:

 Just remove provided for spark-streaming-kinesis-asl

 libraryDependencies += org.apache.spark %%
 spark-streaming-kinesis-asl % 1.3.0

 On Fri, Apr 3, 2015 at 12:45 PM, Vadim Bichutskiy 
 vadim.bichuts...@gmail.com wrote:

 Thanks. So how do I fix it?
 ᐧ

 On Fri, Apr 3, 2015 at 3:43 PM, Kelly, Jonathan jonat...@amazon.com
 wrote:

   spark-streaming-kinesis-asl is not part of the Spark distribution on
 your cluster, so you cannot have it be just a provided dependency.  This
 is also why the KCL and its dependencies were not included in the assembly
 (but yes, they should be).


  ~ Jonathan Kelly

   From: Vadim Bichutskiy vadim.bichuts...@gmail.com
 Date: Friday, April 3, 2015 at 12:26 PM
 To: Jonathan Kelly jonat...@amazon.com
 Cc: user@spark.apache.org user@spark.apache.org
 Subject: Re: Spark + Kinesis

   Hi all,

  Good news! I was able to create a Kinesis consumer and assemble it
 into an uber jar following
 http://spark.apache.org/docs/latest/streaming-kinesis-integration.html
 and example
 https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala
 .

  However when I try to spark-submit it I get the following exception:

  *Exception in thread main java.lang.NoClassDefFoundError:
 com/amazonaws/auth/AWSCredentialsProvider*

  Do I need to include KCL dependency in *build.sbt*, here's what it
 looks like currently:

  import AssemblyKeys._
 name := Kinesis Consumer
 version := 1.0
 organization := com.myconsumer
 scalaVersion := 2.11.5

  libraryDependencies += org.apache.spark %% spark-core % 1.3.0 %
 provided
 libraryDependencies += org.apache.spark %% spark-streaming %
 1.3.0 % provided
 libraryDependencies += org.apache.spark %%
 spark-streaming-kinesis-asl % 1.3.0 % provided

  assemblySettings
 jarName in assembly :=  consumer-assembly.jar
 assemblyOption in assembly := (assemblyOption in
 assembly).value.copy(includeScala=false)

  Any help appreciated.

  Thanks,
 Vadim

 On Thu, Apr 2, 2015 at 1:15 PM, Kelly, Jonathan jonat...@amazon.com
 wrote:

  It looks like you're attempting to mix Scala versions, so that's
 going to cause some problems.  If you really want to use Scala 2.11.5, you
 must also use Spark package versions built for Scala 2.11 rather than
 2.10.  Anyway, that's not quite the correct way to specify Scala
 dependencies in build.sbt.  Instead of placing the Scala version after the
 artifactId (like spark-core_2.10), what you actually want is to use just
 spark-core with two percent signs before it.  Using two percent signs
 will make it use the version of Scala that matches your declared
 scalaVersion.  For example:

  libraryDependencies += org.apache.spark %% spark-core % 1.3.0
 % provided

  libraryDependencies += org.apache.spark %% spark-streaming %
 1.3.0 % provided

  libraryDependencies += org.apache.spark %%
 spark-streaming-kinesis-asl % 1.3.0

  I think that may get you a little closer, though I think you're
 probably going to run into the same problems I ran into in this thread:
 https://www.mail-archive.com/user@spark.apache.org/msg23891.html  I
 never really got an answer for that, and I temporarily moved on to other
 things for now.


  ~ Jonathan Kelly

   From: 'Vadim Bichutskiy' vadim.bichuts...@gmail.com
 Date: Thursday, April 2, 2015 at 9:53 AM
 To: user@spark.apache.org user@spark.apache.org
 Subject: Spark + Kinesis

   Hi all,

  I am trying to write an Amazon Kinesis consumer Scala app that
 processes data in the
 Kinesis stream. Is this the correct way to specify *build.sbt*:

  ---
 *import AssemblyKeys._*
 *name := Kinesis Consumer*






 *version := 1.0 organization := com.myconsumer scalaVersion :=
 2.11.5 libraryDependencies ++= Seq(org.apache.spark % 
 spark-core_2.10
 % 1.3.0 % provided, org.apache.spark % spark-streaming_2.10 %
 1.3.0 org.apache.spark % spark-streaming-kinesis-asl_2.10 % 
 1.3.0)*



 * assemblySettings jarName in assembly :=  consumer-assembly.jar
 assemblyOption in assembly := (assemblyOption in
 assembly).value.copy(includeScala=false)*
 

  In 

Re: Spark-EC2 Security Group Error

2015-04-01 Thread Daniil Osipov
Appears to be a problem with boto. Make sure you have boto 2.34 on your
system.

On Wed, Apr 1, 2015 at 11:19 AM, Ganelin, Ilya ilya.gane...@capitalone.com
wrote:

 Hi all – I’m trying to bring up a spark ec2 cluster with the script below
 and see the following error. Can anyone please advise as to what’s going
 on? Is this indicative of me being unable to connect in the first place?
 The keys are known to work (they’re used elsewhere).

 ./spark-ec2 -k $AWS_KEYPAIR_NAME -i $AWS_PRIVATE_KEY -s 2
 --region=us-west1 --zone=us-west-1a --instance-type=r3.2xlarge launch
 FuntimePartyLand

 Setting up security groups...
 Traceback (most recent call last):
   File ./spark_ec2.py, line 1383, in module
 main()
   File ./spark_ec2.py, line 1375, in main
 real_main()
   File ./spark_ec2.py, line 1210, in real_main
 (master_nodes, slave_nodes) = launch_cluster(conn, opts, cluster_name)
   File ./spark_ec2.py, line 431, in launch_cluster
 master_group = get_or_make_group(conn, cluster_name + -master,
 opts.vpc_id)
   File ./spark_ec2.py, line 310, in get_or_make_group
 groups = conn.get_all_security_groups()
 AttributeError: 'NoneType' object has no attribute
 ‘get_all_security_groups'

 Thank you,
 Ilya Ganelin


 --

 The information contained in this e-mail is confidential and/or
 proprietary to Capital One and/or its affiliates. The information
 transmitted herewith is intended only for use by the individual or entity
 to which it is addressed.  If the reader of this message is not the
 intended recipient, you are hereby notified that any review,
 retransmission, dissemination, distribution, copying or other use of, or
 taking of any action in reliance upon this information is strictly
 prohibited. If you have received this communication in error, please
 contact the sender and delete the material from your computer.




-- 

*Dan Osipov*




 *Shazam*

2114 Broadway Street, Redwood City, CA 94063



Please consider the environment before printing this document

Shazam Entertainment Limited is incorporated in England and Wales under
company number 3998831 and its registered office is at 26-28 Hammersmith
Grove, London W6 7HA. Shazam Media Services Inc is a member of the Shazam
Entertainment Limited group of companies.


Re: Spark on EC2

2015-04-01 Thread Daniil Osipov
You're probably requesting more instances than allowed by your account, so
the error gets generated for the extra instances. Try launching a smaller
cluster.

On Wed, Apr 1, 2015 at 12:41 PM, Vadim Bichutskiy 
vadim.bichuts...@gmail.com wrote:

 Hi all,

 I just tried launching a Spark cluster on EC2 as described in
 http://spark.apache.org/docs/1.3.0/ec2-scripts.html

 I got the following response:


 *ResponseErrorsErrorCodePendingVerification/CodeMessageYour
 account is currently being verified. Verification normally takes less than
 2 hours. Until your account is verified, you may not be able to launch
 additional instances or create additional volumes. If you are still
 receiving this message after more than 2 hours, please let us know by
 writing to aws-verificat...@amazon.com aws-verificat...@amazon.com. We
 appreciate your patience...*
 However I can see the EC2 instances in AWS console as running

 Any thoughts on what's going on?

 Thanks,
 Vadim
 ᐧ




-- 

*Dan Osipov*



 *Shazam*

2114 Broadway Street, Redwood City, CA 94063



Please consider the environment before printing this document

Shazam Entertainment Limited is incorporated in England and Wales under
company number 3998831 and its registered office is at 26-28 Hammersmith
Grove, London W6 7HA. Shazam Media Services Inc is a member of the Shazam
Entertainment Limited group of companies.


Re: GraphX pregel: getting the current iteration number

2015-02-03 Thread Daniil Osipov
I don't think its possible to access. What I've done before is send the
current or next iteration index with the message, where the message is a
case class.

HTH
Dan

On Tue, Feb 3, 2015 at 10:20 AM, Matthew Cornell corn...@cs.umass.edu
wrote:

 Hi Folks,

 I'm new to GraphX and Scala and my sendMsg function needs to index into an
 input list to my algorithm based on the pregel()() iteration number, but I
 don't see a way to access that. I see in
 https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala
 that it's just an index variable i in a while loop, but is there a way
 for sendMsg to access within the loop's scope? I don't think so, so barring
 that, given Scala's functional stateless nature, what other approaches
 would you take to do this? I'm considering a closure, but a var that gets
 updated by all the sendMsgs seems a recipe for trouble.

 Thank you,

 matt
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Spark S3 Performance

2014-11-24 Thread Daniil Osipov
Can you verify that its reading the entire file on each worker using
network monitoring stats? If it does, that would be a bug in my opinion.

On Mon, Nov 24, 2014 at 2:06 PM, Nitay Joffe ni...@actioniq.co wrote:

 Andrei, Ashish,

 To be clear, I don't think it's *counting* the entire file. It just seems
 from the logging and the timing that it is doing a get of the entire file,
 then figures out it only needs some certain blocks, does another get of
 only the specific block.

 Regarding # partitions - I think I see now it has to do with Hadoop's
 block size being set at 64MB. This is not a big deal to me, the main issue
 is the first one, why is every worker doing a call to get the entire file
 followed by the *real* call to get only the specific partitions it needs.

 Best,

 - Nitay
 Founder  CTO


 On Sat, Nov 22, 2014 at 8:28 PM, Andrei faithlessfri...@gmail.com wrote:

 Concerning your second question, I believe you try to set number of
 partitions with something like this:

 rdd = sc.textFile(..., 8)

 but things like `textFile()` don't actually take fixed number of
 partitions. Instead, they expect *minimal* number of partitions. Since
 in your file you have 21 blocks of data, it creates exactly 21 worker
 (which is greater than 8, as expected). To set exact number of partitions,
 use `repartition()` or its full version - `coalesce()` (see example [1])

 [1]:
 http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html#coalesce



 On Sat, Nov 22, 2014 at 10:04 PM, Ashish Rangole arang...@gmail.com
 wrote:

 What makes you think that each executor is reading the whole file? If
 that is the case then the count value returned to the driver will be actual
 X NumOfExecutors. Is that the case when compared with actual lines in the
 input file? If the count returned is same as actual then you probably don't
 have an extra read problem.

 I also see this in your logs which indicates that it is a read that
 starts from an offset and reading one split size (64MB) worth of data:

 14/11/20 15:39:45 [Executor task launch worker-1 ] INFO HadoopRDD: Input
 split: s3n://mybucket/myfile:335544320+67108864
 On Nov 22, 2014 7:23 AM, Nitay Joffe ni...@actioniq.co wrote:

 Err I meant #1 :)

 - Nitay
 Founder  CTO


 On Sat, Nov 22, 2014 at 10:20 AM, Nitay Joffe ni...@actioniq.co
 wrote:

 Anyone have any thoughts on this? Trying to understand especially #2
 if it's a legit bug or something I'm doing wrong.

 - Nitay
 Founder  CTO


 On Thu, Nov 20, 2014 at 11:54 AM, Nitay Joffe ni...@actioniq.co
 wrote:

 I have a simple S3 job to read a text file and do a line count.
 Specifically I'm doing *sc.textFile(s3n://mybucket/myfile).count*.The
 file is about 1.2GB. My setup is standalone spark cluster with 4 workers
 each with 2 cores / 16GB ram. I'm using branch-1.2 code built against
 hadoop 2.4 (though I'm not actually using HDFS, just straight S3 = 
 Spark).

 The whole count is taking on the order of a couple of minutes, which
 seems extremely slow.
 I've been looking into it and so far have noticed two things, hoping
 the community has seen this before and knows what to do...

 1) Every executor seems to make an S3 call to read the *entire file* 
 before
 making another call to read just it's split. Here's a paste I've cleaned 
 up
 to show just one task: http://goo.gl/XCfyZA. I've verified this
 happens in every task. It is taking a long time (40-50 seconds), I don't
 see why it is doing this?
 2) I've tried a few numPartitions parameters. When I make the
 parameter anything below 21 it seems to get ignored. Under the hood
 FileInputFormat is doing something that always ends up with at least 21
 partitions of ~64MB or so. I've also tried 40, 60, and 100 partitions and
 have seen that the performance only gets worse as I increase it beyond 
 21.
 I would like to try 8 just to see, but again I don't see how to force it 
 to
 go below 21.

 Thanks for the help,
 - Nitay
 Founder  CTO








Re: Mulitple Spark Context

2014-11-14 Thread Daniil Osipov
Its not recommended to have multiple spark contexts in one JVM, but you
could launch a separate JVM per context. How resources get allocated is
probably outside the scope of Spark, and more of a task for the cluster
manager.

On Fri, Nov 14, 2014 at 12:58 PM, Charles charles...@cenx.com wrote:

 I need continuously run multiple calculations concurrently on a cluster.
 They
 are not sharing RDDs.
 Each of the calculations needs different number of cores and memory. Also,
 some of them are long running calculation and others are short running
 calculation.They all need be run on regular basis and finish in time. The
 short ones cannot wait until long ones to finish. They cannot run too slow
 either since the short ones' running interval is short as well.

 It looks like the sharing inside a sparkContext cannot guarantee that the
 short ones will get enough resources to finish in time if long ones already
 running. Or am I wrong about that?

  I tried to create a sparkContext for each of the calculations but only the
 first one is alive. The rest dies. I am getting the error below. Is it
 possible to create multiple sparkContext from inside one application jvm?

 ERROR 2014-11-14 14:59:46 akka.actor.OneForOneStrategy:
 spark.httpBroadcast.uri
 java.util.NoSuchElementException: spark.httpBroadcast.uri
 at org.apache.spark.SparkConf$$anonfun$get$1.apply(SparkConf.scala:151)
 at org.apache.spark.SparkConf$$anonfun$get$1.apply(SparkConf.scala:151)
 at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
 at scala.collection.AbstractMap.getOrElse(Map.scala:58)
 at org.apache.spark.SparkConf.get(SparkConf.scala:151)
 at

 org.apache.spark.broadcast.HttpBroadcast$.initialize(HttpBroadcast.scala:104)
 at

 org.apache.spark.broadcast.HttpBroadcastFactory.initialize(HttpBroadcast.scala:70)
 at
 org.apache.spark.broadcast.BroadcastManager.initialize(Broadcast.scala:81)
 at org.apache.spark.broadcast.BroadcastManager.init(Broadcast.scala:68)
 at org.apache.spark.SparkEnv$.create(SparkEnv.scala:175)
 at org.apache.spark.executor.Executor.init(Executor.scala:110)
 at

 org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:56)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at

 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at

 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at

 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 INFO 2014-11-14 14:59:46
 org.apache.spark.executor.CoarseGrainedExecutorBackend: Connecting to
 driver: akka.tcp://spark@172.32.1.12:51590/user/CoarseGrainedScheduler
 ERROR 2014-11-14 14:59:46
 org.apache.spark.executor.CoarseGrainedExecutorBackend: Slave registration
 failed: Duplicate executor ID: 1



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Mulitple-Spark-Context-tp18975.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




GraphX: Get edges for a vertex

2014-11-13 Thread Daniil Osipov
Hello,

I'm attempting to implement a clustering algorithm on top of Pregel
implementation in GraphX, however I'm hitting a wall. Ideally, I'd like to
be able to get all edges for a specific vertex, since they factor into the
calculation. My understanding was that sendMsg function would receive all
relevant edges in participating vertices (all initially, declining as they
converge and stop changing state), and I was planning to keep vertex edges
associated to each vertex and propagate to other vertices that need to know
about these edges.

What I'm finding is that not all edges get iterated on by sendMsg before
sending messages to vprog. Even if I try to keep track of edges, I don't
account all of them, leading to incorrect results.

The graph I'm testing on has edges between all nodes, one for each
direction, and I'm using EdgeDirection.Both.

Anyone seen something similar, and have some suggestions?
Dan


Re: Example of Fold

2014-10-31 Thread Daniil Osipov
You should look at how fold is used in scala in general to help. Here is a
blog post that may also give some guidance:
http://blog.madhukaraphatak.com/spark-rdd-fold

The zero value should be your bean, with the 4th parameter set to the
minimum value. Your fold function should compare the 4th param in the
incoming records and chose the larger one.

If you want to group by the 3 parameters prior to folding you're probably
better off using a reduce function.

On Fri, Oct 31, 2014 at 12:01 PM, Ron Ayoub ronalday...@live.com wrote:

 I'm want to fold an RDD into a smaller RDD with max elements. I have
 simple bean objects with 4 properties. I want to group by 3 of the
 properties and then select the max of the 4th. So I believe fold is the
 appropriate method for this. My question is, is there a good fold example
 out there. Additionally, what it the zero value used for as the first
 argument? Thanks.



Re: Usage of spark-ec2: how to deploy a revised version of spark 1.1.0?

2014-10-22 Thread Daniil Osipov
You can use --spark-version argument to spark-ec2 to specify a GIT hash
corresponding to the version you want to checkout. If you made changes that
are not in the master repository, you can use --spark-git-repo to specify
the git repository to pull down spark from, which contains the specified
commit hash.

On Tue, Oct 21, 2014 at 3:52 PM, sameerf same...@databricks.com wrote:

 Hi,

 Can you post what the error looks like?


 Sameer F.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Usage-of-spark-ec2-how-to-deploy-a-revised-version-of-spark-1-1-0-tp16943p16963.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-20 Thread Daniil Osipov
How are you launching the cluster, and how are you submitting the job to
it? Can you list any Spark configuration parameters you provide?

On Mon, Oct 20, 2014 at 12:53 PM, Daniel Mahler dmah...@gmail.com wrote:


 I am launching EC2 clusters using the spark-ec2 scripts.
 My understanding is that this configures spark to use the available
 resources.
 I can see that spark will use the available memory on larger istance types.
 However I have never seen spark running at more than 400% (using 100% on 4
 cores)
 on machines with many more cores.
 Am I misunderstanding the docs? Is it just that high end ec2 instances get
 I/O starved when running spark? It would be strange if that consistently
 produced a 400% hard limit though.

 thanks
 Daniel



Re: Multipart uploads to Amazon S3 from Apache Spark

2014-10-13 Thread Daniil Osipov
Not directly related, but FWIW, EMR seems to back away from s3n usage:

Previously, Amazon EMR used the S3 Native FileSystem with the URI scheme,
s3n. While this still works, we recommend that you use the s3 URI scheme
for the best performance, security, and reliability.

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-file-systems.html


On Mon, Oct 13, 2014 at 1:42 PM, Nick Chammas nicholas.cham...@gmail.com
wrote:

 Cross posting an interesting question on Stack Overflow
 http://stackoverflow.com/questions/26321947/multipart-uploads-to-amazon-s3-from-apache-spark
 .

 Nick


 --
 View this message in context: Multipart uploads to Amazon S3 from Apache
 Spark
 http://apache-spark-user-list.1001560.n3.nabble.com/Multipart-uploads-to-Amazon-S3-from-Apache-Spark-tp16315.html
 Sent from the Apache Spark User List mailing list archive
 http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.



Re: S3 Bucket Access

2014-10-13 Thread Daniil Osipov
(Copying the user list)
You should use spark_ec2 script to configure the cluster. If you use trunk
version you can use the new --copy-aws-credentials option to configure the
S3 parameters automatically, otherwise either include them in your
SparkConf variable or add them to
/root/spark/ephemeral-hdfs/conf/core-site.xml

On Mon, Oct 13, 2014 at 2:56 PM, Ranga sra...@gmail.com wrote:

 The cluster is deployed on EC2 and I am trying to access the S3 files from
 within a spark-shell session.

 On Mon, Oct 13, 2014 at 2:51 PM, Daniil Osipov daniil.osi...@shazam.com
 wrote:

 So is your cluster running on EC2, or locally? If you're running locally,
 you should still be able to access S3 files, you just need to locate the
 core-site.xml and add the parameters as defined in the error.

 On Mon, Oct 13, 2014 at 2:49 PM, Ranga sra...@gmail.com wrote:

 Hi Daniil

 No. I didn't create the spark-cluster using the ec2 scripts. Is that
 something that I need to do? I just downloaded Spark-1.1.0 and Hadoop-2.4.
 However, I am trying to access files on S3 from this cluster.


 - Ranga

 On Mon, Oct 13, 2014 at 2:36 PM, Daniil Osipov daniil.osi...@shazam.com
  wrote:

 Did you add the fs.s3n.aws* configuration parameters in
 /root/spark/ephemeral-hdfs/conf/core-ste.xml?

 On Mon, Oct 13, 2014 at 11:03 AM, Ranga sra...@gmail.com wrote:

 Hi

 I am trying to access files/buckets in S3 and encountering a
 permissions issue. The buckets are configured to authenticate using an
 IAMRole provider.
 I have set the KeyId and Secret using environment variables (
 AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID). However, I am still
 unable to access the S3 buckets.

 Before setting the access key and secret the error was: 
 java.lang.IllegalArgumentException:
 AWS Access Key ID and Secret Access Key must be specified as the username
 or password (respectively) of a s3n URL, or by setting the
 fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties
 (respectively).

 After setting the access key and secret, the error is: The AWS
 Access Key Id you provided does not exist in our records.

 The id/secret being set are the right values. This makes me believe
 that something else (token, etc.) needs to be set as well.
 Any help is appreciated.


 - Ranga








Re: Cannot read from s3 using sc.textFile

2014-10-07 Thread Daniil Osipov
Try using s3n:// instead of s3 (for the credential configuration as well).

On Tue, Oct 7, 2014 at 9:51 AM, Sunny Khatri sunny.k...@gmail.com wrote:

 Not sure if it's supposed to work. Can you try newAPIHadoopFile() passing
 in the required configuration object.

 On Tue, Oct 7, 2014 at 4:20 AM, Tomer Benyamini tomer@gmail.com
 wrote:

 Hello,

 I'm trying to read from s3 using a simple spark java app:

 -

 SparkConf sparkConf = new SparkConf().setAppName(TestApp);
 sparkConf.setMaster(local);
 JavaSparkContext sc = new JavaSparkContext(sparkConf);
 sc.hadoopConfiguration().set(fs.s3.awsAccessKeyId, XX);
 sc.hadoopConfiguration().set(fs.s3.awsSecretAccessKey, XX);

 String path = s3://bucket/test/testdata;
 JavaRDDString textFile = sc.textFile(path);
 System.out.println(textFile.count());

 -
 But getting this error:

 org.apache.hadoop.mapred.InvalidInputException: Input path does not
 exist: s3://bucket/test/testdata
 at
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
 at
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:175)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
 at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1097)
 at org.apache.spark.rdd.RDD.count(RDD.scala:861)
 at
 org.apache.spark.api.java.JavaRDDLike$class.count(JavaRDDLike.scala:365)
 at org.apache.spark.api.java.JavaRDD.count(JavaRDD.scala:29)
 

 Looking at the debug log I see that
 org.jets3t.service.impl.rest.httpclient.RestS3Service returned 404
 error trying to locate the file.

 Using a simple java program with
 com.amazonaws.services.s3.AmazonS3Client works just fine.

 Any idea?

 Thanks,
 Tomer

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: compiling spark source code

2014-09-11 Thread Daniil Osipov
In the spark source folder, execute `sbt/sbt assembly`

On Thu, Sep 11, 2014 at 8:27 AM, rapelly kartheek kartheek.m...@gmail.com
wrote:

 HI,


 Can someone please tell me how to compile the spark source code to effect
 the changes in the source code. I was trying to ship the jars to all the
 slaves, but in vain.

 -Karthik



Re: Spark on Raspberry Pi?

2014-09-11 Thread Daniil Osipov
Limited memory could also cause you some problems and limit usability. If
you're looking for a local testing environment, vagrant boxes may serve you
much better.

On Thu, Sep 11, 2014 at 6:18 AM, Chen He airb...@gmail.com wrote:




 Pi's bus speed, memory size and access speed, and processing ability are
 limited. The only benefit could be the power consumption.

 On Thu, Sep 11, 2014 at 8:04 AM, Sandeep Singh sand...@techaddict.me
 wrote:

 Has anyone tried using Raspberry Pi for Spark? How efficient is it to use
 around 10 Pi's for local testing env ?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Raspberry-Pi-tp13965.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Re: PrintWriter error in foreach

2014-09-10 Thread Daniil Osipov
Try providing full path to the file you want to write, and make sure the
directory exists and is writable by the Spark process.

On Wed, Sep 10, 2014 at 3:46 PM, Arun Luthra arun.lut...@gmail.com wrote:

 I have a spark program that worked in local mode, but throws an error in
 yarn-client mode on a cluster. On the edge node in my home directory, I
 have an output directory (called transout) which is ready to receive files.
 The spark job I'm running is supposed to write a few hundred files into
 that directory, once for each iteration of a foreach function. This works
 in local mode, and my only guess as to why this would fail in yarn-client
 mode is that the RDD is distributed across many nodes and the program is
 trying to use the PrintWriter on the datanodes, where the output directory
 doesn't exist. Is this what's happening? Any proposed solution?

 abbreviation of the code:

 import java.io.PrintWriter
 ...
 rdd.foreach {
   val outFile = new PrintWriter(transoutput/output.%s.format(id))
   outFile.println(test)
   outFile.close()
 }

 Error:

 14/09/10 16:57:09 WARN TaskSetManager: Lost TID 1826 (task 0.0:26)
 14/09/10 16:57:09 WARN TaskSetManager: Loss was due to
 java.io.FileNotFoundException
 java.io.FileNotFoundException: transoutput/input.598718 (No such file or
 directory)
 at java.io.FileOutputStream.open(Native Method)
 at java.io.FileOutputStream.init(FileOutputStream.java:194)
 at java.io.FileOutputStream.init(FileOutputStream.java:84)
 at java.io.PrintWriter.init(PrintWriter.java:146)
 at
 com.att.bdcoe.cip.ooh.TransformationLayer$$anonfun$main$3.apply(TransformLayer.scala:98)
 at
 com.att.bdcoe.cip.ooh.TransformationLayer$$anonfun$main$3.apply(TransformLayer.scala:95)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:703)
 at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:703)
 at
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083)
 at
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
 at org.apache.spark.scheduler.Task.run(Task.scala:51)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)



Re: Crawler and Scraper with different priorities

2014-09-08 Thread Daniil Osipov
Depending on what you want to do with the result of the scraping, Spark may
not be the best framework for your use case. Take a look at a general Akka
application.

On Sun, Sep 7, 2014 at 12:15 AM, Sandeep Singh sand...@techaddict.me
wrote:

 Hi all,

 I am Implementing a Crawler, Scraper. The It should be able to process the
 request for crawling  scraping, within few seconds of submitting the
 job(around 1mil/sec), for rest I can take some time(scheduled evenly all
 over the day). What is the best way to implement this?

 Thanks.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Crawler-and-Scraper-with-different-priorities-tp13645.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: spark-ec2 [Errno 110] Connection time out

2014-09-02 Thread Daniil Osipov
Make sure your key pair is configured to access whatever region you're
deploying to - it defaults to us-east-1, but you can provide a custom one
with parameter --region.


On Sat, Aug 30, 2014 at 12:53 AM, David Matheson david.j.mathe...@gmail.com
 wrote:

 I'm following the latest documentation on configuring a cluster on ec2
 (http://spark.apache.org/docs/latest/ec2-scripts.html).  Running
  ./spark-ec2 -k Blah -i .ssh/Blah.pem -s 2 launch spark-ec2-test
 gets a generic timeout error that's coming from
   File ./spark_ec2.py, line 717, in real_main
 conn = ec2.connect_to_region(opts.region)

 Any suggestions on how to debug the cause of the timeout?

 Note: I replaced the name of my keypair with Blah.

 Thanks,
 David




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-Errno-110-Connection-time-out-tp13171.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




FileNotFoundException (No space left on device) writing to S3

2014-08-27 Thread Daniil Osipov
Hello,

I've been seeing the following errors when trying to save to S3:

Exception in thread main org.apache.spark.SparkException: Job aborted due
to stage fail
ure: Task 4058 in stage 2.1 failed 4 times, most recent failure: Lost task
4058.3 in stag
e 2.1 (TID 12572, ip-10-81-151-40.ec2.internal):
java.io.FileNotFoundException: /mnt/spa$
k/spark-local-20140827191008-05ae/0c/shuffle_1_7570_5768 (No space left on
device)
java.io.FileOutputStream.open(Native Method)
java.io.FileOutputStream.init(FileOutputStream.java:221)

org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:107)

org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:175$

org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuff$
eWriter.scala:67)

org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuff$
eWriter.scala:65)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65$

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)

org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)

DF tells me there is plenty of space left on the worker node:
root@ip-10-81-151-40 ~]$ df -h
FilesystemSize  Used Avail Use% Mounted on
/dev/xvda17.9G  4.6G  3.3G  59% /
tmpfs 7.4G 0  7.4G   0% /dev/shm
/dev/xvdb  37G   11G   25G  30% /mnt
/dev/xvdf  37G  9.5G   26G  27% /mnt2

Any suggestions?
Dan


Re: countByWindow save the count ?

2014-08-25 Thread Daniil Osipov
You could try to use foreachRDD on the result of countByWindow with a
function that performs the save operation.


On Fri, Aug 22, 2014 at 1:58 AM, Josh J joshjd...@gmail.com wrote:

 Hi,

 Hopefully a simple question. Though is there an example of where to save
 the output of countByWindow ? I would like to save the results to external
 storage (kafka or redis). The examples show only stream.print()

 Thanks,
 Josh



OOM Java heap space error on saveAsTextFile

2014-08-21 Thread Daniil Osipov
Hello,

My job keeps failing on saveAsTextFile stage (frustrating after a 3 hour
run) with an OOM exception. The log is below. I'm running the job on an
input of ~8Tb gzipped JSON files, executing on 15 m3.xlarge instances.
Executor is given 13Gb memory, and I'm setting two custom preferences in
the job: spark.akka.frameSize: 50 (otherwise it fails due to exceeding the
limit of 10Mb), spark.storage.memoryFraction: 0.2

Any suggestions?

14/08/21 19:29:26 INFO spark.MapOutputTrackerMasterActor: Asked to send map
output locations for shuffle 1 to spark@ip-10-99-160-181.ec2.internal:36962
14/08/21 19:29:31 INFO spark.MapOutputTrackerMaster: Size of output
statuses for shuffle 1 is 17541459 bytes
14/08/21 19:29:31 INFO spark.MapOutputTrackerMasterActor: Asked to send map
output locations for shuffle 1 to spark@ip-10-144-221-26.ec2.internal:49973
14/08/21 19:29:31 INFO spark.MapOutputTrackerMasterActor: Asked to send map
output locations for shuffle 1 to spark@ip-10-69-31-121.ec2.internal:34569
14/08/21 19:29:31 INFO spark.MapOutputTrackerMasterActor: Asked to send map
output locations for shuffle 1 to spark@ip-10-165-70-221.ec2.internal:49193
14/08/21 19:29:31 INFO spark.MapOutputTrackerMasterActor: Asked to send map
output locations for shuffle 1 to spark@ip-10-218-181-93.ec2.internal:57648
14/08/21 19:29:31 INFO spark.MapOutputTrackerMasterActor: Asked to send map
output locations for shuffle 1 to spark@ip-10-142-187-230.ec2.internal:48115
14/08/21 19:29:31 INFO spark.MapOutputTrackerMasterActor: Asked to send map
output locations for shuffle 1 to spark@ip-10-101-178-68.ec2.internal:51931
14/08/21 19:29:31 INFO spark.MapOutputTrackerMasterActor: Asked to send map
output locations for shuffle 1 to spark@ip-10-99-165-121.ec2.internal:38153
14/08/21 19:29:31 INFO spark.MapOutputTrackerMasterActor: Asked to send map
output locations for shuffle 1 to spark@ip-10-179-187-182.ec2.internal:55645
14/08/21 19:29:31 INFO spark.MapOutputTrackerMasterActor: Asked to send map
output locations for shuffle 1 to spark@ip-10-182-231-107.ec2.internal:54088
14/08/21 19:29:31 INFO spark.MapOutputTrackerMasterActor: Asked to send map
output locations for shuffle 1 to spark@ip-10-165-79-9.ec2.internal:40112
14/08/21 19:29:31 INFO spark.MapOutputTrackerMasterActor: Asked to send map
output locations for shuffle 1 to spark@ip-10-111-169-138.ec2.internal:40394
14/08/21 19:29:31 INFO spark.MapOutputTrackerMasterActor: Asked to send map
output locations for shuffle 1 to spark@ip-10-203-161-222.ec2.internal:47447
14/08/21 19:29:31 INFO spark.MapOutputTrackerMasterActor: Asked to send map
output locations for shuffle 1 to spark@ip-10-153-141-230.ec2.internal:53906
14/08/21 19:29:32 ERROR actor.ActorSystemImpl: Uncaught fatal error from
thread [spark-akka.actor.default-dispatcher-20] shutting down ActorSystem
[spark]
java.lang.OutOfMemoryError: Java heap space
at
com.google.protobuf_spark.AbstractMessageLite.toByteArray(AbstractMessageLite.java:62)
at
akka.remote.transport.AkkaPduProtobufCodec$.constructPayload(AkkaPduCodec.scala:145)
at
akka.remote.transport.AkkaProtocolHandle.write(AkkaProtocolTransport.scala:156)
at
akka.remote.EndpointWriter$$anonfun$7.applyOrElse(Endpoint.scala:569)
at
akka.remote.EndpointWriter$$anonfun$7.applyOrElse(Endpoint.scala:544)
at
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
at akka.actor.FSM$class.processEvent(FSM.scala:595)
at akka.remote.EndpointWriter.processEvent(Endpoint.scala:443)
at akka.actor.FSM$class.akka$actor$FSM$$processMsg(FSM.scala:589)
at akka.actor.FSM$$anonfun$receive$1.applyOrElse(FSM.scala:583)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
14/08/21 19:29:32 INFO scheduler.DAGScheduler: Failed to run saveAsTextFile
at RecRateApp.scala:88
Exception in thread main org.apache.spark.SparkException: Job cancelled
because SparkContext was shut down
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:639)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:638)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
at