Re: Spark + Kinesis

2015-04-03 Thread Daniil Osipov
Assembly settings have an option to exclude jars. You need something similar to: assemblyExcludedJars in assembly = (fullClasspath in assembly) map { cp = val excludes = Set( minlog-1.2.jar ) cp filter { jar = excludes(jar.data.getName) } } in your build file (may need to be

Re: Spark-EC2 Security Group Error

2015-04-01 Thread Daniil Osipov
Appears to be a problem with boto. Make sure you have boto 2.34 on your system. On Wed, Apr 1, 2015 at 11:19 AM, Ganelin, Ilya ilya.gane...@capitalone.com wrote: Hi all – I’m trying to bring up a spark ec2 cluster with the script below and see the following error. Can anyone please advise as

Re: Spark on EC2

2015-04-01 Thread Daniil Osipov
You're probably requesting more instances than allowed by your account, so the error gets generated for the extra instances. Try launching a smaller cluster. On Wed, Apr 1, 2015 at 12:41 PM, Vadim Bichutskiy vadim.bichuts...@gmail.com wrote: Hi all, I just tried launching a Spark cluster on

Re: GraphX pregel: getting the current iteration number

2015-02-03 Thread Daniil Osipov
I don't think its possible to access. What I've done before is send the current or next iteration index with the message, where the message is a case class. HTH Dan On Tue, Feb 3, 2015 at 10:20 AM, Matthew Cornell corn...@cs.umass.edu wrote: Hi Folks, I'm new to GraphX and Scala and my

Re: Spark S3 Performance

2014-11-24 Thread Daniil Osipov
Can you verify that its reading the entire file on each worker using network monitoring stats? If it does, that would be a bug in my opinion. On Mon, Nov 24, 2014 at 2:06 PM, Nitay Joffe ni...@actioniq.co wrote: Andrei, Ashish, To be clear, I don't think it's *counting* the entire file. It

Re: Mulitple Spark Context

2014-11-14 Thread Daniil Osipov
Its not recommended to have multiple spark contexts in one JVM, but you could launch a separate JVM per context. How resources get allocated is probably outside the scope of Spark, and more of a task for the cluster manager. On Fri, Nov 14, 2014 at 12:58 PM, Charles charles...@cenx.com wrote: I

GraphX: Get edges for a vertex

2014-11-13 Thread Daniil Osipov
Hello, I'm attempting to implement a clustering algorithm on top of Pregel implementation in GraphX, however I'm hitting a wall. Ideally, I'd like to be able to get all edges for a specific vertex, since they factor into the calculation. My understanding was that sendMsg function would receive

Re: Example of Fold

2014-10-31 Thread Daniil Osipov
You should look at how fold is used in scala in general to help. Here is a blog post that may also give some guidance: http://blog.madhukaraphatak.com/spark-rdd-fold The zero value should be your bean, with the 4th parameter set to the minimum value. Your fold function should compare the 4th

Re: Usage of spark-ec2: how to deploy a revised version of spark 1.1.0?

2014-10-22 Thread Daniil Osipov
You can use --spark-version argument to spark-ec2 to specify a GIT hash corresponding to the version you want to checkout. If you made changes that are not in the master repository, you can use --spark-git-repo to specify the git repository to pull down spark from, which contains the specified

Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-20 Thread Daniil Osipov
How are you launching the cluster, and how are you submitting the job to it? Can you list any Spark configuration parameters you provide? On Mon, Oct 20, 2014 at 12:53 PM, Daniel Mahler dmah...@gmail.com wrote: I am launching EC2 clusters using the spark-ec2 scripts. My understanding is that

Re: Multipart uploads to Amazon S3 from Apache Spark

2014-10-13 Thread Daniil Osipov
Not directly related, but FWIW, EMR seems to back away from s3n usage: Previously, Amazon EMR used the S3 Native FileSystem with the URI scheme, s3n. While this still works, we recommend that you use the s3 URI scheme for the best performance, security, and reliability.

Re: S3 Bucket Access

2014-10-13 Thread Daniil Osipov
-hdfs/conf/core-site.xml On Mon, Oct 13, 2014 at 2:56 PM, Ranga sra...@gmail.com wrote: The cluster is deployed on EC2 and I am trying to access the S3 files from within a spark-shell session. On Mon, Oct 13, 2014 at 2:51 PM, Daniil Osipov daniil.osi...@shazam.com wrote: So is your cluster

Re: Cannot read from s3 using sc.textFile

2014-10-07 Thread Daniil Osipov
Try using s3n:// instead of s3 (for the credential configuration as well). On Tue, Oct 7, 2014 at 9:51 AM, Sunny Khatri sunny.k...@gmail.com wrote: Not sure if it's supposed to work. Can you try newAPIHadoopFile() passing in the required configuration object. On Tue, Oct 7, 2014 at 4:20 AM,

Re: compiling spark source code

2014-09-11 Thread Daniil Osipov
In the spark source folder, execute `sbt/sbt assembly` On Thu, Sep 11, 2014 at 8:27 AM, rapelly kartheek kartheek.m...@gmail.com wrote: HI, Can someone please tell me how to compile the spark source code to effect the changes in the source code. I was trying to ship the jars to all the

Re: Spark on Raspberry Pi?

2014-09-11 Thread Daniil Osipov
Limited memory could also cause you some problems and limit usability. If you're looking for a local testing environment, vagrant boxes may serve you much better. On Thu, Sep 11, 2014 at 6:18 AM, Chen He airb...@gmail.com wrote: Pi's bus speed, memory size and access speed, and processing

Re: PrintWriter error in foreach

2014-09-10 Thread Daniil Osipov
Try providing full path to the file you want to write, and make sure the directory exists and is writable by the Spark process. On Wed, Sep 10, 2014 at 3:46 PM, Arun Luthra arun.lut...@gmail.com wrote: I have a spark program that worked in local mode, but throws an error in yarn-client mode on

Re: Crawler and Scraper with different priorities

2014-09-08 Thread Daniil Osipov
Depending on what you want to do with the result of the scraping, Spark may not be the best framework for your use case. Take a look at a general Akka application. On Sun, Sep 7, 2014 at 12:15 AM, Sandeep Singh sand...@techaddict.me wrote: Hi all, I am Implementing a Crawler, Scraper. The It

Re: spark-ec2 [Errno 110] Connection time out

2014-09-02 Thread Daniil Osipov
Make sure your key pair is configured to access whatever region you're deploying to - it defaults to us-east-1, but you can provide a custom one with parameter --region. On Sat, Aug 30, 2014 at 12:53 AM, David Matheson david.j.mathe...@gmail.com wrote: I'm following the latest documentation

FileNotFoundException (No space left on device) writing to S3

2014-08-27 Thread Daniil Osipov
Hello, I've been seeing the following errors when trying to save to S3: Exception in thread main org.apache.spark.SparkException: Job aborted due to stage fail ure: Task 4058 in stage 2.1 failed 4 times, most recent failure: Lost task 4058.3 in stag e 2.1 (TID 12572,

Re: countByWindow save the count ?

2014-08-25 Thread Daniil Osipov
You could try to use foreachRDD on the result of countByWindow with a function that performs the save operation. On Fri, Aug 22, 2014 at 1:58 AM, Josh J joshjd...@gmail.com wrote: Hi, Hopefully a simple question. Though is there an example of where to save the output of countByWindow ? I

OOM Java heap space error on saveAsTextFile

2014-08-21 Thread Daniil Osipov
Hello, My job keeps failing on saveAsTextFile stage (frustrating after a 3 hour run) with an OOM exception. The log is below. I'm running the job on an input of ~8Tb gzipped JSON files, executing on 15 m3.xlarge instances. Executor is given 13Gb memory, and I'm setting two custom preferences in