Re: [ERROR] Insufficient Space

2015-06-19 Thread Vadim Bichutskiy
, or at least if there is, it probably > still requires manual work on the new nodes. This would be the advantage of > EMR over EC2, as we take care of all of that configuration. > > ~ Jonathan > > From: Vadim Bichutskiy > Date: Friday, June 19, 2015 at 5:21 PM >

Re: [ERROR] Insufficient Space

2015-06-19 Thread Vadim Bichutskiy
able to use Spark on EMR rather than on EC2? EMR clusters > allow easy resizing of the cluster, and EMR also now supports Spark 1.3.1 > as of EMR AMI 3.8.0. See http://aws.amazon.com/emr/spark > > ~ Jonathan > > From: Vadim Bichutskiy > Date: Friday, June 19, 2015 at

[ERROR] Insufficient Space

2015-06-19 Thread Vadim Bichutskiy
Hello Spark Experts, I've been running a standalone Spark cluster on EC2 for a few months now, and today I get this error: "IOError: [Errno 28] No space left on device Spark assembly has been built with Hive, including Datanucleus jars on classpath OpenJDK 64-Bit Server VM warning: Insufficient s

Re: Is anyone using Amazon EC2?

2015-05-23 Thread Vadim Bichutskiy
Yes, we're running Spark on EC2. Will transition to EMR soon. -Vadim ᐧ On Sat, May 23, 2015 at 2:22 PM, Johan Beisser wrote: > Yes. > > We're looking at bootstrapping in EMR... > > On Sat, May 23, 2015 at 07:21 Joe Wass wrote: > >> I used Spark on EC2 a while ago >> >

Re: textFileStream Question

2015-05-17 Thread Vadim Bichutskiy
Stream.scala#L172> > > Thanks > Best Regards > > On Fri, May 15, 2015 at 2:25 AM, Vadim Bichutskiy < > vadim.bichuts...@gmail.com> wrote: > >> How does textFileStream work behind the scenes? How does Spark Streaming >> know what files are new and need to

textFileStream Question

2015-05-14 Thread Vadim Bichutskiy
How does textFileStream work behind the scenes? How does Spark Streaming know what files are new and need to be processed? Is it based on time stamp, file name? Thanks, Vadim ᐧ

Re: DStream Union vs. StreamingContext Union

2015-05-14 Thread Vadim Bichutskiy
@TD How do I file a JIRA? ᐧ On Tue, May 12, 2015 at 2:06 PM, Tathagata Das wrote: > I wonder that may be a bug in the Python API. Please file it as a JIRA > along with sample code to reproduce it and sample output you get. > > On Tue, May 12, 2015 at 10:00 AM, Vadim Bichutskiy <

Re: DStream Union vs. StreamingContext Union

2015-05-12 Thread Vadim Bichutskiy
I can confirm it does work in Java >> >> >> >> *From:* Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com] >> *Sent:* Tuesday, May 12, 2015 5:53 PM >> *To:* Evo Eftimov >> *Cc:* Saisai Shao; user@spark.apache.org >> >> *Subject:* Re: DStream

Re: DStream Union vs. StreamingContext Union

2015-05-12 Thread Vadim Bichutskiy
reamRDDs in this way > DstreamRDD1.union(DstreamRDD2).union(DstreamRDD3) etc etc > > > > Ps: the API is not “redundant” it offers several ways for achivieving the > same thing as a convenience depending on the situation > > > > *From:* Vadim Bichutskiy [mailto:vadim.b

Re: DStream Union vs. StreamingContext Union

2015-05-12 Thread Vadim Bichutskiy
l case of StreamingContext.union: > > def union(that: DStream[T]): DStream[T] = new UnionDStream[T](Array(this, > that)) > > So there's no difference, if you want to union more than two DStreams, > just use the one in StreamingContext, otherwise, both two APIs are fine. >

DStream Union vs. StreamingContext Union

2015-05-11 Thread Vadim Bichutskiy
Can someone explain to me the difference between DStream union and StreamingContext union? When do you use one vs the other? Thanks, Vadim ᐧ

Re: Spark + Kinesis

2015-05-09 Thread Vadim Bichutskiy
have this working out of the box, so i'd like to implement anything i > can do to make that happen. > > lemme know. > > btw, Spark 1.4 will have some improvements to the Kinesis Spark Streaming. > > TD and I have been working together on this. > > thanks! > > -chris >

Re: Weird error/exception

2015-04-28 Thread Vadim Bichutskiy
I was having this issue when my batch interval was very big -- like 5 minutes. When my batch interval is smaller, I don't get this exception. Can someone explain to me why this might be happening? Vadim ᐧ On Tue, Apr 28, 2015 at 4:26 PM, Vadim Bichutskiy < vadim.bichuts...@gmail.com>

Re: Re: Spark streaming - textFileStream/fileStream - Get file name

2015-04-28 Thread Vadim Bichutskiy
I was wondering about the same thing. Vadim ᐧ On Tue, Apr 28, 2015 at 10:19 PM, bit1...@163.com wrote: > Looks to me that the same thing also applies to the SparkContext.textFile > or SparkContext.wholeTextFile, there is no way in RDD to figure out the > file information where the data in RDD

Re: Weird error/exception

2015-04-28 Thread Vadim Bichutskiy
I think I need to modify my code as discussed under "Design Patterns for using foreachRDD" in the docs. ᐧ On Tue, Apr 28, 2015 at 4:26 PM, Vadim Bichutskiy < vadim.bichuts...@gmail.com> wrote: > I am using Spark Streaming to monitor an S3 bucket. Everything appears to > be

Weird error/exception

2015-04-28 Thread Vadim Bichutskiy
I am using Spark Streaming to monitor an S3 bucket. Everything appears to be fine. But every batch interval I get the following: *15/04/28 16:12:36 WARN HttpMethodReleaseInputStream: Attempting to release HttpMethod in finalize() as its response data stream has gone out of scope. This attempt will

Re: Map Question

2015-04-23 Thread Vadim Bichutskiy
share a spark context all will work as expected. > > > http://stackoverflow.com/questions/142545/python-how-to-make-a-cross-module-variable > > > > Sent with Good (www.good.com) > > > > -Original Message- > *From: *Vadim Bichutskiy [vadim.bichuts...@gmai

Re: Map Question

2015-04-23 Thread Vadim Bichutskiy
(x): metadata = broadcastVar.value # NameError: broadcastVar not found -- HOW TO FIX? ... metadata.py def get_metadata(): ... return mylist ᐧ On Wed, Apr 22, 2015 at 6:47 PM, Tathagata Das wrote: > Can you give full code? especially the myfunc? > > On Wed, Apr 22, 2015 at

Re: Map Question

2015-04-22 Thread Vadim Bichutskiy
s not defined* The myfunc function is in a different module. How do I make it aware of broadcastVar? ᐧ On Wed, Apr 22, 2015 at 2:13 PM, Vadim Bichutskiy < vadim.bichuts...@gmail.com> wrote: > Great. Will try to modify the code. Always room to optimize! > ᐧ > > On Wed, Apr 22, 201

Re: Map Question

2015-04-22 Thread Vadim Bichutskiy
Can I use broadcast vars in local mode? ᐧ On Wed, Apr 22, 2015 at 2:06 PM, Tathagata Das wrote: > Yep. Not efficient. Pretty bad actually. That's why broadcast variable > were introduced right at the very beginning of Spark. > > > > On Wed, Apr 22, 2015 at 10:

Re: Map Question

2015-04-22 Thread Vadim Bichutskiy
s, and if you update the list > at the driver, you will have to broadcast it again. > > TD > > On Wed, Apr 22, 2015 at 9:28 AM, Vadim Bichutskiy < > vadim.bichuts...@gmail.com> wrote: > >> I am using Spark Streaming with Python. For each RDD, I call a map, i.e., >&

Map Question

2015-04-22 Thread Vadim Bichutskiy
I am using Spark Streaming with Python. For each RDD, I call a map, i.e., myrdd.map(myfunc), myfunc is in a separate Python module. In yet another separate Python module I have a global list, i.e. mylist, that's populated with metadata. I can't get myfunc to see mylist...it's always empty. Alternat

Re: saveAsTextFile

2015-04-16 Thread Vadim Bichutskiy
> you can find them easily. Or consider somehow sending the batches of > data straight into Redshift? no idea how that is done but I imagine > it's doable. > > On Thu, Apr 16, 2015 at 6:38 PM, Vadim Bichutskiy > wrote: >> Thanks Sean. I want to load each batch into Redshi

Re: saveAsTextFile

2015-04-16 Thread Vadim Bichutskiy
ch RDDs streaming in all the time as part of a DStream > > If you want fine / detailed management of the writing to HDFS you can > implement your own HDFS adapter and invoke it in forEachRDD and foreach > > Regards > Evo Eftimov > > From: Vadim Bichutskiy [mailto:vad

Re: saveAsTextFile

2015-04-16 Thread Vadim Bichutskiy
;, which are really directories containing > partitions, as is common in Hadoop. You can move them later, or just > read them where they are. > > On Thu, Apr 16, 2015 at 6:32 PM, Vadim Bichutskiy > wrote: >> I am using Spark Streaming where during each micro-batch I output data

saveAsTextFile

2015-04-16 Thread Vadim Bichutskiy
I am using Spark Streaming where during each micro-batch I output data to S3 using saveAsTextFile. Right now each batch of data is put into its own directory containing 2 objects, "_SUCCESS" and "part-0." How do I output each batch into a common directory? Thanks, Vadim ᐧ

Re: sbt-assembly spark-streaming-kinesis-asl error

2015-04-14 Thread Vadim Bichutskiy
.2.0.jar --class com.xxx.DataConsumer >>> target/scala-2.10/xxx-assembly-0.1-SNAPSHOT.jar >>> >>> I still end up with the following error... >>> >>> Exception in thread "main" java.lang.NoClassDefFoundError: >>> org/joda/time/format

Re: sbt-assembly spark-streaming-kinesis-asl error

2015-04-13 Thread Vadim Bichutskiy
g a spark-submit job via uber jar). Feel free to add me to > gmail chat and maybe we can help each other. > >> On Mon, Apr 13, 2015 at 6:46 PM, Vadim Bichutskiy >> wrote: >> I don't believe the Kinesis asl should be provided. I used mergeStrategy >> succe

Re: sbt-assembly spark-streaming-kinesis-asl error

2015-04-13 Thread Vadim Bichutskiy
I don't believe the Kinesis asl should be provided. I used mergeStrategy successfully to produce an "uber jar." Fyi, I've been having trouble consuming data out of Kinesis with Spark with no success :( Would be curious to know if you got it working. Vadim > On Apr 13, 2015, at 9:36 PM, Mike T

Re: Empty RDD?

2015-04-08 Thread Vadim Bichutskiy
Thanks TD! > On Apr 8, 2015, at 9:36 PM, Tathagata Das wrote: > > Aah yes. The jsonRDD method needs to walk through the whole RDD to understand > the schema, and does not work if there is not data in it. Making sure there > is no data in it using take(1) should work. > > TD --

Re: Spark Streaming and SQL

2015-04-08 Thread Vadim Bichutskiy
Hi all, I figured it out! The DataFrames and SQL example in Spark Streaming docs were useful. Best, Vadim ᐧ On Wed, Apr 8, 2015 at 2:38 PM, Vadim Bichutskiy wrote: > Hi all, > > I am using Spark Streaming to monitor an S3 bucket for objects that > contain JSON. I want > to i

Empty RDD?

2015-04-08 Thread Vadim Bichutskiy
When I call *transform* or *foreachRDD *on* DStream*, I keep getting an error that I have an empty RDD, which make sense since my batch interval maybe smaller than the rate at which new data are coming in. How to guard against it? Thanks, Vadim ᐧ

Spark Streaming and SQL

2015-04-08 Thread Vadim Bichutskiy
Hi all, I am using Spark Streaming to monitor an S3 bucket for objects that contain JSON. I want to import that JSON into Spark SQL DataFrame. Here's my current code: *from pyspark import SparkContext, SparkConf* *from pyspark.streaming import StreamingContext* *import json* *from pyspark.sql im

Re: Spark + Kinesis

2015-04-07 Thread Vadim Bichutskiy
Hey y'all, While I haven't been able to get Spark + Kinesis integration working, I pivoted to plan B: I now push data to S3 where I set up a DStream to monitor an S3 bucket with textFileStream, and that works great. I <3 Spark! Best, Vadim ᐧ On Mon, Apr 6, 2015 at 12:23 PM, Vad

Re: Spark + Kinesis

2015-04-06 Thread Vadim Bichutskiy
Hi all, I am wondering, has anyone on this list been able to successfully implement Spark on top of Kinesis? Best, Vadim ᐧ On Sun, Apr 5, 2015 at 1:50 PM, Vadim Bichutskiy wrote: > ᐧ > Hi all, > > Below is the output that I am getting. My Kinesis stream has 1 shard, and > my S

Re: Spark + Kinesis

2015-04-05 Thread Vadim Bichutskiy
409 ms *** 15/04/05 17:14:50 INFO scheduler.ReceivedBlockTracker: Deleting batches ArrayBuffer(142825407 ms) On Sat, Apr 4, 2015 at 3:13 PM, Vadim Bichutskiy wrote: > Hi all, > > More good news! I was able to utilize mergeStrategy to assembly my Kinesis > consumer into an &quo

Re: Spark + Kinesis

2015-04-04 Thread Vadim Bichutskiy
* Union all the streams */val unionStreams = ssc.union(kinesisStreams).map(byteArray => new String(byteArray))unionStreams.print()ssc.start() ssc.awaitTermination() }}* ᐧ On Fri, Apr 3, 2015 at 3:48 PM, Tathagata Das wrote: > Just remove "provided" for spa

Re: Spark + Kinesis

2015-04-03 Thread Vadim Bichutskiy
-streaming-kinesis-asl" > % "1.3.0" > > On Fri, Apr 3, 2015 at 12:45 PM, Vadim Bichutskiy < > vadim.bichuts...@gmail.com> wrote: > >> Thanks. So how do I fix it? >> ᐧ >> >> On Fri, Apr 3, 2015 at 3:43 PM, Kelly, Jonathan >> wrote: &g

Re: Spark + Kinesis

2015-04-03 Thread Vadim Bichutskiy
s were not included in the assembly > (but yes, they should be). > > > ~ Jonathan Kelly > > From: Vadim Bichutskiy > Date: Friday, April 3, 2015 at 12:26 PM > To: Jonathan Kelly > Cc: "user@spark.apache.org" > Subject: Re: Spark + Kinesis > > Hi

Re: Spark + Kinesis

2015-04-03 Thread Vadim Bichutskiy
Dependencies += "org.apache.spark" %% > "spark-streaming-kinesis-asl" % "1.3.0" > > I think that may get you a little closer, though I think you're probably > going to run into the same problems I ran into in this thread: > https://www.mail-archive.

Re: Spark + Kinesis

2015-04-02 Thread Vadim Bichutskiy
that may get you a little closer, though I think you're probably > going to run into the same problems I ran into in this thread: > https://www.mail-archive.com/user@spark.apache.org/msg23891.html I never > really got an answer for that, and I temporarily moved on to other things

Re: How to learn Spark ?

2015-04-02 Thread Vadim Bichutskiy
5XZs653q_MN8rBNbzRbv22W8r4TLx56dCDWf13Gc8R02?t=http%3A%2F%2Ftwitter.com%2Fdeanwampler&si=5533377798602752&pi=4b4c247b-b7e9-4031-81d5-9b9a8f5f1963> > http://polyglotprogramming.com > > On Thu, Apr 2, 2015 at 8:33 AM, Vadim Bichutskiy < > vadim.bichuts...@gmail.com> wro

Spark + Kinesis

2015-04-02 Thread Vadim Bichutskiy
Hi all, I am trying to write an Amazon Kinesis consumer Scala app that processes data in the Kinesis stream. Is this the correct way to specify *build.sbt*: --- *import AssemblyKeys._* *name := "Kinesis Consumer"* *version := "1.0"organization := "com.myconsumer"scalaVersion := "2.11.5"

Re: How to learn Spark ?

2015-04-02 Thread Vadim Bichutskiy
You can start with http://spark.apache.org/docs/1.3.0/index.html Also get the Learning Spark book http://amzn.to/1NDFI5x. It's great. Enjoy! Vadim ᐧ On Thu, Apr 2, 2015 at 4:19 AM, Star Guo wrote: > Hi, all > > > > I am new to here. Could you give me some suggestion to learn Spark ? > Thanks.

Spark on EC2

2015-04-01 Thread Vadim Bichutskiy
Hi all, I just tried launching a Spark cluster on EC2 as described in http://spark.apache.org/docs/1.3.0/ec2-scripts.html I got the following response: *"PendingVerificationYour account is currently being verified. Verification normally takes less than 2 hours. Until your account is verified,