, or at least if there is, it probably
> still requires manual work on the new nodes. This would be the advantage of
> EMR over EC2, as we take care of all of that configuration.
>
> ~ Jonathan
>
> From: Vadim Bichutskiy
> Date: Friday, June 19, 2015 at 5:21 PM
>
able to use Spark on EMR rather than on EC2? EMR clusters
> allow easy resizing of the cluster, and EMR also now supports Spark 1.3.1
> as of EMR AMI 3.8.0. See http://aws.amazon.com/emr/spark
>
> ~ Jonathan
>
> From: Vadim Bichutskiy
> Date: Friday, June 19, 2015 at
Hello Spark Experts,
I've been running a standalone Spark cluster on EC2 for a few months now,
and today I get this error:
"IOError: [Errno 28] No space left on device
Spark assembly has been built with Hive, including Datanucleus jars on
classpath
OpenJDK 64-Bit Server VM warning: Insufficient s
Yes, we're running Spark on EC2. Will transition to EMR soon. -Vadim
ᐧ
On Sat, May 23, 2015 at 2:22 PM, Johan Beisser wrote:
> Yes.
>
> We're looking at bootstrapping in EMR...
>
> On Sat, May 23, 2015 at 07:21 Joe Wass wrote:
>
>> I used Spark on EC2 a while ago
>>
>
Stream.scala#L172>
>
> Thanks
> Best Regards
>
> On Fri, May 15, 2015 at 2:25 AM, Vadim Bichutskiy <
> vadim.bichuts...@gmail.com> wrote:
>
>> How does textFileStream work behind the scenes? How does Spark Streaming
>> know what files are new and need to
How does textFileStream work behind the scenes? How does Spark Streaming
know what files are new and need to be processed? Is it based on time
stamp, file name?
Thanks,
Vadim
ᐧ
@TD How do I file a JIRA?
ᐧ
On Tue, May 12, 2015 at 2:06 PM, Tathagata Das
wrote:
> I wonder that may be a bug in the Python API. Please file it as a JIRA
> along with sample code to reproduce it and sample output you get.
>
> On Tue, May 12, 2015 at 10:00 AM, Vadim Bichutskiy <
I can confirm it does work in Java
>>
>>
>>
>> *From:* Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com]
>> *Sent:* Tuesday, May 12, 2015 5:53 PM
>> *To:* Evo Eftimov
>> *Cc:* Saisai Shao; user@spark.apache.org
>>
>> *Subject:* Re: DStream
reamRDDs in this way
> DstreamRDD1.union(DstreamRDD2).union(DstreamRDD3) etc etc
>
>
>
> Ps: the API is not “redundant” it offers several ways for achivieving the
> same thing as a convenience depending on the situation
>
>
>
> *From:* Vadim Bichutskiy [mailto:vadim.b
l case of StreamingContext.union:
>
> def union(that: DStream[T]): DStream[T] = new UnionDStream[T](Array(this,
> that))
>
> So there's no difference, if you want to union more than two DStreams,
> just use the one in StreamingContext, otherwise, both two APIs are fine.
>
Can someone explain to me the difference between DStream union and
StreamingContext union?
When do you use one vs the other?
Thanks,
Vadim
ᐧ
have this working out of the box, so i'd like to implement anything i
> can do to make that happen.
>
> lemme know.
>
> btw, Spark 1.4 will have some improvements to the Kinesis Spark Streaming.
>
> TD and I have been working together on this.
>
> thanks!
>
> -chris
>
I was having this issue when my batch interval was very big -- like 5
minutes. When my batch interval is
smaller, I don't get this exception. Can someone explain to me why this
might be happening?
Vadim
ᐧ
On Tue, Apr 28, 2015 at 4:26 PM, Vadim Bichutskiy <
vadim.bichuts...@gmail.com>
I was wondering about the same thing.
Vadim
ᐧ
On Tue, Apr 28, 2015 at 10:19 PM, bit1...@163.com wrote:
> Looks to me that the same thing also applies to the SparkContext.textFile
> or SparkContext.wholeTextFile, there is no way in RDD to figure out the
> file information where the data in RDD
I think I need to modify my code as discussed under "Design Patterns for
using foreachRDD" in the docs.
ᐧ
On Tue, Apr 28, 2015 at 4:26 PM, Vadim Bichutskiy <
vadim.bichuts...@gmail.com> wrote:
> I am using Spark Streaming to monitor an S3 bucket. Everything appears to
> be
I am using Spark Streaming to monitor an S3 bucket. Everything appears to
be fine. But every batch interval I get the following:
*15/04/28 16:12:36 WARN HttpMethodReleaseInputStream: Attempting to release
HttpMethod in finalize() as its response data stream has gone out of scope.
This attempt will
share a spark context all will work as expected.
>
>
> http://stackoverflow.com/questions/142545/python-how-to-make-a-cross-module-variable
>
>
>
> Sent with Good (www.good.com)
>
>
>
> -Original Message-
> *From: *Vadim Bichutskiy [vadim.bichuts...@gmai
(x):
metadata = broadcastVar.value # NameError: broadcastVar not found -- HOW TO
FIX?
...
metadata.py
def get_metadata():
...
return mylist
ᐧ
On Wed, Apr 22, 2015 at 6:47 PM, Tathagata Das wrote:
> Can you give full code? especially the myfunc?
>
> On Wed, Apr 22, 2015 at
s not defined*
The myfunc function is in a different module. How do I make it aware of
broadcastVar?
ᐧ
On Wed, Apr 22, 2015 at 2:13 PM, Vadim Bichutskiy <
vadim.bichuts...@gmail.com> wrote:
> Great. Will try to modify the code. Always room to optimize!
> ᐧ
>
> On Wed, Apr 22, 201
Can I use broadcast vars in local mode?
ᐧ
On Wed, Apr 22, 2015 at 2:06 PM, Tathagata Das wrote:
> Yep. Not efficient. Pretty bad actually. That's why broadcast variable
> were introduced right at the very beginning of Spark.
>
>
>
> On Wed, Apr 22, 2015 at 10:
s, and if you update the list
> at the driver, you will have to broadcast it again.
>
> TD
>
> On Wed, Apr 22, 2015 at 9:28 AM, Vadim Bichutskiy <
> vadim.bichuts...@gmail.com> wrote:
>
>> I am using Spark Streaming with Python. For each RDD, I call a map, i.e.,
>&
I am using Spark Streaming with Python. For each RDD, I call a map, i.e.,
myrdd.map(myfunc), myfunc is in a separate Python module. In yet another
separate Python module I have a global list, i.e. mylist, that's populated
with metadata. I can't get myfunc to see mylist...it's always empty.
Alternat
> you can find them easily. Or consider somehow sending the batches of
> data straight into Redshift? no idea how that is done but I imagine
> it's doable.
>
> On Thu, Apr 16, 2015 at 6:38 PM, Vadim Bichutskiy
> wrote:
>> Thanks Sean. I want to load each batch into Redshi
ch RDDs streaming in all the time as part of a DStream
>
> If you want fine / detailed management of the writing to HDFS you can
> implement your own HDFS adapter and invoke it in forEachRDD and foreach
>
> Regards
> Evo Eftimov
>
> From: Vadim Bichutskiy [mailto:vad
;, which are really directories containing
> partitions, as is common in Hadoop. You can move them later, or just
> read them where they are.
>
> On Thu, Apr 16, 2015 at 6:32 PM, Vadim Bichutskiy
> wrote:
>> I am using Spark Streaming where during each micro-batch I output data
I am using Spark Streaming where during each micro-batch I output data to
S3 using
saveAsTextFile. Right now each batch of data is put into its own directory
containing
2 objects, "_SUCCESS" and "part-0."
How do I output each batch into a common directory?
Thanks,
Vadim
ᐧ
.2.0.jar --class com.xxx.DataConsumer
>>> target/scala-2.10/xxx-assembly-0.1-SNAPSHOT.jar
>>>
>>> I still end up with the following error...
>>>
>>> Exception in thread "main" java.lang.NoClassDefFoundError:
>>> org/joda/time/format
g a spark-submit job via uber jar). Feel free to add me to
> gmail chat and maybe we can help each other.
>
>> On Mon, Apr 13, 2015 at 6:46 PM, Vadim Bichutskiy
>> wrote:
>> I don't believe the Kinesis asl should be provided. I used mergeStrategy
>> succe
I don't believe the Kinesis asl should be provided. I used mergeStrategy
successfully to produce an "uber jar."
Fyi, I've been having trouble consuming data out of Kinesis with Spark with no
success :(
Would be curious to know if you got it working.
Vadim
> On Apr 13, 2015, at 9:36 PM, Mike T
Thanks TD!
> On Apr 8, 2015, at 9:36 PM, Tathagata Das wrote:
>
> Aah yes. The jsonRDD method needs to walk through the whole RDD to understand
> the schema, and does not work if there is not data in it. Making sure there
> is no data in it using take(1) should work.
>
> TD
--
Hi all,
I figured it out! The DataFrames and SQL example in Spark Streaming docs
were useful.
Best,
Vadim
ᐧ
On Wed, Apr 8, 2015 at 2:38 PM, Vadim Bichutskiy wrote:
> Hi all,
>
> I am using Spark Streaming to monitor an S3 bucket for objects that
> contain JSON. I want
> to i
When I call *transform* or *foreachRDD *on* DStream*, I keep getting an
error that I have an empty RDD, which make sense since my batch interval
maybe smaller than the rate at which new data are coming in. How to guard
against it?
Thanks,
Vadim
ᐧ
Hi all,
I am using Spark Streaming to monitor an S3 bucket for objects that contain
JSON. I want
to import that JSON into Spark SQL DataFrame.
Here's my current code:
*from pyspark import SparkContext, SparkConf*
*from pyspark.streaming import StreamingContext*
*import json*
*from pyspark.sql im
Hey y'all,
While I haven't been able to get Spark + Kinesis integration working, I
pivoted to plan B: I now push data to S3 where I set up a DStream to
monitor an S3 bucket with textFileStream, and that works great.
I <3 Spark!
Best,
Vadim
ᐧ
On Mon, Apr 6, 2015 at 12:23 PM, Vad
Hi all,
I am wondering, has anyone on this list been able to successfully implement
Spark on top of Kinesis?
Best,
Vadim
ᐧ
On Sun, Apr 5, 2015 at 1:50 PM, Vadim Bichutskiy wrote:
> ᐧ
> Hi all,
>
> Below is the output that I am getting. My Kinesis stream has 1 shard, and
> my S
409 ms
***
15/04/05 17:14:50 INFO scheduler.ReceivedBlockTracker: Deleting batches
ArrayBuffer(142825407 ms)
On Sat, Apr 4, 2015 at 3:13 PM, Vadim Bichutskiy wrote:
> Hi all,
>
> More good news! I was able to utilize mergeStrategy to assembly my Kinesis
> consumer into an &quo
* Union all the streams */val
unionStreams = ssc.union(kinesisStreams).map(byteArray => new
String(byteArray))unionStreams.print()ssc.start()
ssc.awaitTermination() }}*
ᐧ
On Fri, Apr 3, 2015 at 3:48 PM, Tathagata Das wrote:
> Just remove "provided" for spa
-streaming-kinesis-asl"
> % "1.3.0"
>
> On Fri, Apr 3, 2015 at 12:45 PM, Vadim Bichutskiy <
> vadim.bichuts...@gmail.com> wrote:
>
>> Thanks. So how do I fix it?
>> ᐧ
>>
>> On Fri, Apr 3, 2015 at 3:43 PM, Kelly, Jonathan
>> wrote:
&g
s were not included in the assembly
> (but yes, they should be).
>
>
> ~ Jonathan Kelly
>
> From: Vadim Bichutskiy
> Date: Friday, April 3, 2015 at 12:26 PM
> To: Jonathan Kelly
> Cc: "user@spark.apache.org"
> Subject: Re: Spark + Kinesis
>
> Hi
Dependencies += "org.apache.spark" %%
> "spark-streaming-kinesis-asl" % "1.3.0"
>
> I think that may get you a little closer, though I think you're probably
> going to run into the same problems I ran into in this thread:
> https://www.mail-archive.
that may get you a little closer, though I think you're probably
> going to run into the same problems I ran into in this thread:
> https://www.mail-archive.com/user@spark.apache.org/msg23891.html I never
> really got an answer for that, and I temporarily moved on to other things
5XZs653q_MN8rBNbzRbv22W8r4TLx56dCDWf13Gc8R02?t=http%3A%2F%2Ftwitter.com%2Fdeanwampler&si=5533377798602752&pi=4b4c247b-b7e9-4031-81d5-9b9a8f5f1963>
> http://polyglotprogramming.com
>
> On Thu, Apr 2, 2015 at 8:33 AM, Vadim Bichutskiy <
> vadim.bichuts...@gmail.com> wro
Hi all,
I am trying to write an Amazon Kinesis consumer Scala app that processes
data in the
Kinesis stream. Is this the correct way to specify *build.sbt*:
---
*import AssemblyKeys._*
*name := "Kinesis Consumer"*
*version := "1.0"organization := "com.myconsumer"scalaVersion :=
"2.11.5"
You can start with http://spark.apache.org/docs/1.3.0/index.html
Also get the Learning Spark book http://amzn.to/1NDFI5x. It's great.
Enjoy!
Vadim
ᐧ
On Thu, Apr 2, 2015 at 4:19 AM, Star Guo wrote:
> Hi, all
>
>
>
> I am new to here. Could you give me some suggestion to learn Spark ?
> Thanks.
Hi all,
I just tried launching a Spark cluster on EC2 as described in
http://spark.apache.org/docs/1.3.0/ec2-scripts.html
I got the following response:
*"PendingVerificationYour
account is currently being verified. Verification normally takes less than
2 hours. Until your account is verified,
45 matches
Mail list logo