(map)}
}
}
var uberRdd = esRdds(0)
for (rdd - esRdds) {
uberRdd = uberRdd ++ rdd
}
uberRdd.map joinforeach(x = println(x))
}
From: Jeetendra Gangele [mailto:gangele...@gmail.com]
Sent: April 22, 2015 2:52 PM
To: Adrian Mocanu
Cc: u
Hi
I use the ElasticSearch package for Spark and very often it times out reading
data from ES into an RDD.
How can I keep the connection alive (why doesn't it? Bug?)
Here's the exception I get:
org.elasticsearch.hadoop.serialization.EsHadoopSerializationException:
scroll to get the data from ES; about 150 items at a
time. Usual delay when I perform the same query from a browser plugin ranges
from 1-5sec.
Thanks
From: Jeetendra Gangele [mailto:gangele...@gmail.com]
Sent: April 22, 2015 3:09 PM
To: Adrian Mocanu
Cc: u...@spark.incubator.apache.org
Subject: Re
Hi
I need help fixing a time out exception thrown from ElasticSearch Hadoop. The
ES cluster is up all the time.
I use ElasticSearch Hadoop to read data from ES into RDDs. I get a collection
of these RDD which I traverse (with foreachRDD) and create more RDDs from each
one RDD in the collection.
Here's my log output from a streaming job.
What is this?
09:54:27.504 [RecurringTimer - JobGenerator] DEBUG
o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at time
1427378067504
09:54:27.505 [RecurringTimer - JobGenerator] DEBUG
o.a.s.streaming.util.RecurringTimer -
]
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
~[na:1.7.0_75]
this is a huge stack trace... but it keeps repeating
What could this be from?
From: Adrian Mocanu [mailto:amoc...@verticalscope.com]
Sent: March 26, 2015 2:10 PM
To: u
Hi
Is there a way to write all RDDs in a DStream to the same file?
I tried this and got an empty file. I think it's bc the file is not closed i.e.
ESMinibatchFunctions.writer.close() executes before the stream is created.
Here's my code
myStream.foreachRDD(rdd = {
rdd.foreach(x = {
Hi
Does updateStateByKey pass elements to updateFunc (in Seq[V]) in order in which
they appear in the RDD?
My guess is no which means updateFunc needs to be commutative. Am I correct?
I've asked this question before but there were no takers.
Here's the scala docs for updateStateByKey
/**
*
Here's my use case:
I read an array into an RDD and I use a hash partitioner to partition the RDD.
This is the array type: Array[(String, Iterable[(Long, Int)])]
topK:Array[(String, Iterable[(Long, Int)])] = ...
import org.apache.spark.HashPartitioner
val hashPartitioner=new HashPartitioner(10)
Hi
I have an RDD: RDD[(String, scala.Iterable[(Long, Int)])] which I want to print
into a file, a file for each key string.
I tried to trigger a repartition of the RDD by doing group by on it. The
grouping gives RDD[(String, scala.Iterable[Iterable[(Long, Int)]])] so I
flattened that:
I was looking at updateStateByKey documentation,
It passes in a values Seq which contains values that have the same key.
I would like to know if there is any ordering to these values. My feeling is
that there is no ordering, but maybe it does preserve RDD ordering.
Example: RDD[ (a,2), (a,3),
Hi
I get this exception when I run a Spark test case on my local machine:
An exception or error caused a run to abort:
-thrift % 2.0.3
intransitive()
val sparkStreamingFromKafka = org.apache.spark % spark-streaming-kafka_2.10
% 0.9.1 excludeAll(
-Original Message-
From: Sean Owen [mailto:so...@cloudera.com]
Sent: January-22-15 11:39 AM
To: Adrian Mocanu
Cc: u...@spark.incubator.apache.org
Subject: Re
I use spark 1.1.0-SNAPSHOT
val spark=org.apache.spark %% spark-core % 1.1.0-SNAPSHOT % provided
excludeAll(
-Original Message-
From: Sean Owen [mailto:so...@cloudera.com]
Sent: January-22-15 11:39 AM
To: Adrian Mocanu
Cc: u...@spark.incubator.apache.org
Subject: Re: Exception
Hi
I have a question regarding design trade offs and best practices. I'm working
on a real time analytics system. For simplicity, I have data with timestamps
(the key) and counts (the value). I use DStreams for the real time aspect.
Tuples w the same timestamp can be across various RDDs and I
I’m resurrecting this thread because I’m interested in doing transpose on a
RowMatrix.
There is this other thread too:
http://apache-spark-user-list.1001560.n3.nabble.com/Matrix-multiplication-in-spark-td12562.html
Which presents https://issues.apache.org/jira/browse/SPARK-3434 which is still
Hi
I'd like to use the result of one RDD1 in another RDD2. Normally I would use
something like a barrier so make the 2nd RDD wait till the computation of the
1st RDD is done then include the result from RDD1 in the closure for RDD2.
Currently I create another RDD, RDD3, out of the result of RDD1
My understanding is that the reason you have an Option is so you could filter
out tuples when None is returned. This way your state data won't grow forever.
-Original Message-
From: spr [mailto:s...@yarcdata.com]
Sent: November-12-14 2:25 PM
To: u...@spark.incubator.apache.org
Subject:
You are correct; the filtering I’m talking about is done implicitly. You don’t
have to do it yourself. Spark will do it for you and remove those entries from
the state collection.
From: Yana Kadiyska [mailto:yana.kadiy...@gmail.com]
Sent: November-12-14 3:50 PM
To: Adrian Mocanu
Cc: spr; u
I use
https://github.com/apache/spark/blob/master/streaming/src/test/scala/org/apache/spark/streaming/TestSuiteBase.scala
to help me with testing.
In spark 9.1 my tests depending on TestSuiteBase worked fine. As soon as I
switched to latest (1.0.1) all tests fail. My sbt import is:
b0c1http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=user_nodesuser=1215,
could you post your code? I am interested in your solution.
Thanks
Adrian
From: boci [mailto:boci.b...@gmail.com]
Sent: June-26-14 6:17 PM
To: user@spark.apache.org
Subject: Re:
are you using. If it is 0.9.1, I can see that the
cleaner in
ShuffleBlockManagerhttps://github.com/apache/spark/blob/v0.9.1/core/src/main/scala/org/apache/spark/storage/ShuffleBlockManager.scala
is not stopped, so it is a bug.
TD
On Thu, May 22, 2014 at 9:24 AM, Adrian Mocanu
amoc
Hi
After using sparks TestSuiteBase to run some tests I've noticed that at the
end, after finishing all tests the cleaner is still running and outputs the
following perdiodically:
INFO o.apache.spark.util.MetadataCleaner - Ran metadata cleaner for
SHUFFLE_BLOCK_MANAGER
I use method
I have a few test cases for Spark which extend TestSuiteBase from
org.apache.spark.streaming.
The tests run fine on my machine but when I commit to repo and run the tests
automatically with bamboo the test cases fail with these errors.
How to fix?
21-May-2014 16:33:09
[info]
java.net.BindException: Address already in use
Is there a way to set these connection up so that they don't all start on the
same port (that's my guess for the root cause of the issue)
From: Adrian Mocanu [mailto:amoc...@verticalscope.com]
Sent: May-21-14 4:58 PM
To: u...@spark.incubator.apache.org; user
Please ignore. This was sent last week not sure why it arrived so late.
-Original Message-
From: amoc [mailto:amoc...@verticalscope.com]
Sent: May-09-14 10:13 AM
To: u...@spark.incubator.apache.org
Subject: Re: slf4j and log4j loop
Hi Patrick/Sean,
Sorry to resurrect this thread, but
I recall someone from the Spark team (TD?) saying that Spark 9.1 will change
the logger and the circular loop error between slf4j and log4j wouldn't show up.
Yet on Spark 9.1 I still get
SLF4J: Detected both log4j-over-slf4j.jar AND slf4j-log4j12.jar on the class
path, preempting
Hey guys,
I've asked before, in Spark 0.9 - I now use 0.9.1, about removing log4j
dependency and was told that it was gone. However I still find it part of
zookeeper imports. This is fine since I exclude it myself in the sbt file, but
another issue arises.
I wonder if anyone else has run into
...@spark.incubator.apache.org
Subject: Re: another updateStateByKey question
Could be a bug. Can you share a code with data that I can use to reproduce this?
TD
On May 2, 2014 9:49 AM, Adrian Mocanu
amoc...@verticalscope.commailto:amoc...@verticalscope.com wrote:
Has anyone else noticed that sometimes
Forgot to mention my batch interval is 1 second:
val ssc = new StreamingContext(conf, Seconds(1))
hence the Thread.sleep(1100)
From: Adrian Mocanu [mailto:amoc...@verticalscope.com]
Sent: May-05-14 12:06 PM
To: user@spark.apache.org
Cc: u...@spark.incubator.apache.org
Subject: RE: another
, Adrian Mocanu amoc...@verticalscope.com
wrote:
Hi TD,
Why does the example keep recalculating the count via fold?
Wouldn’t it make more sense to get the last count in values Seq and
add 1 to it and save that as current count?
From what Sean explained I understand that all values in Seq
I'm trying to understand updateStateByKey.
Here's an example I'm testing with:
Input data: DStream( RDD( (a,2) ), RDD( (a,3) ), RDD( (a,4) ), RDD(
(a,5) ), RDD( (a,6) ), RDD( (a,7) ) )
Code:
val updateFunc = (values: Seq[Int], state: Option[StateClass]) = {
val previousState =
If I use a range partitioner, will this make updateStateByKey take the tuples
in order?
Right now I see them not being taken in order (most of them are ordered but not
all)
-Adrian
What is Seq[V] in updateStateByKey?
Does this store the collected tuples of the RDD in a collection?
Method signature:
def updateStateByKey[S: ClassTag] ( updateFunc: (Seq[V], Option[S]) =
Option[S] ): DStream[(K, S)]
In my case I used Seq[Double] assuming a sequence of Doubles in the RDD; the
Any suggestions where I can find this in the documentation or elsewhere?
Thanks
From: Adrian Mocanu [mailto:amoc...@verticalscope.com]
Sent: April-24-14 11:26 AM
To: u...@spark.incubator.apache.org
Subject: reduceByKeyAndWindow - spark internals
If I have this code:
val stream1
If I have this code:
val stream1= doublesInputStream.window(Seconds(10), Seconds(2))
val stream2= stream1.reduceByKeyAndWindow(_ + _, Seconds(10), Seconds(10))
Does reduceByKeyAndWindow merge all RDDs from stream1 that came in the 10
second window?
Example, in the first 10 secs stream1 will
Has anyone managed to write Booleans to Cassandra from an RDD with Calliope?
My Booleans give compile time errors: expression of type List[Any] does not
conform to expected type Types.CQLRowValues
CQLColumnValue is definted as ByteBuffer: type CQLColumnValue = ByteBuffer
For now I convert them
I'd like to resurrect this thread since I don't have an answer yet.
From: Adrian Mocanu [mailto:amoc...@verticalscope.com]
Sent: March-27-14 10:04 AM
To: u...@spark.incubator.apache.org
Subject: function state lost when next RDD is processed
Is there a way to pass a custom function to spark
is processed
As long as the amount of state being passed is relatively small, it's probably
easiest to send it back to the driver and to introduce it into RDD
transformations as the zero value of a fold.
On Fri, Mar 28, 2014 at 7:12 AM, Adrian Mocanu
amoc...@verticalscope.commailto:amoc
I think you should sort each RDD
-Original Message-
From: yh18190 [mailto:yh18...@gmail.com]
Sent: March-28-14 4:44 PM
To: u...@spark.incubator.apache.org
Subject: Re: Splitting RDD and Grouping together to perform computation
Hi,
Thanks Nanzhu.I tried to implement your suggestion on
I say you need to remap so you have a key for each tuple that you can sort on.
Then call rdd.sortByKey(true) like this mystream.transform(rdd =
rdd.sortByKey(true))
For this fn to be available you need to import
org.apache.spark.rdd.OrderedRDDFunctions
-Original Message-
From: yh18190
Not sure how to change your code because you'd need to generate the keys where
you get the data. Sorry about that.
I can tell you where to put the code to remap and sort though.
import org.apache.spark.rdd.OrderedRDDFunctions
val res2=reduced_hccg.map(_._2)
.map( x= (newkey,x)).sortByKey(true)
Found this transform fn in StreamingContext which takes in a DStream[_] and a
function which acts on each of its RDDs
Unfortunately I can't figure out how to transform my DStream[(String,Int)] into
DStream[_]
/*** Create a new DStream in which each RDD is generated by applying a function
on
Please disregard I didn't see the Seq wrapper.
From: Adrian Mocanu [mailto:amoc...@verticalscope.com]
Sent: March-27-14 11:57 AM
To: u...@spark.incubator.apache.org
Subject: StreamingContext.transform on a DStream
Found this transform fn in StreamingContext which takes in a DStream
I create several RDDs by merging several consecutive RDDs from a DStream. Is
there a way to add these new RDDs to a DStream?
-Adrian
I'm passing a moving average function during the map phase like this:
val average= new Sma(window=3)
stream.map(x= average.addNumber(x))
where
class Sma extends Serializable { .. }
I also tried to put creation of object average in an object like I saw in
another post:
object Average {
creators to comment on this.
-A
From: Benjamin Black [mailto:b...@b3k.us]
Sent: March-26-14 11:50 AM
To: user@spark.apache.org
Subject: Re: closures moving averages (state)
Perhaps you want reduce rather than map?
On Wednesday, March 26, 2014, Adrian Mocanu
amoc...@verticalscope.commailto:amoc
Hi Diana,
I'll answer Q3:
You can check if an RDD is empty in several ways.
Someone here mentioned that using an iterator was safer:
val isEmpty = rdd.mapPartitions(iter = Iterator(! iter.hasNext)).reduce(__)
You can also check with a fold or rdd.count
rdd.reduce(_ + _) // can't handle
Let me rephrase that,
Do you think it is possible to use an accumulator to skip the first few
incomplete RDDs?
-Original Message-
From: Adrian Mocanu [mailto:amoc...@verticalscope.com]
Sent: March-25-14 9:57 AM
To: user@spark.apache.org
Cc: u...@spark.incubator.apache.org
Subject: RE
I have a DStream like this:
..RDD[a,b],RDD[b,c]..
Is there a way to remove duplicates across the entire DStream? Ie: I would like
the output to be (by removing one of the b's):
..RDD[a],RDD[b,c].. or ..RDD[a,b],RDD[c]..
Thanks
-Adrian
I have what I would call unexpected behaviour when using window on a stream.
I have 2 windowed streams with a 5s batch interval. One window stream is
(5s,5s)=smallWindow and the other (10s,5s)=bigWindow
What I've noticed is that the 1st RDD produced by bigWindow is incorrect and is
of the size
Hi
Quick question here,
I know that .foreach is not idempotent. I am wondering if collect() is
idempotent? Meaning that once I've collect()-ed if spark node crashes I can't
get the same values from the stream ever again.
Thanks
-Adrian
Hi
Have you encountered a slf4j and log4j loop when using Spark? I pull a few
packages via sbt.
Spark package uses slf4j-log4j12.jar and another package uses use
log4j-over-slf4j.jar which creates the circular loop between the 2 loggers and
thus the exception below. Do you know of a fix for
53 matches
Mail list logo