Why would scala 0.11 change things here? I'm not familiar with what
features you're referring.
I would support a prelude file in ~/.sparkrc our similar that is
automatically imported on spark shell startup if it exists.
Sent from my mobile phone
On Feb 17, 2014 9:11 PM, Prashant Sharma
Hi,
What is the size of RDD two?
You want to map à line from RDD one to multiple values from RDD two and get
the sum of all of them?
So as result you would have an rdd of size RDD1 and containing a number per
line?
2014-02-18 8:06 GMT+01:00 hanbo hanbo...@gmail.com:
Sincerely thank you for
Hi Robin,
Please make sure , whether any new file is taken as a input or not.
If it is fine then please check what is the the exact Error is coming on
Cluster WEB UI.
Thanks Regards
Amit Ku. Behera
Thanks for your reply:
What is the size of RDD two?
RDD two is a paried rdd, during iterating, its size may differ from
4 to 450.
You want to map à line from RDD one to multiple values from RDD two and
get the sum of all of them?
Yes
So as result you would have an
Here's what I would do :
RDD1 :
"1" "2" "3"
"1" "3" "5"
RDD2 :
("1", 11)
("2", 22)
("3", 33)
("5", 55)
1 / flatMap your lines from RDD1 to RDD1bis (key,lineId) (you'll
have to use a mapPartitionWithIndex
Hi,
I couldn't find any documentation on how to test whether an RDD is empty.
I'm doing a reduce operation, but it throws an
UnsupportedOperationException if the RDD is empty. I'd like to check if
the RDD is empty before calling reduce.
On the Google Groups list Reynold Xin had suggested using
Unfortunately there isn't an easy way to test whether an RDD is empty
unless you count it. But fortunately, RDD.fold can solve your problem.
All you need to do is to provide a zero value.
For an RDD r, r.reduce(f) is roughly equivalent to r.fold(r.first)(f), that
is, the first element of r is
Hi,
I'm experimenting with a Spark analytic on a 9-node cluster, and the Python
version runs in about 5 minutes, whereas the Java version with all the same
SparkContext configurations (and everything else being equal) takes 40+ minutes.
Does anyone know what may be causing this performance
Hi Tathagata!
Thanks for getting back to me.
I've 2 follow up questions.
Q3: You've explained how data is checkpointed, but never addressed the part
about new batches overwriting old batches. Is it true that data is overwritten
or is it that all data is saved resulting in a huge blob.
Q4:
I
Here is an example code that is bundled with Spark
for (i - 1 to ITERATIONS) {
println(On iteration + i)
val gradient = points.map { p =
(1 / (1 + exp(-p.y * (w dot p.x))) - 1) * p.y * p.x
}.reduce(_ + _)
w -= gradient
}
As you can see, an action is called on
Hi Guys,
I am doing some test with NetworkWordCount scala code where I am
counting and summing a stream of words received from network using
foreach action, thanks TD. Firstly I have began with this scenario 1
Master + 1 Worker(also actioning like Stream source) and I have obtained
the
I am getting this error when trying to code from the following page in the
shell:
http://spark.incubator.apache.org/docs/latest/streaming-programming-guide.html
I believe that DStream[(String, Int)] is precisely the class that should
have this function so I am confused.
Thanks!
scala val
Hi,
Since getting Spark + MongoDB to work together was not very obvious (at
least to me) I wrote a tutorial about it in my blog with an example
application:
http://codeforhire.com/2014/02/18/using-spark-with-mongodb/
Hope it's of use to someone else as well.
Cheers,
*Sampo Niskanen*
Very cool, thanks for writing this. I’ll link it from our website.
Matei
On Feb 18, 2014, at 12:44 PM, Sampo Niskanen sampo.niska...@wellmo.com wrote:
Hi,
Since getting Spark + MongoDB to work together was not very obvious (at least
to me) I wrote a tutorial about it in my blog with an
I'm using sbt to build and run my Spark driver program. It's complaining
that it cannot find a class (twitter4j/Status) even though my code compiles
fine.
Do I need to package all external dependencies into one fat jar ? If yes,
can someone tell me the preferred way of doing it with sbt. I'm new
I'm writing unit tests with Spark and need some help.
I've already read this helpful article:
http://blog.quantifind.com/posts/spark-unit-test/
There are a couple differences in my testing environment versus the blog.
1. I'm using FunSpec instead of FunSuite. So my tests look like
class
Hi, sorry about this, my question is not related to yours, I am just
checking around the mailing list in the hope of finding the solution to my
question.
I am compiling one standalone scala spark program, but it
reports sbt.ResolveException: unresolved dependency:
Hi,
I have a question on achieving fault tolerance when counting with spark and
storing the aggregate count into Cassandra.
Example input: RDD 1 [a,a,a], RDD 2 [a,a]
After aggregation of RDD1 (ie map + reduceByKey) we get Map:[a-3]
And after aggregation for RDD2 we get Map:[a-2]
Now lets store
Addendum
Forgot to mention that I use a StreamingContext: val streamingContext = new
StreamingContext(conf, Seconds(10))
And I have no idea how option a) reading the DStream from the beginning would
be implemented in Spark, but I think it might instead read it somewhere in the
middle or last
A3: The basic RDD model is that the dataset is immutable. As new batches of
data come in, each batch is treat as a RDD. Then RDD transformations are
applied to create new RDDs. When some of these RDDs are checkpointed, then
create separate HDFS files. So, yes, the checkpoint files will keep
You have to import StreamingContext._
That imports implicit conversions that allow reduceByKey() to be applied on
DStreams with key-value pairs.
TD
On Tue, Feb 18, 2014 at 12:03 PM, bethesda swearinge...@mac.com wrote:
I am getting this error when trying to code from the following page in
I'm sorry but what does its not fundamentally not possible to avoid
checkpointing mean?
Are you saying that for these two stateful streaming app, it's possible to
avoid checkpointing? how?
On Tue, Feb 18, 2014 at 5:44 PM, Tathagata Das
tathagata.das1...@gmail.comwrote:
A3: The basic RDD model
Thanks for your reply.
I have changed scalaVersion := 2.10 to scalaVersion := 2.10.3 then
everything is good.
So this is a documentation bug :)
dachuan.
On Tue, Feb 18, 2014 at 6:50 PM, Denny Lee denny.g@gmail.com wrote:
What version of Scala are you using? For example, if you're using
Dachuan,
Where did you find that faulty documentation? I'd like to get it fixed.
Thanks!
Andrew
On Tue, Feb 18, 2014 at 4:15 PM, dachuan hdc1...@gmail.com wrote:
Thanks for your reply.
I have changed scalaVersion := 2.10 to scalaVersion := 2.10.3 then
everything is good.
So this is a
oh~Sorry, Andrew
I just made the PR, it’s a error in site config.yml
Best,
--
Nan Zhu
On Tuesday, February 18, 2014 at 7:16 PM, Andrew Ash wrote:
Dachuan,
Where did you find that faulty documentation? I'd like to get it fixed.
Thanks!
Andrew
On Tue, Feb 18, 2014 at
http://spark.incubator.apache.org/docs/latest/quick-start.html
it's in this page.
On Tue, Feb 18, 2014 at 7:24 PM, Nan Zhu zhunanmcg...@gmail.com wrote:
oh~Sorry, Andrew
I just made the PR, it's a error in site config.yml
Best,
--
Nan Zhu
On Tuesday, February 18, 2014 at 7:16 PM,
I guess here: http://spark.incubator.apache.org/docs/latest/quick-start.html
Regards
Mayur
Mayur Rustagi
Ph: +919632149971
h https://twitter.com/mayur_rustagittp://www.sigmoidanalytics.com
https://twitter.com/mayur_rustagi
On Tue, Feb 18, 2014 at 4:16 PM, Andrew Ash and...@andrewash.com
Thank you very much for your answer. We have try the method above before.
This is the problem during doing so.
1. We want to avoid collect method because we do this step in the iteration
and the RDD2 changes in every iteration. So the speed ,usage of memory and
network traffic bother us a lot.
Let's say I have an RDD of text files from HDFS. During the runtime, is it
possible to check for new files in a particular directory and if present,
add them to the existing RDD?
Does Mesos offer a given resource to only one framework to avoid conflicts
or does it offer to many frameworks simultaneously and resolves it similar
to the concept of late binding??
Because
http://mesos.apache.org/documentation/latest/app-framework-development-guide/
tells thats possibility of
the driver is running on the machine where you run command like ./spark-shell
but in 0.9, you can run in-cluster driver
http://spark.incubator.apache.org/docs/latest/spark-standalone.html#launching-applications-inside-the-cluster
Best,
--
Nan Zhu
On Tuesday, February 18, 2014 at 10:06 PM,
Yea, for testing empty RDD, Guillaume's *mapPartitions* solution is
definitely better than *count* since it doesn't require traversing all
elements :-)
On Tue, Feb 18, 2014 at 10:08 PM, Guillaume Pitel
guillaume.pi...@exensa.com wrote:
I think you could do something like that : (but Cheng's
Thank you for your reply.
we have tried this method before, but step 2 is very time cosuming due to
the value number of different keys is not well-distributed. Some key in
lines of RDD1 is very dense, but others are very sparse. After join, the
splits containing dense keys is very large and
I'm trying to launch application inside the cluster (standalone mode)
According to docs, jar-url can be either file:// or hdfs:// format. (
https://spark.incubator.apache.org/docs/latest/spark-standalone.html)
But, when I tried to run spark-class It seemed unable to parse hdfs://xx
format.
The default zeromq receiver that comes with the Spark repository does
guarantee which machine the zeromq receiver will be launched, it can be on
any of the worker machines in the cluster, NOT the application machine
(called the driver in our terms). And your understanding of the code and
the
You can create a *lib * directory at the root of your project home and put
all the jars in it.
Or you can specify the location of your jars by setting the *.setJars
*property in
the sparkConf like:
val conf = new SparkConf()
.setMaster(mesos://akhldz:5050)
hi all,
It seems that the Windowing in Spark Streaming Driven by absolutely
time not conventionally by the timestamp of the data, can anybody
kindly explains why? How can I do if I need Windowing driven by the
data-timestamp?
Thanks!
Aries.Kong
It could be that the hostname that Spark uses to identify the node is
different from the one you are providing. Are you using the Spark
standalone mode? In that case, you can check out the hostnames that Spark
is seeing and use that name.
Let me know if that works out.
TD
On Mon, Feb 17, 2014
Apologies if that was confusing. I meant that in streaming application
where a few specific DStream operations like updateStateByKey, etc. are
used, you have to enable checkpointing. In fact, the system automatically
enables checkpointing with some default interval for DStreams that are
generated
Hi TD,
in case of multiple streams will the streaming code be like:
val ssc = ...
(1 to n).foreach {
val nwStream = kafkaUtils.createStream(...)
nwStream.flatMap(...).map(...).reduceByKeyAndWindow(...).foreachRDD(saveToCassandra())
}
ssc.start()
Will it create any problem in execution (like
Take a look at the trait the spark tests are using:
https://github.com/apache/incubator-spark/blob/master/core/src/test/scala/org/apache/spark/SharedSparkContext.scala?source=cc
/Heiko
On 18 Feb 2014, at 22:36, Ameet Kini ameetk...@gmail.com wrote:
I'm writing unit tests with Spark and need
Both standalone mode and mesos were tested, with the same outcome. After your
suggestion, I tried again in standalone mode and specified the host with
what was written in the log of a worker node. The problem remains.
A bit more detail, the bind failed error is reported on the driver node.
Say,
42 matches
Mail list logo