Dear all,
I need to run a series of transformations that map a RDD into another RDD.
The computation changes over times and so does the resulting RDD. Each
results is then saved to the disk in order to do further analysis (for
example variation of the result over time).
The question is, if I
Hey,
Even i am getting the same error.
I am running,
sudo ./run-example org.apache.spark.streaming.examples.FlumeEventCount
spark://spark_master_hostname:7077 spark_master_hostname 7781
and getting no events in the spark streaming.
---
Time:
Here is the stage overview:
[image: Inline image 2]
and here are the stage details for stage 0:
[image: Inline image 1]
Transformations from first stage to the second one are trivial, so that
should not be the bottle neck (apart from keyBy().groupByKey() that causes
the shuffle write/read).
Kind
Hi,
I searched more articles and ran few examples and have clarified my doubts.
This answer by TD in another thread (
https://groups.google.com/d/msg/spark-users/GQoxJHAAtX4/0kiRX0nm1xsJ ) helped
me a lot.
Here is the summary of my finding:
1) A DStream can consist of 0 or 1 or more RDDs.
2)
Thanks for sharing here.
Sent from my iPhone5s
On 2014年3月21日, at 20:44, Sanjay Awatramani sanjay_a...@yahoo.com wrote:
Hi,
I searched more articles and ran few examples and have clarified my doubts.
This answer by TD in another thread (
Hi,
This is my summary of the gap between expected behavior and actual behavior.
FlumeEventCount spark://spark_master_hostname:7077 address port
Expected: an 'agent' listening on address:port (bind to). In the context
of Spark, this agent should be running on one of the slaves, which should be
On 03/21/2014 06:17 PM, anoldbrain [via Apache Spark User List] wrote:
he actual address, which in turn causes the 'Fail to bind to ...'
error. This comes naturally because the slave that is running the code
to bind to address:port has a different ip.
So if we run the code on the slave where
On 03/21/2014 06:17 PM, anoldbrain [via Apache Spark User List] wrote:
he actual address, which in turn causes the 'Fail to bind to ...'
error. This comes naturally because the slave that is running the code
to bind to address:port has a different ip.
I ran sudo ./run-example
It is my understanding that there is no way to make FlumeInputDStream work in
a cluster environment with the current release. Switch to Kafka, if you can,
would be my suggestion, although I have not used KafkaInputDStream. There is
a big difference between Kafka and Flume InputDstream:
Hi
I need to partition my data represented as RDD into n folds and run metrics
computation in each fold and finally compute the means of my metrics
overall the folds.
Does spark can do the data partition out of the box or do I need to
implement it myself. I know that RDD has a partitions method
Hello,
I have a task that runs on a week's worth of data (let's say) and
produces a Set of tuples such as Set[(String,Long)] (essentially output
of countByValue.toMap)
I want to produce 4 sets, one each for a different week and run an
intersection of the 4 sets.
I have the sequential
I'm trying to run Spark on Mesos and I'm getting this error:
java.lang.ClassNotFoundException: org/apache/spark/serializer/JavaSerializer
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at
In your task details I dont see a large skew in tasks so the low cpu usage
period occurs between stages or during stage execution.
One issue possible is your data is 89GB Shuffle read, if the machine
producing the shuffle data is not the one processing it, data shuffling
across machines may be
Hi everyone,We are planning to set up Spark. The documentation mentions that it
is possible to run Spark in standalone mode on a Hadoop cluster. Does anyone
have any comments on stability and performance of this mode?
Both are quite stable. Yarn is in beta though so would be good to test on
Standalone till Spark 1.0.0.
Regards
Mayur
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Fri, Mar 21, 2014 at 2:19 PM, Sameer Tilak
Jaonary
val loadedData: RDD[(String,(String,Array[Byte]))] =
sc.objectFile(yourObjectFileName)
Deenar
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Reload-RDD-saved-with-saveAsObjectFile-tp2943p3009.html
Sent from the Apache Spark User List mailing
Thanks, Mayur.
Sent via the Samsung GALAXY S®4, an ATT 4G LTE smartphone
div Original message /divdivFrom: Mayur Rustagi
mayur.rust...@gmail.com /divdivDate:03/21/2014 11:32 AM (GMT-08:00)
/divdivTo: user@spark.apache.org /divdivSubject: Re: Spark and Hadoop
cluster
On Tue, Mar 18, 2014 at 12:56 PM, Ognen Duzlevski
og...@plainvanillagames.com wrote:
On 3/18/14, 4:49 AM, dmpou...@gmail.com wrote:
On Sunday, 2 March 2014 19:19:49 UTC+2, Aureliano Buendia wrote:
Is there a reason for spark using the older akka?
On Sun, Mar 2, 2014 at 1:53 PM, 1esha
Howdy, folks!
Anybody out there having a working kafka _output_ for Spark streaming?
Perhaps one that doesn't involve instantiating a new producer for every
batch?
Thanks!
b
Hi,
Our spark app reduces a few 100 gb of data to to a few 100 kb of csv. We
found that a partition number of 1000 is a good number to speed the process
up. However, it does not make sense to have 1000 pieces of csv files each
less than 1 kb.
We used RDD.coalesce(1) to get only 1 csv file, but
Try passing the shuffle=true parameter to coalesce, then it will do the map in
parallel but still pass all the data through one reduce node for writing it
out. That’s probably the fastest it will get. No need to cache if you do that.
Matei
On Mar 21, 2014, at 4:04 PM, Aureliano Buendia
Hi,
I am able to successfully create shark table with 3 columns and 2 rows.
val recList = List(( value A1, value B1,value C1),
(value A2, value B2,value c2));
val dbFields =List (Col A, Col B,Col C)
val rdd = sc.parallelize(recList)
Good to know it's as simple as that! I wonder why shuffle=true is not the
default for coalesce().
On Fri, Mar 21, 2014 at 11:37 PM, Matei Zaharia matei.zaha...@gmail.comwrote:
Try passing the shuffle=true parameter to coalesce, then it will do the
map in parallel but still pass all the data
Matei
It turns out that saveAsObjectFile(), saveAsSequenceFile() and
saveAsHadoopFile() currently do not pickup the hadoop settings as Aureliano
found out in this post
http://apache-spark-user-list.1001560.n3.nabble.com/Turning-kryo-on-does-not-decrease-binary-output-tp212p249.html
Deenar
--
Ah, the reason is because coalesce is often used to deal with lots of small
input files on HDFS. In that case you don’t want to reshuffle them all across
the network, you just want each mapper to directly read multiple files (and you
want fewer than one mapper per file).
Matei
On Mar 21,
Aureliano
Apologies for hijacking this thread.
Matei
On the subject of processing lots (millions) of small input files on HDFS,
what are the best practices to follow on spark. Currently my code looks
something like this. Without coalesce there is one task and one output file
per input file.
Hi all, I'm wondering if there's any settings I can use to reduce the
memory needed by the PythonRDD when computing simple stats. I am
getting OutOfMemoryError exceptions while calculating count() on big,
but not absurd, records. It seems like PythonRDD is trying to keep
too many of these
I think you bumped the wrong thread.
As I mentioned in the other thread:
saveAsHadoopFile only applies compression when the codec is available, and
it does not seem to respect the global hadoop compression properties.
I'm not sure if this is a feature, or a bug in spark.
if this is a feature,
I am getting the following error when trying to build spark. I tried various
sizes for the -Xmx and other memory related arguments to the java command line,
but the assembly command still fails.
$ sbt/sbt assembly
...
[info] Compiling 298 Scala sources and 17 Java sources to
29 matches
Mail list logo