hi,
What is the bet way to process a batch window in SparkStreaming :
kafkaStream.foreachRDD(rdd = {
rdd.collect().foreach(event = {
// process the event
process(event)
})
})
Or
kafkaStream.foreachRDD(rdd = {
rdd.map(event = {
//
You might be interested in the new s3a filesystem in Hadoop 2.6.0 [1].
1. https://issues.apache.org/jira/plugins/servlet/mobile#issue/HADOOP-10400
On Nov 26, 2014 12:24 PM, Aaron Davidson ilike...@gmail.com wrote:
Spark has a known problem where it will do a pass of metadata on a large
number
Hi,
I have 2 files which come from csv import of 2 Oracle tables.
F1 has 46730613 rows
F2 has 3386740 rows
I build 2 tables with spark.
Table F1 join with table F2 on c1=d1.
All keys F2.d1 exists in F1.c1, so i expect to retrieve 46730613 rows. But
it returns only 3437 rows
// ---
Hi,
I build 2 tables from files. Table F1 join with table F2 on c5=d4.
F1 has 46730613 rows
F2 has 3386740 rows
All keys d4 exists in F1.c5, so i expect to retrieve 46730613 rows. But it
returns only 3437 rows
// --- begin code ---
val sqlContext = new
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
Hi,
I am building a graph from a large CSV file. Each record contains a couple of
nodes and about 10 edges. When I try to load a large portion of the graph,
using multiple partitions, I get inconsistent results in the number of edges
between different runs. However, if I use a single
Thank's
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Key-Value-decomposition-tp17966p18050.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To
Hi,
I'm a newbie in Spark and faces the following use case :
val data = Array ( A, 1;2;3)
val rdd = sc.parallelize(data)
// Something here to produce RDD of (Key,value)
// ( A, 1) , (A, 2), (A, 3)
Does anybody know how to do ?
Thank's
--
View this message in
Hi,
But i've only one RDD. Hre is a more complete exemple :
my rdd is something like (A, 1;2;3), (B, 2;5;6), (C, 3;2;1)
And i expect to have the following result :
(A,1) , (A,2) , (A,3) , (B,2) , (B,5) , (B,6) , (C,3) ,
(C,2) , (C,1)
Any idea about how can i achieve this ?
Thank's
Hi,
I finally found a solution after reading the post :
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-split-RDD-by-key-and-save-to-different-path-td11887.html#a11983
--
View this message in context:
the documentation and found nothing
specifically relevant to cassandra, is there such a piece of documentation?
Thank you,
- David
Thanks, that worked! I downloaded the version pre-built against hadoop1 and
the examples worked.
- David
On Tue, Sep 30, 2014 at 5:08 PM, Kan Zhang kzh...@apache.org wrote:
java.lang.IncompatibleClassChangeError: Found interface
org.apache.hadoop.mapreduce.JobContext, but class was expected
Hi All,
After some hair pulling, I've reached the realisation that an operation I
am currently doing via:
myRDD.groupByKey.mapValues(func)
should be done more efficiently using aggregateByKey or combineByKey. Both
of these methods would do, and they seem very similar to me in terms of
their
, mergeCombiners.
Hope this helps!
Liquan
On Sun, Sep 28, 2014 at 11:59 PM, David Rowe davidr...@gmail.com wrote:
Hi All,
After some hair pulling, I've reached the realisation that an operation I
am currently doing via:
myRDD.groupByKey.mapValues(func)
should be done more efficiently using
Hi,
Does anybody know how to use sortbykey in scala on a RDD like :
val rddToSave = file.map(l = l.split(\\|)).map(r = (r(34)+-+r(3),
r(4), r(10), r(12)))
besauce, i received ann error sortByKey is not a member of
ord.apache.spark.rdd.RDD[(String,String,String,String)].
What i try do
thank's
i've already try this solution but it does not compile (in Eclipse)
I'm surprise to see that in Spark-shell, sortByKey works fine on 2
solutions :
(String,String,String,String)
(String,(String,String,String))
--
View this message in context:
Hi Andrew,
I can't speak for Theodore, but I would find that incredibly useful.
Dave
On Wed, Sep 24, 2014 at 11:24 AM, Andrew Ash and...@andrewash.com wrote:
Hi Theodore,
What do you mean by module diagram? A high level architecture diagram of
how the classes are organized into packages?
Hi,
I've seen this problem before, and I'm not convinced it's GC.
When spark shuffles it writes a lot of small files to store the data to be
sent to other executors (AFAICT). According to what I've read around the
place the intention is that these files be stored in disk buffers, and
since
Is there a shell available for Spark SQL, similar to the way the Shark
or Hive shells work?
From my reading up on Spark SQL, it seems like one can execute SQL
queries in the Spark shell, but only from within code in a programming
language such as Scala. There does not seem to be any way to
I generally call values.stats, e.g.:
val stats = myPairRdd.values.stats
On Fri, Sep 12, 2014 at 4:46 PM, rzykov rzy...@gmail.com wrote:
Is it possible to use DoubleRDDFunctions
https://spark.apache.org/docs/1.0.0/api/java/org/apache/spark/rdd/DoubleRDDFunctions.html
for calculating mean
Oh I see, I think you're trying to do something like (in SQL):
SELECT order, mean(price) FROM orders GROUP BY order
In this case, I'm not aware of a way to use the DoubleRDDFunctions, since
you have a single RDD of pairs where each pair is of type (KeyType,
Iterable[Double]).
It seems to me
conn = ec2.connect_to_region(opts.region)
Any suggestions on how to debug the cause of the timeout?
Note: I replaced the name of my keypair with Blah.
Thanks,
David
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-Errno-110-Connection-time-out
I'm still bumping up against this issue: spark (and shark) are breaking
my inputs into 64MB-sized splits. Anyone know where/how to configure
spark so that it either doesn't split the inputs, or at least uses a
much large split size? (E.g., 512MB.)
Thanks,
DR
On 07/15/2014 05:58 PM, David
));
Thanks in advance, David
at the content inside of the map function or should I
be doing something else entirely?
Thanks
David
.
It may be the case that you don't really need a bunch of RDDs at all,
but can operate on an RDD of pairs of Strings (roots) and
something-elses, all at once.
On Mon, Aug 18, 2014 at 2:31 PM, David Tinker david.tin...@gmail.com
wrote:
Hi All.
I need to create a lot of RDDs starting from
Got a spark/shark cluster up and running recently, and have been kicking
the tires on it. However, been wrestling with an issue on it that I'm
not quite sure how to solve. (Or, at least, not quite sure about the
correct way to solve it.)
I ran a simple Hive query (select count ...) against
For some reason the patch did not make it.
Trying via email:
/D
On May 23, 2014, at 9:52 AM, lemieud david.lemi...@radialpoint.com wrote:
Hi,
I think I found the problem.
In SparkFlumeEvent the readExternal method use in.read(bodyBuff) which read
the first 1020 bytes, but no more. The
Created https://issues.apache.org/jira/browse/SPARK-1916
I'll submit a pull request soon.
/D
On May 23, 2014, at 9:56 AM, David Lemieux david.lemi...@radialpoint.com
wrote:
For some reason the patch did not make it.
Trying via email:
/D
On May 23, 2014, at 9:52 AM, lemieud david.lemi
@spark.apache.org
Cc: user@spark.apache.org
Subject: Re: K-means with large K
David,
Just curious to know what kind of use cases demand such large k clusters
Chester
Sent from my iPhone
On Apr 28, 2014, at 9:19 AM, Buttler, David
buttl...@llnl.govmailto:buttl...@llnl.gov wrote:
Hi,
I am trying
This sounds like a configuration issue. Either you have not set the MASTER
correctly, or possibly another process is using up all of the cores
Dave
From: ge ko [mailto:koenig@gmail.com]
Sent: Sunday, April 13, 2014 12:51 PM
To: user@spark.apache.org
Subject:
Hi,
I'm still going to start
During a Spark stage, how are tasks split among the workers? Specifically
for a HadoopRDD, who determines which worker has to get which task?
What is the difference between checkpointing and caching an RDD?
Can someone explain how RDD is resilient? If one of the partition is lost,
who is responsible to recreate that partition - is it the driver program?
Is there a way to see 'Application Detail UI' page (at master:4040) for
completed applications? Currently, I can see that page only for running
applications, I would like to see various numbers for the application after
it has completed.
That helps! Thank you.
On Fri, Mar 28, 2014 at 12:36 AM, Sonal Goyal sonalgoy...@gmail.com wrote:
Hi David,
I am sorry but your question is not clear to me. Are you talking about
taking some value and sharing it across your cluster so that it is present
on all the nodes? You can look
What is the concept of Block and BlockManager in Spark? How is a Block
related to a Partition of a RDD?
:
https://spark-project.atlassian.net/browse/SPARK-1021).
David Thomas dt5434...@gmail.com
March 11, 2014 at 9:49 PM
For example, is distinct() transformation lazy?
when I see the Spark source code, distinct applies a map- reduceByKey -
map function to the RDD elements. Why is this lazy
I have an RDD of (K, Array[V]) pairs.
For example: ((key1, (1,2,3)), (key2, (3,2,4)), (key1, (4,3,2)))
How can I do a groupByKey such that I get back an RDD of the form (K,
Array[V]) pairs.
Ex: ((key1, (1,2,3,4,3,2)), (key2, (3,2,4)))
201 - 239 of 239 matches
Mail list logo