No, I am not collect the result to driver,I sample send the result to kafka.
BTW, the image address are:
https://cloud.githubusercontent.com/assets/5170878/7389463/ac03bf34-eea0-11e4-9e6b-1d2fba170c1c.png
and
Hello,
I'm facing a problem with custom RDD transformations.
I would like to transform a RDD[K, V] into a Map[K, RDD[V]], meaning a map
of RDD by key.
This would be great, for example, in order to process mllib clustering on V
values grouped by K.
I know I could do it using filter() on my RDD
Hi Daniel,
I understood Sébastien was talking having having a high number of keys, I
guess I was prejudiced by my own problem! :) Anyway I don't think you need
to use disk or a database to generate a RDD per key, you can use filter
which I guess would be more efficient because IO is avoided,
Check out http://stackoverflow.com/a/26051042/3318517. It's a nice method
for saving the RDD into separate files by key in a single pass. Then you
can read the files into separate RDDs.
On Wed, Apr 29, 2015 at 2:10 PM, Juan Rodríguez Hortalá
juan.rodriguez.hort...@gmail.com wrote:
Hi
Hi all, we are transfer our HIVE job into SparkSQL, but we found a litter
difference between HIVE and Spark SQL that our sql has a statement like:
select A from B where id regexp '^12345$'
in HIVE it works fine but in Spark SQL we got a:
java.lang.ClassCastException: java.lang.Long cannot be
You can check JIRA for any existing plans. If there isn't any, then feel
free to create a JIRA and make the case there for why this would be a good
feature to add.
Nick
On Wed, Apr 29, 2015 at 7:30 AM Olivier Girardot
o.girar...@lateral-thoughts.com wrote:
Hi,
Is there any plan to add the
I can't comment on the direction of the DataFrame API (that's more for
Reynold or Michael I guess), but I just wanted to point out that the JIRA
would be the recommended way to create a central place for discussing a
feature add like that.
Nick
On Wed, Apr 29, 2015 at 3:43 PM Olivier Girardot
To give you a broader idea of the current use case, I have a few
transformations (sort and column creations) oriented towards a simple goal.
My data is timestamped and if two lines are identical, that time difference
will have to be more than X days in order to be kept, so there are a few
shifts
I guess you can use cast(id as String) instead of just id in your where
clause ?
Le mer. 29 avr. 2015 à 12:13, lonely Feb lonely8...@gmail.com a écrit :
Hi all, we are transfer our HIVE job into SparkSQL, but we found a litter
difference between HIVE and Spark SQL that our sql has a statement
Spark currently doesn't allocate any memory off of the heap for shuffle
objects. When the in-memory data gets too large, it will write it out to a
file, and then merge spilled filed later.
What exactly do you mean by store shuffle data in HDFS?
-Sandy
On Tue, Apr 14, 2015 at 10:15 AM, Kannan
In this case it's fine to discuss whether this would fit in Spark
DataFrames' high level direction before putting it in JIRA. Otherwise we
might end up creating a lot of tickets just for querying whether something
might be a good idea.
About this specific feature -- I'm not sure what it means in
In general there's a tension between ordered data and set-oriented data
model underlying DataFrames. You can force a total ordering on the data,
but it may come at a high cost with respect to performance.
It would be good to get a sense of the use case you're trying to support,
but one suggestion
Scaladoc isn't much of a problem because scaladocs are grouped. Java/Python
is the main problem ...
See
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
On Wed, Apr 29, 2015 at 3:38 PM, Shivaram Venkataraman
shiva...@eecs.berkeley.edu wrote:
My feeling
I agree, Ewan.
We should also look into combining both Flink and Spark into one.This eases the
industry adaptation instead.
Thanking you.
With Regards
Sree
On Wednesday, April 29, 2015 3:21 AM, Ewan Higgs ewan.hi...@ugent.be
wrote:
Hi all,
A quick question about Tungsten. The
To add a little bit more context, some pros/cons I can think of are:
Option 1: Very easy for users to find the function, since they are all in
org.apache.spark.sql.functions. However, there will be quite a large number
of them.
Option 2: I can't tell why we would want this one over Option 3,
Actually I'm doing some cleanups related to type coercion, and I will take
care of this.
On Wed, Apr 29, 2015 at 5:10 PM, lonely Feb lonely8...@gmail.com wrote:
OK, I'll try.
On Apr 30, 2015 06:54, Reynold Xin r...@databricks.com wrote:
We added ExpectedInputConversion rule recently in
Before we make DataFrame non-alpha, it would be great to decide how we want
to namespace all the functions. There are 3 alternatives:
1. Put all in org.apache.spark.sql.functions. This is how SQL does it,
since SQL doesn't have namespaces. I estimate eventually we will have ~ 200
functions.
2.
My feeling is that we should have a handful of namespaces (say 4 or 5). It
becomes too cumbersome to import / remember more package names and having
everything in one package makes it hard to read scaladoc etc.
Thanks
Shivaram
On Wed, Apr 29, 2015 at 3:30 PM, Reynold Xin r...@databricks.com
Hi Juan, Daniel,
thank you for your explanations. Indeed, I don't have a big number of keys,
at least not enough to stuck the scheduler.
I was using a method quite similar as what you post, Juan, and yes it
works, but I think this would be more efficient to not call filter on each
key. So, I was
I am working on it. Here is the (very rough) version:
https://github.com/apache/spark/compare/apache:master...marmbrus:multiHiveVersions
On Mon, Apr 27, 2015 at 1:03 PM, Punyashloka Biswal punya.bis...@gmail.com
wrote:
Thanks Marcelo and Patrick - I don't know how I missed that ticket in my
We definitely still have the name collision problem in SQL.
On Wed, Apr 29, 2015 at 10:01 PM, Punyashloka Biswal punya.bis...@gmail.com
wrote:
Do we still have to keep the names of the functions distinct to avoid
collisions in SQL? Or is there a plan to allow importing a namespace into
SQL
I have the real DEBS-TAxi data in csv file , in order to operate over it
how to simulate a Spout kind of thing as event generator using the
timestamps in CSV file.
--
SERC-IISC
Thanks Regards,
Anshu Shukla
22 matches
Mail list logo