Re:Re:Driver memory leak?

2015-04-29 Thread wyphao.2007
No, I am not collect the result to driver,I sample send the result to kafka. BTW, the image address are: https://cloud.githubusercontent.com/assets/5170878/7389463/ac03bf34-eea0-11e4-9e6b-1d2fba170c1c.png and

RDD split into multiple RDDs

2015-04-29 Thread Sébastien Soubré-Lanabère
Hello, I'm facing a problem with custom RDD transformations. I would like to transform a RDD[K, V] into a Map[K, RDD[V]], meaning a map of RDD by key. This would be great, for example, in order to process mllib clustering on V values grouped by K. I know I could do it using filter() on my RDD

Re: RDD split into multiple RDDs

2015-04-29 Thread Juan Rodríguez Hortalá
Hi Daniel, I understood Sébastien was talking having having a high number of keys, I guess I was prejudiced by my own problem! :) Anyway I don't think you need to use disk or a database to generate a RDD per key, you can use filter which I guess would be more efficient because IO is avoided,

Re: RDD split into multiple RDDs

2015-04-29 Thread Daniel Darabos
Check out http://stackoverflow.com/a/26051042/3318517. It's a nice method for saving the RDD into separate files by key in a single pass. Then you can read the files into separate RDDs. On Wed, Apr 29, 2015 at 2:10 PM, Juan Rodríguez Hortalá juan.rodriguez.hort...@gmail.com wrote: Hi

Spark SQL cannot tolerate regexp with BIGINT

2015-04-29 Thread lonely Feb
Hi all, we are transfer our HIVE job into SparkSQL, but we found a litter difference between HIVE and Spark SQL that our sql has a statement like: select A from B where id regexp '^12345$' in HIVE it works fine but in Spark SQL we got a: java.lang.ClassCastException: java.lang.Long cannot be

Re: Pandas' Shift in Dataframe

2015-04-29 Thread Nicholas Chammas
You can check JIRA for any existing plans. If there isn't any, then feel free to create a JIRA and make the case there for why this would be a good feature to add. Nick On Wed, Apr 29, 2015 at 7:30 AM Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi, Is there any plan to add the

Re: Pandas' Shift in Dataframe

2015-04-29 Thread Nicholas Chammas
I can't comment on the direction of the DataFrame API (that's more for Reynold or Michael I guess), but I just wanted to point out that the JIRA would be the recommended way to create a central place for discussing a feature add like that. Nick On Wed, Apr 29, 2015 at 3:43 PM Olivier Girardot

Re: Pandas' Shift in Dataframe

2015-04-29 Thread Olivier Girardot
To give you a broader idea of the current use case, I have a few transformations (sort and column creations) oriented towards a simple goal. My data is timestamped and if two lines are identical, that time difference will have to be more than X days in order to be kept, so there are a few shifts

Re: Spark SQL cannot tolerate regexp with BIGINT

2015-04-29 Thread Olivier Girardot
I guess you can use cast(id as String) instead of just id in your where clause ? Le mer. 29 avr. 2015 à 12:13, lonely Feb lonely8...@gmail.com a écrit : Hi all, we are transfer our HIVE job into SparkSQL, but we found a litter difference between HIVE and Spark SQL that our sql has a statement

Re: Using memory mapped file for shuffle

2015-04-29 Thread Sandy Ryza
Spark currently doesn't allocate any memory off of the heap for shuffle objects. When the in-memory data gets too large, it will write it out to a file, and then merge spilled filed later. What exactly do you mean by store shuffle data in HDFS? -Sandy On Tue, Apr 14, 2015 at 10:15 AM, Kannan

Re: Pandas' Shift in Dataframe

2015-04-29 Thread Reynold Xin
In this case it's fine to discuss whether this would fit in Spark DataFrames' high level direction before putting it in JIRA. Otherwise we might end up creating a lot of tickets just for querying whether something might be a good idea. About this specific feature -- I'm not sure what it means in

Re: Pandas' Shift in Dataframe

2015-04-29 Thread Evan R. Sparks
In general there's a tension between ordered data and set-oriented data model underlying DataFrames. You can force a total ordering on the data, but it may come at a high cost with respect to performance. It would be good to get a sense of the use case you're trying to support, but one suggestion

Re: [discuss] DataFrame function namespacing

2015-04-29 Thread Reynold Xin
Scaladoc isn't much of a problem because scaladocs are grouped. Java/Python is the main problem ... See https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$ On Wed, Apr 29, 2015 at 3:38 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: My feeling

Re: Tungsten + Flink

2015-04-29 Thread Sree V
I agree, Ewan. We should also look into combining both Flink and Spark into one.This eases the industry adaptation instead. Thanking you. With Regards Sree On Wednesday, April 29, 2015 3:21 AM, Ewan Higgs ewan.hi...@ugent.be wrote: Hi all, A quick question about Tungsten. The

Re: [discuss] DataFrame function namespacing

2015-04-29 Thread Reynold Xin
To add a little bit more context, some pros/cons I can think of are: Option 1: Very easy for users to find the function, since they are all in org.apache.spark.sql.functions. However, there will be quite a large number of them. Option 2: I can't tell why we would want this one over Option 3,

Re: Spark SQL cannot tolerate regexp with BIGINT

2015-04-29 Thread Reynold Xin
Actually I'm doing some cleanups related to type coercion, and I will take care of this. On Wed, Apr 29, 2015 at 5:10 PM, lonely Feb lonely8...@gmail.com wrote: OK, I'll try. On Apr 30, 2015 06:54, Reynold Xin r...@databricks.com wrote: We added ExpectedInputConversion rule recently in

[discuss] DataFrame function namespacing

2015-04-29 Thread Reynold Xin
Before we make DataFrame non-alpha, it would be great to decide how we want to namespace all the functions. There are 3 alternatives: 1. Put all in org.apache.spark.sql.functions. This is how SQL does it, since SQL doesn't have namespaces. I estimate eventually we will have ~ 200 functions. 2.

Re: [discuss] DataFrame function namespacing

2015-04-29 Thread Shivaram Venkataraman
My feeling is that we should have a handful of namespaces (say 4 or 5). It becomes too cumbersome to import / remember more package names and having everything in one package makes it hard to read scaladoc etc. Thanks Shivaram On Wed, Apr 29, 2015 at 3:30 PM, Reynold Xin r...@databricks.com

Re: RDD split into multiple RDDs

2015-04-29 Thread Sébastien Soubré-Lanabère
Hi Juan, Daniel, thank you for your explanations. Indeed, I don't have a big number of keys, at least not enough to stuck the scheduler. I was using a method quite similar as what you post, Juan, and yes it works, but I think this would be more efficient to not call filter on each key. So, I was

Re: Plans for upgrading Hive dependency?

2015-04-29 Thread Michael Armbrust
I am working on it. Here is the (very rough) version: https://github.com/apache/spark/compare/apache:master...marmbrus:multiHiveVersions On Mon, Apr 27, 2015 at 1:03 PM, Punyashloka Biswal punya.bis...@gmail.com wrote: Thanks Marcelo and Patrick - I don't know how I missed that ticket in my

Re: [discuss] DataFrame function namespacing

2015-04-29 Thread Reynold Xin
We definitely still have the name collision problem in SQL. On Wed, Apr 29, 2015 at 10:01 PM, Punyashloka Biswal punya.bis...@gmail.com wrote: Do we still have to keep the names of the functions distinct to avoid collisions in SQL? Or is there a plan to allow importing a namespace into SQL

Event generator for SPARK-Streaming from csv

2015-04-29 Thread anshu shukla
I have the real DEBS-TAxi data in csv file , in order to operate over it how to simulate a Spout kind of thing as event generator using the timestamps in CSV file. -- SERC-IISC Thanks Regards, Anshu Shukla