Re: Custom PersistanceEngine and LeaderAgent implementation in Java

2015-04-29 Thread Reynold Xin
We should change the trait to abstract class, and then your problem will go away. Do you want to submit a pull request? On Wed, Apr 29, 2015 at 11:02 PM, Niranda Perera wrote: > Hi, > > this follows the following feature in this feature [1] > > I'm trying to implement a custom persistence engi

Custom PersistanceEngine and LeaderAgent implementation in Java

2015-04-29 Thread Niranda Perera
Hi, this follows the following feature in this feature [1] I'm trying to implement a custom persistence engine and a leader agent in the Java environment. vis-a-vis scala, when I implement the PersistenceEngine trait in java, I would have to implement methods such as readPersistedData, removeDri

Event generator for SPARK-Streaming from csv

2015-04-29 Thread anshu shukla
I have the real DEBS-TAxi data in csv file , in order to operate over it how to simulate a "Spout" kind of thing as event generator using the timestamps in CSV file. -- SERC-IISC Thanks & Regards, Anshu Shukla

Re: [discuss] DataFrame function namespacing

2015-04-29 Thread Reynold Xin
We definitely still have the name collision problem in SQL. On Wed, Apr 29, 2015 at 10:01 PM, Punyashloka Biswal wrote: > Do we still have to keep the names of the functions distinct to avoid > collisions in SQL? Or is there a plan to allow "importing" a namespace into > SQL somehow? > > I ask b

Re: [discuss] DataFrame function namespacing

2015-04-29 Thread Punyashloka Biswal
Do we still have to keep the names of the functions distinct to avoid collisions in SQL? Or is there a plan to allow "importing" a namespace into SQL somehow? I ask because if we have to keep worrying about name collisions then I'm not sure what the added complexity of #2 and #3 buys us. Punya On

Re: Plans for upgrading Hive dependency?

2015-04-29 Thread Michael Armbrust
I am working on it. Here is the (very rough) version: https://github.com/apache/spark/compare/apache:master...marmbrus:multiHiveVersions On Mon, Apr 27, 2015 at 1:03 PM, Punyashloka Biswal wrote: > Thanks Marcelo and Patrick - I don't know how I missed that ticket in my > Jira search earlier. I

Re: Spark SQL cannot tolerate regexp with BIGINT

2015-04-29 Thread Reynold Xin
Actually I'm doing some cleanups related to type coercion, and I will take care of this. On Wed, Apr 29, 2015 at 5:10 PM, lonely Feb wrote: > OK, I'll try. > On Apr 30, 2015 06:54, "Reynold Xin" wrote: > >> We added ExpectedInputConversion rule recently in analysis: >> https://github.com/apach

Re: Spark SQL cannot tolerate regexp with BIGINT

2015-04-29 Thread lonely Feb
OK, I'll try. On Apr 30, 2015 06:54, "Reynold Xin" wrote: > We added ExpectedInputConversion rule recently in analysis: > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala#L647 > > With this rule, the analyzer aut

Re: Tungsten + Flink

2015-04-29 Thread Sree V
I agree, Ewan. We should also look into combining both Flink and Spark into one.This eases the industry adaptation instead. Thanking you. With Regards Sree On Wednesday, April 29, 2015 3:21 AM, Ewan Higgs wrote: Hi all, A quick question about Tungsten. The announcement of the Tun

Re: Spark SQL cannot tolerate regexp with BIGINT

2015-04-29 Thread Reynold Xin
We added ExpectedInputConversion rule recently in analysis: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala#L647 With this rule, the analyzer automatically adds cast for expressions that inherit ExpectsInputTypes

Re: [discuss] DataFrame function namespacing

2015-04-29 Thread Reynold Xin
Scaladoc isn't much of a problem because scaladocs are grouped. Java/Python is the main problem ... See https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$ On Wed, Apr 29, 2015 at 3:38 PM, Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > My feeli

Re: [discuss] DataFrame function namespacing

2015-04-29 Thread Shivaram Venkataraman
My feeling is that we should have a handful of namespaces (say 4 or 5). It becomes too cumbersome to import / remember more package names and having everything in one package makes it hard to read scaladoc etc. Thanks Shivaram On Wed, Apr 29, 2015 at 3:30 PM, Reynold Xin wrote: > To add a littl

Re: [discuss] DataFrame function namespacing

2015-04-29 Thread Reynold Xin
To add a little bit more context, some pros/cons I can think of are: Option 1: Very easy for users to find the function, since they are all in org.apache.spark.sql.functions. However, there will be quite a large number of them. Option 2: I can't tell why we would want this one over Option 3, sinc

[discuss] DataFrame function namespacing

2015-04-29 Thread Reynold Xin
Before we make DataFrame non-alpha, it would be great to decide how we want to namespace all the functions. There are 3 alternatives: 1. Put all in org.apache.spark.sql.functions. This is how SQL does it, since SQL doesn't have namespaces. I estimate eventually we will have ~ 200 functions. 2. Ha

Re: Using memory mapped file for shuffle

2015-04-29 Thread Sandy Ryza
Spark currently doesn't allocate any memory off of the heap for shuffle objects. When the in-memory data gets too large, it will write it out to a file, and then merge spilled filed later. What exactly do you mean by store shuffle data in HDFS? -Sandy On Tue, Apr 14, 2015 at 10:15 AM, Kannan Ra

Re: Spark SQL cannot tolerate regexp with BIGINT

2015-04-29 Thread Olivier Girardot
I guess you can use cast(id as String) instead of just id in your where clause ? Le mer. 29 avr. 2015 à 12:13, lonely Feb a écrit : > Hi all, we are transfer our HIVE job into SparkSQL, but we found a litter > difference between HIVE and Spark SQL that our sql has a statement like: > > select A

Re: Pandas' Shift in Dataframe

2015-04-29 Thread Olivier Girardot
To give you a broader idea of the current use case, I have a few transformations (sort and column creations) oriented towards a simple goal. My data is timestamped and if two lines are identical, that time difference will have to be more than X days in order to be kept, so there are a few shifts do

Re: Pandas' Shift in Dataframe

2015-04-29 Thread Evan R. Sparks
In general there's a tension between ordered data and set-oriented data model underlying DataFrames. You can force a total ordering on the data, but it may come at a high cost with respect to performance. It would be good to get a sense of the use case you're trying to support, but one suggestion

Re: Pandas' Shift in Dataframe

2015-04-29 Thread Reynold Xin
In this case it's fine to discuss whether this would fit in Spark DataFrames' high level direction before putting it in JIRA. Otherwise we might end up creating a lot of tickets just for querying whether something might be a good idea. About this specific feature -- I'm not sure what it means in g

Re: Pandas' Shift in Dataframe

2015-04-29 Thread Nicholas Chammas
I can't comment on the direction of the DataFrame API (that's more for Reynold or Michael I guess), but I just wanted to point out that the JIRA would be the recommended way to create a central place for discussing a feature add like that. Nick On Wed, Apr 29, 2015 at 3:43 PM Olivier Girardot < o

Re: Pandas' Shift in Dataframe

2015-04-29 Thread Olivier Girardot
Hi Nicholas, yes I've already checked, and I've just created the https://issues.apache.org/jira/browse/SPARK-7247 I'm not even sure why this would be a good feature to add except the fact that some of the data scientists I'm working with are using it, and it would be therefore useful for me to tran

Re: Pandas' Shift in Dataframe

2015-04-29 Thread Nicholas Chammas
You can check JIRA for any existing plans. If there isn't any, then feel free to create a JIRA and make the case there for why this would be a good feature to add. Nick On Wed, Apr 29, 2015 at 7:30 AM Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: > Hi, > Is there any plan to add the

Re: Spilling when not expected

2015-04-29 Thread Tom Hubregtsen
Hi reynold, It took me some time, but I've finally found that there is a difference between spilling on the map-side and spilling on the reduce-side for a shuffle. Spilling to disk on the map-side happens by default (with the spillToPartitionFiles call from insertAll in ExternalSorter; don't know

Re: RDD split into multiple RDDs

2015-04-29 Thread Sébastien Soubré-Lanabère
Hi Juan, Daniel, thank you for your explanations. Indeed, I don't have a big number of keys, at least not enough to stuck the scheduler. I was using a method quite similar as what you post, Juan, and yes it works, but I think this would be more efficient to not call filter on each key. So, I was

Re: RDD split into multiple RDDs

2015-04-29 Thread Juan Rodríguez Hortalá
Hi Daniel, I understood Sébastien was talking having having a high number of keys, I guess I was prejudiced by my own problem! :) Anyway I don't think you need to use disk or a database to generate a RDD per key, you can use filter which I guess would be more efficient because IO is avoided, espec

Re: RDD split into multiple RDDs

2015-04-29 Thread Daniel Darabos
Check out http://stackoverflow.com/a/26051042/3318517. It's a nice method for saving the RDD into separate files by key in a single pass. Then you can read the files into separate RDDs. On Wed, Apr 29, 2015 at 2:10 PM, Juan Rodríguez Hortalá < juan.rodriguez.hort...@gmail.com> wrote: > Hi Sébasti

Re: RDD split into multiple RDDs

2015-04-29 Thread Juan Rodríguez Hortalá
Hi Sébastien, I came with a similar problem some time ago, you can see the discussion in the Spark users mailing list at http://markmail.org/message/fudmem4yy63p62ar#query:+page:1+mid:qv4gw6czf6lb6hpq+state:results . My experience was that when you create too many RDDs the Spark scheduler gets stu

Pandas' Shift in Dataframe

2015-04-29 Thread Olivier Girardot
Hi, Is there any plan to add the "shift" method from Pandas to Spark Dataframe, not that I think it's an easy task... c.f. http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.shift.html Regards, Olivier.

Re:Re:Driver memory leak?

2015-04-29 Thread wyphao.2007
No, I am not collect the result to driver,I sample send the result to kafka. BTW, the image address are: https://cloud.githubusercontent.com/assets/5170878/7389463/ac03bf34-eea0-11e4-9e6b-1d2fba170c1c.png and https://cloud.githubusercontent.com/assets/5170878/7389480/c629d236-eea0-11e4-983a-dc5

RDD split into multiple RDDs

2015-04-29 Thread Sébastien Soubré-Lanabère
Hello, I'm facing a problem with custom RDD transformations. I would like to transform a RDD[K, V] into a Map[K, RDD[V]], meaning a map of RDD by key. This would be great, for example, in order to process mllib clustering on V values grouped by K. I know I could do it using filter() on my RDD a

Driver memory leak?

2015-04-29 Thread wyphao.2007
Hi, Dear developer, I am using Spark Streaming to read data from kafka, the program already run about 120 hours, but today the program failed because of driver's OOM as follow: Container [pid=49133,containerID=container_1429773909253_0050_02_01] is running beyond physical memory limits. Cu

Tungsten + Flink

2015-04-29 Thread Ewan Higgs
Hi all, A quick question about Tungsten. The announcement of the Tungsten project is on the back of Hadoop Summit in Brussels where some of the Flink devs were giving talks [1] on how Flink manages memory using byte arrays and the like to avoid the overhead of all the Java types[2]. Is there a

Re: hive initialization on executors

2015-04-29 Thread Manku Timma
The problem was in my hive-13 branch. So ignore this. On 27 April 2015 at 10:34, Manku Timma wrote: > I am facing an exception "Hive.get() called without a hive db setup" in > the executor. I wanted to understand how Hive object is initialized in the > executor threads? I only see Hive.get(hivec

Spark SQL cannot tolerate regexp with BIGINT

2015-04-29 Thread lonely Feb
Hi all, we are transfer our HIVE job into SparkSQL, but we found a litter difference between HIVE and Spark SQL that our sql has a statement like: select A from B where id regexp '^12345$' in HIVE it works fine but in Spark SQL we got a: java.lang.ClassCastException: java.lang.Long cannot be cas