Re: PySpark API divergence + improving pandas interoperability

2016-03-21 Thread Reynold Xin
Hi Wes, I agree it is difficult to do this design case by case, but what I was pointing out was "it is difficult to generalize without seeing a lot more cases". I do think we need to see a lot of these cases and then make a call. My intuition is that we can just have config options that control

Re: PySpark API divergence + improving pandas interoperability

2016-03-21 Thread Wes McKinney
hi Reynold, It's of course possible to find solutions to specific issues, but what I'm curious about is a general decision-making framework around building strong user experiences for programmers using each of the Spark APIs. Right now, the semantics of using Spark are very tied to the semantics

Re: SparkML algos limitations question.

2016-03-21 Thread Joseph Bradley
The indexing I mentioned is more restrictive than that: each index corresponds to a unique position in a binary tree. (I.e., the first index of row 0 is 1, the first of row 1 is 2, the first of row 2 is 4, etc., IIRC) You're correct that this restriction could be removed; with some careful

Re: java.lang.OutOfMemoryError: Unable to acquire bytes of memory

2016-03-21 Thread Nezih Yigitbasi
Andrew, thanks for the suggestion, but unfortunately it didn't work -- still getting the same exception. On Mon, Mar 21, 2016 at 10:32 AM Andrew Or wrote: > @Nezih, can you try again after setting `spark.memory.useLegacyMode` to > true? Can you still reproduce the OOM that

Merging ML Estimator and Model

2016-03-21 Thread Joseph Bradley
Spark devs & users, I want to bring attention to a proposal to merge the MLlib (spark.ml) concepts of Estimator and Model in Spark 2.0. Please comment & discuss on SPARK-14033 (not in this email thread). *TL;DR:* *Proposal*: Merge Estimator

Re: SparkML algos limitations question.

2016-03-21 Thread Eugene Morozov
Hi, Joseph, I thought I understood, why it has a limit of 30 levels for decision tree, but now I'm not that sure. I thought that's because the decision tree stored in the array, which has length of type int, which cannot be more, than 2^31-1. But here are my new discoveries. I've trained two

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-21 Thread Nicholas Chammas
Is someone going to retry fixing these packages? It's still a problem. Also, it would be good to understand why this is happening. On Fri, Mar 18, 2016 at 6:49 PM Jakob Odersky wrote: > I just realized you're using a different download site. Sorry for the > confusion, the

RE: MLPC model can not be saved

2016-03-21 Thread Ulanov, Alexander
Hi Pan, There is a pull request that is supposed to fix the issue: https://github.com/apache/spark/pull/9854 There is a workaround for saving/loading a model (however I am not sure if it will work for the pipeline): sc.parallelize(Seq(model), 1).saveAsObjectFile("path") val sameModel =

Re: java.lang.OutOfMemoryError: Unable to acquire bytes of memory

2016-03-21 Thread Andrew Or
@Nezih, can you try again after setting `spark.memory.useLegacyMode` to true? Can you still reproduce the OOM that way? 2016-03-21 10:29 GMT-07:00 Nezih Yigitbasi : > Hi Spark devs, > I am using 1.6.0 with dynamic allocation on yarn. I am trying to run a >

java.lang.OutOfMemoryError: Unable to acquire bytes of memory

2016-03-21 Thread Nezih Yigitbasi
Hi Spark devs, I am using 1.6.0 with dynamic allocation on yarn. I am trying to run a relatively big application with 10s of jobs and 100K+ tasks and my app fails with the exception below. The closest jira issue I could find is SPARK-11293 , which

RE: Performance improvements for sorted RDDs

2016-03-21 Thread JOAQUIN GUANTER GONZALBEZ
Hi Daniel, I am glad you already ran the numbers on this change ☺ (for anyone reading, they can be found on slide 19 in http://www.slideshare.net/SparkSummit/interactive-graph-analytics-daniel-darabos ). I haven’t done any formal benchmarking, but the speedup in our jobs is highly noticeable.

Re: PySpark API divergence + improving pandas interoperability

2016-03-21 Thread Reynold Xin
Hi Wes, Thanks for the email. It is difficult to generalize without seeing a lot more cases, but the boolean issue is simply a query analysis rule. I can see us having a config option that changes analysis to match more Python/R like, which changes the behavior of implicit type coercion and

Re: Performance improvements for sorted RDDs

2016-03-21 Thread Daniel Darabos
There is related discussion in https://issues.apache.org/jira/browse/SPARK-8836. It's not too hard to implement this without modifying Spark and we measured ~10x improvement over plain RDD joins. I haven't benchmarked against DataFrames -- maybe they also realize this performance advantage. On

Re: Performance improvements for sorted RDDs

2016-03-21 Thread Ted Yu
Do you have performance numbers to backup this proposal for cogroup operation ? Thanks On Mon, Mar 21, 2016 at 1:06 AM, JOAQUIN GUANTER GONZALBEZ < joaquin.guantergonzal...@telefonica.com> wrote: > Hello devs, > > > > I have found myself in a situation where Spark is doing sub-optimal >

subscribe

2016-03-21 Thread Namrata Thanvi

Performance improvements for sorted RDDs

2016-03-21 Thread JOAQUIN GUANTER GONZALBEZ
Hello devs, I have found myself in a situation where Spark is doing sub-optimal computations for my RDDs, and I was wondering whether a patch to enable improved performance for this scenario would be a welcome addition to Spark or not. The scenario happens when trying to cogroup two RDDs that

CTAS support in sparksql

2016-03-21 Thread Kashish Jain
Hi, I was exploring the possibility of CTAS with spark-sql (SPARK-1.3.1) for saving the big results into CSV formatted files for offline viewing. These are the two things that I did 1. CREATE TABLE IF NOT EXISTS csv_dump27 ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY