Why does Spark 2.0 change number or partitions when reading a parquet file?

2016-12-22 Thread Kristina Rogale Plazonic
Hi, I write a randomly generated 30,000-row dataframe to parquet. I verify that it has 200 partitions (both in Spark and inspecting the parquet file in hdfs). When I read it back in, it has 23 partitions?! Is there some optimization going on? (This doesn't happen in Spark 1.5) *How can I force

StandardScaler in spark.ml.feature requires vector input?

2016-01-09 Thread Kristina Rogale Plazonic
Hi, The code below gives me an unexpected result. I expected that StandardScaler (in ml, not mllib) will take a specified column of an input dataframe and subtract the mean of the column and divide the difference by the standard deviation of the dataframe column. However, Spark gives me the

Re: pyspark dataframe: row with a minimum value of a column for each group

2016-01-06 Thread Kristina Rogale Plazonic
Try redefining your window, without sortBy part. In other words, rerun your code with window = Window.partitionBy("a") The thing is that the window is defined differently in these two cases. In your example, in the group where "a" is 1, - If you include "sortBy" option, it is a rolling

Re: finding distinct count using dataframe

2016-01-05 Thread Kristina Rogale Plazonic
I think it's an expression, rather than a function you'd find in the API (as a function you could do df.select(col).distinct.count) This will give you the number of distinct rows in both columns: scala> df.select(countDistinct("name", "age")) res397: org.apache.spark.sql.DataFrame =

passing RDDs/DataFrames as arguments to functions - what happens?

2015-11-08 Thread Kristina Rogale Plazonic
Hi, I thought I understood RDDs and DataFrames, but one noob thing is bugging me (because I'm seeing weird errors involving joins): *What does Spark do when you pass a big dataframe as an argument to a function? * Are these dataframes included in the closure of the function, and is therefore

Re: Spark SQL running totals

2015-10-15 Thread Kristina Rogale Plazonic
You can do it and many other transformations very easily with window functions, see this blog post: https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html In your case you would do (in Scala): import org.apache.spark.sql.expressions.Window import

Where to put import sqlContext.implicits._ to be able to work on DataFrames in another file?

2015-10-05 Thread Kristina Rogale Plazonic
Hi all, I have a Scala project with multiple files: a main file and a file with utility functions on DataFrames. However, using $"colname" to refer to a column of the DataFrame in the utils file (see code below) produces a compile-time error as follows: "value $ is not a member of StringContext"

RandomForestClassifer does not recognize number of classes, nor can number of classes be set

2015-09-29 Thread Kristina Rogale Plazonic
Hi, I'm trying out the ml.classification.RandomForestClassifer() on a simple dataframe and it returns an exception that number of classes has not been set in my dataframe. However, I cannot find a function that would set number of classes, or pass it as an argument anywhere. In mllib, numClasses

Re: Adding/subtracting org.apache.spark.mllib.linalg.Vector in Scala?

2015-09-09 Thread Kristina Rogale Plazonic
etic operations for use in Scala. I really would appreciate > any feedback! > > On Tue, Aug 25, 2015 at 11:06 AM, Kristina Rogale Plazonic < > kpl...@gmail.com> wrote: > >> YES PLEASE! >> >> :))) >> >> On Tue, Aug 25, 2015 at 1:57 PM, Bura

Re: Build k-NN graph for large dataset

2015-08-26 Thread Kristina Rogale Plazonic
If you don't want to compute all N^2 similarities, you need to implement some kind of blocking first. For example, LSH (locally sensitive hashing). A quick search gave this link to a Spark implementation:

Re: Adding/subtracting org.apache.spark.mllib.linalg.Vector in Scala?

2015-08-25 Thread Kristina Rogale Plazonic
However I do think it's easier than it seems to write the implicits; it doesn't involve new classes or anything. Yes it's pretty much just what you wrote. There is a class Vector in Spark. This declaration can be in an object; you don't implement your own class. (Also you can use toBreeze to

Re: Adding/subtracting org.apache.spark.mllib.linalg.Vector in Scala?

2015-08-25 Thread Kristina Rogale Plazonic
be able to write a lot of the source code just as you imagine it, as if the Breeze methods were available on the Vector object in MLlib. On Tue, Aug 25, 2015 at 3:35 PM, Kristina Rogale Plazonic kpl...@gmail.com wrote: Well, yes, the hack below works (that's all I have time

Re: Adding/subtracting org.apache.spark.mllib.linalg.Vector in Scala?

2015-08-25 Thread Kristina Rogale Plazonic
with that code if people are interested. Best, Burak On Tue, Aug 25, 2015 at 10:54 AM, Kristina Rogale Plazonic kpl...@gmail.com wrote: However I do think it's easier than it seems to write the implicits; it doesn't involve new classes or anything. Yes it's pretty much just what you wrote

Re: Adding/subtracting org.apache.spark.mllib.linalg.Vector in Scala?

2015-08-25 Thread Kristina Rogale Plazonic
:41 AM, Sonal Goyal sonalgoy...@gmail.com wrote: From what I have understood, you probably need to convert your vector to breeze and do your operations there. Check stackoverflow.com/questions/28232829/addition-of-two-rddmllib-linalg-vectors On Aug 25, 2015 7:06 PM, Kristina Rogale Plazonic kpl

Adding/subtracting org.apache.spark.mllib.linalg.Vector in Scala?

2015-08-25 Thread Kristina Rogale Plazonic
Hi all, I'm still not clear what is the best (or, ANY) way to add/subtract two org.apache.spark.mllib.Vector objects in Scala. Ok, I understand there was a conscious Spark decision not to support linear algebra operations in Scala and leave it to the user to choose a linear algebra library.

Embarassingly parallel computation in SparkR?

2015-08-17 Thread Kristina Rogale Plazonic
Hi, I'm wondering how to achieve, say, a Monte Carlo simulation in SparkR without use of low level RDD functions that were made private in 1.4, such as parallelize and map. Something like parallelize(sc, 1:1000).map ( ### R code that does my computation ) where the code is the same on every

DataFrame DAG recomputed even though DataFrame is cached?

2015-07-28 Thread Kristina Rogale Plazonic
Hi, I'm puzzling over the following problem: when I cache a small sample of a big dataframe, the small dataframe is recomputed when selecting a column (but not if show() or count() is invoked). Why is that so and how can I avoid recomputation of the small sample dataframe? More details: - I