Hi,
I write a randomly generated 30,000-row dataframe to parquet. I verify that
it has 200 partitions (both in Spark and inspecting the parquet file in
hdfs).
When I read it back in, it has 23 partitions?! Is there some optimization
going on? (This doesn't happen in Spark 1.5)
*How can I force
Hi,
The code below gives me an unexpected result. I expected that
StandardScaler (in ml, not mllib) will take a specified column of an input
dataframe and subtract the mean of the column and divide the difference by
the standard deviation of the dataframe column.
However, Spark gives me the
Try redefining your window, without sortBy part. In other words, rerun your
code with
window = Window.partitionBy("a")
The thing is that the window is defined differently in these two cases. In
your example, in the group where "a" is 1,
- If you include "sortBy" option, it is a rolling
I think it's an expression, rather than a function you'd find in the API
(as a function you could do df.select(col).distinct.count)
This will give you the number of distinct rows in both columns:
scala> df.select(countDistinct("name", "age"))
res397: org.apache.spark.sql.DataFrame =
Hi,
I thought I understood RDDs and DataFrames, but one noob thing is bugging
me (because I'm seeing weird errors involving joins):
*What does Spark do when you pass a big dataframe as an argument to a
function? *
Are these dataframes included in the closure of the function, and is
therefore
You can do it and many other transformations very easily with window
functions, see this blog post:
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
In your case you would do (in Scala):
import org.apache.spark.sql.expressions.Window
import
Hi all,
I have a Scala project with multiple files: a main file and a file with
utility functions on DataFrames. However, using $"colname" to refer to a
column of the DataFrame in the utils file (see code below) produces a
compile-time error as follows:
"value $ is not a member of StringContext"
Hi,
I'm trying out the ml.classification.RandomForestClassifer() on a simple
dataframe and it returns an exception that number of classes has not been
set in my dataframe. However, I cannot find a function that would set
number of classes, or pass it as an argument anywhere. In mllib, numClasses
etic operations for use in Scala. I really would appreciate
> any feedback!
>
> On Tue, Aug 25, 2015 at 11:06 AM, Kristina Rogale Plazonic <
> kpl...@gmail.com> wrote:
>
>> YES PLEASE!
>>
>> :)))
>>
>> On Tue, Aug 25, 2015 at 1:57 PM, Bura
If you don't want to compute all N^2 similarities, you need to implement
some kind of blocking first. For example, LSH (locally sensitive hashing).
A quick search gave this link to a Spark implementation:
However I do think it's easier than it seems to write the implicits;
it doesn't involve new classes or anything. Yes it's pretty much just
what you wrote. There is a class Vector in Spark. This declaration
can be in an object; you don't implement your own class. (Also you can
use toBreeze to
be
able to write a lot of the source code just as you imagine it, as if
the Breeze methods were available on the Vector object in MLlib.
On Tue, Aug 25, 2015 at 3:35 PM, Kristina Rogale Plazonic
kpl...@gmail.com wrote:
Well, yes, the hack below works (that's all I have time
with that code if people are interested.
Best,
Burak
On Tue, Aug 25, 2015 at 10:54 AM, Kristina Rogale Plazonic
kpl...@gmail.com wrote:
However I do think it's easier than it seems to write the implicits;
it doesn't involve new classes or anything. Yes it's pretty much just
what you wrote
:41 AM, Sonal Goyal sonalgoy...@gmail.com wrote:
From what I have understood, you probably need to convert your vector to
breeze and do your operations there. Check
stackoverflow.com/questions/28232829/addition-of-two-rddmllib-linalg-vectors
On Aug 25, 2015 7:06 PM, Kristina Rogale Plazonic kpl
Hi all,
I'm still not clear what is the best (or, ANY) way to add/subtract
two org.apache.spark.mllib.Vector objects in Scala.
Ok, I understand there was a conscious Spark decision not to support linear
algebra operations in Scala and leave it to the user to choose a linear
algebra library.
Hi,
I'm wondering how to achieve, say, a Monte Carlo simulation in SparkR
without use of low level RDD functions that were made private in 1.4, such
as parallelize and map. Something like
parallelize(sc, 1:1000).map (
### R code that does my computation
)
where the code is the same on every
Hi,
I'm puzzling over the following problem: when I cache a small sample of a
big dataframe, the small dataframe is recomputed when selecting a column
(but not if show() or count() is invoked).
Why is that so and how can I avoid recomputation of the small sample
dataframe?
More details:
- I
17 matches
Mail list logo