Re: VectorUDT with spark.ml.linalg.Vector

2016-08-17 Thread Michał Zieliński
I'm using Spark 1.6.2 for Vector-based UDAF and this works: def inputSchema: StructType = new StructType().add("input", new VectorUDT()) Maybe it was made private in 2.0 On 17 August 2016 at 05:31, Alexey Svyatkovskiy wrote: > Hi Yanbo, > > Thanks for your reply. I will

Re: Spark ML : One hot Encoding for multiple columns

2016-08-17 Thread Michał Zieliński
You can it just map over your columns and create a pipeline: val columns = Array("colA", "colB", "colC") val transformers: Array[PipelineStage] = columns.map { x => new OneHotEncoder().setInputCol(x).setOutputCol(x + "Encoded") } val pipeline = new Pipeline() .setStages(transformers) On 17

Re: Do not wrap result of a UDAF in an Struct

2016-03-29 Thread Michał Zieliński
Matthias, You don't need StructType, you can have ArrayType directly def bufferSchema: StructType = StructType(StructField("vals", DataTypes.createArrayType(StringType)) :: Nil) def dataType: DataType = DataTypes.createArrayType(StringType) def evaluate(buffer: Row): Any =

Fwd: Any plans to migrate Transformer API to Spark SQL (closer to DataFrames)?

2016-03-26 Thread Michał Zieliński
83732/Scalable%20Predictive%20Pipelines%20with%20Spark%20and%20Scala.pdf -- Forwarded message -- From: Ted Yu <yuzhih...@gmail.com> Date: 26 March 2016 at 12:51 Subject: Re: Any plans to migrate Transformer API to Spark SQL (closer to DataFrames)? To: Michał Zieliński <ziel

Re: Column explode a map

2016-03-24 Thread Michał Zieliński
n you can just extract them > directly. > > Here are a few examples > <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1826874816150656/2840265927289860/latest.html> > . > > On Thu, Mar 24, 2016 at 12:01 PM,

Column explode a map

2016-03-24 Thread Michał Zieliński
Hi, Imagine you have a structure like this: val events = sqlContext.createDataFrame( Seq( ("a", Map("a"->1,"b"->1)), ("b", Map("b"->1,"c"->1)), ("c", Map("a"->1,"c"->1)) ) ).toDF("id","map") What I want to achieve is have the map values as a separate columns. Basically I

Re: how to convert the RDD[Array[Double]] to RDD[Double]

2016-03-14 Thread Michał Zieliński
For RDD you can use flatMap, for DataFrames explode would be the best fit. On 14 March 2016 at 08:28, lizhenm...@163.com wrote: > > hi: > *I want to *convert the RDD[Array[Double]] to RDD[Double]. for example, > t stored 1.0 2.0 3.0 in the file , how i read > >

Re: Spark ML - Scaling logistic regression for many features

2016-03-07 Thread Michał Zieliński
We're using SparseVector columns in a DataFrame, so they are definitely supported. But maybe for LR some implicit magic is happening inside. On 7 March 2016 at 23:04, Devin Jones wrote: > I could be wrong but its possible that toDF populates a dataframe which I >

Re: Flattening Data within DataFrames

2016-02-29 Thread Michał Zieliński
Hi Kevin, This should help: https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-spark.html On 29 February 2016 at 16:54, Kevin Mellott wrote: > Fellow Sparkers, > > I'm trying to "flatten" my view of data within a DataFrame, and am having >

Re: Spark SQL support for sub-queries

2016-02-26 Thread Michał Zieliński
Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Thanks Michael. Great > > d.filter(col("id") === lit(m)).show > > BTW where all these methods like lit etc are documented. Also I guess any > action call like apply(0) or getInt(0) refers to the "current"

Re: Spark SQL support for sub-queries

2016-02-26 Thread Michał Zieliński
You need to collect the value. val m: Int = d.agg(max($"id")).collect.apply(0).getInt(0) d.filter(col("id") === lit(m)) On 26 February 2016 at 09:41, Mich Talebzadeh wrote: > Can this be done using DFs? > > > > scala> val HiveContext = new

Re: Performing multiple aggregations over the same data

2016-02-23 Thread Michał Zieliński
Do you mean something like this? data.agg(sum("var1"),sum("var2"),sum("var3")) On 24 February 2016 at 01:49, Daniel Imberman wrote: > Hi guys, > > So I'm running into a speed issue where I have a dataset that needs to be > aggregated multiple times. > > Initially my

Re: Using functional programming rather than SQL

2016-02-22 Thread Michał Zieliński
Your SQL query will look something like that in DataFrames (but as Ted said, check the docs to see the signatures). smallsales .join(times,"time_id") .join(channels,"channel_id") .groupBy("calendar_month_desc","channel_desc") .agg(sum(col("amount_sold")).as("TotalSales"),

Re: How to get the code for class in spark

2016-02-20 Thread Michał Zieliński
Probably you mean reflection: https://stackoverflow.com/questions/2224251/reflection-on-a-scala-case-class On 19 February 2016 at 15:14, Ashok Kumar wrote: > Hi, > > class body thanks > > > On Friday, 19 February 2016, 11:23, Ted Yu wrote: > >

Re: Checking for null values when mapping

2016-02-20 Thread Michał Zieliński
You can use filter and isNotNull on Column before the map. On 20 February 2016 at 08:24, Mich Talebzadeh wrote: > > > I have a DF like below reading a csv file > > > > > > val df = >

sparkr-submit additional R files

2015-07-07 Thread Michał Zieliński
Hi all, *spark-submit* for Python and Java/Scala has *--py-files* and *--jars* options for submitting additional files on top of the main application. Is there any such option for *sparkr-submit*? I know that there is *includePackage() *R function to add library dependencies, but can you add