[ML] Allow CrossValidation ParamGrid on SVMWithSGD

2018-01-19 Thread Tomasz Dudek
Hello,

is there any way to use CrossValidation's ParamGrid with SVMWithSGD?

usually, when e.g. using RandomForest you can specify a lot of parameters,
to automatise the param grid search (when used with CrossValidation)

val algorithm = new RandomForestClassifier()
val paramGrid = { new ParamGridBuilder()
  .addGrid(algorithm.impurity, Array("gini", "entropy"))
  .addGrid(algorithm.maxDepth, Array(3, 5, 10))
  .addGrid(algorithm.numTrees, Array(2, 3, 5, 15, 50))
  .addGrid(algorithm.minInfoGain, Array(0.01, 0.001))
  .addGrid(algorithm.minInstancesPerNode, Array(10, 50, 500))
  .build()
}

with SGDWIthSGD however, the parameters are inside GradientDescent. You can
explicitly tune the params, either by using SGDWithSGD's constructor or by
calling setters here:

val algorithm = new SVMWithSGD()
algorithm.optimizer.setMiniBatchFraction(256)
  .setNumIterations(200)
  .setRegParam(0.01)

those two ways however restrict me from using ParamGridBuilder correctly.

There are no such things as algorithm.optimizer.numIterations or
algorithm.optimizer.regParam, only setters(and ParamGrid requires Params,
not setters)

I could of course create each SVM model manually, create one huge Pipeline
with each model saving its result to different column and then manually
decide which performed the best. It requires a lot of coding and so far
CrossValidation's ParamGrid did that job for me instead.

Am I missing something? Is it WIP or is there any hack to do that?

Yours,
Tomasz


Reverse MinMaxScaler in SparkML

2018-01-08 Thread Tomasz Dudek
Hello,

since the similar question on StackOverflow remains unanswered (
https://stackoverflow.com/questions/46092114/is-there-no-inverse-transform-method-for-a-scaler-like-minmaxscaler-in-spark
) and perhaps there is a solution that I am not aware of, I'll ask:

After traning MinMaxScaler(or similar scaler) is there any built-in way to
revert the process? What I mean is to transform the scaled data back to its
original form. SKlearn has a dedicated method inverse_transform that does
exactly that.

I can, of course, get the originalMin/originalMax Vectors from the
MinMaxScalerModel and then map the values myself but it would be nice to
have it built-in.

Yours,
Tomasz


Re: Row Encoder For DataSet

2017-12-10 Thread Tomasz Dudek
Hello Sandeep,

you can pass Row to UDAF. Just provide a proper inputSchema to your UDAF.

Check out this example https://docs.databricks.com/
spark/latest/spark-sql/udaf-scala.html

Yours,
Tomasz

2017-12-10 11:55 GMT+01:00 Sandip Mehta :

> Thanks Georg. I have looked at UADF based on your suggestion. Looks like
> you can only pass single column to UADF. Is there any way you can pass
> entire Row to aggregate function?
>
> I want to list of user defined function and given row object. Perform the
> aggregation and return aggregated Row object.
>
> Regards
> Sandeep
>
> On Fri, Dec 8, 2017 at 12:47 PM Georg Heiler 
> wrote:
>
>> You are looking for an UADF.
>> Sandip Mehta  schrieb am Fr. 8. Dez. 2017 um
>> 06:20:
>>
>>> Hi,
>>>
>>> I want to group on certain columns and then for every group wants to
>>> apply custom UDF function to it. Currently groupBy only allows to add
>>> aggregation function to GroupData.
>>>
>>> For this was thinking to use groupByKey which will return
>>> KeyValueDataSet and then apply UDF for every group but really not been able
>>> solve this.
>>>
>>> SM
>>>
>>> On Fri, Dec 8, 2017 at 10:29 AM Weichen Xu 
>>> wrote:
>>>
 You can groupBy multiple columns on dataframe, so why you need so
 complicated schema ?

 suppose df schema: (x, y, u, v, z)

 df.groupBy($"x", $"y").agg(...)

 Is this you want ?

 On Fri, Dec 8, 2017 at 11:51 AM, Sandip Mehta <
 sandip.mehta@gmail.com> wrote:

> Hi,
>
> During my aggregation I end up having following schema.
>
> Row(Row(val1,val2), Row(val1,val2,val3...))
>
> val values = Seq(
> (Row(10, 11), Row(10, 2, 11)),
> (Row(10, 11), Row(10, 2, 11)),
> (Row(20, 11), Row(10, 2, 11))
>   )
>
>
> 1st tuple is used to group the relevant records for aggregation. I
> have used following to create dataset.
>
> val s = StructType(Seq(
>   StructField("x", IntegerType, true),
>   StructField("y", IntegerType, true)
> ))
> val s1 = StructType(Seq(
>   StructField("u", IntegerType, true),
>   StructField("v", IntegerType, true),
>   StructField("z", IntegerType, true)
> ))
>
> val ds = 
> sparkSession.sqlContext.createDataset(sparkSession.sparkContext.parallelize(values))(Encoders.tuple(RowEncoder(s),
>  RowEncoder(s1)))
>
> Is this correct way of representing this?
>
> How do I create dataset and row encoder for such use case for doing
> groupByKey on this?
>
>
>
> Regards
> Sandeep
>




Re: Question on using pseudo columns in spark jdbc options

2017-12-07 Thread Tomasz Dudek
Hey Ravion,

yes, you can obviously specify other column than a primary key. Be aware
though, that if the key range is not spread evenly (for example in your
code, if there's a "gap" in primary keys and no row has id between 0 and
17220) some of the executors may not assist in loading data (because
"SELECT * FROM orders WHERE order_id IS BETWEEN 0 AND 17220 will return an
empty result). I think you might want to repartition afterwards to ensure
that df is evenly distributed(<--- could somebody confirm my last sentence?
I don't want to mislead and I am not sure).

The first question - could you just check and provide us the answer? :)

Cheers,
Tomasz

2017-12-03 7:39 GMT+01:00 ☼ R Nair (रविशंकर नायर) <
ravishankar.n...@gmail.com>:

> Hi all,
>
> I am using a query to fetch data from MYSQL as follows:
>
> var df = spark.read.
> format("jdbc").
> option("url", "jdbc:mysql://10.0.0.192:3306/retail_db").
> option("driver" ,"com.mysql.jdbc.Driver").
> option("user", "retail_dba").
> option("password", "cloudera").
> option("dbtable", "orders").
> option("partitionColumn", "order_id").
> option("lowerBound", "1").
> option("upperBound", "68883").
> option("numPartitions", "4").
> load()
>
> Question is, can I use a pseudo column (like ROWNUM in Oracle or
> RRN(employeeno) in DB2) in option where I specify the "partitionColumn" ?
> If not, can we specify a partition column which is not a primary key ?
>
> Best,
> Ravion
>
>
>
>