[jira] [Commented] (SPARK-15194) Add Python ML API for MultivariateGaussian

2016-05-21 Thread praveen dareddy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15295413#comment-15295413
 ] 

praveen dareddy commented on SPARK-15194:
-

[~josephkb][~holdenk]
I have sent PR  to resolve this issue.
Kindly, review PR.

Thanks,
praveen

> Add Python ML API for MultivariateGaussian
> --
>
> Key: SPARK-15194
> URL: https://issues.apache.org/jira/browse/SPARK-15194
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Minor
>
> We have a PySpark API for the MLLib version but not the ML version. This 
> would allow Python's  `GaussianMixture` to more closely match the Scala API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15364) Implement Python picklers for ml.Vector and ml.Matrix under spark.ml.python

2016-05-17 Thread praveen dareddy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15287533#comment-15287533
 ] 

praveen dareddy commented on SPARK-15364:
-

Hi,
[~mengxr]
I would like to work on this issue.
After going through 
https://github.com/apache/spark/blob/e2efe0529acd748f26dbaa41331d1733ed256237/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala,
 I found pickler classes for Vector, DenseVector, NewDense Vector etc.

So, Do we need to add these classes in ML Api as well?
Can you guide me on how to proceed here?

Thanks,
Praveen

 

> Implement Python picklers for ml.Vector and ml.Matrix under spark.ml.python
> ---
>
> Key: SPARK-15364
> URL: https://issues.apache.org/jira/browse/SPARK-15364
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> Now picklers for both new and old vectors are implemented under 
> PythonMLlibAPI. To separate spark.mllib from spark.ml, we should implement 
> them under `spark.ml.python` instead. I set the target to 2.1 since those are 
> private APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15194) Add Python ML API for MultivariateGaussian

2016-05-16 Thread praveen dareddy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285618#comment-15285618
 ] 

praveen dareddy commented on SPARK-15194:
-

[~josephkb] Thanks for clarifying this.
I will continue work on this issue once the blocker issue SPARK-14906 is merged 
to the master.

Thanks,
praveen

> Add Python ML API for MultivariateGaussian
> --
>
> Key: SPARK-15194
> URL: https://issues.apache.org/jira/browse/SPARK-15194
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Minor
>
> We have a PySpark API for the MLLib version but not the ML version. This 
> would allow Python's  `GaussianMixture` to more closely match the Scala API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15194) Add Python ML API for MultivariateGaussian

2016-05-11 Thread praveen dareddy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280563#comment-15280563
 ] 

praveen dareddy commented on SPARK-15194:
-

Hi All,
After going through ml and mllib api's,It seems MultivariateGaussian in scala 
uses breeze library for linear algebra.
So, are we implementing the same in Python using numpy or using  a wrapper to 
Scala MultivariateGaussian?

I have tried using JavaWrapper class in 
https://github.com/apache/spark/blob/master/python/pyspark/ml/wrapper.py
as wrapper solution. But I am getting constructor errors.( need to pass Vector 
and DenseMatrix to MultivariateGaussian).

Are there any other Wrapper API's i am missing?
Kindly, help me out.

Thanks,
Praveen

Here is my code,

from pyspark.ml.wrapper import JavaWrapper
__all__ = ['MultivariateGaussian']

class MultivariateGaussian(JavaWrapper):

#@keyword_only
def __init__(self, mu,sigma):
super(MultivariateGaussian, self).__init__()
self._java_obj = self._new_java_obj(

"org.apache.spark.ml.stat.distribution.MultivariateGaussian",(mu,sigma) )
self.mu=mu
self.sigma=sigma


> Add Python ML API for MultivariateGaussian
> --
>
> Key: SPARK-15194
> URL: https://issues.apache.org/jira/browse/SPARK-15194
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Minor
>
> We have a PySpark API for the MLLib version but not the ML version. This 
> would allow Python's  `GaussianMixture` to more closely match the Scala API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15194) Add Python ML API for MultivariateGaussian

2016-05-06 Thread praveen dareddy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15275076#comment-15275076
 ] 

praveen dareddy edited comment on SPARK-15194 at 5/7/16 4:33 AM:
-

Hi,

It seems PySpark version of GauusianMixture is currently implemented in 
clustering.py as GaussianMixtureModel class and GaussianMixture class.
https://github.com/apache/spark/blob/302a18686998b8b96546526bfccec9cf5b667386/python/pyspark/mllib/clustering.py

Can anyone point me in the right direction here.

Thanks,
Praveen


was (Author: praveen red):

Hi,

It seems PySpark version of GauusianMixture is currently implemented in 
clustering.py as GaussianMixtureModel class.
https://github.com/apache/spark/blob/302a18686998b8b96546526bfccec9cf5b667386/python/pyspark/mllib/clustering.py

Can anyone point me in the right direction here.

Thanks,
Praveen

> Add Python ML API for MultivariateGaussian
> --
>
> Key: SPARK-15194
> URL: https://issues.apache.org/jira/browse/SPARK-15194
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Minor
>
> We have a PySpark API for the MLLib version but not the ML version. This 
> would allow Python's  `GaussianMixture` to more closely match the Scala API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15194) Add Python ML API for MultivariateGaussian

2016-05-06 Thread praveen dareddy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15275076#comment-15275076
 ] 

praveen dareddy commented on SPARK-15194:
-


Hi,

It seems PySpark version of GauusianMixture is currently implemented in 
clustering.py as GaussianMixtureModel class.
https://github.com/apache/spark/blob/302a18686998b8b96546526bfccec9cf5b667386/python/pyspark/mllib/clustering.py

Can anyone point me in the right direction here.

Thanks,
Praveen

> Add Python ML API for MultivariateGaussian
> --
>
> Key: SPARK-15194
> URL: https://issues.apache.org/jira/browse/SPARK-15194
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Minor
>
> We have a PySpark API for the MLLib version but not the ML version. This 
> would allow Python's  `GaussianMixture` to more closely match the Scala API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15194) Add Python ML API for MultivariateGaussian

2016-05-06 Thread praveen dareddy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15275074#comment-15275074
 ] 

praveen dareddy commented on SPARK-15194:
-

Can i contribute to this issue?
>From what i understood till now, we need to mirror 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala
in pySpark API.

Am i understanding it right?

Thanks,
Praveen



> Add Python ML API for MultivariateGaussian
> --
>
> Key: SPARK-15194
> URL: https://issues.apache.org/jira/browse/SPARK-15194
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Minor
>
> We have a PySpark API for the MLLib version but not the ML version. This 
> would allow Python's  `GaussianMixture` to more closely match the Scala API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14759) After join one cannot drop dynamically added column

2016-04-25 Thread praveen dareddy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255935#comment-15255935
 ] 

praveen dareddy commented on SPARK-14759:
-

Hi Tomasz,

I am new to Spark but would like to help on this issue. I have the environment 
set up in my local system. I have quite recently started going through the code 
base and am eager to contribute. Can you point me towards the specific module i 
need to understand to solve this issue?

Thanks,
Red

> After join one cannot drop dynamically added column
> ---
>
> Key: SPARK-14759
> URL: https://issues.apache.org/jira/browse/SPARK-14759
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Tomasz Bartczak
>Priority: Minor
>
> running following code:
> {code}
> from pyspark.sql.functions import *
> df1 = sqlContext.createDataFrame([(1,10,)], ['any','hour'])
> df2 = sqlContext.createDataFrame([(1,)], ['any']).withColumn('hour',lit(10))
> j = df1.join(df2,[df1.hour == df2.hour],how='left')
> print("columns after join:{0}".format(j.columns))
> jj = j.drop(df2.hour)
> print("columns after removing 'hour':{0}".format(jj.columns))
> {code}
> should show that after join and remove df2.hour I end up with only one 'hour' 
> column in dataframe.
> Unfortunately this column is not dropped.
> {code}
> columns after join:['any', 'hour', 'any', 'hour']
> columns after removing 'hour': ['any', 'hour', 'any', 'hour']
> {code}
> I found out that it behaves like that only when the column is added 
> dynamically before the join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org