from:"MechCoder"

[GitHub] spark issue #13959: [SPARK-14351] [MLlib] [ML] Optimize findBestSplits metho...

2017-05-18 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/13959
  
I don't understand. If you don't have time to review that is fine (I've 
been there too), but there is no need to close a PR due to unavailability of 
comitters.

One of the reasons, that I am happy to have stopped contributing to Spark 
and focus my energy elsewhere...

Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17621: [SPARK-6227][MLLIB][PYSPARK] Implement PySpark wrappers ...

2017-04-14 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/17621
  
Thanks @MLnick !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14273: [SPARK-9140] [ML] Replace TimeTracker by MultiSto...

2017-02-27 Thread MechCoder

Github user MechCoder closed the pull request at:

https://github.com/apache/spark/pull/14273


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #7963: [SPARK-6227] [MLlib] [PySpark] Implement PySpark wrappers...

2016-10-11 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/7963
  
Thanks for the reviews @holdenk . Unfortunately I will not be able to work 
on this anytime soon. Feel free to cherry-pick the commits, (if you wish)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14640: [SPARK-17055] [MLLIB] add labelKFold to CrossValidator

2016-08-23 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/14640
  
Just FYI, we plan to rename "LabelKFold" to "GroupKFold" in the next 
version of sklearn as a label can mean several things. (including the target 
label)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13650: [SPARK-9623] [ML] Provide conditional variance for Rando...

2016-08-22 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/13650
  
@yanboliang Sorry for the wrong delay! Hope you are still here.

1. The term variance in predictions is ambiguous and a bit misleading. Let 
us say that we have the original data generating distribution, the variance in 
prediction for a decision tree describes how much the prediction changes from 
one decision tree to another fit on the subsample of the data. As we know, this 
"variance in predictions" is high for a decision tree and reduces to zero for a 
random forest, (assuming there a huge number of trees and the trees are 
uncorrelated). I have updated the PR title to reflect this.

2. No, the paper as such is not widely cited. Also what @sethah describes 
is correct. This approach is picking a random tree with equal probability, and 
use the expected variance as got by that. However, the conditional distribution 
of Y|X is NOT the mean of the conditional distribution of Y|X of each tree. 
That is P(Y | X) != (P(Y_1 | x) + P(Y_2 | x) + .. P(Y_n | x)) / n. It is only 
that the expectation of Y|x is given by the mean of the expectation of the 
individual trees. The correct way of deriving the conditional CDF of Y | x is 
given in the well-cited paper 
(http://www.jmlr.org/papers/volume7/meinshausen06a/meinshausen06a.pdf) . 

3. However, the formula derived in the paper is the same as got by the 
weighted variance with weights given to the target variable in the training 
data as defined in formula 5 of 
http://www.jmlr.org/papers/volume7/meinshausen06a/meinshausen06a.pdf . I have 
verified it on synthetic data in a notebook over here 
(https://github.com/MechCoder/Notebooks/blob/master/Conditional_variances.ipynb)
 .

I have spent more time then I would have initially expected on this Pull 
Request and I'm willing to do anything more that is required to merge.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/ca...

2016-08-19 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/14579#discussion_r75567101
  
--- Diff: python/pyspark/rdd.py ---
@@ -188,6 +188,12 @@ def __init__(self, jrdd, ctx, 
jrdd_deserializer=AutoBatchedSerializer(PickleSeri
 self._id = jrdd.id()
 self.partitioner = None
 
+def __enter__(self):
--- End diff --

yeas, also known as the "If you don't know what to do; raise an Error" 
approach :p 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Improve handling of PySpark P...

2016-08-19 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/12790
  
Yes, I agree that allowing steps to be an empty Sequence or a list in a 
Pipeline is non-intuitive but I'm fine with allowing that corner case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing error if Py...

2016-08-18 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/12790
  
Awesome! Thansk!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing erro...

2016-08-18 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/12790#discussion_r75390810
  
--- Diff: python/pyspark/status.py ---
@@ -83,6 +85,8 @@ def getJobInfo(self, jobId):
 job = self._jtracker.getJobInfo(jobId)
 if job is not None:
 return SparkJobInfo(jobId, job.stageIds(), str(job.status()))
+else:
--- End diff --

Python returns None by default ;)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing error if Py...

2016-08-18 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/12790
  
If that's the case then the piece of documentation that promises the 
Pipeline to behave as an identity transformer when no stages are used, has to 
be changed (removed).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14653: [SPARK-10931][PYSPARK][ML] PySpark ML Models shou...

2016-08-17 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/14653#discussion_r75230698
  
--- Diff: python/pyspark/ml/wrapper.py ---
@@ -243,7 +240,7 @@ def __init__(self, java_model=None):
 """
 Initialize this instance with a Java model object.
 Subclasses should call this constructor, initialize params,
-and then call _transfer_params_from_java.
+and then call _transformer_params.
--- End diff --

Not sure you intended this change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13036: [SPARK-15243][ML][SQL][PYSPARK] Param methods should use...

2016-08-17 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/13036
  
lgtm


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14653: [SPARK-10931][PYSPARK][ML] PySpark ML Models should cont...

2016-08-17 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/14653
  
Should we start having

`PredictorParams` -> (HasLabelCol, HasFeaturesCol, HasPredictionCol)
`ClassifierParams` -> (HasRawPredictionCol)

as done in the Scala side?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14653: [SPARK-10931][PYSPARK][ML] PySpark ML Models shou...

2016-08-17 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/14653#discussion_r75228035
  
--- Diff: python/pyspark/ml/classification.py ---
@@ -59,6 +59,16 @@ class LogisticRegression(JavaEstimator, HasFeaturesCol, 
HasLabelCol, HasPredicti
 ... Row(label=0.0, weight=2.0, features=Vectors.sparse(1, [], 
[]))]).toDF()
 >>> lr = LogisticRegression(maxIter=5, regParam=0.01, 
weightCol="weight")
 >>> model = lr.fit(df)
+>>> emap = lr.extractParamMap()
+>>> mmap = model.extractParamMap()
+>>> all([emap[getattr(lr, param.name)] == value for (param, value) in 
mmap.items()])
--- End diff --

style: Also `(param, value)` -> `param, value` (Brackets redundant)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14653: [SPARK-10931][PYSPARK][ML] PySpark ML Models shou...

2016-08-17 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/14653#discussion_r75227783
  
--- Diff: python/pyspark/ml/classification.py ---
@@ -59,6 +59,16 @@ class LogisticRegression(JavaEstimator, HasFeaturesCol, 
HasLabelCol, HasPredicti
 ... Row(label=0.0, weight=2.0, features=Vectors.sparse(1, [], 
[]))]).toDF()
 >>> lr = LogisticRegression(maxIter=5, regParam=0.01, 
weightCol="weight")
 >>> model = lr.fit(df)
+>>> emap = lr.extractParamMap()
+>>> mmap = model.extractParamMap()
+>>> all([emap[getattr(lr, param.name)] == value for (param, value) in 
mmap.items()])
--- End diff --

`emap[getattr(lr, param.name)]` is the same as `emap[param]` no?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14653: [SPARK-10931][PYSPARK][ML] PySpark ML Models shou...

2016-08-17 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/14653#discussion_r75225024
  
--- Diff: python/pyspark/ml/classification.py ---
@@ -59,6 +59,16 @@ class LogisticRegression(JavaEstimator, HasFeaturesCol, 
HasLabelCol, HasPredicti
 ... Row(label=0.0, weight=2.0, features=Vectors.sparse(1, [], 
[]))]).toDF()
 >>> lr = LogisticRegression(maxIter=5, regParam=0.01, 
weightCol="weight")
 >>> model = lr.fit(df)
+>>> emap = lr.extractParamMap()
--- End diff --

style:
`emap` -> `estimator_paramMap`
`mmap` -> `model_paramMap`
?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing error if Py...

2016-08-16 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/12790
  
LGTM: cc @yanboliang @srowen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing erro...

2016-08-16 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/12790#discussion_r75024959
  
--- Diff: python/pyspark/ml/pipeline.py ---
@@ -57,9 +57,8 @@ def __init__(self, stages=None):
 """
 __init__(self, stages=None)
 """
-if stages is None:
-stages = []
 super(Pipeline, self).__init__()
+self._setDefault(stages=[])
--- End diff --

Could you add a comment on why this is being done for future reference?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing erro...

2016-08-16 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/12790#discussion_r75023876
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -230,6 +230,15 @@ def test_pipeline(self):
 self.assertEqual(5, transformer3.dataset_index)
 self.assertEqual(6, dataset.index)
 
+def test_identity_pipeline(self):
+dataset = MockDataset()
+
+def doTransform(pipeline):
+pipeline_model = pipeline.fit(dataset)
+return pipeline_model.transform(dataset)
+self.assertEqual(dataset.index, doTransform(Pipeline()).index)
+self.assertEqual(dataset.index, 
doTransform(Pipeline(stages=[])).index)
--- End diff --

Should we also check that `setParams(stages=[])` and 
`Pipeline().getStages()` return the expected value?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14467: [SPARK-16861][PYSPARK][CORE] Refactor PySpark acc...

2016-08-15 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/14467#discussion_r74845665
  
--- Diff: python/pyspark/context.py ---
@@ -173,9 +173,8 @@ def _do_init(self, master, appName, sparkHome, pyFiles, 
environment, batchSize,
 # they will be passed back to us through a TCP server
 self._accumulatorServer = accumulators._start_update_server()
 (host, port) = self._accumulatorServer.server_address
-self._javaAccumulator = self._jsc.accumulator(
-self._jvm.java.util.ArrayList(),
-self._jvm.PythonAccumulatorParam(host, port))
+self._javaAccumulator = self._jvm.PythonAccumulatorV2(host, port)
+self._jsc.sc().register(self._javaAccumulator)
--- End diff --

I cannot fully understand why an accumulator is created for every instance 
of SparkContext . I see it is used when the attribute `_jrdd` is called but 
that still does not clear things :(


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/ca...

2016-08-15 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/14579#discussion_r74813935
  
--- Diff: python/pyspark/rdd.py ---
@@ -188,6 +188,12 @@ def __init__(self, jrdd, ctx, 
jrdd_deserializer=AutoBatchedSerializer(PickleSeri
 self._id = jrdd.id()
 self.partitioner = None
 
+def __enter__(self):
--- End diff --

Is that true? Doesn't it call `__enter__` on the instance of 
`rdd.cache().map(...)` where `is_cached` is set to False?

Quick verification:

```python
def __enter__(self)
if self.is_cached:
return self
else:
raise ValueError("r")

 with rdd.cache().map(lambda x: x) as t:
pass
```

raises a `ValueError`




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/ca...

2016-08-15 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/14579#discussion_r74811199
  
--- Diff: python/pyspark/rdd.py ---
@@ -188,6 +188,12 @@ def __init__(self, jrdd, ctx, 
jrdd_deserializer=AutoBatchedSerializer(PickleSeri
 self._id = jrdd.id()
 self.partitioner = None
 
+def __enter__(self):
--- End diff --

Is it reasonable just to raise an error saying that the context manager is 
meant to work only with cached RDD's (Dataframes) if `self.is_cached` is not 
set to True?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14273: [SPARK-9140] [ML] Replace TimeTracker by MultiStopwatch

2016-08-12 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/14273
  
bump?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12889: [SPARK-15113][PySpark][ML] Add missing num featur...

2016-08-08 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/12889#discussion_r73947457
  
--- Diff: python/pyspark/ml/classification.py ---
@@ -44,6 +44,23 @@
 
 
 @inherit_doc
+class JavaClassificationModel(JavaPredictionModel):
+"""
+(Private) Java Model produced by a ``Classifier``.
+Classes are indexed {0, 1, ..., numClasses - 1}.
+To be mixed in with class:`pyspark.ml.JavaModel`
+"""
+
+@property
+@since("2.0.0")
--- End diff --

Should be 2.1?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12983: [SPARK-15213][PySpark] Unify 'range' usages

2016-08-03 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/12983
  
In sklearn, we use `sklearn.six.moves` which makes `range` and `xrange` to 
be used interchangeably. In Python3, both `range` and `xrange` would return a 
`range` instance and in Py2, both `xrange` and `range` would return a `xrange` 
instance. Something like

```python
if sys.version[0] == '3':
xrange = range
if sys.version[0] == '2':
range = xrange
```

That being said, I'm ok with this PR being merged as it is since as a Py3 
user, it is more natural for me to use `range`. (But only as a Py3 user). 
Anyway I believe, we should be consistent in usage.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13571: [SPARK-15369][WIP][RFC][PySpark][SQL] Expose pote...

2016-07-29 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/13571#discussion_r72870886
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -1731,13 +1749,115 @@ def sort_array(col, asc=True):
 
 #  User Defined Function 
--
 
+def _wrap_jython_func(sc, src, ser_vars, ser_imports, setup_code, 
returnType):
+return sc._jvm.org.apache.spark.sql.execution.python.JythonFunction(
+src, ser_vars, ser_imports, setup_code, sc._jsc.sc())
+
+
 def _wrap_function(sc, func, returnType):
 command = (func, returnType)
 pickled_command, broadcast_vars, env, includes = 
_prepare_for_python_RDD(sc, command)
 return sc._jvm.PythonFunction(bytearray(pickled_command), env, 
includes, sc.pythonExec,
   sc.pythonVer, broadcast_vars, 
sc._javaAccumulator)
 
 
+class UserDefinedJythonFunction(object):
+"""
+User defined function in Jython - note this might be a bad idea to use.
+
+.. versionadded:: 2.0
+.. Note: Experimental
+"""
+def __init__(self, func, returnType, name=None, setupCode=""):
+self.func = func
+self.returnType = returnType
+self.setupCode = setupCode
+self._judf = self._create_judf(name)
+
+def _create_judf(self, name):
+func = self.func
+from pyspark.sql import SQLContext
+sc = SparkContext.getOrCreate()
+# Empty strings allow the Scala code to recognize no data and skip 
adding the Jython
+# code to handle vars or imports if not needed.
+serialized_vars = ""
+serialized_imports = ""
+if isinstance(func, basestring):
+src = func
+else:
+try:
+import dill
--- End diff --

Currently it seems pyspark uses cloudpickle to serialize and deserialize 
otherwise non-serializable functions. What are the advantages of using dill 
here instead of cloudpickle?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14273: [SPARK-9140] [ML] Replace TimeTracker by MultiStopwatch

2016-07-28 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/14273
  
@jkbradley Would you be able to have a look?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12889: [SPARK-15113][PySpark][ML] Add missing num featur...

2016-07-20 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/12889#discussion_r71629288
  
--- Diff: python/pyspark/ml/classification.py ---
@@ -581,8 +602,11 @@ def _create_model(self, java_model):
 
 
 @inherit_doc
-class DecisionTreeClassificationModel(DecisionTreeModel, JavaMLWritable, 
JavaMLReadable):
+class DecisionTreeClassificationModel(DecisionTreeModel, 
JavaClassificationModel, JavaMLWritable,
--- End diff --

Just curious to know why we don't expose `numClasses` in 
`GBTClassificationModel`. Do we not support multiclass currently, or is there 
some other reason?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12889: [SPARK-15113][PySpark][ML] Add missing num features num ...

2016-07-20 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/12889
  
Just `LinearRegressionModel` seems missing to me. LGTM otherwise.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12374: [SPARK-14610][ML] Remove superfluous split for co...

2016-07-20 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/12374#discussion_r71615972
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala ---
@@ -137,14 +137,47 @@ class RandomForestSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 {
   val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0,
 Map(), Set(),
-Array(3), Gini, QuantileStrategy.Sort,
+Array(2), Gini, QuantileStrategy.Sort,
 0, 0, 0.0, 0, 0
   )
   val featureSamples = Array(0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2).map(_.toDouble)
   val splits = 
RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
   assert(splits.length === 1)
   assert(splits(0) === 1.0)
 }
+
+// find splits for constant feature
+{
+  val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0,
+Map(), Set(),
+Array(3), Gini, QuantileStrategy.Sort,
+0, 0, 0.0, 0, 0
+  )
+  val featureSamples = Array(0, 0, 0).map(_.toDouble)
+  val featureSamplesEmpty = Array.empty[Double]
+  val splits = 
RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
+  assert(splits === Array[Double]())
+  val splitsEmpty =
+RandomForest.findSplitsForContinuousFeature(featureSamplesEmpty, 
fakeMetadata, 0)
+  assert(splitsEmpty === Array[Double]())
+}
+  }
+
+  test("train with constant features") {
+val lp = LabeledPoint(1.0, Vectors.dense(0.0, 0.0, 0.0))
+val data = Array.fill(5)(lp)
+val rdd = sc.parallelize(data)
+val strategy = new OldStrategy(
+  OldAlgo.Classification,
+  Gini,
+  maxDepth = 2,
+  numClasses = 100,
--- End diff --

Is it the case that `numClasses` can be greater than the number of unique 
labels in the data. If yes, then ignore the comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12374: [SPARK-14610][ML] Remove superfluous split for co...

2016-07-20 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/12374#discussion_r71615183
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -692,14 +692,20 @@ private[spark] object RandomForest extends Logging {
   node.stats
 }
 
-// For each (feature, split), calculate the gain, and select the best 
(feature, split).
-val (bestSplit, bestSplitStats) =
-  Range(0, binAggregates.metadata.numFeaturesPerNode).map { 
featureIndexIdx =>
-val featureIndex = if (featuresForNode.nonEmpty) {
-  featuresForNode.get.apply(featureIndexIdx)
+val validFeatureSplits =
+  Range(0, binAggregates.metadata.numFeaturesPerNode).view.map { 
featureIndexIdx =>
+if (featuresForNode.nonEmpty) {
+  (featureIndexIdx, featuresForNode.get.apply(featureIndexIdx))
--- End diff --

Is the `apply` here redundant?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12374: [SPARK-14610][ML] Remove superfluous split for continuou...

2016-07-20 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/12374
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12374: [SPARK-14610][ML] Remove superfluous split for co...

2016-07-20 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/12374#discussion_r71615061
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala ---
@@ -137,14 +137,47 @@ class RandomForestSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 {
   val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0,
 Map(), Set(),
-Array(3), Gini, QuantileStrategy.Sort,
+Array(2), Gini, QuantileStrategy.Sort,
 0, 0, 0.0, 0, 0
   )
   val featureSamples = Array(0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2).map(_.toDouble)
   val splits = 
RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
   assert(splits.length === 1)
   assert(splits(0) === 1.0)
 }
+
+// find splits for constant feature
+{
+  val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0,
+Map(), Set(),
+Array(3), Gini, QuantileStrategy.Sort,
+0, 0, 0.0, 0, 0
+  )
+  val featureSamples = Array(0, 0, 0).map(_.toDouble)
+  val featureSamplesEmpty = Array.empty[Double]
+  val splits = 
RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
+  assert(splits === Array[Double]())
+  val splitsEmpty =
+RandomForest.findSplitsForContinuousFeature(featureSamplesEmpty, 
fakeMetadata, 0)
+  assert(splitsEmpty === Array[Double]())
+}
+  }
+
+  test("train with constant features") {
+val lp = LabeledPoint(1.0, Vectors.dense(0.0, 0.0, 0.0))
+val data = Array.fill(5)(lp)
+val rdd = sc.parallelize(data)
+val strategy = new OldStrategy(
+  OldAlgo.Classification,
+  Gini,
+  maxDepth = 2,
+  numClasses = 100,
--- End diff --

My concern was that `numClasses=100` is set here, but the way they data is 
initialized suggests that `numClasses=1`. If we ever write code that validates 
the `numClasses` from data, these tests will break. (Similar concerns about 
`categoricalFeaturesInfo`).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13248: [SPARK-15194] [ML] Add Python ML API for MultivariateGau...

2016-07-20 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/13248
  
Can you please reopen the pull request across the spark master branch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14273: [SPARK-9140] [ML] Replace TimeTracker by MultiStopwatch

2016-07-19 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/14273
  
ping @jkbradley @mengxr 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14273: [SPARK-9140] [ML] Replace TimeTracker by MultiSto...

2016-07-19 Thread MechCoder

GitHub user MechCoder opened a pull request:

https://github.com/apache/spark/pull/14273

[SPARK-9140] [ML] Replace TimeTracker by MultiStopwatch

## What changes were proposed in this pull request?

Builds upon the work done by @hhbyyh in 
https://github.com/apache/spark/pull/7871 . This replaces all occurrences of 
TimeTracker with the more useful MultiStopWatch. More useful because it is 
possible to bench the total time across the worker nodes as well, for instance 
in the method `binsToBestSplit` using the `DistributedStopwatch`. It is also 
very useful to measure the optimizations in terms of time done in 
https://github.com/apache/spark/pull/13959 and should be merged before that 
gets reviewed. It also removes the `TimeTracker` since it is not being used 
elsewhere except the tree module.


## How was this patch tested?

It was run using `setLogLevel("INFO")` and the following timings are 
printed out.

16/07/19 16:45:18 INFO RandomForest: {
  binsToBestSplit: 26ms,
  chooseSplits: 301ms,
  findBestSplits: 307ms,
  findSplitsBins: 553ms,
  init: 1229ms,
  total: 1572ms
}


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MechCoder/spark timeTracker

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14273.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14273


commit fc055532ff3afac0df14e5ff8b63358f9410eae6
Author: Yuhao Yang <hhb...@gmail.com>
Date:   2015-08-02T16:25:30Z

Initial draft

commit c981ad554fa4706fbda40b42acbe4b275a2dbf47
Author: MechCoder <mks...@nyu.edu>
Date:   2016-07-19T21:50:12Z

Remove unused import

commit ea9caf497f392b9149572a2ff4fcefed9d66f9ab
Author: MechCoder <mks...@nyu.edu>
Date:   2016-07-19T22:32:19Z

Add MultiStopWatch to GBT's

commit 7cb2fa09232f8512b018e0673d9b2d4402f88c86
Author: MechCoder <mks...@nyu.edu>
Date:   2016-07-19T22:33:37Z

Remove TimeTracker

commit 3dd9b3135722aa937b04052501876dc2b3ebb06f
Author: MechCoder <mks...@nyu.edu>
Date:   2016-07-19T23:21:50Z

Pass MultiStopWatch instead of LocalStopWatch

commit e5b077de8a901bae666ff25d2e1800caf622681b
Author: MechCoder <mks...@nyu.edu>
Date:   2016-07-19T23:48:51Z

add distributed timer to multitimer




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #7871: [SPARK-9140][MLlib] Replace TimeTracker by Stopwatch

2016-07-19 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/7871
  
@hhbyyh What is your opinion about renaming `addLocal` to `addOrGetLocal` 
which returns a local stopwatch if it already exists? That should solve your 
concerns.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #7871: [SPARK-9140][MLlib] Replace TimeTracker by Stopwatch

2016-07-19 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/7871
  
@hhbyyh What is your opinion about adding 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12374: [SPARK-14610][ML] Remove superfluous split for continuou...

2016-07-06 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/12374
  
Outside of this PR, I would like to either:

1. Update the documentation of `findSplitsForContinuousFeature` to reflect 
that the return type is an array of thresholds, rather than an array of Splits.
2. Change the return type of `findSplitsForContinuousFeature` to return an 
array of splits directly.

(The 2nd one more preferable)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12374: [SPARK-14610][ML] Remove superfluous split for co...

2016-07-06 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/12374#discussion_r69827161
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala ---
@@ -137,14 +137,47 @@ class RandomForestSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 {
   val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0,
 Map(), Set(),
-Array(3), Gini, QuantileStrategy.Sort,
+Array(2), Gini, QuantileStrategy.Sort,
 0, 0, 0.0, 0, 0
   )
   val featureSamples = Array(0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2).map(_.toDouble)
   val splits = 
RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
   assert(splits.length === 1)
   assert(splits(0) === 1.0)
 }
+
+// find splits for constant feature
+{
+  val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0,
+Map(), Set(),
+Array(3), Gini, QuantileStrategy.Sort,
+0, 0, 0.0, 0, 0
+  )
+  val featureSamples = Array(0, 0, 0).map(_.toDouble)
+  val featureSamplesEmpty = Array.empty[Double]
+  val splits = 
RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
+  assert(splits === Array[Double]())
+  val splitsEmpty =
+RandomForest.findSplitsForContinuousFeature(featureSamplesEmpty, 
fakeMetadata, 0)
+  assert(splitsEmpty === Array[Double]())
+}
+  }
+
+  test("train with constant features") {
+val lp = LabeledPoint(1.0, Vectors.dense(0.0, 0.0, 0.0))
+val data = Array.fill(5)(lp)
+val rdd = sc.parallelize(data)
+val strategy = new OldStrategy(
+  OldAlgo.Classification,
+  Gini,
+  maxDepth = 2,
+  numClasses = 100,
+  maxBins = 100,
+  categoricalFeaturesInfo = Map(0 -> 2, 1 -> 5))
+val Array(tree) = RandomForest.run(rdd, strategy, 1, "all", 42L, instr 
= None)
+assert(tree.rootNode.impurity === -1.0)
--- End diff --

Why is the impurity of the rootNode  "-1"? Since there is only class should 
it not be just zero? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12374: [SPARK-14610][ML] Remove superfluous split for continuou...

2016-07-06 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/12374
  
@sethah Nice catch! This superfluous split seems to be only for continuous 
features in which the number of unique values - 1 is lesser than or equal to 
the number of splits. Can you update the PR title or description to reflect 
this change? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12374: [SPARK-14610][ML] Remove superfluous split for co...

2016-07-06 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/12374#discussion_r69826097
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala ---
@@ -137,14 +137,47 @@ class RandomForestSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 {
   val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0,
 Map(), Set(),
-Array(3), Gini, QuantileStrategy.Sort,
+Array(2), Gini, QuantileStrategy.Sort,
 0, 0, 0.0, 0, 0
   )
   val featureSamples = Array(0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2).map(_.toDouble)
   val splits = 
RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
   assert(splits.length === 1)
   assert(splits(0) === 1.0)
 }
+
+// find splits for constant feature
+{
+  val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0,
+Map(), Set(),
+Array(3), Gini, QuantileStrategy.Sort,
+0, 0, 0.0, 0, 0
+  )
+  val featureSamples = Array(0, 0, 0).map(_.toDouble)
+  val featureSamplesEmpty = Array.empty[Double]
+  val splits = 
RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
+  assert(splits === Array[Double]())
+  val splitsEmpty =
+RandomForest.findSplitsForContinuousFeature(featureSamplesEmpty, 
fakeMetadata, 0)
+  assert(splitsEmpty === Array[Double]())
+}
+  }
+
+  test("train with constant features") {
+val lp = LabeledPoint(1.0, Vectors.dense(0.0, 0.0, 0.0))
+val data = Array.fill(5)(lp)
+val rdd = sc.parallelize(data)
+val strategy = new OldStrategy(
+  OldAlgo.Classification,
+  Gini,
+  maxDepth = 2,
+  numClasses = 100,
--- End diff --

`numClasses=100`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12374: [SPARK-14610][ML] Remove superfluous split for co...

2016-07-06 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/12374#discussion_r69825860
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala ---
@@ -137,14 +137,47 @@ class RandomForestSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 {
   val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0,
 Map(), Set(),
-Array(3), Gini, QuantileStrategy.Sort,
+Array(2), Gini, QuantileStrategy.Sort,
 0, 0, 0.0, 0, 0
   )
   val featureSamples = Array(0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2).map(_.toDouble)
   val splits = 
RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
   assert(splits.length === 1)
   assert(splits(0) === 1.0)
 }
+
+// find splits for constant feature
+{
+  val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0,
+Map(), Set(),
+Array(3), Gini, QuantileStrategy.Sort,
+0, 0, 0.0, 0, 0
+  )
+  val featureSamples = Array(0, 0, 0).map(_.toDouble)
+  val featureSamplesEmpty = Array.empty[Double]
+  val splits = 
RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
+  assert(splits === Array[Double]())
+  val splitsEmpty =
+RandomForest.findSplitsForContinuousFeature(featureSamplesEmpty, 
fakeMetadata, 0)
+  assert(splitsEmpty === Array[Double]())
+}
+  }
+
+  test("train with constant features") {
+val lp = LabeledPoint(1.0, Vectors.dense(0.0, 0.0, 0.0))
+val data = Array.fill(5)(lp)
+val rdd = sc.parallelize(data)
+val strategy = new OldStrategy(
+  OldAlgo.Classification,
+  Gini,
+  maxDepth = 2,
+  numClasses = 100,
+  maxBins = 100,
+  categoricalFeaturesInfo = Map(0 -> 2, 1 -> 5))
--- End diff --

I would just remove `categoricalFeaturesInfo` 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12374: [SPARK-14610][ML] Remove superfluous split for co...

2016-07-06 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/12374#discussion_r69825752
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala ---
@@ -137,14 +137,47 @@ class RandomForestSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 {
   val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0,
 Map(), Set(),
-Array(3), Gini, QuantileStrategy.Sort,
+Array(2), Gini, QuantileStrategy.Sort,
 0, 0, 0.0, 0, 0
   )
   val featureSamples = Array(0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2).map(_.toDouble)
   val splits = 
RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
   assert(splits.length === 1)
   assert(splits(0) === 1.0)
 }
+
+// find splits for constant feature
+{
+  val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0,
+Map(), Set(),
+Array(3), Gini, QuantileStrategy.Sort,
+0, 0, 0.0, 0, 0
+  )
+  val featureSamples = Array(0, 0, 0).map(_.toDouble)
+  val featureSamplesEmpty = Array.empty[Double]
+  val splits = 
RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
+  assert(splits === Array[Double]())
+  val splitsEmpty =
+RandomForest.findSplitsForContinuousFeature(featureSamplesEmpty, 
fakeMetadata, 0)
+  assert(splitsEmpty === Array[Double]())
+}
+  }
+
+  test("train with constant features") {
--- End diff --

"train with constant features" -> "train with constant continuous features"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12374: [SPARK-14610][ML] Remove superfluous split for co...

2016-07-06 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/12374#discussion_r69824592
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala ---
@@ -137,14 +137,47 @@ class RandomForestSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 {
   val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0,
 Map(), Set(),
-Array(3), Gini, QuantileStrategy.Sort,
+Array(2), Gini, QuantileStrategy.Sort,
 0, 0, 0.0, 0, 0
   )
   val featureSamples = Array(0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2).map(_.toDouble)
   val splits = 
RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
   assert(splits.length === 1)
   assert(splits(0) === 1.0)
 }
+
+// find splits for constant feature
+{
+  val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0,
+Map(), Set(),
+Array(3), Gini, QuantileStrategy.Sort,
+0, 0, 0.0, 0, 0
+  )
+  val featureSamples = Array(0, 0, 0).map(_.toDouble)
+  val featureSamplesEmpty = Array.empty[Double]
+  val splits = 
RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
+  assert(splits === Array[Double]())
+  val splitsEmpty =
--- End diff --

When will this ever happen, or in other words what corner case does this 
fix?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12374: [SPARK-14610][ML] Remove superfluous split for co...

2016-07-06 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/12374#discussion_r69824338
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala ---
@@ -114,7 +114,7 @@ class RandomForestSuite extends SparkFunSuite with 
MLlibTestSparkContext {
   )
   val featureSamples = Array(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 
3).map(_.toDouble)
   val splits = 
RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
-  assert(splits.length === 3)
+  assert(splits.length === 2)
--- End diff --

I would check the splits explicitly, that is the `Array(1, 2)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12374: [SPARK-14610][ML] Remove superfluous split for co...

2016-07-06 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/12374#discussion_r69824214
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -712,17 +712,23 @@ private[spark] object RandomForest extends Logging {
 splitIndex += 1
   }
   // Find best split.
-  val (bestFeatureSplitIndex, bestFeatureGainStats) =
-Range(0, numSplits).map { case splitIdx =>
-  val leftChildStats = 
binAggregates.getImpurityCalculator(nodeFeatureOffset, splitIdx)
-  val rightChildStats =
-binAggregates.getImpurityCalculator(nodeFeatureOffset, 
numSplits)
-  rightChildStats.subtract(leftChildStats)
-  gainAndImpurityStats = 
calculateImpurityStats(gainAndImpurityStats,
-leftChildStats, rightChildStats, binAggregates.metadata)
-  (splitIdx, gainAndImpurityStats)
-}.maxBy(_._2.gain)
-  (splits(featureIndex)(bestFeatureSplitIndex), 
bestFeatureGainStats)
+  if (numSplits == 0) {
--- End diff --

This seems slightly hacky to me. What is your opinion about doing filtering 
out the feature indices that have zero splits (something similar to this)?

```scala
val validFeaturesSplits = Range(0, 
binAggregates.metadata.numFeaturesPerNode).filter { featureIndexIdx =>
  val featureIndex = if (featuresForNode.nonEmpty) {
featuresForNode.get.apply(featureIndexIdx)
  } else {
featureIndexIdx
  }
  binAggregates.metadata.numSplits(featureIndex) != 0
}
```

That will prevent code-rewrite for this corner-case in PR's such as 
https://github.com/apache/spark/pull/13959 and 
https://github.com/apache/spark/pull/8540


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14016: [SPARK-16399] Force PYSPARK_PYTHON to python

2016-07-06 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/14016
  
I agree with you, I created a new JIRA and renamed the title.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13981: [SPARK-16307] [ML] Add test to verify the predict...

2016-07-06 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/13981#discussion_r69770334
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/DecisionTreeRegressorSuite.scala
 ---
@@ -96,6 +97,25 @@ class DecisionTreeRegressorSuite
   assert(variance === expectedVariance,
 s"Expected variance $expectedVariance but got $variance.")
 }
+
+val varianceData: RDD[LabeledPoint] = TreeTests.varianceData(sc)
+val varianceDF = TreeTests.setMetadata(varianceData, Map.empty[Int, 
Int], 0)
+dt.setMaxDepth(1)
+  .setMaxBins(6)
+  .setSeed(0)
+val transformVarDF = dt.fit(varianceDF).transform(varianceDF)
+val calculatedVariances = 
transformVarDF.select(dt.getVarianceCol).collect().map {
--- End diff --

Ah, I see. Thanks for the note. I wasn't familiar with the Dataset API till 
now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13981: [SPARK-16307] [ML] Add test to verify the predicted vari...

2016-07-06 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/13981
  
Thanks @sethah @yanboliang for the reviews!!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13981: [SPARK-16307] [ML] Add test to verify the predicted vari...

2016-07-06 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/13981
  
@yanboliang Would appreciate it if you could look at 
https://github.com/apache/spark/pull/13650


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13981: [SPARK-16307] [ML] Add test to verify the predict...

2016-07-05 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/13981#discussion_r69660324
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/DecisionTreeRegressorSuite.scala
 ---
@@ -96,6 +97,25 @@ class DecisionTreeRegressorSuite
   assert(variance === expectedVariance,
 s"Expected variance $expectedVariance but got $variance.")
 }
+
+val varianceData: RDD[LabeledPoint] = TreeTests.varianceData(sc)
+val varianceDF = TreeTests.setMetadata(varianceData, Map.empty[Int, 
Int], 0)
+dt.setMaxDepth(1)
+  .setMaxBins(6)
+  .setSeed(0)
+val transformVarDF = dt.fit(varianceDF).transform(varianceDF)
+val calculatedVariances = 
transformVarDF.select(dt.getVarianceCol).collect().map {
--- End diff --

You mean after the collect?

It fails with `is not a member of Seq[org.apache.spark.sql.Row]`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13981: [SPARK-16307] [ML] Add test to verify the predicted vari...

2016-07-05 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/13981
  
OK, that should be it. I removed all the unused variables and imports.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13981: [SPARK-16307] [ML] Add test to verify the predicted vari...

2016-07-05 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/13981
  
@yanboliang Thanks! Addressed your comments. Let me know if there is 
anything else.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #8013: [SPARK-3181][MLLIB]: Add Robust Regression Algorithm with...

2016-07-05 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/8013
  
I'll be happy to review it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13981: [SPARK-16307] [ML] Add test to verify the predict...

2016-07-04 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/13981#discussion_r69419165
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/DecisionTreeRegressorSuite.scala
 ---
@@ -96,6 +108,15 @@ class DecisionTreeRegressorSuite
   assert(variance === expectedVariance,
 s"Expected variance $expectedVariance but got $variance.")
 }
+
+val toyDF = TreeTests.setMetadata(toyData, Map.empty[Int, Int], 0)
+dt.setMaxDepth(1)
+  .setMaxBins(6)
--- End diff --

If you would like to reduce the number of warnings, then this should be 
kept as is (unless I am misunderstanding something)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13981: [SPARK-16307] [ML] Add test to verify the predicted vari...

2016-07-01 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/13981
  
I'm slightly in favour of keeping the original test because the impurity is 
set to "variance" explicitly by the `setImpurity` method, so it's a safe 
assumption that the `calculate` method returns the variance.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13981: [SPARK-16307] [ML] Add test to verify the predict...

2016-07-01 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/13981#discussion_r69363260
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/DecisionTreeRegressorSuite.scala
 ---
@@ -96,6 +108,15 @@ class DecisionTreeRegressorSuite
   assert(variance === expectedVariance,
 s"Expected variance $expectedVariance but got $variance.")
 }
+
+val toyDF = TreeTests.setMetadata(toyData, Map.empty[Int, Int], 0)
+dt.setMaxDepth(1)
+  .setMaxBins(6)
--- End diff --

"Explicit is better than implicit" ;)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14016: [SPARK-15761] [FOLLOWUP] Set DEFAULT_PYTHON to python

2016-07-01 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/14016
  
@srowen fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13981: [SPARK-16307] [ML] Add test to verify the predicted vari...

2016-07-01 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/13981
  
@sethah Thank you for your comments. I have addressed them. Do you have 
anything else?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14016: [SPARK-15761] [FOLLOWUP] Set DEFAULT_PYTHON to python

2016-07-01 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/14016
  
Thanks for clarifying! It might be a good time to get rid of it..


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13981: [SPARK-16307] [ML] Add test to verify the predict...

2016-07-01 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/13981#discussion_r69333966
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/DecisionTreeRegressorSuite.scala
 ---
@@ -96,6 +108,15 @@ class DecisionTreeRegressorSuite
   assert(variance === expectedVariance,
 s"Expected variance $expectedVariance but got $variance.")
 }
+
+val toyDF = TreeTests.setMetadata(toyData, Map.empty[Int, Int], 0)
+dt.setMaxDepth(1)
+  .setMaxBins(6)
--- End diff --

I verified and my intuition was correct. I get this warning for the default 
setting:

 WARN DecisionTreeMetadata: DecisionTree reducing maxBins from 32 to 6 (= 
number of training instances)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13981: [SPARK-16307] [ML] Add test to verify the predict...

2016-07-01 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/13981#discussion_r69332114
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/DecisionTreeRegressorSuite.scala
 ---
@@ -96,6 +108,15 @@ class DecisionTreeRegressorSuite
   assert(variance === expectedVariance,
 s"Expected variance $expectedVariance but got $variance.")
 }
+
+val toyDF = TreeTests.setMetadata(toyData, Map.empty[Int, Int], 0)
+dt.setMaxDepth(1)
+  .setMaxBins(6)
--- End diff --

Because there are 6 datapoints, and I want each datapoint to be a split.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14016: [SPARK-15761] [FOLLOWUP] Set DEFAULT_PYTHON to python

2016-07-01 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/14016
  
ping @srowen @JoshRosen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14016: [SPARK-15761] [FOLLOWUP] Set DEFAULT_PYTHON to py...

2016-07-01 Thread MechCoder

GitHub user MechCoder opened a pull request:

https://github.com/apache/spark/pull/14016

[SPARK-15761] [FOLLOWUP] Set DEFAULT_PYTHON to python

## What changes were proposed in this pull request?

I would like to change

```bash
if hash python2.7 2>/dev/null; then
  # Attempt to use Python 2.7, if installed:
  DEFAULT_PYTHON="python2.7"
else
  DEFAULT_PYTHON="python"
fi
```

to just ```DEFAULT_PYTHON="python"```

I'm not sure if it is a great assumption that python2.7 is used by default, 
when python points to something else.


## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)


(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)




You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MechCoder/spark followup

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14016.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14016


commit 4661493466ff220ede24257c2c83274ea78f73fb
Author: MechCoder <mks...@nyu.edu>
Date:   2016-07-01T17:24:46Z

[SPARK-15761] Set DEFAULT_PYTHON to python




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13503: [SPARK-15761] [MLlib] [PySpark] Load ipython when defaul...

2016-06-30 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/13503
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13503: [SPARK-15761] [MLlib] [PySpark] Load ipython when defaul...

2016-06-30 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/13503
  
I would also like to change

```bash
if hash python2.7 2>/dev/null; then
  # Attempt to use Python 2.7, if installed:
  DEFAULT_PYTHON="python2.7"
else
  DEFAULT_PYTHON="python"
fi
```

to just 

```bash
DEFAULT_PYTHON="python"
```

I'm not sure if it is a great assumption that python2.7 is used by default, 
when python points to something else.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13503: [SPARK-15761] [MLlib] [PySpark] Load ipython when defaul...

2016-06-30 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/13503
  
@JoshRosen fixed, thanks! let me know if you need any other changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13503: [SPARK-15761] [MLlib] [PySpark] Load ipython when defaul...

2016-06-30 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/13503
  
bump?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12983: [SPARK-15213][PySpark] Unify 'range' usages

2016-06-30 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/12983
  
I don't really get the difference, could you please explain it to me..

The previous version renamed `range` in `Python3` to `xrange` and this pull 
request does the same thing by renaming `xrange` in Python2 to `range`. Not 
sure this is necessary. There should be no performance changes since both 
return generators.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13997: [SPARK-16328][ML][MLLIB][PYSPARK] Add 'asML' and ...

2016-06-30 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/13997#discussion_r69176771
  
--- Diff: python/pyspark/mllib/linalg/__init__.py ---
@@ -1044,6 +1122,28 @@ def toSparse(self):
 
 return SparseMatrix(self.numRows, self.numCols, colPtrs, 
rowIndices, values)
 
+def asML(self):
+"""
+Convert this matrix to the new mllib-local representation.
+This does NOT copy the data; it copies references.
+
+>>> mllibDM = Matrices.dense(2, 2, [0, 1, 2, 3])
+>>> mlDM1 = newlinalg.Matrices.dense(2, 2, [0, 1, 2, 3])
+>>> mlDM2 = mllibDM.asML()
+>>> mlDM2 == mlDM1
+True
+>>> mllibDMt = DenseMatrix(2, 2, [0, 1, 2, 3], True)
+>>> mlDMt1 = newlinalg.DenseMatrix(2, 2, [0, 1, 2, 3], True)
+>>> mlDMt2 = mllibDMt.asML()
+>>> mlDMt2 == mlDMt1
+True
+
+:return: :py:class:`pyspark.ml.linalg.DenseMatrix`
+
+.. versionadded:: 2.0.0
+"""
+return newlinalg.DenseMatrix(self.numRows, self.numCols, 
self.values, self.isTransposed)
--- End diff --

> 79 ;)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13997: [SPARK-16328][ML][MLLIB][PYSPARK] Add 'asML' and 'fromML...

2016-06-30 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/13997
  
LGTM pending nitpicks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13997: [SPARK-16328][ML][MLLIB][PYSPARK] Add 'asML' and ...

2016-06-30 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/13997#discussion_r69176457
  
--- Diff: python/pyspark/mllib/linalg/__init__.py ---
@@ -846,6 +890,33 @@ def dense(*elements):
 return DenseVector(elements)
 
 @staticmethod
+def fromML(vec):
+"""
+Convert a vector from the new mllib-local representation.
+This does NOT copy the data; it copies references.
+
+>>> mllibDV1 = Vectors.dense([1, 2, 3])
+>>> mlDV = newlinalg.Vectors.dense([1, 2, 3])
+>>> mllibDV2 = Vectors.fromML(mlDV)
+>>> mllibDV1 == mllibDV2
+True
+>>> mllibSV1 = Vectors.sparse(4, {1: 1.0, 3: 5.5})
+>>> mlSV = newlinalg.Vectors.sparse(4, {1: 1.0, 3: 5.5})
+>>> mllibSV2 = Vectors.fromML(mlSV)
+>>> mllibSV1 == mllibSV2
+True
+
+:param vec: a :py:class:`pyspark.ml.linalg.Vector`
+:return: a :py:class:`pyspark.mllib.linalg.Vector`
+"""
+if type(vec) == newlinalg.DenseVector:
--- End diff --

It's a common pythonic practise to use `isinstance` in such cases. If we 
inherit something from `DenseVector`, then this check will fail.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13650: [SPARK-9623] [ML] Provide variance for RandomFore...

2016-06-29 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/13650#discussion_r69039559
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/RandomForestRegressorSuite.scala
 ---
@@ -105,6 +108,55 @@ class RandomForestRegressorSuite extends SparkFunSuite 
with MLlibTestSparkContex
   }
   }
 
+  test("Random Forest variance") {
--- End diff --

The first test is meant to pass for all impurities, since it compares the 
variance of a forest with one tree (with bootstrapping set off). You are right, 
that we have to be deterministic about checking the predicted variances. I have 
done it for the DecisionTrees here (https://github.com/apache/spark/pull/13981) 
but not sure it is straightforward for a RandomForest


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13981: [SPARK-16307] [ML] Add test to verify the predict...

2016-06-29 Thread MechCoder

GitHub user MechCoder opened a pull request:

https://github.com/apache/spark/pull/13981

[SPARK-16307] [ML] Add test to verify the predicted variances of a DT on 
toy data

## What changes were proposed in this pull request?

The current tests assumes that `impurity.calculate()` returns the variance 
correctly. It should be better to make the tests independent of this 
assumption. In other words verify that the variance computed equals the 
variance computed manually on a small tree.

## How was this patch tested?

The patch is a test

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MechCoder/spark dt_variance

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13981.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13981






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13981: [SPARK-16307] [ML] Add test to verify the predicted vari...

2016-06-29 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/13981
  
@yanboliang Could you have a look?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13650: [SPARK-9623] [ML] Provide variance for RandomFore...

2016-06-29 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/13650#discussion_r68985017
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/RandomForestRegressor.scala 
---
@@ -168,15 +173,39 @@ class RandomForestRegressionModel private[ml] (
   // Note: We may add support for weights (based on tree performance) 
later on.
   private lazy val _treeWeights: Array[Double] = 
Array.fill[Double](_trees.length)(1.0)
 
+  @Since("2.1.0")
+  /** @group getParam */
+  def setVarianceCol(value: String): this.type = set(varianceCol, value)
+
   @Since("1.4.0")
   override def treeWeights: Array[Double] = _treeWeights
 
   override protected def transformImpl(dataset: Dataset[_]): DataFrame = {
 val bcastModel = dataset.sparkSession.sparkContext.broadcast(this)
+
+var output = dataset
+
 val predictUDF = udf { (features: Any) =>
   bcastModel.value.predict(features.asInstanceOf[Vector])
 }
-dataset.withColumn($(predictionCol), predictUDF(col($(featuresCol
+val predictions = predictUDF(col($(featuresCol)))
+output = dataset.withColumn($(predictionCol), predictions)
+
+val varianceUDF = udf { (features: Any) =>
+  val leafNodes = 
bcastModel.value.returnLeafNodes(features.asInstanceOf[Vector])
+  leafNodes.map { leafNode =>
--- End diff --

Nice! I'll address this for the `RandomForest` here and we can move the 
decision tree test strengthening to another PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13959: [SPARK-14351] [MLlib] [ML] Optimize findBestSplits metho...

2016-06-29 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/13959
  
The test failure is just due to binary incompatibility. I can fix those 
once we decide that the current PR is the way to proceed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13959: [SPARK-14351] [MLlib] [ML] Optimize findBestSplits metho...

2016-06-28 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/13959
  
@jkbradley @sethah Please have a look when free!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13959: [SPARK-14351] [MLlib] [ML] Optimize findBestSplit...

2016-06-28 Thread MechCoder

GitHub user MechCoder opened a pull request:

https://github.com/apache/spark/pull/13959

[SPARK-14351] [MLlib] [ML] Optimize findBestSplits method for decision 
trees (and random forest)

## What changes were proposed in this pull request?

The current `findBestSplits` method creates an instance of 
`ImpurityCalculator` and `ImpurityStats` for every possible split and feature 
in the search for the bestSplit. Every instance of `ImpurityCalculator` creates 
an array of size `statsSize` which is unnecessary and take a non-negligible 
amount of time. This pull request tackles this problem by the following 
technique.

1. Remove the `impurityCalculator` instantiation for every possible split 
and feature. Replace this by a `calculateGain` method for each impurity that 
computes the gain directly from the `allStats` attribute of the 
`DTStatsAggregator` which holds all the necessary information. 
2. Replace returning an instance of `ImpurityStats` for every possible 
split and feature with just the information gain since the gain is sufficient 
to calculate the `bestSplit`. Just return an instance of `ImpurityStats` once 
for the `bestSplit` 
3. Remove the not-so-useful `calculateImpurityStats` method. 

## How was this patch tested?

Since this is a performance improvement, tests are necessary. Here are the 
improvements for a `RandomForestRegressor` with `maxDepth` set to 30, 
`subSamplingRate` set to 1 and `maxBins` set to 20 on synthetic data. The 
timings were calculated locally and the mean of 3 attempts were taken.


| n_trees   | n_samples | n_features  | time in master | total time 
in this branch |  
| - |:-:| 
---:|---:|--:|
| 1 | 1 | 500 | 8.954  | 7.786  
   |
| 10| 1 | 500 | 9.44   | 6.825  
   |
| 100   | 1 | 500 | 18.457 | 16.498 
   |
| 1 | 500   | 1   | 8.718  | 6.783  
   |
| 10| 500   | 1   | 8.579  | 6.853  
   |
| 100   | 500   | 1   | 17.593 | 15.905 
   |
| 1 | 1000  | 1000| 8.323  | 6.456  
   |
| 10| 1000  | 1000| 8.841  | 6.633  
   |
| 100   | 1000  | 1000| 17.834 | 16.077 
   |
| 500   | 1000  | 1000| 64.3   | 58.94  
   | 


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MechCoder/spark again

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13959.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13959


commit 64d066b90b152ceb71b185b7e17313486974ae77
Author: MechCoder <mks...@nyu.edu>
Date:   2016-06-27T20:42:44Z

Add calculateGain method to all Impurity objects

commit f1d8c8950f8adace6ee175cd569b20ed6468bb61
Author: MechCoder <mks...@nyu.edu>
Date:   2016-06-27T21:32:56Z

Refactor gain calculation for categorical splits

commit 6e31e3a7b36981c8ccbf867e013363aa6f784e39
Author: MechCoder <mks...@nyu.edu>
Date:   2016-06-27T22:58:10Z

Remove impurity calculation to outside the for loop

commit ea4a0735c14ff91ad1071fb517da3fd890080354
Author: MechCoder <mks...@nyu.edu>
Date:   2016-06-28T00:45:36Z

Remove per feature impurityCalculator initialization

commit ca8b36088b74cacb7f162fb793070c4d3c6a1a8c
Author: MechCoder <mks...@nyu.edu>
Date:   2016-06-28T17:27:40Z

Get rid of calculateImpurityStats

commit 67b401a6a0e59b48a167e4f3036ca9f3f6a5df1f
Author: MechCoder <mks...@nyu.edu>
Date:   2016-06-28T18:17:32Z

where did that come from?

commit e8b89141f6cabfef5f582fe9521f4443afa9ec65
Author: MechCoder <mks...@nyu.edu>
Date:   2016-06-29T00:01:55Z

Add documentation




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #7963: [SPARK-6227] [MLlib] [PySpark] Implement PySpark wrappers...

2016-06-13 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/7963
  
Bump?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13650: [SPARK-9623] [ML] Provide variance for RandomForestRegre...

2016-06-13 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/13650
  
cc: @yanboliang @MLnick 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13650: [SPARK-9623] [ML] Provide variance for RandomFore...

2016-06-13 Thread MechCoder

GitHub user MechCoder opened a pull request:

https://github.com/apache/spark/pull/13650

[SPARK-9623] [ML] Provide variance for RandomForestRegressor predictions

## What changes were proposed in this pull request?
It is useful to get the variance of predictions from the 
`RandomForestRegressor` to plot confidence intervals on the predictions. I 
verified the formula from page 17 of this paper 
(http://arxiv.org/pdf/1211.0906v2.pdf)

## How was this patch tested?
I added a couple of tests to the RandomForestRegression test suite.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MechCoder/spark random_forest_var

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13650.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13650


commit 75254c91cf8d9c2f3638a3f9b1cfd5c029e10996
Author: MechCoder <mks...@nyu.edu>
Date:   2016-06-09T18:22:53Z

[SPARK-9623] [ML] Provide variance for RandomForestRegressor predictions




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13493: [SPARK-15750][MLLib][PYSPARK] Constructing FPGrowth fail...

2016-06-07 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/13493
  
lgtm cc: @MLnick 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13540: [SPARK-15788][PYSPARK][ML] PySpark IDFModel missing "idf...

2016-06-07 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/13540
  
LGTM as well. pending the nitpick by @BryanCutler 

Not related, but it's been a while since I hacked on Spark or PySpark but 
at some point do we need better docs for PySpark? I couldn't figure out how the 
IDF's are calculated without looking at the Scala documentation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12370: [SPARK-14599][ML] BaggedPoint should support sample weig...

2016-06-06 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/12370
  
Should be there a sanity check providing input RDD of instance objects and 
`extractSampleWeight` as callable that just returns the weight for each 
instance?

 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12370: [SPARK-14599][ML] BaggedPoint should support samp...

2016-06-06 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/12370#discussion_r65994490
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/BaggedPoint.scala ---
@@ -33,13 +33,20 @@ import org.apache.spark.util.random.XORShiftRandom
  * this datum has 1 copy, 0 copies, and 4 copies in the 3 subsamples, 
respectively.
  *
  * @param datum  Data instance
- * @param subsampleWeights  Weight of this instance in each subsampled 
dataset.
- *
- * TODO: This does not currently support (Double) weighted instances.  
Once MLlib has weighted
- *   dataset support, update.  (We store subsampleWeights as Double 
for this future extension.)
+ * @param subsampleCounts  Number of samples of this instance in each 
subsampled dataset.
+ * @param sampleWeight The weight of this instance.
  */
-private[spark] class BaggedPoint[Datum](val datum: Datum, val 
subsampleWeights: Array[Double])
-  extends Serializable
+private[spark] class BaggedPoint[Datum](
+val datum: Datum,
+val subsampleCounts: Array[Int],
+val sampleWeight: Double) extends Serializable {
+
+  /**
+   * Subsample counts weighted by the sample weight.
+   */
+  def weightedCounts: Array[Double] = subsampleCounts.map(_ * sampleWeight)
--- End diff --

Should this be a `val`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13503: [SPARK-15761] [MLlib] [PySpark] Load ipython when defaul...

2016-06-06 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/13503
  
Merge?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13503: [SPARK-15761] [MLlib] [PySpark] Load ipython when defaul...

2016-06-04 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/13503
  
cc @JoshRosen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13248: [SPARK-15194] [ML] Add Python ML API for MultivariateGau...

2016-06-03 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/13248
  
@praveendareddy21 Just made a first pass. Also please run PEP8 on your code


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13248: [SPARK-15194] [ML] Add Python ML API for Multivar...

2016-06-03 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/13248#discussion_r65794904
  
--- Diff: python/pyspark/ml/stat/distribution.py ---
@@ -0,0 +1,267 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.ml.linalg import DenseVector, DenseMatrix, Vector
+import numpy as np
+
+__all__ = ['MultivariateGaussian']
+
+
+
+class MultivariateGaussian():
+"""
+This class provides basic functionality for a Multivariate Gaussian 
(Normal) Distribution. In
+ the event that the covariance matrix is singular, the density will be 
computed in a
+reduced dimensional subspace under which the distribution is supported.
+(see 
[[http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case]])
+
+mu The mean vector of the distribution
+sigma The covariance matrix of the distribution
+
+
+>>> mu = Vectors.dense([0.0, 0.0])
+>>> sigma= DenseMatrix(2, 2, [1.0, 1.0, 1.0, 1.0])
+>>> x = Vectors.dense([1.0, 1.0])
+>>> m = MultivariateGaussian(mu, sigma)
+>>> m.pdf(x)
+0.0682586811486
+
+"""
+
+def __init__(self, mu, sigma):
+"""
+__init__(self, mu, sigma)
+
+mu The mean vector of the distribution
+sigma The covariance matrix of the distribution
+
+mu and sigma must be instances of DenseVector and DenseMatrix 
respectively.
+
+"""
+
+
+assert (isinstance(mu, DenseVector)), "mu must be a DenseVector 
Object"
+assert (isinstance(sigma, DenseMatrix)), "sigma must be a 
DenseMatrix Object"
+
+sigma_shape=sigma.toArray().shape
+assert (sigma_shape[0]==sigma_shape[1]) , "Covariance matrix must 
be square"
+assert (sigma_shape[0]==mu.size) , "Mean vector length must match 
covariance matrix size"
+
+# initialize eagerly precomputed attributes
+
+self.mu=mu
+
+# storing sigma as numpy.ndarray
+# furthur calculations are done ndarray only
+self.sigma=sigma.toArray()
+
+
+# initialize attributes to be computed later
+
+self.prec_U = None
+self.log_det_cov = None
+
+# compute distribution dependent constants
+self.__calculateCovarianceConstants()
+
+
+def pdf(self,x):
+"""
+Returns density of this multivariate Gaussian at a point given by 
Vector x
+"""
+assert (isinstance(x, Vector)), "x must be of Vector Type"
+return float(self.__pdf(x))
+
+def logpdf(self,x):
+"""
+Returns the log-density of this multivariate Gaussian at a point 
given by Vector x
+"""
+assert (isinstance(x, Vector)), "x must be of Vector Type"
+return float(self.__logpdf(x))
+
+def __calculateCovarianceConstants(self):
+"""
+Calculates distribution dependent components used for the density 
function
+based on scipy multivariate library
+refer 
https://github.com/scipy/scipy/blob/master/scipy/stats/_multivariate.py
+tested with precision of 9 significant digits(refer testcase)
+
+
+"""
+
+try :
+# pre-processing input parameters
+# throws ValueError with invalid inputs
+self.dim, self.mu, self.sigma = 
self.__process_parameters(None, self.mu, self.sigma)
+
+# return the eigenvalues and eigenvectors 
+# of a Hermitian or symmetric matrix.
+# s =  eigen values
+

[GitHub] spark pull request #13248: [SPARK-15194] [ML] Add Python ML API for Multivar...

2016-06-03 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/13248#discussion_r65794836
  
--- Diff: python/pyspark/ml/stat/distribution.py ---
@@ -0,0 +1,267 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.ml.linalg import DenseVector, DenseMatrix, Vector
+import numpy as np
+
+__all__ = ['MultivariateGaussian']
+
+
+
+class MultivariateGaussian():
+"""
+This class provides basic functionality for a Multivariate Gaussian 
(Normal) Distribution. In
+ the event that the covariance matrix is singular, the density will be 
computed in a
+reduced dimensional subspace under which the distribution is supported.
+(see 
[[http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case]])
+
+mu The mean vector of the distribution
+sigma The covariance matrix of the distribution
+
+
+>>> mu = Vectors.dense([0.0, 0.0])
+>>> sigma= DenseMatrix(2, 2, [1.0, 1.0, 1.0, 1.0])
+>>> x = Vectors.dense([1.0, 1.0])
+>>> m = MultivariateGaussian(mu, sigma)
+>>> m.pdf(x)
+0.0682586811486
+
+"""
+
+def __init__(self, mu, sigma):
+"""
+__init__(self, mu, sigma)
+
+mu The mean vector of the distribution
+sigma The covariance matrix of the distribution
+
+mu and sigma must be instances of DenseVector and DenseMatrix 
respectively.
+
+"""
+
+
+assert (isinstance(mu, DenseVector)), "mu must be a DenseVector 
Object"
+assert (isinstance(sigma, DenseMatrix)), "sigma must be a 
DenseMatrix Object"
+
+sigma_shape=sigma.toArray().shape
+assert (sigma_shape[0]==sigma_shape[1]) , "Covariance matrix must 
be square"
+assert (sigma_shape[0]==mu.size) , "Mean vector length must match 
covariance matrix size"
+
+# initialize eagerly precomputed attributes
+
+self.mu=mu
+
+# storing sigma as numpy.ndarray
+# furthur calculations are done ndarray only
+self.sigma=sigma.toArray()
+
+
+# initialize attributes to be computed later
+
+self.prec_U = None
+self.log_det_cov = None
+
+# compute distribution dependent constants
+self.__calculateCovarianceConstants()
+
+
+def pdf(self,x):
+"""
+Returns density of this multivariate Gaussian at a point given by 
Vector x
+"""
+assert (isinstance(x, Vector)), "x must be of Vector Type"
+return float(self.__pdf(x))
+
+def logpdf(self,x):
+"""
+Returns the log-density of this multivariate Gaussian at a point 
given by Vector x
+"""
+assert (isinstance(x, Vector)), "x must be of Vector Type"
+return float(self.__logpdf(x))
+
+def __calculateCovarianceConstants(self):
+"""
+Calculates distribution dependent components used for the density 
function
+based on scipy multivariate library
+refer 
https://github.com/scipy/scipy/blob/master/scipy/stats/_multivariate.py
+tested with precision of 9 significant digits(refer testcase)
+
+
+"""
+
+try :
+# pre-processing input parameters
+# throws ValueError with invalid inputs
+self.dim, self.mu, self.sigma = 
self.__process_parameters(None, self.mu, self.sigma)
+
+# return the eigenvalues and eigenvectors 
+# of a Hermitian or symmetric matrix.
+# s =  eigen values
+

[GitHub] spark pull request #13248: [SPARK-15194] [ML] Add Python ML API for Multivar...

2016-06-03 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/13248#discussion_r65794779
  
--- Diff: python/pyspark/ml/stat/distribution.py ---
@@ -0,0 +1,267 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.ml.linalg import DenseVector, DenseMatrix, Vector
+import numpy as np
+
+__all__ = ['MultivariateGaussian']
+
+
+
+class MultivariateGaussian():
+"""
+This class provides basic functionality for a Multivariate Gaussian 
(Normal) Distribution. In
+ the event that the covariance matrix is singular, the density will be 
computed in a
+reduced dimensional subspace under which the distribution is supported.
+(see 
[[http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case]])
+
+mu The mean vector of the distribution
+sigma The covariance matrix of the distribution
+
+
+>>> mu = Vectors.dense([0.0, 0.0])
+>>> sigma= DenseMatrix(2, 2, [1.0, 1.0, 1.0, 1.0])
+>>> x = Vectors.dense([1.0, 1.0])
+>>> m = MultivariateGaussian(mu, sigma)
+>>> m.pdf(x)
+0.0682586811486
+
+"""
+
+def __init__(self, mu, sigma):
+"""
+__init__(self, mu, sigma)
+
+mu The mean vector of the distribution
+sigma The covariance matrix of the distribution
+
+mu and sigma must be instances of DenseVector and DenseMatrix 
respectively.
+
+"""
+
+
+assert (isinstance(mu, DenseVector)), "mu must be a DenseVector 
Object"
+assert (isinstance(sigma, DenseMatrix)), "sigma must be a 
DenseMatrix Object"
+
+sigma_shape=sigma.toArray().shape
+assert (sigma_shape[0]==sigma_shape[1]) , "Covariance matrix must 
be square"
+assert (sigma_shape[0]==mu.size) , "Mean vector length must match 
covariance matrix size"
+
+# initialize eagerly precomputed attributes
+
+self.mu=mu
+
+# storing sigma as numpy.ndarray
+# furthur calculations are done ndarray only
+self.sigma=sigma.toArray()
+
+
+# initialize attributes to be computed later
+
+self.prec_U = None
+self.log_det_cov = None
+
+# compute distribution dependent constants
+self.__calculateCovarianceConstants()
+
+
+def pdf(self,x):
+"""
+Returns density of this multivariate Gaussian at a point given by 
Vector x
+"""
+assert (isinstance(x, Vector)), "x must be of Vector Type"
+return float(self.__pdf(x))
+
+def logpdf(self,x):
+"""
+Returns the log-density of this multivariate Gaussian at a point 
given by Vector x
+"""
+assert (isinstance(x, Vector)), "x must be of Vector Type"
+return float(self.__logpdf(x))
+
+def __calculateCovarianceConstants(self):
+"""
+Calculates distribution dependent components used for the density 
function
+based on scipy multivariate library
+refer 
https://github.com/scipy/scipy/blob/master/scipy/stats/_multivariate.py
+tested with precision of 9 significant digits(refer testcase)
+
+
+"""
+
+try :
+# pre-processing input parameters
+# throws ValueError with invalid inputs
+self.dim, self.mu, self.sigma = 
self.__process_parameters(None, self.mu, self.sigma)
+
+# return the eigenvalues and eigenvectors 
+# of a Hermitian or symmetric matrix.
+# s =  eigen values
+

[GitHub] spark pull request #13248: [SPARK-15194] [ML] Add Python ML API for Multivar...

2016-06-03 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/13248#discussion_r65794427
  
--- Diff: python/pyspark/ml/stat/distribution.py ---
@@ -0,0 +1,267 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.ml.linalg import DenseVector, DenseMatrix, Vector
+import numpy as np
+
+__all__ = ['MultivariateGaussian']
+
+
+
+class MultivariateGaussian():
+"""
+This class provides basic functionality for a Multivariate Gaussian 
(Normal) Distribution. In
+ the event that the covariance matrix is singular, the density will be 
computed in a
+reduced dimensional subspace under which the distribution is supported.
+(see 
[[http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case]])
+
+mu The mean vector of the distribution
+sigma The covariance matrix of the distribution
+
+
+>>> mu = Vectors.dense([0.0, 0.0])
+>>> sigma= DenseMatrix(2, 2, [1.0, 1.0, 1.0, 1.0])
+>>> x = Vectors.dense([1.0, 1.0])
+>>> m = MultivariateGaussian(mu, sigma)
+>>> m.pdf(x)
+0.0682586811486
+
+"""
+
+def __init__(self, mu, sigma):
+"""
+__init__(self, mu, sigma)
+
+mu The mean vector of the distribution
+sigma The covariance matrix of the distribution
+
+mu and sigma must be instances of DenseVector and DenseMatrix 
respectively.
+
+"""
+
+
+assert (isinstance(mu, DenseVector)), "mu must be a DenseVector 
Object"
+assert (isinstance(sigma, DenseMatrix)), "sigma must be a 
DenseMatrix Object"
+
+sigma_shape=sigma.toArray().shape
+assert (sigma_shape[0]==sigma_shape[1]) , "Covariance matrix must 
be square"
+assert (sigma_shape[0]==mu.size) , "Mean vector length must match 
covariance matrix size"
+
+# initialize eagerly precomputed attributes
+
+self.mu=mu
+
+# storing sigma as numpy.ndarray
+# furthur calculations are done ndarray only
+self.sigma=sigma.toArray()
+
+
+# initialize attributes to be computed later
+
+self.prec_U = None
+self.log_det_cov = None
+
+# compute distribution dependent constants
+self.__calculateCovarianceConstants()
+
+
+def pdf(self,x):
+"""
+Returns density of this multivariate Gaussian at a point given by 
Vector x
+"""
+assert (isinstance(x, Vector)), "x must be of Vector Type"
+return float(self.__pdf(x))
+
+def logpdf(self,x):
+"""
+Returns the log-density of this multivariate Gaussian at a point 
given by Vector x
+"""
+assert (isinstance(x, Vector)), "x must be of Vector Type"
+return float(self.__logpdf(x))
+
+def __calculateCovarianceConstants(self):
+"""
+Calculates distribution dependent components used for the density 
function
+based on scipy multivariate library
+refer 
https://github.com/scipy/scipy/blob/master/scipy/stats/_multivariate.py
+tested with precision of 9 significant digits(refer testcase)
+
+
+"""
+
+try :
+# pre-processing input parameters
+# throws ValueError with invalid inputs
+self.dim, self.mu, self.sigma = 
self.__process_parameters(None, self.mu, self.sigma)
+
+# return the eigenvalues and eigenvectors 
+# of a Hermitian or symmetric matrix.
+# s =  eigen values
+

[GitHub] spark pull request #13248: [SPARK-15194] [ML] Add Python ML API for Multivar...

2016-06-03 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/13248#discussion_r65794325
  
--- Diff: python/pyspark/ml/stat/distribution.py ---
@@ -0,0 +1,267 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.ml.linalg import DenseVector, DenseMatrix, Vector
+import numpy as np
+
+__all__ = ['MultivariateGaussian']
+
+
+
+class MultivariateGaussian():
+"""
+This class provides basic functionality for a Multivariate Gaussian 
(Normal) Distribution. In
+ the event that the covariance matrix is singular, the density will be 
computed in a
+reduced dimensional subspace under which the distribution is supported.
+(see 
[[http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case]])
+
+mu The mean vector of the distribution
+sigma The covariance matrix of the distribution
+
+
+>>> mu = Vectors.dense([0.0, 0.0])
+>>> sigma= DenseMatrix(2, 2, [1.0, 1.0, 1.0, 1.0])
+>>> x = Vectors.dense([1.0, 1.0])
+>>> m = MultivariateGaussian(mu, sigma)
+>>> m.pdf(x)
+0.0682586811486
+
+"""
+
+def __init__(self, mu, sigma):
+"""
+__init__(self, mu, sigma)
+
+mu The mean vector of the distribution
+sigma The covariance matrix of the distribution
+
+mu and sigma must be instances of DenseVector and DenseMatrix 
respectively.
+
+"""
+
+
+assert (isinstance(mu, DenseVector)), "mu must be a DenseVector 
Object"
+assert (isinstance(sigma, DenseMatrix)), "sigma must be a 
DenseMatrix Object"
+
+sigma_shape=sigma.toArray().shape
+assert (sigma_shape[0]==sigma_shape[1]) , "Covariance matrix must 
be square"
+assert (sigma_shape[0]==mu.size) , "Mean vector length must match 
covariance matrix size"
+
+# initialize eagerly precomputed attributes
+
+self.mu=mu
+
+# storing sigma as numpy.ndarray
+# furthur calculations are done ndarray only
+self.sigma=sigma.toArray()
+
+
+# initialize attributes to be computed later
+
+self.prec_U = None
+self.log_det_cov = None
+
+# compute distribution dependent constants
+self.__calculateCovarianceConstants()
+
+
+def pdf(self,x):
+"""
+Returns density of this multivariate Gaussian at a point given by 
Vector x
+"""
+assert (isinstance(x, Vector)), "x must be of Vector Type"
+return float(self.__pdf(x))
+
+def logpdf(self,x):
+"""
+Returns the log-density of this multivariate Gaussian at a point 
given by Vector x
+"""
+assert (isinstance(x, Vector)), "x must be of Vector Type"
+return float(self.__logpdf(x))
+
+def __calculateCovarianceConstants(self):
+"""
+Calculates distribution dependent components used for the density 
function
+based on scipy multivariate library
+refer 
https://github.com/scipy/scipy/blob/master/scipy/stats/_multivariate.py
+tested with precision of 9 significant digits(refer testcase)
+
+
+"""
+
+try :
+# pre-processing input parameters
+# throws ValueError with invalid inputs
+self.dim, self.mu, self.sigma = 
self.__process_parameters(None, self.mu, self.sigma)
+
+# return the eigenvalues and eigenvectors 
+# of a Hermitian or symmetric matrix.
+# s =  eigen values
+

[GitHub] spark pull request #13248: [SPARK-15194] [ML] Add Python ML API for Multivar...

2016-06-03 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/13248#discussion_r65794293
  
--- Diff: python/pyspark/ml/stat/distribution.py ---
@@ -0,0 +1,267 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.ml.linalg import DenseVector, DenseMatrix, Vector
+import numpy as np
+
+__all__ = ['MultivariateGaussian']
+
+
+
+class MultivariateGaussian():
+"""
+This class provides basic functionality for a Multivariate Gaussian 
(Normal) Distribution. In
+ the event that the covariance matrix is singular, the density will be 
computed in a
+reduced dimensional subspace under which the distribution is supported.
+(see 
[[http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case]])
+
+mu The mean vector of the distribution
+sigma The covariance matrix of the distribution
+
+
+>>> mu = Vectors.dense([0.0, 0.0])
+>>> sigma= DenseMatrix(2, 2, [1.0, 1.0, 1.0, 1.0])
+>>> x = Vectors.dense([1.0, 1.0])
+>>> m = MultivariateGaussian(mu, sigma)
+>>> m.pdf(x)
+0.0682586811486
+
+"""
+
+def __init__(self, mu, sigma):
+"""
+__init__(self, mu, sigma)
+
+mu The mean vector of the distribution
+sigma The covariance matrix of the distribution
+
+mu and sigma must be instances of DenseVector and DenseMatrix 
respectively.
+
+"""
+
+
+assert (isinstance(mu, DenseVector)), "mu must be a DenseVector 
Object"
+assert (isinstance(sigma, DenseMatrix)), "sigma must be a 
DenseMatrix Object"
+
+sigma_shape=sigma.toArray().shape
+assert (sigma_shape[0]==sigma_shape[1]) , "Covariance matrix must 
be square"
+assert (sigma_shape[0]==mu.size) , "Mean vector length must match 
covariance matrix size"
+
+# initialize eagerly precomputed attributes
+
+self.mu=mu
+
+# storing sigma as numpy.ndarray
+# furthur calculations are done ndarray only
+self.sigma=sigma.toArray()
+
+
+# initialize attributes to be computed later
+
+self.prec_U = None
+self.log_det_cov = None
+
+# compute distribution dependent constants
+self.__calculateCovarianceConstants()
+
+
+def pdf(self,x):
+"""
+Returns density of this multivariate Gaussian at a point given by 
Vector x
+"""
+assert (isinstance(x, Vector)), "x must be of Vector Type"
+return float(self.__pdf(x))
+
+def logpdf(self,x):
+"""
+Returns the log-density of this multivariate Gaussian at a point 
given by Vector x
+"""
+assert (isinstance(x, Vector)), "x must be of Vector Type"
+return float(self.__logpdf(x))
+
+def __calculateCovarianceConstants(self):
+"""
+Calculates distribution dependent components used for the density 
function
+based on scipy multivariate library
+refer 
https://github.com/scipy/scipy/blob/master/scipy/stats/_multivariate.py
+tested with precision of 9 significant digits(refer testcase)
+
+
+"""
+
+try :
+# pre-processing input parameters
+# throws ValueError with invalid inputs
+self.dim, self.mu, self.sigma = 
self.__process_parameters(None, self.mu, self.sigma)
+
+# return the eigenvalues and eigenvectors 
+# of a Hermitian or symmetric matrix.
+# s =  eigen values
+

[GitHub] spark pull request #13248: [SPARK-15194] [ML] Add Python ML API for Multivar...

2016-06-03 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/13248#discussion_r65794126
  
--- Diff: python/pyspark/ml/stat/distribution.py ---
@@ -0,0 +1,267 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.ml.linalg import DenseVector, DenseMatrix, Vector
+import numpy as np
+
+__all__ = ['MultivariateGaussian']
+
+
+
+class MultivariateGaussian():
+"""
+This class provides basic functionality for a Multivariate Gaussian 
(Normal) Distribution. In
+ the event that the covariance matrix is singular, the density will be 
computed in a
+reduced dimensional subspace under which the distribution is supported.
+(see 
[[http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case]])
+
+mu The mean vector of the distribution
+sigma The covariance matrix of the distribution
+
+
+>>> mu = Vectors.dense([0.0, 0.0])
+>>> sigma= DenseMatrix(2, 2, [1.0, 1.0, 1.0, 1.0])
+>>> x = Vectors.dense([1.0, 1.0])
+>>> m = MultivariateGaussian(mu, sigma)
+>>> m.pdf(x)
+0.0682586811486
+
+"""
+
+def __init__(self, mu, sigma):
+"""
+__init__(self, mu, sigma)
+
+mu The mean vector of the distribution
+sigma The covariance matrix of the distribution
+
+mu and sigma must be instances of DenseVector and DenseMatrix 
respectively.
+
+"""
+
+
+assert (isinstance(mu, DenseVector)), "mu must be a DenseVector 
Object"
+assert (isinstance(sigma, DenseMatrix)), "sigma must be a 
DenseMatrix Object"
+
+sigma_shape=sigma.toArray().shape
+assert (sigma_shape[0]==sigma_shape[1]) , "Covariance matrix must 
be square"
+assert (sigma_shape[0]==mu.size) , "Mean vector length must match 
covariance matrix size"
+
+# initialize eagerly precomputed attributes
+
+self.mu=mu
+
+# storing sigma as numpy.ndarray
+# furthur calculations are done ndarray only
+self.sigma=sigma.toArray()
+
+
+# initialize attributes to be computed later
+
+self.prec_U = None
+self.log_det_cov = None
+
+# compute distribution dependent constants
+self.__calculateCovarianceConstants()
+
+
+def pdf(self,x):
--- End diff --

Should we fall back to SciPy's multivariate normal if that is present?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13248: [SPARK-15194] [ML] Add Python ML API for Multivar...

2016-06-03 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/13248#discussion_r65794056
  
--- Diff: python/pyspark/ml/stat/distribution.py ---
@@ -0,0 +1,267 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.ml.linalg import DenseVector, DenseMatrix, Vector
+import numpy as np
+
+__all__ = ['MultivariateGaussian']
+
+
+
+class MultivariateGaussian():
+"""
+This class provides basic functionality for a Multivariate Gaussian 
(Normal) Distribution. In
+ the event that the covariance matrix is singular, the density will be 
computed in a
+reduced dimensional subspace under which the distribution is supported.
+(see 
[[http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case]])
+
+mu The mean vector of the distribution
+sigma The covariance matrix of the distribution
+
+
+>>> mu = Vectors.dense([0.0, 0.0])
+>>> sigma= DenseMatrix(2, 2, [1.0, 1.0, 1.0, 1.0])
+>>> x = Vectors.dense([1.0, 1.0])
+>>> m = MultivariateGaussian(mu, sigma)
+>>> m.pdf(x)
+0.0682586811486
+
+"""
+
+def __init__(self, mu, sigma):
+"""
+__init__(self, mu, sigma)
+
+mu The mean vector of the distribution
+sigma The covariance matrix of the distribution
+
+mu and sigma must be instances of DenseVector and DenseMatrix 
respectively.
+
+"""
+
+
+assert (isinstance(mu, DenseVector)), "mu must be a DenseVector 
Object"
+assert (isinstance(sigma, DenseMatrix)), "sigma must be a 
DenseMatrix Object"
+
+sigma_shape=sigma.toArray().shape
+assert (sigma_shape[0]==sigma_shape[1]) , "Covariance matrix must 
be square"
+assert (sigma_shape[0]==mu.size) , "Mean vector length must match 
covariance matrix size"
+
+# initialize eagerly precomputed attributes
+
+self.mu=mu
+
+# storing sigma as numpy.ndarray
+# furthur calculations are done ndarray only
+self.sigma=sigma.toArray()
+
+
+# initialize attributes to be computed later
+
+self.prec_U = None
+self.log_det_cov = None
+
+# compute distribution dependent constants
+self.__calculateCovarianceConstants()
+
+
+def pdf(self,x):
+"""
+Returns density of this multivariate Gaussian at a point given by 
Vector x
+"""
+assert (isinstance(x, Vector)), "x must be of Vector Type"
+return float(self.__pdf(x))
+
+def logpdf(self,x):
+"""
+Returns the log-density of this multivariate Gaussian at a point 
given by Vector x
+"""
+assert (isinstance(x, Vector)), "x must be of Vector Type"
+return float(self.__logpdf(x))
+
+def __calculateCovarianceConstants(self):
+"""
+Calculates distribution dependent components used for the density 
function
+based on scipy multivariate library
+refer 
https://github.com/scipy/scipy/blob/master/scipy/stats/_multivariate.py
+tested with precision of 9 significant digits(refer testcase)
+
+
+"""
+
+try :
+# pre-processing input parameters
+# throws ValueError with invalid inputs
+self.dim, self.mu, self.sigma = 
self.__process_parameters(None, self.mu, self.sigma)
+
+# return the eigenvalues and eigenvectors 
+# of a Hermitian or symmetric matrix.
+# s =  eigen values
+

[GitHub] spark pull request #13248: [SPARK-15194] [ML] Add Python ML API for Multivar...

2016-06-03 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/13248#discussion_r65793951
  
--- Diff: python/pyspark/ml/stat/distribution.py ---
@@ -0,0 +1,267 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.ml.linalg import DenseVector, DenseMatrix, Vector
+import numpy as np
+
+__all__ = ['MultivariateGaussian']
+
+
+
+class MultivariateGaussian():
+"""
+This class provides basic functionality for a Multivariate Gaussian 
(Normal) Distribution. In
+ the event that the covariance matrix is singular, the density will be 
computed in a
+reduced dimensional subspace under which the distribution is supported.
+(see 
[[http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case]])
+
+mu The mean vector of the distribution
+sigma The covariance matrix of the distribution
+
+
+>>> mu = Vectors.dense([0.0, 0.0])
+>>> sigma= DenseMatrix(2, 2, [1.0, 1.0, 1.0, 1.0])
+>>> x = Vectors.dense([1.0, 1.0])
+>>> m = MultivariateGaussian(mu, sigma)
+>>> m.pdf(x)
+0.0682586811486
+
+"""
+
+def __init__(self, mu, sigma):
+"""
+__init__(self, mu, sigma)
+
+mu The mean vector of the distribution
+sigma The covariance matrix of the distribution
+
+mu and sigma must be instances of DenseVector and DenseMatrix 
respectively.
+
+"""
+
+
+assert (isinstance(mu, DenseVector)), "mu must be a DenseVector 
Object"
+assert (isinstance(sigma, DenseMatrix)), "sigma must be a 
DenseMatrix Object"
+
+sigma_shape=sigma.toArray().shape
+assert (sigma_shape[0]==sigma_shape[1]) , "Covariance matrix must 
be square"
+assert (sigma_shape[0]==mu.size) , "Mean vector length must match 
covariance matrix size"
--- End diff --

You can use the `numRows`, `numCols`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 920 matches

Mail list logo