[GitHub] spark issue #13959: [SPARK-14351] [MLlib] [ML] Optimize findBestSplits metho...
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/13959 I don't understand. If you don't have time to review that is fine (I've been there too), but there is no need to close a PR due to unavailability of comitters. One of the reasons, that I am happy to have stopped contributing to Spark and focus my energy elsewhere... Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17621: [SPARK-6227][MLLIB][PYSPARK] Implement PySpark wrappers ...
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/17621 Thanks @MLnick ! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14273: [SPARK-9140] [ML] Replace TimeTracker by MultiSto...
Github user MechCoder closed the pull request at: https://github.com/apache/spark/pull/14273 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #7963: [SPARK-6227] [MLlib] [PySpark] Implement PySpark wrappers...
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/7963 Thanks for the reviews @holdenk . Unfortunately I will not be able to work on this anytime soon. Feel free to cherry-pick the commits, (if you wish) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14640: [SPARK-17055] [MLLIB] add labelKFold to CrossValidator
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/14640 Just FYI, we plan to rename "LabelKFold" to "GroupKFold" in the next version of sklearn as a label can mean several things. (including the target label) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13650: [SPARK-9623] [ML] Provide conditional variance for Rando...
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/13650 @yanboliang Sorry for the wrong delay! Hope you are still here. 1. The term variance in predictions is ambiguous and a bit misleading. Let us say that we have the original data generating distribution, the variance in prediction for a decision tree describes how much the prediction changes from one decision tree to another fit on the subsample of the data. As we know, this "variance in predictions" is high for a decision tree and reduces to zero for a random forest, (assuming there a huge number of trees and the trees are uncorrelated). I have updated the PR title to reflect this. 2. No, the paper as such is not widely cited. Also what @sethah describes is correct. This approach is picking a random tree with equal probability, and use the expected variance as got by that. However, the conditional distribution of Y|X is NOT the mean of the conditional distribution of Y|X of each tree. That is P(Y | X) != (P(Y_1 | x) + P(Y_2 | x) + .. P(Y_n | x)) / n. It is only that the expectation of Y|x is given by the mean of the expectation of the individual trees. The correct way of deriving the conditional CDF of Y | x is given in the well-cited paper (http://www.jmlr.org/papers/volume7/meinshausen06a/meinshausen06a.pdf) . 3. However, the formula derived in the paper is the same as got by the weighted variance with weights given to the target variable in the training data as defined in formula 5 of http://www.jmlr.org/papers/volume7/meinshausen06a/meinshausen06a.pdf . I have verified it on synthetic data in a notebook over here (https://github.com/MechCoder/Notebooks/blob/master/Conditional_variances.ipynb) . I have spent more time then I would have initially expected on this Pull Request and I'm willing to do anything more that is required to merge. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/ca...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/14579#discussion_r75567101 --- Diff: python/pyspark/rdd.py --- @@ -188,6 +188,12 @@ def __init__(self, jrdd, ctx, jrdd_deserializer=AutoBatchedSerializer(PickleSeri self._id = jrdd.id() self.partitioner = None +def __enter__(self): --- End diff -- yeas, also known as the "If you don't know what to do; raise an Error" approach :p --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Improve handling of PySpark P...
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/12790 Yes, I agree that allowing steps to be an empty Sequence or a list in a Pipeline is non-intuitive but I'm fine with allowing that corner case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing error if Py...
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/12790 Awesome! Thansk! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing erro...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/12790#discussion_r75390810 --- Diff: python/pyspark/status.py --- @@ -83,6 +85,8 @@ def getJobInfo(self, jobId): job = self._jtracker.getJobInfo(jobId) if job is not None: return SparkJobInfo(jobId, job.stageIds(), str(job.status())) +else: --- End diff -- Python returns None by default ;) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing error if Py...
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/12790 If that's the case then the piece of documentation that promises the Pipeline to behave as an identity transformer when no stages are used, has to be changed (removed). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14653: [SPARK-10931][PYSPARK][ML] PySpark ML Models shou...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/14653#discussion_r75230698 --- Diff: python/pyspark/ml/wrapper.py --- @@ -243,7 +240,7 @@ def __init__(self, java_model=None): """ Initialize this instance with a Java model object. Subclasses should call this constructor, initialize params, -and then call _transfer_params_from_java. +and then call _transformer_params. --- End diff -- Not sure you intended this change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13036: [SPARK-15243][ML][SQL][PYSPARK] Param methods should use...
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/13036 lgtm --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14653: [SPARK-10931][PYSPARK][ML] PySpark ML Models should cont...
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/14653 Should we start having `PredictorParams` -> (HasLabelCol, HasFeaturesCol, HasPredictionCol) `ClassifierParams` -> (HasRawPredictionCol) as done in the Scala side? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14653: [SPARK-10931][PYSPARK][ML] PySpark ML Models shou...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/14653#discussion_r75228035 --- Diff: python/pyspark/ml/classification.py --- @@ -59,6 +59,16 @@ class LogisticRegression(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredicti ... Row(label=0.0, weight=2.0, features=Vectors.sparse(1, [], []))]).toDF() >>> lr = LogisticRegression(maxIter=5, regParam=0.01, weightCol="weight") >>> model = lr.fit(df) +>>> emap = lr.extractParamMap() +>>> mmap = model.extractParamMap() +>>> all([emap[getattr(lr, param.name)] == value for (param, value) in mmap.items()]) --- End diff -- style: Also `(param, value)` -> `param, value` (Brackets redundant) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14653: [SPARK-10931][PYSPARK][ML] PySpark ML Models shou...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/14653#discussion_r75227783 --- Diff: python/pyspark/ml/classification.py --- @@ -59,6 +59,16 @@ class LogisticRegression(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredicti ... Row(label=0.0, weight=2.0, features=Vectors.sparse(1, [], []))]).toDF() >>> lr = LogisticRegression(maxIter=5, regParam=0.01, weightCol="weight") >>> model = lr.fit(df) +>>> emap = lr.extractParamMap() +>>> mmap = model.extractParamMap() +>>> all([emap[getattr(lr, param.name)] == value for (param, value) in mmap.items()]) --- End diff -- `emap[getattr(lr, param.name)]` is the same as `emap[param]` no? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14653: [SPARK-10931][PYSPARK][ML] PySpark ML Models shou...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/14653#discussion_r75225024 --- Diff: python/pyspark/ml/classification.py --- @@ -59,6 +59,16 @@ class LogisticRegression(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredicti ... Row(label=0.0, weight=2.0, features=Vectors.sparse(1, [], []))]).toDF() >>> lr = LogisticRegression(maxIter=5, regParam=0.01, weightCol="weight") >>> model = lr.fit(df) +>>> emap = lr.extractParamMap() --- End diff -- style: `emap` -> `estimator_paramMap` `mmap` -> `model_paramMap` ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing error if Py...
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/12790 LGTM: cc @yanboliang @srowen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing erro...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/12790#discussion_r75024959 --- Diff: python/pyspark/ml/pipeline.py --- @@ -57,9 +57,8 @@ def __init__(self, stages=None): """ __init__(self, stages=None) """ -if stages is None: -stages = [] super(Pipeline, self).__init__() +self._setDefault(stages=[]) --- End diff -- Could you add a comment on why this is being done for future reference? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12790: [SPARK-15018][PYSPARK][ML] Fixed bug causing erro...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/12790#discussion_r75023876 --- Diff: python/pyspark/ml/tests.py --- @@ -230,6 +230,15 @@ def test_pipeline(self): self.assertEqual(5, transformer3.dataset_index) self.assertEqual(6, dataset.index) +def test_identity_pipeline(self): +dataset = MockDataset() + +def doTransform(pipeline): +pipeline_model = pipeline.fit(dataset) +return pipeline_model.transform(dataset) +self.assertEqual(dataset.index, doTransform(Pipeline()).index) +self.assertEqual(dataset.index, doTransform(Pipeline(stages=[])).index) --- End diff -- Should we also check that `setParams(stages=[])` and `Pipeline().getStages()` return the expected value? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14467: [SPARK-16861][PYSPARK][CORE] Refactor PySpark acc...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/14467#discussion_r74845665 --- Diff: python/pyspark/context.py --- @@ -173,9 +173,8 @@ def _do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize, # they will be passed back to us through a TCP server self._accumulatorServer = accumulators._start_update_server() (host, port) = self._accumulatorServer.server_address -self._javaAccumulator = self._jsc.accumulator( -self._jvm.java.util.ArrayList(), -self._jvm.PythonAccumulatorParam(host, port)) +self._javaAccumulator = self._jvm.PythonAccumulatorV2(host, port) +self._jsc.sc().register(self._javaAccumulator) --- End diff -- I cannot fully understand why an accumulator is created for every instance of SparkContext . I see it is used when the attribute `_jrdd` is called but that still does not clear things :( --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/ca...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/14579#discussion_r74813935 --- Diff: python/pyspark/rdd.py --- @@ -188,6 +188,12 @@ def __init__(self, jrdd, ctx, jrdd_deserializer=AutoBatchedSerializer(PickleSeri self._id = jrdd.id() self.partitioner = None +def __enter__(self): --- End diff -- Is that true? Doesn't it call `__enter__` on the instance of `rdd.cache().map(...)` where `is_cached` is set to False? Quick verification: ```python def __enter__(self) if self.is_cached: return self else: raise ValueError("r") with rdd.cache().map(lambda x: x) as t: pass ``` raises a `ValueError` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/ca...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/14579#discussion_r74811199 --- Diff: python/pyspark/rdd.py --- @@ -188,6 +188,12 @@ def __init__(self, jrdd, ctx, jrdd_deserializer=AutoBatchedSerializer(PickleSeri self._id = jrdd.id() self.partitioner = None +def __enter__(self): --- End diff -- Is it reasonable just to raise an error saying that the context manager is meant to work only with cached RDD's (Dataframes) if `self.is_cached` is not set to True? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14273: [SPARK-9140] [ML] Replace TimeTracker by MultiStopwatch
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/14273 bump? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12889: [SPARK-15113][PySpark][ML] Add missing num featur...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/12889#discussion_r73947457 --- Diff: python/pyspark/ml/classification.py --- @@ -44,6 +44,23 @@ @inherit_doc +class JavaClassificationModel(JavaPredictionModel): +""" +(Private) Java Model produced by a ``Classifier``. +Classes are indexed {0, 1, ..., numClasses - 1}. +To be mixed in with class:`pyspark.ml.JavaModel` +""" + +@property +@since("2.0.0") --- End diff -- Should be 2.1? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12983: [SPARK-15213][PySpark] Unify 'range' usages
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/12983 In sklearn, we use `sklearn.six.moves` which makes `range` and `xrange` to be used interchangeably. In Python3, both `range` and `xrange` would return a `range` instance and in Py2, both `xrange` and `range` would return a `xrange` instance. Something like ```python if sys.version[0] == '3': xrange = range if sys.version[0] == '2': range = xrange ``` That being said, I'm ok with this PR being merged as it is since as a Py3 user, it is more natural for me to use `range`. (But only as a Py3 user). Anyway I believe, we should be consistent in usage. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13571: [SPARK-15369][WIP][RFC][PySpark][SQL] Expose pote...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/13571#discussion_r72870886 --- Diff: python/pyspark/sql/functions.py --- @@ -1731,13 +1749,115 @@ def sort_array(col, asc=True): # User Defined Function -- +def _wrap_jython_func(sc, src, ser_vars, ser_imports, setup_code, returnType): +return sc._jvm.org.apache.spark.sql.execution.python.JythonFunction( +src, ser_vars, ser_imports, setup_code, sc._jsc.sc()) + + def _wrap_function(sc, func, returnType): command = (func, returnType) pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command) return sc._jvm.PythonFunction(bytearray(pickled_command), env, includes, sc.pythonExec, sc.pythonVer, broadcast_vars, sc._javaAccumulator) +class UserDefinedJythonFunction(object): +""" +User defined function in Jython - note this might be a bad idea to use. + +.. versionadded:: 2.0 +.. Note: Experimental +""" +def __init__(self, func, returnType, name=None, setupCode=""): +self.func = func +self.returnType = returnType +self.setupCode = setupCode +self._judf = self._create_judf(name) + +def _create_judf(self, name): +func = self.func +from pyspark.sql import SQLContext +sc = SparkContext.getOrCreate() +# Empty strings allow the Scala code to recognize no data and skip adding the Jython +# code to handle vars or imports if not needed. +serialized_vars = "" +serialized_imports = "" +if isinstance(func, basestring): +src = func +else: +try: +import dill --- End diff -- Currently it seems pyspark uses cloudpickle to serialize and deserialize otherwise non-serializable functions. What are the advantages of using dill here instead of cloudpickle? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14273: [SPARK-9140] [ML] Replace TimeTracker by MultiStopwatch
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/14273 @jkbradley Would you be able to have a look? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12889: [SPARK-15113][PySpark][ML] Add missing num featur...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/12889#discussion_r71629288 --- Diff: python/pyspark/ml/classification.py --- @@ -581,8 +602,11 @@ def _create_model(self, java_model): @inherit_doc -class DecisionTreeClassificationModel(DecisionTreeModel, JavaMLWritable, JavaMLReadable): +class DecisionTreeClassificationModel(DecisionTreeModel, JavaClassificationModel, JavaMLWritable, --- End diff -- Just curious to know why we don't expose `numClasses` in `GBTClassificationModel`. Do we not support multiclass currently, or is there some other reason? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12889: [SPARK-15113][PySpark][ML] Add missing num features num ...
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/12889 Just `LinearRegressionModel` seems missing to me. LGTM otherwise. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12374: [SPARK-14610][ML] Remove superfluous split for co...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/12374#discussion_r71615972 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala --- @@ -137,14 +137,47 @@ class RandomForestSuite extends SparkFunSuite with MLlibTestSparkContext { { val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0, Map(), Set(), -Array(3), Gini, QuantileStrategy.Sort, +Array(2), Gini, QuantileStrategy.Sort, 0, 0, 0.0, 0, 0 ) val featureSamples = Array(0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2).map(_.toDouble) val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0) assert(splits.length === 1) assert(splits(0) === 1.0) } + +// find splits for constant feature +{ + val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0, +Map(), Set(), +Array(3), Gini, QuantileStrategy.Sort, +0, 0, 0.0, 0, 0 + ) + val featureSamples = Array(0, 0, 0).map(_.toDouble) + val featureSamplesEmpty = Array.empty[Double] + val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0) + assert(splits === Array[Double]()) + val splitsEmpty = +RandomForest.findSplitsForContinuousFeature(featureSamplesEmpty, fakeMetadata, 0) + assert(splitsEmpty === Array[Double]()) +} + } + + test("train with constant features") { +val lp = LabeledPoint(1.0, Vectors.dense(0.0, 0.0, 0.0)) +val data = Array.fill(5)(lp) +val rdd = sc.parallelize(data) +val strategy = new OldStrategy( + OldAlgo.Classification, + Gini, + maxDepth = 2, + numClasses = 100, --- End diff -- Is it the case that `numClasses` can be greater than the number of unique labels in the data. If yes, then ignore the comment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12374: [SPARK-14610][ML] Remove superfluous split for co...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/12374#discussion_r71615183 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -692,14 +692,20 @@ private[spark] object RandomForest extends Logging { node.stats } -// For each (feature, split), calculate the gain, and select the best (feature, split). -val (bestSplit, bestSplitStats) = - Range(0, binAggregates.metadata.numFeaturesPerNode).map { featureIndexIdx => -val featureIndex = if (featuresForNode.nonEmpty) { - featuresForNode.get.apply(featureIndexIdx) +val validFeatureSplits = + Range(0, binAggregates.metadata.numFeaturesPerNode).view.map { featureIndexIdx => +if (featuresForNode.nonEmpty) { + (featureIndexIdx, featuresForNode.get.apply(featureIndexIdx)) --- End diff -- Is the `apply` here redundant? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12374: [SPARK-14610][ML] Remove superfluous split for continuou...
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/12374 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12374: [SPARK-14610][ML] Remove superfluous split for co...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/12374#discussion_r71615061 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala --- @@ -137,14 +137,47 @@ class RandomForestSuite extends SparkFunSuite with MLlibTestSparkContext { { val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0, Map(), Set(), -Array(3), Gini, QuantileStrategy.Sort, +Array(2), Gini, QuantileStrategy.Sort, 0, 0, 0.0, 0, 0 ) val featureSamples = Array(0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2).map(_.toDouble) val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0) assert(splits.length === 1) assert(splits(0) === 1.0) } + +// find splits for constant feature +{ + val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0, +Map(), Set(), +Array(3), Gini, QuantileStrategy.Sort, +0, 0, 0.0, 0, 0 + ) + val featureSamples = Array(0, 0, 0).map(_.toDouble) + val featureSamplesEmpty = Array.empty[Double] + val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0) + assert(splits === Array[Double]()) + val splitsEmpty = +RandomForest.findSplitsForContinuousFeature(featureSamplesEmpty, fakeMetadata, 0) + assert(splitsEmpty === Array[Double]()) +} + } + + test("train with constant features") { +val lp = LabeledPoint(1.0, Vectors.dense(0.0, 0.0, 0.0)) +val data = Array.fill(5)(lp) +val rdd = sc.parallelize(data) +val strategy = new OldStrategy( + OldAlgo.Classification, + Gini, + maxDepth = 2, + numClasses = 100, --- End diff -- My concern was that `numClasses=100` is set here, but the way they data is initialized suggests that `numClasses=1`. If we ever write code that validates the `numClasses` from data, these tests will break. (Similar concerns about `categoricalFeaturesInfo`). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13248: [SPARK-15194] [ML] Add Python ML API for MultivariateGau...
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/13248 Can you please reopen the pull request across the spark master branch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14273: [SPARK-9140] [ML] Replace TimeTracker by MultiStopwatch
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/14273 ping @jkbradley @mengxr --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14273: [SPARK-9140] [ML] Replace TimeTracker by MultiSto...
GitHub user MechCoder opened a pull request: https://github.com/apache/spark/pull/14273 [SPARK-9140] [ML] Replace TimeTracker by MultiStopwatch ## What changes were proposed in this pull request? Builds upon the work done by @hhbyyh in https://github.com/apache/spark/pull/7871 . This replaces all occurrences of TimeTracker with the more useful MultiStopWatch. More useful because it is possible to bench the total time across the worker nodes as well, for instance in the method `binsToBestSplit` using the `DistributedStopwatch`. It is also very useful to measure the optimizations in terms of time done in https://github.com/apache/spark/pull/13959 and should be merged before that gets reviewed. It also removes the `TimeTracker` since it is not being used elsewhere except the tree module. ## How was this patch tested? It was run using `setLogLevel("INFO")` and the following timings are printed out. 16/07/19 16:45:18 INFO RandomForest: { binsToBestSplit: 26ms, chooseSplits: 301ms, findBestSplits: 307ms, findSplitsBins: 553ms, init: 1229ms, total: 1572ms } You can merge this pull request into a Git repository by running: $ git pull https://github.com/MechCoder/spark timeTracker Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14273.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14273 commit fc055532ff3afac0df14e5ff8b63358f9410eae6 Author: Yuhao Yang <hhb...@gmail.com> Date: 2015-08-02T16:25:30Z Initial draft commit c981ad554fa4706fbda40b42acbe4b275a2dbf47 Author: MechCoder <mks...@nyu.edu> Date: 2016-07-19T21:50:12Z Remove unused import commit ea9caf497f392b9149572a2ff4fcefed9d66f9ab Author: MechCoder <mks...@nyu.edu> Date: 2016-07-19T22:32:19Z Add MultiStopWatch to GBT's commit 7cb2fa09232f8512b018e0673d9b2d4402f88c86 Author: MechCoder <mks...@nyu.edu> Date: 2016-07-19T22:33:37Z Remove TimeTracker commit 3dd9b3135722aa937b04052501876dc2b3ebb06f Author: MechCoder <mks...@nyu.edu> Date: 2016-07-19T23:21:50Z Pass MultiStopWatch instead of LocalStopWatch commit e5b077de8a901bae666ff25d2e1800caf622681b Author: MechCoder <mks...@nyu.edu> Date: 2016-07-19T23:48:51Z add distributed timer to multitimer --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #7871: [SPARK-9140][MLlib] Replace TimeTracker by Stopwatch
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/7871 @hhbyyh What is your opinion about renaming `addLocal` to `addOrGetLocal` which returns a local stopwatch if it already exists? That should solve your concerns. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #7871: [SPARK-9140][MLlib] Replace TimeTracker by Stopwatch
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/7871 @hhbyyh What is your opinion about adding --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12374: [SPARK-14610][ML] Remove superfluous split for continuou...
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/12374 Outside of this PR, I would like to either: 1. Update the documentation of `findSplitsForContinuousFeature` to reflect that the return type is an array of thresholds, rather than an array of Splits. 2. Change the return type of `findSplitsForContinuousFeature` to return an array of splits directly. (The 2nd one more preferable) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12374: [SPARK-14610][ML] Remove superfluous split for co...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/12374#discussion_r69827161 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala --- @@ -137,14 +137,47 @@ class RandomForestSuite extends SparkFunSuite with MLlibTestSparkContext { { val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0, Map(), Set(), -Array(3), Gini, QuantileStrategy.Sort, +Array(2), Gini, QuantileStrategy.Sort, 0, 0, 0.0, 0, 0 ) val featureSamples = Array(0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2).map(_.toDouble) val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0) assert(splits.length === 1) assert(splits(0) === 1.0) } + +// find splits for constant feature +{ + val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0, +Map(), Set(), +Array(3), Gini, QuantileStrategy.Sort, +0, 0, 0.0, 0, 0 + ) + val featureSamples = Array(0, 0, 0).map(_.toDouble) + val featureSamplesEmpty = Array.empty[Double] + val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0) + assert(splits === Array[Double]()) + val splitsEmpty = +RandomForest.findSplitsForContinuousFeature(featureSamplesEmpty, fakeMetadata, 0) + assert(splitsEmpty === Array[Double]()) +} + } + + test("train with constant features") { +val lp = LabeledPoint(1.0, Vectors.dense(0.0, 0.0, 0.0)) +val data = Array.fill(5)(lp) +val rdd = sc.parallelize(data) +val strategy = new OldStrategy( + OldAlgo.Classification, + Gini, + maxDepth = 2, + numClasses = 100, + maxBins = 100, + categoricalFeaturesInfo = Map(0 -> 2, 1 -> 5)) +val Array(tree) = RandomForest.run(rdd, strategy, 1, "all", 42L, instr = None) +assert(tree.rootNode.impurity === -1.0) --- End diff -- Why is the impurity of the rootNode "-1"? Since there is only class should it not be just zero? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12374: [SPARK-14610][ML] Remove superfluous split for continuou...
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/12374 @sethah Nice catch! This superfluous split seems to be only for continuous features in which the number of unique values - 1 is lesser than or equal to the number of splits. Can you update the PR title or description to reflect this change? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12374: [SPARK-14610][ML] Remove superfluous split for co...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/12374#discussion_r69826097 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala --- @@ -137,14 +137,47 @@ class RandomForestSuite extends SparkFunSuite with MLlibTestSparkContext { { val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0, Map(), Set(), -Array(3), Gini, QuantileStrategy.Sort, +Array(2), Gini, QuantileStrategy.Sort, 0, 0, 0.0, 0, 0 ) val featureSamples = Array(0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2).map(_.toDouble) val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0) assert(splits.length === 1) assert(splits(0) === 1.0) } + +// find splits for constant feature +{ + val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0, +Map(), Set(), +Array(3), Gini, QuantileStrategy.Sort, +0, 0, 0.0, 0, 0 + ) + val featureSamples = Array(0, 0, 0).map(_.toDouble) + val featureSamplesEmpty = Array.empty[Double] + val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0) + assert(splits === Array[Double]()) + val splitsEmpty = +RandomForest.findSplitsForContinuousFeature(featureSamplesEmpty, fakeMetadata, 0) + assert(splitsEmpty === Array[Double]()) +} + } + + test("train with constant features") { +val lp = LabeledPoint(1.0, Vectors.dense(0.0, 0.0, 0.0)) +val data = Array.fill(5)(lp) +val rdd = sc.parallelize(data) +val strategy = new OldStrategy( + OldAlgo.Classification, + Gini, + maxDepth = 2, + numClasses = 100, --- End diff -- `numClasses=100`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12374: [SPARK-14610][ML] Remove superfluous split for co...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/12374#discussion_r69825860 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala --- @@ -137,14 +137,47 @@ class RandomForestSuite extends SparkFunSuite with MLlibTestSparkContext { { val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0, Map(), Set(), -Array(3), Gini, QuantileStrategy.Sort, +Array(2), Gini, QuantileStrategy.Sort, 0, 0, 0.0, 0, 0 ) val featureSamples = Array(0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2).map(_.toDouble) val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0) assert(splits.length === 1) assert(splits(0) === 1.0) } + +// find splits for constant feature +{ + val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0, +Map(), Set(), +Array(3), Gini, QuantileStrategy.Sort, +0, 0, 0.0, 0, 0 + ) + val featureSamples = Array(0, 0, 0).map(_.toDouble) + val featureSamplesEmpty = Array.empty[Double] + val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0) + assert(splits === Array[Double]()) + val splitsEmpty = +RandomForest.findSplitsForContinuousFeature(featureSamplesEmpty, fakeMetadata, 0) + assert(splitsEmpty === Array[Double]()) +} + } + + test("train with constant features") { +val lp = LabeledPoint(1.0, Vectors.dense(0.0, 0.0, 0.0)) +val data = Array.fill(5)(lp) +val rdd = sc.parallelize(data) +val strategy = new OldStrategy( + OldAlgo.Classification, + Gini, + maxDepth = 2, + numClasses = 100, + maxBins = 100, + categoricalFeaturesInfo = Map(0 -> 2, 1 -> 5)) --- End diff -- I would just remove `categoricalFeaturesInfo` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12374: [SPARK-14610][ML] Remove superfluous split for co...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/12374#discussion_r69825752 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala --- @@ -137,14 +137,47 @@ class RandomForestSuite extends SparkFunSuite with MLlibTestSparkContext { { val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0, Map(), Set(), -Array(3), Gini, QuantileStrategy.Sort, +Array(2), Gini, QuantileStrategy.Sort, 0, 0, 0.0, 0, 0 ) val featureSamples = Array(0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2).map(_.toDouble) val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0) assert(splits.length === 1) assert(splits(0) === 1.0) } + +// find splits for constant feature +{ + val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0, +Map(), Set(), +Array(3), Gini, QuantileStrategy.Sort, +0, 0, 0.0, 0, 0 + ) + val featureSamples = Array(0, 0, 0).map(_.toDouble) + val featureSamplesEmpty = Array.empty[Double] + val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0) + assert(splits === Array[Double]()) + val splitsEmpty = +RandomForest.findSplitsForContinuousFeature(featureSamplesEmpty, fakeMetadata, 0) + assert(splitsEmpty === Array[Double]()) +} + } + + test("train with constant features") { --- End diff -- "train with constant features" -> "train with constant continuous features"? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12374: [SPARK-14610][ML] Remove superfluous split for co...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/12374#discussion_r69824592 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala --- @@ -137,14 +137,47 @@ class RandomForestSuite extends SparkFunSuite with MLlibTestSparkContext { { val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0, Map(), Set(), -Array(3), Gini, QuantileStrategy.Sort, +Array(2), Gini, QuantileStrategy.Sort, 0, 0, 0.0, 0, 0 ) val featureSamples = Array(0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2).map(_.toDouble) val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0) assert(splits.length === 1) assert(splits(0) === 1.0) } + +// find splits for constant feature +{ + val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0, +Map(), Set(), +Array(3), Gini, QuantileStrategy.Sort, +0, 0, 0.0, 0, 0 + ) + val featureSamples = Array(0, 0, 0).map(_.toDouble) + val featureSamplesEmpty = Array.empty[Double] + val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0) + assert(splits === Array[Double]()) + val splitsEmpty = --- End diff -- When will this ever happen, or in other words what corner case does this fix? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12374: [SPARK-14610][ML] Remove superfluous split for co...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/12374#discussion_r69824338 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala --- @@ -114,7 +114,7 @@ class RandomForestSuite extends SparkFunSuite with MLlibTestSparkContext { ) val featureSamples = Array(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3).map(_.toDouble) val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0) - assert(splits.length === 3) + assert(splits.length === 2) --- End diff -- I would check the splits explicitly, that is the `Array(1, 2)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12374: [SPARK-14610][ML] Remove superfluous split for co...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/12374#discussion_r69824214 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -712,17 +712,23 @@ private[spark] object RandomForest extends Logging { splitIndex += 1 } // Find best split. - val (bestFeatureSplitIndex, bestFeatureGainStats) = -Range(0, numSplits).map { case splitIdx => - val leftChildStats = binAggregates.getImpurityCalculator(nodeFeatureOffset, splitIdx) - val rightChildStats = -binAggregates.getImpurityCalculator(nodeFeatureOffset, numSplits) - rightChildStats.subtract(leftChildStats) - gainAndImpurityStats = calculateImpurityStats(gainAndImpurityStats, -leftChildStats, rightChildStats, binAggregates.metadata) - (splitIdx, gainAndImpurityStats) -}.maxBy(_._2.gain) - (splits(featureIndex)(bestFeatureSplitIndex), bestFeatureGainStats) + if (numSplits == 0) { --- End diff -- This seems slightly hacky to me. What is your opinion about doing filtering out the feature indices that have zero splits (something similar to this)? ```scala val validFeaturesSplits = Range(0, binAggregates.metadata.numFeaturesPerNode).filter { featureIndexIdx => val featureIndex = if (featuresForNode.nonEmpty) { featuresForNode.get.apply(featureIndexIdx) } else { featureIndexIdx } binAggregates.metadata.numSplits(featureIndex) != 0 } ``` That will prevent code-rewrite for this corner-case in PR's such as https://github.com/apache/spark/pull/13959 and https://github.com/apache/spark/pull/8540 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14016: [SPARK-16399] Force PYSPARK_PYTHON to python
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/14016 I agree with you, I created a new JIRA and renamed the title. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13981: [SPARK-16307] [ML] Add test to verify the predict...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/13981#discussion_r69770334 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/DecisionTreeRegressorSuite.scala --- @@ -96,6 +97,25 @@ class DecisionTreeRegressorSuite assert(variance === expectedVariance, s"Expected variance $expectedVariance but got $variance.") } + +val varianceData: RDD[LabeledPoint] = TreeTests.varianceData(sc) +val varianceDF = TreeTests.setMetadata(varianceData, Map.empty[Int, Int], 0) +dt.setMaxDepth(1) + .setMaxBins(6) + .setSeed(0) +val transformVarDF = dt.fit(varianceDF).transform(varianceDF) +val calculatedVariances = transformVarDF.select(dt.getVarianceCol).collect().map { --- End diff -- Ah, I see. Thanks for the note. I wasn't familiar with the Dataset API till now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13981: [SPARK-16307] [ML] Add test to verify the predicted vari...
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/13981 Thanks @sethah @yanboliang for the reviews!! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13981: [SPARK-16307] [ML] Add test to verify the predicted vari...
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/13981 @yanboliang Would appreciate it if you could look at https://github.com/apache/spark/pull/13650 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13981: [SPARK-16307] [ML] Add test to verify the predict...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/13981#discussion_r69660324 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/DecisionTreeRegressorSuite.scala --- @@ -96,6 +97,25 @@ class DecisionTreeRegressorSuite assert(variance === expectedVariance, s"Expected variance $expectedVariance but got $variance.") } + +val varianceData: RDD[LabeledPoint] = TreeTests.varianceData(sc) +val varianceDF = TreeTests.setMetadata(varianceData, Map.empty[Int, Int], 0) +dt.setMaxDepth(1) + .setMaxBins(6) + .setSeed(0) +val transformVarDF = dt.fit(varianceDF).transform(varianceDF) +val calculatedVariances = transformVarDF.select(dt.getVarianceCol).collect().map { --- End diff -- You mean after the collect? It fails with `is not a member of Seq[org.apache.spark.sql.Row]` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13981: [SPARK-16307] [ML] Add test to verify the predicted vari...
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/13981 OK, that should be it. I removed all the unused variables and imports. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13981: [SPARK-16307] [ML] Add test to verify the predicted vari...
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/13981 @yanboliang Thanks! Addressed your comments. Let me know if there is anything else. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #8013: [SPARK-3181][MLLIB]: Add Robust Regression Algorithm with...
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/8013 I'll be happy to review it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13981: [SPARK-16307] [ML] Add test to verify the predict...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/13981#discussion_r69419165 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/DecisionTreeRegressorSuite.scala --- @@ -96,6 +108,15 @@ class DecisionTreeRegressorSuite assert(variance === expectedVariance, s"Expected variance $expectedVariance but got $variance.") } + +val toyDF = TreeTests.setMetadata(toyData, Map.empty[Int, Int], 0) +dt.setMaxDepth(1) + .setMaxBins(6) --- End diff -- If you would like to reduce the number of warnings, then this should be kept as is (unless I am misunderstanding something) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13981: [SPARK-16307] [ML] Add test to verify the predicted vari...
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/13981 I'm slightly in favour of keeping the original test because the impurity is set to "variance" explicitly by the `setImpurity` method, so it's a safe assumption that the `calculate` method returns the variance. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13981: [SPARK-16307] [ML] Add test to verify the predict...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/13981#discussion_r69363260 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/DecisionTreeRegressorSuite.scala --- @@ -96,6 +108,15 @@ class DecisionTreeRegressorSuite assert(variance === expectedVariance, s"Expected variance $expectedVariance but got $variance.") } + +val toyDF = TreeTests.setMetadata(toyData, Map.empty[Int, Int], 0) +dt.setMaxDepth(1) + .setMaxBins(6) --- End diff -- "Explicit is better than implicit" ;) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14016: [SPARK-15761] [FOLLOWUP] Set DEFAULT_PYTHON to python
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/14016 @srowen fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13981: [SPARK-16307] [ML] Add test to verify the predicted vari...
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/13981 @sethah Thank you for your comments. I have addressed them. Do you have anything else? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14016: [SPARK-15761] [FOLLOWUP] Set DEFAULT_PYTHON to python
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/14016 Thanks for clarifying! It might be a good time to get rid of it.. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13981: [SPARK-16307] [ML] Add test to verify the predict...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/13981#discussion_r69333966 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/DecisionTreeRegressorSuite.scala --- @@ -96,6 +108,15 @@ class DecisionTreeRegressorSuite assert(variance === expectedVariance, s"Expected variance $expectedVariance but got $variance.") } + +val toyDF = TreeTests.setMetadata(toyData, Map.empty[Int, Int], 0) +dt.setMaxDepth(1) + .setMaxBins(6) --- End diff -- I verified and my intuition was correct. I get this warning for the default setting: WARN DecisionTreeMetadata: DecisionTree reducing maxBins from 32 to 6 (= number of training instances) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13981: [SPARK-16307] [ML] Add test to verify the predict...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/13981#discussion_r69332114 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/DecisionTreeRegressorSuite.scala --- @@ -96,6 +108,15 @@ class DecisionTreeRegressorSuite assert(variance === expectedVariance, s"Expected variance $expectedVariance but got $variance.") } + +val toyDF = TreeTests.setMetadata(toyData, Map.empty[Int, Int], 0) +dt.setMaxDepth(1) + .setMaxBins(6) --- End diff -- Because there are 6 datapoints, and I want each datapoint to be a split. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14016: [SPARK-15761] [FOLLOWUP] Set DEFAULT_PYTHON to python
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/14016 ping @srowen @JoshRosen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14016: [SPARK-15761] [FOLLOWUP] Set DEFAULT_PYTHON to py...
GitHub user MechCoder opened a pull request: https://github.com/apache/spark/pull/14016 [SPARK-15761] [FOLLOWUP] Set DEFAULT_PYTHON to python ## What changes were proposed in this pull request? I would like to change ```bash if hash python2.7 2>/dev/null; then # Attempt to use Python 2.7, if installed: DEFAULT_PYTHON="python2.7" else DEFAULT_PYTHON="python" fi ``` to just ```DEFAULT_PYTHON="python"``` I'm not sure if it is a great assumption that python2.7 is used by default, when python points to something else. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) You can merge this pull request into a Git repository by running: $ git pull https://github.com/MechCoder/spark followup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14016.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14016 commit 4661493466ff220ede24257c2c83274ea78f73fb Author: MechCoder <mks...@nyu.edu> Date: 2016-07-01T17:24:46Z [SPARK-15761] Set DEFAULT_PYTHON to python --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13503: [SPARK-15761] [MLlib] [PySpark] Load ipython when defaul...
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/13503 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13503: [SPARK-15761] [MLlib] [PySpark] Load ipython when defaul...
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/13503 I would also like to change ```bash if hash python2.7 2>/dev/null; then # Attempt to use Python 2.7, if installed: DEFAULT_PYTHON="python2.7" else DEFAULT_PYTHON="python" fi ``` to just ```bash DEFAULT_PYTHON="python" ``` I'm not sure if it is a great assumption that python2.7 is used by default, when python points to something else. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13503: [SPARK-15761] [MLlib] [PySpark] Load ipython when defaul...
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/13503 @JoshRosen fixed, thanks! let me know if you need any other changes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13503: [SPARK-15761] [MLlib] [PySpark] Load ipython when defaul...
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/13503 bump? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12983: [SPARK-15213][PySpark] Unify 'range' usages
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/12983 I don't really get the difference, could you please explain it to me.. The previous version renamed `range` in `Python3` to `xrange` and this pull request does the same thing by renaming `xrange` in Python2 to `range`. Not sure this is necessary. There should be no performance changes since both return generators. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13997: [SPARK-16328][ML][MLLIB][PYSPARK] Add 'asML' and ...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/13997#discussion_r69176771 --- Diff: python/pyspark/mllib/linalg/__init__.py --- @@ -1044,6 +1122,28 @@ def toSparse(self): return SparseMatrix(self.numRows, self.numCols, colPtrs, rowIndices, values) +def asML(self): +""" +Convert this matrix to the new mllib-local representation. +This does NOT copy the data; it copies references. + +>>> mllibDM = Matrices.dense(2, 2, [0, 1, 2, 3]) +>>> mlDM1 = newlinalg.Matrices.dense(2, 2, [0, 1, 2, 3]) +>>> mlDM2 = mllibDM.asML() +>>> mlDM2 == mlDM1 +True +>>> mllibDMt = DenseMatrix(2, 2, [0, 1, 2, 3], True) +>>> mlDMt1 = newlinalg.DenseMatrix(2, 2, [0, 1, 2, 3], True) +>>> mlDMt2 = mllibDMt.asML() +>>> mlDMt2 == mlDMt1 +True + +:return: :py:class:`pyspark.ml.linalg.DenseMatrix` + +.. versionadded:: 2.0.0 +""" +return newlinalg.DenseMatrix(self.numRows, self.numCols, self.values, self.isTransposed) --- End diff -- > 79 ;) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13997: [SPARK-16328][ML][MLLIB][PYSPARK] Add 'asML' and 'fromML...
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/13997 LGTM pending nitpicks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13997: [SPARK-16328][ML][MLLIB][PYSPARK] Add 'asML' and ...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/13997#discussion_r69176457 --- Diff: python/pyspark/mllib/linalg/__init__.py --- @@ -846,6 +890,33 @@ def dense(*elements): return DenseVector(elements) @staticmethod +def fromML(vec): +""" +Convert a vector from the new mllib-local representation. +This does NOT copy the data; it copies references. + +>>> mllibDV1 = Vectors.dense([1, 2, 3]) +>>> mlDV = newlinalg.Vectors.dense([1, 2, 3]) +>>> mllibDV2 = Vectors.fromML(mlDV) +>>> mllibDV1 == mllibDV2 +True +>>> mllibSV1 = Vectors.sparse(4, {1: 1.0, 3: 5.5}) +>>> mlSV = newlinalg.Vectors.sparse(4, {1: 1.0, 3: 5.5}) +>>> mllibSV2 = Vectors.fromML(mlSV) +>>> mllibSV1 == mllibSV2 +True + +:param vec: a :py:class:`pyspark.ml.linalg.Vector` +:return: a :py:class:`pyspark.mllib.linalg.Vector` +""" +if type(vec) == newlinalg.DenseVector: --- End diff -- It's a common pythonic practise to use `isinstance` in such cases. If we inherit something from `DenseVector`, then this check will fail. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13650: [SPARK-9623] [ML] Provide variance for RandomFore...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/13650#discussion_r69039559 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/RandomForestRegressorSuite.scala --- @@ -105,6 +108,55 @@ class RandomForestRegressorSuite extends SparkFunSuite with MLlibTestSparkContex } } + test("Random Forest variance") { --- End diff -- The first test is meant to pass for all impurities, since it compares the variance of a forest with one tree (with bootstrapping set off). You are right, that we have to be deterministic about checking the predicted variances. I have done it for the DecisionTrees here (https://github.com/apache/spark/pull/13981) but not sure it is straightforward for a RandomForest --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13981: [SPARK-16307] [ML] Add test to verify the predict...
GitHub user MechCoder opened a pull request: https://github.com/apache/spark/pull/13981 [SPARK-16307] [ML] Add test to verify the predicted variances of a DT on toy data ## What changes were proposed in this pull request? The current tests assumes that `impurity.calculate()` returns the variance correctly. It should be better to make the tests independent of this assumption. In other words verify that the variance computed equals the variance computed manually on a small tree. ## How was this patch tested? The patch is a test You can merge this pull request into a Git repository by running: $ git pull https://github.com/MechCoder/spark dt_variance Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13981.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13981 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13981: [SPARK-16307] [ML] Add test to verify the predicted vari...
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/13981 @yanboliang Could you have a look? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13650: [SPARK-9623] [ML] Provide variance for RandomFore...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/13650#discussion_r68985017 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/RandomForestRegressor.scala --- @@ -168,15 +173,39 @@ class RandomForestRegressionModel private[ml] ( // Note: We may add support for weights (based on tree performance) later on. private lazy val _treeWeights: Array[Double] = Array.fill[Double](_trees.length)(1.0) + @Since("2.1.0") + /** @group getParam */ + def setVarianceCol(value: String): this.type = set(varianceCol, value) + @Since("1.4.0") override def treeWeights: Array[Double] = _treeWeights override protected def transformImpl(dataset: Dataset[_]): DataFrame = { val bcastModel = dataset.sparkSession.sparkContext.broadcast(this) + +var output = dataset + val predictUDF = udf { (features: Any) => bcastModel.value.predict(features.asInstanceOf[Vector]) } -dataset.withColumn($(predictionCol), predictUDF(col($(featuresCol +val predictions = predictUDF(col($(featuresCol))) +output = dataset.withColumn($(predictionCol), predictions) + +val varianceUDF = udf { (features: Any) => + val leafNodes = bcastModel.value.returnLeafNodes(features.asInstanceOf[Vector]) + leafNodes.map { leafNode => --- End diff -- Nice! I'll address this for the `RandomForest` here and we can move the decision tree test strengthening to another PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13959: [SPARK-14351] [MLlib] [ML] Optimize findBestSplits metho...
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/13959 The test failure is just due to binary incompatibility. I can fix those once we decide that the current PR is the way to proceed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13959: [SPARK-14351] [MLlib] [ML] Optimize findBestSplits metho...
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/13959 @jkbradley @sethah Please have a look when free! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13959: [SPARK-14351] [MLlib] [ML] Optimize findBestSplit...
GitHub user MechCoder opened a pull request: https://github.com/apache/spark/pull/13959 [SPARK-14351] [MLlib] [ML] Optimize findBestSplits method for decision trees (and random forest) ## What changes were proposed in this pull request? The current `findBestSplits` method creates an instance of `ImpurityCalculator` and `ImpurityStats` for every possible split and feature in the search for the bestSplit. Every instance of `ImpurityCalculator` creates an array of size `statsSize` which is unnecessary and take a non-negligible amount of time. This pull request tackles this problem by the following technique. 1. Remove the `impurityCalculator` instantiation for every possible split and feature. Replace this by a `calculateGain` method for each impurity that computes the gain directly from the `allStats` attribute of the `DTStatsAggregator` which holds all the necessary information. 2. Replace returning an instance of `ImpurityStats` for every possible split and feature with just the information gain since the gain is sufficient to calculate the `bestSplit`. Just return an instance of `ImpurityStats` once for the `bestSplit` 3. Remove the not-so-useful `calculateImpurityStats` method. ## How was this patch tested? Since this is a performance improvement, tests are necessary. Here are the improvements for a `RandomForestRegressor` with `maxDepth` set to 30, `subSamplingRate` set to 1 and `maxBins` set to 20 on synthetic data. The timings were calculated locally and the mean of 3 attempts were taken. | n_trees | n_samples | n_features | time in master | total time in this branch | | - |:-:| ---:|---:|--:| | 1 | 1 | 500 | 8.954 | 7.786 | | 10| 1 | 500 | 9.44 | 6.825 | | 100 | 1 | 500 | 18.457 | 16.498 | | 1 | 500 | 1 | 8.718 | 6.783 | | 10| 500 | 1 | 8.579 | 6.853 | | 100 | 500 | 1 | 17.593 | 15.905 | | 1 | 1000 | 1000| 8.323 | 6.456 | | 10| 1000 | 1000| 8.841 | 6.633 | | 100 | 1000 | 1000| 17.834 | 16.077 | | 500 | 1000 | 1000| 64.3 | 58.94 | You can merge this pull request into a Git repository by running: $ git pull https://github.com/MechCoder/spark again Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13959.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13959 commit 64d066b90b152ceb71b185b7e17313486974ae77 Author: MechCoder <mks...@nyu.edu> Date: 2016-06-27T20:42:44Z Add calculateGain method to all Impurity objects commit f1d8c8950f8adace6ee175cd569b20ed6468bb61 Author: MechCoder <mks...@nyu.edu> Date: 2016-06-27T21:32:56Z Refactor gain calculation for categorical splits commit 6e31e3a7b36981c8ccbf867e013363aa6f784e39 Author: MechCoder <mks...@nyu.edu> Date: 2016-06-27T22:58:10Z Remove impurity calculation to outside the for loop commit ea4a0735c14ff91ad1071fb517da3fd890080354 Author: MechCoder <mks...@nyu.edu> Date: 2016-06-28T00:45:36Z Remove per feature impurityCalculator initialization commit ca8b36088b74cacb7f162fb793070c4d3c6a1a8c Author: MechCoder <mks...@nyu.edu> Date: 2016-06-28T17:27:40Z Get rid of calculateImpurityStats commit 67b401a6a0e59b48a167e4f3036ca9f3f6a5df1f Author: MechCoder <mks...@nyu.edu> Date: 2016-06-28T18:17:32Z where did that come from? commit e8b89141f6cabfef5f582fe9521f4443afa9ec65 Author: MechCoder <mks...@nyu.edu> Date: 2016-06-29T00:01:55Z Add documentation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #7963: [SPARK-6227] [MLlib] [PySpark] Implement PySpark wrappers...
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/7963 Bump? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13650: [SPARK-9623] [ML] Provide variance for RandomForestRegre...
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/13650 cc: @yanboliang @MLnick --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13650: [SPARK-9623] [ML] Provide variance for RandomFore...
GitHub user MechCoder opened a pull request: https://github.com/apache/spark/pull/13650 [SPARK-9623] [ML] Provide variance for RandomForestRegressor predictions ## What changes were proposed in this pull request? It is useful to get the variance of predictions from the `RandomForestRegressor` to plot confidence intervals on the predictions. I verified the formula from page 17 of this paper (http://arxiv.org/pdf/1211.0906v2.pdf) ## How was this patch tested? I added a couple of tests to the RandomForestRegression test suite. You can merge this pull request into a Git repository by running: $ git pull https://github.com/MechCoder/spark random_forest_var Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13650.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13650 commit 75254c91cf8d9c2f3638a3f9b1cfd5c029e10996 Author: MechCoder <mks...@nyu.edu> Date: 2016-06-09T18:22:53Z [SPARK-9623] [ML] Provide variance for RandomForestRegressor predictions --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13493: [SPARK-15750][MLLib][PYSPARK] Constructing FPGrowth fail...
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/13493 lgtm cc: @MLnick --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13540: [SPARK-15788][PYSPARK][ML] PySpark IDFModel missing "idf...
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/13540 LGTM as well. pending the nitpick by @BryanCutler Not related, but it's been a while since I hacked on Spark or PySpark but at some point do we need better docs for PySpark? I couldn't figure out how the IDF's are calculated without looking at the Scala documentation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12370: [SPARK-14599][ML] BaggedPoint should support sample weig...
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/12370 Should be there a sanity check providing input RDD of instance objects and `extractSampleWeight` as callable that just returns the weight for each instance? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12370: [SPARK-14599][ML] BaggedPoint should support samp...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/12370#discussion_r65994490 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/BaggedPoint.scala --- @@ -33,13 +33,20 @@ import org.apache.spark.util.random.XORShiftRandom * this datum has 1 copy, 0 copies, and 4 copies in the 3 subsamples, respectively. * * @param datum Data instance - * @param subsampleWeights Weight of this instance in each subsampled dataset. - * - * TODO: This does not currently support (Double) weighted instances. Once MLlib has weighted - * dataset support, update. (We store subsampleWeights as Double for this future extension.) + * @param subsampleCounts Number of samples of this instance in each subsampled dataset. + * @param sampleWeight The weight of this instance. */ -private[spark] class BaggedPoint[Datum](val datum: Datum, val subsampleWeights: Array[Double]) - extends Serializable +private[spark] class BaggedPoint[Datum]( +val datum: Datum, +val subsampleCounts: Array[Int], +val sampleWeight: Double) extends Serializable { + + /** + * Subsample counts weighted by the sample weight. + */ + def weightedCounts: Array[Double] = subsampleCounts.map(_ * sampleWeight) --- End diff -- Should this be a `val`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13503: [SPARK-15761] [MLlib] [PySpark] Load ipython when defaul...
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/13503 Merge? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13503: [SPARK-15761] [MLlib] [PySpark] Load ipython when defaul...
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/13503 cc @JoshRosen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13248: [SPARK-15194] [ML] Add Python ML API for MultivariateGau...
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/13248 @praveendareddy21 Just made a first pass. Also please run PEP8 on your code --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13248: [SPARK-15194] [ML] Add Python ML API for Multivar...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/13248#discussion_r65794904 --- Diff: python/pyspark/ml/stat/distribution.py --- @@ -0,0 +1,267 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from pyspark.ml.linalg import DenseVector, DenseMatrix, Vector +import numpy as np + +__all__ = ['MultivariateGaussian'] + + + +class MultivariateGaussian(): +""" +This class provides basic functionality for a Multivariate Gaussian (Normal) Distribution. In + the event that the covariance matrix is singular, the density will be computed in a +reduced dimensional subspace under which the distribution is supported. +(see [[http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case]]) + +mu The mean vector of the distribution +sigma The covariance matrix of the distribution + + +>>> mu = Vectors.dense([0.0, 0.0]) +>>> sigma= DenseMatrix(2, 2, [1.0, 1.0, 1.0, 1.0]) +>>> x = Vectors.dense([1.0, 1.0]) +>>> m = MultivariateGaussian(mu, sigma) +>>> m.pdf(x) +0.0682586811486 + +""" + +def __init__(self, mu, sigma): +""" +__init__(self, mu, sigma) + +mu The mean vector of the distribution +sigma The covariance matrix of the distribution + +mu and sigma must be instances of DenseVector and DenseMatrix respectively. + +""" + + +assert (isinstance(mu, DenseVector)), "mu must be a DenseVector Object" +assert (isinstance(sigma, DenseMatrix)), "sigma must be a DenseMatrix Object" + +sigma_shape=sigma.toArray().shape +assert (sigma_shape[0]==sigma_shape[1]) , "Covariance matrix must be square" +assert (sigma_shape[0]==mu.size) , "Mean vector length must match covariance matrix size" + +# initialize eagerly precomputed attributes + +self.mu=mu + +# storing sigma as numpy.ndarray +# furthur calculations are done ndarray only +self.sigma=sigma.toArray() + + +# initialize attributes to be computed later + +self.prec_U = None +self.log_det_cov = None + +# compute distribution dependent constants +self.__calculateCovarianceConstants() + + +def pdf(self,x): +""" +Returns density of this multivariate Gaussian at a point given by Vector x +""" +assert (isinstance(x, Vector)), "x must be of Vector Type" +return float(self.__pdf(x)) + +def logpdf(self,x): +""" +Returns the log-density of this multivariate Gaussian at a point given by Vector x +""" +assert (isinstance(x, Vector)), "x must be of Vector Type" +return float(self.__logpdf(x)) + +def __calculateCovarianceConstants(self): +""" +Calculates distribution dependent components used for the density function +based on scipy multivariate library +refer https://github.com/scipy/scipy/blob/master/scipy/stats/_multivariate.py +tested with precision of 9 significant digits(refer testcase) + + +""" + +try : +# pre-processing input parameters +# throws ValueError with invalid inputs +self.dim, self.mu, self.sigma = self.__process_parameters(None, self.mu, self.sigma) + +# return the eigenvalues and eigenvectors +# of a Hermitian or symmetric matrix. +# s = eigen values +
[GitHub] spark pull request #13248: [SPARK-15194] [ML] Add Python ML API for Multivar...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/13248#discussion_r65794836 --- Diff: python/pyspark/ml/stat/distribution.py --- @@ -0,0 +1,267 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from pyspark.ml.linalg import DenseVector, DenseMatrix, Vector +import numpy as np + +__all__ = ['MultivariateGaussian'] + + + +class MultivariateGaussian(): +""" +This class provides basic functionality for a Multivariate Gaussian (Normal) Distribution. In + the event that the covariance matrix is singular, the density will be computed in a +reduced dimensional subspace under which the distribution is supported. +(see [[http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case]]) + +mu The mean vector of the distribution +sigma The covariance matrix of the distribution + + +>>> mu = Vectors.dense([0.0, 0.0]) +>>> sigma= DenseMatrix(2, 2, [1.0, 1.0, 1.0, 1.0]) +>>> x = Vectors.dense([1.0, 1.0]) +>>> m = MultivariateGaussian(mu, sigma) +>>> m.pdf(x) +0.0682586811486 + +""" + +def __init__(self, mu, sigma): +""" +__init__(self, mu, sigma) + +mu The mean vector of the distribution +sigma The covariance matrix of the distribution + +mu and sigma must be instances of DenseVector and DenseMatrix respectively. + +""" + + +assert (isinstance(mu, DenseVector)), "mu must be a DenseVector Object" +assert (isinstance(sigma, DenseMatrix)), "sigma must be a DenseMatrix Object" + +sigma_shape=sigma.toArray().shape +assert (sigma_shape[0]==sigma_shape[1]) , "Covariance matrix must be square" +assert (sigma_shape[0]==mu.size) , "Mean vector length must match covariance matrix size" + +# initialize eagerly precomputed attributes + +self.mu=mu + +# storing sigma as numpy.ndarray +# furthur calculations are done ndarray only +self.sigma=sigma.toArray() + + +# initialize attributes to be computed later + +self.prec_U = None +self.log_det_cov = None + +# compute distribution dependent constants +self.__calculateCovarianceConstants() + + +def pdf(self,x): +""" +Returns density of this multivariate Gaussian at a point given by Vector x +""" +assert (isinstance(x, Vector)), "x must be of Vector Type" +return float(self.__pdf(x)) + +def logpdf(self,x): +""" +Returns the log-density of this multivariate Gaussian at a point given by Vector x +""" +assert (isinstance(x, Vector)), "x must be of Vector Type" +return float(self.__logpdf(x)) + +def __calculateCovarianceConstants(self): +""" +Calculates distribution dependent components used for the density function +based on scipy multivariate library +refer https://github.com/scipy/scipy/blob/master/scipy/stats/_multivariate.py +tested with precision of 9 significant digits(refer testcase) + + +""" + +try : +# pre-processing input parameters +# throws ValueError with invalid inputs +self.dim, self.mu, self.sigma = self.__process_parameters(None, self.mu, self.sigma) + +# return the eigenvalues and eigenvectors +# of a Hermitian or symmetric matrix. +# s = eigen values +
[GitHub] spark pull request #13248: [SPARK-15194] [ML] Add Python ML API for Multivar...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/13248#discussion_r65794779 --- Diff: python/pyspark/ml/stat/distribution.py --- @@ -0,0 +1,267 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from pyspark.ml.linalg import DenseVector, DenseMatrix, Vector +import numpy as np + +__all__ = ['MultivariateGaussian'] + + + +class MultivariateGaussian(): +""" +This class provides basic functionality for a Multivariate Gaussian (Normal) Distribution. In + the event that the covariance matrix is singular, the density will be computed in a +reduced dimensional subspace under which the distribution is supported. +(see [[http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case]]) + +mu The mean vector of the distribution +sigma The covariance matrix of the distribution + + +>>> mu = Vectors.dense([0.0, 0.0]) +>>> sigma= DenseMatrix(2, 2, [1.0, 1.0, 1.0, 1.0]) +>>> x = Vectors.dense([1.0, 1.0]) +>>> m = MultivariateGaussian(mu, sigma) +>>> m.pdf(x) +0.0682586811486 + +""" + +def __init__(self, mu, sigma): +""" +__init__(self, mu, sigma) + +mu The mean vector of the distribution +sigma The covariance matrix of the distribution + +mu and sigma must be instances of DenseVector and DenseMatrix respectively. + +""" + + +assert (isinstance(mu, DenseVector)), "mu must be a DenseVector Object" +assert (isinstance(sigma, DenseMatrix)), "sigma must be a DenseMatrix Object" + +sigma_shape=sigma.toArray().shape +assert (sigma_shape[0]==sigma_shape[1]) , "Covariance matrix must be square" +assert (sigma_shape[0]==mu.size) , "Mean vector length must match covariance matrix size" + +# initialize eagerly precomputed attributes + +self.mu=mu + +# storing sigma as numpy.ndarray +# furthur calculations are done ndarray only +self.sigma=sigma.toArray() + + +# initialize attributes to be computed later + +self.prec_U = None +self.log_det_cov = None + +# compute distribution dependent constants +self.__calculateCovarianceConstants() + + +def pdf(self,x): +""" +Returns density of this multivariate Gaussian at a point given by Vector x +""" +assert (isinstance(x, Vector)), "x must be of Vector Type" +return float(self.__pdf(x)) + +def logpdf(self,x): +""" +Returns the log-density of this multivariate Gaussian at a point given by Vector x +""" +assert (isinstance(x, Vector)), "x must be of Vector Type" +return float(self.__logpdf(x)) + +def __calculateCovarianceConstants(self): +""" +Calculates distribution dependent components used for the density function +based on scipy multivariate library +refer https://github.com/scipy/scipy/blob/master/scipy/stats/_multivariate.py +tested with precision of 9 significant digits(refer testcase) + + +""" + +try : +# pre-processing input parameters +# throws ValueError with invalid inputs +self.dim, self.mu, self.sigma = self.__process_parameters(None, self.mu, self.sigma) + +# return the eigenvalues and eigenvectors +# of a Hermitian or symmetric matrix. +# s = eigen values +
[GitHub] spark pull request #13248: [SPARK-15194] [ML] Add Python ML API for Multivar...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/13248#discussion_r65794427 --- Diff: python/pyspark/ml/stat/distribution.py --- @@ -0,0 +1,267 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from pyspark.ml.linalg import DenseVector, DenseMatrix, Vector +import numpy as np + +__all__ = ['MultivariateGaussian'] + + + +class MultivariateGaussian(): +""" +This class provides basic functionality for a Multivariate Gaussian (Normal) Distribution. In + the event that the covariance matrix is singular, the density will be computed in a +reduced dimensional subspace under which the distribution is supported. +(see [[http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case]]) + +mu The mean vector of the distribution +sigma The covariance matrix of the distribution + + +>>> mu = Vectors.dense([0.0, 0.0]) +>>> sigma= DenseMatrix(2, 2, [1.0, 1.0, 1.0, 1.0]) +>>> x = Vectors.dense([1.0, 1.0]) +>>> m = MultivariateGaussian(mu, sigma) +>>> m.pdf(x) +0.0682586811486 + +""" + +def __init__(self, mu, sigma): +""" +__init__(self, mu, sigma) + +mu The mean vector of the distribution +sigma The covariance matrix of the distribution + +mu and sigma must be instances of DenseVector and DenseMatrix respectively. + +""" + + +assert (isinstance(mu, DenseVector)), "mu must be a DenseVector Object" +assert (isinstance(sigma, DenseMatrix)), "sigma must be a DenseMatrix Object" + +sigma_shape=sigma.toArray().shape +assert (sigma_shape[0]==sigma_shape[1]) , "Covariance matrix must be square" +assert (sigma_shape[0]==mu.size) , "Mean vector length must match covariance matrix size" + +# initialize eagerly precomputed attributes + +self.mu=mu + +# storing sigma as numpy.ndarray +# furthur calculations are done ndarray only +self.sigma=sigma.toArray() + + +# initialize attributes to be computed later + +self.prec_U = None +self.log_det_cov = None + +# compute distribution dependent constants +self.__calculateCovarianceConstants() + + +def pdf(self,x): +""" +Returns density of this multivariate Gaussian at a point given by Vector x +""" +assert (isinstance(x, Vector)), "x must be of Vector Type" +return float(self.__pdf(x)) + +def logpdf(self,x): +""" +Returns the log-density of this multivariate Gaussian at a point given by Vector x +""" +assert (isinstance(x, Vector)), "x must be of Vector Type" +return float(self.__logpdf(x)) + +def __calculateCovarianceConstants(self): +""" +Calculates distribution dependent components used for the density function +based on scipy multivariate library +refer https://github.com/scipy/scipy/blob/master/scipy/stats/_multivariate.py +tested with precision of 9 significant digits(refer testcase) + + +""" + +try : +# pre-processing input parameters +# throws ValueError with invalid inputs +self.dim, self.mu, self.sigma = self.__process_parameters(None, self.mu, self.sigma) + +# return the eigenvalues and eigenvectors +# of a Hermitian or symmetric matrix. +# s = eigen values +
[GitHub] spark pull request #13248: [SPARK-15194] [ML] Add Python ML API for Multivar...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/13248#discussion_r65794325 --- Diff: python/pyspark/ml/stat/distribution.py --- @@ -0,0 +1,267 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from pyspark.ml.linalg import DenseVector, DenseMatrix, Vector +import numpy as np + +__all__ = ['MultivariateGaussian'] + + + +class MultivariateGaussian(): +""" +This class provides basic functionality for a Multivariate Gaussian (Normal) Distribution. In + the event that the covariance matrix is singular, the density will be computed in a +reduced dimensional subspace under which the distribution is supported. +(see [[http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case]]) + +mu The mean vector of the distribution +sigma The covariance matrix of the distribution + + +>>> mu = Vectors.dense([0.0, 0.0]) +>>> sigma= DenseMatrix(2, 2, [1.0, 1.0, 1.0, 1.0]) +>>> x = Vectors.dense([1.0, 1.0]) +>>> m = MultivariateGaussian(mu, sigma) +>>> m.pdf(x) +0.0682586811486 + +""" + +def __init__(self, mu, sigma): +""" +__init__(self, mu, sigma) + +mu The mean vector of the distribution +sigma The covariance matrix of the distribution + +mu and sigma must be instances of DenseVector and DenseMatrix respectively. + +""" + + +assert (isinstance(mu, DenseVector)), "mu must be a DenseVector Object" +assert (isinstance(sigma, DenseMatrix)), "sigma must be a DenseMatrix Object" + +sigma_shape=sigma.toArray().shape +assert (sigma_shape[0]==sigma_shape[1]) , "Covariance matrix must be square" +assert (sigma_shape[0]==mu.size) , "Mean vector length must match covariance matrix size" + +# initialize eagerly precomputed attributes + +self.mu=mu + +# storing sigma as numpy.ndarray +# furthur calculations are done ndarray only +self.sigma=sigma.toArray() + + +# initialize attributes to be computed later + +self.prec_U = None +self.log_det_cov = None + +# compute distribution dependent constants +self.__calculateCovarianceConstants() + + +def pdf(self,x): +""" +Returns density of this multivariate Gaussian at a point given by Vector x +""" +assert (isinstance(x, Vector)), "x must be of Vector Type" +return float(self.__pdf(x)) + +def logpdf(self,x): +""" +Returns the log-density of this multivariate Gaussian at a point given by Vector x +""" +assert (isinstance(x, Vector)), "x must be of Vector Type" +return float(self.__logpdf(x)) + +def __calculateCovarianceConstants(self): +""" +Calculates distribution dependent components used for the density function +based on scipy multivariate library +refer https://github.com/scipy/scipy/blob/master/scipy/stats/_multivariate.py +tested with precision of 9 significant digits(refer testcase) + + +""" + +try : +# pre-processing input parameters +# throws ValueError with invalid inputs +self.dim, self.mu, self.sigma = self.__process_parameters(None, self.mu, self.sigma) + +# return the eigenvalues and eigenvectors +# of a Hermitian or symmetric matrix. +# s = eigen values +
[GitHub] spark pull request #13248: [SPARK-15194] [ML] Add Python ML API for Multivar...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/13248#discussion_r65794293 --- Diff: python/pyspark/ml/stat/distribution.py --- @@ -0,0 +1,267 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from pyspark.ml.linalg import DenseVector, DenseMatrix, Vector +import numpy as np + +__all__ = ['MultivariateGaussian'] + + + +class MultivariateGaussian(): +""" +This class provides basic functionality for a Multivariate Gaussian (Normal) Distribution. In + the event that the covariance matrix is singular, the density will be computed in a +reduced dimensional subspace under which the distribution is supported. +(see [[http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case]]) + +mu The mean vector of the distribution +sigma The covariance matrix of the distribution + + +>>> mu = Vectors.dense([0.0, 0.0]) +>>> sigma= DenseMatrix(2, 2, [1.0, 1.0, 1.0, 1.0]) +>>> x = Vectors.dense([1.0, 1.0]) +>>> m = MultivariateGaussian(mu, sigma) +>>> m.pdf(x) +0.0682586811486 + +""" + +def __init__(self, mu, sigma): +""" +__init__(self, mu, sigma) + +mu The mean vector of the distribution +sigma The covariance matrix of the distribution + +mu and sigma must be instances of DenseVector and DenseMatrix respectively. + +""" + + +assert (isinstance(mu, DenseVector)), "mu must be a DenseVector Object" +assert (isinstance(sigma, DenseMatrix)), "sigma must be a DenseMatrix Object" + +sigma_shape=sigma.toArray().shape +assert (sigma_shape[0]==sigma_shape[1]) , "Covariance matrix must be square" +assert (sigma_shape[0]==mu.size) , "Mean vector length must match covariance matrix size" + +# initialize eagerly precomputed attributes + +self.mu=mu + +# storing sigma as numpy.ndarray +# furthur calculations are done ndarray only +self.sigma=sigma.toArray() + + +# initialize attributes to be computed later + +self.prec_U = None +self.log_det_cov = None + +# compute distribution dependent constants +self.__calculateCovarianceConstants() + + +def pdf(self,x): +""" +Returns density of this multivariate Gaussian at a point given by Vector x +""" +assert (isinstance(x, Vector)), "x must be of Vector Type" +return float(self.__pdf(x)) + +def logpdf(self,x): +""" +Returns the log-density of this multivariate Gaussian at a point given by Vector x +""" +assert (isinstance(x, Vector)), "x must be of Vector Type" +return float(self.__logpdf(x)) + +def __calculateCovarianceConstants(self): +""" +Calculates distribution dependent components used for the density function +based on scipy multivariate library +refer https://github.com/scipy/scipy/blob/master/scipy/stats/_multivariate.py +tested with precision of 9 significant digits(refer testcase) + + +""" + +try : +# pre-processing input parameters +# throws ValueError with invalid inputs +self.dim, self.mu, self.sigma = self.__process_parameters(None, self.mu, self.sigma) + +# return the eigenvalues and eigenvectors +# of a Hermitian or symmetric matrix. +# s = eigen values +
[GitHub] spark pull request #13248: [SPARK-15194] [ML] Add Python ML API for Multivar...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/13248#discussion_r65794126 --- Diff: python/pyspark/ml/stat/distribution.py --- @@ -0,0 +1,267 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from pyspark.ml.linalg import DenseVector, DenseMatrix, Vector +import numpy as np + +__all__ = ['MultivariateGaussian'] + + + +class MultivariateGaussian(): +""" +This class provides basic functionality for a Multivariate Gaussian (Normal) Distribution. In + the event that the covariance matrix is singular, the density will be computed in a +reduced dimensional subspace under which the distribution is supported. +(see [[http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case]]) + +mu The mean vector of the distribution +sigma The covariance matrix of the distribution + + +>>> mu = Vectors.dense([0.0, 0.0]) +>>> sigma= DenseMatrix(2, 2, [1.0, 1.0, 1.0, 1.0]) +>>> x = Vectors.dense([1.0, 1.0]) +>>> m = MultivariateGaussian(mu, sigma) +>>> m.pdf(x) +0.0682586811486 + +""" + +def __init__(self, mu, sigma): +""" +__init__(self, mu, sigma) + +mu The mean vector of the distribution +sigma The covariance matrix of the distribution + +mu and sigma must be instances of DenseVector and DenseMatrix respectively. + +""" + + +assert (isinstance(mu, DenseVector)), "mu must be a DenseVector Object" +assert (isinstance(sigma, DenseMatrix)), "sigma must be a DenseMatrix Object" + +sigma_shape=sigma.toArray().shape +assert (sigma_shape[0]==sigma_shape[1]) , "Covariance matrix must be square" +assert (sigma_shape[0]==mu.size) , "Mean vector length must match covariance matrix size" + +# initialize eagerly precomputed attributes + +self.mu=mu + +# storing sigma as numpy.ndarray +# furthur calculations are done ndarray only +self.sigma=sigma.toArray() + + +# initialize attributes to be computed later + +self.prec_U = None +self.log_det_cov = None + +# compute distribution dependent constants +self.__calculateCovarianceConstants() + + +def pdf(self,x): --- End diff -- Should we fall back to SciPy's multivariate normal if that is present? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13248: [SPARK-15194] [ML] Add Python ML API for Multivar...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/13248#discussion_r65794056 --- Diff: python/pyspark/ml/stat/distribution.py --- @@ -0,0 +1,267 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from pyspark.ml.linalg import DenseVector, DenseMatrix, Vector +import numpy as np + +__all__ = ['MultivariateGaussian'] + + + +class MultivariateGaussian(): +""" +This class provides basic functionality for a Multivariate Gaussian (Normal) Distribution. In + the event that the covariance matrix is singular, the density will be computed in a +reduced dimensional subspace under which the distribution is supported. +(see [[http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case]]) + +mu The mean vector of the distribution +sigma The covariance matrix of the distribution + + +>>> mu = Vectors.dense([0.0, 0.0]) +>>> sigma= DenseMatrix(2, 2, [1.0, 1.0, 1.0, 1.0]) +>>> x = Vectors.dense([1.0, 1.0]) +>>> m = MultivariateGaussian(mu, sigma) +>>> m.pdf(x) +0.0682586811486 + +""" + +def __init__(self, mu, sigma): +""" +__init__(self, mu, sigma) + +mu The mean vector of the distribution +sigma The covariance matrix of the distribution + +mu and sigma must be instances of DenseVector and DenseMatrix respectively. + +""" + + +assert (isinstance(mu, DenseVector)), "mu must be a DenseVector Object" +assert (isinstance(sigma, DenseMatrix)), "sigma must be a DenseMatrix Object" + +sigma_shape=sigma.toArray().shape +assert (sigma_shape[0]==sigma_shape[1]) , "Covariance matrix must be square" +assert (sigma_shape[0]==mu.size) , "Mean vector length must match covariance matrix size" + +# initialize eagerly precomputed attributes + +self.mu=mu + +# storing sigma as numpy.ndarray +# furthur calculations are done ndarray only +self.sigma=sigma.toArray() + + +# initialize attributes to be computed later + +self.prec_U = None +self.log_det_cov = None + +# compute distribution dependent constants +self.__calculateCovarianceConstants() + + +def pdf(self,x): +""" +Returns density of this multivariate Gaussian at a point given by Vector x +""" +assert (isinstance(x, Vector)), "x must be of Vector Type" +return float(self.__pdf(x)) + +def logpdf(self,x): +""" +Returns the log-density of this multivariate Gaussian at a point given by Vector x +""" +assert (isinstance(x, Vector)), "x must be of Vector Type" +return float(self.__logpdf(x)) + +def __calculateCovarianceConstants(self): +""" +Calculates distribution dependent components used for the density function +based on scipy multivariate library +refer https://github.com/scipy/scipy/blob/master/scipy/stats/_multivariate.py +tested with precision of 9 significant digits(refer testcase) + + +""" + +try : +# pre-processing input parameters +# throws ValueError with invalid inputs +self.dim, self.mu, self.sigma = self.__process_parameters(None, self.mu, self.sigma) + +# return the eigenvalues and eigenvectors +# of a Hermitian or symmetric matrix. +# s = eigen values +
[GitHub] spark pull request #13248: [SPARK-15194] [ML] Add Python ML API for Multivar...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/13248#discussion_r65793951 --- Diff: python/pyspark/ml/stat/distribution.py --- @@ -0,0 +1,267 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from pyspark.ml.linalg import DenseVector, DenseMatrix, Vector +import numpy as np + +__all__ = ['MultivariateGaussian'] + + + +class MultivariateGaussian(): +""" +This class provides basic functionality for a Multivariate Gaussian (Normal) Distribution. In + the event that the covariance matrix is singular, the density will be computed in a +reduced dimensional subspace under which the distribution is supported. +(see [[http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case]]) + +mu The mean vector of the distribution +sigma The covariance matrix of the distribution + + +>>> mu = Vectors.dense([0.0, 0.0]) +>>> sigma= DenseMatrix(2, 2, [1.0, 1.0, 1.0, 1.0]) +>>> x = Vectors.dense([1.0, 1.0]) +>>> m = MultivariateGaussian(mu, sigma) +>>> m.pdf(x) +0.0682586811486 + +""" + +def __init__(self, mu, sigma): +""" +__init__(self, mu, sigma) + +mu The mean vector of the distribution +sigma The covariance matrix of the distribution + +mu and sigma must be instances of DenseVector and DenseMatrix respectively. + +""" + + +assert (isinstance(mu, DenseVector)), "mu must be a DenseVector Object" +assert (isinstance(sigma, DenseMatrix)), "sigma must be a DenseMatrix Object" + +sigma_shape=sigma.toArray().shape +assert (sigma_shape[0]==sigma_shape[1]) , "Covariance matrix must be square" +assert (sigma_shape[0]==mu.size) , "Mean vector length must match covariance matrix size" --- End diff -- You can use the `numRows`, `numCols` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org