GitHub user dorx opened a pull request:
https://github.com/apache/spark/pull/1911
[SPARK-2993] [MLLib] colStats (wrapper around
MultivariateStatisticalSummary) in Statistics
For both Scala and Python.
The ser/de util functions were moved out of `PythonMLLibAPI
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1733#discussion_r16075494
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software
GitHub user dorx opened a pull request:
https://github.com/apache/spark/pull/1866
[SPARK-2937] Separate out samplyByKeyExact as its own API in PairRDDFunction
To enable Python consistency and `Experimental` label of the
`sampleByKeyExact` API.
You can merge this pull request
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1733#discussion_r16024688
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1733#discussion_r16024698
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1733#discussion_r16024886
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software
Github user dorx commented on the pull request:
https://github.com/apache/spark/pull/1733#issuecomment-51570348
@mengxr @jkbradley @falaki
In case you guys haven't noticed, the latest version implements the
discussed APIs.
---
If your project is set up for it, you can reply
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1733#discussion_r16009653
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/stat/test/TestResult.scala ---
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1733#discussion_r16009835
--- Diff:
mllib/src/test/scala/org/apache/spark/mllib/stat/HypothesisTestSuite.scala ---
@@ -0,0 +1,128 @@
+/*
+ * Licensed to the Apache Software
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1733#discussion_r16015802
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/stat/test/TestResult.scala ---
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation
Github user dorx commented on the pull request:
https://github.com/apache/spark/pull/1733#issuecomment-51286506
@mengxr @ jkbradley @falaki
PR ready for review now.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1733#discussion_r15854474
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala
---
@@ -89,4 +90,76 @@ object Statistics {
*/
@Experimental
GitHub user dorx opened a pull request:
https://github.com/apache/spark/pull/1713
[SPARK-2786][mllib] Python correlations
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/dorx/spark pythonCorrelation
Alternatively you can review
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1713#discussion_r15709151
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -456,6 +458,37 @@ class PythonMLLibAPI extends Serializable
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1713#discussion_r15717671
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala
---
@@ -49,43 +49,48 @@ private[stat] trait Correlation
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1713#discussion_r15717902
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -456,6 +458,37 @@ class PythonMLLibAPI extends Serializable
GitHub user dorx opened a pull request:
https://github.com/apache/spark/pull/1733
[SPARK-2515][mllib] Chi Squared test
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/dorx/spark chisquare
Alternatively you can review and apply
GitHub user dorx opened a pull request:
https://github.com/apache/spark/pull/1710
[SPARK-2782][mllib] Bug fix for getRanks in SpearmanCorrelation
getRanks computes the wrong rank when numPartition = size in the input
RDDs before this patch. added units to address this bug.
You can
Github user dorx commented on the pull request:
https://github.com/apache/spark/pull/1710#issuecomment-50845017
@mengxr I'd really appreciate it if we can get this merged ASAP so I can
send out my python correlation PR before the code freeze. Thanks!
---
If your project is set up
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1710#discussion_r15681388
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmanCorrelation.scala
---
@@ -89,20 +89,17 @@ private[stat] object
Github user dorx commented on the pull request:
https://github.com/apache/spark/pull/1628#issuecomment-50663192
@JoshRosen any suggestions on what to do for the `random` name collision
issue?
---
If your project is set up for it, you can reply to this email and have your
reply
Github user dorx commented on the pull request:
https://github.com/apache/spark/pull/1628#issuecomment-50664227
The simple, but perhaps not most elegant solution is adding the following
inside of pyspark/\__init\__.py:
```
import sys, importlib
s = sys.path.pop(0
Github user dorx commented on the pull request:
https://github.com/apache/spark/pull/1628#issuecomment-50679008
@JoshRosen tried it inside mllib/\__init\__.py and pyspark/\__init__.py and
still get the import error when trying to run anything inside of mllib.
---
If your project
Github user dorx commented on the pull request:
https://github.com/apache/spark/pull/1628#issuecomment-50689664
@JoshRosen yep I was also able to force it to work with an unnecessary
import from pyspark.context to force it to import python's random first. The
problem is now importing
Github user dorx commented on the pull request:
https://github.com/apache/spark/pull/1628#issuecomment-50525518
In NumPy's source, they had a directory named random:
https://github.com/numpy/numpy/tree/master/numpy/random
It seems like having directory hierarchy is the only way
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1628#discussion_r15563853
--- Diff: python/pyspark/mllib/random/RandomRDDGenerators.py ---
@@ -0,0 +1,201 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
Github user dorx commented on the pull request:
https://github.com/apache/spark/pull/1628#issuecomment-50567408
Btw `from pyspark.mllib import random` now works with the latest commit in
the pyspark shell.
---
If your project is set up for it, you can reply to this email and have
GitHub user dorx reopened a pull request:
https://github.com/apache/spark/pull/1025
[SPARK-2082] stratified sampling in PairRDDFunctions that guarantees exact
sample size
Implemented stratified sampling that guarantees exact sample size using
ScaRSR with two passes over the RDD
GitHub user dorx opened a pull request:
https://github.com/apache/spark/pull/1628
[SPARK-2724] Python version of RandomRDDGenerators
RandomRDDGenerators but without support for randomRDD and randomVectorRDD,
which take in arbitrary DistributionGenerator.
`randomRDD.py
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1520#discussion_r15423644
--- Diff:
mllib/src/test/scala/org/apache/spark/mllib/random/DistributionGeneratorSuite.scala
---
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache
Github user dorx commented on the pull request:
https://github.com/apache/spark/pull/1520#issuecomment-50213475
Jenkins, retest this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
Github user dorx commented on the pull request:
https://github.com/apache/spark/pull/1581#issuecomment-50213453
Jenkins, retest this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
Github user dorx closed the pull request at:
https://github.com/apache/spark/pull/1025
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature
Github user dorx commented on the pull request:
https://github.com/apache/spark/pull/1025#issuecomment-50071767
Looks like there's some API changes from Xiangrui's updates. @mateiz
@pwendell
---
If your project is set up for it, you can reply to this email and have your
reply appear
Github user dorx commented on the pull request:
https://github.com/apache/spark/pull/1025#issuecomment-50073867
Also, seems like there wasn't a single line of code preserved from before
the updates. We should probably close this PR and let Xiangrui submit his
version in a separate PR
GitHub user dorx opened a pull request:
https://github.com/apache/spark/pull/1581
[SPARK-2679] [MLLib] Ser/De for Double
Added a set of serializer/deserializer for Double in _common.py and
PythonMLLibAPI in MLLib.
You can merge this pull request into a Git repository by running
Github user dorx commented on the pull request:
https://github.com/apache/spark/pull/1581#issuecomment-50091128
@falaki @mengxr Created a separate PR for this so I can use it in both the
python correlation and python randomRDD additions.
---
If your project is set up for it, you can
Github user dorx commented on the pull request:
https://github.com/apache/spark/pull/1581#issuecomment-50092950
@mengxr Given the current list of supported types, no, but if someone down
the road adds Long or arrays of chars/shorts, etc, which isn't far-fetched,
then it becomes
Github user dorx commented on the pull request:
https://github.com/apache/spark/pull/1581#issuecomment-50101155
The issue is what other things we can reasonably serialize into 8 bytes.
Not sure how other types of doubles are relevant here since the size would be
different and cause
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1520#discussion_r15305960
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/rdd/RandomRDD.scala
---
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1520#discussion_r15307029
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/random/RandomRDDGenerators.scala ---
@@ -0,0 +1,235 @@
+/*
+ * Licensed to the Apache Software
GitHub user dorx opened a pull request:
https://github.com/apache/spark/pull/1554
[SPARK-2656] Python version of stratified sampling
exact sample size not supported for now.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/dorx
Github user dorx commented on the pull request:
https://github.com/apache/spark/pull/1554#issuecomment-49951984
@mengxr @falaki
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1520#discussion_r15262023
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/random/DistributionGenerator.scala
---
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1520#discussion_r15265484
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/random/DistributionGenerator.scala
---
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1520#discussion_r15265778
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/rdd/RandomRDD.scala
---
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1520#discussion_r15265929
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/rdd/RandomRDD.scala
---
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1520#discussion_r15266470
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/rdd/RandomRDD.scala
---
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1520#discussion_r15266943
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/random/RandomRDDGenerators.scala ---
@@ -0,0 +1,235 @@
+/*
+ * Licensed to the Apache Software
Github user dorx commented on the pull request:
https://github.com/apache/spark/pull/1425#issuecomment-49683184
@dbtsai this is awesome! I actually created a JIRA on this after trying to
use TestUtils in one of my unit suites, but it looks like you're already taking
care
GitHub user dorx opened a pull request:
https://github.com/apache/spark/pull/1520
[SPARK-2514] [mllib] Random RDD generator
Utilities for generating random RDDs.
RandomRDD and RandomVectorRDD are created instead of using
`sc.parallelize(range:Range)` because `Range
Github user dorx commented on the pull request:
https://github.com/apache/spark/pull/1520#issuecomment-49695577
@falaki @jkbradley @mengxr
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
Github user dorx closed the pull request at:
https://github.com/apache/spark/pull/1473
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1473#discussion_r15098158
--- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
@@ -135,7 +135,7 @@ class RangePartitioner[K : Ordering : ClassTag, V](
val k
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1367#discussion_r15098446
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala
---
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1367#discussion_r15098523
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala
---
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache
Github user dorx commented on the pull request:
https://github.com/apache/spark/pull/1473#issuecomment-49466619
A superficial look at the failed unit tests seems to suggest some Spark SQL
optimizations rely on the fact that 1000 is set as the sequential scan
threshhold. @rxin
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1367#discussion_r15135411
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala
---
@@ -0,0 +1,128 @@
+/*
+ * Licensed to the Apache
GitHub user dorx opened a pull request:
https://github.com/apache/spark/pull/1473
Fixed a typo in the comments in RangePartitioner
Checked with Holden, the original author as per the log, and was told
code is right comment is wrong.
You can merge this pull request into a Git
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1367#discussion_r15017783
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala
---
@@ -0,0 +1,102 @@
+/*
+ * Licensed to the Apache
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1367#discussion_r15017975
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala
---
@@ -0,0 +1,102 @@
+/*
+ * Licensed to the Apache
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1367#discussion_r15020400
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala
---
@@ -0,0 +1,102 @@
+/*
+ * Licensed to the Apache
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1367#discussion_r15022756
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala
---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1367#discussion_r15028681
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala
---
@@ -0,0 +1,102 @@
+/*
+ * Licensed to the Apache
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1367#discussion_r15030155
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala
---
@@ -0,0 +1,102 @@
+/*
+ * Licensed to the Apache
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1367#discussion_r15033178
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala
---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1367#discussion_r15036448
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala
---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1367#discussion_r15036918
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala
---
@@ -0,0 +1,102 @@
+/*
+ * Licensed to the Apache
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1367#discussion_r14896742
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala
---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1367#discussion_r14896912
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala
---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1367#discussion_r14897552
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/PearsonCorrelation.scala
---
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1367#discussion_r14899055
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala
---
@@ -0,0 +1,102 @@
+/*
+ * Licensed to the Apache
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1367#discussion_r14899116
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala
---
@@ -0,0 +1,102 @@
+/*
+ * Licensed to the Apache
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1367#discussion_r14900326
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala
---
@@ -0,0 +1,102 @@
+/*
+ * Licensed to the Apache
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1367#discussion_r14900429
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala
---
@@ -0,0 +1,102 @@
+/*
+ * Licensed to the Apache
Github user dorx commented on the pull request:
https://github.com/apache/spark/pull/1367#issuecomment-48953501
@mengxr Thanks for the feedback. Can you respond to my followup questions
before I update my PR?
---
If your project is set up for it, you can reply to this email
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1025#discussion_r14906065
--- Diff: pom.xml ---
@@ -257,6 +257,11 @@
version1.5/version
/dependency
dependency
+groupIdorg.apache.commons
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1025#discussion_r14906349
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala
---
@@ -195,6 +193,37 @@ class PairRDDFunctions[K, V](self: RDD[(K, V
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1025#discussion_r14906412
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala
---
@@ -195,6 +193,37 @@ class PairRDDFunctions[K, V](self: RDD[(K, V
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1025#discussion_r14906680
--- Diff:
core/src/main/scala/org/apache/spark/util/random/StratifiedSampler.scala ---
@@ -0,0 +1,311 @@
+/*
+ * Licensed to the Apache Software
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1025#discussion_r14906754
--- Diff:
core/src/main/scala/org/apache/spark/util/random/StratifiedSampler.scala ---
@@ -0,0 +1,311 @@
+/*
+ * Licensed to the Apache Software
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1025#discussion_r14906825
--- Diff:
core/src/main/scala/org/apache/spark/util/random/StratifiedSampler.scala ---
@@ -0,0 +1,311 @@
+/*
+ * Licensed to the Apache Software
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1025#discussion_r14906919
--- Diff:
core/src/main/scala/org/apache/spark/util/random/StratifiedSampler.scala ---
@@ -0,0 +1,311 @@
+/*
+ * Licensed to the Apache Software
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1025#discussion_r14907155
--- Diff:
core/src/main/scala/org/apache/spark/util/random/StratifiedSampler.scala ---
@@ -0,0 +1,311 @@
+/*
+ * Licensed to the Apache Software
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1025#discussion_r14907202
--- Diff:
core/src/main/scala/org/apache/spark/util/random/StratifiedSampler.scala ---
@@ -0,0 +1,311 @@
+/*
+ * Licensed to the Apache Software
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1025#discussion_r14907670
--- Diff:
core/src/main/scala/org/apache/spark/util/random/StratifiedSampler.scala ---
@@ -0,0 +1,311 @@
+/*
+ * Licensed to the Apache Software
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1025#discussion_r14907870
--- Diff:
core/src/main/scala/org/apache/spark/util/random/StratifiedSampler.scala ---
@@ -0,0 +1,311 @@
+/*
+ * Licensed to the Apache Software
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1025#discussion_r14907896
--- Diff:
core/src/main/scala/org/apache/spark/util/random/StratifiedSampler.scala ---
@@ -0,0 +1,311 @@
+/*
+ * Licensed to the Apache Software
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1367#discussion_r14836617
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala
---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1367#discussion_r14836715
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala
---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1367#discussion_r14836922
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala
---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1367#discussion_r14846509
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala
---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software
GitHub user dorx opened a pull request:
https://github.com/apache/spark/pull/1367
[SPARK-2359][MLlib] Correlations
Implementation for Pearson and Spearman's correlation.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/dorx/spark
Github user dorx commented on the pull request:
https://github.com/apache/spark/pull/1367#issuecomment-48682999
@mengrx, @falaki, @jkbradley please take a look
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1025#discussion_r14672589
--- Diff:
core/src/main/scala/org/apache/spark/util/random/StratifiedSampler.scala ---
@@ -0,0 +1,335 @@
+/*
+ * Licensed to the Apache Software
Github user dorx commented on the pull request:
https://github.com/apache/spark/pull/1025#issuecomment-48386518
Jenkins, retest this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1025#discussion_r14688550
--- Diff:
core/src/main/scala/org/apache/spark/util/random/SamplingUtils.scala ---
@@ -45,11 +50,75 @@ private[spark] object SamplingUtils {
val
Github user dorx commented on a diff in the pull request:
https://github.com/apache/spark/pull/1025#discussion_r14688633
--- Diff:
core/src/main/scala/org/apache/spark/util/random/StratifiedSampler.scala ---
@@ -0,0 +1,310 @@
+/*
+ * Licensed to the Apache Software
Github user dorx commented on the pull request:
https://github.com/apache/spark/pull/1025#issuecomment-48419179
Holding out on updating the docs until the python version is supported.
For the python version, any objections to using _jrdd to invoke the java
version of sampleByKey
Github user dorx commented on the pull request:
https://github.com/apache/spark/pull/1025#issuecomment-48419578
Jenkins, retest this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
1 - 100 of 132 matches
Mail list logo