[GitHub] spark pull request #22365: [SPARK-25381][SQL] Stratified sampling by Column ...

2018-09-20 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22365


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22365: [SPARK-25381][SQL] Stratified sampling by Column ...

2018-09-19 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22365#discussion_r219034294
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
@@ -370,29 +370,76 @@ final class DataFrameStatFunctions private[sql](df: 
DataFrame) {
* @since 1.5.0
*/
   def sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): 
DataFrame = {
+sampleBy(Column(col), fractions, seed)
+  }
+
+  /**
+   * Returns a stratified sample without replacement based on the fraction 
given on each stratum.
+   * @param col column that defines strata
+   * @param fractions sampling fraction for each stratum. If a stratum is 
not specified, we treat
+   *  its fraction as zero.
+   * @param seed random seed
+   * @tparam T stratum type
+   * @return a new `DataFrame` that represents the stratified sample
+   *
+   * @since 1.5.0
+   */
+  def sampleBy[T](col: String, fractions: ju.Map[T, jl.Double], seed: 
Long): DataFrame = {
+sampleBy(col, fractions.asScala.toMap.asInstanceOf[Map[T, Double]], 
seed)
+  }
+
+  /**
+   * Returns a stratified sample without replacement based on the fraction 
given on each stratum.
+   * @param col column that defines strata
+   * @param fractions sampling fraction for each stratum. If a stratum is 
not specified, we treat
+   *  its fraction as zero.
+   * @param seed random seed
+   * @tparam T stratum type
+   * @return a new `DataFrame` that represents the stratified sample
+   *
+   * The stratified sample can be performed over multiple columns:
+   * {{{
+   *import org.apache.spark.sql.Row
+   *import org.apache.spark.sql.functions.struct
+   *
+   *val df = spark.createDataFrame(Seq(("Bob", 17), ("Alice", 10), 
("Nico", 8), ("Bob", 17),
+   *  ("Alice", 10))).toDF("name", "age")
+   *val fractions = Map(Row("Alice", 10) -> 0.3, Row("Nico", 8) -> 1.0)
+   *df.stat.sampleBy(struct($"name", $"age"), fractions, 36L).show()
+   *+-+---+
+   *| name|age|
+   *+-+---+
+   *| Nico|  8|
+   *|Alice| 10|
+   *+-+---+
+   * }}}
+   *
+   * @since 3.0.0
--- End diff --

the next release is 2.5.0


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22365: [SPARK-25381][SQL] Stratified sampling by Column ...

2018-09-12 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22365#discussion_r217257137
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
@@ -370,29 +370,76 @@ final class DataFrameStatFunctions private[sql](df: 
DataFrame) {
* @since 1.5.0
*/
   def sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): 
DataFrame = {
--- End diff --

Will probably send an email after 2.4.0 since it's not going to be super 
urgent.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22365: [SPARK-25381][SQL] Stratified sampling by Column ...

2018-09-12 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22365#discussion_r217256279
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
@@ -370,29 +370,76 @@ final class DataFrameStatFunctions private[sql](df: 
DataFrame) {
* @since 1.5.0
*/
   def sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): 
DataFrame = {
--- End diff --

I'm +1 for it, but we probably need to send a email to dev list to get more 
feedbacks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22365: [SPARK-25381][SQL] Stratified sampling by Column ...

2018-09-12 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22365#discussion_r217252035
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
@@ -370,29 +370,76 @@ final class DataFrameStatFunctions private[sql](df: 
DataFrame) {
* @since 1.5.0
*/
   def sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): 
DataFrame = {
--- End diff --

@cloud-fan, WDYT about we start to deprecate String method?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22365: [SPARK-25381][SQL] Stratified sampling by Column ...

2018-09-10 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22365#discussion_r216482340
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -880,18 +880,23 @@ def sampleBy(self, col, fractions, seed=None):
 |  0|5|
 |  1|9|
 +---+-+
+>>> dataset.sampleBy(col("key"), fractions={2: 1.0}, 
seed=0).count()
--- End diff --

Added


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22365: [SPARK-25381][SQL] Stratified sampling by Column ...

2018-09-10 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22365#discussion_r216233575
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -880,18 +880,23 @@ def sampleBy(self, col, fractions, seed=None):
 |  0|5|
 |  1|9|
 +---+-+
+>>> dataset.sampleBy(col("key"), fractions={2: 1.0}, 
seed=0).count()
+33
 
 """
-if not isinstance(col, basestring):
-raise ValueError("col must be a string, but got %r" % 
type(col))
+if isinstance(col, basestring):
+col = Column(col)
+elif not isinstance(col, Column):
+raise ValueError("col must be a string or a column, but got 
%r" % type(col))
 if not isinstance(fractions, dict):
 raise ValueError("fractions must be a dict but got %r" % 
type(fractions))
 for k, v in fractions.items():
 if not isinstance(k, (float, int, long, basestring)):
 raise ValueError("key must be float, int, long, or string, 
but got %r" % type(k))
 fractions[k] = float(v)
 seed = seed if seed is not None else random.randint(0, sys.maxsize)
-return DataFrame(self._jdf.stat().sampleBy(col, 
self._jmap(fractions), seed), self.sql_ctx)
+return DataFrame(self._jdf.stat()
+ .sampleBy(col._jc, self._jmap(fractions), seed), 
self.sql_ctx)
--- End diff --

I would just do `col = col._jc`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22365: [SPARK-25381][SQL] Stratified sampling by Column ...

2018-09-10 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22365#discussion_r216233066
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -880,18 +880,23 @@ def sampleBy(self, col, fractions, seed=None):
 |  0|5|
 |  1|9|
 +---+-+
+>>> dataset.sampleBy(col("key"), fractions={2: 1.0}, 
seed=0).count()
--- End diff --

@MaxGekk, shall we add:

```python
.. versionchanged:: 3.0
   blah blah blah
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22365: [SPARK-25381][SQL] Stratified sampling by Column ...

2018-09-08 Thread MaxGekk
GitHub user MaxGekk opened a pull request:

https://github.com/apache/spark/pull/22365

[SPARK-25381][SQL] Stratified sampling by Column argument

## What changes were proposed in this pull request?

In the PR, I propose to add an overloaded method for `sampleBy` which 
accepts the first argument of the `Column` type. This will allow to sample by 
any complex columns as well as sampling by multiple columns. For example:

```Scala
spark.createDataFrame(Seq(("Bob", 17), ("Alice", 10), ("Nico", 8), ("Bob", 
17),
  ("Alice", 10))).toDF("name", "age")
  .stat
  .sampleBy(struct($"name", $"age"), Map(Row("Alice", 10) -> 0.3, 
Row("Nico", 8) -> 1.0), 36L)
  .show()

+-+---+
| name|age|
+-+---+
| Nico|  8|
|Alice| 10|
+-+---+
```

## How was this patch tested?

Added new test for sampling by multiple columns for Scala and test for 
Java, Python to check that `sampleBy` is able to sample by `Column` type 
argument.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MaxGekk/spark-1 sample-by-column

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22365.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22365


commit 3832f2137676a76d6d06a0bb6dbcedcba801910b
Author: Maxim Gekk 
Date:   2018-09-08T13:30:49Z

Adding overloaded sampleBy with Column type

commit 5cd3229ce8bfe894dac8ebc097109da237d95401
Author: Maxim Gekk 
Date:   2018-09-08T13:39:30Z

Adding overloaded sampleBy with Column type for Java

commit e2e61498c47da9d7b36d2e0727ce8642d5d71472
Author: Maxim Gekk 
Date:   2018-09-08T14:56:36Z

Adding overloaded sampleBy with Column type for Python




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org