[GitHub] spark pull request #22365: [SPARK-25381][SQL] Stratified sampling by Column ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22365 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22365: [SPARK-25381][SQL] Stratified sampling by Column ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22365#discussion_r219034294 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala --- @@ -370,29 +370,76 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) { * @since 1.5.0 */ def sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): DataFrame = { +sampleBy(Column(col), fractions, seed) + } + + /** + * Returns a stratified sample without replacement based on the fraction given on each stratum. + * @param col column that defines strata + * @param fractions sampling fraction for each stratum. If a stratum is not specified, we treat + * its fraction as zero. + * @param seed random seed + * @tparam T stratum type + * @return a new `DataFrame` that represents the stratified sample + * + * @since 1.5.0 + */ + def sampleBy[T](col: String, fractions: ju.Map[T, jl.Double], seed: Long): DataFrame = { +sampleBy(col, fractions.asScala.toMap.asInstanceOf[Map[T, Double]], seed) + } + + /** + * Returns a stratified sample without replacement based on the fraction given on each stratum. + * @param col column that defines strata + * @param fractions sampling fraction for each stratum. If a stratum is not specified, we treat + * its fraction as zero. + * @param seed random seed + * @tparam T stratum type + * @return a new `DataFrame` that represents the stratified sample + * + * The stratified sample can be performed over multiple columns: + * {{{ + *import org.apache.spark.sql.Row + *import org.apache.spark.sql.functions.struct + * + *val df = spark.createDataFrame(Seq(("Bob", 17), ("Alice", 10), ("Nico", 8), ("Bob", 17), + * ("Alice", 10))).toDF("name", "age") + *val fractions = Map(Row("Alice", 10) -> 0.3, Row("Nico", 8) -> 1.0) + *df.stat.sampleBy(struct($"name", $"age"), fractions, 36L).show() + *+-+---+ + *| name|age| + *+-+---+ + *| Nico| 8| + *|Alice| 10| + *+-+---+ + * }}} + * + * @since 3.0.0 --- End diff -- the next release is 2.5.0 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22365: [SPARK-25381][SQL] Stratified sampling by Column ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22365#discussion_r217257137 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala --- @@ -370,29 +370,76 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) { * @since 1.5.0 */ def sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): DataFrame = { --- End diff -- Will probably send an email after 2.4.0 since it's not going to be super urgent. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22365: [SPARK-25381][SQL] Stratified sampling by Column ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22365#discussion_r217256279 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala --- @@ -370,29 +370,76 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) { * @since 1.5.0 */ def sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): DataFrame = { --- End diff -- I'm +1 for it, but we probably need to send a email to dev list to get more feedbacks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22365: [SPARK-25381][SQL] Stratified sampling by Column ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22365#discussion_r217252035 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala --- @@ -370,29 +370,76 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) { * @since 1.5.0 */ def sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): DataFrame = { --- End diff -- @cloud-fan, WDYT about we start to deprecate String method? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22365: [SPARK-25381][SQL] Stratified sampling by Column ...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/22365#discussion_r216482340 --- Diff: python/pyspark/sql/dataframe.py --- @@ -880,18 +880,23 @@ def sampleBy(self, col, fractions, seed=None): | 0|5| | 1|9| +---+-+ +>>> dataset.sampleBy(col("key"), fractions={2: 1.0}, seed=0).count() --- End diff -- Added --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22365: [SPARK-25381][SQL] Stratified sampling by Column ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22365#discussion_r216233575 --- Diff: python/pyspark/sql/dataframe.py --- @@ -880,18 +880,23 @@ def sampleBy(self, col, fractions, seed=None): | 0|5| | 1|9| +---+-+ +>>> dataset.sampleBy(col("key"), fractions={2: 1.0}, seed=0).count() +33 """ -if not isinstance(col, basestring): -raise ValueError("col must be a string, but got %r" % type(col)) +if isinstance(col, basestring): +col = Column(col) +elif not isinstance(col, Column): +raise ValueError("col must be a string or a column, but got %r" % type(col)) if not isinstance(fractions, dict): raise ValueError("fractions must be a dict but got %r" % type(fractions)) for k, v in fractions.items(): if not isinstance(k, (float, int, long, basestring)): raise ValueError("key must be float, int, long, or string, but got %r" % type(k)) fractions[k] = float(v) seed = seed if seed is not None else random.randint(0, sys.maxsize) -return DataFrame(self._jdf.stat().sampleBy(col, self._jmap(fractions), seed), self.sql_ctx) +return DataFrame(self._jdf.stat() + .sampleBy(col._jc, self._jmap(fractions), seed), self.sql_ctx) --- End diff -- I would just do `col = col._jc` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22365: [SPARK-25381][SQL] Stratified sampling by Column ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22365#discussion_r216233066 --- Diff: python/pyspark/sql/dataframe.py --- @@ -880,18 +880,23 @@ def sampleBy(self, col, fractions, seed=None): | 0|5| | 1|9| +---+-+ +>>> dataset.sampleBy(col("key"), fractions={2: 1.0}, seed=0).count() --- End diff -- @MaxGekk, shall we add: ```python .. versionchanged:: 3.0 blah blah blah ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22365: [SPARK-25381][SQL] Stratified sampling by Column ...
GitHub user MaxGekk opened a pull request: https://github.com/apache/spark/pull/22365 [SPARK-25381][SQL] Stratified sampling by Column argument ## What changes were proposed in this pull request? In the PR, I propose to add an overloaded method for `sampleBy` which accepts the first argument of the `Column` type. This will allow to sample by any complex columns as well as sampling by multiple columns. For example: ```Scala spark.createDataFrame(Seq(("Bob", 17), ("Alice", 10), ("Nico", 8), ("Bob", 17), ("Alice", 10))).toDF("name", "age") .stat .sampleBy(struct($"name", $"age"), Map(Row("Alice", 10) -> 0.3, Row("Nico", 8) -> 1.0), 36L) .show() +-+---+ | name|age| +-+---+ | Nico| 8| |Alice| 10| +-+---+ ``` ## How was this patch tested? Added new test for sampling by multiple columns for Scala and test for Java, Python to check that `sampleBy` is able to sample by `Column` type argument. You can merge this pull request into a Git repository by running: $ git pull https://github.com/MaxGekk/spark-1 sample-by-column Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22365.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22365 commit 3832f2137676a76d6d06a0bb6dbcedcba801910b Author: Maxim Gekk Date: 2018-09-08T13:30:49Z Adding overloaded sampleBy with Column type commit 5cd3229ce8bfe894dac8ebc097109da237d95401 Author: Maxim Gekk Date: 2018-09-08T13:39:30Z Adding overloaded sampleBy with Column type for Java commit e2e61498c47da9d7b36d2e0727ce8642d5d71472 Author: Maxim Gekk Date: 2018-09-08T14:56:36Z Adding overloaded sampleBy with Column type for Python --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org