[jira] [Assigned] (SPARK-22981) Incorrect results of casting Struct to String
[ https://issues.apache.org/jira/browse/SPARK-22981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22981: Assignee: Apache Spark > Incorrect results of casting Struct to String > - > > Key: SPARK-22981 > URL: https://issues.apache.org/jira/browse/SPARK-22981 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Takeshi Yamamuro >Assignee: Apache Spark > > {code} > scala> val df = Seq(((1, "a"), 0), ((2, "b"), 0)).toDF("a", "b") > df: org.apache.spark.sql.DataFrame = [a: struct<_1: int, _2: string>, b: int] > scala> df.write.saveAsTable("t") > > > scala> sql("SELECT CAST(a AS STRING) FROM t").show > +---+ > | a| > +---+ > |[0,1,180001,61]| > |[0,2,180001,62]| > +---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22981) Incorrect results of casting Struct to String
[ https://issues.apache.org/jira/browse/SPARK-22981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315071#comment-16315071 ] Apache Spark commented on SPARK-22981: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/20176 > Incorrect results of casting Struct to String > - > > Key: SPARK-22981 > URL: https://issues.apache.org/jira/browse/SPARK-22981 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Takeshi Yamamuro > > {code} > scala> val df = Seq(((1, "a"), 0), ((2, "b"), 0)).toDF("a", "b") > df: org.apache.spark.sql.DataFrame = [a: struct<_1: int, _2: string>, b: int] > scala> df.write.saveAsTable("t") > > > scala> sql("SELECT CAST(a AS STRING) FROM t").show > +---+ > | a| > +---+ > |[0,1,180001,61]| > |[0,2,180001,62]| > +---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22981) Incorrect results of casting Struct to String
[ https://issues.apache.org/jira/browse/SPARK-22981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22981: Assignee: (was: Apache Spark) > Incorrect results of casting Struct to String > - > > Key: SPARK-22981 > URL: https://issues.apache.org/jira/browse/SPARK-22981 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Takeshi Yamamuro > > {code} > scala> val df = Seq(((1, "a"), 0), ((2, "b"), 0)).toDF("a", "b") > df: org.apache.spark.sql.DataFrame = [a: struct<_1: int, _2: string>, b: int] > scala> df.write.saveAsTable("t") > > > scala> sql("SELECT CAST(a AS STRING) FROM t").show > +---+ > | a| > +---+ > |[0,1,180001,61]| > |[0,2,180001,62]| > +---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22981) Incorrect results of casting Struct to String
Takeshi Yamamuro created SPARK-22981: Summary: Incorrect results of casting Struct to String Key: SPARK-22981 URL: https://issues.apache.org/jira/browse/SPARK-22981 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.1 Reporter: Takeshi Yamamuro {code} scala> val df = Seq(((1, "a"), 0), ((2, "b"), 0)).toDF("a", "b") df: org.apache.spark.sql.DataFrame = [a: struct<_1: int, _2: string>, b: int] scala> df.write.saveAsTable("t") scala> sql("SELECT CAST(a AS STRING) FROM t").show +---+ | a| +---+ |[0,1,180001,61]| |[0,2,180001,62]| +---+ {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22973) Incorrect results of casting Map to String
[ https://issues.apache.org/jira/browse/SPARK-22973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-22973. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20166 [https://github.com/apache/spark/pull/20166] > Incorrect results of casting Map to String > -- > > Key: SPARK-22973 > URL: https://issues.apache.org/jira/browse/SPARK-22973 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro > Fix For: 2.3.0 > > > {code} > scala> Seq(Map(1 -> "a", 2 -> "b")).toDF("a").write.saveAsTable("t") > scala> sql("SELECT cast(a as String) FROM t").show(false) > ++ > |a | > ++ > |org.apache.spark.sql.catalyst.expressions.UnsafeMapData@38bdd75d| > ++ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22973) Incorrect results of casting Map to String
[ https://issues.apache.org/jira/browse/SPARK-22973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-22973: --- Assignee: Takeshi Yamamuro > Incorrect results of casting Map to String > -- > > Key: SPARK-22973 > URL: https://issues.apache.org/jira/browse/SPARK-22973 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro > > {code} > scala> Seq(Map(1 -> "a", 2 -> "b")).toDF("a").write.saveAsTable("t") > scala> sql("SELECT cast(a as String) FROM t").show(false) > ++ > |a | > ++ > |org.apache.spark.sql.catalyst.expressions.UnsafeMapData@38bdd75d| > ++ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22980) Wrong answer when using pandas_udf
[ https://issues.apache.org/jira/browse/SPARK-22980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-22980: Priority: Blocker (was: Major) > Wrong answer when using pandas_udf > -- > > Key: SPARK-22980 > URL: https://issues.apache.org/jira/browse/SPARK-22980 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Blocker > > {noformat} > from pyspark.sql.functions import pandas_udf > from pyspark.sql.functions import col, lit > from pyspark.sql.types import LongType > df = spark.range(3) > f = pandas_udf(lambda x, y: len(x) + y, LongType()) > df.select(f(lit('text'), col('id'))).show() > {noformat} > {noformat} > from pyspark.sql.functions import udf > from pyspark.sql.functions import col, lit > from pyspark.sql.types import LongType > df = spark.range(3) > f = udf(lambda x, y: len(x) + y, LongType()) > df.select(f(lit('text'), col('id'))).show() > {noformat} > The results of pandas_udf are different from udf. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22980) Wrong answer when using pandas_udf
[ https://issues.apache.org/jira/browse/SPARK-22980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315042#comment-16315042 ] Xiao Li commented on SPARK-22980: - cc [~icexelloss] [~bryanc] [~ueshin] > Wrong answer when using pandas_udf > -- > > Key: SPARK-22980 > URL: https://issues.apache.org/jira/browse/SPARK-22980 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li > > {noformat} > from pyspark.sql.functions import pandas_udf > from pyspark.sql.functions import col, lit > from pyspark.sql.types import LongType > df = spark.range(3) > f = pandas_udf(lambda x, y: len(x) + y, LongType()) > df.select(f(lit('text'), col('id'))).show() > {noformat} > {noformat} > from pyspark.sql.functions import udf > from pyspark.sql.functions import col, lit > from pyspark.sql.types import LongType > df = spark.range(3) > f = udf(lambda x, y: len(x) + y, LongType()) > df.select(f(lit('text'), col('id'))).show() > {noformat} > The results of pandas_udf are different from udf. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22980) Wrong answer when using pandas_udf
Xiao Li created SPARK-22980: --- Summary: Wrong answer when using pandas_udf Key: SPARK-22980 URL: https://issues.apache.org/jira/browse/SPARK-22980 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 2.3.0 Reporter: Xiao Li {noformat} from pyspark.sql.functions import pandas_udf from pyspark.sql.functions import col, lit from pyspark.sql.types import LongType df = spark.range(3) f = pandas_udf(lambda x, y: len(x) + y, LongType()) df.select(f(lit('text'), col('id'))).show() {noformat} {noformat} from pyspark.sql.functions import udf from pyspark.sql.functions import col, lit from pyspark.sql.types import LongType df = spark.range(3) f = udf(lambda x, y: len(x) + y, LongType()) df.select(f(lit('text'), col('id'))).show() {noformat} The results of pandas_udf are different from udf. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20498) RandomForestRegressionModel should expose getMaxDepth in PySpark
[ https://issues.apache.org/jira/browse/SPARK-20498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20498: -- Target Version/s: 2.4.0 (was: 2.3.0) > RandomForestRegressionModel should expose getMaxDepth in PySpark > > > Key: SPARK-20498 > URL: https://issues.apache.org/jira/browse/SPARK-20498 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.1.0 >Reporter: Nick Lothian >Assignee: Xin Ren >Priority: Minor > > Currently it isn't clear hot to get the max depth of a > RandomForestRegressionModel (eg, after doing a grid search) > It is possible to call > {{regressor._java_obj.getMaxDepth()}} > but most other decision trees allow > {{regressor.getMaxDepth()}} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18569) Support R formula arithmetic
[ https://issues.apache.org/jira/browse/SPARK-18569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16314994#comment-16314994 ] Joseph K. Bradley commented on SPARK-18569: --- [~felixcheung] There are a few JIRAs for R still targeted for 2.3. Would you mind vetting them to see which should be retargeted for 2.4 vs. un-targeted? Thanks! > Support R formula arithmetic > - > > Key: SPARK-18569 > URL: https://issues.apache.org/jira/browse/SPARK-18569 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Reporter: Felix Cheung > > I think we should support arithmetic which makes it a lot more convenient to > build model. Something like > {code} > log(y) ~ a + log(x) > {code} > And to avoid resolution confusions we should support the I() operator: > {code} > I > I(X∗Z) as is: include a new variable consisting of these variables multiplied > {code} > Such that this works: > {code} > y ~ a + I(b+c) > {code} > the term b+c is to be interpreted as the sum of b and c. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20498) RandomForestRegressionModel should expose getMaxDepth in PySpark
[ https://issues.apache.org/jira/browse/SPARK-20498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16314993#comment-16314993 ] Joseph K. Bradley commented on SPARK-20498: --- We'll need to retarget this for 2.4, but I'll mark myself as shepherd to try to make sure it gets in. > RandomForestRegressionModel should expose getMaxDepth in PySpark > > > Key: SPARK-20498 > URL: https://issues.apache.org/jira/browse/SPARK-20498 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.1.0 >Reporter: Nick Lothian >Assignee: Xin Ren >Priority: Minor > > Currently it isn't clear hot to get the max depth of a > RandomForestRegressionModel (eg, after doing a grid search) > It is possible to call > {{regressor._java_obj.getMaxDepth()}} > but most other decision trees allow > {{regressor.getMaxDepth()}} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20498) RandomForestRegressionModel should expose getMaxDepth in PySpark
[ https://issues.apache.org/jira/browse/SPARK-20498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20498: -- Shepherd: Joseph K. Bradley > RandomForestRegressionModel should expose getMaxDepth in PySpark > > > Key: SPARK-20498 > URL: https://issues.apache.org/jira/browse/SPARK-20498 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.1.0 >Reporter: Nick Lothian >Assignee: Xin Ren >Priority: Minor > > Currently it isn't clear hot to get the max depth of a > RandomForestRegressionModel (eg, after doing a grid search) > It is possible to call > {{regressor._java_obj.getMaxDepth()}} > but most other decision trees allow > {{regressor.getMaxDepth()}} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20602) Adding LBFGS optimizer and Squared_hinge loss for LinearSVC
[ https://issues.apache.org/jira/browse/SPARK-20602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16314979#comment-16314979 ] Joseph K. Bradley commented on SPARK-20602: --- I'm afraid we'll need to re-target this since the branch has been cut for 2.3. I'll change it to 2.4, but [~yanboliang] please retarget as needed. > Adding LBFGS optimizer and Squared_hinge loss for LinearSVC > --- > > Key: SPARK-20602 > URL: https://issues.apache.org/jira/browse/SPARK-20602 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang >Assignee: yuhao yang > > Currently LinearSVC in Spark only supports OWLQN as the optimizer ( check > https://issues.apache.org/jira/browse/SPARK-14709). I made comparison between > LBFGS and OWLQN on several public dataset and found LBFGS converges much > faster for LinearSVC in most cases. > The following table presents the number of training iterations and f1 score > of both optimizers until convergence > ||Dataset||LBFGS with hinge||OWLQN with hinge||LBFGS with squared_hinge|| > |news20.binary| 31 (0.99) | 413(0.99) | 185 (0.99) | > |mushroom| 28(1.0) | 170(1.0)| 24(1.0) | > |madelon|143(0.75) | 8129(0.70)| 823(0.74) | > |breast-cancer-scale| 15(1.0) | 16(1.0)| 15 (1.0) | > |phishing | 329(0.94) | 231(0.94) | 67 (0.94) | > |a1a(adult) | 466 (0.87) | 282 (0.87) | 77 (0.86) | > |a7a | 237 (0.84) | 372(0.84) | 69(0.84) | > data source: > https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html > training code: new LinearSVC().setMaxIter(1).setTol(1e-6) > LBFGS requires less iterations in most cases (except for a1a) and probably is > a better default optimizer. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20602) Adding LBFGS optimizer and Squared_hinge loss for LinearSVC
[ https://issues.apache.org/jira/browse/SPARK-20602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20602: -- Target Version/s: 2.4.0 (was: 2.3.0) > Adding LBFGS optimizer and Squared_hinge loss for LinearSVC > --- > > Key: SPARK-20602 > URL: https://issues.apache.org/jira/browse/SPARK-20602 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang >Assignee: yuhao yang > > Currently LinearSVC in Spark only supports OWLQN as the optimizer ( check > https://issues.apache.org/jira/browse/SPARK-14709). I made comparison between > LBFGS and OWLQN on several public dataset and found LBFGS converges much > faster for LinearSVC in most cases. > The following table presents the number of training iterations and f1 score > of both optimizers until convergence > ||Dataset||LBFGS with hinge||OWLQN with hinge||LBFGS with squared_hinge|| > |news20.binary| 31 (0.99) | 413(0.99) | 185 (0.99) | > |mushroom| 28(1.0) | 170(1.0)| 24(1.0) | > |madelon|143(0.75) | 8129(0.70)| 823(0.74) | > |breast-cancer-scale| 15(1.0) | 16(1.0)| 15 (1.0) | > |phishing | 329(0.94) | 231(0.94) | 67 (0.94) | > |a1a(adult) | 466 (0.87) | 282 (0.87) | 77 (0.86) | > |a7a | 237 (0.84) | 372(0.84) | 69(0.84) | > data source: > https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html > training code: new LinearSVC().setMaxIter(1).setTol(1e-6) > LBFGS requires less iterations in most cases (except for a1a) and probably is > a better default optimizer. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22796) Add multiple column support to PySpark QuantileDiscretizer
[ https://issues.apache.org/jira/browse/SPARK-22796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16314978#comment-16314978 ] Joseph K. Bradley commented on SPARK-22796: --- We'll need to re-target this for 2.4 now that the branch has been cut. > Add multiple column support to PySpark QuantileDiscretizer > -- > > Key: SPARK-22796 > URL: https://issues.apache.org/jira/browse/SPARK-22796 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Nick Pentreath > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18348) Improve tree ensemble model summary
[ https://issues.apache.org/jira/browse/SPARK-18348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16314976#comment-16314976 ] Joseph K. Bradley commented on SPARK-18348: --- I'll remove the target, but please re-add as needed > Improve tree ensemble model summary > --- > > Key: SPARK-18348 > URL: https://issues.apache.org/jira/browse/SPARK-18348 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Affects Versions: 2.0.0, 2.1.0 >Reporter: Felix Cheung > > During work on R APIs for tree ensemble models (eg. Random Forest, GBT) it is > discovered and discussed that > - we don't have a good summary on nodes or trees for their observations, > loss, probability and so on > - we don't have a shared API with nicely formatted output > We believe this could be a shared API that benefits multiple language > bindings, including R, when available. > For example, here is what R {code}rpart{code} shows for model summary: > {code} > Call: > rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis, > method = "class") > n= 81 > CP nsplit rel errorxerror xstd > 1 0.17647059 0 1.000 1.000 0.2155872 > 2 0.01960784 1 0.8235294 0.9411765 0.2107780 > 3 0.0100 4 0.7647059 1.0588235 0.2200975 > Variable importance > StartAge Number > 64 24 12 > Node number 1: 81 observations,complexity param=0.1764706 > predicted class=absent expected loss=0.2098765 P(node) =1 > class counts:6417 >probabilities: 0.790 0.210 > left son=2 (62 obs) right son=3 (19 obs) > Primary splits: > Start < 8.5 to the right, improve=6.762330, (0 missing) > Number < 5.5 to the left, improve=2.866795, (0 missing) > Age< 39.5 to the left, improve=2.250212, (0 missing) > Surrogate splits: > Number < 6.5 to the left, agree=0.802, adj=0.158, (0 split) > Node number 2: 62 observations,complexity param=0.01960784 > predicted class=absent expected loss=0.09677419 P(node) =0.7654321 > class counts:56 6 >probabilities: 0.903 0.097 > left son=4 (29 obs) right son=5 (33 obs) > Primary splits: > Start < 14.5 to the right, improve=1.0205280, (0 missing) > Age< 55 to the left, improve=0.6848635, (0 missing) > Number < 4.5 to the left, improve=0.2975332, (0 missing) > Surrogate splits: > Number < 3.5 to the left, agree=0.645, adj=0.241, (0 split) > Age< 16 to the left, agree=0.597, adj=0.138, (0 split) > Node number 3: 19 observations > predicted class=present expected loss=0.4210526 P(node) =0.2345679 > class counts: 811 >probabilities: 0.421 0.579 > Node number 4: 29 observations > predicted class=absent expected loss=0 P(node) =0.3580247 > class counts:29 0 >probabilities: 1.000 0.000 > Node number 5: 33 observations,complexity param=0.01960784 > predicted class=absent expected loss=0.1818182 P(node) =0.4074074 > class counts:27 6 >probabilities: 0.818 0.182 > left son=10 (12 obs) right son=11 (21 obs) > Primary splits: > Age< 55 to the left, improve=1.2467530, (0 missing) > Start < 12.5 to the right, improve=0.2887701, (0 missing) > Number < 3.5 to the right, improve=0.1753247, (0 missing) > Surrogate splits: > Start < 9.5 to the left, agree=0.758, adj=0.333, (0 split) > Number < 5.5 to the right, agree=0.697, adj=0.167, (0 split) > Node number 10: 12 observations > predicted class=absent expected loss=0 P(node) =0.1481481 > class counts:12 0 >probabilities: 1.000 0.000 > Node number 11: 21 observations,complexity param=0.01960784 > predicted class=absent expected loss=0.2857143 P(node) =0.2592593 > class counts:15 6 >probabilities: 0.714 0.286 > left son=22 (14 obs) right son=23 (7 obs) > Primary splits: > Age< 111 to the right, improve=1.71428600, (0 missing) > Start < 12.5 to the right, improve=0.79365080, (0 missing) > Number < 3.5 to the right, improve=0.07142857, (0 missing) > Node number 22: 14 observations > predicted class=absent expected loss=0.1428571 P(node) =0.1728395 > class counts:12 2 >probabilities: 0.857 0.143 > Node number 23: 7 observations > predicted class=present expected loss=0.4285714 P(node) =0.08641975 > class counts: 3 4 >probabilities: 0.429 0.571 > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18348) Improve tree ensemble model summary
[ https://issues.apache.org/jira/browse/SPARK-18348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18348: -- Target Version/s: (was: 2.3.0) > Improve tree ensemble model summary > --- > > Key: SPARK-18348 > URL: https://issues.apache.org/jira/browse/SPARK-18348 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Affects Versions: 2.0.0, 2.1.0 >Reporter: Felix Cheung > > During work on R APIs for tree ensemble models (eg. Random Forest, GBT) it is > discovered and discussed that > - we don't have a good summary on nodes or trees for their observations, > loss, probability and so on > - we don't have a shared API with nicely formatted output > We believe this could be a shared API that benefits multiple language > bindings, including R, when available. > For example, here is what R {code}rpart{code} shows for model summary: > {code} > Call: > rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis, > method = "class") > n= 81 > CP nsplit rel errorxerror xstd > 1 0.17647059 0 1.000 1.000 0.2155872 > 2 0.01960784 1 0.8235294 0.9411765 0.2107780 > 3 0.0100 4 0.7647059 1.0588235 0.2200975 > Variable importance > StartAge Number > 64 24 12 > Node number 1: 81 observations,complexity param=0.1764706 > predicted class=absent expected loss=0.2098765 P(node) =1 > class counts:6417 >probabilities: 0.790 0.210 > left son=2 (62 obs) right son=3 (19 obs) > Primary splits: > Start < 8.5 to the right, improve=6.762330, (0 missing) > Number < 5.5 to the left, improve=2.866795, (0 missing) > Age< 39.5 to the left, improve=2.250212, (0 missing) > Surrogate splits: > Number < 6.5 to the left, agree=0.802, adj=0.158, (0 split) > Node number 2: 62 observations,complexity param=0.01960784 > predicted class=absent expected loss=0.09677419 P(node) =0.7654321 > class counts:56 6 >probabilities: 0.903 0.097 > left son=4 (29 obs) right son=5 (33 obs) > Primary splits: > Start < 14.5 to the right, improve=1.0205280, (0 missing) > Age< 55 to the left, improve=0.6848635, (0 missing) > Number < 4.5 to the left, improve=0.2975332, (0 missing) > Surrogate splits: > Number < 3.5 to the left, agree=0.645, adj=0.241, (0 split) > Age< 16 to the left, agree=0.597, adj=0.138, (0 split) > Node number 3: 19 observations > predicted class=present expected loss=0.4210526 P(node) =0.2345679 > class counts: 811 >probabilities: 0.421 0.579 > Node number 4: 29 observations > predicted class=absent expected loss=0 P(node) =0.3580247 > class counts:29 0 >probabilities: 1.000 0.000 > Node number 5: 33 observations,complexity param=0.01960784 > predicted class=absent expected loss=0.1818182 P(node) =0.4074074 > class counts:27 6 >probabilities: 0.818 0.182 > left son=10 (12 obs) right son=11 (21 obs) > Primary splits: > Age< 55 to the left, improve=1.2467530, (0 missing) > Start < 12.5 to the right, improve=0.2887701, (0 missing) > Number < 3.5 to the right, improve=0.1753247, (0 missing) > Surrogate splits: > Start < 9.5 to the left, agree=0.758, adj=0.333, (0 split) > Number < 5.5 to the right, agree=0.697, adj=0.167, (0 split) > Node number 10: 12 observations > predicted class=absent expected loss=0 P(node) =0.1481481 > class counts:12 0 >probabilities: 1.000 0.000 > Node number 11: 21 observations,complexity param=0.01960784 > predicted class=absent expected loss=0.2857143 P(node) =0.2592593 > class counts:15 6 >probabilities: 0.714 0.286 > left son=22 (14 obs) right son=23 (7 obs) > Primary splits: > Age< 111 to the right, improve=1.71428600, (0 missing) > Start < 12.5 to the right, improve=0.79365080, (0 missing) > Number < 3.5 to the right, improve=0.07142857, (0 missing) > Node number 22: 14 observations > predicted class=absent expected loss=0.1428571 P(node) =0.1728395 > class counts:12 2 >probabilities: 0.857 0.143 > Node number 23: 7 observations > predicted class=present expected loss=0.4285714 P(node) =0.08641975 > class counts: 3 4 >probabilities: 0.429 0.571 > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15572) ML persistence in R format: compatibility with other languages
[ https://issues.apache.org/jira/browse/SPARK-15572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-15572: -- Target Version/s: (was: 2.3.0) > ML persistence in R format: compatibility with other languages > -- > > Key: SPARK-15572 > URL: https://issues.apache.org/jira/browse/SPARK-15572 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Reporter: Joseph K. Bradley > > Currently, models saved in R cannot be loaded easily into other languages. > This is because R saves extra metadata (feature names) alongside the model. > We should fix this issue so that models can be transferred seamlessly between > languages. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15572) ML persistence in R format: compatibility with other languages
[ https://issues.apache.org/jira/browse/SPARK-15572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15987592#comment-15987592 ] Joseph K. Bradley edited comment on SPARK-15572 at 1/6/18 10:40 PM: Retargeting since 2.2 has been cut -> I'll remove the target, but please add as needed. CC [~falaki] was (Author: josephkb): Retargeting since 2.2 has been cut -> and to 2.4 now > ML persistence in R format: compatibility with other languages > -- > > Key: SPARK-15572 > URL: https://issues.apache.org/jira/browse/SPARK-15572 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Reporter: Joseph K. Bradley > > Currently, models saved in R cannot be loaded easily into other languages. > This is because R saves extra metadata (feature names) alongside the model. > We should fix this issue so that models can be transferred seamlessly between > languages. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15572) ML persistence in R format: compatibility with other languages
[ https://issues.apache.org/jira/browse/SPARK-15572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15987592#comment-15987592 ] Joseph K. Bradley edited comment on SPARK-15572 at 1/6/18 10:39 PM: Retargeting since 2.2 has been cut -> and to 2.4 now was (Author: josephkb): Retargeting since 2.2 has been cut > ML persistence in R format: compatibility with other languages > -- > > Key: SPARK-15572 > URL: https://issues.apache.org/jira/browse/SPARK-15572 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Reporter: Joseph K. Bradley > > Currently, models saved in R cannot be loaded easily into other languages. > This is because R saves extra metadata (feature names) alongside the model. > We should fix this issue so that models can be transferred seamlessly between > languages. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15784) Add Power Iteration Clustering to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15987588#comment-15987588 ] Joseph K. Bradley edited comment on SPARK-15784 at 1/6/18 10:39 PM: Retargeting since 2.2 has been cut -> and now to 2.4 was (Author: josephkb): Retargeting since 2.2 has been cut > Add Power Iteration Clustering to spark.ml > -- > > Key: SPARK-15784 > URL: https://issues.apache.org/jira/browse/SPARK-15784 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xinh Huynh >Assignee: Miao Wang > > Adding this algorithm is required as part of SPARK-4591: Algorithm/model > parity for spark.ml. The review JIRA for clustering is SPARK-14380. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15784) Add Power Iteration Clustering to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-15784: -- Target Version/s: 2.4.0 (was: 2.3.0) > Add Power Iteration Clustering to spark.ml > -- > > Key: SPARK-15784 > URL: https://issues.apache.org/jira/browse/SPARK-15784 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xinh Huynh >Assignee: Miao Wang > > Adding this algorithm is required as part of SPARK-4591: Algorithm/model > parity for spark.ml. The review JIRA for clustering is SPARK-14380. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18618) SparkR GLM model predict should support type as a argument
[ https://issues.apache.org/jira/browse/SPARK-18618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18618: -- Target Version/s: (was: 2.3.0) > SparkR GLM model predict should support type as a argument > -- > > Key: SPARK-18618 > URL: https://issues.apache.org/jira/browse/SPARK-18618 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Reporter: Yanbo Liang > > SparkR GLM model {{predict}} should support {{type}} as a argument. This will > it consistent with native R predict such as > https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html . -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18618) SparkR GLM model predict should support type as a argument
[ https://issues.apache.org/jira/browse/SPARK-18618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16314966#comment-16314966 ] Joseph K. Bradley commented on SPARK-18618: --- I'll remove the target, but please add a new one as needed. I'll also CC [~falaki] who may be interested. > SparkR GLM model predict should support type as a argument > -- > > Key: SPARK-18618 > URL: https://issues.apache.org/jira/browse/SPARK-18618 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Reporter: Yanbo Liang > > SparkR GLM model {{predict}} should support {{type}} as a argument. This will > it consistent with native R predict such as > https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html . -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22951) count() after dropDuplicates() on emptyDataFrame returns incorrect value
[ https://issues.apache.org/jira/browse/SPARK-22951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22951: Assignee: (was: Apache Spark) > count() after dropDuplicates() on emptyDataFrame returns incorrect value > > > Key: SPARK-22951 > URL: https://issues.apache.org/jira/browse/SPARK-22951 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 2.2.0, 2.3.0 >Reporter: Michael Dreibelbis > > here is a minimal Spark Application to reproduce: > {code} > import org.apache.spark.sql.SQLContext > import org.apache.spark.{SparkConf, SparkContext} > object DropDupesApp extends App { > > override def main(args: Array[String]): Unit = { > val conf = new SparkConf() > .setAppName("test") > .setMaster("local") > val sc = new SparkContext(conf) > val sql = SQLContext.getOrCreate(sc) > assert(sql.emptyDataFrame.count == 0) // expected > assert(sql.emptyDataFrame.dropDuplicates.count == 1) // unexpected > } > > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22951) count() after dropDuplicates() on emptyDataFrame returns incorrect value
[ https://issues.apache.org/jira/browse/SPARK-22951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22951: Assignee: Apache Spark > count() after dropDuplicates() on emptyDataFrame returns incorrect value > > > Key: SPARK-22951 > URL: https://issues.apache.org/jira/browse/SPARK-22951 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 2.2.0, 2.3.0 >Reporter: Michael Dreibelbis >Assignee: Apache Spark > > here is a minimal Spark Application to reproduce: > {code} > import org.apache.spark.sql.SQLContext > import org.apache.spark.{SparkConf, SparkContext} > object DropDupesApp extends App { > > override def main(args: Array[String]): Unit = { > val conf = new SparkConf() > .setAppName("test") > .setMaster("local") > val sc = new SparkContext(conf) > val sql = SQLContext.getOrCreate(sc) > assert(sql.emptyDataFrame.count == 0) // expected > assert(sql.emptyDataFrame.dropDuplicates.count == 1) // unexpected > } > > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22951) count() after dropDuplicates() on emptyDataFrame returns incorrect value
[ https://issues.apache.org/jira/browse/SPARK-22951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16314508#comment-16314508 ] Apache Spark commented on SPARK-22951: -- User 'liufengdb' has created a pull request for this issue: https://github.com/apache/spark/pull/20174 > count() after dropDuplicates() on emptyDataFrame returns incorrect value > > > Key: SPARK-22951 > URL: https://issues.apache.org/jira/browse/SPARK-22951 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 2.2.0, 2.3.0 >Reporter: Michael Dreibelbis > > here is a minimal Spark Application to reproduce: > {code} > import org.apache.spark.sql.SQLContext > import org.apache.spark.{SparkConf, SparkContext} > object DropDupesApp extends App { > > override def main(args: Array[String]): Unit = { > val conf = new SparkConf() > .setAppName("test") > .setMaster("local") > val sc = new SparkContext(conf) > val sql = SQLContext.getOrCreate(sc) > assert(sql.emptyDataFrame.count == 0) // expected > assert(sql.emptyDataFrame.dropDuplicates.count == 1) // unexpected > } > > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21786) The 'spark.sql.parquet.compression.codec' configuration doesn't take effect on tables with partition field(s)
[ https://issues.apache.org/jira/browse/SPARK-21786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-21786: Description: Since Hive 1.1, Hive allows users to set parquet compression codec via table-level properties parquet.compression. See the JIRA: https://issues.apache.org/jira/browse/HIVE-7858 . We do support orc.compression for ORC. Thus, for external users, it is more straightforward to support both. See the stackflow question: https://stackoverflow.com/questions/36941122/spark-sql-ignores-parquet-compression-propertie-specified-in-tblproperties In Spark side, our table-level compression conf compression was added by #11464 since Spark 2.0. We need to support both table-level conf. Users might also use session-level conf spark.sql.parquet.compression.codec. The priority rule will be like If other compression codec configuration was found through hive or parquet, the precedence would be compression, parquet.compression, spark.sql.parquet.compression.codec. Acceptable values include: none, uncompressed, snappy, gzip, lzo. The rule for Parquet is consistent with the ORC after the change. Changes: 1.Increased acquiring 'compressionCodecClassName' from parquet.compression,and the precedence order is compression,parquet.compression,spark.sql.parquet.compression.codec, just like what we do in OrcOptions. 2.Change spark.sql.parquet.compression.codec to support "none".Actually in ParquetOptions,we do support "none" as equivalent to "uncompressed", but it does not allowed to configured to "none". was: For tables created like below, 'spark.sql.parquet.compression.codec' doesn't take any effect when insert data. And because the default compression codec is 'uncompressed', if I want to change the compression codec, I have to change it by 'set parquet.compression='. Contrast,tables without any partition field will work normal with 'spark.sql.parquet.compression.codec',and the default compression codec is 'snappy', but it seems 'parquet.compression' no longer in effect. Should we use the ‘spark.sql.parquet.compression.codec’ configuration uniformly? CREATE TABLE Test_Parquet(provincecode int, citycode int, districtcode int) PARTITIONED BY (p_provincecode int) STORED AS PARQUET; INSERT OVERWRITE TABLE Test_Parquet select * from TableB; > The 'spark.sql.parquet.compression.codec' configuration doesn't take effect > on tables with partition field(s) > - > > Key: SPARK-21786 > URL: https://issues.apache.org/jira/browse/SPARK-21786 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jinhua Fu >Assignee: Jinhua Fu > Fix For: 2.3.0 > > > Since Hive 1.1, Hive allows users to set parquet compression codec via > table-level properties parquet.compression. See the JIRA: > https://issues.apache.org/jira/browse/HIVE-7858 . We do support > orc.compression for ORC. Thus, for external users, it is more straightforward > to support both. See the stackflow question: > https://stackoverflow.com/questions/36941122/spark-sql-ignores-parquet-compression-propertie-specified-in-tblproperties > In Spark side, our table-level compression conf compression was added by > #11464 since Spark 2.0. > We need to support both table-level conf. Users might also use session-level > conf spark.sql.parquet.compression.codec. The priority rule will be like > If other compression codec configuration was found through hive or parquet, > the precedence would be compression, parquet.compression, > spark.sql.parquet.compression.codec. Acceptable values include: none, > uncompressed, snappy, gzip, lzo. > The rule for Parquet is consistent with the ORC after the change. > Changes: > 1.Increased acquiring 'compressionCodecClassName' from > parquet.compression,and the precedence order is > compression,parquet.compression,spark.sql.parquet.compression.codec, just > like what we do in OrcOptions. > 2.Change spark.sql.parquet.compression.codec to support "none".Actually in > ParquetOptions,we do support "none" as equivalent to "uncompressed", but it > does not allowed to configured to "none". -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21786) The 'spark.sql.parquet.compression.codec' configuration doesn't take effect on tables with partition field(s)
[ https://issues.apache.org/jira/browse/SPARK-21786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-21786: --- Assignee: Jinhua Fu > The 'spark.sql.parquet.compression.codec' configuration doesn't take effect > on tables with partition field(s) > - > > Key: SPARK-21786 > URL: https://issues.apache.org/jira/browse/SPARK-21786 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jinhua Fu >Assignee: Jinhua Fu > Fix For: 2.3.0 > > > For tables created like below, 'spark.sql.parquet.compression.codec' doesn't > take any effect when insert data. And because the default compression codec > is 'uncompressed', if I want to change the compression codec, I have to > change it by 'set parquet.compression='. > Contrast,tables without any partition field will work normal with > 'spark.sql.parquet.compression.codec',and the default compression codec is > 'snappy', but it seems 'parquet.compression' no longer in effect. > Should we use the ‘spark.sql.parquet.compression.codec’ configuration > uniformly? > > CREATE TABLE Test_Parquet(provincecode int, citycode int, districtcode int) > PARTITIONED BY (p_provincecode int) > STORED AS PARQUET; > INSERT OVERWRITE TABLE Test_Parquet select * from TableB; -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21786) The 'spark.sql.parquet.compression.codec' configuration doesn't take effect on tables with partition field(s)
[ https://issues.apache.org/jira/browse/SPARK-21786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-21786. - Resolution: Fixed Fix Version/s: 2.3.0 > The 'spark.sql.parquet.compression.codec' configuration doesn't take effect > on tables with partition field(s) > - > > Key: SPARK-21786 > URL: https://issues.apache.org/jira/browse/SPARK-21786 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jinhua Fu >Assignee: Jinhua Fu > Fix For: 2.3.0 > > > For tables created like below, 'spark.sql.parquet.compression.codec' doesn't > take any effect when insert data. And because the default compression codec > is 'uncompressed', if I want to change the compression codec, I have to > change it by 'set parquet.compression='. > Contrast,tables without any partition field will work normal with > 'spark.sql.parquet.compression.codec',and the default compression codec is > 'snappy', but it seems 'parquet.compression' no longer in effect. > Should we use the ‘spark.sql.parquet.compression.codec’ configuration > uniformly? > > CREATE TABLE Test_Parquet(provincecode int, citycode int, districtcode int) > PARTITIONED BY (p_provincecode int) > STORED AS PARQUET; > INSERT OVERWRITE TABLE Test_Parquet select * from TableB; -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22793) Memory leak in Spark Thrift Server
[ https://issues.apache.org/jira/browse/SPARK-22793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-22793. - Resolution: Fixed Assignee: zuotingbing Fix Version/s: 2.3.0 > Memory leak in Spark Thrift Server > -- > > Key: SPARK-22793 > URL: https://issues.apache.org/jira/browse/SPARK-22793 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.2.1 >Reporter: zuotingbing >Assignee: zuotingbing >Priority: Critical > Fix For: 2.3.0 > > > 1. Start HiveThriftServer2. > 2. Connect to thriftserver through beeline. > 3. Close the beeline. > 4. repeat step2 and step 3 for several times, which caused the leak of Memory. > we found there are many directories never be dropped under the path > {code:java} > hive.exec.local.scratchdir > {code} and > {code:java} > hive.exec.scratchdir > {code} , as we know the scratchdir has been added to deleteOnExit when it be > created. So it means that the cache size of FileSystem deleteOnExit will keep > increasing until JVM terminated. > In addition, we use > {code:java} > jmap -histo:live [PID] > {code} to printout the size of objects in HiveThriftServer2 Process, we can > find the object "org.apache.spark.sql.hive.client.HiveClientImpl" and > "org.apache.hadoop.hive.ql.session.SessionState" keep increasing even though > we closed all the beeline connections, which caused the leak of Memory. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22901) Add non-deterministic to Python UDF
[ https://issues.apache.org/jira/browse/SPARK-22901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16314476#comment-16314476 ] Apache Spark commented on SPARK-22901: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/20173 > Add non-deterministic to Python UDF > --- > > Key: SPARK-22901 > URL: https://issues.apache.org/jira/browse/SPARK-22901 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.1 >Reporter: Xiao Li >Assignee: Marco Gaido > Fix For: 2.3.0 > > > Add a new API for Python UDF to allow users to change the determinism from > deterministic to non-deterministic. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22979) Avoid per-record type dispatch in Python data conversion (EvaluatePython.fromJava)
[ https://issues.apache.org/jira/browse/SPARK-22979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16314459#comment-16314459 ] Apache Spark commented on SPARK-22979: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/20172 > Avoid per-record type dispatch in Python data conversion > (EvaluatePython.fromJava) > -- > > Key: SPARK-22979 > URL: https://issues.apache.org/jira/browse/SPARK-22979 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon > > Seems we are type dispatching between Java objects (from Pyrolite) to Spark's > internal data format. > See > https://github.com/apache/spark/blob/3f958a99921d149fb9fdf7ba7e78957afdad1405/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala#L89-L162 > Looks we can make converters each for each type and then reuse it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22979) Avoid per-record type dispatch in Python data conversion (EvaluatePython.fromJava)
[ https://issues.apache.org/jira/browse/SPARK-22979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22979: Assignee: Apache Spark > Avoid per-record type dispatch in Python data conversion > (EvaluatePython.fromJava) > -- > > Key: SPARK-22979 > URL: https://issues.apache.org/jira/browse/SPARK-22979 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark > > Seems we are type dispatching between Java objects (from Pyrolite) to Spark's > internal data format. > See > https://github.com/apache/spark/blob/3f958a99921d149fb9fdf7ba7e78957afdad1405/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala#L89-L162 > Looks we can make converters each for each type and then reuse it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22979) Avoid per-record type dispatch in Python data conversion (EvaluatePython.fromJava)
[ https://issues.apache.org/jira/browse/SPARK-22979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22979: Assignee: (was: Apache Spark) > Avoid per-record type dispatch in Python data conversion > (EvaluatePython.fromJava) > -- > > Key: SPARK-22979 > URL: https://issues.apache.org/jira/browse/SPARK-22979 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon > > Seems we are type dispatching between Java objects (from Pyrolite) to Spark's > internal data format. > See > https://github.com/apache/spark/blob/3f958a99921d149fb9fdf7ba7e78957afdad1405/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala#L89-L162 > Looks we can make converters each for each type and then reuse it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22978) Register Vectorized UDFs for SQL Statement
[ https://issues.apache.org/jira/browse/SPARK-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-22978: Description: Capable of registering vectorized UDFs and then use it in SQL statement. For example, {noformat} >>> import random >>> from pyspark.sql.types import IntegerType >>> from pyspark.sql.functions import pandas_udf >>> random_pandas_udf = pandas_udf( ... lambda x: random.randint(0, 100) + x, IntegerType()) ... .asNondeterministic() # doctest: +SKIP >>> _ = spark.catalog.registerFunction( ... "random_pandas_udf", random_pandas_udf, IntegerType()) # doctest: +SKIP >>> spark.sql("SELECT random_pandas_udf(2)").collect() # doctest: +SKIP [Row(random_pandas_udf(2)=84)] {noformat} was:Capable of registering vectorized UDFs and then use it in SQL statement > Register Vectorized UDFs for SQL Statement > -- > > Key: SPARK-22978 > URL: https://issues.apache.org/jira/browse/SPARK-22978 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li > > Capable of registering vectorized UDFs and then use it in SQL statement. > For example, > {noformat} > >>> import random > >>> from pyspark.sql.types import IntegerType > >>> from pyspark.sql.functions import pandas_udf > >>> random_pandas_udf = pandas_udf( > ... lambda x: random.randint(0, 100) + x, IntegerType()) > ... .asNondeterministic() # doctest: +SKIP > >>> _ = spark.catalog.registerFunction( > ... "random_pandas_udf", random_pandas_udf, IntegerType()) # doctest: > +SKIP > >>> spark.sql("SELECT random_pandas_udf(2)").collect() # doctest: +SKIP > [Row(random_pandas_udf(2)=84)] > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22978) Register Vectorized UDFs for SQL Statement
[ https://issues.apache.org/jira/browse/SPARK-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22978: Assignee: Xiao Li (was: Apache Spark) > Register Vectorized UDFs for SQL Statement > -- > > Key: SPARK-22978 > URL: https://issues.apache.org/jira/browse/SPARK-22978 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li > > Capable of registering vectorized UDFs and then use it in SQL statement -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22978) Register Vectorized UDFs for SQL Statement
[ https://issues.apache.org/jira/browse/SPARK-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22978: Assignee: Apache Spark (was: Xiao Li) > Register Vectorized UDFs for SQL Statement > -- > > Key: SPARK-22978 > URL: https://issues.apache.org/jira/browse/SPARK-22978 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Apache Spark > > Capable of registering vectorized UDFs and then use it in SQL statement -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22978) Register Vectorized UDFs for SQL Statement
[ https://issues.apache.org/jira/browse/SPARK-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16314456#comment-16314456 ] Apache Spark commented on SPARK-22978: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/20171 > Register Vectorized UDFs for SQL Statement > -- > > Key: SPARK-22978 > URL: https://issues.apache.org/jira/browse/SPARK-22978 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li > > Capable of registering vectorized UDFs and then use it in SQL statement -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22979) Avoid per-record type dispatch in Python data conversion (EvaluatePython.fromJava)
Hyukjin Kwon created SPARK-22979: Summary: Avoid per-record type dispatch in Python data conversion (EvaluatePython.fromJava) Key: SPARK-22979 URL: https://issues.apache.org/jira/browse/SPARK-22979 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 2.3.0 Reporter: Hyukjin Kwon Seems we are type dispatching between Java objects (from Pyrolite) to Spark's internal data format. See https://github.com/apache/spark/blob/3f958a99921d149fb9fdf7ba7e78957afdad1405/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala#L89-L162 Looks we can make converters each for each type and then reuse it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22978) Register Vectorized UDFs for SQL Statement
[ https://issues.apache.org/jira/browse/SPARK-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-22978: Description: Capable of registering vectorized UDFs and then use it in SQL statement (was: R) > Register Vectorized UDFs for SQL Statement > -- > > Key: SPARK-22978 > URL: https://issues.apache.org/jira/browse/SPARK-22978 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li > > Capable of registering vectorized UDFs and then use it in SQL statement -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22978) Register Vectorized UDFs for SQL Statement
[ https://issues.apache.org/jira/browse/SPARK-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-22978: Summary: Register Vectorized UDFs for SQL Statement (was: Register Vectori) > Register Vectorized UDFs for SQL Statement > -- > > Key: SPARK-22978 > URL: https://issues.apache.org/jira/browse/SPARK-22978 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li > > R -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22978) Register Vectori
Xiao Li created SPARK-22978: --- Summary: Register Vectori Key: SPARK-22978 URL: https://issues.apache.org/jira/browse/SPARK-22978 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 2.3.0 Reporter: Xiao Li Assignee: Xiao Li R -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-7551) Don't split by dot if within backticks for DataFrame attribute resolution
[ https://issues.apache.org/jira/browse/SPARK-7551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haris updated SPARK-7551: - Comment: was deleted (was: I encountered a similar problem in spark 2,2 while using pyspark, I tried to split a column with period (.) and it did not behave well even after providing escape chars: {code} >>> spark.sql("select split('a.aaa','.')").show() +---+ |split(a.aaa, .)| +---+ | [, , , , , ]| +---+ >>> spark.sql("select split('a.aaa','\\.')").show() +---+ |split(a.aaa, .)| +---+ | [, , , , , ]| +---+ >>> spark.sql("select split('a.aaa','[.]')").show() +-+ |split(a.aaa, [.])| +-+ | [a, aaa]| +-+ {code} It uses period only when we provide it like {code} [.] {code} while it should also be working with escape seq {code}'\\.'{code}) > Don't split by dot if within backticks for DataFrame attribute resolution > - > > Key: SPARK-7551 > URL: https://issues.apache.org/jira/browse/SPARK-7551 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Wenchen Fan >Priority: Critical > Fix For: 1.4.0 > > > DataFrame's resolve: > {code} > protected[sql] def resolve(colName: String): NamedExpression = { > queryExecution.analyzed.resolve(colName.split("\\."), > sqlContext.analyzer.resolver).getOrElse { > throw new AnalysisException( > s"""Cannot resolve column name "$colName" among > (${schema.fieldNames.mkString(", ")})""") > } > } > {code} > We should not split the parts quoted by backticks (`). > For example, `ab.cd`.`efg` should be split into two parts "ab.cd" and "efg". -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7551) Don't split by dot if within backticks for DataFrame attribute resolution
[ https://issues.apache.org/jira/browse/SPARK-7551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16314443#comment-16314443 ] Haris commented on SPARK-7551: -- I encountered a similar problem in spark 2,2 while using pyspark, I tried to split a column with period (.) and it did not behave well even after providing escape chars: {code} >>> spark.sql("select split('a.aaa','.')").show() +---+ |split(a.aaa, .)| +---+ | [, , , , , ]| +---+ >>> spark.sql("select split('a.aaa','\\.')").show() +---+ |split(a.aaa, .)| +---+ | [, , , , , ]| +---+ >>> spark.sql("select split('a.aaa','[.]')").show() +-+ |split(a.aaa, [.])| +-+ | [a, aaa]| +-+ {code} It uses period only when we provide it like {code} [.] {code} while it should also be working with escape seq {code}'\\.'{code} > Don't split by dot if within backticks for DataFrame attribute resolution > - > > Key: SPARK-7551 > URL: https://issues.apache.org/jira/browse/SPARK-7551 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Wenchen Fan >Priority: Critical > Fix For: 1.4.0 > > > DataFrame's resolve: > {code} > protected[sql] def resolve(colName: String): NamedExpression = { > queryExecution.analyzed.resolve(colName.split("\\."), > sqlContext.analyzer.resolver).getOrElse { > throw new AnalysisException( > s"""Cannot resolve column name "$colName" among > (${schema.fieldNames.mkString(", ")})""") > } > } > {code} > We should not split the parts quoted by backticks (`). > For example, `ab.cd`.`efg` should be split into two parts "ab.cd" and "efg". -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22930) Improve the description of Vectorized UDFs for non-deterministic cases
[ https://issues.apache.org/jira/browse/SPARK-22930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-22930. - Resolution: Fixed Assignee: Li Jin Fix Version/s: 2.3.0 > Improve the description of Vectorized UDFs for non-deterministic cases > -- > > Key: SPARK-22930 > URL: https://issues.apache.org/jira/browse/SPARK-22930 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Li Jin > Fix For: 2.3.0 > > > After we merge this commit > https://github.com/apache/spark/commit/ff48b1b338241039a7189e7a3c04333b1256fdb3, > we also need to update the function description of Vectorized UDFs. Users > are able to create non-deterministic Vectorized UDFs. > Also, add the related test cases. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org