date:20180106

[jira] [Assigned] (SPARK-22981) Incorrect results of casting Struct to String

2018-01-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22981:


Assignee: Apache Spark

> Incorrect results of casting Struct to String
> -
>
> Key: SPARK-22981
> URL: https://issues.apache.org/jira/browse/SPARK-22981
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>
> {code}
> scala> val df = Seq(((1, "a"), 0), ((2, "b"), 0)).toDF("a", "b")
> df: org.apache.spark.sql.DataFrame = [a: struct<_1: int, _2: string>, b: int]
> scala> df.write.saveAsTable("t")
>   
>   
> scala> sql("SELECT CAST(a AS STRING) FROM t").show
> +---+
> |  a|
> +---+
> |[0,1,180001,61]|
> |[0,2,180001,62]|
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22981) Incorrect results of casting Struct to String

2018-01-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315071#comment-16315071
 ] 

Apache Spark commented on SPARK-22981:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/20176

> Incorrect results of casting Struct to String
> -
>
> Key: SPARK-22981
> URL: https://issues.apache.org/jira/browse/SPARK-22981
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Takeshi Yamamuro
>
> {code}
> scala> val df = Seq(((1, "a"), 0), ((2, "b"), 0)).toDF("a", "b")
> df: org.apache.spark.sql.DataFrame = [a: struct<_1: int, _2: string>, b: int]
> scala> df.write.saveAsTable("t")
>   
>   
> scala> sql("SELECT CAST(a AS STRING) FROM t").show
> +---+
> |  a|
> +---+
> |[0,1,180001,61]|
> |[0,2,180001,62]|
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22981) Incorrect results of casting Struct to String

2018-01-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22981:


Assignee: (was: Apache Spark)

> Incorrect results of casting Struct to String
> -
>
> Key: SPARK-22981
> URL: https://issues.apache.org/jira/browse/SPARK-22981
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Takeshi Yamamuro
>
> {code}
> scala> val df = Seq(((1, "a"), 0), ((2, "b"), 0)).toDF("a", "b")
> df: org.apache.spark.sql.DataFrame = [a: struct<_1: int, _2: string>, b: int]
> scala> df.write.saveAsTable("t")
>   
>   
> scala> sql("SELECT CAST(a AS STRING) FROM t").show
> +---+
> |  a|
> +---+
> |[0,1,180001,61]|
> |[0,2,180001,62]|
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22981) Incorrect results of casting Struct to String

2018-01-06 Thread Takeshi Yamamuro (JIRA)

Takeshi Yamamuro created SPARK-22981:


 Summary: Incorrect results of casting Struct to String
 Key: SPARK-22981
 URL: https://issues.apache.org/jira/browse/SPARK-22981
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.1
Reporter: Takeshi Yamamuro


{code}
scala> val df = Seq(((1, "a"), 0), ((2, "b"), 0)).toDF("a", "b")
df: org.apache.spark.sql.DataFrame = [a: struct<_1: int, _2: string>, b: int]

scala> df.write.saveAsTable("t")

scala> sql("SELECT CAST(a AS STRING) FROM t").show
+---+
|  a|
+---+
|[0,1,180001,61]|
|[0,2,180001,62]|
+---+
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22973) Incorrect results of casting Map to String

2018-01-06 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22973.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20166
[https://github.com/apache/spark/pull/20166]

> Incorrect results of casting Map to String
> --
>
> Key: SPARK-22973
> URL: https://issues.apache.org/jira/browse/SPARK-22973
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
> Fix For: 2.3.0
>
>
> {code}
> scala> Seq(Map(1 -> "a", 2 -> "b")).toDF("a").write.saveAsTable("t")
> scala> sql("SELECT cast(a as String) FROM t").show(false)
> ++
> |a   |
> ++
> |org.apache.spark.sql.catalyst.expressions.UnsafeMapData@38bdd75d|
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22973) Incorrect results of casting Map to String

2018-01-06 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22973:
---

Assignee: Takeshi Yamamuro

> Incorrect results of casting Map to String
> --
>
> Key: SPARK-22973
> URL: https://issues.apache.org/jira/browse/SPARK-22973
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>
> {code}
> scala> Seq(Map(1 -> "a", 2 -> "b")).toDF("a").write.saveAsTable("t")
> scala> sql("SELECT cast(a as String) FROM t").show(false)
> ++
> |a   |
> ++
> |org.apache.spark.sql.catalyst.expressions.UnsafeMapData@38bdd75d|
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22980) Wrong answer when using pandas_udf

2018-01-06 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22980:

Priority: Blocker  (was: Major)

> Wrong answer when using pandas_udf
> --
>
> Key: SPARK-22980
> URL: https://issues.apache.org/jira/browse/SPARK-22980
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Blocker
>
> {noformat}
> from pyspark.sql.functions import pandas_udf
> from pyspark.sql.functions import col, lit
> from pyspark.sql.types import LongType
> df = spark.range(3)
> f = pandas_udf(lambda x, y: len(x) + y, LongType())
> df.select(f(lit('text'), col('id'))).show()
> {noformat}
> {noformat}
> from pyspark.sql.functions import udf
> from pyspark.sql.functions import col, lit
> from pyspark.sql.types import LongType
> df = spark.range(3)
> f = udf(lambda x, y: len(x) + y, LongType())
> df.select(f(lit('text'), col('id'))).show()
> {noformat}
> The results of pandas_udf are different from udf. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22980) Wrong answer when using pandas_udf

2018-01-06 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315042#comment-16315042
 ] 

Xiao Li commented on SPARK-22980:
-

cc [~icexelloss] [~bryanc] [~ueshin]

> Wrong answer when using pandas_udf
> --
>
> Key: SPARK-22980
> URL: https://issues.apache.org/jira/browse/SPARK-22980
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>
> {noformat}
> from pyspark.sql.functions import pandas_udf
> from pyspark.sql.functions import col, lit
> from pyspark.sql.types import LongType
> df = spark.range(3)
> f = pandas_udf(lambda x, y: len(x) + y, LongType())
> df.select(f(lit('text'), col('id'))).show()
> {noformat}
> {noformat}
> from pyspark.sql.functions import udf
> from pyspark.sql.functions import col, lit
> from pyspark.sql.types import LongType
> df = spark.range(3)
> f = udf(lambda x, y: len(x) + y, LongType())
> df.select(f(lit('text'), col('id'))).show()
> {noformat}
> The results of pandas_udf are different from udf. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22980) Wrong answer when using pandas_udf

2018-01-06 Thread Xiao Li (JIRA)

Xiao Li created SPARK-22980:
---

 Summary: Wrong answer when using pandas_udf
 Key: SPARK-22980
 URL: https://issues.apache.org/jira/browse/SPARK-22980
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 2.3.0
Reporter: Xiao Li


{noformat}
from pyspark.sql.functions import pandas_udf
from pyspark.sql.functions import col, lit
from pyspark.sql.types import LongType
df = spark.range(3)
f = pandas_udf(lambda x, y: len(x) + y, LongType())
df.select(f(lit('text'), col('id'))).show()
{noformat}


{noformat}
from pyspark.sql.functions import udf
from pyspark.sql.functions import col, lit
from pyspark.sql.types import LongType
df = spark.range(3)
f = udf(lambda x, y: len(x) + y, LongType())
df.select(f(lit('text'), col('id'))).show()
{noformat}

The results of pandas_udf are different from udf. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20498) RandomForestRegressionModel should expose getMaxDepth in PySpark

2018-01-06 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20498:
--
Target Version/s: 2.4.0  (was: 2.3.0)

> RandomForestRegressionModel should expose getMaxDepth in PySpark
> 
>
> Key: SPARK-20498
> URL: https://issues.apache.org/jira/browse/SPARK-20498
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.1.0
>Reporter: Nick Lothian
>Assignee: Xin Ren
>Priority: Minor
>
> Currently it isn't clear hot to get the max depth of a 
> RandomForestRegressionModel (eg, after doing a grid search)
> It is possible to call
> {{regressor._java_obj.getMaxDepth()}} 
> but most other decision trees allow
> {{regressor.getMaxDepth()}} 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18569) Support R formula arithmetic

2018-01-06 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16314994#comment-16314994
 ] 

Joseph K. Bradley commented on SPARK-18569:
---

[~felixcheung] There are a few JIRAs for R still targeted for 2.3.  Would you 
mind vetting them to see which should be retargeted for 2.4 vs. un-targeted?  
Thanks!

> Support R formula arithmetic 
> -
>
> Key: SPARK-18569
> URL: https://issues.apache.org/jira/browse/SPARK-18569
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Felix Cheung
>
> I think we should support arithmetic which makes it a lot more convenient to 
> build model. Something like
> {code}
>   log(y) ~ a + log(x)
> {code}
> And to avoid resolution confusions we should support the I() operator:
> {code}
> I
>  I(X∗Z) as is: include a new variable consisting of these variables multiplied
> {code}
> Such that this works:
> {code}
> y ~ a + I(b+c)
> {code}
> the term b+c is to be interpreted as the sum of b and c.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20498) RandomForestRegressionModel should expose getMaxDepth in PySpark

2018-01-06 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16314993#comment-16314993
 ] 

Joseph K. Bradley commented on SPARK-20498:
---

We'll need to retarget this for 2.4, but I'll mark myself as shepherd to try to 
make sure it gets in.

> RandomForestRegressionModel should expose getMaxDepth in PySpark
> 
>
> Key: SPARK-20498
> URL: https://issues.apache.org/jira/browse/SPARK-20498
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.1.0
>Reporter: Nick Lothian
>Assignee: Xin Ren
>Priority: Minor
>
> Currently it isn't clear hot to get the max depth of a 
> RandomForestRegressionModel (eg, after doing a grid search)
> It is possible to call
> {{regressor._java_obj.getMaxDepth()}} 
> but most other decision trees allow
> {{regressor.getMaxDepth()}} 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20498) RandomForestRegressionModel should expose getMaxDepth in PySpark

2018-01-06 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20498:
--
Shepherd: Joseph K. Bradley

> RandomForestRegressionModel should expose getMaxDepth in PySpark
> 
>
> Key: SPARK-20498
> URL: https://issues.apache.org/jira/browse/SPARK-20498
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.1.0
>Reporter: Nick Lothian
>Assignee: Xin Ren
>Priority: Minor
>
> Currently it isn't clear hot to get the max depth of a 
> RandomForestRegressionModel (eg, after doing a grid search)
> It is possible to call
> {{regressor._java_obj.getMaxDepth()}} 
> but most other decision trees allow
> {{regressor.getMaxDepth()}} 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20602) Adding LBFGS optimizer and Squared_hinge loss for LinearSVC

2018-01-06 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16314979#comment-16314979
 ] 

Joseph K. Bradley commented on SPARK-20602:
---

I'm afraid we'll need to re-target this since the branch has been cut for 2.3.  
I'll change it to 2.4, but [~yanboliang] please retarget as needed.

> Adding LBFGS optimizer and Squared_hinge loss for LinearSVC
> ---
>
> Key: SPARK-20602
> URL: https://issues.apache.org/jira/browse/SPARK-20602
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Assignee: yuhao yang
>
> Currently LinearSVC in Spark only supports OWLQN as the optimizer ( check 
> https://issues.apache.org/jira/browse/SPARK-14709). I made comparison between 
> LBFGS and OWLQN on several public dataset and found LBFGS converges much 
> faster for LinearSVC in most cases.
> The following table presents the number of training iterations and f1 score 
> of both optimizers until convergence
> ||Dataset||LBFGS with hinge||OWLQN with hinge||LBFGS with squared_hinge||
> |news20.binary| 31 (0.99) | 413(0.99) |  185 (0.99) |
> |mushroom| 28(1.0) | 170(1.0)| 24(1.0) |
> |madelon|143(0.75) | 8129(0.70)| 823(0.74) |
> |breast-cancer-scale| 15(1.0) | 16(1.0)| 15 (1.0) |
> |phishing | 329(0.94) | 231(0.94) | 67 (0.94) |
> |a1a(adult) | 466 (0.87) | 282 (0.87) | 77 (0.86) |
> |a7a | 237 (0.84) | 372(0.84) | 69(0.84) |
> data source: 
> https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html
> training code: new LinearSVC().setMaxIter(1).setTol(1e-6)
> LBFGS requires less iterations in most cases (except for a1a) and probably is 
> a better default optimizer. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20602) Adding LBFGS optimizer and Squared_hinge loss for LinearSVC

2018-01-06 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20602:
--
Target Version/s: 2.4.0  (was: 2.3.0)

> Adding LBFGS optimizer and Squared_hinge loss for LinearSVC
> ---
>
> Key: SPARK-20602
> URL: https://issues.apache.org/jira/browse/SPARK-20602
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Assignee: yuhao yang
>
> Currently LinearSVC in Spark only supports OWLQN as the optimizer ( check 
> https://issues.apache.org/jira/browse/SPARK-14709). I made comparison between 
> LBFGS and OWLQN on several public dataset and found LBFGS converges much 
> faster for LinearSVC in most cases.
> The following table presents the number of training iterations and f1 score 
> of both optimizers until convergence
> ||Dataset||LBFGS with hinge||OWLQN with hinge||LBFGS with squared_hinge||
> |news20.binary| 31 (0.99) | 413(0.99) |  185 (0.99) |
> |mushroom| 28(1.0) | 170(1.0)| 24(1.0) |
> |madelon|143(0.75) | 8129(0.70)| 823(0.74) |
> |breast-cancer-scale| 15(1.0) | 16(1.0)| 15 (1.0) |
> |phishing | 329(0.94) | 231(0.94) | 67 (0.94) |
> |a1a(adult) | 466 (0.87) | 282 (0.87) | 77 (0.86) |
> |a7a | 237 (0.84) | 372(0.84) | 69(0.84) |
> data source: 
> https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html
> training code: new LinearSVC().setMaxIter(1).setTol(1e-6)
> LBFGS requires less iterations in most cases (except for a1a) and probably is 
> a better default optimizer. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22796) Add multiple column support to PySpark QuantileDiscretizer

2018-01-06 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16314978#comment-16314978
 ] 

Joseph K. Bradley commented on SPARK-22796:
---

We'll need to re-target this for 2.4 now that the branch has been cut.

> Add multiple column support to PySpark QuantileDiscretizer
> --
>
> Key: SPARK-22796
> URL: https://issues.apache.org/jira/browse/SPARK-22796
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18348) Improve tree ensemble model summary

2018-01-06 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16314976#comment-16314976
 ] 

Joseph K. Bradley commented on SPARK-18348:
---

I'll remove the target, but please re-add as needed

> Improve tree ensemble model summary
> ---
>
> Key: SPARK-18348
> URL: https://issues.apache.org/jira/browse/SPARK-18348
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Felix Cheung
>
> During work on R APIs for tree ensemble models (eg. Random Forest, GBT) it is 
> discovered and discussed that
> - we don't have a good summary on nodes or trees for their observations, 
> loss, probability and so on
> - we don't have a shared API with nicely formatted output
> We believe this could be a shared API that benefits multiple language 
> bindings, including R, when available.
> For example, here is what R {code}rpart{code} shows for model summary:
> {code}
> Call:
> rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis,
> method = "class")
>   n= 81
>   CP nsplit rel errorxerror  xstd
> 1 0.17647059  0 1.000 1.000 0.2155872
> 2 0.01960784  1 0.8235294 0.9411765 0.2107780
> 3 0.0100  4 0.7647059 1.0588235 0.2200975
> Variable importance
>  StartAge Number
> 64 24 12
> Node number 1: 81 observations,complexity param=0.1764706
>   predicted class=absent   expected loss=0.2098765  P(node) =1
> class counts:6417
>probabilities: 0.790 0.210
>   left son=2 (62 obs) right son=3 (19 obs)
>   Primary splits:
>   Start  < 8.5  to the right, improve=6.762330, (0 missing)
>   Number < 5.5  to the left,  improve=2.866795, (0 missing)
>   Age< 39.5 to the left,  improve=2.250212, (0 missing)
>   Surrogate splits:
>   Number < 6.5  to the left,  agree=0.802, adj=0.158, (0 split)
> Node number 2: 62 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.09677419  P(node) =0.7654321
> class counts:56 6
>probabilities: 0.903 0.097
>   left son=4 (29 obs) right son=5 (33 obs)
>   Primary splits:
>   Start  < 14.5 to the right, improve=1.0205280, (0 missing)
>   Age< 55   to the left,  improve=0.6848635, (0 missing)
>   Number < 4.5  to the left,  improve=0.2975332, (0 missing)
>   Surrogate splits:
>   Number < 3.5  to the left,  agree=0.645, adj=0.241, (0 split)
>   Age< 16   to the left,  agree=0.597, adj=0.138, (0 split)
> Node number 3: 19 observations
>   predicted class=present  expected loss=0.4210526  P(node) =0.2345679
> class counts: 811
>probabilities: 0.421 0.579
> Node number 4: 29 observations
>   predicted class=absent   expected loss=0  P(node) =0.3580247
> class counts:29 0
>probabilities: 1.000 0.000
> Node number 5: 33 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.1818182  P(node) =0.4074074
> class counts:27 6
>probabilities: 0.818 0.182
>   left son=10 (12 obs) right son=11 (21 obs)
>   Primary splits:
>   Age< 55   to the left,  improve=1.2467530, (0 missing)
>   Start  < 12.5 to the right, improve=0.2887701, (0 missing)
>   Number < 3.5  to the right, improve=0.1753247, (0 missing)
>   Surrogate splits:
>   Start  < 9.5  to the left,  agree=0.758, adj=0.333, (0 split)
>   Number < 5.5  to the right, agree=0.697, adj=0.167, (0 split)
> Node number 10: 12 observations
>   predicted class=absent   expected loss=0  P(node) =0.1481481
> class counts:12 0
>probabilities: 1.000 0.000
> Node number 11: 21 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.2857143  P(node) =0.2592593
> class counts:15 6
>probabilities: 0.714 0.286
>   left son=22 (14 obs) right son=23 (7 obs)
>   Primary splits:
>   Age< 111  to the right, improve=1.71428600, (0 missing)
>   Start  < 12.5 to the right, improve=0.79365080, (0 missing)
>   Number < 3.5  to the right, improve=0.07142857, (0 missing)
> Node number 22: 14 observations
>   predicted class=absent   expected loss=0.1428571  P(node) =0.1728395
> class counts:12 2
>probabilities: 0.857 0.143
> Node number 23: 7 observations
>   predicted class=present  expected loss=0.4285714  P(node) =0.08641975
> class counts: 3 4
>probabilities: 0.429 0.571
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18348) Improve tree ensemble model summary

2018-01-06 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18348:
--
Target Version/s:   (was: 2.3.0)

> Improve tree ensemble model summary
> ---
>
> Key: SPARK-18348
> URL: https://issues.apache.org/jira/browse/SPARK-18348
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Felix Cheung
>
> During work on R APIs for tree ensemble models (eg. Random Forest, GBT) it is 
> discovered and discussed that
> - we don't have a good summary on nodes or trees for their observations, 
> loss, probability and so on
> - we don't have a shared API with nicely formatted output
> We believe this could be a shared API that benefits multiple language 
> bindings, including R, when available.
> For example, here is what R {code}rpart{code} shows for model summary:
> {code}
> Call:
> rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis,
> method = "class")
>   n= 81
>   CP nsplit rel errorxerror  xstd
> 1 0.17647059  0 1.000 1.000 0.2155872
> 2 0.01960784  1 0.8235294 0.9411765 0.2107780
> 3 0.0100  4 0.7647059 1.0588235 0.2200975
> Variable importance
>  StartAge Number
> 64 24 12
> Node number 1: 81 observations,complexity param=0.1764706
>   predicted class=absent   expected loss=0.2098765  P(node) =1
> class counts:6417
>probabilities: 0.790 0.210
>   left son=2 (62 obs) right son=3 (19 obs)
>   Primary splits:
>   Start  < 8.5  to the right, improve=6.762330, (0 missing)
>   Number < 5.5  to the left,  improve=2.866795, (0 missing)
>   Age< 39.5 to the left,  improve=2.250212, (0 missing)
>   Surrogate splits:
>   Number < 6.5  to the left,  agree=0.802, adj=0.158, (0 split)
> Node number 2: 62 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.09677419  P(node) =0.7654321
> class counts:56 6
>probabilities: 0.903 0.097
>   left son=4 (29 obs) right son=5 (33 obs)
>   Primary splits:
>   Start  < 14.5 to the right, improve=1.0205280, (0 missing)
>   Age< 55   to the left,  improve=0.6848635, (0 missing)
>   Number < 4.5  to the left,  improve=0.2975332, (0 missing)
>   Surrogate splits:
>   Number < 3.5  to the left,  agree=0.645, adj=0.241, (0 split)
>   Age< 16   to the left,  agree=0.597, adj=0.138, (0 split)
> Node number 3: 19 observations
>   predicted class=present  expected loss=0.4210526  P(node) =0.2345679
> class counts: 811
>probabilities: 0.421 0.579
> Node number 4: 29 observations
>   predicted class=absent   expected loss=0  P(node) =0.3580247
> class counts:29 0
>probabilities: 1.000 0.000
> Node number 5: 33 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.1818182  P(node) =0.4074074
> class counts:27 6
>probabilities: 0.818 0.182
>   left son=10 (12 obs) right son=11 (21 obs)
>   Primary splits:
>   Age< 55   to the left,  improve=1.2467530, (0 missing)
>   Start  < 12.5 to the right, improve=0.2887701, (0 missing)
>   Number < 3.5  to the right, improve=0.1753247, (0 missing)
>   Surrogate splits:
>   Start  < 9.5  to the left,  agree=0.758, adj=0.333, (0 split)
>   Number < 5.5  to the right, agree=0.697, adj=0.167, (0 split)
> Node number 10: 12 observations
>   predicted class=absent   expected loss=0  P(node) =0.1481481
> class counts:12 0
>probabilities: 1.000 0.000
> Node number 11: 21 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.2857143  P(node) =0.2592593
> class counts:15 6
>probabilities: 0.714 0.286
>   left son=22 (14 obs) right son=23 (7 obs)
>   Primary splits:
>   Age< 111  to the right, improve=1.71428600, (0 missing)
>   Start  < 12.5 to the right, improve=0.79365080, (0 missing)
>   Number < 3.5  to the right, improve=0.07142857, (0 missing)
> Node number 22: 14 observations
>   predicted class=absent   expected loss=0.1428571  P(node) =0.1728395
> class counts:12 2
>probabilities: 0.857 0.143
> Node number 23: 7 observations
>   predicted class=present  expected loss=0.4285714  P(node) =0.08641975
> class counts: 3 4
>probabilities: 0.429 0.571
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15572) ML persistence in R format: compatibility with other languages

2018-01-06 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15572:
--
Target Version/s:   (was: 2.3.0)

> ML persistence in R format: compatibility with other languages
> --
>
> Key: SPARK-15572
> URL: https://issues.apache.org/jira/browse/SPARK-15572
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Joseph K. Bradley
>
> Currently, models saved in R cannot be loaded easily into other languages.  
> This is because R saves extra metadata (feature names) alongside the model.  
> We should fix this issue so that models can be transferred seamlessly between 
> languages.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15572) ML persistence in R format: compatibility with other languages

2018-01-06 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15987592#comment-15987592
 ] 

Joseph K. Bradley edited comment on SPARK-15572 at 1/6/18 10:40 PM:


Retargeting since 2.2 has been cut -> I'll remove the target, but please add as 
needed.
CC [~falaki]


was (Author: josephkb):
Retargeting since 2.2 has been cut -> and to 2.4 now

> ML persistence in R format: compatibility with other languages
> --
>
> Key: SPARK-15572
> URL: https://issues.apache.org/jira/browse/SPARK-15572
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Joseph K. Bradley
>
> Currently, models saved in R cannot be loaded easily into other languages.  
> This is because R saves extra metadata (feature names) alongside the model.  
> We should fix this issue so that models can be transferred seamlessly between 
> languages.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15572) ML persistence in R format: compatibility with other languages

2018-01-06 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15987592#comment-15987592
 ] 

Joseph K. Bradley edited comment on SPARK-15572 at 1/6/18 10:39 PM:


Retargeting since 2.2 has been cut -> and to 2.4 now


was (Author: josephkb):
Retargeting since 2.2 has been cut

> ML persistence in R format: compatibility with other languages
> --
>
> Key: SPARK-15572
> URL: https://issues.apache.org/jira/browse/SPARK-15572
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Joseph K. Bradley
>
> Currently, models saved in R cannot be loaded easily into other languages.  
> This is because R saves extra metadata (feature names) alongside the model.  
> We should fix this issue so that models can be transferred seamlessly between 
> languages.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2018-01-06 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15987588#comment-15987588
 ] 

Joseph K. Bradley edited comment on SPARK-15784 at 1/6/18 10:39 PM:


Retargeting since 2.2 has been cut -> and now to 2.4


was (Author: josephkb):
Retargeting since 2.2 has been cut

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>Assignee: Miao Wang
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2018-01-06 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15784:
--
Target Version/s: 2.4.0  (was: 2.3.0)

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>Assignee: Miao Wang
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18618) SparkR GLM model predict should support type as a argument

2018-01-06 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18618:
--
Target Version/s:   (was: 2.3.0)

> SparkR GLM model predict should support type as a argument
> --
>
> Key: SPARK-18618
> URL: https://issues.apache.org/jira/browse/SPARK-18618
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>
> SparkR GLM model {{predict}} should support {{type}} as a argument. This will 
> it consistent with native R predict such as 
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html .



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18618) SparkR GLM model predict should support type as a argument

2018-01-06 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16314966#comment-16314966
 ] 

Joseph K. Bradley commented on SPARK-18618:
---

I'll remove the target, but please add a new one as needed.  I'll also CC 
[~falaki] who may be interested.

> SparkR GLM model predict should support type as a argument
> --
>
> Key: SPARK-18618
> URL: https://issues.apache.org/jira/browse/SPARK-18618
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>
> SparkR GLM model {{predict}} should support {{type}} as a argument. This will 
> it consistent with native R predict such as 
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html .



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22951) count() after dropDuplicates() on emptyDataFrame returns incorrect value

2018-01-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22951:


Assignee: (was: Apache Spark)

> count() after dropDuplicates() on emptyDataFrame returns incorrect value
> 
>
> Key: SPARK-22951
> URL: https://issues.apache.org/jira/browse/SPARK-22951
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.2.0, 2.3.0
>Reporter: Michael Dreibelbis
>
> here is a minimal Spark Application to reproduce:
> {code}
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.{SparkConf, SparkContext}
> object DropDupesApp extends App {
>   
>   override def main(args: Array[String]): Unit = {
> val conf = new SparkConf()
>   .setAppName("test")
>   .setMaster("local")
> val sc = new SparkContext(conf)
> val sql = SQLContext.getOrCreate(sc)
> assert(sql.emptyDataFrame.count == 0) // expected
> assert(sql.emptyDataFrame.dropDuplicates.count == 1) // unexpected
>   }
>   
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22951) count() after dropDuplicates() on emptyDataFrame returns incorrect value

2018-01-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22951:


Assignee: Apache Spark

> count() after dropDuplicates() on emptyDataFrame returns incorrect value
> 
>
> Key: SPARK-22951
> URL: https://issues.apache.org/jira/browse/SPARK-22951
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.2.0, 2.3.0
>Reporter: Michael Dreibelbis
>Assignee: Apache Spark
>
> here is a minimal Spark Application to reproduce:
> {code}
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.{SparkConf, SparkContext}
> object DropDupesApp extends App {
>   
>   override def main(args: Array[String]): Unit = {
> val conf = new SparkConf()
>   .setAppName("test")
>   .setMaster("local")
> val sc = new SparkContext(conf)
> val sql = SQLContext.getOrCreate(sc)
> assert(sql.emptyDataFrame.count == 0) // expected
> assert(sql.emptyDataFrame.dropDuplicates.count == 1) // unexpected
>   }
>   
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22951) count() after dropDuplicates() on emptyDataFrame returns incorrect value

2018-01-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16314508#comment-16314508
 ] 

Apache Spark commented on SPARK-22951:
--

User 'liufengdb' has created a pull request for this issue:
https://github.com/apache/spark/pull/20174

> count() after dropDuplicates() on emptyDataFrame returns incorrect value
> 
>
> Key: SPARK-22951
> URL: https://issues.apache.org/jira/browse/SPARK-22951
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.2.0, 2.3.0
>Reporter: Michael Dreibelbis
>
> here is a minimal Spark Application to reproduce:
> {code}
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.{SparkConf, SparkContext}
> object DropDupesApp extends App {
>   
>   override def main(args: Array[String]): Unit = {
> val conf = new SparkConf()
>   .setAppName("test")
>   .setMaster("local")
> val sc = new SparkContext(conf)
> val sql = SQLContext.getOrCreate(sc)
> assert(sql.emptyDataFrame.count == 0) // expected
> assert(sql.emptyDataFrame.dropDuplicates.count == 1) // unexpected
>   }
>   
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21786) The 'spark.sql.parquet.compression.codec' configuration doesn't take effect on tables with partition field(s)

2018-01-06 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21786:

Description: 
Since Hive 1.1, Hive allows users to set parquet compression codec via 
table-level properties parquet.compression. See the JIRA: 
https://issues.apache.org/jira/browse/HIVE-7858 . We do support orc.compression 
for ORC. Thus, for external users, it is more straightforward to support both. 
See the stackflow question: 
https://stackoverflow.com/questions/36941122/spark-sql-ignores-parquet-compression-propertie-specified-in-tblproperties
In Spark side, our table-level compression conf compression was added by #11464 
since Spark 2.0.
We need to support both table-level conf. Users might also use session-level 
conf spark.sql.parquet.compression.codec. The priority rule will be like
If other compression codec configuration was found through hive or parquet, the 
precedence would be compression, parquet.compression, 
spark.sql.parquet.compression.codec. Acceptable values include: none, 
uncompressed, snappy, gzip, lzo.
The rule for Parquet is consistent with the ORC after the change.

Changes:
1.Increased acquiring 'compressionCodecClassName' from parquet.compression,and 
the precedence order is 
compression,parquet.compression,spark.sql.parquet.compression.codec, just like 
what we do in OrcOptions.

2.Change spark.sql.parquet.compression.codec to support "none".Actually in 
ParquetOptions,we do support "none" as equivalent to "uncompressed", but it 
does not allowed to configured to "none".

  was:
For tables created like below,  'spark.sql.parquet.compression.codec' doesn't 
take any effect when insert data. And because the default compression codec is 
'uncompressed', if I want to change the compression codec, I have to change it 
by 'set parquet.compression='.

Contrast，tables without any partition field will work normal with 
'spark.sql.parquet.compression.codec',and the default compression codec is 
'snappy', but it seems 'parquet.compression' no longer in effect.

Should we use the ‘spark.sql.parquet.compression.codec’ configuration uniformly？


CREATE TABLE Test_Parquet(provincecode int, citycode int, districtcode int)
PARTITIONED BY (p_provincecode int)
STORED AS PARQUET;

INSERT OVERWRITE TABLE Test_Parquet select * from TableB;


> The 'spark.sql.parquet.compression.codec' configuration doesn't take effect 
> on tables with partition field(s)
> -
>
> Key: SPARK-21786
> URL: https://issues.apache.org/jira/browse/SPARK-21786
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jinhua Fu
>Assignee: Jinhua Fu
> Fix For: 2.3.0
>
>
> Since Hive 1.1, Hive allows users to set parquet compression codec via 
> table-level properties parquet.compression. See the JIRA: 
> https://issues.apache.org/jira/browse/HIVE-7858 . We do support 
> orc.compression for ORC. Thus, for external users, it is more straightforward 
> to support both. See the stackflow question: 
> https://stackoverflow.com/questions/36941122/spark-sql-ignores-parquet-compression-propertie-specified-in-tblproperties
> In Spark side, our table-level compression conf compression was added by 
> #11464 since Spark 2.0.
> We need to support both table-level conf. Users might also use session-level 
> conf spark.sql.parquet.compression.codec. The priority rule will be like
> If other compression codec configuration was found through hive or parquet, 
> the precedence would be compression, parquet.compression, 
> spark.sql.parquet.compression.codec. Acceptable values include: none, 
> uncompressed, snappy, gzip, lzo.
> The rule for Parquet is consistent with the ORC after the change.
> Changes:
> 1.Increased acquiring 'compressionCodecClassName' from 
> parquet.compression,and the precedence order is 
> compression,parquet.compression,spark.sql.parquet.compression.codec, just 
> like what we do in OrcOptions.
> 2.Change spark.sql.parquet.compression.codec to support "none".Actually in 
> ParquetOptions,we do support "none" as equivalent to "uncompressed", but it 
> does not allowed to configured to "none".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21786) The 'spark.sql.parquet.compression.codec' configuration doesn't take effect on tables with partition field(s)

2018-01-06 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-21786:
---

Assignee: Jinhua Fu

> The 'spark.sql.parquet.compression.codec' configuration doesn't take effect 
> on tables with partition field(s)
> -
>
> Key: SPARK-21786
> URL: https://issues.apache.org/jira/browse/SPARK-21786
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jinhua Fu
>Assignee: Jinhua Fu
> Fix For: 2.3.0
>
>
> For tables created like below,  'spark.sql.parquet.compression.codec' doesn't 
> take any effect when insert data. And because the default compression codec 
> is 'uncompressed', if I want to change the compression codec, I have to 
> change it by 'set parquet.compression='.
> Contrast，tables without any partition field will work normal with 
> 'spark.sql.parquet.compression.codec',and the default compression codec is 
> 'snappy', but it seems 'parquet.compression' no longer in effect.
> Should we use the ‘spark.sql.parquet.compression.codec’ configuration 
> uniformly？
> 
> CREATE TABLE Test_Parquet(provincecode int, citycode int, districtcode int)
> PARTITIONED BY (p_provincecode int)
> STORED AS PARQUET;
> INSERT OVERWRITE TABLE Test_Parquet select * from TableB;



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21786) The 'spark.sql.parquet.compression.codec' configuration doesn't take effect on tables with partition field(s)

2018-01-06 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-21786.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> The 'spark.sql.parquet.compression.codec' configuration doesn't take effect 
> on tables with partition field(s)
> -
>
> Key: SPARK-21786
> URL: https://issues.apache.org/jira/browse/SPARK-21786
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jinhua Fu
>Assignee: Jinhua Fu
> Fix For: 2.3.0
>
>
> For tables created like below,  'spark.sql.parquet.compression.codec' doesn't 
> take any effect when insert data. And because the default compression codec 
> is 'uncompressed', if I want to change the compression codec, I have to 
> change it by 'set parquet.compression='.
> Contrast，tables without any partition field will work normal with 
> 'spark.sql.parquet.compression.codec',and the default compression codec is 
> 'snappy', but it seems 'parquet.compression' no longer in effect.
> Should we use the ‘spark.sql.parquet.compression.codec’ configuration 
> uniformly？
> 
> CREATE TABLE Test_Parquet(provincecode int, citycode int, districtcode int)
> PARTITIONED BY (p_provincecode int)
> STORED AS PARQUET;
> INSERT OVERWRITE TABLE Test_Parquet select * from TableB;



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22793) Memory leak in Spark Thrift Server

2018-01-06 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22793.
-
   Resolution: Fixed
 Assignee: zuotingbing
Fix Version/s: 2.3.0

> Memory leak in Spark Thrift Server
> --
>
> Key: SPARK-22793
> URL: https://issues.apache.org/jira/browse/SPARK-22793
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.2.1
>Reporter: zuotingbing
>Assignee: zuotingbing
>Priority: Critical
> Fix For: 2.3.0
>
>
> 1. Start HiveThriftServer2.
> 2. Connect to thriftserver through beeline.
> 3. Close the beeline.
> 4. repeat step2 and step 3 for several times, which caused the leak of Memory.
> we found there are many directories never be dropped under the path
> {code:java}
> hive.exec.local.scratchdir
> {code} and 
> {code:java}
> hive.exec.scratchdir
> {code} , as we know the scratchdir has been added to deleteOnExit when it be 
> created. So it means that the cache size of FileSystem deleteOnExit will keep 
> increasing until JVM terminated.
> In addition, we use 
> {code:java}
> jmap -histo:live [PID]
> {code} to printout the size of objects in HiveThriftServer2 Process, we can 
> find the object "org.apache.spark.sql.hive.client.HiveClientImpl" and 
> "org.apache.hadoop.hive.ql.session.SessionState" keep increasing even though 
> we closed all the beeline connections, which caused the leak of Memory.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22901) Add non-deterministic to Python UDF

2018-01-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16314476#comment-16314476
 ] 

Apache Spark commented on SPARK-22901:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/20173

> Add non-deterministic to Python UDF
> ---
>
> Key: SPARK-22901
> URL: https://issues.apache.org/jira/browse/SPARK-22901
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.1
>Reporter: Xiao Li
>Assignee: Marco Gaido
> Fix For: 2.3.0
>
>
> Add a new API for Python UDF to allow users to change the determinism from 
> deterministic to non-deterministic. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22979) Avoid per-record type dispatch in Python data conversion (EvaluatePython.fromJava)

2018-01-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16314459#comment-16314459
 ] 

Apache Spark commented on SPARK-22979:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/20172

> Avoid per-record type dispatch in Python data conversion 
> (EvaluatePython.fromJava)
> --
>
> Key: SPARK-22979
> URL: https://issues.apache.org/jira/browse/SPARK-22979
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>
> Seems we are type dispatching between Java objects (from Pyrolite) to Spark's 
> internal data format.
> See 
> https://github.com/apache/spark/blob/3f958a99921d149fb9fdf7ba7e78957afdad1405/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala#L89-L162
> Looks we can make converters each for each type and then reuse it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22979) Avoid per-record type dispatch in Python data conversion (EvaluatePython.fromJava)

2018-01-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22979:


Assignee: Apache Spark

> Avoid per-record type dispatch in Python data conversion 
> (EvaluatePython.fromJava)
> --
>
> Key: SPARK-22979
> URL: https://issues.apache.org/jira/browse/SPARK-22979
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>
> Seems we are type dispatching between Java objects (from Pyrolite) to Spark's 
> internal data format.
> See 
> https://github.com/apache/spark/blob/3f958a99921d149fb9fdf7ba7e78957afdad1405/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala#L89-L162
> Looks we can make converters each for each type and then reuse it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22979) Avoid per-record type dispatch in Python data conversion (EvaluatePython.fromJava)

2018-01-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22979:


Assignee: (was: Apache Spark)

> Avoid per-record type dispatch in Python data conversion 
> (EvaluatePython.fromJava)
> --
>
> Key: SPARK-22979
> URL: https://issues.apache.org/jira/browse/SPARK-22979
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>
> Seems we are type dispatching between Java objects (from Pyrolite) to Spark's 
> internal data format.
> See 
> https://github.com/apache/spark/blob/3f958a99921d149fb9fdf7ba7e78957afdad1405/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala#L89-L162
> Looks we can make converters each for each type and then reuse it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22978) Register Vectorized UDFs for SQL Statement

2018-01-06 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22978:

Description: 
Capable of registering vectorized UDFs and then use it in SQL statement.

For example,
{noformat}
>>> import random
>>> from pyspark.sql.types import IntegerType
>>> from pyspark.sql.functions import pandas_udf
>>> random_pandas_udf = pandas_udf(
... lambda x: random.randint(0, 100) + x, IntegerType())
... .asNondeterministic()  # doctest: +SKIP
>>> _ = spark.catalog.registerFunction(
... "random_pandas_udf", random_pandas_udf, IntegerType())  # doctest: +SKIP
>>> spark.sql("SELECT random_pandas_udf(2)").collect()  # doctest: +SKIP
[Row(random_pandas_udf(2)=84)]
{noformat}

  was:Capable of registering vectorized UDFs and then use it in SQL statement


> Register Vectorized UDFs for SQL Statement
> --
>
> Key: SPARK-22978
> URL: https://issues.apache.org/jira/browse/SPARK-22978
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Capable of registering vectorized UDFs and then use it in SQL statement.
> For example,
> {noformat}
> >>> import random
> >>> from pyspark.sql.types import IntegerType
> >>> from pyspark.sql.functions import pandas_udf
> >>> random_pandas_udf = pandas_udf(
> ... lambda x: random.randint(0, 100) + x, IntegerType())
> ... .asNondeterministic()  # doctest: +SKIP
> >>> _ = spark.catalog.registerFunction(
> ... "random_pandas_udf", random_pandas_udf, IntegerType())  # doctest: 
> +SKIP
> >>> spark.sql("SELECT random_pandas_udf(2)").collect()  # doctest: +SKIP
> [Row(random_pandas_udf(2)=84)]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22978) Register Vectorized UDFs for SQL Statement

2018-01-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22978:


Assignee: Xiao Li  (was: Apache Spark)

> Register Vectorized UDFs for SQL Statement
> --
>
> Key: SPARK-22978
> URL: https://issues.apache.org/jira/browse/SPARK-22978
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Capable of registering vectorized UDFs and then use it in SQL statement



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22978) Register Vectorized UDFs for SQL Statement

2018-01-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22978:


Assignee: Apache Spark  (was: Xiao Li)

> Register Vectorized UDFs for SQL Statement
> --
>
> Key: SPARK-22978
> URL: https://issues.apache.org/jira/browse/SPARK-22978
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> Capable of registering vectorized UDFs and then use it in SQL statement



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22978) Register Vectorized UDFs for SQL Statement

2018-01-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16314456#comment-16314456
 ] 

Apache Spark commented on SPARK-22978:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/20171

> Register Vectorized UDFs for SQL Statement
> --
>
> Key: SPARK-22978
> URL: https://issues.apache.org/jira/browse/SPARK-22978
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Capable of registering vectorized UDFs and then use it in SQL statement



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22979) Avoid per-record type dispatch in Python data conversion (EvaluatePython.fromJava)

2018-01-06 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created SPARK-22979:


 Summary: Avoid per-record type dispatch in Python data conversion 
(EvaluatePython.fromJava)
 Key: SPARK-22979
 URL: https://issues.apache.org/jira/browse/SPARK-22979
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 2.3.0
Reporter: Hyukjin Kwon


Seems we are type dispatching between Java objects (from Pyrolite) to Spark's 
internal data format.

See 
https://github.com/apache/spark/blob/3f958a99921d149fb9fdf7ba7e78957afdad1405/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala#L89-L162

Looks we can make converters each for each type and then reuse it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22978) Register Vectorized UDFs for SQL Statement

2018-01-06 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22978:

Description: Capable of registering vectorized UDFs and then use it in SQL 
statement  (was: R)

> Register Vectorized UDFs for SQL Statement
> --
>
> Key: SPARK-22978
> URL: https://issues.apache.org/jira/browse/SPARK-22978
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Capable of registering vectorized UDFs and then use it in SQL statement



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22978) Register Vectorized UDFs for SQL Statement

2018-01-06 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22978:

Summary: Register Vectorized UDFs for SQL Statement  (was: Register Vectori)

> Register Vectorized UDFs for SQL Statement
> --
>
> Key: SPARK-22978
> URL: https://issues.apache.org/jira/browse/SPARK-22978
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> R



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22978) Register Vectori

2018-01-06 Thread Xiao Li (JIRA)

Xiao Li created SPARK-22978:
---

 Summary: Register Vectori
 Key: SPARK-22978
 URL: https://issues.apache.org/jira/browse/SPARK-22978
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 2.3.0
Reporter: Xiao Li
Assignee: Xiao Li


R



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-7551) Don't split by dot if within backticks for DataFrame attribute resolution

2018-01-06 Thread Haris (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haris updated SPARK-7551:
-
Comment: was deleted

(was: I encountered a similar problem in spark 2,2 while using pyspark, I tried 
to split a column with period (.) and it did not behave well even after 
providing escape chars:

{code}

>>> spark.sql("select split('a.aaa','.')").show()
+---+
|split(a.aaa, .)|
+---+
|   [, , , , , ]|
+---+

>>> spark.sql("select split('a.aaa','\\.')").show()
+---+
|split(a.aaa, .)|
+---+
|   [, , , , , ]|
+---+

>>> spark.sql("select split('a.aaa','[.]')").show()
+-+
|split(a.aaa, [.])|
+-+
| [a, aaa]|
+-+
{code}

It uses period only when we provide it like {code} [.] {code}  while it should 
also be working with escape seq {code}'\\.'{code})

> Don't split by dot if within backticks for DataFrame attribute resolution
> -
>
> Key: SPARK-7551
> URL: https://issues.apache.org/jira/browse/SPARK-7551
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 1.4.0
>
>
> DataFrame's resolve:
> {code}
>   protected[sql] def resolve(colName: String): NamedExpression = {
> queryExecution.analyzed.resolve(colName.split("\\."), 
> sqlContext.analyzer.resolver).getOrElse {
>   throw new AnalysisException(
> s"""Cannot resolve column name "$colName" among 
> (${schema.fieldNames.mkString(", ")})""")
> }
>   }
> {code}
> We should not split the parts quoted by backticks (`).
> For example, `ab.cd`.`efg` should be split into two parts "ab.cd" and "efg". 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7551) Don't split by dot if within backticks for DataFrame attribute resolution

2018-01-06 Thread Haris (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16314443#comment-16314443
 ] 

Haris commented on SPARK-7551:
--

I encountered a similar problem in spark 2,2 while using pyspark, I tried to 
split a column with period (.) and it did not behave well even after providing 
escape chars:

{code}

>>> spark.sql("select split('a.aaa','.')").show()
+---+
|split(a.aaa, .)|
+---+
|   [, , , , , ]|
+---+

>>> spark.sql("select split('a.aaa','\\.')").show()
+---+
|split(a.aaa, .)|
+---+
|   [, , , , , ]|
+---+

>>> spark.sql("select split('a.aaa','[.]')").show()
+-+
|split(a.aaa, [.])|
+-+
| [a, aaa]|
+-+
{code}

It uses period only when we provide it like {code} [.] {code}  while it should 
also be working with escape seq {code}'\\.'{code}

> Don't split by dot if within backticks for DataFrame attribute resolution
> -
>
> Key: SPARK-7551
> URL: https://issues.apache.org/jira/browse/SPARK-7551
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 1.4.0
>
>
> DataFrame's resolve:
> {code}
>   protected[sql] def resolve(colName: String): NamedExpression = {
> queryExecution.analyzed.resolve(colName.split("\\."), 
> sqlContext.analyzer.resolver).getOrElse {
>   throw new AnalysisException(
> s"""Cannot resolve column name "$colName" among 
> (${schema.fieldNames.mkString(", ")})""")
> }
>   }
> {code}
> We should not split the parts quoted by backticks (`).
> For example, `ab.cd`.`efg` should be split into two parts "ab.cd" and "efg". 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22930) Improve the description of Vectorized UDFs for non-deterministic cases

2018-01-06 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22930.
-
   Resolution: Fixed
 Assignee: Li Jin
Fix Version/s: 2.3.0

> Improve the description of Vectorized UDFs for non-deterministic cases
> --
>
> Key: SPARK-22930
> URL: https://issues.apache.org/jira/browse/SPARK-22930
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Li Jin
> Fix For: 2.3.0
>
>
> After we merge this commit 
> https://github.com/apache/spark/commit/ff48b1b338241039a7189e7a3c04333b1256fdb3,
>  we also need to update the function description of Vectorized UDFs. Users 
> are able to create non-deterministic Vectorized UDFs. 
> Also, add the related test cases. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

47 matches

Mail list logo