[jira] [Commented] (SPARK-28990) SparkSQL invalid call to toAttribute on unresolved object, tree: *

2019-12-23 Thread Wenchao Wu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002698#comment-17002698
 ] 

Wenchao Wu commented on SPARK-28990:


[~lucusguo] [~xiaozhang] me too

> SparkSQL invalid call to toAttribute on unresolved object, tree: *
> --
>
> Key: SPARK-28990
> URL: https://issues.apache.org/jira/browse/SPARK-28990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: fengchaoge
>Priority: Major
>
> SparkSQL create table as select from one table which may not exists throw 
> exceptions like:
> {code}
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> toAttribute on unresolved object, tree:
> {code}
> This is not friendly, spark user may have no idea about what's wrong.
> Simple sql can reproduce it,like this:
> {code}
> spark-sql (default)> create table default.spark as select * from default.dual;
> {code}
> {code}
> 2019-09-05 16:27:24,127 INFO (main) [Logging.scala:logInfo(54)] - Parsing 
> command: create table default.spark as select * from default.dual
> 2019-09-05 16:27:24,772 ERROR (main) [Logging.scala:logError(91)] - Failed in 
> [create table default.spark as select * from default.dual]
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> toAttribute on unresolved object, tree: *
> at 
> org.apache.spark.sql.catalyst.analysis.Star.toAttribute(unresolved.scala:245)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicLogicalOperators.scala:52)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicLogicalOperators.scala:52)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.immutable.List.map(List.scala:296)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Project.output(basicLogicalOperators.scala:52)
> at 
> org.apache.spark.sql.hive.HiveAnalysis$$anonfun$apply$3.applyOrElse(HiveStrategies.scala:160)
> at 
> org.apache.spark.sql.hive.HiveAnalysis$$anonfun$apply$3.applyOrElse(HiveStrategies.scala:148)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1$$anonfun$2.apply(AnalysisHelper.scala:108)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1$$anonfun$2.apply(AnalysisHelper.scala:108)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1.apply(AnalysisHelper.scala:107)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1.apply(AnalysisHelper.scala:106)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsDown(AnalysisHelper.scala:106)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperators(AnalysisHelper.scala:73)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:29)
> at org.apache.spark.sql.hive.HiveAnalysis$.apply(HiveStrategies.scala:148)
> at org.apache.spark.sql.hive.HiveAnalysis$.apply(HiveStrategies.scala:147)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84)
> at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
> at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
> at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:48)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:127)
> at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.sc

[jira] [Commented] (SPARK-28990) SparkSQL invalid call to toAttribute on unresolved object, tree: *

2019-12-23 Thread Xiao Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002696#comment-17002696
 ] 

Xiao Zhang commented on SPARK-28990:


[~fengchaoge] me too

> SparkSQL invalid call to toAttribute on unresolved object, tree: *
> --
>
> Key: SPARK-28990
> URL: https://issues.apache.org/jira/browse/SPARK-28990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: fengchaoge
>Priority: Major
>
> SparkSQL create table as select from one table which may not exists throw 
> exceptions like:
> {code}
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> toAttribute on unresolved object, tree:
> {code}
> This is not friendly, spark user may have no idea about what's wrong.
> Simple sql can reproduce it,like this:
> {code}
> spark-sql (default)> create table default.spark as select * from default.dual;
> {code}
> {code}
> 2019-09-05 16:27:24,127 INFO (main) [Logging.scala:logInfo(54)] - Parsing 
> command: create table default.spark as select * from default.dual
> 2019-09-05 16:27:24,772 ERROR (main) [Logging.scala:logError(91)] - Failed in 
> [create table default.spark as select * from default.dual]
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> toAttribute on unresolved object, tree: *
> at 
> org.apache.spark.sql.catalyst.analysis.Star.toAttribute(unresolved.scala:245)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicLogicalOperators.scala:52)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicLogicalOperators.scala:52)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.immutable.List.map(List.scala:296)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Project.output(basicLogicalOperators.scala:52)
> at 
> org.apache.spark.sql.hive.HiveAnalysis$$anonfun$apply$3.applyOrElse(HiveStrategies.scala:160)
> at 
> org.apache.spark.sql.hive.HiveAnalysis$$anonfun$apply$3.applyOrElse(HiveStrategies.scala:148)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1$$anonfun$2.apply(AnalysisHelper.scala:108)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1$$anonfun$2.apply(AnalysisHelper.scala:108)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1.apply(AnalysisHelper.scala:107)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1.apply(AnalysisHelper.scala:106)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsDown(AnalysisHelper.scala:106)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperators(AnalysisHelper.scala:73)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:29)
> at org.apache.spark.sql.hive.HiveAnalysis$.apply(HiveStrategies.scala:148)
> at org.apache.spark.sql.hive.HiveAnalysis$.apply(HiveStrategies.scala:147)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84)
> at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
> at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
> at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:48)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:127)
> at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:121)
> 

[jira] [Commented] (SPARK-28990) SparkSQL invalid call to toAttribute on unresolved object, tree: *

2019-12-23 Thread lucusguo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002695#comment-17002695
 ] 

lucusguo commented on SPARK-28990:
--

but, I cannot reproduce  it in spark2.4.3

> SparkSQL invalid call to toAttribute on unresolved object, tree: *
> --
>
> Key: SPARK-28990
> URL: https://issues.apache.org/jira/browse/SPARK-28990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: fengchaoge
>Priority: Major
>
> SparkSQL create table as select from one table which may not exists throw 
> exceptions like:
> {code}
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> toAttribute on unresolved object, tree:
> {code}
> This is not friendly, spark user may have no idea about what's wrong.
> Simple sql can reproduce it,like this:
> {code}
> spark-sql (default)> create table default.spark as select * from default.dual;
> {code}
> {code}
> 2019-09-05 16:27:24,127 INFO (main) [Logging.scala:logInfo(54)] - Parsing 
> command: create table default.spark as select * from default.dual
> 2019-09-05 16:27:24,772 ERROR (main) [Logging.scala:logError(91)] - Failed in 
> [create table default.spark as select * from default.dual]
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> toAttribute on unresolved object, tree: *
> at 
> org.apache.spark.sql.catalyst.analysis.Star.toAttribute(unresolved.scala:245)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicLogicalOperators.scala:52)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicLogicalOperators.scala:52)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.immutable.List.map(List.scala:296)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Project.output(basicLogicalOperators.scala:52)
> at 
> org.apache.spark.sql.hive.HiveAnalysis$$anonfun$apply$3.applyOrElse(HiveStrategies.scala:160)
> at 
> org.apache.spark.sql.hive.HiveAnalysis$$anonfun$apply$3.applyOrElse(HiveStrategies.scala:148)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1$$anonfun$2.apply(AnalysisHelper.scala:108)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1$$anonfun$2.apply(AnalysisHelper.scala:108)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1.apply(AnalysisHelper.scala:107)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1.apply(AnalysisHelper.scala:106)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsDown(AnalysisHelper.scala:106)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperators(AnalysisHelper.scala:73)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:29)
> at org.apache.spark.sql.hive.HiveAnalysis$.apply(HiveStrategies.scala:148)
> at org.apache.spark.sql.hive.HiveAnalysis$.apply(HiveStrategies.scala:147)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84)
> at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
> at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
> at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:48)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:127)
> at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analy

[jira] [Created] (SPARK-30342) Update LIST JAR/FILE command

2019-12-23 Thread Rakesh Raushan (Jira)
Rakesh Raushan created SPARK-30342:
--

 Summary: Update LIST JAR/FILE command
 Key: SPARK-30342
 URL: https://issues.apache.org/jira/browse/SPARK-30342
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Rakesh Raushan


LIST FILE/JAR command is not documented properly. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30333) Bump jackson-databind to 2.6.7.3

2019-12-23 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-30333.
--
Fix Version/s: 2.4.5
 Assignee: Sandeep Katta
   Resolution: Fixed

Resolved by [https://github.com/apache/spark/pull/26986]

> Bump  jackson-databind to 2.6.7.3 
> --
>
> Key: SPARK-30333
> URL: https://issues.apache.org/jira/browse/SPARK-30333
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Sandeep Katta
>Assignee: Sandeep Katta
>Priority: Major
> Fix For: 2.4.5
>
>
> To fix below CVE
>  
> CVE-2018-14718
> CVE-2018-14719
> CVE-2018-14720
> CVE-2018-14721
> CVE-2018-19360,
> CVE-2018-19361
> CVE-2018-19362



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30341) check overflow for interval arithmetic operations

2019-12-23 Thread Kent Yao (Jira)
Kent Yao created SPARK-30341:


 Summary: check overflow for interval arithmetic operations
 Key: SPARK-30341
 URL: https://issues.apache.org/jira/browse/SPARK-30341
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kent Yao


the interval arithmetic functions, e.g. add/subtract/negative/multiply/divide, 
should enable overflow check when ansi is on, and add/subtract/negative should 
result NULL when overflow happens and ansi is off as multiply/divide.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30340) Python tests failed on arm64/x86

2019-12-23 Thread huangtianhua (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huangtianhua updated SPARK-30340:
-
Summary: Python tests failed on arm64/x86  (was: Python tests failed on 
arm64 )

> Python tests failed on arm64/x86
> 
>
> Key: SPARK-30340
> URL: https://issues.apache.org/jira/browse/SPARK-30340
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: huangtianhua
>Priority: Major
>
> Jenkins job spark-master-test-python-arm failed after the commit 
> c6ab7165dd11a0a7b8aea4c805409088e9a41a74:
> File 
> "/home/jenkins/workspace/spark-master-test-python-arm/python/pyspark/ml/classification.py",
>  line 2790, in __main__.FMClassifier
>  Failed example:
>  model.transform(test0).select("features", "probability").show(10, False)
>  Expected:
>  +--++
> |features|probability|
> +--++
> |[-1.0]|[0.97574736,2.425264676902229E-10]|
> |[0.5]|[0.47627851732981163,0.5237214826701884]|
> |[1.0]|[5.491554426243495E-4,0.9994508445573757]|
> |[2.0]|[2.00573870645E-10,0.97994233]|
> +--++
>  Got:
>  +--++
> |features|probability|
> +--++
> |[-1.0]|[0.97574736,2.425264676902229E-10]|
> |[0.5]|[0.47627851732981163,0.5237214826701884]|
> |[1.0]|[5.491554426243495E-4,0.9994508445573757]|
> |[2.0]|[2.00573870645E-10,0.97994233]|
> +--++
>  
>  **
>  File 
> "/home/jenkins/workspace/spark-master-test-python-arm/python/pyspark/ml/classification.py",
>  line 2803, in __main__.FMClassifier
>  Failed example:
>  model.factors
>  Expected:
>  DenseMatrix(1, 2, [0.0028, 0.0048], 1)
>  Got:
>  DenseMatrix(1, 2, [-0.0122, 0.0106], 1)
>  **
>  2 of 10 in __main__.FMClassifier
>  ***Test Failed*** 2 failures.
>  
> The details see 
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-python-arm/91/console]
> And seems the tests failed on x86:
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115668/console]
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115665/console]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30340) Python tests failed on arm64

2019-12-23 Thread huangtianhua (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huangtianhua updated SPARK-30340:
-
Description: 
Jenkins job spark-master-test-python-arm failed after the commit 
c6ab7165dd11a0a7b8aea4c805409088e9a41a74:

File 
"/home/jenkins/workspace/spark-master-test-python-arm/python/pyspark/ml/classification.py",
 line 2790, in __main__.FMClassifier
 Failed example:
 model.transform(test0).select("features", "probability").show(10, False)
 Expected:
 +--++
|features|probability|

+--++
|[-1.0]|[0.97574736,2.425264676902229E-10]|
|[0.5]|[0.47627851732981163,0.5237214826701884]|
|[1.0]|[5.491554426243495E-4,0.9994508445573757]|
|[2.0]|[2.00573870645E-10,0.97994233]|

+--++
 Got:
 +--++
|features|probability|

+--++
|[-1.0]|[0.97574736,2.425264676902229E-10]|
|[0.5]|[0.47627851732981163,0.5237214826701884]|
|[1.0]|[5.491554426243495E-4,0.9994508445573757]|
|[2.0]|[2.00573870645E-10,0.97994233]|

+--++
 
 **
 File 
"/home/jenkins/workspace/spark-master-test-python-arm/python/pyspark/ml/classification.py",
 line 2803, in __main__.FMClassifier
 Failed example:
 model.factors
 Expected:
 DenseMatrix(1, 2, [0.0028, 0.0048], 1)
 Got:
 DenseMatrix(1, 2, [-0.0122, 0.0106], 1)
 **
 2 of 10 in __main__.FMClassifier
 ***Test Failed*** 2 failures.

 

The details see 
[https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-python-arm/91/console]

And seems the tests failed on x86:

[https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115668/console]

[https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115665/console]

  was:
Jenkins job spark-master-test-python-arm failed after the commit 
c6ab7165dd11a0a7b8aea4c805409088e9a41a74:

File 
"/home/jenkins/workspace/spark-master-test-python-arm/python/pyspark/ml/classification.py",
 line 2790, in __main__.FMClassifier
 Failed example:
 model.transform(test0).select("features", "probability").show(10, False)
 Expected:
 +-+-+
|features|probability|

+-+-+
|[-1.0]|[0.97574736,2.425264676902229E-10]|
|[0.5]|[0.47627851732981163,0.5237214826701884]|
|[1.0]|[5.491554426243495E-4,0.9994508445573757]|
|[2.0]|[2.00573870645E-10,0.97994233]|

+-+-+
 Got:
 +-+-+
|features|probability|

+-+-+
|[-1.0]|[0.97574736,2.425264676902229E-10]|
|[0.5]|[0.47627851732981163,0.5237214826701884]|
|[1.0]|[5.491554426243495E-4,0.9994508445573757]|
|[2.0]|[2.00573870645E-10,0.97994233]|

+-+-+
 
 **
 File 
"/home/jenkins/workspace/spark-master-test-python-arm/python/pyspark/ml/classification.py",
 line 2803, in __main__.FMClassifier
 Failed example:
 model.factors
 Expected:
 DenseMatrix(1, 2, [0.0028, 0.0048], 1)
 Got:
 DenseMatrix(1, 2, [-0.0122, 0.0106], 1)
 **
 2 of 10 in __main__.FMClassifier
 ***Test Failed*** 2 failures.

 

The details see 
[https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-python-arm/91/console]


> Python tests failed on arm64 
> -
>
> Key: SPARK-30340
> URL: https://issues.apache.org/jira/browse/SPARK-30340
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: huangtianhua
>Priority: Major
>
> Jenkins job spark-master-test-python-arm failed after the commit 
> c6ab7165dd11a0a7b8aea4c805409088e9a41a74:
> File 
> "/home/jenkins/workspace/spark-master-test-python-arm/python/pyspark/ml/classification.py",
>  line 2790, in __main__.FMClassifier
>  Failed example:
>  model.transform(test0).select("features", "probability").show(10, False)
>  Expected:
>  +--++
> |features|probability|
> +--++
> |[-1.0]|[0.97574736,2.425264676902229E-10]|
> |[0.5]|[0.47627851732981163,0.5237214826701884]|
> |[1.0]|[5.491554426243495E-4,0.9994508445573757]|
> |[2.0]|[2.00573870645E-10,0.97994233]|
> +--++
>  Got:
>  +

[jira] [Commented] (SPARK-30328) Fail to write local files with RDD.saveTextFile when setting the incorrect Hadoop configuration files

2019-12-23 Thread Ankit Raj Boudh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002613#comment-17002613
 ] 

Ankit Raj Boudh commented on SPARK-30328:
-

Thank you [~tobe], i will analyse this issue and will update you

> Fail to write local files with RDD.saveTextFile when setting the incorrect 
> Hadoop configuration files
> -
>
> Key: SPARK-30328
> URL: https://issues.apache.org/jira/browse/SPARK-30328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: chendihao
>Priority: Major
>
> We find that the incorrect Hadoop configuration files cause the failure of 
> saving RDD to local file system. It is not expected because we have specify 
> the local url and the API of DataFrame.write.text does not have this issue. 
> It is easy to reproduce and verify with Spark 2.3.0.
> 1.Do not set environment variable of `HADOOP_CONF_DIR`.
> 2.Install pyspark and run the local Python script. This should work and save 
> files to local file system.
> {code:java}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.master("local").getOrCreate()
> sc = spark.sparkContextrdd = sc.parallelize([1, 2, 3])
> rdd.saveAsTextFile("file:///tmp/rdd.text")
> {code}
> 3.Set environment variable of `HADOOP_CONF_DIR` and put the Hadoop 
> configuration files there. Make sure the format of `core-site.xml` is right 
> but it has an unresolved host name.
> 4.Run the same Python script again. If it try to connect HDFS and found the 
> unresolved host name, Java exception happens.
> We thinks `saveAsTextFile("file:///)` should not attempt to connect HDFS 
> whenever `HADOOP_CONF_DIR` is set or not. Actually the following code of 
> DataFrame will work with the same incorrect Hadoop configuration files.
> {code:java}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.master("local").getOrCreate()
> df = spark.createDataFrame(rows, ["attribute", "value"])
> df.write.parquet("file:///tmp/df.parquet")
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30328) Fail to write local files with RDD.saveTextFile when setting the incorrect Hadoop configuration files

2019-12-23 Thread chendihao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chendihao updated SPARK-30328:
--
Description: 
We find that the incorrect Hadoop configuration files cause the failure of 
saving RDD to local file system. It is not expected because we have specify the 
local url and the API of DataFrame.write.text does not have this issue. It is 
easy to reproduce and verify with Spark 2.3.0.

1.Do not set environment variable of `HADOOP_CONF_DIR`.

2.Install pyspark and run the local Python script. This should work and save 
files to local file system.
{code:java}
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local").getOrCreate()
sc = spark.sparkContextrdd = sc.parallelize([1, 2, 3])
rdd.saveAsTextFile("file:///tmp/rdd.text")
{code}
3.Set environment variable of `HADOOP_CONF_DIR` and put the Hadoop 
configuration files there. Make sure the format of `core-site.xml` is right but 
it has an unresolved host name.

4.Run the same Python script again. If it try to connect HDFS and found the 
unresolved host name, Java exception happens.

We thinks `saveAsTextFile("file:///)` should not attempt to connect HDFS 
whenever `HADOOP_CONF_DIR` is set or not. Actually the following code of 
DataFrame will work with the same incorrect Hadoop configuration files.
{code:java}
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local").getOrCreate()
df = spark.createDataFrame(rows, ["attribute", "value"])
df.write.parquet("file:///tmp/df.parquet")
{code}

  was:
We find that the incorrect Hadoop configuration files cause the failure of 
saving RDD to local file system. It is not expected because we have specify the 
local url and the API of DataFrame.write.text does not have this issue. It is 
easy to reproduce and verify with Spark 2.3.0.

1.Do not set environment variable of `HADOOP_CONF_DIR`.

2.Install pyspark and run the local Python script. This should work and save 
files to local file system.
{code:java}
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local").getOrCreate()
sc = spark.sparkContextrdd = sc.parallelize([1, 2, 3])
rdd.saveAsTextFile("file:///tmp/rdd.text")
{code}
3.Set environment variable of `HADOOP_CONF_DIR` and put the Hadoop 
configuration files there. Make sure the format of `core-site.xml` is right but 
it has an unresolved host name.

4.Run the same Python script again. If it try to connect HDFS and found the 
unresolved host name, Java exception happens.

We thinks `saveAsTextFile("file:///)` should not attempt to connect HDFS not 
matter `HADOOP_CONF_DIR` is set. Actually the following code will work with the 
same incorrect Hadoop configuration files.
{code:java}
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local").getOrCreate()
df = spark.createDataFrame(rows, ["attribute", "value"])
df.write.parquet("file:///tmp/df.parquet")
{code}


> Fail to write local files with RDD.saveTextFile when setting the incorrect 
> Hadoop configuration files
> -
>
> Key: SPARK-30328
> URL: https://issues.apache.org/jira/browse/SPARK-30328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: chendihao
>Priority: Major
>
> We find that the incorrect Hadoop configuration files cause the failure of 
> saving RDD to local file system. It is not expected because we have specify 
> the local url and the API of DataFrame.write.text does not have this issue. 
> It is easy to reproduce and verify with Spark 2.3.0.
> 1.Do not set environment variable of `HADOOP_CONF_DIR`.
> 2.Install pyspark and run the local Python script. This should work and save 
> files to local file system.
> {code:java}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.master("local").getOrCreate()
> sc = spark.sparkContextrdd = sc.parallelize([1, 2, 3])
> rdd.saveAsTextFile("file:///tmp/rdd.text")
> {code}
> 3.Set environment variable of `HADOOP_CONF_DIR` and put the Hadoop 
> configuration files there. Make sure the format of `core-site.xml` is right 
> but it has an unresolved host name.
> 4.Run the same Python script again. If it try to connect HDFS and found the 
> unresolved host name, Java exception happens.
> We thinks `saveAsTextFile("file:///)` should not attempt to connect HDFS 
> whenever `HADOOP_CONF_DIR` is set or not. Actually the following code of 
> DataFrame will work with the same incorrect Hadoop configuration files.
> {code:java}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.master("local").getOrCreate()
> df = spark.createDataFrame(rows, ["attribute", "value"])
> df.write.parquet("file:///tmp/df.parquet")
> {code}



--
This message was sent by Atlas

[jira] [Commented] (SPARK-30328) Fail to write local files with RDD.saveTextFile when setting the incorrect Hadoop configuration files

2019-12-23 Thread chendihao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002609#comment-17002609
 ] 

chendihao commented on SPARK-30328:
---

Of course and thanks [~Ankitraj] . We don't have time to dig into the source 
code but I think it may to problem of initialing Hadoop client before parsing 
the local filesystem url. 

> Fail to write local files with RDD.saveTextFile when setting the incorrect 
> Hadoop configuration files
> -
>
> Key: SPARK-30328
> URL: https://issues.apache.org/jira/browse/SPARK-30328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: chendihao
>Priority: Major
>
> We find that the incorrect Hadoop configuration files cause the failure of 
> saving RDD to local file system. It is not expected because we have specify 
> the local url and the API of DataFrame.write.text does not have this issue. 
> It is easy to reproduce and verify with Spark 2.3.0.
> 1.Do not set environment variable of `HADOOP_CONF_DIR`.
> 2.Install pyspark and run the local Python script. This should work and save 
> files to local file system.
> {code:java}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.master("local").getOrCreate()
> sc = spark.sparkContextrdd = sc.parallelize([1, 2, 3])
> rdd.saveAsTextFile("file:///tmp/rdd.text")
> {code}
> 3.Set environment variable of `HADOOP_CONF_DIR` and put the Hadoop 
> configuration files there. Make sure the format of `core-site.xml` is right 
> but it has an unresolved host name.
> 4.Run the same Python script again. If it try to connect HDFS and found the 
> unresolved host name, Java exception happens.
> We thinks `saveAsTextFile("file:///)` should not attempt to connect HDFS not 
> matter `HADOOP_CONF_DIR` is set. Actually the following code will work with 
> the same incorrect Hadoop configuration files.
> {code:java}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.master("local").getOrCreate()
> df = spark.createDataFrame(rows, ["attribute", "value"])
> df.write.parquet("file:///tmp/df.parquet")
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-30328) Fail to write local files with RDD.saveTextFile when setting the incorrect Hadoop configuration files

2019-12-23 Thread chendihao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002609#comment-17002609
 ] 

chendihao edited comment on SPARK-30328 at 12/24/19 2:57 AM:
-

Of course and thanks [~Ankitraj] . We don't have time to dig into the source 
code but I think it may be the problem of initialing Hadoop client before 
parsing the local filesystem url. 


was (Author: tobe):
Of course and thanks [~Ankitraj] . We don't have time to dig into the source 
code but I think it may to problem of initialing Hadoop client before parsing 
the local filesystem url. 

> Fail to write local files with RDD.saveTextFile when setting the incorrect 
> Hadoop configuration files
> -
>
> Key: SPARK-30328
> URL: https://issues.apache.org/jira/browse/SPARK-30328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: chendihao
>Priority: Major
>
> We find that the incorrect Hadoop configuration files cause the failure of 
> saving RDD to local file system. It is not expected because we have specify 
> the local url and the API of DataFrame.write.text does not have this issue. 
> It is easy to reproduce and verify with Spark 2.3.0.
> 1.Do not set environment variable of `HADOOP_CONF_DIR`.
> 2.Install pyspark and run the local Python script. This should work and save 
> files to local file system.
> {code:java}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.master("local").getOrCreate()
> sc = spark.sparkContextrdd = sc.parallelize([1, 2, 3])
> rdd.saveAsTextFile("file:///tmp/rdd.text")
> {code}
> 3.Set environment variable of `HADOOP_CONF_DIR` and put the Hadoop 
> configuration files there. Make sure the format of `core-site.xml` is right 
> but it has an unresolved host name.
> 4.Run the same Python script again. If it try to connect HDFS and found the 
> unresolved host name, Java exception happens.
> We thinks `saveAsTextFile("file:///)` should not attempt to connect HDFS not 
> matter `HADOOP_CONF_DIR` is set. Actually the following code will work with 
> the same incorrect Hadoop configuration files.
> {code:java}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.master("local").getOrCreate()
> df = spark.createDataFrame(rows, ["attribute", "value"])
> df.write.parquet("file:///tmp/df.parquet")
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30340) Python tests failed on arm64

2019-12-23 Thread huangtianhua (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huangtianhua updated SPARK-30340:
-
Description: 
Jenkins job spark-master-test-python-arm failed after the commit 
c6ab7165dd11a0a7b8aea4c805409088e9a41a74:

File 
"/home/jenkins/workspace/spark-master-test-python-arm/python/pyspark/ml/classification.py",
 line 2790, in __main__.FMClassifier
 Failed example:
 model.transform(test0).select("features", "probability").show(10, False)
 Expected:
 +-+-+
|features|probability|

+-+-+
|[-1.0]|[0.97574736,2.425264676902229E-10]|
|[0.5]|[0.47627851732981163,0.5237214826701884]|
|[1.0]|[5.491554426243495E-4,0.9994508445573757]|
|[2.0]|[2.00573870645E-10,0.97994233]|

+-+-+
 Got:
 +-+-+
|features|probability|

+-+-+
|[-1.0]|[0.97574736,2.425264676902229E-10]|
|[0.5]|[0.47627851732981163,0.5237214826701884]|
|[1.0]|[5.491554426243495E-4,0.9994508445573757]|
|[2.0]|[2.00573870645E-10,0.97994233]|

+-+-+
 
 **
 File 
"/home/jenkins/workspace/spark-master-test-python-arm/python/pyspark/ml/classification.py",
 line 2803, in __main__.FMClassifier
 Failed example:
 model.factors
 Expected:
 DenseMatrix(1, 2, [0.0028, 0.0048], 1)
 Got:
 DenseMatrix(1, 2, [-0.0122, 0.0106], 1)
 **
 2 of 10 in __main__.FMClassifier
 ***Test Failed*** 2 failures.

 

The details see 
[https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-python-arm/91/console]

  was:
Jenkins job spark-master-test-python-arm failed after the commit 
c6ab7165dd11a0a7b8aea4c805409088e9a41a74:

File 
"/home/jenkins/workspace/spark-master-test-python-arm/python/pyspark/ml/classification.py",
 line 2790, in __main__.FMClassifier
Failed example:
 model.transform(test0).select("features", "probability").show(10, False)
Expected:
 ++--+
 |features|probability |
 ++--+
 |[-1.0] |[0.97574736,2.425264676902229E-10]|
 |[0.5] |[0.47627851732981163,0.5237214826701884] |
 |[1.0] |[5.491554426243495E-4,0.9994508445573757] |
 |[2.0] |[2.00573870645E-10,0.97994233]|
 ++--+
Got:
 ++--+
 |features|probability |
 ++--+
 |[-1.0] |[0.97574736,2.425264676902229E-10]|
 |[0.5] |[0.47627851732981163,0.5237214826701884] |
 |[1.0] |[5.491554426243495E-4,0.9994508445573757] |
 |[2.0] |[2.00573870645E-10,0.97994233]|
 ++--+
 
**
File 
"/home/jenkins/workspace/spark-master-test-python-arm/python/pyspark/ml/classification.py",
 line 2803, in __main__.FMClassifier
Failed example:
 model.factors
Expected:
 DenseMatrix(1, 2, [0.0028, 0.0048], 1)
Got:
 DenseMatrix(1, 2, [-0.0122, 0.0106], 1)
**
 2 of 10 in __main__.FMClassifier
***Test Failed*** 2 failures.


> Python tests failed on arm64 
> -
>
> Key: SPARK-30340
> URL: https://issues.apache.org/jira/browse/SPARK-30340
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: huangtianhua
>Priority: Major
>
> Jenkins job spark-master-test-python-arm failed after the commit 
> c6ab7165dd11a0a7b8aea4c805409088e9a41a74:
> File 
> "/home/jenkins/workspace/spark-master-test-python-arm/python/pyspark/ml/classification.py",
>  line 2790, in __main__.FMClassifier
>  Failed example:
>  model.transform(test0).select("features", "probability").show(10, False)
>  Expected:
>  +-+-+
> |features|probability|
> +-+-+
> |[-1.0]|[0.97574736,2.425264676902229E-10]|
> |[0.5]|[0.47627851732981163,0.5237214826701884]|
> |[1.0]|[5.491554426243495E-4,0.9994508445573757]|
> |[2.0]|[2.00573870645E-10,0.97994233]|
> +-+-+
>  Got:
>  +-+-+
> |features|probability|
> +-+-+
> |[-1.0]|[0.97574736,2.425264676902229E-10]|
> |[0.5]|[0.47627851732981163,0.5237214826701884]|
> |[1.0]|[5.491554426243495E-4,0.9994508445573757]|
> |[2.0]|[2.005

[jira] [Updated] (SPARK-30339) Avoid to fail twice in function lookup

2019-12-23 Thread Zhenhua Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-30339:
-
Description: Currently if function lookup fails, spark will give it a 
second change by casting decimal type to double type. But for cases where 
decimal type doesn't exist, it's meaningless to lookup again and causes extra 
cost like unnecessary metastore access. We should throw exceptions directly in 
these cases.  (was: Currently if function lookup fails, spark will give it a 
second change by casting decimal type to double type. But for cases where 
decimal type doesn't exist, it's meaningless to lookup again and causes extra 
cost like unnecessary metastore access.)

> Avoid to fail twice in function lookup
> --
>
> Key: SPARK-30339
> URL: https://issues.apache.org/jira/browse/SPARK-30339
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Zhenhua Wang
>Priority: Minor
>
> Currently if function lookup fails, spark will give it a second change by 
> casting decimal type to double type. But for cases where decimal type doesn't 
> exist, it's meaningless to lookup again and causes extra cost like 
> unnecessary metastore access. We should throw exceptions directly in these 
> cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30340) Python tests failed on arm64

2019-12-23 Thread huangtianhua (Jira)
huangtianhua created SPARK-30340:


 Summary: Python tests failed on arm64 
 Key: SPARK-30340
 URL: https://issues.apache.org/jira/browse/SPARK-30340
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 3.0.0
Reporter: huangtianhua


Jenkins job spark-master-test-python-arm failed after the commit 
c6ab7165dd11a0a7b8aea4c805409088e9a41a74:

File 
"/home/jenkins/workspace/spark-master-test-python-arm/python/pyspark/ml/classification.py",
 line 2790, in __main__.FMClassifier
Failed example:
 model.transform(test0).select("features", "probability").show(10, False)
Expected:
 ++--+
 |features|probability |
 ++--+
 |[-1.0] |[0.97574736,2.425264676902229E-10]|
 |[0.5] |[0.47627851732981163,0.5237214826701884] |
 |[1.0] |[5.491554426243495E-4,0.9994508445573757] |
 |[2.0] |[2.00573870645E-10,0.97994233]|
 ++--+
Got:
 ++--+
 |features|probability |
 ++--+
 |[-1.0] |[0.97574736,2.425264676902229E-10]|
 |[0.5] |[0.47627851732981163,0.5237214826701884] |
 |[1.0] |[5.491554426243495E-4,0.9994508445573757] |
 |[2.0] |[2.00573870645E-10,0.97994233]|
 ++--+
 
**
File 
"/home/jenkins/workspace/spark-master-test-python-arm/python/pyspark/ml/classification.py",
 line 2803, in __main__.FMClassifier
Failed example:
 model.factors
Expected:
 DenseMatrix(1, 2, [0.0028, 0.0048], 1)
Got:
 DenseMatrix(1, 2, [-0.0122, 0.0106], 1)
**
 2 of 10 in __main__.FMClassifier
***Test Failed*** 2 failures.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30339) Avoid to fail twice in function lookup

2019-12-23 Thread Zhenhua Wang (Jira)
Zhenhua Wang created SPARK-30339:


 Summary: Avoid to fail twice in function lookup
 Key: SPARK-30339
 URL: https://issues.apache.org/jira/browse/SPARK-30339
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.5, 3.0.0
Reporter: Zhenhua Wang


Currently if function lookup fails, spark will give it a second change by 
casting decimal type to double type. But for cases where decimal type doesn't 
exist, it's meaningless to lookup again and causes extra cost like unnecessary 
metastore access.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30328) Fail to write local files with RDD.saveTextFile when setting the incorrect Hadoop configuration files

2019-12-23 Thread Ankit Raj Boudh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002585#comment-17002585
 ] 

Ankit Raj Boudh commented on SPARK-30328:
-

@chendihao, can i check this issue ?

> Fail to write local files with RDD.saveTextFile when setting the incorrect 
> Hadoop configuration files
> -
>
> Key: SPARK-30328
> URL: https://issues.apache.org/jira/browse/SPARK-30328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: chendihao
>Priority: Major
>
> We find that the incorrect Hadoop configuration files cause the failure of 
> saving RDD to local file system. It is not expected because we have specify 
> the local url and the API of DataFrame.write.text does not have this issue. 
> It is easy to reproduce and verify with Spark 2.3.0.
> 1.Do not set environment variable of `HADOOP_CONF_DIR`.
> 2.Install pyspark and run the local Python script. This should work and save 
> files to local file system.
> {code:java}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.master("local").getOrCreate()
> sc = spark.sparkContextrdd = sc.parallelize([1, 2, 3])
> rdd.saveAsTextFile("file:///tmp/rdd.text")
> {code}
> 3.Set environment variable of `HADOOP_CONF_DIR` and put the Hadoop 
> configuration files there. Make sure the format of `core-site.xml` is right 
> but it has an unresolved host name.
> 4.Run the same Python script again. If it try to connect HDFS and found the 
> unresolved host name, Java exception happens.
> We thinks `saveAsTextFile("file:///)` should not attempt to connect HDFS not 
> matter `HADOOP_CONF_DIR` is set. Actually the following code will work with 
> the same incorrect Hadoop configuration files.
> {code:java}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.master("local").getOrCreate()
> df = spark.createDataFrame(rows, ["attribute", "value"])
> df.write.parquet("file:///tmp/df.parquet")
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30338) Avoid unnecessary InternalRow copies in ParquetRowConverter

2019-12-23 Thread Josh Rosen (Jira)
Josh Rosen created SPARK-30338:
--

 Summary: Avoid unnecessary InternalRow copies in 
ParquetRowConverter
 Key: SPARK-30338
 URL: https://issues.apache.org/jira/browse/SPARK-30338
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Josh Rosen
Assignee: Josh Rosen


ParquetRowConverter calls {{InternalRow.copy()}} in cases where the copy is 
unnecessary; this can severely harm performance when reading deeply-nested 
Parquet.

It looks like this copying was originally added to handle arrays and maps of 
structs (in which case we need to keep the copying), but we can omit it for the 
more common case of structs nested directly in structs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25603) Generalize Nested Column Pruning

2019-12-23 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002577#comment-17002577
 ] 

Takeshi Yamamuro commented on SPARK-25603:
--

Still WIP? Since we've finished implementing the basic part for nested column 
pruning, we can set this as resolved for now? cc: [~dongjoon] [~smilegator]

> Generalize Nested Column Pruning
> 
>
> Key: SPARK-25603
> URL: https://issues.apache.org/jira/browse/SPARK-25603
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30337) Convert case class with var to normal class in spark-sql-kafka module

2019-12-23 Thread Jungtaek Lim (Jira)
Jungtaek Lim created SPARK-30337:


 Summary: Convert case class with var to normal class in 
spark-sql-kafka module
 Key: SPARK-30337
 URL: https://issues.apache.org/jira/browse/SPARK-30337
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Jungtaek Lim


There was a review comment in SPARK-25151 pointed out this, but we decided to 
mark it as TODO as it was having 300+ comments and didn't want to drag it more.

This issue tracks the effort on addressing TODO comments.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30336) Move Kafka consumer related classes to its own package

2019-12-23 Thread Jungtaek Lim (Jira)
Jungtaek Lim created SPARK-30336:


 Summary: Move Kafka consumer related classes to its own package
 Key: SPARK-30336
 URL: https://issues.apache.org/jira/browse/SPARK-30336
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Jungtaek Lim


There're too many classes placed in a package "org.apache.spark.sql.kafka010" 
which classes should have been grouped by purpose.

As a part of change in SPARK-21869, we moved out producer related classes to 
"org.apache.spark.sql.kafka010.producer" and only expose necessary 
classes/methods to the outside of package. We can apply it to consumer related 
classes as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30120) LSH approxNearestNeighbors should use BoundedPriorityQueue when numNearestNeighbors is small

2019-12-23 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-30120.
--
Resolution: Not A Problem

> LSH approxNearestNeighbors should use BoundedPriorityQueue when 
> numNearestNeighbors is small
> 
>
> Key: SPARK-30120
> URL: https://issues.apache.org/jira/browse/SPARK-30120
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Minor
>
> ping [~huaxingao]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29245) CCE during creating HiveMetaStoreClient

2019-12-23 Thread Xiao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002539#comment-17002539
 ] 

Xiao Li commented on SPARK-29245:
-

Since JDK support is experimental, it is not a blocker of Spark 3.0. It only 
affects JDK 11 users based on my understanding.  

However, we should still fix it in 3.0 and let us target it to 3.0 

> CCE during creating HiveMetaStoreClient 
> 
>
> Key: SPARK-29245
> URL: https://issues.apache.org/jira/browse/SPARK-29245
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> From `master` branch build, when I try to connect to an external HMS, I hit 
> the following.
> {code}
> 19/09/25 10:58:46 ERROR hive.log: Got exception: java.lang.ClassCastException 
> class [Ljava.lang.Object; cannot be cast to class [Ljava.net.URI; 
> ([Ljava.lang.Object; and [Ljava.net.URI; are in module java.base of loader 
> 'bootstrap')
> java.lang.ClassCastException: class [Ljava.lang.Object; cannot be cast to 
> class [Ljava.net.URI; ([Ljava.lang.Object; and [Ljava.net.URI; are in module 
> java.base of loader 'bootstrap')
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:200)
>   at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:70)
> {code}
> With HIVE-21508, I can get the following.
> {code}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
>   /_/
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.4)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> sql("show databases").show
> ++
> |databaseName|
> ++
> |  .  |
> ...
> {code}
> With 2.3.7-SNAPSHOT, the following basic tests are tested.
> - SHOW DATABASES / TABLES
> - DESC DATABASE / TABLE
> - CREATE / DROP / USE DATABASE
> - CREATE / DROP / INSERT / LOAD / SELECT TABLE



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29245) CCE during creating HiveMetaStoreClient

2019-12-23 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-29245:

Priority: Major  (was: Blocker)

> CCE during creating HiveMetaStoreClient 
> 
>
> Key: SPARK-29245
> URL: https://issues.apache.org/jira/browse/SPARK-29245
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> From `master` branch build, when I try to connect to an external HMS, I hit 
> the following.
> {code}
> 19/09/25 10:58:46 ERROR hive.log: Got exception: java.lang.ClassCastException 
> class [Ljava.lang.Object; cannot be cast to class [Ljava.net.URI; 
> ([Ljava.lang.Object; and [Ljava.net.URI; are in module java.base of loader 
> 'bootstrap')
> java.lang.ClassCastException: class [Ljava.lang.Object; cannot be cast to 
> class [Ljava.net.URI; ([Ljava.lang.Object; and [Ljava.net.URI; are in module 
> java.base of loader 'bootstrap')
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:200)
>   at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:70)
> {code}
> With HIVE-21508, I can get the following.
> {code}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
>   /_/
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.4)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> sql("show databases").show
> ++
> |databaseName|
> ++
> |  .  |
> ...
> {code}
> With 2.3.7-SNAPSHOT, the following basic tests are tested.
> - SHOW DATABASES / TABLES
> - DESC DATABASE / TABLE
> - CREATE / DROP / USE DATABASE
> - CREATE / DROP / INSERT / LOAD / SELECT TABLE



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30316) data size boom after shuffle writing dataframe save as parquet

2019-12-23 Thread Xiao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002529#comment-17002529
 ] 

Xiao Li commented on SPARK-30316:
-

The compression ratio depends on your data layout, instead of number of row. 

> data size boom after shuffle writing dataframe save as parquet
> --
>
> Key: SPARK-30316
> URL: https://issues.apache.org/jira/browse/SPARK-30316
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, SQL
>Affects Versions: 2.4.4
>Reporter: Cesc 
>Priority: Major
>
> When I read a same parquet file and then save it in two ways, with shuffle 
> and without shuffle, I found the size of output parquet files are quite 
> different. For example,  an origin parquet file with 800 MB size, if save 
> without shuffle, the size is still 800MB, whereas if I use method repartition 
> and then save it as in parquet format, the data size increase to 2.5GB. Row 
> numbers, column numbers and content of two output files are all the same.
> I wonder:
> firstly, why data size will increase after repartition/shuffle?
> secondly, if I need shuffle the input dataframe, how to save it as parquet 
> file efficiently to avoid data size boom?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30316) data size boom after shuffle writing dataframe save as parquet

2019-12-23 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-30316:

Priority: Major  (was: Blocker)

> data size boom after shuffle writing dataframe save as parquet
> --
>
> Key: SPARK-30316
> URL: https://issues.apache.org/jira/browse/SPARK-30316
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, SQL
>Affects Versions: 2.4.4
>Reporter: Cesc 
>Priority: Major
>
> When I read a same parquet file and then save it in two ways, with shuffle 
> and without shuffle, I found the size of output parquet files are quite 
> different. For example,  an origin parquet file with 800 MB size, if save 
> without shuffle, the size is still 800MB, whereas if I use method repartition 
> and then save it as in parquet format, the data size increase to 2.5GB. Row 
> numbers, column numbers and content of two output files are all the same.
> I wonder:
> firstly, why data size will increase after repartition/shuffle?
> secondly, if I need shuffle the input dataframe, how to save it as parquet 
> file efficiently to avoid data size boom?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21869) A cached Kafka producer should not be closed if any task is using it.

2019-12-23 Thread Marcelo Masiero Vanzin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-21869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin resolved SPARK-21869.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26845
[https://github.com/apache/spark/pull/26845]

> A cached Kafka producer should not be closed if any task is using it.
> -
>
> Key: SPARK-21869
> URL: https://issues.apache.org/jira/browse/SPARK-21869
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Shixiong Zhu
>Assignee: Gabor Somogyi
>Priority: Major
> Fix For: 3.0.0
>
>
> Right now a cached Kafka producer may be closed if a large task uses it for 
> more than 10 minutes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21869) A cached Kafka producer should not be closed if any task is using it.

2019-12-23 Thread Marcelo Masiero Vanzin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-21869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin reassigned SPARK-21869:
--

Assignee: Jungtaek Lim  (was: Gabor Somogyi)

> A cached Kafka producer should not be closed if any task is using it.
> -
>
> Key: SPARK-21869
> URL: https://issues.apache.org/jira/browse/SPARK-21869
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Shixiong Zhu
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
>
> Right now a cached Kafka producer may be closed if a large task uses it for 
> more than 10 minutes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30335) Clarify behavior of FIRST and LAST without OVER caluse.

2019-12-23 Thread xqods9o5ekm3 (Jira)
xqods9o5ekm3 created SPARK-30335:


 Summary: Clarify behavior of FIRST and LAST without OVER caluse.
 Key: SPARK-30335
 URL: https://issues.apache.org/jira/browse/SPARK-30335
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.4.0, 3.0.0
Reporter: xqods9o5ekm3


Unlike many databases, Spark SQL allows usage of {{FIRST}} and {{LAST}} in 
non-analytic contexts.

 

At the moment {{FIRST}}

 

> first(expr[, isIgnoreNull]) - Returns the first value of {{expr}} for a group 
> of rows. If {{isIgnoreNull}} is true, returns only non-null values.

 

and {{LAST}}

 

> last(expr[, isIgnoreNull]) - Returns the last value of {{expr}} for a group 
> of rows. If {{isIgnoreNull}} is true, returns only non-null values.

 

descriptions, suggest that their behavior is deterministic and many users 
assume that it return specific values for example when query 
 
{code:sql}
SELECT first(foo)
FROM (
SELECT * FROM table ORDER BY bar
)
{code}

That however doesn't seem to be the case.

To make situation worse, it seems to work (for example on small samples in 
local mode).





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27838) Support user provided non-nullable avro schema for nullable catalyst schema without any null record

2019-12-23 Thread Frank Lee (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002440#comment-17002440
 ] 

Frank Lee commented on SPARK-27838:
---

Hello

Is there a workaround for this before this is released? Currently our avro 
schema is defined as

(we are using avdl)
 
{code:java}
protocol Foo {

record FooRecord {

 string something;

  string anotherthing;
   
   long count;
}
}
 {code}

And AvroSerializer throw error "AvroRuntimeException: Not a union: "string""

> Support user provided non-nullable avro schema for nullable catalyst schema 
> without any null record
> ---
>
> Key: SPARK-27838
> URL: https://issues.apache.org/jira/browse/SPARK-27838
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 3.0.0
>
>
> When the data is read from the sources, the catalyst schema is always 
> nullable. Since Avro uses Union type to represent nullable,  when any 
> non-nullable avro file is read and then written out, the schema will always 
> be changed. This PR provides a solution for users to keep the Avro schema 
> without being forced to use Union type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29224) Implement Factorization Machines as a ml-pipeline component

2019-12-23 Thread Ruslan Dautkhanov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002402#comment-17002402
 ] 

Ruslan Dautkhanov commented on SPARK-29224:
---

E.g. would this work with 0.1m or 1m sparse features?

> Implement Factorization Machines as a ml-pipeline component
> ---
>
> Key: SPARK-29224
> URL: https://issues.apache.org/jira/browse/SPARK-29224
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: mob-ai
>Assignee: mob-ai
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: url_loss.xlsx
>
>
> Factorization Machines is widely used in advertising and recommendation 
> system to estimate CTR(click-through rate).
> Advertising and recommendation system usually has a lot of data, so we need 
> Spark to estimate the CTR, and Factorization Machines are common ml model to 
> estimate CTR.
> Goal: Implement Factorization Machines as a ml-pipeline component
> Requirements:
> 1. loss function supports: logloss, mse
> 2. optimizer: mini batch SGD
> References:
> 1. S. Rendle, “Factorization machines,” in Proceedings of IEEE International 
> Conference on Data Mining (ICDM), pp. 995–1000, 2010.
> https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27762) Support user provided avro schema for writing fields with different ordering

2019-12-23 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-27762:
---

Assignee: DB Tsai

> Support user provided avro schema for writing fields with different ordering
> 
>
> Key: SPARK-27762
> URL: https://issues.apache.org/jira/browse/SPARK-27762
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29224) Implement Factorization Machines as a ml-pipeline component

2019-12-23 Thread Ruslan Dautkhanov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002398#comment-17002398
 ] 

Ruslan Dautkhanov commented on SPARK-29224:
---

That's great.

Out of curiosity - what's largest number of features this was tested with?

 

> Implement Factorization Machines as a ml-pipeline component
> ---
>
> Key: SPARK-29224
> URL: https://issues.apache.org/jira/browse/SPARK-29224
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: mob-ai
>Assignee: mob-ai
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: url_loss.xlsx
>
>
> Factorization Machines is widely used in advertising and recommendation 
> system to estimate CTR(click-through rate).
> Advertising and recommendation system usually has a lot of data, so we need 
> Spark to estimate the CTR, and Factorization Machines are common ml model to 
> estimate CTR.
> Goal: Implement Factorization Machines as a ml-pipeline component
> Requirements:
> 1. loss function supports: logloss, mse
> 2. optimizer: mini batch SGD
> References:
> 1. S. Rendle, “Factorization machines,” in Proceedings of IEEE International 
> Conference on Data Mining (ICDM), pp. 995–1000, 2010.
> https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30334) Add metadata around semi-structured columns to Spark

2019-12-23 Thread Burak Yavuz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-30334:

Description: 
Semi-structured data is used widely in the data industry for reporting events 
in a wide variety of formats. Click events in product analytics can be stored 
as json. Some application logs can be in the form of delimited key=value text. 
Some data may be in xml.

The goal of this project is to be able to signal Spark that such a column 
exists. This will then enable Spark to "auto-parse" these columns on the fly. 
The proposal is to store this information as part of the column metadata, in 
the fields:

 - format: The format of the semi-structured column, e.g. json, xml, avro

 - options: Options for parsing these columns

Then imagine having the following data:
{code:java}
++---++
| ts | event |raw |
++---++
| 2019-10-12 | click | {"field":"value"}  |
++---++ {code}
SELECT raw.field FROM data

will return "value"

or the following data
{code:java}
++---+--+
| ts | event | raw  |
++---+--+
| 2019-10-12 | click | field1=v1|field2=v2  |
++---+--+ {code}
SELECT raw.field1 FROM data

will return v1.

 

As a first step, we will introduce the function "as_json", which accomplishes 
this for JSON columns.

  was:
Semi-structured data is used widely in the data industry for reporting events 
in a wide variety of formats. Click events in product analytics can be stored 
as json. Some application logs can be in the form of delimited key=value text. 
Some data may be in xml.

The goal of this project is to be able to signal Spark that such a column 
exists. This will then enable Spark to "auto-parse" these columns on the fly. 
The proposal is to store this information as part of the column metadata, in 
the fields:

 - format: The format of the semi-structured column, e.g. json, xml, avro

 - options: Options for parsing these columns

Then imagine having the following data:
{code:java}
++---++
| ts | event |raw |
++---++
| 2019-10-12 | click | {"field":"value"}  |
++---++ {code}
SELECT raw.field FROM data

will return "value"

or the following data
{code:java}
++---+--+
| ts | event | raw  |
++---+--+
| 2019-10-12 | click | field1=v1|field2=v2  |
++---+--+ {code}
SELECT raw.field1 FROM data

will return v1.


> Add metadata around semi-structured columns to Spark
> 
>
> Key: SPARK-30334
> URL: https://issues.apache.org/jira/browse/SPARK-30334
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Burak Yavuz
>Priority: Major
>
> Semi-structured data is used widely in the data industry for reporting events 
> in a wide variety of formats. Click events in product analytics can be stored 
> as json. Some application logs can be in the form of delimited key=value 
> text. Some data may be in xml.
> The goal of this project is to be able to signal Spark that such a column 
> exists. This will then enable Spark to "auto-parse" these columns on the fly. 
> The proposal is to store this information as part of the column metadata, in 
> the fields:
>  - format: The format of the semi-structured column, e.g. json, xml, avro
>  - options: Options for parsing these columns
> Then imagine having the following data:
> {code:java}
> ++---++
> | ts | event |raw |
> ++---++
> | 2019-10-12 | click | {"field":"value"}  |
> ++---++ {code}
> SELECT raw.field FROM data
> will return "value"
> or the following data
> {code:java}
> ++---+--+
> | ts | event | raw  |
> ++---+--+
> | 2019-10-12 | click | field1=v1|field2=v2  |
> ++---+--+ {code}
> SELECT raw.field1 FROM data
> will return v1.
>  
> As a first step, we will introduce the function "as_json", which accomplishes 
> this for JSON columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30334) Add metadata around semi-structured columns to Spark

2019-12-23 Thread Burak Yavuz (Jira)
Burak Yavuz created SPARK-30334:
---

 Summary: Add metadata around semi-structured columns to Spark
 Key: SPARK-30334
 URL: https://issues.apache.org/jira/browse/SPARK-30334
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.4.4
Reporter: Burak Yavuz


Semi-structured data is used widely in the data industry for reporting events 
in a wide variety of formats. Click events in product analytics can be stored 
as json. Some application logs can be in the form of delimited key=value text. 
Some data may be in xml.

The goal of this project is to be able to signal Spark that such a column 
exists. This will then enable Spark to "auto-parse" these columns on the fly. 
The proposal is to store this information as part of the column metadata, in 
the fields:

 - format: The format of the semi-structured column, e.g. json, xml, avro

 - options: Options for parsing these columns

Then imagine having the following data:
{code:java}
++---++
| ts | event |raw |
++---++
| 2019-10-12 | click | {"field":"value"}  |
++---++ {code}
SELECT raw.field FROM data

will return "value"

or the following data
{code:java}
++---+--+
| ts | event | raw  |
++---+--+
| 2019-10-12 | click | field1=v1|field2=v2  |
++---+--+ {code}
SELECT raw.field1 FROM data

will return v1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26663) Cannot query a Hive table with subdirectories

2019-12-23 Thread Xiaoguang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002367#comment-17002367
 ] 

Xiaoguang Wang commented on SPARK-26663:


I meet the same problem here.

 

How to debug?

> Cannot query a Hive table with subdirectories
> -
>
> Key: SPARK-26663
> URL: https://issues.apache.org/jira/browse/SPARK-26663
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Aäron
>Priority: Major
>
> Hello,
>  
> I want to report the following issue (my first one :) )
> When I create a table in Hive based on a union all then Spark 2.4 is unable 
> to query this table.
> To reproduce:
> *Hive 1.2.1*
> {code:java}
> hive> creat table a(id int);
> insert into a values(1);
> hive> creat table b(id int);
> insert into b values(2);
> hive> create table c(id int) as select id from a union all select id from b;
> {code}
>  
> *Spark 2.3.1*
>  
> {code:java}
> scala> spark.table("c").show
> +---+
> | id|
> +---+
> | 1|
> | 2|
> +---+
> scala> spark.table("c").count
> res5: Long = 2
>  {code}
>  
> *Spark 2.4.0*
> {code:java}
> scala> spark.table("c").show
> 19/01/18 17:00:49 WARN HiveMetastoreCatalog: Unable to infer schema for table 
> perftest_be.c from file format ORC (inference mode: INFER_AND_SAVE). Using 
> metastore schema.
> +---+
> | id|
> +---+
> +---+
> scala> spark.table("c").count
> res3: Long = 0
> {code}
> I did not find an existing issue for this.  Might be important to investigate.
>  
> +Extra info:+ Spark 2.3.1 and 2.4.0 use the same spark-defaults.conf.
>  
> Kind regards.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29224) Implement Factorization Machines as a ml-pipeline component

2019-12-23 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29224.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26124
[https://github.com/apache/spark/pull/26124]

> Implement Factorization Machines as a ml-pipeline component
> ---
>
> Key: SPARK-29224
> URL: https://issues.apache.org/jira/browse/SPARK-29224
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: mob-ai
>Assignee: mob-ai
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: url_loss.xlsx
>
>
> Factorization Machines is widely used in advertising and recommendation 
> system to estimate CTR(click-through rate).
> Advertising and recommendation system usually has a lot of data, so we need 
> Spark to estimate the CTR, and Factorization Machines are common ml model to 
> estimate CTR.
> Goal: Implement Factorization Machines as a ml-pipeline component
> Requirements:
> 1. loss function supports: logloss, mse
> 2. optimizer: mini batch SGD
> References:
> 1. S. Rendle, “Factorization machines,” in Proceedings of IEEE International 
> Conference on Data Mining (ICDM), pp. 995–1000, 2010.
> https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29224) Implement Factorization Machines as a ml-pipeline component

2019-12-23 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-29224:


Assignee: mob-ai

> Implement Factorization Machines as a ml-pipeline component
> ---
>
> Key: SPARK-29224
> URL: https://issues.apache.org/jira/browse/SPARK-29224
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: mob-ai
>Assignee: mob-ai
>Priority: Major
> Attachments: url_loss.xlsx
>
>
> Factorization Machines is widely used in advertising and recommendation 
> system to estimate CTR(click-through rate).
> Advertising and recommendation system usually has a lot of data, so we need 
> Spark to estimate the CTR, and Factorization Machines are common ml model to 
> estimate CTR.
> Goal: Implement Factorization Machines as a ml-pipeline component
> Requirements:
> 1. loss function supports: logloss, mse
> 2. optimizer: mini batch SGD
> References:
> 1. S. Rendle, “Factorization machines,” in Proceedings of IEEE International 
> Conference on Data Mining (ICDM), pp. 995–1000, 2010.
> https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30333) Bump jackson-databind to 2.6.7.3

2019-12-23 Thread Sandeep Katta (Jira)
Sandeep Katta created SPARK-30333:
-

 Summary: Bump  jackson-databind to 2.6.7.3 
 Key: SPARK-30333
 URL: https://issues.apache.org/jira/browse/SPARK-30333
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.4
Reporter: Sandeep Katta


To fix below CVE

 

CVE-2018-14718

CVE-2018-14719

CVE-2018-14720

CVE-2018-14721

CVE-2018-19360,

CVE-2018-19361

CVE-2018-19362



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30332) When running sql query with limit catalyst throw StackOverFlow exception

2019-12-23 Thread Izek Greenfield (Jira)
Izek Greenfield created SPARK-30332:
---

 Summary: When running sql query with limit catalyst throw 
StackOverFlow exception 
 Key: SPARK-30332
 URL: https://issues.apache.org/jira/browse/SPARK-30332
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
 Environment: spark version 3.0.0-preview
Reporter: Izek Greenfield


Running that SQL:
{code:sql}
SELECT  BT_capital.asof_date,
BT_capital.run_id,
BT_capital.v,
BT_capital.id,
BT_capital.entity,
BT_capital.level_1,
BT_capital.level_2,
BT_capital.level_3,
BT_capital.level_4,
BT_capital.level_5,
BT_capital.level_6,
BT_capital.path_bt_capital,
BT_capital.line_item,
t0.target_line_item,
t0.line_description,
BT_capital.col_item,
BT_capital.rep_amount,
root.orgUnitId,
root.cptyId,
root.instId,
root.startDate,
root.maturityDate,
root.amount,
root.nominalAmount,
root.quantity,
root.lkupAssetLiability,
root.lkupCurrency,
root.lkupProdType,
root.interestResetDate,
root.interestResetTerm,
root.noticePeriod,
root.historicCostAmount,
root.dueDate,
root.lkupResidence,
root.lkupCountryOfUltimateRisk,
root.lkupSector,
root.lkupIndustry,
root.lkupAccountingPortfolioType,
root.lkupLoanDepositTerm,
root.lkupFixedFloating,
root.lkupCollateralType,
root.lkupRiskType,
root.lkupEligibleRefinancing,
root.lkupHedging,
root.lkupIsOwnIssued,
root.lkupIsSubordinated,
root.lkupIsQuoted,
root.lkupIsSecuritised,
root.lkupIsSecuritisedServiced,
root.lkupIsSyndicated,
root.lkupIsDeRecognised,
root.lkupIsRenegotiated,
root.lkupIsTransferable,
root.lkupIsNewBusiness,
root.lkupIsFiduciary,
root.lkupIsNonPerforming,
root.lkupIsInterGroup,
root.lkupIsIntraGroup,
root.lkupIsRediscounted,
root.lkupIsCollateral,
root.lkupIsExercised,
root.lkupIsImpaired,
root.facilityId,
root.lkupIsOTC,
root.lkupIsDefaulted,
root.lkupIsSavingsPosition,
root.lkupIsForborne,
root.lkupIsDebtRestructuringLoan,
root.interestRateAAR,
root.interestRateAPRC,
root.custom1,
root.custom2,
root.custom3,
root.lkupSecuritisationType,
root.lkupIsCashPooling,
root.lkupIsEquityParticipationGTE10,
root.lkupIsConvertible,
root.lkupEconomicHedge,
root.lkupIsNonCurrHeldForSale,
root.lkupIsEmbeddedDerivative,
root.lkupLoanPurpose,
root.lkupRegulated,
root.lkupRepaymentType,
root.glAccount,
root.lkupIsRecourse,
root.lkupIsNotFullyGuaranteed,
root.lkupImpairmentStage,
root.lkupIsEntireAmountWrittenOff,
root.lkupIsLowCreditRisk,
root.lkupIsOBSWithinIFRS9,
root.lkupIsUnderSpecialSurveillance,
root.lkupProtection,
root.lkupIsGeneralAllowance,
root.lkupSectorUltimateRisk,
root.cptyOrgUnitId,
root.name,
root.lkupNationality,
root.lkupSize,
root.lkupIsSPV,
root.lkupIsCentralCounterparty,
root.lkupIsMMRMFI,
root.lkupIsKeyManagement,
root.lkupIsOtherRelatedParty,
root.lkupResidenceProvince,
root.lkupIsTradingBook,
root.entityHierarchy_entityId,
root.entityHierarchy_Residence,
root.lkupLocalCurrency,
root.cpty_entityhierarchy_entityId,
root.lkupRelationship,
root.cpty_lkupRelationship,
root.entityNationality,
root.lkupRepCurrency,
root.startDateFinancialYear,
root.numEmployees,
root.numEmployeesTotal,
root.collateralAmount,
root.guaranteeAmount,
root.impairmentSpecificIndividual,
root.impairmentSpecificCollective,
root.impairmentGeneral,
root.creditRiskAmount,
root.provisionSpecificIndividual,
root.provisionSpecificCollective,
root.provisionGeneral,
root.writeOffAmount,
root.interest,
root.fairValueAmount,
root.grossCarryingAmount,
root.carryingAmount,
root.code,
root.lkupInstrumentType,
root.price,
root.amountAtIssue,
root.yield,
root.totalFacilityAmount,
root.facility_rate,
root.spec_indiv_est,
root.spec_coll_est,
root.coll_inc_loss,
root.impairment_amount,
root.provision_amount,
root.accumulated_impairment,
root.exclusionFlag,
root.lkupIsHoldingCompany,
root.instrument_startDate,
root.entityResidence,
fxRate.enumerator,
fxRate.lkupFromCurrency,
fxRate.rate,
fxRate.custom1,
fxRate.custom2,
fxRate.custom3,
GB_position.lkupIsECGDGuaranteed,
GB_position.lkupIsMultiAcctOffsetMortgage,
GB_position.lkupIsIndexLinked,
GB_position.lkupIsRetail,
GB_position.lkupCollateralLocation,
GB_position.percentAboveBBR,
GB_position.lkupIsMoreInArrears,
GB_position.lkupIsArrearsCapitalised,
GB_position.lkupCollateralPossession,
GB_position.lkupIsLifetimeMortgage,
GB_position.lkupLoanConcessionType,
GB_position.lkupIsMultiCurrency,
GB_position.lkupIsJointIncomeBasis,
GB_position.ratioIncomeMultiple,
GB_position.interestRate,
GB_position.exclusionFlag,
GB_position.lkupFDIDirection,
GB_position.lkupIsRTGS,
GB_positionExtended.nonRecourseFinanceAmount,
GB_positionExtended.arrearsAmount,
GB_Counterparty.lkupIsClearingFirm,
GB_Counterparty.lkupIsIntermediateFinCorp,
GB_Counterparty.lkupIsImpairedCreditHistory,
GB_Counterparty.lkupFDIRelationship  FROM portfolio_41446 BT_capital
JOIN aggr_41390 root ON (root.id = BT_capital.id AND root.entity = 
BT_capital.entity AND (root.instance_id = 
'e3b82807-9371-44f4-9c97-d63cde

[jira] [Created] (SPARK-30331) The final AdaptiveSparkPlan event is not marked with `isFinalPlan=true`

2019-12-23 Thread Manu Zhang (Jira)
Manu Zhang created SPARK-30331:
--

 Summary: The final AdaptiveSparkPlan event is not marked with 
`isFinalPlan=true`
 Key: SPARK-30331
 URL: https://issues.apache.org/jira/browse/SPARK-30331
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Manu Zhang


This is due to that the final AdaptiveSparkPlan event is sent out before 
{{isFinalPlan}} variable set to `true`. It would fail any listener attempting 
to catch the final event by pattern matching `isFinalPlan=true`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28332) SQLMetric wrong initValue

2019-12-23 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-28332:
---

Assignee: EdisonWang

> SQLMetric wrong initValue 
> --
>
> Key: SPARK-28332
> URL: https://issues.apache.org/jira/browse/SPARK-28332
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Song Jun
>Assignee: EdisonWang
>Priority: Minor
> Fix For: 3.0.0
>
>
> Currently SQLMetrics.createSizeMetric create a SQLMetric with initValue set 
> to -1.
> If there is a ShuffleMapStage with lots of Tasks which read 0 bytes data, 
> these tasks will send the metric(the metric value still be the initValue with 
> -1) to Driver,  then Driver do metric merge for this Stage in 
> DAGScheduler.updateAccumulators, this will cause the merged metric value of 
> this Stage set to be a negative value. 
> This is incorrect, we should set the initValue to 0 .
> Another same case in SQLMetrics.createTimingMetric.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28332) SQLMetric wrong initValue

2019-12-23 Thread EdisonWang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002148#comment-17002148
 ] 

EdisonWang commented on SPARK-28332:


I've taken it [~cloud_fan]

> SQLMetric wrong initValue 
> --
>
> Key: SPARK-28332
> URL: https://issues.apache.org/jira/browse/SPARK-28332
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Song Jun
>Priority: Minor
> Fix For: 3.0.0
>
>
> Currently SQLMetrics.createSizeMetric create a SQLMetric with initValue set 
> to -1.
> If there is a ShuffleMapStage with lots of Tasks which read 0 bytes data, 
> these tasks will send the metric(the metric value still be the initValue with 
> -1) to Driver,  then Driver do metric merge for this Stage in 
> DAGScheduler.updateAccumulators, this will cause the merged metric value of 
> this Stage set to be a negative value. 
> This is incorrect, we should set the initValue to 0 .
> Another same case in SQLMetrics.createTimingMetric.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26002) SQL date operators calculates with incorrect dayOfYears for dates before 1500-03-01

2019-12-23 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-26002:

Labels: correctness  (was: )

> SQL date operators calculates with incorrect dayOfYears for dates before 
> 1500-03-01
> ---
>
> Key: SPARK-26002
> URL: https://issues.apache.org/jira/browse/SPARK-26002
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 
> 2.3.2, 2.4.0, 3.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.0
>
>
> Running the following SQL the result is incorrect:
> {noformat}
> scala> sql("select dayOfYear('1500-01-02')").show()
> +---+
> |dayofyear(CAST(1500-01-02 AS DATE))|
> +---+
> |  1|
> +---+
> {noformat}
> This off by one day is more annoying right at the beginning of a year:
> {noformat}
> scala> sql("select year('1500-01-01')").show()
> +--+
> |year(CAST(1500-01-01 AS DATE))|
> +--+
> |  1499|
> +--+
> scala> sql("select month('1500-01-01')").show()
> +---+
> |month(CAST(1500-01-01 AS DATE))|
> +---+
> | 12|
> +---+
> scala> sql("select dayOfYear('1500-01-01')").show()
> +---+
> |dayofyear(CAST(1500-01-01 AS DATE))|
> +---+
> |365|
> +---+
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30330) Support single quotes json parsing for get_json_object and json_tuple

2019-12-23 Thread Fang Wen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fang Wen updated SPARK-30330:
-
External issue URL: https://github.com/apache/spark/pull/26965

> Support single quotes json parsing for get_json_object and json_tuple
> -
>
> Key: SPARK-30330
> URL: https://issues.apache.org/jira/browse/SPARK-30330
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3, 2.4.4
>Reporter: Fang Wen
>Priority: Major
>  Labels: release-notes
>
> I execute some query as
> {code:java}
>  select get_json_object(ytag, '$.y1') AS y1 from t4{code}
> SparkSQL return null but Hive return correct results. 
> In my production environment, ytag is a json wrapped by single quotes,as 
> follows
> {code:java}
> {'y1': 'shuma', 'y2': 'shuma:shouji'}
> {'y1': 'jiaoyu', 'y2': 'jiaoyu:gaokao'}
> {'y1': 'yule', 'y2': 'yule:mingxing'}
> {code}
> Then l realized some functions including get_json_object and json_tuple does 
> not support single quotes json parsing. It will return null for this 
> situation.
> I think such a treatment is unfriendly for users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30330) Support single quotes json parsing for get_json_object and json_tuple

2019-12-23 Thread Fang Wen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fang Wen updated SPARK-30330:
-
External issue URL:   (was: https://github.com/apache/spark/pull/26965)

> Support single quotes json parsing for get_json_object and json_tuple
> -
>
> Key: SPARK-30330
> URL: https://issues.apache.org/jira/browse/SPARK-30330
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3, 2.4.4
>Reporter: Fang Wen
>Priority: Major
>  Labels: release-notes
>
> I execute some query as
> {code:java}
>  select get_json_object(ytag, '$.y1') AS y1 from t4{code}
> SparkSQL return null but Hive return correct results. 
> In my production environment, ytag is a json wrapped by single quotes,as 
> follows
> {code:java}
> {'y1': 'shuma', 'y2': 'shuma:shouji'}
> {'y1': 'jiaoyu', 'y2': 'jiaoyu:gaokao'}
> {'y1': 'yule', 'y2': 'yule:mingxing'}
> {code}
> Then l realized some functions including get_json_object and json_tuple does 
> not support single quotes json parsing. It will return null for this 
> situation.
> I think such a treatment is unfriendly for users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30330) Support single quotes json parsing for get_json_object and json_tuple

2019-12-23 Thread Fang Wen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fang Wen updated SPARK-30330:
-
Labels: release-notes  (was: )

> Support single quotes json parsing for get_json_object and json_tuple
> -
>
> Key: SPARK-30330
> URL: https://issues.apache.org/jira/browse/SPARK-30330
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3, 2.4.4
>Reporter: Fang Wen
>Priority: Major
>  Labels: release-notes
>
> I execute some query as
> {code:java}
>  select get_json_object(ytag, '$.y1') AS y1 from t4{code}
> SparkSQL return null but Hive return correct results. 
> In my production environment, ytag is a json wrapped by single quotes,as 
> follows
> {code:java}
> {'y1': 'shuma', 'y2': 'shuma:shouji'}
> {'y1': 'jiaoyu', 'y2': 'jiaoyu:gaokao'}
> {'y1': 'yule', 'y2': 'yule:mingxing'}
> {code}
> Then l realized some functions including get_json_object and json_tuple does 
> not support single quotes json parsing. It will return null for this 
> situation.
> I think such a treatment is unfriendly for users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30330) Support single quotes json parsing for get_json_object and json_tuple

2019-12-23 Thread Fang Wen (Jira)
Fang Wen created SPARK-30330:


 Summary: Support single quotes json parsing for get_json_object 
and json_tuple
 Key: SPARK-30330
 URL: https://issues.apache.org/jira/browse/SPARK-30330
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.4, 2.4.3
Reporter: Fang Wen


I execute some query as
{code:java}
 select get_json_object(ytag, '$.y1') AS y1 from t4{code}
SparkSQL return null but Hive return correct results. 
In my production environment, ytag is a json wrapped by single quotes,as follows
{code:java}
{'y1': 'shuma', 'y2': 'shuma:shouji'}
{'y1': 'jiaoyu', 'y2': 'jiaoyu:gaokao'}
{'y1': 'yule', 'y2': 'yule:mingxing'}
{code}
Then l realized some functions including get_json_object and json_tuple does 
not support single quotes json parsing. It will return null for this situation.

I think such a treatment is unfriendly for users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30328) Fail to write local files with RDD.saveTextFile when setting the incorrect Hadoop configuration files

2019-12-23 Thread chendihao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chendihao updated SPARK-30328:
--
Description: 
We find that the incorrect Hadoop configuration files cause the failure of 
saving RDD to local file system. It is not expected because we have specify the 
local url and the API of DataFrame.write.text does not have this issue. It is 
easy to reproduce and verify with Spark 2.3.0.

1.Do not set environment variable of `HADOOP_CONF_DIR`.

2.Install pyspark and run the local Python script. This should work and save 
files to local file system.
{code:java}
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local").getOrCreate()
sc = spark.sparkContextrdd = sc.parallelize([1, 2, 3])
rdd.saveAsTextFile("file:///tmp/rdd.text")
{code}
3.Set environment variable of `HADOOP_CONF_DIR` and put the Hadoop 
configuration files there. Make sure the format of `core-site.xml` is right but 
it has an unresolved host name.

4.Run the same Python script again. If it try to connect HDFS and found the 
unresolved host name, Java exception happens.

We thinks `saveAsTextFile("file:///)` should not attempt to connect HDFS not 
matter `HADOOP_CONF_DIR` is set. Actually the following code will work with the 
same incorrect Hadoop configuration files.
{code:java}
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local").getOrCreate()
df = spark.createDataFrame(rows, ["attribute", "value"])
df.write.parquet("file:///tmp/df.parquet")
{code}

> Fail to write local files with RDD.saveTextFile when setting the incorrect 
> Hadoop configuration files
> -
>
> Key: SPARK-30328
> URL: https://issues.apache.org/jira/browse/SPARK-30328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: chendihao
>Priority: Major
>
> We find that the incorrect Hadoop configuration files cause the failure of 
> saving RDD to local file system. It is not expected because we have specify 
> the local url and the API of DataFrame.write.text does not have this issue. 
> It is easy to reproduce and verify with Spark 2.3.0.
> 1.Do not set environment variable of `HADOOP_CONF_DIR`.
> 2.Install pyspark and run the local Python script. This should work and save 
> files to local file system.
> {code:java}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.master("local").getOrCreate()
> sc = spark.sparkContextrdd = sc.parallelize([1, 2, 3])
> rdd.saveAsTextFile("file:///tmp/rdd.text")
> {code}
> 3.Set environment variable of `HADOOP_CONF_DIR` and put the Hadoop 
> configuration files there. Make sure the format of `core-site.xml` is right 
> but it has an unresolved host name.
> 4.Run the same Python script again. If it try to connect HDFS and found the 
> unresolved host name, Java exception happens.
> We thinks `saveAsTextFile("file:///)` should not attempt to connect HDFS not 
> matter `HADOOP_CONF_DIR` is set. Actually the following code will work with 
> the same incorrect Hadoop configuration files.
> {code:java}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.master("local").getOrCreate()
> df = spark.createDataFrame(rows, ["attribute", "value"])
> df.write.parquet("file:///tmp/df.parquet")
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org