[jira] [Commented] (SPARK-28990) SparkSQL invalid call to toAttribute on unresolved object, tree: *
[ https://issues.apache.org/jira/browse/SPARK-28990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002698#comment-17002698 ] Wenchao Wu commented on SPARK-28990: [~lucusguo] [~xiaozhang] me too > SparkSQL invalid call to toAttribute on unresolved object, tree: * > -- > > Key: SPARK-28990 > URL: https://issues.apache.org/jira/browse/SPARK-28990 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: fengchaoge >Priority: Major > > SparkSQL create table as select from one table which may not exists throw > exceptions like: > {code} > org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > toAttribute on unresolved object, tree: > {code} > This is not friendly, spark user may have no idea about what's wrong. > Simple sql can reproduce it,like this: > {code} > spark-sql (default)> create table default.spark as select * from default.dual; > {code} > {code} > 2019-09-05 16:27:24,127 INFO (main) [Logging.scala:logInfo(54)] - Parsing > command: create table default.spark as select * from default.dual > 2019-09-05 16:27:24,772 ERROR (main) [Logging.scala:logError(91)] - Failed in > [create table default.spark as select * from default.dual] > org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > toAttribute on unresolved object, tree: * > at > org.apache.spark.sql.catalyst.analysis.Star.toAttribute(unresolved.scala:245) > at > org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicLogicalOperators.scala:52) > at > org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicLogicalOperators.scala:52) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:296) > at > org.apache.spark.sql.catalyst.plans.logical.Project.output(basicLogicalOperators.scala:52) > at > org.apache.spark.sql.hive.HiveAnalysis$$anonfun$apply$3.applyOrElse(HiveStrategies.scala:160) > at > org.apache.spark.sql.hive.HiveAnalysis$$anonfun$apply$3.applyOrElse(HiveStrategies.scala:148) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1$$anonfun$2.apply(AnalysisHelper.scala:108) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1$$anonfun$2.apply(AnalysisHelper.scala:108) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1.apply(AnalysisHelper.scala:107) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1.apply(AnalysisHelper.scala:106) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsDown(AnalysisHelper.scala:106) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperators(AnalysisHelper.scala:73) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:29) > at org.apache.spark.sql.hive.HiveAnalysis$.apply(HiveStrategies.scala:148) > at org.apache.spark.sql.hive.HiveAnalysis$.apply(HiveStrategies.scala:147) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:48) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76) > at scala.collection.immutable.List.foreach(List.scala:392) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:127) > at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.sc
[jira] [Commented] (SPARK-28990) SparkSQL invalid call to toAttribute on unresolved object, tree: *
[ https://issues.apache.org/jira/browse/SPARK-28990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002696#comment-17002696 ] Xiao Zhang commented on SPARK-28990: [~fengchaoge] me too > SparkSQL invalid call to toAttribute on unresolved object, tree: * > -- > > Key: SPARK-28990 > URL: https://issues.apache.org/jira/browse/SPARK-28990 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: fengchaoge >Priority: Major > > SparkSQL create table as select from one table which may not exists throw > exceptions like: > {code} > org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > toAttribute on unresolved object, tree: > {code} > This is not friendly, spark user may have no idea about what's wrong. > Simple sql can reproduce it,like this: > {code} > spark-sql (default)> create table default.spark as select * from default.dual; > {code} > {code} > 2019-09-05 16:27:24,127 INFO (main) [Logging.scala:logInfo(54)] - Parsing > command: create table default.spark as select * from default.dual > 2019-09-05 16:27:24,772 ERROR (main) [Logging.scala:logError(91)] - Failed in > [create table default.spark as select * from default.dual] > org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > toAttribute on unresolved object, tree: * > at > org.apache.spark.sql.catalyst.analysis.Star.toAttribute(unresolved.scala:245) > at > org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicLogicalOperators.scala:52) > at > org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicLogicalOperators.scala:52) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:296) > at > org.apache.spark.sql.catalyst.plans.logical.Project.output(basicLogicalOperators.scala:52) > at > org.apache.spark.sql.hive.HiveAnalysis$$anonfun$apply$3.applyOrElse(HiveStrategies.scala:160) > at > org.apache.spark.sql.hive.HiveAnalysis$$anonfun$apply$3.applyOrElse(HiveStrategies.scala:148) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1$$anonfun$2.apply(AnalysisHelper.scala:108) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1$$anonfun$2.apply(AnalysisHelper.scala:108) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1.apply(AnalysisHelper.scala:107) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1.apply(AnalysisHelper.scala:106) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsDown(AnalysisHelper.scala:106) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperators(AnalysisHelper.scala:73) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:29) > at org.apache.spark.sql.hive.HiveAnalysis$.apply(HiveStrategies.scala:148) > at org.apache.spark.sql.hive.HiveAnalysis$.apply(HiveStrategies.scala:147) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:48) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76) > at scala.collection.immutable.List.foreach(List.scala:392) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:127) > at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:121) >
[jira] [Commented] (SPARK-28990) SparkSQL invalid call to toAttribute on unresolved object, tree: *
[ https://issues.apache.org/jira/browse/SPARK-28990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002695#comment-17002695 ] lucusguo commented on SPARK-28990: -- but, I cannot reproduce it in spark2.4.3 > SparkSQL invalid call to toAttribute on unresolved object, tree: * > -- > > Key: SPARK-28990 > URL: https://issues.apache.org/jira/browse/SPARK-28990 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: fengchaoge >Priority: Major > > SparkSQL create table as select from one table which may not exists throw > exceptions like: > {code} > org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > toAttribute on unresolved object, tree: > {code} > This is not friendly, spark user may have no idea about what's wrong. > Simple sql can reproduce it,like this: > {code} > spark-sql (default)> create table default.spark as select * from default.dual; > {code} > {code} > 2019-09-05 16:27:24,127 INFO (main) [Logging.scala:logInfo(54)] - Parsing > command: create table default.spark as select * from default.dual > 2019-09-05 16:27:24,772 ERROR (main) [Logging.scala:logError(91)] - Failed in > [create table default.spark as select * from default.dual] > org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > toAttribute on unresolved object, tree: * > at > org.apache.spark.sql.catalyst.analysis.Star.toAttribute(unresolved.scala:245) > at > org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicLogicalOperators.scala:52) > at > org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicLogicalOperators.scala:52) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:296) > at > org.apache.spark.sql.catalyst.plans.logical.Project.output(basicLogicalOperators.scala:52) > at > org.apache.spark.sql.hive.HiveAnalysis$$anonfun$apply$3.applyOrElse(HiveStrategies.scala:160) > at > org.apache.spark.sql.hive.HiveAnalysis$$anonfun$apply$3.applyOrElse(HiveStrategies.scala:148) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1$$anonfun$2.apply(AnalysisHelper.scala:108) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1$$anonfun$2.apply(AnalysisHelper.scala:108) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1.apply(AnalysisHelper.scala:107) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1.apply(AnalysisHelper.scala:106) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsDown(AnalysisHelper.scala:106) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperators(AnalysisHelper.scala:73) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:29) > at org.apache.spark.sql.hive.HiveAnalysis$.apply(HiveStrategies.scala:148) > at org.apache.spark.sql.hive.HiveAnalysis$.apply(HiveStrategies.scala:147) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:48) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76) > at scala.collection.immutable.List.foreach(List.scala:392) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:127) > at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analy
[jira] [Created] (SPARK-30342) Update LIST JAR/FILE command
Rakesh Raushan created SPARK-30342: -- Summary: Update LIST JAR/FILE command Key: SPARK-30342 URL: https://issues.apache.org/jira/browse/SPARK-30342 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Rakesh Raushan LIST FILE/JAR command is not documented properly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30333) Bump jackson-databind to 2.6.7.3
[ https://issues.apache.org/jira/browse/SPARK-30333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-30333. -- Fix Version/s: 2.4.5 Assignee: Sandeep Katta Resolution: Fixed Resolved by [https://github.com/apache/spark/pull/26986] > Bump jackson-databind to 2.6.7.3 > -- > > Key: SPARK-30333 > URL: https://issues.apache.org/jira/browse/SPARK-30333 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Sandeep Katta >Assignee: Sandeep Katta >Priority: Major > Fix For: 2.4.5 > > > To fix below CVE > > CVE-2018-14718 > CVE-2018-14719 > CVE-2018-14720 > CVE-2018-14721 > CVE-2018-19360, > CVE-2018-19361 > CVE-2018-19362 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30341) check overflow for interval arithmetic operations
Kent Yao created SPARK-30341: Summary: check overflow for interval arithmetic operations Key: SPARK-30341 URL: https://issues.apache.org/jira/browse/SPARK-30341 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Kent Yao the interval arithmetic functions, e.g. add/subtract/negative/multiply/divide, should enable overflow check when ansi is on, and add/subtract/negative should result NULL when overflow happens and ansi is off as multiply/divide. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30340) Python tests failed on arm64/x86
[ https://issues.apache.org/jira/browse/SPARK-30340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huangtianhua updated SPARK-30340: - Summary: Python tests failed on arm64/x86 (was: Python tests failed on arm64 ) > Python tests failed on arm64/x86 > > > Key: SPARK-30340 > URL: https://issues.apache.org/jira/browse/SPARK-30340 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 3.0.0 >Reporter: huangtianhua >Priority: Major > > Jenkins job spark-master-test-python-arm failed after the commit > c6ab7165dd11a0a7b8aea4c805409088e9a41a74: > File > "/home/jenkins/workspace/spark-master-test-python-arm/python/pyspark/ml/classification.py", > line 2790, in __main__.FMClassifier > Failed example: > model.transform(test0).select("features", "probability").show(10, False) > Expected: > +--++ > |features|probability| > +--++ > |[-1.0]|[0.97574736,2.425264676902229E-10]| > |[0.5]|[0.47627851732981163,0.5237214826701884]| > |[1.0]|[5.491554426243495E-4,0.9994508445573757]| > |[2.0]|[2.00573870645E-10,0.97994233]| > +--++ > Got: > +--++ > |features|probability| > +--++ > |[-1.0]|[0.97574736,2.425264676902229E-10]| > |[0.5]|[0.47627851732981163,0.5237214826701884]| > |[1.0]|[5.491554426243495E-4,0.9994508445573757]| > |[2.0]|[2.00573870645E-10,0.97994233]| > +--++ > > ** > File > "/home/jenkins/workspace/spark-master-test-python-arm/python/pyspark/ml/classification.py", > line 2803, in __main__.FMClassifier > Failed example: > model.factors > Expected: > DenseMatrix(1, 2, [0.0028, 0.0048], 1) > Got: > DenseMatrix(1, 2, [-0.0122, 0.0106], 1) > ** > 2 of 10 in __main__.FMClassifier > ***Test Failed*** 2 failures. > > The details see > [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-python-arm/91/console] > And seems the tests failed on x86: > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115668/console] > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115665/console] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30340) Python tests failed on arm64
[ https://issues.apache.org/jira/browse/SPARK-30340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huangtianhua updated SPARK-30340: - Description: Jenkins job spark-master-test-python-arm failed after the commit c6ab7165dd11a0a7b8aea4c805409088e9a41a74: File "/home/jenkins/workspace/spark-master-test-python-arm/python/pyspark/ml/classification.py", line 2790, in __main__.FMClassifier Failed example: model.transform(test0).select("features", "probability").show(10, False) Expected: +--++ |features|probability| +--++ |[-1.0]|[0.97574736,2.425264676902229E-10]| |[0.5]|[0.47627851732981163,0.5237214826701884]| |[1.0]|[5.491554426243495E-4,0.9994508445573757]| |[2.0]|[2.00573870645E-10,0.97994233]| +--++ Got: +--++ |features|probability| +--++ |[-1.0]|[0.97574736,2.425264676902229E-10]| |[0.5]|[0.47627851732981163,0.5237214826701884]| |[1.0]|[5.491554426243495E-4,0.9994508445573757]| |[2.0]|[2.00573870645E-10,0.97994233]| +--++ ** File "/home/jenkins/workspace/spark-master-test-python-arm/python/pyspark/ml/classification.py", line 2803, in __main__.FMClassifier Failed example: model.factors Expected: DenseMatrix(1, 2, [0.0028, 0.0048], 1) Got: DenseMatrix(1, 2, [-0.0122, 0.0106], 1) ** 2 of 10 in __main__.FMClassifier ***Test Failed*** 2 failures. The details see [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-python-arm/91/console] And seems the tests failed on x86: [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115668/console] [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115665/console] was: Jenkins job spark-master-test-python-arm failed after the commit c6ab7165dd11a0a7b8aea4c805409088e9a41a74: File "/home/jenkins/workspace/spark-master-test-python-arm/python/pyspark/ml/classification.py", line 2790, in __main__.FMClassifier Failed example: model.transform(test0).select("features", "probability").show(10, False) Expected: +-+-+ |features|probability| +-+-+ |[-1.0]|[0.97574736,2.425264676902229E-10]| |[0.5]|[0.47627851732981163,0.5237214826701884]| |[1.0]|[5.491554426243495E-4,0.9994508445573757]| |[2.0]|[2.00573870645E-10,0.97994233]| +-+-+ Got: +-+-+ |features|probability| +-+-+ |[-1.0]|[0.97574736,2.425264676902229E-10]| |[0.5]|[0.47627851732981163,0.5237214826701884]| |[1.0]|[5.491554426243495E-4,0.9994508445573757]| |[2.0]|[2.00573870645E-10,0.97994233]| +-+-+ ** File "/home/jenkins/workspace/spark-master-test-python-arm/python/pyspark/ml/classification.py", line 2803, in __main__.FMClassifier Failed example: model.factors Expected: DenseMatrix(1, 2, [0.0028, 0.0048], 1) Got: DenseMatrix(1, 2, [-0.0122, 0.0106], 1) ** 2 of 10 in __main__.FMClassifier ***Test Failed*** 2 failures. The details see [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-python-arm/91/console] > Python tests failed on arm64 > - > > Key: SPARK-30340 > URL: https://issues.apache.org/jira/browse/SPARK-30340 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 3.0.0 >Reporter: huangtianhua >Priority: Major > > Jenkins job spark-master-test-python-arm failed after the commit > c6ab7165dd11a0a7b8aea4c805409088e9a41a74: > File > "/home/jenkins/workspace/spark-master-test-python-arm/python/pyspark/ml/classification.py", > line 2790, in __main__.FMClassifier > Failed example: > model.transform(test0).select("features", "probability").show(10, False) > Expected: > +--++ > |features|probability| > +--++ > |[-1.0]|[0.97574736,2.425264676902229E-10]| > |[0.5]|[0.47627851732981163,0.5237214826701884]| > |[1.0]|[5.491554426243495E-4,0.9994508445573757]| > |[2.0]|[2.00573870645E-10,0.97994233]| > +--++ > Got: > +
[jira] [Commented] (SPARK-30328) Fail to write local files with RDD.saveTextFile when setting the incorrect Hadoop configuration files
[ https://issues.apache.org/jira/browse/SPARK-30328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002613#comment-17002613 ] Ankit Raj Boudh commented on SPARK-30328: - Thank you [~tobe], i will analyse this issue and will update you > Fail to write local files with RDD.saveTextFile when setting the incorrect > Hadoop configuration files > - > > Key: SPARK-30328 > URL: https://issues.apache.org/jira/browse/SPARK-30328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: chendihao >Priority: Major > > We find that the incorrect Hadoop configuration files cause the failure of > saving RDD to local file system. It is not expected because we have specify > the local url and the API of DataFrame.write.text does not have this issue. > It is easy to reproduce and verify with Spark 2.3.0. > 1.Do not set environment variable of `HADOOP_CONF_DIR`. > 2.Install pyspark and run the local Python script. This should work and save > files to local file system. > {code:java} > from pyspark.sql import SparkSession > spark = SparkSession.builder.master("local").getOrCreate() > sc = spark.sparkContextrdd = sc.parallelize([1, 2, 3]) > rdd.saveAsTextFile("file:///tmp/rdd.text") > {code} > 3.Set environment variable of `HADOOP_CONF_DIR` and put the Hadoop > configuration files there. Make sure the format of `core-site.xml` is right > but it has an unresolved host name. > 4.Run the same Python script again. If it try to connect HDFS and found the > unresolved host name, Java exception happens. > We thinks `saveAsTextFile("file:///)` should not attempt to connect HDFS > whenever `HADOOP_CONF_DIR` is set or not. Actually the following code of > DataFrame will work with the same incorrect Hadoop configuration files. > {code:java} > from pyspark.sql import SparkSession > spark = SparkSession.builder.master("local").getOrCreate() > df = spark.createDataFrame(rows, ["attribute", "value"]) > df.write.parquet("file:///tmp/df.parquet") > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30328) Fail to write local files with RDD.saveTextFile when setting the incorrect Hadoop configuration files
[ https://issues.apache.org/jira/browse/SPARK-30328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chendihao updated SPARK-30328: -- Description: We find that the incorrect Hadoop configuration files cause the failure of saving RDD to local file system. It is not expected because we have specify the local url and the API of DataFrame.write.text does not have this issue. It is easy to reproduce and verify with Spark 2.3.0. 1.Do not set environment variable of `HADOOP_CONF_DIR`. 2.Install pyspark and run the local Python script. This should work and save files to local file system. {code:java} from pyspark.sql import SparkSession spark = SparkSession.builder.master("local").getOrCreate() sc = spark.sparkContextrdd = sc.parallelize([1, 2, 3]) rdd.saveAsTextFile("file:///tmp/rdd.text") {code} 3.Set environment variable of `HADOOP_CONF_DIR` and put the Hadoop configuration files there. Make sure the format of `core-site.xml` is right but it has an unresolved host name. 4.Run the same Python script again. If it try to connect HDFS and found the unresolved host name, Java exception happens. We thinks `saveAsTextFile("file:///)` should not attempt to connect HDFS whenever `HADOOP_CONF_DIR` is set or not. Actually the following code of DataFrame will work with the same incorrect Hadoop configuration files. {code:java} from pyspark.sql import SparkSession spark = SparkSession.builder.master("local").getOrCreate() df = spark.createDataFrame(rows, ["attribute", "value"]) df.write.parquet("file:///tmp/df.parquet") {code} was: We find that the incorrect Hadoop configuration files cause the failure of saving RDD to local file system. It is not expected because we have specify the local url and the API of DataFrame.write.text does not have this issue. It is easy to reproduce and verify with Spark 2.3.0. 1.Do not set environment variable of `HADOOP_CONF_DIR`. 2.Install pyspark and run the local Python script. This should work and save files to local file system. {code:java} from pyspark.sql import SparkSession spark = SparkSession.builder.master("local").getOrCreate() sc = spark.sparkContextrdd = sc.parallelize([1, 2, 3]) rdd.saveAsTextFile("file:///tmp/rdd.text") {code} 3.Set environment variable of `HADOOP_CONF_DIR` and put the Hadoop configuration files there. Make sure the format of `core-site.xml` is right but it has an unresolved host name. 4.Run the same Python script again. If it try to connect HDFS and found the unresolved host name, Java exception happens. We thinks `saveAsTextFile("file:///)` should not attempt to connect HDFS not matter `HADOOP_CONF_DIR` is set. Actually the following code will work with the same incorrect Hadoop configuration files. {code:java} from pyspark.sql import SparkSession spark = SparkSession.builder.master("local").getOrCreate() df = spark.createDataFrame(rows, ["attribute", "value"]) df.write.parquet("file:///tmp/df.parquet") {code} > Fail to write local files with RDD.saveTextFile when setting the incorrect > Hadoop configuration files > - > > Key: SPARK-30328 > URL: https://issues.apache.org/jira/browse/SPARK-30328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: chendihao >Priority: Major > > We find that the incorrect Hadoop configuration files cause the failure of > saving RDD to local file system. It is not expected because we have specify > the local url and the API of DataFrame.write.text does not have this issue. > It is easy to reproduce and verify with Spark 2.3.0. > 1.Do not set environment variable of `HADOOP_CONF_DIR`. > 2.Install pyspark and run the local Python script. This should work and save > files to local file system. > {code:java} > from pyspark.sql import SparkSession > spark = SparkSession.builder.master("local").getOrCreate() > sc = spark.sparkContextrdd = sc.parallelize([1, 2, 3]) > rdd.saveAsTextFile("file:///tmp/rdd.text") > {code} > 3.Set environment variable of `HADOOP_CONF_DIR` and put the Hadoop > configuration files there. Make sure the format of `core-site.xml` is right > but it has an unresolved host name. > 4.Run the same Python script again. If it try to connect HDFS and found the > unresolved host name, Java exception happens. > We thinks `saveAsTextFile("file:///)` should not attempt to connect HDFS > whenever `HADOOP_CONF_DIR` is set or not. Actually the following code of > DataFrame will work with the same incorrect Hadoop configuration files. > {code:java} > from pyspark.sql import SparkSession > spark = SparkSession.builder.master("local").getOrCreate() > df = spark.createDataFrame(rows, ["attribute", "value"]) > df.write.parquet("file:///tmp/df.parquet") > {code} -- This message was sent by Atlas
[jira] [Commented] (SPARK-30328) Fail to write local files with RDD.saveTextFile when setting the incorrect Hadoop configuration files
[ https://issues.apache.org/jira/browse/SPARK-30328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002609#comment-17002609 ] chendihao commented on SPARK-30328: --- Of course and thanks [~Ankitraj] . We don't have time to dig into the source code but I think it may to problem of initialing Hadoop client before parsing the local filesystem url. > Fail to write local files with RDD.saveTextFile when setting the incorrect > Hadoop configuration files > - > > Key: SPARK-30328 > URL: https://issues.apache.org/jira/browse/SPARK-30328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: chendihao >Priority: Major > > We find that the incorrect Hadoop configuration files cause the failure of > saving RDD to local file system. It is not expected because we have specify > the local url and the API of DataFrame.write.text does not have this issue. > It is easy to reproduce and verify with Spark 2.3.0. > 1.Do not set environment variable of `HADOOP_CONF_DIR`. > 2.Install pyspark and run the local Python script. This should work and save > files to local file system. > {code:java} > from pyspark.sql import SparkSession > spark = SparkSession.builder.master("local").getOrCreate() > sc = spark.sparkContextrdd = sc.parallelize([1, 2, 3]) > rdd.saveAsTextFile("file:///tmp/rdd.text") > {code} > 3.Set environment variable of `HADOOP_CONF_DIR` and put the Hadoop > configuration files there. Make sure the format of `core-site.xml` is right > but it has an unresolved host name. > 4.Run the same Python script again. If it try to connect HDFS and found the > unresolved host name, Java exception happens. > We thinks `saveAsTextFile("file:///)` should not attempt to connect HDFS not > matter `HADOOP_CONF_DIR` is set. Actually the following code will work with > the same incorrect Hadoop configuration files. > {code:java} > from pyspark.sql import SparkSession > spark = SparkSession.builder.master("local").getOrCreate() > df = spark.createDataFrame(rows, ["attribute", "value"]) > df.write.parquet("file:///tmp/df.parquet") > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-30328) Fail to write local files with RDD.saveTextFile when setting the incorrect Hadoop configuration files
[ https://issues.apache.org/jira/browse/SPARK-30328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002609#comment-17002609 ] chendihao edited comment on SPARK-30328 at 12/24/19 2:57 AM: - Of course and thanks [~Ankitraj] . We don't have time to dig into the source code but I think it may be the problem of initialing Hadoop client before parsing the local filesystem url. was (Author: tobe): Of course and thanks [~Ankitraj] . We don't have time to dig into the source code but I think it may to problem of initialing Hadoop client before parsing the local filesystem url. > Fail to write local files with RDD.saveTextFile when setting the incorrect > Hadoop configuration files > - > > Key: SPARK-30328 > URL: https://issues.apache.org/jira/browse/SPARK-30328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: chendihao >Priority: Major > > We find that the incorrect Hadoop configuration files cause the failure of > saving RDD to local file system. It is not expected because we have specify > the local url and the API of DataFrame.write.text does not have this issue. > It is easy to reproduce and verify with Spark 2.3.0. > 1.Do not set environment variable of `HADOOP_CONF_DIR`. > 2.Install pyspark and run the local Python script. This should work and save > files to local file system. > {code:java} > from pyspark.sql import SparkSession > spark = SparkSession.builder.master("local").getOrCreate() > sc = spark.sparkContextrdd = sc.parallelize([1, 2, 3]) > rdd.saveAsTextFile("file:///tmp/rdd.text") > {code} > 3.Set environment variable of `HADOOP_CONF_DIR` and put the Hadoop > configuration files there. Make sure the format of `core-site.xml` is right > but it has an unresolved host name. > 4.Run the same Python script again. If it try to connect HDFS and found the > unresolved host name, Java exception happens. > We thinks `saveAsTextFile("file:///)` should not attempt to connect HDFS not > matter `HADOOP_CONF_DIR` is set. Actually the following code will work with > the same incorrect Hadoop configuration files. > {code:java} > from pyspark.sql import SparkSession > spark = SparkSession.builder.master("local").getOrCreate() > df = spark.createDataFrame(rows, ["attribute", "value"]) > df.write.parquet("file:///tmp/df.parquet") > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30340) Python tests failed on arm64
[ https://issues.apache.org/jira/browse/SPARK-30340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huangtianhua updated SPARK-30340: - Description: Jenkins job spark-master-test-python-arm failed after the commit c6ab7165dd11a0a7b8aea4c805409088e9a41a74: File "/home/jenkins/workspace/spark-master-test-python-arm/python/pyspark/ml/classification.py", line 2790, in __main__.FMClassifier Failed example: model.transform(test0).select("features", "probability").show(10, False) Expected: +-+-+ |features|probability| +-+-+ |[-1.0]|[0.97574736,2.425264676902229E-10]| |[0.5]|[0.47627851732981163,0.5237214826701884]| |[1.0]|[5.491554426243495E-4,0.9994508445573757]| |[2.0]|[2.00573870645E-10,0.97994233]| +-+-+ Got: +-+-+ |features|probability| +-+-+ |[-1.0]|[0.97574736,2.425264676902229E-10]| |[0.5]|[0.47627851732981163,0.5237214826701884]| |[1.0]|[5.491554426243495E-4,0.9994508445573757]| |[2.0]|[2.00573870645E-10,0.97994233]| +-+-+ ** File "/home/jenkins/workspace/spark-master-test-python-arm/python/pyspark/ml/classification.py", line 2803, in __main__.FMClassifier Failed example: model.factors Expected: DenseMatrix(1, 2, [0.0028, 0.0048], 1) Got: DenseMatrix(1, 2, [-0.0122, 0.0106], 1) ** 2 of 10 in __main__.FMClassifier ***Test Failed*** 2 failures. The details see [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-python-arm/91/console] was: Jenkins job spark-master-test-python-arm failed after the commit c6ab7165dd11a0a7b8aea4c805409088e9a41a74: File "/home/jenkins/workspace/spark-master-test-python-arm/python/pyspark/ml/classification.py", line 2790, in __main__.FMClassifier Failed example: model.transform(test0).select("features", "probability").show(10, False) Expected: ++--+ |features|probability | ++--+ |[-1.0] |[0.97574736,2.425264676902229E-10]| |[0.5] |[0.47627851732981163,0.5237214826701884] | |[1.0] |[5.491554426243495E-4,0.9994508445573757] | |[2.0] |[2.00573870645E-10,0.97994233]| ++--+ Got: ++--+ |features|probability | ++--+ |[-1.0] |[0.97574736,2.425264676902229E-10]| |[0.5] |[0.47627851732981163,0.5237214826701884] | |[1.0] |[5.491554426243495E-4,0.9994508445573757] | |[2.0] |[2.00573870645E-10,0.97994233]| ++--+ ** File "/home/jenkins/workspace/spark-master-test-python-arm/python/pyspark/ml/classification.py", line 2803, in __main__.FMClassifier Failed example: model.factors Expected: DenseMatrix(1, 2, [0.0028, 0.0048], 1) Got: DenseMatrix(1, 2, [-0.0122, 0.0106], 1) ** 2 of 10 in __main__.FMClassifier ***Test Failed*** 2 failures. > Python tests failed on arm64 > - > > Key: SPARK-30340 > URL: https://issues.apache.org/jira/browse/SPARK-30340 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 3.0.0 >Reporter: huangtianhua >Priority: Major > > Jenkins job spark-master-test-python-arm failed after the commit > c6ab7165dd11a0a7b8aea4c805409088e9a41a74: > File > "/home/jenkins/workspace/spark-master-test-python-arm/python/pyspark/ml/classification.py", > line 2790, in __main__.FMClassifier > Failed example: > model.transform(test0).select("features", "probability").show(10, False) > Expected: > +-+-+ > |features|probability| > +-+-+ > |[-1.0]|[0.97574736,2.425264676902229E-10]| > |[0.5]|[0.47627851732981163,0.5237214826701884]| > |[1.0]|[5.491554426243495E-4,0.9994508445573757]| > |[2.0]|[2.00573870645E-10,0.97994233]| > +-+-+ > Got: > +-+-+ > |features|probability| > +-+-+ > |[-1.0]|[0.97574736,2.425264676902229E-10]| > |[0.5]|[0.47627851732981163,0.5237214826701884]| > |[1.0]|[5.491554426243495E-4,0.9994508445573757]| > |[2.0]|[2.005
[jira] [Updated] (SPARK-30339) Avoid to fail twice in function lookup
[ https://issues.apache.org/jira/browse/SPARK-30339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenhua Wang updated SPARK-30339: - Description: Currently if function lookup fails, spark will give it a second change by casting decimal type to double type. But for cases where decimal type doesn't exist, it's meaningless to lookup again and causes extra cost like unnecessary metastore access. We should throw exceptions directly in these cases. (was: Currently if function lookup fails, spark will give it a second change by casting decimal type to double type. But for cases where decimal type doesn't exist, it's meaningless to lookup again and causes extra cost like unnecessary metastore access.) > Avoid to fail twice in function lookup > -- > > Key: SPARK-30339 > URL: https://issues.apache.org/jira/browse/SPARK-30339 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.5, 3.0.0 >Reporter: Zhenhua Wang >Priority: Minor > > Currently if function lookup fails, spark will give it a second change by > casting decimal type to double type. But for cases where decimal type doesn't > exist, it's meaningless to lookup again and causes extra cost like > unnecessary metastore access. We should throw exceptions directly in these > cases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30340) Python tests failed on arm64
huangtianhua created SPARK-30340: Summary: Python tests failed on arm64 Key: SPARK-30340 URL: https://issues.apache.org/jira/browse/SPARK-30340 Project: Spark Issue Type: Bug Components: ML Affects Versions: 3.0.0 Reporter: huangtianhua Jenkins job spark-master-test-python-arm failed after the commit c6ab7165dd11a0a7b8aea4c805409088e9a41a74: File "/home/jenkins/workspace/spark-master-test-python-arm/python/pyspark/ml/classification.py", line 2790, in __main__.FMClassifier Failed example: model.transform(test0).select("features", "probability").show(10, False) Expected: ++--+ |features|probability | ++--+ |[-1.0] |[0.97574736,2.425264676902229E-10]| |[0.5] |[0.47627851732981163,0.5237214826701884] | |[1.0] |[5.491554426243495E-4,0.9994508445573757] | |[2.0] |[2.00573870645E-10,0.97994233]| ++--+ Got: ++--+ |features|probability | ++--+ |[-1.0] |[0.97574736,2.425264676902229E-10]| |[0.5] |[0.47627851732981163,0.5237214826701884] | |[1.0] |[5.491554426243495E-4,0.9994508445573757] | |[2.0] |[2.00573870645E-10,0.97994233]| ++--+ ** File "/home/jenkins/workspace/spark-master-test-python-arm/python/pyspark/ml/classification.py", line 2803, in __main__.FMClassifier Failed example: model.factors Expected: DenseMatrix(1, 2, [0.0028, 0.0048], 1) Got: DenseMatrix(1, 2, [-0.0122, 0.0106], 1) ** 2 of 10 in __main__.FMClassifier ***Test Failed*** 2 failures. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30339) Avoid to fail twice in function lookup
Zhenhua Wang created SPARK-30339: Summary: Avoid to fail twice in function lookup Key: SPARK-30339 URL: https://issues.apache.org/jira/browse/SPARK-30339 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.5, 3.0.0 Reporter: Zhenhua Wang Currently if function lookup fails, spark will give it a second change by casting decimal type to double type. But for cases where decimal type doesn't exist, it's meaningless to lookup again and causes extra cost like unnecessary metastore access. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30328) Fail to write local files with RDD.saveTextFile when setting the incorrect Hadoop configuration files
[ https://issues.apache.org/jira/browse/SPARK-30328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002585#comment-17002585 ] Ankit Raj Boudh commented on SPARK-30328: - @chendihao, can i check this issue ? > Fail to write local files with RDD.saveTextFile when setting the incorrect > Hadoop configuration files > - > > Key: SPARK-30328 > URL: https://issues.apache.org/jira/browse/SPARK-30328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: chendihao >Priority: Major > > We find that the incorrect Hadoop configuration files cause the failure of > saving RDD to local file system. It is not expected because we have specify > the local url and the API of DataFrame.write.text does not have this issue. > It is easy to reproduce and verify with Spark 2.3.0. > 1.Do not set environment variable of `HADOOP_CONF_DIR`. > 2.Install pyspark and run the local Python script. This should work and save > files to local file system. > {code:java} > from pyspark.sql import SparkSession > spark = SparkSession.builder.master("local").getOrCreate() > sc = spark.sparkContextrdd = sc.parallelize([1, 2, 3]) > rdd.saveAsTextFile("file:///tmp/rdd.text") > {code} > 3.Set environment variable of `HADOOP_CONF_DIR` and put the Hadoop > configuration files there. Make sure the format of `core-site.xml` is right > but it has an unresolved host name. > 4.Run the same Python script again. If it try to connect HDFS and found the > unresolved host name, Java exception happens. > We thinks `saveAsTextFile("file:///)` should not attempt to connect HDFS not > matter `HADOOP_CONF_DIR` is set. Actually the following code will work with > the same incorrect Hadoop configuration files. > {code:java} > from pyspark.sql import SparkSession > spark = SparkSession.builder.master("local").getOrCreate() > df = spark.createDataFrame(rows, ["attribute", "value"]) > df.write.parquet("file:///tmp/df.parquet") > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30338) Avoid unnecessary InternalRow copies in ParquetRowConverter
Josh Rosen created SPARK-30338: -- Summary: Avoid unnecessary InternalRow copies in ParquetRowConverter Key: SPARK-30338 URL: https://issues.apache.org/jira/browse/SPARK-30338 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Josh Rosen Assignee: Josh Rosen ParquetRowConverter calls {{InternalRow.copy()}} in cases where the copy is unnecessary; this can severely harm performance when reading deeply-nested Parquet. It looks like this copying was originally added to handle arrays and maps of structs (in which case we need to keep the copying), but we can omit it for the more common case of structs nested directly in structs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25603) Generalize Nested Column Pruning
[ https://issues.apache.org/jira/browse/SPARK-25603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002577#comment-17002577 ] Takeshi Yamamuro commented on SPARK-25603: -- Still WIP? Since we've finished implementing the basic part for nested column pruning, we can set this as resolved for now? cc: [~dongjoon] [~smilegator] > Generalize Nested Column Pruning > > > Key: SPARK-25603 > URL: https://issues.apache.org/jira/browse/SPARK-25603 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30337) Convert case class with var to normal class in spark-sql-kafka module
Jungtaek Lim created SPARK-30337: Summary: Convert case class with var to normal class in spark-sql-kafka module Key: SPARK-30337 URL: https://issues.apache.org/jira/browse/SPARK-30337 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.0.0 Reporter: Jungtaek Lim There was a review comment in SPARK-25151 pointed out this, but we decided to mark it as TODO as it was having 300+ comments and didn't want to drag it more. This issue tracks the effort on addressing TODO comments. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30336) Move Kafka consumer related classes to its own package
Jungtaek Lim created SPARK-30336: Summary: Move Kafka consumer related classes to its own package Key: SPARK-30336 URL: https://issues.apache.org/jira/browse/SPARK-30336 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.0.0 Reporter: Jungtaek Lim There're too many classes placed in a package "org.apache.spark.sql.kafka010" which classes should have been grouped by purpose. As a part of change in SPARK-21869, we moved out producer related classes to "org.apache.spark.sql.kafka010.producer" and only expose necessary classes/methods to the outside of package. We can apply it to consumer related classes as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30120) LSH approxNearestNeighbors should use BoundedPriorityQueue when numNearestNeighbors is small
[ https://issues.apache.org/jira/browse/SPARK-30120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-30120. -- Resolution: Not A Problem > LSH approxNearestNeighbors should use BoundedPriorityQueue when > numNearestNeighbors is small > > > Key: SPARK-30120 > URL: https://issues.apache.org/jira/browse/SPARK-30120 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Minor > > ping [~huaxingao] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29245) CCE during creating HiveMetaStoreClient
[ https://issues.apache.org/jira/browse/SPARK-29245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002539#comment-17002539 ] Xiao Li commented on SPARK-29245: - Since JDK support is experimental, it is not a blocker of Spark 3.0. It only affects JDK 11 users based on my understanding. However, we should still fix it in 3.0 and let us target it to 3.0 > CCE during creating HiveMetaStoreClient > > > Key: SPARK-29245 > URL: https://issues.apache.org/jira/browse/SPARK-29245 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > From `master` branch build, when I try to connect to an external HMS, I hit > the following. > {code} > 19/09/25 10:58:46 ERROR hive.log: Got exception: java.lang.ClassCastException > class [Ljava.lang.Object; cannot be cast to class [Ljava.net.URI; > ([Ljava.lang.Object; and [Ljava.net.URI; are in module java.base of loader > 'bootstrap') > java.lang.ClassCastException: class [Ljava.lang.Object; cannot be cast to > class [Ljava.net.URI; ([Ljava.lang.Object; and [Ljava.net.URI; are in module > java.base of loader 'bootstrap') > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:200) > at > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:70) > {code} > With HIVE-21508, I can get the following. > {code} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT > /_/ > Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.4) > Type in expressions to have them evaluated. > Type :help for more information. > scala> sql("show databases").show > ++ > |databaseName| > ++ > | . | > ... > {code} > With 2.3.7-SNAPSHOT, the following basic tests are tested. > - SHOW DATABASES / TABLES > - DESC DATABASE / TABLE > - CREATE / DROP / USE DATABASE > - CREATE / DROP / INSERT / LOAD / SELECT TABLE -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29245) CCE during creating HiveMetaStoreClient
[ https://issues.apache.org/jira/browse/SPARK-29245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-29245: Priority: Major (was: Blocker) > CCE during creating HiveMetaStoreClient > > > Key: SPARK-29245 > URL: https://issues.apache.org/jira/browse/SPARK-29245 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > From `master` branch build, when I try to connect to an external HMS, I hit > the following. > {code} > 19/09/25 10:58:46 ERROR hive.log: Got exception: java.lang.ClassCastException > class [Ljava.lang.Object; cannot be cast to class [Ljava.net.URI; > ([Ljava.lang.Object; and [Ljava.net.URI; are in module java.base of loader > 'bootstrap') > java.lang.ClassCastException: class [Ljava.lang.Object; cannot be cast to > class [Ljava.net.URI; ([Ljava.lang.Object; and [Ljava.net.URI; are in module > java.base of loader 'bootstrap') > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:200) > at > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:70) > {code} > With HIVE-21508, I can get the following. > {code} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT > /_/ > Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.4) > Type in expressions to have them evaluated. > Type :help for more information. > scala> sql("show databases").show > ++ > |databaseName| > ++ > | . | > ... > {code} > With 2.3.7-SNAPSHOT, the following basic tests are tested. > - SHOW DATABASES / TABLES > - DESC DATABASE / TABLE > - CREATE / DROP / USE DATABASE > - CREATE / DROP / INSERT / LOAD / SELECT TABLE -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30316) data size boom after shuffle writing dataframe save as parquet
[ https://issues.apache.org/jira/browse/SPARK-30316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002529#comment-17002529 ] Xiao Li commented on SPARK-30316: - The compression ratio depends on your data layout, instead of number of row. > data size boom after shuffle writing dataframe save as parquet > -- > > Key: SPARK-30316 > URL: https://issues.apache.org/jira/browse/SPARK-30316 > Project: Spark > Issue Type: Improvement > Components: Shuffle, SQL >Affects Versions: 2.4.4 >Reporter: Cesc >Priority: Major > > When I read a same parquet file and then save it in two ways, with shuffle > and without shuffle, I found the size of output parquet files are quite > different. For example, an origin parquet file with 800 MB size, if save > without shuffle, the size is still 800MB, whereas if I use method repartition > and then save it as in parquet format, the data size increase to 2.5GB. Row > numbers, column numbers and content of two output files are all the same. > I wonder: > firstly, why data size will increase after repartition/shuffle? > secondly, if I need shuffle the input dataframe, how to save it as parquet > file efficiently to avoid data size boom? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30316) data size boom after shuffle writing dataframe save as parquet
[ https://issues.apache.org/jira/browse/SPARK-30316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-30316: Priority: Major (was: Blocker) > data size boom after shuffle writing dataframe save as parquet > -- > > Key: SPARK-30316 > URL: https://issues.apache.org/jira/browse/SPARK-30316 > Project: Spark > Issue Type: Improvement > Components: Shuffle, SQL >Affects Versions: 2.4.4 >Reporter: Cesc >Priority: Major > > When I read a same parquet file and then save it in two ways, with shuffle > and without shuffle, I found the size of output parquet files are quite > different. For example, an origin parquet file with 800 MB size, if save > without shuffle, the size is still 800MB, whereas if I use method repartition > and then save it as in parquet format, the data size increase to 2.5GB. Row > numbers, column numbers and content of two output files are all the same. > I wonder: > firstly, why data size will increase after repartition/shuffle? > secondly, if I need shuffle the input dataframe, how to save it as parquet > file efficiently to avoid data size boom? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21869) A cached Kafka producer should not be closed if any task is using it.
[ https://issues.apache.org/jira/browse/SPARK-21869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin resolved SPARK-21869. Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26845 [https://github.com/apache/spark/pull/26845] > A cached Kafka producer should not be closed if any task is using it. > - > > Key: SPARK-21869 > URL: https://issues.apache.org/jira/browse/SPARK-21869 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.4, 3.0.0 >Reporter: Shixiong Zhu >Assignee: Gabor Somogyi >Priority: Major > Fix For: 3.0.0 > > > Right now a cached Kafka producer may be closed if a large task uses it for > more than 10 minutes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21869) A cached Kafka producer should not be closed if any task is using it.
[ https://issues.apache.org/jira/browse/SPARK-21869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin reassigned SPARK-21869: -- Assignee: Jungtaek Lim (was: Gabor Somogyi) > A cached Kafka producer should not be closed if any task is using it. > - > > Key: SPARK-21869 > URL: https://issues.apache.org/jira/browse/SPARK-21869 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.4, 3.0.0 >Reporter: Shixiong Zhu >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.0.0 > > > Right now a cached Kafka producer may be closed if a large task uses it for > more than 10 minutes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30335) Clarify behavior of FIRST and LAST without OVER caluse.
xqods9o5ekm3 created SPARK-30335: Summary: Clarify behavior of FIRST and LAST without OVER caluse. Key: SPARK-30335 URL: https://issues.apache.org/jira/browse/SPARK-30335 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.4.0, 3.0.0 Reporter: xqods9o5ekm3 Unlike many databases, Spark SQL allows usage of {{FIRST}} and {{LAST}} in non-analytic contexts. At the moment {{FIRST}} > first(expr[, isIgnoreNull]) - Returns the first value of {{expr}} for a group > of rows. If {{isIgnoreNull}} is true, returns only non-null values. and {{LAST}} > last(expr[, isIgnoreNull]) - Returns the last value of {{expr}} for a group > of rows. If {{isIgnoreNull}} is true, returns only non-null values. descriptions, suggest that their behavior is deterministic and many users assume that it return specific values for example when query {code:sql} SELECT first(foo) FROM ( SELECT * FROM table ORDER BY bar ) {code} That however doesn't seem to be the case. To make situation worse, it seems to work (for example on small samples in local mode). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27838) Support user provided non-nullable avro schema for nullable catalyst schema without any null record
[ https://issues.apache.org/jira/browse/SPARK-27838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002440#comment-17002440 ] Frank Lee commented on SPARK-27838: --- Hello Is there a workaround for this before this is released? Currently our avro schema is defined as (we are using avdl) {code:java} protocol Foo { record FooRecord { string something; string anotherthing; long count; } } {code} And AvroSerializer throw error "AvroRuntimeException: Not a union: "string"" > Support user provided non-nullable avro schema for nullable catalyst schema > without any null record > --- > > Key: SPARK-27838 > URL: https://issues.apache.org/jira/browse/SPARK-27838 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.3 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > Fix For: 3.0.0 > > > When the data is read from the sources, the catalyst schema is always > nullable. Since Avro uses Union type to represent nullable, when any > non-nullable avro file is read and then written out, the schema will always > be changed. This PR provides a solution for users to keep the Avro schema > without being forced to use Union type. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29224) Implement Factorization Machines as a ml-pipeline component
[ https://issues.apache.org/jira/browse/SPARK-29224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002402#comment-17002402 ] Ruslan Dautkhanov commented on SPARK-29224: --- E.g. would this work with 0.1m or 1m sparse features? > Implement Factorization Machines as a ml-pipeline component > --- > > Key: SPARK-29224 > URL: https://issues.apache.org/jira/browse/SPARK-29224 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.0.0 >Reporter: mob-ai >Assignee: mob-ai >Priority: Major > Fix For: 3.0.0 > > Attachments: url_loss.xlsx > > > Factorization Machines is widely used in advertising and recommendation > system to estimate CTR(click-through rate). > Advertising and recommendation system usually has a lot of data, so we need > Spark to estimate the CTR, and Factorization Machines are common ml model to > estimate CTR. > Goal: Implement Factorization Machines as a ml-pipeline component > Requirements: > 1. loss function supports: logloss, mse > 2. optimizer: mini batch SGD > References: > 1. S. Rendle, “Factorization machines,” in Proceedings of IEEE International > Conference on Data Mining (ICDM), pp. 995–1000, 2010. > https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27762) Support user provided avro schema for writing fields with different ordering
[ https://issues.apache.org/jira/browse/SPARK-27762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-27762: --- Assignee: DB Tsai > Support user provided avro schema for writing fields with different ordering > > > Key: SPARK-27762 > URL: https://issues.apache.org/jira/browse/SPARK-27762 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29224) Implement Factorization Machines as a ml-pipeline component
[ https://issues.apache.org/jira/browse/SPARK-29224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002398#comment-17002398 ] Ruslan Dautkhanov commented on SPARK-29224: --- That's great. Out of curiosity - what's largest number of features this was tested with? > Implement Factorization Machines as a ml-pipeline component > --- > > Key: SPARK-29224 > URL: https://issues.apache.org/jira/browse/SPARK-29224 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.0.0 >Reporter: mob-ai >Assignee: mob-ai >Priority: Major > Fix For: 3.0.0 > > Attachments: url_loss.xlsx > > > Factorization Machines is widely used in advertising and recommendation > system to estimate CTR(click-through rate). > Advertising and recommendation system usually has a lot of data, so we need > Spark to estimate the CTR, and Factorization Machines are common ml model to > estimate CTR. > Goal: Implement Factorization Machines as a ml-pipeline component > Requirements: > 1. loss function supports: logloss, mse > 2. optimizer: mini batch SGD > References: > 1. S. Rendle, “Factorization machines,” in Proceedings of IEEE International > Conference on Data Mining (ICDM), pp. 995–1000, 2010. > https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30334) Add metadata around semi-structured columns to Spark
[ https://issues.apache.org/jira/browse/SPARK-30334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz updated SPARK-30334: Description: Semi-structured data is used widely in the data industry for reporting events in a wide variety of formats. Click events in product analytics can be stored as json. Some application logs can be in the form of delimited key=value text. Some data may be in xml. The goal of this project is to be able to signal Spark that such a column exists. This will then enable Spark to "auto-parse" these columns on the fly. The proposal is to store this information as part of the column metadata, in the fields: - format: The format of the semi-structured column, e.g. json, xml, avro - options: Options for parsing these columns Then imagine having the following data: {code:java} ++---++ | ts | event |raw | ++---++ | 2019-10-12 | click | {"field":"value"} | ++---++ {code} SELECT raw.field FROM data will return "value" or the following data {code:java} ++---+--+ | ts | event | raw | ++---+--+ | 2019-10-12 | click | field1=v1|field2=v2 | ++---+--+ {code} SELECT raw.field1 FROM data will return v1. As a first step, we will introduce the function "as_json", which accomplishes this for JSON columns. was: Semi-structured data is used widely in the data industry for reporting events in a wide variety of formats. Click events in product analytics can be stored as json. Some application logs can be in the form of delimited key=value text. Some data may be in xml. The goal of this project is to be able to signal Spark that such a column exists. This will then enable Spark to "auto-parse" these columns on the fly. The proposal is to store this information as part of the column metadata, in the fields: - format: The format of the semi-structured column, e.g. json, xml, avro - options: Options for parsing these columns Then imagine having the following data: {code:java} ++---++ | ts | event |raw | ++---++ | 2019-10-12 | click | {"field":"value"} | ++---++ {code} SELECT raw.field FROM data will return "value" or the following data {code:java} ++---+--+ | ts | event | raw | ++---+--+ | 2019-10-12 | click | field1=v1|field2=v2 | ++---+--+ {code} SELECT raw.field1 FROM data will return v1. > Add metadata around semi-structured columns to Spark > > > Key: SPARK-30334 > URL: https://issues.apache.org/jira/browse/SPARK-30334 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.4 >Reporter: Burak Yavuz >Priority: Major > > Semi-structured data is used widely in the data industry for reporting events > in a wide variety of formats. Click events in product analytics can be stored > as json. Some application logs can be in the form of delimited key=value > text. Some data may be in xml. > The goal of this project is to be able to signal Spark that such a column > exists. This will then enable Spark to "auto-parse" these columns on the fly. > The proposal is to store this information as part of the column metadata, in > the fields: > - format: The format of the semi-structured column, e.g. json, xml, avro > - options: Options for parsing these columns > Then imagine having the following data: > {code:java} > ++---++ > | ts | event |raw | > ++---++ > | 2019-10-12 | click | {"field":"value"} | > ++---++ {code} > SELECT raw.field FROM data > will return "value" > or the following data > {code:java} > ++---+--+ > | ts | event | raw | > ++---+--+ > | 2019-10-12 | click | field1=v1|field2=v2 | > ++---+--+ {code} > SELECT raw.field1 FROM data > will return v1. > > As a first step, we will introduce the function "as_json", which accomplishes > this for JSON columns. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30334) Add metadata around semi-structured columns to Spark
Burak Yavuz created SPARK-30334: --- Summary: Add metadata around semi-structured columns to Spark Key: SPARK-30334 URL: https://issues.apache.org/jira/browse/SPARK-30334 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.4.4 Reporter: Burak Yavuz Semi-structured data is used widely in the data industry for reporting events in a wide variety of formats. Click events in product analytics can be stored as json. Some application logs can be in the form of delimited key=value text. Some data may be in xml. The goal of this project is to be able to signal Spark that such a column exists. This will then enable Spark to "auto-parse" these columns on the fly. The proposal is to store this information as part of the column metadata, in the fields: - format: The format of the semi-structured column, e.g. json, xml, avro - options: Options for parsing these columns Then imagine having the following data: {code:java} ++---++ | ts | event |raw | ++---++ | 2019-10-12 | click | {"field":"value"} | ++---++ {code} SELECT raw.field FROM data will return "value" or the following data {code:java} ++---+--+ | ts | event | raw | ++---+--+ | 2019-10-12 | click | field1=v1|field2=v2 | ++---+--+ {code} SELECT raw.field1 FROM data will return v1. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26663) Cannot query a Hive table with subdirectories
[ https://issues.apache.org/jira/browse/SPARK-26663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002367#comment-17002367 ] Xiaoguang Wang commented on SPARK-26663: I meet the same problem here. How to debug? > Cannot query a Hive table with subdirectories > - > > Key: SPARK-26663 > URL: https://issues.apache.org/jira/browse/SPARK-26663 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Aäron >Priority: Major > > Hello, > > I want to report the following issue (my first one :) ) > When I create a table in Hive based on a union all then Spark 2.4 is unable > to query this table. > To reproduce: > *Hive 1.2.1* > {code:java} > hive> creat table a(id int); > insert into a values(1); > hive> creat table b(id int); > insert into b values(2); > hive> create table c(id int) as select id from a union all select id from b; > {code} > > *Spark 2.3.1* > > {code:java} > scala> spark.table("c").show > +---+ > | id| > +---+ > | 1| > | 2| > +---+ > scala> spark.table("c").count > res5: Long = 2 > {code} > > *Spark 2.4.0* > {code:java} > scala> spark.table("c").show > 19/01/18 17:00:49 WARN HiveMetastoreCatalog: Unable to infer schema for table > perftest_be.c from file format ORC (inference mode: INFER_AND_SAVE). Using > metastore schema. > +---+ > | id| > +---+ > +---+ > scala> spark.table("c").count > res3: Long = 0 > {code} > I did not find an existing issue for this. Might be important to investigate. > > +Extra info:+ Spark 2.3.1 and 2.4.0 use the same spark-defaults.conf. > > Kind regards. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29224) Implement Factorization Machines as a ml-pipeline component
[ https://issues.apache.org/jira/browse/SPARK-29224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-29224. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26124 [https://github.com/apache/spark/pull/26124] > Implement Factorization Machines as a ml-pipeline component > --- > > Key: SPARK-29224 > URL: https://issues.apache.org/jira/browse/SPARK-29224 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.0.0 >Reporter: mob-ai >Assignee: mob-ai >Priority: Major > Fix For: 3.0.0 > > Attachments: url_loss.xlsx > > > Factorization Machines is widely used in advertising and recommendation > system to estimate CTR(click-through rate). > Advertising and recommendation system usually has a lot of data, so we need > Spark to estimate the CTR, and Factorization Machines are common ml model to > estimate CTR. > Goal: Implement Factorization Machines as a ml-pipeline component > Requirements: > 1. loss function supports: logloss, mse > 2. optimizer: mini batch SGD > References: > 1. S. Rendle, “Factorization machines,” in Proceedings of IEEE International > Conference on Data Mining (ICDM), pp. 995–1000, 2010. > https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29224) Implement Factorization Machines as a ml-pipeline component
[ https://issues.apache.org/jira/browse/SPARK-29224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-29224: Assignee: mob-ai > Implement Factorization Machines as a ml-pipeline component > --- > > Key: SPARK-29224 > URL: https://issues.apache.org/jira/browse/SPARK-29224 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.0.0 >Reporter: mob-ai >Assignee: mob-ai >Priority: Major > Attachments: url_loss.xlsx > > > Factorization Machines is widely used in advertising and recommendation > system to estimate CTR(click-through rate). > Advertising and recommendation system usually has a lot of data, so we need > Spark to estimate the CTR, and Factorization Machines are common ml model to > estimate CTR. > Goal: Implement Factorization Machines as a ml-pipeline component > Requirements: > 1. loss function supports: logloss, mse > 2. optimizer: mini batch SGD > References: > 1. S. Rendle, “Factorization machines,” in Proceedings of IEEE International > Conference on Data Mining (ICDM), pp. 995–1000, 2010. > https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30333) Bump jackson-databind to 2.6.7.3
Sandeep Katta created SPARK-30333: - Summary: Bump jackson-databind to 2.6.7.3 Key: SPARK-30333 URL: https://issues.apache.org/jira/browse/SPARK-30333 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.4 Reporter: Sandeep Katta To fix below CVE CVE-2018-14718 CVE-2018-14719 CVE-2018-14720 CVE-2018-14721 CVE-2018-19360, CVE-2018-19361 CVE-2018-19362 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30332) When running sql query with limit catalyst throw StackOverFlow exception
Izek Greenfield created SPARK-30332: --- Summary: When running sql query with limit catalyst throw StackOverFlow exception Key: SPARK-30332 URL: https://issues.apache.org/jira/browse/SPARK-30332 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Environment: spark version 3.0.0-preview Reporter: Izek Greenfield Running that SQL: {code:sql} SELECT BT_capital.asof_date, BT_capital.run_id, BT_capital.v, BT_capital.id, BT_capital.entity, BT_capital.level_1, BT_capital.level_2, BT_capital.level_3, BT_capital.level_4, BT_capital.level_5, BT_capital.level_6, BT_capital.path_bt_capital, BT_capital.line_item, t0.target_line_item, t0.line_description, BT_capital.col_item, BT_capital.rep_amount, root.orgUnitId, root.cptyId, root.instId, root.startDate, root.maturityDate, root.amount, root.nominalAmount, root.quantity, root.lkupAssetLiability, root.lkupCurrency, root.lkupProdType, root.interestResetDate, root.interestResetTerm, root.noticePeriod, root.historicCostAmount, root.dueDate, root.lkupResidence, root.lkupCountryOfUltimateRisk, root.lkupSector, root.lkupIndustry, root.lkupAccountingPortfolioType, root.lkupLoanDepositTerm, root.lkupFixedFloating, root.lkupCollateralType, root.lkupRiskType, root.lkupEligibleRefinancing, root.lkupHedging, root.lkupIsOwnIssued, root.lkupIsSubordinated, root.lkupIsQuoted, root.lkupIsSecuritised, root.lkupIsSecuritisedServiced, root.lkupIsSyndicated, root.lkupIsDeRecognised, root.lkupIsRenegotiated, root.lkupIsTransferable, root.lkupIsNewBusiness, root.lkupIsFiduciary, root.lkupIsNonPerforming, root.lkupIsInterGroup, root.lkupIsIntraGroup, root.lkupIsRediscounted, root.lkupIsCollateral, root.lkupIsExercised, root.lkupIsImpaired, root.facilityId, root.lkupIsOTC, root.lkupIsDefaulted, root.lkupIsSavingsPosition, root.lkupIsForborne, root.lkupIsDebtRestructuringLoan, root.interestRateAAR, root.interestRateAPRC, root.custom1, root.custom2, root.custom3, root.lkupSecuritisationType, root.lkupIsCashPooling, root.lkupIsEquityParticipationGTE10, root.lkupIsConvertible, root.lkupEconomicHedge, root.lkupIsNonCurrHeldForSale, root.lkupIsEmbeddedDerivative, root.lkupLoanPurpose, root.lkupRegulated, root.lkupRepaymentType, root.glAccount, root.lkupIsRecourse, root.lkupIsNotFullyGuaranteed, root.lkupImpairmentStage, root.lkupIsEntireAmountWrittenOff, root.lkupIsLowCreditRisk, root.lkupIsOBSWithinIFRS9, root.lkupIsUnderSpecialSurveillance, root.lkupProtection, root.lkupIsGeneralAllowance, root.lkupSectorUltimateRisk, root.cptyOrgUnitId, root.name, root.lkupNationality, root.lkupSize, root.lkupIsSPV, root.lkupIsCentralCounterparty, root.lkupIsMMRMFI, root.lkupIsKeyManagement, root.lkupIsOtherRelatedParty, root.lkupResidenceProvince, root.lkupIsTradingBook, root.entityHierarchy_entityId, root.entityHierarchy_Residence, root.lkupLocalCurrency, root.cpty_entityhierarchy_entityId, root.lkupRelationship, root.cpty_lkupRelationship, root.entityNationality, root.lkupRepCurrency, root.startDateFinancialYear, root.numEmployees, root.numEmployeesTotal, root.collateralAmount, root.guaranteeAmount, root.impairmentSpecificIndividual, root.impairmentSpecificCollective, root.impairmentGeneral, root.creditRiskAmount, root.provisionSpecificIndividual, root.provisionSpecificCollective, root.provisionGeneral, root.writeOffAmount, root.interest, root.fairValueAmount, root.grossCarryingAmount, root.carryingAmount, root.code, root.lkupInstrumentType, root.price, root.amountAtIssue, root.yield, root.totalFacilityAmount, root.facility_rate, root.spec_indiv_est, root.spec_coll_est, root.coll_inc_loss, root.impairment_amount, root.provision_amount, root.accumulated_impairment, root.exclusionFlag, root.lkupIsHoldingCompany, root.instrument_startDate, root.entityResidence, fxRate.enumerator, fxRate.lkupFromCurrency, fxRate.rate, fxRate.custom1, fxRate.custom2, fxRate.custom3, GB_position.lkupIsECGDGuaranteed, GB_position.lkupIsMultiAcctOffsetMortgage, GB_position.lkupIsIndexLinked, GB_position.lkupIsRetail, GB_position.lkupCollateralLocation, GB_position.percentAboveBBR, GB_position.lkupIsMoreInArrears, GB_position.lkupIsArrearsCapitalised, GB_position.lkupCollateralPossession, GB_position.lkupIsLifetimeMortgage, GB_position.lkupLoanConcessionType, GB_position.lkupIsMultiCurrency, GB_position.lkupIsJointIncomeBasis, GB_position.ratioIncomeMultiple, GB_position.interestRate, GB_position.exclusionFlag, GB_position.lkupFDIDirection, GB_position.lkupIsRTGS, GB_positionExtended.nonRecourseFinanceAmount, GB_positionExtended.arrearsAmount, GB_Counterparty.lkupIsClearingFirm, GB_Counterparty.lkupIsIntermediateFinCorp, GB_Counterparty.lkupIsImpairedCreditHistory, GB_Counterparty.lkupFDIRelationship FROM portfolio_41446 BT_capital JOIN aggr_41390 root ON (root.id = BT_capital.id AND root.entity = BT_capital.entity AND (root.instance_id = 'e3b82807-9371-44f4-9c97-d63cde
[jira] [Created] (SPARK-30331) The final AdaptiveSparkPlan event is not marked with `isFinalPlan=true`
Manu Zhang created SPARK-30331: -- Summary: The final AdaptiveSparkPlan event is not marked with `isFinalPlan=true` Key: SPARK-30331 URL: https://issues.apache.org/jira/browse/SPARK-30331 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Manu Zhang This is due to that the final AdaptiveSparkPlan event is sent out before {{isFinalPlan}} variable set to `true`. It would fail any listener attempting to catch the final event by pattern matching `isFinalPlan=true` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28332) SQLMetric wrong initValue
[ https://issues.apache.org/jira/browse/SPARK-28332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-28332: --- Assignee: EdisonWang > SQLMetric wrong initValue > -- > > Key: SPARK-28332 > URL: https://issues.apache.org/jira/browse/SPARK-28332 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Song Jun >Assignee: EdisonWang >Priority: Minor > Fix For: 3.0.0 > > > Currently SQLMetrics.createSizeMetric create a SQLMetric with initValue set > to -1. > If there is a ShuffleMapStage with lots of Tasks which read 0 bytes data, > these tasks will send the metric(the metric value still be the initValue with > -1) to Driver, then Driver do metric merge for this Stage in > DAGScheduler.updateAccumulators, this will cause the merged metric value of > this Stage set to be a negative value. > This is incorrect, we should set the initValue to 0 . > Another same case in SQLMetrics.createTimingMetric. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28332) SQLMetric wrong initValue
[ https://issues.apache.org/jira/browse/SPARK-28332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002148#comment-17002148 ] EdisonWang commented on SPARK-28332: I've taken it [~cloud_fan] > SQLMetric wrong initValue > -- > > Key: SPARK-28332 > URL: https://issues.apache.org/jira/browse/SPARK-28332 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Song Jun >Priority: Minor > Fix For: 3.0.0 > > > Currently SQLMetrics.createSizeMetric create a SQLMetric with initValue set > to -1. > If there is a ShuffleMapStage with lots of Tasks which read 0 bytes data, > these tasks will send the metric(the metric value still be the initValue with > -1) to Driver, then Driver do metric merge for this Stage in > DAGScheduler.updateAccumulators, this will cause the merged metric value of > this Stage set to be a negative value. > This is incorrect, we should set the initValue to 0 . > Another same case in SQLMetrics.createTimingMetric. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26002) SQL date operators calculates with incorrect dayOfYears for dates before 1500-03-01
[ https://issues.apache.org/jira/browse/SPARK-26002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-26002: Labels: correctness (was: ) > SQL date operators calculates with incorrect dayOfYears for dates before > 1500-03-01 > --- > > Key: SPARK-26002 > URL: https://issues.apache.org/jira/browse/SPARK-26002 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, > 2.3.2, 2.4.0, 3.0.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Major > Labels: correctness > Fix For: 3.0.0 > > > Running the following SQL the result is incorrect: > {noformat} > scala> sql("select dayOfYear('1500-01-02')").show() > +---+ > |dayofyear(CAST(1500-01-02 AS DATE))| > +---+ > | 1| > +---+ > {noformat} > This off by one day is more annoying right at the beginning of a year: > {noformat} > scala> sql("select year('1500-01-01')").show() > +--+ > |year(CAST(1500-01-01 AS DATE))| > +--+ > | 1499| > +--+ > scala> sql("select month('1500-01-01')").show() > +---+ > |month(CAST(1500-01-01 AS DATE))| > +---+ > | 12| > +---+ > scala> sql("select dayOfYear('1500-01-01')").show() > +---+ > |dayofyear(CAST(1500-01-01 AS DATE))| > +---+ > |365| > +---+ > {noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30330) Support single quotes json parsing for get_json_object and json_tuple
[ https://issues.apache.org/jira/browse/SPARK-30330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fang Wen updated SPARK-30330: - External issue URL: https://github.com/apache/spark/pull/26965 > Support single quotes json parsing for get_json_object and json_tuple > - > > Key: SPARK-30330 > URL: https://issues.apache.org/jira/browse/SPARK-30330 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.3, 2.4.4 >Reporter: Fang Wen >Priority: Major > Labels: release-notes > > I execute some query as > {code:java} > select get_json_object(ytag, '$.y1') AS y1 from t4{code} > SparkSQL return null but Hive return correct results. > In my production environment, ytag is a json wrapped by single quotes,as > follows > {code:java} > {'y1': 'shuma', 'y2': 'shuma:shouji'} > {'y1': 'jiaoyu', 'y2': 'jiaoyu:gaokao'} > {'y1': 'yule', 'y2': 'yule:mingxing'} > {code} > Then l realized some functions including get_json_object and json_tuple does > not support single quotes json parsing. It will return null for this > situation. > I think such a treatment is unfriendly for users. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30330) Support single quotes json parsing for get_json_object and json_tuple
[ https://issues.apache.org/jira/browse/SPARK-30330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fang Wen updated SPARK-30330: - External issue URL: (was: https://github.com/apache/spark/pull/26965) > Support single quotes json parsing for get_json_object and json_tuple > - > > Key: SPARK-30330 > URL: https://issues.apache.org/jira/browse/SPARK-30330 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.3, 2.4.4 >Reporter: Fang Wen >Priority: Major > Labels: release-notes > > I execute some query as > {code:java} > select get_json_object(ytag, '$.y1') AS y1 from t4{code} > SparkSQL return null but Hive return correct results. > In my production environment, ytag is a json wrapped by single quotes,as > follows > {code:java} > {'y1': 'shuma', 'y2': 'shuma:shouji'} > {'y1': 'jiaoyu', 'y2': 'jiaoyu:gaokao'} > {'y1': 'yule', 'y2': 'yule:mingxing'} > {code} > Then l realized some functions including get_json_object and json_tuple does > not support single quotes json parsing. It will return null for this > situation. > I think such a treatment is unfriendly for users. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30330) Support single quotes json parsing for get_json_object and json_tuple
[ https://issues.apache.org/jira/browse/SPARK-30330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fang Wen updated SPARK-30330: - Labels: release-notes (was: ) > Support single quotes json parsing for get_json_object and json_tuple > - > > Key: SPARK-30330 > URL: https://issues.apache.org/jira/browse/SPARK-30330 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.3, 2.4.4 >Reporter: Fang Wen >Priority: Major > Labels: release-notes > > I execute some query as > {code:java} > select get_json_object(ytag, '$.y1') AS y1 from t4{code} > SparkSQL return null but Hive return correct results. > In my production environment, ytag is a json wrapped by single quotes,as > follows > {code:java} > {'y1': 'shuma', 'y2': 'shuma:shouji'} > {'y1': 'jiaoyu', 'y2': 'jiaoyu:gaokao'} > {'y1': 'yule', 'y2': 'yule:mingxing'} > {code} > Then l realized some functions including get_json_object and json_tuple does > not support single quotes json parsing. It will return null for this > situation. > I think such a treatment is unfriendly for users. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30330) Support single quotes json parsing for get_json_object and json_tuple
Fang Wen created SPARK-30330: Summary: Support single quotes json parsing for get_json_object and json_tuple Key: SPARK-30330 URL: https://issues.apache.org/jira/browse/SPARK-30330 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.4, 2.4.3 Reporter: Fang Wen I execute some query as {code:java} select get_json_object(ytag, '$.y1') AS y1 from t4{code} SparkSQL return null but Hive return correct results. In my production environment, ytag is a json wrapped by single quotes,as follows {code:java} {'y1': 'shuma', 'y2': 'shuma:shouji'} {'y1': 'jiaoyu', 'y2': 'jiaoyu:gaokao'} {'y1': 'yule', 'y2': 'yule:mingxing'} {code} Then l realized some functions including get_json_object and json_tuple does not support single quotes json parsing. It will return null for this situation. I think such a treatment is unfriendly for users. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30328) Fail to write local files with RDD.saveTextFile when setting the incorrect Hadoop configuration files
[ https://issues.apache.org/jira/browse/SPARK-30328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chendihao updated SPARK-30328: -- Description: We find that the incorrect Hadoop configuration files cause the failure of saving RDD to local file system. It is not expected because we have specify the local url and the API of DataFrame.write.text does not have this issue. It is easy to reproduce and verify with Spark 2.3.0. 1.Do not set environment variable of `HADOOP_CONF_DIR`. 2.Install pyspark and run the local Python script. This should work and save files to local file system. {code:java} from pyspark.sql import SparkSession spark = SparkSession.builder.master("local").getOrCreate() sc = spark.sparkContextrdd = sc.parallelize([1, 2, 3]) rdd.saveAsTextFile("file:///tmp/rdd.text") {code} 3.Set environment variable of `HADOOP_CONF_DIR` and put the Hadoop configuration files there. Make sure the format of `core-site.xml` is right but it has an unresolved host name. 4.Run the same Python script again. If it try to connect HDFS and found the unresolved host name, Java exception happens. We thinks `saveAsTextFile("file:///)` should not attempt to connect HDFS not matter `HADOOP_CONF_DIR` is set. Actually the following code will work with the same incorrect Hadoop configuration files. {code:java} from pyspark.sql import SparkSession spark = SparkSession.builder.master("local").getOrCreate() df = spark.createDataFrame(rows, ["attribute", "value"]) df.write.parquet("file:///tmp/df.parquet") {code} > Fail to write local files with RDD.saveTextFile when setting the incorrect > Hadoop configuration files > - > > Key: SPARK-30328 > URL: https://issues.apache.org/jira/browse/SPARK-30328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: chendihao >Priority: Major > > We find that the incorrect Hadoop configuration files cause the failure of > saving RDD to local file system. It is not expected because we have specify > the local url and the API of DataFrame.write.text does not have this issue. > It is easy to reproduce and verify with Spark 2.3.0. > 1.Do not set environment variable of `HADOOP_CONF_DIR`. > 2.Install pyspark and run the local Python script. This should work and save > files to local file system. > {code:java} > from pyspark.sql import SparkSession > spark = SparkSession.builder.master("local").getOrCreate() > sc = spark.sparkContextrdd = sc.parallelize([1, 2, 3]) > rdd.saveAsTextFile("file:///tmp/rdd.text") > {code} > 3.Set environment variable of `HADOOP_CONF_DIR` and put the Hadoop > configuration files there. Make sure the format of `core-site.xml` is right > but it has an unresolved host name. > 4.Run the same Python script again. If it try to connect HDFS and found the > unresolved host name, Java exception happens. > We thinks `saveAsTextFile("file:///)` should not attempt to connect HDFS not > matter `HADOOP_CONF_DIR` is set. Actually the following code will work with > the same incorrect Hadoop configuration files. > {code:java} > from pyspark.sql import SparkSession > spark = SparkSession.builder.master("local").getOrCreate() > df = spark.createDataFrame(rows, ["attribute", "value"]) > df.write.parquet("file:///tmp/df.parquet") > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org