[jira] [Updated] (SPARK-34481) Refactor dataframe reader/writer path option logic
[ https://issues.apache.org/jira/browse/SPARK-34481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuchen Huo updated SPARK-34481: --- Priority: Trivial (was: Major) > Refactor dataframe reader/writer path option logic > -- > > Key: SPARK-34481 > URL: https://issues.apache.org/jira/browse/SPARK-34481 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuchen Huo >Priority: Trivial > > Refactor the dataframe reader/writer logic so the path in options handling > logic has their own function. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34481) Refactor dataframe reader/writer path option logic
Yuchen Huo created SPARK-34481: -- Summary: Refactor dataframe reader/writer path option logic Key: SPARK-34481 URL: https://issues.apache.org/jira/browse/SPARK-34481 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Yuchen Huo Refactor the dataframe reader/writer logic so the path in options handling logic has their own function. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30417) SPARK-29976 calculation of slots wrong for Standalone Mode
[ https://issues.apache.org/jira/browse/SPARK-30417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010142#comment-17010142 ] Yuchen Huo commented on SPARK-30417: [~tgraves] Sure. Is there a more stable way to get the number of cores the executor is using instead of checking the value of EXECUTOR_CORES which might not be set? cc [~jiangxb1987] > SPARK-29976 calculation of slots wrong for Standalone Mode > -- > > Key: SPARK-30417 > URL: https://issues.apache.org/jira/browse/SPARK-30417 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > In SPARK-29976 we added a config to determine if we should allow speculation > when the number of tasks is less then the number of slots on a single > executor. The problem is that for standalone mode (and mesos coarse > grained) the EXECUTOR_CORES config is not set properly by default. In those > modes the number of executor cores is all the cores of the Worker. The > default of EXECUTOR_CORES is 1. > The calculation: > {color:#80}val {color}{color:#660e7a}speculationTasksLessEqToSlots > {color}= {color:#660e7a}numTasks {color}<= > ({color:#660e7a}conf{color}.get({color:#660e7a}EXECUTOR_CORES{color}) / > sched.{color:#660e7a}CPUS_PER_TASK{color}) > If someone set the cpus per task > 1 then this would end up being false even > if 1 task. Note that the default case where cpus per task is 1 and executor > cores is 1 it works out ok but is only applied if 1 task vs number of slots > on the executor. > Here we really don't know the number of executor cores for standalone mode or > mesos so I think a decent solution is to just use 1 in those cases and > document the difference. > Something like > max({color:#660e7a}conf{color}.get({color:#660e7a}EXECUTOR_CORES{color}) / > sched.{color:#660e7a}CPUS_PER_TASK{color}, 1) > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30314) Add identifier and catalog information to DataSourceV2Relation
Yuchen Huo created SPARK-30314: -- Summary: Add identifier and catalog information to DataSourceV2Relation Key: SPARK-30314 URL: https://issues.apache.org/jira/browse/SPARK-30314 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Yuchen Huo Add identifier and catalog information in DataSourceV2Relation so it would be possible to do richer checks in *checkAnalysis* step. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29976) Allow speculation even if there is only one task
Yuchen Huo created SPARK-29976: -- Summary: Allow speculation even if there is only one task Key: SPARK-29976 URL: https://issues.apache.org/jira/browse/SPARK-29976 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Yuchen Huo In the current speculative execution implementation if there is only one task in the stage then no speculative run would be conducted. However, there might be cases where an executor have some problem in writing to its disk and just hang forever. In this case, if the single task stage get assigned to the problematic executor then the whole job would hang forever. It would be better if we could run the task on another executor if this happens. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23822) Improve error message for Parquet schema mismatches
[ https://issues.apache.org/jira/browse/SPARK-23822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16537412#comment-16537412 ] Yuchen Huo commented on SPARK-23822: Yes, the error message would show the column name, the expected type and the type was found. > Improve error message for Parquet schema mismatches > --- > > Key: SPARK-23822 > URL: https://issues.apache.org/jira/browse/SPARK-23822 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Yuchen Huo >Assignee: Yuchen Huo >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > If a user attempts to read Parquet files with mismatched schemas and schema > merging is disabled then this may result in a very confusing > UnsupportedOperationException and ParquetDecodingException errors from > Parquet. > e.g. > {code:java} > Seq(("bcd")).toDF("a").coalesce(1).write.mode("overwrite").parquet(s"$path/") > Seq((1)).toDF("a").coalesce(1).write.mode("append").parquet(s"$path/") > spark.read.parquet(s"$path/").collect() > {code} > Would result in > {code:java} > Caused by: java.lang.UnsupportedOperationException: Unimplemented type: > IntegerType > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:474) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:182) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:617) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23822) Improve error message for Parquet schema mismatches
[ https://issues.apache.org/jira/browse/SPARK-23822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuchen Huo updated SPARK-23822: --- Component/s: (was: Input/Output) SQL > Improve error message for Parquet schema mismatches > --- > > Key: SPARK-23822 > URL: https://issues.apache.org/jira/browse/SPARK-23822 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Yuchen Huo >Priority: Major > > If a user attempts to read Parquet files with mismatched schemas and schema > merging is disabled then this may result in a very confusing > UnsupportedOperationException and ParquetDecodingException errors from > Parquet. > e.g. > {code:java} > Seq(("bcd")).toDF("a").coalesce(1).write.mode("overwrite").parquet(s"$path/") > Seq((1)).toDF("a").coalesce(1).write.mode("append").parquet(s"$path/") > spark.read.parquet(s"$path/").collect() > {code} > Would result in > {code:java} > Caused by: java.lang.UnsupportedOperationException: Unimplemented type: > IntegerType > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:474) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:182) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:617) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23822) Improve error message for Parquet schema mismatches
[ https://issues.apache.org/jira/browse/SPARK-23822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuchen Huo updated SPARK-23822: --- Description: If a user attempts to read Parquet files with mismatched schemas and schema merging is disabled then this may result in a very confusing UnsupportedOperationException and ParquetDecodingException errors from Parquet. e.g. {code:java} Seq(("bcd")).toDF("a").coalesce(1).write.mode("overwrite").parquet(s"$path/") Seq((1)).toDF("a").coalesce(1).write.mode("append").parquet(s"$path/") spark.read.parquet(s"$path/").collect() {code} Would result in {code:java} Caused by: java.lang.UnsupportedOperationException: Unimplemented type: IntegerType at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:474) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:182) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:617) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) {code} was: If a user attempts to read Parquet files with mismatched schemas and schema merging is disabled then this may result in a very confusing UnsupportedOperationException and ParquetDecodingException errors from Parquet. e.g. {code:java} Seq(("bcd")).toDF("a").coalesce(1).write.mode("overwrite").parquet(s"$path/") Seq((1)).toDF("a").coalesce(1).write.mode("append").parquet(s"$path/") spark.read.parquet(s"$path/").collect() {code} Would result in > Improve error message for Parquet schema mismatches > --- > > Key: SPARK-23822 > URL: https://issues.apache.org/jira/browse/SPARK-23822 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.3.0 >Reporter: Yuchen Huo >Priority: Major > > If a user attempts to read Parquet files with mismatched schemas and schema > merging is disabled then this may result in a very confusing > UnsupportedOperationException and ParquetDecodingException errors from > Parquet. > e.g. > {code:java} > Seq(("bcd")).toDF("a").coalesce(1).write.mode("overwrite").parquet(s"$path/") > Seq((1)).toDF("a").coalesce(1).write.mode("append").parquet(s"$path/") > spark.read.parquet(s"$path/").collect() > {code} > Would result in > {code:java} > Caused by: java.lang.UnsupportedOperationException: Unimplemented type: > IntegerType > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:474) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214) > at >
[jira] [Created] (SPARK-23822) Improve error message for Parquet schema mismatches
Yuchen Huo created SPARK-23822: -- Summary: Improve error message for Parquet schema mismatches Key: SPARK-23822 URL: https://issues.apache.org/jira/browse/SPARK-23822 Project: Spark Issue Type: Improvement Components: Input/Output Affects Versions: 2.3.0 Reporter: Yuchen Huo If a user attempts to read Parquet files with mismatched schemas and schema merging is disabled then this may result in a very confusing UnsupportedOperationException and ParquetDecodingException errors from Parquet. e.g. {code:java} Seq(("bcd")).toDF("a").coalesce(1).write.mode("overwrite").parquet(s"$path/") Seq((1)).toDF("a").coalesce(1).write.mode("append").parquet(s"$path/") spark.read.parquet(s"$path/").collect() {code} Would result in -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org