[jira] [Updated] (SPARK-34481) Refactor dataframe reader/writer path option logic

2021-02-19 Thread Yuchen Huo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuchen Huo updated SPARK-34481:
---
Priority: Trivial  (was: Major)

> Refactor dataframe reader/writer path option logic
> --
>
> Key: SPARK-34481
> URL: https://issues.apache.org/jira/browse/SPARK-34481
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuchen Huo
>Priority: Trivial
>
> Refactor the dataframe reader/writer logic so the path in options handling 
> logic has their own function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34481) Refactor dataframe reader/writer path option logic

2021-02-19 Thread Yuchen Huo (Jira)
Yuchen Huo created SPARK-34481:
--

 Summary: Refactor dataframe reader/writer path option logic
 Key: SPARK-34481
 URL: https://issues.apache.org/jira/browse/SPARK-34481
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yuchen Huo


Refactor the dataframe reader/writer logic so the path in options handling 
logic has their own function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30417) SPARK-29976 calculation of slots wrong for Standalone Mode

2020-01-07 Thread Yuchen Huo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010142#comment-17010142
 ] 

Yuchen Huo commented on SPARK-30417:


[~tgraves] Sure. Is there a more stable way to get the number of cores the 
executor is using instead of checking the value of EXECUTOR_CORES which might 
not be set?

 

cc [~jiangxb1987]

> SPARK-29976 calculation of slots wrong for Standalone Mode
> --
>
> Key: SPARK-30417
> URL: https://issues.apache.org/jira/browse/SPARK-30417
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> In SPARK-29976 we added a config to determine if we should allow speculation 
> when the number of tasks is less then the number of slots on a single 
> executor.  The problem is that for standalone mode (and  mesos coarse 
> grained) the EXECUTOR_CORES config is not set properly by default. In those 
> modes the number of executor cores is all the cores of the Worker.    The 
> default of EXECUTOR_CORES is 1.
> The calculation:
> {color:#80}val {color}{color:#660e7a}speculationTasksLessEqToSlots 
> {color}= {color:#660e7a}numTasks {color}<= 
> ({color:#660e7a}conf{color}.get({color:#660e7a}EXECUTOR_CORES{color}) / 
> sched.{color:#660e7a}CPUS_PER_TASK{color})
> If someone set the cpus per task > 1 then this would end up being false even 
> if 1 task.  Note that the default case where cpus per task is 1 and executor 
> cores is 1 it works out ok but is only applied if 1 task vs number of slots 
> on the executor.
> Here we really don't know the number of executor cores for standalone mode or 
> mesos so I think a decent solution is to just use 1 in those cases and 
> document the difference.
> Something like 
> max({color:#660e7a}conf{color}.get({color:#660e7a}EXECUTOR_CORES{color}) / 
> sched.{color:#660e7a}CPUS_PER_TASK{color}, 1)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30314) Add identifier and catalog information to DataSourceV2Relation

2019-12-19 Thread Yuchen Huo (Jira)
Yuchen Huo created SPARK-30314:
--

 Summary: Add identifier and catalog information to 
DataSourceV2Relation
 Key: SPARK-30314
 URL: https://issues.apache.org/jira/browse/SPARK-30314
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuchen Huo


Add identifier and catalog information in DataSourceV2Relation so it would be 
possible to do richer checks in *checkAnalysis* step.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29976) Allow speculation even if there is only one task

2019-11-20 Thread Yuchen Huo (Jira)
Yuchen Huo created SPARK-29976:
--

 Summary: Allow speculation even if there is only one task
 Key: SPARK-29976
 URL: https://issues.apache.org/jira/browse/SPARK-29976
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Yuchen Huo


In the current speculative execution implementation if there is only one task 
in the stage then no speculative run would be conducted. However, there might 
be cases where an executor have some problem in writing to its disk and just 
hang forever. In this case, if the single task stage get assigned to the 
problematic executor then the whole job would hang forever. It would be better 
if we could run the task on another executor if this happens. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23822) Improve error message for Parquet schema mismatches

2018-07-09 Thread Yuchen Huo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16537412#comment-16537412
 ] 

Yuchen Huo commented on SPARK-23822:


Yes, the error message would show the column name, the expected type and the 
type was found.

> Improve error message for Parquet schema mismatches
> ---
>
> Key: SPARK-23822
> URL: https://issues.apache.org/jira/browse/SPARK-23822
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Yuchen Huo
>Assignee: Yuchen Huo
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> If a user attempts to read Parquet files with mismatched schemas and schema 
> merging is disabled then this may result in a very confusing 
> UnsupportedOperationException and ParquetDecodingException errors from 
> Parquet.
> e.g.
> {code:java}
> Seq(("bcd")).toDF("a").coalesce(1).write.mode("overwrite").parquet(s"$path/")
> Seq((1)).toDF("a").coalesce(1).write.mode("append").parquet(s"$path/")
> spark.read.parquet(s"$path/").collect()
> {code}
> Would result in
> {code:java}
> Caused by: java.lang.UnsupportedOperationException: Unimplemented type: 
> IntegerType
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:474)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:182)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:617)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23822) Improve error message for Parquet schema mismatches

2018-03-29 Thread Yuchen Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuchen Huo updated SPARK-23822:
---
Component/s: (was: Input/Output)
 SQL

> Improve error message for Parquet schema mismatches
> ---
>
> Key: SPARK-23822
> URL: https://issues.apache.org/jira/browse/SPARK-23822
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Yuchen Huo
>Priority: Major
>
> If a user attempts to read Parquet files with mismatched schemas and schema 
> merging is disabled then this may result in a very confusing 
> UnsupportedOperationException and ParquetDecodingException errors from 
> Parquet.
> e.g.
> {code:java}
> Seq(("bcd")).toDF("a").coalesce(1).write.mode("overwrite").parquet(s"$path/")
> Seq((1)).toDF("a").coalesce(1).write.mode("append").parquet(s"$path/")
> spark.read.parquet(s"$path/").collect()
> {code}
> Would result in
> {code:java}
> Caused by: java.lang.UnsupportedOperationException: Unimplemented type: 
> IntegerType
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:474)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:182)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:617)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23822) Improve error message for Parquet schema mismatches

2018-03-29 Thread Yuchen Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuchen Huo updated SPARK-23822:
---
Description: 
If a user attempts to read Parquet files with mismatched schemas and schema 
merging is disabled then this may result in a very confusing 
UnsupportedOperationException and ParquetDecodingException errors from Parquet.

e.g.
{code:java}
Seq(("bcd")).toDF("a").coalesce(1).write.mode("overwrite").parquet(s"$path/")
Seq((1)).toDF("a").coalesce(1).write.mode("append").parquet(s"$path/")

spark.read.parquet(s"$path/").collect()
{code}
Would result in
{code:java}
Caused by: java.lang.UnsupportedOperationException: Unimplemented type: 
IntegerType

  at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:474)

  at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214)

  at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261)

  at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159)

  at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)

  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106)

  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:182)

  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106)

  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch$(Unknown
 Source)

  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)

  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)

  at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:617)

  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)

  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)

  at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)

  at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)

  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)

  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)

  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

  at org.apache.spark.scheduler.Task.run(Task.scala:109)

  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)

  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

  at java.lang.Thread.run(Thread.java:748)
{code}
 

  was:
If a user attempts to read Parquet files with mismatched schemas and schema 
merging is disabled then this may result in a very confusing 
UnsupportedOperationException and ParquetDecodingException errors from Parquet.

e.g.
{code:java}
Seq(("bcd")).toDF("a").coalesce(1).write.mode("overwrite").parquet(s"$path/")
Seq((1)).toDF("a").coalesce(1).write.mode("append").parquet(s"$path/")

spark.read.parquet(s"$path/").collect()
{code}
Would result in

 


> Improve error message for Parquet schema mismatches
> ---
>
> Key: SPARK-23822
> URL: https://issues.apache.org/jira/browse/SPARK-23822
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.3.0
>Reporter: Yuchen Huo
>Priority: Major
>
> If a user attempts to read Parquet files with mismatched schemas and schema 
> merging is disabled then this may result in a very confusing 
> UnsupportedOperationException and ParquetDecodingException errors from 
> Parquet.
> e.g.
> {code:java}
> Seq(("bcd")).toDF("a").coalesce(1).write.mode("overwrite").parquet(s"$path/")
> Seq((1)).toDF("a").coalesce(1).write.mode("append").parquet(s"$path/")
> spark.read.parquet(s"$path/").collect()
> {code}
> Would result in
> {code:java}
> Caused by: java.lang.UnsupportedOperationException: Unimplemented type: 
> IntegerType
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:474)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214)
>   at 
> 

[jira] [Created] (SPARK-23822) Improve error message for Parquet schema mismatches

2018-03-29 Thread Yuchen Huo (JIRA)
Yuchen Huo created SPARK-23822:
--

 Summary: Improve error message for Parquet schema mismatches
 Key: SPARK-23822
 URL: https://issues.apache.org/jira/browse/SPARK-23822
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Affects Versions: 2.3.0
Reporter: Yuchen Huo


If a user attempts to read Parquet files with mismatched schemas and schema 
merging is disabled then this may result in a very confusing 
UnsupportedOperationException and ParquetDecodingException errors from Parquet.

e.g.
{code:java}
Seq(("bcd")).toDF("a").coalesce(1).write.mode("overwrite").parquet(s"$path/")
Seq((1)).toDF("a").coalesce(1).write.mode("append").parquet(s"$path/")

spark.read.parquet(s"$path/").collect()
{code}
Would result in

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org