[jira] [Commented] (SPARK-41232) High-order function: array_append
[ https://issues.apache.org/jira/browse/SPARK-41232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641272#comment-17641272 ] Senthil Kumar commented on SPARK-41232: --- [~podongfeng] Shall I work on this? > High-order function: array_append > - > > Key: SPARK-41232 > URL: https://issues.apache.org/jira/browse/SPARK-41232 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > > refer to > https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_append.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40367) Total size of serialized results of 3730 tasks (64.0 GB) is bigger than spark.driver.maxResultSize (64.0 GB)
[ https://issues.apache.org/jira/browse/SPARK-40367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17606414#comment-17606414 ] Senthil Kumar commented on SPARK-40367: --- Hi [~jackyjfhu] Check if you are sending bytes/rows which are more than "spark.driver.maxResultSize". If so, you need to keep increasing "spark.driver.maxResultSize" untill it is fixing this issue. But while increasing spark.driver.maxResultSize you should be careful that it should not exceed driver-memory. _Note: driver-memory > spark.driver.maxResultSize > row/bytes sent to driver_ > Total size of serialized results of 3730 tasks (64.0 GB) is bigger than > spark.driver.maxResultSize (64.0 GB) > - > > Key: SPARK-40367 > URL: https://issues.apache.org/jira/browse/SPARK-40367 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: jackyjfhu >Priority: Blocker > > I use this > code:spark.sql("xx").selectExpr(spark.table(target).columns:_*).write.mode("overwrite").insertInto(target),I > get an error > > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Total size of serialized results of 3730 tasks (64.0 GB) is bigger than > spark.driver.maxResultSize (64.0 GB) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1609) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1597) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1596) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1596) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1830) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1779) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1768) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) > at org.apache.spark.rdd.RDD.collect(RDD.scala:938) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectIterator(SparkPlan.scala:304) > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:76) > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:73) > at > org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:97) > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:72) > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:72) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > --conf spark.driver.maxResultSize=64g > --conf spark.sql.broadcastTimeout=36000 > -conf spark.sql.autoBroadcastJoinThreshold=204857600 > --conf spark.memory.offHeap.enabled=true > --conf spark.memory.offHeap.size=4g > --num-exe
[jira] [Commented] (SPARK-38213) support Metrics information report to kafkaSink.
[ https://issues.apache.org/jira/browse/SPARK-38213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17492391#comment-17492391 ] Senthil Kumar commented on SPARK-38213: --- Working on this > support Metrics information report to kafkaSink. > > > Key: SPARK-38213 > URL: https://issues.apache.org/jira/browse/SPARK-38213 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: YuanGuanhu >Priority: Major > > Spark now support ConsoleSink/CsvSink/GraphiteSink/JmxSink etc. Now we want > report metrics information to kafka, we can work to support this. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37936) Use error classes in the parsing errors of intervals
[ https://issues.apache.org/jira/browse/SPARK-37936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17480099#comment-17480099 ] Senthil Kumar commented on SPARK-37936: --- [~maxgekk] , I have queries which matches * invalidIntervalFormError - "SELECT INTERVAL '1 DAY 2' HOUR" * fromToIntervalUnsupportedError - "SELECT extract(MONTH FROM INTERVAL '2021-11' YEAR TO DAY)" it will be helpful if you share queries for below scenarios, * moreThanOneFromToUnitInIntervalLiteralError * invalidIntervalLiteralError * invalidFromToUnitValueError * mixedIntervalUnitsError > Use error classes in the parsing errors of intervals > > > Key: SPARK-37936 > URL: https://issues.apache.org/jira/browse/SPARK-37936 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > Modify the following methods in QueryParsingErrors: > * moreThanOneFromToUnitInIntervalLiteralError > * invalidIntervalLiteralError > * invalidIntervalFormError > * invalidFromToUnitValueError > * fromToIntervalUnsupportedError > * mixedIntervalUnitsError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryParsingErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37944) Use error classes in the execution errors of casting
[ https://issues.apache.org/jira/browse/SPARK-37944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17477551#comment-17477551 ] Senthil Kumar commented on SPARK-37944: --- I will work on this > Use error classes in the execution errors of casting > > > Key: SPARK-37944 > URL: https://issues.apache.org/jira/browse/SPARK-37944 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > Migrate the following errors in QueryExecutionErrors: > * failedToCastValueToDataTypeForPartitionColumnError > * invalidInputSyntaxForNumericError > * cannotCastToDateTimeError > * invalidInputSyntaxForBooleanError > * nullLiteralsCannotBeCastedError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryExecutionErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37945) Use error classes in the execution errors of arithmetic ops
[ https://issues.apache.org/jira/browse/SPARK-37945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17477550#comment-17477550 ] Senthil Kumar commented on SPARK-37945: --- I will work on this > Use error classes in the execution errors of arithmetic ops > --- > > Key: SPARK-37945 > URL: https://issues.apache.org/jira/browse/SPARK-37945 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > Migrate the following errors in QueryExecutionErrors: > * overflowInSumOfDecimalError > * overflowInIntegralDivideError > * arithmeticOverflowError > * unaryMinusCauseOverflowError > * binaryArithmeticCauseOverflowError > * unscaledValueTooLargeForPrecisionError > * decimalPrecisionExceedsMaxPrecisionError > * outOfDecimalTypeRangeError > * integerOverflowError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryExecutionErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37940) Use error classes in the compilation errors of partitions
[ https://issues.apache.org/jira/browse/SPARK-37940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17477549#comment-17477549 ] Senthil Kumar commented on SPARK-37940: --- I will work on this > Use error classes in the compilation errors of partitions > - > > Key: SPARK-37940 > URL: https://issues.apache.org/jira/browse/SPARK-37940 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > Migrate the following errors in QueryCompilationErrors: > * unsupportedIfNotExistsError > * nonPartitionColError > * missingStaticPartitionColumn > * alterV2TableSetLocationWithPartitionNotSupportedError > * invalidPartitionSpecError > * partitionNotSpecifyLocationUriError > * describeDoesNotSupportPartitionForV2TablesError > * tableDoesNotSupportPartitionManagementError > * tableDoesNotSupportAtomicPartitionManagementError > * alterTableRecoverPartitionsNotSupportedForV2TablesError > * partitionColumnNotSpecifiedError > * invalidPartitionColumnError > * multiplePartitionColumnValuesSpecifiedError > * cannotUseDataTypeForPartitionColumnError > * cannotUseAllColumnsForPartitionColumnsError > * partitionColumnNotFoundInSchemaError > * mismatchedTablePartitionColumnError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryCompilationErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37939) Use error classes in the parsing errors of properties
[ https://issues.apache.org/jira/browse/SPARK-37939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17477548#comment-17477548 ] Senthil Kumar commented on SPARK-37939: --- I will work on this > Use error classes in the parsing errors of properties > - > > Key: SPARK-37939 > URL: https://issues.apache.org/jira/browse/SPARK-37939 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > Migrate the following errors in QueryParsingErrors: > * cannotCleanReservedNamespacePropertyError > * cannotCleanReservedTablePropertyError > * invalidPropertyKeyForSetQuotedConfigurationError > * invalidPropertyValueForSetQuotedConfigurationError > * propertiesAndDbPropertiesBothSpecifiedError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryParsingErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37936) Use error classes in the parsing errors of intervals
[ https://issues.apache.org/jira/browse/SPARK-37936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17477544#comment-17477544 ] Senthil Kumar commented on SPARK-37936: --- Working on this > Use error classes in the parsing errors of intervals > > > Key: SPARK-37936 > URL: https://issues.apache.org/jira/browse/SPARK-37936 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > Modify the following methods in QueryParsingErrors: > * moreThanOneFromToUnitInIntervalLiteralError > * invalidIntervalLiteralError > * invalidIntervalFormError > * invalidFromToUnitValueError > * fromToIntervalUnsupportedError > * mixedIntervalUnitsError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryParsingErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36996) fixing "SQL column nullable setting not retained as part of spark read" issue
[ https://issues.apache.org/jira/browse/SPARK-36996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428141#comment-17428141 ] Senthil Kumar commented on SPARK-36996: --- Sample Output after this changes: SQL : mysql> CREATE TABLE Persons(Id int NOT NULL, FirstName varchar(255), LastName varchar(255), Age int); mysql> desc Persons; +---+--+--+-+-+---+ | Field | Type | Null | Key | Default | Extra | +---+--+--+-+-+---+ | Id | int | NO | | NULL | | | FirstName | varchar(255) | YES | | NULL | | | LastName | varchar(255) | YES | | NULL | | | Age | int | YES | | NULL | | +---+--+--+-+-+---+ --++---+++ Spark: scala> val df = spark.read.format("jdbc").option("database","Test_DB").option("user", "root").option("password", "").option("driver", "com.mysql.cj.jdbc.Driver").option("url", "jdbc:mysql://localhost:3306/Test_DB").option("dbtable", "Persons").load() df: org.apache.spark.sql.DataFrame = [Id: int, FirstName: string ... 2 more fields] scala> df.printSchema() root |-- Id: integer (nullable = false) |-- FirstName: string (nullable = true) |-- LastName: string (nullable = true) |-- Age: integer (nullable = true) And for TIMESTAMP columns SQL: create table timestamp_test(id int(11), time_stamp timestamp not null default current_timestamp); SPARK: scala> val df = spark.read.format("jdbc").option("database","Test_DB").option("user", "root").option("password", "").option("driver", "com.mysql.cj.jdbc.Driver").option("url", "jdbc:mysql://localhost:3306/Test_DB").option("dbtable", "timestamp_test").load() df: org.apache.spark.sql.DataFrame = [id: int, time_stamp: timestamp] scala> df.printSchema() root |-- id: integer (nullable = true) |-- time_stamp: timestamp (nullable = true) > fixing "SQL column nullable setting not retained as part of spark read" issue > - > > Key: SPARK-36996 > URL: https://issues.apache.org/jira/browse/SPARK-36996 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0, 3.1.1, 3.1.2 >Reporter: Senthil Kumar >Priority: Major > > Sql 'nullable' columns are not retaining 'nullable' type as it is while > reading from Spark read using jdbc format. > > SQL : > > > mysql> CREATE TABLE Persons(Id int NOT NULL, FirstName varchar(255), LastName > varchar(255), Age int); > > mysql> desc Persons; > +---+--+--+-+-+---+ > | Field | Type | Null | Key | Default | Extra | > +---+--+--+-+-+---+ > | Id | int | NO | | NULL | | > | FirstName | varchar(255) | YES | | NULL | | > | LastName | varchar(255) | YES | | NULL | | > | Age | int | YES | | NULL | | > +---+--+--+-+-+---+ > > But in Spark we get all the columns as "Nullable": > = > scala> val df = > spark.read.format("jdbc").option("database","Test_DB").option("user", > "root").option("password", "").option("driver", > "com.mysql.cj.jdbc.Driver").option("url", > "jdbc:mysql://localhost:3306/Test_DB").option("dbtable", "Persons").load() > df: org.apache.spark.sql.DataFrame = [Id: int, FirstName: string ... 2 more > fields] > scala> df.printSchema() > root > |-- Id: integer (nullable = true) > |-- FirstName: string (nullable = true) > |-- LastName: string (nullable = true) > |-- Age: integer (nullable = true) > = > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36996) fixing "SQL column nullable setting not retained as part of spark read" issue
[ https://issues.apache.org/jira/browse/SPARK-36996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428140#comment-17428140 ] Senthil Kumar commented on SPARK-36996: --- We need to consider 2 scenarios # maintain NULLABLE value as per SQL metadata for non timestamp columns # set NULLABLE as true(always) for timestamp columns > fixing "SQL column nullable setting not retained as part of spark read" issue > - > > Key: SPARK-36996 > URL: https://issues.apache.org/jira/browse/SPARK-36996 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0, 3.1.1, 3.1.2 >Reporter: Senthil Kumar >Priority: Major > > Sql 'nullable' columns are not retaining 'nullable' type as it is while > reading from Spark read using jdbc format. > > SQL : > > > mysql> CREATE TABLE Persons(Id int NOT NULL, FirstName varchar(255), LastName > varchar(255), Age int); > > mysql> desc Persons; > +---+--+--+-+-+---+ > | Field | Type | Null | Key | Default | Extra | > +---+--+--+-+-+---+ > | Id | int | NO | | NULL | | > | FirstName | varchar(255) | YES | | NULL | | > | LastName | varchar(255) | YES | | NULL | | > | Age | int | YES | | NULL | | > +---+--+--+-+-+---+ > > But in Spark we get all the columns as "Nullable": > = > scala> val df = > spark.read.format("jdbc").option("database","Test_DB").option("user", > "root").option("password", "").option("driver", > "com.mysql.cj.jdbc.Driver").option("url", > "jdbc:mysql://localhost:3306/Test_DB").option("dbtable", "Persons").load() > df: org.apache.spark.sql.DataFrame = [Id: int, FirstName: string ... 2 more > fields] > scala> df.printSchema() > root > |-- Id: integer (nullable = true) > |-- FirstName: string (nullable = true) > |-- LastName: string (nullable = true) > |-- Age: integer (nullable = true) > = > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36996) fixing "SQL column nullable setting not retained as part of spark read" issue
[ https://issues.apache.org/jira/browse/SPARK-36996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428104#comment-17428104 ] Senthil Kumar commented on SPARK-36996: --- Based on further analysis, Spark is hard-coding "nullable" as "true" always. This change has been inccluded due to "https://issues.apache.org/jira/browse/SPARK-19726";. > fixing "SQL column nullable setting not retained as part of spark read" issue > - > > Key: SPARK-36996 > URL: https://issues.apache.org/jira/browse/SPARK-36996 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0, 3.1.1, 3.1.2 >Reporter: Senthil Kumar >Priority: Major > > Sql 'nullable' columns are not retaining 'nullable' type as it is while > reading from Spark read using jdbc format. > > SQL : > > > mysql> CREATE TABLE Persons(Id int NOT NULL, FirstName varchar(255), LastName > varchar(255), Age int); > > mysql> desc Persons; > +---+--+--+-+-+---+ > | Field | Type | Null | Key | Default | Extra | > +---+--+--+-+-+---+ > | Id | int | NO | | NULL | | > | FirstName | varchar(255) | YES | | NULL | | > | LastName | varchar(255) | YES | | NULL | | > | Age | int | YES | | NULL | | > +---+--+--+-+-+---+ > > But in Spark we get all the columns as "Nullable": > = > scala> val df = > spark.read.format("jdbc").option("database","Test_DB").option("user", > "root").option("password", "").option("driver", > "com.mysql.cj.jdbc.Driver").option("url", > "jdbc:mysql://localhost:3306/Test_DB").option("dbtable", "Persons").load() > df: org.apache.spark.sql.DataFrame = [Id: int, FirstName: string ... 2 more > fields] > scala> df.printSchema() > root > |-- Id: integer (nullable = true) > |-- FirstName: string (nullable = true) > |-- LastName: string (nullable = true) > |-- Age: integer (nullable = true) > = > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36996) fixing "SQL column nullable setting not retained as part of spark read" issue
[ https://issues.apache.org/jira/browse/SPARK-36996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428105#comment-17428105 ] Senthil Kumar commented on SPARK-36996: --- I m working on this > fixing "SQL column nullable setting not retained as part of spark read" issue > - > > Key: SPARK-36996 > URL: https://issues.apache.org/jira/browse/SPARK-36996 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0, 3.1.1, 3.1.2 >Reporter: Senthil Kumar >Priority: Major > > Sql 'nullable' columns are not retaining 'nullable' type as it is while > reading from Spark read using jdbc format. > > SQL : > > > mysql> CREATE TABLE Persons(Id int NOT NULL, FirstName varchar(255), LastName > varchar(255), Age int); > > mysql> desc Persons; > +---+--+--+-+-+---+ > | Field | Type | Null | Key | Default | Extra | > +---+--+--+-+-+---+ > | Id | int | NO | | NULL | | > | FirstName | varchar(255) | YES | | NULL | | > | LastName | varchar(255) | YES | | NULL | | > | Age | int | YES | | NULL | | > +---+--+--+-+-+---+ > > But in Spark we get all the columns as "Nullable": > = > scala> val df = > spark.read.format("jdbc").option("database","Test_DB").option("user", > "root").option("password", "").option("driver", > "com.mysql.cj.jdbc.Driver").option("url", > "jdbc:mysql://localhost:3306/Test_DB").option("dbtable", "Persons").load() > df: org.apache.spark.sql.DataFrame = [Id: int, FirstName: string ... 2 more > fields] > scala> df.printSchema() > root > |-- Id: integer (nullable = true) > |-- FirstName: string (nullable = true) > |-- LastName: string (nullable = true) > |-- Age: integer (nullable = true) > = > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36996) fixing "SQL column nullable setting not retained as part of spark read" issue
Senthil Kumar created SPARK-36996: - Summary: fixing "SQL column nullable setting not retained as part of spark read" issue Key: SPARK-36996 URL: https://issues.apache.org/jira/browse/SPARK-36996 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.2, 3.1.1, 3.1.0, 3.0.0 Reporter: Senthil Kumar Sql 'nullable' columns are not retaining 'nullable' type as it is while reading from Spark read using jdbc format. SQL : mysql> CREATE TABLE Persons(Id int NOT NULL, FirstName varchar(255), LastName varchar(255), Age int); mysql> desc Persons; +---+--+--+-+-+---+ | Field | Type | Null | Key | Default | Extra | +---+--+--+-+-+---+ | Id | int | NO | | NULL | | | FirstName | varchar(255) | YES | | NULL | | | LastName | varchar(255) | YES | | NULL | | | Age | int | YES | | NULL | | +---+--+--+-+-+---+ But in Spark we get all the columns as "Nullable": = scala> val df = spark.read.format("jdbc").option("database","Test_DB").option("user", "root").option("password", "").option("driver", "com.mysql.cj.jdbc.Driver").option("url", "jdbc:mysql://localhost:3306/Test_DB").option("dbtable", "Persons").load() df: org.apache.spark.sql.DataFrame = [Id: int, FirstName: string ... 2 more fields] scala> df.printSchema() root |-- Id: integer (nullable = true) |-- FirstName: string (nullable = true) |-- LastName: string (nullable = true) |-- Age: integer (nullable = true) = -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36238) Spark UI load event timeline too slow for huge stage
[ https://issues.apache.org/jira/browse/SPARK-36238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17423419#comment-17423419 ] Senthil Kumar commented on SPARK-36238: --- [~angerszhuuu] Did you try increasing heap memory for Spark History Server? > Spark UI load event timeline too slow for huge stage > - > > Key: SPARK-36238 > URL: https://issues.apache.org/jira/browse/SPARK-36238 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36901) ERROR exchange.BroadcastExchangeExec: Could not execute broadcast in 300 secs
[ https://issues.apache.org/jira/browse/SPARK-36901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17423409#comment-17423409 ] Senthil Kumar commented on SPARK-36901: --- [~rangareddy.av...@gmail.com] It looks like normal behaviour of Spark. Due to Spark's lazy behaviour, it tries to execute "BroadcastExchangeExec" and then it finds that there are lack of resources in cluster and hence throws WARN messages and then wait for 300s and then throws ERROR messages stating that ""BroadcastExchangeExec" timeout. > ERROR exchange.BroadcastExchangeExec: Could not execute broadcast in 300 secs > - > > Key: SPARK-36901 > URL: https://issues.apache.org/jira/browse/SPARK-36901 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.4.0 >Reporter: Ranga Reddy >Priority: Major > > While running Spark application, if there are no further resources to launch > executors, Spark application is failed after 5 mins with below exception. > {code:java} > 21/09/24 06:12:45 WARN cluster.YarnScheduler: Initial job has not accepted > any resources; check your cluster UI to ensure that workers are registered > and have sufficient resources > ... > 21/09/24 06:17:29 ERROR exchange.BroadcastExchangeExec: Could not execute > broadcast in 300 secs. > java.util.concurrent.TimeoutException: Futures timed out after [300 seconds] > ... > Caused by: java.util.concurrent.TimeoutException: Futures timed out after > [300 seconds] > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227) > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220) > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:146) > ... 71 more > 21/09/24 06:17:30 INFO spark.SparkContext: Invoking stop() from shutdown hook > {code} > *Expectation* should be either needs to be throw proper exception saying > *"there are no further to resources to run the application"* or it needs to > be *"wait till it get resources"*. > To reproduce the issue we have used following sample code. > *PySpark Code (test_broadcast_timeout.py):* > {code:java} > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName("Test Broadcast Timeout").getOrCreate() > t1 = spark.range(5) > t2 = spark.range(5) > q = t1.join(t2,t1.id == t2.id) > q.explain > q.show(){code} > *Spark Submit Command:* > {code:java} > spark-submit --executor-memory 512M test_broadcast_timeout.py{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-36861) Partition columns are overly eagerly parsed as dates
[ https://issues.apache.org/jira/browse/SPARK-36861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17423402#comment-17423402 ] Senthil Kumar edited comment on SPARK-36861 at 10/1/21, 7:24 PM: - Yes in Spark 3.3, hour column is created as "DateType" but I could see hour part in subdirs created === Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_\ version 3.3.0-SNAPSHOT /_/ Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 1.8.0_292) Type in expressions to have them evaluated. Type :help for more information. scala> val df = Seq(("2021-01-01T00", 0), ("2021-01-01T01", 1), ("2021-01-01T02", 2)).toDF("hour", "i") df: org.apache.spark.sql.DataFrame = [hour: string, i: int] scala> df.write.partitionBy("hour").parquet("/tmp/t1") scala> spark.read.parquet("/tmp/t1").schema res1: org.apache.spark.sql.types.StructType = StructType(StructField(i,IntegerType,true), StructField(hour,DateType,true)) scala> === and subdirs created are === ls -l total 0 -rw-r--r-- 1 senthilkumar wheel 0 Oct 2 00:44 _SUCCESS drwxr-xr-x 4 senthilkumar wheel 128 Oct 2 00:44 hour=2021-01-01T00 drwxr-xr-x 4 senthilkumar wheel 128 Oct 2 00:44 hour=2021-01-01T01 drwxr-xr-x 4 senthilkumar wheel 128 Oct 2 00:44 hour=2021-01-01T02 === It will be helpful if you share the list of sub-dirs created in your case. was (Author: senthh): Yes in Spark 3.3 hour column is created as "DateType" but I could see hour part in subdirs created === Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.3.0-SNAPSHOT /_/ Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 1.8.0_292) Type in expressions to have them evaluated. Type :help for more information. scala> val df = Seq(("2021-01-01T00", 0), ("2021-01-01T01", 1), ("2021-01-01T02", 2)).toDF("hour", "i") df: org.apache.spark.sql.DataFrame = [hour: string, i: int] scala> df.write.partitionBy("hour").parquet("/tmp/t1") scala> spark.read.parquet("/tmp/t1").schema res1: org.apache.spark.sql.types.StructType = StructType(StructField(i,IntegerType,true), StructField(hour,DateType,true)) scala> === and subdirs created are === ls -l total 0 -rw-r--r-- 1 senthilkumar wheel 0 Oct 2 00:44 _SUCCESS drwxr-xr-x 4 senthilkumar wheel 128 Oct 2 00:44 hour=2021-01-01T00 drwxr-xr-x 4 senthilkumar wheel 128 Oct 2 00:44 hour=2021-01-01T01 drwxr-xr-x 4 senthilkumar wheel 128 Oct 2 00:44 hour=2021-01-01T02 === It will be helpful if you share the list of sub-dirs created in your case. > Partition columns are overly eagerly parsed as dates > > > Key: SPARK-36861 > URL: https://issues.apache.org/jira/browse/SPARK-36861 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Tanel Kiis >Priority: Blocker > > I have an input directory with subdirs: > * hour=2021-01-01T00 > * hour=2021-01-01T01 > * hour=2021-01-01T02 > * ... > in spark 3.1 the 'hour' column is parsed as a string type, but in 3.2 RC it > is parsed as date type and the hour part is lost. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36861) Partition columns are overly eagerly parsed as dates
[ https://issues.apache.org/jira/browse/SPARK-36861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17423402#comment-17423402 ] Senthil Kumar commented on SPARK-36861: --- Yes in Spark 3.3 hour column is created as "DateType" but I could see hour part in subdirs created === Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.3.0-SNAPSHOT /_/ Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 1.8.0_292) Type in expressions to have them evaluated. Type :help for more information. scala> val df = Seq(("2021-01-01T00", 0), ("2021-01-01T01", 1), ("2021-01-01T02", 2)).toDF("hour", "i") df: org.apache.spark.sql.DataFrame = [hour: string, i: int] scala> df.write.partitionBy("hour").parquet("/tmp/t1") scala> spark.read.parquet("/tmp/t1").schema res1: org.apache.spark.sql.types.StructType = StructType(StructField(i,IntegerType,true), StructField(hour,DateType,true)) scala> === and subdirs created are === ls -l total 0 -rw-r--r-- 1 senthilkumar wheel 0 Oct 2 00:44 _SUCCESS drwxr-xr-x 4 senthilkumar wheel 128 Oct 2 00:44 hour=2021-01-01T00 drwxr-xr-x 4 senthilkumar wheel 128 Oct 2 00:44 hour=2021-01-01T01 drwxr-xr-x 4 senthilkumar wheel 128 Oct 2 00:44 hour=2021-01-01T02 === It will be helpful if you share the list of sub-dirs created in your case. > Partition columns are overly eagerly parsed as dates > > > Key: SPARK-36861 > URL: https://issues.apache.org/jira/browse/SPARK-36861 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Tanel Kiis >Priority: Blocker > > I have an input directory with subdirs: > * hour=2021-01-01T00 > * hour=2021-01-01T01 > * hour=2021-01-01T02 > * ... > in spark 3.1 the 'hour' column is parsed as a string type, but in 3.2 RC it > is parsed as date type and the hour part is lost. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36861) Partition columns are overly eagerly parsed as dates
[ https://issues.apache.org/jira/browse/SPARK-36861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421328#comment-17421328 ] Senthil Kumar commented on SPARK-36861: --- [~tanelk] This is issue not reproducable even in 3.1.2 root |-- i: integer (nullable = true) |-- hour: string (nullable = true) > Partition columns are overly eagerly parsed as dates > > > Key: SPARK-36861 > URL: https://issues.apache.org/jira/browse/SPARK-36861 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Tanel Kiis >Priority: Blocker > > I have an input directory with subdirs: > * hour=2021-01-01T00 > * hour=2021-01-01T01 > * hour=2021-01-01T02 > * ... > in spark 3.1 the 'hour' column is parsed as a string type, but in 3.2 RC it > is parsed as date type and the hour part is lost. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36781) The log could not get the correct line number
[ https://issues.apache.org/jira/browse/SPARK-36781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421281#comment-17421281 ] Senthil Kumar commented on SPARK-36781: --- [~chenxusheng] Could you please share the sample code to simulate this issue? > The log could not get the correct line number > - > > Key: SPARK-36781 > URL: https://issues.apache.org/jira/browse/SPARK-36781 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.6, 3.0.3, 3.1.2 >Reporter: chenxusheng >Priority: Major > > INFO 18:16:46 [Thread-1] > org.apache.spark.internal.Logging$class.logInfo({color:#FF}Logging.scala:54{color}) > MemoryStore cleared > INFO 18:16:46 [Thread-1] > org.apache.spark.internal.Logging$class.logInfo({color:#FF}Logging.scala:54{color}) > BlockManager stopped > INFO 18:16:46 [Thread-1] > org.apache.spark.internal.Logging$class.logInfo({color:#FF}Logging.scala:54{color}) > BlockManagerMaster stopped > INFO 18:16:46 [dispatcher-event-loop-0] > org.apache.spark.internal.Logging$class.logInfo({color:#FF}Logging.scala:54{color}) > OutputCommitCoordinator stopped! > INFO 18:16:46 [Thread-1] > org.apache.spark.internal.Logging$class.logInfo({color:#FF}Logging.scala:54{color}) > Successfully stopped SparkContext > INFO 18:16:46 [Thread-1] > org.apache.spark.internal.Logging$class.logInfo({color:#FF}Logging.scala:54{color}) > Shutdown hook called > all are : {color:#FF}Logging.scala:54{color} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36801) Document change for Spark sql jdbc
[ https://issues.apache.org/jira/browse/SPARK-36801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Senthil Kumar updated SPARK-36801: -- Description: Reading using Spark SQL jdbc DataSource does not maintain nullable type and changes "non nullable" columns to "nullable". For example: mysql> CREATE TABLE Persons(Id int NOT NULL, FirstName varchar(255), LastName varchar(255), Age int); Query OK, 0 rows affected (0.04 sec) mysql> show tables; +---+ | Tables_in_test_db | +---+ | Persons | +---+ 1 row in set (0.00 sec) mysql> desc Persons; +---+--+--+-+-+---+ | Field | Type | Null | Key | Default | Extra | +---+--+--+-+-+---+ | Id | int | NO | | NULL | | | FirstName | varchar(255) | YES | | NULL | | | LastName | varchar(255) | YES | | NULL | | | Age | int | YES | | NULL | | +---+--+--+-+-+---+ {color:#cc7832}val {color}df = spark.read.format({color:#6a8759}"jdbc"{color}) .option({color:#6a8759}"database"{color}{color:#cc7832},{color}{color:#6a8759}"Test_DB"{color}) .option({color:#6a8759}"user"{color}{color:#cc7832}, {color}{color:#6a8759}"root"{color}) .option({color:#6a8759}"password"{color}{color:#cc7832}, {color}{color:#6a8759}""{color}) .option({color:#6a8759}"driver"{color}{color:#cc7832}, {color}{color:#6a8759}"com.mysql.cj.jdbc.Driver"{color}) .option({color:#6a8759}"url"{color}{color:#cc7832}, {color}{color:#6a8759}"jdbc:mysql://localhost:3306/Test_DB"{color}) .option({color:#6a8759}"query"{color}{color:#cc7832}, {color}{color:#6a8759}"(select * from Persons)"{color}) .load() df.printSchema() *output:* root |-- Id: integer (nullable = true) |-- FirstName: string (nullable = true) |-- LastName: string (nullable = true) |-- Age: integer (nullable = true) So we need to add a note, in Documentation[1], "All columns are automatically converted to be nullable for compatibility reasons." Ref: [1 ][https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html#jdbc-to-other-databases] was: Reading using Spark SQL jdbc DataSource does not maintain nullable type and changes "non nullable" columns to "nullable". So we need to add a note, in Documentation[1], "All columns are automatically converted to be nullable for compatibility reasons." [1 ]https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html#jdbc-to-other-databases > Document change for Spark sql jdbc > -- > > Key: SPARK-36801 > URL: https://issues.apache.org/jira/browse/SPARK-36801 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.0.0, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2 >Reporter: Senthil Kumar >Priority: Trivial > > Reading using Spark SQL jdbc DataSource does not maintain nullable type and > changes "non nullable" columns to "nullable". > > For example: > mysql> CREATE TABLE Persons(Id int NOT NULL, FirstName varchar(255), LastName > varchar(255), Age int); > Query OK, 0 rows affected (0.04 sec) > mysql> show tables; > +---+ > | Tables_in_test_db | > +---+ > | Persons | > +---+ > 1 row in set (0.00 sec) > mysql> desc Persons; > +---+--+--+-+-+---+ > | Field | Type | Null | Key | Default | Extra | > +---+--+--+-+-+---+ > | Id | int | NO | | NULL | | > | FirstName | varchar(255) | YES | | NULL | | > | LastName | varchar(255) | YES | | NULL | | > | Age | int | YES | | NULL | | > +---+--+--+-+-+---+ > > > {color:#cc7832}val {color}df = spark.read.format({color:#6a8759}"jdbc"{color}) > > .option({color:#6a8759}"database"{color}{color:#cc7832},{color}{color:#6a8759}"Test_DB"{color}) > .option({color:#6a8759}"user"{color}{color:#cc7832}, > {color}{color:#6a8759}"root"{color}) > .option({color:#6a8759}"password"{color}{color:#cc7832}, > {color}{color:#6a8759}""{color}) > .option({color:#6a8759}"driver"{color}{color:#cc7832}, > {color}{color:#6a8759}"com.mysql.cj.jdbc.Driver"{color}) > .option({color:#6a8759}"url"{color}{color:#cc7832}, > {color}{color:#6a8759}"jdbc:mysql://localhost:3306/Test_DB"{color}) > .option({color:#6a8759}"query"{color}{color:#cc7832}, > {color}{color:#6a8759}"(select * from Persons)"{color}) > .load() > df.printSchema() > > *output:* > > root > |-- Id: integer (nullable = true) > |-- FirstName: string (nullable = true) > |-- LastName: string (nullable = true) > |-- Age: integer (nullable = true) > > > So we need to add a note, in Documentation[1], "All columns are automatically > converted to be nullable for compatibility reasons." > Ref: > [1 > ][https://spar
[jira] [Updated] (SPARK-36801) Document change for Spark sql jdbc
[ https://issues.apache.org/jira/browse/SPARK-36801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Senthil Kumar updated SPARK-36801: -- Description: Reading using Spark SQL jdbc DataSource does not maintain nullable type and changes "non nullable" columns to "nullable". So we need to add a note, in Documentation[1], "All columns are automatically converted to be nullable for compatibility reasons." [1 ]https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html#jdbc-to-other-databases was: Reading using Spark SQL jdbc DataSource does not maintain nullable type and changes "non nullable" columns to "nullable". So we need to add a note, in Documentation[1], "{color:#a9b7c6}All columns are automatically converted to be nullable for compatibility reasons.{color}" [1 ]https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html#jdbc-to-other-databases > Document change for Spark sql jdbc > -- > > Key: SPARK-36801 > URL: https://issues.apache.org/jira/browse/SPARK-36801 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.0.0, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2 >Reporter: Senthil Kumar >Priority: Trivial > > Reading using Spark SQL jdbc DataSource does not maintain nullable type and > changes "non nullable" columns to "nullable". > So we need to add a note, in Documentation[1], "All columns are automatically > converted to be nullable for compatibility reasons." > > [1 > ]https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html#jdbc-to-other-databases -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36801) Document change for Spark sql jdbc
Senthil Kumar created SPARK-36801: - Summary: Document change for Spark sql jdbc Key: SPARK-36801 URL: https://issues.apache.org/jira/browse/SPARK-36801 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 3.1.2, 3.1.1, 3.1.0, 3.0.3, 3.0.2, 3.0.0 Reporter: Senthil Kumar Reading using Spark SQL jdbc DataSource does not maintain nullable type and changes "non nullable" columns to "nullable". So we need to add a note, in Documentation[1], "{color:#a9b7c6}All columns are automatically converted to be nullable for compatibility reasons.{color}" [1 ]https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html#jdbc-to-other-databases -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36743) Backporting SPARK-36327 changes into Spark 2.4 version
[ https://issues.apache.org/jira/browse/SPARK-36743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414746#comment-17414746 ] Senthil Kumar commented on SPARK-36743: --- [~hyukjin.kwon], [~dongjoon]. Thanks for the kind and immediate response on this. > Backporting SPARK-36327 changes into Spark 2.4 version > -- > > Key: SPARK-36743 > URL: https://issues.apache.org/jira/browse/SPARK-36743 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Senthil Kumar >Priority: Minor > > Could we back port changes merged by PR > [https://github.com/apache/spark/pull/33577] into Spark 2.4 too? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36743) Backporting SPARK-36327 changes into Spark 2.4 version
[ https://issues.apache.org/jira/browse/SPARK-36743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414242#comment-17414242 ] Senthil Kumar commented on SPARK-36743: --- [~hyukjin.kwon], [~dongjoon] > Backporting SPARK-36327 changes into Spark 2.4 version > -- > > Key: SPARK-36743 > URL: https://issues.apache.org/jira/browse/SPARK-36743 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Senthil Kumar >Priority: Minor > Fix For: 3.3.0 > > > Could we back port changes merged by PR > [https://github.com/apache/spark/pull/33577] into Spark 2.4 too? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36743) Backporting SPARK-36327 changes into Spark 2.4 version
[ https://issues.apache.org/jira/browse/SPARK-36743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Senthil Kumar updated SPARK-36743: -- Summary: Backporting SPARK-36327 changes into Spark 2.4 version (was: Backporting changes into Spark 2.4 version) > Backporting SPARK-36327 changes into Spark 2.4 version > -- > > Key: SPARK-36743 > URL: https://issues.apache.org/jira/browse/SPARK-36743 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Senthil Kumar >Priority: Minor > Fix For: 3.3.0 > > > Could we back port changes merged by PR > [https://github.com/apache/spark/pull/33577] into Spark 2.4 too? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36743) Backporting changes into Spark 2.4 version
Senthil Kumar created SPARK-36743: - Summary: Backporting changes into Spark 2.4 version Key: SPARK-36743 URL: https://issues.apache.org/jira/browse/SPARK-36743 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Senthil Kumar Fix For: 3.3.0 Could we back port changes merged by PR [https://github.com/apache/spark/pull/33577] into Spark 2.4 too? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-35623) Volcano resource manager for Spark on Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-35623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17408750#comment-17408750 ] Senthil Kumar edited comment on SPARK-35623 at 9/2/21, 11:56 AM: - [~dipanjanK] Include me too pls. mail id: senthissen...@gmail.com was (Author: senthh): [~dipanjanK] Include me too pls > Volcano resource manager for Spark on Kubernetes > > > Key: SPARK-35623 > URL: https://issues.apache.org/jira/browse/SPARK-35623 > Project: Spark > Issue Type: Brainstorming > Components: Kubernetes >Affects Versions: 3.1.1, 3.1.2 >Reporter: Dipanjan Kailthya >Priority: Minor > Labels: kubernetes, resourcemanager > > Dear Spark Developers, > > Hello from the Netherlands! Posting this here as I still haven't gotten > accepted to post in the spark dev mailing list. > > My team is planning to use spark with Kubernetes support on our shared > (multi-tenant) on premise Kubernetes cluster. However we would like to have > certain scheduling features like fair-share and preemption which as we > understand are not built into the current spark-kubernetes resource manager > yet. We have been working on and are close to a first successful prototype > integration with Volcano ([https://volcano.sh/en/docs/]). Briefly this means > a new resource manager component with lots in common with existing > spark-kubernetes resource manager, but instead of pods it launches Volcano > jobs which delegate the driver and executor pod creation and lifecycle > management to Volcano. We are interested in contributing this to open source, > either directly in spark or as a separate project. > > So, two questions: > > 1. Do the spark maintainers see this as a valuable contribution to the > mainline spark codebase? If so, can we have some guidance on how to publish > the changes? > > 2. Are any other developers / organizations interested to contribute to this > effort? If so, please get in touch. > > Best, > Dipanjan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35623) Volcano resource manager for Spark on Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-35623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17408750#comment-17408750 ] Senthil Kumar commented on SPARK-35623: --- [~dipanjanK] Include me too pls > Volcano resource manager for Spark on Kubernetes > > > Key: SPARK-35623 > URL: https://issues.apache.org/jira/browse/SPARK-35623 > Project: Spark > Issue Type: Brainstorming > Components: Kubernetes >Affects Versions: 3.1.1, 3.1.2 >Reporter: Dipanjan Kailthya >Priority: Minor > Labels: kubernetes, resourcemanager > > Dear Spark Developers, > > Hello from the Netherlands! Posting this here as I still haven't gotten > accepted to post in the spark dev mailing list. > > My team is planning to use spark with Kubernetes support on our shared > (multi-tenant) on premise Kubernetes cluster. However we would like to have > certain scheduling features like fair-share and preemption which as we > understand are not built into the current spark-kubernetes resource manager > yet. We have been working on and are close to a first successful prototype > integration with Volcano ([https://volcano.sh/en/docs/]). Briefly this means > a new resource manager component with lots in common with existing > spark-kubernetes resource manager, but instead of pods it launches Volcano > jobs which delegate the driver and executor pod creation and lifecycle > management to Volcano. We are interested in contributing this to open source, > either directly in spark or as a separate project. > > So, two questions: > > 1. Do the spark maintainers see this as a valuable contribution to the > mainline spark codebase? If so, can we have some guidance on how to publish > the changes? > > 2. Are any other developers / organizations interested to contribute to this > effort? If so, please get in touch. > > Best, > Dipanjan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36643) Add more information in ERROR log while SparkConf is modified when spark.sql.legacy.setCommandRejectsSparkCoreConfs is set
[ https://issues.apache.org/jira/browse/SPARK-36643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Senthil Kumar updated SPARK-36643: -- Component/s: SQL > Add more information in ERROR log while SparkConf is modified when > spark.sql.legacy.setCommandRejectsSparkCoreConfs is set > -- > > Key: SPARK-36643 > URL: https://issues.apache.org/jira/browse/SPARK-36643 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.1.2 >Reporter: Senthil Kumar >Priority: Minor > > Right now, by default sql.legacy.setCommandRejectsSparkCoreConfs is set as > true in Spark 3.* versions int order to avoid changing Spark Confs. But from > the error message we get confused if we can not modify/change Spark conf in > Spark 3.* or not. > Current Error Message : > {code:java} > Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot > modify the value of a Spark config: spark.driver.host > at > org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:156) > at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:40){code} > > So adding little more information( how to modify Spark Conf), in ERROR log > while SparkConf is modified when > spark.sql.legacy.setCommandRejectsSparkCoreConfs is 'true', will be helpful > to avoid confusions. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-36643) Add more information in ERROR log while SparkConf is modified when spark.sql.legacy.setCommandRejectsSparkCoreConfs is set
[ https://issues.apache.org/jira/browse/SPARK-36643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17408282#comment-17408282 ] Senthil Kumar edited comment on SPARK-36643 at 9/1/21, 5:20 PM: New ERROR message will be as below, {code:java} scala> spark.conf.set("spark.driver.host", "localhost") org.apache.spark.sql.AnalysisException: Cannot modify the value of a Spark config: spark.driver.host, please set spark.sql.legacy.setCommandRejectsSparkCoreConfs as 'false' in order to make change value of Spark config: spark.driver.host . at org.apache.spark.sql.errors.QueryCompilationErrors$.cannotModifyValueOfSparkConfigError(QueryCompilationErrors.scala:2336) at org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:157) at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:41) ... 47 elided{code} was (Author: senthh): New ERROR message will be as below, {code:java} scala> spark.conf.set("spark.driver.host", "localhost") org.apache.spark.sql.AnalysisException: Cannot modify the value of a Spark config: spark.driver.host, please set spark.sql.legacy.setCommandRejectsSparkCoreConfs as 'false' in order to make change value of Spark config: spark.driver.host . at org.apache.spark.sql.errors.QueryCompilationErrors$.cannotModifyValueOfSparkConfigError(QueryCompilationErrors.scala:2336) at org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:157) at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:41){code} ... 47 elided > Add more information in ERROR log while SparkConf is modified when > spark.sql.legacy.setCommandRejectsSparkCoreConfs is set > -- > > Key: SPARK-36643 > URL: https://issues.apache.org/jira/browse/SPARK-36643 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Senthil Kumar >Priority: Minor > > Right now, by default sql.legacy.setCommandRejectsSparkCoreConfs is set as > true in Spark 3.* versions int order to avoid changing Spark Confs. But from > the error message we get confused if we can not modify/change Spark conf in > Spark 3.* or not. > Current Error Message : > {code:java} > Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot > modify the value of a Spark config: spark.driver.host > at > org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:156) > at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:40){code} > > So adding little more information( how to modify Spark Conf), in ERROR log > while SparkConf is modified when > spark.sql.legacy.setCommandRejectsSparkCoreConfs is 'true', will be helpful > to avoid confusions. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36643) Add more information in ERROR log while SparkConf is modified when spark.sql.legacy.setCommandRejectsSparkCoreConfs is set
[ https://issues.apache.org/jira/browse/SPARK-36643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17408282#comment-17408282 ] Senthil Kumar commented on SPARK-36643: --- New ERROR message will be as below, {code:java} scala> spark.conf.set("spark.driver.host", "localhost") org.apache.spark.sql.AnalysisException: Cannot modify the value of a Spark config: spark.driver.host, please set spark.sql.legacy.setCommandRejectsSparkCoreConfs as 'false' in order to make change value of Spark config: spark.driver.host . at org.apache.spark.sql.errors.QueryCompilationErrors$.cannotModifyValueOfSparkConfigError(QueryCompilationErrors.scala:2336) at org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:157) at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:41){code} ... 47 elided > Add more information in ERROR log while SparkConf is modified when > spark.sql.legacy.setCommandRejectsSparkCoreConfs is set > -- > > Key: SPARK-36643 > URL: https://issues.apache.org/jira/browse/SPARK-36643 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Senthil Kumar >Priority: Minor > > Right now, by default sql.legacy.setCommandRejectsSparkCoreConfs is set as > true in Spark 3.* versions int order to avoid changing Spark Confs. But from > the error message we get confused if we can not modify/change Spark conf in > Spark 3.* or not. > Current Error Message : > {code:java} > Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot > modify the value of a Spark config: spark.driver.host > at > org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:156) > at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:40){code} > > So adding little more information( how to modify Spark Conf), in ERROR log > while SparkConf is modified when > spark.sql.legacy.setCommandRejectsSparkCoreConfs is 'true', will be helpful > to avoid confusions. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36643) Add more information in ERROR log while SparkConf is modified when spark.sql.legacy.setCommandRejectsSparkCoreConfs is set
[ https://issues.apache.org/jira/browse/SPARK-36643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17408274#comment-17408274 ] Senthil Kumar commented on SPARK-36643: --- Creating PR for this > Add more information in ERROR log while SparkConf is modified when > spark.sql.legacy.setCommandRejectsSparkCoreConfs is set > -- > > Key: SPARK-36643 > URL: https://issues.apache.org/jira/browse/SPARK-36643 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Senthil Kumar >Priority: Minor > > Right now, by default sql.legacy.setCommandRejectsSparkCoreConfs is set as > true in Spark 3.* versions int order to avoid changing Spark Confs. But from > the error message we get confused if we can not modify/change Spark conf in > Spark 3.* or not. > Current Error Message : > {code:java} > Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot > modify the value of a Spark config: spark.driver.host > at > org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:156) > at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:40){code} > > So adding little more information( how to modify Spark Conf), in ERROR log > while SparkConf is modified when > spark.sql.legacy.setCommandRejectsSparkCoreConfs is 'true', will be helpful > to avoid confusions. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36643) Add more information in ERROR log while SparkConf is modified when spark.sql.legacy.setCommandRejectsSparkCoreConfs is set
Senthil Kumar created SPARK-36643: - Summary: Add more information in ERROR log while SparkConf is modified when spark.sql.legacy.setCommandRejectsSparkCoreConfs is set Key: SPARK-36643 URL: https://issues.apache.org/jira/browse/SPARK-36643 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.1.2 Reporter: Senthil Kumar Right now, by default sql.legacy.setCommandRejectsSparkCoreConfs is set as true in Spark 3.* versions int order to avoid changing Spark Confs. But from the error message we get confused if we can not modify/change Spark conf in Spark 3.* or not. Current Error Message : {code:java} Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot modify the value of a Spark config: spark.driver.host at org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:156) at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:40){code} So adding little more information( how to modify Spark Conf), in ERROR log while SparkConf is modified when spark.sql.legacy.setCommandRejectsSparkCoreConfs is 'true', will be helpful to avoid confusions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36604) timestamp type column analyze result is wrong
[ https://issues.apache.org/jira/browse/SPARK-36604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17407068#comment-17407068 ] Senthil Kumar commented on SPARK-36604: --- [~yghu] I tested this scenario in Spark2.4, but I don't see this issue is occurring. Are you seeing this issue only in Spark 3.1.1? {panel} _scala> spark.sql("create table c(a timestamp)")_ _res16: org.apache.spark.sql.DataFrame = []_ __ _scala> spark.sql("insert into c select '2021-08-15 15:30:01'")_ _res17: org.apache.spark.sql.DataFrame = []_ __ _scala> spark.sql("analyze table c compute statistics for columns a")_ _res18: org.apache.spark.sql.DataFrame = []_ __ _scala> spark.sql("desc formatted c a").show(true)_ _+--++_ _| info_name| info_value|_ _+--++_ _| col_name| a|_ _| data_type| timestamp|_ _| comment| NULL|_ _| min|2021-08-15 15:30:...|_ _| max|2021-08-15 15:30:...|_ _| num_nulls| 0|_ _|distinct_count| 1|_ _| avg_col_len| 8|_ _| max_col_len| 8|_ _| histogram| NULL|_ _+--++_ {panel} > timestamp type column analyze result is wrong > - > > Key: SPARK-36604 > URL: https://issues.apache.org/jira/browse/SPARK-36604 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1, 3.1.2 > Environment: Spark 3.1.1 >Reporter: YuanGuanhu >Priority: Major > > when we create table with timestamp column type, the min and max data of the > analyze result for the timestamp column is wrong > eg: > {code} > > select * from a; > {code} > {code} > 2021-08-15 15:30:01 > Time taken: 2.789 seconds, Fetched 1 row(s) > spark-sql> desc formatted a a; > col_name a > data_type timestamp > comment NULL > min 2021-08-15 07:30:01.00 > max 2021-08-15 07:30:01.00 > num_nulls 0 > distinct_count 1 > avg_col_len 8 > max_col_len 8 > histogram NULL > Time taken: 0.278 seconds, Fetched 10 row(s) > spark-sql> desc a; > a timestamp NULL > Time taken: 1.432 seconds, Fetched 1 row(s) > {code} > > reproduce step: > {code} > create table a(a timestamp); > insert into a select '2021-08-15 15:30:01'; > analyze table a compute statistics for columns a; > desc formatted a a; > select * from a; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36412) Add Test Coverage to meet viewFs(Hadoop federation) scenario
[ https://issues.apache.org/jira/browse/SPARK-36412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Senthil Kumar updated SPARK-36412: -- Summary: Add Test Coverage to meet viewFs(Hadoop federation) scenario (was: Create coverage Test to meet viewFs(Hadoop federation) scenario) > Add Test Coverage to meet viewFs(Hadoop federation) scenario > - > > Key: SPARK-36412 > URL: https://issues.apache.org/jira/browse/SPARK-36412 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2 >Reporter: Senthil Kumar >Priority: Major > > Create coverage Test to meet viewFs(Hadoop federation) scenario. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36412) Create coverage Test to meet viewFs(Hadoop federation) scenario
[ https://issues.apache.org/jira/browse/SPARK-36412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17392976#comment-17392976 ] Senthil Kumar commented on SPARK-36412: --- I am working on this > Create coverage Test to meet viewFs(Hadoop federation) scenario > --- > > Key: SPARK-36412 > URL: https://issues.apache.org/jira/browse/SPARK-36412 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2 >Reporter: Senthil Kumar >Priority: Major > > Create coverage Test to meet viewFs(Hadoop federation) scenario. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36412) Create coverage Test to meet viewFs(Hadoop federation) scenario
Senthil Kumar created SPARK-36412: - Summary: Create coverage Test to meet viewFs(Hadoop federation) scenario Key: SPARK-36412 URL: https://issues.apache.org/jira/browse/SPARK-36412 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.2 Reporter: Senthil Kumar Create coverage Test to meet viewFs(Hadoop federation) scenario. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36327) Spark sql creates staging dir inside database directory rather than creating inside table directory
[ https://issues.apache.org/jira/browse/SPARK-36327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17390851#comment-17390851 ] Senthil Kumar commented on SPARK-36327: --- Hi [~sunchao] Hive is creating .staging directories inside "/db/table" location but Spark-sql creates .staging directories inside /db/" location when we use hadoop federation(viewFs). But works as expected (creating .staging inside /db/table/ location for other filesystems like hdfs). HIVE: {{ # beeline > use dicedb; > insert into table part_test partition (j=1) values (1); ... INFO : Loading data to table dicedb.part_test partition (j=1) from **viewfs://cloudera/user/daisuke/dicedb/part_test/j=1/.hive-staging_hive_2021-07-19_13-04-44_989_6775328876605030677-1/-ext-1** }} but spark's behaviour, {{ spark-sql> use dicedb; spark-sql> insert into table part_test partition (j=2) values (2); 21/07/19 13:07:37 INFO FileUtils: Creating directory if it doesn't exist: **viewfs://cloudera/user/daisuke/dicedb/.hive-staging_hive_2021-07-19_13-07-37_317_5083528872437596950-1** ... }} The reason why we require this change is , if we allow spark-sql to create .staging directory inside /db/ location then we will end-up with security issues. We need to provide permission for "viewfs:///db/" location to all users who submit spark jobs. After this change is applied spark-sql creates .staging inside /db/table/, similar to hive, as below, {{ spark-sql> use dicedb; 21/07/28 00:22:47 INFO SparkSQLCLIDriver: Time taken: 0.929 seconds spark-sql> insert into table part_test partition (j=8) values (8); 21/07/28 00:23:25 INFO HiveMetaStoreClient: Closed a connection to metastore, current connections: 1 21/07/28 00:23:26 INFO FileUtils: Creating directory if it doesn't exist: **viewfs://cloudera/user/daisuke/dicedb/part_test/.hive-staging_hive_2021-07-28_00-23-26_109_4548714524589026450-1** }} The reason why we don't see this issue in Hive but only occurs in Spark-sql: In hive, "/db/table/tmp" directory structure is passed for path and hence path.getParent returns "db/table/" . But in Spark we just pass "/db/table" so it is not required to use "path.getParent" for hadoop federation(viewfs) > Spark sql creates staging dir inside database directory rather than creating > inside table directory > --- > > Key: SPARK-36327 > URL: https://issues.apache.org/jira/browse/SPARK-36327 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.1.2 >Reporter: Senthil Kumar >Priority: Minor > > Spark sql creates staging dir inside database directory rather than creating > inside table directory. > > This arises only when viewfs:// is configured. When the location is hdfs://, > it doesn't occur. > > Based on further investigation in file *SaveAsHiveFile.scala*, I could see > that the directory hierarchy has been not properly handled for viewFS > condition. > Parent path(db path) is passed rather than passing the actual directory(table > location). > {{ > // Mostly copied from Context.java#getExternalTmpPath of Hive 1.2 > private def newVersionExternalTempPath( > path: Path, > hadoopConf: Configuration, > stagingDir: String): Path = { > val extURI: URI = path.toUri > if (extURI.getScheme == "viewfs") > { getExtTmpPathRelTo(path.getParent, hadoopConf, stagingDir) } > else > { new Path(getExternalScratchDir(extURI, hadoopConf, stagingDir), > "-ext-1") } > } > }} > Please refer below lines > === > if (extURI.getScheme == "viewfs") { > getExtTmpPathRelTo(path.getParent, hadoopConf, stagingDir) > === -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36327) Spark sql creates staging dir inside database directory rather than creating inside table directory
[ https://issues.apache.org/jira/browse/SPARK-36327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17389827#comment-17389827 ] Senthil Kumar commented on SPARK-36327: --- Hi [~dongjoon], [~hyukjin.kwon] Could you please review these minor changes? > Spark sql creates staging dir inside database directory rather than creating > inside table directory > --- > > Key: SPARK-36327 > URL: https://issues.apache.org/jira/browse/SPARK-36327 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.1.2 >Reporter: Senthil Kumar >Priority: Minor > > Spark sql creates staging dir inside database directory rather than creating > inside table directory. > > This arises only when viewfs:// is configured. When the location is hdfs://, > it doesn't occur. > > Based on further investigation in file *SaveAsHiveFile.scala*, I could see > that the directory hierarchy has been not properly handled for viewFS > condition. > Parent path(db path) is passed rather than passing the actual directory(table > location). > {{ > // Mostly copied from Context.java#getExternalTmpPath of Hive 1.2 > private def newVersionExternalTempPath( > path: Path, > hadoopConf: Configuration, > stagingDir: String): Path = { > val extURI: URI = path.toUri > if (extURI.getScheme == "viewfs") > { getExtTmpPathRelTo(path.getParent, hadoopConf, stagingDir) } > else > { new Path(getExternalScratchDir(extURI, hadoopConf, stagingDir), > "-ext-1") } > } > }} > Please refer below lines > === > if (extURI.getScheme == "viewfs") { > getExtTmpPathRelTo(path.getParent, hadoopConf, stagingDir) > === -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36327) Spark sql creates staging dir inside database directory rather than creating inside table directory
[ https://issues.apache.org/jira/browse/SPARK-36327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Senthil Kumar updated SPARK-36327: -- Component/s: SQL > Spark sql creates staging dir inside database directory rather than creating > inside table directory > --- > > Key: SPARK-36327 > URL: https://issues.apache.org/jira/browse/SPARK-36327 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.1.2 >Reporter: Senthil Kumar >Priority: Minor > > Spark sql creates staging dir inside database directory rather than creating > inside table directory. > > This arises only when viewfs:// is configured. When the location is hdfs://, > it doesn't occur. > > Based on further investigation in file *SaveAsHiveFile.scala*, I could see > that the directory hierarchy has been not properly handled for viewFS > condition. > Parent path(db path) is passed rather than passing the actual directory(table > location). > {{ > // Mostly copied from Context.java#getExternalTmpPath of Hive 1.2 > private def newVersionExternalTempPath( > path: Path, > hadoopConf: Configuration, > stagingDir: String): Path = { > val extURI: URI = path.toUri > if (extURI.getScheme == "viewfs") > { getExtTmpPathRelTo(path.getParent, hadoopConf, stagingDir) } > else > { new Path(getExternalScratchDir(extURI, hadoopConf, stagingDir), > "-ext-1") } > } > }} > Please refer below lines > === > if (extURI.getScheme == "viewfs") { > getExtTmpPathRelTo(path.getParent, hadoopConf, stagingDir) > === -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36327) Spark sql creates staging dir inside database directory rather than creating inside table directory
[ https://issues.apache.org/jira/browse/SPARK-36327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17389814#comment-17389814 ] Senthil Kumar commented on SPARK-36327: --- Created PR https://github.com/apache/spark/pull/33577 > Spark sql creates staging dir inside database directory rather than creating > inside table directory > --- > > Key: SPARK-36327 > URL: https://issues.apache.org/jira/browse/SPARK-36327 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Senthil Kumar >Priority: Minor > > Spark sql creates staging dir inside database directory rather than creating > inside table directory. > > This arises only when viewfs:// is configured. When the location is hdfs://, > it doesn't occur. > > Based on further investigation in file *SaveAsHiveFile.scala*, I could see > that the directory hierarchy has been not properly handled for viewFS > condition. > Parent path(db path) is passed rather than passing the actual directory(table > location). > {{ > // Mostly copied from Context.java#getExternalTmpPath of Hive 1.2 > private def newVersionExternalTempPath( > path: Path, > hadoopConf: Configuration, > stagingDir: String): Path = { > val extURI: URI = path.toUri > if (extURI.getScheme == "viewfs") > { getExtTmpPathRelTo(path.getParent, hadoopConf, stagingDir) } > else > { new Path(getExternalScratchDir(extURI, hadoopConf, stagingDir), > "-ext-1") } > } > }} > Please refer below lines > === > if (extURI.getScheme == "viewfs") { > getExtTmpPathRelTo(path.getParent, hadoopConf, stagingDir) > === -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36327) Spark sql creates staging dir inside database directory rather than creating inside table directory
[ https://issues.apache.org/jira/browse/SPARK-36327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388597#comment-17388597 ] Senthil Kumar commented on SPARK-36327: --- Shall I work on this Jira to fix this issue? > Spark sql creates staging dir inside database directory rather than creating > inside table directory > --- > > Key: SPARK-36327 > URL: https://issues.apache.org/jira/browse/SPARK-36327 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Senthil Kumar >Priority: Minor > > Spark sql creates staging dir inside database directory rather than creating > inside table directory. > > This arises only when viewfs:// is configured. When the location is hdfs://, > it doesn't occur. > > Based on further investigation in file *SaveAsHiveFile.scala*, I could see > that the directory hierarchy has been not properly handled for viewFS > condition. > Parent path(db path) is passed rather than passing the actual directory(table > location). > {{ > // Mostly copied from Context.java#getExternalTmpPath of Hive 1.2 > private def newVersionExternalTempPath( > path: Path, > hadoopConf: Configuration, > stagingDir: String): Path = { > val extURI: URI = path.toUri > if (extURI.getScheme == "viewfs") > { getExtTmpPathRelTo(path.getParent, hadoopConf, stagingDir) } > else > { new Path(getExternalScratchDir(extURI, hadoopConf, stagingDir), > "-ext-1") } > } > }} > Please refer below lines > === > if (extURI.getScheme == "viewfs") { > getExtTmpPathRelTo(path.getParent, hadoopConf, stagingDir) > === -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36327) Spark sql creates staging dir inside database directory rather than creating inside table directory
Senthil Kumar created SPARK-36327: - Summary: Spark sql creates staging dir inside database directory rather than creating inside table directory Key: SPARK-36327 URL: https://issues.apache.org/jira/browse/SPARK-36327 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.1.2 Reporter: Senthil Kumar Spark sql creates staging dir inside database directory rather than creating inside table directory. This arises only when viewfs:// is configured. When the location is hdfs://, it doesn't occur. Based on further investigation in file *SaveAsHiveFile.scala*, I could see that the directory hierarchy has been not properly handled for viewFS condition. Parent path(db path) is passed rather than passing the actual directory(table location). {{ // Mostly copied from Context.java#getExternalTmpPath of Hive 1.2 private def newVersionExternalTempPath( path: Path, hadoopConf: Configuration, stagingDir: String): Path = { val extURI: URI = path.toUri if (extURI.getScheme == "viewfs") { getExtTmpPathRelTo(path.getParent, hadoopConf, stagingDir) } else { new Path(getExternalScratchDir(extURI, hadoopConf, stagingDir), "-ext-1") } } }} Please refer below lines === if (extURI.getScheme == "viewfs") { getExtTmpPathRelTo(path.getParent, hadoopConf, stagingDir) === -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org