[jira] [Created] (SPARK-48458) Dynamic partition override mode might be ignored in certain scenarios causing data loss
Artem Kupchinskiy created SPARK-48458: - Summary: Dynamic partition override mode might be ignored in certain scenarios causing data loss Key: SPARK-48458 URL: https://issues.apache.org/jira/browse/SPARK-48458 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.1, 2.4.8, 4.0.0 Reporter: Artem Kupchinskiy If an active spark session is stopped in the middle of an insert into file system, the session config responsible for overwriting partitions behavior might be not respected. The failure scenario basically is following: # The spark context is stopped just before [getting a partition override mode setting|https://github.com/apache/spark/blob/8bbbde7cb3c396bc369c06853ed3a2ec021a2530/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L69] # Due to the [fallback config usage in case of stopped spark context,|https://github.com/apache/spark/blob/8bbbde7cb3c396bc369c06853ed3a2ec021a2530/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L121] this mode is evaluated to static (default mode in the default SQLConf used as a fallback) # The data is cleared [here|https://github.com/apache/spark/blob/8bbbde7cb3c396bc369c06853ed3a2ec021a2530/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L131] totally which is literally a data loss from the user perspective who intends to overwrite data just partially. This [gist|https://gist.github.com/akupchinskiy/b5f31781d59e5c0e9b172e7de40132cd] reproduces this behavior. On my local machine, it takes 1-3 iterations to have pre-created data cleared totally. The mitigation of this bug would be usage of explicit write parameter `partitionOverwriteMode` instead of relying on session configuration. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38116) Ability to turn off auto commit in JDBC source for read only operations
[ https://issues.apache.org/jira/browse/SPARK-38116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17489052#comment-17489052 ] Artem Kupchinskiy commented on SPARK-38116: --- [~yoda-mon] Nice catch - Indeed, for postgres, it appears not be a problem because it is handled on the dialect layer. Theoretically, it would be nice to have some control on a connection in a generic case. However, as a downside, users can encounter some vague results having this option activated during DML operations (that is why in proposal I mentioned the necessity of guards preventing an illegal option combination). I guess some of maintainers should decide whether it could be a useful feature in general case. > Ability to turn off auto commit in JDBC source for read only operations > --- > > Key: SPARK-38116 > URL: https://issues.apache.org/jira/browse/SPARK-38116 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1 >Reporter: Artem Kupchinskiy >Priority: Minor > > Currently, all the jdbc connections on executors side work always with auto > commit option set to true. > However, there are cases where this mode makes hard to use > JdbcRelationProvider at all, i.e. reading huge datasets from Postgres (a > whole result set is collected regardless of a fetch size when autocommit is > set to true > https://jdbc.postgresql.org/documentation/91/query.html#query-with-cursor ) > So the proposal is following: > # Add a boolean option "autocommit" to JDBC Source allowing a user to turn > off autocommit mode for read only operations. > # Add guards which prevent using this option in DML operations. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38116) Ability to turn off auto commit in JDBC source for read only operations
Artem Kupchinskiy created SPARK-38116: - Summary: Ability to turn off auto commit in JDBC source for read only operations Key: SPARK-38116 URL: https://issues.apache.org/jira/browse/SPARK-38116 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.1 Reporter: Artem Kupchinskiy Currently, all the jdbc connections on executors side work always with auto commit option set to true. However, there are cases where this mode makes hard to use JdbcRelationProvider at all, i.e. reading huge datasets from Postgres (a whole result set is collected regardless of a fetch size when autocommit is set to true https://jdbc.postgresql.org/documentation/91/query.html#query-with-cursor ) So the proposal is following: # Add a boolean option "autocommit" to JDBC Source allowing a user to turn off autocommit mode for read only operations. # Add guards which prevent using this option in DML operations. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33689) ArrayData specialization to enable RowEncoder primitive arrays support
[ https://issues.apache.org/jira/browse/SPARK-33689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Kupchinskiy updated SPARK-33689: -- Summary: ArrayData specialization to enable RowEncoder primitive arrays support (was: RowEncoder primitive arrays support) > ArrayData specialization to enable RowEncoder primitive arrays support > -- > > Key: SPARK-33689 > URL: https://issues.apache.org/jira/browse/SPARK-33689 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.1 >Reporter: Artem Kupchinskiy >Priority: Major > > Currently, even non-nullable array fields are represented as > WrappedArray.ofRef in external rows. It leads to memory footprint, as well as > unnecessary cycles in boxing/unboxing operations. Ideally, if arrays are > non-nullable and contain primitive values, they should be represented as > specialized less-overhead versions of WrappedArray. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33689) RowEncoder primitive arrays support
Artem Kupchinskiy created SPARK-33689: - Summary: RowEncoder primitive arrays support Key: SPARK-33689 URL: https://issues.apache.org/jira/browse/SPARK-33689 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.1 Reporter: Artem Kupchinskiy Currently, even non-nullable array fields are represented as WrappedArray.ofRef in external rows. It leads to memory footprint, as well as unnecessary cycles in boxing/unboxing operations. Ideally, if arrays are non-nullable and contain primitive values, they should be represented as specialized less-overhead versions of WrappedArray. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25885) HighlyCompressedMapStatus deserialization optimization
Artem Kupchinskiy created SPARK-25885: - Summary: HighlyCompressedMapStatus deserialization optimization Key: SPARK-25885 URL: https://issues.apache.org/jira/browse/SPARK-25885 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.2.0 Reporter: Artem Kupchinskiy HighlyCompressedMapStatus uses unnecessary indirection level during deserialization and construction. It uses ArrayBuffer, as an interim storage, before the actual map construction. Since both methods could be application hot spots under certain workloads, it is worth to get rid of that intermediate level. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21418) NoSuchElementException: None.get in DataSourceScanExec with sun.io.serialization.extendedDebugInfo=true
[ https://issues.apache.org/jira/browse/SPARK-21418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16174642#comment-16174642 ] Artem Kupchinskiy commented on SPARK-21418: --- There is still a place in FileSourceScanExec.scala where None.get error theoretically could appear. {code:java} val needsUnsafeRowConversion: Boolean = if (relation.fileFormat.isInstanceOf[ParquetSource]) { SparkSession.getActiveSession.get.sessionState.conf.parquetVectorizedReaderEnabled } else { false } {code} I think it is worth defending this val initialization as well (setting a default configuration value in case of None). Although I didn't encounter any None.get errors so far after applying this patch. > NoSuchElementException: None.get in DataSourceScanExec with > sun.io.serialization.extendedDebugInfo=true > --- > > Key: SPARK-21418 > URL: https://issues.apache.org/jira/browse/SPARK-21418 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Daniel Darabos >Assignee: Sean Owen >Priority: Minor > Fix For: 2.2.1, 2.3.0 > > > I don't have a minimal reproducible example yet, sorry. I have the following > lines in a unit test for our Spark application: > {code} > val df = mySparkSession.read.format("jdbc") > .options(Map("url" -> url, "dbtable" -> "test_table")) > .load() > df.show > println(df.rdd.collect) > {code} > The output shows the DataFrame contents from {{df.show}}. But the {{collect}} > fails: > {noformat} > org.apache.spark.SparkException: Job aborted due to stage failure: Task > serialization failed: java.util.NoSuchElementException: None.get > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option.scala:345) > at > org.apache.spark.sql.execution.DataSourceScanExec$class.org$apache$spark$sql$execution$DataSourceScanExec$$redact(DataSourceScanExec.scala:70) > at > org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:54) > at > org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:52) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.execution.DataSourceScanExec$class.simpleString(DataSourceScanExec.scala:52) > at > org.apache.spark.sql.execution.RowDataSourceScanExec.simpleString(DataSourceScanExec.scala:75) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.verboseString(QueryPlan.scala:349) > at > org.apache.spark.sql.execution.RowDataSourceScanExec.org$apache$spark$sql$execution$DataSourceScanExec$$super$verboseString(DataSourceScanExec.scala:75) > at > org.apache.spark.sql.execution.DataSourceScanExec$class.verboseString(DataSourceScanExec.scala:60) > at > org.apache.spark.sql.execution.RowDataSourceScanExec.verboseString(DataSourceScanExec.scala:75) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:556) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:451) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:576) > at > org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:480) > at > org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:477) > at org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:474) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1421) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOu