[jira] [Created] (SPARK-48458) Dynamic partition override mode might be ignored in certain scenarios causing data loss

2024-05-29 Thread Artem Kupchinskiy (Jira)
Artem Kupchinskiy created SPARK-48458:
-

 Summary: Dynamic partition override mode might be ignored in 
certain scenarios causing data loss
 Key: SPARK-48458
 URL: https://issues.apache.org/jira/browse/SPARK-48458
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.1, 2.4.8, 4.0.0
Reporter: Artem Kupchinskiy


If an active spark session is stopped in the middle of an insert into file 
system, the session config responsible for overwriting partitions behavior 
might be not respected. The failure scenario basically is following:
 # The spark context is stopped just before  [getting a partition override mode 
setting|https://github.com/apache/spark/blob/8bbbde7cb3c396bc369c06853ed3a2ec021a2530/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L69]
 # Due to the [fallback config usage in case of stopped spark 
context,|https://github.com/apache/spark/blob/8bbbde7cb3c396bc369c06853ed3a2ec021a2530/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L121]
 this mode is evaluated to static (default mode in the default SQLConf used as  
a fallback)
 # The data is cleared 
[here|https://github.com/apache/spark/blob/8bbbde7cb3c396bc369c06853ed3a2ec021a2530/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L131]
 totally which is literally a data loss from the user perspective who intends 
to overwrite data just partially.

This 
[gist|https://gist.github.com/akupchinskiy/b5f31781d59e5c0e9b172e7de40132cd] 
reproduces this behavior. On my local machine, it takes 1-3 iterations to have  
pre-created data cleared totally.

The mitigation of this bug would be usage of explicit write parameter 
`partitionOverwriteMode`  instead of relying on session configuration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38116) Ability to turn off auto commit in JDBC source for read only operations

2022-02-08 Thread Artem Kupchinskiy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17489052#comment-17489052
 ] 

Artem Kupchinskiy commented on SPARK-38116:
---

[~yoda-mon] Nice catch - Indeed, for postgres, it appears not be a problem 
because it is handled on the dialect layer.  

Theoretically, it would be nice to have some control on a connection in a 
generic case. However, as a downside, users can encounter some vague results 
having this option activated during DML operations (that is why in proposal I 
mentioned the necessity of guards preventing an illegal option combination). I 
guess some of maintainers should decide whether it could be a useful feature in 
general case. 

> Ability to turn off auto commit in JDBC source for read only operations
> ---
>
> Key: SPARK-38116
> URL: https://issues.apache.org/jira/browse/SPARK-38116
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Artem Kupchinskiy
>Priority: Minor
>
> Currently, all the jdbc connections on executors side work always with auto 
> commit option set to true.
> However, there are cases where this mode makes hard to use 
> JdbcRelationProvider at all, i.e. reading huge datasets from Postgres (a 
> whole result set is collected regardless of a fetch size when autocommit is 
> set to true 
> https://jdbc.postgresql.org/documentation/91/query.html#query-with-cursor )
> So the proposal is following:
>  # Add a boolean option "autocommit" to JDBC Source allowing a user to turn 
> off autocommit mode for read only operations.
>  # Add guards which prevent using this option in DML operations.  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38116) Ability to turn off auto commit in JDBC source for read only operations

2022-02-05 Thread Artem Kupchinskiy (Jira)
Artem Kupchinskiy created SPARK-38116:
-

 Summary: Ability to turn off auto commit in JDBC source for read 
only operations
 Key: SPARK-38116
 URL: https://issues.apache.org/jira/browse/SPARK-38116
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.1
Reporter: Artem Kupchinskiy


Currently, all the jdbc connections on executors side work always with auto 
commit option set to true.

However, there are cases where this mode makes hard to use JdbcRelationProvider 
at all, i.e. reading huge datasets from Postgres (a whole result set is 
collected regardless of a fetch size when autocommit is set to true 
https://jdbc.postgresql.org/documentation/91/query.html#query-with-cursor )

So the proposal is following:
 # Add a boolean option "autocommit" to JDBC Source allowing a user to turn off 
autocommit mode for read only operations.
 # Add guards which prevent using this option in DML operations.  

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33689) ArrayData specialization to enable RowEncoder primitive arrays support

2020-12-07 Thread Artem Kupchinskiy (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Kupchinskiy updated SPARK-33689:
--
Summary: ArrayData specialization to enable RowEncoder primitive arrays 
support  (was: RowEncoder primitive arrays support)

> ArrayData specialization to enable RowEncoder primitive arrays support
> --
>
> Key: SPARK-33689
> URL: https://issues.apache.org/jira/browse/SPARK-33689
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Artem Kupchinskiy
>Priority: Major
>
> Currently, even non-nullable array fields are represented as 
> WrappedArray.ofRef in external rows. It leads to memory footprint, as well as 
> unnecessary cycles in boxing/unboxing operations. Ideally, if arrays are 
> non-nullable and contain primitive values, they should be represented as 
> specialized less-overhead versions of WrappedArray.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33689) RowEncoder primitive arrays support

2020-12-07 Thread Artem Kupchinskiy (Jira)
Artem Kupchinskiy created SPARK-33689:
-

 Summary: RowEncoder primitive arrays support
 Key: SPARK-33689
 URL: https://issues.apache.org/jira/browse/SPARK-33689
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.1
Reporter: Artem Kupchinskiy


Currently, even non-nullable array fields are represented as WrappedArray.ofRef 
in external rows. It leads to memory footprint, as well as unnecessary cycles 
in boxing/unboxing operations. Ideally, if arrays are non-nullable and contain 
primitive values, they should be represented as specialized less-overhead 
versions of WrappedArray.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25885) HighlyCompressedMapStatus deserialization optimization

2018-10-30 Thread Artem Kupchinskiy (JIRA)
Artem Kupchinskiy created SPARK-25885:
-

 Summary: HighlyCompressedMapStatus deserialization optimization
 Key: SPARK-25885
 URL: https://issues.apache.org/jira/browse/SPARK-25885
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Artem Kupchinskiy


HighlyCompressedMapStatus uses unnecessary indirection level during 
deserialization and construction. It uses ArrayBuffer, as an interim storage, 
before the actual map construction. Since both methods could be application hot 
spots under certain workloads, it is worth to get rid of that intermediate 
level.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21418) NoSuchElementException: None.get in DataSourceScanExec with sun.io.serialization.extendedDebugInfo=true

2017-09-21 Thread Artem Kupchinskiy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16174642#comment-16174642
 ] 

Artem Kupchinskiy commented on SPARK-21418:
---

There is still a place in FileSourceScanExec.scala where None.get error 
theoretically could appear.
{code:java}
  val needsUnsafeRowConversion: Boolean = if 
(relation.fileFormat.isInstanceOf[ParquetSource]) {

SparkSession.getActiveSession.get.sessionState.conf.parquetVectorizedReaderEnabled
  } else {
false
  }
{code}

I think it is worth defending this val initialization as well (setting a 
default configuration value in case of None). Although I didn't encounter any 
None.get errors so far after applying this patch.



> NoSuchElementException: None.get in DataSourceScanExec with 
> sun.io.serialization.extendedDebugInfo=true
> ---
>
> Key: SPARK-21418
> URL: https://issues.apache.org/jira/browse/SPARK-21418
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Daniel Darabos
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.2.1, 2.3.0
>
>
> I don't have a minimal reproducible example yet, sorry. I have the following 
> lines in a unit test for our Spark application:
> {code}
> val df = mySparkSession.read.format("jdbc")
>   .options(Map("url" -> url, "dbtable" -> "test_table"))
>   .load()
> df.show
> println(df.rdd.collect)
> {code}
> The output shows the DataFrame contents from {{df.show}}. But the {{collect}} 
> fails:
> {noformat}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 
> serialization failed: java.util.NoSuchElementException: None.get
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$class.org$apache$spark$sql$execution$DataSourceScanExec$$redact(DataSourceScanExec.scala:70)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:54)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:52)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$class.simpleString(DataSourceScanExec.scala:52)
>   at 
> org.apache.spark.sql.execution.RowDataSourceScanExec.simpleString(DataSourceScanExec.scala:75)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.verboseString(QueryPlan.scala:349)
>   at 
> org.apache.spark.sql.execution.RowDataSourceScanExec.org$apache$spark$sql$execution$DataSourceScanExec$$super$verboseString(DataSourceScanExec.scala:75)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$class.verboseString(DataSourceScanExec.scala:60)
>   at 
> org.apache.spark.sql.execution.RowDataSourceScanExec.verboseString(DataSourceScanExec.scala:75)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:556)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:451)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:576)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:480)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:477)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:474)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1421)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOu