[jira] [Commented] (SPARK-27030) DataFrameWriter.insertInto fails when writing in parallel to a hive table

2023-05-03 Thread Shrikant Prasad (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17719158#comment-17719158
 ] 

Shrikant Prasad commented on SPARK-27030:
-

[~lev] [~shivuson...@gmail.com] Did you find any resolution for this issue? I 
am encountering the same error.

> DataFrameWriter.insertInto fails when writing in parallel to a hive table
> -
>
> Key: SPARK-27030
> URL: https://issues.apache.org/jira/browse/SPARK-27030
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Lev Katzav
>Priority: Major
>  Labels: bulk-closed
>
> When writing to a hive table, the following temp directory is used:
> {code:java}
> /path/to/table/_temporary/0/{code}
> (the 0 at the end comes from the config
> {code:java}
> "mapreduce.job.application.attempt.id"{code}
> since that config is missing, it falls back to 0)
> when there are 2 processes that write to the same table, there could be the 
> following race condition:
>  # p1 creates temp folder and uses it
>  # p2 uses temp folder
>  # p1 finishes and deletes temp folder
>  # p2 fails since temp folder is missing
>  
> It is possible to recreate this error locally with the following code:
> (the code runs locally, but I experienced the same error when running on a 
> cluster
> with 2 jobs writing to the same table)
> {code:java}
> import org.apache.spark.sql.functions._
> val df = spark
>  .range(1000)
>  .toDF("a")
>  .withColumn("partition", lit(0))
>  .cache()
> //create db
> sqlContext.sql("CREATE DATABASE IF NOT EXISTS db").count()
> //create table
> df
>  .write
>  .partitionBy("partition")
>  .saveAsTable("db.table")
> val x = (1 to 100).par
> x.tasksupport = new ForkJoinTaskSupport( new ForkJoinPool(10))
> //insert to different partitions in parallel
> x.foreach { p =>
>  val df2 = df
>  .withColumn("partition",lit(p))
>   df2
>.write
>.mode(SaveMode.Overwrite)
>.insertInto("db.table")
> }
> {code}
>  
>  the error would be:
> {code:java}
> java.io.FileNotFoundException: File 
> file:/path/to/warehouse/db.db/table/_temporary/0 does not exist
>  at 
> org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:406)
>  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1497)
>  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1537)
>  at 
> org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:669)
>  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1497)
>  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1537)
>  at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.getAllCommittedTaskPaths(FileOutputCommitter.java:283)
>  at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:325)
>  at 
> org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
>  at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:166)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:185)
>  at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>  at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>  at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>  at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> 

[jira] [Updated] (SPARK-42655) Incorrect ambiguous column reference error

2023-03-02 Thread Shrikant Prasad (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shrikant Prasad updated SPARK-42655:

Description: 
val df1 = 
sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", 
"col5")
val op_cols_same_case = List("id","col2","col3","col4", "col5", "id")
val df2 = df1.select(op_cols_same_case.head, op_cols_same_case.tail: _*)
df2.select("id").show()
 
This query runs fine.
 
But when we change the casing of the op_cols to have mix of upper & lower case 
("id" & "ID") it throws an ambiguous col ref error:
 
val df1 = 
sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", 
"col5")
val op_cols_mixed_case = List("id","col2","col3","col4", "col5", "ID")
val df3 = df1.select(op_cols_mixed_case.head, op_cols_mixed_case.tail: _*)
df3.select("id").show()



org.apache.spark.sql.AnalysisException: Reference 'id' is ambiguous, could be: 
id, id.

  at 
org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:363)

  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:112)

  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$resolveExpressionByPlanChildren$1(Analyzer.scala:1857)

  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$resolveExpression$2(Analyzer.scala:1787)

  at 
org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:60)

  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.innerResolve$1(Analyzer.scala:1794)

  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.resolveExpression(Analyzer.scala:1812)

  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.resolveExpressionByPlanChildren(Analyzer.scala:1863)

  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$17.$anonfun$applyOrElse$94(Analyzer.scala:1577)

  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193)

  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)

  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193)

  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204)

  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$3(QueryPlan.scala:209)

  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)

  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)

  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)

  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)

  at scala.collection.TraversableLike.map(TraversableLike.scala:286)

  at scala.collection.TraversableLike.map$(TraversableLike.scala:279)

  at scala.collection.AbstractTraversable.map(Traversable.scala:108)

  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:209)

 


Since, Spark is case insensitive, it should work for second case also when we 
have upper and lower case column names in the column list.

It also works fine in Spark 2.3.
 

  was:
val df1 = 
sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", 
"col5")
val op_cols_same_case = List("id","col2","col3","col4", "col5", "id")
val df2 = df1.select(op_cols_same_case .head, op_cols_same_case .tail: _*)
df2.select("id").show() 
 
This query runs fine.
 
But when we change the casing of the op_cols to have mix of upper & lower case 
("id" & "ID") it throws an ambiguous col ref error:
 
val df1 = 
sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", 
"col5")
val op_cols_same_case = List("id","col2","col3","col4", "col5", "ID")
val df2 = df1.select(op_cols_same_case .head, op_cols_same_case .tail: _*)
df2.select("id").show() 
 
Since, Spark is case insensitive, it should work for second case also when we 
have upper and lower case column names in the column list.

It also works fine in Spark 2.3.
 


> Incorrect ambiguous column reference error
> --
>
> Key: SPARK-42655
> URL: https://issues.apache.org/jira/browse/SPARK-42655
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Shrikant Prasad
>Priority: Major
>
> val df1 = 
> sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", 
> "col5")
> val op_cols_same_case = List("id","col2","col3","col4", "col5", "id")
> val df2 = df1.select(op_cols_same_case.head, op_cols_same_case.tail: _*)
> df2.select("id").show()
>  
> This query runs fine.
>  
> But when we change the casing of the op_cols to have mix of upper & lower 
> case ("id" & "ID") it throws an ambiguous col ref error:
>  
> val df1 = 
> 

[jira] [Updated] (SPARK-42655) Incorrect ambiguous column reference error

2023-03-02 Thread Shrikant Prasad (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shrikant Prasad updated SPARK-42655:

Description: 
val df1 = 
sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", 
"col5")
val op_cols_same_case = List("id","col2","col3","col4", "col5", "id")
val df2 = df1.select(op_cols_same_case .head, op_cols_same_case .tail: _*)
df2.select("id").show() 
 
This query runs fine.
 
But when we change the casing of the op_cols to have mix of upper & lower case 
("id" & "ID") it throws an ambiguous col ref error:
 
val df1 = 
sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", 
"col5")
val op_cols_same_case = List("id","col2","col3","col4", "col5", "ID")
val df2 = df1.select(op_cols_same_case .head, op_cols_same_case .tail: _*)
df2.select("id").show() 
 
Since, Spark is case insensitive, it should work for second case also when we 
have upper and lower case column names in the column list.

It also works fine in Spark 2.3.
 

  was:
val df1 = 
sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", 
"col5")
val op_cols_same_case = List("id","col2","col3","col4", "col5", "id")
val df2 = df1.select(op_cols_same_case .head, op_cols_same_case .tail: _*)
df2.select("id").show() 
 
This query runs fine.
 
But when we change the casing of the op_cols to have mix of upper & lower case 
("id" & "ID") it throws an ambiguous col ref error:
 
val df1 = 
sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", 
"col5")
val op_cols_same_case = List("id","col2","col3","col4", "col5", "ID")
val df2 = df1.select(op_cols_same_case .head, op_cols_same_case .tail: _*)
df2.select("id").show() 
 
Since, Spark is case insensitive, it should work for second case also when we 
have upper and lower case column names in the column list.
 


> Incorrect ambiguous column reference error
> --
>
> Key: SPARK-42655
> URL: https://issues.apache.org/jira/browse/SPARK-42655
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Shrikant Prasad
>Priority: Major
>
> val df1 = 
> sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", 
> "col5")
> val op_cols_same_case = List("id","col2","col3","col4", "col5", "id")
> val df2 = df1.select(op_cols_same_case .head, op_cols_same_case .tail: _*)
> df2.select("id").show() 
>  
> This query runs fine.
>  
> But when we change the casing of the op_cols to have mix of upper & lower 
> case ("id" & "ID") it throws an ambiguous col ref error:
>  
> val df1 = 
> sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", 
> "col5")
> val op_cols_same_case = List("id","col2","col3","col4", "col5", "ID")
> val df2 = df1.select(op_cols_same_case .head, op_cols_same_case .tail: _*)
> df2.select("id").show() 
>  
> Since, Spark is case insensitive, it should work for second case also when we 
> have upper and lower case column names in the column list.
> It also works fine in Spark 2.3.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42655) Incorrect ambiguous column reference error

2023-03-02 Thread Shrikant Prasad (Jira)
Shrikant Prasad created SPARK-42655:
---

 Summary: Incorrect ambiguous column reference error
 Key: SPARK-42655
 URL: https://issues.apache.org/jira/browse/SPARK-42655
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: Shrikant Prasad


val df1 = 
sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", 
"col5")
val op_cols_same_case = List("id","col2","col3","col4", "col5", "id")
val df2 = df1.select(op_cols_same_case .head, op_cols_same_case .tail: _*)
df2.select("id").show() 
 
This query runs fine.
 
But when we change the casing of the op_cols to have mix of upper & lower case 
("id" & "ID") it throws an ambiguous col ref error:
 
val df1 = 
sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", 
"col5")
val op_cols_same_case = List("id","col2","col3","col4", "col5", "ID")
val df2 = df1.select(op_cols_same_case .head, op_cols_same_case .tail: _*)
df2.select("id").show() 
 
Since, Spark is case insensitive, it should work for second case also when we 
have upper and lower case column names in the column list.
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42466) spark.kubernetes.file.upload.path not deleting files under HDFS after job completes

2023-02-23 Thread Shrikant Prasad (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shrikant Prasad updated SPARK-42466:

Affects Version/s: 3.3.2

> spark.kubernetes.file.upload.path not deleting files under HDFS after job 
> completes
> ---
>
> Key: SPARK-42466
> URL: https://issues.apache.org/jira/browse/SPARK-42466
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.0, 3.3.2
>Reporter: Jagadeeswara Rao
>Priority: Major
>
> In cluster mode after uploading files to HDFS location using 
> spark.kubernetes.file.upload.path property files are not getting cleared . 
> File is successfully uploaded to hdfs location in this format 
> spark-upload-[randomUUID] using {{KubernetesUtils}} is requested to  
> uploadFileUri . 
> [https://github.com/apache/spark/blob/76a134ade60a9f354aca01eaca0b2e2477c6bd43/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesUtils.scala#L310]
> following is driver log  , driver is completed successfully and shutdownhook 
> is not cleared the hdfs files.
> {code:java}
> 23/02/16 18:06:56 INFO KubernetesClusterSchedulerBackend: Shutting down all 
> executors
> 23/02/16 18:06:56 INFO 
> KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each 
> executor to shut down
> 23/02/16 18:06:56 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has 
> been closed.
> 23/02/16 18:06:57 INFO MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped!
> 23/02/16 18:06:57 INFO MemoryStore: MemoryStore cleared
> 23/02/16 18:06:57 INFO BlockManager: BlockManager stopped
> 23/02/16 18:06:57 INFO BlockManagerMaster: BlockManagerMaster stopped
> 23/02/16 18:06:57 INFO 
> OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped!
> 23/02/16 18:06:57 INFO SparkContext: Successfully stopped SparkContext
> 23/02/16 18:06:57 INFO ShutdownHookManager: Shutdown hook called
> 23/02/16 18:06:57 INFO ShutdownHookManager: Deleting directory 
> /tmp/spark-efb8f725-4ead-4729-a8e0-f478280121b7
> 23/02/16 18:06:57 INFO ShutdownHookManager: Deleting directory 
> /spark-local2/spark-66dbf7e6-fe7e-4655-8724-69d76d93fc1f
> 23/02/16 18:06:57 INFO ShutdownHookManager: Deleting directory 
> /spark-local1/spark-53aefaee-58a5-4fce-b5b0-5e29f42e337f{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42466) spark.kubernetes.file.upload.path not deleting files under HDFS after job completes

2023-02-17 Thread Shrikant Prasad (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690381#comment-17690381
 ] 

Shrikant Prasad commented on SPARK-42466:
-

working on the fix.

> spark.kubernetes.file.upload.path not deleting files under HDFS after job 
> completes
> ---
>
> Key: SPARK-42466
> URL: https://issues.apache.org/jira/browse/SPARK-42466
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Jagadeeswara Rao
>Priority: Major
>
> In cluster mode after uploading files to HDFS location using 
> spark.kubernetes.file.upload.path property files are not getting cleared . 
> File is successfully uploaded to hdfs location in this format 
> spark-upload-[randomUUID] using {{KubernetesUtils}} is requested to  
> uploadFileUri . 
> [https://github.com/apache/spark/blob/76a134ade60a9f354aca01eaca0b2e2477c6bd43/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesUtils.scala#L310]
> following is driver log  , driver is completed successfully and shutdownhook 
> is not cleared the hdfs files.
> {code:java}
> 23/02/16 18:06:56 INFO KubernetesClusterSchedulerBackend: Shutting down all 
> executors
> 23/02/16 18:06:56 INFO 
> KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each 
> executor to shut down
> 23/02/16 18:06:56 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has 
> been closed.
> 23/02/16 18:06:57 INFO MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped!
> 23/02/16 18:06:57 INFO MemoryStore: MemoryStore cleared
> 23/02/16 18:06:57 INFO BlockManager: BlockManager stopped
> 23/02/16 18:06:57 INFO BlockManagerMaster: BlockManagerMaster stopped
> 23/02/16 18:06:57 INFO 
> OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped!
> 23/02/16 18:06:57 INFO SparkContext: Successfully stopped SparkContext
> 23/02/16 18:06:57 INFO ShutdownHookManager: Shutdown hook called
> 23/02/16 18:06:57 INFO ShutdownHookManager: Deleting directory 
> /tmp/spark-efb8f725-4ead-4729-a8e0-f478280121b7
> 23/02/16 18:06:57 INFO ShutdownHookManager: Deleting directory 
> /spark-local2/spark-66dbf7e6-fe7e-4655-8724-69d76d93fc1f
> 23/02/16 18:06:57 INFO ShutdownHookManager: Deleting directory 
> /spark-local1/spark-53aefaee-58a5-4fce-b5b0-5e29f42e337f{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41719) Spark SSLOptions sub settings should be set only when ssl is enabled

2022-12-26 Thread Shrikant Prasad (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shrikant Prasad updated SPARK-41719:

Description: 
If ${ns}.enabled is false, there is no use of setting rest of ${ns}.* settings 
in SSLOptions as this requires unnecessary operations to be performed to set 
these properties. 

for ex: ${ns} - spark.ssl

As per SSLOptions,
 * SSLOptions is intended to provide the maximum common set of SSL settings, 
which are supported
 * by the protocol, which it can generate the configuration for.
*
 * @param enabled enables or disables SSL; *if it is set to false, the rest of 
the settings are disregarded*

  was:
If ${ns}.enabled is false, there is no use of setting rest of ${ns}.* settings 
in SSLOptions as this requires unnecessary operations to be performed to set 
these properties. 

As per SSLOptions,
 * SSLOptions is intended to provide the maximum common set of SSL settings, 
which are supported
 * by the protocol, which it can generate the configuration for.
*
 * @param enabled enables or disables SSL; *if it is set to false, the rest of 
the settings are disregarded*


> Spark SSLOptions sub settings should be set only when ssl is enabled
> 
>
> Key: SPARK-41719
> URL: https://issues.apache.org/jira/browse/SPARK-41719
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.4
>Reporter: Shrikant Prasad
>Priority: Major
>
> If ${ns}.enabled is false, there is no use of setting rest of ${ns}.* 
> settings in SSLOptions as this requires unnecessary operations to be 
> performed to set these properties. 
> for ex: ${ns} - spark.ssl
> As per SSLOptions,
>  * SSLOptions is intended to provide the maximum common set of SSL settings, 
> which are supported
>  * by the protocol, which it can generate the configuration for.
> *
>  * @param enabled enables or disables SSL; *if it is set to false, the rest 
> of the settings are disregarded*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41719) Spark SSLOptions sub settings should be set only when ssl is enabled

2022-12-26 Thread Shrikant Prasad (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shrikant Prasad updated SPARK-41719:

Summary: Spark SSLOptions sub settings should be set only when ssl is 
enabled  (was: Spark SSL Options should be set only when ssl is enabled)

> Spark SSLOptions sub settings should be set only when ssl is enabled
> 
>
> Key: SPARK-41719
> URL: https://issues.apache.org/jira/browse/SPARK-41719
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.4
>Reporter: Shrikant Prasad
>Priority: Major
>
> If ${ns}.enabled is false, there is no use of setting rest of ${ns}.* 
> settings in SSLOptions as this requires unnecessary operations to be 
> performed to set these properties. 
> As per SSLOptions,
>  * SSLOptions is intended to provide the maximum common set of SSL settings, 
> which are supported
>  * by the protocol, which it can generate the configuration for.
> *
>  * @param enabled enables or disables SSL; *if it is set to false, the rest 
> of the settings are disregarded*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41719) Spark SSL Options should be set only when ssl is enabled

2022-12-26 Thread Shrikant Prasad (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shrikant Prasad updated SPARK-41719:

Description: 
If ${ns}.enabled is false, there is no use of setting rest of ${ns}.* settings 
in SSLOptions as this requires unnecessary operations to be performed to set 
these properties. 

As per SSLOptions,
 * SSLOptions is intended to provide the maximum common set of SSL settings, 
which are supported
 * by the protocol, which it can generate the configuration for.
*
 * @param enabled enables or disables SSL; *if it is set to false, the rest of 
the settings are disregarded*

  was:
If ${ns}.enabled is false, there is no use of setting rest of ${ns}.* settings 
in SSLOptions as this requires unnecessary operations to be performed to set 
these properties. 

As per SSLOptions,
 * SSLOptions is intended to provide the maximum common set of SSL settings, 
which are supported
* by the protocol, which it can generate the configuration for.
*
* @param enabled enables or disables SSL; *if it is set to false, the rest of 
the*
** settings are disregarded*


> Spark SSL Options should be set only when ssl is enabled
> 
>
> Key: SPARK-41719
> URL: https://issues.apache.org/jira/browse/SPARK-41719
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.4
>Reporter: Shrikant Prasad
>Priority: Major
>
> If ${ns}.enabled is false, there is no use of setting rest of ${ns}.* 
> settings in SSLOptions as this requires unnecessary operations to be 
> performed to set these properties. 
> As per SSLOptions,
>  * SSLOptions is intended to provide the maximum common set of SSL settings, 
> which are supported
>  * by the protocol, which it can generate the configuration for.
> *
>  * @param enabled enables or disables SSL; *if it is set to false, the rest 
> of the settings are disregarded*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41719) Spark SSL Options should be set only when ssl is enabled

2022-12-26 Thread Shrikant Prasad (Jira)
Shrikant Prasad created SPARK-41719:
---

 Summary: Spark SSL Options should be set only when ssl is enabled
 Key: SPARK-41719
 URL: https://issues.apache.org/jira/browse/SPARK-41719
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.2.4
Reporter: Shrikant Prasad


If ${ns}.enabled is false, there is no use of setting rest of ${ns}.* settings 
in SSLOptions as this requires unnecessary operations to be performed to set 
these properties. 

As per SSLOptions,
 * SSLOptions is intended to provide the maximum common set of SSL settings, 
which are supported
* by the protocol, which it can generate the configuration for.
*
* @param enabled enables or disables SSL; *if it is set to false, the rest of 
the*
** settings are disregarded*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26365) spark-submit for k8s cluster doesn't propagate exit code

2022-09-08 Thread Shrikant Prasad (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601827#comment-17601827
 ] 

Shrikant Prasad commented on SPARK-26365:
-

Spark submit command exit code ($?) as 0 is okay as there is no error in job 
submission. It's the job which failed and that info we do get in container exit 
code (1).  When job submission fails, we do get proper exit code. So it doesn't 
seems to be a bug.
{code:java}
container status: 
 container name: spark-kubernetes-driver
 container image: **
 container state: terminated
 container started at: 2022-09-08T13:40:39Z
 container finished at: 2022-09-08T13:40:43Z
 exit code: 1
 termination reason: Error {code}

> spark-submit for k8s cluster doesn't propagate exit code
> 
>
> Key: SPARK-26365
> URL: https://issues.apache.org/jira/browse/SPARK-26365
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core, Spark Submit
>Affects Versions: 2.3.2, 2.4.0, 3.0.0, 3.1.0
>Reporter: Oscar Bonilla
>Priority: Major
> Attachments: spark-2.4.5-raise-exception-k8s-failure.patch, 
> spark-3.0.0-raise-exception-k8s-failure.patch
>
>
> When launching apps using spark-submit in a kubernetes cluster, if the Spark 
> applications fails (returns exit code = 1 for example), spark-submit will 
> still exit gracefully and return exit code = 0.
> This is problematic, since there's no way to know if there's been a problem 
> with the Spark application.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39399) proxy-user not working for Spark on k8s in cluster deploy mode

2022-08-18 Thread Shrikant Prasad (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581495#comment-17581495
 ] 

Shrikant Prasad commented on SPARK-39399:
-

[~dongjoon] [~hyukjin.kwon] Can you please have a look at this issue and let me 
know if I need to add any more details in order to take this forward.

> proxy-user not working for Spark on k8s in cluster deploy mode
> --
>
> Key: SPARK-39399
> URL: https://issues.apache.org/jira/browse/SPARK-39399
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.2.0
>Reporter: Shrikant Prasad
>Priority: Major
>
> As part of https://issues.apache.org/jira/browse/SPARK-25355 Proxy user 
> support was added for Spark on K8s. But the PR only added proxy user argument 
> on the spark-submit command. The actual functionality of authentication using 
> the proxy user is not working in case of cluster deploy mode.
> We get AccessControlException when trying to access the kerberized HDFS 
> through a proxy user. 
> Spark-Submit:
> $SPARK_HOME/bin/spark-submit \
> --master  \
> --deploy-mode cluster \
> --name with_proxy_user_di \
> --proxy-user  \
> --class org.apache.spark.examples.SparkPi \
> --conf spark.kubernetes.container.image= \
> --conf spark.kubernetes.driver.limit.cores=1 \
> --conf spark.executor.instances=1 \
> --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
> --conf spark.kubernetes.namespace= \
> --conf spark.kubernetes.kerberos.krb5.path=/etc/krb5.conf \
> --conf spark.eventLog.enabled=true \
> --conf spark.eventLog.dir=hdfs:///scaas/shs_logs \
> --conf spark.kubernetes.file.upload.path=hdfs:///tmp \
> --conf spark.kubernetes.container.image.pullPolicy=Always \
> $SPARK_HOME/examples/jars/spark-examples_2.12-3.2.0-1.jar 
> Driver Logs:
> {code:java}
> ++ id -u
> + myuid=185
> ++ id -g
> + mygid=0
> + set +e
> ++ getent passwd 185
> + uidentry=
> + set -e
> + '[' -z '' ']'
> + '[' -w /etc/passwd ']'
> + echo '185:x:185:0:anonymous uid:/opt/spark:/bin/false'
> + SPARK_CLASSPATH=':/opt/spark/jars/*'
> + env
> + grep SPARK_JAVA_OPT_
> + sort -t_ -k4 -n
> + sed 's/[^=]*=\(.*\)/\1/g'
> + readarray -t SPARK_EXECUTOR_JAVA_OPTS
> + '[' -n '' ']'
> + '[' -z ']'
> + '[' -z ']'
> + '[' -n '' ']'
> + '[' -z x ']'
> + SPARK_CLASSPATH='/opt/hadoop/conf::/opt/spark/jars/*'
> + '[' -z x ']'
> + SPARK_CLASSPATH='/opt/spark/conf:/opt/hadoop/conf::/opt/spark/jars/*'
> + case "$1" in
> + shift 1
> + CMD=("$SPARK_HOME/bin/spark-submit" --conf 
> "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client 
> "$@")
> + exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf 
> spark.driver.bindAddress= --deploy-mode client --proxy-user proxy_user 
> --properties-file /opt/spark/conf/spark.properties --class 
> org.apache.spark.examples.SparkPi spark-internal
> WARNING: An illegal reflective access operation has occurred
> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
> (file:/opt/spark/jars/spark-unsafe_2.12-3.2.0-1.jar) to constructor 
> java.nio.DirectByteBuffer(long,int)
> WARNING: Please consider reporting this to the maintainers of 
> org.apache.spark.unsafe.Platform
> WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations
> WARNING: All illegal access operations will be denied in a future release
> 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field 
> org.apache.hadoop.metrics2.lib.MutableRate 
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess with 
> annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", 
> sampleName="Ops", always=false, type=DEFAULT, value={"Rate of successful 
> kerberos logins and latency (milliseconds)"}, valueName="Time")
> 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field 
> org.apache.hadoop.metrics2.lib.MutableRate 
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginFailure with 
> annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", 
> sampleName="Ops", always=false, type=DEFAULT, value={"Rate of failed kerberos 
> logins and latency (milliseconds)"}, valueName="Time")
> 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field 
> org.apache.hadoop.metrics2.lib.MutableRate 
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.getGroups with 
> annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", 
> sampleName="Ops", always=false, type=DEFAULT, value={"GetGroups"}, 
> valueName="Time")
> 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field private 
> org.apache.hadoop.metrics2.lib.MutableGaugeLong 
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.renewalFailuresTotal
>  with annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", 
> sampleName="Ops", 

[jira] [Commented] (SPARK-39993) Spark on Kubernetes doesn't filter data by date

2022-08-18 Thread Shrikant Prasad (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581492#comment-17581492
 ] 

Shrikant Prasad commented on SPARK-39993:
-

[~h.liashchuk] The code snippet you have shared is working fine in cluster 
deploy mode if we write to hdfs instead of s3. So, don't think there is any 
issue with k8s master. You might also first check what's the output of 
df.show() to see if the df contains the expected rows or not.

> Spark on Kubernetes doesn't filter data by date
> ---
>
> Key: SPARK-39993
> URL: https://issues.apache.org/jira/browse/SPARK-39993
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.2
> Environment: Kubernetes v1.23.6
> Spark 3.2.2
> Java 1.8.0_312
> Python 3.9.13
> Aws dependencies:
> aws-java-sdk-bundle-1.11.901.jar and hadoop-aws-3.3.1.jar
>Reporter: Hanna Liashchuk
>Priority: Major
>  Labels: kubernetes
>
> I'm creating a Dataset with type date and saving it into s3. When I read it 
> and try to use where() clause, I've noticed it doesn't return data even 
> though it's there
> Below is the code snippet I'm running
>  
> {code:java}
> from pyspark.sql.types import Row
> from pyspark.sql.functions import *
> ds = spark.range(10).withColumn("date", lit("2022-01-01")).withColumn("date", 
> col("date").cast("date"))
> ds.where("date = '2022-01-01'").show()
> ds.write.mode("overwrite").parquet("s3a://bucket/test")
> df = spark.read.format("parquet").load("s3a://bucket/test")
> df.where("date = '2022-01-01'").show()
> {code}
> The first show() returns data, while the second one - no.
> I've noticed that it's Kubernetes master related, as the same code snipped 
> works ok with master "local"
> UPD: if the column is used as a partition and has the type "date" there is no 
> filtering problem.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org