[jira] [Commented] (SPARK-40103) Support read/write.csv() in SparkR

2022-08-16 Thread deshanxiao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580623#comment-17580623
 ] 

deshanxiao commented on SPARK-40103:


Yes read.csv, read.csv2 have benn used in R utils packages.

> Support read/write.csv() in SparkR
> --
>
> Key: SPARK-40103
> URL: https://issues.apache.org/jira/browse/SPARK-40103
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 3.3.0
>Reporter: deshanxiao
>Priority: Major
>
> Today, almost languages support the DataFrameReader.csv API, only R is 
> missing. we need to use df.read() to read the csv file. We need a more 
> high-level api for it.
> Java:
> [DataFrameReader.csv()|https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html]
> Scala:
> [DataFrameReader.csv()|https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrameReader.html#csv(paths:String*):org.apache.spark.sql.DataFrame]
> Python:
> [DataFrameReader.csv()|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.csv.html#pyspark.sql.DataFrameReader.csv]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40116) Remove Arrow in AppVeyor for now

2022-08-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40116:
-
Summary: Remove Arrow in AppVeyor for now  (was: Pin Arrow version to 8.0.0 
in AppVeyor)

> Remove Arrow in AppVeyor for now
> 
>
> Key: SPARK-40116
> URL: https://issues.apache.org/jira/browse/SPARK-40116
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, SparkR
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>
> SparkR does not support Arrow 9.0.0 (SPARK-40114) so the tests currently 
> fail, https://ci.appveyor.com/project/HyukjinKwon/spark/builds/44490387.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40116) Pin Arrow version to 8.0.0 in AppVeyor

2022-08-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40116.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37546
[https://github.com/apache/spark/pull/37546]

> Pin Arrow version to 8.0.0 in AppVeyor
> --
>
> Key: SPARK-40116
> URL: https://issues.apache.org/jira/browse/SPARK-40116
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, SparkR
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>
> SparkR does not support Arrow 9.0.0 (SPARK-40114) so the tests currently 
> fail, https://ci.appveyor.com/project/HyukjinKwon/spark/builds/44490387.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40117) Convert condition to java in DataFrameWriterV2.overwrite

2022-08-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40117.
--
Fix Version/s: 3.3.1
   3.1.4
   3.2.3
   3.4.0
   Resolution: Fixed

Issue resolved by pull request 37547
[https://github.com/apache/spark/pull/37547]

> Convert condition to java in DataFrameWriterV2.overwrite
> 
>
> Key: SPARK-40117
> URL: https://issues.apache.org/jira/browse/SPARK-40117
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.1.3, 3.3.0, 3.2.2
>Reporter: Wenli Looi
>Assignee: Wenli Looi
>Priority: Major
> Fix For: 3.3.1, 3.1.4, 3.2.3, 3.4.0
>
>
> DataFrameWriterV2.overwrite() fails to convert the condition parameter to 
> java. This prevents the function from being called.
> It is caused by the following commit that deleted the `_to_java_column` call 
> instead of fixing it: 
> [https://github.com/apache/spark/commit/a1e459ed9f6777fb8d5a2d09fda666402f9230b9]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40117) Convert condition to java in DataFrameWriterV2.overwrite

2022-08-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40117:


Assignee: Wenli Looi

> Convert condition to java in DataFrameWriterV2.overwrite
> 
>
> Key: SPARK-40117
> URL: https://issues.apache.org/jira/browse/SPARK-40117
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.1.3, 3.3.0, 3.2.2
>Reporter: Wenli Looi
>Assignee: Wenli Looi
>Priority: Major
>
> DataFrameWriterV2.overwrite() fails to convert the condition parameter to 
> java. This prevents the function from being called.
> It is caused by the following commit that deleted the `_to_java_column` call 
> instead of fixing it: 
> [https://github.com/apache/spark/commit/a1e459ed9f6777fb8d5a2d09fda666402f9230b9]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40116) Pin Arrow version to 8.0.0 in AppVeyor

2022-08-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40116:


Assignee: Hyukjin Kwon

> Pin Arrow version to 8.0.0 in AppVeyor
> --
>
> Key: SPARK-40116
> URL: https://issues.apache.org/jira/browse/SPARK-40116
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, SparkR
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> SparkR does not support Arrow 9.0.0 (SPARK-40114) so the tests currently 
> fail, https://ci.appveyor.com/project/HyukjinKwon/spark/builds/44490387.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40113) Reactor ParquetScanBuilder DataSourceV2 interface implementation

2022-08-16 Thread Mars (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mars updated SPARK-40113:
-
Summary: Reactor ParquetScanBuilder DataSourceV2 interface implementation  
(was: Unify ParquetScanBuilder DataSourceV2 interface implementation)

> Reactor ParquetScanBuilder DataSourceV2 interface implementation
> 
>
> Key: SPARK-40113
> URL: https://issues.apache.org/jira/browse/SPARK-40113
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.3.0
>Reporter: Mars
>Priority: Minor
>
> Now `FileScanBuilder` interface is not fully implemented in 
> `ParquetScanBuilder` like 
> `OrcScanBuilder`,`AvroScanBuilder`,`CSVScanBuilder`
> In order to unify the logic of the code and make it clearer, this part of the 
> implementation is unified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values

2022-08-16 Thread Vivek Garg (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580598#comment-17580598
 ] 

Vivek Garg edited comment on SPARK-22588 at 8/17/22 5:53 AM:
-

According to the [Salesforce CPQ 
Certification|https://www.igmguru.com/salesforce/salesforce-cpq-training/] 
Exam, our Salesforce CPQ Certification Training program has been created. The 
core abilities needed for effectively implementing Salesforce CPQ solutions are 
developed in this course on Salesforce CPQ. Through instruction using practical 
examples, this course will go deeper into developing a quoting process, pricing 
strategies, configuration, CPQ object data model, and more. This online 
Salesforce CPQ training course includes practical projects that will aid you in 
passing the Salesforce CPQ Certification test.


was (Author: JIRAUSER294516):
[salesforce cpq 
certification|https://www.igmguru.com/salesforce/salesforce-cpq-training/]

> SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
> -
>
> Key: SPARK-22588
> URL: https://issues.apache.org/jira/browse/SPARK-22588
> Project: Spark
>  Issue Type: Question
>  Components: Deploy
>Affects Versions: 2.1.1
>Reporter: Saanvi Sharma
>Priority: Minor
>  Labels: dynamodb, spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am using spark 2.1 on EMR and i have a dataframe like this:
>  ClientNum  | Value_1  | Value_2 | Value_3  | Value_4
>  14 |A |B|   C  |   null
>  19 |X |Y|  null|   null
>  21 |R |   null  |  null|   null
> I want to load data into DynamoDB table with ClientNum as key fetching:
> Analyze Your Data on Amazon DynamoDB with apche Spark11
> Using Spark SQL for ETL3
> here is my code that I tried to solve:
>   var jobConf = new JobConf(sc.hadoopConfiguration)
>   jobConf.set("dynamodb.servicename", "dynamodb")
>   jobConf.set("dynamodb.input.tableName", "table_name")   
>   jobConf.set("dynamodb.output.tableName", "table_name")   
>   jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com")
>   jobConf.set("dynamodb.regionid", "eu-west-1")
>   jobConf.set("dynamodb.throughput.read", "1")
>   jobConf.set("dynamodb.throughput.read.percent", "1")
>   jobConf.set("dynamodb.throughput.write", "1")
>   jobConf.set("dynamodb.throughput.write.percent", "1")
>   
>   jobConf.set("mapred.output.format.class", 
> "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
>   jobConf.set("mapred.input.format.class", 
> "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
>   #Import Data
>   val df = 
> sqlContext.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").load(path)
> I performed a transformation to have an RDD that matches the types that the 
> DynamoDB custom output format knows how to write. The custom output format 
> expects a tuple containing the Text and DynamoDBItemWritable types.
> Create a new RDD with those types in it, in the following map call:
>   #Convert the dataframe to rdd
>   val df_rdd = df.rdd
>   > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> MapPartitionsRDD[10] at rdd at :41
>   
>   #Print first rdd
>   df_rdd.take(1)
>   > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null])
>   var ddbInsertFormattedRDD = df_rdd.map(a => {
>   var ddbMap = new HashMap[String, AttributeValue]()
>   var ClientNum = new AttributeValue()
>   ClientNum.setN(a.get(0).toString)
>   ddbMap.put("ClientNum", ClientNum)
>   var Value_1 = new AttributeValue()
>   Value_1.setS(a.get(1).toString)
>   ddbMap.put("Value_1", Value_1)
>   var Value_2 = new AttributeValue()
>   Value_2.setS(a.get(2).toString)
>   ddbMap.put("Value_2", Value_2)
>   var Value_3 = new AttributeValue()
>   Value_3.setS(a.get(3).toString)
>   ddbMap.put("Value_3", Value_3)
>   var Value_4 = new AttributeValue()
>   Value_4.setS(a.get(4).toString)
>   ddbMap.put("Value_4", Value_4)
>   var item = new DynamoDBItemWritable()
>   item.setItem(ddbMap)
>   (new Text(""), item)
>   })
> This last call uses the job configuration that defines the EMR-DDB connector 
> to write out the new RDD you created in the expected format:
> ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf)
> fails with the follwoing error:
> Caused by: java.lang.NullPointerException
> null values caused the error, if I try with ClientNum and Value_1 it works 
> data is correctly inserted on DynamoDB table.
> Thanks for your help !!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additiona

[jira] [Comment Edited] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values

2022-08-16 Thread Vivek Garg (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580598#comment-17580598
 ] 

Vivek Garg edited comment on SPARK-22588 at 8/17/22 5:52 AM:
-

[salesforce cpq 
certification|https://www.igmguru.com/salesforce/salesforce-cpq-training/]


was (Author: JIRAUSER294516):
[https://www.igmguru.com/salesforce/salesforce-cpq-training/Salesforce CPQ 
Certification]

> SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
> -
>
> Key: SPARK-22588
> URL: https://issues.apache.org/jira/browse/SPARK-22588
> Project: Spark
>  Issue Type: Question
>  Components: Deploy
>Affects Versions: 2.1.1
>Reporter: Saanvi Sharma
>Priority: Minor
>  Labels: dynamodb, spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am using spark 2.1 on EMR and i have a dataframe like this:
>  ClientNum  | Value_1  | Value_2 | Value_3  | Value_4
>  14 |A |B|   C  |   null
>  19 |X |Y|  null|   null
>  21 |R |   null  |  null|   null
> I want to load data into DynamoDB table with ClientNum as key fetching:
> Analyze Your Data on Amazon DynamoDB with apche Spark11
> Using Spark SQL for ETL3
> here is my code that I tried to solve:
>   var jobConf = new JobConf(sc.hadoopConfiguration)
>   jobConf.set("dynamodb.servicename", "dynamodb")
>   jobConf.set("dynamodb.input.tableName", "table_name")   
>   jobConf.set("dynamodb.output.tableName", "table_name")   
>   jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com")
>   jobConf.set("dynamodb.regionid", "eu-west-1")
>   jobConf.set("dynamodb.throughput.read", "1")
>   jobConf.set("dynamodb.throughput.read.percent", "1")
>   jobConf.set("dynamodb.throughput.write", "1")
>   jobConf.set("dynamodb.throughput.write.percent", "1")
>   
>   jobConf.set("mapred.output.format.class", 
> "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
>   jobConf.set("mapred.input.format.class", 
> "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
>   #Import Data
>   val df = 
> sqlContext.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").load(path)
> I performed a transformation to have an RDD that matches the types that the 
> DynamoDB custom output format knows how to write. The custom output format 
> expects a tuple containing the Text and DynamoDBItemWritable types.
> Create a new RDD with those types in it, in the following map call:
>   #Convert the dataframe to rdd
>   val df_rdd = df.rdd
>   > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> MapPartitionsRDD[10] at rdd at :41
>   
>   #Print first rdd
>   df_rdd.take(1)
>   > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null])
>   var ddbInsertFormattedRDD = df_rdd.map(a => {
>   var ddbMap = new HashMap[String, AttributeValue]()
>   var ClientNum = new AttributeValue()
>   ClientNum.setN(a.get(0).toString)
>   ddbMap.put("ClientNum", ClientNum)
>   var Value_1 = new AttributeValue()
>   Value_1.setS(a.get(1).toString)
>   ddbMap.put("Value_1", Value_1)
>   var Value_2 = new AttributeValue()
>   Value_2.setS(a.get(2).toString)
>   ddbMap.put("Value_2", Value_2)
>   var Value_3 = new AttributeValue()
>   Value_3.setS(a.get(3).toString)
>   ddbMap.put("Value_3", Value_3)
>   var Value_4 = new AttributeValue()
>   Value_4.setS(a.get(4).toString)
>   ddbMap.put("Value_4", Value_4)
>   var item = new DynamoDBItemWritable()
>   item.setItem(ddbMap)
>   (new Text(""), item)
>   })
> This last call uses the job configuration that defines the EMR-DDB connector 
> to write out the new RDD you created in the expected format:
> ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf)
> fails with the follwoing error:
> Caused by: java.lang.NullPointerException
> null values caused the error, if I try with ClientNum and Value_1 it works 
> data is correctly inserted on DynamoDB table.
> Thanks for your help !!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values

2022-08-16 Thread Vivek Garg (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580598#comment-17580598
 ] 

Vivek Garg edited comment on SPARK-22588 at 8/17/22 5:50 AM:
-

[https://www.igmguru.com/salesforce/salesforce-cpq-training/Salesforce CPQ 
Certification]


was (Author: JIRAUSER294516):
[[https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/ SAP 
analytics cloud training]]

> SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
> -
>
> Key: SPARK-22588
> URL: https://issues.apache.org/jira/browse/SPARK-22588
> Project: Spark
>  Issue Type: Question
>  Components: Deploy
>Affects Versions: 2.1.1
>Reporter: Saanvi Sharma
>Priority: Minor
>  Labels: dynamodb, spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am using spark 2.1 on EMR and i have a dataframe like this:
>  ClientNum  | Value_1  | Value_2 | Value_3  | Value_4
>  14 |A |B|   C  |   null
>  19 |X |Y|  null|   null
>  21 |R |   null  |  null|   null
> I want to load data into DynamoDB table with ClientNum as key fetching:
> Analyze Your Data on Amazon DynamoDB with apche Spark11
> Using Spark SQL for ETL3
> here is my code that I tried to solve:
>   var jobConf = new JobConf(sc.hadoopConfiguration)
>   jobConf.set("dynamodb.servicename", "dynamodb")
>   jobConf.set("dynamodb.input.tableName", "table_name")   
>   jobConf.set("dynamodb.output.tableName", "table_name")   
>   jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com")
>   jobConf.set("dynamodb.regionid", "eu-west-1")
>   jobConf.set("dynamodb.throughput.read", "1")
>   jobConf.set("dynamodb.throughput.read.percent", "1")
>   jobConf.set("dynamodb.throughput.write", "1")
>   jobConf.set("dynamodb.throughput.write.percent", "1")
>   
>   jobConf.set("mapred.output.format.class", 
> "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
>   jobConf.set("mapred.input.format.class", 
> "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
>   #Import Data
>   val df = 
> sqlContext.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").load(path)
> I performed a transformation to have an RDD that matches the types that the 
> DynamoDB custom output format knows how to write. The custom output format 
> expects a tuple containing the Text and DynamoDBItemWritable types.
> Create a new RDD with those types in it, in the following map call:
>   #Convert the dataframe to rdd
>   val df_rdd = df.rdd
>   > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> MapPartitionsRDD[10] at rdd at :41
>   
>   #Print first rdd
>   df_rdd.take(1)
>   > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null])
>   var ddbInsertFormattedRDD = df_rdd.map(a => {
>   var ddbMap = new HashMap[String, AttributeValue]()
>   var ClientNum = new AttributeValue()
>   ClientNum.setN(a.get(0).toString)
>   ddbMap.put("ClientNum", ClientNum)
>   var Value_1 = new AttributeValue()
>   Value_1.setS(a.get(1).toString)
>   ddbMap.put("Value_1", Value_1)
>   var Value_2 = new AttributeValue()
>   Value_2.setS(a.get(2).toString)
>   ddbMap.put("Value_2", Value_2)
>   var Value_3 = new AttributeValue()
>   Value_3.setS(a.get(3).toString)
>   ddbMap.put("Value_3", Value_3)
>   var Value_4 = new AttributeValue()
>   Value_4.setS(a.get(4).toString)
>   ddbMap.put("Value_4", Value_4)
>   var item = new DynamoDBItemWritable()
>   item.setItem(ddbMap)
>   (new Text(""), item)
>   })
> This last call uses the job configuration that defines the EMR-DDB connector 
> to write out the new RDD you created in the expected format:
> ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf)
> fails with the follwoing error:
> Caused by: java.lang.NullPointerException
> null values caused the error, if I try with ClientNum and Value_1 it works 
> data is correctly inserted on DynamoDB table.
> Thanks for your help !!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values

2022-08-16 Thread Vivek Garg (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580598#comment-17580598
 ] 

Vivek Garg edited comment on SPARK-22588 at 8/17/22 5:49 AM:
-

[[https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/ SAP 
analytics cloud training]]


was (Author: JIRAUSER294516):
[https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/ SAP 
analytics cloud training]

> SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
> -
>
> Key: SPARK-22588
> URL: https://issues.apache.org/jira/browse/SPARK-22588
> Project: Spark
>  Issue Type: Question
>  Components: Deploy
>Affects Versions: 2.1.1
>Reporter: Saanvi Sharma
>Priority: Minor
>  Labels: dynamodb, spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am using spark 2.1 on EMR and i have a dataframe like this:
>  ClientNum  | Value_1  | Value_2 | Value_3  | Value_4
>  14 |A |B|   C  |   null
>  19 |X |Y|  null|   null
>  21 |R |   null  |  null|   null
> I want to load data into DynamoDB table with ClientNum as key fetching:
> Analyze Your Data on Amazon DynamoDB with apche Spark11
> Using Spark SQL for ETL3
> here is my code that I tried to solve:
>   var jobConf = new JobConf(sc.hadoopConfiguration)
>   jobConf.set("dynamodb.servicename", "dynamodb")
>   jobConf.set("dynamodb.input.tableName", "table_name")   
>   jobConf.set("dynamodb.output.tableName", "table_name")   
>   jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com")
>   jobConf.set("dynamodb.regionid", "eu-west-1")
>   jobConf.set("dynamodb.throughput.read", "1")
>   jobConf.set("dynamodb.throughput.read.percent", "1")
>   jobConf.set("dynamodb.throughput.write", "1")
>   jobConf.set("dynamodb.throughput.write.percent", "1")
>   
>   jobConf.set("mapred.output.format.class", 
> "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
>   jobConf.set("mapred.input.format.class", 
> "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
>   #Import Data
>   val df = 
> sqlContext.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").load(path)
> I performed a transformation to have an RDD that matches the types that the 
> DynamoDB custom output format knows how to write. The custom output format 
> expects a tuple containing the Text and DynamoDBItemWritable types.
> Create a new RDD with those types in it, in the following map call:
>   #Convert the dataframe to rdd
>   val df_rdd = df.rdd
>   > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> MapPartitionsRDD[10] at rdd at :41
>   
>   #Print first rdd
>   df_rdd.take(1)
>   > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null])
>   var ddbInsertFormattedRDD = df_rdd.map(a => {
>   var ddbMap = new HashMap[String, AttributeValue]()
>   var ClientNum = new AttributeValue()
>   ClientNum.setN(a.get(0).toString)
>   ddbMap.put("ClientNum", ClientNum)
>   var Value_1 = new AttributeValue()
>   Value_1.setS(a.get(1).toString)
>   ddbMap.put("Value_1", Value_1)
>   var Value_2 = new AttributeValue()
>   Value_2.setS(a.get(2).toString)
>   ddbMap.put("Value_2", Value_2)
>   var Value_3 = new AttributeValue()
>   Value_3.setS(a.get(3).toString)
>   ddbMap.put("Value_3", Value_3)
>   var Value_4 = new AttributeValue()
>   Value_4.setS(a.get(4).toString)
>   ddbMap.put("Value_4", Value_4)
>   var item = new DynamoDBItemWritable()
>   item.setItem(ddbMap)
>   (new Text(""), item)
>   })
> This last call uses the job configuration that defines the EMR-DDB connector 
> to write out the new RDD you created in the expected format:
> ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf)
> fails with the follwoing error:
> Caused by: java.lang.NullPointerException
> null values caused the error, if I try with ClientNum and Value_1 it works 
> data is correctly inserted on DynamoDB table.
> Thanks for your help !!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values

2022-08-16 Thread Vivek Garg (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580598#comment-17580598
 ] 

Vivek Garg edited comment on SPARK-22588 at 8/17/22 5:48 AM:
-

https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/";>SAP
 analytics cloud training
[https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/](SAP 
analytics cloud training)
(https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/)[SAP 
analytics cloud training]
[url=https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/]SAP
 analytics cloud training[/url]
[https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/ SAP 
analytics cloud training]
[SAP analytics cloud 
training](https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/)
(SAP analytics cloud 
training)[https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/]



was (Author: JIRAUSER294516):
According to the [Salesforce CPQ 
Certification|[https://www.igmguru.com/salesforce/salesforce-cpq-training/]] 
Exam, our Salesforce CPQ Training program has been created. The core abilities 
needed for effectively implementing Salesforce CPQ solutions are developed in 
this course on Salesforce CPQ. Through instruction using practical examples, 
this course will go deeper into developing a quoting process, pricing 
strategies, configuration, CPQ object data model, and more. This online 
Salesforce CPQ training course includes practical projects that will aid you in 
passing the Salesforce CPQ Certification test.

> SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
> -
>
> Key: SPARK-22588
> URL: https://issues.apache.org/jira/browse/SPARK-22588
> Project: Spark
>  Issue Type: Question
>  Components: Deploy
>Affects Versions: 2.1.1
>Reporter: Saanvi Sharma
>Priority: Minor
>  Labels: dynamodb, spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am using spark 2.1 on EMR and i have a dataframe like this:
>  ClientNum  | Value_1  | Value_2 | Value_3  | Value_4
>  14 |A |B|   C  |   null
>  19 |X |Y|  null|   null
>  21 |R |   null  |  null|   null
> I want to load data into DynamoDB table with ClientNum as key fetching:
> Analyze Your Data on Amazon DynamoDB with apche Spark11
> Using Spark SQL for ETL3
> here is my code that I tried to solve:
>   var jobConf = new JobConf(sc.hadoopConfiguration)
>   jobConf.set("dynamodb.servicename", "dynamodb")
>   jobConf.set("dynamodb.input.tableName", "table_name")   
>   jobConf.set("dynamodb.output.tableName", "table_name")   
>   jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com")
>   jobConf.set("dynamodb.regionid", "eu-west-1")
>   jobConf.set("dynamodb.throughput.read", "1")
>   jobConf.set("dynamodb.throughput.read.percent", "1")
>   jobConf.set("dynamodb.throughput.write", "1")
>   jobConf.set("dynamodb.throughput.write.percent", "1")
>   
>   jobConf.set("mapred.output.format.class", 
> "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
>   jobConf.set("mapred.input.format.class", 
> "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
>   #Import Data
>   val df = 
> sqlContext.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").load(path)
> I performed a transformation to have an RDD that matches the types that the 
> DynamoDB custom output format knows how to write. The custom output format 
> expects a tuple containing the Text and DynamoDBItemWritable types.
> Create a new RDD with those types in it, in the following map call:
>   #Convert the dataframe to rdd
>   val df_rdd = df.rdd
>   > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> MapPartitionsRDD[10] at rdd at :41
>   
>   #Print first rdd
>   df_rdd.take(1)
>   > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null])
>   var ddbInsertFormattedRDD = df_rdd.map(a => {
>   var ddbMap = new HashMap[String, AttributeValue]()
>   var ClientNum = new AttributeValue()
>   ClientNum.setN(a.get(0).toString)
>   ddbMap.put("ClientNum", ClientNum)
>   var Value_1 = new AttributeValue()
>   Value_1.setS(a.get(1).toString)
>   ddbMap.put("Value_1", Value_1)
>   var Value_2 = new AttributeValue()
>   Value_2.setS(a.get(2).toString)
>   ddbMap.put("Value_2", Value_2)
>   var Value_3 = new AttributeValue()
>   Value_3.setS(a.get(3).toString)
>   ddbMap.put("Value_3", Value_3)
>   var Value_4 = new AttributeValue()
>   Value_4.setS(a.get(4).toString)
>   ddbMap.put("Value_4", Value_4)
>   var item = new DynamoDBItemWritable()
>   item.setItem(ddbMap)
>   (new Text(

[jira] [Comment Edited] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values

2022-08-16 Thread Vivek Garg (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580598#comment-17580598
 ] 

Vivek Garg edited comment on SPARK-22588 at 8/17/22 5:49 AM:
-


(https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/)[SAP 
analytics cloud training]
[https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/ SAP 
analytics cloud training]
[SAP analytics cloud 
training](https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/)
(SAP analytics cloud 
training)[https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/]



was (Author: JIRAUSER294516):
https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/";>SAP
 analytics cloud training
[https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/](SAP 
analytics cloud training)
(https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/)[SAP 
analytics cloud training]
[url=https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/]SAP
 analytics cloud training[/url]
[https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/ SAP 
analytics cloud training]
[SAP analytics cloud 
training](https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/)
(SAP analytics cloud 
training)[https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/]


> SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
> -
>
> Key: SPARK-22588
> URL: https://issues.apache.org/jira/browse/SPARK-22588
> Project: Spark
>  Issue Type: Question
>  Components: Deploy
>Affects Versions: 2.1.1
>Reporter: Saanvi Sharma
>Priority: Minor
>  Labels: dynamodb, spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am using spark 2.1 on EMR and i have a dataframe like this:
>  ClientNum  | Value_1  | Value_2 | Value_3  | Value_4
>  14 |A |B|   C  |   null
>  19 |X |Y|  null|   null
>  21 |R |   null  |  null|   null
> I want to load data into DynamoDB table with ClientNum as key fetching:
> Analyze Your Data on Amazon DynamoDB with apche Spark11
> Using Spark SQL for ETL3
> here is my code that I tried to solve:
>   var jobConf = new JobConf(sc.hadoopConfiguration)
>   jobConf.set("dynamodb.servicename", "dynamodb")
>   jobConf.set("dynamodb.input.tableName", "table_name")   
>   jobConf.set("dynamodb.output.tableName", "table_name")   
>   jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com")
>   jobConf.set("dynamodb.regionid", "eu-west-1")
>   jobConf.set("dynamodb.throughput.read", "1")
>   jobConf.set("dynamodb.throughput.read.percent", "1")
>   jobConf.set("dynamodb.throughput.write", "1")
>   jobConf.set("dynamodb.throughput.write.percent", "1")
>   
>   jobConf.set("mapred.output.format.class", 
> "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
>   jobConf.set("mapred.input.format.class", 
> "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
>   #Import Data
>   val df = 
> sqlContext.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").load(path)
> I performed a transformation to have an RDD that matches the types that the 
> DynamoDB custom output format knows how to write. The custom output format 
> expects a tuple containing the Text and DynamoDBItemWritable types.
> Create a new RDD with those types in it, in the following map call:
>   #Convert the dataframe to rdd
>   val df_rdd = df.rdd
>   > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> MapPartitionsRDD[10] at rdd at :41
>   
>   #Print first rdd
>   df_rdd.take(1)
>   > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null])
>   var ddbInsertFormattedRDD = df_rdd.map(a => {
>   var ddbMap = new HashMap[String, AttributeValue]()
>   var ClientNum = new AttributeValue()
>   ClientNum.setN(a.get(0).toString)
>   ddbMap.put("ClientNum", ClientNum)
>   var Value_1 = new AttributeValue()
>   Value_1.setS(a.get(1).toString)
>   ddbMap.put("Value_1", Value_1)
>   var Value_2 = new AttributeValue()
>   Value_2.setS(a.get(2).toString)
>   ddbMap.put("Value_2", Value_2)
>   var Value_3 = new AttributeValue()
>   Value_3.setS(a.get(3).toString)
>   ddbMap.put("Value_3", Value_3)
>   var Value_4 = new AttributeValue()
>   Value_4.setS(a.get(4).toString)
>   ddbMap.put("Value_4", Value_4)
>   var item = new DynamoDBItemWritable()
>   item.setItem(ddbMap)
>   (new Text(""), item)
>   })
> This last call uses the job configuration that defines the EMR-DDB connector 
> to write out the new RDD you created in the expected format:
> ddbInsertFormattedRDD.saveAsH

[jira] [Comment Edited] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values

2022-08-16 Thread Vivek Garg (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580598#comment-17580598
 ] 

Vivek Garg edited comment on SPARK-22588 at 8/17/22 5:49 AM:
-

[https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/ SAP 
analytics cloud training]


was (Author: JIRAUSER294516):

(https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/)[SAP 
analytics cloud training]
[https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/ SAP 
analytics cloud training]
[SAP analytics cloud 
training](https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/)
(SAP analytics cloud 
training)[https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/]


> SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
> -
>
> Key: SPARK-22588
> URL: https://issues.apache.org/jira/browse/SPARK-22588
> Project: Spark
>  Issue Type: Question
>  Components: Deploy
>Affects Versions: 2.1.1
>Reporter: Saanvi Sharma
>Priority: Minor
>  Labels: dynamodb, spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am using spark 2.1 on EMR and i have a dataframe like this:
>  ClientNum  | Value_1  | Value_2 | Value_3  | Value_4
>  14 |A |B|   C  |   null
>  19 |X |Y|  null|   null
>  21 |R |   null  |  null|   null
> I want to load data into DynamoDB table with ClientNum as key fetching:
> Analyze Your Data on Amazon DynamoDB with apche Spark11
> Using Spark SQL for ETL3
> here is my code that I tried to solve:
>   var jobConf = new JobConf(sc.hadoopConfiguration)
>   jobConf.set("dynamodb.servicename", "dynamodb")
>   jobConf.set("dynamodb.input.tableName", "table_name")   
>   jobConf.set("dynamodb.output.tableName", "table_name")   
>   jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com")
>   jobConf.set("dynamodb.regionid", "eu-west-1")
>   jobConf.set("dynamodb.throughput.read", "1")
>   jobConf.set("dynamodb.throughput.read.percent", "1")
>   jobConf.set("dynamodb.throughput.write", "1")
>   jobConf.set("dynamodb.throughput.write.percent", "1")
>   
>   jobConf.set("mapred.output.format.class", 
> "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
>   jobConf.set("mapred.input.format.class", 
> "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
>   #Import Data
>   val df = 
> sqlContext.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").load(path)
> I performed a transformation to have an RDD that matches the types that the 
> DynamoDB custom output format knows how to write. The custom output format 
> expects a tuple containing the Text and DynamoDBItemWritable types.
> Create a new RDD with those types in it, in the following map call:
>   #Convert the dataframe to rdd
>   val df_rdd = df.rdd
>   > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> MapPartitionsRDD[10] at rdd at :41
>   
>   #Print first rdd
>   df_rdd.take(1)
>   > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null])
>   var ddbInsertFormattedRDD = df_rdd.map(a => {
>   var ddbMap = new HashMap[String, AttributeValue]()
>   var ClientNum = new AttributeValue()
>   ClientNum.setN(a.get(0).toString)
>   ddbMap.put("ClientNum", ClientNum)
>   var Value_1 = new AttributeValue()
>   Value_1.setS(a.get(1).toString)
>   ddbMap.put("Value_1", Value_1)
>   var Value_2 = new AttributeValue()
>   Value_2.setS(a.get(2).toString)
>   ddbMap.put("Value_2", Value_2)
>   var Value_3 = new AttributeValue()
>   Value_3.setS(a.get(3).toString)
>   ddbMap.put("Value_3", Value_3)
>   var Value_4 = new AttributeValue()
>   Value_4.setS(a.get(4).toString)
>   ddbMap.put("Value_4", Value_4)
>   var item = new DynamoDBItemWritable()
>   item.setItem(ddbMap)
>   (new Text(""), item)
>   })
> This last call uses the job configuration that defines the EMR-DDB connector 
> to write out the new RDD you created in the expected format:
> ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf)
> fails with the follwoing error:
> Caused by: java.lang.NullPointerException
> null values caused the error, if I try with ClientNum and Value_1 it works 
> data is correctly inserted on DynamoDB table.
> Thanks for your help !!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values

2022-08-16 Thread Vivek Garg (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580598#comment-17580598
 ] 

Vivek Garg commented on SPARK-22588:


According to the [Salesforce CPQ 
Certification|[https://www.igmguru.com/salesforce/salesforce-cpq-training/]] 
Exam, our Salesforce CPQ Training program has been created. The core abilities 
needed for effectively implementing Salesforce CPQ solutions are developed in 
this course on Salesforce CPQ. Through instruction using practical examples, 
this course will go deeper into developing a quoting process, pricing 
strategies, configuration, CPQ object data model, and more. This online 
Salesforce CPQ training course includes practical projects that will aid you in 
passing the Salesforce CPQ Certification test.

> SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
> -
>
> Key: SPARK-22588
> URL: https://issues.apache.org/jira/browse/SPARK-22588
> Project: Spark
>  Issue Type: Question
>  Components: Deploy
>Affects Versions: 2.1.1
>Reporter: Saanvi Sharma
>Priority: Minor
>  Labels: dynamodb, spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am using spark 2.1 on EMR and i have a dataframe like this:
>  ClientNum  | Value_1  | Value_2 | Value_3  | Value_4
>  14 |A |B|   C  |   null
>  19 |X |Y|  null|   null
>  21 |R |   null  |  null|   null
> I want to load data into DynamoDB table with ClientNum as key fetching:
> Analyze Your Data on Amazon DynamoDB with apche Spark11
> Using Spark SQL for ETL3
> here is my code that I tried to solve:
>   var jobConf = new JobConf(sc.hadoopConfiguration)
>   jobConf.set("dynamodb.servicename", "dynamodb")
>   jobConf.set("dynamodb.input.tableName", "table_name")   
>   jobConf.set("dynamodb.output.tableName", "table_name")   
>   jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com")
>   jobConf.set("dynamodb.regionid", "eu-west-1")
>   jobConf.set("dynamodb.throughput.read", "1")
>   jobConf.set("dynamodb.throughput.read.percent", "1")
>   jobConf.set("dynamodb.throughput.write", "1")
>   jobConf.set("dynamodb.throughput.write.percent", "1")
>   
>   jobConf.set("mapred.output.format.class", 
> "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
>   jobConf.set("mapred.input.format.class", 
> "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
>   #Import Data
>   val df = 
> sqlContext.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").load(path)
> I performed a transformation to have an RDD that matches the types that the 
> DynamoDB custom output format knows how to write. The custom output format 
> expects a tuple containing the Text and DynamoDBItemWritable types.
> Create a new RDD with those types in it, in the following map call:
>   #Convert the dataframe to rdd
>   val df_rdd = df.rdd
>   > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> MapPartitionsRDD[10] at rdd at :41
>   
>   #Print first rdd
>   df_rdd.take(1)
>   > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null])
>   var ddbInsertFormattedRDD = df_rdd.map(a => {
>   var ddbMap = new HashMap[String, AttributeValue]()
>   var ClientNum = new AttributeValue()
>   ClientNum.setN(a.get(0).toString)
>   ddbMap.put("ClientNum", ClientNum)
>   var Value_1 = new AttributeValue()
>   Value_1.setS(a.get(1).toString)
>   ddbMap.put("Value_1", Value_1)
>   var Value_2 = new AttributeValue()
>   Value_2.setS(a.get(2).toString)
>   ddbMap.put("Value_2", Value_2)
>   var Value_3 = new AttributeValue()
>   Value_3.setS(a.get(3).toString)
>   ddbMap.put("Value_3", Value_3)
>   var Value_4 = new AttributeValue()
>   Value_4.setS(a.get(4).toString)
>   ddbMap.put("Value_4", Value_4)
>   var item = new DynamoDBItemWritable()
>   item.setItem(ddbMap)
>   (new Text(""), item)
>   })
> This last call uses the job configuration that defines the EMR-DDB connector 
> to write out the new RDD you created in the expected format:
> ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf)
> fails with the follwoing error:
> Caused by: java.lang.NullPointerException
> null values caused the error, if I try with ClientNum and Value_1 it works 
> data is correctly inserted on DynamoDB table.
> Thanks for your help !!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40117) Convert condition to java in DataFrameWriterV2.overwrite

2022-08-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580588#comment-17580588
 ] 

Apache Spark commented on SPARK-40117:
--

User 'looi' has created a pull request for this issue:
https://github.com/apache/spark/pull/37547

> Convert condition to java in DataFrameWriterV2.overwrite
> 
>
> Key: SPARK-40117
> URL: https://issues.apache.org/jira/browse/SPARK-40117
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.1.3, 3.3.0, 3.2.2
>Reporter: Wenli Looi
>Priority: Major
>
> DataFrameWriterV2.overwrite() fails to convert the condition parameter to 
> java. This prevents the function from being called.
> It is caused by the following commit that deleted the `_to_java_column` call 
> instead of fixing it: 
> [https://github.com/apache/spark/commit/a1e459ed9f6777fb8d5a2d09fda666402f9230b9]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40117) Convert condition to java in DataFrameWriterV2.overwrite

2022-08-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580587#comment-17580587
 ] 

Apache Spark commented on SPARK-40117:
--

User 'looi' has created a pull request for this issue:
https://github.com/apache/spark/pull/37547

> Convert condition to java in DataFrameWriterV2.overwrite
> 
>
> Key: SPARK-40117
> URL: https://issues.apache.org/jira/browse/SPARK-40117
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.1.3, 3.3.0, 3.2.2
>Reporter: Wenli Looi
>Priority: Major
>
> DataFrameWriterV2.overwrite() fails to convert the condition parameter to 
> java. This prevents the function from being called.
> It is caused by the following commit that deleted the `_to_java_column` call 
> instead of fixing it: 
> [https://github.com/apache/spark/commit/a1e459ed9f6777fb8d5a2d09fda666402f9230b9]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40117) Convert condition to java in DataFrameWriterV2.overwrite

2022-08-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40117:


Assignee: (was: Apache Spark)

> Convert condition to java in DataFrameWriterV2.overwrite
> 
>
> Key: SPARK-40117
> URL: https://issues.apache.org/jira/browse/SPARK-40117
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.1.3, 3.3.0, 3.2.2
>Reporter: Wenli Looi
>Priority: Major
>
> DataFrameWriterV2.overwrite() fails to convert the condition parameter to 
> java. This prevents the function from being called.
> It is caused by the following commit that deleted the `_to_java_column` call 
> instead of fixing it: 
> [https://github.com/apache/spark/commit/a1e459ed9f6777fb8d5a2d09fda666402f9230b9]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40117) Convert condition to java in DataFrameWriterV2.overwrite

2022-08-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40117:


Assignee: Apache Spark

> Convert condition to java in DataFrameWriterV2.overwrite
> 
>
> Key: SPARK-40117
> URL: https://issues.apache.org/jira/browse/SPARK-40117
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.1.3, 3.3.0, 3.2.2
>Reporter: Wenli Looi
>Assignee: Apache Spark
>Priority: Major
>
> DataFrameWriterV2.overwrite() fails to convert the condition parameter to 
> java. This prevents the function from being called.
> It is caused by the following commit that deleted the `_to_java_column` call 
> instead of fixing it: 
> [https://github.com/apache/spark/commit/a1e459ed9f6777fb8d5a2d09fda666402f9230b9]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40117) Convert condition to java in DataFrameWriterV2.overwrite

2022-08-16 Thread Wenli Looi (Jira)
Wenli Looi created SPARK-40117:
--

 Summary: Convert condition to java in DataFrameWriterV2.overwrite
 Key: SPARK-40117
 URL: https://issues.apache.org/jira/browse/SPARK-40117
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 3.2.2, 3.3.0, 3.1.3
Reporter: Wenli Looi


DataFrameWriterV2.overwrite() fails to convert the condition parameter to java. 
This prevents the function from being called.

It is caused by the following commit that deleted the `_to_java_column` call 
instead of fixing it: 
[https://github.com/apache/spark/commit/a1e459ed9f6777fb8d5a2d09fda666402f9230b9]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40116) Pin Arrow version to 8.0.0 in AppVeyor

2022-08-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580574#comment-17580574
 ] 

Apache Spark commented on SPARK-40116:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/37546

> Pin Arrow version to 8.0.0 in AppVeyor
> --
>
> Key: SPARK-40116
> URL: https://issues.apache.org/jira/browse/SPARK-40116
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, SparkR
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> SparkR does not support Arrow 9.0.0 (SPARK-40114) so the tests currently 
> fail, https://ci.appveyor.com/project/HyukjinKwon/spark/builds/44490387.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40116) Pin Arrow version to 8.0.0 in AppVeyor

2022-08-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40116:


Assignee: Apache Spark

> Pin Arrow version to 8.0.0 in AppVeyor
> --
>
> Key: SPARK-40116
> URL: https://issues.apache.org/jira/browse/SPARK-40116
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, SparkR
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> SparkR does not support Arrow 9.0.0 (SPARK-40114) so the tests currently 
> fail, https://ci.appveyor.com/project/HyukjinKwon/spark/builds/44490387.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40116) Pin Arrow version to 8.0.0 in AppVeyor

2022-08-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40116:


Assignee: (was: Apache Spark)

> Pin Arrow version to 8.0.0 in AppVeyor
> --
>
> Key: SPARK-40116
> URL: https://issues.apache.org/jira/browse/SPARK-40116
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, SparkR
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> SparkR does not support Arrow 9.0.0 (SPARK-40114) so the tests currently 
> fail, https://ci.appveyor.com/project/HyukjinKwon/spark/builds/44490387.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40116) Pin Arrow version to 8.0.0 in AppVeyor

2022-08-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580573#comment-17580573
 ] 

Apache Spark commented on SPARK-40116:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/37546

> Pin Arrow version to 8.0.0 in AppVeyor
> --
>
> Key: SPARK-40116
> URL: https://issues.apache.org/jira/browse/SPARK-40116
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, SparkR
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> SparkR does not support Arrow 9.0.0 (SPARK-40114) so the tests currently 
> fail, https://ci.appveyor.com/project/HyukjinKwon/spark/builds/44490387.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40116) Pin Arrow version to 8.0.0 in AppVeyor

2022-08-16 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-40116:


 Summary: Pin Arrow version to 8.0.0 in AppVeyor
 Key: SPARK-40116
 URL: https://issues.apache.org/jira/browse/SPARK-40116
 Project: Spark
  Issue Type: Test
  Components: Project Infra, SparkR
Affects Versions: 3.4.0
Reporter: Hyukjin Kwon


SparkR does not support Arrow 9.0.0 (SPARK-40114) so the tests currently fail, 
https://ci.appveyor.com/project/HyukjinKwon/spark/builds/44490387.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40115) Pin Arrow version to 8.0.0 in AppVeyor

2022-08-16 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-40115:


 Summary: Pin Arrow version to 8.0.0 in AppVeyor
 Key: SPARK-40115
 URL: https://issues.apache.org/jira/browse/SPARK-40115
 Project: Spark
  Issue Type: Test
  Components: Project Infra, SparkR
Affects Versions: 3.4.0
Reporter: Hyukjin Kwon


Currently SparkR tests fail 
https://ci.appveyor.com/project/HyukjinKwon/spark/builds/44490387 because 
SparkR does not support Arrow 9.0.0+, see also SPARK-40114

We should pin the version to 8.0.0 for now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40114) Arrow 9.0.0 support with SparkR

2022-08-16 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-40114:


 Summary: Arrow 9.0.0 support with SparkR
 Key: SPARK-40114
 URL: https://issues.apache.org/jira/browse/SPARK-40114
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 3.4.0
Reporter: Hyukjin Kwon


{code}
== Failed ==
-- 1. Error (test_sparkSQL_arrow.R:103:3): dapply() Arrow optimization -
Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 
'n' argument
Backtrace:
 1. SparkR::collect(ret)
  at test_sparkSQL_arrow.R:103:2
 2. SparkR::collect(ret)
 3. SparkR (local) .local(x, ...)
 7. SparkR:::readRaw(conn)
 8. base::readBin(con, raw(), as.integer(dataLen), endian = "big")
-- 2. Error (test_sparkSQL_arrow.R:133:3): dapply() Arrow optimization - type sp
Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 
'n' argument
Backtrace:
 1. SparkR::collect(ret)
  at test_sparkSQL_arrow.R:133:2
 2. SparkR::collect(ret)
 3. SparkR (local) .local(x, ...)
 7. SparkR:::readRaw(conn)
 8. base::readBin(con, raw(), as.integer(dataLen), endian = "big")
-- 3. Error (test_sparkSQL_arrow.R:143:3): dapply() Arrow optimization - type sp
Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 
'n' argument
Backtrace:
  1. testthat::expect_true(all(collect(ret) == rdf))
   at test_sparkSQL_arrow.R:143:2
  5. SparkR::collect(ret)
  6. SparkR (local) .local(x, ...)
 10. SparkR:::readRaw(conn)
 11. base::readBin(con, raw(), as.integer(dataLen), endian = "big")
-- 4. Error (test_sparkSQL_arrow.R:184:3): gapply() Arrow optimization -
Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 
'n' argument
Backtrace:
 1. SparkR::collect(ret)
  at test_sparkSQL_arrow.R:184:2
 2. SparkR::collect(ret)
 3. SparkR (local) .local(x, ...)
 7. SparkR:::readRaw(conn)
 8. base::readBin(con, raw(), as.integer(dataLen), endian = "big")
-- 5. Error (test_sparkSQL_arrow.R:217:3): gapply() Arrow optimization - type sp
Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 
'n' argument
Backtrace:
 1. SparkR::collect(ret)
  at test_sparkSQL_arrow.R:217:2
 2. SparkR::collect(ret)
 3. SparkR (local) .local(x, ...)
 7. SparkR:::readRaw(conn)
 8. base::readBin(con, raw(), as.integer(dataLen), endian = "big")
-- 6. Error (test_sparkSQL_arrow.R:229:3): gapply() Arrow optimization - type sp
Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 
'n' argument
Backtrace:
  1. testthat::expect_true(all(collect(ret) == rdf))
   at test_sparkSQL_arrow.R:229:2
  5. SparkR::collect(ret)
  6. SparkR (local) .local(x, ...)
 10. SparkR:::readRaw(conn)
 11. base::readBin(con, raw(), as.integer(dataLen), endian = "big")
-- 7. Failure (test_sparkSQL_arrow.R:247:3): SPARK-32478: gapply() Arrow optimiz
`count(...)` threw an error with unexpected message.
Expected match: "expected IntegerType, IntegerType, got IntegerType, StringType"
Actual message: "org.apache.spark.SparkException: Job aborted due to stage 
failure: Task 0 in stage 29.0 failed 1 times, most recent failure: Lost task 
0.0 in stage 29.0 (TID 54) (APPVYR-WIN executor driver): 
org.apache.spark.SparkException: R unexpectedly exited.\nR worker produced 
errors: The tzdb package is not installed. Timezones will not be available to 
Arrow compute functions.\nError in arrow::write_arrow(df, raw()) : write_arrow 
has been removed\nCalls:  -> writeRaw -> writeInt -> writeBin -> 
\nExecution halted\n\r\n\tat 
org.apache.spark.api.r.BaseRRunner$ReaderIterator$$anonfun$1.applyOrElse(BaseRRunner.scala:144)\r\n\tat
 
org.apache.spark.api.r.BaseRRunner$ReaderIterator$$anonfun$1.applyOrElse(BaseRRunner.scala:137)\r\n\tat
 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)\r\n\tat
 
org.apache.spark.sql.execution.r.ArrowRRunner$$anon$2.read(ArrowRRunner.scala:194)\r\n\tat
 
org.apache.spark.sql.execution.r.ArrowRRunner$$anon$2.read(ArrowRRunner.scala:123)\r\n\tat
 
org.apache.spark.api.r.BaseRRunner$ReaderIterator.hasNext(BaseRRunner.scala:113)\r\n\tat
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)\r\n\tat 
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)\r\n\tat 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.hashAgg_doAggregateWithoutKey_0$(Unknown
 Source)\r\n\tat 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
 Source)\r\n\tat 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)\r\n\tat
 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)\r\n\tat
 scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)\r\n\tat 
org.apache.spark.shuffle.sort.BypassMer

[jira] [Updated] (SPARK-40113) Unify ParquetScanBuilder DataSourceV2 interface implementation

2022-08-16 Thread Mars (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mars updated SPARK-40113:
-
Priority: Minor  (was: Major)

> Unify ParquetScanBuilder DataSourceV2 interface implementation
> --
>
> Key: SPARK-40113
> URL: https://issues.apache.org/jira/browse/SPARK-40113
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.3.0
>Reporter: Mars
>Priority: Minor
>
> Now `FileScanBuilder` interface is not fully implemented in 
> `ParquetScanBuilder` like 
> `OrcScanBuilder`,`AvroScanBuilder`,`CSVScanBuilder`
> In order to unify the logic of the code and make it clearer, this part of the 
> implementation is unified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40113) Unify ParquetScanBuilder DataSourceV2 interface implementation

2022-08-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40113:


Assignee: Apache Spark

> Unify ParquetScanBuilder DataSourceV2 interface implementation
> --
>
> Key: SPARK-40113
> URL: https://issues.apache.org/jira/browse/SPARK-40113
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.3.0
>Reporter: Mars
>Assignee: Apache Spark
>Priority: Major
>
> Now `FileScanBuilder` interface is not fully implemented in 
> `ParquetScanBuilder` like 
> `OrcScanBuilder`,`AvroScanBuilder`,`CSVScanBuilder`
> In order to unify the logic of the code and make it clearer, this part of the 
> implementation is unified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40113) Unify ParquetScanBuilder DataSourceV2 interface implementation

2022-08-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40113:


Assignee: (was: Apache Spark)

> Unify ParquetScanBuilder DataSourceV2 interface implementation
> --
>
> Key: SPARK-40113
> URL: https://issues.apache.org/jira/browse/SPARK-40113
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.3.0
>Reporter: Mars
>Priority: Major
>
> Now `FileScanBuilder` interface is not fully implemented in 
> `ParquetScanBuilder` like 
> `OrcScanBuilder`,`AvroScanBuilder`,`CSVScanBuilder`
> In order to unify the logic of the code and make it clearer, this part of the 
> implementation is unified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40113) Unify ParquetScanBuilder DataSourceV2 interface implementation

2022-08-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580551#comment-17580551
 ] 

Apache Spark commented on SPARK-40113:
--

User 'yabola' has created a pull request for this issue:
https://github.com/apache/spark/pull/37545

> Unify ParquetScanBuilder DataSourceV2 interface implementation
> --
>
> Key: SPARK-40113
> URL: https://issues.apache.org/jira/browse/SPARK-40113
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.3.0
>Reporter: Mars
>Priority: Major
>
> Now `FileScanBuilder` interface is not fully implemented in 
> `ParquetScanBuilder` like 
> `OrcScanBuilder`,`AvroScanBuilder`,`CSVScanBuilder`
> In order to unify the logic of the code and make it clearer, this part of the 
> implementation is unified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40113) Unify ParquetScanBuilder DataSourceV2 interface implementation

2022-08-16 Thread Mars (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mars updated SPARK-40113:
-
Summary: Unify ParquetScanBuilder DataSourceV2 interface implementation  
(was: Unified ParquetScanBuilder DataSourceV2 interface implementation)

> Unify ParquetScanBuilder DataSourceV2 interface implementation
> --
>
> Key: SPARK-40113
> URL: https://issues.apache.org/jira/browse/SPARK-40113
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.3.0
>Reporter: Mars
>Priority: Major
>
> Now `FileScanBuilder` interface is not fully implemented in 
> `ParquetScanBuilder` like 
> `OrcScanBuilder`,`AvroScanBuilder`,`CSVScanBuilder`
> In order to unify the logic of the code and make it clearer, this part of the 
> implementation is unified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40113) Unified ParquetScanBuilder DataSourceV2 interface implementation

2022-08-16 Thread Mars (Jira)
Mars created SPARK-40113:


 Summary: Unified ParquetScanBuilder DataSourceV2 interface 
implementation
 Key: SPARK-40113
 URL: https://issues.apache.org/jira/browse/SPARK-40113
 Project: Spark
  Issue Type: Improvement
  Components: Optimizer
Affects Versions: 3.3.0
Reporter: Mars


Now `FileScanBuilder` interface is not fully implemented in 
`ParquetScanBuilder` like 
`OrcScanBuilder`,`AvroScanBuilder`,`CSVScanBuilder`
In order to unify the logic of the code and make it clearer, this part of the 
implementation is unified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38648) SPIP: Simplified API for DL Inferencing

2022-08-16 Thread Yikun Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580548#comment-17580548
 ] 

Yikun Jiang commented on SPARK-38648:
-

By the way, just curious, is this SPIP expected to be a feature in version 3.4?

> SPIP: Simplified API for DL Inferencing
> ---
>
> Key: SPARK-38648
> URL: https://issues.apache.org/jira/browse/SPARK-38648
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Lee Yang
>Priority: Minor
>
> h1. Background and Motivation
> The deployment of deep learning (DL) models to Spark clusters can be a point 
> of friction today.  DL practitioners often aren't well-versed with Spark, and 
> Spark experts often aren't well-versed with the fast-changing DL frameworks.  
> Currently, the deployment of trained DL models is done in a fairly ad-hoc 
> manner, with each model integration usually requiring significant effort.
> To simplify this process, we propose adding an integration layer for each 
> major DL framework that can introspect their respective saved models to 
> more-easily integrate these models into Spark applications.  You can find a 
> detailed proposal here: 
> [https://docs.google.com/document/d/1n7QPHVZfmQknvebZEXxzndHPV2T71aBsDnP4COQa_v0]
> h1. Goals
>  - Simplify the deployment of pre-trained single-node DL models to Spark 
> inference applications.
>  - Follow pandas_udf for simple inference use-cases.
>  - Follow Spark ML Pipelines APIs for transfer-learning use-cases.
>  - Enable integrations with popular third-party DL frameworks like 
> TensorFlow, PyTorch, and Huggingface.
>  - Focus on PySpark, since most of the DL frameworks use Python.
>  - Take advantage of built-in Spark features like GPU scheduling and Arrow 
> integration.
>  - Enable inference on both CPU and GPU.
> h1. Non-goals
>  - DL model training.
>  - Inference w/ distributed models, i.e. "model parallel" inference.
> h1. Target Personas
>  - Data scientists who need to deploy DL models on Spark.
>  - Developers who need to deploy DL models on Spark.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38648) SPIP: Simplified API for DL Inferencing

2022-08-16 Thread Yikun Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580547#comment-17580547
 ] 

Yikun Jiang commented on SPARK-38648:
-

If we want to run onnx directly, we might want to support onnxruntime as one of 
DL Frameworks, like sparkext.onnxruntime.Model(url).

For other frameworks, user can first convert onnx to framework specific model 
format directly [1], and then call sparkext.onnxruntime.Model(converted_url), I 
don't think it's too difficult.

So I personally think, the format of the model should not be unified, onnx is 
just one of them.

[1]https://pytorch.org/docs/stable/onnx.html#torch-onnx

> SPIP: Simplified API for DL Inferencing
> ---
>
> Key: SPARK-38648
> URL: https://issues.apache.org/jira/browse/SPARK-38648
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Lee Yang
>Priority: Minor
>
> h1. Background and Motivation
> The deployment of deep learning (DL) models to Spark clusters can be a point 
> of friction today.  DL practitioners often aren't well-versed with Spark, and 
> Spark experts often aren't well-versed with the fast-changing DL frameworks.  
> Currently, the deployment of trained DL models is done in a fairly ad-hoc 
> manner, with each model integration usually requiring significant effort.
> To simplify this process, we propose adding an integration layer for each 
> major DL framework that can introspect their respective saved models to 
> more-easily integrate these models into Spark applications.  You can find a 
> detailed proposal here: 
> [https://docs.google.com/document/d/1n7QPHVZfmQknvebZEXxzndHPV2T71aBsDnP4COQa_v0]
> h1. Goals
>  - Simplify the deployment of pre-trained single-node DL models to Spark 
> inference applications.
>  - Follow pandas_udf for simple inference use-cases.
>  - Follow Spark ML Pipelines APIs for transfer-learning use-cases.
>  - Enable integrations with popular third-party DL frameworks like 
> TensorFlow, PyTorch, and Huggingface.
>  - Focus on PySpark, since most of the DL frameworks use Python.
>  - Take advantage of built-in Spark features like GPU scheduling and Arrow 
> integration.
>  - Enable inference on both CPU and GPU.
> h1. Non-goals
>  - DL model training.
>  - Inference w/ distributed models, i.e. "model parallel" inference.
> h1. Target Personas
>  - Data scientists who need to deploy DL models on Spark.
>  - Developers who need to deploy DL models on Spark.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40110) Add JDBCWithAQESuite

2022-08-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580545#comment-17580545
 ] 

Apache Spark commented on SPARK-40110:
--

User 'kazuyukitanimura' has created a pull request for this issue:
https://github.com/apache/spark/pull/37544

> Add JDBCWithAQESuite
> 
>
> Key: SPARK-40110
> URL: https://issues.apache.org/jira/browse/SPARK-40110
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Kazuyuki Tanimura
>Priority: Minor
>
> Currently `JDBCSuite` assumes that AQE is always turned off. We should also 
> test with AQE turned on



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40110) Add JDBCWithAQESuite

2022-08-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580546#comment-17580546
 ] 

Apache Spark commented on SPARK-40110:
--

User 'kazuyukitanimura' has created a pull request for this issue:
https://github.com/apache/spark/pull/37544

> Add JDBCWithAQESuite
> 
>
> Key: SPARK-40110
> URL: https://issues.apache.org/jira/browse/SPARK-40110
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Kazuyuki Tanimura
>Priority: Minor
>
> Currently `JDBCSuite` assumes that AQE is always turned off. We should also 
> test with AQE turned on



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40110) Add JDBCWithAQESuite

2022-08-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40110:


Assignee: Apache Spark

> Add JDBCWithAQESuite
> 
>
> Key: SPARK-40110
> URL: https://issues.apache.org/jira/browse/SPARK-40110
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Kazuyuki Tanimura
>Assignee: Apache Spark
>Priority: Minor
>
> Currently `JDBCSuite` assumes that AQE is always turned off. We should also 
> test with AQE turned on



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40110) Add JDBCWithAQESuite

2022-08-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40110:


Assignee: (was: Apache Spark)

> Add JDBCWithAQESuite
> 
>
> Key: SPARK-40110
> URL: https://issues.apache.org/jira/browse/SPARK-40110
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Kazuyuki Tanimura
>Priority: Minor
>
> Currently `JDBCSuite` assumes that AQE is always turned off. We should also 
> test with AQE turned on



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39975) Upgrade rocksdbjni to 7.4.5

2022-08-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39975:


Assignee: Apache Spark

> Upgrade rocksdbjni to 7.4.5
> ---
>
> Key: SPARK-39975
> URL: https://issues.apache.org/jira/browse/SPARK-39975
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Major
>
> [https://github.com/facebook/rocksdb/releases/tag/v7.4.5]
>  
> {code:java}
> Fix a bug starting in 7.4.0 in which some fsync operations might be skipped 
> in a DB after any DropColumnFamily on that DB, until it is re-opened. This 
> can lead to data loss on power loss. (For custom FileSystem implementations, 
> this could lead to FSDirectory::Fsync or FSDirectory::Close after the first 
> FSDirectory::Close; Also, valgrind could report call to close() with fd=-1.) 
> {code}
>  
> Fixed a bug that caused data loss
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39975) Upgrade rocksdbjni to 7.4.5

2022-08-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580539#comment-17580539
 ] 

Apache Spark commented on SPARK-39975:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37543

> Upgrade rocksdbjni to 7.4.5
> ---
>
> Key: SPARK-39975
> URL: https://issues.apache.org/jira/browse/SPARK-39975
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>
> [https://github.com/facebook/rocksdb/releases/tag/v7.4.5]
>  
> {code:java}
> Fix a bug starting in 7.4.0 in which some fsync operations might be skipped 
> in a DB after any DropColumnFamily on that DB, until it is re-opened. This 
> can lead to data loss on power loss. (For custom FileSystem implementations, 
> this could lead to FSDirectory::Fsync or FSDirectory::Close after the first 
> FSDirectory::Close; Also, valgrind could report call to close() with fd=-1.) 
> {code}
>  
> Fixed a bug that caused data loss
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39975) Upgrade rocksdbjni to 7.4.5

2022-08-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39975:


Assignee: (was: Apache Spark)

> Upgrade rocksdbjni to 7.4.5
> ---
>
> Key: SPARK-39975
> URL: https://issues.apache.org/jira/browse/SPARK-39975
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>
> [https://github.com/facebook/rocksdb/releases/tag/v7.4.5]
>  
> {code:java}
> Fix a bug starting in 7.4.0 in which some fsync operations might be skipped 
> in a DB after any DropColumnFamily on that DB, until it is re-opened. This 
> can lead to data loss on power loss. (For custom FileSystem implementations, 
> this could lead to FSDirectory::Fsync or FSDirectory::Close after the first 
> FSDirectory::Close; Also, valgrind could report call to close() with fd=-1.) 
> {code}
>  
> Fixed a bug that caused data loss
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40046) Use Jackson instead of json4s to serialize `RocksDBMetrics`

2022-08-16 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-40046.
--
Resolution: Won't Fix

> Use Jackson instead of json4s to serialize `RocksDBMetrics`
> ---
>
> Key: SPARK-40046
> URL: https://issues.apache.org/jira/browse/SPARK-40046
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40075) Refactor kafka010.JsonUtils to use Jackson instead of Json4s

2022-08-16 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-40075.
--
Resolution: Won't Fix

> Refactor kafka010.JsonUtils to use Jackson instead of Json4s
> 
>
> Key: SPARK-40075
> URL: https://issues.apache.org/jira/browse/SPARK-40075
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40101) `include an external JAR in SparkR` in core module but need antlr4

2022-08-16 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580535#comment-17580535
 ] 

Hyukjin Kwon commented on SPARK-40101:
--

I think mvn install won't work. It has to be package IIRC since it requires 
jars to load. mvn install doesn't create jars iIRC

> `include an external JAR in SparkR` in core module but need antlr4
> --
>
> Key: SPARK-40101
> URL: https://issues.apache.org/jira/browse/SPARK-40101
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>
> Run following commands:
>  
> {code:java}
> mvn clean -Phadoop-3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl 
> -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive
> mvn clean install -DskipTests -pl core -am
> mvn clean test -pl core 
> -Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest -Dtest=none 
> -DwildcardSuites=org.apache.spark.deploy.SparkSubmitSuite {code}
>  
>  
> `include an external JAR in SparkR` failed as follows:
>  
> {code:java}
> include an external JAR in SparkR *** FAILED ***
>   spark-submit returned with exit code 1.
>   Command line: '/Users/Spark/spark-source/bin/spark-submit' '--name' 
> 'testApp' '--master' 'local' '--jars' 
> 'file:/Users/Spark/spark-source/core/target/tmp/spark-e15e960c-1c10-44cd-99d4-f1905e4c18be/sparkRTestJar-1660632952368.jar'
>  '--verbose' '--conf' 'spark.ui.enabled=false' 
> '/Users/Spark/spark-source/R/pkg/tests/fulltests/jarTest.R'
>   
>   2022-08-15 23:55:53.495 - stderr> Using properties file: null
>   2022-08-15 23:55:53.58 - stderr> Parsed arguments:
>   2022-08-15 23:55:53.58 - stderr>   master                  local
>   2022-08-15 23:55:53.58 - stderr>   deployMode              null
>   2022-08-15 23:55:53.58 - stderr>   executorMemory          null
>   2022-08-15 23:55:53.581 - stderr>   executorCores           null
>   2022-08-15 23:55:53.581 - stderr>   totalExecutorCores      null
>   2022-08-15 23:55:53.581 - stderr>   propertiesFile          null
>   2022-08-15 23:55:53.581 - stderr>   driverMemory            null
>   2022-08-15 23:55:53.581 - stderr>   driverCores             null
>   2022-08-15 23:55:53.581 - stderr>   driverExtraClassPath    null
>   2022-08-15 23:55:53.581 - stderr>   driverExtraLibraryPath  null
>   2022-08-15 23:55:53.581 - stderr>   driverExtraJavaOptions  null
>   2022-08-15 23:55:53.581 - stderr>   supervise               false
>   2022-08-15 23:55:53.581 - stderr>   queue                   null
>   2022-08-15 23:55:53.581 - stderr>   numExecutors            null
>   2022-08-15 23:55:53.581 - stderr>   files                   null
>   2022-08-15 23:55:53.581 - stderr>   pyFiles                 null
>   2022-08-15 23:55:53.581 - stderr>   archives                null
>   2022-08-15 23:55:53.581 - stderr>   mainClass               null
>   2022-08-15 23:55:53.581 - stderr>   primaryResource         
> file:/Users/Spark/spark-source/R/pkg/tests/fulltests/jarTest.R
>   2022-08-15 23:55:53.581 - stderr>   name                    testApp
>   2022-08-15 23:55:53.581 - stderr>   childArgs               []
>   2022-08-15 23:55:53.581 - stderr>   jars                    
> file:/Users/Spark/spark-source/core/target/tmp/spark-e15e960c-1c10-44cd-99d4-f1905e4c18be/sparkRTestJar-1660632952368.jar
>   2022-08-15 23:55:53.581 - stderr>   packages                null
>   2022-08-15 23:55:53.581 - stderr>   packagesExclusions      null
>   2022-08-15 23:55:53.581 - stderr>   repositories            null
>   2022-08-15 23:55:53.581 - stderr>   verbose                 true
>   2022-08-15 23:55:53.581 - stderr> 
>   2022-08-15 23:55:53.581 - stderr> Spark properties used, including those 
> specified through
>   2022-08-15 23:55:53.581 - stderr>  --conf and those from the properties 
> file null:
>   2022-08-15 23:55:53.581 - stderr>   (spark.ui.enabled,false)
>   2022-08-15 23:55:53.581 - stderr> 
>   2022-08-15 23:55:53.581 - stderr>     
>   2022-08-15 23:55:53.729 - stderr> 
> /Users/Spark/spark-source/core/target/tmp/spark-e15e960c-1c10-44cd-99d4-f1905e4c18be/sparkRTestJar-1660632952368.jar
>  doesn't contain R source code, skipping...
>   2022-08-15 23:55:54.058 - stderr> Main class:
>   2022-08-15 23:55:54.058 - stderr> org.apache.spark.deploy.RRunner
>   2022-08-15 23:55:54.058 - stderr> Arguments:
>   2022-08-15 23:55:54.058 - stderr> 
> file:/Users/Spark/spark-source/R/pkg/tests/fulltests/jarTest.R
>   2022-08-15 23:55:54.06 - stderr> Spark config:
>   2022-08-15 23:55:54.06 - stderr> (spark.app.name,testApp)
>   2022-08-15 23:55:54.06 - stderr> (spark.app.submitTime,1660632954058)
>   2022-08-15 23:55:54.06 - stderr> 
> (spark.files,file:/Users/Spark/spark-source/R/pkg/tests/fulltests/jarTest.R)

[jira] [Commented] (SPARK-40103) Support read/write.csv() in SparkR

2022-08-16 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580522#comment-17580522
 ] 

Hyukjin Kwon commented on SPARK-40103:
--

The main problem is that the signature conflicts with R base API IIRC. We 
should probably use a different name for this.

> Support read/write.csv() in SparkR
> --
>
> Key: SPARK-40103
> URL: https://issues.apache.org/jira/browse/SPARK-40103
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 3.3.0
>Reporter: deshanxiao
>Priority: Major
>
> Today, almost languages support the DataFrameReader.csv API, only R is 
> missing. we need to use df.read() to read the csv file. We need a more 
> high-level api for it.
> Java:
> [DataFrameReader.csv()|https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html]
> Scala:
> [DataFrameReader.csv()|https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrameReader.html#csv(paths:String*):org.apache.spark.sql.DataFrame]
> Python:
> [DataFrameReader.csv()|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.csv.html#pyspark.sql.DataFrameReader.csv]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39184) ArrayIndexOutOfBoundsException for some date/time sequences in some time-zones

2022-08-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580521#comment-17580521
 ] 

Apache Spark commented on SPARK-39184:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/37542

> ArrayIndexOutOfBoundsException for some date/time sequences in some time-zones
> --
>
> Key: SPARK-39184
> URL: https://issues.apache.org/jira/browse/SPARK-39184
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.3, 3.2.1, 3.3.0, 3.4.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
> Fix For: 3.4.0, 3.3.1, 3.2.3
>
>
> The following query gets an {{ArrayIndexOutOfBoundsException}} when run from 
> the {{America/Los_Angeles}} time-zone:
> {noformat}
> spark-sql> select sequence(timestamp'2022-03-13 00:00:00', 
> timestamp'2022-03-16 03:00:00', interval 1 day 1 hour) as x;
> 22/05/13 14:47:27 ERROR SparkSQLDriver: Failed in [select 
> sequence(timestamp'2022-03-13 00:00:00', timestamp'2022-03-16 03:00:00', 
> interval 1 day 1 hour) as x]
> java.lang.ArrayIndexOutOfBoundsException: 3
> {noformat}
> In fact, any such query will get an {{ArrayIndexOutOfBoundsException}} if the 
> start-stop period in your time-zone includes more instances of "spring 
> forward" than instances of "fall back" and the start-stop period is evenly 
> divisible by the interval.
> In the {{America/Los_Angeles}} time-zone, examples include:
> {noformat}
> -- This query encompasses 2 instances of "spring forward" but only one
> -- instance of "fall back".
> select sequence(
>   timestamp'2022-03-13',
>   timestamp'2022-03-13' + (interval '42' hours * 209),
>   interval '42' hours) as x;
> {noformat}
> {noformat}
> select sequence(
>   timestamp'2022-03-13',
>   timestamp'2022-03-13' + (interval '31' hours * 11),
>   interval '31' hours) as x;
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40063) pyspark.pandas .apply() changing rows ordering

2022-08-16 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580518#comment-17580518
 ] 

Hyukjin Kwon commented on SPARK-40063:
--

[~marcelorossini] what's your "Default Index type"? 
{{compute.default_index_type}} configuration

> pyspark.pandas .apply() changing rows ordering
> --
>
> Key: SPARK-40063
> URL: https://issues.apache.org/jira/browse/SPARK-40063
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark
>Affects Versions: 3.3.0
> Environment: Databricks Runtime 11.1
>Reporter: Marcelo Rossini Castro
>Priority: Minor
>  Labels: Pandas, PySpark
>
> When using the apply function to apply a function to a DataFrame column, it 
> ends up mixing the column's rows ordering.
> A command like this:
> {code:java}
> def example_func(df_col):
>   return df_col ** 2 
> df['col_to_apply_function'] = df.apply(lambda row: 
> example_func(row['col_to_apply_function']), axis=1) {code}
> A workaround is to assign the results to a new column instead of the same 
> one, but if the old column is dropped, the same error is produced.
> Setting one column as index also didn't work.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39184) ArrayIndexOutOfBoundsException for some date/time sequences in some time-zones

2022-08-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580519#comment-17580519
 ] 

Apache Spark commented on SPARK-39184:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/37542

> ArrayIndexOutOfBoundsException for some date/time sequences in some time-zones
> --
>
> Key: SPARK-39184
> URL: https://issues.apache.org/jira/browse/SPARK-39184
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.3, 3.2.1, 3.3.0, 3.4.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
> Fix For: 3.4.0, 3.3.1, 3.2.3
>
>
> The following query gets an {{ArrayIndexOutOfBoundsException}} when run from 
> the {{America/Los_Angeles}} time-zone:
> {noformat}
> spark-sql> select sequence(timestamp'2022-03-13 00:00:00', 
> timestamp'2022-03-16 03:00:00', interval 1 day 1 hour) as x;
> 22/05/13 14:47:27 ERROR SparkSQLDriver: Failed in [select 
> sequence(timestamp'2022-03-13 00:00:00', timestamp'2022-03-16 03:00:00', 
> interval 1 day 1 hour) as x]
> java.lang.ArrayIndexOutOfBoundsException: 3
> {noformat}
> In fact, any such query will get an {{ArrayIndexOutOfBoundsException}} if the 
> start-stop period in your time-zone includes more instances of "spring 
> forward" than instances of "fall back" and the start-stop period is evenly 
> divisible by the interval.
> In the {{America/Los_Angeles}} time-zone, examples include:
> {noformat}
> -- This query encompasses 2 instances of "spring forward" but only one
> -- instance of "fall back".
> select sequence(
>   timestamp'2022-03-13',
>   timestamp'2022-03-13' + (interval '42' hours * 209),
>   interval '42' hours) as x;
> {noformat}
> {noformat}
> select sequence(
>   timestamp'2022-03-13',
>   timestamp'2022-03-13' + (interval '31' hours * 11),
>   interval '31' hours) as x;
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40112) Improve the TO_BINARY() function

2022-08-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580517#comment-17580517
 ] 

Apache Spark commented on SPARK-40112:
--

User 'vitaliili-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/37483

> Improve the TO_BINARY() function
> 
>
> Key: SPARK-40112
> URL: https://issues.apache.org/jira/browse/SPARK-40112
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vitalii Li
>Priority: Major
>
> Original description SPARK-37507:
> {quote}to_binary(expr, fmt) is a common function available in many other 
> systems to provide a unified entry for string to binary data conversion, 
> where fmt can be utf8, base64, hex and base2 (or whatever the reverse 
> operation to_char()supports).
> [https://docs.aws.amazon.com/redshift/latest/dg/r_TO_VARBYTE.html]
> [https://docs.snowflake.com/en/sql-reference/functions/to_binary.html]
> [https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#format_string_as_bytes]
> [https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/etRo5aTAY9n5fUPjxSEynw]
> Related Spark functions: unbase64, unhex
> {quote}
>  
> Expected improvement:
>  * `base64` behaves more strictly, i.e. does not allow symbols not included 
> in base64 dictionary (A-Za-z0-9+/) and verifies correct padding and symbol 
> groups (see RFC 4648 § 4). Whitespaces are ignored.
>  ** Current implementation allows arbitrary strings and invalid symbols are 
> skipped.
>  * `hex` converts only valid hexadecimal strings and throws errors otherwise. 
> Whitespaces are not allowed.
>  * `utf-8` and `utf8` are interchangeable.
>  * Correct errors are thrown and classified for invalid input 
> (CONVERSION_INVALID_INPUT) and invalid format (CONVERSION_INVALID_FORMAT)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40112) Improve the TO_BINARY() function

2022-08-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40112:


Assignee: Apache Spark

> Improve the TO_BINARY() function
> 
>
> Key: SPARK-40112
> URL: https://issues.apache.org/jira/browse/SPARK-40112
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vitalii Li
>Assignee: Apache Spark
>Priority: Major
>
> Original description SPARK-37507:
> {quote}to_binary(expr, fmt) is a common function available in many other 
> systems to provide a unified entry for string to binary data conversion, 
> where fmt can be utf8, base64, hex and base2 (or whatever the reverse 
> operation to_char()supports).
> [https://docs.aws.amazon.com/redshift/latest/dg/r_TO_VARBYTE.html]
> [https://docs.snowflake.com/en/sql-reference/functions/to_binary.html]
> [https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#format_string_as_bytes]
> [https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/etRo5aTAY9n5fUPjxSEynw]
> Related Spark functions: unbase64, unhex
> {quote}
>  
> Expected improvement:
>  * `base64` behaves more strictly, i.e. does not allow symbols not included 
> in base64 dictionary (A-Za-z0-9+/) and verifies correct padding and symbol 
> groups (see RFC 4648 § 4). Whitespaces are ignored.
>  ** Current implementation allows arbitrary strings and invalid symbols are 
> skipped.
>  * `hex` converts only valid hexadecimal strings and throws errors otherwise. 
> Whitespaces are not allowed.
>  * `utf-8` and `utf8` are interchangeable.
>  * Correct errors are thrown and classified for invalid input 
> (CONVERSION_INVALID_INPUT) and invalid format (CONVERSION_INVALID_FORMAT)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40112) Improve the TO_BINARY() function

2022-08-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40112:


Assignee: (was: Apache Spark)

> Improve the TO_BINARY() function
> 
>
> Key: SPARK-40112
> URL: https://issues.apache.org/jira/browse/SPARK-40112
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vitalii Li
>Priority: Major
>
> Original description SPARK-37507:
> {quote}to_binary(expr, fmt) is a common function available in many other 
> systems to provide a unified entry for string to binary data conversion, 
> where fmt can be utf8, base64, hex and base2 (or whatever the reverse 
> operation to_char()supports).
> [https://docs.aws.amazon.com/redshift/latest/dg/r_TO_VARBYTE.html]
> [https://docs.snowflake.com/en/sql-reference/functions/to_binary.html]
> [https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#format_string_as_bytes]
> [https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/etRo5aTAY9n5fUPjxSEynw]
> Related Spark functions: unbase64, unhex
> {quote}
>  
> Expected improvement:
>  * `base64` behaves more strictly, i.e. does not allow symbols not included 
> in base64 dictionary (A-Za-z0-9+/) and verifies correct padding and symbol 
> groups (see RFC 4648 § 4). Whitespaces are ignored.
>  ** Current implementation allows arbitrary strings and invalid symbols are 
> skipped.
>  * `hex` converts only valid hexadecimal strings and throws errors otherwise. 
> Whitespaces are not allowed.
>  * `utf-8` and `utf8` are interchangeable.
>  * Correct errors are thrown and classified for invalid input 
> (CONVERSION_INVALID_INPUT) and invalid format (CONVERSION_INVALID_FORMAT)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40112) Improve the TO_BINARY() function

2022-08-16 Thread Vitalii Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitalii Li updated SPARK-40112:
---
Description: 
Original description SPARK-37507:
{quote}to_binary(expr, fmt) is a common function available in many other 
systems to provide a unified entry for string to binary data conversion, where 
fmt can be utf8, base64, hex and base2 (or whatever the reverse operation 
to_char()supports).

[https://docs.aws.amazon.com/redshift/latest/dg/r_TO_VARBYTE.html]

[https://docs.snowflake.com/en/sql-reference/functions/to_binary.html]

[https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#format_string_as_bytes]

[https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/etRo5aTAY9n5fUPjxSEynw]

Related Spark functions: unbase64, unhex
{quote}
 

Expected improvement:
 * `base64` behaves more strictly, i.e. does not allow symbols not included in 
base64 dictionary (A-Za-z0-9+/) and verifies correct padding and symbol groups 
(see RFC 4648 § 4). Whitespaces are ignored.
 ** Current implementation allows arbitrary strings and invalid symbols are 
skipped.
 * `hex` converts only valid hexadecimal strings and throws errors otherwise. 
Whitespaces are not allowed.
 * `utf-8` and `utf8` are interchangeable.
 * Correct errors are thrown and classified for invalid input 
(CONVERSION_INVALID_INPUT) and invalid format (CONVERSION_INVALID_FORMAT)

  was:
Original description SPARK-37507:
{quote}to_binary(expr, fmt) is a common function available in many other 
systems to provide a unified entry for string to binary data conversion, where 
fmt can be utf8, base64, hex and base2 (or whatever the reverse operation 
to_char()supports).

[https://docs.aws.amazon.com/redshift/latest/dg/r_TO_VARBYTE.html]

[https://docs.snowflake.com/en/sql-reference/functions/to_binary.html]

[https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#format_string_as_bytes]

[https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/etRo5aTAY9n5fUPjxSEynw]

Related Spark functions: unbase64, unhex
{quote}
 

Expected improvement:
 * `base64` behaves more strictly, i.e. does not allow symbols not included in 
base64 dictionary (A-Za-z0-9+/) and verifies correct padding and symbol groups 
(see RFC 4648 § 4). Whitespaces are ignored.
 ** Current implementation allows arbitrary strings and invalid symbols are 
skipped.
 * `hex` converts only valid hexadecimal strings and throws errors otherwise. 
Whitespaces are not allowed.
 * `utf-8` and `utf8` are interchangeable.
 * Correct errors are thrown and classified for invalid input and invalid 
format:


> Improve the TO_BINARY() function
> 
>
> Key: SPARK-40112
> URL: https://issues.apache.org/jira/browse/SPARK-40112
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vitalii Li
>Priority: Major
>
> Original description SPARK-37507:
> {quote}to_binary(expr, fmt) is a common function available in many other 
> systems to provide a unified entry for string to binary data conversion, 
> where fmt can be utf8, base64, hex and base2 (or whatever the reverse 
> operation to_char()supports).
> [https://docs.aws.amazon.com/redshift/latest/dg/r_TO_VARBYTE.html]
> [https://docs.snowflake.com/en/sql-reference/functions/to_binary.html]
> [https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#format_string_as_bytes]
> [https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/etRo5aTAY9n5fUPjxSEynw]
> Related Spark functions: unbase64, unhex
> {quote}
>  
> Expected improvement:
>  * `base64` behaves more strictly, i.e. does not allow symbols not included 
> in base64 dictionary (A-Za-z0-9+/) and verifies correct padding and symbol 
> groups (see RFC 4648 § 4). Whitespaces are ignored.
>  ** Current implementation allows arbitrary strings and invalid symbols are 
> skipped.
>  * `hex` converts only valid hexadecimal strings and throws errors otherwise. 
> Whitespaces are not allowed.
>  * `utf-8` and `utf8` are interchangeable.
>  * Correct errors are thrown and classified for invalid input 
> (CONVERSION_INVALID_INPUT) and invalid format (CONVERSION_INVALID_FORMAT)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40112) Improve the TO_BINARY() function

2022-08-16 Thread Vitalii Li (Jira)
Vitalii Li created SPARK-40112:
--

 Summary: Improve the TO_BINARY() function
 Key: SPARK-40112
 URL: https://issues.apache.org/jira/browse/SPARK-40112
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Vitalii Li


Original description SPARK-37507:
{quote}to_binary(expr, fmt) is a common function available in many other 
systems to provide a unified entry for string to binary data conversion, where 
fmt can be utf8, base64, hex and base2 (or whatever the reverse operation 
to_char()supports).

[https://docs.aws.amazon.com/redshift/latest/dg/r_TO_VARBYTE.html]

[https://docs.snowflake.com/en/sql-reference/functions/to_binary.html]

[https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#format_string_as_bytes]

[https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/etRo5aTAY9n5fUPjxSEynw]

Related Spark functions: unbase64, unhex
{quote}
 

Expected improvement:
 * `base64` behaves more strictly, i.e. does not allow symbols not included in 
base64 dictionary (A-Za-z0-9+/) and verifies correct padding and symbol groups 
(see RFC 4648 § 4). Whitespaces are ignored.
 ** Current implementation allows arbitrary strings and invalid symbols are 
skipped.
 * `hex` converts only valid hexadecimal strings and throws errors otherwise. 
Whitespaces are not allowed.
 * `utf-8` and `utf8` are interchangeable.
 * Correct errors are thrown and classified for invalid input and invalid 
format:



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40111) Make pyspark.rdd examples self-contained

2022-08-16 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-40111:
-

 Summary: Make pyspark.rdd examples self-contained
 Key: SPARK-40111
 URL: https://issues.apache.org/jira/browse/SPARK-40111
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40111) Make pyspark.rdd examples self-contained

2022-08-16 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-40111:
-

Assignee: Ruifeng Zheng

> Make pyspark.rdd examples self-contained
> 
>
> Key: SPARK-40111
> URL: https://issues.apache.org/jira/browse/SPARK-40111
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40110) Add JDBCWithAQESuite

2022-08-16 Thread Kazuyuki Tanimura (Jira)
Kazuyuki Tanimura created SPARK-40110:
-

 Summary: Add JDBCWithAQESuite
 Key: SPARK-40110
 URL: https://issues.apache.org/jira/browse/SPARK-40110
 Project: Spark
  Issue Type: Test
  Components: SQL, Tests
Affects Versions: 3.4.0
Reporter: Kazuyuki Tanimura


Currently `JDBCSuite` assumes that AQE is always turned off. We should also 
test with AQE turned on



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40109) New SQL function: get()

2022-08-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580502#comment-17580502
 ] 

Apache Spark commented on SPARK-40109:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37541

> New SQL function: get()
> ---
>
> Key: SPARK-40109
> URL: https://issues.apache.org/jira/browse/SPARK-40109
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Currently, when accessing array element with invalid index under ANSI SQL 
> mode, the error is like:
> {quote}[INVALID_ARRAY_INDEX] The index -1 is out of bounds. The array has 3 
> elements. Use `try_element_at` and increase the array index by 1(the starting 
> array index is 1 for `try_element_at`) to tolerate accessing element at 
> invalid index and return NULL instead. If necessary set 
> "spark.sql.ansi.enabled" to "false" to bypass this error.
> {quote}
> The provided solution is complicated. I suggest introducing a new method 
> get() which always returns null on an invalid array index. This is from 
> [https://docs.snowflake.com/en/sql-reference/functions/get.html.]
> Since Spark's map access always returns null, let's don't support map type in 
> the get method for now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40109) New SQL function: get()

2022-08-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580501#comment-17580501
 ] 

Apache Spark commented on SPARK-40109:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37541

> New SQL function: get()
> ---
>
> Key: SPARK-40109
> URL: https://issues.apache.org/jira/browse/SPARK-40109
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Currently, when accessing array element with invalid index under ANSI SQL 
> mode, the error is like:
> {quote}[INVALID_ARRAY_INDEX] The index -1 is out of bounds. The array has 3 
> elements. Use `try_element_at` and increase the array index by 1(the starting 
> array index is 1 for `try_element_at`) to tolerate accessing element at 
> invalid index and return NULL instead. If necessary set 
> "spark.sql.ansi.enabled" to "false" to bypass this error.
> {quote}
> The provided solution is complicated. I suggest introducing a new method 
> get() which always returns null on an invalid array index. This is from 
> [https://docs.snowflake.com/en/sql-reference/functions/get.html.]
> Since Spark's map access always returns null, let's don't support map type in 
> the get method for now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40109) New SQL function: get()

2022-08-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40109:


Assignee: Apache Spark  (was: Gengliang Wang)

> New SQL function: get()
> ---
>
> Key: SPARK-40109
> URL: https://issues.apache.org/jira/browse/SPARK-40109
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> Currently, when accessing array element with invalid index under ANSI SQL 
> mode, the error is like:
> {quote}[INVALID_ARRAY_INDEX] The index -1 is out of bounds. The array has 3 
> elements. Use `try_element_at` and increase the array index by 1(the starting 
> array index is 1 for `try_element_at`) to tolerate accessing element at 
> invalid index and return NULL instead. If necessary set 
> "spark.sql.ansi.enabled" to "false" to bypass this error.
> {quote}
> The provided solution is complicated. I suggest introducing a new method 
> get() which always returns null on an invalid array index. This is from 
> [https://docs.snowflake.com/en/sql-reference/functions/get.html.]
> Since Spark's map access always returns null, let's don't support map type in 
> the get method for now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40109) New SQL function: get()

2022-08-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40109:


Assignee: Gengliang Wang  (was: Apache Spark)

> New SQL function: get()
> ---
>
> Key: SPARK-40109
> URL: https://issues.apache.org/jira/browse/SPARK-40109
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Currently, when accessing array element with invalid index under ANSI SQL 
> mode, the error is like:
> {quote}[INVALID_ARRAY_INDEX] The index -1 is out of bounds. The array has 3 
> elements. Use `try_element_at` and increase the array index by 1(the starting 
> array index is 1 for `try_element_at`) to tolerate accessing element at 
> invalid index and return NULL instead. If necessary set 
> "spark.sql.ansi.enabled" to "false" to bypass this error.
> {quote}
> The provided solution is complicated. I suggest introducing a new method 
> get() which always returns null on an invalid array index. This is from 
> [https://docs.snowflake.com/en/sql-reference/functions/get.html.]
> Since Spark's map access always returns null, let's don't support map type in 
> the get method for now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40109) New SQL function: get()

2022-08-16 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-40109:
--

 Summary: New SQL function: get()
 Key: SPARK-40109
 URL: https://issues.apache.org/jira/browse/SPARK-40109
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang


Currently, when accessing array element with invalid index under ANSI SQL mode, 
the error is like:
{quote}[INVALID_ARRAY_INDEX] The index -1 is out of bounds. The array has 3 
elements. Use `try_element_at` and increase the array index by 1(the starting 
array index is 1 for `try_element_at`) to tolerate accessing element at invalid 
index and return NULL instead. If necessary set "spark.sql.ansi.enabled" to 
"false" to bypass this error.
{quote}
The provided solution is complicated. I suggest introducing a new method get() 
which always returns null on an invalid array index. This is from 
[https://docs.snowflake.com/en/sql-reference/functions/get.html.]

Since Spark's map access always returns null, let's don't support map type in 
the get method for now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40108) JDBC connection to Hive Metastore fails without first calling any .jdbc call

2022-08-16 Thread In-Ho Yi (Jira)
In-Ho Yi created SPARK-40108:


 Summary: JDBC connection to Hive Metastore fails without first 
calling any .jdbc call
 Key: SPARK-40108
 URL: https://issues.apache.org/jira/browse/SPARK-40108
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.3.0
 Environment: PySpark==3.3.0
Java 11
Reporter: In-Ho Yi


Tested on pyspark==3.3.0. When talking to hive metastore with MySQL backend, I 
installed MySQL driver with spark.jars.packages, alongside with other necessary 
settings:

ss = SparkSession.builder.master('local[*]')\
    .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.3.3," +
        
"org.apache.hadoop:hadoop-common:3.3.3,mysql:mysql-connector-java:8.0.30") \   
.config("spark.executor.memory", "10g") \
    .config("spark.driver.memory", "10g") \
    .config("spark.memory.offHeap.enabled","true") \
    .config("spark.memory.offHeap.size","32g")  \
    .config("spark.hadoop.javax.jdo.option.ConnectionURL", 
"jdbc:mysql://localhost:3306/hive") \
    .config("spark.hadoop.javax.jdo.option.ConnectionUserName", "") \
    .config("spark.hadoop.javax.jdo.option.ConnectionPassword", "") \
    .config("spark.hadoop.javax.jdo.option.ConnectionDriverName", 
"com.mysql.cj.jdbc.Driver") \
    .config("spark.sql.hive.metastore.sharedPrefixes", "com.mysql") \
    .config("spark.sql.warehouse.dir", "s3://-/") \
    .enableHiveSupport() \
    .appName("hms_test").config(conf=conf).getOrCreate()

Now, if I just do: ss.sql("SHOW DATABASES;").show() I get a lot of errors, 
saying:

Unable to open a test connection to the given database. JDBC url = 
jdbc:mysql://localhost:3306/hive, username = . Terminating connection pool 
(set lazyInit to true if you expect to start your database after your app). 
Original Exception: --
java.sql.SQLException: No suitable driver found for 
jdbc:mysql://localhost:3306/hive
    at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:702)
    at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:189)
    at com.jolbox.bonecp.BoneCP.obtainRawInternalConnection(BoneCP.java:361)
    at com.jolbox.bonecp.BoneCP.(BoneCP.java:416)
...

However, if I do any "jdbc" read, even if the call ends up in an error, then 
the call to Hive Metastore seem to succeed without any issue:

try:
    _ = ss.read.format("jdbc") \
        .option("url", "jdbc:mysql://localhost:3306/hive") \
        .option("query", "SHOW TABLES;") \
        .option("driver", "com.mysql.cj.jdbc.Driver").load()
except:
    pass

ss.sql("SHOW DATABASES;").show() # this now works fine.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-40063) pyspark.pandas .apply() changing rows ordering

2022-08-16 Thread Marcelo Rossini Castro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580459#comment-17580459
 ] 

Marcelo Rossini Castro edited comment on SPARK-40063 at 8/16/22 8:33 PM:
-

Yes, it doesn't change the other ones.

When I do this on the same column, replacing the data, the order on this column 
changes.
But if I assign the results to a new column, I get the right order, but if I 
drop the old one, I get the same problem again on the new one.

About the compute.ordered_head, I actually tried it, but it didn't help.


was (Author: JIRAUSER294354):
Yes, when I do this on the same column, replacing the data, the order changes.
But if I assign the results to a new column, I get the right order, but if I 
drop the old one, I get the same problem again.

About the compute.ordered_head, I actually tried it, but it didn't help.

> pyspark.pandas .apply() changing rows ordering
> --
>
> Key: SPARK-40063
> URL: https://issues.apache.org/jira/browse/SPARK-40063
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark
>Affects Versions: 3.3.0
> Environment: Databricks Runtime 11.1
>Reporter: Marcelo Rossini Castro
>Priority: Minor
>  Labels: Pandas, PySpark
>
> When using the apply function to apply a function to a DataFrame column, it 
> ends up mixing the column's rows ordering.
> A command like this:
> {code:java}
> def example_func(df_col):
>   return df_col ** 2 
> df['col_to_apply_function'] = df.apply(lambda row: 
> example_func(row['col_to_apply_function']), axis=1) {code}
> A workaround is to assign the results to a new column instead of the same 
> one, but if the old column is dropped, the same error is produced.
> Setting one column as index also didn't work.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40063) pyspark.pandas .apply() changing rows ordering

2022-08-16 Thread Marcelo Rossini Castro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580459#comment-17580459
 ] 

Marcelo Rossini Castro commented on SPARK-40063:


Yes, when I do this on the same column, replacing the data, the order changes.
But if I assign the results to a new column, I get the right order, but if I 
drop the old one, I get the same problem again.

About the compute.ordered_head, I actually tried it, but it didn't help.

> pyspark.pandas .apply() changing rows ordering
> --
>
> Key: SPARK-40063
> URL: https://issues.apache.org/jira/browse/SPARK-40063
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark
>Affects Versions: 3.3.0
> Environment: Databricks Runtime 11.1
>Reporter: Marcelo Rossini Castro
>Priority: Minor
>  Labels: Pandas, PySpark
>
> When using the apply function to apply a function to a DataFrame column, it 
> ends up mixing the column's rows ordering.
> A command like this:
> {code:java}
> def example_func(df_col):
>   return df_col ** 2 
> df['col_to_apply_function'] = df.apply(lambda row: 
> example_func(row['col_to_apply_function']), axis=1) {code}
> A workaround is to assign the results to a new column instead of the same 
> one, but if the old column is dropped, the same error is produced.
> Setting one column as index also didn't work.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36511) Remove ColumnIO once PARQUET-2050 is released in Parquet 1.13

2022-08-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580448#comment-17580448
 ] 

Apache Spark commented on SPARK-36511:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37529

> Remove ColumnIO once PARQUET-2050 is released in Parquet 1.13
> -
>
> Key: SPARK-36511
> URL: https://issues.apache.org/jira/browse/SPARK-36511
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>
> {{ColumnIO}} doesn't expose methods to get repetition and definition level so 
> Spark has to use a hack. This should be removed once PARQUET-2050 is released.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36511) Remove ColumnIO once PARQUET-2050 is released in Parquet 1.13

2022-08-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-36511.
--
Fix Version/s: 3.4.0
 Assignee: BingKun Pan
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/37529

> Remove ColumnIO once PARQUET-2050 is released in Parquet 1.13
> -
>
> Key: SPARK-36511
> URL: https://issues.apache.org/jira/browse/SPARK-36511
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>
> {{ColumnIO}} doesn't expose methods to get repetition and definition level so 
> Spark has to use a hack. This should be removed once PARQUET-2050 is released.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40089) Sorting of at least Decimal(20, 2) fails for some values near the max.

2022-08-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40089:


Assignee: (was: Apache Spark)

> Sorting of at least Decimal(20, 2) fails for some values near the max.
> --
>
> Key: SPARK-40089
> URL: https://issues.apache.org/jira/browse/SPARK-40089
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0, 3.4.0
>Reporter: Robert Joseph Evans
>Priority: Major
> Attachments: input.parquet
>
>
> I have been doing some testing with Decimal values for the RAPIDS Accelerator 
> for Apache Spark. I have been trying to add in new corner cases and when I 
> tried to enable the maximum supported value for a sort I started to get 
> failures.  On closer inspection it looks like the CPU is sorting things 
> incorrectly.  Specifically anything that is "99.50" or above 
> is placed as a chunk in the wrong location in the outputs.
>  In local mode with 12 tasks.
> {code:java}
> spark.read.parquet("input.parquet").orderBy(col("a")).collect.foreach(System.err.println)
>  {code}
>  
> Here you will notice that the last entry printed is 
> {{[99.49]}}, and {{[99.99]}} is near the top 
> near {{[-99.99]}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40089) Sorting of at least Decimal(20, 2) fails for some values near the max.

2022-08-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40089:


Assignee: Apache Spark

> Sorting of at least Decimal(20, 2) fails for some values near the max.
> --
>
> Key: SPARK-40089
> URL: https://issues.apache.org/jira/browse/SPARK-40089
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0, 3.4.0
>Reporter: Robert Joseph Evans
>Assignee: Apache Spark
>Priority: Major
> Attachments: input.parquet
>
>
> I have been doing some testing with Decimal values for the RAPIDS Accelerator 
> for Apache Spark. I have been trying to add in new corner cases and when I 
> tried to enable the maximum supported value for a sort I started to get 
> failures.  On closer inspection it looks like the CPU is sorting things 
> incorrectly.  Specifically anything that is "99.50" or above 
> is placed as a chunk in the wrong location in the outputs.
>  In local mode with 12 tasks.
> {code:java}
> spark.read.parquet("input.parquet").orderBy(col("a")).collect.foreach(System.err.println)
>  {code}
>  
> Here you will notice that the last entry printed is 
> {{[99.49]}}, and {{[99.99]}} is near the top 
> near {{[-99.99]}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40089) Sorting of at least Decimal(20, 2) fails for some values near the max.

2022-08-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40089:


Assignee: Apache Spark

> Sorting of at least Decimal(20, 2) fails for some values near the max.
> --
>
> Key: SPARK-40089
> URL: https://issues.apache.org/jira/browse/SPARK-40089
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0, 3.4.0
>Reporter: Robert Joseph Evans
>Assignee: Apache Spark
>Priority: Major
> Attachments: input.parquet
>
>
> I have been doing some testing with Decimal values for the RAPIDS Accelerator 
> for Apache Spark. I have been trying to add in new corner cases and when I 
> tried to enable the maximum supported value for a sort I started to get 
> failures.  On closer inspection it looks like the CPU is sorting things 
> incorrectly.  Specifically anything that is "99.50" or above 
> is placed as a chunk in the wrong location in the outputs.
>  In local mode with 12 tasks.
> {code:java}
> spark.read.parquet("input.parquet").orderBy(col("a")).collect.foreach(System.err.println)
>  {code}
>  
> Here you will notice that the last entry printed is 
> {{[99.49]}}, and {{[99.99]}} is near the top 
> near {{[-99.99]}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40089) Sorting of at least Decimal(20, 2) fails for some values near the max.

2022-08-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580437#comment-17580437
 ] 

Apache Spark commented on SPARK-40089:
--

User 'revans2' has created a pull request for this issue:
https://github.com/apache/spark/pull/37540

> Sorting of at least Decimal(20, 2) fails for some values near the max.
> --
>
> Key: SPARK-40089
> URL: https://issues.apache.org/jira/browse/SPARK-40089
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0, 3.4.0
>Reporter: Robert Joseph Evans
>Priority: Major
> Attachments: input.parquet
>
>
> I have been doing some testing with Decimal values for the RAPIDS Accelerator 
> for Apache Spark. I have been trying to add in new corner cases and when I 
> tried to enable the maximum supported value for a sort I started to get 
> failures.  On closer inspection it looks like the CPU is sorting things 
> incorrectly.  Specifically anything that is "99.50" or above 
> is placed as a chunk in the wrong location in the outputs.
>  In local mode with 12 tasks.
> {code:java}
> spark.read.parquet("input.parquet").orderBy(col("a")).collect.foreach(System.err.println)
>  {code}
>  
> Here you will notice that the last entry printed is 
> {{[99.49]}}, and {{[99.99]}} is near the top 
> near {{[-99.99]}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40089) Sorting of at least Decimal(20, 2) fails for some values near the max.

2022-08-16 Thread Robert Joseph Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580434#comment-17580434
 ] 

Robert Joseph Evans commented on SPARK-40089:
-

I put up a PR https://github.com/apache/spark/pull/37540

> Sorting of at least Decimal(20, 2) fails for some values near the max.
> --
>
> Key: SPARK-40089
> URL: https://issues.apache.org/jira/browse/SPARK-40089
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0, 3.4.0
>Reporter: Robert Joseph Evans
>Priority: Major
> Attachments: input.parquet
>
>
> I have been doing some testing with Decimal values for the RAPIDS Accelerator 
> for Apache Spark. I have been trying to add in new corner cases and when I 
> tried to enable the maximum supported value for a sort I started to get 
> failures.  On closer inspection it looks like the CPU is sorting things 
> incorrectly.  Specifically anything that is "99.50" or above 
> is placed as a chunk in the wrong location in the outputs.
>  In local mode with 12 tasks.
> {code:java}
> spark.read.parquet("input.parquet").orderBy(col("a")).collect.foreach(System.err.println)
>  {code}
>  
> Here you will notice that the last entry printed is 
> {{[99.49]}}, and {{[99.99]}} is near the top 
> near {{[-99.99]}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-38330) Certificate doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]

2022-08-16 Thread Maksim Grinman (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580431#comment-17580431
 ] 

Maksim Grinman edited comment on SPARK-38330 at 8/16/22 7:38 PM:
-

Apologize ahead of time as I did not understand some of the discussion. Is 
there a way to work-around this issue while waiting for a version of Spark 
which uses hadoop 3.3.4 (Spark 3.4?)? It seems like anyone using Spark with AWS 
s3 is stuck on 3.1.2 until then (although AWS EMR has latest releases claiming 
to work with Spark 3.2 somehow).


was (Author: JIRAUSER290629):
Apologize ahead of time as I did not understand some of the discussion. Is 
there a way to work-around this issue while waiting for a version of Spark 
which uses hadoop 3.3.4 (Spark 3.4?)? 

> Certificate doesn't match any of the subject alternative names: 
> [*.s3.amazonaws.com, s3.amazonaws.com]
> --
>
> Key: SPARK-38330
> URL: https://issues.apache.org/jira/browse/SPARK-38330
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 3.2.1
> Environment: Spark 3.2.1 built with `hadoop-cloud` flag.
> Direct access to s3 using default file committer.
> JDK8.
>  
>Reporter: André F.
>Priority: Major
>
> Trying to run any job after bumping our Spark version from 3.1.2 to 3.2.1, 
> lead us to the current exception while reading files on s3:
> {code:java}
> org.apache.hadoop.fs.s3a.AWSClientIOException: getFileStatus on 
> s3a:///.parquet: com.amazonaws.SdkClientException: Unable to 
> execute HTTP request: Certificate for  doesn't match 
> any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]: 
> Unable to execute HTTP request: Certificate for  doesn't match any of 
> the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com] at 
> org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:208) at 
> org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:170) at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3351)
>  at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3185)
>  at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.isDirectory(S3AFileSystem.java:4277) 
> at 
> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:54)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274) 
> at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
>  at scala.Option.getOrElse(Option.scala:189) at 
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245) at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:596) {code}
>  
> {code:java}
> Caused by: javax.net.ssl.SSLPeerUnverifiedException: Certificate for 
>  doesn't match any of the subject alternative names: 
> [*.s3.amazonaws.com, s3.amazonaws.com]
>   at 
> com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.verifyHostname(SSLConnectionSocketFactory.java:507)
>   at 
> com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:437)
>   at 
> com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:384)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376)
>   at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76)
>   at com.amazonaws.http.conn.$Proxy16.connect(Unknown Source)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
>

[jira] [Commented] (SPARK-38330) Certificate doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]

2022-08-16 Thread Maksim Grinman (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580431#comment-17580431
 ] 

Maksim Grinman commented on SPARK-38330:


Apologize ahead of time as I did not understand some of the discussion. Is 
there a way to work-around this issue while waiting for a version of Spark 
which uses hadoop 3.3.4 (Spark 3.4?)? 

> Certificate doesn't match any of the subject alternative names: 
> [*.s3.amazonaws.com, s3.amazonaws.com]
> --
>
> Key: SPARK-38330
> URL: https://issues.apache.org/jira/browse/SPARK-38330
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 3.2.1
> Environment: Spark 3.2.1 built with `hadoop-cloud` flag.
> Direct access to s3 using default file committer.
> JDK8.
>  
>Reporter: André F.
>Priority: Major
>
> Trying to run any job after bumping our Spark version from 3.1.2 to 3.2.1, 
> lead us to the current exception while reading files on s3:
> {code:java}
> org.apache.hadoop.fs.s3a.AWSClientIOException: getFileStatus on 
> s3a:///.parquet: com.amazonaws.SdkClientException: Unable to 
> execute HTTP request: Certificate for  doesn't match 
> any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]: 
> Unable to execute HTTP request: Certificate for  doesn't match any of 
> the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com] at 
> org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:208) at 
> org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:170) at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3351)
>  at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3185)
>  at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.isDirectory(S3AFileSystem.java:4277) 
> at 
> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:54)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274) 
> at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
>  at scala.Option.getOrElse(Option.scala:189) at 
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245) at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:596) {code}
>  
> {code:java}
> Caused by: javax.net.ssl.SSLPeerUnverifiedException: Certificate for 
>  doesn't match any of the subject alternative names: 
> [*.s3.amazonaws.com, s3.amazonaws.com]
>   at 
> com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.verifyHostname(SSLConnectionSocketFactory.java:507)
>   at 
> com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:437)
>   at 
> com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:384)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376)
>   at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76)
>   at com.amazonaws.http.conn.$Proxy16.connect(Unknown Source)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
>   at 
> com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.e

[jira] [Assigned] (SPARK-40107) Pull out empty2null conversion from FileFormatWriter

2022-08-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40107:


Assignee: (was: Apache Spark)

> Pull out empty2null conversion from FileFormatWriter
> 
>
> Key: SPARK-40107
> URL: https://issues.apache.org/jira/browse/SPARK-40107
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Allison Wang
>Priority: Major
>
> This is a follow-up for SPARK-37287. We can pull out the physical project to 
> convert empty string partition columns to null in `FileFormatWriter` into 
> logical planning as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40107) Pull out empty2null conversion from FileFormatWriter

2022-08-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40107:


Assignee: Apache Spark

> Pull out empty2null conversion from FileFormatWriter
> 
>
> Key: SPARK-40107
> URL: https://issues.apache.org/jira/browse/SPARK-40107
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Allison Wang
>Assignee: Apache Spark
>Priority: Major
>
> This is a follow-up for SPARK-37287. We can pull out the physical project to 
> convert empty string partition columns to null in `FileFormatWriter` into 
> logical planning as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40107) Pull out empty2null conversion from FileFormatWriter

2022-08-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580426#comment-17580426
 ] 

Apache Spark commented on SPARK-40107:
--

User 'allisonwang-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/37539

> Pull out empty2null conversion from FileFormatWriter
> 
>
> Key: SPARK-40107
> URL: https://issues.apache.org/jira/browse/SPARK-40107
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Allison Wang
>Priority: Major
>
> This is a follow-up for SPARK-37287. We can pull out the physical project to 
> convert empty string partition columns to null in `FileFormatWriter` into 
> logical planning as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40107) Pull out empty2null conversion from FileFormatWriter

2022-08-16 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-40107:
-
Description: This is a follow-up for SPARK-37287. We can pull out the 
physical project to convert empty string partition columns to null in 
`FileFormatWriter` into logical planning as well.  (was: This is a follow-up 
for SPARK-37287. )

> Pull out empty2null conversion from FileFormatWriter
> 
>
> Key: SPARK-40107
> URL: https://issues.apache.org/jira/browse/SPARK-40107
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Allison Wang
>Priority: Major
>
> This is a follow-up for SPARK-37287. We can pull out the physical project to 
> convert empty string partition columns to null in `FileFormatWriter` into 
> logical planning as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40107) Pull out empty2null conversion from FileFormatWriter

2022-08-16 Thread Allison Wang (Jira)
Allison Wang created SPARK-40107:


 Summary: Pull out empty2null conversion from FileFormatWriter
 Key: SPARK-40107
 URL: https://issues.apache.org/jira/browse/SPARK-40107
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Allison Wang


This is a follow-up for SPARK-37287. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40089) Sorting of at least Decimal(20, 2) fails for some values near the max.

2022-08-16 Thread Robert Joseph Evans (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated SPARK-40089:

Summary: Sorting of at least Decimal(20, 2) fails for some values near the 
max.  (was: Doring of at least Decimal(20, 2) fails for some values near the 
max.)

> Sorting of at least Decimal(20, 2) fails for some values near the max.
> --
>
> Key: SPARK-40089
> URL: https://issues.apache.org/jira/browse/SPARK-40089
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0, 3.4.0
>Reporter: Robert Joseph Evans
>Priority: Major
> Attachments: input.parquet
>
>
> I have been doing some testing with Decimal values for the RAPIDS Accelerator 
> for Apache Spark. I have been trying to add in new corner cases and when I 
> tried to enable the maximum supported value for a sort I started to get 
> failures.  On closer inspection it looks like the CPU is sorting things 
> incorrectly.  Specifically anything that is "99.50" or above 
> is placed as a chunk in the wrong location in the outputs.
>  In local mode with 12 tasks.
> {code:java}
> spark.read.parquet("input.parquet").orderBy(col("a")).collect.foreach(System.err.println)
>  {code}
>  
> Here you will notice that the last entry printed is 
> {{[99.49]}}, and {{[99.99]}} is near the top 
> near {{[-99.99]}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40089) Doring of at least Decimal(20, 2) fails for some values near the max.

2022-08-16 Thread Robert Joseph Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580377#comment-17580377
 ] 

Robert Joseph Evans commented on SPARK-40089:
-

Never mind I figured out that there is a separate prefixComparator that does 
the same kinds of checks. But I have a fix that works, so I will put up a PR 
shortly.

> Doring of at least Decimal(20, 2) fails for some values near the max.
> -
>
> Key: SPARK-40089
> URL: https://issues.apache.org/jira/browse/SPARK-40089
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0, 3.4.0
>Reporter: Robert Joseph Evans
>Priority: Major
> Attachments: input.parquet
>
>
> I have been doing some testing with Decimal values for the RAPIDS Accelerator 
> for Apache Spark. I have been trying to add in new corner cases and when I 
> tried to enable the maximum supported value for a sort I started to get 
> failures.  On closer inspection it looks like the CPU is sorting things 
> incorrectly.  Specifically anything that is "99.50" or above 
> is placed as a chunk in the wrong location in the outputs.
>  In local mode with 12 tasks.
> {code:java}
> spark.read.parquet("input.parquet").orderBy(col("a")).collect.foreach(System.err.println)
>  {code}
>  
> Here you will notice that the last entry printed is 
> {{[99.49]}}, and {{[99.99]}} is near the top 
> near {{[-99.99]}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37442) In AQE, wrong InMemoryRelation size estimation causes "Cannot broadcast the table that is larger than 8GB: 8 GB" failure

2022-08-16 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580369#comment-17580369
 ] 

Dongjoon Hyun commented on SPARK-37442:
---

Hi, [~irelandbird]. Apache Spark 2.4 and 3.0 are End-Of-Life release . Please 
try to use the latest Apache Spark version like 3.3.0.

> In AQE, wrong InMemoryRelation size estimation causes "Cannot broadcast the 
> table that is larger than 8GB: 8 GB" failure
> 
>
> Key: SPARK-37442
> URL: https://issues.apache.org/jira/browse/SPARK-37442
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer, SQL
>Affects Versions: 3.1.1, 3.2.0
>Reporter: Michael Chen
>Assignee: Michael Chen
>Priority: Major
> Fix For: 3.2.1, 3.3.0
>
>
> There is a period in time where an InMemoryRelation will have the cached 
> buffers loaded, but the statistics will be inaccurate (anywhere between 0 -> 
> size in bytes reported by accumulators). When AQE is enabled, it is possible 
> that join planning strategies will happen in this window. In this scenario, 
> join children sizes including InMemoryRelation are greatly underestimated and 
> a broadcast join can be planned when it shouldn't be. We have seen scenarios 
> where a broadcast join is planned with the builder size greater than 8GB 
> because at planning time, the optimizer believes the InMemoryRelation is 0 
> bytes.
> Here is an example test case where the broadcast threshold is being ignored. 
> It can mimic the 8GB error by increasing the size of the tables.
> {code:java}
> withSQLConf(
>   SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true",
>   SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "1048584") {
>   // Spark estimates a string column as 20 bytes so with 60k rows, these 
> relations should be
>   // estimated at ~120m bytes which is greater than the broadcast join 
> threshold
>   Seq.fill(6)("a").toDF("key")
> .createOrReplaceTempView("temp")
>   Seq.fill(6)("b").toDF("key")
> .createOrReplaceTempView("temp2")
>   Seq("a").toDF("key").createOrReplaceTempView("smallTemp")
>   spark.sql("SELECT key as newKey FROM temp").persist()
>   val query =
>   s"""
>  |SELECT t3.newKey
>  |FROM
>  |  (SELECT t1.newKey
>  |  FROM (SELECT key as newKey FROM temp) as t1
>  |JOIN
>  |(SELECT key FROM smallTemp) as t2
>  |ON t1.newKey = t2.key
>  |  ) as t3
>  |  JOIN
>  |  (SELECT key FROM temp2) as t4
>  |  ON t3.newKey = t4.key
>  |UNION
>  |SELECT t1.newKey
>  |FROM
>  |(SELECT key as newKey FROM temp) as t1
>  |JOIN
>  |(SELECT key FROM temp2) as t2
>  |ON t1.newKey = t2.key
>  |""".stripMargin
>   val df = spark.sql(query)
>   df.collect()
>   val adaptivePlan = df.queryExecution.executedPlan
>   val bhj = findTopLevelBroadcastHashJoin(adaptivePlan)
>   assert(bhj.length == 1) {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40089) Doring of at least Decimal(20, 2) fails for some values near the max.

2022-08-16 Thread Robert Joseph Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580360#comment-17580360
 ] 

Robert Joseph Evans commented on SPARK-40089:
-

I have been trying to come up with a patch, but keep hitting some issues. I 
first tried to change 

{code}
 case dt: DecimalType if dt.precision - dt.scale <= Decimal.MAX_LONG_DIGITS =>
{code}

to

{code}
 case dt: DecimalType if dt.precision - dt.scale < Decimal.MAX_LONG_DIGITS =>
{code}

So that we would bypass the overflow case entirely and use the Double prefix 
logic. But when I do that the negative values all come after the positive 
values when sorting ascending. So now I have a lot of other tests/debugging 
that I need to run to understand what is happening there. Just because I think 
I have found another bug. 

[~ulysses] I don't have a ton of time that I can devote to this right now, I 
will keep working towards a patch, but if you want to put up one, then I would 
love to see it. 

> Doring of at least Decimal(20, 2) fails for some values near the max.
> -
>
> Key: SPARK-40089
> URL: https://issues.apache.org/jira/browse/SPARK-40089
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0, 3.4.0
>Reporter: Robert Joseph Evans
>Priority: Major
> Attachments: input.parquet
>
>
> I have been doing some testing with Decimal values for the RAPIDS Accelerator 
> for Apache Spark. I have been trying to add in new corner cases and when I 
> tried to enable the maximum supported value for a sort I started to get 
> failures.  On closer inspection it looks like the CPU is sorting things 
> incorrectly.  Specifically anything that is "99.50" or above 
> is placed as a chunk in the wrong location in the outputs.
>  In local mode with 12 tasks.
> {code:java}
> spark.read.parquet("input.parquet").orderBy(col("a")).collect.foreach(System.err.println)
>  {code}
>  
> Here you will notice that the last entry printed is 
> {{[99.49]}}, and {{[99.99]}} is near the top 
> near {{[-99.99]}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40106) Task failure handlers should always run if the task failed

2022-08-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40106:


Assignee: (was: Apache Spark)

> Task failure handlers should always run if the task failed
> --
>
> Key: SPARK-40106
> URL: https://issues.apache.org/jira/browse/SPARK-40106
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Ryan Johnson
>Priority: Major
>
> Today, if a task body succeeds, but a task completion listener fails, task 
> failure listeners are not called -- even tho the task has indeed failed at 
> that point.
> If a completion listener fails, and failure listeners were not previously 
> invoked, we should invoke them before running the remaining completion 
> listeners.
> Such a change would increase the utility of task listeners, especially ones 
> intended to assist with task cleanup. 
> To give one arbitrary example, code like this appears at several places in 
> the code (taken from {{executeTask}} method of FileFormatWriter.scala):
> {code:java}
>     try {
>       Utils.tryWithSafeFinallyAndFailureCallbacks(block = {
>         // Execute the task to write rows out and commit the task.
>         dataWriter.writeWithIterator(iterator)
>         dataWriter.commit()
>       })(catchBlock = {
>         // If there is an error, abort the task
>     dataWriter.abort()
>         logError(s"Job $jobId aborted.")
>       }, finallyBlock = {
>         dataWriter.close()
>       })
>     } catch {
>       case e: FetchFailedException =>
>         throw e
>       case f: FileAlreadyExistsException if 
> SQLConf.get.fastFailFileFormatOutput =>
>         // If any output file to write already exists, it does not make sense 
> to re-run this task.
>         // We throw the exception and let Executor throw ExceptionFailure to 
> abort the job.
>         throw new TaskOutputFileAlreadyExistException(f)
>       case t: Throwable =>
>         throw QueryExecutionErrors.taskFailedWhileWritingRowsError(t)
>     }{code}
> If failure listeners were reliably called, the above idiom could potentially 
> be factored out as two failure listeners plus a completion listener, and 
> reused rather than duplicated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40106) Task failure handlers should always run if the task failed

2022-08-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40106:


Assignee: Apache Spark

> Task failure handlers should always run if the task failed
> --
>
> Key: SPARK-40106
> URL: https://issues.apache.org/jira/browse/SPARK-40106
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Ryan Johnson
>Assignee: Apache Spark
>Priority: Major
>
> Today, if a task body succeeds, but a task completion listener fails, task 
> failure listeners are not called -- even tho the task has indeed failed at 
> that point.
> If a completion listener fails, and failure listeners were not previously 
> invoked, we should invoke them before running the remaining completion 
> listeners.
> Such a change would increase the utility of task listeners, especially ones 
> intended to assist with task cleanup. 
> To give one arbitrary example, code like this appears at several places in 
> the code (taken from {{executeTask}} method of FileFormatWriter.scala):
> {code:java}
>     try {
>       Utils.tryWithSafeFinallyAndFailureCallbacks(block = {
>         // Execute the task to write rows out and commit the task.
>         dataWriter.writeWithIterator(iterator)
>         dataWriter.commit()
>       })(catchBlock = {
>         // If there is an error, abort the task
>     dataWriter.abort()
>         logError(s"Job $jobId aborted.")
>       }, finallyBlock = {
>         dataWriter.close()
>       })
>     } catch {
>       case e: FetchFailedException =>
>         throw e
>       case f: FileAlreadyExistsException if 
> SQLConf.get.fastFailFileFormatOutput =>
>         // If any output file to write already exists, it does not make sense 
> to re-run this task.
>         // We throw the exception and let Executor throw ExceptionFailure to 
> abort the job.
>         throw new TaskOutputFileAlreadyExistException(f)
>       case t: Throwable =>
>         throw QueryExecutionErrors.taskFailedWhileWritingRowsError(t)
>     }{code}
> If failure listeners were reliably called, the above idiom could potentially 
> be factored out as two failure listeners plus a completion listener, and 
> reused rather than duplicated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40106) Task failure handlers should always run if the task failed

2022-08-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580339#comment-17580339
 ] 

Apache Spark commented on SPARK-40106:
--

User 'ryan-johnson-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/37531

> Task failure handlers should always run if the task failed
> --
>
> Key: SPARK-40106
> URL: https://issues.apache.org/jira/browse/SPARK-40106
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Ryan Johnson
>Priority: Major
>
> Today, if a task body succeeds, but a task completion listener fails, task 
> failure listeners are not called -- even tho the task has indeed failed at 
> that point.
> If a completion listener fails, and failure listeners were not previously 
> invoked, we should invoke them before running the remaining completion 
> listeners.
> Such a change would increase the utility of task listeners, especially ones 
> intended to assist with task cleanup. 
> To give one arbitrary example, code like this appears at several places in 
> the code (taken from {{executeTask}} method of FileFormatWriter.scala):
> {code:java}
>     try {
>       Utils.tryWithSafeFinallyAndFailureCallbacks(block = {
>         // Execute the task to write rows out and commit the task.
>         dataWriter.writeWithIterator(iterator)
>         dataWriter.commit()
>       })(catchBlock = {
>         // If there is an error, abort the task
>     dataWriter.abort()
>         logError(s"Job $jobId aborted.")
>       }, finallyBlock = {
>         dataWriter.close()
>       })
>     } catch {
>       case e: FetchFailedException =>
>         throw e
>       case f: FileAlreadyExistsException if 
> SQLConf.get.fastFailFileFormatOutput =>
>         // If any output file to write already exists, it does not make sense 
> to re-run this task.
>         // We throw the exception and let Executor throw ExceptionFailure to 
> abort the job.
>         throw new TaskOutputFileAlreadyExistException(f)
>       case t: Throwable =>
>         throw QueryExecutionErrors.taskFailedWhileWritingRowsError(t)
>     }{code}
> If failure listeners were reliably called, the above idiom could potentially 
> be factored out as two failure listeners plus a completion listener, and 
> reused rather than duplicated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40106) Task failure handlers should always run if the task failed

2022-08-16 Thread Ryan Johnson (Jira)
Ryan Johnson created SPARK-40106:


 Summary: Task failure handlers should always run if the task failed
 Key: SPARK-40106
 URL: https://issues.apache.org/jira/browse/SPARK-40106
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.3.0
Reporter: Ryan Johnson


Today, if a task body succeeds, but a task completion listener fails, task 
failure listeners are not called -- even tho the task has indeed failed at that 
point.

If a completion listener fails, and failure listeners were not previously 
invoked, we should invoke them before running the remaining completion 
listeners.

Such a change would increase the utility of task listeners, especially ones 
intended to assist with task cleanup. 

To give one arbitrary example, code like this appears at several places in the 
code (taken from {{executeTask}} method of FileFormatWriter.scala):
{code:java}
    try {
      Utils.tryWithSafeFinallyAndFailureCallbacks(block = {
        // Execute the task to write rows out and commit the task.
        dataWriter.writeWithIterator(iterator)
        dataWriter.commit()
      })(catchBlock = {
        // If there is an error, abort the task
    dataWriter.abort()
        logError(s"Job $jobId aborted.")
      }, finallyBlock = {
        dataWriter.close()
      })
    } catch {
      case e: FetchFailedException =>
        throw e
      case f: FileAlreadyExistsException if 
SQLConf.get.fastFailFileFormatOutput =>
        // If any output file to write already exists, it does not make sense 
to re-run this task.
        // We throw the exception and let Executor throw ExceptionFailure to 
abort the job.
        throw new TaskOutputFileAlreadyExistException(f)
      case t: Throwable =>
        throw QueryExecutionErrors.taskFailedWhileWritingRowsError(t)
    }{code}
If failure listeners were reliably called, the above idiom could potentially be 
factored out as two failure listeners plus a completion listener, and reused 
rather than duplicated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40102) Use SparkException instead of IllegalStateException in SparkPlan

2022-08-16 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-40102:


Assignee: Yi kaifei

> Use SparkException instead of IllegalStateException in SparkPlan
> 
>
> Key: SPARK-40102
> URL: https://issues.apache.org/jira/browse/SPARK-40102
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yi kaifei
>Assignee: Yi kaifei
>Priority: Major
> Fix For: 3.4.0
>
>
> This pr aims to use SparkException instead of IllegalStateException in 
> SparkPlan, for details, see: https://github.com/apache/spark/pull/37524



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40102) Use SparkException instead of IllegalStateException in SparkPlan

2022-08-16 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-40102.
--
Resolution: Fixed

Issue resolved by pull request 37535
[https://github.com/apache/spark/pull/37535]

> Use SparkException instead of IllegalStateException in SparkPlan
> 
>
> Key: SPARK-40102
> URL: https://issues.apache.org/jira/browse/SPARK-40102
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yi kaifei
>Assignee: Yi kaifei
>Priority: Major
> Fix For: 3.4.0
>
>
> This pr aims to use SparkException instead of IllegalStateException in 
> SparkPlan, for details, see: https://github.com/apache/spark/pull/37524



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40102) Use SparkException instead of IllegalStateException in SparkPlan

2022-08-16 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-40102:
-
Parent: SPARK-37935
Issue Type: Sub-task  (was: Improvement)

> Use SparkException instead of IllegalStateException in SparkPlan
> 
>
> Key: SPARK-40102
> URL: https://issues.apache.org/jira/browse/SPARK-40102
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yi kaifei
>Priority: Major
> Fix For: 3.4.0
>
>
> This pr aims to use SparkException instead of IllegalStateException in 
> SparkPlan, for details, see: https://github.com/apache/spark/pull/37524



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40102) Use SparkException instead of IllegalStateException in SparkPlan

2022-08-16 Thread Max Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580326#comment-17580326
 ] 

Max Gekk commented on SPARK-40102:
--

[~kaifeiYi] Please, open a sub-task of 
https://issues.apache.org/jira/browse/SPARK-37935 next time.

> Use SparkException instead of IllegalStateException in SparkPlan
> 
>
> Key: SPARK-40102
> URL: https://issues.apache.org/jira/browse/SPARK-40102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yi kaifei
>Priority: Major
> Fix For: 3.4.0
>
>
> This pr aims to use SparkException instead of IllegalStateException in 
> SparkPlan, for details, see: https://github.com/apache/spark/pull/37524



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40036) LevelDB/RocksDBIterator.next should return false after iterator or db close

2022-08-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40036.
--
Fix Version/s: 3.4.0
 Assignee: Yang Jie
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/37471

> LevelDB/RocksDBIterator.next should return false after iterator or db close
> ---
>
> Key: SPARK-40036
> URL: https://issues.apache.org/jira/browse/SPARK-40036
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>
> {code:java}
> @Test
> public void testHasNextAndNextAfterIteratorClose() throws Exception {
>   String prefix = "test_db_iter_close.";
>   String suffix = ".ldb";
>   File path = File.createTempFile(prefix, suffix);
>   path.delete();
>   LevelDB db = new LevelDB(path);
>   // Write one records for test
>   db.write(createCustomType1(0));
>   KVStoreIterator iter =
> db.view(CustomType1.class).closeableIterator();
>   // iter should be true
>   assertTrue(iter.hasNext());
>   // close iter
>   iter.close();
>   // iter.hasNext should be false after iter close
>   assertFalse(iter.hasNext());
>   // iter.next should throw NoSuchElementException after iter close
>   assertThrows(NoSuchElementException.class, iter::next);
>   db.close();
>   assertTrue(path.exists());
>   FileUtils.deleteQuietly(path);
>   assertFalse(path.exists());
> }
> @Test
> public void testHasNextAndNextAfterDBClose() throws Exception {
>   String prefix = "test_db_db_close.";
>   String suffix = ".ldb";
>   File path = File.createTempFile(prefix, suffix);
>   path.delete();
>   LevelDB db = new LevelDB(path);
>   // Write one record for test
>   db.write(createCustomType1(0));
>   KVStoreIterator iter =
> db.view(CustomType1.class).closeableIterator();
>   // iter should be true
>   assertTrue(iter.hasNext());
>   // close db
>   db.close();
>   // iter.hasNext should be false after db close
>   assertFalse(iter.hasNext());
>   // iter.next should throw NoSuchElementException after db close
>   assertThrows(NoSuchElementException.class, iter::next);
>   assertTrue(path.exists());
>   FileUtils.deleteQuietly(path);
>   assertFalse(path.exists());
> } {code}
>  
> For the above two cases, when iterator/db is closed, `hasNext` will return 
> true, and `next` will return the value not obtained before close.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40042) Make pyspark.sql.streaming.query examples self-contained

2022-08-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40042.
--
Fix Version/s: 3.4.0
 Assignee: Qian Sun
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/37482

> Make pyspark.sql.streaming.query examples self-contained
> 
>
> Key: SPARK-40042
> URL: https://issues.apache.org/jira/browse/SPARK-40042
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Qian Sun
>Assignee: Qian Sun
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40042) Make pyspark.sql.streaming.query examples self-contained

2022-08-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-40042:
-
Priority: Minor  (was: Major)

> Make pyspark.sql.streaming.query examples self-contained
> 
>
> Key: SPARK-40042
> URL: https://issues.apache.org/jira/browse/SPARK-40042
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Qian Sun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37980) Extend METADATA column to support row indices for file based data sources

2022-08-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-37980:
---

Assignee: Ala Luszczak

> Extend METADATA column to support row indices for file based data sources
> -
>
> Key: SPARK-37980
> URL: https://issues.apache.org/jira/browse/SPARK-37980
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Prakhar Jain
>Assignee: Ala Luszczak
>Priority: Major
> Fix For: 3.4.0
>
>
> Spark recently added hidden metadata column support for File based 
> datasources as part of  SPARK-37273.
> We should extend it to support ROW_INDEX/ROW_POSITION also.
>  
> Meaning of  ROW_POSITION:
> ROW_INDEX/ROW_POSITION is basically an index of a row within a file. E.g. 5th 
> row in the file will have ROW_INDEX 5.
>  
> Use cases: 
> Row Indexes can be used in a variety of ways. A (fileName, rowIndex) tuple 
> uniquely identifies row in a table. This information can be used to mark rows 
> e.g. this can be used by indexer etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37980) Extend METADATA column to support row indices for file based data sources

2022-08-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-37980.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37228
[https://github.com/apache/spark/pull/37228]

> Extend METADATA column to support row indices for file based data sources
> -
>
> Key: SPARK-37980
> URL: https://issues.apache.org/jira/browse/SPARK-37980
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 3.4.0
>
>
> Spark recently added hidden metadata column support for File based 
> datasources as part of  SPARK-37273.
> We should extend it to support ROW_INDEX/ROW_POSITION also.
>  
> Meaning of  ROW_POSITION:
> ROW_INDEX/ROW_POSITION is basically an index of a row within a file. E.g. 5th 
> row in the file will have ROW_INDEX 5.
>  
> Use cases: 
> Row Indexes can be used in a variety of ways. A (fileName, rowIndex) tuple 
> uniquely identifies row in a table. This information can be used to mark rows 
> e.g. this can be used by indexer etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40105) Improve repartition in ReplaceCTERefWithRepartition

2022-08-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40105:


Assignee: Apache Spark

> Improve repartition in ReplaceCTERefWithRepartition
> ---
>
> Key: SPARK-40105
> URL: https://issues.apache.org/jira/browse/SPARK-40105
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: Apache Spark
>Priority: Minor
>
> If cte can not inlined, the ReplaceCTERefWithRepartition will add repartition 
> to force a shuffle so that the reference can reuse shuffle exchange.
> The added repartition should be optimized by AQE for better performance.
> If the user has specified a rebalance, the ReplaceCTERefWithRepartition 
> should skip add repartition.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40105) Improve repartition in ReplaceCTERefWithRepartition

2022-08-16 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-40105:
--
Description: 
If cte can not inlined, the ReplaceCTERefWithRepartition will add repartition 
to force a shuffle so that the reference can reuse shuffle exchange.
The added repartition should be optimized by AQE for better performance.

If the user has specified a rebalance, the ReplaceCTERefWithRepartition should 
skip add repartition.

  was:
If cte can not inlined, the ReplaceCTERefWithRepartition will add repartition 
to force a shuffle so that the reference can reuse shuffle exchange. It can not 
be optimized by AQE since it has defined shuffle partition. 

 


> Improve repartition in ReplaceCTERefWithRepartition
> ---
>
> Key: SPARK-40105
> URL: https://issues.apache.org/jira/browse/SPARK-40105
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Minor
>
> If cte can not inlined, the ReplaceCTERefWithRepartition will add repartition 
> to force a shuffle so that the reference can reuse shuffle exchange.
> The added repartition should be optimized by AQE for better performance.
> If the user has specified a rebalance, the ReplaceCTERefWithRepartition 
> should skip add repartition.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40105) Improve repartition in ReplaceCTERefWithRepartition

2022-08-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40105:


Assignee: (was: Apache Spark)

> Improve repartition in ReplaceCTERefWithRepartition
> ---
>
> Key: SPARK-40105
> URL: https://issues.apache.org/jira/browse/SPARK-40105
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Minor
>
> If cte can not inlined, the ReplaceCTERefWithRepartition will add repartition 
> to force a shuffle so that the reference can reuse shuffle exchange.
> The added repartition should be optimized by AQE for better performance.
> If the user has specified a rebalance, the ReplaceCTERefWithRepartition 
> should skip add repartition.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40105) Improve repartition in ReplaceCTERefWithRepartition

2022-08-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580279#comment-17580279
 ] 

Apache Spark commented on SPARK-40105:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/37537

> Improve repartition in ReplaceCTERefWithRepartition
> ---
>
> Key: SPARK-40105
> URL: https://issues.apache.org/jira/browse/SPARK-40105
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Minor
>
> If cte can not inlined, the ReplaceCTERefWithRepartition will add repartition 
> to force a shuffle so that the reference can reuse shuffle exchange.
> The added repartition should be optimized by AQE for better performance.
> If the user has specified a rebalance, the ReplaceCTERefWithRepartition 
> should skip add repartition.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >