[jira] [Commented] (SPARK-40103) Support read/write.csv() in SparkR
[ https://issues.apache.org/jira/browse/SPARK-40103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580623#comment-17580623 ] deshanxiao commented on SPARK-40103: Yes read.csv, read.csv2 have benn used in R utils packages. > Support read/write.csv() in SparkR > -- > > Key: SPARK-40103 > URL: https://issues.apache.org/jira/browse/SPARK-40103 > Project: Spark > Issue Type: New Feature > Components: SparkR >Affects Versions: 3.3.0 >Reporter: deshanxiao >Priority: Major > > Today, almost languages support the DataFrameReader.csv API, only R is > missing. we need to use df.read() to read the csv file. We need a more > high-level api for it. > Java: > [DataFrameReader.csv()|https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html] > Scala: > [DataFrameReader.csv()|https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrameReader.html#csv(paths:String*):org.apache.spark.sql.DataFrame] > Python: > [DataFrameReader.csv()|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.csv.html#pyspark.sql.DataFrameReader.csv] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40116) Remove Arrow in AppVeyor for now
[ https://issues.apache.org/jira/browse/SPARK-40116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-40116: - Summary: Remove Arrow in AppVeyor for now (was: Pin Arrow version to 8.0.0 in AppVeyor) > Remove Arrow in AppVeyor for now > > > Key: SPARK-40116 > URL: https://issues.apache.org/jira/browse/SPARK-40116 > Project: Spark > Issue Type: Test > Components: Project Infra, SparkR >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > > SparkR does not support Arrow 9.0.0 (SPARK-40114) so the tests currently > fail, https://ci.appveyor.com/project/HyukjinKwon/spark/builds/44490387. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40116) Pin Arrow version to 8.0.0 in AppVeyor
[ https://issues.apache.org/jira/browse/SPARK-40116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40116. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37546 [https://github.com/apache/spark/pull/37546] > Pin Arrow version to 8.0.0 in AppVeyor > -- > > Key: SPARK-40116 > URL: https://issues.apache.org/jira/browse/SPARK-40116 > Project: Spark > Issue Type: Test > Components: Project Infra, SparkR >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > > SparkR does not support Arrow 9.0.0 (SPARK-40114) so the tests currently > fail, https://ci.appveyor.com/project/HyukjinKwon/spark/builds/44490387. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40117) Convert condition to java in DataFrameWriterV2.overwrite
[ https://issues.apache.org/jira/browse/SPARK-40117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40117. -- Fix Version/s: 3.3.1 3.1.4 3.2.3 3.4.0 Resolution: Fixed Issue resolved by pull request 37547 [https://github.com/apache/spark/pull/37547] > Convert condition to java in DataFrameWriterV2.overwrite > > > Key: SPARK-40117 > URL: https://issues.apache.org/jira/browse/SPARK-40117 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.1.3, 3.3.0, 3.2.2 >Reporter: Wenli Looi >Assignee: Wenli Looi >Priority: Major > Fix For: 3.3.1, 3.1.4, 3.2.3, 3.4.0 > > > DataFrameWriterV2.overwrite() fails to convert the condition parameter to > java. This prevents the function from being called. > It is caused by the following commit that deleted the `_to_java_column` call > instead of fixing it: > [https://github.com/apache/spark/commit/a1e459ed9f6777fb8d5a2d09fda666402f9230b9] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40117) Convert condition to java in DataFrameWriterV2.overwrite
[ https://issues.apache.org/jira/browse/SPARK-40117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-40117: Assignee: Wenli Looi > Convert condition to java in DataFrameWriterV2.overwrite > > > Key: SPARK-40117 > URL: https://issues.apache.org/jira/browse/SPARK-40117 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.1.3, 3.3.0, 3.2.2 >Reporter: Wenli Looi >Assignee: Wenli Looi >Priority: Major > > DataFrameWriterV2.overwrite() fails to convert the condition parameter to > java. This prevents the function from being called. > It is caused by the following commit that deleted the `_to_java_column` call > instead of fixing it: > [https://github.com/apache/spark/commit/a1e459ed9f6777fb8d5a2d09fda666402f9230b9] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40116) Pin Arrow version to 8.0.0 in AppVeyor
[ https://issues.apache.org/jira/browse/SPARK-40116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-40116: Assignee: Hyukjin Kwon > Pin Arrow version to 8.0.0 in AppVeyor > -- > > Key: SPARK-40116 > URL: https://issues.apache.org/jira/browse/SPARK-40116 > Project: Spark > Issue Type: Test > Components: Project Infra, SparkR >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > SparkR does not support Arrow 9.0.0 (SPARK-40114) so the tests currently > fail, https://ci.appveyor.com/project/HyukjinKwon/spark/builds/44490387. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40113) Reactor ParquetScanBuilder DataSourceV2 interface implementation
[ https://issues.apache.org/jira/browse/SPARK-40113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mars updated SPARK-40113: - Summary: Reactor ParquetScanBuilder DataSourceV2 interface implementation (was: Unify ParquetScanBuilder DataSourceV2 interface implementation) > Reactor ParquetScanBuilder DataSourceV2 interface implementation > > > Key: SPARK-40113 > URL: https://issues.apache.org/jira/browse/SPARK-40113 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 3.3.0 >Reporter: Mars >Priority: Minor > > Now `FileScanBuilder` interface is not fully implemented in > `ParquetScanBuilder` like > `OrcScanBuilder`,`AvroScanBuilder`,`CSVScanBuilder` > In order to unify the logic of the code and make it clearer, this part of the > implementation is unified. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
[ https://issues.apache.org/jira/browse/SPARK-22588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580598#comment-17580598 ] Vivek Garg edited comment on SPARK-22588 at 8/17/22 5:53 AM: - According to the [Salesforce CPQ Certification|https://www.igmguru.com/salesforce/salesforce-cpq-training/] Exam, our Salesforce CPQ Certification Training program has been created. The core abilities needed for effectively implementing Salesforce CPQ solutions are developed in this course on Salesforce CPQ. Through instruction using practical examples, this course will go deeper into developing a quoting process, pricing strategies, configuration, CPQ object data model, and more. This online Salesforce CPQ training course includes practical projects that will aid you in passing the Salesforce CPQ Certification test. was (Author: JIRAUSER294516): [salesforce cpq certification|https://www.igmguru.com/salesforce/salesforce-cpq-training/] > SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values > - > > Key: SPARK-22588 > URL: https://issues.apache.org/jira/browse/SPARK-22588 > Project: Spark > Issue Type: Question > Components: Deploy >Affects Versions: 2.1.1 >Reporter: Saanvi Sharma >Priority: Minor > Labels: dynamodb, spark > Original Estimate: 24h > Remaining Estimate: 24h > > I am using spark 2.1 on EMR and i have a dataframe like this: > ClientNum | Value_1 | Value_2 | Value_3 | Value_4 > 14 |A |B| C | null > 19 |X |Y| null| null > 21 |R | null | null| null > I want to load data into DynamoDB table with ClientNum as key fetching: > Analyze Your Data on Amazon DynamoDB with apche Spark11 > Using Spark SQL for ETL3 > here is my code that I tried to solve: > var jobConf = new JobConf(sc.hadoopConfiguration) > jobConf.set("dynamodb.servicename", "dynamodb") > jobConf.set("dynamodb.input.tableName", "table_name") > jobConf.set("dynamodb.output.tableName", "table_name") > jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com") > jobConf.set("dynamodb.regionid", "eu-west-1") > jobConf.set("dynamodb.throughput.read", "1") > jobConf.set("dynamodb.throughput.read.percent", "1") > jobConf.set("dynamodb.throughput.write", "1") > jobConf.set("dynamodb.throughput.write.percent", "1") > > jobConf.set("mapred.output.format.class", > "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat") > jobConf.set("mapred.input.format.class", > "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat") > #Import Data > val df = > sqlContext.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").load(path) > I performed a transformation to have an RDD that matches the types that the > DynamoDB custom output format knows how to write. The custom output format > expects a tuple containing the Text and DynamoDBItemWritable types. > Create a new RDD with those types in it, in the following map call: > #Convert the dataframe to rdd > val df_rdd = df.rdd > > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > MapPartitionsRDD[10] at rdd at :41 > > #Print first rdd > df_rdd.take(1) > > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null]) > var ddbInsertFormattedRDD = df_rdd.map(a => { > var ddbMap = new HashMap[String, AttributeValue]() > var ClientNum = new AttributeValue() > ClientNum.setN(a.get(0).toString) > ddbMap.put("ClientNum", ClientNum) > var Value_1 = new AttributeValue() > Value_1.setS(a.get(1).toString) > ddbMap.put("Value_1", Value_1) > var Value_2 = new AttributeValue() > Value_2.setS(a.get(2).toString) > ddbMap.put("Value_2", Value_2) > var Value_3 = new AttributeValue() > Value_3.setS(a.get(3).toString) > ddbMap.put("Value_3", Value_3) > var Value_4 = new AttributeValue() > Value_4.setS(a.get(4).toString) > ddbMap.put("Value_4", Value_4) > var item = new DynamoDBItemWritable() > item.setItem(ddbMap) > (new Text(""), item) > }) > This last call uses the job configuration that defines the EMR-DDB connector > to write out the new RDD you created in the expected format: > ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf) > fails with the follwoing error: > Caused by: java.lang.NullPointerException > null values caused the error, if I try with ClientNum and Value_1 it works > data is correctly inserted on DynamoDB table. > Thanks for your help !! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additiona
[jira] [Comment Edited] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
[ https://issues.apache.org/jira/browse/SPARK-22588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580598#comment-17580598 ] Vivek Garg edited comment on SPARK-22588 at 8/17/22 5:52 AM: - [salesforce cpq certification|https://www.igmguru.com/salesforce/salesforce-cpq-training/] was (Author: JIRAUSER294516): [https://www.igmguru.com/salesforce/salesforce-cpq-training/Salesforce CPQ Certification] > SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values > - > > Key: SPARK-22588 > URL: https://issues.apache.org/jira/browse/SPARK-22588 > Project: Spark > Issue Type: Question > Components: Deploy >Affects Versions: 2.1.1 >Reporter: Saanvi Sharma >Priority: Minor > Labels: dynamodb, spark > Original Estimate: 24h > Remaining Estimate: 24h > > I am using spark 2.1 on EMR and i have a dataframe like this: > ClientNum | Value_1 | Value_2 | Value_3 | Value_4 > 14 |A |B| C | null > 19 |X |Y| null| null > 21 |R | null | null| null > I want to load data into DynamoDB table with ClientNum as key fetching: > Analyze Your Data on Amazon DynamoDB with apche Spark11 > Using Spark SQL for ETL3 > here is my code that I tried to solve: > var jobConf = new JobConf(sc.hadoopConfiguration) > jobConf.set("dynamodb.servicename", "dynamodb") > jobConf.set("dynamodb.input.tableName", "table_name") > jobConf.set("dynamodb.output.tableName", "table_name") > jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com") > jobConf.set("dynamodb.regionid", "eu-west-1") > jobConf.set("dynamodb.throughput.read", "1") > jobConf.set("dynamodb.throughput.read.percent", "1") > jobConf.set("dynamodb.throughput.write", "1") > jobConf.set("dynamodb.throughput.write.percent", "1") > > jobConf.set("mapred.output.format.class", > "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat") > jobConf.set("mapred.input.format.class", > "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat") > #Import Data > val df = > sqlContext.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").load(path) > I performed a transformation to have an RDD that matches the types that the > DynamoDB custom output format knows how to write. The custom output format > expects a tuple containing the Text and DynamoDBItemWritable types. > Create a new RDD with those types in it, in the following map call: > #Convert the dataframe to rdd > val df_rdd = df.rdd > > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > MapPartitionsRDD[10] at rdd at :41 > > #Print first rdd > df_rdd.take(1) > > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null]) > var ddbInsertFormattedRDD = df_rdd.map(a => { > var ddbMap = new HashMap[String, AttributeValue]() > var ClientNum = new AttributeValue() > ClientNum.setN(a.get(0).toString) > ddbMap.put("ClientNum", ClientNum) > var Value_1 = new AttributeValue() > Value_1.setS(a.get(1).toString) > ddbMap.put("Value_1", Value_1) > var Value_2 = new AttributeValue() > Value_2.setS(a.get(2).toString) > ddbMap.put("Value_2", Value_2) > var Value_3 = new AttributeValue() > Value_3.setS(a.get(3).toString) > ddbMap.put("Value_3", Value_3) > var Value_4 = new AttributeValue() > Value_4.setS(a.get(4).toString) > ddbMap.put("Value_4", Value_4) > var item = new DynamoDBItemWritable() > item.setItem(ddbMap) > (new Text(""), item) > }) > This last call uses the job configuration that defines the EMR-DDB connector > to write out the new RDD you created in the expected format: > ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf) > fails with the follwoing error: > Caused by: java.lang.NullPointerException > null values caused the error, if I try with ClientNum and Value_1 it works > data is correctly inserted on DynamoDB table. > Thanks for your help !! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
[ https://issues.apache.org/jira/browse/SPARK-22588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580598#comment-17580598 ] Vivek Garg edited comment on SPARK-22588 at 8/17/22 5:50 AM: - [https://www.igmguru.com/salesforce/salesforce-cpq-training/Salesforce CPQ Certification] was (Author: JIRAUSER294516): [[https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/ SAP analytics cloud training]] > SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values > - > > Key: SPARK-22588 > URL: https://issues.apache.org/jira/browse/SPARK-22588 > Project: Spark > Issue Type: Question > Components: Deploy >Affects Versions: 2.1.1 >Reporter: Saanvi Sharma >Priority: Minor > Labels: dynamodb, spark > Original Estimate: 24h > Remaining Estimate: 24h > > I am using spark 2.1 on EMR and i have a dataframe like this: > ClientNum | Value_1 | Value_2 | Value_3 | Value_4 > 14 |A |B| C | null > 19 |X |Y| null| null > 21 |R | null | null| null > I want to load data into DynamoDB table with ClientNum as key fetching: > Analyze Your Data on Amazon DynamoDB with apche Spark11 > Using Spark SQL for ETL3 > here is my code that I tried to solve: > var jobConf = new JobConf(sc.hadoopConfiguration) > jobConf.set("dynamodb.servicename", "dynamodb") > jobConf.set("dynamodb.input.tableName", "table_name") > jobConf.set("dynamodb.output.tableName", "table_name") > jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com") > jobConf.set("dynamodb.regionid", "eu-west-1") > jobConf.set("dynamodb.throughput.read", "1") > jobConf.set("dynamodb.throughput.read.percent", "1") > jobConf.set("dynamodb.throughput.write", "1") > jobConf.set("dynamodb.throughput.write.percent", "1") > > jobConf.set("mapred.output.format.class", > "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat") > jobConf.set("mapred.input.format.class", > "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat") > #Import Data > val df = > sqlContext.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").load(path) > I performed a transformation to have an RDD that matches the types that the > DynamoDB custom output format knows how to write. The custom output format > expects a tuple containing the Text and DynamoDBItemWritable types. > Create a new RDD with those types in it, in the following map call: > #Convert the dataframe to rdd > val df_rdd = df.rdd > > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > MapPartitionsRDD[10] at rdd at :41 > > #Print first rdd > df_rdd.take(1) > > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null]) > var ddbInsertFormattedRDD = df_rdd.map(a => { > var ddbMap = new HashMap[String, AttributeValue]() > var ClientNum = new AttributeValue() > ClientNum.setN(a.get(0).toString) > ddbMap.put("ClientNum", ClientNum) > var Value_1 = new AttributeValue() > Value_1.setS(a.get(1).toString) > ddbMap.put("Value_1", Value_1) > var Value_2 = new AttributeValue() > Value_2.setS(a.get(2).toString) > ddbMap.put("Value_2", Value_2) > var Value_3 = new AttributeValue() > Value_3.setS(a.get(3).toString) > ddbMap.put("Value_3", Value_3) > var Value_4 = new AttributeValue() > Value_4.setS(a.get(4).toString) > ddbMap.put("Value_4", Value_4) > var item = new DynamoDBItemWritable() > item.setItem(ddbMap) > (new Text(""), item) > }) > This last call uses the job configuration that defines the EMR-DDB connector > to write out the new RDD you created in the expected format: > ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf) > fails with the follwoing error: > Caused by: java.lang.NullPointerException > null values caused the error, if I try with ClientNum and Value_1 it works > data is correctly inserted on DynamoDB table. > Thanks for your help !! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
[ https://issues.apache.org/jira/browse/SPARK-22588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580598#comment-17580598 ] Vivek Garg edited comment on SPARK-22588 at 8/17/22 5:49 AM: - [[https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/ SAP analytics cloud training]] was (Author: JIRAUSER294516): [https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/ SAP analytics cloud training] > SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values > - > > Key: SPARK-22588 > URL: https://issues.apache.org/jira/browse/SPARK-22588 > Project: Spark > Issue Type: Question > Components: Deploy >Affects Versions: 2.1.1 >Reporter: Saanvi Sharma >Priority: Minor > Labels: dynamodb, spark > Original Estimate: 24h > Remaining Estimate: 24h > > I am using spark 2.1 on EMR and i have a dataframe like this: > ClientNum | Value_1 | Value_2 | Value_3 | Value_4 > 14 |A |B| C | null > 19 |X |Y| null| null > 21 |R | null | null| null > I want to load data into DynamoDB table with ClientNum as key fetching: > Analyze Your Data on Amazon DynamoDB with apche Spark11 > Using Spark SQL for ETL3 > here is my code that I tried to solve: > var jobConf = new JobConf(sc.hadoopConfiguration) > jobConf.set("dynamodb.servicename", "dynamodb") > jobConf.set("dynamodb.input.tableName", "table_name") > jobConf.set("dynamodb.output.tableName", "table_name") > jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com") > jobConf.set("dynamodb.regionid", "eu-west-1") > jobConf.set("dynamodb.throughput.read", "1") > jobConf.set("dynamodb.throughput.read.percent", "1") > jobConf.set("dynamodb.throughput.write", "1") > jobConf.set("dynamodb.throughput.write.percent", "1") > > jobConf.set("mapred.output.format.class", > "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat") > jobConf.set("mapred.input.format.class", > "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat") > #Import Data > val df = > sqlContext.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").load(path) > I performed a transformation to have an RDD that matches the types that the > DynamoDB custom output format knows how to write. The custom output format > expects a tuple containing the Text and DynamoDBItemWritable types. > Create a new RDD with those types in it, in the following map call: > #Convert the dataframe to rdd > val df_rdd = df.rdd > > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > MapPartitionsRDD[10] at rdd at :41 > > #Print first rdd > df_rdd.take(1) > > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null]) > var ddbInsertFormattedRDD = df_rdd.map(a => { > var ddbMap = new HashMap[String, AttributeValue]() > var ClientNum = new AttributeValue() > ClientNum.setN(a.get(0).toString) > ddbMap.put("ClientNum", ClientNum) > var Value_1 = new AttributeValue() > Value_1.setS(a.get(1).toString) > ddbMap.put("Value_1", Value_1) > var Value_2 = new AttributeValue() > Value_2.setS(a.get(2).toString) > ddbMap.put("Value_2", Value_2) > var Value_3 = new AttributeValue() > Value_3.setS(a.get(3).toString) > ddbMap.put("Value_3", Value_3) > var Value_4 = new AttributeValue() > Value_4.setS(a.get(4).toString) > ddbMap.put("Value_4", Value_4) > var item = new DynamoDBItemWritable() > item.setItem(ddbMap) > (new Text(""), item) > }) > This last call uses the job configuration that defines the EMR-DDB connector > to write out the new RDD you created in the expected format: > ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf) > fails with the follwoing error: > Caused by: java.lang.NullPointerException > null values caused the error, if I try with ClientNum and Value_1 it works > data is correctly inserted on DynamoDB table. > Thanks for your help !! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
[ https://issues.apache.org/jira/browse/SPARK-22588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580598#comment-17580598 ] Vivek Garg edited comment on SPARK-22588 at 8/17/22 5:48 AM: - https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/";>SAP analytics cloud training [https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/](SAP analytics cloud training) (https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/)[SAP analytics cloud training] [url=https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/]SAP analytics cloud training[/url] [https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/ SAP analytics cloud training] [SAP analytics cloud training](https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/) (SAP analytics cloud training)[https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/] was (Author: JIRAUSER294516): According to the [Salesforce CPQ Certification|[https://www.igmguru.com/salesforce/salesforce-cpq-training/]] Exam, our Salesforce CPQ Training program has been created. The core abilities needed for effectively implementing Salesforce CPQ solutions are developed in this course on Salesforce CPQ. Through instruction using practical examples, this course will go deeper into developing a quoting process, pricing strategies, configuration, CPQ object data model, and more. This online Salesforce CPQ training course includes practical projects that will aid you in passing the Salesforce CPQ Certification test. > SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values > - > > Key: SPARK-22588 > URL: https://issues.apache.org/jira/browse/SPARK-22588 > Project: Spark > Issue Type: Question > Components: Deploy >Affects Versions: 2.1.1 >Reporter: Saanvi Sharma >Priority: Minor > Labels: dynamodb, spark > Original Estimate: 24h > Remaining Estimate: 24h > > I am using spark 2.1 on EMR and i have a dataframe like this: > ClientNum | Value_1 | Value_2 | Value_3 | Value_4 > 14 |A |B| C | null > 19 |X |Y| null| null > 21 |R | null | null| null > I want to load data into DynamoDB table with ClientNum as key fetching: > Analyze Your Data on Amazon DynamoDB with apche Spark11 > Using Spark SQL for ETL3 > here is my code that I tried to solve: > var jobConf = new JobConf(sc.hadoopConfiguration) > jobConf.set("dynamodb.servicename", "dynamodb") > jobConf.set("dynamodb.input.tableName", "table_name") > jobConf.set("dynamodb.output.tableName", "table_name") > jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com") > jobConf.set("dynamodb.regionid", "eu-west-1") > jobConf.set("dynamodb.throughput.read", "1") > jobConf.set("dynamodb.throughput.read.percent", "1") > jobConf.set("dynamodb.throughput.write", "1") > jobConf.set("dynamodb.throughput.write.percent", "1") > > jobConf.set("mapred.output.format.class", > "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat") > jobConf.set("mapred.input.format.class", > "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat") > #Import Data > val df = > sqlContext.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").load(path) > I performed a transformation to have an RDD that matches the types that the > DynamoDB custom output format knows how to write. The custom output format > expects a tuple containing the Text and DynamoDBItemWritable types. > Create a new RDD with those types in it, in the following map call: > #Convert the dataframe to rdd > val df_rdd = df.rdd > > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > MapPartitionsRDD[10] at rdd at :41 > > #Print first rdd > df_rdd.take(1) > > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null]) > var ddbInsertFormattedRDD = df_rdd.map(a => { > var ddbMap = new HashMap[String, AttributeValue]() > var ClientNum = new AttributeValue() > ClientNum.setN(a.get(0).toString) > ddbMap.put("ClientNum", ClientNum) > var Value_1 = new AttributeValue() > Value_1.setS(a.get(1).toString) > ddbMap.put("Value_1", Value_1) > var Value_2 = new AttributeValue() > Value_2.setS(a.get(2).toString) > ddbMap.put("Value_2", Value_2) > var Value_3 = new AttributeValue() > Value_3.setS(a.get(3).toString) > ddbMap.put("Value_3", Value_3) > var Value_4 = new AttributeValue() > Value_4.setS(a.get(4).toString) > ddbMap.put("Value_4", Value_4) > var item = new DynamoDBItemWritable() > item.setItem(ddbMap) > (new Text(
[jira] [Comment Edited] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
[ https://issues.apache.org/jira/browse/SPARK-22588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580598#comment-17580598 ] Vivek Garg edited comment on SPARK-22588 at 8/17/22 5:49 AM: - (https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/)[SAP analytics cloud training] [https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/ SAP analytics cloud training] [SAP analytics cloud training](https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/) (SAP analytics cloud training)[https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/] was (Author: JIRAUSER294516): https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/";>SAP analytics cloud training [https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/](SAP analytics cloud training) (https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/)[SAP analytics cloud training] [url=https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/]SAP analytics cloud training[/url] [https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/ SAP analytics cloud training] [SAP analytics cloud training](https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/) (SAP analytics cloud training)[https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/] > SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values > - > > Key: SPARK-22588 > URL: https://issues.apache.org/jira/browse/SPARK-22588 > Project: Spark > Issue Type: Question > Components: Deploy >Affects Versions: 2.1.1 >Reporter: Saanvi Sharma >Priority: Minor > Labels: dynamodb, spark > Original Estimate: 24h > Remaining Estimate: 24h > > I am using spark 2.1 on EMR and i have a dataframe like this: > ClientNum | Value_1 | Value_2 | Value_3 | Value_4 > 14 |A |B| C | null > 19 |X |Y| null| null > 21 |R | null | null| null > I want to load data into DynamoDB table with ClientNum as key fetching: > Analyze Your Data on Amazon DynamoDB with apche Spark11 > Using Spark SQL for ETL3 > here is my code that I tried to solve: > var jobConf = new JobConf(sc.hadoopConfiguration) > jobConf.set("dynamodb.servicename", "dynamodb") > jobConf.set("dynamodb.input.tableName", "table_name") > jobConf.set("dynamodb.output.tableName", "table_name") > jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com") > jobConf.set("dynamodb.regionid", "eu-west-1") > jobConf.set("dynamodb.throughput.read", "1") > jobConf.set("dynamodb.throughput.read.percent", "1") > jobConf.set("dynamodb.throughput.write", "1") > jobConf.set("dynamodb.throughput.write.percent", "1") > > jobConf.set("mapred.output.format.class", > "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat") > jobConf.set("mapred.input.format.class", > "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat") > #Import Data > val df = > sqlContext.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").load(path) > I performed a transformation to have an RDD that matches the types that the > DynamoDB custom output format knows how to write. The custom output format > expects a tuple containing the Text and DynamoDBItemWritable types. > Create a new RDD with those types in it, in the following map call: > #Convert the dataframe to rdd > val df_rdd = df.rdd > > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > MapPartitionsRDD[10] at rdd at :41 > > #Print first rdd > df_rdd.take(1) > > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null]) > var ddbInsertFormattedRDD = df_rdd.map(a => { > var ddbMap = new HashMap[String, AttributeValue]() > var ClientNum = new AttributeValue() > ClientNum.setN(a.get(0).toString) > ddbMap.put("ClientNum", ClientNum) > var Value_1 = new AttributeValue() > Value_1.setS(a.get(1).toString) > ddbMap.put("Value_1", Value_1) > var Value_2 = new AttributeValue() > Value_2.setS(a.get(2).toString) > ddbMap.put("Value_2", Value_2) > var Value_3 = new AttributeValue() > Value_3.setS(a.get(3).toString) > ddbMap.put("Value_3", Value_3) > var Value_4 = new AttributeValue() > Value_4.setS(a.get(4).toString) > ddbMap.put("Value_4", Value_4) > var item = new DynamoDBItemWritable() > item.setItem(ddbMap) > (new Text(""), item) > }) > This last call uses the job configuration that defines the EMR-DDB connector > to write out the new RDD you created in the expected format: > ddbInsertFormattedRDD.saveAsH
[jira] [Comment Edited] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
[ https://issues.apache.org/jira/browse/SPARK-22588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580598#comment-17580598 ] Vivek Garg edited comment on SPARK-22588 at 8/17/22 5:49 AM: - [https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/ SAP analytics cloud training] was (Author: JIRAUSER294516): (https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/)[SAP analytics cloud training] [https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/ SAP analytics cloud training] [SAP analytics cloud training](https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/) (SAP analytics cloud training)[https://www.igmguru.com/erp-training/sac-analytics-cloud-online-training/] > SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values > - > > Key: SPARK-22588 > URL: https://issues.apache.org/jira/browse/SPARK-22588 > Project: Spark > Issue Type: Question > Components: Deploy >Affects Versions: 2.1.1 >Reporter: Saanvi Sharma >Priority: Minor > Labels: dynamodb, spark > Original Estimate: 24h > Remaining Estimate: 24h > > I am using spark 2.1 on EMR and i have a dataframe like this: > ClientNum | Value_1 | Value_2 | Value_3 | Value_4 > 14 |A |B| C | null > 19 |X |Y| null| null > 21 |R | null | null| null > I want to load data into DynamoDB table with ClientNum as key fetching: > Analyze Your Data on Amazon DynamoDB with apche Spark11 > Using Spark SQL for ETL3 > here is my code that I tried to solve: > var jobConf = new JobConf(sc.hadoopConfiguration) > jobConf.set("dynamodb.servicename", "dynamodb") > jobConf.set("dynamodb.input.tableName", "table_name") > jobConf.set("dynamodb.output.tableName", "table_name") > jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com") > jobConf.set("dynamodb.regionid", "eu-west-1") > jobConf.set("dynamodb.throughput.read", "1") > jobConf.set("dynamodb.throughput.read.percent", "1") > jobConf.set("dynamodb.throughput.write", "1") > jobConf.set("dynamodb.throughput.write.percent", "1") > > jobConf.set("mapred.output.format.class", > "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat") > jobConf.set("mapred.input.format.class", > "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat") > #Import Data > val df = > sqlContext.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").load(path) > I performed a transformation to have an RDD that matches the types that the > DynamoDB custom output format knows how to write. The custom output format > expects a tuple containing the Text and DynamoDBItemWritable types. > Create a new RDD with those types in it, in the following map call: > #Convert the dataframe to rdd > val df_rdd = df.rdd > > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > MapPartitionsRDD[10] at rdd at :41 > > #Print first rdd > df_rdd.take(1) > > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null]) > var ddbInsertFormattedRDD = df_rdd.map(a => { > var ddbMap = new HashMap[String, AttributeValue]() > var ClientNum = new AttributeValue() > ClientNum.setN(a.get(0).toString) > ddbMap.put("ClientNum", ClientNum) > var Value_1 = new AttributeValue() > Value_1.setS(a.get(1).toString) > ddbMap.put("Value_1", Value_1) > var Value_2 = new AttributeValue() > Value_2.setS(a.get(2).toString) > ddbMap.put("Value_2", Value_2) > var Value_3 = new AttributeValue() > Value_3.setS(a.get(3).toString) > ddbMap.put("Value_3", Value_3) > var Value_4 = new AttributeValue() > Value_4.setS(a.get(4).toString) > ddbMap.put("Value_4", Value_4) > var item = new DynamoDBItemWritable() > item.setItem(ddbMap) > (new Text(""), item) > }) > This last call uses the job configuration that defines the EMR-DDB connector > to write out the new RDD you created in the expected format: > ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf) > fails with the follwoing error: > Caused by: java.lang.NullPointerException > null values caused the error, if I try with ClientNum and Value_1 it works > data is correctly inserted on DynamoDB table. > Thanks for your help !! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
[ https://issues.apache.org/jira/browse/SPARK-22588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580598#comment-17580598 ] Vivek Garg commented on SPARK-22588: According to the [Salesforce CPQ Certification|[https://www.igmguru.com/salesforce/salesforce-cpq-training/]] Exam, our Salesforce CPQ Training program has been created. The core abilities needed for effectively implementing Salesforce CPQ solutions are developed in this course on Salesforce CPQ. Through instruction using practical examples, this course will go deeper into developing a quoting process, pricing strategies, configuration, CPQ object data model, and more. This online Salesforce CPQ training course includes practical projects that will aid you in passing the Salesforce CPQ Certification test. > SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values > - > > Key: SPARK-22588 > URL: https://issues.apache.org/jira/browse/SPARK-22588 > Project: Spark > Issue Type: Question > Components: Deploy >Affects Versions: 2.1.1 >Reporter: Saanvi Sharma >Priority: Minor > Labels: dynamodb, spark > Original Estimate: 24h > Remaining Estimate: 24h > > I am using spark 2.1 on EMR and i have a dataframe like this: > ClientNum | Value_1 | Value_2 | Value_3 | Value_4 > 14 |A |B| C | null > 19 |X |Y| null| null > 21 |R | null | null| null > I want to load data into DynamoDB table with ClientNum as key fetching: > Analyze Your Data on Amazon DynamoDB with apche Spark11 > Using Spark SQL for ETL3 > here is my code that I tried to solve: > var jobConf = new JobConf(sc.hadoopConfiguration) > jobConf.set("dynamodb.servicename", "dynamodb") > jobConf.set("dynamodb.input.tableName", "table_name") > jobConf.set("dynamodb.output.tableName", "table_name") > jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com") > jobConf.set("dynamodb.regionid", "eu-west-1") > jobConf.set("dynamodb.throughput.read", "1") > jobConf.set("dynamodb.throughput.read.percent", "1") > jobConf.set("dynamodb.throughput.write", "1") > jobConf.set("dynamodb.throughput.write.percent", "1") > > jobConf.set("mapred.output.format.class", > "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat") > jobConf.set("mapred.input.format.class", > "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat") > #Import Data > val df = > sqlContext.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").load(path) > I performed a transformation to have an RDD that matches the types that the > DynamoDB custom output format knows how to write. The custom output format > expects a tuple containing the Text and DynamoDBItemWritable types. > Create a new RDD with those types in it, in the following map call: > #Convert the dataframe to rdd > val df_rdd = df.rdd > > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > MapPartitionsRDD[10] at rdd at :41 > > #Print first rdd > df_rdd.take(1) > > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null]) > var ddbInsertFormattedRDD = df_rdd.map(a => { > var ddbMap = new HashMap[String, AttributeValue]() > var ClientNum = new AttributeValue() > ClientNum.setN(a.get(0).toString) > ddbMap.put("ClientNum", ClientNum) > var Value_1 = new AttributeValue() > Value_1.setS(a.get(1).toString) > ddbMap.put("Value_1", Value_1) > var Value_2 = new AttributeValue() > Value_2.setS(a.get(2).toString) > ddbMap.put("Value_2", Value_2) > var Value_3 = new AttributeValue() > Value_3.setS(a.get(3).toString) > ddbMap.put("Value_3", Value_3) > var Value_4 = new AttributeValue() > Value_4.setS(a.get(4).toString) > ddbMap.put("Value_4", Value_4) > var item = new DynamoDBItemWritable() > item.setItem(ddbMap) > (new Text(""), item) > }) > This last call uses the job configuration that defines the EMR-DDB connector > to write out the new RDD you created in the expected format: > ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf) > fails with the follwoing error: > Caused by: java.lang.NullPointerException > null values caused the error, if I try with ClientNum and Value_1 it works > data is correctly inserted on DynamoDB table. > Thanks for your help !! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40117) Convert condition to java in DataFrameWriterV2.overwrite
[ https://issues.apache.org/jira/browse/SPARK-40117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580588#comment-17580588 ] Apache Spark commented on SPARK-40117: -- User 'looi' has created a pull request for this issue: https://github.com/apache/spark/pull/37547 > Convert condition to java in DataFrameWriterV2.overwrite > > > Key: SPARK-40117 > URL: https://issues.apache.org/jira/browse/SPARK-40117 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.1.3, 3.3.0, 3.2.2 >Reporter: Wenli Looi >Priority: Major > > DataFrameWriterV2.overwrite() fails to convert the condition parameter to > java. This prevents the function from being called. > It is caused by the following commit that deleted the `_to_java_column` call > instead of fixing it: > [https://github.com/apache/spark/commit/a1e459ed9f6777fb8d5a2d09fda666402f9230b9] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40117) Convert condition to java in DataFrameWriterV2.overwrite
[ https://issues.apache.org/jira/browse/SPARK-40117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580587#comment-17580587 ] Apache Spark commented on SPARK-40117: -- User 'looi' has created a pull request for this issue: https://github.com/apache/spark/pull/37547 > Convert condition to java in DataFrameWriterV2.overwrite > > > Key: SPARK-40117 > URL: https://issues.apache.org/jira/browse/SPARK-40117 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.1.3, 3.3.0, 3.2.2 >Reporter: Wenli Looi >Priority: Major > > DataFrameWriterV2.overwrite() fails to convert the condition parameter to > java. This prevents the function from being called. > It is caused by the following commit that deleted the `_to_java_column` call > instead of fixing it: > [https://github.com/apache/spark/commit/a1e459ed9f6777fb8d5a2d09fda666402f9230b9] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40117) Convert condition to java in DataFrameWriterV2.overwrite
[ https://issues.apache.org/jira/browse/SPARK-40117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40117: Assignee: (was: Apache Spark) > Convert condition to java in DataFrameWriterV2.overwrite > > > Key: SPARK-40117 > URL: https://issues.apache.org/jira/browse/SPARK-40117 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.1.3, 3.3.0, 3.2.2 >Reporter: Wenli Looi >Priority: Major > > DataFrameWriterV2.overwrite() fails to convert the condition parameter to > java. This prevents the function from being called. > It is caused by the following commit that deleted the `_to_java_column` call > instead of fixing it: > [https://github.com/apache/spark/commit/a1e459ed9f6777fb8d5a2d09fda666402f9230b9] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40117) Convert condition to java in DataFrameWriterV2.overwrite
[ https://issues.apache.org/jira/browse/SPARK-40117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40117: Assignee: Apache Spark > Convert condition to java in DataFrameWriterV2.overwrite > > > Key: SPARK-40117 > URL: https://issues.apache.org/jira/browse/SPARK-40117 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.1.3, 3.3.0, 3.2.2 >Reporter: Wenli Looi >Assignee: Apache Spark >Priority: Major > > DataFrameWriterV2.overwrite() fails to convert the condition parameter to > java. This prevents the function from being called. > It is caused by the following commit that deleted the `_to_java_column` call > instead of fixing it: > [https://github.com/apache/spark/commit/a1e459ed9f6777fb8d5a2d09fda666402f9230b9] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40117) Convert condition to java in DataFrameWriterV2.overwrite
Wenli Looi created SPARK-40117: -- Summary: Convert condition to java in DataFrameWriterV2.overwrite Key: SPARK-40117 URL: https://issues.apache.org/jira/browse/SPARK-40117 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 3.2.2, 3.3.0, 3.1.3 Reporter: Wenli Looi DataFrameWriterV2.overwrite() fails to convert the condition parameter to java. This prevents the function from being called. It is caused by the following commit that deleted the `_to_java_column` call instead of fixing it: [https://github.com/apache/spark/commit/a1e459ed9f6777fb8d5a2d09fda666402f9230b9] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40116) Pin Arrow version to 8.0.0 in AppVeyor
[ https://issues.apache.org/jira/browse/SPARK-40116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580574#comment-17580574 ] Apache Spark commented on SPARK-40116: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/37546 > Pin Arrow version to 8.0.0 in AppVeyor > -- > > Key: SPARK-40116 > URL: https://issues.apache.org/jira/browse/SPARK-40116 > Project: Spark > Issue Type: Test > Components: Project Infra, SparkR >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > SparkR does not support Arrow 9.0.0 (SPARK-40114) so the tests currently > fail, https://ci.appveyor.com/project/HyukjinKwon/spark/builds/44490387. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40116) Pin Arrow version to 8.0.0 in AppVeyor
[ https://issues.apache.org/jira/browse/SPARK-40116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40116: Assignee: Apache Spark > Pin Arrow version to 8.0.0 in AppVeyor > -- > > Key: SPARK-40116 > URL: https://issues.apache.org/jira/browse/SPARK-40116 > Project: Spark > Issue Type: Test > Components: Project Infra, SparkR >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > SparkR does not support Arrow 9.0.0 (SPARK-40114) so the tests currently > fail, https://ci.appveyor.com/project/HyukjinKwon/spark/builds/44490387. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40116) Pin Arrow version to 8.0.0 in AppVeyor
[ https://issues.apache.org/jira/browse/SPARK-40116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40116: Assignee: (was: Apache Spark) > Pin Arrow version to 8.0.0 in AppVeyor > -- > > Key: SPARK-40116 > URL: https://issues.apache.org/jira/browse/SPARK-40116 > Project: Spark > Issue Type: Test > Components: Project Infra, SparkR >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > SparkR does not support Arrow 9.0.0 (SPARK-40114) so the tests currently > fail, https://ci.appveyor.com/project/HyukjinKwon/spark/builds/44490387. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40116) Pin Arrow version to 8.0.0 in AppVeyor
[ https://issues.apache.org/jira/browse/SPARK-40116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580573#comment-17580573 ] Apache Spark commented on SPARK-40116: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/37546 > Pin Arrow version to 8.0.0 in AppVeyor > -- > > Key: SPARK-40116 > URL: https://issues.apache.org/jira/browse/SPARK-40116 > Project: Spark > Issue Type: Test > Components: Project Infra, SparkR >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > SparkR does not support Arrow 9.0.0 (SPARK-40114) so the tests currently > fail, https://ci.appveyor.com/project/HyukjinKwon/spark/builds/44490387. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40116) Pin Arrow version to 8.0.0 in AppVeyor
Hyukjin Kwon created SPARK-40116: Summary: Pin Arrow version to 8.0.0 in AppVeyor Key: SPARK-40116 URL: https://issues.apache.org/jira/browse/SPARK-40116 Project: Spark Issue Type: Test Components: Project Infra, SparkR Affects Versions: 3.4.0 Reporter: Hyukjin Kwon SparkR does not support Arrow 9.0.0 (SPARK-40114) so the tests currently fail, https://ci.appveyor.com/project/HyukjinKwon/spark/builds/44490387. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40115) Pin Arrow version to 8.0.0 in AppVeyor
Hyukjin Kwon created SPARK-40115: Summary: Pin Arrow version to 8.0.0 in AppVeyor Key: SPARK-40115 URL: https://issues.apache.org/jira/browse/SPARK-40115 Project: Spark Issue Type: Test Components: Project Infra, SparkR Affects Versions: 3.4.0 Reporter: Hyukjin Kwon Currently SparkR tests fail https://ci.appveyor.com/project/HyukjinKwon/spark/builds/44490387 because SparkR does not support Arrow 9.0.0+, see also SPARK-40114 We should pin the version to 8.0.0 for now. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40114) Arrow 9.0.0 support with SparkR
Hyukjin Kwon created SPARK-40114: Summary: Arrow 9.0.0 support with SparkR Key: SPARK-40114 URL: https://issues.apache.org/jira/browse/SPARK-40114 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 3.4.0 Reporter: Hyukjin Kwon {code} == Failed == -- 1. Error (test_sparkSQL_arrow.R:103:3): dapply() Arrow optimization - Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 'n' argument Backtrace: 1. SparkR::collect(ret) at test_sparkSQL_arrow.R:103:2 2. SparkR::collect(ret) 3. SparkR (local) .local(x, ...) 7. SparkR:::readRaw(conn) 8. base::readBin(con, raw(), as.integer(dataLen), endian = "big") -- 2. Error (test_sparkSQL_arrow.R:133:3): dapply() Arrow optimization - type sp Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 'n' argument Backtrace: 1. SparkR::collect(ret) at test_sparkSQL_arrow.R:133:2 2. SparkR::collect(ret) 3. SparkR (local) .local(x, ...) 7. SparkR:::readRaw(conn) 8. base::readBin(con, raw(), as.integer(dataLen), endian = "big") -- 3. Error (test_sparkSQL_arrow.R:143:3): dapply() Arrow optimization - type sp Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 'n' argument Backtrace: 1. testthat::expect_true(all(collect(ret) == rdf)) at test_sparkSQL_arrow.R:143:2 5. SparkR::collect(ret) 6. SparkR (local) .local(x, ...) 10. SparkR:::readRaw(conn) 11. base::readBin(con, raw(), as.integer(dataLen), endian = "big") -- 4. Error (test_sparkSQL_arrow.R:184:3): gapply() Arrow optimization - Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 'n' argument Backtrace: 1. SparkR::collect(ret) at test_sparkSQL_arrow.R:184:2 2. SparkR::collect(ret) 3. SparkR (local) .local(x, ...) 7. SparkR:::readRaw(conn) 8. base::readBin(con, raw(), as.integer(dataLen), endian = "big") -- 5. Error (test_sparkSQL_arrow.R:217:3): gapply() Arrow optimization - type sp Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 'n' argument Backtrace: 1. SparkR::collect(ret) at test_sparkSQL_arrow.R:217:2 2. SparkR::collect(ret) 3. SparkR (local) .local(x, ...) 7. SparkR:::readRaw(conn) 8. base::readBin(con, raw(), as.integer(dataLen), endian = "big") -- 6. Error (test_sparkSQL_arrow.R:229:3): gapply() Arrow optimization - type sp Error in `readBin(con, raw(), as.integer(dataLen), endian = "big")`: invalid 'n' argument Backtrace: 1. testthat::expect_true(all(collect(ret) == rdf)) at test_sparkSQL_arrow.R:229:2 5. SparkR::collect(ret) 6. SparkR (local) .local(x, ...) 10. SparkR:::readRaw(conn) 11. base::readBin(con, raw(), as.integer(dataLen), endian = "big") -- 7. Failure (test_sparkSQL_arrow.R:247:3): SPARK-32478: gapply() Arrow optimiz `count(...)` threw an error with unexpected message. Expected match: "expected IntegerType, IntegerType, got IntegerType, StringType" Actual message: "org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 29.0 failed 1 times, most recent failure: Lost task 0.0 in stage 29.0 (TID 54) (APPVYR-WIN executor driver): org.apache.spark.SparkException: R unexpectedly exited.\nR worker produced errors: The tzdb package is not installed. Timezones will not be available to Arrow compute functions.\nError in arrow::write_arrow(df, raw()) : write_arrow has been removed\nCalls: -> writeRaw -> writeInt -> writeBin -> \nExecution halted\n\r\n\tat org.apache.spark.api.r.BaseRRunner$ReaderIterator$$anonfun$1.applyOrElse(BaseRRunner.scala:144)\r\n\tat org.apache.spark.api.r.BaseRRunner$ReaderIterator$$anonfun$1.applyOrElse(BaseRRunner.scala:137)\r\n\tat scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)\r\n\tat org.apache.spark.sql.execution.r.ArrowRRunner$$anon$2.read(ArrowRRunner.scala:194)\r\n\tat org.apache.spark.sql.execution.r.ArrowRRunner$$anon$2.read(ArrowRRunner.scala:123)\r\n\tat org.apache.spark.api.r.BaseRRunner$ReaderIterator.hasNext(BaseRRunner.scala:113)\r\n\tat scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)\r\n\tat scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)\r\n\tat org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.hashAgg_doAggregateWithoutKey_0$(Unknown Source)\r\n\tat org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)\r\n\tat org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)\r\n\tat org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)\r\n\tat scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)\r\n\tat org.apache.spark.shuffle.sort.BypassMer
[jira] [Updated] (SPARK-40113) Unify ParquetScanBuilder DataSourceV2 interface implementation
[ https://issues.apache.org/jira/browse/SPARK-40113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mars updated SPARK-40113: - Priority: Minor (was: Major) > Unify ParquetScanBuilder DataSourceV2 interface implementation > -- > > Key: SPARK-40113 > URL: https://issues.apache.org/jira/browse/SPARK-40113 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 3.3.0 >Reporter: Mars >Priority: Minor > > Now `FileScanBuilder` interface is not fully implemented in > `ParquetScanBuilder` like > `OrcScanBuilder`,`AvroScanBuilder`,`CSVScanBuilder` > In order to unify the logic of the code and make it clearer, this part of the > implementation is unified. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40113) Unify ParquetScanBuilder DataSourceV2 interface implementation
[ https://issues.apache.org/jira/browse/SPARK-40113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40113: Assignee: Apache Spark > Unify ParquetScanBuilder DataSourceV2 interface implementation > -- > > Key: SPARK-40113 > URL: https://issues.apache.org/jira/browse/SPARK-40113 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 3.3.0 >Reporter: Mars >Assignee: Apache Spark >Priority: Major > > Now `FileScanBuilder` interface is not fully implemented in > `ParquetScanBuilder` like > `OrcScanBuilder`,`AvroScanBuilder`,`CSVScanBuilder` > In order to unify the logic of the code and make it clearer, this part of the > implementation is unified. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40113) Unify ParquetScanBuilder DataSourceV2 interface implementation
[ https://issues.apache.org/jira/browse/SPARK-40113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40113: Assignee: (was: Apache Spark) > Unify ParquetScanBuilder DataSourceV2 interface implementation > -- > > Key: SPARK-40113 > URL: https://issues.apache.org/jira/browse/SPARK-40113 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 3.3.0 >Reporter: Mars >Priority: Major > > Now `FileScanBuilder` interface is not fully implemented in > `ParquetScanBuilder` like > `OrcScanBuilder`,`AvroScanBuilder`,`CSVScanBuilder` > In order to unify the logic of the code and make it clearer, this part of the > implementation is unified. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40113) Unify ParquetScanBuilder DataSourceV2 interface implementation
[ https://issues.apache.org/jira/browse/SPARK-40113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580551#comment-17580551 ] Apache Spark commented on SPARK-40113: -- User 'yabola' has created a pull request for this issue: https://github.com/apache/spark/pull/37545 > Unify ParquetScanBuilder DataSourceV2 interface implementation > -- > > Key: SPARK-40113 > URL: https://issues.apache.org/jira/browse/SPARK-40113 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 3.3.0 >Reporter: Mars >Priority: Major > > Now `FileScanBuilder` interface is not fully implemented in > `ParquetScanBuilder` like > `OrcScanBuilder`,`AvroScanBuilder`,`CSVScanBuilder` > In order to unify the logic of the code and make it clearer, this part of the > implementation is unified. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40113) Unify ParquetScanBuilder DataSourceV2 interface implementation
[ https://issues.apache.org/jira/browse/SPARK-40113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mars updated SPARK-40113: - Summary: Unify ParquetScanBuilder DataSourceV2 interface implementation (was: Unified ParquetScanBuilder DataSourceV2 interface implementation) > Unify ParquetScanBuilder DataSourceV2 interface implementation > -- > > Key: SPARK-40113 > URL: https://issues.apache.org/jira/browse/SPARK-40113 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 3.3.0 >Reporter: Mars >Priority: Major > > Now `FileScanBuilder` interface is not fully implemented in > `ParquetScanBuilder` like > `OrcScanBuilder`,`AvroScanBuilder`,`CSVScanBuilder` > In order to unify the logic of the code and make it clearer, this part of the > implementation is unified. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40113) Unified ParquetScanBuilder DataSourceV2 interface implementation
Mars created SPARK-40113: Summary: Unified ParquetScanBuilder DataSourceV2 interface implementation Key: SPARK-40113 URL: https://issues.apache.org/jira/browse/SPARK-40113 Project: Spark Issue Type: Improvement Components: Optimizer Affects Versions: 3.3.0 Reporter: Mars Now `FileScanBuilder` interface is not fully implemented in `ParquetScanBuilder` like `OrcScanBuilder`,`AvroScanBuilder`,`CSVScanBuilder` In order to unify the logic of the code and make it clearer, this part of the implementation is unified. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38648) SPIP: Simplified API for DL Inferencing
[ https://issues.apache.org/jira/browse/SPARK-38648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580548#comment-17580548 ] Yikun Jiang commented on SPARK-38648: - By the way, just curious, is this SPIP expected to be a feature in version 3.4? > SPIP: Simplified API for DL Inferencing > --- > > Key: SPARK-38648 > URL: https://issues.apache.org/jira/browse/SPARK-38648 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: Lee Yang >Priority: Minor > > h1. Background and Motivation > The deployment of deep learning (DL) models to Spark clusters can be a point > of friction today. DL practitioners often aren't well-versed with Spark, and > Spark experts often aren't well-versed with the fast-changing DL frameworks. > Currently, the deployment of trained DL models is done in a fairly ad-hoc > manner, with each model integration usually requiring significant effort. > To simplify this process, we propose adding an integration layer for each > major DL framework that can introspect their respective saved models to > more-easily integrate these models into Spark applications. You can find a > detailed proposal here: > [https://docs.google.com/document/d/1n7QPHVZfmQknvebZEXxzndHPV2T71aBsDnP4COQa_v0] > h1. Goals > - Simplify the deployment of pre-trained single-node DL models to Spark > inference applications. > - Follow pandas_udf for simple inference use-cases. > - Follow Spark ML Pipelines APIs for transfer-learning use-cases. > - Enable integrations with popular third-party DL frameworks like > TensorFlow, PyTorch, and Huggingface. > - Focus on PySpark, since most of the DL frameworks use Python. > - Take advantage of built-in Spark features like GPU scheduling and Arrow > integration. > - Enable inference on both CPU and GPU. > h1. Non-goals > - DL model training. > - Inference w/ distributed models, i.e. "model parallel" inference. > h1. Target Personas > - Data scientists who need to deploy DL models on Spark. > - Developers who need to deploy DL models on Spark. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38648) SPIP: Simplified API for DL Inferencing
[ https://issues.apache.org/jira/browse/SPARK-38648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580547#comment-17580547 ] Yikun Jiang commented on SPARK-38648: - If we want to run onnx directly, we might want to support onnxruntime as one of DL Frameworks, like sparkext.onnxruntime.Model(url). For other frameworks, user can first convert onnx to framework specific model format directly [1], and then call sparkext.onnxruntime.Model(converted_url), I don't think it's too difficult. So I personally think, the format of the model should not be unified, onnx is just one of them. [1]https://pytorch.org/docs/stable/onnx.html#torch-onnx > SPIP: Simplified API for DL Inferencing > --- > > Key: SPARK-38648 > URL: https://issues.apache.org/jira/browse/SPARK-38648 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: Lee Yang >Priority: Minor > > h1. Background and Motivation > The deployment of deep learning (DL) models to Spark clusters can be a point > of friction today. DL practitioners often aren't well-versed with Spark, and > Spark experts often aren't well-versed with the fast-changing DL frameworks. > Currently, the deployment of trained DL models is done in a fairly ad-hoc > manner, with each model integration usually requiring significant effort. > To simplify this process, we propose adding an integration layer for each > major DL framework that can introspect their respective saved models to > more-easily integrate these models into Spark applications. You can find a > detailed proposal here: > [https://docs.google.com/document/d/1n7QPHVZfmQknvebZEXxzndHPV2T71aBsDnP4COQa_v0] > h1. Goals > - Simplify the deployment of pre-trained single-node DL models to Spark > inference applications. > - Follow pandas_udf for simple inference use-cases. > - Follow Spark ML Pipelines APIs for transfer-learning use-cases. > - Enable integrations with popular third-party DL frameworks like > TensorFlow, PyTorch, and Huggingface. > - Focus on PySpark, since most of the DL frameworks use Python. > - Take advantage of built-in Spark features like GPU scheduling and Arrow > integration. > - Enable inference on both CPU and GPU. > h1. Non-goals > - DL model training. > - Inference w/ distributed models, i.e. "model parallel" inference. > h1. Target Personas > - Data scientists who need to deploy DL models on Spark. > - Developers who need to deploy DL models on Spark. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40110) Add JDBCWithAQESuite
[ https://issues.apache.org/jira/browse/SPARK-40110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580545#comment-17580545 ] Apache Spark commented on SPARK-40110: -- User 'kazuyukitanimura' has created a pull request for this issue: https://github.com/apache/spark/pull/37544 > Add JDBCWithAQESuite > > > Key: SPARK-40110 > URL: https://issues.apache.org/jira/browse/SPARK-40110 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.4.0 >Reporter: Kazuyuki Tanimura >Priority: Minor > > Currently `JDBCSuite` assumes that AQE is always turned off. We should also > test with AQE turned on -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40110) Add JDBCWithAQESuite
[ https://issues.apache.org/jira/browse/SPARK-40110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580546#comment-17580546 ] Apache Spark commented on SPARK-40110: -- User 'kazuyukitanimura' has created a pull request for this issue: https://github.com/apache/spark/pull/37544 > Add JDBCWithAQESuite > > > Key: SPARK-40110 > URL: https://issues.apache.org/jira/browse/SPARK-40110 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.4.0 >Reporter: Kazuyuki Tanimura >Priority: Minor > > Currently `JDBCSuite` assumes that AQE is always turned off. We should also > test with AQE turned on -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40110) Add JDBCWithAQESuite
[ https://issues.apache.org/jira/browse/SPARK-40110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40110: Assignee: Apache Spark > Add JDBCWithAQESuite > > > Key: SPARK-40110 > URL: https://issues.apache.org/jira/browse/SPARK-40110 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.4.0 >Reporter: Kazuyuki Tanimura >Assignee: Apache Spark >Priority: Minor > > Currently `JDBCSuite` assumes that AQE is always turned off. We should also > test with AQE turned on -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40110) Add JDBCWithAQESuite
[ https://issues.apache.org/jira/browse/SPARK-40110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40110: Assignee: (was: Apache Spark) > Add JDBCWithAQESuite > > > Key: SPARK-40110 > URL: https://issues.apache.org/jira/browse/SPARK-40110 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.4.0 >Reporter: Kazuyuki Tanimura >Priority: Minor > > Currently `JDBCSuite` assumes that AQE is always turned off. We should also > test with AQE turned on -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39975) Upgrade rocksdbjni to 7.4.5
[ https://issues.apache.org/jira/browse/SPARK-39975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39975: Assignee: Apache Spark > Upgrade rocksdbjni to 7.4.5 > --- > > Key: SPARK-39975 > URL: https://issues.apache.org/jira/browse/SPARK-39975 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Major > > [https://github.com/facebook/rocksdb/releases/tag/v7.4.5] > > {code:java} > Fix a bug starting in 7.4.0 in which some fsync operations might be skipped > in a DB after any DropColumnFamily on that DB, until it is re-opened. This > can lead to data loss on power loss. (For custom FileSystem implementations, > this could lead to FSDirectory::Fsync or FSDirectory::Close after the first > FSDirectory::Close; Also, valgrind could report call to close() with fd=-1.) > {code} > > Fixed a bug that caused data loss > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39975) Upgrade rocksdbjni to 7.4.5
[ https://issues.apache.org/jira/browse/SPARK-39975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580539#comment-17580539 ] Apache Spark commented on SPARK-39975: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/37543 > Upgrade rocksdbjni to 7.4.5 > --- > > Key: SPARK-39975 > URL: https://issues.apache.org/jira/browse/SPARK-39975 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Major > > [https://github.com/facebook/rocksdb/releases/tag/v7.4.5] > > {code:java} > Fix a bug starting in 7.4.0 in which some fsync operations might be skipped > in a DB after any DropColumnFamily on that DB, until it is re-opened. This > can lead to data loss on power loss. (For custom FileSystem implementations, > this could lead to FSDirectory::Fsync or FSDirectory::Close after the first > FSDirectory::Close; Also, valgrind could report call to close() with fd=-1.) > {code} > > Fixed a bug that caused data loss > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39975) Upgrade rocksdbjni to 7.4.5
[ https://issues.apache.org/jira/browse/SPARK-39975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39975: Assignee: (was: Apache Spark) > Upgrade rocksdbjni to 7.4.5 > --- > > Key: SPARK-39975 > URL: https://issues.apache.org/jira/browse/SPARK-39975 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Major > > [https://github.com/facebook/rocksdb/releases/tag/v7.4.5] > > {code:java} > Fix a bug starting in 7.4.0 in which some fsync operations might be skipped > in a DB after any DropColumnFamily on that DB, until it is re-opened. This > can lead to data loss on power loss. (For custom FileSystem implementations, > this could lead to FSDirectory::Fsync or FSDirectory::Close after the first > FSDirectory::Close; Also, valgrind could report call to close() with fd=-1.) > {code} > > Fixed a bug that caused data loss > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40046) Use Jackson instead of json4s to serialize `RocksDBMetrics`
[ https://issues.apache.org/jira/browse/SPARK-40046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-40046. -- Resolution: Won't Fix > Use Jackson instead of json4s to serialize `RocksDBMetrics` > --- > > Key: SPARK-40046 > URL: https://issues.apache.org/jira/browse/SPARK-40046 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40075) Refactor kafka010.JsonUtils to use Jackson instead of Json4s
[ https://issues.apache.org/jira/browse/SPARK-40075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-40075. -- Resolution: Won't Fix > Refactor kafka010.JsonUtils to use Jackson instead of Json4s > > > Key: SPARK-40075 > URL: https://issues.apache.org/jira/browse/SPARK-40075 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40101) `include an external JAR in SparkR` in core module but need antlr4
[ https://issues.apache.org/jira/browse/SPARK-40101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580535#comment-17580535 ] Hyukjin Kwon commented on SPARK-40101: -- I think mvn install won't work. It has to be package IIRC since it requires jars to load. mvn install doesn't create jars iIRC > `include an external JAR in SparkR` in core module but need antlr4 > -- > > Key: SPARK-40101 > URL: https://issues.apache.org/jira/browse/SPARK-40101 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Major > > Run following commands: > > {code:java} > mvn clean -Phadoop-3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl > -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive > mvn clean install -DskipTests -pl core -am > mvn clean test -pl core > -Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest -Dtest=none > -DwildcardSuites=org.apache.spark.deploy.SparkSubmitSuite {code} > > > `include an external JAR in SparkR` failed as follows: > > {code:java} > include an external JAR in SparkR *** FAILED *** > spark-submit returned with exit code 1. > Command line: '/Users/Spark/spark-source/bin/spark-submit' '--name' > 'testApp' '--master' 'local' '--jars' > 'file:/Users/Spark/spark-source/core/target/tmp/spark-e15e960c-1c10-44cd-99d4-f1905e4c18be/sparkRTestJar-1660632952368.jar' > '--verbose' '--conf' 'spark.ui.enabled=false' > '/Users/Spark/spark-source/R/pkg/tests/fulltests/jarTest.R' > > 2022-08-15 23:55:53.495 - stderr> Using properties file: null > 2022-08-15 23:55:53.58 - stderr> Parsed arguments: > 2022-08-15 23:55:53.58 - stderr> master local > 2022-08-15 23:55:53.58 - stderr> deployMode null > 2022-08-15 23:55:53.58 - stderr> executorMemory null > 2022-08-15 23:55:53.581 - stderr> executorCores null > 2022-08-15 23:55:53.581 - stderr> totalExecutorCores null > 2022-08-15 23:55:53.581 - stderr> propertiesFile null > 2022-08-15 23:55:53.581 - stderr> driverMemory null > 2022-08-15 23:55:53.581 - stderr> driverCores null > 2022-08-15 23:55:53.581 - stderr> driverExtraClassPath null > 2022-08-15 23:55:53.581 - stderr> driverExtraLibraryPath null > 2022-08-15 23:55:53.581 - stderr> driverExtraJavaOptions null > 2022-08-15 23:55:53.581 - stderr> supervise false > 2022-08-15 23:55:53.581 - stderr> queue null > 2022-08-15 23:55:53.581 - stderr> numExecutors null > 2022-08-15 23:55:53.581 - stderr> files null > 2022-08-15 23:55:53.581 - stderr> pyFiles null > 2022-08-15 23:55:53.581 - stderr> archives null > 2022-08-15 23:55:53.581 - stderr> mainClass null > 2022-08-15 23:55:53.581 - stderr> primaryResource > file:/Users/Spark/spark-source/R/pkg/tests/fulltests/jarTest.R > 2022-08-15 23:55:53.581 - stderr> name testApp > 2022-08-15 23:55:53.581 - stderr> childArgs [] > 2022-08-15 23:55:53.581 - stderr> jars > file:/Users/Spark/spark-source/core/target/tmp/spark-e15e960c-1c10-44cd-99d4-f1905e4c18be/sparkRTestJar-1660632952368.jar > 2022-08-15 23:55:53.581 - stderr> packages null > 2022-08-15 23:55:53.581 - stderr> packagesExclusions null > 2022-08-15 23:55:53.581 - stderr> repositories null > 2022-08-15 23:55:53.581 - stderr> verbose true > 2022-08-15 23:55:53.581 - stderr> > 2022-08-15 23:55:53.581 - stderr> Spark properties used, including those > specified through > 2022-08-15 23:55:53.581 - stderr> --conf and those from the properties > file null: > 2022-08-15 23:55:53.581 - stderr> (spark.ui.enabled,false) > 2022-08-15 23:55:53.581 - stderr> > 2022-08-15 23:55:53.581 - stderr> > 2022-08-15 23:55:53.729 - stderr> > /Users/Spark/spark-source/core/target/tmp/spark-e15e960c-1c10-44cd-99d4-f1905e4c18be/sparkRTestJar-1660632952368.jar > doesn't contain R source code, skipping... > 2022-08-15 23:55:54.058 - stderr> Main class: > 2022-08-15 23:55:54.058 - stderr> org.apache.spark.deploy.RRunner > 2022-08-15 23:55:54.058 - stderr> Arguments: > 2022-08-15 23:55:54.058 - stderr> > file:/Users/Spark/spark-source/R/pkg/tests/fulltests/jarTest.R > 2022-08-15 23:55:54.06 - stderr> Spark config: > 2022-08-15 23:55:54.06 - stderr> (spark.app.name,testApp) > 2022-08-15 23:55:54.06 - stderr> (spark.app.submitTime,1660632954058) > 2022-08-15 23:55:54.06 - stderr> > (spark.files,file:/Users/Spark/spark-source/R/pkg/tests/fulltests/jarTest.R)
[jira] [Commented] (SPARK-40103) Support read/write.csv() in SparkR
[ https://issues.apache.org/jira/browse/SPARK-40103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580522#comment-17580522 ] Hyukjin Kwon commented on SPARK-40103: -- The main problem is that the signature conflicts with R base API IIRC. We should probably use a different name for this. > Support read/write.csv() in SparkR > -- > > Key: SPARK-40103 > URL: https://issues.apache.org/jira/browse/SPARK-40103 > Project: Spark > Issue Type: New Feature > Components: SparkR >Affects Versions: 3.3.0 >Reporter: deshanxiao >Priority: Major > > Today, almost languages support the DataFrameReader.csv API, only R is > missing. we need to use df.read() to read the csv file. We need a more > high-level api for it. > Java: > [DataFrameReader.csv()|https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html] > Scala: > [DataFrameReader.csv()|https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrameReader.html#csv(paths:String*):org.apache.spark.sql.DataFrame] > Python: > [DataFrameReader.csv()|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.csv.html#pyspark.sql.DataFrameReader.csv] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39184) ArrayIndexOutOfBoundsException for some date/time sequences in some time-zones
[ https://issues.apache.org/jira/browse/SPARK-39184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580521#comment-17580521 ] Apache Spark commented on SPARK-39184: -- User 'bersprockets' has created a pull request for this issue: https://github.com/apache/spark/pull/37542 > ArrayIndexOutOfBoundsException for some date/time sequences in some time-zones > -- > > Key: SPARK-39184 > URL: https://issues.apache.org/jira/browse/SPARK-39184 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.3, 3.2.1, 3.3.0, 3.4.0 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Major > Fix For: 3.4.0, 3.3.1, 3.2.3 > > > The following query gets an {{ArrayIndexOutOfBoundsException}} when run from > the {{America/Los_Angeles}} time-zone: > {noformat} > spark-sql> select sequence(timestamp'2022-03-13 00:00:00', > timestamp'2022-03-16 03:00:00', interval 1 day 1 hour) as x; > 22/05/13 14:47:27 ERROR SparkSQLDriver: Failed in [select > sequence(timestamp'2022-03-13 00:00:00', timestamp'2022-03-16 03:00:00', > interval 1 day 1 hour) as x] > java.lang.ArrayIndexOutOfBoundsException: 3 > {noformat} > In fact, any such query will get an {{ArrayIndexOutOfBoundsException}} if the > start-stop period in your time-zone includes more instances of "spring > forward" than instances of "fall back" and the start-stop period is evenly > divisible by the interval. > In the {{America/Los_Angeles}} time-zone, examples include: > {noformat} > -- This query encompasses 2 instances of "spring forward" but only one > -- instance of "fall back". > select sequence( > timestamp'2022-03-13', > timestamp'2022-03-13' + (interval '42' hours * 209), > interval '42' hours) as x; > {noformat} > {noformat} > select sequence( > timestamp'2022-03-13', > timestamp'2022-03-13' + (interval '31' hours * 11), > interval '31' hours) as x; > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40063) pyspark.pandas .apply() changing rows ordering
[ https://issues.apache.org/jira/browse/SPARK-40063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580518#comment-17580518 ] Hyukjin Kwon commented on SPARK-40063: -- [~marcelorossini] what's your "Default Index type"? {{compute.default_index_type}} configuration > pyspark.pandas .apply() changing rows ordering > -- > > Key: SPARK-40063 > URL: https://issues.apache.org/jira/browse/SPARK-40063 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark >Affects Versions: 3.3.0 > Environment: Databricks Runtime 11.1 >Reporter: Marcelo Rossini Castro >Priority: Minor > Labels: Pandas, PySpark > > When using the apply function to apply a function to a DataFrame column, it > ends up mixing the column's rows ordering. > A command like this: > {code:java} > def example_func(df_col): > return df_col ** 2 > df['col_to_apply_function'] = df.apply(lambda row: > example_func(row['col_to_apply_function']), axis=1) {code} > A workaround is to assign the results to a new column instead of the same > one, but if the old column is dropped, the same error is produced. > Setting one column as index also didn't work. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39184) ArrayIndexOutOfBoundsException for some date/time sequences in some time-zones
[ https://issues.apache.org/jira/browse/SPARK-39184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580519#comment-17580519 ] Apache Spark commented on SPARK-39184: -- User 'bersprockets' has created a pull request for this issue: https://github.com/apache/spark/pull/37542 > ArrayIndexOutOfBoundsException for some date/time sequences in some time-zones > -- > > Key: SPARK-39184 > URL: https://issues.apache.org/jira/browse/SPARK-39184 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.3, 3.2.1, 3.3.0, 3.4.0 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Major > Fix For: 3.4.0, 3.3.1, 3.2.3 > > > The following query gets an {{ArrayIndexOutOfBoundsException}} when run from > the {{America/Los_Angeles}} time-zone: > {noformat} > spark-sql> select sequence(timestamp'2022-03-13 00:00:00', > timestamp'2022-03-16 03:00:00', interval 1 day 1 hour) as x; > 22/05/13 14:47:27 ERROR SparkSQLDriver: Failed in [select > sequence(timestamp'2022-03-13 00:00:00', timestamp'2022-03-16 03:00:00', > interval 1 day 1 hour) as x] > java.lang.ArrayIndexOutOfBoundsException: 3 > {noformat} > In fact, any such query will get an {{ArrayIndexOutOfBoundsException}} if the > start-stop period in your time-zone includes more instances of "spring > forward" than instances of "fall back" and the start-stop period is evenly > divisible by the interval. > In the {{America/Los_Angeles}} time-zone, examples include: > {noformat} > -- This query encompasses 2 instances of "spring forward" but only one > -- instance of "fall back". > select sequence( > timestamp'2022-03-13', > timestamp'2022-03-13' + (interval '42' hours * 209), > interval '42' hours) as x; > {noformat} > {noformat} > select sequence( > timestamp'2022-03-13', > timestamp'2022-03-13' + (interval '31' hours * 11), > interval '31' hours) as x; > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40112) Improve the TO_BINARY() function
[ https://issues.apache.org/jira/browse/SPARK-40112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580517#comment-17580517 ] Apache Spark commented on SPARK-40112: -- User 'vitaliili-db' has created a pull request for this issue: https://github.com/apache/spark/pull/37483 > Improve the TO_BINARY() function > > > Key: SPARK-40112 > URL: https://issues.apache.org/jira/browse/SPARK-40112 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Vitalii Li >Priority: Major > > Original description SPARK-37507: > {quote}to_binary(expr, fmt) is a common function available in many other > systems to provide a unified entry for string to binary data conversion, > where fmt can be utf8, base64, hex and base2 (or whatever the reverse > operation to_char()supports). > [https://docs.aws.amazon.com/redshift/latest/dg/r_TO_VARBYTE.html] > [https://docs.snowflake.com/en/sql-reference/functions/to_binary.html] > [https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#format_string_as_bytes] > [https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/etRo5aTAY9n5fUPjxSEynw] > Related Spark functions: unbase64, unhex > {quote} > > Expected improvement: > * `base64` behaves more strictly, i.e. does not allow symbols not included > in base64 dictionary (A-Za-z0-9+/) and verifies correct padding and symbol > groups (see RFC 4648 § 4). Whitespaces are ignored. > ** Current implementation allows arbitrary strings and invalid symbols are > skipped. > * `hex` converts only valid hexadecimal strings and throws errors otherwise. > Whitespaces are not allowed. > * `utf-8` and `utf8` are interchangeable. > * Correct errors are thrown and classified for invalid input > (CONVERSION_INVALID_INPUT) and invalid format (CONVERSION_INVALID_FORMAT) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40112) Improve the TO_BINARY() function
[ https://issues.apache.org/jira/browse/SPARK-40112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40112: Assignee: Apache Spark > Improve the TO_BINARY() function > > > Key: SPARK-40112 > URL: https://issues.apache.org/jira/browse/SPARK-40112 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Vitalii Li >Assignee: Apache Spark >Priority: Major > > Original description SPARK-37507: > {quote}to_binary(expr, fmt) is a common function available in many other > systems to provide a unified entry for string to binary data conversion, > where fmt can be utf8, base64, hex and base2 (or whatever the reverse > operation to_char()supports). > [https://docs.aws.amazon.com/redshift/latest/dg/r_TO_VARBYTE.html] > [https://docs.snowflake.com/en/sql-reference/functions/to_binary.html] > [https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#format_string_as_bytes] > [https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/etRo5aTAY9n5fUPjxSEynw] > Related Spark functions: unbase64, unhex > {quote} > > Expected improvement: > * `base64` behaves more strictly, i.e. does not allow symbols not included > in base64 dictionary (A-Za-z0-9+/) and verifies correct padding and symbol > groups (see RFC 4648 § 4). Whitespaces are ignored. > ** Current implementation allows arbitrary strings and invalid symbols are > skipped. > * `hex` converts only valid hexadecimal strings and throws errors otherwise. > Whitespaces are not allowed. > * `utf-8` and `utf8` are interchangeable. > * Correct errors are thrown and classified for invalid input > (CONVERSION_INVALID_INPUT) and invalid format (CONVERSION_INVALID_FORMAT) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40112) Improve the TO_BINARY() function
[ https://issues.apache.org/jira/browse/SPARK-40112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40112: Assignee: (was: Apache Spark) > Improve the TO_BINARY() function > > > Key: SPARK-40112 > URL: https://issues.apache.org/jira/browse/SPARK-40112 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Vitalii Li >Priority: Major > > Original description SPARK-37507: > {quote}to_binary(expr, fmt) is a common function available in many other > systems to provide a unified entry for string to binary data conversion, > where fmt can be utf8, base64, hex and base2 (or whatever the reverse > operation to_char()supports). > [https://docs.aws.amazon.com/redshift/latest/dg/r_TO_VARBYTE.html] > [https://docs.snowflake.com/en/sql-reference/functions/to_binary.html] > [https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#format_string_as_bytes] > [https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/etRo5aTAY9n5fUPjxSEynw] > Related Spark functions: unbase64, unhex > {quote} > > Expected improvement: > * `base64` behaves more strictly, i.e. does not allow symbols not included > in base64 dictionary (A-Za-z0-9+/) and verifies correct padding and symbol > groups (see RFC 4648 § 4). Whitespaces are ignored. > ** Current implementation allows arbitrary strings and invalid symbols are > skipped. > * `hex` converts only valid hexadecimal strings and throws errors otherwise. > Whitespaces are not allowed. > * `utf-8` and `utf8` are interchangeable. > * Correct errors are thrown and classified for invalid input > (CONVERSION_INVALID_INPUT) and invalid format (CONVERSION_INVALID_FORMAT) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40112) Improve the TO_BINARY() function
[ https://issues.apache.org/jira/browse/SPARK-40112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vitalii Li updated SPARK-40112: --- Description: Original description SPARK-37507: {quote}to_binary(expr, fmt) is a common function available in many other systems to provide a unified entry for string to binary data conversion, where fmt can be utf8, base64, hex and base2 (or whatever the reverse operation to_char()supports). [https://docs.aws.amazon.com/redshift/latest/dg/r_TO_VARBYTE.html] [https://docs.snowflake.com/en/sql-reference/functions/to_binary.html] [https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#format_string_as_bytes] [https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/etRo5aTAY9n5fUPjxSEynw] Related Spark functions: unbase64, unhex {quote} Expected improvement: * `base64` behaves more strictly, i.e. does not allow symbols not included in base64 dictionary (A-Za-z0-9+/) and verifies correct padding and symbol groups (see RFC 4648 § 4). Whitespaces are ignored. ** Current implementation allows arbitrary strings and invalid symbols are skipped. * `hex` converts only valid hexadecimal strings and throws errors otherwise. Whitespaces are not allowed. * `utf-8` and `utf8` are interchangeable. * Correct errors are thrown and classified for invalid input (CONVERSION_INVALID_INPUT) and invalid format (CONVERSION_INVALID_FORMAT) was: Original description SPARK-37507: {quote}to_binary(expr, fmt) is a common function available in many other systems to provide a unified entry for string to binary data conversion, where fmt can be utf8, base64, hex and base2 (or whatever the reverse operation to_char()supports). [https://docs.aws.amazon.com/redshift/latest/dg/r_TO_VARBYTE.html] [https://docs.snowflake.com/en/sql-reference/functions/to_binary.html] [https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#format_string_as_bytes] [https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/etRo5aTAY9n5fUPjxSEynw] Related Spark functions: unbase64, unhex {quote} Expected improvement: * `base64` behaves more strictly, i.e. does not allow symbols not included in base64 dictionary (A-Za-z0-9+/) and verifies correct padding and symbol groups (see RFC 4648 § 4). Whitespaces are ignored. ** Current implementation allows arbitrary strings and invalid symbols are skipped. * `hex` converts only valid hexadecimal strings and throws errors otherwise. Whitespaces are not allowed. * `utf-8` and `utf8` are interchangeable. * Correct errors are thrown and classified for invalid input and invalid format: > Improve the TO_BINARY() function > > > Key: SPARK-40112 > URL: https://issues.apache.org/jira/browse/SPARK-40112 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Vitalii Li >Priority: Major > > Original description SPARK-37507: > {quote}to_binary(expr, fmt) is a common function available in many other > systems to provide a unified entry for string to binary data conversion, > where fmt can be utf8, base64, hex and base2 (or whatever the reverse > operation to_char()supports). > [https://docs.aws.amazon.com/redshift/latest/dg/r_TO_VARBYTE.html] > [https://docs.snowflake.com/en/sql-reference/functions/to_binary.html] > [https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#format_string_as_bytes] > [https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/etRo5aTAY9n5fUPjxSEynw] > Related Spark functions: unbase64, unhex > {quote} > > Expected improvement: > * `base64` behaves more strictly, i.e. does not allow symbols not included > in base64 dictionary (A-Za-z0-9+/) and verifies correct padding and symbol > groups (see RFC 4648 § 4). Whitespaces are ignored. > ** Current implementation allows arbitrary strings and invalid symbols are > skipped. > * `hex` converts only valid hexadecimal strings and throws errors otherwise. > Whitespaces are not allowed. > * `utf-8` and `utf8` are interchangeable. > * Correct errors are thrown and classified for invalid input > (CONVERSION_INVALID_INPUT) and invalid format (CONVERSION_INVALID_FORMAT) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40112) Improve the TO_BINARY() function
Vitalii Li created SPARK-40112: -- Summary: Improve the TO_BINARY() function Key: SPARK-40112 URL: https://issues.apache.org/jira/browse/SPARK-40112 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Vitalii Li Original description SPARK-37507: {quote}to_binary(expr, fmt) is a common function available in many other systems to provide a unified entry for string to binary data conversion, where fmt can be utf8, base64, hex and base2 (or whatever the reverse operation to_char()supports). [https://docs.aws.amazon.com/redshift/latest/dg/r_TO_VARBYTE.html] [https://docs.snowflake.com/en/sql-reference/functions/to_binary.html] [https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#format_string_as_bytes] [https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/etRo5aTAY9n5fUPjxSEynw] Related Spark functions: unbase64, unhex {quote} Expected improvement: * `base64` behaves more strictly, i.e. does not allow symbols not included in base64 dictionary (A-Za-z0-9+/) and verifies correct padding and symbol groups (see RFC 4648 § 4). Whitespaces are ignored. ** Current implementation allows arbitrary strings and invalid symbols are skipped. * `hex` converts only valid hexadecimal strings and throws errors otherwise. Whitespaces are not allowed. * `utf-8` and `utf8` are interchangeable. * Correct errors are thrown and classified for invalid input and invalid format: -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40111) Make pyspark.rdd examples self-contained
Ruifeng Zheng created SPARK-40111: - Summary: Make pyspark.rdd examples self-contained Key: SPARK-40111 URL: https://issues.apache.org/jira/browse/SPARK-40111 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40111) Make pyspark.rdd examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-40111: - Assignee: Ruifeng Zheng > Make pyspark.rdd examples self-contained > > > Key: SPARK-40111 > URL: https://issues.apache.org/jira/browse/SPARK-40111 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40110) Add JDBCWithAQESuite
Kazuyuki Tanimura created SPARK-40110: - Summary: Add JDBCWithAQESuite Key: SPARK-40110 URL: https://issues.apache.org/jira/browse/SPARK-40110 Project: Spark Issue Type: Test Components: SQL, Tests Affects Versions: 3.4.0 Reporter: Kazuyuki Tanimura Currently `JDBCSuite` assumes that AQE is always turned off. We should also test with AQE turned on -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40109) New SQL function: get()
[ https://issues.apache.org/jira/browse/SPARK-40109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580502#comment-17580502 ] Apache Spark commented on SPARK-40109: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/37541 > New SQL function: get() > --- > > Key: SPARK-40109 > URL: https://issues.apache.org/jira/browse/SPARK-40109 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > Currently, when accessing array element with invalid index under ANSI SQL > mode, the error is like: > {quote}[INVALID_ARRAY_INDEX] The index -1 is out of bounds. The array has 3 > elements. Use `try_element_at` and increase the array index by 1(the starting > array index is 1 for `try_element_at`) to tolerate accessing element at > invalid index and return NULL instead. If necessary set > "spark.sql.ansi.enabled" to "false" to bypass this error. > {quote} > The provided solution is complicated. I suggest introducing a new method > get() which always returns null on an invalid array index. This is from > [https://docs.snowflake.com/en/sql-reference/functions/get.html.] > Since Spark's map access always returns null, let's don't support map type in > the get method for now. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40109) New SQL function: get()
[ https://issues.apache.org/jira/browse/SPARK-40109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580501#comment-17580501 ] Apache Spark commented on SPARK-40109: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/37541 > New SQL function: get() > --- > > Key: SPARK-40109 > URL: https://issues.apache.org/jira/browse/SPARK-40109 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > Currently, when accessing array element with invalid index under ANSI SQL > mode, the error is like: > {quote}[INVALID_ARRAY_INDEX] The index -1 is out of bounds. The array has 3 > elements. Use `try_element_at` and increase the array index by 1(the starting > array index is 1 for `try_element_at`) to tolerate accessing element at > invalid index and return NULL instead. If necessary set > "spark.sql.ansi.enabled" to "false" to bypass this error. > {quote} > The provided solution is complicated. I suggest introducing a new method > get() which always returns null on an invalid array index. This is from > [https://docs.snowflake.com/en/sql-reference/functions/get.html.] > Since Spark's map access always returns null, let's don't support map type in > the get method for now. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40109) New SQL function: get()
[ https://issues.apache.org/jira/browse/SPARK-40109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40109: Assignee: Apache Spark (was: Gengliang Wang) > New SQL function: get() > --- > > Key: SPARK-40109 > URL: https://issues.apache.org/jira/browse/SPARK-40109 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > Currently, when accessing array element with invalid index under ANSI SQL > mode, the error is like: > {quote}[INVALID_ARRAY_INDEX] The index -1 is out of bounds. The array has 3 > elements. Use `try_element_at` and increase the array index by 1(the starting > array index is 1 for `try_element_at`) to tolerate accessing element at > invalid index and return NULL instead. If necessary set > "spark.sql.ansi.enabled" to "false" to bypass this error. > {quote} > The provided solution is complicated. I suggest introducing a new method > get() which always returns null on an invalid array index. This is from > [https://docs.snowflake.com/en/sql-reference/functions/get.html.] > Since Spark's map access always returns null, let's don't support map type in > the get method for now. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40109) New SQL function: get()
[ https://issues.apache.org/jira/browse/SPARK-40109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40109: Assignee: Gengliang Wang (was: Apache Spark) > New SQL function: get() > --- > > Key: SPARK-40109 > URL: https://issues.apache.org/jira/browse/SPARK-40109 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > Currently, when accessing array element with invalid index under ANSI SQL > mode, the error is like: > {quote}[INVALID_ARRAY_INDEX] The index -1 is out of bounds. The array has 3 > elements. Use `try_element_at` and increase the array index by 1(the starting > array index is 1 for `try_element_at`) to tolerate accessing element at > invalid index and return NULL instead. If necessary set > "spark.sql.ansi.enabled" to "false" to bypass this error. > {quote} > The provided solution is complicated. I suggest introducing a new method > get() which always returns null on an invalid array index. This is from > [https://docs.snowflake.com/en/sql-reference/functions/get.html.] > Since Spark's map access always returns null, let's don't support map type in > the get method for now. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40109) New SQL function: get()
Gengliang Wang created SPARK-40109: -- Summary: New SQL function: get() Key: SPARK-40109 URL: https://issues.apache.org/jira/browse/SPARK-40109 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Gengliang Wang Assignee: Gengliang Wang Currently, when accessing array element with invalid index under ANSI SQL mode, the error is like: {quote}[INVALID_ARRAY_INDEX] The index -1 is out of bounds. The array has 3 elements. Use `try_element_at` and increase the array index by 1(the starting array index is 1 for `try_element_at`) to tolerate accessing element at invalid index and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. {quote} The provided solution is complicated. I suggest introducing a new method get() which always returns null on an invalid array index. This is from [https://docs.snowflake.com/en/sql-reference/functions/get.html.] Since Spark's map access always returns null, let's don't support map type in the get method for now. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40108) JDBC connection to Hive Metastore fails without first calling any .jdbc call
In-Ho Yi created SPARK-40108: Summary: JDBC connection to Hive Metastore fails without first calling any .jdbc call Key: SPARK-40108 URL: https://issues.apache.org/jira/browse/SPARK-40108 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.3.0 Environment: PySpark==3.3.0 Java 11 Reporter: In-Ho Yi Tested on pyspark==3.3.0. When talking to hive metastore with MySQL backend, I installed MySQL driver with spark.jars.packages, alongside with other necessary settings: ss = SparkSession.builder.master('local[*]')\ .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.3.3," + "org.apache.hadoop:hadoop-common:3.3.3,mysql:mysql-connector-java:8.0.30") \ .config("spark.executor.memory", "10g") \ .config("spark.driver.memory", "10g") \ .config("spark.memory.offHeap.enabled","true") \ .config("spark.memory.offHeap.size","32g") \ .config("spark.hadoop.javax.jdo.option.ConnectionURL", "jdbc:mysql://localhost:3306/hive") \ .config("spark.hadoop.javax.jdo.option.ConnectionUserName", "") \ .config("spark.hadoop.javax.jdo.option.ConnectionPassword", "") \ .config("spark.hadoop.javax.jdo.option.ConnectionDriverName", "com.mysql.cj.jdbc.Driver") \ .config("spark.sql.hive.metastore.sharedPrefixes", "com.mysql") \ .config("spark.sql.warehouse.dir", "s3://-/") \ .enableHiveSupport() \ .appName("hms_test").config(conf=conf).getOrCreate() Now, if I just do: ss.sql("SHOW DATABASES;").show() I get a lot of errors, saying: Unable to open a test connection to the given database. JDBC url = jdbc:mysql://localhost:3306/hive, username = . Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: -- java.sql.SQLException: No suitable driver found for jdbc:mysql://localhost:3306/hive at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:702) at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:189) at com.jolbox.bonecp.BoneCP.obtainRawInternalConnection(BoneCP.java:361) at com.jolbox.bonecp.BoneCP.(BoneCP.java:416) ... However, if I do any "jdbc" read, even if the call ends up in an error, then the call to Hive Metastore seem to succeed without any issue: try: _ = ss.read.format("jdbc") \ .option("url", "jdbc:mysql://localhost:3306/hive") \ .option("query", "SHOW TABLES;") \ .option("driver", "com.mysql.cj.jdbc.Driver").load() except: pass ss.sql("SHOW DATABASES;").show() # this now works fine. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-40063) pyspark.pandas .apply() changing rows ordering
[ https://issues.apache.org/jira/browse/SPARK-40063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580459#comment-17580459 ] Marcelo Rossini Castro edited comment on SPARK-40063 at 8/16/22 8:33 PM: - Yes, it doesn't change the other ones. When I do this on the same column, replacing the data, the order on this column changes. But if I assign the results to a new column, I get the right order, but if I drop the old one, I get the same problem again on the new one. About the compute.ordered_head, I actually tried it, but it didn't help. was (Author: JIRAUSER294354): Yes, when I do this on the same column, replacing the data, the order changes. But if I assign the results to a new column, I get the right order, but if I drop the old one, I get the same problem again. About the compute.ordered_head, I actually tried it, but it didn't help. > pyspark.pandas .apply() changing rows ordering > -- > > Key: SPARK-40063 > URL: https://issues.apache.org/jira/browse/SPARK-40063 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark >Affects Versions: 3.3.0 > Environment: Databricks Runtime 11.1 >Reporter: Marcelo Rossini Castro >Priority: Minor > Labels: Pandas, PySpark > > When using the apply function to apply a function to a DataFrame column, it > ends up mixing the column's rows ordering. > A command like this: > {code:java} > def example_func(df_col): > return df_col ** 2 > df['col_to_apply_function'] = df.apply(lambda row: > example_func(row['col_to_apply_function']), axis=1) {code} > A workaround is to assign the results to a new column instead of the same > one, but if the old column is dropped, the same error is produced. > Setting one column as index also didn't work. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40063) pyspark.pandas .apply() changing rows ordering
[ https://issues.apache.org/jira/browse/SPARK-40063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580459#comment-17580459 ] Marcelo Rossini Castro commented on SPARK-40063: Yes, when I do this on the same column, replacing the data, the order changes. But if I assign the results to a new column, I get the right order, but if I drop the old one, I get the same problem again. About the compute.ordered_head, I actually tried it, but it didn't help. > pyspark.pandas .apply() changing rows ordering > -- > > Key: SPARK-40063 > URL: https://issues.apache.org/jira/browse/SPARK-40063 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark >Affects Versions: 3.3.0 > Environment: Databricks Runtime 11.1 >Reporter: Marcelo Rossini Castro >Priority: Minor > Labels: Pandas, PySpark > > When using the apply function to apply a function to a DataFrame column, it > ends up mixing the column's rows ordering. > A command like this: > {code:java} > def example_func(df_col): > return df_col ** 2 > df['col_to_apply_function'] = df.apply(lambda row: > example_func(row['col_to_apply_function']), axis=1) {code} > A workaround is to assign the results to a new column instead of the same > one, but if the old column is dropped, the same error is produced. > Setting one column as index also didn't work. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36511) Remove ColumnIO once PARQUET-2050 is released in Parquet 1.13
[ https://issues.apache.org/jira/browse/SPARK-36511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580448#comment-17580448 ] Apache Spark commented on SPARK-36511: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/37529 > Remove ColumnIO once PARQUET-2050 is released in Parquet 1.13 > - > > Key: SPARK-36511 > URL: https://issues.apache.org/jira/browse/SPARK-36511 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.4.0 > > > {{ColumnIO}} doesn't expose methods to get repetition and definition level so > Spark has to use a hack. This should be removed once PARQUET-2050 is released. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36511) Remove ColumnIO once PARQUET-2050 is released in Parquet 1.13
[ https://issues.apache.org/jira/browse/SPARK-36511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-36511. -- Fix Version/s: 3.4.0 Assignee: BingKun Pan Resolution: Fixed Resolved by https://github.com/apache/spark/pull/37529 > Remove ColumnIO once PARQUET-2050 is released in Parquet 1.13 > - > > Key: SPARK-36511 > URL: https://issues.apache.org/jira/browse/SPARK-36511 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.4.0 > > > {{ColumnIO}} doesn't expose methods to get repetition and definition level so > Spark has to use a hack. This should be removed once PARQUET-2050 is released. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40089) Sorting of at least Decimal(20, 2) fails for some values near the max.
[ https://issues.apache.org/jira/browse/SPARK-40089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40089: Assignee: (was: Apache Spark) > Sorting of at least Decimal(20, 2) fails for some values near the max. > -- > > Key: SPARK-40089 > URL: https://issues.apache.org/jira/browse/SPARK-40089 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0, 3.4.0 >Reporter: Robert Joseph Evans >Priority: Major > Attachments: input.parquet > > > I have been doing some testing with Decimal values for the RAPIDS Accelerator > for Apache Spark. I have been trying to add in new corner cases and when I > tried to enable the maximum supported value for a sort I started to get > failures. On closer inspection it looks like the CPU is sorting things > incorrectly. Specifically anything that is "99.50" or above > is placed as a chunk in the wrong location in the outputs. > In local mode with 12 tasks. > {code:java} > spark.read.parquet("input.parquet").orderBy(col("a")).collect.foreach(System.err.println) > {code} > > Here you will notice that the last entry printed is > {{[99.49]}}, and {{[99.99]}} is near the top > near {{[-99.99]}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40089) Sorting of at least Decimal(20, 2) fails for some values near the max.
[ https://issues.apache.org/jira/browse/SPARK-40089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40089: Assignee: Apache Spark > Sorting of at least Decimal(20, 2) fails for some values near the max. > -- > > Key: SPARK-40089 > URL: https://issues.apache.org/jira/browse/SPARK-40089 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0, 3.4.0 >Reporter: Robert Joseph Evans >Assignee: Apache Spark >Priority: Major > Attachments: input.parquet > > > I have been doing some testing with Decimal values for the RAPIDS Accelerator > for Apache Spark. I have been trying to add in new corner cases and when I > tried to enable the maximum supported value for a sort I started to get > failures. On closer inspection it looks like the CPU is sorting things > incorrectly. Specifically anything that is "99.50" or above > is placed as a chunk in the wrong location in the outputs. > In local mode with 12 tasks. > {code:java} > spark.read.parquet("input.parquet").orderBy(col("a")).collect.foreach(System.err.println) > {code} > > Here you will notice that the last entry printed is > {{[99.49]}}, and {{[99.99]}} is near the top > near {{[-99.99]}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40089) Sorting of at least Decimal(20, 2) fails for some values near the max.
[ https://issues.apache.org/jira/browse/SPARK-40089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40089: Assignee: Apache Spark > Sorting of at least Decimal(20, 2) fails for some values near the max. > -- > > Key: SPARK-40089 > URL: https://issues.apache.org/jira/browse/SPARK-40089 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0, 3.4.0 >Reporter: Robert Joseph Evans >Assignee: Apache Spark >Priority: Major > Attachments: input.parquet > > > I have been doing some testing with Decimal values for the RAPIDS Accelerator > for Apache Spark. I have been trying to add in new corner cases and when I > tried to enable the maximum supported value for a sort I started to get > failures. On closer inspection it looks like the CPU is sorting things > incorrectly. Specifically anything that is "99.50" or above > is placed as a chunk in the wrong location in the outputs. > In local mode with 12 tasks. > {code:java} > spark.read.parquet("input.parquet").orderBy(col("a")).collect.foreach(System.err.println) > {code} > > Here you will notice that the last entry printed is > {{[99.49]}}, and {{[99.99]}} is near the top > near {{[-99.99]}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40089) Sorting of at least Decimal(20, 2) fails for some values near the max.
[ https://issues.apache.org/jira/browse/SPARK-40089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580437#comment-17580437 ] Apache Spark commented on SPARK-40089: -- User 'revans2' has created a pull request for this issue: https://github.com/apache/spark/pull/37540 > Sorting of at least Decimal(20, 2) fails for some values near the max. > -- > > Key: SPARK-40089 > URL: https://issues.apache.org/jira/browse/SPARK-40089 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0, 3.4.0 >Reporter: Robert Joseph Evans >Priority: Major > Attachments: input.parquet > > > I have been doing some testing with Decimal values for the RAPIDS Accelerator > for Apache Spark. I have been trying to add in new corner cases and when I > tried to enable the maximum supported value for a sort I started to get > failures. On closer inspection it looks like the CPU is sorting things > incorrectly. Specifically anything that is "99.50" or above > is placed as a chunk in the wrong location in the outputs. > In local mode with 12 tasks. > {code:java} > spark.read.parquet("input.parquet").orderBy(col("a")).collect.foreach(System.err.println) > {code} > > Here you will notice that the last entry printed is > {{[99.49]}}, and {{[99.99]}} is near the top > near {{[-99.99]}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40089) Sorting of at least Decimal(20, 2) fails for some values near the max.
[ https://issues.apache.org/jira/browse/SPARK-40089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580434#comment-17580434 ] Robert Joseph Evans commented on SPARK-40089: - I put up a PR https://github.com/apache/spark/pull/37540 > Sorting of at least Decimal(20, 2) fails for some values near the max. > -- > > Key: SPARK-40089 > URL: https://issues.apache.org/jira/browse/SPARK-40089 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0, 3.4.0 >Reporter: Robert Joseph Evans >Priority: Major > Attachments: input.parquet > > > I have been doing some testing with Decimal values for the RAPIDS Accelerator > for Apache Spark. I have been trying to add in new corner cases and when I > tried to enable the maximum supported value for a sort I started to get > failures. On closer inspection it looks like the CPU is sorting things > incorrectly. Specifically anything that is "99.50" or above > is placed as a chunk in the wrong location in the outputs. > In local mode with 12 tasks. > {code:java} > spark.read.parquet("input.parquet").orderBy(col("a")).collect.foreach(System.err.println) > {code} > > Here you will notice that the last entry printed is > {{[99.49]}}, and {{[99.99]}} is near the top > near {{[-99.99]}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-38330) Certificate doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]
[ https://issues.apache.org/jira/browse/SPARK-38330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580431#comment-17580431 ] Maksim Grinman edited comment on SPARK-38330 at 8/16/22 7:38 PM: - Apologize ahead of time as I did not understand some of the discussion. Is there a way to work-around this issue while waiting for a version of Spark which uses hadoop 3.3.4 (Spark 3.4?)? It seems like anyone using Spark with AWS s3 is stuck on 3.1.2 until then (although AWS EMR has latest releases claiming to work with Spark 3.2 somehow). was (Author: JIRAUSER290629): Apologize ahead of time as I did not understand some of the discussion. Is there a way to work-around this issue while waiting for a version of Spark which uses hadoop 3.3.4 (Spark 3.4?)? > Certificate doesn't match any of the subject alternative names: > [*.s3.amazonaws.com, s3.amazonaws.com] > -- > > Key: SPARK-38330 > URL: https://issues.apache.org/jira/browse/SPARK-38330 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 3.2.1 > Environment: Spark 3.2.1 built with `hadoop-cloud` flag. > Direct access to s3 using default file committer. > JDK8. > >Reporter: André F. >Priority: Major > > Trying to run any job after bumping our Spark version from 3.1.2 to 3.2.1, > lead us to the current exception while reading files on s3: > {code:java} > org.apache.hadoop.fs.s3a.AWSClientIOException: getFileStatus on > s3a:///.parquet: com.amazonaws.SdkClientException: Unable to > execute HTTP request: Certificate for doesn't match > any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]: > Unable to execute HTTP request: Certificate for doesn't match any of > the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com] at > org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:208) at > org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:170) at > org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3351) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3185) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.isDirectory(S3AFileSystem.java:4277) > at > org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:54) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245) > at scala.Option.getOrElse(Option.scala:189) at > org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245) at > org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:596) {code} > > {code:java} > Caused by: javax.net.ssl.SSLPeerUnverifiedException: Certificate for > doesn't match any of the subject alternative names: > [*.s3.amazonaws.com, s3.amazonaws.com] > at > com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.verifyHostname(SSLConnectionSocketFactory.java:507) > at > com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:437) > at > com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:384) > at > com.amazonaws.thirdparty.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142) > at > com.amazonaws.thirdparty.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376) > at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76) > at com.amazonaws.http.conn.$Proxy16.connect(Unknown Source) > at > com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393) > at > com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) > at > com.amazonaws.thirdparty.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) > at > com.amazonaws.thirdparty.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) >
[jira] [Commented] (SPARK-38330) Certificate doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]
[ https://issues.apache.org/jira/browse/SPARK-38330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580431#comment-17580431 ] Maksim Grinman commented on SPARK-38330: Apologize ahead of time as I did not understand some of the discussion. Is there a way to work-around this issue while waiting for a version of Spark which uses hadoop 3.3.4 (Spark 3.4?)? > Certificate doesn't match any of the subject alternative names: > [*.s3.amazonaws.com, s3.amazonaws.com] > -- > > Key: SPARK-38330 > URL: https://issues.apache.org/jira/browse/SPARK-38330 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 3.2.1 > Environment: Spark 3.2.1 built with `hadoop-cloud` flag. > Direct access to s3 using default file committer. > JDK8. > >Reporter: André F. >Priority: Major > > Trying to run any job after bumping our Spark version from 3.1.2 to 3.2.1, > lead us to the current exception while reading files on s3: > {code:java} > org.apache.hadoop.fs.s3a.AWSClientIOException: getFileStatus on > s3a:///.parquet: com.amazonaws.SdkClientException: Unable to > execute HTTP request: Certificate for doesn't match > any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]: > Unable to execute HTTP request: Certificate for doesn't match any of > the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com] at > org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:208) at > org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:170) at > org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3351) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3185) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.isDirectory(S3AFileSystem.java:4277) > at > org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:54) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245) > at scala.Option.getOrElse(Option.scala:189) at > org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245) at > org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:596) {code} > > {code:java} > Caused by: javax.net.ssl.SSLPeerUnverifiedException: Certificate for > doesn't match any of the subject alternative names: > [*.s3.amazonaws.com, s3.amazonaws.com] > at > com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.verifyHostname(SSLConnectionSocketFactory.java:507) > at > com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:437) > at > com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:384) > at > com.amazonaws.thirdparty.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142) > at > com.amazonaws.thirdparty.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376) > at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76) > at com.amazonaws.http.conn.$Proxy16.connect(Unknown Source) > at > com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393) > at > com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) > at > com.amazonaws.thirdparty.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) > at > com.amazonaws.thirdparty.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) > at > com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) > at > com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) > at > com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.e
[jira] [Assigned] (SPARK-40107) Pull out empty2null conversion from FileFormatWriter
[ https://issues.apache.org/jira/browse/SPARK-40107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40107: Assignee: (was: Apache Spark) > Pull out empty2null conversion from FileFormatWriter > > > Key: SPARK-40107 > URL: https://issues.apache.org/jira/browse/SPARK-40107 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Allison Wang >Priority: Major > > This is a follow-up for SPARK-37287. We can pull out the physical project to > convert empty string partition columns to null in `FileFormatWriter` into > logical planning as well. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40107) Pull out empty2null conversion from FileFormatWriter
[ https://issues.apache.org/jira/browse/SPARK-40107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40107: Assignee: Apache Spark > Pull out empty2null conversion from FileFormatWriter > > > Key: SPARK-40107 > URL: https://issues.apache.org/jira/browse/SPARK-40107 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Allison Wang >Assignee: Apache Spark >Priority: Major > > This is a follow-up for SPARK-37287. We can pull out the physical project to > convert empty string partition columns to null in `FileFormatWriter` into > logical planning as well. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40107) Pull out empty2null conversion from FileFormatWriter
[ https://issues.apache.org/jira/browse/SPARK-40107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580426#comment-17580426 ] Apache Spark commented on SPARK-40107: -- User 'allisonwang-db' has created a pull request for this issue: https://github.com/apache/spark/pull/37539 > Pull out empty2null conversion from FileFormatWriter > > > Key: SPARK-40107 > URL: https://issues.apache.org/jira/browse/SPARK-40107 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Allison Wang >Priority: Major > > This is a follow-up for SPARK-37287. We can pull out the physical project to > convert empty string partition columns to null in `FileFormatWriter` into > logical planning as well. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40107) Pull out empty2null conversion from FileFormatWriter
[ https://issues.apache.org/jira/browse/SPARK-40107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-40107: - Description: This is a follow-up for SPARK-37287. We can pull out the physical project to convert empty string partition columns to null in `FileFormatWriter` into logical planning as well. (was: This is a follow-up for SPARK-37287. ) > Pull out empty2null conversion from FileFormatWriter > > > Key: SPARK-40107 > URL: https://issues.apache.org/jira/browse/SPARK-40107 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Allison Wang >Priority: Major > > This is a follow-up for SPARK-37287. We can pull out the physical project to > convert empty string partition columns to null in `FileFormatWriter` into > logical planning as well. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40107) Pull out empty2null conversion from FileFormatWriter
Allison Wang created SPARK-40107: Summary: Pull out empty2null conversion from FileFormatWriter Key: SPARK-40107 URL: https://issues.apache.org/jira/browse/SPARK-40107 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Allison Wang This is a follow-up for SPARK-37287. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40089) Sorting of at least Decimal(20, 2) fails for some values near the max.
[ https://issues.apache.org/jira/browse/SPARK-40089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Joseph Evans updated SPARK-40089: Summary: Sorting of at least Decimal(20, 2) fails for some values near the max. (was: Doring of at least Decimal(20, 2) fails for some values near the max.) > Sorting of at least Decimal(20, 2) fails for some values near the max. > -- > > Key: SPARK-40089 > URL: https://issues.apache.org/jira/browse/SPARK-40089 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0, 3.4.0 >Reporter: Robert Joseph Evans >Priority: Major > Attachments: input.parquet > > > I have been doing some testing with Decimal values for the RAPIDS Accelerator > for Apache Spark. I have been trying to add in new corner cases and when I > tried to enable the maximum supported value for a sort I started to get > failures. On closer inspection it looks like the CPU is sorting things > incorrectly. Specifically anything that is "99.50" or above > is placed as a chunk in the wrong location in the outputs. > In local mode with 12 tasks. > {code:java} > spark.read.parquet("input.parquet").orderBy(col("a")).collect.foreach(System.err.println) > {code} > > Here you will notice that the last entry printed is > {{[99.49]}}, and {{[99.99]}} is near the top > near {{[-99.99]}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40089) Doring of at least Decimal(20, 2) fails for some values near the max.
[ https://issues.apache.org/jira/browse/SPARK-40089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580377#comment-17580377 ] Robert Joseph Evans commented on SPARK-40089: - Never mind I figured out that there is a separate prefixComparator that does the same kinds of checks. But I have a fix that works, so I will put up a PR shortly. > Doring of at least Decimal(20, 2) fails for some values near the max. > - > > Key: SPARK-40089 > URL: https://issues.apache.org/jira/browse/SPARK-40089 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0, 3.4.0 >Reporter: Robert Joseph Evans >Priority: Major > Attachments: input.parquet > > > I have been doing some testing with Decimal values for the RAPIDS Accelerator > for Apache Spark. I have been trying to add in new corner cases and when I > tried to enable the maximum supported value for a sort I started to get > failures. On closer inspection it looks like the CPU is sorting things > incorrectly. Specifically anything that is "99.50" or above > is placed as a chunk in the wrong location in the outputs. > In local mode with 12 tasks. > {code:java} > spark.read.parquet("input.parquet").orderBy(col("a")).collect.foreach(System.err.println) > {code} > > Here you will notice that the last entry printed is > {{[99.49]}}, and {{[99.99]}} is near the top > near {{[-99.99]}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37442) In AQE, wrong InMemoryRelation size estimation causes "Cannot broadcast the table that is larger than 8GB: 8 GB" failure
[ https://issues.apache.org/jira/browse/SPARK-37442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580369#comment-17580369 ] Dongjoon Hyun commented on SPARK-37442: --- Hi, [~irelandbird]. Apache Spark 2.4 and 3.0 are End-Of-Life release . Please try to use the latest Apache Spark version like 3.3.0. > In AQE, wrong InMemoryRelation size estimation causes "Cannot broadcast the > table that is larger than 8GB: 8 GB" failure > > > Key: SPARK-37442 > URL: https://issues.apache.org/jira/browse/SPARK-37442 > Project: Spark > Issue Type: Sub-task > Components: Optimizer, SQL >Affects Versions: 3.1.1, 3.2.0 >Reporter: Michael Chen >Assignee: Michael Chen >Priority: Major > Fix For: 3.2.1, 3.3.0 > > > There is a period in time where an InMemoryRelation will have the cached > buffers loaded, but the statistics will be inaccurate (anywhere between 0 -> > size in bytes reported by accumulators). When AQE is enabled, it is possible > that join planning strategies will happen in this window. In this scenario, > join children sizes including InMemoryRelation are greatly underestimated and > a broadcast join can be planned when it shouldn't be. We have seen scenarios > where a broadcast join is planned with the builder size greater than 8GB > because at planning time, the optimizer believes the InMemoryRelation is 0 > bytes. > Here is an example test case where the broadcast threshold is being ignored. > It can mimic the 8GB error by increasing the size of the tables. > {code:java} > withSQLConf( > SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true", > SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "1048584") { > // Spark estimates a string column as 20 bytes so with 60k rows, these > relations should be > // estimated at ~120m bytes which is greater than the broadcast join > threshold > Seq.fill(6)("a").toDF("key") > .createOrReplaceTempView("temp") > Seq.fill(6)("b").toDF("key") > .createOrReplaceTempView("temp2") > Seq("a").toDF("key").createOrReplaceTempView("smallTemp") > spark.sql("SELECT key as newKey FROM temp").persist() > val query = > s""" > |SELECT t3.newKey > |FROM > | (SELECT t1.newKey > | FROM (SELECT key as newKey FROM temp) as t1 > |JOIN > |(SELECT key FROM smallTemp) as t2 > |ON t1.newKey = t2.key > | ) as t3 > | JOIN > | (SELECT key FROM temp2) as t4 > | ON t3.newKey = t4.key > |UNION > |SELECT t1.newKey > |FROM > |(SELECT key as newKey FROM temp) as t1 > |JOIN > |(SELECT key FROM temp2) as t2 > |ON t1.newKey = t2.key > |""".stripMargin > val df = spark.sql(query) > df.collect() > val adaptivePlan = df.queryExecution.executedPlan > val bhj = findTopLevelBroadcastHashJoin(adaptivePlan) > assert(bhj.length == 1) {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40089) Doring of at least Decimal(20, 2) fails for some values near the max.
[ https://issues.apache.org/jira/browse/SPARK-40089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580360#comment-17580360 ] Robert Joseph Evans commented on SPARK-40089: - I have been trying to come up with a patch, but keep hitting some issues. I first tried to change {code} case dt: DecimalType if dt.precision - dt.scale <= Decimal.MAX_LONG_DIGITS => {code} to {code} case dt: DecimalType if dt.precision - dt.scale < Decimal.MAX_LONG_DIGITS => {code} So that we would bypass the overflow case entirely and use the Double prefix logic. But when I do that the negative values all come after the positive values when sorting ascending. So now I have a lot of other tests/debugging that I need to run to understand what is happening there. Just because I think I have found another bug. [~ulysses] I don't have a ton of time that I can devote to this right now, I will keep working towards a patch, but if you want to put up one, then I would love to see it. > Doring of at least Decimal(20, 2) fails for some values near the max. > - > > Key: SPARK-40089 > URL: https://issues.apache.org/jira/browse/SPARK-40089 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0, 3.4.0 >Reporter: Robert Joseph Evans >Priority: Major > Attachments: input.parquet > > > I have been doing some testing with Decimal values for the RAPIDS Accelerator > for Apache Spark. I have been trying to add in new corner cases and when I > tried to enable the maximum supported value for a sort I started to get > failures. On closer inspection it looks like the CPU is sorting things > incorrectly. Specifically anything that is "99.50" or above > is placed as a chunk in the wrong location in the outputs. > In local mode with 12 tasks. > {code:java} > spark.read.parquet("input.parquet").orderBy(col("a")).collect.foreach(System.err.println) > {code} > > Here you will notice that the last entry printed is > {{[99.49]}}, and {{[99.99]}} is near the top > near {{[-99.99]}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40106) Task failure handlers should always run if the task failed
[ https://issues.apache.org/jira/browse/SPARK-40106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40106: Assignee: (was: Apache Spark) > Task failure handlers should always run if the task failed > -- > > Key: SPARK-40106 > URL: https://issues.apache.org/jira/browse/SPARK-40106 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Ryan Johnson >Priority: Major > > Today, if a task body succeeds, but a task completion listener fails, task > failure listeners are not called -- even tho the task has indeed failed at > that point. > If a completion listener fails, and failure listeners were not previously > invoked, we should invoke them before running the remaining completion > listeners. > Such a change would increase the utility of task listeners, especially ones > intended to assist with task cleanup. > To give one arbitrary example, code like this appears at several places in > the code (taken from {{executeTask}} method of FileFormatWriter.scala): > {code:java} > try { > Utils.tryWithSafeFinallyAndFailureCallbacks(block = { > // Execute the task to write rows out and commit the task. > dataWriter.writeWithIterator(iterator) > dataWriter.commit() > })(catchBlock = { > // If there is an error, abort the task > dataWriter.abort() > logError(s"Job $jobId aborted.") > }, finallyBlock = { > dataWriter.close() > }) > } catch { > case e: FetchFailedException => > throw e > case f: FileAlreadyExistsException if > SQLConf.get.fastFailFileFormatOutput => > // If any output file to write already exists, it does not make sense > to re-run this task. > // We throw the exception and let Executor throw ExceptionFailure to > abort the job. > throw new TaskOutputFileAlreadyExistException(f) > case t: Throwable => > throw QueryExecutionErrors.taskFailedWhileWritingRowsError(t) > }{code} > If failure listeners were reliably called, the above idiom could potentially > be factored out as two failure listeners plus a completion listener, and > reused rather than duplicated. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40106) Task failure handlers should always run if the task failed
[ https://issues.apache.org/jira/browse/SPARK-40106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40106: Assignee: Apache Spark > Task failure handlers should always run if the task failed > -- > > Key: SPARK-40106 > URL: https://issues.apache.org/jira/browse/SPARK-40106 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Ryan Johnson >Assignee: Apache Spark >Priority: Major > > Today, if a task body succeeds, but a task completion listener fails, task > failure listeners are not called -- even tho the task has indeed failed at > that point. > If a completion listener fails, and failure listeners were not previously > invoked, we should invoke them before running the remaining completion > listeners. > Such a change would increase the utility of task listeners, especially ones > intended to assist with task cleanup. > To give one arbitrary example, code like this appears at several places in > the code (taken from {{executeTask}} method of FileFormatWriter.scala): > {code:java} > try { > Utils.tryWithSafeFinallyAndFailureCallbacks(block = { > // Execute the task to write rows out and commit the task. > dataWriter.writeWithIterator(iterator) > dataWriter.commit() > })(catchBlock = { > // If there is an error, abort the task > dataWriter.abort() > logError(s"Job $jobId aborted.") > }, finallyBlock = { > dataWriter.close() > }) > } catch { > case e: FetchFailedException => > throw e > case f: FileAlreadyExistsException if > SQLConf.get.fastFailFileFormatOutput => > // If any output file to write already exists, it does not make sense > to re-run this task. > // We throw the exception and let Executor throw ExceptionFailure to > abort the job. > throw new TaskOutputFileAlreadyExistException(f) > case t: Throwable => > throw QueryExecutionErrors.taskFailedWhileWritingRowsError(t) > }{code} > If failure listeners were reliably called, the above idiom could potentially > be factored out as two failure listeners plus a completion listener, and > reused rather than duplicated. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40106) Task failure handlers should always run if the task failed
[ https://issues.apache.org/jira/browse/SPARK-40106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580339#comment-17580339 ] Apache Spark commented on SPARK-40106: -- User 'ryan-johnson-databricks' has created a pull request for this issue: https://github.com/apache/spark/pull/37531 > Task failure handlers should always run if the task failed > -- > > Key: SPARK-40106 > URL: https://issues.apache.org/jira/browse/SPARK-40106 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Ryan Johnson >Priority: Major > > Today, if a task body succeeds, but a task completion listener fails, task > failure listeners are not called -- even tho the task has indeed failed at > that point. > If a completion listener fails, and failure listeners were not previously > invoked, we should invoke them before running the remaining completion > listeners. > Such a change would increase the utility of task listeners, especially ones > intended to assist with task cleanup. > To give one arbitrary example, code like this appears at several places in > the code (taken from {{executeTask}} method of FileFormatWriter.scala): > {code:java} > try { > Utils.tryWithSafeFinallyAndFailureCallbacks(block = { > // Execute the task to write rows out and commit the task. > dataWriter.writeWithIterator(iterator) > dataWriter.commit() > })(catchBlock = { > // If there is an error, abort the task > dataWriter.abort() > logError(s"Job $jobId aborted.") > }, finallyBlock = { > dataWriter.close() > }) > } catch { > case e: FetchFailedException => > throw e > case f: FileAlreadyExistsException if > SQLConf.get.fastFailFileFormatOutput => > // If any output file to write already exists, it does not make sense > to re-run this task. > // We throw the exception and let Executor throw ExceptionFailure to > abort the job. > throw new TaskOutputFileAlreadyExistException(f) > case t: Throwable => > throw QueryExecutionErrors.taskFailedWhileWritingRowsError(t) > }{code} > If failure listeners were reliably called, the above idiom could potentially > be factored out as two failure listeners plus a completion listener, and > reused rather than duplicated. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40106) Task failure handlers should always run if the task failed
Ryan Johnson created SPARK-40106: Summary: Task failure handlers should always run if the task failed Key: SPARK-40106 URL: https://issues.apache.org/jira/browse/SPARK-40106 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.3.0 Reporter: Ryan Johnson Today, if a task body succeeds, but a task completion listener fails, task failure listeners are not called -- even tho the task has indeed failed at that point. If a completion listener fails, and failure listeners were not previously invoked, we should invoke them before running the remaining completion listeners. Such a change would increase the utility of task listeners, especially ones intended to assist with task cleanup. To give one arbitrary example, code like this appears at several places in the code (taken from {{executeTask}} method of FileFormatWriter.scala): {code:java} try { Utils.tryWithSafeFinallyAndFailureCallbacks(block = { // Execute the task to write rows out and commit the task. dataWriter.writeWithIterator(iterator) dataWriter.commit() })(catchBlock = { // If there is an error, abort the task dataWriter.abort() logError(s"Job $jobId aborted.") }, finallyBlock = { dataWriter.close() }) } catch { case e: FetchFailedException => throw e case f: FileAlreadyExistsException if SQLConf.get.fastFailFileFormatOutput => // If any output file to write already exists, it does not make sense to re-run this task. // We throw the exception and let Executor throw ExceptionFailure to abort the job. throw new TaskOutputFileAlreadyExistException(f) case t: Throwable => throw QueryExecutionErrors.taskFailedWhileWritingRowsError(t) }{code} If failure listeners were reliably called, the above idiom could potentially be factored out as two failure listeners plus a completion listener, and reused rather than duplicated. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40102) Use SparkException instead of IllegalStateException in SparkPlan
[ https://issues.apache.org/jira/browse/SPARK-40102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-40102: Assignee: Yi kaifei > Use SparkException instead of IllegalStateException in SparkPlan > > > Key: SPARK-40102 > URL: https://issues.apache.org/jira/browse/SPARK-40102 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yi kaifei >Assignee: Yi kaifei >Priority: Major > Fix For: 3.4.0 > > > This pr aims to use SparkException instead of IllegalStateException in > SparkPlan, for details, see: https://github.com/apache/spark/pull/37524 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40102) Use SparkException instead of IllegalStateException in SparkPlan
[ https://issues.apache.org/jira/browse/SPARK-40102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-40102. -- Resolution: Fixed Issue resolved by pull request 37535 [https://github.com/apache/spark/pull/37535] > Use SparkException instead of IllegalStateException in SparkPlan > > > Key: SPARK-40102 > URL: https://issues.apache.org/jira/browse/SPARK-40102 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yi kaifei >Assignee: Yi kaifei >Priority: Major > Fix For: 3.4.0 > > > This pr aims to use SparkException instead of IllegalStateException in > SparkPlan, for details, see: https://github.com/apache/spark/pull/37524 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40102) Use SparkException instead of IllegalStateException in SparkPlan
[ https://issues.apache.org/jira/browse/SPARK-40102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-40102: - Parent: SPARK-37935 Issue Type: Sub-task (was: Improvement) > Use SparkException instead of IllegalStateException in SparkPlan > > > Key: SPARK-40102 > URL: https://issues.apache.org/jira/browse/SPARK-40102 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yi kaifei >Priority: Major > Fix For: 3.4.0 > > > This pr aims to use SparkException instead of IllegalStateException in > SparkPlan, for details, see: https://github.com/apache/spark/pull/37524 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40102) Use SparkException instead of IllegalStateException in SparkPlan
[ https://issues.apache.org/jira/browse/SPARK-40102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580326#comment-17580326 ] Max Gekk commented on SPARK-40102: -- [~kaifeiYi] Please, open a sub-task of https://issues.apache.org/jira/browse/SPARK-37935 next time. > Use SparkException instead of IllegalStateException in SparkPlan > > > Key: SPARK-40102 > URL: https://issues.apache.org/jira/browse/SPARK-40102 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yi kaifei >Priority: Major > Fix For: 3.4.0 > > > This pr aims to use SparkException instead of IllegalStateException in > SparkPlan, for details, see: https://github.com/apache/spark/pull/37524 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40036) LevelDB/RocksDBIterator.next should return false after iterator or db close
[ https://issues.apache.org/jira/browse/SPARK-40036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-40036. -- Fix Version/s: 3.4.0 Assignee: Yang Jie Resolution: Fixed Resolved by https://github.com/apache/spark/pull/37471 > LevelDB/RocksDBIterator.next should return false after iterator or db close > --- > > Key: SPARK-40036 > URL: https://issues.apache.org/jira/browse/SPARK-40036 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.4.0 > > > {code:java} > @Test > public void testHasNextAndNextAfterIteratorClose() throws Exception { > String prefix = "test_db_iter_close."; > String suffix = ".ldb"; > File path = File.createTempFile(prefix, suffix); > path.delete(); > LevelDB db = new LevelDB(path); > // Write one records for test > db.write(createCustomType1(0)); > KVStoreIterator iter = > db.view(CustomType1.class).closeableIterator(); > // iter should be true > assertTrue(iter.hasNext()); > // close iter > iter.close(); > // iter.hasNext should be false after iter close > assertFalse(iter.hasNext()); > // iter.next should throw NoSuchElementException after iter close > assertThrows(NoSuchElementException.class, iter::next); > db.close(); > assertTrue(path.exists()); > FileUtils.deleteQuietly(path); > assertFalse(path.exists()); > } > @Test > public void testHasNextAndNextAfterDBClose() throws Exception { > String prefix = "test_db_db_close."; > String suffix = ".ldb"; > File path = File.createTempFile(prefix, suffix); > path.delete(); > LevelDB db = new LevelDB(path); > // Write one record for test > db.write(createCustomType1(0)); > KVStoreIterator iter = > db.view(CustomType1.class).closeableIterator(); > // iter should be true > assertTrue(iter.hasNext()); > // close db > db.close(); > // iter.hasNext should be false after db close > assertFalse(iter.hasNext()); > // iter.next should throw NoSuchElementException after db close > assertThrows(NoSuchElementException.class, iter::next); > assertTrue(path.exists()); > FileUtils.deleteQuietly(path); > assertFalse(path.exists()); > } {code} > > For the above two cases, when iterator/db is closed, `hasNext` will return > true, and `next` will return the value not obtained before close. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40042) Make pyspark.sql.streaming.query examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-40042. -- Fix Version/s: 3.4.0 Assignee: Qian Sun Resolution: Fixed Resolved by https://github.com/apache/spark/pull/37482 > Make pyspark.sql.streaming.query examples self-contained > > > Key: SPARK-40042 > URL: https://issues.apache.org/jira/browse/SPARK-40042 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Qian Sun >Assignee: Qian Sun >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40042) Make pyspark.sql.streaming.query examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-40042: - Priority: Minor (was: Major) > Make pyspark.sql.streaming.query examples self-contained > > > Key: SPARK-40042 > URL: https://issues.apache.org/jira/browse/SPARK-40042 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Qian Sun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37980) Extend METADATA column to support row indices for file based data sources
[ https://issues.apache.org/jira/browse/SPARK-37980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-37980: --- Assignee: Ala Luszczak > Extend METADATA column to support row indices for file based data sources > - > > Key: SPARK-37980 > URL: https://issues.apache.org/jira/browse/SPARK-37980 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Prakhar Jain >Assignee: Ala Luszczak >Priority: Major > Fix For: 3.4.0 > > > Spark recently added hidden metadata column support for File based > datasources as part of SPARK-37273. > We should extend it to support ROW_INDEX/ROW_POSITION also. > > Meaning of ROW_POSITION: > ROW_INDEX/ROW_POSITION is basically an index of a row within a file. E.g. 5th > row in the file will have ROW_INDEX 5. > > Use cases: > Row Indexes can be used in a variety of ways. A (fileName, rowIndex) tuple > uniquely identifies row in a table. This information can be used to mark rows > e.g. this can be used by indexer etc. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37980) Extend METADATA column to support row indices for file based data sources
[ https://issues.apache.org/jira/browse/SPARK-37980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-37980. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37228 [https://github.com/apache/spark/pull/37228] > Extend METADATA column to support row indices for file based data sources > - > > Key: SPARK-37980 > URL: https://issues.apache.org/jira/browse/SPARK-37980 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Prakhar Jain >Priority: Major > Fix For: 3.4.0 > > > Spark recently added hidden metadata column support for File based > datasources as part of SPARK-37273. > We should extend it to support ROW_INDEX/ROW_POSITION also. > > Meaning of ROW_POSITION: > ROW_INDEX/ROW_POSITION is basically an index of a row within a file. E.g. 5th > row in the file will have ROW_INDEX 5. > > Use cases: > Row Indexes can be used in a variety of ways. A (fileName, rowIndex) tuple > uniquely identifies row in a table. This information can be used to mark rows > e.g. this can be used by indexer etc. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40105) Improve repartition in ReplaceCTERefWithRepartition
[ https://issues.apache.org/jira/browse/SPARK-40105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40105: Assignee: Apache Spark > Improve repartition in ReplaceCTERefWithRepartition > --- > > Key: SPARK-40105 > URL: https://issues.apache.org/jira/browse/SPARK-40105 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Assignee: Apache Spark >Priority: Minor > > If cte can not inlined, the ReplaceCTERefWithRepartition will add repartition > to force a shuffle so that the reference can reuse shuffle exchange. > The added repartition should be optimized by AQE for better performance. > If the user has specified a rebalance, the ReplaceCTERefWithRepartition > should skip add repartition. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40105) Improve repartition in ReplaceCTERefWithRepartition
[ https://issues.apache.org/jira/browse/SPARK-40105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-40105: -- Description: If cte can not inlined, the ReplaceCTERefWithRepartition will add repartition to force a shuffle so that the reference can reuse shuffle exchange. The added repartition should be optimized by AQE for better performance. If the user has specified a rebalance, the ReplaceCTERefWithRepartition should skip add repartition. was: If cte can not inlined, the ReplaceCTERefWithRepartition will add repartition to force a shuffle so that the reference can reuse shuffle exchange. It can not be optimized by AQE since it has defined shuffle partition. > Improve repartition in ReplaceCTERefWithRepartition > --- > > Key: SPARK-40105 > URL: https://issues.apache.org/jira/browse/SPARK-40105 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Priority: Minor > > If cte can not inlined, the ReplaceCTERefWithRepartition will add repartition > to force a shuffle so that the reference can reuse shuffle exchange. > The added repartition should be optimized by AQE for better performance. > If the user has specified a rebalance, the ReplaceCTERefWithRepartition > should skip add repartition. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40105) Improve repartition in ReplaceCTERefWithRepartition
[ https://issues.apache.org/jira/browse/SPARK-40105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40105: Assignee: (was: Apache Spark) > Improve repartition in ReplaceCTERefWithRepartition > --- > > Key: SPARK-40105 > URL: https://issues.apache.org/jira/browse/SPARK-40105 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Priority: Minor > > If cte can not inlined, the ReplaceCTERefWithRepartition will add repartition > to force a shuffle so that the reference can reuse shuffle exchange. > The added repartition should be optimized by AQE for better performance. > If the user has specified a rebalance, the ReplaceCTERefWithRepartition > should skip add repartition. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40105) Improve repartition in ReplaceCTERefWithRepartition
[ https://issues.apache.org/jira/browse/SPARK-40105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580279#comment-17580279 ] Apache Spark commented on SPARK-40105: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/37537 > Improve repartition in ReplaceCTERefWithRepartition > --- > > Key: SPARK-40105 > URL: https://issues.apache.org/jira/browse/SPARK-40105 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Priority: Minor > > If cte can not inlined, the ReplaceCTERefWithRepartition will add repartition > to force a shuffle so that the reference can reuse shuffle exchange. > The added repartition should be optimized by AQE for better performance. > If the user has specified a rebalance, the ReplaceCTERefWithRepartition > should skip add repartition. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org