[ANNOUNCE] Apache Celeborn(incubating) 0.3.1 available
Hi all, Apache Celeborn(Incubating) community is glad to announce the new release of Apache Celeborn(Incubating) 0.3.1. Celeborn is dedicated to improving the efficiency and elasticity of different map-reduce engines and provides an elastic, high-efficient service for intermediate data including shuffle data, spilled data, result data, etc. Download Link: https://celeborn.apache.org/download/ GitHub Release Tag: - https://github.com/apache/incubator-celeborn/releases/tag/v0.3.1-incubating Release Notes: - https://celeborn.apache.org/community/release_notes/release_note_0.3.1 Home Page: https://celeborn.apache.org/ Celeborn Resources: - Issue Management: https://issues.apache.org/jira/projects/CELEBORN - Mailing List: d...@celeborn.apache.org Thanks, Cheng Pan On behalf of the Apache Celeborn(incubating) community - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Fwd: Fw: Can not complete the read csv task
Dear group members, I'm trying to get a fresh start with Spark, but came a cross following issue; I tried to read few CSV files from a folder, but the task got stuck and didn't complete. ( copied content from the terminal.) Can someone help to understand what is going wrong? Versions; java version "11.0.16" 2022-07-19 LTS Java(TM) SE Runtime Environment 18.9 (build 11.0.16+11-LTS-199) Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.16+11-LTS-199, mixed mode) Python 3.9.13 Windows 10 Copied from the terminal; __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.5.0 /_/ Using Python version 3.9.13 (main, Aug 25 2022 23:51:50) Spark context Web UI available at http://LK510FIDSLW4.ey.net:4041 Spark context available as 'sc' (master = local[*], app id = local-1697089858181). SparkSession available as 'spark'. >>> merged_spark_data = spark.read.csv(r"C:\Users\Kelum.Perera\Downloads\data-master\nyse_all\nyse_data\*", header=False ) Exception in thread "globPath-ForkJoinPool-1-worker-115" java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method) at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:793) at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:1249) at org.apache.hadoop.fs.FileUtil.list(FileUtil.java:1454) at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:601) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014) at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761) at org.apache.hadoop.fs.Globber.listStatus(Globber.java:128) at org.apache.hadoop.fs.Globber.doGlob(Globber.java:291) at org.apache.hadoop.fs.Globber.glob(Globber.java:202) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2124) at org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:238) at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$3(DataSource.scala:737) at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:380) at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659) at scala.util.Success.$anonfun$map$1(Try.scala:255) at scala.util.Success.map(Try.scala:213) at scala.concurrent.Future.$anonfun$map$1(Future.scala:292) at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33) at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64) at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1426) at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020) at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594) at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183) Noting happens afterwards. Appreciate your kind input to solve this. Best Regards, Kelum Perera
Fw: Can not complete the read csv task
From: Kelum Perera Sent: Thursday, October 12, 2023 11:40 AM To: user@spark.apache.org ; Kelum Perera ; Kelum Gmail Subject: Can not complete the read csv task Dear friends, I'm trying to get a fresh start with Spark. I tried to read few CSV files in a folder, but the task got stuck and not completed as shown in the copied content from the terminal. Can someone help to understand what is going wrong? Versions; java version "11.0.16" 2022-07-19 LTS Java(TM) SE Runtime Environment 18.9 (build 11.0.16+11-LTS-199) Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.16+11-LTS-199, mixed mode) Python 3.9.13 Windows 10 Copied from the terminal; __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.5.0 /_/ Using Python version 3.9.13 (main, Aug 25 2022 23:51:50) Spark context Web UI available at http://LK510FIDSLW4.ey.net:4041 Spark context available as 'sc' (master = local[*], app id = local-1697089858181). SparkSession available as 'spark'. >>> merged_spark_data = >>> spark.read.csv(r"C:\Users\Kelum.Perera\Downloads\data-master\nyse_all\nyse_data\*", >>> header=False ) Exception in thread "globPath-ForkJoinPool-1-worker-115" java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method) at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:793) at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:1249) at org.apache.hadoop.fs.FileUtil.list(FileUtil.java:1454) at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:601) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014) at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761) at org.apache.hadoop.fs.Globber.listStatus(Globber.java:128) at org.apache.hadoop.fs.Globber.doGlob(Globber.java:291) at org.apache.hadoop.fs.Globber.glob(Globber.java:202) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2124) at org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:238) at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$3(DataSource.scala:737) at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:380) at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659) at scala.util.Success.$anonfun$map$1(Try.scala:255) at scala.util.Success.map(Try.scala:213) at scala.concurrent.Future.$anonfun$map$1(Future.scala:292) at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33) at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64) at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1426) at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020) at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594) at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183) Noting happens afterwards. Appreciate your kind input to solve this. Best Regards, Kelum Perera
Re: [ SPARK SQL ]: UPPER in WHERE condition is not working in Apache Spark 3.5.0 for Mysql ENUM Column
This issue is related to CharVarcharCodegenUtils readSidePadding method . Appending white spaces while reading ENUM data from mysql Causing issue in querying , writing the same data to Cassandra. On Thu, 12 Oct, 2023, 7:46 pm Suyash Ajmera, wrote: > I have upgraded my spark job from spark 3.3.1 to spark 3.5.0, I am > querying to Mysql Database and applying > > `*UPPER(col) = UPPER(value)*` in the subsequent sql query. It is working > as expected in spark 3.3.1 , but not working with 3.5.0. > > Where Condition :: `*UPPER(vn) = 'ERICSSON' AND (upper(st) = 'OPEN' OR > upper(st) = 'REOPEN' OR upper(st) = 'CLOSED')*` > > The *st *column is ENUM in the database and it is causing the issue. > > Below is the Physical Plan of *FILTER* phase : > > For 3.3.1 : > > +- Filter ((upper(vn#11) = ERICSSON) AND (((upper(st#42) = OPEN) OR > (upper(st#42) = REOPEN)) OR (upper(st#42) = CLOSED))) > > For 3.5.0 : > > +- Filter ((upper(vn#11) = ERICSSON) AND (((upper(staticinvoke(class > org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, > readSidePadding, st#42, 13, true, false, true)) = OPEN) OR > (upper(staticinvoke(class > org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, > readSidePadding, st#42, 13, true, false, true)) = REOPEN)) OR > (upper(staticinvoke(class > org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, > readSidePadding, st#42, 13, true, false, true)) = CLOSED))) > > - > > I have debug it and found that Spark added a property in version 3.4.0 , > i.e. **spark.sql.readSideCharPadding** which has default value **true**. > > Link to the JIRA : https://issues.apache.org/jira/browse/SPARK-40697 > > Added a new method in Class **CharVarcharCodegenUtils** > > public static UTF8String readSidePadding(UTF8String inputStr, int limit) { > int numChars = inputStr.numChars(); > if (numChars == limit) { > return inputStr; > } else if (numChars < limit) { > return inputStr.rpad(limit, SPACE); > } else { > return inputStr; > } > } > > > **This method is appending some whitespace padding to the ENUM values > while reading and causing the Issue.** > > --- > > When I am removing the UPPER function from the where condition the > **FILTER** Phase looks like this : > > +- Filter (((staticinvoke(class > org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, > StringType, readSidePadding, st#42, 13, true, false, true) = OPEN > ) OR (staticinvoke(class > org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, > readSidePadding, st#42, 13, true, false, true) = REOPEN )) OR > (staticinvoke(class > org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, > readSidePadding, st#42, 13, true, false, true) = CLOSED )) > > > **You can see it has added some white space after the value and the query > runs fine giving the correct result.** > > But with the UPPER function I am not getting the data. > > -- > > I have also tried to disable this Property *spark.sql.readSideCharPadding > = false* with following cases : > > 1. With Upper function in where clause : >It is not pushing the filters to Database and the *query works fine*. > > +- Filter (((upper(st#42) = OPEN) OR (upper(st#42) = REOPEN)) OR > (upper(st#42) = CLOSED)) > > 2. But when I am removing the upper function > > *It is pushing the filter to Mysql with the white spaces and I am not > getting the data. (THIS IS A CAUSING VERY BIG ISSUE)* > > PushedFilters: [*IsNotNull(vn), *EqualTo(vn,ERICSSON), > *Or(Or(EqualTo(st,OPEN ),EqualTo(st,REOPEN > )),EqualTo(st,CLOSED ))] > > I cannot move this filter to JDBC read query , also I can't remove this > UPPER function in the where clause. > > > > > Also I found same data getting written to CASSANDRA with *PADDING .* >