Re: Can not complete the read csv task
This command only defines a new DataFrame, in order to see some results you need to do something like merged_spark_data.show() on a new line. Regarding the error I think it's typical error that you get when you run Spark on Windows OS. You can suppress it using Winutils tool (Google it or ChatGPT it to see how). On Thu, 12 Oct 2023, 11:58 Kelum Perera, wrote: > Dear friends, > > I'm trying to get a fresh start with Spark. I tried to read few CSV files > in a folder, but the task got stuck and not completed as shown in the > copied content from the terminal. > > Can someone help to understand what is going wrong? > > Versions; > java version "11.0.16" 2022-07-19 LTS > Java(TM) SE Runtime Environment 18.9 (build 11.0.16+11-LTS-199) > Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.16+11-LTS-199, mixed > mode) > > Python 3.9.13 > Windows 10 > > Copied from the terminal; > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 3.5.0 > /_/ > > Using Python version 3.9.13 (main, Aug 25 2022 23:51:50) > Spark context Web UI available at http://LK510FIDSLW4.ey.net:4041 > Spark context available as 'sc' (master = local[*], app id = > local-1697089858181). > SparkSession available as 'spark'. > >>> merged_spark_data = > spark.read.csv(r"C:\Users\Kelum.Perera\Downloads\data-master\nyse_all\nyse_data\*", > header=False ) > Exception in thread "globPath-ForkJoinPool-1-worker-115" > java.lang.UnsatisfiedLinkError: > org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z > at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native > Method) > at > org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:793) > at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:1249) > at org.apache.hadoop.fs.FileUtil.list(FileUtil.java:1454) > at > org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:601) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014) > at > org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761) > at org.apache.hadoop.fs.Globber.listStatus(Globber.java:128) > at org.apache.hadoop.fs.Globber.doGlob(Globber.java:291) > at org.apache.hadoop.fs.Globber.glob(Globber.java:202) > at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2124) > at > org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:238) > at > org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$3(DataSource.scala:737) > at > org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:380) > at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659) > at scala.util.Success.$anonfun$map$1(Try.scala:255) > at scala.util.Success.map(Try.scala:213) > at scala.concurrent.Future.$anonfun$map$1(Future.scala:292) > at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33) > at > scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64) > at > java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1426) > at > java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) > at > java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020) > at > java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) > at > java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594) > at > java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183) > > > > Noting happens afterwards. Appreciate your kind input to solve this. > > Best Regards, > Kelum Perera > > > >
Fwd: Fw: Can not complete the read csv task
Dear group members, I'm trying to get a fresh start with Spark, but came a cross following issue; I tried to read few CSV files from a folder, but the task got stuck and didn't complete. ( copied content from the terminal.) Can someone help to understand what is going wrong? Versions; java version "11.0.16" 2022-07-19 LTS Java(TM) SE Runtime Environment 18.9 (build 11.0.16+11-LTS-199) Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.16+11-LTS-199, mixed mode) Python 3.9.13 Windows 10 Copied from the terminal; __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.5.0 /_/ Using Python version 3.9.13 (main, Aug 25 2022 23:51:50) Spark context Web UI available at http://LK510FIDSLW4.ey.net:4041 Spark context available as 'sc' (master = local[*], app id = local-1697089858181). SparkSession available as 'spark'. >>> merged_spark_data = spark.read.csv(r"C:\Users\Kelum.Perera\Downloads\data-master\nyse_all\nyse_data\*", header=False ) Exception in thread "globPath-ForkJoinPool-1-worker-115" java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method) at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:793) at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:1249) at org.apache.hadoop.fs.FileUtil.list(FileUtil.java:1454) at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:601) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014) at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761) at org.apache.hadoop.fs.Globber.listStatus(Globber.java:128) at org.apache.hadoop.fs.Globber.doGlob(Globber.java:291) at org.apache.hadoop.fs.Globber.glob(Globber.java:202) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2124) at org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:238) at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$3(DataSource.scala:737) at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:380) at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659) at scala.util.Success.$anonfun$map$1(Try.scala:255) at scala.util.Success.map(Try.scala:213) at scala.concurrent.Future.$anonfun$map$1(Future.scala:292) at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33) at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64) at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1426) at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020) at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594) at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183) Noting happens afterwards. Appreciate your kind input to solve this. Best Regards, Kelum Perera
Fw: Can not complete the read csv task
From: Kelum Perera Sent: Thursday, October 12, 2023 11:40 AM To: user@spark.apache.org ; Kelum Perera ; Kelum Gmail Subject: Can not complete the read csv task Dear friends, I'm trying to get a fresh start with Spark. I tried to read few CSV files in a folder, but the task got stuck and not completed as shown in the copied content from the terminal. Can someone help to understand what is going wrong? Versions; java version "11.0.16" 2022-07-19 LTS Java(TM) SE Runtime Environment 18.9 (build 11.0.16+11-LTS-199) Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.16+11-LTS-199, mixed mode) Python 3.9.13 Windows 10 Copied from the terminal; __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.5.0 /_/ Using Python version 3.9.13 (main, Aug 25 2022 23:51:50) Spark context Web UI available at http://LK510FIDSLW4.ey.net:4041 Spark context available as 'sc' (master = local[*], app id = local-1697089858181). SparkSession available as 'spark'. >>> merged_spark_data = >>> spark.read.csv(r"C:\Users\Kelum.Perera\Downloads\data-master\nyse_all\nyse_data\*", >>> header=False ) Exception in thread "globPath-ForkJoinPool-1-worker-115" java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method) at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:793) at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:1249) at org.apache.hadoop.fs.FileUtil.list(FileUtil.java:1454) at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:601) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014) at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761) at org.apache.hadoop.fs.Globber.listStatus(Globber.java:128) at org.apache.hadoop.fs.Globber.doGlob(Globber.java:291) at org.apache.hadoop.fs.Globber.glob(Globber.java:202) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2124) at org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:238) at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$3(DataSource.scala:737) at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:380) at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659) at scala.util.Success.$anonfun$map$1(Try.scala:255) at scala.util.Success.map(Try.scala:213) at scala.concurrent.Future.$anonfun$map$1(Future.scala:292) at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33) at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64) at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1426) at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020) at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594) at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183) Noting happens afterwards. Appreciate your kind input to solve this. Best Regards, Kelum Perera
Can not complete the read csv task
Dear friends, I'm trying to get a fresh start with Spark. I tried to read few CSV files in a folder, but the task got stuck and not completed as shown in the copied content from the terminal. Can someone help to understand what is going wrong? Versions; java version "11.0.16" 2022-07-19 LTS Java(TM) SE Runtime Environment 18.9 (build 11.0.16+11-LTS-199) Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.16+11-LTS-199, mixed mode) Python 3.9.13 Windows 10 Copied from the terminal; __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.5.0 /_/ Using Python version 3.9.13 (main, Aug 25 2022 23:51:50) Spark context Web UI available at http://LK510FIDSLW4.ey.net:4041 Spark context available as 'sc' (master = local[*], app id = local-1697089858181). SparkSession available as 'spark'. >>> merged_spark_data = >>> spark.read.csv(r"C:\Users\Kelum.Perera\Downloads\data-master\nyse_all\nyse_data\*", >>> header=False ) Exception in thread "globPath-ForkJoinPool-1-worker-115" java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method) at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:793) at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:1249) at org.apache.hadoop.fs.FileUtil.list(FileUtil.java:1454) at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:601) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014) at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761) at org.apache.hadoop.fs.Globber.listStatus(Globber.java:128) at org.apache.hadoop.fs.Globber.doGlob(Globber.java:291) at org.apache.hadoop.fs.Globber.glob(Globber.java:202) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2124) at org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:238) at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$3(DataSource.scala:737) at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:380) at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659) at scala.util.Success.$anonfun$map$1(Try.scala:255) at scala.util.Success.map(Try.scala:213) at scala.concurrent.Future.$anonfun$map$1(Future.scala:292) at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33) at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64) at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1426) at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020) at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594) at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183) Noting happens afterwards. Appreciate your kind input to solve this. Best Regards, Kelum Perera