Okay that was some caching issue. Now there is a shared mount point between the place the pyspark code is executed and the spark nodes it runs. Hrmph, I was hoping that wouldn't be the case. Fair enough!
On Thu, Mar 7, 2024 at 11:23 PM Tom Barber <t...@spicule.co.uk> wrote: > Okay interesting, maybe my assumption was incorrect, although I'm still > confused. > > I tried to mount a central mount point that would be the same on my local > machine and the container. Same error although I moved the path to > /tmp/hive/data/hive/.... but when I rerun the test code to save a table, > the complaint is still for > > Warehouse Dir: file:/tmp/hive/data/hive/warehouse > Metastore URIs: thrift://192.168.1.245:9083 > Warehouse Dir: file:/tmp/hive/data/hive/warehouse > Metastore URIs: thrift://192.168.1.245:9083 > ERROR FileOutputCommitter: Mkdirs failed to create > file:/data/hive/warehouse/input.db/accounts_20240307_232110_1_0_6_post21_g4fdc321_d20240307/_temporary/0 > > so what is /data/hive even referring to when I print out the spark conf > values and neither now refer to /data/hive/ > > On Thu, Mar 7, 2024 at 9:49 PM Tom Barber <t...@spicule.co.uk> wrote: > >> Wonder if anyone can just sort my brain out here as to whats possible or >> not. >> >> I have a container running Spark, with Hive and a ThriftServer. I want to >> run code against it remotely. >> >> If I take something simple like this >> >> from pyspark.sql import SparkSession >> from pyspark.sql.types import StructType, StructField, IntegerType, >> StringType >> >> # Initialize SparkSession >> spark = SparkSession.builder \ >> .appName("ShowDatabases") \ >> .master("spark://192.168.1.245:7077") \ >> .config("spark.sql.warehouse.dir", "file:/data/hive/warehouse") \ >> .config("hive.metastore.uris","thrift://192.168.1.245:9083")\ >> .enableHiveSupport() \ >> .getOrCreate() >> >> # Define schema of the DataFrame >> schema = StructType([ >> StructField("id", IntegerType(), True), >> StructField("name", StringType(), True) >> ]) >> >> # Data to be converted into a DataFrame >> data = [(1, "John Doe"), (2, "Jane Doe"), (3, "Mike Johnson")] >> >> # Create DataFrame >> df = spark.createDataFrame(data, schema) >> >> # Show the DataFrame (optional, for verification) >> df.show() >> >> # Save the DataFrame to a table named "my_table" >> df.write.mode("overwrite").saveAsTable("my_table") >> >> # Stop the SparkSession >> spark.stop() >> >> When I run it in the container it runs fine, but when I run it remotely >> it says: >> >> : java.io.FileNotFoundException: File >> file:/data/hive/warehouse/my_table/_temporary/0 does not exist >> at >> org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:597) >> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972) >> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014) >> at >> org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761) >> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972) >> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014) >> at >> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.getAllCommittedTaskPaths(FileOutputCommitter.java:334) >> at >> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:404) >> at >> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:377) >> at >> org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48) >> at >> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:192) >> >> My assumption is that its trying to look on my local machine for >> /data/hive/warehouse and failing because on the remote box I can see those >> folders. >> >> So the question is, if you're not backing it with hadoop or something do >> you have to mount the drive in the same place on the computer running the >> pyspark? Or am I missing a config option somewhere? >> >> Thanks! >> >