Okay interesting, maybe my assumption was incorrect, although I'm still confused.
I tried to mount a central mount point that would be the same on my local machine and the container. Same error although I moved the path to /tmp/hive/data/hive/.... but when I rerun the test code to save a table, the complaint is still for Warehouse Dir: file:/tmp/hive/data/hive/warehouse Metastore URIs: thrift://192.168.1.245:9083 Warehouse Dir: file:/tmp/hive/data/hive/warehouse Metastore URIs: thrift://192.168.1.245:9083 ERROR FileOutputCommitter: Mkdirs failed to create file:/data/hive/warehouse/input.db/accounts_20240307_232110_1_0_6_post21_g4fdc321_d20240307/_temporary/0 so what is /data/hive even referring to when I print out the spark conf values and neither now refer to /data/hive/ On Thu, Mar 7, 2024 at 9:49 PM Tom Barber <t...@spicule.co.uk> wrote: > Wonder if anyone can just sort my brain out here as to whats possible or > not. > > I have a container running Spark, with Hive and a ThriftServer. I want to > run code against it remotely. > > If I take something simple like this > > from pyspark.sql import SparkSession > from pyspark.sql.types import StructType, StructField, IntegerType, > StringType > > # Initialize SparkSession > spark = SparkSession.builder \ > .appName("ShowDatabases") \ > .master("spark://192.168.1.245:7077") \ > .config("spark.sql.warehouse.dir", "file:/data/hive/warehouse") \ > .config("hive.metastore.uris","thrift://192.168.1.245:9083")\ > .enableHiveSupport() \ > .getOrCreate() > > # Define schema of the DataFrame > schema = StructType([ > StructField("id", IntegerType(), True), > StructField("name", StringType(), True) > ]) > > # Data to be converted into a DataFrame > data = [(1, "John Doe"), (2, "Jane Doe"), (3, "Mike Johnson")] > > # Create DataFrame > df = spark.createDataFrame(data, schema) > > # Show the DataFrame (optional, for verification) > df.show() > > # Save the DataFrame to a table named "my_table" > df.write.mode("overwrite").saveAsTable("my_table") > > # Stop the SparkSession > spark.stop() > > When I run it in the container it runs fine, but when I run it remotely it > says: > > : java.io.FileNotFoundException: File > file:/data/hive/warehouse/my_table/_temporary/0 does not exist > at > org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:597) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014) > at > org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.getAllCommittedTaskPaths(FileOutputCommitter.java:334) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:404) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:377) > at > org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48) > at > org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:192) > > My assumption is that its trying to look on my local machine for > /data/hive/warehouse and failing because on the remote box I can see those > folders. > > So the question is, if you're not backing it with hadoop or something do > you have to mount the drive in the same place on the computer running the > pyspark? Or am I missing a config option somewhere? > > Thanks! >