Wonder if anyone can just sort my brain out here as to whats possible or not.
I have a container running Spark, with Hive and a ThriftServer. I want to run code against it remotely. If I take something simple like this from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, IntegerType, StringType # Initialize SparkSession spark = SparkSession.builder \ .appName("ShowDatabases") \ .master("spark://192.168.1.245:7077") \ .config("spark.sql.warehouse.dir", "file:/data/hive/warehouse") \ .config("hive.metastore.uris","thrift://192.168.1.245:9083")\ .enableHiveSupport() \ .getOrCreate() # Define schema of the DataFrame schema = StructType([ StructField("id", IntegerType(), True), StructField("name", StringType(), True) ]) # Data to be converted into a DataFrame data = [(1, "John Doe"), (2, "Jane Doe"), (3, "Mike Johnson")] # Create DataFrame df = spark.createDataFrame(data, schema) # Show the DataFrame (optional, for verification) df.show() # Save the DataFrame to a table named "my_table" df.write.mode("overwrite").saveAsTable("my_table") # Stop the SparkSession spark.stop() When I run it in the container it runs fine, but when I run it remotely it says: : java.io.FileNotFoundException: File file:/data/hive/warehouse/my_table/_temporary/0 does not exist at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:597) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014) at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.getAllCommittedTaskPaths(FileOutputCommitter.java:334) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:404) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:377) at org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48) at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:192) My assumption is that its trying to look on my local machine for /data/hive/warehouse and failing because on the remote box I can see those folders. So the question is, if you're not backing it with hadoop or something do you have to mount the drive in the same place on the computer running the pyspark? Or am I missing a config option somewhere? Thanks!