Creating remote tables using PySpark

Tom Barber Thu, 07 Mar 2024 13:50:14 -0800

Wonder if anyone can just sort my brain out here as to whats possible or
not.


I have a container running Spark, with Hive and a ThriftServer. I want to
run code against it remotely.

If I take something simple like this

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType,
StringType

# Initialize SparkSession
spark = SparkSession.builder \
    .appName("ShowDatabases") \
    .master("spark://192.168.1.245:7077") \
    .config("spark.sql.warehouse.dir", "file:/data/hive/warehouse") \
    .config("hive.metastore.uris","thrift://192.168.1.245:9083")\
    .enableHiveSupport() \
    .getOrCreate()

# Define schema of the DataFrame
schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True)
])

# Data to be converted into a DataFrame
data = [(1, "John Doe"), (2, "Jane Doe"), (3, "Mike Johnson")]

# Create DataFrame
df = spark.createDataFrame(data, schema)

# Show the DataFrame (optional, for verification)
df.show()

# Save the DataFrame to a table named "my_table"
df.write.mode("overwrite").saveAsTable("my_table")

# Stop the SparkSession
spark.stop()

When I run it in the container it runs fine, but when I run it remotely it
says:

: java.io.FileNotFoundException: File
file:/data/hive/warehouse/my_table/_temporary/0 does not exist
at
org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:597)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
at
org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
at
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.getAllCommittedTaskPaths(FileOutputCommitter.java:334)
at
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:404)
at
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:377)
at
org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
at
org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:192)

My assumption is that its trying to look on my local machine for
/data/hive/warehouse and failing because on the remote box I can see those
folders.

So the question is, if you're not backing it with hadoop or something do
you have to mount the drive in the same place on the computer running the
pyspark? Or am I missing a config option somewhere?

Thanks!

Creating remote tables using PySpark

Reply via email to