[ https://issues.apache.org/jira/browse/SPARK-40287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17603819#comment-17603819 ]
Drew commented on SPARK-40287: ------------------------------ Hey [~ste...@apache.org], Yes, I get the same functionality with this criteria as well. It looks like the data is moving to the new table location again. > Load Data using Spark by a single partition moves entire dataset under same > location in S3 > ------------------------------------------------------------------------------------------ > > Key: SPARK-40287 > URL: https://issues.apache.org/jira/browse/SPARK-40287 > Project: Spark > Issue Type: Question > Components: Spark Core > Affects Versions: 3.2.1 > Reporter: Drew > Priority: Major > > Hello, > I'm experiencing an issue in PySpark when creating a hive table and loading > in the data to the table. So I'm using an Amazon s3 bucket as a data location > and I'm creating a table as parquet and trying to load data into that table > by a single partition, and I'm seeing some weird behavior. When selecting the > data location in s3 of a parquet file to load into my table. All of the data > is moved into the specified location in my create table command including the > partitions I didn't specify in the load data command. For example: > {code:java} > # create a data frame in pyspark with partitions > df = spark.createDataFrame([("a", 1, "x"), ("b", 2, "y"), ("c", 3, "y")], > ["c1", "c2", "p"]) > # save it to S3 > df.write.format("parquet").mode("overwrite").partitionBy("p").save("s3://bucket/data/") > {code} > In the current state S3 should have a new folder `data` with two folders > which contain a parquet file in each partition. > > - s3://bucket/data/p=x/ > - part-00001.snappy.parquet > - s3://bucket/data/p=y/ > - part-00002.snappy.parquet > - part-00003.snappy.parquet > > {code:java} > # create new table > spark.sql("create table src (c1 string,c2 int) PARTITIONED BY (p string) > STORED AS parquet LOCATION 's3://bucket/new/'") > # load the saved table data from s3 specifying single partition value x > spark.sql("LOAD DATA INPATH 's3://bucket/data/'INTO TABLE src PARTITION > (p='x')") > spark.sql("select * from src").show() > # output: > # +---+---+---+ > # | c1| c2| p| > # +---+---+---+ > # +---+---+---+ > {code} > After running the `load data` command, and looking at the table I'm left with > no data loaded in. When checking S3 the data source we saved earlier is moved > under `s3://bucket/new/` oddly enough it also brought over the other > partitions along with it directory structure listed below. > - s3://bucket/new/ > - p=x/ > - p=x/ > - part-00001.snappy.parquet > - p=y/ > - part-00002.snappy.parquet > - part-00003.snappy.parquet > Is this the intended behavior of loading the data in from a partitioned > parquet file? Is the previous file supposed to be moved/deleted from source > directory? -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org