[ 
https://issues.apache.org/jira/browse/SPARK-40287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17603819#comment-17603819
 ] 

Drew commented on SPARK-40287:
------------------------------

Hey [~ste...@apache.org], 

Yes, I get the same functionality with this criteria as well. It looks like the 
data is moving to the new table location again.

> Load Data using Spark by a single partition moves entire dataset under same 
> location in S3
> ------------------------------------------------------------------------------------------
>
>                 Key: SPARK-40287
>                 URL: https://issues.apache.org/jira/browse/SPARK-40287
>             Project: Spark
>          Issue Type: Question
>          Components: Spark Core
>    Affects Versions: 3.2.1
>            Reporter: Drew
>            Priority: Major
>
> Hello,
> I'm experiencing an issue in PySpark when creating a hive table and loading 
> in the data to the table. So I'm using an Amazon s3 bucket as a data location 
> and I'm creating a table as parquet and trying to load data into that table 
> by a single partition, and I'm seeing some weird behavior. When selecting the 
> data location in s3 of a parquet file to load into my table. All of the data 
> is moved into the specified location in my create table command including the 
> partitions I didn't specify in the load data command. For example:
> {code:java}
> # create a data frame in pyspark with partitions
> df = spark.createDataFrame([("a", 1, "x"), ("b", 2, "y"), ("c", 3, "y")], 
> ["c1", "c2", "p"])
> # save it to S3
> df.write.format("parquet").mode("overwrite").partitionBy("p").save("s3://bucket/data/")
> {code}
> In the current state S3 should have a new folder `data` with two folders 
> which contain a parquet file in each partition. 
>   
>  - s3://bucket/data/p=x/
>     - part-00001.snappy.parquet
>  - s3://bucket/data/p=y/
>     - part-00002.snappy.parquet
>     - part-00003.snappy.parquet
>  
> {code:java}
> # create new table
> spark.sql("create table src (c1 string,c2 int) PARTITIONED BY (p string) 
> STORED AS parquet LOCATION 's3://bucket/new/'")
> # load the saved table data from s3 specifying single partition value x
> spark.sql("LOAD DATA INPATH 's3://bucket/data/'INTO TABLE src PARTITION 
> (p='x')")
> spark.sql("select * from src").show()
> # output: 
> # +---+---+---+
> # | c1| c2|  p|
> # +---+---+---+
> # +---+---+---+
> {code}
> After running the `load data` command, and looking at the table I'm left with 
> no data loaded in. When checking S3 the data source we saved earlier is moved 
> under `s3://bucket/new/` oddly enough it also brought over the other 
> partitions along with it directory structure listed below. 
> - s3://bucket/new/
>     - p=x/
>         - p=x/
>             - part-00001.snappy.parquet
>         - p=y/
>             - part-00002.snappy.parquet
>             - part-00003.snappy.parquet
> Is this the intended behavior of loading the data in from a partitioned 
> parquet file? Is the previous file supposed to be moved/deleted from source 
> directory? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to