[jira] [Commented] (SPARK-40286) Load Data from S3 deletes data source file
[ https://issues.apache.org/jira/browse/SPARK-40286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17605509#comment-17605509 ] Drew commented on SPARK-40286: -- [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Loadingfilesintotables] If the keyword LOCAL is _not_ specified, then Hive will either use the full URI of {_}filepath{_}, if one is specified, or will apply the following rules: * If scheme or authority are not specified, Hive will use the scheme and authority from the hadoop configuration variable {{fs.default.name}} that specifies the Namenode URI. * If the path is not absolute, then Hive will interpret it relative to {{/user/}} * Hive will _move_ the files addressed by _filepath_ into the table (or partition > Load Data from S3 deletes data source file > -- > > Key: SPARK-40286 > URL: https://issues.apache.org/jira/browse/SPARK-40286 > Project: Spark > Issue Type: Question > Components: Documentation >Affects Versions: 3.2.1 >Reporter: Drew >Priority: Major > > Hello, > I'm using spark to [load > data|https://spark.apache.org/docs/latest/sql-ref-syntax-dml-load.html] into > a hive table through Pyspark, and when I load data from a path in Amazon S3, > the original file is getting wiped from the Directory. The file is found, and > is populating the table with data. I also tried to add the `Local` clause but > that throws an error when looking for the file. When looking through the > documentation it doesn't explicitly state that this is the intended behavior. > Thanks in advance! > {code:java} > spark.sql("CREATE TABLE src (key INT, value STRING) STORED AS textfile") > spark.sql("LOAD DATA INPATH 's3://bucket/kv1.txt' OVERWRITE INTO TABLE > src"){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40286) Load Data from S3 deletes data source file
[ https://issues.apache.org/jira/browse/SPARK-40286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17603826#comment-17603826 ] Drew commented on SPARK-40286: -- Hi [~ste...@apache.org], Yeah, is there anything significant there that I should be looking for? When doing this with that same criteria I get the same results and nothing in the logs raises any suspicion to me. > Load Data from S3 deletes data source file > -- > > Key: SPARK-40286 > URL: https://issues.apache.org/jira/browse/SPARK-40286 > Project: Spark > Issue Type: Question > Components: Documentation >Affects Versions: 3.2.1 >Reporter: Drew >Priority: Major > > Hello, > I'm using spark to [load > data|https://spark.apache.org/docs/latest/sql-ref-syntax-dml-load.html] into > a hive table through Pyspark, and when I load data from a path in Amazon S3, > the original file is getting wiped from the Directory. The file is found, and > is populating the table with data. I also tried to add the `Local` clause but > that throws an error when looking for the file. When looking through the > documentation it doesn't explicitly state that this is the intended behavior. > Thanks in advance! > {code:java} > spark.sql("CREATE TABLE src (key INT, value STRING) STORED AS textfile") > spark.sql("LOAD DATA INPATH 's3://bucket/kv1.txt' OVERWRITE INTO TABLE > src"){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40286) Load Data from S3 deletes data source file
[ https://issues.apache.org/jira/browse/SPARK-40286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598837#comment-17598837 ] Steve Loughran commented on SPARK-40286: this is EMR. can you repliacate in an ASF spark release through the s3a connector and committers? if you can replicate, especiallly in spark standadone, turn spark and org.apache.hadoop.fs.s3a logging on to debug and see what it says > Load Data from S3 deletes data source file > -- > > Key: SPARK-40286 > URL: https://issues.apache.org/jira/browse/SPARK-40286 > Project: Spark > Issue Type: Question > Components: Documentation >Affects Versions: 3.2.1 >Reporter: Drew >Priority: Major > > Hello, > I'm using spark to [load > data|https://spark.apache.org/docs/latest/sql-ref-syntax-dml-load.html] into > a hive table through Pyspark, and when I load data from a path in Amazon S3, > the original file is getting wiped from the Directory. The file is found, and > is populating the table with data. I also tried to add the `Local` clause but > that throws an error when looking for the file. When looking through the > documentation it doesn't explicitly state that this is the intended behavior. > Thanks in advance! > {code:java} > spark.sql("CREATE TABLE src (key INT, value STRING) STORED AS textfile") > spark.sql("LOAD DATA INPATH 's3://bucket/kv1.txt' OVERWRITE INTO TABLE > src"){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40286) Load Data from S3 deletes data source file
[ https://issues.apache.org/jira/browse/SPARK-40286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598574#comment-17598574 ] Sean R. Owen commented on SPARK-40286: -- I could be completely wrong, but then I'd be quite as surprised as you are, if that's how this is meant to work. If so it needs to be in the docs > Load Data from S3 deletes data source file > -- > > Key: SPARK-40286 > URL: https://issues.apache.org/jira/browse/SPARK-40286 > Project: Spark > Issue Type: Question > Components: Documentation >Affects Versions: 3.2.1 >Reporter: Drew >Priority: Major > > Hello, > I'm using spark to [load > data|https://spark.apache.org/docs/latest/sql-ref-syntax-dml-load.html] into > a hive table through Pyspark, and when I load data from a path in Amazon S3, > the original file is getting wiped from the Directory. The file is found, and > is populating the table with data. I also tried to add the `Local` clause but > that throws an error when looking for the file. When looking through the > documentation it doesn't explicitly state that this is the intended behavior. > Thanks in advance! > {code:java} > spark.sql("CREATE TABLE src (key INT, value STRING) STORED AS textfile") > spark.sql("LOAD DATA INPATH 's3://bucket/kv1.txt' OVERWRITE INTO TABLE > src"){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40286) Load Data from S3 deletes data source file
[ https://issues.apache.org/jira/browse/SPARK-40286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598572#comment-17598572 ] Drew commented on SPARK-40286: -- [~srowen] interesting, this is the only information I could find in regards to moving the data source. Is this how sparks load data works as well? [https://stackoverflow.com/a/40182243/11558988] > Load Data from S3 deletes data source file > -- > > Key: SPARK-40286 > URL: https://issues.apache.org/jira/browse/SPARK-40286 > Project: Spark > Issue Type: Question > Components: Documentation >Affects Versions: 3.2.1 >Reporter: Drew >Priority: Major > > Hello, > I'm using spark to [load > data|https://spark.apache.org/docs/latest/sql-ref-syntax-dml-load.html] into > a hive table through Pyspark, and when I load data from a path in Amazon S3, > the original file is getting wiped from the Directory. The file is found, and > is populating the table with data. I also tried to add the `Local` clause but > that throws an error when looking for the file. When looking through the > documentation it doesn't explicitly state that this is the intended behavior. > Thanks in advance! > {code:java} > spark.sql("CREATE TABLE src (key INT, value STRING) STORED AS textfile") > spark.sql("LOAD DATA INPATH 's3://bucket/kv1.txt' OVERWRITE INTO TABLE > src"){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40286) Load Data from S3 deletes data source file
[ https://issues.apache.org/jira/browse/SPARK-40286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598568#comment-17598568 ] Sean R. Owen commented on SPARK-40286: -- No, LOAD DATA does not delete source data. I'm not sure what's happening here, but I suspect something else is removing those files > Load Data from S3 deletes data source file > -- > > Key: SPARK-40286 > URL: https://issues.apache.org/jira/browse/SPARK-40286 > Project: Spark > Issue Type: Question > Components: Documentation >Affects Versions: 3.2.1 >Reporter: Drew >Priority: Major > > Hello, > I'm using spark to [load > data|https://spark.apache.org/docs/latest/sql-ref-syntax-dml-load.html] into > a hive table through Pyspark, and when I load data from a path in Amazon S3, > the original file is getting wiped from the Directory. The file is found, and > is populating the table with data. I also tried to add the `Local` clause but > that throws an error when looking for the file. When looking through the > documentation it doesn't explicitly state that this is the intended behavior. > Thanks in advance! > {code:java} > spark.sql("CREATE TABLE src (key INT, value STRING) STORED AS textfile") > spark.sql("LOAD DATA INPATH 's3://bucket/kv1.txt' OVERWRITE INTO TABLE > src"){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40286) Load Data from S3 deletes data source file
[ https://issues.apache.org/jira/browse/SPARK-40286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598556#comment-17598556 ] Drew commented on SPARK-40286: -- [~srowen] I see, the table is located in s3 in another bucket of mine. So now the file is being moved into the new source directory. So instead of it living in s3://bucket it's now in s3://bucket_two. Is that correct? > Load Data from S3 deletes data source file > -- > > Key: SPARK-40286 > URL: https://issues.apache.org/jira/browse/SPARK-40286 > Project: Spark > Issue Type: Question > Components: Documentation >Affects Versions: 3.2.1 >Reporter: Drew >Priority: Major > > Hello, > I'm using spark to [load > data|https://spark.apache.org/docs/latest/sql-ref-syntax-dml-load.html] into > a hive table through Pyspark, and when I load data from a path in Amazon S3, > the original file is getting wiped from the Directory. The file is found, and > is populating the table with data. I also tried to add the `Local` clause but > that throws an error when looking for the file. When looking through the > documentation it doesn't explicitly state that this is the intended behavior. > Thanks in advance! > {code:java} > spark.sql("CREATE TABLE src (key INT, value STRING) STORED AS textfile") > spark.sql("LOAD DATA INPATH 's3://bucket/kv1.txt' OVERWRITE INTO TABLE > src"){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40286) Load Data from S3 deletes data source file
[ https://issues.apache.org/jira/browse/SPARK-40286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598542#comment-17598542 ] Sean R. Owen commented on SPARK-40286: -- Where is src stored? LOAD DATA should not affect the source, but, you are OVERWRITEing whatever is in src's storage. > Load Data from S3 deletes data source file > -- > > Key: SPARK-40286 > URL: https://issues.apache.org/jira/browse/SPARK-40286 > Project: Spark > Issue Type: Question > Components: Documentation >Affects Versions: 3.2.1 >Reporter: Drew >Priority: Major > > Hello, > I'm using spark to [load > data|https://spark.apache.org/docs/latest/sql-ref-syntax-dml-load.html] into > a hive table through Pyspark, and when I load data from a path in Amazon S3, > the original file is getting wiped from the Directory. The file is found, and > is populating the table with data. I also tried to add the `Local` clause but > that throws an error when looking for the file. When looking through the > documentation it doesn't explicitly state that this is the intended behavior. > Thanks in advance! > {code:java} > spark.sql("CREATE TABLE src (key INT, value STRING) STORED AS textfile") > spark.sql("LOAD DATA INPATH 's3://bucket/kv1.txt' OVERWRITE INTO TABLE > src"){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40286) Load Data from S3 deletes data source file
[ https://issues.apache.org/jira/browse/SPARK-40286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598538#comment-17598538 ] Drew commented on SPARK-40286: -- Hi [~srowen], In this case, before loading data into the table from my bucket in S3 has the file kv1.txt. Then, when I run the code block above, the file is removed from the s3 bucket directory. The data is in the table when I run: {code:java} spark.sql('select * from src').show(){code} I was wondering if that's exepected? > Load Data from S3 deletes data source file > -- > > Key: SPARK-40286 > URL: https://issues.apache.org/jira/browse/SPARK-40286 > Project: Spark > Issue Type: Question > Components: Documentation >Affects Versions: 3.2.1 >Reporter: Drew >Priority: Major > > Hello, > I'm using spark to [load > data|https://spark.apache.org/docs/latest/sql-ref-syntax-dml-load.html] into > a hive table through Pyspark, and when I load data from a path in Amazon S3, > the original file is getting wiped from the Directory. The file is found, and > is populating the table with data. I also tried to add the `Local` clause but > that throws an error when looking for the file. When looking through the > documentation it doesn't explicitly state that this is the intended behavior. > Thanks in advance! > {code:java} > spark.sql("CREATE TABLE src (key INT, value STRING) STORED AS textfile") > spark.sql("LOAD DATA INPATH 's3://bucket/kv1.txt' OVERWRITE INTO TABLE > src"){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40286) Load Data from S3 deletes data source file
[ https://issues.apache.org/jira/browse/SPARK-40286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598535#comment-17598535 ] Drew commented on SPARK-40286: -- In this case, before loading data into the table from my bucket in S3 has `kv1.txt`. Then, when I run the code block above, the file is removed from the s3 bucket directory. The data is in the table when I run `spark.sql('select * from src')`. I was wondering if that's exepected? > Load Data from S3 deletes data source file > -- > > Key: SPARK-40286 > URL: https://issues.apache.org/jira/browse/SPARK-40286 > Project: Spark > Issue Type: Question > Components: Documentation >Affects Versions: 3.2.1 >Reporter: Drew >Priority: Major > > Hello, > I'm using spark to [load > data|https://spark.apache.org/docs/latest/sql-ref-syntax-dml-load.html] into > a hive table through Pyspark, and when I load data from a path in Amazon S3, > the original file is getting wiped from the Directory. The file is found, and > is populating the table with data. I also tried to add the `Local` clause but > that throws an error when looking for the file. When looking through the > documentation it doesn't explicitly state that this is the intended behavior. > Thanks in advance! > {code:java} > spark.sql("CREATE TABLE src (key INT, value STRING) STORED AS textfile") > spark.sql("LOAD DATA INPATH 's3://bucket/kv1.txt' OVERWRITE INTO TABLE > src"){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40286) Load Data from S3 deletes data source file
[ https://issues.apache.org/jira/browse/SPARK-40286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598510#comment-17598510 ] Sean R. Owen commented on SPARK-40286: -- There is no delete here. Why do you think Spark is deleting something vs something else you're doing? what files where? > Load Data from S3 deletes data source file > -- > > Key: SPARK-40286 > URL: https://issues.apache.org/jira/browse/SPARK-40286 > Project: Spark > Issue Type: Question > Components: Documentation >Affects Versions: 3.2.1 >Reporter: Drew >Priority: Major > > Hello, > I'm using spark to [load > data|https://spark.apache.org/docs/latest/sql-ref-syntax-dml-load.html] into > a hive table through Pyspark, and when I load data from a path in Amazon S3, > the original file is getting wiped from the Directory. The file is found, and > is populating the table with data. I also tried to add the `Local` clause but > that throws an error when looking for the file. When looking through the > documentation it doesn't explicitly state that this is the intended behavior. > Thanks in advance! > {code:java} > spark.sql("CREATE TABLE src (key INT, value STRING) STORED AS textfile") > spark.sql("LOAD DATA INPATH 's3://bucket/kv1.txt' OVERWRITE INTO TABLE > src"){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org