[ https://issues.apache.org/jira/browse/SPARK-21938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Abbi McClintic updated SPARK-21938: ----------------------------------- Description: Hello, My team has been experiencing a recurring unpredictable bug where only a partial write to CSV in S3 on one partition of our Dataset is performed. For example, in a Dataset of 10 partitions written to CSV in S3, we might see 9 of the partitions as 2.8 GB in size, but one of them as 1.6 GB. However, the job does not fail. This becomes problematic in the following ways: 1. When we copy the data to Redshift, we get a bad decrypt error on the partial file, suggesting that the failure occurred at a weird byte in the file. 2. We lose data - sometimes as much as 10%. We don't see this problem with parquet, which we also use, but moving all of our data to parquet is not currently feasible. We're using the Java API. Any help on resolving this would be much appreciated. was: Hello, My team has been experiencing a recurring unpredictable bug where only a partial write to CSV in S3 on one partition of our Dataset is performed. For example, in a Dataset of 10 partitions written to CSV in S3, we might see 9 of the partitions as 2.8 GB in size, but one of them as 1.6 GB. However, the job does not fail. This becomes problematic in the following ways: 1. When we copy the data to Redshift, we get a bad decrypt error on the partial file, suggesting that the failure occurred at a weird byte in the file. 2. We lose data - sometimes as much as 10%. We don't see this problem with parquet, which we also use, but moving all of our data to parquet is not currently feasible. Any help on resolving this would be much appreciated. > Spark partial CSV write fails silently > -------------------------------------- > > Key: SPARK-21938 > URL: https://issues.apache.org/jira/browse/SPARK-21938 > Project: Spark > Issue Type: Bug > Components: Java API, Spark Core > Affects Versions: 2.2.0 > Environment: Amazon EMR 5.8, varying instance types > Reporter: Abbi McClintic > > Hello, > My team has been experiencing a recurring unpredictable bug where only a > partial write to CSV in S3 on one partition of our Dataset is performed. For > example, in a Dataset of 10 partitions written to CSV in S3, we might see 9 > of the partitions as 2.8 GB in size, but one of them as 1.6 GB. However, the > job does not fail. > This becomes problematic in the following ways: > 1. When we copy the data to Redshift, we get a bad decrypt error on the > partial file, suggesting that the failure occurred at a weird byte in the > file. > 2. We lose data - sometimes as much as 10%. > We don't see this problem with parquet, which we also use, but moving all of > our data to parquet is not currently feasible. We're using the Java API. > Any help on resolving this would be much appreciated. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org