[ https://issues.apache.org/jira/browse/SPARK-15654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15306731#comment-15306731 ]
Jurriaan Pruis commented on SPARK-15654: ---------------------------------------- cc [~davies] [~marmbrus] I saw you guys worked on code regarding the {{FileSourceStrategy}}. Maybe you know more about this issue/how to fix it? > Reading gzipped files results in duplicate rows > ----------------------------------------------- > > Key: SPARK-15654 > URL: https://issues.apache.org/jira/browse/SPARK-15654 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.0 > Reporter: Jurriaan Pruis > Priority: Blocker > > When gzipped files are larger then {{spark.sql.files.maxPartitionBytes}} > reading the file will result in duplicate rows in the dataframe. > Given an example gzipped wordlist (of 740K bytes): > {code} > $ gzcat words.gz |wc -l > 235886 > {code} > Reading it using spark results in the following output: > {code} > >>> sqlContext.setConf('spark.sql.files.maxPartitionBytes', '1000') > >>> sqlContext.read.text("/Users/jurriaanpruis/spark/words.gz").count() > 81244093 > >>> sqlContext.setConf('spark.sql.files.maxPartitionBytes', '10000') > >>> sqlContext.read.text("/Users/jurriaanpruis/spark/words.gz").count() > 8348566 > >>> sqlContext.setConf('spark.sql.files.maxPartitionBytes', '100000') > >>> sqlContext.read.text("/Users/jurriaanpruis/spark/words.gz").count() > 1051469 > >>> sqlContext.setConf('spark.sql.files.maxPartitionBytes', '1000000') > >>> sqlContext.read.text("/Users/jurriaanpruis/spark/words.gz").count() > 235886 > {code} > You can clearly see how the number of rows scales with the number of > partitions. > Somehow the data is duplicated when the number of partitions exceeds one > (which as seen above approximately scales with the partition size). > When using distinct you'll get the correct answer: > {code} > >>> sqlContext.setConf('spark.sql.files.maxPartitionBytes', '10000') > >>> sqlContext.read.text("/Users/jurriaanpruis/spark/words.gz").distinct().count() > 235886 > {code} > This looks like a pretty serious bug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org