[ https://issues.apache.org/jira/browse/SPARK-15654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15307043#comment-15307043 ]
Takeshi Yamamuro edited comment on SPARK-15654 at 5/31/16 11:10 PM: -------------------------------------------------------------------- Seems a root cause is that LineRecordReader cannot split files compressed by some codecs. hadoop-v2.8+ throws an exception if these kinds of files are passed into LineRecordReader (https://github.com/apache/hadoop/blame/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java#L103) This is fixed in MAPREDUCE-2094. One solution is to set Long.MaxValue at defaultMaxSplitBytes in FileSourceStrategy if unsplittable files detected there. https://github.com/apache/spark/compare/master...maropu:SPARK-15654 was (Author: maropu): Seems a root cause is that LineRecordReader cannot split files compressed by some codecs. hadoop-v2.8+ throws an exception if these kinds of files are passed into LineRecordReader (https://github.com/apache/hadoop/blame/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java#L103) This is fixed in MAPREDUCE-2094. One solution is to set Long.MaxValue at defaultMaxSplitBytes in FileSourceStrategy if splittable files detected there. https://github.com/apache/spark/compare/master...maropu:SPARK-15654 > Reading gzipped files results in duplicate rows > ----------------------------------------------- > > Key: SPARK-15654 > URL: https://issues.apache.org/jira/browse/SPARK-15654 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.0 > Reporter: Jurriaan Pruis > Priority: Blocker > > When gzipped files are larger then {{spark.sql.files.maxPartitionBytes}} > reading the file will result in duplicate rows in the dataframe. > Given an example gzipped wordlist (of 740K bytes): > {code} > $ gzcat words.gz |wc -l > 235886 > {code} > Reading it using spark results in the following output: > {code} > >>> sqlContext.setConf('spark.sql.files.maxPartitionBytes', '1000') > >>> sqlContext.read.text("/Users/jurriaanpruis/spark/words.gz").count() > 81244093 > >>> sqlContext.setConf('spark.sql.files.maxPartitionBytes', '10000') > >>> sqlContext.read.text("/Users/jurriaanpruis/spark/words.gz").count() > 8348566 > >>> sqlContext.setConf('spark.sql.files.maxPartitionBytes', '100000') > >>> sqlContext.read.text("/Users/jurriaanpruis/spark/words.gz").count() > 1051469 > >>> sqlContext.setConf('spark.sql.files.maxPartitionBytes', '1000000') > >>> sqlContext.read.text("/Users/jurriaanpruis/spark/words.gz").count() > 235886 > {code} > You can clearly see how the number of rows scales with the number of > partitions. > Somehow the data is duplicated when the number of partitions exceeds one > (which as seen above approximately scales with the partition size). > When using distinct you'll get the correct answer: > {code} > >>> sqlContext.setConf('spark.sql.files.maxPartitionBytes', '10000') > >>> sqlContext.read.text("/Users/jurriaanpruis/spark/words.gz").distinct().count() > 235886 > {code} > This looks like a pretty serious bug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org