[ https://issues.apache.org/jira/browse/SPARK-18414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15663327#comment-15663327 ]
Sean Owen commented on SPARK-18414: ----------------------------------- I suppose it depends on how common this is. Core Hadoop and therefore Spark already support common compression codecs out of the box, and I think Spark would just inherit Hadoop's support unless there was a big reason to add something further. In this case, if it requires GPL code, Spark can't ship it directly anyway. You however can add it to your app if you like, and that seems like it might be sufficient given the use case now. > sc.textFile doesn't seem to use LzoTextInputFormat when hadoop-lzo is > installed > ------------------------------------------------------------------------------- > > Key: SPARK-18414 > URL: https://issues.apache.org/jira/browse/SPARK-18414 > Project: Spark > Issue Type: Bug > Components: Input/Output > Affects Versions: 2.0.1 > Reporter: Renan Vicente Gomes da Silva > Priority: Minor > Labels: hadoop-lzo > > When reading LZO files using sc.textFile it miss a few files from time to > time. > Sample: > val Data = sc.textFile(Files) > listFiles += Data.count() > Considering that Files is a HDFS directory containing LZO files. If executed > for example a 1000 times it gets different results a few times. > Now if you use newAPIHadoopFile to force it to use > com.hadoop.mapreduce.LzoTextInputFormat it works perfectly, shows the same > results in all executions. > Sample: > val Data = sc.newAPIHadoopFile(Files, > classOf[com.hadoop.mapreduce.LzoTextInputFormat], > classOf[org.apache.hadoop.io.LongWritable], > classOf[org.apache.hadoop.io.Text]).map(_._2.toString) > listFiles += Data.count() > Looking at Spark code it looks like it use TextInputFormat by default but is > not using com.hadoop.mapreduce.LzoTextInputFormat when hadoop-lzo is > installed. > https://github.com/apache/spark/blob/v2.0.1/core/src/main/scala/org/apache/spark/SparkContext.scala#L795-L801 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org