[ https://issues.apache.org/jira/browse/SPARK-18414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15657599#comment-15657599 ]
Sean Owen commented on SPARK-18414: ----------------------------------- textFile uses TextFileInputFormat on purpose. if you want to use an alternative inputformat, yes, you are doing it correctly. This already works then right? > sc.textFile doesn't seem to use LzoTextInputFormat when hadoop-lzo is > installed > ------------------------------------------------------------------------------- > > Key: SPARK-18414 > URL: https://issues.apache.org/jira/browse/SPARK-18414 > Project: Spark > Issue Type: Bug > Components: Input/Output > Affects Versions: 2.0.1 > Reporter: Renan Vicente Gomes da Silva > Priority: Minor > Labels: hadoop-lzo > > When reading LZO files using sc.textFile it miss a few files from time to > time. > Sample: > val Data = sc.textFile(Files) > listFiles += Data.count() > Considering that Files is a HDFS directory containing LZO files. If executed > for example a 1000 times it gets different results a few times. > Now if you use newAPIHadoopFile to force it to use > com.hadoop.mapreduce.LzoTextInputFormat it works perfectly, shows the same > results in all executions. > Sample: > val Data = sc.newAPIHadoopFile(Files, > classOf[com.hadoop.mapreduce.LzoTextInputFormat], > classOf[org.apache.hadoop.io.LongWritable], > classOf[org.apache.hadoop.io.Text]).map(_._2.toString) > listFiles += Data.count() > Looking at Spark code it looks like it use TextInputFormat by default but is > not using com.hadoop.mapreduce.LzoTextInputFormat when hadoop-lzo is > installed. > https://github.com/apache/spark/blob/v2.0.1/core/src/main/scala/org/apache/spark/SparkContext.scala#L795-L801 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org