[jira] [Commented] (SPARK-18414) sc.textFile doesn't seem to use LzoTextInputFormat when hadoop-lzo is installed

Sean Owen (JIRA) Mon, 14 Nov 2016 01:54:13 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-18414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15663327#comment-15663327
 ]


Sean Owen commented on SPARK-18414:
-----------------------------------

I suppose it depends on how common this is. Core Hadoop and therefore Spark 
already support common compression codecs out of the box, and I think Spark 
would just inherit Hadoop's support unless there was a big reason to add 
something further. In this case, if it requires GPL code, Spark can't ship it 
directly anyway. You however can add it to your app if you like, and that seems 
like it might be sufficient given the use case now.

> sc.textFile doesn't seem to use LzoTextInputFormat when hadoop-lzo is 
> installed
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-18414
>                 URL: https://issues.apache.org/jira/browse/SPARK-18414
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 2.0.1
>            Reporter: Renan Vicente Gomes da Silva
>            Priority: Minor
>              Labels: hadoop-lzo
>
> When reading LZO files using sc.textFile it miss a few files from time to 
> time.
> Sample:
>       val Data = sc.textFile(Files)
>       listFiles += Data.count()
> Considering that Files is a HDFS directory containing LZO files. If executed 
> for example a 1000 times it gets different results a few times.
> Now if you use newAPIHadoopFile to force it to use 
> com.hadoop.mapreduce.LzoTextInputFormat it works perfectly, shows the same 
> results in all executions.
> Sample:
>       val Data = sc.newAPIHadoopFile(Files,
>         classOf[com.hadoop.mapreduce.LzoTextInputFormat],
>         classOf[org.apache.hadoop.io.LongWritable],
>         classOf[org.apache.hadoop.io.Text]).map(_._2.toString)
>       listFiles += Data.count()
> Looking at Spark code it looks like it use TextInputFormat by default but is 
> not using com.hadoop.mapreduce.LzoTextInputFormat when hadoop-lzo is 
> installed.
> https://github.com/apache/spark/blob/v2.0.1/core/src/main/scala/org/apache/spark/SparkContext.scala#L795-L801



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18414) sc.textFile doesn't seem to use LzoTextInputFormat when hadoop-lzo is installed

Reply via email to