an update on this issue, now spark is able to read the lzo file and
recognize that it has index and starts multiple map tasks. you need to
use following function instead of textFile

csv =
sc.newAPIHadoopFile(opts.input,"com.hadoop.mapreduce.LzoTextInputFormat","org.apache.hadoop.io.LongWritable","org.apache.hadoop.io.Text").count()

- Gurvinder
On 07/03/2014 06:24 PM, Gurvinder Singh wrote:
> Hi all,
> 
> I am trying to read the lzo files. It seems spark recognizes that the
> input file is compressed and got the decompressor as
> 
> 14/07/03 18:11:01 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
> 14/07/03 18:11:01 INFO lzo.LzoCodec: Successfully loaded & initialized
> native-lzo library [hadoop-lzo rev
> ee825cb06b23d3ab97cdd87e13cbbb630bd75b98]
> 14/07/03 18:11:01 INFO Configuration.deprecation: hadoop.native.lib is
> deprecated. Instead, use io.native.lib.available
> 14/07/03 18:11:01 INFO compress.CodecPool: Got brand-new decompressor
> [.lzo]
> 
> But it has two issues
> 
> 1. It just stuck here without doing anything waited for 15 min for a
> small files.
> 2. I used the hadoop-lzo to create the index so that spark can split
> the input to multiple maps but spark creates only one mapper.
> 
> I am using python with reading using sc.textFile(). Spark version is
> of the git master.
> 
> Regards,
> Gurvinder
> 

Reply via email to