an update on this issue, now spark is able to read the lzo file and recognize that it has index and starts multiple map tasks. you need to use following function instead of textFile
csv = sc.newAPIHadoopFile(opts.input,"com.hadoop.mapreduce.LzoTextInputFormat","org.apache.hadoop.io.LongWritable","org.apache.hadoop.io.Text").count() - Gurvinder On 07/03/2014 06:24 PM, Gurvinder Singh wrote: > Hi all, > > I am trying to read the lzo files. It seems spark recognizes that the > input file is compressed and got the decompressor as > > 14/07/03 18:11:01 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library > 14/07/03 18:11:01 INFO lzo.LzoCodec: Successfully loaded & initialized > native-lzo library [hadoop-lzo rev > ee825cb06b23d3ab97cdd87e13cbbb630bd75b98] > 14/07/03 18:11:01 INFO Configuration.deprecation: hadoop.native.lib is > deprecated. Instead, use io.native.lib.available > 14/07/03 18:11:01 INFO compress.CodecPool: Got brand-new decompressor > [.lzo] > > But it has two issues > > 1. It just stuck here without doing anything waited for 15 min for a > small files. > 2. I used the hadoop-lzo to create the index so that spark can split > the input to multiple maps but spark creates only one mapper. > > I am using python with reading using sc.textFile(). Spark version is > of the git master. > > Regards, > Gurvinder >