Re: reading compress lzo files
I found it quite painful to figure out all the steps required and have filed SPARK-2394 https://issues.apache.org/jira/browse/SPARK-2394 to track improving this. Perhaps I have been going about it the wrong way, but it seems way more painful than it should be to set up a Spark cluster built using spark-ec2 to read LZO-compressed input. Nick
Re: reading compress lzo files
On 07/06/2014 05:19 AM, Nicholas Chammas wrote: On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh gurvinder.si...@uninett.no mailto:gurvinder.si...@uninett.no wrote: csv = sc.newAPIHadoopFile(opts.input,com.hadoop.mapreduce.LzoTextInputFormat,org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text).count() Does anyone know what the rough equivalent of this would be in the Scala API? I am not sure, I haven't tested it using scala. com.hadoop.mapreduce.LzoTextInputFormat class is from this package https://github.com/twitter/hadoop-lzo I have installed it from clourdera hadoop-lzo package with liblzo2-2 debian package on all of my workers. Make sure you have hadoop-lzo.jar in your class path for spark. - Gurvinder I am trying the following, but the first import yields an error on my |spark-ec2| cluster: |import com.hadoop.mapreduce.LzoTextInputFormat import org.apache.hadoop.io.LongWritable import org.apache.hadoop.io.Text sc.newAPIHadoopFile(s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data, LzoTextInputFormat, LongWritable, Text) | |scala import com.hadoop.mapreduce.LzoTextInputFormat console:12: error: object hadoop is not a member of package com import com.hadoop.mapreduce.LzoTextInputFormat | Nick
Re: reading compress lzo files
Ah, indeed it looks like I need to install this separately https://code.google.com/a/apache-extras.org/p/hadoop-gpl-compression/wiki/FAQ?redir=1 as it is not part of the core. Nick On Sun, Jul 6, 2014 at 2:22 AM, Gurvinder Singh gurvinder.si...@uninett.no wrote: On 07/06/2014 05:19 AM, Nicholas Chammas wrote: On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh gurvinder.si...@uninett.no mailto:gurvinder.si...@uninett.no wrote: csv = sc.newAPIHadoopFile(opts.input,com.hadoop.mapreduce.LzoTextInputFormat,org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text).count() Does anyone know what the rough equivalent of this would be in the Scala API? I am not sure, I haven't tested it using scala. com.hadoop.mapreduce.LzoTextInputFormat class is from this package https://github.com/twitter/hadoop-lzo I have installed it from clourdera hadoop-lzo package with liblzo2-2 debian package on all of my workers. Make sure you have hadoop-lzo.jar in your class path for spark. - Gurvinder I am trying the following, but the first import yields an error on my |spark-ec2| cluster: |import com.hadoop.mapreduce.LzoTextInputFormat import org.apache.hadoop.io.LongWritable import org.apache.hadoop.io.Text sc.newAPIHadoopFile(s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data, LzoTextInputFormat, LongWritable, Text) | |scala import com.hadoop.mapreduce.LzoTextInputFormat console:12: error: object hadoop is not a member of package com import com.hadoop.mapreduce.LzoTextInputFormat | Nick
Re: reading compress lzo files
Pardon, I was wrong about this. There is actually code distributed under com.hadoop, and that's where this class is. Oops. https://code.google.com/a/apache-extras.org/p/hadoop-gpl-compression/source/browse/trunk/src/java/com/hadoop/mapreduce/LzoTextInputFormat.java On Sun, Jul 6, 2014 at 6:37 AM, Sean Owen so...@cloudera.com wrote: The package com.hadoop.mapreduce certainly looks wrong. If it is a Hadoop class it starts with org.apache.hadoop On Jul 6, 2014 4:20 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh gurvinder.si...@uninett.no wrote: csv = sc.newAPIHadoopFile(opts.input,com.hadoop.mapreduce.LzoTextInputFormat,org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text).count() Does anyone know what the rough equivalent of this would be in the Scala API? I am trying the following, but the first import yields an error on my spark-ec2 cluster: import com.hadoop.mapreduce.LzoTextInputFormat import org.apache.hadoop.io.LongWritable import org.apache.hadoop.io.Text sc.newAPIHadoopFile(s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data, LzoTextInputFormat, LongWritable, Text) scala import com.hadoop.mapreduce.LzoTextInputFormat console:12: error: object hadoop is not a member of package com import com.hadoop.mapreduce.LzoTextInputFormat Nick
Re: reading compress lzo files
Ni Nick, The cluster I was working on in those linked messages was a private data center cluster, not on EC2. I'd imagine that the setup would be pretty similar, but I'm not familiar with the EC2 init scripts that Spark uses. Also I upgraded that cluster to 1.0 recently and am continuing to use LZO-compressed data, so I know there's not a version issue. Andrew On Sun, Jul 6, 2014 at 12:02 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I’ve been reading through several pages trying to figure out how to set up my spark-ec2 cluster to read LZO-compressed files from S3. - http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCA+-p3AGSPeNE5miQRFHC7-ZwNbicaXfh1-ZXdKJ=saw_mgr...@mail.gmail.com%3E - http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3cca+-p3aga6f86qcsowp7k_r+8r-dgbmj3gz+4xljzjpr90db...@mail.gmail.com%3E - https://github.com/twitter/hadoop-lzo - http://blog.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/ It seems that several things may have changed since the above pages were put together, so getting this to work is more work than I expected. Is there a simple set of instructions somewhere one can follow to get a Spark EC2 cluster reading LZO-compressed input files correctly? Nick On Sun, Jul 6, 2014 at 10:55 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Ah, indeed it looks like I need to install this separately https://code.google.com/a/apache-extras.org/p/hadoop-gpl-compression/wiki/FAQ?redir=1 as it is not part of the core. Nick On Sun, Jul 6, 2014 at 2:22 AM, Gurvinder Singh gurvinder.si...@uninett.no wrote: On 07/06/2014 05:19 AM, Nicholas Chammas wrote: On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh gurvinder.si...@uninett.no mailto:gurvinder.si...@uninett.no wrote: csv = sc.newAPIHadoopFile(opts.input,com.hadoop.mapreduce.LzoTextInputFormat,org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text).count() Does anyone know what the rough equivalent of this would be in the Scala API? I am not sure, I haven't tested it using scala. com.hadoop.mapreduce.LzoTextInputFormat class is from this package https://github.com/twitter/hadoop-lzo I have installed it from clourdera hadoop-lzo package with liblzo2-2 debian package on all of my workers. Make sure you have hadoop-lzo.jar in your class path for spark. - Gurvinder I am trying the following, but the first import yields an error on my |spark-ec2| cluster: |import com.hadoop.mapreduce.LzoTextInputFormat import org.apache.hadoop.io.LongWritable import org.apache.hadoop.io.Text sc.newAPIHadoopFile(s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data, LzoTextInputFormat, LongWritable, Text) | |scala import com.hadoop.mapreduce.LzoTextInputFormat console:12: error: object hadoop is not a member of package com import com.hadoop.mapreduce.LzoTextInputFormat | Nick
Re: reading compress lzo files
On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh gurvinder.si...@uninett.no wrote: csv = sc.newAPIHadoopFile(opts.input,com.hadoop .mapreduce.LzoTextInputFormat,org.apache.hadoop .io.LongWritable,org.apache.hadoop.io.Text).count() Does anyone know what the rough equivalent of this would be in the Scala API? I am trying the following, but the first import yields an error on my spark-ec2 cluster: import com.hadoop.mapreduce.LzoTextInputFormatimport org.apache.hadoop.io.LongWritableimport org.apache.hadoop.io.Text sc.newAPIHadoopFile(s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data, LzoTextInputFormat, LongWritable, Text) scala import com.hadoop.mapreduce.LzoTextInputFormat console:12: error: object hadoop is not a member of package com import com.hadoop.mapreduce.LzoTextInputFormat Nick
Re: reading compress lzo files
The package com.hadoop.mapreduce certainly looks wrong. If it is a Hadoop class it starts with org.apache.hadoop On Jul 6, 2014 4:20 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh gurvinder.si...@uninett.no wrote: csv = sc.newAPIHadoopFile(opts.input,com.hadoop .mapreduce.LzoTextInputFormat,org.apache.hadoop .io.LongWritable,org.apache.hadoop.io.Text).count() Does anyone know what the rough equivalent of this would be in the Scala API? I am trying the following, but the first import yields an error on my spark-ec2 cluster: import com.hadoop.mapreduce.LzoTextInputFormatimport org.apache.hadoop.io.LongWritableimport org.apache.hadoop.io.Text sc.newAPIHadoopFile(s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data, LzoTextInputFormat, LongWritable, Text) scala import com.hadoop.mapreduce.LzoTextInputFormat console:12: error: object hadoop is not a member of package com import com.hadoop.mapreduce.LzoTextInputFormat Nick
Re: reading compress lzo files
an update on this issue, now spark is able to read the lzo file and recognize that it has index and starts multiple map tasks. you need to use following function instead of textFile csv = sc.newAPIHadoopFile(opts.input,com.hadoop.mapreduce.LzoTextInputFormat,org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text).count() - Gurvinder On 07/03/2014 06:24 PM, Gurvinder Singh wrote: Hi all, I am trying to read the lzo files. It seems spark recognizes that the input file is compressed and got the decompressor as 14/07/03 18:11:01 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library 14/07/03 18:11:01 INFO lzo.LzoCodec: Successfully loaded initialized native-lzo library [hadoop-lzo rev ee825cb06b23d3ab97cdd87e13cbbb630bd75b98] 14/07/03 18:11:01 INFO Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available 14/07/03 18:11:01 INFO compress.CodecPool: Got brand-new decompressor [.lzo] But it has two issues 1. It just stuck here without doing anything waited for 15 min for a small files. 2. I used the hadoop-lzo to create the index so that spark can split the input to multiple maps but spark creates only one mapper. I am using python with reading using sc.textFile(). Spark version is of the git master. Regards, Gurvinder