Re: reading compress lzo files

2014-07-07 Thread Nicholas Chammas
I found it quite painful to figure out all the steps required and have
filed SPARK-2394 https://issues.apache.org/jira/browse/SPARK-2394 to
track improving this. Perhaps I have been going about it the wrong way, but
it seems way more painful than it should be to set up a Spark cluster built
using spark-ec2 to read LZO-compressed input.

Nick
​


Re: reading compress lzo files

2014-07-06 Thread Gurvinder Singh
On 07/06/2014 05:19 AM, Nicholas Chammas wrote:
 On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh
 gurvinder.si...@uninett.no mailto:gurvinder.si...@uninett.no wrote:
 
 csv =
 
 sc.newAPIHadoopFile(opts.input,com.hadoop.mapreduce.LzoTextInputFormat,org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text).count()
 
 Does anyone know what the rough equivalent of this would be in the Scala
 API?
 
I am not sure, I haven't tested it using scala.
com.hadoop.mapreduce.LzoTextInputFormat class is from this package
https://github.com/twitter/hadoop-lzo

I have installed it from clourdera hadoop-lzo package with liblzo2-2
debian package on all of my workers. Make sure you have hadoop-lzo.jar
in your class path for spark.

- Gurvinder

 I am trying the following, but the first import yields an error on my
 |spark-ec2| cluster:
 
 |import com.hadoop.mapreduce.LzoTextInputFormat
 import org.apache.hadoop.io.LongWritable
 import org.apache.hadoop.io.Text
 
 sc.newAPIHadoopFile(s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data,
  LzoTextInputFormat, LongWritable, Text)
 |
 
 |scala import com.hadoop.mapreduce.LzoTextInputFormat
 console:12: error: object hadoop is not a member of package com
import com.hadoop.mapreduce.LzoTextInputFormat
 |
 
 Nick
 
 ​




Re: reading compress lzo files

2014-07-06 Thread Nicholas Chammas
Ah, indeed it looks like I need to install this separately
https://code.google.com/a/apache-extras.org/p/hadoop-gpl-compression/wiki/FAQ?redir=1
as it is not part of the core.

Nick



On Sun, Jul 6, 2014 at 2:22 AM, Gurvinder Singh gurvinder.si...@uninett.no
wrote:

 On 07/06/2014 05:19 AM, Nicholas Chammas wrote:
  On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh
  gurvinder.si...@uninett.no mailto:gurvinder.si...@uninett.no wrote:
 
  csv =
 
 sc.newAPIHadoopFile(opts.input,com.hadoop.mapreduce.LzoTextInputFormat,org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text).count()
 
  Does anyone know what the rough equivalent of this would be in the Scala
  API?
 
 I am not sure, I haven't tested it using scala.
 com.hadoop.mapreduce.LzoTextInputFormat class is from this package
 https://github.com/twitter/hadoop-lzo

 I have installed it from clourdera hadoop-lzo package with liblzo2-2
 debian package on all of my workers. Make sure you have hadoop-lzo.jar
 in your class path for spark.

 - Gurvinder

  I am trying the following, but the first import yields an error on my
  |spark-ec2| cluster:
 
  |import com.hadoop.mapreduce.LzoTextInputFormat
  import org.apache.hadoop.io.LongWritable
  import org.apache.hadoop.io.Text
 
 
 sc.newAPIHadoopFile(s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data,
 LzoTextInputFormat, LongWritable, Text)
  |
 
  |scala import com.hadoop.mapreduce.LzoTextInputFormat
  console:12: error: object hadoop is not a member of package com
 import com.hadoop.mapreduce.LzoTextInputFormat
  |
 
  Nick
 
  ​





Re: reading compress lzo files

2014-07-06 Thread Sean Owen
Pardon, I was wrong about this. There is actually code distributed
under com.hadoop, and that's where this class is. Oops.

https://code.google.com/a/apache-extras.org/p/hadoop-gpl-compression/source/browse/trunk/src/java/com/hadoop/mapreduce/LzoTextInputFormat.java

On Sun, Jul 6, 2014 at 6:37 AM, Sean Owen so...@cloudera.com wrote:
 The package com.hadoop.mapreduce certainly looks wrong. If it is a Hadoop
 class it starts with org.apache.hadoop

 On Jul 6, 2014 4:20 AM, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:

 On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh
 gurvinder.si...@uninett.no wrote:

 csv =

 sc.newAPIHadoopFile(opts.input,com.hadoop.mapreduce.LzoTextInputFormat,org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text).count()

 Does anyone know what the rough equivalent of this would be in the Scala
 API?

 I am trying the following, but the first import yields an error on my
 spark-ec2 cluster:

 import com.hadoop.mapreduce.LzoTextInputFormat
 import org.apache.hadoop.io.LongWritable
 import org.apache.hadoop.io.Text


 sc.newAPIHadoopFile(s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data,
 LzoTextInputFormat, LongWritable, Text)

 scala import com.hadoop.mapreduce.LzoTextInputFormat
 console:12: error: object hadoop is not a member of package com
import com.hadoop.mapreduce.LzoTextInputFormat

 Nick


Re: reading compress lzo files

2014-07-06 Thread Andrew Ash
Ni Nick,

The cluster I was working on in those linked messages was a private data
center cluster, not on EC2.  I'd imagine that the setup would be pretty
similar, but I'm not familiar with the EC2 init scripts that Spark uses.

Also I upgraded that cluster to 1.0 recently and am continuing to use
LZO-compressed data, so I know there's not a version issue.

Andrew


On Sun, Jul 6, 2014 at 12:02 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 I’ve been reading through several pages trying to figure out how to set up
 my spark-ec2 cluster to read LZO-compressed files from S3.

-

 http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCA+-p3AGSPeNE5miQRFHC7-ZwNbicaXfh1-ZXdKJ=saw_mgr...@mail.gmail.com%3E
-

 http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3cca+-p3aga6f86qcsowp7k_r+8r-dgbmj3gz+4xljzjpr90db...@mail.gmail.com%3E
- https://github.com/twitter/hadoop-lzo
-

 http://blog.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/

 It seems that several things may have changed since the above pages were
 put together, so getting this to work is more work than I expected.

 Is there a simple set of instructions somewhere one can follow to get a
 Spark EC2 cluster reading LZO-compressed input files correctly?

 Nick
 ​


 On Sun, Jul 6, 2014 at 10:55 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Ah, indeed it looks like I need to install this separately
 https://code.google.com/a/apache-extras.org/p/hadoop-gpl-compression/wiki/FAQ?redir=1
 as it is not part of the core.

 Nick



 On Sun, Jul 6, 2014 at 2:22 AM, Gurvinder Singh 
 gurvinder.si...@uninett.no wrote:

 On 07/06/2014 05:19 AM, Nicholas Chammas wrote:
  On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh
  gurvinder.si...@uninett.no mailto:gurvinder.si...@uninett.no
 wrote:
 
  csv =
 
 sc.newAPIHadoopFile(opts.input,com.hadoop.mapreduce.LzoTextInputFormat,org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text).count()
 
  Does anyone know what the rough equivalent of this would be in the
 Scala
  API?
 
 I am not sure, I haven't tested it using scala.
 com.hadoop.mapreduce.LzoTextInputFormat class is from this package
 https://github.com/twitter/hadoop-lzo

 I have installed it from clourdera hadoop-lzo package with liblzo2-2
 debian package on all of my workers. Make sure you have hadoop-lzo.jar
 in your class path for spark.

 - Gurvinder

  I am trying the following, but the first import yields an error on my
  |spark-ec2| cluster:
 
  |import com.hadoop.mapreduce.LzoTextInputFormat
  import org.apache.hadoop.io.LongWritable
  import org.apache.hadoop.io.Text
 
 
 sc.newAPIHadoopFile(s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data,
 LzoTextInputFormat, LongWritable, Text)
  |
 
  |scala import com.hadoop.mapreduce.LzoTextInputFormat
  console:12: error: object hadoop is not a member of package com
 import com.hadoop.mapreduce.LzoTextInputFormat
  |
 
  Nick
 
  ​







Re: reading compress lzo files

2014-07-05 Thread Nicholas Chammas
On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh gurvinder.si...@uninett.no
wrote:

csv =
 sc.newAPIHadoopFile(opts.input,com.hadoop
 .mapreduce.LzoTextInputFormat,org.apache.hadoop
 .io.LongWritable,org.apache.hadoop.io.Text).count()

Does anyone know what the rough equivalent of this would be in the Scala
API?

I am trying the following, but the first import yields an error on my
spark-ec2 cluster:

import com.hadoop.mapreduce.LzoTextInputFormatimport
org.apache.hadoop.io.LongWritableimport org.apache.hadoop.io.Text

sc.newAPIHadoopFile(s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data,
LzoTextInputFormat, LongWritable, Text)

scala import com.hadoop.mapreduce.LzoTextInputFormat
console:12: error: object hadoop is not a member of package com
   import com.hadoop.mapreduce.LzoTextInputFormat

Nick
​


Re: reading compress lzo files

2014-07-05 Thread Sean Owen
The package com.hadoop.mapreduce certainly looks wrong. If it is a Hadoop
class it starts with org.apache.hadoop
On Jul 6, 2014 4:20 AM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:

 On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh 
 gurvinder.si...@uninett.no wrote:

 csv =
 sc.newAPIHadoopFile(opts.input,com.hadoop
 .mapreduce.LzoTextInputFormat,org.apache.hadoop
 .io.LongWritable,org.apache.hadoop.io.Text).count()

 Does anyone know what the rough equivalent of this would be in the Scala
 API?

 I am trying the following, but the first import yields an error on my
 spark-ec2 cluster:

 import com.hadoop.mapreduce.LzoTextInputFormatimport 
 org.apache.hadoop.io.LongWritableimport org.apache.hadoop.io.Text

 sc.newAPIHadoopFile(s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data,
  LzoTextInputFormat, LongWritable, Text)

 scala import com.hadoop.mapreduce.LzoTextInputFormat
 console:12: error: object hadoop is not a member of package com
import com.hadoop.mapreduce.LzoTextInputFormat

 Nick
 ​



Re: reading compress lzo files

2014-07-04 Thread Gurvinder Singh
an update on this issue, now spark is able to read the lzo file and
recognize that it has index and starts multiple map tasks. you need to
use following function instead of textFile

csv =
sc.newAPIHadoopFile(opts.input,com.hadoop.mapreduce.LzoTextInputFormat,org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text).count()

- Gurvinder
On 07/03/2014 06:24 PM, Gurvinder Singh wrote:
 Hi all,
 
 I am trying to read the lzo files. It seems spark recognizes that the
 input file is compressed and got the decompressor as
 
 14/07/03 18:11:01 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
 14/07/03 18:11:01 INFO lzo.LzoCodec: Successfully loaded  initialized
 native-lzo library [hadoop-lzo rev
 ee825cb06b23d3ab97cdd87e13cbbb630bd75b98]
 14/07/03 18:11:01 INFO Configuration.deprecation: hadoop.native.lib is
 deprecated. Instead, use io.native.lib.available
 14/07/03 18:11:01 INFO compress.CodecPool: Got brand-new decompressor
 [.lzo]
 
 But it has two issues
 
 1. It just stuck here without doing anything waited for 15 min for a
 small files.
 2. I used the hadoop-lzo to create the index so that spark can split
 the input to multiple maps but spark creates only one mapper.
 
 I am using python with reading using sc.textFile(). Spark version is
 of the git master.
 
 Regards,
 Gurvinder