Re: reading compress lzo files

2014-07-07 Thread Nicholas Chammas
I found it quite painful to figure out all the steps required and have filed SPARK-2394 https://issues.apache.org/jira/browse/SPARK-2394 to track improving this. Perhaps I have been going about it the wrong way, but it seems way more painful than it should be to set up a Spark cluster built using

Re: reading compress lzo files

2014-07-06 Thread Gurvinder Singh
On 07/06/2014 05:19 AM, Nicholas Chammas wrote: On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh gurvinder.si...@uninett.no mailto:gurvinder.si...@uninett.no wrote: csv =

Re: reading compress lzo files

2014-07-06 Thread Nicholas Chammas
Ah, indeed it looks like I need to install this separately https://code.google.com/a/apache-extras.org/p/hadoop-gpl-compression/wiki/FAQ?redir=1 as it is not part of the core. Nick On Sun, Jul 6, 2014 at 2:22 AM, Gurvinder Singh gurvinder.si...@uninett.no wrote: On 07/06/2014 05:19 AM,

Re: reading compress lzo files

2014-07-06 Thread Sean Owen
Pardon, I was wrong about this. There is actually code distributed under com.hadoop, and that's where this class is. Oops. https://code.google.com/a/apache-extras.org/p/hadoop-gpl-compression/source/browse/trunk/src/java/com/hadoop/mapreduce/LzoTextInputFormat.java On Sun, Jul 6, 2014 at 6:37

Re: reading compress lzo files

2014-07-06 Thread Andrew Ash
Ni Nick, The cluster I was working on in those linked messages was a private data center cluster, not on EC2. I'd imagine that the setup would be pretty similar, but I'm not familiar with the EC2 init scripts that Spark uses. Also I upgraded that cluster to 1.0 recently and am continuing to use

Re: reading compress lzo files

2014-07-05 Thread Nicholas Chammas
On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh gurvinder.si...@uninett.no wrote: csv = sc.newAPIHadoopFile(opts.input,com.hadoop .mapreduce.LzoTextInputFormat,org.apache.hadoop .io.LongWritable,org.apache.hadoop.io.Text).count() Does anyone know what the rough equivalent of this would be in

Re: reading compress lzo files

2014-07-05 Thread Sean Owen
The package com.hadoop.mapreduce certainly looks wrong. If it is a Hadoop class it starts with org.apache.hadoop On Jul 6, 2014 4:20 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh gurvinder.si...@uninett.no wrote: csv =

Re: reading compress lzo files

2014-07-04 Thread Gurvinder Singh
an update on this issue, now spark is able to read the lzo file and recognize that it has index and starts multiple map tasks. you need to use following function instead of textFile csv =