Re: reading LZO compressed file in spark

Rajeev Srivastava Thu, 26 Dec 2013 12:48:40 -0800

Hi Andrew,
     Thanks for your example
I used your command and i get the following errors from worker  ( missing
codec from worker i guess)
How do i get codecs over to worker machines
regards
Rajeev
*******************************************************************
13/12/26 12:34:42 INFO TaskSetManager: Loss was due to java.io.IOException:
Codec for file
hdfs://hadoop00/tmp/ldpc_dec_top_2450000_to_2750000.vcd.sstv3.lzo not
found, cannot
run
at
com.hadoop.mapreduce.LzoLineRecordReader.initialize(LzoLineRecordReader.java:97)
at
spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:68)
at
spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:57)
at
spark.RDD.computeOrReadCheckpoint(RDD.scala:207)
at
spark.RDD.iterator(RDD.scala:196)
at
spark.scheduler.ResultTask.run(ResultTask.scala:77)
at
spark.executor.Executor$TaskRunner.run(Executor.scala:98)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at
java.lang.Thread.run(Thread.java:724)
13/12/26 12:34:42 INFO TaskSetManager: Starting task 0.0:15 as TID 28 on
executor 4: hadoop02
(preferred)
13/12/26 12:34:42 INFO TaskSetManager: Serialized task 0.0:15 as 1358 bytes
in 0 ms                                      13/12/26 12:34:42 INFO
TaskSetManager: Lost TID 22 (task
0.0:20)                                                         13/12/26
12:34:42 INFO TaskSetManager: Loss was due to java.io.IOException: Codec
for file hdfs://hadoop00/tmp/ldpc_dec_top_2450000_to_2750000.vcd.sstv3.lzo
not found, cannot run [duplicate 1]


Rajeev Srivastava
Silverline Design Inc
2118 Walsh ave, suite 204
Santa Clara, CA, 95050
cell : 408-409-0940


On Tue, Dec 24, 2013 at 5:20 PM, Andrew Ash <and...@andrewash.com> wrote:

> Hi Berkeley,
>
> By RF=3 I mean replication factor of 3 on the files in HDFS, so each block
> is stored 3 times across the cluster.  It's a pretty standard choice for
> the replication factor in order to give a hardware team time to replace bad
> hardware in the case of failure.  With RF=3 the cluster can sustain failure
> on any two nodes without data loss, but the loss of the third node may
> cause loss.
>
> When reading the LZO files with the newAPIHadoopFile() call I showed
> below, the data in the RDD is already decompressed -- it transparently
> looks the same to my Spark program as if I was operating on an uncompressed
> file.
>
> Cheers,
> Andrew
>
>
> On Tue, Dec 24, 2013 at 12:29 PM, Berkeley Malagon <
> berke...@firestickgames.com> wrote:
>
>> Andrew, This is great.
>>
>> Excuse my ignorance, but what do you mean by RF=3? Also, after reading
>> the LZO files, are you able to access the contents directly, or do you have
>> to decompress them after reading them?
>>
>> Sent from my iPhone
>>
>> On Dec 24, 2013, at 12:03 AM, Andrew Ash <and...@andrewash.com> wrote:
>>
>> Hi Rajeev,
>>
>> I'm not sure if you ever got it working, but I just got mine up and
>> going.  If you just use sc.textFile(...) the file will be read but the LZO
>> index won't be used so a .count() on my 1B+ row file took 2483s.  When I
>> ran it like this though:
>>
>> sc.newAPIHadoopFile("hdfs:///path/to/myfile.lzo",
>> classOf[com.hadoop.mapreduce.LzoTextInputFormat],
>> classOf[org.apache.hadoop.io.LongWritable],
>> classOf[org.apache.hadoop.io.Text]).count
>>
>> the LZO index file was used and the .count() took just 101s.  For
>> reference this file is 43GB when .gz compressed and 78.4GB when .lzo
>> compressed.  I have RF=3 and this is across 4 pretty beefy machines with
>> Hadoop DataNodes and Spark both running on each machine.
>>
>> Cheers!
>> Andrew
>>
>>
>> On Mon, Dec 16, 2013 at 2:34 PM, Rajeev Srivastava <
>> raj...@silverline-da.com> wrote:
>>
>>> Thanks for your suggestion. I will try this and update by late evening.
>>>
>>> regards
>>> Rajeev
>>>
>>> Rajeev Srivastava
>>> Silverline Design Inc
>>> 2118 Walsh ave, suite 204
>>> Santa Clara, CA, 95050
>>> cell : 408-409-0940
>>>
>>>
>>> On Mon, Dec 16, 2013 at 11:24 AM, Andrew Ash <and...@andrewash.com>wrote:
>>>
>>>> Hi Rajeev,
>>>>
>>>> It looks like you're using the 
>>>> com.hadoop.mapred.DeprecatedLzoTextInputFormat
>>>> input format above, while Stephen referred to com.hadoop.mapreduce.
>>>> LzoTextInputFormat
>>>>
>>>> I think the way to use this in Spark would be to use the
>>>> SparkContext.hadoopFile() or SparkContext.newAPIHadoopFile() methods with
>>>> the path and the InputFormat as parameters.  Can you give those a shot?
>>>>
>>>> Andrew
>>>>
>>>>
>>>> On Wed, Dec 11, 2013 at 8:59 PM, Rajeev Srivastava <
>>>> raj...@silverline-da.com> wrote:
>>>>
>>>>> Hi Stephen,
>>>>>      I tried the same lzo file with a simple hadoop script
>>>>> this seems to work fine
>>>>>
>>>>> HADOOP_HOME=/usr/lib/hadoop
>>>>> /usr/bin/hadoop  jar
>>>>> /opt/cloudera/parcels/CDH-4.4.0-1.cdh4.4.0.p0.39/lib/hadoop-mapreduce/hadoop-streaming.jar
>>>>> \
>>>>> -libjars
>>>>> /opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib/hadoop-lzo-cdh4-0.4.15-gplextras.jar
>>>>> \
>>>>> -input /tmp/ldpc.sstv3.lzo \
>>>>> -inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat \
>>>>> -output wc_test \
>>>>> -mapper 'cat' \
>>>>> -reducer 'wc -l'
>>>>>
>>>>> This means hadoop is able to handle the lzo file correctly
>>>>>
>>>>> Can you suggest me what i should do in spark for it to work
>>>>>
>>>>> regards
>>>>> Rajeev
>>>>>
>>>>>
>>>>> Rajeev Srivastava
>>>>> Silverline Design Inc
>>>>> 2118 Walsh ave, suite 204
>>>>> Santa Clara, CA, 95050
>>>>> cell : 408-409-0940
>>>>>
>>>>>
>>>>> On Tue, Dec 10, 2013 at 1:20 PM, Stephen Haberman <
>>>>> stephen.haber...@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>> > System.setProperty("spark.io.compression.codec",
>>>>>> > "com.hadoop.compression.lzo.LzopCodec")
>>>>>>
>>>>>> This spark.io.compression.codec is a completely different setting
>>>>>> than the
>>>>>> codecs that are used for reading/writing from HDFS. (It is for
>>>>>> compressing
>>>>>> Spark's internal/non-HDFS intermediate output.)
>>>>>>
>>>>>> > Hope this helps and someone can help read a LZO file
>>>>>>
>>>>>> Spark just uses the regular Hadoop File System API, so any issues
>>>>>> with reading
>>>>>> LZO files would be Hadoop issues. I would search in the Hadoop issue
>>>>>> tracker,
>>>>>> and look for information on using LZO files with Hadoop/Hive, and
>>>>>> whatever works
>>>>>> for them, should magically work for Spark as well.
>>>>>>
>>>>>> This looks like a good place to start:
>>>>>>
>>>>>> https://github.com/twitter/hadoop-lzo
>>>>>>
>>>>>> IANAE, but I would try passing one of these:
>>>>>>
>>>>>>
>>>>>> https://github.com/twitter/hadoop-lzo/blob/master/src/main/java/com/hadoop/mapreduce/LzoTextInputFormat.java
>>>>>>
>>>>>> To the SparkContext.hadoopFile method.
>>>>>>
>>>>>> - Stephen
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: reading LZO compressed file in spark

Reply via email to