Rajeev, You should have something like this in in your core-site.xml file in Hadoop:
<property> <name>io.compression.codecs</name> <value>com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.SnappyCodec</value> </property> I also had to add the LZO jar into Spark with SPARK_CLASSPATH in spark-env.sh so you may need to do that too. Cheers, Andrew On Thu, Dec 26, 2013 at 3:48 PM, Rajeev Srivastava <raj...@silverline-da.com > wrote: > Hi Andrew, > Thanks for your example > I used your command and i get the following errors from worker ( missing > codec from worker i guess) > How do i get codecs over to worker machines > regards > Rajeev > ******************************************************************* > 13/12/26 12:34:42 INFO TaskSetManager: Loss was due to > java.io.IOException: Codec for file > hdfs://hadoop00/tmp/ldpc_dec_top_2450000_to_2750000.vcd.sstv3.lzo not > found, cannot > run > at > com.hadoop.mapreduce.LzoLineRecordReader.initialize(LzoLineRecordReader.java:97) > at > spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:68) > at > spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:57) > at > spark.RDD.computeOrReadCheckpoint(RDD.scala:207) > at > spark.RDD.iterator(RDD.scala:196) > at > spark.scheduler.ResultTask.run(ResultTask.scala:77) > at > spark.executor.Executor$TaskRunner.run(Executor.scala:98) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at > java.lang.Thread.run(Thread.java:724) > 13/12/26 12:34:42 INFO TaskSetManager: Starting task 0.0:15 as TID 28 on > executor 4: hadoop02 > (preferred) > 13/12/26 12:34:42 INFO TaskSetManager: Serialized task 0.0:15 as 1358 bytes > in 0 ms 13/12/26 12:34:42 INFO > TaskSetManager: Lost TID 22 (task > 0.0:20) 13/12/26 > 12:34:42 INFO TaskSetManager: Loss was due to java.io.IOException: Codec > for file hdfs://hadoop00/tmp/ldpc_dec_top_2450000_to_2750000.vcd.sstv3.lzo > not found, cannot run [duplicate 1] > > Rajeev Srivastava > Silverline Design Inc > 2118 Walsh ave, suite 204 > Santa Clara, CA, 95050 > cell : 408-409-0940 > > > On Tue, Dec 24, 2013 at 5:20 PM, Andrew Ash <and...@andrewash.com> wrote: > >> Hi Berkeley, >> >> By RF=3 I mean replication factor of 3 on the files in HDFS, so each >> block is stored 3 times across the cluster. It's a pretty standard choice >> for the replication factor in order to give a hardware team time to replace >> bad hardware in the case of failure. With RF=3 the cluster can sustain >> failure on any two nodes without data loss, but the loss of the third node >> may cause loss. >> >> When reading the LZO files with the newAPIHadoopFile() call I showed >> below, the data in the RDD is already decompressed -- it transparently >> looks the same to my Spark program as if I was operating on an uncompressed >> file. >> >> Cheers, >> Andrew >> >> >> On Tue, Dec 24, 2013 at 12:29 PM, Berkeley Malagon < >> berke...@firestickgames.com> wrote: >> >>> Andrew, This is great. >>> >>> Excuse my ignorance, but what do you mean by RF=3? Also, after reading >>> the LZO files, are you able to access the contents directly, or do you have >>> to decompress them after reading them? >>> >>> Sent from my iPhone >>> >>> On Dec 24, 2013, at 12:03 AM, Andrew Ash <and...@andrewash.com> wrote: >>> >>> Hi Rajeev, >>> >>> I'm not sure if you ever got it working, but I just got mine up and >>> going. If you just use sc.textFile(...) the file will be read but the LZO >>> index won't be used so a .count() on my 1B+ row file took 2483s. When I >>> ran it like this though: >>> >>> sc.newAPIHadoopFile("hdfs:///path/to/myfile.lzo", >>> classOf[com.hadoop.mapreduce.LzoTextInputFormat], >>> classOf[org.apache.hadoop.io.LongWritable], >>> classOf[org.apache.hadoop.io.Text]).count >>> >>> the LZO index file was used and the .count() took just 101s. For >>> reference this file is 43GB when .gz compressed and 78.4GB when .lzo >>> compressed. I have RF=3 and this is across 4 pretty beefy machines with >>> Hadoop DataNodes and Spark both running on each machine. >>> >>> Cheers! >>> Andrew >>> >>> >>> On Mon, Dec 16, 2013 at 2:34 PM, Rajeev Srivastava < >>> raj...@silverline-da.com> wrote: >>> >>>> Thanks for your suggestion. I will try this and update by late evening. >>>> >>>> regards >>>> Rajeev >>>> >>>> Rajeev Srivastava >>>> Silverline Design Inc >>>> 2118 Walsh ave, suite 204 >>>> Santa Clara, CA, 95050 >>>> cell : 408-409-0940 >>>> >>>> >>>> On Mon, Dec 16, 2013 at 11:24 AM, Andrew Ash <and...@andrewash.com>wrote: >>>> >>>>> Hi Rajeev, >>>>> >>>>> It looks like you're using the >>>>> com.hadoop.mapred.DeprecatedLzoTextInputFormat >>>>> input format above, while Stephen referred to com.hadoop.mapreduce. >>>>> LzoTextInputFormat >>>>> >>>>> I think the way to use this in Spark would be to use the >>>>> SparkContext.hadoopFile() or SparkContext.newAPIHadoopFile() methods with >>>>> the path and the InputFormat as parameters. Can you give those a shot? >>>>> >>>>> Andrew >>>>> >>>>> >>>>> On Wed, Dec 11, 2013 at 8:59 PM, Rajeev Srivastava < >>>>> raj...@silverline-da.com> wrote: >>>>> >>>>>> Hi Stephen, >>>>>> I tried the same lzo file with a simple hadoop script >>>>>> this seems to work fine >>>>>> >>>>>> HADOOP_HOME=/usr/lib/hadoop >>>>>> /usr/bin/hadoop jar >>>>>> /opt/cloudera/parcels/CDH-4.4.0-1.cdh4.4.0.p0.39/lib/hadoop-mapreduce/hadoop-streaming.jar >>>>>> \ >>>>>> -libjars >>>>>> /opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib/hadoop-lzo-cdh4-0.4.15-gplextras.jar >>>>>> \ >>>>>> -input /tmp/ldpc.sstv3.lzo \ >>>>>> -inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat \ >>>>>> -output wc_test \ >>>>>> -mapper 'cat' \ >>>>>> -reducer 'wc -l' >>>>>> >>>>>> This means hadoop is able to handle the lzo file correctly >>>>>> >>>>>> Can you suggest me what i should do in spark for it to work >>>>>> >>>>>> regards >>>>>> Rajeev >>>>>> >>>>>> >>>>>> Rajeev Srivastava >>>>>> Silverline Design Inc >>>>>> 2118 Walsh ave, suite 204 >>>>>> Santa Clara, CA, 95050 >>>>>> cell : 408-409-0940 >>>>>> >>>>>> >>>>>> On Tue, Dec 10, 2013 at 1:20 PM, Stephen Haberman < >>>>>> stephen.haber...@gmail.com> wrote: >>>>>> >>>>>>> >>>>>>> > System.setProperty("spark.io.compression.codec", >>>>>>> > "com.hadoop.compression.lzo.LzopCodec") >>>>>>> >>>>>>> This spark.io.compression.codec is a completely different setting >>>>>>> than the >>>>>>> codecs that are used for reading/writing from HDFS. (It is for >>>>>>> compressing >>>>>>> Spark's internal/non-HDFS intermediate output.) >>>>>>> >>>>>>> > Hope this helps and someone can help read a LZO file >>>>>>> >>>>>>> Spark just uses the regular Hadoop File System API, so any issues >>>>>>> with reading >>>>>>> LZO files would be Hadoop issues. I would search in the Hadoop issue >>>>>>> tracker, >>>>>>> and look for information on using LZO files with Hadoop/Hive, and >>>>>>> whatever works >>>>>>> for them, should magically work for Spark as well. >>>>>>> >>>>>>> This looks like a good place to start: >>>>>>> >>>>>>> https://github.com/twitter/hadoop-lzo >>>>>>> >>>>>>> IANAE, but I would try passing one of these: >>>>>>> >>>>>>> >>>>>>> https://github.com/twitter/hadoop-lzo/blob/master/src/main/java/com/hadoop/mapreduce/LzoTextInputFormat.java >>>>>>> >>>>>>> To the SparkContext.hadoopFile method. >>>>>>> >>>>>>> - Stephen >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >