Hi Andrew, Thanks for your example I used your command and i get the following errors from worker ( missing codec from worker i guess) How do i get codecs over to worker machines regards Rajeev ******************************************************************* 13/12/26 12:34:42 INFO TaskSetManager: Loss was due to java.io.IOException: Codec for file hdfs://hadoop00/tmp/ldpc_dec_top_2450000_to_2750000.vcd.sstv3.lzo not found, cannot run at com.hadoop.mapreduce.LzoLineRecordReader.initialize(LzoLineRecordReader.java:97) at spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:68) at spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:57) at spark.RDD.computeOrReadCheckpoint(RDD.scala:207) at spark.RDD.iterator(RDD.scala:196) at spark.scheduler.ResultTask.run(ResultTask.scala:77) at spark.executor.Executor$TaskRunner.run(Executor.scala:98) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) 13/12/26 12:34:42 INFO TaskSetManager: Starting task 0.0:15 as TID 28 on executor 4: hadoop02 (preferred) 13/12/26 12:34:42 INFO TaskSetManager: Serialized task 0.0:15 as 1358 bytes in 0 ms 13/12/26 12:34:42 INFO TaskSetManager: Lost TID 22 (task 0.0:20) 13/12/26 12:34:42 INFO TaskSetManager: Loss was due to java.io.IOException: Codec for file hdfs://hadoop00/tmp/ldpc_dec_top_2450000_to_2750000.vcd.sstv3.lzo not found, cannot run [duplicate 1]
Rajeev Srivastava Silverline Design Inc 2118 Walsh ave, suite 204 Santa Clara, CA, 95050 cell : 408-409-0940 On Tue, Dec 24, 2013 at 5:20 PM, Andrew Ash <and...@andrewash.com> wrote: > Hi Berkeley, > > By RF=3 I mean replication factor of 3 on the files in HDFS, so each block > is stored 3 times across the cluster. It's a pretty standard choice for > the replication factor in order to give a hardware team time to replace bad > hardware in the case of failure. With RF=3 the cluster can sustain failure > on any two nodes without data loss, but the loss of the third node may > cause loss. > > When reading the LZO files with the newAPIHadoopFile() call I showed > below, the data in the RDD is already decompressed -- it transparently > looks the same to my Spark program as if I was operating on an uncompressed > file. > > Cheers, > Andrew > > > On Tue, Dec 24, 2013 at 12:29 PM, Berkeley Malagon < > berke...@firestickgames.com> wrote: > >> Andrew, This is great. >> >> Excuse my ignorance, but what do you mean by RF=3? Also, after reading >> the LZO files, are you able to access the contents directly, or do you have >> to decompress them after reading them? >> >> Sent from my iPhone >> >> On Dec 24, 2013, at 12:03 AM, Andrew Ash <and...@andrewash.com> wrote: >> >> Hi Rajeev, >> >> I'm not sure if you ever got it working, but I just got mine up and >> going. If you just use sc.textFile(...) the file will be read but the LZO >> index won't be used so a .count() on my 1B+ row file took 2483s. When I >> ran it like this though: >> >> sc.newAPIHadoopFile("hdfs:///path/to/myfile.lzo", >> classOf[com.hadoop.mapreduce.LzoTextInputFormat], >> classOf[org.apache.hadoop.io.LongWritable], >> classOf[org.apache.hadoop.io.Text]).count >> >> the LZO index file was used and the .count() took just 101s. For >> reference this file is 43GB when .gz compressed and 78.4GB when .lzo >> compressed. I have RF=3 and this is across 4 pretty beefy machines with >> Hadoop DataNodes and Spark both running on each machine. >> >> Cheers! >> Andrew >> >> >> On Mon, Dec 16, 2013 at 2:34 PM, Rajeev Srivastava < >> raj...@silverline-da.com> wrote: >> >>> Thanks for your suggestion. I will try this and update by late evening. >>> >>> regards >>> Rajeev >>> >>> Rajeev Srivastava >>> Silverline Design Inc >>> 2118 Walsh ave, suite 204 >>> Santa Clara, CA, 95050 >>> cell : 408-409-0940 >>> >>> >>> On Mon, Dec 16, 2013 at 11:24 AM, Andrew Ash <and...@andrewash.com>wrote: >>> >>>> Hi Rajeev, >>>> >>>> It looks like you're using the >>>> com.hadoop.mapred.DeprecatedLzoTextInputFormat >>>> input format above, while Stephen referred to com.hadoop.mapreduce. >>>> LzoTextInputFormat >>>> >>>> I think the way to use this in Spark would be to use the >>>> SparkContext.hadoopFile() or SparkContext.newAPIHadoopFile() methods with >>>> the path and the InputFormat as parameters. Can you give those a shot? >>>> >>>> Andrew >>>> >>>> >>>> On Wed, Dec 11, 2013 at 8:59 PM, Rajeev Srivastava < >>>> raj...@silverline-da.com> wrote: >>>> >>>>> Hi Stephen, >>>>> I tried the same lzo file with a simple hadoop script >>>>> this seems to work fine >>>>> >>>>> HADOOP_HOME=/usr/lib/hadoop >>>>> /usr/bin/hadoop jar >>>>> /opt/cloudera/parcels/CDH-4.4.0-1.cdh4.4.0.p0.39/lib/hadoop-mapreduce/hadoop-streaming.jar >>>>> \ >>>>> -libjars >>>>> /opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib/hadoop-lzo-cdh4-0.4.15-gplextras.jar >>>>> \ >>>>> -input /tmp/ldpc.sstv3.lzo \ >>>>> -inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat \ >>>>> -output wc_test \ >>>>> -mapper 'cat' \ >>>>> -reducer 'wc -l' >>>>> >>>>> This means hadoop is able to handle the lzo file correctly >>>>> >>>>> Can you suggest me what i should do in spark for it to work >>>>> >>>>> regards >>>>> Rajeev >>>>> >>>>> >>>>> Rajeev Srivastava >>>>> Silverline Design Inc >>>>> 2118 Walsh ave, suite 204 >>>>> Santa Clara, CA, 95050 >>>>> cell : 408-409-0940 >>>>> >>>>> >>>>> On Tue, Dec 10, 2013 at 1:20 PM, Stephen Haberman < >>>>> stephen.haber...@gmail.com> wrote: >>>>> >>>>>> >>>>>> > System.setProperty("spark.io.compression.codec", >>>>>> > "com.hadoop.compression.lzo.LzopCodec") >>>>>> >>>>>> This spark.io.compression.codec is a completely different setting >>>>>> than the >>>>>> codecs that are used for reading/writing from HDFS. (It is for >>>>>> compressing >>>>>> Spark's internal/non-HDFS intermediate output.) >>>>>> >>>>>> > Hope this helps and someone can help read a LZO file >>>>>> >>>>>> Spark just uses the regular Hadoop File System API, so any issues >>>>>> with reading >>>>>> LZO files would be Hadoop issues. I would search in the Hadoop issue >>>>>> tracker, >>>>>> and look for information on using LZO files with Hadoop/Hive, and >>>>>> whatever works >>>>>> for them, should magically work for Spark as well. >>>>>> >>>>>> This looks like a good place to start: >>>>>> >>>>>> https://github.com/twitter/hadoop-lzo >>>>>> >>>>>> IANAE, but I would try passing one of these: >>>>>> >>>>>> >>>>>> https://github.com/twitter/hadoop-lzo/blob/master/src/main/java/com/hadoop/mapreduce/LzoTextInputFormat.java >>>>>> >>>>>> To the SparkContext.hadoopFile method. >>>>>> >>>>>> - Stephen >>>>>> >>>>>> >>>>> >>>> >>> >> >