lzop should work. On Mon, Sep 27, 2010 at 10:59 AM, Rohan Rai <rohan....@inmobi.com> wrote:
> Well > > I haven't tried (rather I don't remember) compressing via lzop and then > putting on cluster... > So cant tell you about that...Here is what works for me. > > I do it by first putting the file on cluster and then doing Stream > Compression. > > And yes it need not be indexed (I guess it doesn't matter for small > test file, otherwise it is unwise > for one loses the benefit of parallelism) > > Regards > Rohan > > > pig wrote: > >> Hi Rohan, >> >> The test file (test_input_chars.txt.lzo) is not indexed. I created it >> using >> the command >> >> 'lzop test_input_chars.txt' >> >> It's a really small file (only 6 lines) so I didn't think it needed to be >> index. Do all files regardless of size need to be indexed for the >> LzoTokenizedLoader to work? >> >> Thank you! >> >> ~Ed >> >> On Mon, Sep 27, 2010 at 1:25 AM, Rohan Rai <rohan....@inmobi.com> wrote: >> >> >> Oh Sorry I am completely out of sync... >>> >>> Can you tell how did you lzo'ed and indexed the file >>> >>> >>> Regards >>> Rohan >>> >>> Rohan Rai wrote: >>> >>> >>> Oh Sorry I did not see this mail ... >>>> >>>> Its not an official patch/release >>>> >>>> But here is a fork on elephant-bird which works with pig 0.7 >>>> >>>> for normal LZOText Loading etc >>>> >>>> (NOt HbaseLoader) >>>> >>>> Regards >>>> Rohan >>>> >>>> Dmitriy Ryaboy wrote: >>>> >>>> The 0.7 branch is not tested.. it's quite likely it doesn't actually >>>> work >>>> >>>> :). >>>>> Rohan Rai was working on it.. Rohan, think you can take a look and help >>>>> Ed >>>>> out? >>>>> >>>>> Ed, you may want to check if the same input works when you use Pig 0.6 >>>>> (and >>>>> the official elephant-bird, on Kevin Weil's github). >>>>> >>>>> -D >>>>> >>>>> On Thu, Sep 23, 2010 at 6:49 AM, pig <hadoopn...@gmail.com> wrote: >>>>> >>>>> >>>>> >>>>> Hello, >>>>> >>>>> After getting all the errors to go away with LZO libraries not being >>>>>> found >>>>>> and missing jar files for elephant-bird I've run into a new problem >>>>>> when >>>>>> using the elephant-bird branch for pig 0.7 >>>>>> >>>>>> The following simple pig script works as expected >>>>>> >>>>>> REGISTER elephant-bird-1.0.jar >>>>>> REGISTER /usr/lib/elephant-bird/lib/google-collect-1.0.jar >>>>>> A = load '/usr/foo/input/test_input_chars.txt'; >>>>>> DUMP A; >>>>>> >>>>>> This just dumps out the contents of the test_input_chars.txt file >>>>>> which >>>>>> is >>>>>> tab delimited. The output looks like: >>>>>> >>>>>> (1,a,a,a,a,a,a) >>>>>> (2,b,b,b,b,b,b) >>>>>> (3,c,c,c,c,c,c) >>>>>> (4,d,d,d,d,d,d) >>>>>> (5,e,e,e,e,e,e) >>>>>> >>>>>> I then lzop the test file to get test_input_chars.txt.lzo (I >>>>>> decompressed >>>>>> this with lzop -d to make sure the compression worked fine and >>>>>> everything >>>>>> looks good). >>>>>> If I run the exact same script provided above on the lzo file it works >>>>>> fine. However, this file is really small and doesn't need to use >>>>>> indexes. >>>>>> As a result, I wanted to >>>>>> have LZO support that worked with indexes. Based on this I decided to >>>>>> try >>>>>> out the elephant-bird branch for pig 0.7 located here ( >>>>>> http://github.com/hirohanin/elephant-bird/) as >>>>>> recommended by Dimitriy. >>>>>> >>>>>> I created the following pig script that mirrors the above script but >>>>>> should >>>>>> hopefully work on LZO files (including indexed ones) >>>>>> >>>>>> REGISTER elephant-bird-1.0.jar >>>>>> REGISTER /usr/lib/elephant-bird/lib/google-collect-1.0.jar >>>>>> A = load '/usr/foo/input/test_input_chars.txt.lzo' USING >>>>>> com.twitter.elephantbird.pig.load.LzoTokenizedLoader('\t'); >>>>>> DUMP A; >>>>>> >>>>>> When I run this script which uses the LzoTokenizedLoader there is no >>>>>> output. The script appears to run without errors but there are zero >>>>>> Records >>>>>> Written and 0 Bytes Written. >>>>>> >>>>>> Here is the exact output: >>>>>> >>>>>> grunt > DUMP A; >>>>>> [main] INFO com.twitter.elephantbird.pig.load.LzoTokenizedLoader - >>>>>> LzoTokenizedLoader with given delimited [ ] >>>>>> [main] INFO com.twitter.elephantbird.pig.load.LzoTokenizedLoader - >>>>>> LzoTokenizedLoader with given delimited [ ] >>>>>> [main] INFO com.twitter.elephantbird.pig.load.LzoTokenizedLoader - >>>>>> LzoTokenizedLoader with given delimited [ ] >>>>>> [main] INFO >>>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine >>>>>> - >>>>>> (Name: >>>>>> >>>>>> >>>>>> >>>>>> Store(hdfs://master:9000/tmp/temp-2052828736/tmp-1533645117:org.apache.pig.builtin.BinStorage) >>>>>> - 1-4 Operator Key: 1-4 >>>>>> [main] INFO >>>>>> >>>>>> >>>>>> >>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer >>>>>> - MR plan size before optimization: 1 >>>>>> [main] INFO >>>>>> >>>>>> >>>>>> >>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer >>>>>> - MR plan size after optimization: 1 >>>>>> [main] INFO >>>>>> >>>>>> >>>>>> >>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler >>>>>> - mapred.job.reduce.markreset.buffer.percent is not set, set to >>>>>> default >>>>>> 0.3 >>>>>> [main] INFO >>>>>> >>>>>> >>>>>> >>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler >>>>>> - Setting up single store job >>>>>> [main] INFO >>>>>> >>>>>> >>>>>> >>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >>>>>> - 1 map-reduce job(s) waiting for submission. >>>>>> [Thread-12] WARN org.apache.hadoop.mapred.JobClient - Use >>>>>> GenericOptionsParser for parsing the arguments. Applications should >>>>>> implement Tool for the same. >>>>>> [Thread-12] INFO com.twitter.elephantbird.pig.load.LzoTokenizedLoader >>>>>> - >>>>>> LzoTokenizedLoader with given delimiter [ ] >>>>>> [Thread-12] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat >>>>>> - >>>>>> Total input paths to process : 1 >>>>>> [main] INFO >>>>>> >>>>>> >>>>>> >>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >>>>>> - 0% complete >>>>>> [main] INFO >>>>>> >>>>>> >>>>>> >>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >>>>>> - HadoopJobId: job_201009101108_0151 >>>>>> [main] INFO >>>>>> >>>>>> >>>>>> >>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >>>>>> - More information at >>>>>> http://master:50030/jobdetails.jsp?jobid=job_201009101108_0151 >>>>>> [main] INFO >>>>>> >>>>>> >>>>>> >>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >>>>>> - 50% complete >>>>>> [main] INFO >>>>>> >>>>>> >>>>>> >>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >>>>>> - 100% complete >>>>>> [main] INFO >>>>>> >>>>>> >>>>>> >>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >>>>>> - Succesfully stored result in >>>>>> "hdfs://amb-hadoop-01:9000/tmp/temp-2052828736/tmp-1533645117 >>>>>> [main] INFO >>>>>> >>>>>> >>>>>> >>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >>>>>> - Records written: 0 >>>>>> [main] INFO >>>>>> >>>>>> >>>>>> >>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >>>>>> - Bytes written: 0 >>>>>> [main] INFO >>>>>> >>>>>> >>>>>> >>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >>>>>> - Spillable Memory Manager spill count : 0 >>>>>> [main] INFO >>>>>> >>>>>> >>>>>> >>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >>>>>> - Proactive spill count : 0 >>>>>> [main] INFO >>>>>> >>>>>> >>>>>> >>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >>>>>> - Success! >>>>>> [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - >>>>>> Total >>>>>> input paths to process: 1 >>>>>> [main] INFO >>>>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - >>>>>> Total input paths to process: 1 >>>>>> grunt > >>>>>> >>>>>> I'm not sure if I'm doing something wrong in my use of >>>>>> LzoTokenizedLoader >>>>>> or >>>>>> if there is a problem with the class itself (most likely the problem >>>>>> is >>>>>> with >>>>>> my code heh) Thank you for any help! >>>>>> >>>>>> ~Ed >>>>>> >>>>>> >>>>>> >>>>>> . >>>>>> >>>>>> >>>>> >>>>> >>>>> The information contained in this communication is intended solely for >>>> the >>>> use of the individual or entity to whom it is addressed and others >>>> authorized to receive it. It may contain confidential or legally >>>> privileged >>>> information. If you are not the intended recipient you are hereby >>>> notified >>>> that any disclosure, copying, distribution or taking any action in >>>> reliance >>>> on the contents of this information is strictly prohibited and may be >>>> unlawful. If you have received this communication in error, please >>>> notify us >>>> immediately by responding to this email and then delete it from your >>>> system. >>>> The firm is neither liable for the proper and complete transmission of >>>> the >>>> information contained in this communication nor for any delay in its >>>> receipt. >>>> . >>>> >>>> >>>> >>>> >>>> The information contained in this communication is intended solely for >>> the >>> use of the individual or entity to whom it is addressed and others >>> authorized to receive it. It may contain confidential or legally >>> privileged >>> information. If you are not the intended recipient you are hereby >>> notified >>> that any disclosure, copying, distribution or taking any action in >>> reliance >>> on the contents of this information is strictly prohibited and may be >>> unlawful. If you have received this communication in error, please notify >>> us >>> immediately by responding to this email and then delete it from your >>> system. >>> The firm is neither liable for the proper and complete transmission of >>> the >>> information contained in this communication nor for any delay in its >>> receipt. >>> >>> >>> . >> >> >> > > The information contained in this communication is intended solely for the > use of the individual or entity to whom it is addressed and others > authorized to receive it. It may contain confidential or legally privileged > information. If you are not the intended recipient you are hereby notified > that any disclosure, copying, distribution or taking any action in reliance > on the contents of this information is strictly prohibited and may be > unlawful. If you have received this communication in error, please notify us > immediately by responding to this email and then delete it from your system. > The firm is neither liable for the proper and complete transmission of the > information contained in this communication nor for any delay in its > receipt. >