lzop should work.

On Mon, Sep 27, 2010 at 10:59 AM, Rohan Rai <rohan....@inmobi.com> wrote:

> Well
>
> I haven't tried (rather I don't remember) compressing via lzop and then
> putting on cluster...
> So cant tell you about that...Here is what works for me.
>
> I do it by first putting the file on cluster and then doing Stream
> Compression.
>
> And yes it need not be indexed (I guess it doesn't matter for  small
> test file, otherwise it is unwise
> for one loses the benefit of parallelism)
>
> Regards
> Rohan
>
>
> pig wrote:
>
>> Hi Rohan,
>>
>> The test file (test_input_chars.txt.lzo) is not indexed.  I created it
>> using
>> the command
>>
>> 'lzop test_input_chars.txt'
>>
>> It's a really small file (only 6 lines) so I didn't think it needed to be
>> index.  Do all files regardless of size need to be indexed for the
>> LzoTokenizedLoader to work?
>>
>> Thank you!
>>
>> ~Ed
>>
>> On Mon, Sep 27, 2010 at 1:25 AM, Rohan Rai <rohan....@inmobi.com> wrote:
>>
>>
>>  Oh  Sorry I am completely out of  sync...
>>>
>>> Can you tell how did you lzo'ed and indexed  the file
>>>
>>>
>>> Regards
>>> Rohan
>>>
>>> Rohan Rai wrote:
>>>
>>>
>>>  Oh Sorry I did not see this mail ...
>>>>
>>>> Its not an official patch/release
>>>>
>>>> But here is a fork on elephant-bird which works with pig 0.7
>>>>
>>>> for  normal LZOText Loading etc
>>>>
>>>> (NOt HbaseLoader)
>>>>
>>>> Regards
>>>> Rohan
>>>>
>>>> Dmitriy Ryaboy wrote:
>>>>
>>>>  The 0.7 branch is not tested.. it's quite likely it doesn't actually
>>>> work
>>>>
>>>>  :).
>>>>> Rohan Rai was working on it.. Rohan, think you can take a look and help
>>>>> Ed
>>>>> out?
>>>>>
>>>>> Ed, you may want to check if the same input works when you use Pig 0.6
>>>>> (and
>>>>> the official elephant-bird, on Kevin Weil's github).
>>>>>
>>>>> -D
>>>>>
>>>>> On Thu, Sep 23, 2010 at 6:49 AM, pig <hadoopn...@gmail.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>>  Hello,
>>>>>
>>>>>  After getting all the errors to go away with LZO libraries not being
>>>>>> found
>>>>>> and missing jar files for elephant-bird I've run into a new problem
>>>>>> when
>>>>>> using the elephant-bird branch for pig 0.7
>>>>>>
>>>>>> The following simple pig script works as expected
>>>>>>
>>>>>>   REGISTER elephant-bird-1.0.jar
>>>>>>   REGISTER /usr/lib/elephant-bird/lib/google-collect-1.0.jar
>>>>>>   A = load '/usr/foo/input/test_input_chars.txt';
>>>>>>   DUMP A;
>>>>>>
>>>>>> This just dumps out the contents of the test_input_chars.txt file
>>>>>> which
>>>>>> is
>>>>>> tab delimited. The output looks like:
>>>>>>
>>>>>>   (1,a,a,a,a,a,a)
>>>>>>   (2,b,b,b,b,b,b)
>>>>>>   (3,c,c,c,c,c,c)
>>>>>>   (4,d,d,d,d,d,d)
>>>>>>   (5,e,e,e,e,e,e)
>>>>>>
>>>>>> I then lzop the test file to get test_input_chars.txt.lzo (I
>>>>>> decompressed
>>>>>> this with lzop -d to make sure the compression worked fine and
>>>>>> everything
>>>>>> looks good).
>>>>>> If I run the exact same script provided above on the lzo file it works
>>>>>> fine.  However, this file is really small and doesn't need to use
>>>>>> indexes.
>>>>>> As a result, I wanted to
>>>>>> have LZO support that worked with indexes.  Based on this I decided to
>>>>>> try
>>>>>> out the elephant-bird branch for pig 0.7 located here (
>>>>>> http://github.com/hirohanin/elephant-bird/) as
>>>>>> recommended by Dimitriy.
>>>>>>
>>>>>> I created the following pig script that mirrors the above script but
>>>>>> should
>>>>>> hopefully work on LZO files (including indexed ones)
>>>>>>
>>>>>>   REGISTER elephant-bird-1.0.jar
>>>>>>   REGISTER /usr/lib/elephant-bird/lib/google-collect-1.0.jar
>>>>>>   A = load '/usr/foo/input/test_input_chars.txt.lzo' USING
>>>>>> com.twitter.elephantbird.pig.load.LzoTokenizedLoader('\t');
>>>>>>   DUMP A;
>>>>>>
>>>>>> When I run this script which uses the LzoTokenizedLoader there is no
>>>>>> output.  The script appears to run without errors but there are zero
>>>>>> Records
>>>>>> Written and 0 Bytes Written.
>>>>>>
>>>>>> Here is the exact output:
>>>>>>
>>>>>> grunt > DUMP A;
>>>>>> [main] INFO com.twitter.elephantbird.pig.load.LzoTokenizedLoader -
>>>>>> LzoTokenizedLoader with given delimited [     ]
>>>>>> [main] INFO com.twitter.elephantbird.pig.load.LzoTokenizedLoader -
>>>>>> LzoTokenizedLoader with given delimited [     ]
>>>>>> [main] INFO com.twitter.elephantbird.pig.load.LzoTokenizedLoader -
>>>>>> LzoTokenizedLoader with given delimited [     ]
>>>>>> [main] INFO
>>>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine
>>>>>> -
>>>>>> (Name:
>>>>>>
>>>>>>
>>>>>>
>>>>>> Store(hdfs://master:9000/tmp/temp-2052828736/tmp-1533645117:org.apache.pig.builtin.BinStorage)
>>>>>> - 1-4 Operator Key: 1-4
>>>>>> [main] INFO
>>>>>>
>>>>>>
>>>>>>
>>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>>>>>> - MR plan size before optimization: 1
>>>>>> [main] INFO
>>>>>>
>>>>>>
>>>>>>
>>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>>>>>> - MR plan size after optimization: 1
>>>>>> [main] INFO
>>>>>>
>>>>>>
>>>>>>
>>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
>>>>>> - mapred.job.reduce.markreset.buffer.percent is not set, set to
>>>>>> default
>>>>>> 0.3
>>>>>> [main] INFO
>>>>>>
>>>>>>
>>>>>>
>>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
>>>>>> - Setting up single store job
>>>>>> [main] INFO
>>>>>>
>>>>>>
>>>>>>
>>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>>>>>> - 1 map-reduce job(s) waiting for submission.
>>>>>> [Thread-12] WARN org.apache.hadoop.mapred.JobClient - Use
>>>>>> GenericOptionsParser for parsing the arguments.  Applications should
>>>>>> implement Tool for the same.
>>>>>> [Thread-12] INFO com.twitter.elephantbird.pig.load.LzoTokenizedLoader
>>>>>> -
>>>>>> LzoTokenizedLoader with given delimiter [     ]
>>>>>> [Thread-12] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat
>>>>>> -
>>>>>> Total input paths to process : 1
>>>>>> [main] INFO
>>>>>>
>>>>>>
>>>>>>
>>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>>>>>> - 0% complete
>>>>>> [main] INFO
>>>>>>
>>>>>>
>>>>>>
>>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>>>>>> - HadoopJobId: job_201009101108_0151
>>>>>> [main] INFO
>>>>>>
>>>>>>
>>>>>>
>>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>>>>>> - More information at
>>>>>> http://master:50030/jobdetails.jsp?jobid=job_201009101108_0151
>>>>>> [main] INFO
>>>>>>
>>>>>>
>>>>>>
>>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>>>>>> - 50% complete
>>>>>> [main] INFO
>>>>>>
>>>>>>
>>>>>>
>>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>>>>>> - 100% complete
>>>>>> [main] INFO
>>>>>>
>>>>>>
>>>>>>
>>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>>>>>> - Succesfully stored result in
>>>>>> "hdfs://amb-hadoop-01:9000/tmp/temp-2052828736/tmp-1533645117
>>>>>> [main] INFO
>>>>>>
>>>>>>
>>>>>>
>>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>>>>>> - Records written: 0
>>>>>> [main] INFO
>>>>>>
>>>>>>
>>>>>>
>>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>>>>>> - Bytes written: 0
>>>>>> [main] INFO
>>>>>>
>>>>>>
>>>>>>
>>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>>>>>> - Spillable Memory Manager spill count : 0
>>>>>> [main] INFO
>>>>>>
>>>>>>
>>>>>>
>>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>>>>>> - Proactive spill count : 0
>>>>>> [main] INFO
>>>>>>
>>>>>>
>>>>>>
>>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>>>>>> - Success!
>>>>>> [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat -
>>>>>> Total
>>>>>> input paths to process: 1
>>>>>> [main] INFO
>>>>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil -
>>>>>> Total input paths to process: 1
>>>>>> grunt >
>>>>>>
>>>>>> I'm not sure if I'm doing something wrong in my use of
>>>>>> LzoTokenizedLoader
>>>>>> or
>>>>>> if there is a problem with the class itself (most likely the problem
>>>>>> is
>>>>>> with
>>>>>> my code heh)  Thank you for any help!
>>>>>>
>>>>>> ~Ed
>>>>>>
>>>>>>
>>>>>>
>>>>>>  .
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>  The information contained in this communication is intended solely for
>>>> the
>>>> use of the individual or entity to whom it is addressed and others
>>>> authorized to receive it. It may contain confidential or legally
>>>> privileged
>>>> information. If you are not the intended recipient you are hereby
>>>> notified
>>>> that any disclosure, copying, distribution or taking any action in
>>>> reliance
>>>> on the contents of this information is strictly prohibited and may be
>>>> unlawful. If you have received this communication in error, please
>>>> notify us
>>>> immediately by responding to this email and then delete it from your
>>>> system.
>>>> The firm is neither liable for the proper and complete transmission of
>>>> the
>>>> information contained in this communication nor for any delay in its
>>>> receipt.
>>>> .
>>>>
>>>>
>>>>
>>>>
>>>>  The information contained in this communication is intended solely for
>>> the
>>> use of the individual or entity to whom it is addressed and others
>>> authorized to receive it. It may contain confidential or legally
>>> privileged
>>> information. If you are not the intended recipient you are hereby
>>> notified
>>> that any disclosure, copying, distribution or taking any action in
>>> reliance
>>> on the contents of this information is strictly prohibited and may be
>>> unlawful. If you have received this communication in error, please notify
>>> us
>>> immediately by responding to this email and then delete it from your
>>> system.
>>> The firm is neither liable for the proper and complete transmission of
>>> the
>>> information contained in this communication nor for any delay in its
>>> receipt.
>>>
>>>
>>>  .
>>
>>
>>
>
> The information contained in this communication is intended solely for the
> use of the individual or entity to whom it is addressed and others
> authorized to receive it. It may contain confidential or legally privileged
> information. If you are not the intended recipient you are hereby notified
> that any disclosure, copying, distribution or taking any action in reliance
> on the contents of this information is strictly prohibited and may be
> unlawful. If you have received this communication in error, please notify us
> immediately by responding to this email and then delete it from your system.
> The firm is neither liable for the proper and complete transmission of the
> information contained in this communication nor for any delay in its
> receipt.
>

Reply via email to