Re: Problem when executionengine.util.MapRedUtil combine input paths

Dmitriy Ryaboy Tue, 01 Mar 2011 14:07:27 -0800

fwiw, something similar happened with the HBase loader in 0.8 -- only the
first of the combined splits was read in (I worked around this by turning
off split combination in the loader's setLocation, see pig-1680)


D

On Tue, Mar 1, 2011 at 2:02 PM, Charles Gonçalves <charles...@gmail.com>wrote:

> Ok ...
>
> I'm sending both.
> Versions:
>
> Apache Pig version 0.8.0 (r1043805)
> compiled Dec 08 2010, 17:26:09
>
> Hadoop 0.20.2
>
>
>
> On Tue, Mar 1, 2011 at 6:44 PM, Daniel Dai <jiany...@yahoo-inc.com> wrote:
>
>>  Combine input splits should be able to handle compressed files. It will
>> create seperate RecordReader for each file within one input split. So gzip
>> concatenation should not be the case. I am not sure what happen to your
>> script. If possible, give us more information (script, UDF, data, version).
>>
>> Daniel
>>
>>
>>
>> On 02/28/2011 05:40 PM, Charles Gonçalves wrote:
>>
>> Guys,
>>
>>  The amount of data in the source dir:
>> hdfs://hydra1:57810/user/cdh-hadoop/mscdata/201010_raw  22567369111
>>
>>  What I did was:
>> I run with all logs, 43458 and the counters are:
>>
>>   FILE_BYTES_READ 253,905,706 372,708,857 626,614,563  HDFS_BYTES_READ
>> 2,553,123,734 0 2,553,123,734  FILE_BYTES_WRITTEN 619,877,917 372,708,857
>> 992,586,774  HDFS_BYTES_WRITTEN 0 535 535
>>
>>
>>  I did a manual join of the files and run again for the 336 files (the
>> merge of all those files).
>> The job didn't finished yet and the counters are:
>>
>>    FILE_BYTES_READ 21,054,970,818 0 21,054,970,818  HDFS_BYTES_READ
>> 16,772,063,486 0 16,772,063,486  FILE_BYTES_WRITTEN 39,797,038,008
>> 10,404,287,551 50,201,325,55
>>
>>
>>  I think that the problem could be in the combination of the input files.
>> Is the combination class aware of compression.
>> Because *all my files are compressed*.
>> Maybe the class perform a concatenation and we fall in the hdfs limitation
>> of gzip concatenated files.
>>
>> On Mon, Feb 28, 2011 at 8:47 PM, Charles Gonçalves 
>> <charles...@gmail.com>wrote:
>>
>>>
>>>
>>>  On Mon, Feb 28, 2011 at 7:39 PM, Thejas M Nair <te...@yahoo-inc.com>wrote:
>>>
>>>>  Hi Charles,
>>>> Which load function are you using ?
>>>>
>>>  I'm using a UD load function ..
>>>
>>>  Is the default (PigStorage?).
>>>>
>>> Nops ...
>>>
>>>
>>>>  In the hadoop counters for the job in the jobtracker ui, do you see the
>>>> expected number of input records being read?
>>>>
>>>  Is possible to see the counter in the history interface on JobTracker?
>>>
>>> I will run the jobs again to compare the counter, but my guess is
>>> probably not!
>>>
>>>   -Thejas
>>>>
>>>>
>>>>
>>>>
>>>> On 2/28/11 10:57 AM, "Charles Gonçalves" <charles...@gmail.com> wrote:
>>>>
>>>>    I'm not using any filtering in the script.
>>>> I'm just want to see the total traffic per day in all logs.
>>>>
>>>> If I combine 1000 log files into  one and run the script on this log
>>>> files I
>>>> got the correct answer for those logs.
>>>> But when I'm run with   all the *43458* log files I got a incorrect
>>>> output.
>>>> The correct would be an histogram for each day from 2010-10 but the
>>>> result
>>>> contain only data from 2010-10-21.
>>>> And if I process all the logs with an awk script I got the correct
>>>> answer.
>>>>
>>>>
>>>> On Mon, Feb 28, 2011 at 3:29 PM, Daniel Dai <jiany...@yahoo-inc.com>
>>>> wrote:
>>>>
>>>> > Not sure if I get your question. In 0.8, Pig combine small files into
>>>> one
>>>> > map, so it is possible you get less output files.
>>>>
>>>> This is not the problem.
>>>> But thanks anyway!
>>>>
>>>> If that is your concern, you can try to disable split combine using
>>>> > "-Dpig.splitCombination=false"
>>>> >
>>>> > Daniel
>>>> >
>>>> >
>>>> > Charles Gonçalves wrote:
>>>> >
>>>> >> I tried to process a big number of small files on pig and I got a
>>>> strange
>>>> >> problem.
>>>> >>
>>>> >> 2011-02-27 00:00:58,746 [Thread-15] INFO
>>>> >>  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
>>>> paths
>>>> >> to process : *43458*
>>>> >> 2011-02-27 00:00:58,755 [Thread-15] INFO
>>>> >>  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil -
>>>> Total
>>>> >> input
>>>> >> paths to process : *43458*
>>>> >> 2011-02-27 00:01:14,173 [Thread-15] INFO
>>>> >>  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil -
>>>> Total
>>>> >> input
>>>> >> paths (combined) to process : *329*
>>>> >>
>>>> >> When the script finish to process, the result is just about a
>>>> subgroup of
>>>> >> the input files.
>>>> >> These are logs from a whole month,  but the results are just from the
>>>> day
>>>> >> 21.
>>>> >>
>>>> >>
>>>> >> Maybe I'm missing something.
>>>> >> Any Ideas?
>>>> >>
>>>> >>
>>>> >>
>>>> >
>>>> >
>>>>
>>>>
>>>> --
>>>> *Charles Ferreira Gonçalves *
>>>> http://homepages.dcc.ufmg.br/~charles/
>>>> UFMG - ICEx - Dcc
>>>> Cel.: 55 31 87741485
>>>> Tel.:  55 31 34741485
>>>> Lab.: 55 31 34095840
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>>  *Charles Ferreira Gonçalves *
>>> http://homepages.dcc.ufmg.br/~charles/
>>> UFMG - ICEx - Dcc
>>> Cel.: 55 31 87741485
>>> Tel.:  55 31 34741485
>>> Lab.: 55 31 34095840
>>>
>>
>>
>>
>> --
>> *Charles Ferreira Gonçalves *
>> http://homepages.dcc.ufmg.br/~charles/
>> UFMG - ICEx - Dcc
>> Cel.: 55 31 87741485
>> Tel.:  55 31 34741485
>> Lab.: 55 31 34095840
>>
>>
>>
>
>
> --
> *Charles Ferreira Gonçalves *
> http://homepages.dcc.ufmg.br/~charles/
> UFMG - ICEx - Dcc
> Cel.: 55 31 87741485
> Tel.:  55 31 34741485
> Lab.: 55 31 34095840
>

Re: Problem when executionengine.util.MapRedUtil combine input paths

Reply via email to