On Mon, Feb 28, 2011 at 7:39 PM, Thejas M Nair <te...@yahoo-inc.com> wrote:
> Hi Charles, > Which load function are you using ? > I'm using a UD load function .. Is the default (PigStorage?). > Nops ... > In the hadoop counters for the job in the jobtracker ui, do you see the > expected number of input records being read? > Is possible to see the counter in the history interface on JobTracker? I will run the jobs again to compare the counter, but my guess is probably not! -Thejas > > > > > On 2/28/11 10:57 AM, "Charles Gonçalves" <charles...@gmail.com> wrote: > > I'm not using any filtering in the script. > I'm just want to see the total traffic per day in all logs. > > If I combine 1000 log files into one and run the script on this log files > I > got the correct answer for those logs. > But when I'm run with all the *43458* log files I got a incorrect output. > The correct would be an histogram for each day from 2010-10 but the result > contain only data from 2010-10-21. > And if I process all the logs with an awk script I got the correct answer. > > > On Mon, Feb 28, 2011 at 3:29 PM, Daniel Dai <jiany...@yahoo-inc.com> > wrote: > > > Not sure if I get your question. In 0.8, Pig combine small files into one > > map, so it is possible you get less output files. > > This is not the problem. > But thanks anyway! > > If that is your concern, you can try to disable split combine using > > "-Dpig.splitCombination=false" > > > > Daniel > > > > > > Charles Gonçalves wrote: > > > >> I tried to process a big number of small files on pig and I got a > strange > >> problem. > >> > >> 2011-02-27 00:00:58,746 [Thread-15] INFO > >> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input > paths > >> to process : *43458* > >> 2011-02-27 00:00:58,755 [Thread-15] INFO > >> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total > >> input > >> paths to process : *43458* > >> 2011-02-27 00:01:14,173 [Thread-15] INFO > >> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total > >> input > >> paths (combined) to process : *329* > >> > >> When the script finish to process, the result is just about a subgroup > of > >> the input files. > >> These are logs from a whole month, but the results are just from the > day > >> 21. > >> > >> > >> Maybe I'm missing something. > >> Any Ideas? > >> > >> > >> > > > > > > > -- > *Charles Ferreira Gonçalves * > http://homepages.dcc.ufmg.br/~charles/ > UFMG - ICEx - Dcc > Cel.: 55 31 87741485 > Tel.: 55 31 34741485 > Lab.: 55 31 34095840 > > > -- *Charles Ferreira Gonçalves * http://homepages.dcc.ufmg.br/~charles/ UFMG - ICEx - Dcc Cel.: 55 31 87741485 Tel.: 55 31 34741485 Lab.: 55 31 34095840