fwiw, something similar happened with the HBase loader in 0.8 -- only the first of the combined splits was read in (I worked around this by turning off split combination in the loader's setLocation, see pig-1680)
D On Tue, Mar 1, 2011 at 2:02 PM, Charles Gonçalves <charles...@gmail.com>wrote: > Ok ... > > I'm sending both. > Versions: > > Apache Pig version 0.8.0 (r1043805) > compiled Dec 08 2010, 17:26:09 > > Hadoop 0.20.2 > > > > On Tue, Mar 1, 2011 at 6:44 PM, Daniel Dai <jiany...@yahoo-inc.com> wrote: > >> Combine input splits should be able to handle compressed files. It will >> create seperate RecordReader for each file within one input split. So gzip >> concatenation should not be the case. I am not sure what happen to your >> script. If possible, give us more information (script, UDF, data, version). >> >> Daniel >> >> >> >> On 02/28/2011 05:40 PM, Charles Gonçalves wrote: >> >> Guys, >> >> The amount of data in the source dir: >> hdfs://hydra1:57810/user/cdh-hadoop/mscdata/201010_raw 22567369111 >> >> What I did was: >> I run with all logs, 43458 and the counters are: >> >> FILE_BYTES_READ 253,905,706 372,708,857 626,614,563 HDFS_BYTES_READ >> 2,553,123,734 0 2,553,123,734 FILE_BYTES_WRITTEN 619,877,917 372,708,857 >> 992,586,774 HDFS_BYTES_WRITTEN 0 535 535 >> >> >> I did a manual join of the files and run again for the 336 files (the >> merge of all those files). >> The job didn't finished yet and the counters are: >> >> FILE_BYTES_READ 21,054,970,818 0 21,054,970,818 HDFS_BYTES_READ >> 16,772,063,486 0 16,772,063,486 FILE_BYTES_WRITTEN 39,797,038,008 >> 10,404,287,551 50,201,325,55 >> >> >> I think that the problem could be in the combination of the input files. >> Is the combination class aware of compression. >> Because *all my files are compressed*. >> Maybe the class perform a concatenation and we fall in the hdfs limitation >> of gzip concatenated files. >> >> On Mon, Feb 28, 2011 at 8:47 PM, Charles Gonçalves >> <charles...@gmail.com>wrote: >> >>> >>> >>> On Mon, Feb 28, 2011 at 7:39 PM, Thejas M Nair <te...@yahoo-inc.com>wrote: >>> >>>> Hi Charles, >>>> Which load function are you using ? >>>> >>> I'm using a UD load function .. >>> >>> Is the default (PigStorage?). >>>> >>> Nops ... >>> >>> >>>> In the hadoop counters for the job in the jobtracker ui, do you see the >>>> expected number of input records being read? >>>> >>> Is possible to see the counter in the history interface on JobTracker? >>> >>> I will run the jobs again to compare the counter, but my guess is >>> probably not! >>> >>> -Thejas >>>> >>>> >>>> >>>> >>>> On 2/28/11 10:57 AM, "Charles Gonçalves" <charles...@gmail.com> wrote: >>>> >>>> I'm not using any filtering in the script. >>>> I'm just want to see the total traffic per day in all logs. >>>> >>>> If I combine 1000 log files into one and run the script on this log >>>> files I >>>> got the correct answer for those logs. >>>> But when I'm run with all the *43458* log files I got a incorrect >>>> output. >>>> The correct would be an histogram for each day from 2010-10 but the >>>> result >>>> contain only data from 2010-10-21. >>>> And if I process all the logs with an awk script I got the correct >>>> answer. >>>> >>>> >>>> On Mon, Feb 28, 2011 at 3:29 PM, Daniel Dai <jiany...@yahoo-inc.com> >>>> wrote: >>>> >>>> > Not sure if I get your question. In 0.8, Pig combine small files into >>>> one >>>> > map, so it is possible you get less output files. >>>> >>>> This is not the problem. >>>> But thanks anyway! >>>> >>>> If that is your concern, you can try to disable split combine using >>>> > "-Dpig.splitCombination=false" >>>> > >>>> > Daniel >>>> > >>>> > >>>> > Charles Gonçalves wrote: >>>> > >>>> >> I tried to process a big number of small files on pig and I got a >>>> strange >>>> >> problem. >>>> >> >>>> >> 2011-02-27 00:00:58,746 [Thread-15] INFO >>>> >> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input >>>> paths >>>> >> to process : *43458* >>>> >> 2011-02-27 00:00:58,755 [Thread-15] INFO >>>> >> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - >>>> Total >>>> >> input >>>> >> paths to process : *43458* >>>> >> 2011-02-27 00:01:14,173 [Thread-15] INFO >>>> >> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - >>>> Total >>>> >> input >>>> >> paths (combined) to process : *329* >>>> >> >>>> >> When the script finish to process, the result is just about a >>>> subgroup of >>>> >> the input files. >>>> >> These are logs from a whole month, but the results are just from the >>>> day >>>> >> 21. >>>> >> >>>> >> >>>> >> Maybe I'm missing something. >>>> >> Any Ideas? >>>> >> >>>> >> >>>> >> >>>> > >>>> > >>>> >>>> >>>> -- >>>> *Charles Ferreira Gonçalves * >>>> http://homepages.dcc.ufmg.br/~charles/ >>>> UFMG - ICEx - Dcc >>>> Cel.: 55 31 87741485 >>>> Tel.: 55 31 34741485 >>>> Lab.: 55 31 34095840 >>>> >>>> >>>> >>> >>> >>> -- >>> *Charles Ferreira Gonçalves * >>> http://homepages.dcc.ufmg.br/~charles/ >>> UFMG - ICEx - Dcc >>> Cel.: 55 31 87741485 >>> Tel.: 55 31 34741485 >>> Lab.: 55 31 34095840 >>> >> >> >> >> -- >> *Charles Ferreira Gonçalves * >> http://homepages.dcc.ufmg.br/~charles/ >> UFMG - ICEx - Dcc >> Cel.: 55 31 87741485 >> Tel.: 55 31 34741485 >> Lab.: 55 31 34095840 >> >> >> > > > -- > *Charles Ferreira Gonçalves * > http://homepages.dcc.ufmg.br/~charles/ > UFMG - ICEx - Dcc > Cel.: 55 31 87741485 > Tel.: 55 31 34741485 > Lab.: 55 31 34095840 >