i am not an expert ..learning just like you .. but i will attempt to answer
and provide some justification ..

imo ..depending on the sorted file naming/ sorted data would restrict your
processing logic. What you want to leverage is "parallelism" so let the
digest and process logic be generic so that it can be run in parallel..

while reading files ..you should just control split size/ combine split
size http://pig.apache.org/docs/r0.14.0/perf.html#combine-files and let pig
handle the rest for # mappers generation ..

similarly there are efficiency algos built at pig shuffle phases which
should optimize reducer load distribution by keys ..



*Cheers !!*
Arvind

On Tue, Mar 17, 2015 at 1:29 AM, Troy X <troy...@hotmail.com> wrote:

>
> Hi Experts,
> I'm trying to transform couple of thousands delimited files that is stored
> on HDFS using PIG.  Each file is between 20 to 200 MB in size. The files
> have very simple column definitions like event history ;
> TimeStamp, Location, Source, Target, EventType,Description
> The logic is as follows ;- Each file is already in natural order by
> timestamp column- Event type can be either start or complete
> What Im trying to do is to match first Completion event that occured after
> a Start event for a given Location , Source , Target combination to be able
> to calculate the durations. So the transformation will convert all the
> files ;
> FROM ;
>
> TimeStamp,Location,Source,Target,EventType,Description14:00:43,A,S1,D1,Start,Description114:01:02,A,S1,D2,Start,Description214:01:43,A,S1,D1,Complete,Description314:03:02,A,S1,D2,Complete,Description414:03:43,A,S2,D1,Start,Description514:03:43,A,S1,D1,Start,Description614:04:53,A,S2,D1,Complete,Description7
> TO ;
>
> TimeStamp,Location,Source,Target,Duration14:00:43,A,S1,D1,01:0014:01:02,A,S1,D2,02:0014:03:43,A,S2,D1,01:10
> I thought that I should leverage the fact that individual files are
> already sorted and filenames reveal which file comes first, or I may import
> them and sort them all together at once. However Im not sure how to process
> these files in that order and apply the grouping / sequence based duration
> extraction in each file..
> Can I ask your opinion or guidance / hints ? Which way is better to
> leverage parallelism of Hadoop cluster ?
>
> Kind Regards,
>

Reply via email to