Hi Experts, I'm trying to transform couple of thousands delimited files that is stored on HDFS using PIG. Each file is between 20 to 200 MB in size. The files have very simple column definitions like event history ; TimeStamp, Location, Source, Target, EventType,Description The logic is as follows ;- Each file is already in natural order by timestamp column- Event type can be either start or complete What Im trying to do is to match first Completion event that occured after a Start event for a given Location , Source , Target combination to be able to calculate the durations. So the transformation will convert all the files ; FROM ; TimeStamp,Location,Source,Target,EventType,Description14:00:43,A,S1,D1,Start,Description114:01:02,A,S1,D2,Start,Description214:01:43,A,S1,D1,Complete,Description314:03:02,A,S1,D2,Complete,Description414:03:43,A,S2,D1,Start,Description514:03:43,A,S1,D1,Start,Description614:04:53,A,S2,D1,Complete,Description7 TO ; TimeStamp,Location,Source,Target,Duration14:00:43,A,S1,D1,01:0014:01:02,A,S1,D2,02:0014:03:43,A,S2,D1,01:10 I thought that I should leverage the fact that individual files are already sorted and filenames reveal which file comes first, or I may import them and sort them all together at once. However Im not sure how to process these files in that order and apply the grouping / sequence based duration extraction in each file.. Can I ask your opinion or guidance / hints ? Which way is better to leverage parallelism of Hadoop cluster ?
Kind Regards,