Omg, thanks it's exactly the thing I need. I can't do it before GROUP. I need group by key, then sort by timestamp field inside each group. After sort is done I do can determine non valid records. I've provided simplified case.
The only problem is that SPLIT is not allowed in nested FOREACH statement. 2013/7/23 Pradeep Gollakota <pradeep...@gmail.com> > You can use the SPLIT operator to split a relation into two (or more) > relations. http://pig.apache.org/docs/r0.11.1/basic.html#SPLIT > > Also, you should probably do this before GROUP. As a best practice (and > general pig optimization strategy), you should filter (and project) early > and often. > > > On Tue, Jul 23, 2013 at 4:27 AM, Serega Sheypak <serega.shey...@gmail.com > >wrote: > > > Hi, I have rather simple problem and I can't create nice solution. > > Here is my input: > > msisdn longitude latitude ts > > 1 20.30 40.50 123 > > 1 0.0 null 456 > > 2 60.70 34.67 678 > > 2 null null 978 > > > > I need: > > group by msisdn > > order by ts inside each group > > filter records in each group: > > 1. put all records where longitude, latitude are valid on one side > > 2. put all records where longitude/latidude = 0.0/null to the othe side > > > > Here is pig pseudo-code: > > rawRecords = LOAD '/data' as ...; > > grouped = GROUP rawRecords BY msisdn; > > validAndNotValidRecords = FOREACH grouped{ > > ordered = ORDER rawRecords BY ts; > > --do sometihing here to filter valid and not valid > records.... > > } > > STORE notValidRecords INTO /not_valid_data; > > > > someOtherProjection = GROUP validRecords By msisdn; > > --continue to work with filtered valid records... > > > > Can I do it in a single pig script, or I need to create two scripts: > > the first one would filter not valid records and store them > > the second one will continue to process filtered set of records? > > >