You can use the SPLIT operator to split a relation into two (or more) relations. http://pig.apache.org/docs/r0.11.1/basic.html#SPLIT
Also, you should probably do this before GROUP. As a best practice (and general pig optimization strategy), you should filter (and project) early and often. On Tue, Jul 23, 2013 at 4:27 AM, Serega Sheypak <[email protected]>wrote: > Hi, I have rather simple problem and I can't create nice solution. > Here is my input: > msisdn longitude latitude ts > 1 20.30 40.50 123 > 1 0.0 null 456 > 2 60.70 34.67 678 > 2 null null 978 > > I need: > group by msisdn > order by ts inside each group > filter records in each group: > 1. put all records where longitude, latitude are valid on one side > 2. put all records where longitude/latidude = 0.0/null to the othe side > > Here is pig pseudo-code: > rawRecords = LOAD '/data' as ...; > grouped = GROUP rawRecords BY msisdn; > validAndNotValidRecords = FOREACH grouped{ > ordered = ORDER rawRecords BY ts; > --do sometihing here to filter valid and not valid records.... > } > STORE notValidRecords INTO /not_valid_data; > > someOtherProjection = GROUP validRecords By msisdn; > --continue to work with filtered valid records... > > Can I do it in a single pig script, or I need to create two scripts: > the first one would filter not valid records and store them > the second one will continue to process filtered set of records? >
