Even better, push the tag_with_amenity = FILTER tag BY (tag_attr_k == 'amenity'); as high as possible.
2013/1/31 Cheolsoo Park <cheol...@cloudera.com> > Hi Jerome, > > Try this: > > XmlTag = FOREACH xmlToTuple GENERATE FLATTEN ($0); > XmlTag2 = FOREACH XmlTag { > tag_with_amenity = FILTER tag BY (tag_attr_k == 'amenity'); > GENERATE *, COUNT(tag_with_amenity) AS count; > }; > XmlTag3 = FOREACH (FILTER XmlTag2 BY count > 0) GENERATE node_attr_id, > node_attr_lon, node_attr_lat, tag; > > Thanks, > Cheolsoo > > > On Thu, Jan 31, 2013 at 9:19 AM, Jerome Pierson > <jerome.pier...@hurence.com>wrote: > > > Hi There, > > > > I am a beginner, I achieved something, but I guess I could have done > > better. Let me explain. > > (Pig 0.10) > > > > My data is DESCRIBE as : > > > > xmlToTuple: {(node_attr_id: int,node_attr_lon: chararray,node_attr_lat: > > chararray,tag: {(tag_attr_k: chararray,tag_attr_v: chararray)})} > > > > > > and DUMP like this : > > > > ((100312088,45.2745669,-12.**7776222,{(created_by,JOSM)})) > > ((100948454,45.2620946,-12.**7849171,)) > > ((100948519,45.2356985,-12.**7707014,{(created_by,JOSM)})) > > ((704398904,45.2416667,-13.**0058333,{(lat,-13.00583333),(** > > lon,45.24166667)})) > > ((1230941976,45.0743117,-12.**6888807,{(place,village)})) > > ((1230941977,45.0832807,-12.**6810328,{(name,Mtsahara)})) > > ((1976927219,45.2272263,-12.**7794359,)) > > ((1751057677,45.2216163,-12.**7825896,{(amenity,fast_food),(** > > name,Brochetterie)})) > > ((1751057678,45.2216953,-12.**7829678,{(amenity,fast_food),(** > > name,Brochetterie)})) > > ((100948360,45.2338541,-12.**7762230,{(amenity,ferry_**terminal)})) > > > ((362795028,45.2086809,-12.**8062991,{(amenity,fuel),(**operator,Total)})) > > > > I want to extract the record which have a certain value for the > tag_attr_k > > field. For example, give me the record where there is a tag_attr_k = > > amesity ? That should be : > > > > (100948360,-12.7762230,45.**2338541,{(amenity,ferry_**terminal)}) > > (362795028,-12.8062991,45.**2086809,{(operator,Total),(**amenity,fuel)}) > > (1751057677,-12.7825896,45.**2216163,{(amenity,fast_food),(** > > name,Brochetterie)}) > > (1751057678,-12.7829678,45.**2216953,{(amenity,fast_food),(** > > name,Brochetterie)}) > > > > So (node_attr_id, node_attr_lat , node_attr_lon,{(tag_attr_k, > > tag_attr_v)...(tag_attr_k,tag_**attr_v)} > > > > I ended up with this script. > > > > > > ... > > XmlTag = foreach xmlToTuple GENERATE FLATTEN ($0); --removed top > including > > level bag > > XmlTag2 = foreach XmlTag GENERATE $0 as id, $1 as lon, $2 as lat, FLATTEN > > (tag) as (key, value); --flatten the bag of tags > > XmlTag3 = FILTER XmlTag2 BY key == 'amenity'; -- get all the records > with > > amenity tags > > XmlTag4 = JOIN XmlTag3 BY id, XmlTag2 BY id; -- re-build records with all > > tags containing amenity tag > > XmlTag7 = foreach XmlTag4 GENERATE $0 as id,$1 as lon, $2 as lat,$8 as > > key, $9 as value; -- re-build records : removing redundant field > > XmlTag5 = GROUP XmlTag7 BY (id,lat,lon); -- re-build records : grouping > > redundant records > > XmlTag8 = foreach XmlTag5 { --rebuild records : id,lat,long > > {(key,value)...(key,value)} > > tag = foreach XmlTag7 GENERATE key, value; > > GENERATE group.id,group.lat,group.lon,**tag; > > }; > > > > Using this variable: > > > > xmlToTuple: {(node_attr_id: int,node_attr_lon: chararray,node_attr_lat: > > chararray,tag: {(tag_attr_k: chararray,tag_attr_v: chararray)})} > > XmlTag: {null::node_attr_id: int,null::node_attr_lon: > > chararray,null::node_attr_lat: chararray,null::tag: {(tag_attr_k: > > chararray,tag_attr_v: chararray)}} > > XmlTag2: {id: int,lon: chararray,lat: chararray,key: chararray,value: > > chararray} > > XmlTag3: {id: int,lon: chararray,lat: chararray,key: chararray,value: > > chararray} > > XmlTag4: {XmlTag3::id: int,XmlTag3::lon: chararray,XmlTag3::lat: > > chararray,XmlTag3::key: chararray,XmlTag3::value: chararray,XmlTag2::id: > > int,XmlTag2::lon: chararray,XmlTag2::lat: chararray,XmlTag2::key: > > chararray,XmlTag2::value: chararray} > > XmlTag7: {id: int,lon: chararray,lat: chararray,key: chararray,value: > > chararray} > > XmlTag5: {group: (id: int,lat: chararray,lon: chararray),XmlTag7: {(id: > > int,lon: chararray,lat: chararray,key: chararray,value: chararray)}} > > XmlTag8: {id: int,lat: chararray,lon: chararray,tag: {(key: > > chararray,value: chararray)}} > > > > > > I guess this not very straightforward and can be largely optimized. > Please > > give me some hints ? > > > > Regards, > > Jérôme > > >