[ 
https://issues.apache.org/jira/browse/PIG-3402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey resolved PIG-3402.
-------------------------

    Resolution: Not A Problem

I've found a problem.
'ts' atom comes from avro file and this field was defined as 'int' in avro 
schema.
Later in pig script it was casted to long.
I did put away casting to long and preserved "native" int type.
Problem has gone.

                
> Incorrect ORDER BY after UNION ONSCHMEA. Pig handles Long atom as chararray
> ---------------------------------------------------------------------------
>
>                 Key: PIG-3402
>                 URL: https://issues.apache.org/jira/browse/PIG-3402
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Sergey
>
> Here is a part of script:
> {code}
> lastEndPoints24h = LOAD '$lastEndPoints24h' USING 
> org.apache.pig.piggybank.storage.avro.AvroStorage();
> describe lastEndPoints24h;
> dump lastEndPoints24h;
> lastEndPoints24hProj = FOREACH lastEndPoints24h GENERATE msisdn, 
> toLong((chararray)ts) as ts:long,
>                                                                center_lon, 
> center_lat,
>                                                                lac, cid, lon, 
> lat, cell_type, is_active, azimuth, hpbw, max_dist,
>                                                                tile_id, 
> zone_col, zone_row,
>                                                                is_end_point, 
> end_point_type;
> describe lastEndPoints24hProj;
> dump lastEndPoints24hProj;
> unionOfPivotsAndLastEndPoints = UNION ONSCHEMA validPivotsProj, 
> lastEndPoints24hProj;
> describe unionOfPivotsAndLastEndPoints;
> dump unionOfPivotsAndLastEndPoints;
> groupedValidPivots = GROUP unionOfPivotsAndLastEndPoints BY msisdn;
> pivotsWithEndPoints = FOREACH groupedValidPivots {
>                 ordered = ORDER unionOfPivotsAndLastEndPoints BY ts;
> {code}
> The problem is that unionOfPivotsAndLastEndPoints are not correctly sorted. 
> Looks like PIg assumes that ts field is chararray.
> Here are dumps and schemas of relations:
> {code}
> lastEndPoints24h: {msisdn: long,ts: long,center_lon: double,center_lat: 
> double,lac: int,cid: int,lon: double,lat: double,cell_type: 
> chararray,is_active: boolean,azimuth: int,hpbw: int,max_dist: int,tile_id: 
> int,zone_col: int,zone_row: int,is_end_point: boolean,end_point_type: 
> chararray}
> --dump
> (79263332100,1374521131,37.553441893272755,55.880436657140294,7712,24316,37.5473,55.8792,OUTDOOR,true,75,60,1102,49646,469,410,true,JITTER_START)
> {code}
> {code}
> lastEndPoints24hProj: {msisdn: long,ts: long,center_lon: double,center_lat: 
> double,lac: int,cid: int,lon: double,lat: double,cell_type: 
> chararray,is_active: boolean,azimuth: int,hpbw: int,max_dist: int,tile_id: 
> int,zone_col: int,zone_row: int,is_end_point: boolean,end_point_type: 
> chararray}
> (79263332100,1374521131,37.553441893272755,55.880436657140294,7712,24316,37.5473,55.8792,OUTDOOR,true,75,60,1102,49646,469,410,true,JITTER_START)
> {code}
> {code}
> unionOfPivotsAndLastEndPoints: {msisdn: long,ts: long,lac: int,cid: int,lon: 
> double,lat: double,azimuth: int,hpbw: int,max_dist: int,cell_type: 
> chararray,branch_id: int,center_lon: double,center_lat: double,tile_id: 
> int,zone_col: int,zone_row: int,is_active: boolean,is_end_point: 
> boolean,end_point_type: chararray}
> --union dump:
> (79263332100,1374529463,7712,5258,37.5564,55.8845,210,60,765,OUTDOOR,5145,37.55330379777028,55.881137048806984,49646,469,410,true,,)
> (79263332100,1374550275,7712,24316,37.5473,55.8792,75,60,1102,OUTDOOR,5145,37.55614891372749,55.88052982685867,49646,471,410,true,,)
> --more lines here...
> --the last one came from projection lastEndPoints24hProj
> (79263332100,1374521131,7712,24316,37.5473,55.8792,75,60,1102,OUTDOOR,,37.553441893272755,55.880436657140294,49646,469,410,true,true,JITTER_START)
> {code}
> Looks like everything is OK, but it's not true!
> Here is input for UDF after ORDER BY:
> {code}
> --a part of code
> groupedValidPivots = GROUP unionOfPivotsAndLastEndPoints BY msisdn;
> pivotsWithEndPoints = FOREACH groupedValidPivots {
>                 ordered = ORDER unionOfPivotsAndLastEndPoints BY ts;
>                 GENERATE FLATTEN(udf.mark_end_points(ordered, 'ts:1, lac:2, 
> cid:3, is_end_point:17, lon:4, lat:5, azimuth:6, hpbw:7, max_dist:8'))
> {code}
> ordered projection print from UDF:
> {code}
> ITERATE PIVOTS: 0 ) (79263332100L, 1374529463, 7712, 5258, 37.5564, 55.8845, 
> 210, 60, 765, u'OUTDOOR', 5145, 37.55330379777028, 55.881137048806984, 49646, 
> 469, 410, True, None, None)
> --more lines here...
> ITERATE PIVOTS: 22 ) (79263332100L, 1374521131L, 7712, 24316, 37.5473, 
> 55.8792, 75, 60, 1102, u'OUTDOOR', None, 37.553441893272755, 
> 55.880436657140294, 49646, 469, 410, True, True, u'JITTER_START')
> {code}
> See that 1374521131L has "L" and 1374529463 doesn't have (it's ts atom value)
> See that 1374529463 > 1374521131, but tuple with ts=1374521131L is at the end 
> of list. Looks like sorting was applied to ts:hararray, not to ts:long.
> It's weird. :(

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to