[ https://issues.apache.org/jira/browse/PIG-801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Olga Natkovich resolved PIG-801. -------------------------------- Resolution: Fixed This is subset of PIG-1434 > Pig needs to handle scalar aliases to improve programmer and code execution > efficiency > -------------------------------------------------------------------------------------- > > Key: PIG-801 > URL: https://issues.apache.org/jira/browse/PIG-801 > Project: Pig > Issue Type: New Feature > Reporter: David Ciemiewicz > Assignee: Aniket Mokashi > Fix For: 0.8.0 > > > In Pig, it is often the case that the result of an operation is a scalar > value that needs to be applied to the next step of processing. > For example: > * FILTER by MAX of group -- See: PIG-772 > * Compute proportions by dividing by total (SUM) of grouped alias > Today Pig programmers need to go through distasteful and slow contortions of > using FLATTEN or CROSS to propagate the scalar computation to EVERY row of > data to perform these operations creating needless copies of data. Or, the > user must write the global sum to a file, then read it back in to gain the > efficiency. > If the language were simply extended to have the notion of scalar aliases, > then coding would be simplified without contortions for the programmer and, I > believe, execution of the code would be faster too. > For instance, to compute global proportions, I want to do the following: > {code} > CountryPopulations = load 'country.dat' using PigStorage() as ( country: > chararray, population: long ); > AllCountryPopulations= group CountryPopulations all; > Total = foreach AllCountryPopulations generate > SUM(CountryPopulations.population) as population; > PopulationProportions = foreach CountryPopulations generate > country, population, (double)population / (double)Total.population as > global_proportion; > {code} > One of the very distasteful workarounds for this is to do something like: > {code} > CountryPopulations = load 'country.dat' using PigStorage() as ( country: > chararray, population: long ); > AllCountryPopulations= group CountryPopulations all; > Total = foreach AllCountryPopulations generate > SUM(CountryPopulations.population) as population; > CountryPopulationsTotal = cross CountryPopulations, Total; > PopulationProportions = foreach CountryPopulations generate > CountryPopulations::country, > CountryPopulations::population, > (double)CountryPopulations::population / (double)Total::population as > global_proportion; > {code} > This just makes me cringe every time I have to do it. Constructing new rows > of data simply to apply > the same scalar value row after row after row for potentially billions of > rows of data just feels horribly wrong > and inefficient both from the coding standpoint and from the execution > standpoint. > In SQL, I'd just code this as: > {code} > select > country, > population, > population / SUM(population) > from > CountryPopulations; > {code} > In writing a SQL to Pig translator, it would seem that this construct or > idiom would need to be supported, so why not create a higher level of Pig > which would support the notion of scalars efficiently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.