[jira] Updated: (PIG-801) Pig needs to handle scalar aliases to improve programmer and code execution efficiency

Olga Natkovich (JIRA) Fri, 09 Jul 2010 17:48:48 -0700

     [ 
https://issues.apache.org/jira/browse/PIG-801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Olga Natkovich updated PIG-801:
-------------------------------

         Assignee: Aniket Mokashi
    Fix Version/s: 0.8.0

> Pig needs to handle scalar aliases to improve programmer and code execution 
> efficiency
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-801
>                 URL: https://issues.apache.org/jira/browse/PIG-801
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: David Ciemiewicz
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>
> In Pig, it is often the case that the result of an operation is a scalar 
> value that needs to be applied to the next step of processing.
> For example:
> * FILTER by MAX of group -- See: PIG-772
> * Compute proportions by dividing by total (SUM) of grouped alias
> Today Pig programmers need to go through distasteful and slow contortions of 
> using FLATTEN or CROSS to propagate the scalar computation to EVERY row of 
> data to perform these operations creating needless copies of data.  Or, the 
> user must write the global sum to a file, then read it back in to gain the 
> efficiency.
> If the language were simply extended to have the notion of scalar aliases, 
> then coding would be simplified without contortions for the programmer and, I 
> believe, execution of the code would be faster too.
> For instance, to compute global proportions, I want to do the following:
> {code}
> CountryPopulations = load 'country.dat' using PigStorage() as ( country: 
> chararray, population: long );
> AllCountryPopulations= group CountryPopulations all;
> Total = foreach AllCountryPopulations generate 
> SUM(CountryPopulations.population) as population;
> PopulationProportions = foreach CountryPopulations generate
>     country, population, (double)population / (double)Total.population as 
> global_proportion;
> {code}
> One of the very distasteful workarounds for this is to do something like:
> {code}
> CountryPopulations = load 'country.dat' using PigStorage() as ( country: 
> chararray, population: long );
> AllCountryPopulations= group CountryPopulations all;
> Total = foreach AllCountryPopulations generate 
> SUM(CountryPopulations.population) as population;
> CountryPopulationsTotal = cross CountryPopulations, Total;
> PopulationProportions = foreach CountryPopulations generate
>     CountryPopulations::country,
>     CountryPopulations::population,
>     (double)CountryPopulations::population / (double)Total::population as 
> global_proportion;
> {code}
> This just makes me cringe every time I have to do it.  Constructing new rows 
> of data simply to apply
> the same scalar value row after row after row for potentially billions of 
> rows of data just feels horribly wrong
> and inefficient both from the coding standpoint and from the execution 
> standpoint.
> In SQL, I'd just code this as:
> {code}
> select
>      country,
>      population,
>      population / SUM(population)
> from
>      CountryPopulations;
> {code}
> In writing a SQL to Pig translator, it would seem that this construct or 
> idiom would need to be supported, so why not create a higher level of Pig 
> which would support the notion of scalars efficiently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-801) Pig needs to handle scalar aliases to improve programmer and code execution efficiency

Reply via email to