[jira] [Commented] (PIG-2661) Pig uses an extra job for loading data in Pigmix L9

Dmitriy V. Ryaboy (JIRA) Mon, 03 Sep 2012 10:11:10 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-2661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13447354#comment-13447354
 ]


Dmitriy V. Ryaboy commented on PIG-2661:
----------------------------------------

Ok, some fresh thoughts rolling in after sleeping on this.

Why do we have this foreach in the first place? It's inserted to achieve the 
following goals:
* pad nulls (in PIG-2824, Jie saw perf problems from that, and I suggested we 
get rid of the foreach altogether, getting POLoad to do the null padding 
instead).
* coerce tuples generated by the loader into schemas specified in the "load 
as.." statement
* drop unneeded columns

(please let me know if this list is incomplete)

For padding nulls, I believe we can achieve the same effect much more cheaply, 
and without the side effect that's biting us here, by making basic 
modifications to POLoad.

For coercing into schemas, we can do the same thing -- copy all the fields from 
the incoming tuple (including excess ones), and only convert the ones we know 
something about. This can also be done directly in POLoad, and only be 
triggered if the loader doesn't already tell us what the schema is it's 
returning, or the schemas don't match type-wise.

This leaves dropping columns. Since in that case the whole point is to not 
carry along unwanted columns, this use case is clearly in conflict with the way 
the PoissonSampleLoader wants to work, by inserting extra columns and sneaking 
them through to the UDF linked to it. Moreover, if we go the route of putting 
the plan between load and skewed join between the sample loader and the 
GetMemNumRows UDF, other things may also break the sampling -- for example, 
filters that happen to filter out the specially marked tuples, by accident. 
This is telling us that messing with the tuples PSL returns is problematic. 
What if instead we created a UDF that was fed all the tuples from a regular 
loader, with the rest of the pipeline that gets inserted, but was able to 
signal to its consumers when it's done -- thus effectively recreating 
PoissonSampleLoader's functionality in addition to GetMemNumRows ? It would 
output sample tuples or nulls, and we can add a null filter right above it. I 
believe that gives us everything we are looking for and simplifies the pipeline 
a fair bit.  We'd have to add capability for UDFs to early-terminate, of 
course. That's already been done for Accumulative UDFs in PIG-2066 and I think 
should be straightforward to do for regular UDFs.

Thoughts?
                
> Pig uses an extra job for loading data in Pigmix L9
> ---------------------------------------------------
>
>                 Key: PIG-2661
>                 URL: https://issues.apache.org/jira/browse/PIG-2661
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Jie Li
>            Assignee: Jie Li
>         Attachments: PIG-2661.0.patch, PIG-2661.1.patch, PIG-2661.2.patch, 
> PIG-2661.3.patch, PIG-2661.4.patch, PIG-2661.5.patch, PIG-2661.6.patch, 
> PIG-2661.7.patch, PIG-2661.8.patch, PIG-2661.plan.txt
>
>
> See 
> https://issues.apache.org/jira/browse/PIG-200?focusedCommentId=13260155&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13260155

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2661) Pig uses an extra job for loading data in Pigmix L9

Reply via email to