[
https://issues.apache.org/jira/browse/PIG-2661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13447354#comment-13447354
]
Dmitriy V. Ryaboy commented on PIG-2661:
----------------------------------------
Ok, some fresh thoughts rolling in after sleeping on this.
Why do we have this foreach in the first place? It's inserted to achieve the
following goals:
* pad nulls (in PIG-2824, Jie saw perf problems from that, and I suggested we
get rid of the foreach altogether, getting POLoad to do the null padding
instead).
* coerce tuples generated by the loader into schemas specified in the "load
as.." statement
* drop unneeded columns
(please let me know if this list is incomplete)
For padding nulls, I believe we can achieve the same effect much more cheaply,
and without the side effect that's biting us here, by making basic
modifications to POLoad.
For coercing into schemas, we can do the same thing -- copy all the fields from
the incoming tuple (including excess ones), and only convert the ones we know
something about. This can also be done directly in POLoad, and only be
triggered if the loader doesn't already tell us what the schema is it's
returning, or the schemas don't match type-wise.
This leaves dropping columns. Since in that case the whole point is to not
carry along unwanted columns, this use case is clearly in conflict with the way
the PoissonSampleLoader wants to work, by inserting extra columns and sneaking
them through to the UDF linked to it. Moreover, if we go the route of putting
the plan between load and skewed join between the sample loader and the
GetMemNumRows UDF, other things may also break the sampling -- for example,
filters that happen to filter out the specially marked tuples, by accident.
This is telling us that messing with the tuples PSL returns is problematic.
What if instead we created a UDF that was fed all the tuples from a regular
loader, with the rest of the pipeline that gets inserted, but was able to
signal to its consumers when it's done -- thus effectively recreating
PoissonSampleLoader's functionality in addition to GetMemNumRows ? It would
output sample tuples or nulls, and we can add a null filter right above it. I
believe that gives us everything we are looking for and simplifies the pipeline
a fair bit. We'd have to add capability for UDFs to early-terminate, of
course. That's already been done for Accumulative UDFs in PIG-2066 and I think
should be straightforward to do for regular UDFs.
Thoughts?
> Pig uses an extra job for loading data in Pigmix L9
> ---------------------------------------------------
>
> Key: PIG-2661
> URL: https://issues.apache.org/jira/browse/PIG-2661
> Project: Pig
> Issue Type: Improvement
> Affects Versions: 0.9.0
> Reporter: Jie Li
> Assignee: Jie Li
> Attachments: PIG-2661.0.patch, PIG-2661.1.patch, PIG-2661.2.patch,
> PIG-2661.3.patch, PIG-2661.4.patch, PIG-2661.5.patch, PIG-2661.6.patch,
> PIG-2661.7.patch, PIG-2661.8.patch, PIG-2661.plan.txt
>
>
> See
> https://issues.apache.org/jira/browse/PIG-200?focusedCommentId=13260155&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13260155
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira