[
https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788971#action_12788971
]
Sriranjan Manjunath commented on PIG-1143:
------------------------------------------
To describe the problem in more detail, the current implementation does not
handle a glob efficiently. When the sample loader encounters a directory (or
combinations thereof), it gets the element descriptors of all the files inside
the directory to compute the file sizes.
For ex: A = load "{view, click}" will result in computing file sizes of all the
files underneath both "view" and "click" directories. If we have a large number
of mappers, this will result in a ton of hdfs system calls, clogging the name
node.
I intend to modify Poisson Sample Loader as follows. The algorithm for
computing the total number of samples remains the same. However, it will not be
computed by every mapper. I will be using the UDFContext object to share this
information across mappers. Since mapper/ reducers can only read the
information from UDFContext, the slicer will store this information. The slicer
will compute the sampler count for the first map. As before, PigSlice will call
computeSamples() for the first map. It will then store this value as a property
in the UDFContext object. The Slicer will check UDFContext to see if this value
is set and if it is, it will use it instead of computing it again. I intend to
use "pig.input.0.sampleCount" as the key.
This solution will reduce the fileSize() invocations to a minimum and should
reduce the burden on the name node.
> Poisson Sample Loader should compute the number of samples required only once
> -----------------------------------------------------------------------------
>
> Key: PIG-1143
> URL: https://issues.apache.org/jira/browse/PIG-1143
> Project: Pig
> Issue Type: Bug
> Reporter: Sriranjan Manjunath
> Assignee: Sriranjan Manjunath
>
> The current poisson sampler forces each of the maps to compute the sample
> number. This is redundant and causes issues when a large directory is
> specified in the join. The sampler should be changed to calculate the sample
> count only once and this information should be shared with the remaining
> mappers.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.