[
https://issues.apache.org/jira/browse/PIG-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696416#action_12696416
]
Giridharan Kesavan commented on PIG-733:
----------------------------------------
I'm going to resubmit the patch to hudson ..
> Order by sampling dumps entire sample to hdfs which causes dfs "FileSystem
> closed" error on large input
> -------------------------------------------------------------------------------------------------------
>
> Key: PIG-733
> URL: https://issues.apache.org/jira/browse/PIG-733
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.2.0
> Reporter: Pradeep Kamath
> Assignee: Pradeep Kamath
> Fix For: 0.3.0
>
> Attachments: PIG-733-v2.patch, PIG-733.patch
>
>
> Order by has a sampling job which samples the input and creates a sorted list
> of sample items. CUrrently the number of items sampled is 100 per map task.
> So if the input is large resulting in many maps (say 50,000) the sample is
> big. This sorted sample is stored on dfs. The WeightedRangePartitioner
> computes quantile boundaries and weighted probabilities for repeating values
> in each map by reading the samples file from DFS. In queries with many maps
> (in the order of 50,000) the dfs read of the sample file fails with
> "FileSystem closed" error. This seems to point to a dfs issue wherein a big
> dfs file being read simultaneously by many dfs clients (in this case all
> maps) causes the clients to be closed. However on the pig side, loading the
> sample from each map in the final map reduce job and computing the quantile
> boundaries and weighted probabilities is inefficient. We should do this
> computation through a FindQuantiles udf in the same map reduce job which
> produces the sorted samples. This way lesser data is written to dfs and in
> the final map reduce job, the weightedRangePartitioner needs to just load the
> computed information.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.