[ https://issues.apache.org/jira/browse/PIG-733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Pradeep Kamath updated PIG-733: ------------------------------- Attachment: PIG-733-v2.patch > Order by sampling dumps entire sample to hdfs which causes dfs "FileSystem > closed" error on large input > ------------------------------------------------------------------------------------------------------- > > Key: PIG-733 > URL: https://issues.apache.org/jira/browse/PIG-733 > Project: Pig > Issue Type: Bug > Affects Versions: 0.2.0 > Reporter: Pradeep Kamath > Assignee: Pradeep Kamath > Fix For: 0.3.0 > > Attachments: PIG-733-v2.patch, PIG-733.patch > > > Order by has a sampling job which samples the input and creates a sorted list > of sample items. CUrrently the number of items sampled is 100 per map task. > So if the input is large resulting in many maps (say 50,000) the sample is > big. This sorted sample is stored on dfs. The WeightedRangePartitioner > computes quantile boundaries and weighted probabilities for repeating values > in each map by reading the samples file from DFS. In queries with many maps > (in the order of 50,000) the dfs read of the sample file fails with > "FileSystem closed" error. This seems to point to a dfs issue wherein a big > dfs file being read simultaneously by many dfs clients (in this case all > maps) causes the clients to be closed. However on the pig side, loading the > sample from each map in the final map reduce job and computing the quantile > boundaries and weighted probabilities is inefficient. We should do this > computation through a FindQuantiles udf in the same map reduce job which > produces the sorted samples. This way lesser data is written to dfs and in > the final map reduce job, the weightedRangePartitioner needs to just load the > computed information. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.