[ 
https://issues.apache.org/jira/browse/PIG-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13857699#comment-13857699
 ] 

Rohini Palaniswamy commented on PIG-3645:
-----------------------------------------

> UUID returns takes up 36 chars, while random we are only using 10 chars. 
> Would prefer shorter filenames to take up less space in NN and prevent GCs
  This is actually a bigger concern for me as intermediate jobs will create lot 
of temporary files under the temporary directory with MR when there are lot of 
maps/reduces. Since these will be short lived files and will be deleted soon 
will be a pressure on NN if they have long paths. We already have serious 
problems with NN performance on big clusters with heavy usage and hadoop team 
has been squeezing optimizations in cutting down on the number of calls MR 
framework makes. 

> Replace Random with UUID in FileLocalizer.getTemporaryPath()
> ------------------------------------------------------------
>
>                 Key: PIG-3645
>                 URL: https://issues.apache.org/jira/browse/PIG-3645
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Cheolsoo Park
>            Assignee: Cheolsoo Park
>            Priority: Minor
>             Fix For: 0.13.0
>
>         Attachments: PIG-3645-1.patch
>
>
> Currently, temporary paths are generated by FileLocalizer using 
> Random.nextInt(). To provide strong randomness, MapReduceLauncher resets the 
> Random object every time when compiling physical plan to MR plan:
> {code}
> MRCompiler comp = new MRCompiler(php, pc); 
> comp.randomizeFileLocalizer(); // This in turn calls FileLocalizer.setR(new 
> Random()).
> {code}
> Besides, there are a couple of places calling FileLocalizer.setR() (e.g. 
> MRCompiler) with some random seed.
> I think-
> # Randomizing Random seed is unnecessary if we switch to UUID.
> # Setting Random objects in code like this is error-prone because it can be 
> easily broken by having or missing a FileLocalizer.setR() somewhere else. See 
> an example [here|http://search-hadoop.com/m/2nxTzQXfHw1].
> So I propose that we remove all this "randomizing Random seed" code and use 
> UUID instead in temporary paths.
> For unit tests that compare the results against gold files, we should still 
> allow to set Random seed through FileLocalizer.setR(). But this method will 
> be annotated as "VisibleForTesting" to ensure it is not used nowhere else 
> other than in unit tests.
> Regarding the existing gold files, they can be easily regenerated by 
> TestMRCompiler as follows-
> {code}
> FileOutputStream fos = new FileOutputStream(expectedFile + "_new");
> PrintWriter pw = new PrintWriter(fos);
> pw.write(compiledPlan);
> {code}
> I assume there won't be any kind of regressions due to this change. But 
> please let me know if I am wrong.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to