[jira] Commented: (HDFS-1140) Speedup INode.getPathComponents

2010-07-08 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886556#action_12886556
 ] 

Hudson commented on HDFS-1140:
--

Integrated in Hadoop-Hdfs-trunk-Commit #334 (See 
[http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/334/])
HDFS-1140. Speedup INode.getPathComponents. Contributed by Dmytro Molkov.


> Speedup INode.getPathComponents
> ---
>
> Key: HDFS-1140
> URL: https://issues.apache.org/jira/browse/HDFS-1140
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node
>Affects Versions: 0.22.0
>Reporter: Dmytro Molkov
>Assignee: Dmytro Molkov
>Priority: Minor
> Fix For: 0.22.0
>
> Attachments: HDFS-1140.2.patch, HDFS-1140.3.patch, HDFS-1140.4.patch, 
> HDFS-1140.patch
>
>
> When the namenode is loading the image there is a significant amount of time 
> being spent in the DFSUtil.string2Bytes. We have a very specific workload 
> here. The path that namenode does getPathComponents for shares N - 1 
> component with the previous path this method was called for (assuming current 
> path has N components).
> Hence we can improve the image load time by caching the result of previous 
> conversion.
> We thought of using some simple LRU cache for components, but the reality is, 
> String.getBytes gets optimized during runtime and LRU cache doesn't perform 
> as well, however using just the latest path components and their translation 
> to bytes in two arrays gives quite a performance boost.
> I could get another 20% off of the time to load the image on our cluster (30 
> seconds vs 24) and I wrote a simple benchmark that tests performance with and 
> without caching.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1140) Speedup INode.getPathComponents

2010-07-08 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886528#action_12886528
 ] 

Konstantin Shvachko commented on HDFS-1140:
---

I filed HDFS-1284 and HDFS-1285 to address two other test failures. I checked 
javaDoc warnings locally, don't see anything related to this jira.

> Speedup INode.getPathComponents
> ---
>
> Key: HDFS-1140
> URL: https://issues.apache.org/jira/browse/HDFS-1140
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node
>Affects Versions: 0.22.0
>Reporter: Dmytro Molkov
>Assignee: Dmytro Molkov
>Priority: Minor
> Fix For: 0.22.0
>
> Attachments: HDFS-1140.2.patch, HDFS-1140.3.patch, HDFS-1140.4.patch, 
> HDFS-1140.patch
>
>
> When the namenode is loading the image there is a significant amount of time 
> being spent in the DFSUtil.string2Bytes. We have a very specific workload 
> here. The path that namenode does getPathComponents for shares N - 1 
> component with the previous path this method was called for (assuming current 
> path has N components).
> Hence we can improve the image load time by caching the result of previous 
> conversion.
> We thought of using some simple LRU cache for components, but the reality is, 
> String.getBytes gets optimized during runtime and LRU cache doesn't perform 
> as well, however using just the latest path components and their translation 
> to bytes in two arrays gives quite a performance boost.
> I could get another 20% off of the time to load the image on our cluster (30 
> seconds vs 24) and I wrote a simple benchmark that tests performance with and 
> without caching.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1140) Speedup INode.getPathComponents

2010-07-07 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886202#action_12886202
 ] 

Todd Lipcon commented on HDFS-1140:
---

I opened HDFS-1286. Unfortunately I'm leaving for a vacation on Tuesday and am 
pretty booked between now and then, so may not get to it until the end of this 
month. Thanks Konstantin.

> Speedup INode.getPathComponents
> ---
>
> Key: HDFS-1140
> URL: https://issues.apache.org/jira/browse/HDFS-1140
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node
>Affects Versions: 0.22.0
>Reporter: Dmytro Molkov
>Assignee: Dmytro Molkov
>Priority: Minor
> Fix For: 0.22.0
>
> Attachments: HDFS-1140.2.patch, HDFS-1140.3.patch, HDFS-1140.4.patch, 
> HDFS-1140.patch
>
>
> When the namenode is loading the image there is a significant amount of time 
> being spent in the DFSUtil.string2Bytes. We have a very specific workload 
> here. The path that namenode does getPathComponents for shares N - 1 
> component with the previous path this method was called for (assuming current 
> path has N components).
> Hence we can improve the image load time by caching the result of previous 
> conversion.
> We thought of using some simple LRU cache for components, but the reality is, 
> String.getBytes gets optimized during runtime and LRU cache doesn't perform 
> as well, however using just the latest path components and their translation 
> to bytes in two arrays gives quite a performance boost.
> I could get another 20% off of the time to load the image on our cluster (30 
> seconds vs 24) and I wrote a simple benchmark that tests performance with and 
> without caching.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1140) Speedup INode.getPathComponents

2010-07-07 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886194#action_12886194
 ] 

Konstantin Shvachko commented on HDFS-1140:
---

Todd, thanks for looking. 
[Here|http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/423/testReport/]
 is another report that also failed the same test cases. I don't see it now on 
my box either, but you can check the logs to understand what is going on, and 
probably model. The message "The directory is already locked." means that the 
previous DN is still running or did not release the lock on the directory.
If we could isolate TestFileAppend4 into a separate jira, then this one can be 
closed.

> Speedup INode.getPathComponents
> ---
>
> Key: HDFS-1140
> URL: https://issues.apache.org/jira/browse/HDFS-1140
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node
>Affects Versions: 0.22.0
>Reporter: Dmytro Molkov
>Assignee: Dmytro Molkov
>Priority: Minor
> Fix For: 0.22.0
>
> Attachments: HDFS-1140.2.patch, HDFS-1140.3.patch, HDFS-1140.4.patch, 
> HDFS-1140.patch
>
>
> When the namenode is loading the image there is a significant amount of time 
> being spent in the DFSUtil.string2Bytes. We have a very specific workload 
> here. The path that namenode does getPathComponents for shares N - 1 
> component with the previous path this method was called for (assuming current 
> path has N components).
> Hence we can improve the image load time by caching the result of previous 
> conversion.
> We thought of using some simple LRU cache for components, but the reality is, 
> String.getBytes gets optimized during runtime and LRU cache doesn't perform 
> as well, however using just the latest path components and their translation 
> to bytes in two arrays gives quite a performance boost.
> I could get another 20% off of the time to load the image on our cluster (30 
> seconds vs 24) and I wrote a simple benchmark that tests performance with and 
> without caching.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1140) Speedup INode.getPathComponents

2010-07-07 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886177#action_12886177
 ] 

Todd Lipcon commented on HDFS-1140:
---

Hmm, TestFileAppend4 passes for me on a trunk checkout. It seems like some test 
that ran prior to it didn't close resources properly? Does it fail on your 
machine?

> Speedup INode.getPathComponents
> ---
>
> Key: HDFS-1140
> URL: https://issues.apache.org/jira/browse/HDFS-1140
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node
>Affects Versions: 0.22.0
>Reporter: Dmytro Molkov
>Assignee: Dmytro Molkov
>Priority: Minor
> Fix For: 0.22.0
>
> Attachments: HDFS-1140.2.patch, HDFS-1140.3.patch, HDFS-1140.4.patch, 
> HDFS-1140.patch
>
>
> When the namenode is loading the image there is a significant amount of time 
> being spent in the DFSUtil.string2Bytes. We have a very specific workload 
> here. The path that namenode does getPathComponents for shares N - 1 
> component with the previous path this method was called for (assuming current 
> path has N components).
> Hence we can improve the image load time by caching the result of previous 
> conversion.
> We thought of using some simple LRU cache for components, but the reality is, 
> String.getBytes gets optimized during runtime and LRU cache doesn't perform 
> as well, however using just the latest path components and their translation 
> to bytes in two arrays gives quite a performance boost.
> I could get another 20% off of the time to load the image on our cluster (30 
> seconds vs 24) and I wrote a simple benchmark that tests performance with and 
> without caching.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1140) Speedup INode.getPathComponents

2010-07-07 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886125#action_12886125
 ] 

Konstantin Shvachko commented on HDFS-1140:
---

I am not sure. My guess is based on this 
[log|http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/421/artifact/trunk/patchprocess/patchJavadocWarnings.txt].
I am saying somebody should investigate, if possible.

> Speedup INode.getPathComponents
> ---
>
> Key: HDFS-1140
> URL: https://issues.apache.org/jira/browse/HDFS-1140
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node
>Affects Versions: 0.22.0
>Reporter: Dmytro Molkov
>Assignee: Dmytro Molkov
>Priority: Minor
> Fix For: 0.22.0
>
> Attachments: HDFS-1140.2.patch, HDFS-1140.3.patch, HDFS-1140.4.patch, 
> HDFS-1140.patch
>
>
> When the namenode is loading the image there is a significant amount of time 
> being spent in the DFSUtil.string2Bytes. We have a very specific workload 
> here. The path that namenode does getPathComponents for shares N - 1 
> component with the previous path this method was called for (assuming current 
> path has N components).
> Hence we can improve the image load time by caching the result of previous 
> conversion.
> We thought of using some simple LRU cache for components, but the reality is, 
> String.getBytes gets optimized during runtime and LRU cache doesn't perform 
> as well, however using just the latest path components and their translation 
> to bytes in two arrays gives quite a performance boost.
> I could get another 20% off of the time to load the image on our cluster (30 
> seconds vs 24) and I wrote a simple benchmark that tests performance with and 
> without caching.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1140) Speedup INode.getPathComponents

2010-07-07 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886103#action_12886103
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-1140:
--

> I believe this because org.apache.hadoop.hdfs.util.GSet links java.util.Map 
> and Set in its javaDocs.  ...

Konstantin, are you sure?  The GSet patch was committed in May and there is no 
javadoc warning for quite a few Hudson builds, e.g. 
[this|https://issues.apache.org/jira/browse/HDFS-1258?focusedCommentId=12883441&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12883441],
 
[this|https://issues.apache.org/jira/browse/HDFS-1093?focusedCommentId=12884884&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12884884]
 and 
[this|https://issues.apache.org/jira/browse/HDFS-1093?focusedCommentId=12884883&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12884883].

> Speedup INode.getPathComponents
> ---
>
> Key: HDFS-1140
> URL: https://issues.apache.org/jira/browse/HDFS-1140
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node
>Affects Versions: 0.22.0
>Reporter: Dmytro Molkov
>Assignee: Dmytro Molkov
>Priority: Minor
> Fix For: 0.22.0
>
> Attachments: HDFS-1140.2.patch, HDFS-1140.3.patch, HDFS-1140.4.patch, 
> HDFS-1140.patch
>
>
> When the namenode is loading the image there is a significant amount of time 
> being spent in the DFSUtil.string2Bytes. We have a very specific workload 
> here. The path that namenode does getPathComponents for shares N - 1 
> component with the previous path this method was called for (assuming current 
> path has N components).
> Hence we can improve the image load time by caching the result of previous 
> conversion.
> We thought of using some simple LRU cache for components, but the reality is, 
> String.getBytes gets optimized during runtime and LRU cache doesn't perform 
> as well, however using just the latest path components and their translation 
> to bytes in two arrays gives quite a performance boost.
> I could get another 20% off of the time to load the image on our cluster (30 
> seconds vs 24) and I wrote a simple benchmark that tests performance with and 
> without caching.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1140) Speedup INode.getPathComponents

2010-07-07 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886088#action_12886088
 ] 

Konstantin Shvachko commented on HDFS-1140:
---

Dmytro.
There is one javaDoc warning saying 
{code}[javadoc] javadoc: warning - Error fetching URL: 
http://java.sun.com/javase/6/docs/api/package-list{code}

I believe this because {{org.apache.hadoop.hdfs.util.GSet}} links 
{{java.util.Map}} and {{Set}} in its javaDocs. May be because they are not 
imported in the module. This is definitely not related to your patch.

There are 4 test failures:
* org.apache.hadoop.hdfs.TestFileAppend4.testRecoverFinalizedBlock
* org.apache.hadoop.hdfs.TestFileAppend4.testCompleteOtherLeaseHoldersFile
* org.apache.hadoop.hdfs.security.token.block.TestBlockToken.testBlockTokenRpc
* org.apache.hadoop.hdfs.server.common.TestJspHelper.testGetUgi

I see TestBlockToken and TestJspHelper failing on Hudson from time to time. We 
should file a jira to fix them.
I don't see anybody reporting failure of TestFileAppend4. May be Todd should 
look at it, as it seems the failures are due to not closing or freeing some 
resources.
Could you please investigate these issues.

> Speedup INode.getPathComponents
> ---
>
> Key: HDFS-1140
> URL: https://issues.apache.org/jira/browse/HDFS-1140
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node
>Affects Versions: 0.22.0
>Reporter: Dmytro Molkov
>Assignee: Dmytro Molkov
>Priority: Minor
> Fix For: 0.22.0
>
> Attachments: HDFS-1140.2.patch, HDFS-1140.3.patch, HDFS-1140.4.patch, 
> HDFS-1140.patch
>
>
> When the namenode is loading the image there is a significant amount of time 
> being spent in the DFSUtil.string2Bytes. We have a very specific workload 
> here. The path that namenode does getPathComponents for shares N - 1 
> component with the previous path this method was called for (assuming current 
> path has N components).
> Hence we can improve the image load time by caching the result of previous 
> conversion.
> We thought of using some simple LRU cache for components, but the reality is, 
> String.getBytes gets optimized during runtime and LRU cache doesn't perform 
> as well, however using just the latest path components and their translation 
> to bytes in two arrays gives quite a performance boost.
> I could get another 20% off of the time to load the image on our cluster (30 
> seconds vs 24) and I wrote a simple benchmark that tests performance with and 
> without caching.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1140) Speedup INode.getPathComponents

2010-07-06 Thread Dmytro Molkov (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12885636#action_12885636
 ] 

Dmytro Molkov commented on HDFS-1140:
-

My patch doesn't have anything to do with all of those -1s.
I reran hudsonQA locally:

[exec] +1 overall.  
[exec] 
[exec] +1 @author.  The patch does not contain any @author tags.
[exec] 
[exec] +1 tests included.  The patch appears to include 3 new or modified 
tests.
[exec] 
[exec] +1 javadoc.  The javadoc tool did not generate any warning messages.
[exec] 
[exec] +1 javac.  The applied patch does not increase the total number of 
javac compiler warnings.
[exec] 
[exec] +1 findbugs.  The patch does not introduce any new Findbugs warnings.
[exec] 
[exec] +1 release audit.  The applied patch does not increase the total 
number of release audit warnings.


> Speedup INode.getPathComponents
> ---
>
> Key: HDFS-1140
> URL: https://issues.apache.org/jira/browse/HDFS-1140
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node
>Affects Versions: 0.22.0
>Reporter: Dmytro Molkov
>Assignee: Dmytro Molkov
>Priority: Minor
> Fix For: 0.22.0
>
> Attachments: HDFS-1140.2.patch, HDFS-1140.3.patch, HDFS-1140.4.patch, 
> HDFS-1140.patch
>
>
> When the namenode is loading the image there is a significant amount of time 
> being spent in the DFSUtil.string2Bytes. We have a very specific workload 
> here. The path that namenode does getPathComponents for shares N - 1 
> component with the previous path this method was called for (assuming current 
> path has N components).
> Hence we can improve the image load time by caching the result of previous 
> conversion.
> We thought of using some simple LRU cache for components, but the reality is, 
> String.getBytes gets optimized during runtime and LRU cache doesn't perform 
> as well, however using just the latest path components and their translation 
> to bytes in two arrays gives quite a performance boost.
> I could get another 20% off of the time to load the image on our cluster (30 
> seconds vs 24) and I wrote a simple benchmark that tests performance with and 
> without caching.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1140) Speedup INode.getPathComponents

2010-07-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884832#action_12884832
 ] 

Hadoop QA commented on HDFS-1140:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12448256/HDFS-1140.4.patch
  against trunk revision 959874.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

-1 javadoc.  The javadoc tool appears to have generated 1 warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/421/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/421/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/421/artifact/trunk/build/test/checkstyle-errors.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/421/console

This message is automatically generated.

> Speedup INode.getPathComponents
> ---
>
> Key: HDFS-1140
> URL: https://issues.apache.org/jira/browse/HDFS-1140
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node
>Affects Versions: 0.22.0
>Reporter: Dmytro Molkov
>Assignee: Dmytro Molkov
>Priority: Minor
> Fix For: 0.22.0
>
> Attachments: HDFS-1140.2.patch, HDFS-1140.3.patch, HDFS-1140.4.patch, 
> HDFS-1140.patch
>
>
> When the namenode is loading the image there is a significant amount of time 
> being spent in the DFSUtil.string2Bytes. We have a very specific workload 
> here. The path that namenode does getPathComponents for shares N - 1 
> component with the previous path this method was called for (assuming current 
> path has N components).
> Hence we can improve the image load time by caching the result of previous 
> conversion.
> We thought of using some simple LRU cache for components, but the reality is, 
> String.getBytes gets optimized during runtime and LRU cache doesn't perform 
> as well, however using just the latest path components and their translation 
> to bytes in two arrays gives quite a performance boost.
> I could get another 20% off of the time to load the image on our cluster (30 
> seconds vs 24) and I wrote a simple benchmark that tests performance with and 
> without caching.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1140) Speedup INode.getPathComponents

2010-07-02 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884781#action_12884781
 ] 

Konstantin Shvachko commented on HDFS-1140:
---

+1. Th patch looks good.
I will commit it.

> Speedup INode.getPathComponents
> ---
>
> Key: HDFS-1140
> URL: https://issues.apache.org/jira/browse/HDFS-1140
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Dmytro Molkov
>Assignee: Dmytro Molkov
>Priority: Minor
> Attachments: HDFS-1140.2.patch, HDFS-1140.3.patch, HDFS-1140.4.patch, 
> HDFS-1140.patch
>
>
> When the namenode is loading the image there is a significant amount of time 
> being spent in the DFSUtil.string2Bytes. We have a very specific workload 
> here. The path that namenode does getPathComponents for shares N - 1 
> component with the previous path this method was called for (assuming current 
> path has N components).
> Hence we can improve the image load time by caching the result of previous 
> conversion.
> We thought of using some simple LRU cache for components, but the reality is, 
> String.getBytes gets optimized during runtime and LRU cache doesn't perform 
> as well, however using just the latest path components and their translation 
> to bytes in two arrays gives quite a performance boost.
> I could get another 20% off of the time to load the image on our cluster (30 
> seconds vs 24) and I wrote a simple benchmark that tests performance with and 
> without caching.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1140) Speedup INode.getPathComponents

2010-06-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883109#action_12883109
 ] 

Hadoop QA commented on HDFS-1140:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12447511/HDFS-1140.3.patch
  against trunk revision 957669.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 3 new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/200/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/200/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/200/artifact/trunk/build/test/checkstyle-errors.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/200/console

This message is automatically generated.

> Speedup INode.getPathComponents
> ---
>
> Key: HDFS-1140
> URL: https://issues.apache.org/jira/browse/HDFS-1140
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Dmytro Molkov
>Assignee: Dmytro Molkov
>Priority: Minor
> Attachments: HDFS-1140.2.patch, HDFS-1140.3.patch, HDFS-1140.patch
>
>
> When the namenode is loading the image there is a significant amount of time 
> being spent in the DFSUtil.string2Bytes. We have a very specific workload 
> here. The path that namenode does getPathComponents for shares N - 1 
> component with the previous path this method was called for (assuming current 
> path has N components).
> Hence we can improve the image load time by caching the result of previous 
> conversion.
> We thought of using some simple LRU cache for components, but the reality is, 
> String.getBytes gets optimized during runtime and LRU cache doesn't perform 
> as well, however using just the latest path components and their translation 
> to bytes in two arrays gives quite a performance boost.
> I could get another 20% off of the time to load the image on our cluster (30 
> seconds vs 24) and I wrote a simple benchmark that tests performance with and 
> without caching.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1140) Speedup INode.getPathComponents

2010-06-22 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881498#action_12881498
 ] 

Konstantin Shvachko commented on HDFS-1140:
---

Some review comment:
# {{FSImage.isParent(String, String)}} is not used, please remove.
# Could you please add separators between the methods and javaDoc descriptions 
for the new methods if possible.
# {{INode.getPathFromComponents()}} should be {{DFSUtil.byteArray2String()}}.
# {{TestPathComponents}} should use junit 4 style rather than junit 3.
# I'd advise to reuse {{U_STR}} instead of allocating {{DeprecatedUTF8 buff}} 
directly in {{FSImage.loadFSImage()}}. 
In order to do that you can provide a convenience method similar to 
{{readString()}} or {{readBytes()}}:
{code}
static byte[][] readPathComponents(DataInputStream in) throws IOException {
  U_STR.readFields(in);
  return DFSUtil.bytes2byteArray(U_STR.getBytes(), U_STR.getLength(), 
(byte)Path.SEPARATOR_CHAR);
}
{code}
The idea was to remove DeprecatedUTF8 at some point, so it is better to keep 
this stuff in one place right after the declaration of U_STR.
# It does not look like {{FSDirectory.addToParent(String src ...)}} is used 
anywhere anymore. Could you please verify and remove it if so.
# Same with {{INodeDirectory.addToParent(String path, ...)}} - can we eliminate 
it too?

> Speedup INode.getPathComponents
> ---
>
> Key: HDFS-1140
> URL: https://issues.apache.org/jira/browse/HDFS-1140
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Dmytro Molkov
>Assignee: Dmytro Molkov
>Priority: Minor
> Attachments: HDFS-1140.2.patch, HDFS-1140.3.patch, HDFS-1140.patch
>
>
> When the namenode is loading the image there is a significant amount of time 
> being spent in the DFSUtil.string2Bytes. We have a very specific workload 
> here. The path that namenode does getPathComponents for shares N - 1 
> component with the previous path this method was called for (assuming current 
> path has N components).
> Hence we can improve the image load time by caching the result of previous 
> conversion.
> We thought of using some simple LRU cache for components, but the reality is, 
> String.getBytes gets optimized during runtime and LRU cache doesn't perform 
> as well, however using just the latest path components and their translation 
> to bytes in two arrays gives quite a performance boost.
> I could get another 20% off of the time to load the image on our cluster (30 
> seconds vs 24) and I wrote a simple benchmark that tests performance with and 
> without caching.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1140) Speedup INode.getPathComponents

2010-05-26 Thread Eli Collins (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12872041#action_12872041
 ] 

Eli Collins commented on HDFS-1140:
---

Makes sense. I was mostly curious to get your thoughts on how hard it would be 
to use byte[] throughout. It's probably not worth refactoring the code to have 
eg INode#name be an index into a path byte[].

> Speedup INode.getPathComponents
> ---
>
> Key: HDFS-1140
> URL: https://issues.apache.org/jira/browse/HDFS-1140
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Dmytro Molkov
>Assignee: Dmytro Molkov
> Attachments: HDFS-1140.2.patch, HDFS-1140.patch
>
>
> When the namenode is loading the image there is a significant amount of time 
> being spent in the DFSUtil.string2Bytes. We have a very specific workload 
> here. The path that namenode does getPathComponents for shares N - 1 
> component with the previous path this method was called for (assuming current 
> path has N components).
> Hence we can improve the image load time by caching the result of previous 
> conversion.
> We thought of using some simple LRU cache for components, but the reality is, 
> String.getBytes gets optimized during runtime and LRU cache doesn't perform 
> as well, however using just the latest path components and their translation 
> to bytes in two arrays gives quite a performance boost.
> I could get another 20% off of the time to load the image on our cluster (30 
> seconds vs 24) and I wrote a simple benchmark that tests performance with and 
> without caching.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1140) Speedup INode.getPathComponents

2010-05-26 Thread Dmytro Molkov (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871984#action_12871984
 ] 

Dmytro Molkov commented on HDFS-1140:
-

Eli, you are right, this patch moves us from more or less user friendly string 
passing to byte[][] passing already. However I do not really see we can avoid 
those copies. The first one is due to the nature of Writable, if you do not 
copy the stuff then the array you end up with can be the combination of the 
path currently read and those bytes you read before at the end of the array. 
You probably could expand bytes2byteArray to have offset and length inside of 
the byte array given to perform the split on.
The second copy is also kind of unavoidable (or I do not know a good way to do 
it) since we need to end up with byte[][] array. The problem using byte[] array 
lies in how we traverse the tree of directories to find the INode the path 
points to.  Eventually when you do INodeDirectory.getChildINode you need to 
have byte[] representation of the name of the child you are looking for.
Right now every piece of the code inside of NameNode as far as I understand is 
relying on using byte[][] representation of the path where each part of it is 
the byte[] representation of an INode name. I am not sure how we can fix this.
I can look into making bytes2byteArray be more flexible to get rid of one 
byte[] copy.

Does all of this make sense? I will make other changes shortly.

> Speedup INode.getPathComponents
> ---
>
> Key: HDFS-1140
> URL: https://issues.apache.org/jira/browse/HDFS-1140
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Dmytro Molkov
>Assignee: Dmytro Molkov
> Attachments: HDFS-1140.2.patch, HDFS-1140.patch
>
>
> When the namenode is loading the image there is a significant amount of time 
> being spent in the DFSUtil.string2Bytes. We have a very specific workload 
> here. The path that namenode does getPathComponents for shares N - 1 
> component with the previous path this method was called for (assuming current 
> path has N components).
> Hence we can improve the image load time by caching the result of previous 
> conversion.
> We thought of using some simple LRU cache for components, but the reality is, 
> String.getBytes gets optimized during runtime and LRU cache doesn't perform 
> as well, however using just the latest path components and their translation 
> to bytes in two arrays gives quite a performance boost.
> I could get another 20% off of the time to load the image on our cluster (30 
> seconds vs 24) and I wrote a simple benchmark that tests performance with and 
> without caching.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1140) Speedup INode.getPathComponents

2010-05-26 Thread Eli Collins (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871865#action_12871865
 ] 

Eli Collins commented on HDFS-1140:
---

Hey Dmytro,  

Definitely an improvement. I noticed there's still a lot of copying going on, 
readBytes copies the strings bytes to a byte array, then bytes2byteArray copies 
this byte array into another byte array (it's hard for bytes2byteArray to use 
readBytes w/o copying). Would it make sense to go whole hog and just use the 
byte[] representation of a path internally? I understand that's a large change 
but it would remove a bunch of copies and since this change is all about using 
a less user-friendly abstraction in the name of reducing overhead it might be 
worth considering.

* Do we need to add the new addToParent to preserve the old String-based API?  
Would be nice to have FSImage use a single representation of a path.

* bytes2byteArray could use a javadoc. 

* Adding and using the following helper function as you've done with isParent 
would help readability.  
{{boolean isRoot(byte[][] pathComp) { return pathComp.length == 1 && 
pathComp[0].length == 0; }}}



> Speedup INode.getPathComponents
> ---
>
> Key: HDFS-1140
> URL: https://issues.apache.org/jira/browse/HDFS-1140
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Dmytro Molkov
>Assignee: Dmytro Molkov
> Attachments: HDFS-1140.2.patch, HDFS-1140.patch
>
>
> When the namenode is loading the image there is a significant amount of time 
> being spent in the DFSUtil.string2Bytes. We have a very specific workload 
> here. The path that namenode does getPathComponents for shares N - 1 
> component with the previous path this method was called for (assuming current 
> path has N components).
> Hence we can improve the image load time by caching the result of previous 
> conversion.
> We thought of using some simple LRU cache for components, but the reality is, 
> String.getBytes gets optimized during runtime and LRU cache doesn't perform 
> as well, however using just the latest path components and their translation 
> to bytes in two arrays gives quite a performance boost.
> I could get another 20% off of the time to load the image on our cluster (30 
> seconds vs 24) and I wrote a simple benchmark that tests performance with and 
> without caching.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1140) Speedup INode.getPathComponents

2010-05-19 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12869495#action_12869495
 ] 

Hadoop QA commented on HDFS-1140:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/1264/HDFS-1140.2.patch
  against trunk revision 946488.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 3 new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/178/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/178/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/178/artifact/trunk/build/test/checkstyle-errors.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/178/console

This message is automatically generated.

> Speedup INode.getPathComponents
> ---
>
> Key: HDFS-1140
> URL: https://issues.apache.org/jira/browse/HDFS-1140
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Dmytro Molkov
>Assignee: Dmytro Molkov
> Attachments: HDFS-1140.2.patch, HDFS-1140.patch
>
>
> When the namenode is loading the image there is a significant amount of time 
> being spent in the DFSUtil.string2Bytes. We have a very specific workload 
> here. The path that namenode does getPathComponents for shares N - 1 
> component with the previous path this method was called for (assuming current 
> path has N components).
> Hence we can improve the image load time by caching the result of previous 
> conversion.
> We thought of using some simple LRU cache for components, but the reality is, 
> String.getBytes gets optimized during runtime and LRU cache doesn't perform 
> as well, however using just the latest path components and their translation 
> to bytes in two arrays gives quite a performance boost.
> I could get another 20% off of the time to load the image on our cluster (30 
> seconds vs 24) and I wrote a simple benchmark that tests performance with and 
> without caching.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1140) Speedup INode.getPathComponents

2010-05-13 Thread Dmytro Molkov (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867364#action_12867364
 ] 

Dmytro Molkov commented on HDFS-1140:
-

@Hairong. Well, I agree with you that conversion to String now is currently 
unnecessary. I guess I was trying to make an argument that potentially the 
format of the path in the image and the format of the path in the memory can be 
different, if someone changes it. In that case having a String representation 
in the middle might simplify things.
Anyway, since currently the byte representation is the same it does make sense 
to operate on the byte arrays right from the start.
Please see the patch attached. It doesn't convert the read bytes to string and 
introduces a codepath to insert a node based on the byte[][] representation 
array right from the start. Let me know if you have further comments.

> Speedup INode.getPathComponents
> ---
>
> Key: HDFS-1140
> URL: https://issues.apache.org/jira/browse/HDFS-1140
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Dmytro Molkov
>Assignee: Dmytro Molkov
> Attachments: HDFS-1140.2.patch, HDFS-1140.patch
>
>
> When the namenode is loading the image there is a significant amount of time 
> being spent in the DFSUtil.string2Bytes. We have a very specific workload 
> here. The path that namenode does getPathComponents for shares N - 1 
> component with the previous path this method was called for (assuming current 
> path has N components).
> Hence we can improve the image load time by caching the result of previous 
> conversion.
> We thought of using some simple LRU cache for components, but the reality is, 
> String.getBytes gets optimized during runtime and LRU cache doesn't perform 
> as well, however using just the latest path components and their translation 
> to bytes in two arrays gives quite a performance boost.
> I could get another 20% off of the time to load the image on our cluster (30 
> seconds vs 24) and I wrote a simple benchmark that tests performance with and 
> without caching.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1140) Speedup INode.getPathComponents

2010-05-13 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867190#action_12867190
 ] 

Hairong Kuang commented on HDFS-1140:
-

> two different abstractions that are communicating via String paths 
No I do not think so. What matters is the encoding. The goal is to convert 
what's on disk to java UTF8 encoding as stored in memory. Currently it happens 
that what's on disk uses the same encoding as what's in the memory. Why bother 
converting to String first.

> Speedup INode.getPathComponents
> ---
>
> Key: HDFS-1140
> URL: https://issues.apache.org/jira/browse/HDFS-1140
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Dmytro Molkov
>Assignee: Dmytro Molkov
> Attachments: HDFS-1140.patch
>
>
> When the namenode is loading the image there is a significant amount of time 
> being spent in the DFSUtil.string2Bytes. We have a very specific workload 
> here. The path that namenode does getPathComponents for shares N - 1 
> component with the previous path this method was called for (assuming current 
> path has N components).
> Hence we can improve the image load time by caching the result of previous 
> conversion.
> We thought of using some simple LRU cache for components, but the reality is, 
> String.getBytes gets optimized during runtime and LRU cache doesn't perform 
> as well, however using just the latest path components and their translation 
> to bytes in two arrays gives quite a performance boost.
> I could get another 20% off of the time to load the image on our cluster (30 
> seconds vs 24) and I wrote a simple benchmark that tests performance with and 
> without caching.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1140) Speedup INode.getPathComponents

2010-05-11 Thread Dmytro Molkov (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866242#action_12866242
 ] 

Dmytro Molkov commented on HDFS-1140:
-

Well, I completely agree now, but since currently storing the image file on 
disc and storing in memory state are kind of two different abstractions that 
are communicating via String paths this seems like a clean way to get some 
performance improvement.

> Speedup INode.getPathComponents
> ---
>
> Key: HDFS-1140
> URL: https://issues.apache.org/jira/browse/HDFS-1140
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Dmytro Molkov
>Assignee: Dmytro Molkov
> Attachments: HDFS-1140.patch
>
>
> When the namenode is loading the image there is a significant amount of time 
> being spent in the DFSUtil.string2Bytes. We have a very specific workload 
> here. The path that namenode does getPathComponents for shares N - 1 
> component with the previous path this method was called for (assuming current 
> path has N components).
> Hence we can improve the image load time by caching the result of previous 
> conversion.
> We thought of using some simple LRU cache for components, but the reality is, 
> String.getBytes gets optimized during runtime and LRU cache doesn't perform 
> as well, however using just the latest path components and their translation 
> to bytes in two arrays gives quite a performance boost.
> I could get another 20% off of the time to load the image on our cluster (30 
> seconds vs 24) and I wrote a simple benchmark that tests performance with and 
> without caching.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1140) Speedup INode.getPathComponents

2010-05-11 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866224#action_12866224
 ] 

Hairong Kuang commented on HDFS-1140:
-

All paths are stored as bytes in memory. In theory, we do not need to convert 
bytes to string and then to bytes when loading fsimage.  But this needs a lot 
of re-organization of our code.

> Speedup INode.getPathComponents
> ---
>
> Key: HDFS-1140
> URL: https://issues.apache.org/jira/browse/HDFS-1140
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Dmytro Molkov
>Assignee: Dmytro Molkov
> Attachments: HDFS-1140.patch
>
>
> When the namenode is loading the image there is a significant amount of time 
> being spent in the DFSUtil.string2Bytes. We have a very specific workload 
> here. The path that namenode does getPathComponents for shares N - 1 
> component with the previous path this method was called for (assuming current 
> path has N components).
> Hence we can improve the image load time by caching the result of previous 
> conversion.
> We thought of using some simple LRU cache for components, but the reality is, 
> String.getBytes gets optimized during runtime and LRU cache doesn't perform 
> as well, however using just the latest path components and their translation 
> to bytes in two arrays gives quite a performance boost.
> I could get another 20% off of the time to load the image on our cluster (30 
> seconds vs 24) and I wrote a simple benchmark that tests performance with and 
> without caching.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.