[jira] [Commented] (HDFS-2092) Create a light inner conf class in DFSClient

2011-06-24 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054247#comment-13054247
 ] 

Aaron T. Myers commented on HDFS-2092:
--

Note: I'm not necessarily opposed to this change, but please justify its 
usefulness. From what I can tell so far, this patch seems to be optimizing 
something that's not actually an issue.

bq. That was just a sample of measurement for a day. 

Sure, but what was it actually measuring? Increase in child heap size per task 
attempt? Increase in heap size per TT? Something else?

bq. Also, Going forward, PIG 0.9 will store lots of meta data in the conf and 
also one can embed the PIG script itself in the conf.

I don't know much about Pig, but that sounds like a bad idea on its part. Maybe 
I'm wrong about that.

bq. This can potentially blow the TT.

Can it? I've seen users have a lot of different problems with Hadoop, but Task 
Trackers falling over because of conf objects being too large isn't one I can 
recall.

bq. Since one can store anything in the job conf, we should be careful with the 
references to this object - we should not hold for long duration.

At most these references will be held for the lifetime of a task attempt, 
right? So not so long?

 Create a light inner conf class in DFSClient
 

 Key: HDFS-2092
 URL: https://issues.apache.org/jira/browse/HDFS-2092
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client
Affects Versions: 0.23.0
Reporter: Bharath Mundlapudi
Assignee: Bharath Mundlapudi
 Fix For: 0.23.0

 Attachments: HDFS-2092-1.patch, HDFS-2092-2.patch


 At present, DFSClient stores reference to configuration object. Since, these 
 configuration objects are pretty big at times can blot the processes which 
 has multiple DFSClient objects like in TaskTracker. This is an attempt to 
 remove the reference of conf object in DFSClient. 
 This patch creates a light inner conf class and copies the required keys from 
 the Configuration object.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2092) Create a light inner conf class in DFSClient

2011-06-24 Thread Bharath Mundlapudi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054264#comment-13054264
 ] 

Bharath Mundlapudi commented on HDFS-2092:
--

We are not concerned about the task attempt. The problem here is for Task 
Tracker's availability. The way conf was designed has its own benefits. At the 
same time it comes with some disadvantages. What if a task attempt can run for 
a day or more? This is not uncommon in, our clusters.

Again, I am listing couple of issues,
1. With UGI, conf will be created per user in TT. (Security folks?)
2. PIG or any other job can store arbitrary data. Hadoop framework should be 
able to deal with it as far as it can. 
3. Last but not least, API should not hold on to client's data. 

As every job is different so can workloads can be different. So one can't see 
or hear all the problems.







 Create a light inner conf class in DFSClient
 

 Key: HDFS-2092
 URL: https://issues.apache.org/jira/browse/HDFS-2092
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client
Affects Versions: 0.23.0
Reporter: Bharath Mundlapudi
Assignee: Bharath Mundlapudi
 Fix For: 0.23.0

 Attachments: HDFS-2092-1.patch, HDFS-2092-2.patch


 At present, DFSClient stores reference to configuration object. Since, these 
 configuration objects are pretty big at times can blot the processes which 
 has multiple DFSClient objects like in TaskTracker. This is an attempt to 
 remove the reference of conf object in DFSClient. 
 This patch creates a light inner conf class and copies the required keys from 
 the Configuration object.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2092) Create a light inner conf class in DFSClient

2011-06-24 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054285#comment-13054285
 ] 

Aaron T. Myers commented on HDFS-2092:
--

bq. We are not concerned about the task attempt. The problem here is for Task 
Tracker's availability.

Have you actually experienced TTs crashing because conf objects were too large? 
Or where conf objects were taking up a substantial portion of the available 
heap space?

bq. The way conf was designed has its own benefits. At the same time it comes 
with some disadvantages. What if a task attempt can run for a day or more? This 
is not uncommon in, our clusters.

I would conjecture that such a task attempt is likely using many MBs or GBs of 
memory for the actual work it's doing. Is this patch which saves a few hundred 
KBs at the extreme end really going to move the needle?

bq. 1. With UGI, conf will be created per user in TT. (Security folks?)

But presumably only for every user which is concurrently running a task attempt 
on that TT, so not that many, right? Unless I'm missing something, which is 
certainly possible.

bq. 2. PIG or any other job can store arbitrary data. Hadoop framework should 
be able to deal with it as far as it can. 

No disagreement there.

bq. 3. Last but not least, API should not hold on to client's data.

I see no principled reason the DFSClient should not hold on to client's data 
in the form of the conf object. If this is actually negatively impacting 
performance or availability, then we should certainly fix that, but you haven't 
demonstrated that yet.

bq. As every job is different so can workloads can be different. So one can't 
see or hear all the problems.

Certainly, but we can validate this issue with some testing. Can you please 
describe what you did to gather these measurements? What exactly are they 
actually measuring?

My issue here is that this change is being done purely as an optimization, but 
it's unclear to me that negative issues exist without this patch, or that this 
patch necessarily addresses those issues. If you can demonstrate those, I'll 
shut up immediately. :)

 Create a light inner conf class in DFSClient
 

 Key: HDFS-2092
 URL: https://issues.apache.org/jira/browse/HDFS-2092
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client
Affects Versions: 0.23.0
Reporter: Bharath Mundlapudi
Assignee: Bharath Mundlapudi
 Fix For: 0.23.0

 Attachments: HDFS-2092-1.patch, HDFS-2092-2.patch


 At present, DFSClient stores reference to configuration object. Since, these 
 configuration objects are pretty big at times can blot the processes which 
 has multiple DFSClient objects like in TaskTracker. This is an attempt to 
 remove the reference of conf object in DFSClient. 
 This patch creates a light inner conf class and copies the required keys from 
 the Configuration object.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2092) Create a light inner conf class in DFSClient

2011-06-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054421#comment-13054421
 ] 

Hudson commented on HDFS-2092:
--

Integrated in Hadoop-Hdfs-trunk #705 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/705/])
HDFS-2092. Remove some object references to Configuration in DFSClient.  
Contributed by Bharath Mundlapudi

szetszwo : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1139097
Files : 
* /hadoop/common/trunk/hdfs/src/java/org/apache/hadoop/hdfs/DFSOutputStream.java
* /hadoop/common/trunk/hdfs/CHANGES.txt
* /hadoop/common/trunk/hdfs/src/java/org/apache/hadoop/hdfs/DFSInputStream.java
* /hadoop/common/trunk/hdfs/src/java/org/apache/hadoop/hdfs/DFSClient.java


 Create a light inner conf class in DFSClient
 

 Key: HDFS-2092
 URL: https://issues.apache.org/jira/browse/HDFS-2092
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client
Affects Versions: 0.23.0
Reporter: Bharath Mundlapudi
Assignee: Bharath Mundlapudi
 Fix For: 0.23.0

 Attachments: HDFS-2092-1.patch, HDFS-2092-2.patch


 At present, DFSClient stores reference to configuration object. Since, these 
 configuration objects are pretty big at times can blot the processes which 
 has multiple DFSClient objects like in TaskTracker. This is an attempt to 
 remove the reference of conf object in DFSClient. 
 This patch creates a light inner conf class and copies the required keys from 
 the Configuration object.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2092) Create a light inner conf class in DFSClient

2011-06-24 Thread Bharath Mundlapudi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054632#comment-13054632
 ] 

Bharath Mundlapudi commented on HDFS-2092:
--

Todd, Thanks for the reasons. 

When we mean a client it can be anything, like TT/JT which has TIP/JIP. You are 
right, client TIP/JIP can have references to JobConf. But then reference scope 
is decided by client. And yes, eventually, we need to fix the FS cache you are 
referring also if there are any leaks. 



 Create a light inner conf class in DFSClient
 

 Key: HDFS-2092
 URL: https://issues.apache.org/jira/browse/HDFS-2092
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client
Affects Versions: 0.23.0
Reporter: Bharath Mundlapudi
Assignee: Bharath Mundlapudi
 Fix For: 0.23.0

 Attachments: HDFS-2092-1.patch, HDFS-2092-2.patch


 At present, DFSClient stores reference to configuration object. Since, these 
 configuration objects are pretty big at times can blot the processes which 
 has multiple DFSClient objects like in TaskTracker. This is an attempt to 
 remove the reference of conf object in DFSClient. 
 This patch creates a light inner conf class and copies the required keys from 
 the Configuration object.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2092) Create a light inner conf class in DFSClient

2011-06-23 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054224#comment-13054224
 ] 

Aaron T. Myers commented on HDFS-2092:
--

If I read that right, we're talking about a change that at the 99th percentile 
saves at most 386kb? I'm skeptical that those modest savings warrant this 
change.

Also, how exactly were these gains measured? In what unit can we expect these 
memory savings? i.e. per TT? per DFSClient instance?

 Create a light inner conf class in DFSClient
 

 Key: HDFS-2092
 URL: https://issues.apache.org/jira/browse/HDFS-2092
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client
Affects Versions: 0.23.0
Reporter: Bharath Mundlapudi
Assignee: Bharath Mundlapudi
 Fix For: 0.23.0

 Attachments: HDFS-2092-1.patch, HDFS-2092-2.patch


 At present, DFSClient stores reference to configuration object. Since, these 
 configuration objects are pretty big at times can blot the processes which 
 has multiple DFSClient objects like in TaskTracker. This is an attempt to 
 remove the reference of conf object in DFSClient. 
 This patch creates a light inner conf class and copies the required keys from 
 the Configuration object.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2092) Create a light inner conf class in DFSClient

2011-06-23 Thread Bharath Mundlapudi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054245#comment-13054245
 ] 

Bharath Mundlapudi commented on HDFS-2092:
--

Hi Aaron,

That was just a sample of measurement for a day. We should care for MAX here in 
this case. Also, Going forward, PIG 0.9 will store lots of meta data in the 
conf and also one can embed the PIG script itself in the conf. This can 
potentially blow the TT. We can measure an approx size of conf by the job.xml 
file in the job history location. Since one can store anything in the job conf, 
we should be careful with the references to this object - we should not hold 
for long duration. 



  

 Create a light inner conf class in DFSClient
 

 Key: HDFS-2092
 URL: https://issues.apache.org/jira/browse/HDFS-2092
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client
Affects Versions: 0.23.0
Reporter: Bharath Mundlapudi
Assignee: Bharath Mundlapudi
 Fix For: 0.23.0

 Attachments: HDFS-2092-1.patch, HDFS-2092-2.patch


 At present, DFSClient stores reference to configuration object. Since, these 
 configuration objects are pretty big at times can blot the processes which 
 has multiple DFSClient objects like in TaskTracker. This is an attempt to 
 remove the reference of conf object in DFSClient. 
 This patch creates a light inner conf class and copies the required keys from 
 the Configuration object.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira