[jira] [Commented] (MAPREDUCE-1752) Implement getFileBlockLocations in HarFilesystem

2011-04-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016927#comment-13016927
 ] 

Hudson commented on MAPREDUCE-1752:
---

Integrated in Hadoop-Mapreduce-trunk #643 (See 
[https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk/643/])


> Implement getFileBlockLocations in HarFilesystem
> 
>
> Key: MAPREDUCE-1752
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1752
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: harchive
>Reporter: Dmytro Molkov
>Assignee: Dmytro Molkov
> Fix For: 0.23.0
>
> Attachments: MAPREDUCE-1752.2.patch, MAPREDUCE-1752.3.patch, 
> MR-1752.patch
>
>
> To efficiently run map reduce on the data that has been HAR'ed it will be 
> great to actually implement getFileBlockLocations for a given filename.
> This way the JobTracker will have information about data locality and will 
> schedule tasks appropriately.
> I believe the overhead introduced by doing lookups in the index files can be 
> smaller than that of copying data over the wire.
> Will upload the patch shortly, but would love to get some feedback on this. 
> And any ideas on how to test it are very welcome.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (MAPREDUCE-1752) Implement getFileBlockLocations in HarFilesystem

2010-12-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966835#action_12966835
 ] 

Hudson commented on MAPREDUCE-1752:
---

Integrated in Hadoop-Mapreduce-trunk-Commit #557 (See 
[https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/557/])
MAPREDUCE-1752. Implement getFileBlockLocations in HarFilesystem.
(Patrick Kling via dhruba)


> Implement getFileBlockLocations in HarFilesystem
> 
>
> Key: MAPREDUCE-1752
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1752
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: harchive
>Reporter: Dmytro Molkov
>Assignee: Dmytro Molkov
> Fix For: 0.23.0
>
> Attachments: MAPREDUCE-1752.2.patch, MAPREDUCE-1752.3.patch, 
> MR-1752.patch
>
>
> To efficiently run map reduce on the data that has been HAR'ed it will be 
> great to actually implement getFileBlockLocations for a given filename.
> This way the JobTracker will have information about data locality and will 
> schedule tasks appropriately.
> I believe the overhead introduced by doing lookups in the index files can be 
> smaller than that of copying data over the wire.
> Will upload the patch shortly, but would love to get some feedback on this. 
> And any ideas on how to test it are very welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1752) Implement getFileBlockLocations in HarFilesystem

2010-11-23 Thread Dmytro Molkov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935104#action_12935104
 ] 

Dmytro Molkov commented on MAPREDUCE-1752:
--

The patch looks good to me +1.
I guess the main question is does anyone have any objections to the approach in 
general?

> Implement getFileBlockLocations in HarFilesystem
> 
>
> Key: MAPREDUCE-1752
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1752
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: harchive
>Reporter: Dmytro Molkov
>Assignee: Dmytro Molkov
> Fix For: 0.22.0
>
> Attachments: MAPREDUCE-1752.2.patch, MAPREDUCE-1752.3.patch, 
> MR-1752.patch
>
>
> To efficiently run map reduce on the data that has been HAR'ed it will be 
> great to actually implement getFileBlockLocations for a given filename.
> This way the JobTracker will have information about data locality and will 
> schedule tasks appropriately.
> I believe the overhead introduced by doing lookups in the index files can be 
> smaller than that of copying data over the wire.
> Will upload the patch shortly, but would love to get some feedback on this. 
> And any ideas on how to test it are very welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1752) Implement getFileBlockLocations in HarFilesystem

2010-11-22 Thread Patrick Kling (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934635#action_12934635
 ] 

Patrick Kling commented on MAPREDUCE-1752:
--

Mahadev/Nicholas, could one of you please have a look at this patch?

ant test-patch results:
{code}
 [exec] +1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 6 new or 
modified tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
(version 1.3.9) warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.
 [exec] 
 [exec] +1 system test framework.  The patch passed system test 
framework compile.
 [exec]
{code}

> Implement getFileBlockLocations in HarFilesystem
> 
>
> Key: MAPREDUCE-1752
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1752
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: harchive
>Reporter: Dmytro Molkov
>Assignee: Dmytro Molkov
> Fix For: 0.22.0
>
> Attachments: MAPREDUCE-1752.2.patch, MAPREDUCE-1752.3.patch, 
> MR-1752.patch
>
>
> To efficiently run map reduce on the data that has been HAR'ed it will be 
> great to actually implement getFileBlockLocations for a given filename.
> This way the JobTracker will have information about data locality and will 
> schedule tasks appropriately.
> I believe the overhead introduced by doing lookups in the index files can be 
> smaller than that of copying data over the wire.
> Will upload the patch shortly, but would love to get some feedback on this. 
> And any ideas on how to test it are very welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1752) Implement getFileBlockLocations in HarFilesystem

2010-11-01 Thread Dmytro Molkov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927186#action_12927186
 ] 

Dmytro Molkov commented on MAPREDUCE-1752:
--

In the second case it would of course be b1=.

I personally like the second way of fixing it more, since it gives predictable 
offsets. For the file f the block locations would start with offset 0 and the 
total length would sum up to the total length of the file. The problem with it 
might be that the block location of the first block will have length different 
from the actual block length in this file.
The way block locations are returned currently each of them except for the last 
one will have the length of the block and start at the offset which is a 
multiple of the block length. And even when I call getBlockLocations with 
offset and length different from 0, status.getLength() I am not guaranteed to 
get the result where the sum of length would be equal to length and the 
smallest offset of the block location would be equal to the offset provided.

That said I think that the second approach fits better into this system unless 
having block of different lengths will be a problem.

> Implement getFileBlockLocations in HarFilesystem
> 
>
> Key: MAPREDUCE-1752
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1752
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: harchive
>Reporter: Dmytro Molkov
>Assignee: Dmytro Molkov
> Fix For: 0.22.0
>
> Attachments: MAPREDUCE-1752.2.patch, MR-1752.patch
>
>
> To efficiently run map reduce on the data that has been HAR'ed it will be 
> great to actually implement getFileBlockLocations for a given filename.
> This way the JobTracker will have information about data locality and will 
> schedule tasks appropriately.
> I believe the overhead introduced by doing lookups in the index files can be 
> smaller than that of copying data over the wire.
> Will upload the patch shortly, but would love to get some feedback on this. 
> And any ideas on how to test it are very welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1752) Implement getFileBlockLocations in HarFilesystem

2010-11-01 Thread Patrick Kling (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927178#action_12927178
 ] 

Patrick Kling commented on MAPREDUCE-1752:
--

There is something really strange about the semantics of the offsets and 
lengths returned by this. Consider the following part file consisting of 3 
blocks containing a file f starting at offset 896 with length 512:

{code}
+---+
| ...   |
+---+
0   

+---+---+
| ...   | f |
+---+---+
512 896

+---+---+
| f |...|
+---+---+
10241408
{code}

Calling getFileBlockLocations on this file will return 2 LocatedBlocks: 
b1=, b2=. This indicates that b1 
contains the first 512 bytes of the block, even though in fact it only contains 
the first 128 bytes. This is a problem when the client uses these LocatedBlocks 
to detect whether a portion of f has been corrupted.

I can think of 2 possible ways of fixing this:

1) Fix the offset of the returned blocks by subtracting hstatus.getStartIndex() 
(i.e., the offset of f in the part file) from the block offset. This would 
return b1= and b2=, indicating 
to the client that the first 384 bytes of b1 are not part of 1 and correctly 
indicating the length of each block. In a way, this is similar to how 
FSNamesystem.getBlockLocations returns entire blocks even if the caller asks 
for a range that covers only part of these blocks.

2) Fix the length on the first block returned to reflect the portion of f that 
is contained in this block, i.e., return b1=, 
b2=. This seems somewhat less clean to me but avoids 
negative offsets. Also, it would break the convention that all blocks of a file 
with the exception of the last block are the same length.

> Implement getFileBlockLocations in HarFilesystem
> 
>
> Key: MAPREDUCE-1752
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1752
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: harchive
>Reporter: Dmytro Molkov
>Assignee: Dmytro Molkov
> Fix For: 0.22.0
>
> Attachments: MAPREDUCE-1752.2.patch, MR-1752.patch
>
>
> To efficiently run map reduce on the data that has been HAR'ed it will be 
> great to actually implement getFileBlockLocations for a given filename.
> This way the JobTracker will have information about data locality and will 
> schedule tasks appropriately.
> I believe the overhead introduced by doing lookups in the index files can be 
> smaller than that of copying data over the wire.
> Will upload the patch shortly, but would love to get some feedback on this. 
> And any ideas on how to test it are very welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1752) Implement getFileBlockLocations in HarFilesystem

2010-05-27 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12872668#action_12872668
 ] 

Tsz Wo (Nicholas), SZE commented on MAPREDUCE-1752:
---

> I guess my approach was making it right and then looking at the ways we can 
> optimize it rather then trying to hack up a fast solution right from the 
> start.
> Do you have any other ideas that may be worth exploring?

Yes, your approach totally make sense.  A potential improvement would be 
caching the masterIndex, archiveIndex and all the file statuses since a client 
calls getBlockLocation(..) multiple times for submitting a job.

> Implement getFileBlockLocations in HarFilesystem
> 
>
> Key: MAPREDUCE-1752
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1752
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: harchive
>Reporter: Dmytro Molkov
>Assignee: Dmytro Molkov
> Fix For: 0.22.0
>
> Attachments: MR-1752.patch
>
>
> To efficiently run map reduce on the data that has been HAR'ed it will be 
> great to actually implement getFileBlockLocations for a given filename.
> This way the JobTracker will have information about data locality and will 
> schedule tasks appropriately.
> I believe the overhead introduced by doing lookups in the index files can be 
> smaller than that of copying data over the wire.
> Will upload the patch shortly, but would love to get some feedback on this. 
> And any ideas on how to test it are very welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1752) Implement getFileBlockLocations in HarFilesystem

2010-05-27 Thread Rodrigo Schmidt (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12872392#action_12872392
 ] 

Rodrigo Schmidt commented on MAPREDUCE-1752:


I've been following this discussion.

I think Dmytro's idea makes a lot of sense, specially for big jobs that read 
from big files. In such cases, the performance gains in having local reads 
would easily compensate for the extra delay at setup time.

The idea behind it is to use files stored in hadoop archives as input for 
mapreduce jobs. I don't think this method will be used elsewhere.

Using har to store mapreduce files that are stable (won't change anymore) but 
still necessary for read queries is a huge win for the namenode scalability, 
since it reduces the number of objects it has to store in memory.

> Implement getFileBlockLocations in HarFilesystem
> 
>
> Key: MAPREDUCE-1752
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1752
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: harchive
>Reporter: Dmytro Molkov
>Assignee: Dmytro Molkov
> Fix For: 0.22.0
>
> Attachments: MR-1752.patch
>
>
> To efficiently run map reduce on the data that has been HAR'ed it will be 
> great to actually implement getFileBlockLocations for a given filename.
> This way the JobTracker will have information about data locality and will 
> schedule tasks appropriately.
> I believe the overhead introduced by doing lookups in the index files can be 
> smaller than that of copying data over the wire.
> Will upload the patch shortly, but would love to get some feedback on this. 
> And any ideas on how to test it are very welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1752) Implement getFileBlockLocations in HarFilesystem

2010-05-27 Thread Dmytro Molkov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12872372#action_12872372
 ] 

Dmytro Molkov commented on MAPREDUCE-1752:
--

Nicholas, I do see that this approach is somewhat expensive. However it gives 
us the locality when we are running a job.
And this time will only be added to the job setup time, right?

I guess my approach was making it right and then looking at the ways we can 
optimize it rather then trying to hack up a fast solution right from the start.
Do you have any other ideas that may be worth exploring?

> Implement getFileBlockLocations in HarFilesystem
> 
>
> Key: MAPREDUCE-1752
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1752
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: harchive
>Reporter: Dmytro Molkov
>Assignee: Dmytro Molkov
> Fix For: 0.22.0
>
> Attachments: MR-1752.patch
>
>
> To efficiently run map reduce on the data that has been HAR'ed it will be 
> great to actually implement getFileBlockLocations for a given filename.
> This way the JobTracker will have information about data locality and will 
> schedule tasks appropriately.
> I believe the overhead introduced by doing lookups in the index files can be 
> smaller than that of copying data over the wire.
> Will upload the patch shortly, but would love to get some feedback on this. 
> And any ideas on how to test it are very welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1752) Implement getFileBlockLocations in HarFilesystem

2010-05-26 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871891#action_12871891
 ] 

Tsz Wo (Nicholas), SZE commented on MAPREDUCE-1752:
---

Also, the approach is quite expensive.  It requires
# read masterIndex
#- fs.open(masterIndex)
#- fs.getFileStatus(masterIndex)
#- read from datanode
# read archiveIndex
#- fs.open(archiveIndex)
#- read from datanode
# fs.getFileStatus(part)

> Implement getFileBlockLocations in HarFilesystem
> 
>
> Key: MAPREDUCE-1752
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1752
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Dmytro Molkov
>Assignee: Dmytro Molkov
> Fix For: 0.22.0
>
> Attachments: MR-1752.patch
>
>
> To efficiently run map reduce on the data that has been HAR'ed it will be 
> great to actually implement getFileBlockLocations for a given filename.
> This way the JobTracker will have information about data locality and will 
> schedule tasks appropriately.
> I believe the overhead introduced by doing lookups in the index files can be 
> smaller than that of copying data over the wire.
> Will upload the patch shortly, but would love to get some feedback on this. 
> And any ideas on how to test it are very welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1752) Implement getFileBlockLocations in HarFilesystem

2010-05-26 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871871#action_12871871
 ] 

Tsz Wo (Nicholas), SZE commented on MAPREDUCE-1752:
---

Dmytro, the patch does not compiled.

> Implement getFileBlockLocations in HarFilesystem
> 
>
> Key: MAPREDUCE-1752
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1752
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Dmytro Molkov
>Assignee: Dmytro Molkov
> Fix For: 0.22.0
>
> Attachments: MR-1752.patch
>
>
> To efficiently run map reduce on the data that has been HAR'ed it will be 
> great to actually implement getFileBlockLocations for a given filename.
> This way the JobTracker will have information about data locality and will 
> schedule tasks appropriately.
> I believe the overhead introduced by doing lookups in the index files can be 
> smaller than that of copying data over the wire.
> Will upload the patch shortly, but would love to get some feedback on this. 
> And any ideas on how to test it are very welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1752) Implement getFileBlockLocations in HarFilesystem

2010-05-24 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870770#action_12870770
 ] 

Tsz Wo (Nicholas), SZE commented on MAPREDUCE-1752:
---

Hi Dmytro, are you still working on this?  Will you upload a patch soon?

> Implement getFileBlockLocations in HarFilesystem
> 
>
> Key: MAPREDUCE-1752
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1752
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Dmytro Molkov
>Assignee: Dmytro Molkov
> Fix For: 0.22.0
>
>
> To efficiently run map reduce on the data that has been HAR'ed it will be 
> great to actually implement getFileBlockLocations for a given filename.
> This way the JobTracker will have information about data locality and will 
> schedule tasks appropriately.
> I believe the overhead introduced by doing lookups in the index files can be 
> smaller than that of copying data over the wire.
> Will upload the patch shortly, but would love to get some feedback on this. 
> And any ideas on how to test it are very welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1752) Implement getFileBlockLocations in HarFilesystem

2010-05-04 Thread dhruba borthakur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12864148#action_12864148
 ] 

dhruba borthakur commented on MAPREDUCE-1752:
-

Sounds like a good idea. +1

The idea is to make the contents of a Har file work well with FileInputFormat 
or CombineFileInputFormat, isn't it? In that case, you can see 
TestCombineFileInputFormat and see if u can extend it to test the case when the 
input file(s) are har files.

> Implement getFileBlockLocations in HarFilesystem
> 
>
> Key: MAPREDUCE-1752
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1752
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Dmytro Molkov
>
> To efficiently run map reduce on the data that has been HAR'ed it will be 
> great to actually implement getFileBlockLocations for a given filename.
> This way the JobTracker will have information about data locality and will 
> schedule tasks appropriately.
> I believe the overhead introduced by doing lookups in the index files can be 
> smaller than that of copying data over the wire.
> Will upload the patch shortly, but would love to get some feedback on this. 
> And any ideas on how to test it are very welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.