Get Current Block or Split ID, and using it, the Block Path
Hi, Is it possible to get the 'id' of the currently executing split or block from within the mapper? Using this block Id / split id, I want to be able to query the namenode to get the names of hosts having that block / spllit, and the actual path to the data. I need this for some analytics that I'm doing. Is there a client API that allows doing this? If not, what's the best way to do this? Best, Deepak Nettem
Re: Get Current Block or Split ID, and using it, the Block Path
I think if you called getInputFormat on JobConf and then called getSplits you would atleast get the locations. http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/InputSplit.html On Sun, Apr 8, 2012 at 9:16 AM, Deepak Nettem deepaknet...@gmail.comwrote: Hi, Is it possible to get the 'id' of the currently executing split or block from within the mapper? Using this block Id / split id, I want to be able to query the namenode to get the names of hosts having that block / spllit, and the actual path to the data. I need this for some analytics that I'm doing. Is there a client API that allows doing this? If not, what's the best way to do this? Best, Deepak Nettem
Re: Get Current Block or Split ID, and using it, the Block Path
Deepak On Sun, Apr 8, 2012 at 9:46 PM, Deepak Nettem deepaknet...@gmail.com wrote: Hi, Is it possible to get the 'id' of the currently executing split or block from within the mapper? Using this block Id / split id, I want to be able to query the namenode to get the names of hosts having that block / spllit, and the actual path to the data. You can get the list of host locations for the current Mapper's split item via: https://gist.github.com/2339170 (or generally from a FileSystem object via https://gist.github.com/2339181) You can't get block IDs via any available publicly supported APIs. Therefore, you may consider getting the local block file path as an unavailable option too. I need this for some analytics that I'm doing. Is there a client API that allows doing this? If not, what's the best way to do this? There are some ways to go about it (I wouldn't consider this impossible to do for sure), but I'm curious what your 'analytics' is and how it correlates with needing block IDs and actual block file paths - Cause your problem may also be solvable by other, pre-available means. -- Harsh J
Re: Get Current Block or Split ID, and using it, the Block Path
I have a related question about blocks related to thisNormally, a reduce job outputs several files, all in the same directory. But why? Since we know that Hadoop is abstracting our file for us, shouldn't the part-r- outputs ultimately be thought of as a single file? What is the correspondence between the Part-r- Part-r-0001 . Outputs from a reducer, and the native blocks stored by Hfds (if any). Jay Vyas MMSB UCHC On Apr 8, 2012, at 2:00 PM, Harsh J ha...@cloudera.com wrote: Deepak On Sun, Apr 8, 2012 at 9:46 PM, Deepak Nettem deepaknet...@gmail.com wrote: Hi, Is it possible to get the 'id' of the currently executing split or block from within the mapper? Using this block Id / split id, I want to be able to query the namenode to get the names of hosts having that block / spllit, and the actual path to the data. You can get the list of host locations for the current Mapper's split item via: https://gist.github.com/2339170 (or generally from a FileSystem object via https://gist.github.com/2339181) You can't get block IDs via any available publicly supported APIs. Therefore, you may consider getting the local block file path as an unavailable option too. I need this for some analytics that I'm doing. Is there a client API that allows doing this? If not, what's the best way to do this? There are some ways to go about it (I wouldn't consider this impossible to do for sure), but I'm curious what your 'analytics' is and how it correlates with needing block IDs and actual block file paths - Cause your problem may also be solvable by other, pre-available means. -- Harsh J
Re: Get Current Block or Split ID, and using it, the Block Path
Hi, The part in the default filename stands for partition. In some cases I agree you would not mind viewing them as a singular file instead of having to read directories - but there are also use cases where you would want each partition file to be unique cause you partitioned and processed them that way. In any case, cause HDFS lists files in sorted order you can use the fs -getmerge to get them out as one file if that suits your application. There is also a on-HDFS concat feature in Hadoop 2.x (formerly 0.23.x). On Mon, Apr 9, 2012 at 2:04 AM, JAX jayunit...@gmail.com wrote: I have a related question about blocks related to thisNormally, a reduce job outputs several files, all in the same directory. But why? Since we know that Hadoop is abstracting our file for us, shouldn't the part-r- outputs ultimately be thought of as a single file? What is the correspondence between the Part-r- Part-r-0001 . Outputs from a reducer, and the native blocks stored by Hfds (if any). Jay Vyas MMSB UCHC On Apr 8, 2012, at 2:00 PM, Harsh J ha...@cloudera.com wrote: Deepak On Sun, Apr 8, 2012 at 9:46 PM, Deepak Nettem deepaknet...@gmail.com wrote: Hi, Is it possible to get the 'id' of the currently executing split or block from within the mapper? Using this block Id / split id, I want to be able to query the namenode to get the names of hosts having that block / spllit, and the actual path to the data. You can get the list of host locations for the current Mapper's split item via: https://gist.github.com/2339170 (or generally from a FileSystem object via https://gist.github.com/2339181) You can't get block IDs via any available publicly supported APIs. Therefore, you may consider getting the local block file path as an unavailable option too. I need this for some analytics that I'm doing. Is there a client API that allows doing this? If not, what's the best way to do this? There are some ways to go about it (I wouldn't consider this impossible to do for sure), but I'm curious what your 'analytics' is and how it correlates with needing block IDs and actual block file paths - Cause your problem may also be solvable by other, pre-available means. -- Harsh J -- Harsh J