Hello, I am a beginner with hadoop framework. I am trying create a distributed crawling application. I have googled a lot. but the resources are too low. Can anyone please help me on the following topics.
1. I want to access the local file system of datanode. Suppose i have crawled to site a and b. is it somehow possible using hadoop api to control which datanode will be used to store it. like i want to store site a on datanode 1 and site b on datanode 2 or just the way i wish. is it some how possible? 2. when i will create map reduce for lucene indexing, if map process on datanode 1 requires data from data node 2 will all of it come through master node?? since i need to access them with hdfs://master:port. does it mean it will exchange all data through master node? 3. how can i make it sure that a map process (like lucene indexing on crawled data) is running right on the data node that contains the data. (may be i could not explain it well. its like i really dont want to datanode 2 ( storing site b) is indexing site a (which is stored on datanode 1), so that it consumes up a lot of network traffic) Please anyone reply me as early as possible. Thanks Burhan Uddin Student Department of Computer Science & Engineering Shahjalal University of Science & Technology Bangladesh
