Hello,
I am a beginner with hadoop framework. I am trying create a distributed
crawling application. I have googled a lot. but the resources are too low.
Can anyone please help me on the following topics.

1. I want to access the local file system of datanode. Suppose i have
crawled to site a and b. is it somehow possible using hadoop api to control
which datanode will be used to store it. like i want to store site a on
datanode 1 and site b on datanode 2 or just the way i wish. is it some how
possible?

2. when i will create map reduce for lucene indexing, if map process on
datanode 1 requires data from data node 2 will all of it come through master
node?? since i need to access them with hdfs://master:port. does it mean it
will exchange all data through master node?

3. how can i make it sure that a map process (like lucene indexing on
crawled data) is running right on the  data node that contains the data.
(may be i could not explain it well. its like i really dont want to datanode
2 ( storing site b) is indexing site a (which is stored on datanode 1), so
that it consumes up a lot of network traffic)


Please anyone reply me as early as possible.

Thanks
Burhan Uddin

Student
Department of Computer Science & Engineering
Shahjalal University of Science & Technology
Bangladesh

Reply via email to