Hi,
How about this ? The large model data stay in HDFS but with many
replications and MapReduce program read the model from HDFS. In theory, the
replication factor of model data equals with number of data nodes and with
the Short Circuit Local Reads function of HDFS datanode, the map or reduce
In my suggestion, map or reduce tasks do not use distributed cache. They
use file directly from HDFS with short circuit local read. Like a shared
storage method, but almost every node has the data with high-replication
factor.
Drake 민영근 Ph.D
On Wed, Jan 21, 2015 at 1:49 PM, unmesha sreeveni
Yes, almost same. I assume the most time spending part was copying model
data from datanode which has model data to actual process node(tasktracker
or nodemanager).
How about the model data's replication factor? How many nodes do you have?
If you have 4 or more nodes, you can increase replication
I have 4 nodes and the replication factor is set to 3
On Wed, Jan 21, 2015 at 11:15 AM, Drake민영근 drake@nexr.com wrote:
Yes, almost same. I assume the most time spending part was copying model
data from datanode which has model data to actual process node(tasktracker
or nodemanager).
How
Yes I tried the same Drake.
I dont know if I understood your answer.
Instead of loading them into setup() through cache I read them directly
from HDFS in map section. and for each incoming record .I found the
distance between all the records in HDFS.
ie if R ans S are my dataset, R is the model
Is there any way..
Waiting for a reply.I have posted the question every where..but none is
responding back.
I feel like this is the right place to ask doubts. As some of u may came
across the same issue and get stuck.
On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni unmeshab...@gmail.com
wrote:
In KNN like algorithm we need to load model Data into cache for predicting
the records.
Here is the example for KNN.
[image: Inline image 1]
So if the model will be a large file say1 or 2 GB we will be able to load
them into Distributed cache.
The one way is to split/partition the model
Yes, One of my friend is implemeting the same. I know global sharing of
Data is not possible across Hadoop MapReduce. But I need to check if that
can be done somehow in hadoop Mapreduce also. Because I found some papers
in KNN hadoop also.
And I trying to compare the performance too.
Hope some
have you considered implementing using something like spark? That could be
much easier than raw map-reduce
On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni unmeshab...@gmail.com
wrote:
In KNN like algorithm we need to load model Data into cache for predicting
the records.
Here is the