Yes I tried the same Drake. I dont know if I understood your answer.
Instead of loading them into setup() through cache I read them directly from HDFS in map section. and for each incoming record .I found the distance between all the records in HDFS. ie if R ans S are my dataset, R is the model data stored in HDFs and when S taken for processing S1-R(finding distance with whole R set) S2-R But it is taking a long time as it needs to compute the distance. On Wed, Jan 21, 2015 at 10:31 AM, Drake민영근 <drake....@nexr.com> wrote: > In my suggestion, map or reduce tasks do not use distributed cache. They > use file directly from HDFS with short circuit local read. Like a shared > storage method, but almost every node has the data with high-replication > factor. > > Drake 민영근 Ph.D > > On Wed, Jan 21, 2015 at 1:49 PM, unmesha sreeveni <unmeshab...@gmail.com> > wrote: > >> But stil if the model is very large enough, how can we load them inti >> Distributed cache or some thing like that. >> Here is one source : >> http://www.cs.utah.edu/~lifeifei/papers/knnslides.pdf >> But it is confusing me >> >> On Wed, Jan 21, 2015 at 7:30 AM, Drake민영근 <drake....@nexr.com> wrote: >> >>> Hi, >>> >>> How about this ? The large model data stay in HDFS but with many >>> replications and MapReduce program read the model from HDFS. In theory, the >>> replication factor of model data equals with number of data nodes and with >>> the Short Circuit Local Reads function of HDFS datanode, the map or reduce >>> tasks read the model data in their own disks. >>> >>> In this way, maybe use too many usage of HDFS, but the annoying >>> partition problem will be gone. >>> >>> Thanks >>> >>> Drake 민영근 Ph.D >>> >>> On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni <unmeshab...@gmail.com >>> > wrote: >>> >>>> Is there any way.. >>>> Waiting for a reply.I have posted the question every where..but none is >>>> responding back. >>>> I feel like this is the right place to ask doubts. As some of u may >>>> came across the same issue and get stuck. >>>> >>>> On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni < >>>> unmeshab...@gmail.com> wrote: >>>> >>>>> Yes, One of my friend is implemeting the same. I know global sharing >>>>> of Data is not possible across Hadoop MapReduce. But I need to check if >>>>> that can be done somehow in hadoop Mapreduce also. Because I found some >>>>> papers in KNN hadoop also. >>>>> And I trying to compare the performance too. >>>>> >>>>> Hope some pointers can help me. >>>>> >>>>> >>>>> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <ted.dunn...@gmail.com> >>>>> wrote: >>>>> >>>>>> >>>>>> have you considered implementing using something like spark? That >>>>>> could be much easier than raw map-reduce >>>>>> >>>>>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni < >>>>>> unmeshab...@gmail.com> wrote: >>>>>> >>>>>>> In KNN like algorithm we need to load model Data into cache for >>>>>>> predicting the records. >>>>>>> >>>>>>> Here is the example for KNN. >>>>>>> >>>>>>> >>>>>>> [image: Inline image 1] >>>>>>> >>>>>>> So if the model will be a large file say1 or 2 GB we will be able to >>>>>>> load them into Distributed cache. >>>>>>> >>>>>>> The one way is to split/partition the model Result into some files >>>>>>> and perform the distance calculation for all records in that file and >>>>>>> then >>>>>>> find the min ditance and max occurance of classlabel and predict the >>>>>>> outcome. >>>>>>> >>>>>>> How can we parttion the file and perform the operation on these >>>>>>> partition ? >>>>>>> >>>>>>> ie 1 record <Distance> parttition1,partition2,.... >>>>>>> 2nd record <Distance> parttition1,partition2,... >>>>>>> >>>>>>> This is what came to my thought. >>>>>>> >>>>>>> Is there any further way. >>>>>>> >>>>>>> Any pointers would help me. >>>>>>> >>>>>>> -- >>>>>>> *Thanks & Regards * >>>>>>> >>>>>>> >>>>>>> *Unmesha Sreeveni U.B* >>>>>>> *Hadoop, Bigdata Developer* >>>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* >>>>>>> http://www.unmeshasreeveni.blogspot.in/ >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> *Thanks & Regards * >>>>> >>>>> >>>>> *Unmesha Sreeveni U.B* >>>>> *Hadoop, Bigdata Developer* >>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* >>>>> http://www.unmeshasreeveni.blogspot.in/ >>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> *Thanks & Regards * >>>> >>>> >>>> *Unmesha Sreeveni U.B* >>>> *Hadoop, Bigdata Developer* >>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* >>>> http://www.unmeshasreeveni.blogspot.in/ >>>> >>>> >>>> >>> >> >> >> -- >> *Thanks & Regards * >> >> >> *Unmesha Sreeveni U.B* >> *Hadoop, Bigdata Developer* >> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* >> http://www.unmeshasreeveni.blogspot.in/ >> >> >> > -- *Thanks & Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/