Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
Hey Trevor, I have completed the Kmeans code and will soon commit it as per instructions which you have shared with me the other mail chain. Best Regards Parth On Sat, May 20, 2017 at 2:29 AM, Trevor Grant wrote: > Bumping this- > > Parth, is there anything we can do to assist you? > > > > Trevor Grant > Data Scientist > https://github.com/rawkintrevo > http://stackexchange.com/users/3002022/rawkintrevo > http://trevorgrant.org > > *"Fortunate is he, who is able to know the causes of things." -Virgil* > > > On Mon, Apr 24, 2017 at 9:34 PM, KHATWANI PARTH BHARAT < > h2016...@pilani.bits-pilani.ac.in> wrote: > > > @Trevor and @Dmitriy > > > > Tough Bug in Aggregating Transpose is fixed. One issue is still left > which > > is causing hindrance in completing the KMeans Code > > That issue is of Assigning the the Row Keys of The DRM with the "Closest > > Cluster Index" found > > Consider the Matrix of Data points given as follows > > > > { > >0 => {0:1.0,1: 1.0,2: 1.0, 3: 3.0} > >1 => {0:1.0,1: 2.0,2: 3.0, 3: 4.0} > >2 => {0:1.0,1: 3.0,2: 4.0, 3: 5.0} > >3 => {0:1.0,1: 4.0,2: 5.0, 3: 6.0} > > } > > Now these are > > 0 => > > 1 => > > 2 => > > 3 => > > the Row keys. Here Zeroth column(0) contains the values which will be > used > > the store the count of Points assigned to each cluster and Column 1 to 3 > > contains co-ordinates of the data points. > > > > So now after cluster assignment step of Kmeans algorithm which @Dmitriy > has > > Outlined in the beginning of this mail chain, > > > > the above Matrix should look like this(Assuming that the 0th and 1st data > > points are assigned to the cluster with index 0 and 2nd and 3rd data > points > > are assigned to cluster with index 1) > > > > { > >0 => {0:1.0,1: 1.0,2: 1.0, 3: 3.0} > >0 => {0:1.0,1: 2.0,2: 3.0, 3: 4.0} > >1 => {0:1.0,1: 3.0,2: 4.0, 3: 5.0} > >1 => {0:1.0,1: 4.0,2: 5.0, 3: 6.0} > > } > > > > to achieve above mentioned result i using following code lines of code > > > > //11. Iterating over the Data Matrix(in DrmLike[Int] format) > > dataDrmX.mapBlock() { > > case (keys, block) => > > for (row <- 0 until block.nrow) { > > var dataPoint = block(row, ::) > > > > //12. findTheClosestCentriod find the closest centriod to the > Data > > point specified by "dataPoint" > > val closesetIndex = findTheClosestCentriod(dataPoint, > centriods) > > > > //13. assigning closest index to key > > keys(row) = closesetIndex > > } > > keys -> block > > } > > > > But it turns out to be > > > > { > >0 => {0:1.0,1: 2.0,2: 3.0, 3: 4.0} > >1 => {0:1.0,1: 4.0,2: 5.0, 3: 6.0} > > } > > > > > > So is there any thing wrong with the syntax of the above code.I am unable > > to find any reference to the way in which i should assign a value to the > > row keys. > > > > @Trevor as per what you have mentioned in the above mail chain > > "Got it- in short no. > > > > Think of the keys like a dictionary or HashMap. > > > > That's why everything is ending up on row 1." > > > > But according to Algorithm outlined by@Dmitriy at start of the mail > chain > > we assign same key To Multiple Rows is possible. > > Same is also mentioned in the Book Written by Dmitriy and Andrew. > > It is mentioned that the rows having the same row keys summed up when we > > take aggregating transpose. > > > > I now confused that weather it possible to achieve what i have mentioned > > above or it is not possible to achieve or it is the Bug in the API. > > > > > > > > Thanks & Regards > > Parth > > <#m_33347126371020841_m_5688102708516554904_> > > >
Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
Bumping this- Parth, is there anything we can do to assist you? Trevor Grant Data Scientist https://github.com/rawkintrevo http://stackexchange.com/users/3002022/rawkintrevo http://trevorgrant.org *"Fortunate is he, who is able to know the causes of things." -Virgil* On Mon, Apr 24, 2017 at 9:34 PM, KHATWANI PARTH BHARAT < h2016...@pilani.bits-pilani.ac.in> wrote: > @Trevor and @Dmitriy > > Tough Bug in Aggregating Transpose is fixed. One issue is still left which > is causing hindrance in completing the KMeans Code > That issue is of Assigning the the Row Keys of The DRM with the "Closest > Cluster Index" found > Consider the Matrix of Data points given as follows > > { >0 => {0:1.0,1: 1.0,2: 1.0, 3: 3.0} >1 => {0:1.0,1: 2.0,2: 3.0, 3: 4.0} >2 => {0:1.0,1: 3.0,2: 4.0, 3: 5.0} >3 => {0:1.0,1: 4.0,2: 5.0, 3: 6.0} > } > Now these are > 0 => > 1 => > 2 => > 3 => > the Row keys. Here Zeroth column(0) contains the values which will be used > the store the count of Points assigned to each cluster and Column 1 to 3 > contains co-ordinates of the data points. > > So now after cluster assignment step of Kmeans algorithm which @Dmitriy has > Outlined in the beginning of this mail chain, > > the above Matrix should look like this(Assuming that the 0th and 1st data > points are assigned to the cluster with index 0 and 2nd and 3rd data points > are assigned to cluster with index 1) > > { >0 => {0:1.0,1: 1.0,2: 1.0, 3: 3.0} >0 => {0:1.0,1: 2.0,2: 3.0, 3: 4.0} >1 => {0:1.0,1: 3.0,2: 4.0, 3: 5.0} >1 => {0:1.0,1: 4.0,2: 5.0, 3: 6.0} > } > > to achieve above mentioned result i using following code lines of code > > //11. Iterating over the Data Matrix(in DrmLike[Int] format) > dataDrmX.mapBlock() { > case (keys, block) => > for (row <- 0 until block.nrow) { > var dataPoint = block(row, ::) > > //12. findTheClosestCentriod find the closest centriod to the Data > point specified by "dataPoint" > val closesetIndex = findTheClosestCentriod(dataPoint, centriods) > > //13. assigning closest index to key > keys(row) = closesetIndex > } > keys -> block > } > > But it turns out to be > > { >0 => {0:1.0,1: 2.0,2: 3.0, 3: 4.0} >1 => {0:1.0,1: 4.0,2: 5.0, 3: 6.0} > } > > > So is there any thing wrong with the syntax of the above code.I am unable > to find any reference to the way in which i should assign a value to the > row keys. > > @Trevor as per what you have mentioned in the above mail chain > "Got it- in short no. > > Think of the keys like a dictionary or HashMap. > > That's why everything is ending up on row 1." > > But according to Algorithm outlined by@Dmitriy at start of the mail chain > we assign same key To Multiple Rows is possible. > Same is also mentioned in the Book Written by Dmitriy and Andrew. > It is mentioned that the rows having the same row keys summed up when we > take aggregating transpose. > > I now confused that weather it possible to achieve what i have mentioned > above or it is not possible to achieve or it is the Bug in the API. > > > > Thanks & Regards > Parth > <#m_33347126371020841_m_5688102708516554904_> >
Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
@Trevor and @Dmitriy Tough Bug in Aggregating Transpose is fixed. One issue is still left which is causing hindrance in completing the KMeans Code That issue is of Assigning the the Row Keys of The DRM with the "Closest Cluster Index" found Consider the Matrix of Data points given as follows { 0 => {0:1.0,1: 1.0,2: 1.0, 3: 3.0} 1 => {0:1.0,1: 2.0,2: 3.0, 3: 4.0} 2 => {0:1.0,1: 3.0,2: 4.0, 3: 5.0} 3 => {0:1.0,1: 4.0,2: 5.0, 3: 6.0} } Now these are 0 => 1 => 2 => 3 => the Row keys. Here Zeroth column(0) contains the values which will be used the store the count of Points assigned to each cluster and Column 1 to 3 contains co-ordinates of the data points. So now after cluster assignment step of Kmeans algorithm which @Dmitriy has Outlined in the beginning of this mail chain, the above Matrix should look like this(Assuming that the 0th and 1st data points are assigned to the cluster with index 0 and 2nd and 3rd data points are assigned to cluster with index 1) { 0 => {0:1.0,1: 1.0,2: 1.0, 3: 3.0} 0 => {0:1.0,1: 2.0,2: 3.0, 3: 4.0} 1 => {0:1.0,1: 3.0,2: 4.0, 3: 5.0} 1 => {0:1.0,1: 4.0,2: 5.0, 3: 6.0} } to achieve above mentioned result i using following code lines of code //11. Iterating over the Data Matrix(in DrmLike[Int] format) dataDrmX.mapBlock() { case (keys, block) => for (row <- 0 until block.nrow) { var dataPoint = block(row, ::) //12. findTheClosestCentriod find the closest centriod to the Data point specified by "dataPoint" val closesetIndex = findTheClosestCentriod(dataPoint, centriods) //13. assigning closest index to key keys(row) = closesetIndex } keys -> block } But it turns out to be { 0 => {0:1.0,1: 2.0,2: 3.0, 3: 4.0} 1 => {0:1.0,1: 4.0,2: 5.0, 3: 6.0} } So is there any thing wrong with the syntax of the above code.I am unable to find any reference to the way in which i should assign a value to the row keys. @Trevor as per what you have mentioned in the above mail chain "Got it- in short no. Think of the keys like a dictionary or HashMap. That's why everything is ending up on row 1." But according to Algorithm outlined by@Dmitriy at start of the mail chain we assign same key To Multiple Rows is possible. Same is also mentioned in the Book Written by Dmitriy and Andrew. It is mentioned that the rows having the same row keys summed up when we take aggregating transpose. I now confused that weather it possible to achieve what i have mentioned above or it is not possible to achieve or it is the Bug in the API. Thanks & Regards Parth On Tue, Apr 25, 2017 at 9:07 PM, Khurrum Nasim wrote: > Can mahout be used for self driving tech ? > > Thanks, > > Khurrum. > > On Apr 24, 2017, 10:34 PM -0400, KHATWANI PARTH BHARAT < > h2016...@pilani.bits-pilani.ac.in>, wrote: > > @Trevor and @Dmitriy > > > > Tough Bug in Aggregating Transpose is fixed. One issue is still left > which > > is causing hindrance in completing the KMeans Code > > That issue is of Assigning the the Row Keys of The DRM with the "Closest > > Cluster Index" found > > Consider the Matrix of Data points given as follows > > > > { > > 0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0} > > 1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0} > > 2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0} > > 3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0} > > } > > Now these are > > 0 = > > 1 = > > 2 = > > 3 = > > the Row keys. Here Zeroth column(0) contains the values which will be > used > > the store the count of Points assigned to each cluster and Column 1 to 3 > > contains co-ordinates of the data points. > > > > So now after cluster assignment step of Kmeans algorithm which @Dmitriy > has > > Outlined in the beginning of this mail chain, > > > > the above Matrix should look like this(Assuming that the 0th and 1st data > > points are assigned to the cluster with index 0 and 2nd and 3rd data > points > > are assigned to cluster with index 1) > > > > { > > 0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0} > > 0 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0} > > 1 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0} > > 1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0} > > } > > > > to achieve above mentioned result i using following code lines of code > > > > //11. Iterating over the Data Matrix(in DrmLike[Int] format) > > dataDrmX.mapBlock() { > > case (keys, block) = > > for (row <- 0 until block.nrow) { > > var dataPoint = block(row, ::) > > > > //12. findTheClosestCentriod find the closest centriod to the Data > > point specified by "dataPoint" > > val closesetIndex = findTheClosestCentriod(dataPoint, centriods) > > > > //13. assigning closest index to key > > keys(row) = closesetIndex > > } > > keys -> block > > } > > > > But it turns out to be > > > > { > > 0 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0} > > 1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0} > > } > > > > > > So is there any thing wrong with the syntax of the above code.I am unable > > to find
Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
Can mahout be used for self driving tech ? Thanks, Khurrum. On Apr 24, 2017, 10:34 PM -0400, KHATWANI PARTH BHARAT , wrote: > @Trevor and @Dmitriy > > Tough Bug in Aggregating Transpose is fixed. One issue is still left which > is causing hindrance in completing the KMeans Code > That issue is of Assigning the the Row Keys of The DRM with the "Closest > Cluster Index" found > Consider the Matrix of Data points given as follows > > { > 0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0} > 1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0} > 2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0} > 3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0} > } > Now these are > 0 = > 1 = > 2 = > 3 = > the Row keys. Here Zeroth column(0) contains the values which will be used > the store the count of Points assigned to each cluster and Column 1 to 3 > contains co-ordinates of the data points. > > So now after cluster assignment step of Kmeans algorithm which @Dmitriy has > Outlined in the beginning of this mail chain, > > the above Matrix should look like this(Assuming that the 0th and 1st data > points are assigned to the cluster with index 0 and 2nd and 3rd data points > are assigned to cluster with index 1) > > { > 0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0} > 0 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0} > 1 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0} > 1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0} > } > > to achieve above mentioned result i using following code lines of code > > //11. Iterating over the Data Matrix(in DrmLike[Int] format) > dataDrmX.mapBlock() { > case (keys, block) = > for (row <- 0 until block.nrow) { > var dataPoint = block(row, ::) > > //12. findTheClosestCentriod find the closest centriod to the Data > point specified by "dataPoint" > val closesetIndex = findTheClosestCentriod(dataPoint, centriods) > > //13. assigning closest index to key > keys(row) = closesetIndex > } > keys -> block > } > > But it turns out to be > > { > 0 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0} > 1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0} > } > > > So is there any thing wrong with the syntax of the above code.I am unable > to find any reference to the way in which i should assign a value to the > row keys. > > @Trevor as per what you have mentioned in the above mail chain > "Got it- in short no. > > Think of the keys like a dictionary or HashMap. > > That's why everything is ending up on row 1." > > But according to Algorithm outlined by@Dmitriy at start of the mail chain > we assign same key To Multiple Rows is possible. > Same is also mentioned in the Book Written by Dmitriy and Andrew. > It is mentioned that the rows having the same row keys summed up when we > take aggregating transpose. > > I now confused that weather it possible to achieve what i have mentioned > above or it is not possible to achieve or it is the Bug in the API. > > > > Thanks & Regards > Parth > <#m_33347126371020841_m_5688102708516554904_
Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
@Trevor and @Dmitriy Tough Bug in Aggregating Transpose is fixed. One issue is still left which is causing hindrance in completing the KMeans Code That issue is of Assigning the the Row Keys of The DRM with the "Closest Cluster Index" found Consider the Matrix of Data points given as follows { 0 => {0:1.0,1: 1.0,2: 1.0, 3: 3.0} 1 => {0:1.0,1: 2.0,2: 3.0, 3: 4.0} 2 => {0:1.0,1: 3.0,2: 4.0, 3: 5.0} 3 => {0:1.0,1: 4.0,2: 5.0, 3: 6.0} } Now these are 0 => 1 => 2 => 3 => the Row keys. Here Zeroth column(0) contains the values which will be used the store the count of Points assigned to each cluster and Column 1 to 3 contains co-ordinates of the data points. So now after cluster assignment step of Kmeans algorithm which @Dmitriy has Outlined in the beginning of this mail chain, the above Matrix should look like this(Assuming that the 0th and 1st data points are assigned to the cluster with index 0 and 2nd and 3rd data points are assigned to cluster with index 1) { 0 => {0:1.0,1: 1.0,2: 1.0, 3: 3.0} 0 => {0:1.0,1: 2.0,2: 3.0, 3: 4.0} 1 => {0:1.0,1: 3.0,2: 4.0, 3: 5.0} 1 => {0:1.0,1: 4.0,2: 5.0, 3: 6.0} } to achieve above mentioned result i using following code lines of code //11. Iterating over the Data Matrix(in DrmLike[Int] format) dataDrmX.mapBlock() { case (keys, block) => for (row <- 0 until block.nrow) { var dataPoint = block(row, ::) //12. findTheClosestCentriod find the closest centriod to the Data point specified by "dataPoint" val closesetIndex = findTheClosestCentriod(dataPoint, centriods) //13. assigning closest index to key keys(row) = closesetIndex } keys -> block } But it turns out to be { 0 => {0:1.0,1: 2.0,2: 3.0, 3: 4.0} 1 => {0:1.0,1: 4.0,2: 5.0, 3: 6.0} } So is there any thing wrong with the syntax of the above code.I am unable to find any reference to the way in which i should assign a value to the row keys. @Trevor as per what you have mentioned in the above mail chain "Got it- in short no. Think of the keys like a dictionary or HashMap. That's why everything is ending up on row 1." But according to Algorithm outlined by@Dmitriy at start of the mail chain we assign same key To Multiple Rows is possible. Same is also mentioned in the Book Written by Dmitriy and Andrew. It is mentioned that the rows having the same row keys summed up when we take aggregating transpose. I now confused that weather it possible to achieve what i have mentioned above or it is not possible to achieve or it is the Bug in the API. Thanks & Regards Parth <#m_33347126371020841_m_5688102708516554904_>
Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
> > values > > > > > > to > > > > > > > > the key to Row Matrix and aggregating transpose.I think > > assigning > > > > new > > > > > > > > values to the key to Row Matrix and aggregating transpose is > > > > causing > > > > > > the > > > > > > > > main problem in Kmean code > > > > > > > > Following is the link to Github repo for this code. > > > > > > > > TestClusterAssign.scala > > > > > > > > <https://github.com/parth2691/Spark_Mahout/blob/Dmitriy- > > > > > > > Lyubimov/TestClusterAssign.scala> > > > > > > > > > > > > > > > > above code contains the hard coded data. Following is the > > > expected > > > > > and > > > > > > > the > > > > > > > > actual output of the above code > > > > > > > > Out of 1st println After New Cluster assignment should be > > > > > > > > This > > > > > > > > { > > > > > > > > 0 => {0:1.0,1: 1.0,2: 1.0, 3: 3.0} > > > > > > > > 1 => {0:1.0,1: 2.0,2: 3.0, 3: 4.0} > > > > > > > > 2 => {0:1.0,1: 3.0,2: 4.0, 3: 5.0} > > > > > > > > 3 => {0:1.0,1: 4.0,2: 5.0, 3: 6.0} > > > > > > > > } > > > > > > > > (Here zeroth Column is used to store the centriod count and > > > column > > > > > 1,2 > > > > > > > and > > > > > > > > 3 Contains Data) > > > > > > > > > > > > > > > > But Turns out to be this > > > > > > > > { > > > > > > > > 0 => {} > > > > > > > > 1 => {0:1.0,1:4.0,2:5.0,3:6.0} > > > > > > > > 2 => {} > > > > > > > > 3 => {} > > > > > > > > } > > > > > > > > And the result of aggregating Transpose should be > > > > > > > > { > > > > > > > > 0 => {1: 4.0} > > > > > > > > 1 => {1: 9.0} > > > > > > > > 2 => {1: 12.0} > > > > > > > > 3 => {1: 15.0} > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > Thanks Trevor for such a great Help > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best Regards > > > > > > > > Parth > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Apr 21, 2017 at 4:20 AM, Trevor Grant < > > > > > > trevor.d.gr...@gmail.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > >> Hey > > > > > > > >> > > > > > > > >> Sorry for delay- was getting ready to tear into this. > > > > > > > >> > > > > > > > >> Would you mind posting a small sample of data that you would > > > > expect > > > > > > this > > > > > > > >> application to consume. > > > > > > > >> > > > > > > > >> tg > > > > > > > >> > > > > > > > >> > > > > > > > >> Trevor Grant > > > > > > > >> Data Scientist > > > > > > > >> https://github.com/rawkintrevo > > > > > > > >> http://stackexchange.com/users/3002022/rawkintrevo > > > > > > > >> http://trevorgrant.org > > > > > > > >> > > > > > > > >> *"Fortunate is he, who is able to know the causes of > things." > > > > > > -Virgil* > > > > > > > >> > > > > > > > >> > > &
Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
t; > But Turns out to be this > > > > > > > { > > > > > > > 0 => {} > > > > > > > 1 => {0:1.0,1:4.0,2:5.0,3:6.0} > > > > > > > 2 => {} > > > > > > > 3 => {} > > > > > > > } > > > > > > > And the result of aggregating Transpose should be > > > > > > > { > > > > > > > 0 => {1: 4.0} > > > > > > > 1 => {1: 9.0} > > > > > > > 2 => {1: 12.0} > > > > > > > 3 => {1: 15.0} > > > > > > > } > > > > > > > > > > > > > > > > > > > > > Thanks Trevor for such a great Help > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best Regards > > > > > > > Parth > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Apr 21, 2017 at 4:20 AM, Trevor Grant < > > > > > trevor.d.gr...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > >> Hey > > > > > > >> > > > > > > >> Sorry for delay- was getting ready to tear into this. > > > > > > >> > > > > > > >> Would you mind posting a small sample of data that you would > > > expect > > > > > this > > > > > > >> application to consume. > > > > > > >> > > > > > > >> tg > > > > > > >> > > > > > > >> > > > > > > >> Trevor Grant > > > > > > >> Data Scientist > > > > > > >> https://github.com/rawkintrevo > > > > > > >> http://stackexchange.com/users/3002022/rawkintrevo > > > > > > >> http://trevorgrant.org > > > > > > >> > > > > > > >> *"Fortunate is he, who is able to know the causes of things." > > > > > -Virgil* > > > > > > >> > > > > > > >> > > > > > > >> On Tue, Apr 18, 2017 at 11:32 PM, KHATWANI PARTH BHARAT < > > > > > > >> h2016...@pilani.bits-pilani.ac.in> wrote: > > > > > > >> > > > > > > >> > @Dmitriy,@Trevor and @Andrew Sir, > > > > > > >> > I am still stuck at the above problem can you please help me > > out > > > > > with > > > > > > >> it. > > > > > > >> > I am unable to find the proper reference to solve the above > > > > issue. > > > > > > >> > > > > > > > >> > Thanks & Regards > > > > > > >> > Parth Khatwani > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > <https://mailtrack.io/> Sent with Mailtrack > > > > > > >> > <https://mailtrack.io/install?source=signature&lang=en&; > > > > > > >> > referral=h2016...@pilani.bits-pilani.ac.in&idSignature=22> > > > > > > >> > > > > > > > >> > On Sat, Apr 15, 2017 at 10:07 AM, KHATWANI PARTH BHARAT < > > > > > > >> > h2016...@pilani.bits-pilani.ac.in> wrote: > > > > > > >> > > > > > > > >> > > @Dmitriy, > > > > > > >> > > @Trevor and @Andrew > > > > > > >> > > > > > > > > >> > > I have tried > > > > > > >> > > Testing this Row Key assignment issue which i have > mentioned > > > in > > > > > the > > > > > > >> above > > > > > > >> > > mail, > > > > > > >> > &g
Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
gt; >> > > > > > >> > > > > > >> On Tue, Apr 18, 2017 at 11:32 PM, KHATWANI PARTH BHARAT < > > > > > >> h2016...@pilani.bits-pilani.ac.in> wrote: > > > > > >> > > > > > >> > @Dmitriy,@Trevor and @Andrew Sir, > > > > > >> > I am still stuck at the above problem can you please help me > out > > > > with > > > > > >> it. > > > > > >> > I am unable to find the proper reference to solve the above > > > issue. > > > > > >> > > > > > > >> > Thanks & Regards > > > > > >> > Parth Khatwani > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > <https://mailtrack.io/> Sent with Mailtrack > > > > > >> > <https://mailtrack.io/install?source=signature&lang=en&; > > > > > >> > referral=h2016...@pilani.bits-pilani.ac.in&idSignature=22> > > > > > >> > > > > > > >> > On Sat, Apr 15, 2017 at 10:07 AM, KHATWANI PARTH BHARAT < > > > > > >> > h2016...@pilani.bits-pilani.ac.in> wrote: > > > > > >> > > > > > > >> > > @Dmitriy, > > > > > >> > > @Trevor and @Andrew > > > > > >> > > > > > > > >> > > I have tried > > > > > >> > > Testing this Row Key assignment issue which i have mentioned > > in > > > > the > > > > > >> above > > > > > >> > > mail, > > > > > >> > > By Writing the a separate code where i am assigning the a > > > default > > > > > >> value 1 > > > > > >> > > to each row Key of The DRM and then taking the aggregating > > > > transpose > > > > > >> > > I have committed the separate test code to the Github > Branch > > > > > >> > > <https://github.com/parth2691/Spark_Mahout/tree/Dmitriy- > > > Lyubimov > > > > >. > > > > > >> > > > > > > > >> > > The Code is as follows > > > > > >> > > > > > > > >> > > val inCoreA = dense((1,1, 2, 3), (1,2, 3, 4), (1,3, 4, 5), > > (1,4, > > > > 5, > > > > > >> 6)) > > > > > >> > > val A = drmParallelize(m = inCoreA) > > > > > >> > > > > > > > >> > > //Mapblock > > > > > >> > > val drm2 = A.mapBlock() { > > > > > >> > > case (keys, block) =>for(row <- 0 until > > > keys.size) { > > > > > >> > > > > > > > >> > > * //assigning 1 to each row index* > keys(row) > > > = 1 > > > > > >> > }(keys, block)}prinln("After New Cluster > > > > > assignment") > > > > > >> > println(""+drm2.collect)val aggTranspose = drm2.t > > > > > >> println("Result of > > > > > >> > aggregating tranpose")println(""+aggTranspose.collect) > > > > > >> > > > > > > > >> > > Out of 1st println After New Cluster assignment should be > > > > > >> > > This > > > > > >> > > { > > > > > >> > > 0 => {0:1.0,1: 1.0,2: 1.0, 3: 3.0} > > > > > >> > > 1 => {0:1.0,1: 2.0,2: 3.0, 3: 4.0} > > > > > >> > > 2 => {0:1.0,1: 3.0,2: 4.0, 3: 5.0} > > > > > >> > > 3 => {0:1.0,1: 4.0,2: 5.0, 3: 6.0} > > > > > >> > > } > > > > > >> > > (Here zeroth Column is used to store the centriod count and > > > column > > > > > 1,2 > > > > > >> > and > > > > > >> > > 3 Contains Data) > > > > > >> > > > > > > > >> > > But
Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
nd column > > 1,2 > > > > and > > > > > 3 Contains Data) > > > > > > > > > > But Turns out to be this > > > > > { > > > > > 0 => {} > > > > > 1 => {0:1.0,1:4.0,2:5.0,3:6.0} > > > > > 2 => {} > > > > > 3 => {} > > > > > } > > > > > And the result of aggregating Transpose should be > > > > > { > > > > > 0 => {1: 4.0} > > > > > 1 => {1: 9.0} > > > > > 2 => {1: 12.0} > > > > > 3 => {1: 15.0} > > > > > } > > > > > > > > > > > > > > > Thanks Trevor for such a great Help > > > > > > > > > > > > > > > > > > > > > > > > > Best Regards > > > > > Parth > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Apr 21, 2017 at 4:20 AM, Trevor Grant < > > > trevor.d.gr...@gmail.com> > > > > > wrote: > > > > > > > > > >> Hey > > > > >> > > > > >> Sorry for delay- was getting ready to tear into this. > > > > >> > > > > >> Would you mind posting a small sample of data that you would > expect > > > this > > > > >> application to consume. > > > > >> > > > > >> tg > > > > >> > > > > >> > > > > >> Trevor Grant > > > > >> Data Scientist > > > > >> https://github.com/rawkintrevo > > > > >> http://stackexchange.com/users/3002022/rawkintrevo > > > > >> http://trevorgrant.org > > > > >> > > > > >> *"Fortunate is he, who is able to know the causes of things." > > > -Virgil* > > > > >> > > > > >> > > > > >> On Tue, Apr 18, 2017 at 11:32 PM, KHATWANI PARTH BHARAT < > > > > >> h2016...@pilani.bits-pilani.ac.in> wrote: > > > > >> > > > > >> > @Dmitriy,@Trevor and @Andrew Sir, > > > > >> > I am still stuck at the above problem can you please help me out > > > with > > > > >> it. > > > > >> > I am unable to find the proper reference to solve the above > > issue. > > > > >> > > > > > >> > Thanks & Regards > > > > >> > Parth Khatwani > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > <https://mailtrack.io/> Sent with Mailtrack > > > > >> > <https://mailtrack.io/install?source=signature&lang=en&; > > > > >> > referral=h2016...@pilani.bits-pilani.ac.in&idSignature=22> > > > > >> > > > > > >> > On Sat, Apr 15, 2017 at 10:07 AM, KHATWANI PARTH BHARAT < > > > > >> > h2016...@pilani.bits-pilani.ac.in> wrote: > > > > >> > > > > > >> > > @Dmitriy, > > > > >> > > @Trevor and @Andrew > > > > >> > > > > > > >> > > I have tried > > > > >> > > Testing this Row Key assignment issue which i have mentioned > in > > > the > > > > >> above > > > > >> > > mail, > > > > >> > > By Writing the a separate code where i am assigning the a > > default > > > > >> value 1 > > > > >> > > to each row Key of The DRM and then taking the aggregating > > > transpose > > > > >> > > I have committed the separate test code to the Github Branch > > > > >> > > <https://github.com/parth2691/Spark_Mahout/tree/Dmitriy- > > Lyubimov > > > >. > > > > >> > > > > > > >> > > The Code is as follows > > > > >> > > > > > > >> > > val inCoreA = dense((1,1, 2, 3), (1,2, 3, 4), (1,3, 4, 5), > (1,4, > > > 5, > > > > >> 6))
Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
gt; >> Data Scientist > > > >> https://github.com/rawkintrevo > > > >> http://stackexchange.com/users/3002022/rawkintrevo > > > >> http://trevorgrant.org > > > >> > > > >> *"Fortunate is he, who is able to know the causes of things." > > -Virgil* > > > >> > > > >> > > > >> On Tue, Apr 18, 2017 at 11:32 PM, KHATWANI PARTH BHARAT < > > > >> h2016...@pilani.bits-pilani.ac.in> wrote: > > > >> > > > >> > @Dmitriy,@Trevor and @Andrew Sir, > > > >> > I am still stuck at the above problem can you please help me out > > with > > > >> it. > > > >> > I am unable to find the proper reference to solve the above > issue. > > > >> > > > > >> > Thanks & Regards > > > >> > Parth Khatwani > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > <https://mailtrack.io/> Sent with Mailtrack > > > >> > <https://mailtrack.io/install?source=signature&lang=en&; > > > >> > referral=h2016...@pilani.bits-pilani.ac.in&idSignature=22> > > > >> > > > > >> > On Sat, Apr 15, 2017 at 10:07 AM, KHATWANI PARTH BHARAT < > > > >> > h2016...@pilani.bits-pilani.ac.in> wrote: > > > >> > > > > >> > > @Dmitriy, > > > >> > > @Trevor and @Andrew > > > >> > > > > > >> > > I have tried > > > >> > > Testing this Row Key assignment issue which i have mentioned in > > the > > > >> above > > > >> > > mail, > > > >> > > By Writing the a separate code where i am assigning the a > default > > > >> value 1 > > > >> > > to each row Key of The DRM and then taking the aggregating > > transpose > > > >> > > I have committed the separate test code to the Github Branch > > > >> > > <https://github.com/parth2691/Spark_Mahout/tree/Dmitriy- > Lyubimov > > >. > > > >> > > > > > >> > > The Code is as follows > > > >> > > > > > >> > > val inCoreA = dense((1,1, 2, 3), (1,2, 3, 4), (1,3, 4, 5), (1,4, > > 5, > > > >> 6)) > > > >> > > val A = drmParallelize(m = inCoreA) > > > >> > > > > > >> > > //Mapblock > > > >> > > val drm2 = A.mapBlock() { > > > >> > > case (keys, block) =>for(row <- 0 until > keys.size) { > > > >> > > > > > >> > > * //assigning 1 to each row index* keys(row) > = 1 > > > >> > }(keys, block)}prinln("After New Cluster > > > assignment") > > > >> > println(""+drm2.collect)val aggTranspose = drm2.t > > > >> println("Result of > > > >> > aggregating tranpose")println(""+aggTranspose.collect) > > > >> > > > > > >> > > Out of 1st println After New Cluster assignment should be > > > >> > > This > > > >> > > { > > > >> > > 0 => {0:1.0,1: 1.0,2: 1.0, 3: 3.0} > > > >> > > 1 => {0:1.0,1: 2.0,2: 3.0, 3: 4.0} > > > >> > > 2 => {0:1.0,1: 3.0,2: 4.0, 3: 5.0} > > > >> > > 3 => {0:1.0,1: 4.0,2: 5.0, 3: 6.0} > > > >> > > } > > > >> > > (Here zeroth Column is used to store the centriod count and > column > > > 1,2 > > > >> > and > > > >> > > 3 Contains Data) > > > >> > > > > > >> > > But Turns out to be this > > > >> > > { > > > >> > > 0 => {} > > > >> > > 1 => {0:1.0,1:4.0,2:5.0,3:6.0} > > > >> > > 2 => {} > > > >> > > 3 => {} > > > >> > > } > > > >> > > And the result of aggregating Transpose should be > > > >> > > { > > > >> > > 0 => {1: 4.0} > >
Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
> value 1 > > >> > > to each row Key of The DRM and then taking the aggregating > transpose > > >> > > I have committed the separate test code to the Github Branch > > >> > > <https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov > >. > > >> > > > > >> > > The Code is as follows > > >> > > > > >> > > val inCoreA = dense((1,1, 2, 3), (1,2, 3, 4), (1,3, 4, 5), (1,4, > 5, > > >> 6)) > > >> > > val A = drmParallelize(m = inCoreA) > > >> > > > > >> > > //Mapblock > > >> > > val drm2 = A.mapBlock() { > > >> > > case (keys, block) => for(row <- 0 until keys.size) { > > >> > > > > >> > > * //assigning 1 to each row index* keys(row) = 1 > > >> > }(keys, block)}prinln("After New Cluster > > assignment") > > >> > println(""+drm2.collect)val aggTranspose = drm2.t > > >> println("Result of > > >> > aggregating tranpose")println(""+aggTranspose.collect) > > >> > > > > >> > > Out of 1st println After New Cluster assignment should be > > >> > > This > > >> > > { > > >> > > 0 => {0:1.0,1: 1.0,2: 1.0, 3: 3.0} > > >> > > 1 => {0:1.0,1: 2.0,2: 3.0, 3: 4.0} > > >> > > 2 => {0:1.0,1: 3.0,2: 4.0, 3: 5.0} > > >> > > 3 => {0:1.0,1: 4.0,2: 5.0, 3: 6.0} > > >> > > } > > >> > > (Here zeroth Column is used to store the centriod count and column > > 1,2 > > >> > and > > >> > > 3 Contains Data) > > >> > > > > >> > > But Turns out to be this > > >> > > { > > >> > > 0 => {} > > >> > > 1 => {0:1.0,1:4.0,2:5.0,3:6.0} > > >> > > 2 => {} > > >> > > 3 => {} > > >> > > } > > >> > > And the result of aggregating Transpose should be > > >> > > { > > >> > > 0 => {1: 4.0} > > >> > > 1 => {1: 9.0} > > >> > > 2 => {1: 12.0} > > >> > > 3 => {1: 15.0} > > >> > > } > > >> > > > > >> > > > > >> > > I have referred to the book written by Andrew And Dmitriy Apache > > >> Mahout: > > >> > > Beyond MapReduce > > >> > > <https://www.amazon.com/Apache-Mahout-MapReduce- > > >> > Dmitriy-Lyubimov/dp/1523775785> Aggregating > > >> > > Transpose and other concepts are explained very nicely over here > > but > > >> i > > >> > am > > >> > > unable to find any example where > > >> > > Row Keys are assigned new Values . Mahout Samsara Manual > > >> > > http://apache.github.io/mahout/doc/ScalaSparkBindings.html Also > > Does > > >> not > > >> > > contain any such examples. > > >> > > It will great if i can get some reference to solution of mentioned > > >> issue. > > >> > > > > >> > > > > >> > > Thanks > > >> > > Parth Khatwani > > >> > > > > >> > > > > >> > > > > >> > > On Sat, Apr 15, 2017 at 12:13 AM, Andrew Palumbo < > > ap@outlook.com> > > >> > > wrote: > > >> > > > > >> > >> +1 > > >> > >> > > >> > >> > > >> > >> > > >> > >> Sent from my Verizon Wireless 4G LTE smartphone > > >> > >> > > >> > >> > > >> > >> Original message > > >> > >> From: Trevor Grant > > >> > >> Date: 04/14/2017 11:40 (GMT-08:00) > > >> > >> To: dev@mahout.apache.org > > >> > >> Subject: Re: Trying to write the KMeans Clustering Using "Apache > > >> Mahout > > >> > >> Samsara" > > >> > >> > > >> > >> Parth and Dmitriy, > > >> > >> > > >> > >> This is awesome- as a follow on can we work
Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
yubimov>. >> >> > > >> >> > > The Code is as follows >> >> > > >> >> > > val inCoreA = dense((1,1, 2, 3), (1,2, 3, 4), (1,3, 4, 5), (1,4, 5, >> >> 6)) >> >> > > val A = drmParallelize(m = inCoreA) >> >> > > >> >> > > //Mapblock >> >> > > val drm2 = A.mapBlock() { >> >> > > case (keys, block) =>for(row <- 0 until keys.size) { >> >> > > >> >> > > * //assigning 1 to each row index* keys(row) = 1 >> >> > }(keys, block)}prinln("After New Cluster >> assignment") >> >> > println(""+drm2.collect)val aggTranspose = drm2.t >> >> println("Result of >> >> > aggregating tranpose")println(""+aggTranspose.collect) >> >> > > >> >> > > Out of 1st println After New Cluster assignment should be >> >> > > This >> >> > > { >> >> > > 0 => {0:1.0,1: 1.0,2: 1.0, 3: 3.0} >> >> > > 1 => {0:1.0,1: 2.0,2: 3.0, 3: 4.0} >> >> > > 2 => {0:1.0,1: 3.0,2: 4.0, 3: 5.0} >> >> > > 3 => {0:1.0,1: 4.0,2: 5.0, 3: 6.0} >> >> > > } >> >> > > (Here zeroth Column is used to store the centriod count and column >> 1,2 >> >> > and >> >> > > 3 Contains Data) >> >> > > >> >> > > But Turns out to be this >> >> > > { >> >> > > 0 => {} >> >> > > 1 => {0:1.0,1:4.0,2:5.0,3:6.0} >> >> > > 2 => {} >> >> > > 3 => {} >> >> > > } >> >> > > And the result of aggregating Transpose should be >> >> > > { >> >> > > 0 => {1: 4.0} >> >> > > 1 => {1: 9.0} >> >> > > 2 => {1: 12.0} >> >> > > 3 => {1: 15.0} >> >> > > } >> >> > > >> >> > > >> >> > > I have referred to the book written by Andrew And Dmitriy Apache >> >> Mahout: >> >> > > Beyond MapReduce >> >> > > <https://www.amazon.com/Apache-Mahout-MapReduce- >> >> > Dmitriy-Lyubimov/dp/1523775785> Aggregating >> >> > > Transpose and other concepts are explained very nicely over here >> but >> >> i >> >> > am >> >> > > unable to find any example where >> >> > > Row Keys are assigned new Values . Mahout Samsara Manual >> >> > > http://apache.github.io/mahout/doc/ScalaSparkBindings.html Also >> Does >> >> not >> >> > > contain any such examples. >> >> > > It will great if i can get some reference to solution of mentioned >> >> issue. >> >> > > >> >> > > >> >> > > Thanks >> >> > > Parth Khatwani >> >> > > >> >> > > >> >> > > >> >> > > On Sat, Apr 15, 2017 at 12:13 AM, Andrew Palumbo < >> ap@outlook.com> >> >> > > wrote: >> >> > > >> >> > >> +1 >> >> > >> >> >> > >> >> >> > >> >> >> > >> Sent from my Verizon Wireless 4G LTE smartphone >> >> > >> >> >> > >> >> >> > >> Original message >> >> > >> From: Trevor Grant >> >> > >> Date: 04/14/2017 11:40 (GMT-08:00) >> >> > >> To: dev@mahout.apache.org >> >> > >> Subject: Re: Trying to write the KMeans Clustering Using "Apache >> >> Mahout >> >> > >> Samsara" >> >> > >> >> >> > >> Parth and Dmitriy, >> >> > >> >> >> > >> This is awesome- as a follow on can we work on getting this rolled >> >> in to >> >> > >> the algorithms framework? >> >> > >> >> >> > >> Happy to work with you on this Parth! >> >> > >> >> >> > >> Trevor Grant >> >> > >> Data Scientist >> >> > >> https://github.com/rawkintrevo >> >>
Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
0 => {0:1.0,1: 1.0,2: 1.0, 3: 3.0} > > 1 => {0:1.0,1: 2.0,2: 3.0, 3: 4.0} > > 2 => {0:1.0,1: 3.0,2: 4.0, 3: 5.0} > > 3 => {0:1.0,1: 4.0,2: 5.0, 3: 6.0} > > } > > (Here zeroth Column is used to store the centriod count and column 1,2 > and > > 3 Contains Data) > > > > But Turns out to be this > > { > > 0 => {} > > 1 => {0:1.0,1:4.0,2:5.0,3:6.0} > > 2 => {} > > 3 => {} > > } > > And the result of aggregating Transpose should be > > { > > 0 => {1: 4.0} > > 1 => {1: 9.0} > > 2 => {1: 12.0} > > 3 => {1: 15.0} > > } > > > > > > Thanks Trevor for such a great Help > > > > > > > > > > Best Regards > > Parth > > > > > > > > > > > > > > > > > > On Fri, Apr 21, 2017 at 4:20 AM, Trevor Grant > > wrote: > > > >> Hey > >> > >> Sorry for delay- was getting ready to tear into this. > >> > >> Would you mind posting a small sample of data that you would expect this > >> application to consume. > >> > >> tg > >> > >> > >> Trevor Grant > >> Data Scientist > >> https://github.com/rawkintrevo > >> http://stackexchange.com/users/3002022/rawkintrevo > >> http://trevorgrant.org > >> > >> *"Fortunate is he, who is able to know the causes of things." -Virgil* > >> > >> > >> On Tue, Apr 18, 2017 at 11:32 PM, KHATWANI PARTH BHARAT < > >> h2016...@pilani.bits-pilani.ac.in> wrote: > >> > >> > @Dmitriy,@Trevor and @Andrew Sir, > >> > I am still stuck at the above problem can you please help me out with > >> it. > >> > I am unable to find the proper reference to solve the above issue. > >> > > >> > Thanks & Regards > >> > Parth Khatwani > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > <https://mailtrack.io/> Sent with Mailtrack > >> > <https://mailtrack.io/install?source=signature&lang=en&; > >> > referral=h2016...@pilani.bits-pilani.ac.in&idSignature=22> > >> > > >> > On Sat, Apr 15, 2017 at 10:07 AM, KHATWANI PARTH BHARAT < > >> > h2016...@pilani.bits-pilani.ac.in> wrote: > >> > > >> > > @Dmitriy, > >> > > @Trevor and @Andrew > >> > > > >> > > I have tried > >> > > Testing this Row Key assignment issue which i have mentioned in the > >> above > >> > > mail, > >> > > By Writing the a separate code where i am assigning the a default > >> value 1 > >> > > to each row Key of The DRM and then taking the aggregating transpose > >> > > I have committed the separate test code to the Github Branch > >> > > <https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov>. > >> > > > >> > > The Code is as follows > >> > > > >> > > val inCoreA = dense((1,1, 2, 3), (1,2, 3, 4), (1,3, 4, 5), (1,4, 5, > >> 6)) > >> > > val A = drmParallelize(m = inCoreA) > >> > > > >> > > //Mapblock > >> > > val drm2 = A.mapBlock() { > >> > > case (keys, block) =>for(row <- 0 until keys.size) { > >> > > > >> > > * //assigning 1 to each row index* keys(row) = 1 > >> > }(keys, block)}prinln("After New Cluster > assignment") > >> > println(""+drm2.collect)val aggTranspose = drm2.t > >> println("Result of > >> > aggregating tranpose")println(""+aggTranspose.collect) > >> > > > >> > > Out of 1st println After New Cluster assignment should be > >> > > This > >> > > { > >> > > 0 => {0:1.0,1: 1.0, 2: 1.0, 3: 3.0} > >> > > 1 => {0:1.0,1: 2.0, 2: 3.0, 3: 4.0} > >> > > 2 => {0:1.0,1: 3.0,2: 4.0, 3: 5.0} > >> > > 3 => {0:1.0,1: 4.0,2: 5.0, 3: 6.0} > >> > > } > >> > > (Here zeroth Column is used to store the centriod count and column > 1,2 > >> > and > >> > > 3 Contains Data) > >> > >
Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
t; >> > h2016...@pilani.bits-pilani.ac.in> wrote: >> > >> > > @Dmitriy, >> > > @Trevor and @Andrew >> > > >> > > I have tried >> > > Testing this Row Key assignment issue which i have mentioned in the >> above >> > > mail, >> > > By Writing the a separate code where i am assigning the a default >> value 1 >> > > to each row Key of The DRM and then taking the aggregating transpose >> > > I have committed the separate test code to the Github Branch >> > > <https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov>. >> > > >> > > The Code is as follows >> > > >> > > val inCoreA = dense((1,1, 2, 3), (1,2, 3, 4), (1,3, 4, 5), (1,4, 5, >> 6)) >> > > val A = drmParallelize(m = inCoreA) >> > > >> > > //Mapblock >> > > val drm2 = A.mapBlock() { >> > > case (keys, block) =>for(row <- 0 until keys.size) { >> > > >> > > * //assigning 1 to each row index* keys(row) = 1 >> > }(keys, block)}prinln("After New Cluster assignment") >> > println(""+drm2.collect)val aggTranspose = drm2.t >> println("Result of >> > aggregating tranpose")println(""+aggTranspose.collect) >> > > >> > > Out of 1st println After New Cluster assignment should be >> > > This >> > > { >> > > 0 => {0:1.0,1: 1.0,2: 1.0, 3: 3.0} >> > > 1 => {0:1.0,1: 2.0,2: 3.0, 3: 4.0} >> > > 2 => {0:1.0,1: 3.0,2: 4.0, 3: 5.0} >> > > 3 => {0:1.0,1: 4.0,2: 5.0, 3: 6.0} >> > > } >> > > (Here zeroth Column is used to store the centriod count and column 1,2 >> > and >> > > 3 Contains Data) >> > > >> > > But Turns out to be this >> > > { >> > > 0 => {} >> > > 1 => {0:1.0,1:4.0,2:5.0,3:6.0} >> > > 2 => {} >> > > 3 => {} >> > > } >> > > And the result of aggregating Transpose should be >> > > { >> > > 0 => {1: 4.0} >> > > 1 => {1: 9.0} >> > > 2 => {1: 12.0} >> > > 3 => {1: 15.0} >> > > } >> > > >> > > >> > > I have referred to the book written by Andrew And Dmitriy Apache >> Mahout: >> > > Beyond MapReduce >> > > <https://www.amazon.com/Apache-Mahout-MapReduce- >> > Dmitriy-Lyubimov/dp/1523775785> Aggregating >> > > Transpose and other concepts are explained very nicely over here but >> i >> > am >> > > unable to find any example where >> > > Row Keys are assigned new Values . Mahout Samsara Manual >> > > http://apache.github.io/mahout/doc/ScalaSparkBindings.html Also Does >> not >> > > contain any such examples. >> > > It will great if i can get some reference to solution of mentioned >> issue. >> > > >> > > >> > > Thanks >> > > Parth Khatwani >> > > >> > > >> > > >> > > On Sat, Apr 15, 2017 at 12:13 AM, Andrew Palumbo >> > > wrote: >> > > >> > >> +1 >> > >> >> > >> >> > >> >> > >> Sent from my Verizon Wireless 4G LTE smartphone >> > >> >> > >> >> > >> Original message >> > >> From: Trevor Grant >> > >> Date: 04/14/2017 11:40 (GMT-08:00) >> > >> To: dev@mahout.apache.org >> > >> Subject: Re: Trying to write the KMeans Clustering Using "Apache >> Mahout >> > >> Samsara" >> > >> >> > >> Parth and Dmitriy, >> > >> >> > >> This is awesome- as a follow on can we work on getting this rolled >> in to >> > >> the algorithms framework? >> > >> >> > >> Happy to work with you on this Parth! >> > >> >> > >> Trevor Grant >> > >> Data Scientist >> > >> https://github.com/rawkintrevo >> > >> http://stackexchange.com/users/3002022/rawkintrevo >> > >> http://trevorgrant.org >> > >> >> > >> *"Fortunate is he, who is able to know the causes of things." >> -Virgil* >> > >&g
Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
; {} > > > 3 => {} > > > } > > > And the result of aggregating Transpose should be > > > { > > > 0 => {1: 4.0} > > > 1 => {1: 9.0} > > > 2 => {1: 12.0} > > > 3 => {1: 15.0} > > > } > > > > > > > > > I have referred to the book written by Andrew And Dmitriy Apache > Mahout: > > > Beyond MapReduce > > > <https://www.amazon.com/Apache-Mahout-MapReduce- > > Dmitriy-Lyubimov/dp/1523775785> Aggregating > > > Transpose and other concepts are explained very nicely over here but i > > am > > > unable to find any example where > > > Row Keys are assigned new Values . Mahout Samsara Manual > > > http://apache.github.io/mahout/doc/ScalaSparkBindings.html Also Does > not > > > contain any such examples. > > > It will great if i can get some reference to solution of mentioned > issue. > > > > > > > > > Thanks > > > Parth Khatwani > > > > > > > > > > > > On Sat, Apr 15, 2017 at 12:13 AM, Andrew Palumbo > > > wrote: > > > > > >> +1 > > >> > > >> > > >> > > >> Sent from my Verizon Wireless 4G LTE smartphone > > >> > > >> > > >> Original message > > >> From: Trevor Grant > > >> Date: 04/14/2017 11:40 (GMT-08:00) > > >> To: dev@mahout.apache.org > > >> Subject: Re: Trying to write the KMeans Clustering Using "Apache > Mahout > > >> Samsara" > > >> > > >> Parth and Dmitriy, > > >> > > >> This is awesome- as a follow on can we work on getting this rolled in > to > > >> the algorithms framework? > > >> > > >> Happy to work with you on this Parth! > > >> > > >> Trevor Grant > > >> Data Scientist > > >> https://github.com/rawkintrevo > > >> http://stackexchange.com/users/3002022/rawkintrevo > > >> http://trevorgrant.org > > >> > > >> *"Fortunate is he, who is able to know the causes of things." > -Virgil* > > >> > > >> > > >> On Fri, Apr 14, 2017 at 1:27 PM, Dmitriy Lyubimov > > >> wrote: > > >> > > >> > i would think reassinging keys should work in most cases. > > >> > The only exception is that technically Spark contracts imply that > > effect > > >> > should be idempotent if task is retried, which might be a problem > in a > > >> > specific scenario of the object tree coming out from block cache > > object > > >> > tree, which can stay there and be retried again. but specifically > > w.r.t. > > >> > this key assignment i don't see any problem since the action > obviously > > >> > would be idempotent even if this code is run multiple times on the > > same > > >> > (key, block) pair. This part should be good IMO. > > >> > > > >> > On Fri, Apr 14, 2017 at 2:26 AM, KHATWANI PARTH BHARAT < > > >> > h2016...@pilani.bits-pilani.ac.in> wrote: > > >> > > > >> > > @Dmitriy Sir, > > >> > > In the K means code above I think i am doing the following > > Incorrectly > > >> > > > > >> > > Assigning the closest centriod index to the Row Keys of DRM > > >> > > > > >> > > //11. Iterating over the Data Matrix(in DrmLike[Int] format) to > > >> calculate > > >> > > the initial centriods > > >> > > dataDrmX.mapBlock() { > > >> > > case (keys, block) => > > >> > > for (row <- 0 until block.nrow) { > > >> > > var dataPoint = block(row, ::) > > >> > > > > >> > > //12. findTheClosestCentriod find the closest centriod > to > > >> the > > >> > > Data point specified by "dataPoint" > > >> > > val closesetIndex = findTheClosestCentriod(dataPoint, > > >> > centriods) > > >> > > > > >> > > //13. assigning closest index to key > > >> > > keys(row) = closesetIndex > > >> > > } > > >> > > keys -> block > > >> > > } > > >> > > > > >> > > in step
Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
Hey Sorry for delay- was getting ready to tear into this. Would you mind posting a small sample of data that you would expect this application to consume. tg Trevor Grant Data Scientist https://github.com/rawkintrevo http://stackexchange.com/users/3002022/rawkintrevo http://trevorgrant.org *"Fortunate is he, who is able to know the causes of things." -Virgil* On Tue, Apr 18, 2017 at 11:32 PM, KHATWANI PARTH BHARAT < h2016...@pilani.bits-pilani.ac.in> wrote: > @Dmitriy,@Trevor and @Andrew Sir, > I am still stuck at the above problem can you please help me out with it. > I am unable to find the proper reference to solve the above issue. > > Thanks & Regards > Parth Khatwani > > > > > > > > > <https://mailtrack.io/> Sent with Mailtrack > <https://mailtrack.io/install?source=signature&lang=en&; > referral=h2016...@pilani.bits-pilani.ac.in&idSignature=22> > > On Sat, Apr 15, 2017 at 10:07 AM, KHATWANI PARTH BHARAT < > h2016...@pilani.bits-pilani.ac.in> wrote: > > > @Dmitriy, > > @Trevor and @Andrew > > > > I have tried > > Testing this Row Key assignment issue which i have mentioned in the above > > mail, > > By Writing the a separate code where i am assigning the a default value 1 > > to each row Key of The DRM and then taking the aggregating transpose > > I have committed the separate test code to the Github Branch > > <https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov>. > > > > The Code is as follows > > > > val inCoreA = dense((1,1, 2, 3), (1,2, 3, 4), (1,3, 4, 5), (1,4, 5, 6)) > > val A = drmParallelize(m = inCoreA) > > > > //Mapblock > > val drm2 = A.mapBlock() { > > case (keys, block) =>for(row <- 0 until keys.size) { > > > > * //assigning 1 to each row index* keys(row) = 1 > }(keys, block)}prinln("After New Cluster assignment") > println(""+drm2.collect)val aggTranspose = drm2.tprintln("Result of > aggregating tranpose")println(""+aggTranspose.collect) > > > > Out of 1st println After New Cluster assignment should be > > This > > { > > 0 => {0:1.0,1: 1.0,2: 1.0, 3: 3.0} > > 1 => {0:1.0,1: 2.0,2: 3.0, 3: 4.0} > > 2 => {0:1.0,1: 3.0,2: 4.0, 3: 5.0} > > 3 => {0:1.0,1: 4.0,2: 5.0, 3: 6.0} > > } > > (Here zeroth Column is used to store the centriod count and column 1,2 > and > > 3 Contains Data) > > > > But Turns out to be this > > { > > 0 => {} > > 1 => {0:1.0,1:4.0,2:5.0,3:6.0} > > 2 => {} > > 3 => {} > > } > > And the result of aggregating Transpose should be > > { > > 0 => {1: 4.0} > > 1 => {1: 9.0} > > 2 => {1: 12.0} > > 3 => {1: 15.0} > > } > > > > > > I have referred to the book written by Andrew And Dmitriy Apache Mahout: > > Beyond MapReduce > > <https://www.amazon.com/Apache-Mahout-MapReduce- > Dmitriy-Lyubimov/dp/1523775785> Aggregating > > Transpose and other concepts are explained very nicely over here but i > am > > unable to find any example where > > Row Keys are assigned new Values . Mahout Samsara Manual > > http://apache.github.io/mahout/doc/ScalaSparkBindings.html Also Does not > > contain any such examples. > > It will great if i can get some reference to solution of mentioned issue. > > > > > > Thanks > > Parth Khatwani > > > > > > > > On Sat, Apr 15, 2017 at 12:13 AM, Andrew Palumbo > > wrote: > > > >> +1 > >> > >> > >> > >> Sent from my Verizon Wireless 4G LTE smartphone > >> > >> > >> Original message > >> From: Trevor Grant > >> Date: 04/14/2017 11:40 (GMT-08:00) > >> To: dev@mahout.apache.org > >> Subject: Re: Trying to write the KMeans Clustering Using "Apache Mahout > >> Samsara" > >> > >> Parth and Dmitriy, > >> > >> This is awesome- as a follow on can we work on getting this rolled in to > >> the algorithms framework? > >> > >> Happy to work with you on this Parth! > >> > >> Trevor Grant > >> Data Scientist > >> https://github.com/rawkintrevo > >> http://stackexchange.com/users/3002022/rawkintrevo > >> http://trevorgrant.org > >> > >> *"Fortunate is he, who is able to know the causes of things." -Virgil
Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
@Dmitriy,@Trevor and @Andrew Sir, I am still stuck at the above problem can you please help me out with it. I am unable to find the proper reference to solve the above issue. Thanks & Regards Parth Khatwani <https://mailtrack.io/> Sent with Mailtrack <https://mailtrack.io/install?source=signature&lang=en&referral=h2016...@pilani.bits-pilani.ac.in&idSignature=22> On Sat, Apr 15, 2017 at 10:07 AM, KHATWANI PARTH BHARAT < h2016...@pilani.bits-pilani.ac.in> wrote: > @Dmitriy, > @Trevor and @Andrew > > I have tried > Testing this Row Key assignment issue which i have mentioned in the above > mail, > By Writing the a separate code where i am assigning the a default value 1 > to each row Key of The DRM and then taking the aggregating transpose > I have committed the separate test code to the Github Branch > <https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov>. > > The Code is as follows > > val inCoreA = dense((1,1, 2, 3), (1,2, 3, 4), (1,3, 4, 5), (1,4, 5, 6)) > val A = drmParallelize(m = inCoreA) > > //Mapblock > val drm2 = A.mapBlock() { > case (keys, block) =>for(row <- 0 until keys.size) { > > * //assigning 1 to each row index* keys(row) = 1} > (keys, block)}prinln("After New Cluster assignment") > println(""+drm2.collect)val aggTranspose = drm2.tprintln("Result of > aggregating tranpose")println(""+aggTranspose.collect) > > Out of 1st println After New Cluster assignment should be > This > { > 0 => {0:1.0,1: 1.0,2: 1.0, 3: 3.0} > 1 => {0:1.0,1: 2.0,2: 3.0, 3: 4.0} > 2 => {0:1.0,1: 3.0,2: 4.0, 3: 5.0} > 3 => {0:1.0,1: 4.0,2: 5.0, 3: 6.0} > } > (Here zeroth Column is used to store the centriod count and column 1,2 and > 3 Contains Data) > > But Turns out to be this > { > 0 => {} > 1 => {0:1.0,1:4.0,2:5.0,3:6.0} > 2 => {} > 3 => {} > } > And the result of aggregating Transpose should be > { > 0 => {1: 4.0} > 1 => {1: 9.0} > 2 => {1: 12.0} > 3 => {1: 15.0} > } > > > I have referred to the book written by Andrew And Dmitriy Apache Mahout: > Beyond MapReduce > <https://www.amazon.com/Apache-Mahout-MapReduce-Dmitriy-Lyubimov/dp/1523775785> > Aggregating > Transpose and other concepts are explained very nicely over here but i am > unable to find any example where > Row Keys are assigned new Values . Mahout Samsara Manual > http://apache.github.io/mahout/doc/ScalaSparkBindings.html Also Does not > contain any such examples. > It will great if i can get some reference to solution of mentioned issue. > > > Thanks > Parth Khatwani > > > > On Sat, Apr 15, 2017 at 12:13 AM, Andrew Palumbo > wrote: > >> +1 >> >> >> >> Sent from my Verizon Wireless 4G LTE smartphone >> >> >> Original message >> From: Trevor Grant >> Date: 04/14/2017 11:40 (GMT-08:00) >> To: dev@mahout.apache.org >> Subject: Re: Trying to write the KMeans Clustering Using "Apache Mahout >> Samsara" >> >> Parth and Dmitriy, >> >> This is awesome- as a follow on can we work on getting this rolled in to >> the algorithms framework? >> >> Happy to work with you on this Parth! >> >> Trevor Grant >> Data Scientist >> https://github.com/rawkintrevo >> http://stackexchange.com/users/3002022/rawkintrevo >> http://trevorgrant.org >> >> *"Fortunate is he, who is able to know the causes of things." -Virgil* >> >> >> On Fri, Apr 14, 2017 at 1:27 PM, Dmitriy Lyubimov >> wrote: >> >> > i would think reassinging keys should work in most cases. >> > The only exception is that technically Spark contracts imply that effect >> > should be idempotent if task is retried, which might be a problem in a >> > specific scenario of the object tree coming out from block cache object >> > tree, which can stay there and be retried again. but specifically w.r.t. >> > this key assignment i don't see any problem since the action obviously >> > would be idempotent even if this code is run multiple times on the same >> > (key, block) pair. This part should be good IMO. >> > >> > On Fri, Apr 14, 2017 at 2:26 AM, KHATWANI PARTH BHARAT < >> > h2016...@pilani.bits-pilani.ac.in> wrote: >> > >> > > @Dmitriy Sir, >> > > In the K means code above I think i am doing the following Incorrectly >> > > >
Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
@Dmitriy, @Trevor and @Andrew I have tried Testing this Row Key assignment issue which i have mentioned in the above mail, By Writing the a separate code where i am assigning the a default value 1 to each row Key of The DRM and then taking the aggregating transpose I have committed the separate test code to the Github Branch <https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov>. The Code is as follows val inCoreA = dense((1,1, 2, 3), (1,2, 3, 4), (1,3, 4, 5), (1,4, 5, 6)) val A = drmParallelize(m = inCoreA) //Mapblock val drm2 = A.mapBlock() { case (keys, block) =>for(row <- 0 until keys.size) { * //assigning 1 to each row index* keys(row) = 1 }(keys, block)}prinln("After New Cluster assignment")println(""+drm2.collect)val aggTranspose = drm2.t println("Result of aggregating tranpose") println(""+aggTranspose.collect) Out of 1st println After New Cluster assignment should be This { 0 => {0:1.0,1: 1.0,2: 1.0, 3: 3.0} 1 => {0:1.0,1: 2.0,2: 3.0, 3: 4.0} 2 => {0:1.0,1: 3.0,2: 4.0, 3: 5.0} 3 => {0:1.0,1: 4.0,2: 5.0, 3: 6.0} } (Here zeroth Column is used to store the centriod count and column 1,2 and 3 Contains Data) But Turns out to be this { 0 => {} 1 => {0:1.0,1:4.0,2:5.0,3:6.0} 2 => {} 3 => {} } And the result of aggregating Transpose should be { 0 => {1: 4.0} 1 => {1: 9.0} 2 => {1: 12.0} 3 => {1: 15.0} } I have referred to the book written by Andrew And Dmitriy Apache Mahout: Beyond MapReduce <https://www.amazon.com/Apache-Mahout-MapReduce-Dmitriy-Lyubimov/dp/1523775785> Aggregating Transpose and other concepts are explained very nicely over here but i am unable to find any example where Row Keys are assigned new Values . Mahout Samsara Manual http://apache.github.io/mahout/doc/ScalaSparkBindings.html Also Does not contain any such examples. It will great if i can get some reference to solution of mentioned issue. Thanks Parth Khatwani On Sat, Apr 15, 2017 at 12:13 AM, Andrew Palumbo wrote: > +1 > > > > Sent from my Verizon Wireless 4G LTE smartphone > > > Original message ---- > From: Trevor Grant > Date: 04/14/2017 11:40 (GMT-08:00) > To: dev@mahout.apache.org > Subject: Re: Trying to write the KMeans Clustering Using "Apache Mahout > Samsara" > > Parth and Dmitriy, > > This is awesome- as a follow on can we work on getting this rolled in to > the algorithms framework? > > Happy to work with you on this Parth! > > Trevor Grant > Data Scientist > https://github.com/rawkintrevo > http://stackexchange.com/users/3002022/rawkintrevo > http://trevorgrant.org > > *"Fortunate is he, who is able to know the causes of things." -Virgil* > > > On Fri, Apr 14, 2017 at 1:27 PM, Dmitriy Lyubimov > wrote: > > > i would think reassinging keys should work in most cases. > > The only exception is that technically Spark contracts imply that effect > > should be idempotent if task is retried, which might be a problem in a > > specific scenario of the object tree coming out from block cache object > > tree, which can stay there and be retried again. but specifically w.r.t. > > this key assignment i don't see any problem since the action obviously > > would be idempotent even if this code is run multiple times on the same > > (key, block) pair. This part should be good IMO. > > > > On Fri, Apr 14, 2017 at 2:26 AM, KHATWANI PARTH BHARAT < > > h2016...@pilani.bits-pilani.ac.in> wrote: > > > > > @Dmitriy Sir, > > > In the K means code above I think i am doing the following Incorrectly > > > > > > Assigning the closest centriod index to the Row Keys of DRM > > > > > > //11. Iterating over the Data Matrix(in DrmLike[Int] format) to > calculate > > > the initial centriods > > > dataDrmX.mapBlock() { > > > case (keys, block) => > > > for (row <- 0 until block.nrow) { > > > var dataPoint = block(row, ::) > > > > > > //12. findTheClosestCentriod find the closest centriod to the > > > Data point specified by "dataPoint" > > > val closesetIndex = findTheClosestCentriod(dataPoint, > > centriods) > > > > > > //13. assigning closest index to key > > > keys(row) = closesetIndex > > > } > > > keys -> block > > > } > > > > > > in step 12 i am finding the centriod closest to the current dataPoint > > > in step13 i am assigning the closesetIndex to the
RE: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
+1 Sent from my Verizon Wireless 4G LTE smartphone Original message From: Trevor Grant Date: 04/14/2017 11:40 (GMT-08:00) To: dev@mahout.apache.org Subject: Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara" Parth and Dmitriy, This is awesome- as a follow on can we work on getting this rolled in to the algorithms framework? Happy to work with you on this Parth! Trevor Grant Data Scientist https://github.com/rawkintrevo http://stackexchange.com/users/3002022/rawkintrevo http://trevorgrant.org *"Fortunate is he, who is able to know the causes of things." -Virgil* On Fri, Apr 14, 2017 at 1:27 PM, Dmitriy Lyubimov wrote: > i would think reassinging keys should work in most cases. > The only exception is that technically Spark contracts imply that effect > should be idempotent if task is retried, which might be a problem in a > specific scenario of the object tree coming out from block cache object > tree, which can stay there and be retried again. but specifically w.r.t. > this key assignment i don't see any problem since the action obviously > would be idempotent even if this code is run multiple times on the same > (key, block) pair. This part should be good IMO. > > On Fri, Apr 14, 2017 at 2:26 AM, KHATWANI PARTH BHARAT < > h2016...@pilani.bits-pilani.ac.in> wrote: > > > @Dmitriy Sir, > > In the K means code above I think i am doing the following Incorrectly > > > > Assigning the closest centriod index to the Row Keys of DRM > > > > //11. Iterating over the Data Matrix(in DrmLike[Int] format) to calculate > > the initial centriods > > dataDrmX.mapBlock() { > > case (keys, block) => > > for (row <- 0 until block.nrow) { > > var dataPoint = block(row, ::) > > > > //12. findTheClosestCentriod find the closest centriod to the > > Data point specified by "dataPoint" > > val closesetIndex = findTheClosestCentriod(dataPoint, > centriods) > > > > //13. assigning closest index to key > > keys(row) = closesetIndex > > } > > keys -> block > > } > > > > in step 12 i am finding the centriod closest to the current dataPoint > > in step13 i am assigning the closesetIndex to the key of the > corresponding > > row represented by the dataPoint > > I think i am doing step13 incorrectly. > > > > Also i am unable to find the proper reference for the same in the > reference > > links which you have mentioned above > > > > > > Thanks & Regards > > Parth Khatwani > > > > > > > > > > > > On Thu, Apr 13, 2017 at 6:24 PM, KHATWANI PARTH BHARAT < > > h2016...@pilani.bits-pilani.ac.in> wrote: > > > > > Dmitriy Sir, > > > I have Created a github branch Github Branch Having Initial Kmeans Code > > > <https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov> > > > > > > > > > Thanks & Regards > > > Parth Khatwani > > > > > > On Thu, Apr 13, 2017 at 3:19 AM, Andrew Palumbo > > > wrote: > > > > > >> +1 to creating a branch. > > >> > > >> > > >> > > >> Sent from my Verizon Wireless 4G LTE smartphone > > >> > > >> > > >> Original message > > >> From: Dmitriy Lyubimov > > >> Date: 04/12/2017 11:25 (GMT-08:00) > > >> To: dev@mahout.apache.org > > >> Subject: Re: Trying to write the KMeans Clustering Using "Apache > Mahout > > >> Samsara" > > >> > > >> can't say i can read this code well formatted that way... > > >> > > >> it would seem to me that the code is not using the broadcast variable > > and > > >> instead is using closure variable. that's the only thing i can > > immediately > > >> see by looking in the middle of it. > > >> > > >> it would be better if you created a branch on github for that code > that > > >> would allow for easy check-outs and comments. > > >> > > >> -d > > >> > > >> On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH BHARAT < > > >> h2016...@pilani.bits-pilani.ac.in> wrote: > > >> > > >> > @Dmitriy Sir > > >> > > > >> > I have completed the Kmeans code as per the algorithm you have > Outline > > >> > above > > >> > > > >> > My code
Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
Parth and Dmitriy, This is awesome- as a follow on can we work on getting this rolled in to the algorithms framework? Happy to work with you on this Parth! Trevor Grant Data Scientist https://github.com/rawkintrevo http://stackexchange.com/users/3002022/rawkintrevo http://trevorgrant.org *"Fortunate is he, who is able to know the causes of things." -Virgil* On Fri, Apr 14, 2017 at 1:27 PM, Dmitriy Lyubimov wrote: > i would think reassinging keys should work in most cases. > The only exception is that technically Spark contracts imply that effect > should be idempotent if task is retried, which might be a problem in a > specific scenario of the object tree coming out from block cache object > tree, which can stay there and be retried again. but specifically w.r.t. > this key assignment i don't see any problem since the action obviously > would be idempotent even if this code is run multiple times on the same > (key, block) pair. This part should be good IMO. > > On Fri, Apr 14, 2017 at 2:26 AM, KHATWANI PARTH BHARAT < > h2016...@pilani.bits-pilani.ac.in> wrote: > > > @Dmitriy Sir, > > In the K means code above I think i am doing the following Incorrectly > > > > Assigning the closest centriod index to the Row Keys of DRM > > > > //11. Iterating over the Data Matrix(in DrmLike[Int] format) to calculate > > the initial centriods > > dataDrmX.mapBlock() { > > case (keys, block) => > > for (row <- 0 until block.nrow) { > > var dataPoint = block(row, ::) > > > > //12. findTheClosestCentriod find the closest centriod to the > > Data point specified by "dataPoint" > > val closesetIndex = findTheClosestCentriod(dataPoint, > centriods) > > > > //13. assigning closest index to key > > keys(row) = closesetIndex > > } > > keys -> block > > } > > > > in step 12 i am finding the centriod closest to the current dataPoint > > in step13 i am assigning the closesetIndex to the key of the > corresponding > > row represented by the dataPoint > > I think i am doing step13 incorrectly. > > > > Also i am unable to find the proper reference for the same in the > reference > > links which you have mentioned above > > > > > > Thanks & Regards > > Parth Khatwani > > > > > > > > > > > > On Thu, Apr 13, 2017 at 6:24 PM, KHATWANI PARTH BHARAT < > > h2016...@pilani.bits-pilani.ac.in> wrote: > > > > > Dmitriy Sir, > > > I have Created a github branch Github Branch Having Initial Kmeans Code > > > <https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov> > > > > > > > > > Thanks & Regards > > > Parth Khatwani > > > > > > On Thu, Apr 13, 2017 at 3:19 AM, Andrew Palumbo > > > wrote: > > > > > >> +1 to creating a branch. > > >> > > >> > > >> > > >> Sent from my Verizon Wireless 4G LTE smartphone > > >> > > >> > > >> Original message > > >> From: Dmitriy Lyubimov > > >> Date: 04/12/2017 11:25 (GMT-08:00) > > >> To: dev@mahout.apache.org > > >> Subject: Re: Trying to write the KMeans Clustering Using "Apache > Mahout > > >> Samsara" > > >> > > >> can't say i can read this code well formatted that way... > > >> > > >> it would seem to me that the code is not using the broadcast variable > > and > > >> instead is using closure variable. that's the only thing i can > > immediately > > >> see by looking in the middle of it. > > >> > > >> it would be better if you created a branch on github for that code > that > > >> would allow for easy check-outs and comments. > > >> > > >> -d > > >> > > >> On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH BHARAT < > > >> h2016...@pilani.bits-pilani.ac.in> wrote: > > >> > > >> > @Dmitriy Sir > > >> > > > >> > I have completed the Kmeans code as per the algorithm you have > Outline > > >> > above > > >> > > > >> > My code is as follows > > >> > > > >> > This code works fine till step number 10 > > >> > > > >> > In step 11 i am assigning the new centriod index to corresponding > row > > >> key > > >&
Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
i would think reassinging keys should work in most cases. The only exception is that technically Spark contracts imply that effect should be idempotent if task is retried, which might be a problem in a specific scenario of the object tree coming out from block cache object tree, which can stay there and be retried again. but specifically w.r.t. this key assignment i don't see any problem since the action obviously would be idempotent even if this code is run multiple times on the same (key, block) pair. This part should be good IMO. On Fri, Apr 14, 2017 at 2:26 AM, KHATWANI PARTH BHARAT < h2016...@pilani.bits-pilani.ac.in> wrote: > @Dmitriy Sir, > In the K means code above I think i am doing the following Incorrectly > > Assigning the closest centriod index to the Row Keys of DRM > > //11. Iterating over the Data Matrix(in DrmLike[Int] format) to calculate > the initial centriods > dataDrmX.mapBlock() { > case (keys, block) => > for (row <- 0 until block.nrow) { > var dataPoint = block(row, ::) > > //12. findTheClosestCentriod find the closest centriod to the > Data point specified by "dataPoint" > val closesetIndex = findTheClosestCentriod(dataPoint, centriods) > > //13. assigning closest index to key > keys(row) = closesetIndex > } > keys -> block > } > > in step 12 i am finding the centriod closest to the current dataPoint > in step13 i am assigning the closesetIndex to the key of the corresponding > row represented by the dataPoint > I think i am doing step13 incorrectly. > > Also i am unable to find the proper reference for the same in the reference > links which you have mentioned above > > > Thanks & Regards > Parth Khatwani > > > > > > On Thu, Apr 13, 2017 at 6:24 PM, KHATWANI PARTH BHARAT < > h2016...@pilani.bits-pilani.ac.in> wrote: > > > Dmitriy Sir, > > I have Created a github branch Github Branch Having Initial Kmeans Code > > <https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov> > > > > > > Thanks & Regards > > Parth Khatwani > > > > On Thu, Apr 13, 2017 at 3:19 AM, Andrew Palumbo > > wrote: > > > >> +1 to creating a branch. > >> > >> > >> > >> Sent from my Verizon Wireless 4G LTE smartphone > >> > >> > >> Original message > >> From: Dmitriy Lyubimov > >> Date: 04/12/2017 11:25 (GMT-08:00) > >> To: dev@mahout.apache.org > >> Subject: Re: Trying to write the KMeans Clustering Using "Apache Mahout > >> Samsara" > >> > >> can't say i can read this code well formatted that way... > >> > >> it would seem to me that the code is not using the broadcast variable > and > >> instead is using closure variable. that's the only thing i can > immediately > >> see by looking in the middle of it. > >> > >> it would be better if you created a branch on github for that code that > >> would allow for easy check-outs and comments. > >> > >> -d > >> > >> On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH BHARAT < > >> h2016...@pilani.bits-pilani.ac.in> wrote: > >> > >> > @Dmitriy Sir > >> > > >> > I have completed the Kmeans code as per the algorithm you have Outline > >> > above > >> > > >> > My code is as follows > >> > > >> > This code works fine till step number 10 > >> > > >> > In step 11 i am assigning the new centriod index to corresponding row > >> key > >> > of data Point in the matrix > >> > I think i am doing something wrong in step 11 may be i am using > >> incorrect > >> > syntax > >> > > >> > Can you help me find out what am i doing wrong. > >> > > >> > > >> > //start of main method > >> > > >> > def main(args: Array[String]) { > >> > //1. initialize the spark and mahout context > >> > val conf = new SparkConf() > >> > .setAppName("DRMExample") > >> > .setMaster(args(0)) > >> > .set("spark.serializer", "org.apache.spark.serializer. > >> > KryoSerializer") > >> > .set("spark.kryo.registrator", > >> > "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator") > >> > implicit val sc = new SparkDistributedContext(new >
Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
@Dmitriy Sir, In the K means code above I think i am doing the following Incorrectly Assigning the closest centriod index to the Row Keys of DRM //11. Iterating over the Data Matrix(in DrmLike[Int] format) to calculate the initial centriods dataDrmX.mapBlock() { case (keys, block) => for (row <- 0 until block.nrow) { var dataPoint = block(row, ::) //12. findTheClosestCentriod find the closest centriod to the Data point specified by "dataPoint" val closesetIndex = findTheClosestCentriod(dataPoint, centriods) //13. assigning closest index to key keys(row) = closesetIndex } keys -> block } in step 12 i am finding the centriod closest to the current dataPoint in step13 i am assigning the closesetIndex to the key of the corresponding row represented by the dataPoint I think i am doing step13 incorrectly. Also i am unable to find the proper reference for the same in the reference links which you have mentioned above Thanks & Regards Parth Khatwani On Thu, Apr 13, 2017 at 6:24 PM, KHATWANI PARTH BHARAT < h2016...@pilani.bits-pilani.ac.in> wrote: > Dmitriy Sir, > I have Created a github branch Github Branch Having Initial Kmeans Code > <https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov> > > > Thanks & Regards > Parth Khatwani > > On Thu, Apr 13, 2017 at 3:19 AM, Andrew Palumbo > wrote: > >> +1 to creating a branch. >> >> >> >> Sent from my Verizon Wireless 4G LTE smartphone >> >> >> Original message ---- >> From: Dmitriy Lyubimov >> Date: 04/12/2017 11:25 (GMT-08:00) >> To: dev@mahout.apache.org >> Subject: Re: Trying to write the KMeans Clustering Using "Apache Mahout >> Samsara" >> >> can't say i can read this code well formatted that way... >> >> it would seem to me that the code is not using the broadcast variable and >> instead is using closure variable. that's the only thing i can immediately >> see by looking in the middle of it. >> >> it would be better if you created a branch on github for that code that >> would allow for easy check-outs and comments. >> >> -d >> >> On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH BHARAT < >> h2016...@pilani.bits-pilani.ac.in> wrote: >> >> > @Dmitriy Sir >> > >> > I have completed the Kmeans code as per the algorithm you have Outline >> > above >> > >> > My code is as follows >> > >> > This code works fine till step number 10 >> > >> > In step 11 i am assigning the new centriod index to corresponding row >> key >> > of data Point in the matrix >> > I think i am doing something wrong in step 11 may be i am using >> incorrect >> > syntax >> > >> > Can you help me find out what am i doing wrong. >> > >> > >> > //start of main method >> > >> > def main(args: Array[String]) { >> > //1. initialize the spark and mahout context >> > val conf = new SparkConf() >> > .setAppName("DRMExample") >> > .setMaster(args(0)) >> > .set("spark.serializer", "org.apache.spark.serializer. >> > KryoSerializer") >> > .set("spark.kryo.registrator", >> > "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator") >> > implicit val sc = new SparkDistributedContext(new >> SparkContext(conf)) >> > >> > //2. read the data file and save it in the rdd >> > val lines = sc.textFile(args(1)) >> > >> > //3. convert data read in as string in to array of double >> > val test = lines.map(line => line.split('\t').map(_.toDouble)) >> > >> > //4. add a column having value 1 in array of double this will >> > create something like (1 | D)', which will be used while calculating >> > (1 | D)' >> > val augumentedArray = test.map(addCentriodColumn _) >> > >> > //5. convert rdd of array of double in rdd of DenseVector >> > val rdd = augumentedArray.map(dvec(_)) >> > >> > //6. convert rdd to DrmRdd >> > val rddMatrixLike: DrmRdd[Int] = rdd.zipWithIndex.map { case (v, >> > idx) => (idx.toInt, v) }//7. convert DrmRdd to >> > CheckpointedDrm[Int]val matrix = drmWrap(rddMatrixLike)//8. >> > seperating the column having all ones created in step 4 and will use >> > it laterval oneVector = matrix(::, 0 un
Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
Dmitriy Sir, I have Created a github branch Github Branch Having Initial Kmeans Code <https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov> Thanks & Regards Parth Khatwani On Thu, Apr 13, 2017 at 3:19 AM, Andrew Palumbo wrote: > +1 to creating a branch. > > > > Sent from my Verizon Wireless 4G LTE smartphone > > > Original message > From: Dmitriy Lyubimov > Date: 04/12/2017 11:25 (GMT-08:00) > To: dev@mahout.apache.org > Subject: Re: Trying to write the KMeans Clustering Using "Apache Mahout > Samsara" > > can't say i can read this code well formatted that way... > > it would seem to me that the code is not using the broadcast variable and > instead is using closure variable. that's the only thing i can immediately > see by looking in the middle of it. > > it would be better if you created a branch on github for that code that > would allow for easy check-outs and comments. > > -d > > On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH BHARAT < > h2016...@pilani.bits-pilani.ac.in> wrote: > > > @Dmitriy Sir > > > > I have completed the Kmeans code as per the algorithm you have Outline > > above > > > > My code is as follows > > > > This code works fine till step number 10 > > > > In step 11 i am assigning the new centriod index to corresponding row > key > > of data Point in the matrix > > I think i am doing something wrong in step 11 may be i am using incorrect > > syntax > > > > Can you help me find out what am i doing wrong. > > > > > > //start of main method > > > > def main(args: Array[String]) { > > //1. initialize the spark and mahout context > > val conf = new SparkConf() > > .setAppName("DRMExample") > > .setMaster(args(0)) > > .set("spark.serializer", "org.apache.spark.serializer. > > KryoSerializer") > > .set("spark.kryo.registrator", > > "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator") > > implicit val sc = new SparkDistributedContext(new SparkContext(conf)) > > > > //2. read the data file and save it in the rdd > > val lines = sc.textFile(args(1)) > > > > //3. convert data read in as string in to array of double > > val test = lines.map(line => line.split('\t').map(_.toDouble)) > > > > //4. add a column having value 1 in array of double this will > > create something like (1 | D)', which will be used while calculating > > (1 | D)' > > val augumentedArray = test.map(addCentriodColumn _) > > > > //5. convert rdd of array of double in rdd of DenseVector > > val rdd = augumentedArray.map(dvec(_)) > > > > //6. convert rdd to DrmRdd > > val rddMatrixLike: DrmRdd[Int] = rdd.zipWithIndex.map { case (v, > > idx) => (idx.toInt, v) }//7. convert DrmRdd to > > CheckpointedDrm[Int]val matrix = drmWrap(rddMatrixLike)//8. > > seperating the column having all ones created in step 4 and will use > > it laterval oneVector = matrix(::, 0 until 1)//9. final > > input data in DrmLike[Int] formatval dataDrmX = matrix(::, 1 until > > 4)//9. Sampling to select initial centriodsval > > centriods = drmSampleKRows(dataDrmX, 2, false)centriods.size > > //10. Broad Casting the initial centriodsval broadCastMatrix = > > drmBroadcast(centriods)//11. Iterating over the Data > > Matrix(in DrmLike[Int] format) to calculate the initial centriods > > dataDrmX.mapBlock() { case (keys, block) =>for (row <- 0 > > until block.nrow) { var dataPoint = block(row, ::) > > //12. findTheClosestCentriod find the closest centriod to the > > Data point specified by "dataPoint" val closesetIndex = > > findTheClosestCentriod(dataPoint, centriods)//13. > > assigning closest index to key keys(row) = closesetIndex > > }keys -> block} > > > > //14. Calculating the (1|D) val b = (oneVector cbind > > dataDrmX)//15. Aggregating Transpose (1|D)'val bTranspose > > = (oneVector cbind dataDrmX).t// after step 15 bTranspose will > > have data in the following format/*(n+1)*K where n=dimension > > of the data point, K=number of clusters* zeroth row will contain > > the count of points assigned to each cluster* assuming 3d data > > points * */ > > > > > > val nrows = b.nrow.toInt
RE: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
+1 to creating a branch. Sent from my Verizon Wireless 4G LTE smartphone Original message From: Dmitriy Lyubimov Date: 04/12/2017 11:25 (GMT-08:00) To: dev@mahout.apache.org Subject: Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara" can't say i can read this code well formatted that way... it would seem to me that the code is not using the broadcast variable and instead is using closure variable. that's the only thing i can immediately see by looking in the middle of it. it would be better if you created a branch on github for that code that would allow for easy check-outs and comments. -d On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH BHARAT < h2016...@pilani.bits-pilani.ac.in> wrote: > @Dmitriy Sir > > I have completed the Kmeans code as per the algorithm you have Outline > above > > My code is as follows > > This code works fine till step number 10 > > In step 11 i am assigning the new centriod index to corresponding row key > of data Point in the matrix > I think i am doing something wrong in step 11 may be i am using incorrect > syntax > > Can you help me find out what am i doing wrong. > > > //start of main method > > def main(args: Array[String]) { > //1. initialize the spark and mahout context > val conf = new SparkConf() > .setAppName("DRMExample") > .setMaster(args(0)) > .set("spark.serializer", "org.apache.spark.serializer. > KryoSerializer") > .set("spark.kryo.registrator", > "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator") > implicit val sc = new SparkDistributedContext(new SparkContext(conf)) > > //2. read the data file and save it in the rdd > val lines = sc.textFile(args(1)) > > //3. convert data read in as string in to array of double > val test = lines.map(line => line.split('\t').map(_.toDouble)) > > //4. add a column having value 1 in array of double this will > create something like (1 | D)', which will be used while calculating > (1 | D)' > val augumentedArray = test.map(addCentriodColumn _) > > //5. convert rdd of array of double in rdd of DenseVector > val rdd = augumentedArray.map(dvec(_)) > > //6. convert rdd to DrmRdd > val rddMatrixLike: DrmRdd[Int] = rdd.zipWithIndex.map { case (v, > idx) => (idx.toInt, v) }//7. convert DrmRdd to > CheckpointedDrm[Int]val matrix = drmWrap(rddMatrixLike)//8. > seperating the column having all ones created in step 4 and will use > it laterval oneVector = matrix(::, 0 until 1)//9. final > input data in DrmLike[Int] formatval dataDrmX = matrix(::, 1 until > 4)//9. Sampling to select initial centriodsval > centriods = drmSampleKRows(dataDrmX, 2, false)centriods.size > //10. Broad Casting the initial centriodsval broadCastMatrix = > drmBroadcast(centriods)//11. Iterating over the Data > Matrix(in DrmLike[Int] format) to calculate the initial centriods > dataDrmX.mapBlock() { case (keys, block) =>for (row <- 0 > until block.nrow) { var dataPoint = block(row, ::) > //12. findTheClosestCentriod find the closest centriod to the > Data point specified by "dataPoint" val closesetIndex = > findTheClosestCentriod(dataPoint, centriods)//13. > assigning closest index to key keys(row) = closesetIndex > }keys -> block} > > //14. Calculating the (1|D) val b = (oneVector cbind > dataDrmX)//15. Aggregating Transpose (1|D)'val bTranspose > = (oneVector cbind dataDrmX).t// after step 15 bTranspose will > have data in the following format/*(n+1)*K where n=dimension > of the data point, K=number of clusters* zeroth row will contain > the count of points assigned to each cluster* assuming 3d data > points * */ > > > val nrows = b.nrow.toInt//16. slicing the count vectors out > val pointCountVectors = drmBroadcast(b(0 until 1, ::).collect(0, ::)) >val vectorSums = b(1 until nrows, ::)//17. dividing the data > point by count vectorvectorSums.mapBlock() { case (keys, > block) =>for (row <- 0 until block.nrow) { block(row, > ::) /= pointCountVectors}keys -> block}//18. > seperating the count vectorsval newCentriods = vectorSums.t(::,1 > until centriods.size)//19. iterate over the above code > till convergence criteria is meet }//end of main method > > > > // method to find the closest centriod to data point( vec: Vector > in the arguments) d
Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
iods = vectorSums.t(::,1 > > until centriods.size)//19. iterate over the above code > > till convergence criteria is meet }//end of main method > > > > > > > > // method to find the closest centriod to data point( vec: Vector > > in the arguments) def findTheClosestCentriod(vec: Vector, matrix: > > Matrix): Int = { > > var index = 0 > > var closest = Double.PositiveInfinity > > for (row <- 0 until matrix.nrow) { > > val squaredSum = ssr(vec, matrix(row, ::)) > > val tempDist = Math.sqrt(ssr(vec, matrix(row, ::))) > > if (tempDist < closest) { > > closest = tempDist > > index = row > > } > > } > > index > > } > > > > //calculating the sum of squared distance between the points(Vectors) > > def ssr(a: Vector, b: Vector): Double = { > > (a - b) ^= 2 sum > > } > > > > //method used to create (1|D) > > def addCentriodColumn(arg: Array[Double]): Array[Double] = { > > val newArr = new Array[Double](arg.length + 1) > > newArr(0) = 1.0; > > for (i <- 0 until (arg.size)) { > > newArr(i + 1) = arg(i); > > } > > newArr > > } > > > > > > Thanks & Regards > > Parth Khatwani > > > > > > > > On Mon, Apr 3, 2017 at 7:37 PM, KHATWANI PARTH BHARAT < > > h2016...@pilani.bits-pilani.ac.in> wrote: > > > > > > > > -- Forwarded message -- > > > From: Dmitriy Lyubimov > > > Date: Fri, Mar 31, 2017 at 11:34 PM > > > Subject: Re: Trying to write the KMeans Clustering Using "Apache Mahout > > > Samsara" > > > To: "dev@mahout.apache.org" > > > > > > > > > ps1 this assumes row-wise construction of A based on training set of m > > > n-dimensional points. > > > ps2 since we are doing multiple passes over A it may make sense to make > > > sure it is committed to spark cache (by using checkpoint api), if spark > > is > > > used > > > > > > On Fri, Mar 31, 2017 at 10:53 AM, Dmitriy Lyubimov > > > wrote: > > > > > > > here is the outline. For details of APIs, please refer to samsara > > manual > > > > [2], i will not be be repeating it. > > > > > > > > Assume your training data input is m x n matrix A. For simplicity > let's > > > > assume it's a DRM with int row keys, i.e., DrmLike[Int]. > > > > > > > > Initialization: > > > > > > > > First, classic k-means starts by selecting initial clusters, by > > sampling > > > > them out. You can do that by using sampling api [1], thus forming a k > > x n > > > > in-memory matrix C (current centroids). C is therefore of Mahout's > > Matrix > > > > type. > > > > > > > > You the proceed by alternating between cluster assignments and > > > > recompupting centroid matrix C till convergence based on some test or > > > > simply limited by epoch count budget, your choice. > > > > > > > > Cluster assignments: here, we go over current generation of A and > > > > recompute centroid indexes for each row in A. Once we recompute > index, > > we > > > > put it into the row key . You can do that by assigning centroid > indices > > > to > > > > keys of A using operator mapblock() (details in [2], [3], [4]). You > > also > > > > need to broadcast C in order to be able to access it in efficient > > manner > > > > inside mapblock() closure. Examples of that are plenty given in [2]. > > > > Essentially, in mapblock, you'd reform the row keys to reflect > cluster > > > > index in C. while going over A, you'd have a "nearest neighbor" > problem > > > to > > > > solve for the row of A and centroids C. This is the bulk of > computation > > > > really, and there are a few tricks there that can speed this step up > in > > > > both exact and approximate manner, but you can start with a naive > > search. > > > > > > > > Centroid recomputation: > > > > once you assigned centroids to the keys of marix A, you'd want to do > an > > > > aggregating transpose of A to compute essentially average of row A > > > grouped > > > > by the centroid key. The trick is to do a computation of (1|A)' which > > > will > > >
Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
t; if (tempDist < closest) { > closest = tempDist > index = row > } > } > index > } > >//calculating the sum of squared distance between the points(Vectors) > def ssr(a: Vector, b: Vector): Double = { > (a - b) ^= 2 sum > } > > //method used to create (1|D) > def addCentriodColumn(arg: Array[Double]): Array[Double] = { > val newArr = new Array[Double](arg.length + 1) > newArr(0) = 1.0; > for (i <- 0 until (arg.size)) { > newArr(i + 1) = arg(i); > } > newArr > } > > > Thanks & Regards > Parth Khatwani > > > > On Mon, Apr 3, 2017 at 7:37 PM, KHATWANI PARTH BHARAT < > h2016...@pilani.bits-pilani.ac.in> wrote: > > > > > -- Forwarded message -- > > From: Dmitriy Lyubimov > > Date: Fri, Mar 31, 2017 at 11:34 PM > > Subject: Re: Trying to write the KMeans Clustering Using "Apache Mahout > > Samsara" > > To: "dev@mahout.apache.org" > > > > > > ps1 this assumes row-wise construction of A based on training set of m > > n-dimensional points. > > ps2 since we are doing multiple passes over A it may make sense to make > > sure it is committed to spark cache (by using checkpoint api), if spark > is > > used > > > > On Fri, Mar 31, 2017 at 10:53 AM, Dmitriy Lyubimov > > wrote: > > > > > here is the outline. For details of APIs, please refer to samsara > manual > > > [2], i will not be be repeating it. > > > > > > Assume your training data input is m x n matrix A. For simplicity let's > > > assume it's a DRM with int row keys, i.e., DrmLike[Int]. > > > > > > Initialization: > > > > > > First, classic k-means starts by selecting initial clusters, by > sampling > > > them out. You can do that by using sampling api [1], thus forming a k > x n > > > in-memory matrix C (current centroids). C is therefore of Mahout's > Matrix > > > type. > > > > > > You the proceed by alternating between cluster assignments and > > > recompupting centroid matrix C till convergence based on some test or > > > simply limited by epoch count budget, your choice. > > > > > > Cluster assignments: here, we go over current generation of A and > > > recompute centroid indexes for each row in A. Once we recompute index, > we > > > put it into the row key . You can do that by assigning centroid indices > > to > > > keys of A using operator mapblock() (details in [2], [3], [4]). You > also > > > need to broadcast C in order to be able to access it in efficient > manner > > > inside mapblock() closure. Examples of that are plenty given in [2]. > > > Essentially, in mapblock, you'd reform the row keys to reflect cluster > > > index in C. while going over A, you'd have a "nearest neighbor" problem > > to > > > solve for the row of A and centroids C. This is the bulk of computation > > > really, and there are a few tricks there that can speed this step up in > > > both exact and approximate manner, but you can start with a naive > search. > > > > > > Centroid recomputation: > > > once you assigned centroids to the keys of marix A, you'd want to do an > > > aggregating transpose of A to compute essentially average of row A > > grouped > > > by the centroid key. The trick is to do a computation of (1|A)' which > > will > > > results in a matrix of the shape (Counts/sums of cluster rows). This is > > the > > > part i find difficult to explain without a latex graphics. > > > > > > In Samsara, construction of (1|A)' corresponds to DRM expression > > > > > > (1 cbind A).t (again, see [2]). > > > > > > So when you compute, say, > > > > > > B = (1 | A)', > > > > > > then B is (n+1) x k, so each column contains a vector corresponding to > a > > > cluster 1..k. In such column, the first element would be # of points in > > the > > > cluster, and the rest of it would correspond to sum of all points. So > in > > > order to arrive to an updated matrix C, we need to collect B into > memory, > > > and slice out counters (first row) from the rest of it. > > > > > > So, to compute C: > > > > > > C <- B (2:,:) each row divided by B(1,:) > > > > > > (watch out for empty clusters with 0 elements, this will cause lack of > > > conve
Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
@Dmitriy Sir I have completed the Kmeans code as per the algorithm you have Outline above My code is as follows This code works fine till step number 10 In step 11 i am assigning the new centriod index to corresponding row key of data Point in the matrix I think i am doing something wrong in step 11 may be i am using incorrect syntax Can you help me find out what am i doing wrong. //start of main method def main(args: Array[String]) { //1. initialize the spark and mahout context val conf = new SparkConf() .setAppName("DRMExample") .setMaster(args(0)) .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .set("spark.kryo.registrator", "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator") implicit val sc = new SparkDistributedContext(new SparkContext(conf)) //2. read the data file and save it in the rdd val lines = sc.textFile(args(1)) //3. convert data read in as string in to array of double val test = lines.map(line => line.split('\t').map(_.toDouble)) //4. add a column having value 1 in array of double this will create something like (1 | D)', which will be used while calculating (1 | D)' val augumentedArray = test.map(addCentriodColumn _) //5. convert rdd of array of double in rdd of DenseVector val rdd = augumentedArray.map(dvec(_)) //6. convert rdd to DrmRdd val rddMatrixLike: DrmRdd[Int] = rdd.zipWithIndex.map { case (v, idx) => (idx.toInt, v) }//7. convert DrmRdd to CheckpointedDrm[Int]val matrix = drmWrap(rddMatrixLike)//8. seperating the column having all ones created in step 4 and will use it laterval oneVector = matrix(::, 0 until 1)//9. final input data in DrmLike[Int] formatval dataDrmX = matrix(::, 1 until 4)//9. Sampling to select initial centriodsval centriods = drmSampleKRows(dataDrmX, 2, false)centriods.size //10. Broad Casting the initial centriodsval broadCastMatrix = drmBroadcast(centriods)//11. Iterating over the Data Matrix(in DrmLike[Int] format) to calculate the initial centriods dataDrmX.mapBlock() { case (keys, block) =>for (row <- 0 until block.nrow) { var dataPoint = block(row, ::) //12. findTheClosestCentriod find the closest centriod to the Data point specified by "dataPoint" val closesetIndex = findTheClosestCentriod(dataPoint, centriods)//13. assigning closest index to key keys(row) = closesetIndex }keys -> block} //14. Calculating the (1|D) val b = (oneVector cbind dataDrmX)//15. Aggregating Transpose (1|D)'val bTranspose = (oneVector cbind dataDrmX).t// after step 15 bTranspose will have data in the following format/*(n+1)*K where n=dimension of the data point, K=number of clusters* zeroth row will contain the count of points assigned to each cluster* assuming 3d data points * */ val nrows = b.nrow.toInt//16. slicing the count vectors out val pointCountVectors = drmBroadcast(b(0 until 1, ::).collect(0, ::)) val vectorSums = b(1 until nrows, ::)//17. dividing the data point by count vectorvectorSums.mapBlock() { case (keys, block) =>for (row <- 0 until block.nrow) { block(row, ::) /= pointCountVectors}keys -> block}//18. seperating the count vectorsval newCentriods = vectorSums.t(::,1 until centriods.size)//19. iterate over the above code till convergence criteria is meet }//end of main method // method to find the closest centriod to data point( vec: Vector in the arguments) def findTheClosestCentriod(vec: Vector, matrix: Matrix): Int = { var index = 0 var closest = Double.PositiveInfinity for (row <- 0 until matrix.nrow) { val squaredSum = ssr(vec, matrix(row, ::)) val tempDist = Math.sqrt(ssr(vec, matrix(row, ::))) if (tempDist < closest) { closest = tempDist index = row } } index } //calculating the sum of squared distance between the points(Vectors) def ssr(a: Vector, b: Vector): Double = { (a - b) ^= 2 sum } //method used to create (1|D) def addCentriodColumn(arg: Array[Double]): Array[Double] = { val newArr = new Array[Double](arg.length + 1) newArr(0) = 1.0; for (i <- 0 until (arg.size)) { newArr(i + 1) = arg(i); } newArr } Thanks & Regards Parth Khatwani On Mon, Apr 3, 2017 at 7:37 PM, KHATWANI PARTH BHARAT < h2016...@pilani.bits-pilani.ac.in> wrote: > > ------ Forwarded message ------ > From: Dmitriy Lyubimov > Date: Fri, Mar 31, 2017 at 11:34 PM > Subject: Re: Trying to write the KMeans Clustering Using "Apache Mahout > Samsara" > To: "dev@mahout.apache.org" > > > ps1
Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
ps1 this assumes row-wise construction of A based on training set of m n-dimensional points. ps2 since we are doing multiple passes over A it may make sense to make sure it is committed to spark cache (by using checkpoint api), if spark is used On Fri, Mar 31, 2017 at 10:53 AM, Dmitriy Lyubimov wrote: > here is the outline. For details of APIs, please refer to samsara manual > [2], i will not be be repeating it. > > Assume your training data input is m x n matrix A. For simplicity let's > assume it's a DRM with int row keys, i.e., DrmLike[Int]. > > Initialization: > > First, classic k-means starts by selecting initial clusters, by sampling > them out. You can do that by using sampling api [1], thus forming a k x n > in-memory matrix C (current centroids). C is therefore of Mahout's Matrix > type. > > You the proceed by alternating between cluster assignments and > recompupting centroid matrix C till convergence based on some test or > simply limited by epoch count budget, your choice. > > Cluster assignments: here, we go over current generation of A and > recompute centroid indexes for each row in A. Once we recompute index, we > put it into the row key . You can do that by assigning centroid indices to > keys of A using operator mapblock() (details in [2], [3], [4]). You also > need to broadcast C in order to be able to access it in efficient manner > inside mapblock() closure. Examples of that are plenty given in [2]. > Essentially, in mapblock, you'd reform the row keys to reflect cluster > index in C. while going over A, you'd have a "nearest neighbor" problem to > solve for the row of A and centroids C. This is the bulk of computation > really, and there are a few tricks there that can speed this step up in > both exact and approximate manner, but you can start with a naive search. > > Centroid recomputation: > once you assigned centroids to the keys of marix A, you'd want to do an > aggregating transpose of A to compute essentially average of row A grouped > by the centroid key. The trick is to do a computation of (1|A)' which will > results in a matrix of the shape (Counts/sums of cluster rows). This is the > part i find difficult to explain without a latex graphics. > > In Samsara, construction of (1|A)' corresponds to DRM expression > > (1 cbind A).t (again, see [2]). > > So when you compute, say, > > B = (1 | A)', > > then B is (n+1) x k, so each column contains a vector corresponding to a > cluster 1..k. In such column, the first element would be # of points in the > cluster, and the rest of it would correspond to sum of all points. So in > order to arrive to an updated matrix C, we need to collect B into memory, > and slice out counters (first row) from the rest of it. > > So, to compute C: > > C <- B (2:,:) each row divided by B(1,:) > > (watch out for empty clusters with 0 elements, this will cause lack of > convergence and NaNs in the newly computed C). > > This operation obviously uses subblocking and row-wise iteration over B, > for which i am again making reference to [2]. > > > [1] https://github.com/apache/mahout/blob/master/math-scala/ > src/main/scala/org/apache/mahout/math/drm/package.scala#L149 > > [2], Sasmara manual, a bit dated but viable, http://apache.github. > io/mahout/doc/ScalaSparkBindings.html > > [3] scaladoc, again, dated but largely viable for the purpose of this > exercise: > http://apache.github.io/mahout/0.10.1/docs/mahout-math-scala/index.htm > > [4] mapblock etc. http://apache.github.io/mahout/0.10.1/docs/mahout- > math-scala/index.html#org.apache.mahout.math.drm.RLikeDrmOps > > On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH BHARAT < > h2016...@pilani.bits-pilani.ac.in> wrote: > >> @Dmitriycan you please again tell me the approach to move ahead. >> >> >> Thanks >> Parth Khatwani >> >> >> On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT < >> h2016...@pilani.bits-pilani.ac.in> wrote: >> >> > yes i am unable to figure out the way ahead. >> > Like how to create the augmented matrix A := (0|D) which you have >> > mentioned. >> > >> > >> > On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov >> > wrote: >> > >> >> was my reply for your post on @user has been a bit confusing? >> >> >> >> On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT < >> >> h2016...@pilani.bits-pilani.ac.in> wrote: >> >> >> >> > Sir, >> >> > I am trying to write the kmeans clustering algorithm using Mahout >> >> Samsara >> >> > but i am bit confused >> >> > about how to leverage Distributed Row Matrix for the same. Can >> anybody >> >> help >> >> > me with same. >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > Thanks >> >> > Parth Khatwani >> >> > >> >> >> > >> > >> > >
Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
here is the outline. For details of APIs, please refer to samsara manual [2], i will not be be repeating it. Assume your training data input is m x n matrix A. For simplicity let's assume it's a DRM with int row keys, i.e., DrmLike[Int]. Initialization: First, classic k-means starts by selecting initial clusters, by sampling them out. You can do that by using sampling api [1], thus forming a k x n in-memory matrix C (current centroids). C is therefore of Mahout's Matrix type. You the proceed by alternating between cluster assignments and recompupting centroid matrix C till convergence based on some test or simply limited by epoch count budget, your choice. Cluster assignments: here, we go over current generation of A and recompute centroid indexes for each row in A. Once we recompute index, we put it into the row key . You can do that by assigning centroid indices to keys of A using operator mapblock() (details in [2], [3], [4]). You also need to broadcast C in order to be able to access it in efficient manner inside mapblock() closure. Examples of that are plenty given in [2]. Essentially, in mapblock, you'd reform the row keys to reflect cluster index in C. while going over A, you'd have a "nearest neighbor" problem to solve for the row of A and centroids C. This is the bulk of computation really, and there are a few tricks there that can speed this step up in both exact and approximate manner, but you can start with a naive search. Centroid recomputation: once you assigned centroids to the keys of marix A, you'd want to do an aggregating transpose of A to compute essentially average of row A grouped by the centroid key. The trick is to do a computation of (1|A)' which will results in a matrix of the shape (Counts/sums of cluster rows). This is the part i find difficult to explain without a latex graphics. In Samsara, construction of (1|A)' corresponds to DRM expression (1 cbind A).t (again, see [2]). So when you compute, say, B = (1 | A)', then B is (n+1) x k, so each column contains a vector corresponding to a cluster 1..k. In such column, the first element would be # of points in the cluster, and the rest of it would correspond to sum of all points. So in order to arrive to an updated matrix C, we need to collect B into memory, and slice out counters (first row) from the rest of it. So, to compute C: C <- B (2:,:) each row divided by B(1,:) (watch out for empty clusters with 0 elements, this will cause lack of convergence and NaNs in the newly computed C). This operation obviously uses subblocking and row-wise iteration over B, for which i am again making reference to [2]. [1] https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/drm/package.scala#L149 [2], Sasmara manual, a bit dated but viable, http://apache.github.io/mahout/doc/ScalaSparkBindings.html [3] scaladoc, again, dated but largely viable for the purpose of this exercise: http://apache.github.io/mahout/0.10.1/docs/mahout-math-scala/index.htm [4] mapblock etc. http://apache.github.io/mahout/0.10.1/docs/mahout-math-scala/index.html#org.apache.mahout.math.drm.RLikeDrmOps On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH BHARAT < h2016...@pilani.bits-pilani.ac.in> wrote: > @Dmitriycan you please again tell me the approach to move ahead. > > > Thanks > Parth Khatwani > > > On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT < > h2016...@pilani.bits-pilani.ac.in> wrote: > > > yes i am unable to figure out the way ahead. > > Like how to create the augmented matrix A := (0|D) which you have > > mentioned. > > > > > > On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov > > wrote: > > > >> was my reply for your post on @user has been a bit confusing? > >> > >> On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT < > >> h2016...@pilani.bits-pilani.ac.in> wrote: > >> > >> > Sir, > >> > I am trying to write the kmeans clustering algorithm using Mahout > >> Samsara > >> > but i am bit confused > >> > about how to leverage Distributed Row Matrix for the same. Can anybody > >> help > >> > me with same. > >> > > >> > > >> > > >> > > >> > > >> > Thanks > >> > Parth Khatwani > >> > > >> > > > > >
Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
@Dmitriycan you please again tell me the approach to move ahead. Thanks Parth Khatwani On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT < h2016...@pilani.bits-pilani.ac.in> wrote: > yes i am unable to figure out the way ahead. > Like how to create the augmented matrix A := (0|D) which you have > mentioned. > > > On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov > wrote: > >> was my reply for your post on @user has been a bit confusing? >> >> On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT < >> h2016...@pilani.bits-pilani.ac.in> wrote: >> >> > Sir, >> > I am trying to write the kmeans clustering algorithm using Mahout >> Samsara >> > but i am bit confused >> > about how to leverage Distributed Row Matrix for the same. Can anybody >> help >> > me with same. >> > >> > >> > >> > >> > >> > Thanks >> > Parth Khatwani >> > >> > >
Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
yes i am unable to figure out the way ahead. Like how to create the augmented matrix A := (0|D) which you have mentioned. On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov wrote: > was my reply for your post on @user has been a bit confusing? > > On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT < > h2016...@pilani.bits-pilani.ac.in> wrote: > > > Sir, > > I am trying to write the kmeans clustering algorithm using Mahout Samsara > > but i am bit confused > > about how to leverage Distributed Row Matrix for the same. Can anybody > help > > me with same. > > > > > > > > > > > > Thanks > > Parth Khatwani > > >
Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
was my reply for your post on @user has been a bit confusing? On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT < h2016...@pilani.bits-pilani.ac.in> wrote: > Sir, > I am trying to write the kmeans clustering algorithm using Mahout Samsara > but i am bit confused > about how to leverage Distributed Row Matrix for the same. Can anybody help > me with same. > > > > > > Thanks > Parth Khatwani >
Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
Sir, I am trying to write the kmeans clustering algorithm using Mahout Samsara but i am bit confused about how to leverage Distributed Row Matrix for the same. Can anybody help me with same. Thanks Parth Khatwani