RE: Algorithm implementations in Pig
Thanks for the clarification Ankur. Do you have any performance comparison between pig-0.6 and Hadoop? I would be interested to look at the same. Last I heard about the comparison was in http://osdir.com/ml/hive-user-hadoop-apache/2009-06/msg00078.html. Pig-0.7.0 seems interesting. Thanks for sharing the information. I am looking forward to experiment with it. Thanks Pallavi -Original Message- From: Ankur C. Goel [mailto:gan...@yahoo-inc.com] Sent: Wednesday, February 24, 2010 1:24 PM To: mahout-dev@lucene.apache.org Subject: Re: Algorithm implementations in Pig Pallavi, Thanks for your comments. Some clarifications w.r.t pig. Pig does not generate any M/R code. What is it generates is logical, physical and map-reduce plans that are nothing but DAGs. The map-reduce plan is then interpreted by pig's own mappers/reducers. The plan generation itself is done on the client side and takes few seconds or minutes (if you have a really big script). About performance tuning in hadoop, all the M/R parameters can be adjusted in pig to have the same effect they'd have in Java M/R programs. Pig 0.7 is moving towards using hadoop's input/output format in its load/store functions, so your custom i/o formats can be easily reused with little additional effort. Pig also provides very nice features like MultiQuery optimization and skewed merge join that are hard to implement in Java M/R every time you need them. With the latest pig release 0.6 the performance gap between Java M/R and Pig has been narrowed to a good extent. Simple statistical measures that you would use to understand or preprocess your data are very easy to do with just few lines of pig code and lot of utility UDFs are available for that. Besides all the good things, I agree that there are compatibility issues running pig-x on hadoop-y but this has also to do with new features of Hadoop that pig is able to exploit in its pipeline. I also agree with the general opinion that for Pig's adoption in Mahout land it should play out well with Mahout's vector formats. At the moment I don't have the proper free time to look into this but will surely get back to evaluating the feasibility of this integration in the coming few weeks. Till then any of the interested folks can fork a JIRA for this and work on it. On 2/24/10 12:27 PM, Palleti, Pallavi pallavi.pall...@corp.aol.com wrote: I too have mixed opinion w.r.t pig. Pig would be a good choice to quickly prototype and test. However, following are the pitfalls I have observed in pig. It is not easy to debug in pig. Also, it have performance issues as it is a layer on top of hadoop, so the overhead of converting pig into map-reduce code. Also, when the code is available in hadoop, it is in developer/user's hand to improve the performance by using various parameters say, no of mappers, different input formats, etc. However is not the case with pig. Also,there are some compatibility issues with pig and hadoop. Say, if I am using pig-x version on hadoop-y version, there might be some compatibility issues and need to spend time on resolving the same as it is not easy to figure out the errors. I believe the main motto of mahout is to propose scalable algorithms which can be used to solve some real world problems. In such case, if pig has got rid of above pitfalls, then it would be good choice as we will have very less developing time efforts. Thanks Pallavi -Original Message- From: Ted Dunning [mailto:ted.dunn...@gmail.com] Sent: Monday, February 22, 2010 11:32 PM To: mahout-dev@lucene.apache.org Subject: Re: Algorithm implementations in Pig As an interesting test case, can you write a pig program that counts words. BUT, it takes an input file name AND an input field name. On Mon, Feb 22, 2010 at 9:56 AM, Ted Dunning ted.dunn...@gmail.com wrote: That isn't an issue here. It is the invocation of pig programs and passing useful information to them that is the problem. On Mon, Feb 22, 2010 at 9:20 AM, Ankur C. Goel gan...@yahoo-inc.comwrote: Scripting ability while still limited has better streaming support so you can have relations streamed Into a custom script executing in either map or reduce phase depending upon where it is placed. -- Ted Dunning, CTO DeepDyve -- Ted Dunning, CTO DeepDyve
RE: Algorithm implementations in Pig
I too have mixed opinion w.r.t pig. Pig would be a good choice to quickly prototype and test. However, following are the pitfalls I have observed in pig. It is not easy to debug in pig. Also, it have performance issues as it is a layer on top of hadoop, so the overhead of converting pig into map-reduce code. Also, when the code is available in hadoop, it is in developer/user's hand to improve the performance by using various parameters say, no of mappers, different input formats, etc. However is not the case with pig. Also,there are some compatibility issues with pig and hadoop. Say, if I am using pig-x version on hadoop-y version, there might be some compatibility issues and need to spend time on resolving the same as it is not easy to figure out the errors. I believe the main motto of mahout is to propose scalable algorithms which can be used to solve some real world problems. In such case, if pig has got rid of above pitfalls, then it would be good choice as we will have very less developing time efforts. Thanks Pallavi -Original Message- From: Ted Dunning [mailto:ted.dunn...@gmail.com] Sent: Monday, February 22, 2010 11:32 PM To: mahout-dev@lucene.apache.org Subject: Re: Algorithm implementations in Pig As an interesting test case, can you write a pig program that counts words. BUT, it takes an input file name AND an input field name. On Mon, Feb 22, 2010 at 9:56 AM, Ted Dunning ted.dunn...@gmail.com wrote: That isn't an issue here. It is the invocation of pig programs and passing useful information to them that is the problem. On Mon, Feb 22, 2010 at 9:20 AM, Ankur C. Goel gan...@yahoo-inc.comwrote: Scripting ability while still limited has better streaming support so you can have relations streamed Into a custom script executing in either map or reduce phase depending upon where it is placed. -- Ted Dunning, CTO DeepDyve -- Ted Dunning, CTO DeepDyve
RE: Fuzzy K Means
How many iterations of FuzzyKMeans you are running? Here is my observation- When I ran for few iterations,the cluster centroids are far off. However, when I ran for more than 50 iterations or so, the cluster points are still different but they are very much near by as if they are same. By the way, I am using m=3 in membership function. Thanks Pallavi -Original Message- From: Robin Anil [mailto:robin.a...@gmail.com] Sent: Wednesday, February 17, 2010 8:10 PM To: mahout-dev@lucene.apache.org Subject: Re: Fuzzy K Means Tests are passing fine. But Not when testing reuters. On Wed, Feb 17, 2010 at 8:07 PM, Pallavi Palleti pallavi.pall...@corp.aol.com wrote: If we just need to verify with some sample dataset, we already have the data in TestFuzzyKMeansClustering code. won't that suffice? Otherwise, I need to manually generate some sample dataset as I don't have this small dataset with me. I am actually running on movielens data using movie ratings as vector (movie as dimension , rating as coefficient) and user as point. Thanks Pallavi Robin Anil wrote: I tracked the versions back to before the change to Writables were done. There is nothing significant change in the code. Can you give me a small dataset 10 points maybe 5 dimensions. I can verify the trunk in Case? Robin On Wed, Feb 17, 2010 at 7:49 PM, Pallavi Palleti pallavi.pall...@corp.aol.com wrote: I have a local version which I have submitted long back and I am using it on real data and is not giving same point for all clusters. However, I haven't tried with latest mahout code. I have kept my code to output data as text so that it is easy for me to verify. However, current mahout code outputs it as binary data (as sequencefile). So, it is difficult to verify. Thanks Pallavi Robin Anil wrote: Have you verified the trunk code on some real data. I am getting same point for all clusters regardless of the distnce measure Robin On Wed, Feb 17, 2010 at 6:41 PM, Pallavi Palleti pallavi.pall...@corp.aol.com wrote: Yes. It shouldn't be a problem. My point was that we are extending numpoints as part of ClusterBase, though we are not using it in SoftCluster. Other that that, I don't see any issue w.r.t. functionality. Thanks Pallavi Robin Anil wrote: In the impl of SoftClusters on writeOut it calculates the centroid and writes it and when read(in) it reads the centroid in to the center. In ClusterDumper it reads into the ClusterBase and does value.getCenter(); It should work normally right Robin On Wed, Feb 17, 2010 at 6:02 PM, Pallavi Palleti pallavi.pall...@corp.aol.com wrote: Yes. But not the total number of points. So, the numpoints from ClusterBase will not be used in SoftCluster. numpoints is specific to Kmeans similar to weightedpoint total for fuzzy kmeans. Robin Anil wrote: the center is still the averaged out centroid right? weightedtotalvector/totalprobWeight On Wed, Feb 17, 2010 at 5:10 PM, Pallavi Palleti pallavi.pall...@corp.aol.com wrote: I haven't yet gone thru ClusterDumper. However, ClusterBase would be having number of points to average out (pointTotal/numPoints as per kmeans) where as SoftCluster will have weighted point total. So, I am wondering how can we reuse ClusterBase here? Thanks Pallavi Robin Anil wrote: yes. So that cluster dumper can print it out. On Wed, Feb 17, 2010 at 5:02 PM, Pallavi Palleti pallavi.pall...@corp.aol.com wrote: Hi Robin, when you meant by reusing ClusterBase, are you planning to extend ClusterBase in SoftCluster? For example, SoftCluster extends ClusterBase? Thanks Pallavi Robin Anil wrote: I have been trying to convert FuzzyKMeans SoftCluster(which should be ideally be named FuzzyKmeansCluster) to use the ClusterBase. I am getting* the same center* for all the clusters. To aid the conversion all i did was remove the center vector from the SoftCluster class and reuse the same from the ClusterBase. These are essentially making no change in the tests which passes correctly. So I am questioning whether the implementation keeps the average center at all ? Anyone who has used FuzzyKMeans experiencing this? Robin
RE: Yourkit License for all of you
Hi Robin, I would like to have the license if possible. Thanks Pallavi -Original Message- From: Robin Anil [mailto:robin.a...@gmail.com] Sent: Wednesday, September 02, 2009 2:03 PM To: mahout-dev Subject: Yourkit License for all of you Dear Mahout Devs,Yourkit sales rep gave me my opensource license. If anyone would like to get one. I can aggregate and send all the requests to him. If you would like to have an opensource license of Yourkit Profiler, reply to on thread within 24 hours of this email. I will, compile it and send it across. Robin
RE: [jira] Commented: (MAHOUT-99) Improving speed of KMeans
Yeah. But, I am wondering how the testcases succeeded? I ran them using mvn clean install command. Thanks Pallavi -Original Message- From: Jeff Eastman [mailto:j...@windwardsolutions.com] Sent: Thursday, March 19, 2009 9:56 AM To: mahout-dev@lucene.apache.org Subject: Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans The Synthetic Control kMeans job calls the Canopy job to build its initial clusters as is commonly done. If the kMeans record format was changed and the Canopy not changed accordingly, then everything would still compile but there would be a mismatch when the kMeans mapper tried to read in the clusters. Jeff Richard Tomsett (JIRA) wrote: [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jir a.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683 252#action_12683252 ] Richard Tomsett commented on MAHOUT-99: --- Yup, just downloaded the latest trunk and run with Hadoop 0.19.1 and I get the same error on the Synthetic Control example. It seems to be because the new KMeans code uses a KeyValueLineRecordReader object to read the input cluster centres from the canopy clustering output, but the canopy clustering job outputs a SequenceFile (and the old KMeans code read in a SequenceFile for the cluster centres). Think that's the problem at least, I''ll have a quick play. Improving speed of KMeans - Key: MAHOUT-99 URL: https://issues.apache.org/jira/browse/MAHOUT-99 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Assignee: Grant Ingersoll Fix For: 0.1 Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent. Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once.
RE: [jira] Commented: (MAHOUT-99) Improving speed of KMeans
It depends on the kind of output. If we are just outputting only some numeric values then it is preferred to have SequenceFile as the data is written as binary. If not, it is preferred to write as simple text. Text file is readable where as binary is not readable. As we consider the data as text in reducers of both Canopy and KMeans, I don't see any performance improvement in using SequenceFile. So, I used TextInputFormat which is read friendly. Thanks Pallavi -Original Message- From: Jeff Eastman [mailto:j...@windwardsolutions.com] Sent: Thursday, March 19, 2009 10:19 AM To: mahout-dev@lucene.apache.org Subject: Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans Also why not consider just converting canopy? Which reader is better? Jeff Eastman wrote: * PGP Signed: 03/18/09 at 21:37:36 Sure, why don't you go ahead and post a patch? Pallavi Palleti (JIRA) wrote: [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.ji ra.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=126 83312#action_12683312 ] Pallavi Palleti commented on MAHOUT-99: --- I have used KeyValueLineRecordReader internally for my code and forgot to revert back to SequenceFileReader. Will that be sufficient to add another patch on the latest code and modify only KMeansDriver to use SequenceFileReader? Kindly let me know. Thanks Pallavi Improving speed of KMeans - Key: MAHOUT-99 URL: https://issues.apache.org/jira/browse/MAHOUT-99 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Assignee: Grant Ingersoll Fix For: 0.1 Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent. Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once. * Jeff Eastman j...@windwardsolutions.com * 0x6BFF1277 .
RE: [jira] Commented: (MAHOUT-79) Improving the speed of Fuzzy K-Means by optimizing data transfer between map and reduce tasks
Hi Grant, Let me know if you are still facing this issue? Thanks Pallavi -Original Message- From: Grant Ingersoll (JIRA) [mailto:[EMAIL PROTECTED] Sent: Friday, October 17, 2008 11:48 PM To: mahout-dev@lucene.apache.org Subject: [jira] Commented: (MAHOUT-79) Improving the speed of Fuzzy K-Means by optimizing data transfer between map and reduce tasks [ https://issues.apache.org/jira/browse/MAHOUT-79?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12640608#action_12640608 ] Grant Ingersoll commented on MAHOUT-79: --- Pallavi, I'm getting: [junit] 08/10/17 14:15:03 INFO mapred.FileInputFormat: Total input paths to process : 2 [junit] 08/10/17 14:15:03 INFO mapred.FileInputFormat: Total input paths to process : 2 [junit] 08/10/17 14:15:03 INFO mapred.JobClient: Running job: job_local_0002 [junit] 08/10/17 14:15:03 INFO mapred.FileInputFormat: Total input paths to process : 2 [junit] 08/10/17 14:15:03 INFO mapred.FileInputFormat: Total input paths to process : 2 [junit] 08/10/17 14:15:03 INFO mapred.MapTask: numReduceTasks: 0 [junit] 08/10/17 14:15:03 INFO fuzzykmeans.FuzzyKMeansMapper: In Mapper Configure: [junit] 08/10/17 14:15:03 WARN mapred.LocalJobRunner: job_local_0002 [junit] java.lang.NullPointerException: Cluster is empty!!! [junit] at org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansMapper.configure(FuzzyKMeansMapper.java:76) [junit] at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) [junit] at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82) [junit] at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33) [junit] at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) [junit] at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82) [junit] at org.apache.hadoop.mapred.MapTask.run(MapTask.java:223) [junit] at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:157) [junit] 08/10/17 14:15:04 WARN fuzzykmeans.FuzzyKMeansDriver: java.io.IOException: Job failed! [junit] java.io.IOException: Job failed! [junit] at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1113) [junit] at org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.runClustering(FuzzyKMeansDriver.java:207) [junit] at org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.runJob(FuzzyKMeansDriver.java:116) [junit] at org.apache.mahout.clustering.fuzzykmeans.TestFuzzyKmeansClustering.testFuzzyKMeansMRJob(TestFuzzyKmeansClustering.java:248) [junit] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [junit] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) [junit] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) [junit] at java.lang.reflect.Method.invoke(Method.java:597) [junit] at junit.framework.TestCase.runTest(TestCase.java:164) [junit] at junit.framework.TestCase.runBare(TestCase.java:130) [junit] at junit.framework.TestResult$1.protect(TestResult.java:106) [junit] at junit.framework.TestResult.runProtected(TestResult.java:124) [junit] at junit.framework.TestResult.run(TestResult.java:109) [junit] at junit.framework.TestCase.run(TestCase.java:120) [junit] at junit.framework.TestSuite.runTest(TestSuite.java:230) [junit] at junit.framework.TestSuite.run(TestSuite.java:225) [junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:421) [junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:912) [junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:766) [junit] - --- [junit] Testcase: testFuzzyKMeansMRJob(org.apache.mahout.clustering.fuzzykmeans.TestFuzzyKmeansClustering): Caused an ERROR [junit] output/points/part-0 (No such file or directory) [junit] java.io.FileNotFoundException: output/points/part-0 (No such file or directory) [junit] at java.io.FileInputStream.open(Native Method) [junit] at java.io.FileInputStream.init(FileInputStream.java:106) [junit] at java.io.FileInputStream.init(FileInputStream.java:66) [junit] at org.apache.mahout.clustering.fuzzykmeans.TestFuzzyKmeansClustering.testFuzzyKMeansMRJob(TestFuzzyKmeansClustering.java:257) [junit] [junit] [junit] Testcase: testFuzzyKMeansReducer(org.apache.mahout.clustering.fuzzykmeans.TestFuzzyKmeansClustering): Caused an ERROR [junit] For input string: 9.0, [s2, 0 [junit] java.lang.NumberFormatException: For input string: 9.0, [s2, 0
RE: Any one working on Cluto- Clustering Algorithm or similar to this?
I am interested in http://glaros.dtc.umn.edu/gkhome/node/193. I am yet to go thru this publication. After then, I will be able to tell clearly. Right now, I am interested in the algorithms and the pruning or filtering part if any that is done for large dimensional sparse data sets in Cluto. Thanks Pallavi -Original Message- From: Grant Ingersoll [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 08, 2008 4:57 PM To: mahout-dev@lucene.apache.org Subject: Re: Any one working on Cluto- Clustering Algorithm or similar to this? On Oct 8, 2008, at 1:26 AM, Palleti, Pallavi wrote: Hi all, I have come across Cluto http://glaros.dtc.umn.edu/gkhome/views/ cluto - a clustering algorithm. I would like to know if there is any work going on in mahout in this regard. If yes, I am willing to use it. If not, I might be interested in working on similar clustering algorithm. I don't think anyone is working on it, but I am curious about what you intend to do. The link you give points to a whole package of tools, AFAICT. Is there a publication (http://glaros.dtc.umn.edu/gkhome/cluto/cluto/publications ) or algorithm that you are particularly interested in?
Any one working on Cluto- Clustering Algorithm or similar to this?
Hi all, I have come across Cluto http://glaros.dtc.umn.edu/gkhome/views/cluto - a clustering algorithm. I would like to know if there is any work going on in mahout in this regard. If yes, I am willing to use it. If not, I might be interested in working on similar clustering algorithm. Thanks Pallavi
RE: OutOfMemory Error
Yeah. That was the problem. And Hama can be surely useful for large scale matrix operations. But for this problem, I have modified the code to just pass the ID information and read the vector information only when it is needed. In this case, it was needed only in the reducer phase. This way, it avoided this problem of out of memory error and also faster now. Thanks Pallavi -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Edward J. Yoon Sent: Friday, September 19, 2008 10:35 AM To: [EMAIL PROTECTED]; mahout-dev@lucene.apache.org; [EMAIL PROTECTED] Subject: Re: OutOfMemory Error The key is of the form ID :DenseVector Representation in mahout with I guess vector size seems too large so it'll need a distributed vector architecture (or 2d partitioning strategies) for large scale matrix operations. The hama team investigate these problem areas. So, it will be improved If hama can be used for mahout in the future. /Edward On Thu, Sep 18, 2008 at 12:28 PM, Pallavi Palleti [EMAIL PROTECTED] wrote: Hadoop Version - 17.1 io.sort.factor =10 The key is of the form ID :DenseVector Representation in mahout with dimensionality size = 160k For example: C1:[,0.0011, 3.002, .. 1.001,] So, typical size of the key of the mapper output can be 160K*6 (assuming double in string is represented in 5 bytes)+ 5 (bytes for C1:[]) + size required to store that the object is of type Text Thanks Pallavi Devaraj Das wrote: On 9/17/08 6:06 PM, Pallavi Palleti [EMAIL PROTECTED] wrote: Hi all, I am getting outofmemory error as shown below when I ran map-red on huge amount of data.: java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:52) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:90) at org.apache.hadoop.io.SequenceFile$Reader.nextRawKey(SequenceFile.java:1974) at org.apache.hadoop.io.SequenceFile$Sorter$SegmentDescriptor.nextRawKey(Sequence File.java:3002) at org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java:28 02) at org.apache.hadoop.io.SequenceFile$Sorter.merge(SequenceFile.java:2511) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1040) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:220) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124 The above error comes almost at the end of map job. I have set the heap size to 1GB. Still the problem is persisting. Can someone please help me how to avoid this error? What is the typical size of your key? What is the value of io.sort.factor? Hadoop version? -- View this message in context: http://www.nabble.com/OutOfMemory-Error-tp19531174p19545298.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- Best regards, Edward J. Yoon [EMAIL PROTECTED] http://blog.udanax.org
RE: [jira] Updated: (MAHOUT-74) Fuzzy K-Means clustering
Great. Thanks Grant for the modifications. -Original Message- From: Grant Ingersoll (JIRA) [mailto:[EMAIL PROTECTED] Sent: Thursday, August 21, 2008 7:09 PM To: mahout-dev@lucene.apache.org Subject: [jira] Updated: (MAHOUT-74) Fuzzy K-Means clustering [ https://issues.apache.org/jira/browse/MAHOUT-74?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated MAHOUT-74: -- Attachment: MAHOUT-74.patch Looking pretty good, Pallavi. I modified it slightly so that m is set just via the JobConf like the other values. I think we are in pretty good shape and I will commit soon. I also made m a float. Looking at the wiki link you have there, I don't see any reason why m should be restricted to an int. Fuzzy K-Means clustering Key: MAHOUT-74 URL: https://issues.apache.org/jira/browse/MAHOUT-74 Project: Mahout Issue Type: New Feature Components: Clustering Reporter: Pallavi Palleti Assignee: Grant Ingersoll Priority: Minor Fix For: 0.1 Attachments: MAHOUT-74.patch, MAHOUT-74.patch, mahout-74.patch, mahout-74.patch Fuzzy KMeans clustering algorithm is an extension to traditional K Means clustering algorithm and performs soft clustering. More details about fuzzy k-means can be found here :http://en.wikipedia.org/wiki/Data_clustering#Fuzzy_c-means_clustering I have implemented fuzzy K-Means prototype and tests in org.apache.mahout.clustering.fuzzykmeans -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
RE: asFormatString tests fail
On the same lines of optimization - I added some optimizations for SparseVector. Especially for operators like minus, plus and divide. Please look at mahout-67, mahout-66 patches for the same. The methods in AbstractVector were getting called for minus, plus and divide where it was iterating thru all the keys irrespective of whether the key contains empty value or not. Thanks Pallavi -Original Message- From: Sean Owen [mailto:[EMAIL PROTECTED] Sent: Saturday, August 09, 2008 12:28 AM To: mahout-dev@lucene.apache.org Subject: Re: asFormatString tests fail Yeah I'm not really worried about the boxing/unboxing yet, since it is buying some code simplicity, though I took the liberty of eliminating boxing where it is redundant, like: int c = new Integer(someString) versus int c = Integer.parseInt(someString) I agree we can go to that trouble if it becomes clear it is non-trivially slowing down the code. It may well. I was more interested in more obvious wins by adjusting use of some Collections API methods as in dot(). I'd be happy to hack away on this but I am hesitating about doing anything but trivial changes to code others are working on, while it might be looked at as premature. If it's viewed as a good thing I can go for it. Hows about changing asFormatString() to sort output? was that the right solution or is its output order not guaranteed. I could take care of it. On Fri, Aug 8, 2008 at 2:45 PM, Ted Dunning [EMAIL PROTECTED] wrote: Worrying about small effects like iterating over keys or entries should be made moot by just switching to a dedicated primitive based hash table. Trove has a nice implementation, but I believe the license would prevent it's use. Colt has another, not quite as nice implementation that is fast and I think it comes under a BSD license. It is also very easy to hack a special purpose structure into place.