[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836543#action_12836543 ] Ankur commented on MAHOUT-305: -- Sean, Thanks for filing the jira. Nothing points from our discussion here. 1. Need to decide on the dataset to run both the implementations on. I have netflix dataset in mind but a strange thing I observed during my tests with it is that there were 2 - 3 users who rated more than 10,000 movies! This seemed a little odd to me. Can you or some else who has had experience with the dataset validate my observation ? 2. Both the implementations need to run on dataset in the identical environment to gauge performance and accuracy. For accuracy I believe we need to do a Precision-Recall test. My understanding of it is that a) Do a 80-20 split of the data (80% train and 20% test) with split happening on a timeline. b) Feed training data to the algorithm and generate recommendations for a subset of users from training data. c) Compare those recommendations with items actually present in the history of user in test data. d) Calculate precision = tp / (tp + fp) = (recommendations actually present in user's history) / (total items recommended) e) Calculate recall = tp / (tp + fn) =(recommendations actually present in user's history) / (total items in user's history) f) Finally take a simple avg of both across all the users to get approx global precision/recall. please feel free to correct any of the step above if I misunderstood anything. Combine both cooccurrence-based CF M/R jobs --- Key: MAHOUT-305 URL: https://issues.apache.org/jira/browse/MAHOUT-305 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.2 Reporter: Sean Owen Priority: Minor We have two different but essentially identical MapReduce jobs to make recommendations based on item co-occurrence: org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be merged. Not sure exactly how to approach that but noting this in JIRA, per Ankur. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Algorithm implementations in Pig
Hi, Glad to hear here that mahout devs are interested in pig. Actually I believe pig is very helpful when you want to quickly implement a prototype of machine learning algorithms. And Pig has java API, it is easy to integrate pig script with java. Maybe we can start with implementing NB using pig first. On Mon, Feb 22, 2010 at 3:56 PM, Ted Dunning ted.dunn...@gmail.com wrote: I have had both positive and negative results with PIG. The positive results were that I was able to express large recommendation computations in a very concise way. That was really helpful. My negative results have been to do with the brittle nature of PIG vis a vis the version of the underlying hadoop system. That problem may have abated somewhat as everybody in the world except me and Amazon's EMR has pretty much piled up on version 20. I also know little about how Pig would interface well with other components. I know that I have had difficulty in the past injecting outside information into Pig, but that has been improved. I also know that Pigs eat anything, but have no clear idea how well this would play out with, say, our vector formats and vectorizers. Ankur, what recent experience do you have? How well do PIG scripts play with other programs any more? On Sun, Feb 21, 2010 at 11:41 PM, Ankur C. Goel gan...@yahoo-inc.com wrote: I had Sean's opinion on this and he was not too comfortable with the Idea of having things in different languages in Mahout. However, given the benefits of PIG, I feel otherwise. I may be biased here due to my own experience of being able to do more in lesser time in Pig then in M/R, so I thought let me ask how folks feel. Ted, I believe you have some PIG experience yourself so any thoughts on this ? -- Ted Dunning, CTO DeepDyve -- Best Regards Jeff Zhang
Re: Algorithm implementations in Pig
I see pig as useful for data preparation, but for any numerical tasks, it is likely to be completely hopeless. On Mon, Feb 22, 2010 at 12:16 AM, Jeff Zhang zjf...@gmail.com wrote: Glad to hear here that mahout devs are interested in pig. Actually I believe pig is very helpful when you want to quickly implement a prototype of machine learning algorithms. And Pig has java API, it is easy to integrate pig script with java. Maybe we can start with implementing NB using pig first. -- Ted Dunning, CTO DeepDyve
Re: Algorithm implementations in Pig
Pig can only make the implementation of map-reduce easier, the numerical computation can been done in UDF. And piglet is a DSL upon pig latin which make pig support loop. http://github.com/iconara/piglet On Mon, Feb 22, 2010 at 4:25 PM, Ted Dunning ted.dunn...@gmail.com wrote: I see pig as useful for data preparation, but for any numerical tasks, it is likely to be completely hopeless. On Mon, Feb 22, 2010 at 12:16 AM, Jeff Zhang zjf...@gmail.com wrote: Glad to hear here that mahout devs are interested in pig. Actually I believe pig is very helpful when you want to quickly implement a prototype of machine learning algorithms. And Pig has java API, it is easy to integrate pig script with java. Maybe we can start with implementing NB using pig first. -- Ted Dunning, CTO DeepDyve -- Best Regards Jeff Zhang
Re: Algorithm implementations in Pig
On Mon, Feb 22, 2010 at 1:55 PM, Ted Dunning ted.dunn...@gmail.com wrote: I see pig as useful for data preparation, but for any numerical tasks, it is likely to be completely hopeless. PIG will be a great tool to experiment quickly on algorithms. But, with people here trying to focus on using Vector to standardize the input output process, It will be tough for the small bunch here to port that to PIG, or help PIG scripts reuse it. As long as the input output of PIG based algorithmns is based on VectorWritable, I dont see any problem not including PIG. But bear in mind the previous PIG submission https://issues.apache.org/jira/browse/MAHOUT-106 still haven't moved in to the trunk. If anyone is willing to help standardize on using PIG with vectors as input they are more than welcome. One thing we definitely dont want to do at this point is for all algorithms to have all different kinds of input format. Robin
Re: Algorithm implementations in Pig
Ted, The latest pig release 0.6.0 on hadoop 20 is a clear winner not just for performance but also for doing a better job of managing memory in its MR job pipeline. Also support for both inner and outer skewed join is something that I found indispensable when dealing with really large datasets. There is support for streaming in pig that lets you stream your relation through an external perl/python/ruby... Script. Also support for UDFs in scripting language is expected in the near future. About interfacing with other systems I assume you have an RDBMS in mind. There is a patch (for pig 0.7) that lets you write directly from PIG to an RDBMS like MySQL. Support for writing directly to Hbase was always there and has been improved I believe. With 0.7 release pig has decided to let its load/store functions rely on hadoop's input/output format so our vector format shouldn't be a problem IMHO. The only thing I am concerned about is the not too efficient Tuple implementation in pig which does not give performance equivalent to Java MR. Recently I implemented shingling in Pig and found it to work beautifully. One problem that I hit had to too with using clusters to generate recommendations since some clusters were quite large ( 10 K). For this I needed to do a self-join and wanted the join load to be split evenly. That's where skewed join came to the rescue. Apart from this I also want to contribute my implementation to Mahout (the reason for starting this thread :-)) -...@nkur On 2/22/10 1:26 PM, Ted Dunning ted.dunn...@gmail.com wrote: I have had both positive and negative results with PIG. The positive results were that I was able to express large recommendation computations in a very concise way. That was really helpful. My negative results have been to do with the brittle nature of PIG vis a vis the version of the underlying hadoop system. That problem may have abated somewhat as everybody in the world except me and Amazon's EMR has pretty much piled up on version 20. I also know little about how Pig would interface well with other components. I know that I have had difficulty in the past injecting outside information into Pig, but that has been improved. I also know that Pigs eat anything, but have no clear idea how well this would play out with, say, our vector formats and vectorizers. Ankur, what recent experience do you have? How well do PIG scripts play with other programs any more? On Sun, Feb 21, 2010 at 11:41 PM, Ankur C. Goel gan...@yahoo-inc.comwrote: I had Sean's opinion on this and he was not too comfortable with the Idea of having things in different languages in Mahout. However, given the benefits of PIG, I feel otherwise. I may be biased here due to my own experience of being able to do more in lesser time in Pig then in M/R, so I thought let me ask how folks feel. Ted, I believe you have some PIG experience yourself so any thoughts on this ? -- Ted Dunning, CTO DeepDyve
Re: Algorithm implementations in Pig
I'm all for Pig, especially once we are a TLP. I haven't had the proper time to review the PLSI implementation, but it looks useful. I agree on the other points, though, in that I think we it would be nice to have consistent formats based on Vector so that things can be more portable. On Feb 22, 2010, at 2:41 AM, Ankur C. Goel wrote: Hi Folks, I would like to know how mahout community feels about having some of the Mahout algorithms implemented in pig - http://hadoop.apache.org/pig. The benefits of using Pig are many including. 1. Small learning curve, people with a bit of SQL knowledge will find it very easy. 2. Operations like grouping, aggregations, join need just few lines of pig code. 3. Insulation against Hadoop complexity - Job chains and JobConf. 4. Quick prototyping and hence increased programmer productivity. I had Sean's opinion on this and he was not too comfortable with the Idea of having things in different languages in Mahout. However, given the benefits of PIG, I feel otherwise. I may be biased here due to my own experience of being able to do more in lesser time in Pig then in M/R, so I thought let me ask how folks feel. Ted, I believe you have some PIG experience yourself so any thoughts on this ? Regards -...@nkur
[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836598#action_12836598 ] Robin Anil commented on MAHOUT-300: --- We should be multiplying using sparsity instead of cardinality to calculated the speed in MB/s for Sparse and Seq and by cardinality for dense vector Solve performance issues with Vector Implementations Key: MAHOUT-300 URL: https://issues.apache.org/jira/browse/MAHOUT-300 Project: Mahout Issue Type: Improvement Affects Versions: 0.3 Reporter: Robin Anil Fix For: 0.3 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch AbstractVector operations like times public Vector times(double x) { Vector result = clone(); IteratorElement iter = iterateNonZero(); while (iter.hasNext()) { Element element = iter.next(); int index = element.index(); result.setQuick(index, element.get() * x); } return result; } should be implemented as follows public Vector times(double x) { Vector result = clone(); IteratorElement iter = result.iterateNonZero(); while (iter.hasNext()) { Element element = iter.next(); element.set(element.get() * x); } return result; } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Algorithm implementations in Pig
Seems like the guys at twitter are going down the pig/hadoop http://highscalability.com/blog/2010/2/19/twitters-plan-to-analyze-100-billion-tweets.html route could be worth getting them on board the Mahout wagon especially with previous discussion had about classification efforts http://old.nabble.com/Twitter-Classification-td27227638.html On 22 Feb 2010, at 12:13, Grant Ingersoll wrote: I'm all for Pig, especially once we are a TLP. I haven't had the proper time to review the PLSI implementation, but it looks useful. I agree on the other points, though, in that I think we it would be nice to have consistent formats based on Vector so that things can be more portable. On Feb 22, 2010, at 2:41 AM, Ankur C. Goel wrote: Hi Folks, I would like to know how mahout community feels about having some of the Mahout algorithms implemented in pig - http://hadoop.apache.org/pig. The benefits of using Pig are many including. 1. Small learning curve, people with a bit of SQL knowledge will find it very easy. 2. Operations like grouping, aggregations, join need just few lines of pig code. 3. Insulation against Hadoop complexity - Job chains and JobConf. 4. Quick prototyping and hence increased programmer productivity. I had Sean's opinion on this and he was not too comfortable with the Idea of having things in different languages in Mahout. However, given the benefits of PIG, I feel otherwise. I may be biased here due to my own experience of being able to do more in lesser time in Pig then in M/R, so I thought let me ask how folks feel. Ted, I believe you have some PIG experience yourself so any thoughts on this ? Regards -...@nkur
[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836624#action_12836624 ] Robin Anil commented on MAHOUT-300: --- I think the irregularity is due to the sparse vector generation process where duplicate index values could get generated leaving some vectors much sparser than the sparsity value {code} Vector v = new SequentialAccessSparseVector(cardinality, sparsity); // sparsity! int[] indexes = new int[sparsity]; double[] values = new double[sparsity]; for (int j = 0; j sparsity; j++) { double value = r.nextGaussian(); int index = sparsity cardinality ? r.nextInt(cardinality) : j; v.set(index, value); indexes[j] = index; values[j] = value; } {code} instead i suggest this {code} Vector v = new SequentialAccessSparseVector(cardinality, sparsity); // sparsity! boolean[] featureSpace = new boolean[cardinality]; int[] indexes = new int[sparsity]; double[] values = new double[sparsity]; int j = 0; while(j sparsity) { double value = r.nextGaussian(); int index = r.nextInt(cardinality); if(featureSpace[index] == false) { featureSpace[index] = true; indexes[j] = index; values[j++] = value; v.set(index, value); } } {code} Solve performance issues with Vector Implementations Key: MAHOUT-300 URL: https://issues.apache.org/jira/browse/MAHOUT-300 Project: Mahout Issue Type: Improvement Affects Versions: 0.3 Reporter: Robin Anil Fix For: 0.3 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch AbstractVector operations like times public Vector times(double x) { Vector result = clone(); IteratorElement iter = iterateNonZero(); while (iter.hasNext()) { Element element = iter.next(); int index = element.index(); result.setQuick(index, element.get() * x); } return result; } should be implemented as follows public Vector times(double x) { Vector result = clone(); IteratorElement iter = result.iterateNonZero(); while (iter.hasNext()) { Element element = iter.next(); element.set(element.get() * x); } return result; } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836630#action_12836630 ] Robin Anil commented on MAHOUT-300: --- Ted, your loop structure seem to be slower by about 150MB/s than the null based impl. Does it need more loops before optimisations kick in ? Solve performance issues with Vector Implementations Key: MAHOUT-300 URL: https://issues.apache.org/jira/browse/MAHOUT-300 Project: Mahout Issue Type: Improvement Affects Versions: 0.3 Reporter: Robin Anil Fix For: 0.3 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch AbstractVector operations like times public Vector times(double x) { Vector result = clone(); IteratorElement iter = iterateNonZero(); while (iter.hasNext()) { Element element = iter.next(); int index = element.index(); result.setQuick(index, element.get() * x); } return result; } should be implemented as follows public Vector times(double x) { Vector result = clone(); IteratorElement iter = result.iterateNonZero(); while (iter.hasNext()) { Element element = iter.next(); element.set(element.get() * x); } return result; } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836633#action_12836633 ] Sean Owen commented on MAHOUT-300: -- Tiny comment -- will probably be wise to use BitSet rather than boolean[], as booleans are stored as full 32 bit value (!). A 32x reduction in memory is non-trivial with cardinalities in the millions. Solve performance issues with Vector Implementations Key: MAHOUT-300 URL: https://issues.apache.org/jira/browse/MAHOUT-300 Project: Mahout Issue Type: Improvement Affects Versions: 0.3 Reporter: Robin Anil Fix For: 0.3 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch AbstractVector operations like times public Vector times(double x) { Vector result = clone(); IteratorElement iter = iterateNonZero(); while (iter.hasNext()) { Element element = iter.next(); int index = element.index(); result.setQuick(index, element.get() * x); } return result; } should be implemented as follows public Vector times(double x) { Vector result = clone(); IteratorElement iter = result.iterateNonZero(); while (iter.hasNext()) { Element element = iter.next(); element.set(element.get() * x); } return result; } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836649#action_12836649 ] Robin Anil commented on MAHOUT-300: --- On dense data 1000, 1000 {noformat} BenchMarks DenseVector RandSparseVector SeqSparseVector Dense.dot(Rand) Dense.dot(Seq) Rand.dot(Dense) Rand.dot(Seq) Seq.dot(Dense) Seq.dot(Rand) DotProduct nCalls = 2; nCalls = 2; nCalls = 2; nCalls = 2; nCalls = 2; nCalls = 2; nCalls = 2; nCalls = 2; nCalls = 2; sum = 0.042869s;sum = 1.139837s;sum = 0.293336s;sum = 0.882977s;sum = 0.452817s;sum = 1.330815s;sum = 0.843993s;sum = 0.931822s;sum = 1.093099s; min = 0.0010ms; min = 0.046ms; min = 0.01ms; min = 0.03ms; min = 0.011ms; min = 0.049ms; min = 0.027ms; min = 0.036ms; min = 0.049ms; max = 2.717ms; max = 21.51ms; max = 3.156ms; max = 25.346ms; max = 26.567ms; max = 14.738ms; max = 53.265ms; max = 9.44ms; max = 4.017ms; mean = 0.002143ms; mean = 0.056991ms; mean = 0.014666ms; mean = 0.044148ms; mean = 0.02264ms; mean = 0.06654ms; mean = 0.042199ms; mean = 0.046591ms; mean = 0.054654ms; stdDev = 0.027798ms;stdDev = 0.194404ms;stdDev = 0.053138ms;stdDev = 0.30642ms; stdDev = 0.255753ms;stdDev = 0.212913ms;stdDev = 0.446643ms;stdDev = 0.131948ms;stdDev = 0.054681ms; Speed = 466537.6 /sec Speed = 17546.367 /sec Speed = 68181.195 /sec Speed = 22650.646 /sec Speed = 44167.953 /sec Speed = 15028.385 /sec Speed = 23696.877 /sec Speed = 21463.326 /sec Speed = 18296.604 /sec Rate = 5598.451 MB/sRate = 210.55641 MB/s Rate = 818.17444 MB/s Rate = 271.80777 MB/s Rate = 530.01544 MB/s Rate = 180.34062 MB/s Rate = 284.36255 MB/s Rate = 257.55994 MB/s Rate = 219.55927 MB/s {noformat} On Sparse Data (1000, 300) Dont compare the MB/s see the unit/s {noformat} BenchMarks DenseVector RandSparseVector SeqSparseVector Dense.dot(Rand) Dense.dot(Seq) Rand.dot(Dense) Rand.dot(Seq) Seq.dot(Dense) Seq.dot(Rand) DotProduct nCalls = 2; nCalls = 2; nCalls = 2; nCalls = 2; nCalls = 2; nCalls = 2; nCalls = 2; nCalls = 2; nCalls = 2; sum = 0.048355s;sum = 0.569326s;sum = 0.338478s;sum = 0.408213s;sum = 0.205143s;sum = 0.469473s;sum = 0.242953s;sum = 0.291587s;sum = 0.362947s; min = 0.0010ms; min = 0.018ms; min = 0.011ms; min = 0.012ms; min = 0.0040ms; min = 0.017ms; min = 0.01ms; min = 0.011ms; min = 0.014ms; max = 6.525ms; max = 33.768ms; max = 3.936ms; max = 26.649ms; max = 27.028ms; max = 3.969ms; max = 3.042ms; max = 4.704ms; max = 7.04ms; mean = 0.002417ms; mean = 0.028466ms; mean = 0.016923ms; mean = 0.02041ms; mean = 0.010257ms; mean = 0.023473ms; mean = 0.012147ms; mean = 0.014579ms; mean = 0.018147ms; stdDev = 0.062427ms;stdDev = 0.302488ms;stdDev = 0.059426ms;stdDev = 0.237577ms;stdDev = 0.222142ms;stdDev = 0.05819ms; stdDev = 0.026846ms;stdDev = 0.06257ms; stdDev = 0.06777ms; Speed = 413607.7 /sec Speed = 35129.258 /sec Speed = 59088.03 /sec Speed = 48994.03 /sec Speed = 97492.96 /sec Speed = 42600.957 /sec Speed = 82320.45 /sec Speed = 68590.164 /sec Speed =
[jira] Updated: (MAHOUT-300) Solve performance issues with Vector Implementations
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-300: -- Attachment: MAHOUT-300.patch Solve performance issues with Vector Implementations Key: MAHOUT-300 URL: https://issues.apache.org/jira/browse/MAHOUT-300 Project: Mahout Issue Type: Improvement Affects Versions: 0.3 Reporter: Robin Anil Fix For: 0.3 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch AbstractVector operations like times public Vector times(double x) { Vector result = clone(); IteratorElement iter = iterateNonZero(); while (iter.hasNext()) { Element element = iter.next(); int index = element.index(); result.setQuick(index, element.get() * x); } return result; } should be implemented as follows public Vector times(double x) { Vector result = clone(); IteratorElement iter = result.iterateNonZero(); while (iter.hasNext()) { Element element = iter.next(); element.set(element.get() * x); } return result; } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-300) Solve performance issues with Vector Implementations
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-300: -- Attachment: MAHOUT-300.patch Increased loop by 3x to give more stability to perf values Solve performance issues with Vector Implementations Key: MAHOUT-300 URL: https://issues.apache.org/jira/browse/MAHOUT-300 Project: Mahout Issue Type: Improvement Affects Versions: 0.3 Reporter: Robin Anil Fix For: 0.3 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch AbstractVector operations like times public Vector times(double x) { Vector result = clone(); IteratorElement iter = iterateNonZero(); while (iter.hasNext()) { Element element = iter.next(); int index = element.index(); result.setQuick(index, element.get() * x); } return result; } should be implemented as follows public Vector times(double x) { Vector result = clone(); IteratorElement iter = result.iterateNonZero(); while (iter.hasNext()) { Element element = iter.next(); element.set(element.get() * x); } return result; } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1283#action_1283 ] Ankur commented on MAHOUT-305: -- Hey Sean, Have you played with netflix dataset? Are there really user who have rated more than 10,000 movies? For PR test do we have something already that will work in this case or some coding is required ? Any other thoughts ? Combine both cooccurrence-based CF M/R jobs --- Key: MAHOUT-305 URL: https://issues.apache.org/jira/browse/MAHOUT-305 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.2 Reporter: Sean Owen Priority: Minor We have two different but essentially identical MapReduce jobs to make recommendations based on item co-occurrence: org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be merged. Not sure exactly how to approach that but noting this in JIRA, per Ankur. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur reassigned MAHOUT-305: Assignee: Ankur Combine both cooccurrence-based CF M/R jobs --- Key: MAHOUT-305 URL: https://issues.apache.org/jira/browse/MAHOUT-305 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.2 Reporter: Sean Owen Assignee: Ankur Priority: Minor We have two different but essentially identical MapReduce jobs to make recommendations based on item co-occurrence: org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be merged. Not sure exactly how to approach that but noting this in JIRA, per Ankur. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836679#action_12836679 ] Robin Anil commented on MAHOUT-300: --- i found the anomaly Jake was talking about. It was due to too many instanceof checks in dot in AbstractVector. I moved the code out split as smaller check in each of overridden dot in each of the impls. The numbers just doubled, confirming my suspicion that instanceof is a heavy weight. Solve performance issues with Vector Implementations Key: MAHOUT-300 URL: https://issues.apache.org/jira/browse/MAHOUT-300 Project: Mahout Issue Type: Improvement Affects Versions: 0.3 Reporter: Robin Anil Fix For: 0.3 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch AbstractVector operations like times public Vector times(double x) { Vector result = clone(); IteratorElement iter = iterateNonZero(); while (iter.hasNext()) { Element element = iter.next(); int index = element.index(); result.setQuick(index, element.get() * x); } return result; } should be implemented as follows public Vector times(double x) { Vector result = clone(); IteratorElement iter = result.iterateNonZero(); while (iter.hasNext()) { Element element = iter.next(); element.set(element.get() * x); } return result; } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-300) Solve performance issues with Vector Implementations
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-300: -- Attachment: MAHOUT-300.patch Solve performance issues with Vector Implementations Key: MAHOUT-300 URL: https://issues.apache.org/jira/browse/MAHOUT-300 Project: Mahout Issue Type: Improvement Affects Versions: 0.3 Reporter: Robin Anil Fix For: 0.3 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch AbstractVector operations like times public Vector times(double x) { Vector result = clone(); IteratorElement iter = iterateNonZero(); while (iter.hasNext()) { Element element = iter.next(); int index = element.index(); result.setQuick(index, element.get() * x); } return result; } should be implemented as follows public Vector times(double x) { Vector result = clone(); IteratorElement iter = result.iterateNonZero(); while (iter.hasNext()) { Element element = iter.next(); element.set(element.get() * x); } return result; } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836689#action_12836689 ] Sean Owen commented on MAHOUT-305: -- Yes there are some prolific users. I don't have anything ready-made for such a test; the existing eval framework won't work here. I think it would need a bit of coding to pull out some test data, run the job, compare the results. I have only one little tweak to make to the procedure you mention here. Really, we ought to pull out the most-preferred movies as test data. After all the recommendations will be for those movies that should be rated highly. We wouldn't want to punish the algorithm for failing to recommend something I have rated, but didn't like, over something I haven't rated but indeed would like. One very crude way to do this is remove all 5-star ratings in the data set, and see how many of those actually come back in the recommendations. Combine both cooccurrence-based CF M/R jobs --- Key: MAHOUT-305 URL: https://issues.apache.org/jira/browse/MAHOUT-305 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.2 Reporter: Sean Owen Assignee: Ankur Priority: Minor We have two different but essentially identical MapReduce jobs to make recommendations based on item co-occurrence: org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be merged. Not sure exactly how to approach that but noting this in JIRA, per Ankur. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-304) MeanShift doesn't read from VectorWritable
[ https://issues.apache.org/jira/browse/MAHOUT-304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-304: -- Attachment: MAHOUT-304.patch Jeff, Meanshift uses only ids generated by the mapper to keep vector membership. I dont yet see how you can get the membership information i.e Vector docid = Canopy Id. Isnt that job missing? Maybe for later 0.4? MeanShift doesn't read from VectorWritable -- Key: MAHOUT-304 URL: https://issues.apache.org/jira/browse/MAHOUT-304 Project: Mahout Issue Type: Improvement Components: Clustering Affects Versions: 0.3 Reporter: Robin Anil Assignee: Robin Anil Fix For: 0.3 Attachments: MAHOUT-304.patch, MAHOUT-304.patch, MAHOUT-304.patch Need an M/R job for converting sequence file containing VectorWritable to MeanShiftCanopy before the MeanShift M/R -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Algorithm implementations in Pig
Actually, no. I meant other programs written in pure Java. It used to be that the very restricted scripting ability of Pig made processing chains composed of Pig and map-reduce programs very brittle. In fact, just gluing together multiple Pig programs used to be very ugly. On Mon, Feb 22, 2010 at 12:42 AM, Ankur C. Goel gan...@yahoo-inc.comwrote: About interfacing with other systems I assume you have an RDBMS in mind. -- Ted Dunning, CTO DeepDyve
Re: Algorithm implementations in Pig
Has the interface for writing UDF's stabilized? For quite some time, the UDF API was changing every 3 months. On Mon, Feb 22, 2010 at 12:35 AM, Jeff Zhang zjf...@gmail.com wrote: Pig can only make the implementation of map-reduce easier, the numerical computation can been done in UDF. -- Ted Dunning, CTO DeepDyve
[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836706#action_12836706 ] Jake Mannix commented on MAHOUT-300: The sparse data is odd... (-vs 50 -sp 5000) (running with 1000, 300 is really not very sparse at all...) I haven't applied any newer patches (just the one I submitted most recently), but I have svn upped. These results are counterintuitve. {code} BenchMarksDenseVector RandomAccessSparseVector SequentialAccessSparseVector Dense.dot(RandomAccess) Dense.dot(SequentialAccess) RandomAcces.dot(Dense) RandomAccess.dot(SequentialAccess)SequentialAccess.dot(Dense) SequentialAccess.dot(RandomAccess) DotProduct nCalls = 2500;nCalls = 2500; nCalls = 2500;nCalls = 2500;nCalls = 2500;nCalls = 2500;nCalls = 2500; nCalls = 2500;nCalls = 2500; sumTime = 3.660321s; sumTime = 1.481516s; sumTime = 0.448737s; sumTime = 2.098937s; sumTime = 0.856259s; sumTime = 2.277742s; sumTime = 0.607507s; sumTime = 1.341608s; sumTime = 0.741622s; minTime = 1.31ms; minTime = 0.459ms; minTime = 0.102ms;minTime = 0.716ms;minTime = 0.24ms; minTime = 0.776ms;minTime = 0.18ms; minTime = 0.442ms;minTime = 0.209ms; maxTime = 10.149ms; maxTime = 36.691ms; maxTime = 4.552ms;maxTime = 5.437ms;maxTime = 11.856ms; maxTime = 8.059ms;maxTime = 4.509ms; maxTime = 2.136ms;maxTime = 2.031ms; meanTime = 1.464128ms;meanTime = 0.592606ms;meanTime = 0.179494ms;meanTime = 0.839574ms; meanTime = 0.342503ms;meanTime = 0.911096ms;meanTime = 0.243002ms;meanTime = 0.536643ms;meanTime = 0.296648ms; stdDevTime = 0.329025ms; stdDevTime = 0.852156ms; stdDevTime = 0.234261ms; stdDevTime = 0.179854ms; stdDevTime = 0.286798ms; stdDevTime = 0.268853ms; stdDevTime = 0.115022ms; stdDevTime = 0.171088ms; stdDevTime = 0.115263ms; Speed = 683.0002 /sec Speed = 1687.4606 /secSpeed = 5571.192 /sec Speed = 1191.0791 /secSpeed = 2919.6772 /secSpeed = 1097.5781 /secSpeed = 4115.1787 /sec Speed = 1863.4355 /secSpeed = 3370.9895 /sec Rate = 4098.001 MB/s Rate = 10124.764 MB/s Rate = 33427.152 MB/s Rate = 7146.4746 MB/s Rate = 17518.062 MB/s Rate = 6585.4688 MB/s Rate = 24691.072 MB/s Rate = 11180.613 MB/s Rate = 20225.936 MB/s {code} Solve performance issues with Vector Implementations Key: MAHOUT-300 URL: https://issues.apache.org/jira/browse/MAHOUT-300 Project: Mahout Issue Type: Improvement Affects Versions: 0.3 Reporter: Robin Anil Fix For: 0.3 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch AbstractVector operations like times public Vector times(double x) { Vector result = clone(); IteratorElement iter = iterateNonZero(); while (iter.hasNext()) { Element element = iter.next(); int index = element.index(); result.setQuick(index, element.get() * x); } return result; } should be implemented as follows public Vector times(double x) { Vector result = clone(); IteratorElement iter = result.iterateNonZero(); while (iter.hasNext()) { Element element = iter.next(); element.set(element.get() * x); } return result; } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836713#action_12836713 ] Robin Anil commented on MAHOUT-300: --- Can i commit the latest. If you dont have any changes pending on your end ? What ever be, we need to ensure correctness and proceed with 0.3. We are much better in terms of perf now than at the begining of this issue Solve performance issues with Vector Implementations Key: MAHOUT-300 URL: https://issues.apache.org/jira/browse/MAHOUT-300 Project: Mahout Issue Type: Improvement Affects Versions: 0.3 Reporter: Robin Anil Fix For: 0.3 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch AbstractVector operations like times public Vector times(double x) { Vector result = clone(); IteratorElement iter = iterateNonZero(); while (iter.hasNext()) { Element element = iter.next(); int index = element.index(); result.setQuick(index, element.get() * x); } return result; } should be implemented as follows public Vector times(double x) { Vector result = clone(); IteratorElement iter = result.iterateNonZero(); while (iter.hasNext()) { Element element = iter.next(); element.set(element.get() * x); } return result; } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836725#action_12836725 ] Ankur commented on MAHOUT-305: -- Typically when doing train-test data split, we divide the data on a timeline. So as a simple example if we have 10 days data then we would keep last 2 days data as test data and remaining as training data. If we remove all 5 star rating the crude way, we may not be able to ensure this condition, not a hard one but still a best practice AFAIK. Also I am not sure if 5 star ratings would be 20 or even 10% of the total data. The crude way you mentioned is ok for a start but I am not sure if its a fair evaluation or not. Also with this we would effectively be calculating precision as precision = (5 start recommendations actually present in user's history) / (total 5 star recommendations) recall = (5 start recommendations actually present in user's history) / (total 5 start items in user's history) is that what you mean? Combine both cooccurrence-based CF M/R jobs --- Key: MAHOUT-305 URL: https://issues.apache.org/jira/browse/MAHOUT-305 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.2 Reporter: Sean Owen Assignee: Ankur Priority: Minor We have two different but essentially identical MapReduce jobs to make recommendations based on item co-occurrence: org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be merged. Not sure exactly how to approach that but noting this in JIRA, per Ankur. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Algorithm implementations in Pig
In the next pig release (0.7) Pig's load/store func would be moving to use hadoop's input/output format. So there are some changes planned for that - http://wiki.apache.org/pig/Pig070IncompatibleChanges After that I don't expect any interface level change in UDF. -...@nkur On 2/22/10 10:10 PM, Ted Dunning ted.dunn...@gmail.com wrote: Has the interface for writing UDF's stabilized? For quite some time, the UDF API was changing every 3 months. On Mon, Feb 22, 2010 at 12:35 AM, Jeff Zhang zjf...@gmail.com wrote: Pig can only make the implementation of map-reduce easier, the numerical computation can been done in UDF. -- Ted Dunning, CTO DeepDyve
Re: Algorithm implementations in Pig
I agree with you and while some of that has been remedied, I wouldn't say things are perfect. Scripting ability while still limited has better streaming support so you can have relations streamed Into a custom script executing in either map or reduce phase depending upon where it is placed. If you want to glue together a bunch of map-reduce programs and pig script then the best option is to Invoke pig from your java program that also manages your M/R chain. Hadoop workflow system (Oozie) Is coming along which should make this better. For gluing together multiple pig programs the best there is exec script.pig which can be called from inside your script. However it is not a very neat solution since you would want to pass a bunch of things to the invoked script and also check for certain conditions to exists. So again a java program or a perl/python/ruby script managing your chain is a better option. Regards -...@nkur On 2/22/10 10:08 PM, Ted Dunning ted.dunn...@gmail.com wrote: Actually, no. I meant other programs written in pure Java. It used to be that the very restricted scripting ability of Pig made processing chains composed of Pig and map-reduce programs very brittle. In fact, just gluing together multiple Pig programs used to be very ugly. On Mon, Feb 22, 2010 at 12:42 AM, Ankur C. Goel gan...@yahoo-inc.comwrote: About interfacing with other systems I assume you have an RDBMS in mind. -- Ted Dunning, CTO DeepDyve
[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836733#action_12836733 ] Sean Owen commented on MAHOUT-305: -- Say I've made the following ratings: 5 stars: Harry Potter 5 stars: Harry Potter 2 1 star: Maid in Manhattan Say I remove Maid in Manhattan as test data. I run recommendations and it recommends to me Harry Potter 3 (which presumably I would rate highly). The implementation would be penalized for not returning Maid in Manhattan, when that's surely not what it should have returned. Even if you take out only the most highly-rated movies as test data (this is what the existing CF precsion/recall evaluator does), this phenomenon can still occur: the recommender could return a movie that's better than anything you've yet seen but that would be considered 'bad' by this evaluation style. It's still not a fair test, but it's less un-fair. Yes you could take the 20% most-highly-rated movies from each user as test data if you like, not just 5-star. Say I ask for 10 recommendations. Precision @ 10 is the proportion of those 10 that were in the users' history (top ratings). Recall @ 10 is the proportion of all top-rated items that appeared in those 10. I think this is a little different than what you're saying? Combine both cooccurrence-based CF M/R jobs --- Key: MAHOUT-305 URL: https://issues.apache.org/jira/browse/MAHOUT-305 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.2 Reporter: Sean Owen Assignee: Ankur Priority: Minor We have two different but essentially identical MapReduce jobs to make recommendations based on item co-occurrence: org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be merged. Not sure exactly how to approach that but noting this in JIRA, per Ankur. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Algorithm implementations in Pig
That isn't an issue here. It is the invocation of pig programs and passing useful information to them that is the problem. On Mon, Feb 22, 2010 at 9:20 AM, Ankur C. Goel gan...@yahoo-inc.com wrote: Scripting ability while still limited has better streaming support so you can have relations streamed Into a custom script executing in either map or reduce phase depending upon where it is placed. -- Ted Dunning, CTO DeepDyve
Re: Algorithm implementations in Pig
As an interesting test case, can you write a pig program that counts words. BUT, it takes an input file name AND an input field name. On Mon, Feb 22, 2010 at 9:56 AM, Ted Dunning ted.dunn...@gmail.com wrote: That isn't an issue here. It is the invocation of pig programs and passing useful information to them that is the problem. On Mon, Feb 22, 2010 at 9:20 AM, Ankur C. Goel gan...@yahoo-inc.comwrote: Scripting ability while still limited has better streaming support so you can have relations streamed Into a custom script executing in either map or reduce phase depending upon where it is placed. -- Ted Dunning, CTO DeepDyve -- Ted Dunning, CTO DeepDyve
[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836815#action_12836815 ] Jake Mannix commented on MAHOUT-300: With these opts: -vs 50 -sp 500 -nv 50 -l 500 -no 10 Dot product looks more sensible. Executive summary: fastest is SequentialAccess.dot(Dense), clocking in at 69,246 units/s, which is as expected. Leaderboard for dotProduct: {code} Seq.dot(Den) : 69,246 units/s Seq.dot(Seq) : 63,958 units/s Seq.dot(Rnd) : 49,638 units/s Rnd.dot(Seq) : 39,019 units/s Den.dot(Seq) : 30,337 units/s Rnd.dot(Rnd) : 5,320 units/s Den.dot(Rnd) : 5,177 units/s Rnd.dot(Den) : 5,101 units/s Den.dot(Den) : 516 units/s {code} {code} INFO: DotProduct DenseVector sum = 48.442942s; min = 1.554ms; max = 32.55ms; mean = 1.937717ms; stdDev = 0.55081ms; Speed: 516.07104 UnitsProcessed/sec 3.0964262 MBytes/sec INFO: DotProduct RandSparseVector sum = 4.69924s; min = 0.116ms; max = 24.211ms; mean = 0.187969ms; stdDev = 0.343685ms; Speed: 5320.0093 UnitsProcessed/sec 31.920053 MBytes/sec INFO: DotProduct SeqSparseVector sum = 0.390877s; min = 0.012ms; max = 2.698ms; mean = 0.015635ms; stdDev = 0.037619ms; Speed: 63958.742 UnitsProcessed/sec 383.7524 MBytes/sec INFO: DotProduct Dense.dot(Rand) sum = 4.828592s; min = 0.137ms; max = 4.09ms; mean = 0.193143ms; stdDev = 0.052169ms; Speed: 5177.4927 UnitsProcessed/sec 31.064955 MBytes/sec INFO: DotProduct Dense.dot(Seq) sum = 0.823286s; min = 0.0ms; max = 4.606ms; mean = 0.032931ms; stdDev = 0.03774ms; Speed: 30366.117 UnitsProcessed/sec 182.1967 MBytes/sec INFO: DotProduct Rand.dot(Dense) sum = 4.900044s; min = 0.14ms; max = 3.969ms; mean = 0.196001ms; stdDev = 0.056772ms; Speed: 5101.995 UnitsProcessed/sec 30.61197 MBytes/sec INFO: DotProduct Rand.dot(Seq) sum = 0.640713s; min = 0.0ms; max = 2.253ms; mean = 0.025628ms; stdDev = 0.041805ms; Speed: 39019.027 UnitsProcessed/sec 234.11417 MBytes/sec INFO: DotProduct Seq.dot(Dense) sum = 0.361031s; min = 0.0ms; max = 4.63ms; mean = 0.014441ms; stdDev = 0.040413ms; Speed: 69246.13 UnitsProcessed/sec 415.47675 MBytes/sec INFO: DotProduct Seq.dot(Rand) sum = 0.503642s; min = 0.0090ms; max = 5.203ms; mean = 0.020145ms; stdDev = 0.05134ms; Speed: 49638.434 UnitsProcessed/sec 297.8306 MBytes/sec {code} Solve performance issues with Vector Implementations Key: MAHOUT-300 URL: https://issues.apache.org/jira/browse/MAHOUT-300 Project: Mahout Issue Type: Improvement Affects Versions: 0.3 Reporter: Robin Anil Fix For: 0.3 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch AbstractVector operations like times public Vector times(double x) { Vector result = clone(); IteratorElement iter = iterateNonZero(); while (iter.hasNext()) { Element element = iter.next(); int index = element.index(); result.setQuick(index, element.get() * x); } return result; } should be implemented as follows public Vector times(double x) { Vector result = clone(); IteratorElement iter = result.iterateNonZero(); while (iter.hasNext()) { Element element = iter.next(); element.set(element.get() * x); } return result; } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836817#action_12836817 ] Ted Dunning commented on MAHOUT-300: These are getting respectable! As a quick hack, the fact that dot is commutative should make it possible to get identical results for dense.dot(seq) as for seq.dot(dense). Likewise for dense.dot(rand). A similar, but less dramatic win might come from rnd.dot(seq) being redone as seq.dot(rnd). Solve performance issues with Vector Implementations Key: MAHOUT-300 URL: https://issues.apache.org/jira/browse/MAHOUT-300 Project: Mahout Issue Type: Improvement Affects Versions: 0.3 Reporter: Robin Anil Fix For: 0.3 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch AbstractVector operations like times public Vector times(double x) { Vector result = clone(); IteratorElement iter = iterateNonZero(); while (iter.hasNext()) { Element element = iter.next(); int index = element.index(); result.setQuick(index, element.get() * x); } return result; } should be implemented as follows public Vector times(double x) { Vector result = clone(); IteratorElement iter = result.iterateNonZero(); while (iter.hasNext()) { Element element = iter.next(); element.set(element.get() * x); } return result; } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836818#action_12836818 ] Jake Mannix commented on MAHOUT-300: agreed, Ted. I'm liking that we're getting 60-70k units/s on Seq.dot(Den) and Seq.dot(Seq), with vectors with 500 nonzero elements. Since a dot requires a multiply and an add per nonzero element, this is doing 60 mflops on my laptop in my IDE, with the browser running, etc. Not bad. Solve performance issues with Vector Implementations Key: MAHOUT-300 URL: https://issues.apache.org/jira/browse/MAHOUT-300 Project: Mahout Issue Type: Improvement Affects Versions: 0.3 Reporter: Robin Anil Fix For: 0.3 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch AbstractVector operations like times public Vector times(double x) { Vector result = clone(); IteratorElement iter = iterateNonZero(); while (iter.hasNext()) { Element element = iter.next(); int index = element.index(); result.setQuick(index, element.get() * x); } return result; } should be implemented as follows public Vector times(double x) { Vector result = clone(); IteratorElement iter = result.iterateNonZero(); while (iter.hasNext()) { Element element = iter.next(); element.set(element.get() * x); } return result; } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836819#action_12836819 ] Robin Anil commented on MAHOUT-300: --- Seq.rand and rand.seq shoudl get the same perf level now with an instanceof removed Solve performance issues with Vector Implementations Key: MAHOUT-300 URL: https://issues.apache.org/jira/browse/MAHOUT-300 Project: Mahout Issue Type: Improvement Affects Versions: 0.3 Reporter: Robin Anil Fix For: 0.3 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch AbstractVector operations like times public Vector times(double x) { Vector result = clone(); IteratorElement iter = iterateNonZero(); while (iter.hasNext()) { Element element = iter.next(); int index = element.index(); result.setQuick(index, element.get() * x); } return result; } should be implemented as follows public Vector times(double x) { Vector result = clone(); IteratorElement iter = result.iterateNonZero(); while (iter.hasNext()) { Element element = iter.next(); element.set(element.get() * x); } return result; } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836826#action_12836826 ] Jake Mannix commented on MAHOUT-300: and now that my run (of three comments ago) is finally done, with dot product removed since it's already been reported. This properly demonstrates how slow it is to build up a SeqAcc vector incrementally, since it's not random-access, among other things. {code} INFO: BenchMarks DenseVector RandSparseVector SeqSparseVector Clone nCalls = 25000; nCalls = 25000; nCalls = 25000; sum = 222.552872s; sum = 34.923269s; sum = 34.251326s; min = 4.598ms; min = 0.446ms; min = 0.4ms; max = 265.445ms;max = 184.352ms;max = 182.734ms; mean = 8.902114ms; mean = 1.39693ms; mean = 1.370053ms; stdDev = 11.676773ms; stdDev = 4.533406ms;stdDev = 5.002041ms; Speed = 112.33286 /sec Speed = 715.8551 /sec Speed = 729.89874 /sec Rate = 0.6739971 MB/s Rate = 4.2951303 MB/s Rate = 4.379392 MB/s Create (copy) nCalls = 25000; nCalls = 25000; nCalls = 25000; sum = 209.506424s; sum = 1.371177s;sum = 0.667553s; min = 1.427ms; min = 0.0050ms; min = 0.021ms; max = 11802.223ms; max = 21.322ms; max = 10.036ms; mean = 8.380256ms; mean = 0.054847ms; mean = 0.026702ms; stdDev = 27.862112ms; stdDev = 0.324031ms;stdDev = 0.130493ms; Speed = 119.32809 /sec Speed = 18232.512 /sec Speed = 37450.207 /sec Rate = 0.7159685 MB/s Rate = 109.395065 MB/s Rate = 224.70125 MB/s Create (incrementally) nCalls = 25000; nCalls = 25000; nCalls = 25000; sum = 0.570172s;sum = 0.755783s;sum = 3.969259s; min = 0.0ms;min = 0.0ms;min = 0.093ms; max = 4.148ms; max = 23.108ms; max = 13.452ms; mean = 0.022806ms; mean = 0.030231ms; mean = 0.15877ms; stdDev = 0.060237ms;stdDev = 0.196128ms;stdDev = 0.192234ms; Speed = 43846.414 /sec Speed = 33078.277 /sec Speed = 6298.405 /sec Rate = 263.0785 MB/sRate = 198.46967 MB/s Rate = 37.79043 MB/s org.apache.mahout.common.distance.CosineDistanceMeasure nCalls = 25000; nCalls = 25000; nCalls = 25000; sum = 500.69893s; sum = 29.026116s; sum = 3.367885s; min = 16.147ms; min = 0.896ms; min = 0.086ms; max = 163.619ms;max = 10.819ms; max = 11.731ms; mean = 20.027957ms; mean = 1.161044ms; mean = 0.134715ms; stdDev = 4.146275ms;stdDev = 0.345399ms;stdDev = 0.092807ms; Speed = 49.930202 /sec Speed = 861.29333 /sec Speed = 7423.056 /sec Rate = 0.2995812 MB/s Rate = 5.16776 MB/s Rate = 44.538334 MB/s org.apache.mahout.common.distance.EuclideanDistanceMeasure nCalls = 25000; nCalls = 25000; nCalls = 25000; sum = 501.080023s; sum = 26.812884s; sum = 3.649897s; min = 17.011ms; min = 0.924ms; min = 0.086ms; max = 120.138ms;max = 9.692ms; max = 13.113ms; mean = 20.0432ms; mean = 1.072515ms; mean = 0.145995ms; stdDev = 4.410452ms;stdDev = 0.262769ms;stdDev = 0.192273ms; Speed = 49.89223 /sec Speed = 932.3876 /sec Speed =
Re: [jira] Updated: (MAHOUT-304) MeanShift doesn't read from VectorWritable
If the Vector-MSCanopy pre-job outputs all of its canopies then each of those canopies would contain the generated canopyId and its canopy center would contain the original vector with its docId. Seems like one could use that data set to get the membership information in a separate post-processing step. Certainly the post-processing job should be for later, after the ListVector - ListcanopyId optimization. Robin Anil (JIRA) wrote: [ https://issues.apache.org/jira/browse/MAHOUT-304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-304: -- Attachment: MAHOUT-304.patch Jeff, Meanshift uses only ids generated by the mapper to keep vector membership. I dont yet see how you can get the membership information i.e Vector docid = Canopy Id. Isnt that job missing? Maybe for later 0.4? MeanShift doesn't read from VectorWritable -- Key: MAHOUT-304 URL: https://issues.apache.org/jira/browse/MAHOUT-304 Project: Mahout Issue Type: Improvement Components: Clustering Affects Versions: 0.3 Reporter: Robin Anil Assignee: Robin Anil Fix For: 0.3 Attachments: MAHOUT-304.patch, MAHOUT-304.patch, MAHOUT-304.patch Need an M/R job for converting sequence file containing VectorWritable to MeanShiftCanopy before the MeanShift M/R
[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836839#action_12836839 ] Robin Anil commented on MAHOUT-300: --- {noformat} seq.seq= 46,855 rand.seq = 37,397 seq.dense = 36,460 seq.rand = 34,348 dense.seq = 25,453 rand.rand = 5,436 dense.rand = 5,303 rand.dense = 4,754 dense.dense= 477 {noformat} Solve performance issues with Vector Implementations Key: MAHOUT-300 URL: https://issues.apache.org/jira/browse/MAHOUT-300 Project: Mahout Issue Type: Improvement Affects Versions: 0.3 Reporter: Robin Anil Fix For: 0.3 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch AbstractVector operations like times public Vector times(double x) { Vector result = clone(); IteratorElement iter = iterateNonZero(); while (iter.hasNext()) { Element element = iter.next(); int index = element.index(); result.setQuick(index, element.get() * x); } return result; } should be implemented as follows public Vector times(double x) { Vector result = clone(); IteratorElement iter = result.iterateNonZero(); while (iter.hasNext()) { Element element = iter.next(); element.set(element.get() * x); } return result; } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Updated: (MAHOUT-304) MeanShift doesn't read from VectorWritable
after the ListVector - ListcanopyId optimization. I did that in the patch. Take a look :)
[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836848#action_12836848 ] Robin Anil commented on MAHOUT-300: --- {noformat} rand.rand = 14,435 dense.rand = 9,172 rand.dense = 10,578 dense.dense= 477 {noformat} Solve performance issues with Vector Implementations Key: MAHOUT-300 URL: https://issues.apache.org/jira/browse/MAHOUT-300 Project: Mahout Issue Type: Improvement Affects Versions: 0.3 Reporter: Robin Anil Fix For: 0.3 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch AbstractVector operations like times public Vector times(double x) { Vector result = clone(); IteratorElement iter = iterateNonZero(); while (iter.hasNext()) { Element element = iter.next(); int index = element.index(); result.setQuick(index, element.get() * x); } return result; } should be implemented as follows public Vector times(double x) { Vector result = clone(); IteratorElement iter = result.iterateNonZero(); while (iter.hasNext()) { Element element = iter.next(); element.set(element.get() * x); } return result; } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836909#action_12836909 ] Jake Mannix commented on MAHOUT-300: New benchmark additions: {code}INFO: BenchMarks DenseVector RandSparseVector SeqSparseVector Dense.fn(Rand) Dense.fn(Seq) Rand.fn(Dense) Rand.fn(Seq)Seq.fn(Dense) Seq.fn(Rand) Clone nCalls = 25000; nCalls = 25000; nCalls = 25000; sum = 222.609888s; sum = 0.427272s;sum = 32.833216s; min = 4.509ms; min = 0.0030ms; min = 0.381ms; max = 205.425ms;max = 17.397ms; max = 164.729ms; mean = 8.904395ms; mean = 0.01709ms; mean = 1.313328ms; stdDev = 11.839592ms; stdDev = 0.256237ms;stdDev = 4.730696ms; Speed = 112.30409 /sec Speed = 58510.74 /sec Speed = 761.424 /sec Rate = 0.6738245 MB/s Rate = 351.06442 MB/s Rate = 4.568544 MB/s Create (copy) nCalls = 25000; nCalls = 25000; nCalls = 25000; sum = 153.385135s; sum = 1.316737s;sum = 0.654021s; min = 1.291ms; min = 0.0080ms; min = 0.0ms; max = 149.59ms; max = 18.778ms; max = 8.555ms; mean = 6.135405ms; mean = 0.052669ms; mean = 0.02616ms; stdDev = 9.730283ms;stdDev = 0.276396ms;stdDev = 0.116822ms; Speed = 162.9884 /sec Speed = 18986.328 /sec Speed = 38225.074 /sec Rate = 0.9779304 MB/s Rate = 113.91796 MB/s Rate = 229.35042 MB/s Create (incrementally) nCalls = 25000; nCalls = 25000; nCalls = 25000; sum = 0.556807s;sum = 1.914268s;sum = 4.109328s; min = 0.0ms;min = 0.02ms; min = 0.093ms; max = 2.523ms; max = 184.955ms;max = 16.624ms; mean = 0.022272ms; mean = 0.07657ms; mean = 0.164373ms; stdDev = 0.038841ms;stdDev = 1.192837ms;stdDev = 0.214126ms; Speed = 44898.863 /sec Speed = 13059.822 /sec Speed = 6083.72 /sec Rate = 269.39316 MB/s Rate = 78.35893 MB/sRate = 36.50232 MB/s DotProduct nCalls = 25000; nCalls = 25000; nCalls = 25000; nCalls = 25000; nCalls = 25000; nCalls = 25000; nCalls = 25000; nCalls = 25000; nCalls = 25000; sum = 48.730579s; sum = 1.214007s;sum = 0.421372s;sum = 2.091561s;sum = 0.883674s;sum = 2.110771s;sum = 0.571964s;sum = 0.370673s;sum = 0.624421s; min = 1.581ms; min = 0.0040ms; min = 0.0ms;min = 0.036ms; min = 0.0ms;min = 0.033ms; min = 0.018ms; min = 0.0ms;min = 0.019ms; max = 14.217ms; max = 26.558ms; max = 2.628ms; max = 9.386ms; max = 8.269ms; max = 8.159ms; max = 1.525ms; max = 1.674ms; max = 7.62ms; mean = 1.949223ms; mean = 0.04856ms; mean = 0.016854ms; mean = 0.083662ms; mean = 0.035346ms; mean = 0.08443ms; mean = 0.022878ms; mean = 0.014826ms; mean = 0.024976ms; stdDev = 0.342952ms;stdDev = 0.216698ms;stdDev = 0.028979ms;stdDev = 0.070128ms;stdDev = 0.065883ms;stdDev = 0.064003ms;stdDev = 0.026759ms;stdDev = 0.034967ms;stdDev = 0.059001ms; Speed = 513.0249 /sec Speed = 20592.96 /sec Speed = 59330.0 /secSpeed =
[jira] Updated: (MAHOUT-304) MeanShift doesn't read from VectorWritable
[ https://issues.apache.org/jira/browse/MAHOUT-304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-304: -- Affects Version/s: (was: 0.3) 0.4 Fix Version/s: (was: 0.3) 0.4 MeanShift doesn't read from VectorWritable -- Key: MAHOUT-304 URL: https://issues.apache.org/jira/browse/MAHOUT-304 Project: Mahout Issue Type: Improvement Components: Clustering Affects Versions: 0.4 Reporter: Robin Anil Assignee: Robin Anil Fix For: 0.4 Attachments: MAHOUT-304.patch, MAHOUT-304.patch, MAHOUT-304.patch Need an M/R job for converting sequence file containing VectorWritable to MeanShiftCanopy before the MeanShift M/R -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-301) Improve command-line shell script by allowing default properties files
[ https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jake Mannix updated MAHOUT-301: --- Attachment: MAHOUT-301.patch Fancy new version. Run as follows: Set your $MAHOUT_CONF_DIR to a directory where you will have your own overrides (or, if unset, defaults to ./core/src/main/resources). In that directory, there should be a file called driver.classes.props with contents like so: {code} org.apache.mahout.utils.vectors.VectorDumper=vecDump org.apache.mahout.utils.clustering.ClusterDumper=clusty org.apache.mahout.utils.SequenceFileDumper=seqDump org.apache.mahout.clustering.kmeans.KMeansDriver=kmeans org.apache.mahout.clustering.canopy.CanopyDriver=canopy org.apache.mahout.utils.vectors.lucene.Driver=luceneVecs org.apache.mahout.text.SequenceFilesFromDirectory=dirToSeq org.apache.mahout.text.WikipediaToSequenceFile=wikToSeq org.apache.mahout.classifier.bayes.TestClassifier=TestClassifier {code} Etc. The right hand side can be whatever you want, *but* whatever it is determines where MahoutDriver will look for a default properties file. For example: {code} $MAHOUT_HOME/bin/mahout run wikToSeq {code} would look for the file $MAHOUT_CONF_DIR/wikToSeq.props and in that file, take each line and transform it into command line arguments for WikipediaToSequenceFile, using the logic as follows: on each line of wikToSeq.props, there is a key-value pair: {code} i | input = my/wiki/input/path o | output = my/output/path c | categories = my/wikiCategories/file e | exactMatch = true all = true {code} The part of the key before the vertical bar is the short-name of the argument to pass, and the second part is the long name. If there is only one, they are assumed to be the same. You can also pass Hadoop options here, like {code} Djava.io.tmpdir = /var/tmp/mahout {code} which would lead to the program being called with -Djava.io.tmpdir=/var/tmp/mahout passed in. Improve command-line shell script by allowing default properties files -- Key: MAHOUT-301 URL: https://issues.apache.org/jira/browse/MAHOUT-301 Project: Mahout Issue Type: New Feature Components: Utils Affects Versions: 0.3 Reporter: Jake Mannix Assignee: Jake Mannix Priority: Minor Fix For: 0.4 Attachments: MAHOUT-301.patch, MAHOUT-301.patch, MAHOUT-301.patch Snippet from javadoc gives the idea: {code} /** * General-purpose driver class for Mahout programs. Utilizes org.apache.hadoop.util.ProgramDriver to run * main methods of other classes, but first loads up default properties from a properties file. * * Usage: run on Hadoop like so: * * $HADOOP_HOME/bin/hadoop -jar path/to/job org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \ * [default.props file for this class] [over-ride options, all specified in long form: --input, --jarFile, etc] * * TODO: set the Main-Class to just be MahoutDriver, so that this option isn't needed? * * (note: using the current shell scipt, this could be modified to be just * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props file] [over-ride options] * ) * * Works like this: by default, the file core/src/main/resources/driver.classes.prop is loaded, which * defines a mapping between short names like VectorDumper and fully qualified class names. This file may * instead be overridden on the command line by having the first argument be some string of the form *classes.props. * * The next argument to the Driver is supposed to be the short name of the class to be run (as defined in the * driver.classes.props file). After this, if the next argument ends in .props / .properties, it is taken to * be the file to use as the default properties file for this execution, and key-value pairs are built up from that: * if the file contains * * input=/path/to/my/input * output=/path/to/my/output * * Then the class which will be run will have it's main called with * * main(new String[] { --input, /path/to/my/input, --output, /path/to/my/output }); * * After all the default properties are loaded from the file, any further command-line arguments are taken in, * and over-ride the defaults. */ {code} Could be cleaned up, as it's kinda ugly with the whole file named in .props, but gives the idea. Really helps cut down on repetitive long command lines, lets defaults be put props files instead of locked into the code also. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files
[ https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836952#action_12836952 ] Jake Mannix commented on MAHOUT-301: Oh, I forgot to finish my sentence which began run as follows... Once youv'e got default property files in your $MAHOUT_CONF_DIR, you can run like so: {code} $MAHOUT_HOME/bin/mahout run wikToSeq {code} and that's it. If you want to override the options in your wikToSeq.props file, just pass them in on that same command line above, and they override as desired. If this can be tested out and debugged, this patch is ready for committing, and significantly improves the command line experience. Improve command-line shell script by allowing default properties files -- Key: MAHOUT-301 URL: https://issues.apache.org/jira/browse/MAHOUT-301 Project: Mahout Issue Type: New Feature Components: Utils Affects Versions: 0.3 Reporter: Jake Mannix Assignee: Jake Mannix Priority: Minor Fix For: 0.4 Attachments: MAHOUT-301.patch, MAHOUT-301.patch, MAHOUT-301.patch Snippet from javadoc gives the idea: {code} /** * General-purpose driver class for Mahout programs. Utilizes org.apache.hadoop.util.ProgramDriver to run * main methods of other classes, but first loads up default properties from a properties file. * * Usage: run on Hadoop like so: * * $HADOOP_HOME/bin/hadoop -jar path/to/job org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \ * [default.props file for this class] [over-ride options, all specified in long form: --input, --jarFile, etc] * * TODO: set the Main-Class to just be MahoutDriver, so that this option isn't needed? * * (note: using the current shell scipt, this could be modified to be just * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props file] [over-ride options] * ) * * Works like this: by default, the file core/src/main/resources/driver.classes.prop is loaded, which * defines a mapping between short names like VectorDumper and fully qualified class names. This file may * instead be overridden on the command line by having the first argument be some string of the form *classes.props. * * The next argument to the Driver is supposed to be the short name of the class to be run (as defined in the * driver.classes.props file). After this, if the next argument ends in .props / .properties, it is taken to * be the file to use as the default properties file for this execution, and key-value pairs are built up from that: * if the file contains * * input=/path/to/my/input * output=/path/to/my/output * * Then the class which will be run will have it's main called with * * main(new String[] { --input, /path/to/my/input, --output, /path/to/my/output }); * * After all the default properties are loaded from the file, any further command-line arguments are taken in, * and over-ride the defaults. */ {code} Could be cleaned up, as it's kinda ugly with the whole file named in .props, but gives the idea. Really helps cut down on repetitive long command lines, lets defaults be put props files instead of locked into the code also. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAHOUT-306) Profile and improve perfomance of algorithms based on vectors
Profile and improve perfomance of algorithms based on vectors - Key: MAHOUT-306 URL: https://issues.apache.org/jira/browse/MAHOUT-306 Project: Mahout Issue Type: Improvement Affects Versions: 0.4 Reporter: Robin Anil Fix For: 0.4 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (MAHOUT-300) Solve performance issues with Vector Implementations
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil resolved MAHOUT-300. --- Resolution: Fixed Assignee: Robin Anil Solve performance issues with Vector Implementations Key: MAHOUT-300 URL: https://issues.apache.org/jira/browse/MAHOUT-300 Project: Mahout Issue Type: Sub-task Affects Versions: 0.3 Reporter: Robin Anil Assignee: Robin Anil Fix For: 0.3 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch AbstractVector operations like times public Vector times(double x) { Vector result = clone(); IteratorElement iter = iterateNonZero(); while (iter.hasNext()) { Element element = iter.next(); int index = element.index(); result.setQuick(index, element.get() * x); } return result; } should be implemented as follows public Vector times(double x) { Vector result = clone(); IteratorElement iter = result.iterateNonZero(); while (iter.hasNext()) { Element element = iter.next(); element.set(element.get() * x); } return result; } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-306) Profile and improve performance of algorithms based on vectors
[ https://issues.apache.org/jira/browse/MAHOUT-306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-306: -- Summary: Profile and improve performance of algorithms based on vectors (was: Profile and improve perfomance of algorithms based on vectors) Profile and improve performance of algorithms based on vectors -- Key: MAHOUT-306 URL: https://issues.apache.org/jira/browse/MAHOUT-306 Project: Mahout Issue Type: Improvement Affects Versions: 0.4 Reporter: Robin Anil Fix For: 0.4 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-300) Solve performance issues with Vector Implementations
[ https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-300: -- Issue Type: Sub-task (was: Improvement) Parent: MAHOUT-306 Solve performance issues with Vector Implementations Key: MAHOUT-300 URL: https://issues.apache.org/jira/browse/MAHOUT-300 Project: Mahout Issue Type: Sub-task Affects Versions: 0.3 Reporter: Robin Anil Fix For: 0.3 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch AbstractVector operations like times public Vector times(double x) { Vector result = clone(); IteratorElement iter = iterateNonZero(); while (iter.hasNext()) { Element element = iter.next(); int index = element.index(); result.setQuick(index, element.get() * x); } return result; } should be implemented as follows public Vector times(double x) { Vector result = clone(); IteratorElement iter = result.iterateNonZero(); while (iter.hasNext()) { Element element = iter.next(); element.set(element.get() * x); } return result; } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files
[ https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836962#action_12836962 ] Robin Anil commented on MAHOUT-301: --- The help comments are missing from the mahout/bin script. Scroll up that file and you will see a pretty printed help string. Just add the Mahout driver description and possibly a wikilink there. Otherwise looks good to commit. I have checked the full functionality yet. If anyone else want to take a look, please do quickly Improve command-line shell script by allowing default properties files -- Key: MAHOUT-301 URL: https://issues.apache.org/jira/browse/MAHOUT-301 Project: Mahout Issue Type: New Feature Components: Utils Affects Versions: 0.3 Reporter: Jake Mannix Assignee: Jake Mannix Priority: Minor Fix For: 0.4 Attachments: MAHOUT-301.patch, MAHOUT-301.patch, MAHOUT-301.patch Snippet from javadoc gives the idea: {code} /** * General-purpose driver class for Mahout programs. Utilizes org.apache.hadoop.util.ProgramDriver to run * main methods of other classes, but first loads up default properties from a properties file. * * Usage: run on Hadoop like so: * * $HADOOP_HOME/bin/hadoop -jar path/to/job org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \ * [default.props file for this class] [over-ride options, all specified in long form: --input, --jarFile, etc] * * TODO: set the Main-Class to just be MahoutDriver, so that this option isn't needed? * * (note: using the current shell scipt, this could be modified to be just * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props file] [over-ride options] * ) * * Works like this: by default, the file core/src/main/resources/driver.classes.prop is loaded, which * defines a mapping between short names like VectorDumper and fully qualified class names. This file may * instead be overridden on the command line by having the first argument be some string of the form *classes.props. * * The next argument to the Driver is supposed to be the short name of the class to be run (as defined in the * driver.classes.props file). After this, if the next argument ends in .props / .properties, it is taken to * be the file to use as the default properties file for this execution, and key-value pairs are built up from that: * if the file contains * * input=/path/to/my/input * output=/path/to/my/output * * Then the class which will be run will have it's main called with * * main(new String[] { --input, /path/to/my/input, --output, /path/to/my/output }); * * After all the default properties are loaded from the file, any further command-line arguments are taken in, * and over-ride the defaults. */ {code} Could be cleaned up, as it's kinda ugly with the whole file named in .props, but gives the idea. Really helps cut down on repetitive long command lines, lets defaults be put props files instead of locked into the code also. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Look! No more ISSUES
waiting for 301 to get commited. https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12310751styleName=Htmlversion=12314281 PMC's. Its in your hands now :D Robin
Re: [jira] Updated: (MAHOUT-304) MeanShift doesn't read from VectorWritable
Robin Anil wrote: after the ListVector - ListcanopyId optimization. I did that in the patch. Take a look :) +1 Simply marvelous
Re: more svn:ignore
Ok, I've committed the ignores for .classpath, .project, .settings created by eclipse and a couple target directories that hadn't been excluded. I'll get the idea stuff on another pass once I figure out how to do global wildcard ignores. On Sun, Feb 21, 2010 at 7:53 AM, Sean Owen sro...@gmail.com wrote: IntelliJ 8 and before used the .ipr and .iws files -- IntelliJ 9 puts it in an .idea directory. It will auto-ignore these files. Still I think it doesn't hurt to ensure SVN never has anything to do with it. That and .DS_Store files from the Mac. maybe it already ignores them. On Sat, Feb 20, 2010 at 10:47 PM, Ted Dunning ted.dunn...@gmail.com wrote: I don't know the normal conventions (and they all seem to have changed recently anyway). *.ipr is the project file and the workspace and project files used to be at the top level. the module files could be below or not. The .idea directory is new and I don't grok it yet. It would only appear at the top-level, I think. If you don't use IDEA, you might punt on this. IDEA is pretty good about not checking extra goo in and I don't see eclipse users accidentally checking in IDEA files. On Sat, Feb 20, 2010 at 1:35 PM, Drew Farris drew.far...@gmail.com wrote: On Sat, Feb 20, 2010 at 3:04 PM, Ted Dunning ted.dunn...@gmail.com wrote: WOuldn't hurt to do the same for the IDEA project (*.ipr), module (*.iml) and workspace (*.iws) files. Lately, it seems idea is keeping this all in a .idea sub-directory of the parent. so, just adding an svn:ignore on the parent for .idea would ignore what we're looking for? AFAIK, ignore is set only on a per directory basis, I'd have to do some minor digging to determine if we can ignore files with those extensions everywhere. -- Ted Dunning, CTO DeepDyve
[jira] Updated: (MAHOUT-301) Improve command-line shell script by allowing default properties files
[ https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Drew Farris updated MAHOUT-301: --- Attachment: MAHOUT-301-drew.patch Did some testing, here's a patch to clean some of these things up + a couple questions: Could we load the default driver.classes.props from the classpath? If it was loaded that way the default would work regardless of where the mahout script is run from (it currently only works if ./bin/mahout is run, not ./mahout for example) and regardless of whether we're running from a binary release or the dev environment. (included in patch) Something else I noticed is that the 'mahout' script doesn't add the classes in $MAHOUT_HOME/lib/*.jar to the classpath. This breakes the binary release in that it can't run anything, e.g: {code} ./mahout vectordump Exception in thread main java.lang.NoClassDefFoundError: org/apache/commons/cli2/OptionException Caused by: java.lang.ClassNotFoundException: org.apache.commons.cli2.OptionException {code} (fixed in patch) Using -core in the context of a dev build should work properly, but leaving out -core will cause the script to error unless run in the context of a release -- this is the way it should work, right? Also wondering what the purpose of adding the job jars to the classpath is? (removed in patch) Also added a help message for the 'run' argument. Does executing './mahout run --help' hang for anyone else or is it something specific to my environment? (didn't track this one down) Improve command-line shell script by allowing default properties files -- Key: MAHOUT-301 URL: https://issues.apache.org/jira/browse/MAHOUT-301 Project: Mahout Issue Type: New Feature Components: Utils Affects Versions: 0.3 Reporter: Jake Mannix Assignee: Jake Mannix Priority: Minor Fix For: 0.4 Attachments: MAHOUT-301-drew.patch, MAHOUT-301.patch, MAHOUT-301.patch, MAHOUT-301.patch Snippet from javadoc gives the idea: {code} /** * General-purpose driver class for Mahout programs. Utilizes org.apache.hadoop.util.ProgramDriver to run * main methods of other classes, but first loads up default properties from a properties file. * * Usage: run on Hadoop like so: * * $HADOOP_HOME/bin/hadoop -jar path/to/job org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \ * [default.props file for this class] [over-ride options, all specified in long form: --input, --jarFile, etc] * * TODO: set the Main-Class to just be MahoutDriver, so that this option isn't needed? * * (note: using the current shell scipt, this could be modified to be just * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props file] [over-ride options] * ) * * Works like this: by default, the file core/src/main/resources/driver.classes.prop is loaded, which * defines a mapping between short names like VectorDumper and fully qualified class names. This file may * instead be overridden on the command line by having the first argument be some string of the form *classes.props. * * The next argument to the Driver is supposed to be the short name of the class to be run (as defined in the * driver.classes.props file). After this, if the next argument ends in .props / .properties, it is taken to * be the file to use as the default properties file for this execution, and key-value pairs are built up from that: * if the file contains * * input=/path/to/my/input * output=/path/to/my/output * * Then the class which will be run will have it's main called with * * main(new String[] { --input, /path/to/my/input, --output, /path/to/my/output }); * * After all the default properties are loaded from the file, any further command-line arguments are taken in, * and over-ride the defaults. */ {code} Could be cleaned up, as it's kinda ugly with the whole file named in .props, but gives the idea. Really helps cut down on repetitive long command lines, lets defaults be put props files instead of locked into the code also. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Algorithm implementations in Pig
Those would be passed as parameters either through -param option or through a parameter file with -param_file option and the pig's preprocessor just substitutes the values in your script. Since its just a blind parameter substitution, in my shingling script I even had the schema definition passed to it. I suppose passing input field name shouldn't be an issue as long as it is valid In the context of script execution plan. -...@nkur 10 11:32 PM, Ted Dunning ted.dunn...@gmail.com wrote: As an interesting test case, can you write a pig program that counts words. BUT, it takes an input file name AND an input field name. On Mon, Feb 22, 2010 at 9:56 AM, Ted Dunning ted.dunn...@gmail.com wrote: That isn't an issue here. It is the invocation of pig programs and passing useful information to them that is the problem. On Mon, Feb 22, 2010 at 9:20 AM, Ankur C. Goel gan...@yahoo-inc.comwrote: Scripting ability while still limited has better streaming support so you can have relations streamed Into a custom script executing in either map or reduce phase depending upon where it is placed. -- Ted Dunning, CTO DeepDyve -- Ted Dunning, CTO DeepDyve
Re: Algorithm implementations in Pig
Good answer. On Mon, Feb 22, 2010 at 8:52 PM, Ankur C. Goel gan...@yahoo-inc.com wrote: Those would be passed as parameters either through -param option or through a parameter file with -param_file option and the pig's preprocessor just substitutes the values in your script. Since its just a blind parameter substitution, in my shingling script I even had the schema definition passed to it. I suppose passing input field name shouldn't be an issue as long as it is valid In the context of script execution plan.
[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files
[ https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837120#action_12837120 ] Robin Anil commented on MAHOUT-301: --- including the job jar is much cleaner than adding all deps. Plus there is nothing more to configure to execute it on top of hadoop.. BTW. How is hadoop execution done using shell script ? i.e hadoop jar mahout-examples-0.3.job o.a.m...DictionaryVectorizer --input . args Improve command-line shell script by allowing default properties files -- Key: MAHOUT-301 URL: https://issues.apache.org/jira/browse/MAHOUT-301 Project: Mahout Issue Type: New Feature Components: Utils Affects Versions: 0.3 Reporter: Jake Mannix Assignee: Jake Mannix Priority: Minor Fix For: 0.4 Attachments: MAHOUT-301-drew.patch, MAHOUT-301.patch, MAHOUT-301.patch, MAHOUT-301.patch Snippet from javadoc gives the idea: {code} /** * General-purpose driver class for Mahout programs. Utilizes org.apache.hadoop.util.ProgramDriver to run * main methods of other classes, but first loads up default properties from a properties file. * * Usage: run on Hadoop like so: * * $HADOOP_HOME/bin/hadoop -jar path/to/job org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \ * [default.props file for this class] [over-ride options, all specified in long form: --input, --jarFile, etc] * * TODO: set the Main-Class to just be MahoutDriver, so that this option isn't needed? * * (note: using the current shell scipt, this could be modified to be just * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props file] [over-ride options] * ) * * Works like this: by default, the file core/src/main/resources/driver.classes.prop is loaded, which * defines a mapping between short names like VectorDumper and fully qualified class names. This file may * instead be overridden on the command line by having the first argument be some string of the form *classes.props. * * The next argument to the Driver is supposed to be the short name of the class to be run (as defined in the * driver.classes.props file). After this, if the next argument ends in .props / .properties, it is taken to * be the file to use as the default properties file for this execution, and key-value pairs are built up from that: * if the file contains * * input=/path/to/my/input * output=/path/to/my/output * * Then the class which will be run will have it's main called with * * main(new String[] { --input, /path/to/my/input, --output, /path/to/my/output }); * * After all the default properties are loaded from the file, any further command-line arguments are taken in, * and over-ride the defaults. */ {code} Could be cleaned up, as it's kinda ugly with the whole file named in .props, but gives the idea. Really helps cut down on repetitive long command lines, lets defaults be put props files instead of locked into the code also. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837123#action_12837123 ] Ankur commented on MAHOUT-305: -- With co-occurrence analysis we are dropping ratings. So if there are a lot of people who watched Harry Potter also watched Maid in manhattan it will have a higher chance of getting recommended regardless of ratings. I am trying not be influenced too much by ratings as that is not the strength of this algorithm. Where it really shines is when you have lots and lots of sparse user click data where a click may be present or absent. Something like an online book store or a shopping site. We are sticking with netflix as there is no such publicly available dataset AFAIK. Ok so moving forward with the action plan, here is what I propose to do. Please feel free to suggest modifications. 1. For each user take out the most recent movies that he has rated 3 or 4 or 5 as TEST data. Use the remaining as TRAIN data. 2. Run both implementations in identical environment on test data and record runtimes and results 3. Join recommendation results with TEST data on 'user' key and calculate precision recall. 4. Report average precision recall. Ok so when separating top ratings as TEST data. For each user Precision @10 = (3,4,5 rating movies recommended actually present ) / 10 Recall @ 10 = (3,4,5 rating movies recommended actually present ) / (all 3,4,5 movies seen by user) Hope this was more clear. Combine both cooccurrence-based CF M/R jobs --- Key: MAHOUT-305 URL: https://issues.apache.org/jira/browse/MAHOUT-305 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.2 Reporter: Sean Owen Assignee: Ankur Priority: Minor We have two different but essentially identical MapReduce jobs to make recommendations based on item co-occurrence: org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be merged. Not sure exactly how to approach that but noting this in JIRA, per Ankur. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-283) Update assemblies to include mahout-collections for release build
[ https://issues.apache.org/jira/browse/MAHOUT-283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-283: -- Fix Version/s: (was: 0.4) 0.3 Update assemblies to include mahout-collections for release build - Key: MAHOUT-283 URL: https://issues.apache.org/jira/browse/MAHOUT-283 Project: Mahout Issue Type: Sub-task Affects Versions: 0.3 Reporter: Drew Farris Assignee: Drew Farris Fix For: 0.3 Attachments: MAHOUT-283.patch The release assemblies need to be updated to include the new mahout-collections project. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-281) scm urls are wrong in the poms
[ https://issues.apache.org/jira/browse/MAHOUT-281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-281: -- Fix Version/s: (was: 0.4) 0.3 scm urls are wrong in the poms -- Key: MAHOUT-281 URL: https://issues.apache.org/jira/browse/MAHOUT-281 Project: Mahout Issue Type: Bug Affects Versions: 0.3 Reporter: Benson Margulies Assignee: Benson Margulies Fix For: 0.3 Attachments: MAHOUT-281.diff The scm urls in the poms are wrong. This must be fixed before running the release plugin to make an 0.3 release. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-280) Clean some redundant POM declarations
[ https://issues.apache.org/jira/browse/MAHOUT-280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-280: -- Fix Version/s: (was: 0.4) 0.3 Clean some redundant POM declarations - Key: MAHOUT-280 URL: https://issues.apache.org/jira/browse/MAHOUT-280 Project: Mahout Issue Type: Improvement Affects Versions: 0.3 Reporter: Benson Margulies Assignee: Benson Margulies Fix For: 0.3 Attachments: MAHOUT-280.diff I am about to attach a simple patch to clean up some redundant stuff in the poms. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.