[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-02-22 Thread Ankur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836543#action_12836543
 ] 

Ankur commented on MAHOUT-305:
--

Sean, Thanks for filing the jira. Nothing points from our discussion here.

1. Need to decide on the dataset to run both the implementations on. I have 
netflix dataset in mind but a strange thing I observed during my tests with it 
is that there were 2 - 3 users who rated more than 10,000 movies! This seemed a 
little odd to me. Can you or some else who has had experience with the dataset 
validate my observation ?  

2. Both the implementations need to run on dataset in the identical environment 
to gauge performance and accuracy. For accuracy I believe we need to do a 
Precision-Recall test. My understanding of it is that 

  a) Do a 80-20 split of the data (80% train and 20% test) with split 
happening on a timeline. 
  b) Feed training data to the algorithm and generate recommendations for a 
subset of users from training data. 
  c) Compare those recommendations with items actually present in the 
history of user in test data.
  d) Calculate precision = tp / (tp + fp) = (recommendations actually 
present in user's history) / (total items recommended)
  e) Calculate recall = tp / (tp + fn) =(recommendations actually 
present in user's history) / (total items in user's history)
  f) Finally take a simple avg of both across all the users to get approx 
global precision/recall. 

please feel free to correct any of the step above if I misunderstood anything.

 Combine both cooccurrence-based CF M/R jobs
 ---

 Key: MAHOUT-305
 URL: https://issues.apache.org/jira/browse/MAHOUT-305
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.2
Reporter: Sean Owen
Priority: Minor

 We have two different but essentially identical MapReduce jobs to make 
 recommendations based on item co-occurrence: 
 org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be 
 merged. Not sure exactly how to approach that but noting this in JIRA, per 
 Ankur.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Algorithm implementations in Pig

2010-02-22 Thread Jeff Zhang
Hi,

Glad to hear here that mahout devs are interested in pig. Actually I believe
pig is very helpful when you want to quickly implement a prototype of
machine learning algorithms. And Pig has java API, it is easy to integrate
pig script with java.  Maybe we can start with implementing NB using pig
first.



On Mon, Feb 22, 2010 at 3:56 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 I have had both positive and negative results with PIG.

 The positive results were that I was able to express large recommendation
 computations in a very concise way.  That was really helpful.

 My negative results have been to do with the brittle nature of PIG vis a
 vis
 the version of the underlying hadoop system.  That problem may have abated
 somewhat as everybody in the world except me and Amazon's EMR has pretty
 much piled up on version 20.  I also know little about how Pig would
 interface well with other components.  I know that I have had difficulty in
 the past injecting outside information into Pig, but that has been
 improved.  I also know that Pigs eat anything, but have no clear idea how
 well this would play out with, say, our vector formats and vectorizers.

 Ankur, what recent experience do you have?  How well do PIG scripts play
 with other programs any more?

 On Sun, Feb 21, 2010 at 11:41 PM, Ankur C. Goel gan...@yahoo-inc.com
 wrote:

  I had Sean's opinion on this and he was not too comfortable with the Idea
  of having things in different languages in Mahout. However, given the
  benefits of PIG, I feel otherwise. I may be biased here due to my own
  experience of being able to do more in lesser time in Pig then in  M/R,
 so I
  thought let me ask how folks feel.
 
  Ted, I believe you have some PIG experience yourself so any thoughts on
  this ?
 



 --
 Ted Dunning, CTO
 DeepDyve




-- 
Best Regards

Jeff Zhang


Re: Algorithm implementations in Pig

2010-02-22 Thread Ted Dunning
I see pig as useful for data preparation, but for any numerical tasks, it is
likely to be completely hopeless.

On Mon, Feb 22, 2010 at 12:16 AM, Jeff Zhang zjf...@gmail.com wrote:


 Glad to hear here that mahout devs are interested in pig. Actually I
 believe
 pig is very helpful when you want to quickly implement a prototype of
 machine learning algorithms. And Pig has java API, it is easy to integrate
 pig script with java.  Maybe we can start with implementing NB using pig
 first.




-- 
Ted Dunning, CTO
DeepDyve


Re: Algorithm implementations in Pig

2010-02-22 Thread Jeff Zhang
Pig can only make the implementation of map-reduce easier, the numerical
computation can been done in UDF. And piglet is a DSL upon pig latin which
make pig support loop.
http://github.com/iconara/piglet



On Mon, Feb 22, 2010 at 4:25 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 I see pig as useful for data preparation, but for any numerical tasks, it
 is
 likely to be completely hopeless.

 On Mon, Feb 22, 2010 at 12:16 AM, Jeff Zhang zjf...@gmail.com wrote:

 
  Glad to hear here that mahout devs are interested in pig. Actually I
  believe
  pig is very helpful when you want to quickly implement a prototype of
  machine learning algorithms. And Pig has java API, it is easy to
 integrate
  pig script with java.  Maybe we can start with implementing NB using pig
  first.




 --
 Ted Dunning, CTO
 DeepDyve




-- 
Best Regards

Jeff Zhang


Re: Algorithm implementations in Pig

2010-02-22 Thread Robin Anil
On Mon, Feb 22, 2010 at 1:55 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 I see pig as useful for data preparation, but for any numerical tasks, it
 is
 likely to be completely hopeless.


PIG will be a great tool to experiment quickly on algorithms.  But, with
people here trying to focus on using Vector to standardize the input output
process, It will be tough for the small bunch here to port that to PIG, or
help PIG scripts reuse it. As long as the input output of PIG based
algorithmns is based on VectorWritable, I dont see any problem not including
PIG. But bear in mind the previous PIG submission
https://issues.apache.org/jira/browse/MAHOUT-106 still haven't moved in to
the trunk. If anyone is willing to help standardize on using PIG with
vectors as input they are more than welcome.

One thing we definitely dont want to do at this point is for all algorithms
to have all different kinds of input format.

Robin


Re: Algorithm implementations in Pig

2010-02-22 Thread Ankur C. Goel
Ted,
 The latest pig release 0.6.0 on hadoop 20 is a clear winner not just for 
performance but also for doing a better job of managing memory in its MR job 
pipeline. Also support for both inner and outer skewed join is something that I 
found indispensable when dealing with really large datasets. There is support 
for streaming in pig that lets you stream your relation through an external 
perl/python/ruby... Script. Also support for UDFs in scripting language is 
expected in the near future.

About interfacing with other systems I assume you have an RDBMS in mind. There 
is a patch (for pig 0.7) that lets you write directly from PIG to an RDBMS like 
MySQL. Support for writing directly to Hbase was always there and has been 
improved I believe. With 0.7 release pig has decided to let its load/store 
functions rely on hadoop's input/output format so our vector format shouldn't 
be a problem IMHO. The only thing I am concerned about is the not too 
efficient Tuple implementation in pig which does not give performance 
equivalent to Java MR.

Recently I implemented shingling in Pig and found it to work beautifully. One 
problem that I hit had to too with using clusters to generate recommendations 
since some clusters were quite large ( 10 K). For this I needed to do a 
self-join and wanted the join load to be split evenly. That's where skewed join 
came to the rescue.

Apart from this I also want to contribute my implementation to Mahout (the 
reason for starting this thread :-))

-...@nkur

On 2/22/10 1:26 PM, Ted Dunning ted.dunn...@gmail.com wrote:

I have had both positive and negative results with PIG.

The positive results were that I was able to express large recommendation
computations in a very concise way.  That was really helpful.

My negative results have been to do with the brittle nature of PIG vis a vis
the version of the underlying hadoop system.  That problem may have abated
somewhat as everybody in the world except me and Amazon's EMR has pretty
much piled up on version 20.  I also know little about how Pig would
interface well with other components.  I know that I have had difficulty in
the past injecting outside information into Pig, but that has been
improved.  I also know that Pigs eat anything, but have no clear idea how
well this would play out with, say, our vector formats and vectorizers.

Ankur, what recent experience do you have?  How well do PIG scripts play
with other programs any more?

On Sun, Feb 21, 2010 at 11:41 PM, Ankur C. Goel gan...@yahoo-inc.comwrote:

 I had Sean's opinion on this and he was not too comfortable with the Idea
 of having things in different languages in Mahout. However, given the
 benefits of PIG, I feel otherwise. I may be biased here due to my own
 experience of being able to do more in lesser time in Pig then in  M/R, so I
 thought let me ask how folks feel.

 Ted, I believe you have some PIG experience yourself so any thoughts on
 this ?




--
Ted Dunning, CTO
DeepDyve



Re: Algorithm implementations in Pig

2010-02-22 Thread Grant Ingersoll
I'm all for Pig, especially once we are a TLP.  I haven't had the proper time 
to review the PLSI implementation, but it looks useful.  I agree on the other 
points, though, in that I think we it would be nice to have consistent formats 
based on Vector so that things can be more portable.


On Feb 22, 2010, at 2:41 AM, Ankur C. Goel wrote:

 Hi Folks,
   I would like to know how mahout community feels about having 
 some of the Mahout algorithms implemented in pig - 
 http://hadoop.apache.org/pig. The benefits of using Pig are many including.
 
 
 1.  Small learning curve, people with a bit of SQL knowledge will find it 
 very easy.
 2.  Operations like grouping, aggregations, join need just few lines of pig 
 code.
 3.  Insulation against Hadoop complexity - Job chains and JobConf.
 4.  Quick prototyping and hence increased programmer productivity.
 
 I had Sean's opinion on this and he was not too comfortable with the Idea of 
 having things in different languages in Mahout. However, given the benefits 
 of PIG, I feel otherwise. I may be biased here due to my own experience of 
 being able to do more in lesser time in Pig then in  M/R, so I thought let me 
 ask how folks feel.
 
 Ted, I believe you have some PIG experience yourself so any thoughts on this ?
 
 Regards
 -...@nkur



[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836598#action_12836598
 ] 

Robin Anil commented on MAHOUT-300:
---

We should be multiplying using sparsity instead of cardinality to calculated 
the speed in MB/s for Sparse and Seq and by cardinality for dense vector

 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Algorithm implementations in Pig

2010-02-22 Thread David Stuart
Seems like the guys at twitter are going down the pig/hadoop 
http://highscalability.com/blog/2010/2/19/twitters-plan-to-analyze-100-billion-tweets.html
 route could be worth getting them on board the Mahout wagon especially with 
previous discussion had about classification efforts 
http://old.nabble.com/Twitter-Classification-td27227638.html
On 22 Feb 2010, at 12:13, Grant Ingersoll wrote:

 I'm all for Pig, especially once we are a TLP.  I haven't had the proper time 
 to review the PLSI implementation, but it looks useful.  I agree on the other 
 points, though, in that I think we it would be nice to have consistent 
 formats based on Vector so that things can be more portable.
 
 
 On Feb 22, 2010, at 2:41 AM, Ankur C. Goel wrote:
 
 Hi Folks,
  I would like to know how mahout community feels about having 
 some of the Mahout algorithms implemented in pig - 
 http://hadoop.apache.org/pig. The benefits of using Pig are many including.
 
 
 1.  Small learning curve, people with a bit of SQL knowledge will find it 
 very easy.
 2.  Operations like grouping, aggregations, join need just few lines of pig 
 code.
 3.  Insulation against Hadoop complexity - Job chains and JobConf.
 4.  Quick prototyping and hence increased programmer productivity.
 
 I had Sean's opinion on this and he was not too comfortable with the Idea of 
 having things in different languages in Mahout. However, given the benefits 
 of PIG, I feel otherwise. I may be biased here due to my own experience of 
 being able to do more in lesser time in Pig then in  M/R, so I thought let 
 me ask how folks feel.
 
 Ted, I believe you have some PIG experience yourself so any thoughts on this 
 ?
 
 Regards
 -...@nkur
 



[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836624#action_12836624
 ] 

Robin Anil commented on MAHOUT-300:
---

I think the irregularity is due to the sparse vector generation process where 
duplicate index values could get generated leaving some vectors much sparser 
than the sparsity value

{code}
  Vector v = new SequentialAccessSparseVector(cardinality, sparsity); // 
sparsity!
  int[] indexes = new int[sparsity];
  double[] values = new double[sparsity];
  for (int j = 0; j  sparsity; j++) {
double value = r.nextGaussian();
int index = sparsity  cardinality ? r.nextInt(cardinality) : j;
v.set(index, value);
indexes[j] = index;
values[j] = value;
  }
{code}

instead i suggest this

{code}
  Vector v = new SequentialAccessSparseVector(cardinality, sparsity); // 
sparsity!
  boolean[] featureSpace = new boolean[cardinality];
  int[] indexes = new int[sparsity];
  double[] values = new double[sparsity];
  int j = 0;
  while(j  sparsity) {
double value = r.nextGaussian();
int index = r.nextInt(cardinality);
if(featureSpace[index] == false) {
  featureSpace[index] = true;
  indexes[j] = index;
  values[j++] = value;
  v.set(index, value);
}
  }
{code}

 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836630#action_12836630
 ] 

Robin Anil commented on MAHOUT-300:
---

Ted, your loop structure seem to be slower by about 150MB/s than the null based 
impl. Does it need more loops before optimisations kick in ?

 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836633#action_12836633
 ] 

Sean Owen commented on MAHOUT-300:
--

Tiny comment -- will probably be wise to use BitSet rather than boolean[], as 
booleans are stored as full 32 bit value (!). A 32x reduction in memory is 
non-trivial with cardinalities in the millions.

 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836649#action_12836649
 ] 

Robin Anil commented on MAHOUT-300:
---

On dense data 1000, 1000

{noformat}
BenchMarks  DenseVector RandSparseVector
SeqSparseVector Dense.dot(Rand) Dense.dot(Seq)  
Rand.dot(Dense) Rand.dot(Seq)   Seq.dot(Dense)  
Seq.dot(Rand)   

DotProduct  


nCalls = 2; nCalls = 2; nCalls 
= 2; nCalls = 2; nCalls = 2; nCalls = 
2; nCalls = 2; nCalls = 2; nCalls = 2;  
   
sum = 0.042869s;sum = 1.139837s;sum = 
0.293336s;sum = 0.882977s;sum = 0.452817s;sum = 
1.330815s;sum = 0.843993s;sum = 0.931822s;sum = 
1.093099s;
min = 0.0010ms; min = 0.046ms;  min = 
0.01ms;   min = 0.03ms;   min = 0.011ms;  min = 
0.049ms;  min = 0.027ms;  min = 0.036ms;  min = 
0.049ms;  
max = 2.717ms;  max = 21.51ms;  max = 
3.156ms;  max = 25.346ms; max = 26.567ms; max = 
14.738ms; max = 53.265ms; max = 9.44ms;   max = 
4.017ms;  
mean = 0.002143ms;  mean = 0.056991ms;  mean = 
0.014666ms;  mean = 0.044148ms;  mean = 0.02264ms;   mean = 
0.06654ms;   mean = 0.042199ms;  mean = 0.046591ms;  mean = 
0.054654ms;  
stdDev = 0.027798ms;stdDev = 0.194404ms;stdDev 
= 0.053138ms;stdDev = 0.30642ms; stdDev = 0.255753ms;stdDev = 
0.212913ms;stdDev = 0.446643ms;stdDev = 0.131948ms;stdDev = 
0.054681ms;
Speed = 466537.6 /sec   Speed = 17546.367 /sec  Speed = 
68181.195 /sec  Speed = 22650.646 /sec  Speed = 44167.953 /sec  Speed = 
15028.385 /sec  Speed = 23696.877 /sec  Speed = 21463.326 /sec  Speed = 
18296.604 /sec  
Rate = 5598.451 MB/sRate = 210.55641 MB/s   Rate = 
818.17444 MB/s   Rate = 271.80777 MB/s   Rate = 530.01544 MB/s   Rate = 
180.34062 MB/s   Rate = 284.36255 MB/s   Rate = 257.55994 MB/s   Rate = 
219.55927 MB/s   
{noformat}

On Sparse Data (1000, 300)
Dont compare the MB/s see the unit/s


{noformat}
BenchMarks  DenseVector RandSparseVector
SeqSparseVector Dense.dot(Rand) Dense.dot(Seq)  
Rand.dot(Dense) Rand.dot(Seq)   Seq.dot(Dense)  
Seq.dot(Rand)   

DotProduct  


nCalls = 2; nCalls = 2; nCalls 
= 2; nCalls = 2; nCalls = 2; nCalls = 
2; nCalls = 2; nCalls = 2; nCalls = 2;  
   
sum = 0.048355s;sum = 0.569326s;sum = 
0.338478s;sum = 0.408213s;sum = 0.205143s;sum = 
0.469473s;sum = 0.242953s;sum = 0.291587s;sum = 
0.362947s;
min = 0.0010ms; min = 0.018ms;  min = 
0.011ms;  min = 0.012ms;  min = 0.0040ms; min = 
0.017ms;  min = 0.01ms;   min = 0.011ms;  min = 
0.014ms;  
max = 6.525ms;  max = 33.768ms; max = 
3.936ms;  max = 26.649ms; max = 27.028ms; max = 
3.969ms;  max = 3.042ms;  max = 4.704ms;  max = 7.04ms; 
  
mean = 0.002417ms;  mean = 0.028466ms;  mean = 
0.016923ms;  mean = 0.02041ms;   mean = 0.010257ms;  mean = 
0.023473ms;  mean = 0.012147ms;  mean = 0.014579ms;  mean = 
0.018147ms;  
stdDev = 0.062427ms;stdDev = 0.302488ms;stdDev 
= 0.059426ms;stdDev = 0.237577ms;stdDev = 0.222142ms;stdDev = 
0.05819ms; stdDev = 0.026846ms;stdDev = 0.06257ms; stdDev = 
0.06777ms; 
Speed = 413607.7 /sec   Speed = 35129.258 /sec  Speed = 
59088.03 /sec   Speed = 48994.03 /sec   Speed = 97492.96 /sec   Speed = 
42600.957 /sec  Speed = 82320.45 /sec   Speed = 68590.164 /sec  Speed = 

[jira] Updated: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-300:
--

Attachment: MAHOUT-300.patch

 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-300:
--

Attachment: MAHOUT-300.patch

Increased loop by 3x to give more stability to perf values

 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-02-22 Thread Ankur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1283#action_1283
 ] 

Ankur commented on MAHOUT-305:
--

Hey Sean,
   Have you played with netflix dataset? Are there really user who 
have rated more than 10,000 movies? For PR test do we have something already 
that will work in this case or some coding is required ? Any other thoughts ?  

 Combine both cooccurrence-based CF M/R jobs
 ---

 Key: MAHOUT-305
 URL: https://issues.apache.org/jira/browse/MAHOUT-305
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.2
Reporter: Sean Owen
Priority: Minor

 We have two different but essentially identical MapReduce jobs to make 
 recommendations based on item co-occurrence: 
 org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be 
 merged. Not sure exactly how to approach that but noting this in JIRA, per 
 Ankur.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-02-22 Thread Ankur (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur reassigned MAHOUT-305:


Assignee: Ankur

 Combine both cooccurrence-based CF M/R jobs
 ---

 Key: MAHOUT-305
 URL: https://issues.apache.org/jira/browse/MAHOUT-305
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.2
Reporter: Sean Owen
Assignee: Ankur
Priority: Minor

 We have two different but essentially identical MapReduce jobs to make 
 recommendations based on item co-occurrence: 
 org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be 
 merged. Not sure exactly how to approach that but noting this in JIRA, per 
 Ankur.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836679#action_12836679
 ] 

Robin Anil commented on MAHOUT-300:
---

i found the anomaly Jake was talking about. It was due to too many instanceof 
checks in dot in AbstractVector. I moved the code out split as smaller check in 
each of overridden dot in each of the impls. The numbers just doubled, 
confirming my suspicion that instanceof is a heavy weight.

 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-300:
--

Attachment: MAHOUT-300.patch

 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-02-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836689#action_12836689
 ] 

Sean Owen commented on MAHOUT-305:
--

Yes there are some prolific users. I don't have anything ready-made for such a 
test; the existing eval framework won't work here. I think it would need a bit 
of coding to pull out some test data, run the job, compare the results.

I have only one little tweak to make to the procedure you mention here. Really, 
we ought to pull out the most-preferred movies as test data. After all the 
recommendations will be for those movies that should be rated highly. We 
wouldn't want to punish the algorithm for failing to recommend something I have 
rated, but didn't like, over something I haven't rated but indeed would like.

One very crude way to do this is remove all 5-star ratings in the data set, and 
see how many of those actually come back in the recommendations.

 Combine both cooccurrence-based CF M/R jobs
 ---

 Key: MAHOUT-305
 URL: https://issues.apache.org/jira/browse/MAHOUT-305
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.2
Reporter: Sean Owen
Assignee: Ankur
Priority: Minor

 We have two different but essentially identical MapReduce jobs to make 
 recommendations based on item co-occurrence: 
 org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be 
 merged. Not sure exactly how to approach that but noting this in JIRA, per 
 Ankur.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-304) MeanShift doesn't read from VectorWritable

2010-02-22 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-304:
--

Attachment: MAHOUT-304.patch

Jeff, Meanshift uses only ids generated by the mapper to keep vector 
membership.  I dont yet see how you can get the membership information i.e 
Vector docid = Canopy Id. Isnt that job missing? Maybe for later 0.4?

 MeanShift doesn't read from VectorWritable
 --

 Key: MAHOUT-304
 URL: https://issues.apache.org/jira/browse/MAHOUT-304
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.3
Reporter: Robin Anil
Assignee: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-304.patch, MAHOUT-304.patch, MAHOUT-304.patch


 Need an M/R job for converting sequence file containing VectorWritable to 
 MeanShiftCanopy before the MeanShift M/R 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Algorithm implementations in Pig

2010-02-22 Thread Ted Dunning
Actually, no.

I meant other programs written in pure Java.  It used to be that the very
restricted scripting ability of Pig made processing chains composed of Pig
and map-reduce programs very brittle.  In fact, just gluing together
multiple Pig programs used to be very ugly.

On Mon, Feb 22, 2010 at 12:42 AM, Ankur C. Goel gan...@yahoo-inc.comwrote:

 About interfacing with other systems I assume you have an RDBMS in mind.




-- 
Ted Dunning, CTO
DeepDyve


Re: Algorithm implementations in Pig

2010-02-22 Thread Ted Dunning
Has the interface for writing UDF's stabilized?  For quite some time, the
UDF API was changing every 3 months.

On Mon, Feb 22, 2010 at 12:35 AM, Jeff Zhang zjf...@gmail.com wrote:

 Pig can only make the implementation of map-reduce easier, the numerical
 computation can been done in UDF.




-- 
Ted Dunning, CTO
DeepDyve


[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836706#action_12836706
 ] 

Jake Mannix commented on MAHOUT-300:


The sparse data is odd... (-vs 50 -sp 5000) (running with 1000, 300 is 
really not very sparse at all...)  

I haven't applied any newer patches (just the one I submitted most recently), 
but I have svn upped.

These results are counterintuitve.

{code}
BenchMarksDenseVector   
RandomAccessSparseVector  SequentialAccessSparseVector  
Dense.dot(RandomAccess)   Dense.dot(SequentialAccess)   
RandomAcces.dot(Dense)
RandomAccess.dot(SequentialAccess)SequentialAccess.dot(Dense)   
SequentialAccess.dot(RandomAccess)
DotProduct  



  nCalls = 2500;nCalls = 2500;  
  nCalls = 2500;nCalls = 2500;nCalls = 
2500;nCalls = 2500;nCalls = 2500;   
 nCalls = 2500;nCalls = 2500;
  sumTime = 3.660321s;  sumTime = 
1.481516s;  sumTime = 0.448737s;  sumTime = 2.098937s;  
sumTime = 0.856259s;  sumTime = 2.277742s;  sumTime = 
0.607507s;  sumTime = 1.341608s;  sumTime = 0.741622s;  
  minTime = 1.31ms; minTime = 0.459ms;  
  minTime = 0.102ms;minTime = 0.716ms;minTime = 
0.24ms; minTime = 0.776ms;minTime = 0.18ms; 
minTime = 0.442ms;minTime = 0.209ms;
  maxTime = 10.149ms;   maxTime = 36.691ms; 
  maxTime = 4.552ms;maxTime = 5.437ms;maxTime = 
11.856ms;   maxTime = 8.059ms;maxTime = 4.509ms;
maxTime = 2.136ms;maxTime = 2.031ms;
  meanTime = 1.464128ms;meanTime = 
0.592606ms;meanTime = 0.179494ms;meanTime = 0.839574ms;
meanTime = 0.342503ms;meanTime = 0.911096ms;meanTime = 
0.243002ms;meanTime = 0.536643ms;meanTime = 0.296648ms;
  stdDevTime = 0.329025ms;  stdDevTime = 
0.852156ms;  stdDevTime = 0.234261ms;  stdDevTime = 0.179854ms;  
stdDevTime = 0.286798ms;  stdDevTime = 0.268853ms;  stdDevTime = 
0.115022ms;  stdDevTime = 0.171088ms;  stdDevTime = 0.115263ms;  
  Speed = 683.0002 /sec Speed = 1687.4606 
/secSpeed = 5571.192 /sec Speed = 1191.0791 /secSpeed = 
2919.6772 /secSpeed = 1097.5781 /secSpeed = 4115.1787 /sec  
  Speed = 1863.4355 /secSpeed = 3370.9895 /sec
  Rate = 4098.001 MB/s  Rate = 10124.764 
MB/s Rate = 33427.152 MB/s Rate = 7146.4746 MB/s Rate = 
17518.062 MB/s Rate = 6585.4688 MB/s Rate = 24691.072 MB/s  
   Rate = 11180.613 MB/s Rate = 20225.936 MB/s 
{code}

 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836713#action_12836713
 ] 

Robin Anil commented on MAHOUT-300:
---

Can i commit the latest. If you dont have any changes pending on your end ? 
What ever be, we need to ensure correctness and proceed with 0.3. We are much 
better in terms of perf now than at the begining of this issue

 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-02-22 Thread Ankur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836725#action_12836725
 ] 

Ankur commented on MAHOUT-305:
--

Typically when doing train-test data split, we divide the data on a timeline. 
So as a simple example if we have 10 days data then we would keep last 2 days 
data as test data and remaining as training data. If we remove all 5 star 
rating the crude way, we may not be able to ensure this condition, not a hard 
one but still a best practice AFAIK.  Also I am not sure if 5 star ratings 
would be 20 or even 10% of the total data.

The crude way you mentioned is ok for a start but I am not sure if its a fair 
evaluation or not. Also with this we would effectively be calculating precision 
as
precision = (5 start recommendations actually present in user's history) / 
(total 5 star recommendations)
recall = (5 start recommendations actually present in user's history) / (total 
5 start items in user's history)

is that what you mean?

 Combine both cooccurrence-based CF M/R jobs
 ---

 Key: MAHOUT-305
 URL: https://issues.apache.org/jira/browse/MAHOUT-305
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.2
Reporter: Sean Owen
Assignee: Ankur
Priority: Minor

 We have two different but essentially identical MapReduce jobs to make 
 recommendations based on item co-occurrence: 
 org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be 
 merged. Not sure exactly how to approach that but noting this in JIRA, per 
 Ankur.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Algorithm implementations in Pig

2010-02-22 Thread Ankur C. Goel
In the next pig release (0.7) Pig's load/store func would be moving to use  
hadoop's input/output format. So there are some changes planned for that - 
http://wiki.apache.org/pig/Pig070IncompatibleChanges
After that I don't expect any interface level change in UDF.

-...@nkur

On 2/22/10 10:10 PM, Ted Dunning ted.dunn...@gmail.com wrote:

Has the interface for writing UDF's stabilized?  For quite some time, the
UDF API was changing every 3 months.

On Mon, Feb 22, 2010 at 12:35 AM, Jeff Zhang zjf...@gmail.com wrote:

 Pig can only make the implementation of map-reduce easier, the numerical
 computation can been done in UDF.




--
Ted Dunning, CTO
DeepDyve



Re: Algorithm implementations in Pig

2010-02-22 Thread Ankur C. Goel
I agree with you and while some of that has been remedied, I wouldn't say 
things are perfect.
Scripting ability while still limited has better streaming support so you can 
have relations streamed
Into a custom script executing in either map or reduce phase depending upon 
where it is placed.

If you want to glue together a bunch of map-reduce programs and pig script then 
the best option is to
Invoke pig from your java program that also manages your M/R chain. Hadoop 
workflow system (Oozie)
Is coming along which should make this better.

For gluing together multiple pig programs the best there is exec script.pig 
which can be called from inside
your script. However it is not a very neat solution since you would want to 
pass a bunch of things to the invoked
script and also check for certain conditions to exists. So again a java program 
or a perl/python/ruby script managing
your chain is a better option.

Regards
-...@nkur

On 2/22/10 10:08 PM, Ted Dunning ted.dunn...@gmail.com wrote:

Actually, no.

I meant other programs written in pure Java.  It used to be that the very
restricted scripting ability of Pig made processing chains composed of Pig
and map-reduce programs very brittle.  In fact, just gluing together
multiple Pig programs used to be very ugly.

On Mon, Feb 22, 2010 at 12:42 AM, Ankur C. Goel gan...@yahoo-inc.comwrote:

 About interfacing with other systems I assume you have an RDBMS in mind.




--
Ted Dunning, CTO
DeepDyve



[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-02-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836733#action_12836733
 ] 

Sean Owen commented on MAHOUT-305:
--

Say I've made the following ratings:

5 stars: Harry Potter
5 stars: Harry Potter 2
1 star: Maid in Manhattan

Say I remove Maid in Manhattan as test data. I run recommendations and it 
recommends to me Harry Potter 3 (which presumably I would rate highly). The 
implementation would be penalized for not returning Maid in Manhattan, when 
that's surely not what it should have returned.

Even if you take out only the most highly-rated movies as test data (this is 
what the existing CF precsion/recall evaluator does), this phenomenon can still 
occur: the recommender could return a movie that's better than anything you've 
yet seen but that would be considered 'bad' by this evaluation style. It's 
still not a fair test, but it's less un-fair.

Yes you could take the 20% most-highly-rated movies from each user as test data 
if you like, not just 5-star.

Say I ask for 10 recommendations. Precision @ 10 is the proportion of those 10 
that were in the users' history (top ratings). Recall @ 10 is the proportion of 
all top-rated items that appeared in those 10. I think this is a little 
different than what you're saying?

 Combine both cooccurrence-based CF M/R jobs
 ---

 Key: MAHOUT-305
 URL: https://issues.apache.org/jira/browse/MAHOUT-305
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.2
Reporter: Sean Owen
Assignee: Ankur
Priority: Minor

 We have two different but essentially identical MapReduce jobs to make 
 recommendations based on item co-occurrence: 
 org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be 
 merged. Not sure exactly how to approach that but noting this in JIRA, per 
 Ankur.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Algorithm implementations in Pig

2010-02-22 Thread Ted Dunning
That isn't an issue here.  It is the invocation of pig programs and passing
useful information to them that is the problem.

On Mon, Feb 22, 2010 at 9:20 AM, Ankur C. Goel gan...@yahoo-inc.com wrote:

 Scripting ability while still limited has better streaming support so you
 can have relations streamed
 Into a custom script executing in either map or reduce phase depending upon
 where it is placed.




-- 
Ted Dunning, CTO
DeepDyve


Re: Algorithm implementations in Pig

2010-02-22 Thread Ted Dunning
As an interesting test case, can you write a pig program that counts words.

BUT, it takes an input file name AND an input field name.

On Mon, Feb 22, 2010 at 9:56 AM, Ted Dunning ted.dunn...@gmail.com wrote:


 That isn't an issue here.  It is the invocation of pig programs and passing
 useful information to them that is the problem.


 On Mon, Feb 22, 2010 at 9:20 AM, Ankur C. Goel gan...@yahoo-inc.comwrote:

 Scripting ability while still limited has better streaming support so you
 can have relations streamed
 Into a custom script executing in either map or reduce phase depending
 upon where it is placed.




 --
 Ted Dunning, CTO
 DeepDyve




-- 
Ted Dunning, CTO
DeepDyve


[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836815#action_12836815
 ] 

Jake Mannix commented on MAHOUT-300:


With these opts: -vs 50 -sp 500 -nv 50 -l 500 -no 10

Dot product looks more sensible.  

Executive summary: fastest is  SequentialAccess.dot(Dense), clocking in at 
69,246 units/s, which is as expected. 

Leaderboard for dotProduct:
{code}
Seq.dot(Den) :  69,246 units/s
Seq.dot(Seq) :  63,958 units/s
Seq.dot(Rnd) :  49,638 units/s
Rnd.dot(Seq) :  39,019 units/s
Den.dot(Seq) :  30,337 units/s
Rnd.dot(Rnd) :  5,320 units/s
Den.dot(Rnd) :  5,177 units/s
Rnd.dot(Den) :  5,101 units/s
Den.dot(Den) :  516 units/s
{code}

{code}
INFO: DotProduct DenseVector 
sum = 48.442942s;
min = 1.554ms;
max = 32.55ms;
mean = 1.937717ms;
stdDev = 0.55081ms; 
Speed: 516.07104 UnitsProcessed/sec 3.0964262 MBytes/sec   

INFO: DotProduct RandSparseVector 
sum = 4.69924s;
min = 0.116ms;
max = 24.211ms;
mean = 0.187969ms;
stdDev = 0.343685ms; 
Speed: 5320.0093 UnitsProcessed/sec 31.920053 MBytes/sec 
  
INFO: DotProduct SeqSparseVector 
sum = 0.390877s;
min = 0.012ms;
max = 2.698ms;
mean = 0.015635ms;
stdDev = 0.037619ms; 
Speed: 63958.742 UnitsProcessed/sec 383.7524 MBytes/sec   

INFO: DotProduct Dense.dot(Rand) 
sum = 4.828592s;
min = 0.137ms;
max = 4.09ms;
mean = 0.193143ms;
stdDev = 0.052169ms; 
Speed: 5177.4927 UnitsProcessed/sec 31.064955 MBytes/sec   

INFO: DotProduct Dense.dot(Seq) 
sum = 0.823286s;
min = 0.0ms;
max = 4.606ms;
mean = 0.032931ms;
stdDev = 0.03774ms; 
Speed: 30366.117 UnitsProcessed/sec 182.1967 MBytes/sec   

INFO: DotProduct Rand.dot(Dense) 
sum = 4.900044s;
min = 0.14ms;
max = 3.969ms;
mean = 0.196001ms;
stdDev = 0.056772ms; 
Speed: 5101.995 UnitsProcessed/sec 30.61197 MBytes/sec
   
INFO: DotProduct Rand.dot(Seq) 
sum = 0.640713s;
min = 0.0ms;
max = 2.253ms;
mean = 0.025628ms;
stdDev = 0.041805ms; 
Speed: 39019.027 UnitsProcessed/sec 234.11417 MBytes/sec 
  
INFO: DotProduct Seq.dot(Dense) 
sum = 0.361031s;
min = 0.0ms;
max = 4.63ms;
mean = 0.014441ms;
stdDev = 0.040413ms; 
Speed: 69246.13 UnitsProcessed/sec 415.47675 MBytes/sec   

INFO: DotProduct Seq.dot(Rand) 
sum = 0.503642s;
min = 0.0090ms;
max = 5.203ms;
mean = 0.020145ms;
stdDev = 0.05134ms; 
Speed: 49638.434 UnitsProcessed/sec 297.8306 MBytes/sec   
{code}

 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836817#action_12836817
 ] 

Ted Dunning commented on MAHOUT-300:



These are getting respectable!

As a quick hack, the fact that dot is commutative should make it possible to 
get identical results for dense.dot(seq) as for seq.dot(dense).  Likewise for 
dense.dot(rand).

A similar, but less dramatic win might come from rnd.dot(seq) being redone as 
seq.dot(rnd).

 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836818#action_12836818
 ] 

Jake Mannix commented on MAHOUT-300:


agreed, Ted.  

I'm liking that we're getting 60-70k units/s on Seq.dot(Den) and Seq.dot(Seq), 
with vectors with 500 nonzero elements.  

Since a dot requires a multiply and an add per nonzero element, this is doing 
60 mflops on my laptop in my IDE, with the browser running, etc.  Not bad.

 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836819#action_12836819
 ] 

Robin Anil commented on MAHOUT-300:
---

Seq.rand and rand.seq shoudl get the same perf level now with an instanceof 
removed

 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836826#action_12836826
 ] 

Jake Mannix commented on MAHOUT-300:


and now that my run (of three comments ago) is finally done, with dot product 
removed since it's already been reported.

This properly demonstrates how slow it is to build up a SeqAcc vector 
incrementally, since it's not random-access, among other things.

{code}
INFO: 
BenchMarks  DenseVector RandSparseVector
SeqSparseVector 
Clone   

nCalls = 25000; nCalls = 25000; nCalls 
= 25000; 
sum = 222.552872s;  sum = 34.923269s;   sum = 
34.251326s;   
min = 4.598ms;  min = 0.446ms;  min = 
0.4ms;
max = 265.445ms;max = 184.352ms;max = 
182.734ms;
mean = 8.902114ms;  mean = 1.39693ms;   mean = 
1.370053ms;  
stdDev = 11.676773ms;   stdDev = 4.533406ms;stdDev 
= 5.002041ms;
Speed = 112.33286 /sec  Speed = 715.8551 /sec   Speed = 
729.89874 /sec  
Rate = 0.6739971 MB/s   Rate = 4.2951303 MB/s   Rate = 
4.379392 MB/s

Create (copy)   

nCalls = 25000; nCalls = 25000; nCalls 
= 25000; 
sum = 209.506424s;  sum = 1.371177s;sum = 
0.667553s;
min = 1.427ms;  min = 0.0050ms; min = 
0.021ms;  
max = 11802.223ms;  max = 21.322ms; max = 
10.036ms; 
mean = 8.380256ms;  mean = 0.054847ms;  mean = 
0.026702ms;  
stdDev = 27.862112ms;   stdDev = 0.324031ms;stdDev 
= 0.130493ms;
Speed = 119.32809 /sec  Speed = 18232.512 /sec  Speed = 
37450.207 /sec  
Rate = 0.7159685 MB/s   Rate = 109.395065 MB/s  Rate = 
224.70125 MB/s   

Create (incrementally)  

nCalls = 25000; nCalls = 25000; nCalls 
= 25000; 
sum = 0.570172s;sum = 0.755783s;sum = 
3.969259s;
min = 0.0ms;min = 0.0ms;min = 
0.093ms;  
max = 4.148ms;  max = 23.108ms; max = 
13.452ms; 
mean = 0.022806ms;  mean = 0.030231ms;  mean = 
0.15877ms;   
stdDev = 0.060237ms;stdDev = 0.196128ms;stdDev 
= 0.192234ms;
Speed = 43846.414 /sec  Speed = 33078.277 /sec  Speed = 
6298.405 /sec   
Rate = 263.0785 MB/sRate = 198.46967 MB/s   Rate = 
37.79043 MB/s

org.apache.mahout.common.distance.CosineDistanceMeasure 
   
nCalls = 25000; nCalls = 25000; nCalls 
= 25000; 
sum = 500.69893s;   sum = 29.026116s;   sum = 
3.367885s;
min = 16.147ms; min = 0.896ms;  min = 
0.086ms;  
max = 163.619ms;max = 10.819ms; max = 
11.731ms; 
mean = 20.027957ms; mean = 1.161044ms;  mean = 
0.134715ms;  
stdDev = 4.146275ms;stdDev = 0.345399ms;stdDev 
= 0.092807ms;
Speed = 49.930202 /sec  Speed = 861.29333 /sec  Speed = 
7423.056 /sec   
Rate = 0.2995812 MB/s   Rate = 5.16776 MB/s Rate = 
44.538334 MB/s   

org.apache.mahout.common.distance.EuclideanDistanceMeasure  
  
nCalls = 25000; nCalls = 25000; nCalls 
= 25000; 
sum = 501.080023s;  sum = 26.812884s;   sum = 
3.649897s;
min = 17.011ms; min = 0.924ms;  min = 
0.086ms;  
max = 120.138ms;max = 9.692ms;  max = 
13.113ms; 
mean = 20.0432ms;   mean = 1.072515ms;  mean = 
0.145995ms;  
stdDev = 4.410452ms;stdDev = 0.262769ms;stdDev 
= 0.192273ms;
Speed = 49.89223 /sec   Speed = 932.3876 /sec   Speed = 

Re: [jira] Updated: (MAHOUT-304) MeanShift doesn't read from VectorWritable

2010-02-22 Thread Jeff Eastman
If the Vector-MSCanopy pre-job outputs all of its canopies then each of 
those canopies would contain the generated canopyId and its canopy 
center would contain the original vector with its docId. Seems like one 
could use that data set to get the membership information in a separate 
post-processing step. Certainly the post-processing job should be for 
later, after the ListVector - ListcanopyId optimization.



Robin Anil (JIRA) wrote:

 [ 
https://issues.apache.org/jira/browse/MAHOUT-304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-304:
--

Attachment: MAHOUT-304.patch

Jeff, Meanshift uses only ids generated by the mapper to keep vector membership.  
I dont yet see how you can get the membership information i.e Vector docid = 
Canopy Id. Isnt that job missing? Maybe for later 0.4?

  

MeanShift doesn't read from VectorWritable
--

Key: MAHOUT-304
URL: https://issues.apache.org/jira/browse/MAHOUT-304
Project: Mahout
 Issue Type: Improvement
 Components: Clustering
   Affects Versions: 0.3
   Reporter: Robin Anil
   Assignee: Robin Anil
Fix For: 0.3

Attachments: MAHOUT-304.patch, MAHOUT-304.patch, MAHOUT-304.patch


Need an M/R job for converting sequence file containing VectorWritable to MeanShiftCanopy before the MeanShift M/R 



  




[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836839#action_12836839
 ] 

Robin Anil commented on MAHOUT-300:
---

{noformat}
seq.seq= 46,855
rand.seq   = 37,397
seq.dense  = 36,460
seq.rand   = 34,348
dense.seq  = 25,453
rand.rand  = 5,436
dense.rand = 5,303
rand.dense = 4,754
dense.dense= 477

{noformat}

 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Updated: (MAHOUT-304) MeanShift doesn't read from VectorWritable

2010-02-22 Thread Robin Anil

 after the ListVector - ListcanopyId optimization.

I did that in the patch. Take a look :)


[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836848#action_12836848
 ] 

Robin Anil commented on MAHOUT-300:
---

{noformat}
rand.rand  = 14,435
dense.rand = 9,172
rand.dense = 10,578
dense.dense= 477
{noformat}

 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836909#action_12836909
 ] 

Jake Mannix commented on MAHOUT-300:


New benchmark additions:

{code}INFO: 
BenchMarks  DenseVector RandSparseVector
SeqSparseVector Dense.fn(Rand)  Dense.fn(Seq)   
Rand.fn(Dense)  Rand.fn(Seq)Seq.fn(Dense)   
Seq.fn(Rand)
Clone   

nCalls = 25000; nCalls = 25000; nCalls 
= 25000; 
sum = 222.609888s;  sum = 0.427272s;sum = 
32.833216s;   
min = 4.509ms;  min = 0.0030ms; min = 
0.381ms;  
max = 205.425ms;max = 17.397ms; max = 
164.729ms;
mean = 8.904395ms;  mean = 0.01709ms;   mean = 
1.313328ms;  
stdDev = 11.839592ms;   stdDev = 0.256237ms;stdDev 
= 4.730696ms;
Speed = 112.30409 /sec  Speed = 58510.74 /sec   Speed = 
761.424 /sec
Rate = 0.6738245 MB/s   Rate = 351.06442 MB/s   Rate = 
4.568544 MB/s

Create (copy)   

nCalls = 25000; nCalls = 25000; nCalls 
= 25000; 
sum = 153.385135s;  sum = 1.316737s;sum = 
0.654021s;
min = 1.291ms;  min = 0.0080ms; min = 
0.0ms;
max = 149.59ms; max = 18.778ms; max = 
8.555ms;  
mean = 6.135405ms;  mean = 0.052669ms;  mean = 
0.02616ms;   
stdDev = 9.730283ms;stdDev = 0.276396ms;stdDev 
= 0.116822ms;
Speed = 162.9884 /sec   Speed = 18986.328 /sec  Speed = 
38225.074 /sec  
Rate = 0.9779304 MB/s   Rate = 113.91796 MB/s   Rate = 
229.35042 MB/s   

Create (incrementally)  

nCalls = 25000; nCalls = 25000; nCalls 
= 25000; 
sum = 0.556807s;sum = 1.914268s;sum = 
4.109328s;
min = 0.0ms;min = 0.02ms;   min = 
0.093ms;  
max = 2.523ms;  max = 184.955ms;max = 
16.624ms; 
mean = 0.022272ms;  mean = 0.07657ms;   mean = 
0.164373ms;  
stdDev = 0.038841ms;stdDev = 1.192837ms;stdDev 
= 0.214126ms;
Speed = 44898.863 /sec  Speed = 13059.822 /sec  Speed = 
6083.72 /sec
Rate = 269.39316 MB/s   Rate = 78.35893 MB/sRate = 
36.50232 MB/s

DotProduct  


nCalls = 25000; nCalls = 25000; nCalls 
= 25000; nCalls = 25000; nCalls = 25000; nCalls = 
25000; nCalls = 25000; nCalls = 25000; nCalls = 25000;  
   
sum = 48.730579s;   sum = 1.214007s;sum = 
0.421372s;sum = 2.091561s;sum = 0.883674s;sum = 
2.110771s;sum = 0.571964s;sum = 0.370673s;sum = 
0.624421s;
min = 1.581ms;  min = 0.0040ms; min = 
0.0ms;min = 0.036ms;  min = 0.0ms;min = 
0.033ms;  min = 0.018ms;  min = 0.0ms;min = 
0.019ms;  
max = 14.217ms; max = 26.558ms; max = 
2.628ms;  max = 9.386ms;  max = 8.269ms;  max = 
8.159ms;  max = 1.525ms;  max = 1.674ms;  max = 7.62ms; 
  
mean = 1.949223ms;  mean = 0.04856ms;   mean = 
0.016854ms;  mean = 0.083662ms;  mean = 0.035346ms;  mean = 
0.08443ms;   mean = 0.022878ms;  mean = 0.014826ms;  mean = 
0.024976ms;  
stdDev = 0.342952ms;stdDev = 0.216698ms;stdDev 
= 0.028979ms;stdDev = 0.070128ms;stdDev = 0.065883ms;stdDev = 
0.064003ms;stdDev = 0.026759ms;stdDev = 0.034967ms;stdDev = 
0.059001ms;
Speed = 513.0249 /sec   Speed = 20592.96 /sec   Speed = 
59330.0 /secSpeed = 

[jira] Updated: (MAHOUT-304) MeanShift doesn't read from VectorWritable

2010-02-22 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-304:
--

Affects Version/s: (was: 0.3)
   0.4
Fix Version/s: (was: 0.3)
   0.4

 MeanShift doesn't read from VectorWritable
 --

 Key: MAHOUT-304
 URL: https://issues.apache.org/jira/browse/MAHOUT-304
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.4
Reporter: Robin Anil
Assignee: Robin Anil
 Fix For: 0.4

 Attachments: MAHOUT-304.patch, MAHOUT-304.patch, MAHOUT-304.patch


 Need an M/R job for converting sequence file containing VectorWritable to 
 MeanShiftCanopy before the MeanShift M/R 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-301) Improve command-line shell script by allowing default properties files

2010-02-22 Thread Jake Mannix (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jake Mannix updated MAHOUT-301:
---

Attachment: MAHOUT-301.patch

Fancy new version.  Run as follows:

Set your $MAHOUT_CONF_DIR to a directory where you will have your own overrides 
(or, if unset, defaults to ./core/src/main/resources).

In that directory, there should be a file called driver.classes.props with 
contents like so:
{code}
org.apache.mahout.utils.vectors.VectorDumper=vecDump
org.apache.mahout.utils.clustering.ClusterDumper=clusty
org.apache.mahout.utils.SequenceFileDumper=seqDump
org.apache.mahout.clustering.kmeans.KMeansDriver=kmeans
org.apache.mahout.clustering.canopy.CanopyDriver=canopy
org.apache.mahout.utils.vectors.lucene.Driver=luceneVecs
org.apache.mahout.text.SequenceFilesFromDirectory=dirToSeq
org.apache.mahout.text.WikipediaToSequenceFile=wikToSeq
org.apache.mahout.classifier.bayes.TestClassifier=TestClassifier
{code}

Etc.  The right hand side can be whatever you want, *but* whatever it is 
determines where MahoutDriver will look for a default properties file.  For 
example:

{code}
$MAHOUT_HOME/bin/mahout run wikToSeq
{code}

would look for the file $MAHOUT_CONF_DIR/wikToSeq.props and in that file, take 
each line and transform it into command line arguments for 
WikipediaToSequenceFile, using the logic as follows:

on each line of wikToSeq.props, there is a key-value pair:

{code}
i | input = my/wiki/input/path
o | output = my/output/path
c | categories = my/wikiCategories/file
e | exactMatch = true
all = true
{code}

The part of the key before the vertical bar is the short-name of the argument 
to pass, and the second part is the long name.  If there is only one, they are 
assumed to be the same.

You can also pass Hadoop options here, like 
{code}
Djava.io.tmpdir = /var/tmp/mahout 
{code}

which would lead to the program being called with 
-Djava.io.tmpdir=/var/tmp/mahout passed in.


 Improve command-line shell script by allowing default properties files
 --

 Key: MAHOUT-301
 URL: https://issues.apache.org/jira/browse/MAHOUT-301
 Project: Mahout
  Issue Type: New Feature
  Components: Utils
Affects Versions: 0.3
Reporter: Jake Mannix
Assignee: Jake Mannix
Priority: Minor
 Fix For: 0.4

 Attachments: MAHOUT-301.patch, MAHOUT-301.patch, MAHOUT-301.patch


 Snippet from javadoc gives the idea:
 {code}
 /**
  * General-purpose driver class for Mahout programs.  Utilizes 
 org.apache.hadoop.util.ProgramDriver to run
  * main methods of other classes, but first loads up default properties from 
 a properties file.
  *
  * Usage: run on Hadoop like so:
  *
  * $HADOOP_HOME/bin/hadoop -jar path/to/job 
 org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \
  *   [default.props file for this class] [over-ride options, all specified in 
 long form: --input, --jarFile, etc]
  *
  * TODO: set the Main-Class to just be MahoutDriver, so that this option 
 isn't needed?
  *
  * (note: using the current shell scipt, this could be modified to be just 
  * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props 
 file] [over-ride options]
  * )
  *
  * Works like this: by default, the file 
 core/src/main/resources/driver.classes.prop is loaded, which
  * defines a mapping between short names like VectorDumper and fully 
 qualified class names.  This file may
  * instead be overridden on the command line by having the first argument be 
 some string of the form *classes.props.
  *
  * The next argument to the Driver is supposed to be the short name of the 
 class to be run (as defined in the
  * driver.classes.props file).  After this, if the next argument ends in 
 .props / .properties, it is taken to
  * be the file to use as the default properties file for this execution, and 
 key-value pairs are built up from that:
  * if the file contains
  *
  * input=/path/to/my/input
  * output=/path/to/my/output
  *
  * Then the class which will be run will have it's main called with
  *
  *   main(new String[] { --input, /path/to/my/input, --output, 
 /path/to/my/output });
  *
  * After all the default properties are loaded from the file, any further 
 command-line arguments are taken in,
  * and over-ride the defaults.
  */
 {code}
 Could be cleaned up, as it's kinda ugly with the whole file named in 
 .props, but gives the idea.  Really helps cut down on repetitive long 
 command lines, lets defaults be put props files instead of locked into the 
 code also.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files

2010-02-22 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836952#action_12836952
 ] 

Jake Mannix commented on MAHOUT-301:


Oh, I forgot to finish my sentence which began run as follows...

Once youv'e got default property files in your $MAHOUT_CONF_DIR, you can run 
like so:

{code}
$MAHOUT_HOME/bin/mahout run wikToSeq
{code}

and that's it.  If you want to override the options in your wikToSeq.props 
file, just pass them in on that same command line above, and they override as 
desired.

If this can be tested out and debugged, this patch is ready for committing, and 
significantly improves the command line experience.

 Improve command-line shell script by allowing default properties files
 --

 Key: MAHOUT-301
 URL: https://issues.apache.org/jira/browse/MAHOUT-301
 Project: Mahout
  Issue Type: New Feature
  Components: Utils
Affects Versions: 0.3
Reporter: Jake Mannix
Assignee: Jake Mannix
Priority: Minor
 Fix For: 0.4

 Attachments: MAHOUT-301.patch, MAHOUT-301.patch, MAHOUT-301.patch


 Snippet from javadoc gives the idea:
 {code}
 /**
  * General-purpose driver class for Mahout programs.  Utilizes 
 org.apache.hadoop.util.ProgramDriver to run
  * main methods of other classes, but first loads up default properties from 
 a properties file.
  *
  * Usage: run on Hadoop like so:
  *
  * $HADOOP_HOME/bin/hadoop -jar path/to/job 
 org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \
  *   [default.props file for this class] [over-ride options, all specified in 
 long form: --input, --jarFile, etc]
  *
  * TODO: set the Main-Class to just be MahoutDriver, so that this option 
 isn't needed?
  *
  * (note: using the current shell scipt, this could be modified to be just 
  * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props 
 file] [over-ride options]
  * )
  *
  * Works like this: by default, the file 
 core/src/main/resources/driver.classes.prop is loaded, which
  * defines a mapping between short names like VectorDumper and fully 
 qualified class names.  This file may
  * instead be overridden on the command line by having the first argument be 
 some string of the form *classes.props.
  *
  * The next argument to the Driver is supposed to be the short name of the 
 class to be run (as defined in the
  * driver.classes.props file).  After this, if the next argument ends in 
 .props / .properties, it is taken to
  * be the file to use as the default properties file for this execution, and 
 key-value pairs are built up from that:
  * if the file contains
  *
  * input=/path/to/my/input
  * output=/path/to/my/output
  *
  * Then the class which will be run will have it's main called with
  *
  *   main(new String[] { --input, /path/to/my/input, --output, 
 /path/to/my/output });
  *
  * After all the default properties are loaded from the file, any further 
 command-line arguments are taken in,
  * and over-ride the defaults.
  */
 {code}
 Could be cleaned up, as it's kinda ugly with the whole file named in 
 .props, but gives the idea.  Really helps cut down on repetitive long 
 command lines, lets defaults be put props files instead of locked into the 
 code also.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAHOUT-306) Profile and improve perfomance of algorithms based on vectors

2010-02-22 Thread Robin Anil (JIRA)
Profile and improve perfomance of algorithms based on vectors
-

 Key: MAHOUT-306
 URL: https://issues.apache.org/jira/browse/MAHOUT-306
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.4
Reporter: Robin Anil
 Fix For: 0.4




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil resolved MAHOUT-300.
---

Resolution: Fixed
  Assignee: Robin Anil

 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Sub-task
Affects Versions: 0.3
Reporter: Robin Anil
Assignee: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-306) Profile and improve performance of algorithms based on vectors

2010-02-22 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-306:
--

Summary: Profile and improve performance of algorithms based on vectors  
(was: Profile and improve perfomance of algorithms based on vectors)

 Profile and improve performance of algorithms based on vectors
 --

 Key: MAHOUT-306
 URL: https://issues.apache.org/jira/browse/MAHOUT-306
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.4
Reporter: Robin Anil
 Fix For: 0.4




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-300:
--

Issue Type: Sub-task  (was: Improvement)
Parent: MAHOUT-306

 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Sub-task
Affects Versions: 0.3
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files

2010-02-22 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836962#action_12836962
 ] 

Robin Anil commented on MAHOUT-301:
---

The help comments are missing from the mahout/bin script. Scroll up that file 
and you will see a pretty printed help string. Just add the Mahout driver 
description and possibly a wikilink there. Otherwise looks good to commit.  I 
have checked the full functionality yet. If anyone else want to take a look, 
please do quickly

 Improve command-line shell script by allowing default properties files
 --

 Key: MAHOUT-301
 URL: https://issues.apache.org/jira/browse/MAHOUT-301
 Project: Mahout
  Issue Type: New Feature
  Components: Utils
Affects Versions: 0.3
Reporter: Jake Mannix
Assignee: Jake Mannix
Priority: Minor
 Fix For: 0.4

 Attachments: MAHOUT-301.patch, MAHOUT-301.patch, MAHOUT-301.patch


 Snippet from javadoc gives the idea:
 {code}
 /**
  * General-purpose driver class for Mahout programs.  Utilizes 
 org.apache.hadoop.util.ProgramDriver to run
  * main methods of other classes, but first loads up default properties from 
 a properties file.
  *
  * Usage: run on Hadoop like so:
  *
  * $HADOOP_HOME/bin/hadoop -jar path/to/job 
 org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \
  *   [default.props file for this class] [over-ride options, all specified in 
 long form: --input, --jarFile, etc]
  *
  * TODO: set the Main-Class to just be MahoutDriver, so that this option 
 isn't needed?
  *
  * (note: using the current shell scipt, this could be modified to be just 
  * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props 
 file] [over-ride options]
  * )
  *
  * Works like this: by default, the file 
 core/src/main/resources/driver.classes.prop is loaded, which
  * defines a mapping between short names like VectorDumper and fully 
 qualified class names.  This file may
  * instead be overridden on the command line by having the first argument be 
 some string of the form *classes.props.
  *
  * The next argument to the Driver is supposed to be the short name of the 
 class to be run (as defined in the
  * driver.classes.props file).  After this, if the next argument ends in 
 .props / .properties, it is taken to
  * be the file to use as the default properties file for this execution, and 
 key-value pairs are built up from that:
  * if the file contains
  *
  * input=/path/to/my/input
  * output=/path/to/my/output
  *
  * Then the class which will be run will have it's main called with
  *
  *   main(new String[] { --input, /path/to/my/input, --output, 
 /path/to/my/output });
  *
  * After all the default properties are loaded from the file, any further 
 command-line arguments are taken in,
  * and over-ride the defaults.
  */
 {code}
 Could be cleaned up, as it's kinda ugly with the whole file named in 
 .props, but gives the idea.  Really helps cut down on repetitive long 
 command lines, lets defaults be put props files instead of locked into the 
 code also.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Look! No more ISSUES

2010-02-22 Thread Robin Anil
waiting for 301 to get commited.

https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12310751styleName=Htmlversion=12314281

PMC's. Its in your hands now :D

Robin


Re: [jira] Updated: (MAHOUT-304) MeanShift doesn't read from VectorWritable

2010-02-22 Thread Jeff Eastman

Robin Anil wrote:

after the ListVector - ListcanopyId optimization.



I did that in the patch. Take a look :)

  

+1 Simply marvelous


Re: more svn:ignore

2010-02-22 Thread Drew Farris
Ok, I've committed the ignores for .classpath, .project, .settings
created by eclipse and a couple target directories that hadn't been
excluded. I'll get the idea stuff on another pass once I figure out
how to do global wildcard ignores.

On Sun, Feb 21, 2010 at 7:53 AM, Sean Owen sro...@gmail.com wrote:
 IntelliJ 8 and before used the .ipr and .iws files -- IntelliJ 9 puts
 it in an .idea directory. It will auto-ignore these files. Still I
 think it doesn't hurt to ensure SVN never has anything to do with it.

 That and .DS_Store files from the Mac. maybe it already ignores them.

 On Sat, Feb 20, 2010 at 10:47 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 I don't know the normal conventions (and they all seem to have changed
 recently anyway).

 *.ipr is the project file and the workspace and project files used to be at
 the top level.  the module files could be below or not.

 The .idea directory is new and I don't grok it yet.  It would only appear at
 the top-level, I think.

 If you don't use IDEA, you might punt on this.  IDEA is pretty good about
 not checking extra goo in and I don't see eclipse users accidentally
 checking in IDEA files.

 On Sat, Feb 20, 2010 at 1:35 PM, Drew Farris drew.far...@gmail.com wrote:

 On Sat, Feb 20, 2010 at 3:04 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:
  WOuldn't hurt to do the same for the IDEA project (*.ipr), module (*.iml)
  and workspace (*.iws) files.  Lately, it seems idea is keeping this all
 in a
  .idea sub-directory of the parent.

 so, just adding an svn:ignore on the parent for .idea would ignore
 what we're looking for? AFAIK, ignore is set only on a per directory
 basis, I'd have to do some minor digging to determine if we can ignore
 files with those extensions everywhere.




 --
 Ted Dunning, CTO
 DeepDyve




[jira] Updated: (MAHOUT-301) Improve command-line shell script by allowing default properties files

2010-02-22 Thread Drew Farris (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Farris updated MAHOUT-301:
---

Attachment: MAHOUT-301-drew.patch

Did some testing, here's a patch to clean some of these things up + a couple 
questions: 

Could we load the default driver.classes.props from the classpath? If it was 
loaded that way the default would work regardless of where the mahout script is 
run from (it currently only works if ./bin/mahout is run, not ./mahout for 
example) and regardless of whether we're running from a binary release or the 
dev environment. (included in patch)

Something else I noticed is that the 'mahout' script doesn't add the classes in 
$MAHOUT_HOME/lib/*.jar to the classpath. This breakes the binary release in 
that it can't run anything, e.g:

{code}
./mahout vectordump
Exception in thread main java.lang.NoClassDefFoundError: 
org/apache/commons/cli2/OptionException
Caused by: java.lang.ClassNotFoundException: 
org.apache.commons.cli2.OptionException
{code}
(fixed in patch)

Using -core in the context of a dev build should work properly, but leaving out 
-core will cause the script to error unless run in the context of a release -- 
this is the way it should work, right?

Also wondering what the purpose of adding the job jars to the classpath is? 
(removed in patch)

Also added a help message for the 'run' argument.

Does executing './mahout run --help' hang for anyone else or is it something 
specific to my environment? (didn't track this one down)

 Improve command-line shell script by allowing default properties files
 --

 Key: MAHOUT-301
 URL: https://issues.apache.org/jira/browse/MAHOUT-301
 Project: Mahout
  Issue Type: New Feature
  Components: Utils
Affects Versions: 0.3
Reporter: Jake Mannix
Assignee: Jake Mannix
Priority: Minor
 Fix For: 0.4

 Attachments: MAHOUT-301-drew.patch, MAHOUT-301.patch, 
 MAHOUT-301.patch, MAHOUT-301.patch


 Snippet from javadoc gives the idea:
 {code}
 /**
  * General-purpose driver class for Mahout programs.  Utilizes 
 org.apache.hadoop.util.ProgramDriver to run
  * main methods of other classes, but first loads up default properties from 
 a properties file.
  *
  * Usage: run on Hadoop like so:
  *
  * $HADOOP_HOME/bin/hadoop -jar path/to/job 
 org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \
  *   [default.props file for this class] [over-ride options, all specified in 
 long form: --input, --jarFile, etc]
  *
  * TODO: set the Main-Class to just be MahoutDriver, so that this option 
 isn't needed?
  *
  * (note: using the current shell scipt, this could be modified to be just 
  * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props 
 file] [over-ride options]
  * )
  *
  * Works like this: by default, the file 
 core/src/main/resources/driver.classes.prop is loaded, which
  * defines a mapping between short names like VectorDumper and fully 
 qualified class names.  This file may
  * instead be overridden on the command line by having the first argument be 
 some string of the form *classes.props.
  *
  * The next argument to the Driver is supposed to be the short name of the 
 class to be run (as defined in the
  * driver.classes.props file).  After this, if the next argument ends in 
 .props / .properties, it is taken to
  * be the file to use as the default properties file for this execution, and 
 key-value pairs are built up from that:
  * if the file contains
  *
  * input=/path/to/my/input
  * output=/path/to/my/output
  *
  * Then the class which will be run will have it's main called with
  *
  *   main(new String[] { --input, /path/to/my/input, --output, 
 /path/to/my/output });
  *
  * After all the default properties are loaded from the file, any further 
 command-line arguments are taken in,
  * and over-ride the defaults.
  */
 {code}
 Could be cleaned up, as it's kinda ugly with the whole file named in 
 .props, but gives the idea.  Really helps cut down on repetitive long 
 command lines, lets defaults be put props files instead of locked into the 
 code also.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Algorithm implementations in Pig

2010-02-22 Thread Ankur C. Goel
Those would be passed as parameters either through -param option or through a 
parameter file with -param_file option and the pig's preprocessor just 
substitutes the values in your script.
Since its just a blind parameter substitution, in my shingling script  I even 
had the schema definition passed to it. I suppose passing input field name 
shouldn't be an issue as long as it is valid
In the context of script execution plan.

-...@nkur

10 11:32 PM, Ted Dunning ted.dunn...@gmail.com wrote:

As an interesting test case, can you write a pig program that counts words.

BUT, it takes an input file name AND an input field name.

On Mon, Feb 22, 2010 at 9:56 AM, Ted Dunning ted.dunn...@gmail.com wrote:


 That isn't an issue here.  It is the invocation of pig programs and passing
 useful information to them that is the problem.


 On Mon, Feb 22, 2010 at 9:20 AM, Ankur C. Goel gan...@yahoo-inc.comwrote:

 Scripting ability while still limited has better streaming support so you
 can have relations streamed
 Into a custom script executing in either map or reduce phase depending
 upon where it is placed.




 --
 Ted Dunning, CTO
 DeepDyve




--
Ted Dunning, CTO
DeepDyve



Re: Algorithm implementations in Pig

2010-02-22 Thread Ted Dunning
Good answer.

On Mon, Feb 22, 2010 at 8:52 PM, Ankur C. Goel gan...@yahoo-inc.com wrote:

 Those would be passed as parameters either through -param option or through
 a parameter file with -param_file option and the pig's preprocessor just
 substitutes the values in your script.
 Since its just a blind parameter substitution, in my shingling script  I
 even had the schema definition passed to it. I suppose passing input field
 name shouldn't be an issue as long as it is valid
 In the context of script execution plan.




[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files

2010-02-22 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837120#action_12837120
 ] 

Robin Anil commented on MAHOUT-301:
---

including the job jar is much cleaner than adding all deps. Plus there is 
nothing more to configure to execute it on top of hadoop..

BTW. How is hadoop execution done using shell script ? i.e

hadoop jar mahout-examples-0.3.job o.a.m...DictionaryVectorizer --input . 
args

 Improve command-line shell script by allowing default properties files
 --

 Key: MAHOUT-301
 URL: https://issues.apache.org/jira/browse/MAHOUT-301
 Project: Mahout
  Issue Type: New Feature
  Components: Utils
Affects Versions: 0.3
Reporter: Jake Mannix
Assignee: Jake Mannix
Priority: Minor
 Fix For: 0.4

 Attachments: MAHOUT-301-drew.patch, MAHOUT-301.patch, 
 MAHOUT-301.patch, MAHOUT-301.patch


 Snippet from javadoc gives the idea:
 {code}
 /**
  * General-purpose driver class for Mahout programs.  Utilizes 
 org.apache.hadoop.util.ProgramDriver to run
  * main methods of other classes, but first loads up default properties from 
 a properties file.
  *
  * Usage: run on Hadoop like so:
  *
  * $HADOOP_HOME/bin/hadoop -jar path/to/job 
 org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \
  *   [default.props file for this class] [over-ride options, all specified in 
 long form: --input, --jarFile, etc]
  *
  * TODO: set the Main-Class to just be MahoutDriver, so that this option 
 isn't needed?
  *
  * (note: using the current shell scipt, this could be modified to be just 
  * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props 
 file] [over-ride options]
  * )
  *
  * Works like this: by default, the file 
 core/src/main/resources/driver.classes.prop is loaded, which
  * defines a mapping between short names like VectorDumper and fully 
 qualified class names.  This file may
  * instead be overridden on the command line by having the first argument be 
 some string of the form *classes.props.
  *
  * The next argument to the Driver is supposed to be the short name of the 
 class to be run (as defined in the
  * driver.classes.props file).  After this, if the next argument ends in 
 .props / .properties, it is taken to
  * be the file to use as the default properties file for this execution, and 
 key-value pairs are built up from that:
  * if the file contains
  *
  * input=/path/to/my/input
  * output=/path/to/my/output
  *
  * Then the class which will be run will have it's main called with
  *
  *   main(new String[] { --input, /path/to/my/input, --output, 
 /path/to/my/output });
  *
  * After all the default properties are loaded from the file, any further 
 command-line arguments are taken in,
  * and over-ride the defaults.
  */
 {code}
 Could be cleaned up, as it's kinda ugly with the whole file named in 
 .props, but gives the idea.  Really helps cut down on repetitive long 
 command lines, lets defaults be put props files instead of locked into the 
 code also.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-02-22 Thread Ankur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837123#action_12837123
 ] 

Ankur commented on MAHOUT-305:
--

With co-occurrence analysis we are dropping ratings. So if there are a lot of 
people who watched Harry Potter also watched Maid in manhattan it will have 
a higher chance of getting recommended regardless of ratings.

I am trying not be influenced too much by ratings as that is not the strength 
of this algorithm. Where it really shines is when you have lots and lots of 
sparse user click data where a click may be present or absent. Something like 
an online book store  or a shopping site. We are sticking with netflix as there 
is no such publicly available dataset AFAIK.

Ok so moving forward with the action plan, here is what I propose to do. Please 
feel free to suggest modifications.

1. For each user take out the most recent movies that he has rated 3 or 4 or 5 
as TEST data.  Use the remaining as TRAIN data. 
2. Run both implementations in identical environment on test data and record 
runtimes and results
3. Join recommendation results with TEST data on 'user' key and calculate 
precision recall.
4. Report average precision  recall.
 
Ok so when separating top ratings as TEST data. For each user 

Precision @10 = (3,4,5 rating movies recommended  actually present ) / 10
Recall @ 10 =  (3,4,5 rating movies recommended  actually present ) / (all 
3,4,5 movies seen by user)

Hope this was more clear.


 Combine both cooccurrence-based CF M/R jobs
 ---

 Key: MAHOUT-305
 URL: https://issues.apache.org/jira/browse/MAHOUT-305
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.2
Reporter: Sean Owen
Assignee: Ankur
Priority: Minor

 We have two different but essentially identical MapReduce jobs to make 
 recommendations based on item co-occurrence: 
 org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be 
 merged. Not sure exactly how to approach that but noting this in JIRA, per 
 Ankur.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-283) Update assemblies to include mahout-collections for release build

2010-02-22 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-283:
--

Fix Version/s: (was: 0.4)
   0.3

 Update assemblies to include mahout-collections for release build
 -

 Key: MAHOUT-283
 URL: https://issues.apache.org/jira/browse/MAHOUT-283
 Project: Mahout
  Issue Type: Sub-task
Affects Versions: 0.3
Reporter: Drew Farris
Assignee: Drew Farris
 Fix For: 0.3

 Attachments: MAHOUT-283.patch


 The release assemblies need to be updated to include the new 
 mahout-collections project.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-281) scm urls are wrong in the poms

2010-02-22 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-281:
--

Fix Version/s: (was: 0.4)
   0.3

 scm urls are wrong in the poms
 --

 Key: MAHOUT-281
 URL: https://issues.apache.org/jira/browse/MAHOUT-281
 Project: Mahout
  Issue Type: Bug
Affects Versions: 0.3
Reporter: Benson Margulies
Assignee: Benson Margulies
 Fix For: 0.3

 Attachments: MAHOUT-281.diff


 The scm urls in the poms are wrong. This must be fixed before running the 
 release plugin to make an 0.3 release.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-280) Clean some redundant POM declarations

2010-02-22 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-280:
--

Fix Version/s: (was: 0.4)
   0.3

 Clean some redundant POM declarations
 -

 Key: MAHOUT-280
 URL: https://issues.apache.org/jira/browse/MAHOUT-280
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Benson Margulies
Assignee: Benson Margulies
 Fix For: 0.3

 Attachments: MAHOUT-280.diff


 I am about to attach a simple patch to clean up some redundant stuff in the 
 poms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.