Re: Possible contributions

2011-05-28 Thread Dmitriy Lyubimov
Yes. There's always a workaround. Say you have input1 and it's a tab separated text with 3 attributes and you have another input2 in a sequence file with another 6 attributes so yes, you could run 2 map-only jobs on them to bring them to a homogeneous format with a join key indicating which part t

Re: Possible contributions

2011-05-28 Thread Sean Owen
You can't mix and match old and new APIs in general, no. It's better to use new APIs unless it would make the implementation really hard or really slow. The new APIs lack MultipleInputs as of 0.20.x. That doesn't mean you can't have multiple inputs. You can add several input paths as Shannon says

Re: Possible contributions

2011-05-28 Thread Dmitriy Lyubimov
Job input path is always multiple paths. You don't need to have multiple inputs to specify that. What you need multiple inputs for is to be able to specify different input file formats and assign different mappers to handle them. If all your input is formatted homogeneously both record structure w

Re: Possible contributions

2011-05-28 Thread Dmitriy Lyubimov
As i said, and as i think Shannon's reply confirms in part, you sometimes can weasel your way out of this, but this is not how this api is intended to be used. To begin with, old and new api have never been intended to be used together (so you are already breaking interop guarantees with any of fut

Re: Possible contributions

2011-05-28 Thread Shannon Quinn
Isn't this just a matter of making multiple calls to FileInputFormat.addInputPath(...) (to adhere to the new APIs) ? On 5/28/11 5:54 PM, Dmitriy Lyubimov wrote: I don't see how you can use deprecated multiple inputs, as if i am not missing anything, its signature is tied to old api types, such

Re: Possible contributions

2011-05-28 Thread Dmitriy Lyubimov
I don't see how you can use deprecated multiple inputs, as if i am not missing anything, its signature is tied to old api types, such as JobConf, which you of course won't have as you define a new api job. On Sat, May 28, 2011 at 3:43 PM, Dhruv Kumar wrote: > Isabel and Dmitry, > > Thank you for

Re: Possible contributions

2011-05-28 Thread Shannon Quinn
As far as I understand, the problem isn't adding multiple inputs; you can do it exactly as the documentation you linked shows. The problem (which is what we're trying to solve in MAHOUT-537) is how to tell within the Mapper/Reducer itself from which input path the current data are taken; there'

Re: Possible contributions

2011-05-28 Thread Dhruv Kumar
Isabel and Dmitry, Thank you for your input on this. I've noticed that Mahout's code uses the new mapreduce package, so I have been following the new APIs. This was also suggested by Sean w.r.t Mahout-294. Multiple inputs is a requirement for my project and I was planning on using the old mapred.

Re: Possible contributions

2011-05-28 Thread Dmitriy Lyubimov
Dhruv, Just a warning, before you want to lock yourself to new apis: Yes new APIs are preferrable but it is not always possible to use them because 0.20.2 lacks _a lot_ in terms of bare necessities in new api realm . (multiple inputs/ outputs come to mind at once). I think i did weasel my way ou

Re: Possible contributions

2011-05-27 Thread Isabel Drost
On 18.05.2011 Dhruv Kumar wrote: > For the GSoC project which version of Hadoop's API should I follow? Try to use the new M/R apis where possible - we had the same discussion in an earlier thread on spectral clustering, in addition Sean just opened an issue concerning Upgrading to newer Hadoop v

Re: Possible contributions

2011-05-24 Thread Ted Dunning
Good man. On Mon, May 23, 2011 at 3:45 PM, Hector Yee wrote: > FYI the ICLA has been filed. > >

Re: Possible contributions

2011-05-23 Thread Hector Yee
FYI the ICLA has been filed. On Wed, May 18, 2011 at 3:27 AM, Ted Dunning wrote: > Hector, > > An in-core variant or a sequential on-disk variant is a great starting > point > and focussing on the kernelized ranker is also a good place to start. > > It would help if you can provide lots of visib

Re: Possible contributions

2011-05-18 Thread Dhruv Kumar
On Wed, May 18, 2011 at 6:38 AM, Sean Owen wrote: > I think it first has to finish embracing MapReduce! The code base already > uses 2.5 different versions of Hadoop. It would be better clean up the > modest clutter of approaches we already have before thinking about > extending > it. > For the

Re: Possible contributions

2011-05-18 Thread Hector Yee
I just completed and submitted an online passive aggressive classifier as my test case (MAHOUT-702). I believe I've followed the how to except I couldn't find a CHANGES.txt to write my changes in. On Wed, May 18, 2011 at 6:27 PM, Ted Dunning wrote: > Hector, > > An in-core variant or a sequentia

Re: Possible contributions

2011-05-18 Thread Ted Dunning
Well, this much I think is uncontroversial. On Wed, May 18, 2011 at 3:38 AM, Sean Owen wrote: > And I do think we need to focus on cleanup now rather than later. For > example I will shortly suggest deprecating M/R jobs that use Hadoop 0.19 > APIs in the name of moving forward. >

Re: Possible contributions

2011-05-18 Thread Sean Owen
I think it first has to finish embracing MapReduce! The code base already uses 2.5 different versions of Hadoop. It would be better clean up the modest clutter of approaches we already have before thinking about extending it. Good news is there's a fair bit of time before any other particular fram

Re: Possible contributions

2011-05-18 Thread Ted Dunning
Hector, An in-core variant or a sequential on-disk variant is a great starting point and focussing on the kernelized ranker is also a good place to start. It would help if you can provide lots of visibility early in the process. IF the JIRA process of attaching a diff becomes cumbersome, then yo

Re: Possible contributions

2011-05-18 Thread Ted Dunning
This is a theme that is going to raise itself over and over. I think that strategically, Mahout is going to have to embrace the MapReduce nextGen work so that we can have flexible computation models. We already need this with all the large scale SVD work. We could very much use it for the SGD st

Re: Possible contributions

2011-05-18 Thread Grant Ingersoll
https://cwiki.apache.org/confluence/display/MAHOUT/How+To+Contribute On May 18, 2011, at 1:17 AM, Hector Yee wrote: > Re: boosting scalability, I've implemented it on thousands of machines, but > not with mapreduce, rather with direct RPC calls. The gradient computation > tends to be iterative, s

Re: Possible contributions

2011-05-17 Thread Hector Yee
Re: boosting scalability, I've implemented it on thousands of machines, but not with mapreduce, rather with direct RPC calls. The gradient computation tends to be iterative, so one way to do it is to have each iteration run per mapreduce. Compute gradients in the mapper, gather them in the reducer,

Re: Possible contributions

2011-05-17 Thread Ted Dunning
On Tue, May 17, 2011 at 5:26 PM, Hector Yee wrote: > I have some proposed contributions and I wonder if they will be useful in > Mahout (otherwise I will just commit it in a new open source project in > github). > These generally sound pretty good. > - Sparse autoencoder (think of it as somet

Possible contributions

2011-05-17 Thread Hector Yee
Hello, Some background on myself - I was at Google the last 5 years working on the self-driving car, image search and youtube in machine learning ( http://www.linkedin.com/in/yeehector) I have some proposed contributions and I wonder if they will be useful in Mahout (otherwise I will just com