Hi Dmitriy, Here's the code. It does cooccurrence analysis with loglikelihood ratio tests. Haven't run it on a cluster yet:
https://gist.github.com/sscdotopen/8314254 --sebastian On 07.01.2014 23:53, Dmitriy Lyubimov wrote: > @Sebastian, > wanna post a link? > > > On Tue, Jan 7, 2014 at 2:46 PM, Sebastian Schelter <[email protected]> wrote: > >> I also have some spark cooccurrence analysis code lying around that >> might be a nice contribution. >> >> On 07.01.2014 23:44, Dmitriy Lyubimov wrote: >>> if you want to contribute to Mahout, obviously you want to speak to >> Mahout >>> dev audience. Spark is not yet officially integrated into Mahout, but we >>> are actively contemplating it and I have been doing some work off SVN >> e.g. >>> https://issues.apache.org/jira/browse/MAHOUT-1346, >>> https://issues.apache.org/jira/browse/MAHOUT-1365 and some other >> algorithm >>> ports. >>> >>> >>> On Tue, Jan 7, 2014 at 1:30 PM, Oleksandr Olgashko < >> [email protected] >>>> wrote: >>> >>>> Didn't work with Spark before (just read their overview page). >>>> Should i ask arising questions here or better switch to Spark's mailing >>>> lists? >>>> >>>> >>>> 2014/1/7 Sebastian Schelter <[email protected]> >>>> >>>>> IIRC that papers talks about MapReduce on a shared-memory system, not >> on >>>>> a shared-nothing system such as the Hadoop implementation. >>>>> >>>>> As a rule of thumb, iterations in Hadoop are about 10x slower than in >>>>> systems such as Giraph, Spark or Stratosphere. >>>>> >>>>> --sebastian >>>>> >>>>> On 07.01.2014 22:01, Oleksandr Olgashko wrote: >>>>>> What can you say about >>>>>> >>>>> >>>> >> http://www.cs.stanford.edu/people/ang//papers/nips06-mapreducemulticore.pdf >>>>> ? >>>>>> >>>>>> >>>>>> 2014/1/7 Dmitriy Lyubimov <[email protected]> >>>>>> >>>>>>> yes. Create working notes how exactly to do that. (Or, what i am a >>>> bit >>>>>>> pushing you towards, Spark, since MR is not really iteration friendly >>>>>>> platform and it looks like iterations are needed in fastICA.). >>>>>>> >>>>>>> >>>>>>> On Tue, Jan 7, 2014 at 12:38 PM, Oleksandr Olgashko < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> So the problem is to adapt ICA for MR, am i right? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> 2014/1/7 Dmitriy Lyubimov <[email protected]> >>>>>>>> >>>>>>>>> i already looked at fast ICA. while it claims to be parallel, this >>>>> work >>>>>>>>> doesn't exactly map it into map reduce (or spark) paradigm and from >>>>>>> what >>>>>>>> i >>>>>>>>> can recollect still implies outer iterations for fitting principal >>>>>>>>> component vectors one by one. Which means it probably already is >>>>>>>>> MR-unfriendly by construction; Spark may show far better promise >>>> here >>>>>>> but >>>>>>>>> still a working notes document is required to show how exactly. >>>> that's >>>>>>>> what >>>>>>>>> i mean. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Jan 7, 2014 at 1:35 AM, Oleksandr Olgashko < >>>>>>>>> [email protected] >>>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Could you please take a look on this article? >>>>>>>>>> http://cran.r-project.org/web/packages/fastICA/fastICA.pdf >>>>>>>>>> I have learned that re-inventing the wheel is wrong for most >>>>>>> problems, >>>>>>>>> and >>>>>>>>>> usually exists a better solution. However, it often needs some >>>>>>>>> "grinding", >>>>>>>>>> so I may research those ways, in case of approval. >>>>>>>>>> >>>>>>>>>> About Scala: unfortunately, I have never worked with this language >>>>>>>>> before, >>>>>>>>>> but wanted to. I'd like to fill that gap in my skills, but I don't >>>>>>> know >>>>>>>>>> exactly where to start. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 2014/1/7 Dmitriy Lyubimov <[email protected]> >>>>>>>>>> >>>>>>>>>>> ICA is a very useful technique for dimensionality reduction. I >>>>>>>> believe >>>>>>>>>>> Mahout would benefit from it; however challenges are fairly >>>>>>>> significant >>>>>>>>>> in >>>>>>>>>>> terms of proven parallelization technique and acceptable >> efficacy, >>>>>>>>> which >>>>>>>>>>> makes it hard to just "implement" (I am not familiar at this >> point >>>>>>>> with >>>>>>>>>> any >>>>>>>>>>> concrete work on parallel ICA). So like i said before i am not >>>> very >>>>>>>>>>> hopeful. However, if one never tries, then nothing will get ever >>>>>>>> done. >>>>>>>>>> who >>>>>>>>>>> knows. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Mon, Jan 6, 2014 at 2:18 PM, Isabel Drost-Fromm < >>>>>>>> [email protected] >>>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> On Mon, Jan 06, 2014 at 10:40:45PM +0200, Oleksandr Olgashko >>>>>>> wrote: >>>>>>>>>>>>> Returning back to question about theme to work, asked 2 months >>>>>>>> ago. >>>>>>>>>>>>> What algorithm should I implement? >>>>>>>>>>>> >>>>>>>>>>>> To be quite frank with you: None. Personally I'd rather see >>>>>>>>>> improvements >>>>>>>>>>>> (in terms of documentation, integration, stableisation, >>>>>>> performance >>>>>>>>>>>> optimisation) of the existing Mahout source. >>>>>>>>>>>> >>>>>>>>>>>> Feel free to take a closer look at the thread concerning >> "getting >>>>>>>>>>>> involved" that we had around Christmas last year for >> inspiration. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Isabel >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>> >>> >> >> >
