If you are looking for a good source of data try http://fimi.cs.helsinki.fi/data/
Thanks, Neal. On Sat, Sep 18, 2010 at 12:38 PM, Neal Richter <[email protected]> wrote: > +1 > > Try data other than your own as well. > > > > > On 9/18/10, Ted Dunning <[email protected]> wrote: > > Good advice relative to Mahout as well. Trying it on a smaller sample > will > > tell you if it is due to bad scaling or really a hangup. > > > > On Sat, Sep 18, 2010 at 12:03 PM, Mark <[email protected]> > wrote: > > > >> Thanks. Ill give this a try and see how it performs > >> > >> > >> On 9/18/10 12:01 PM, Neal Richter wrote: > >> > >>> I suggest you take a sample of your data and run it on these > >>> non-hadoop implementations of itemset miners, FPGrowth is one of the > >>> available algorithms. > >>> > >>> http://www.borgelt.net/fpm.html > >>> > >>> If you have success on a small sample then start upscaling the sample > >>> as well as investigate the distributions of your data. > >>> > >>> - Neal > >>> > >>> On Sat, Sep 18, 2010 at 12:30 PM, Ted Dunning<[email protected]> > >>> wrote: > >>> > >>>> In order to encourage your excellent practice of reposting, I will > >>>> repeat > >>>> my > >>>> (non)-answer here. > >>>> > >>>> ------------------------------------------- > >>>> I don't know the answer to this, but previously this kind of problem > was > >>>> caused by highly skewed statistics in the input data. > >>>> > >>>> If there are things that cooccur with everything, then you will have > >>>> problems with the speed of the algorithm. > >>>> > >>>> Can you say something about the distribution of your data? Can you > post > >>>> a > >>>> frequency by rank table? > >>>> > >>>> On Sat, Sep 18, 2010 at 10:37 AM, Mark<[email protected]> > >>>> wrote: > >>>> > >>>> I am trying to run FPGrowth: > >>>>> > >>>>> /hadoop jar /opt/mahout-0.3/mahout-examples-0.3.job > >>>>> org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver -i > >>>>> output/product/part-r-00000 -o pfp -method mapreduce -regex [\\t] -s > 5 > >>>>> -g > >>>>> 17500 -k 50/ > >>>>> > >>>>> However the 3rd task:/ "Processing FPTree: Bottom Up FP Growth> > >>>>> reduce"/ > >>>>> will not finish. It's basically stuck at 85% and hasn't budged in > over > >>>>> an > >>>>> hour. The output of the first task outputted there were about 37K > >>>>> features > >>>>> so I set -g to 17500. Does anyone know whats going on and how I can > >>>>> speed > >>>>> this up? > >>>>> > >>>>> Thanks > >>>>> > >>>>> > > >
