Upgrading from 0.6 and ClassifierContext

2014-01-23 Thread Grant Ingersoll
Hi, I'm upgrading some classification code from 0.6 to 0.8 and am wondering what the replacement is for the ClassifierContext? Thanks, Grant

[OT] Uses Cases for Taming Text, 2nd ed.

2014-01-20 Thread Grant Ingersoll
Hi Mahout Users, Drew Farris, Tom Morton and I are currently working on the 2nd Edition of Taming Text (http://www.manning.com/ingersoll for first ed.) and are soliciting interested parties who would be willing to contribute to a chapter on practical use cases (i.e. you have something in produc

Re: Question about clusterdump

2013-08-22 Thread Grant Ingersoll
238095238095238 > > Sorry if this is an obvious question but I find it hard to find details on > these specifics. > > Many thanks, > > Will Grant Ingersoll | @gsingers http://www.lucidworks.com

Apache Mahout 0.8 Released

2013-07-25 Thread Grant Ingersoll
The Apache Mahout PMC is pleased to announce the release of Mahout 0.8. Mahout's goal is to build scalable machine learning libraries focused primarily in the areas of collaborative filtering (recommenders), clustering and classification (known collectively as the "3Cs"), as well as the necessa

Mahout 0.8 Release Candidate

2013-07-08 Thread Grant Ingersoll
A _preview_ of release artifacts for 0.8 are at https://repository.apache.org/content/repositories/orgapachemahout-113/org/apache/mahout/. This is not an official release. I will call a vote in a day or two, pending feedback on this thread, so please review/test. A _preview_ of the release no

Re: Applying clustering techique

2013-06-13 Thread Grant Ingersoll
; >> You need to group by user before converting to vector to get sensible >> clustering. >> >> >> On Wed, Jun 12, 2013 at 1:06 PM, Grant Ingersoll >> wrote: >> >>> The CSVVectorIterator in the Integration package will take in a CSV file >

Re: Applying clustering techique

2013-06-12 Thread Grant Ingersoll
ng algorithm. My doubt is, Is > there any need to convert the movielens rating.csv file into a sequence > file. If needed what are the commands for applying clustering technique > using mahout and the hadoop. > > Thanking you, > Neetha Suan Thampi ------

Re: [DRAFT] 0.8 Release Announcement + Future Plans Discussion

2013-06-08 Thread Grant Ingersoll
.m.math.hadoop.decomposer and port all code that uses it to SSVD. No opinion. +1 on everything else. > > To all users and other committers, this is a biased first proposal, > please shout, if you see things different and want to have things kept. > > Best, > Sebastian > >

Re: Dictionary file format in Lucene-Mahout integration

2013-06-06 Thread Grant Ingersoll
parse -> rowid -> cvb. lucene.vector will still give you higher performance at the cost of extra storage (and the fact that it doesn't work in M/R and can't handle multiple directories). I'd say we keep it for now. > > > > > _

Re: Dictionary file format in Lucene-Mahout integration

2013-06-05 Thread Grant Ingersoll
as a sequence file from lucene.vector? > > Thanks for your help. > > James Grant Ingersoll | @gsingers http://www.lucidworks.com

Re: FP Growth

2013-06-04 Thread Grant Ingersoll
On Jun 2, 2013, at 10:42 AM, Sebastian Schelter wrote: > I don't think unmaintained code should stay in our codebase. +1 > This will > only create frustration amongst our users, as they will not get > questions answered and bugs fixed. It would also be an obstacle for a > 1.0 release, where we

FP Growth

2013-06-01 Thread Grant Ingersoll
FP Growth seems to not have a lot of dev support. Are there users out there using it? Should it live on or get the axe prior to 1.0? -Grant

Re: seq2sparse in 0.8 throwing class not found for analyzers

2013-04-24 Thread Grant Ingersoll
o >> ./contentDataDir/sparseVectors --namedVector -wt tf -a >> org.apache.lucene.analysis.EnglishAnalyzer >> >> java.lang.ClassNotFoundException: org.apache.lucene.analysis.EnglishAnalyzer >> >> Looking at the output from bin/mahout classpath >> >> it shows that lucene-analyzers-common-4.2.1.jar is in there as a dependancy >> so any idea why is the above throwing an exception. Grant Ingersoll | @gsingers http://www.lucidworks.com

[OT] Internships at LucidWorks

2013-02-13 Thread Grant Ingersoll
Hi, I'm looking for interns for the summer for those interested in Mahout and Machine Learning: Research Engineer Internship DESCRIPTION LucidWorks, the leading commercial company for Apache Lucene and Solr, is looking for interns to work on building next generation search, analytics and mach

Re: Clustering using Solr Index vs Lucene Index : Different Results

2013-01-30 Thread Grant Ingersoll
stering-using-Solr-Index-vs-Lucene-Index-Different-Results-tp4037198.html > Sent from the Mahout User List mailing list archive at Nabble.com. Grant Ingersoll http://www.lucidworks.com

Re: IndexFormatTooOldException with Solr4.0 ?

2012-11-19 Thread Grant Ingersoll
Is there a way to build and to use any actual version with Lucene 4.0? > > thanks, > > --tomw > > > > > > > Grant Ingersoll http://www.lucidworks.com

Re: Conversion of point numbers to key strings

2012-11-19 Thread Grant Ingersoll
Analyzer.(DefaultAnalyzer.java:34) >>... 11 more >> >> Any idea what causes this? >> Grant Ingersoll http://www.lucidworks.com

Re: If you're at Hadoop World this year

2012-10-21 Thread Grant Ingersoll
e, heavily modified) on top of YARN. >> >> See ya'll there. >> >> JP >> >> -- >> Twitter: @jpatanooga >> Principal Solution Architect @ Cloudera >> hadoop: http://www.cloudera.com >> Grant Ingersoll http://www.lucidworks.com

Re: TFIDFPartialVectorReducer minDf

2012-09-22 Thread Grant Ingersoll
nd SIPC. Unless clearly > stated, nothing herein shall be construed to be an offer to sell, nor a > solicitation of an offer to buy, any financial product. Grant Ingersoll http://www.lucidworks.com

SGD model sizes

2012-09-04 Thread Grant Ingersoll
Hi, I'm wondering if any has any rules of thumb around model size and memory usage for SGD? I'm doing some testing of it myself, but thought I would ask to see how it compares. Thanks, Grant

Re: clusterdump lucene document ID

2012-06-11 Thread Grant Ingersoll
ut-user/201204.mbox/%3cca+y9ocwgs2se7doqqrse3p+qe5gvxct8xutucfdzvgkjkpo...@mail.gmail.com%3E > Grant Ingersoll http://www.lucidimagination.com

Re: Commercializing Mahout: the Myrrix recommender platform

2012-04-21 Thread Grant Ingersoll
On Apr 20, 2012, at 12:05 PM, Hector Yee wrote: > On a related note, wish i could share the data i have to see how these > algorithms stack up to the ones we use for large scale learning. That certainly would be interesting. > > Are there other examples of large data sets people use? I know th

Re: Commercializing Mahout: the Myrrix recommender platform

2012-04-06 Thread Grant Ingersoll
, attention, and work to the Mahout >> project, rather than subtract from it. I hope to see more >> Mahout-related commercializations, beyond the inclusions in >> distributions we're already seeing, in 2012, as it's key to the >> long-term project health. It's most certainly going to be the year of >> the application layer (analytics, machine learning) for Big Data. >> >> Thank you! >> Sean >> Grant Ingersoll http://www.lucidimagination.com

[Job] Research Internships

2012-02-27 Thread Grant Ingersoll
Hi, I have internships open for this summer for students interested in working on search and machine learning. Description is below. -Grant Research Engineer Internship DESCRIPTION Lucid Imagination, the leading commercial company for Apache Lucene and Solr, is looking for interns to work on

Lucene Revolution in Boston in May (with a side of Mahout)

2012-02-24 Thread Grant Ingersoll
Hi Mahout's, Thought some here might be interested as search and machine learning often go together. -- Lucene Revolution will be here May 9-10 in Boston. Reserve your spot today with Early Bird pricing of $575. Committers and accepted speakers are entitled to free admission. Our CFP is op

Re: Goals for Mahout 0.7

2012-02-24 Thread Grant Ingersoll
>> and fixing it up". Since I think this is the only realistic approach >> to a next version, in this conversation I could not support anything >> approach that pretends to do five more things in the next version -- >> at least not unless accompanied by some plan to address the >> contributions already in line in JIRA. It's not OK to be implicitly >> rejecting so much from the community by not planning to fix that first >> and foremost. >> >> > Grant Ingersoll http://www.lucidimagination.com

Re: 0.7 Priorities

2012-02-22 Thread Grant Ingersoll
On Feb 22, 2012, at 7:24 AM, Jake Mannix wrote: > On recent threads on the dev@ list, and discussions off-list, it's pretty > clear that we need to have "cleanup" be a priority for the next release. > > How about this for a formal proposal: > > > - The 0.7 release will have issues (both ne

Re: status of hadoop hidden markov model in mahout

2012-02-01 Thread Grant Ingersoll
On Jan 31, 2012, at 2:14 PM, Keary Cavin wrote: > > Dhruv, I downloaded the MAHOUT-627 patch and applied the files to the current > mahout release. I'll let you know when I have questions. Note, the plan is to put this patch into 0.7 once the remaining test issue is fixed. -Grant

[Job] Research Engineer at Lucid Imagination

2012-02-01 Thread Grant Ingersoll
dwood City, California TRAVEL Minimal -------- Grant Ingersoll http://www.lucidimagination.com

Re: term vectors not created in SparseVectorsFromSequenceFiles using tf weighting and maxDFSigma filtering

2012-01-24 Thread Grant Ingersoll
wt tf --minSupport 2 > --minDF 2 --maxDFSigma 3 -seq > > Thanks, > John > > On Sun, Jan 22, 2012 at 3:00 PM, Grant Ingersoll wrote: > >> What were the command/options you were passing in? >> >> >> On Jan 18, 2012, at 4:26 PM, John Conw

Re: term vectors not created in SparseVectorsFromSequenceFiles using tf weighting and maxDFSigma filtering

2012-01-22 Thread Grant Ingersoll
gt; >DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, > outputDir, tfDirName, conf, minSupport, maxNGramSize, > > minLLRValue, -1.0f, false, reduceTasks, chunkSize, > sequentialAccessOutput, namedVectors); > > } > > -- > > Thanks, > John C > > > > > -- > > -- John C Grant Ingersoll http://www.lucidimagination.com

Re: Help using mahout for k-means clustering on existing vectors

2012-01-09 Thread Grant Ingersoll
e > there existing tools?) ---- Grant Ingersoll http://www.lucidimagination.com

Re: Help needed on TF IDF.

2012-01-09 Thread Grant Ingersoll
Doc 3 is ZZZ similar* Have a look at the RowSimilarityJob, which will do pairwise similarity. > * > * > Can you please help? > > -- > Regards > Junaid Grant Ingersoll http://www.lucidimagination.com

Re: Help regarding Apache Mahout.

2012-01-04 Thread Grant Ingersoll
euters, amongst others, for examples of these in action. > > On Wed, Jan 4, 2012 at 8:31 AM, Grant Ingersoll wrote: >> Hu Junaid, >> >> Have a look at the SparseVectorsFromSequenceFiles class, as this does this >> already, in combination with SequenceFilesFromDirect

Re: Help regarding Apache Mahout.

2012-01-04 Thread Grant Ingersoll
totype to calculate the TF IDF from the documents > present in a directory. > > Can you please help me with the Steps to go about it using Apache Mahout? > Thank you. > > -- > Regards > Junaid ---- Grant Ingersoll http://www.lucidimagination.com

Re: Mahout on EMR

2012-01-04 Thread Grant Ingersoll
hms of Mahout on Amazon EMR including > clusterdumper following the instructions on: > > https://cwiki.apache.org/MAHOUT/mahout-on-elastic-mapreduce.html > > Thanks once again, > Ipshita -------- Grant Ingersoll http://www.lucidimagination.com

Re: SGD and memory

2012-01-03 Thread Grant Ingersoll
ASF projects. The basic task is to try and predict what project an email belongs to based on its content. > Are these textual > features? Or what? > > On Tue, Jan 3, 2012 at 2:53 PM, Grant Ingersoll wrote: > >> I'm trying to run the full ASF email SGD classifier p

SGD and memory

2012-01-03 Thread Grant Ingersoll
I'm trying to run the full ASF email SGD classifier problem and am facing heap size issues. My current setup has 105 features and I am using a cardinality of 100K. I'm using the AdaptiveLogisticRegression. I'm getting heap errors and they occur when trying to construct the ALR class (i.e. not

Re: how to download data for example asf-email-examples.sh?

2012-01-02 Thread Grant Ingersoll
ome notes in the script to document this a bit more. Note, there are some issues w/ this example and the SGD code that are still being worked through. See https://issues.apache.org/jira/browse/MAHOUT-904 for more info. ---- Grant Ingersoll http://www.lucidimagination.com

Re: all keys going to one reducer in subgram step of CollocDriver (?)

2011-12-28 Thread Grant Ingersoll
rong? > > I'm running a 0.6-SNAPSHOT I cloned today from github. Was considering > trying 0.5 but a quick look at recent changes doesn't seem to suggest this > code has changed in awhile... > > Cheers, > Mat Grant Ingersoll http://www.lucidimagination.com

Re: Will "mahout arff.vector" correctly convert string attributes?

2011-12-28 Thread Grant Ingersoll
should become a separate binary attribute, > MapBackedARFFModel.java doesn't seem to do the right thing. We can patch this if you have an alternate implementation. > > Seems like a compressed binary format would be useful for representing such > attributes, unless you also neede

Re: Will "mahout arff.vector" correctly convert string attributes?

2011-12-21 Thread Grant Ingersoll
will mahout insert derived attributes (hour of day, day > of week)? I presume not and I presume I have to add them myself. > > Thanks, Don Grant Ingersoll http://www.lucidimagination.com

Re: SequenceFile cast problems

2011-12-19 Thread Grant Ingersoll
s Grant that was the point of my first question.. >> Now I'll take a look at the vector implementation. >> Thanks again >> Daniele >> >> On 14 December 2011 23:44, Grant Ingersoll wrote: >>> While Ted answered the Dissector question, your original issu

Re: SequenceFile cast problems

2011-12-14 Thread Grant Ingersoll
gt; org.apache.mahout.math.VectorWritable >>> at >>> >>> >> org.apache.mahout.classifier.naivebayes.training.IndexInstancesMapper.map(IndexInstancesMapper.java:1) >>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) >>> at org.apache.hado

Re: SequenceFile cast problems

2011-12-13 Thread Grant Ingersoll
ethod through the seqdirectory program i get this error: > > java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to > org.apache.mahout.math.VectorWritable > > Do you have some hints on the right usage of this class? > > Thanks, > Daniele Volpi ---

Re: mahout exception (lucene.vector)

2011-12-09 Thread Grant Ingersoll
> > -- > View this message in context: > http://lucene.472066.n3.nabble.com/mahout-exception-lucene-vector-tp3569144p3569144.html > Sent from the Mahout User List mailing list archive at Nabble.com. Grant Ingersoll http://www.lucidimagination.com

Re: 20newsgroups example does not print verbose output

2011-12-04 Thread Grant Ingersoll
RK_DIR}/myproj-bydate/bayes-test-input \ > -type bayes \ > -ng 1 \ > -source hdfs \ > -v \ > -method mapreduce > > Any suggestions? Thanks > Grant Ingersoll http://www.lucidimagination.com

Re: DisplayKMean

2011-12-02 Thread Grant Ingersoll
Consider the examples ret > > > > > >Thanks and Regards, > >S SYED ABDUL KATHER > >9731841519 Grant Ingersoll http://www.lucidimagination.com

Re: DisplayKMean

2011-12-02 Thread Grant Ingersoll
gt; > >Thanks and Regards, >S SYED ABDUL KATHER > 9731841519 Grant Ingersoll http://www.lucidimagination.com

Re: Clustering graph coloring and layout

2011-12-01 Thread Grant Ingersoll
, Dec 1, 2011 at 3:32 AM, Ted Dunning wrote: >>> Sure. I attached it, but those get stripped. I didn't realize that this >>> was going to the list. >>> >>> Try here: http://dl.dropbox.com/u/36863361/cluster-viz.r >>> >>> And here for the i

Re: ASF archives?

2011-12-01 Thread Grant Ingersoll
I launched a micro instance and mounted the volume and downloaded it. That's the only way to get that exact data set that I am aware of. I've got a smaller sample up on the Lucid website. Otherwise, if you just want something like it, you can use your ASF credentials to get it. I can point y

Re: Clustering graph coloring and layout

2011-11-30 Thread Grant Ingersoll
le clusters are near. > > On Tue, Nov 29, 2011 at 8:03 AM, Grant Ingersoll wrote: > I'm still learning R, do you have code handy you could share? > > On Nov 29, 2011, at 6:25 AM, Ted Dunning wrote: > > > Coloring is pretty easy in R, which is what I use. I just bu

Re: Clustering graph coloring and layout

2011-11-29 Thread Grant Ingersoll
ns, I vary the transparency according to how seriously > down-sampled the cluster is. That lets me get a good visual feel for the > actual cluster size. > > On Tue, Nov 29, 2011 at 5:03 AM, Grant Ingersoll wrote: > >> Anyone have an easy algorithm for coloring clusters

Clustering graph coloring and layout

2011-11-29 Thread Grant Ingersoll
on https://issues.apache.org/jira/browse/MAHOUT-899) but would really like to be able to produce much prettier visualizations out of the box. -------- Grant Ingersoll http://www.lucidimagination.com

Re: MinHash Clustering in Mahout

2011-11-29 Thread Grant Ingersoll
xamples. > > > It may be a good idea to add 'JaccardDistance' measure to the existing > Distance measures in Mahout (unless there was a reason for not having it in > the first place). TanimotoDistanceMeasure is the Jaccard Distance. > > > > __

Re: MinHash Clustering in Mahout

2011-11-28 Thread Grant Ingersoll
ught the actual text content in both the > files is different. How? > > I am assuming that the NGram attribute was set to the default value of 1 when > creating the tf-idf vectors from sequence files. > > Suneel > > > >

Re: mahout command problems

2011-11-27 Thread Grant Ingersoll
the mahout root directory? > > Isabel -------- Grant Ingersoll http://www.lucidimagination.com

Reminder: SF Mahout User Meeting

2011-11-25 Thread Grant Ingersoll
For those in the San Francisco area, there will be a Mahout User Meeting on Nov. 29th at Lucid Imagination's offices. Details and RSVP are at http://sf-mahout-11-11.eventbrite.com/ For those not in the SF area, I _believe_ we will be recording it and posting it.

Re: Facing problem while fetching the document id from cluser

2011-11-25 Thread Grant Ingersoll
alue = *new* WeightedVectorWritable(); > *while* (reader.next(key, value)) > { > > System.*out*.println(value.toString() + " belongs to cluster "+ > key.toString()); > > } > > reader.close(); > But it is returning null . > Please help me to move further . > > Thanks and Regards, > S SYED ABDUL KATHER Grant Ingersoll http://www.lucidimagination.com

Re: MinHash Clustering in Mahout

2011-11-23 Thread Grant Ingersoll
t to the default value of 1 when > creating the tf-idf vectors from sequence files. > > Suneel > > > > > From: Grant Ingersoll > To: user@mahout.apache.org > Sent: Tuesday, October 25, 2011 5:55 AM > Subject: Re: MinHash C

Re: MinHash Clustering in Mahout

2011-11-23 Thread Grant Ingersoll
; If yes then I would update the wiki page >> https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering with >> the instructions. >> >> Otherwise if someone could tell me on what am I doing wrong. > > I haven't looked into the code, but I get similar outputs, so I assume it is > working. Might be good to incorporate this into the build-reuters.sh as well > as try it on some other input. > > -Grant Grant Ingersoll http://www.lucidimagination.com

Re: clustering hardware requirements

2011-11-22 Thread Grant Ingersoll
hout in Action. If they do what I think they do, I will definitely try > them, and probably complain on the list (Ted) if I can't interpret them right > :). > > Thanks for the reply, > > -- > Ioan Eugen Stan Grant Ingersoll http://www.lucidimagination.com

Re: Trouble understanding how to use the FP_Growth algorithm

2011-11-21 Thread Grant Ingersoll
>x2: JList[JPair[JList[String], JLong]]) = { > println(x1 + ":" + >x2.map(pair => "[" + pair.getFirst.mkString(",") + "] : " + > pair.getSecond).mkString("; ")) > } >

Large Scale Clustering

2011-11-18 Thread Grant Ingersoll
Might be of interest: "Clustering Very Large Multi-dimensional Datasets with MapReduce" http://www.cs.cmu.edu/~jclopez/ref/kdd2011-mr-clustering.pdf -------- Grant Ingersoll http://www.lucidimagination.com

Re: clustering hardware requirements

2011-11-18 Thread Grant Ingersoll
On Nov 16, 2011, at 9:39 PM, Ioan Eugen Stan wrote: > Hello, > > I have to figure out how much hardware is required to do clustering > for my company on about 10+ milion user accounts, each with 100-5000 > documents. The documents will be indexed so vector creation will be > done at indexing. >

Re: lsi

2011-11-17 Thread Grant Ingersoll
I've never implemented LSI. Is there a way to incrementally build the model (by simply indexing documents) or is it something that one only runs after the fact once one has built up the much bigger matrix? If it's the former, I bet it wouldn't be that hard to just implement the appropriate new

Re: NewsKMeansClustering does not find any clusters!

2011-11-17 Thread Grant Ingersoll
should I give the class? I tried to change the canopy > thresholds (250, 120) to some other numbers, tried also changing the > EuclideanDistanceMeasure for the canopy clustering to > CosineDistanceMeasure, with no use. > > Many thanks in advance, > Ahmad Grant Ingersoll http://www.lucidimagination.com

Re: lsi

2011-11-14 Thread Grant Ingersoll
Might be useful: https://github.com/algoriffic/lsa4solr Looks like it hasn't been kept up to date. On Nov 13, 2011, at 1:47 PM, Sebastian Schelter wrote: > Is there some documentation/tutorial available on how to build a LSI > pipeline with mahout and lucene? > > --sebastian

Re: incosistent output while using clusterdumper

2011-11-11 Thread Grant Ingersoll
> MSV-770{n=1 c=[0:-0.025,1:0.011,2:0.032,..etc > > As seen above in MSV-441 there is no presence of ":" in the output whereas > MSV-770 has ):-0.025. > Can anyone throw some light as to what is the difference and why is it > present there..?? > > Thanks.

Re: NewsKMeansClustering - the result most people want seems to be missing

2011-11-09 Thread Grant Ingersoll
solution anyhow). See the ClusterDumper code. > > I'm new to Mahout and have to admit I've been struggling even to get this > far. Any help would be gratefully received. > > > R Grant Ingersoll http://www.lucidimagination.com

Re: SGD TrainNewsGroups interim output

2011-11-09 Thread Grant Ingersoll
Cool, how about adding it to the Wiki? On Nov 9, 2011, at 8:15 AM, Suneel Marthi wrote: > I can put together a doc if we don't already have one, know the SGD code > pretty well. > > Regards, > Suneel > > > > ____ >

SGD TrainNewsGroups interim output

2011-11-09 Thread Grant Ingersoll
In the SGD TrainNewsGroups example, we have: System.out.printf("%.2f\t%.2f\t%.2f\t%.2f\t%.8g\t%.8g\t", maxBeta, nonZeros, positive, norm, lambda, mu); Do we have any docs explaining what these values mean and what one should be looking for to know whether the system is performing or not? Thanks

Re: Minhash key groups

2011-11-08 Thread Grant Ingersoll
. -Grant On Nov 7, 2011, at 8:54 PM, Suneel Marthi wrote: > Do we have an answer for this? > > Sent from my iPhone > > On Nov 2, 2011, at 7:20 AM, Grant Ingersoll wrote: > >> What's the Minhash key groups value used for in the MinhashDriver? I mean, >> I see

Re: Minhash key groups

2011-11-08 Thread Grant Ingersoll
I haven't seen an answer yet. I also asked on dev@. On Nov 7, 2011, at 8:54 PM, Suneel Marthi wrote: > Do we have an answer for this? > > Sent from my iPhone > > On Nov 2, 2011, at 7:20 AM, Grant Ingersoll wrote: > >> What's the Minhash key groups value

Re: creating vectors from lucene index which does NOT store vectors

2011-11-05 Thread Grant Ingersoll
st of the heavy lifting. On Nov 5, 2011, at 11:20 AM, Robert Stewart wrote: > Can you point me to the code in trunk which implements "lucene.vector" > command? > > Bob > > > On Nov 4, 2011, at 2:05 PM, Grant Ingersoll wrote: > >> Should be doable, but like

Re: getting mahout clustering info back into lucene

2011-11-05 Thread Grant Ingersoll
to make this work via codecs. -Grant -------- Grant Ingersoll http://www.lucidimagination.com

SF Apache Mahout User Meeting (MUM) Nov 29th @ Lucid Imagination HQ

2011-11-04 Thread Grant Ingersoll
g to have two speakers giving presentations related to Mahout: Ted Dunning, MapR and Grant Ingersoll of Lucid Imagination (me). Both Ted and Grant are long time committers on the Mahout project. Ted's talk: How and why random projections work? Mine: Using Mahout to Cluster, Classify a

Re: creating vectors from lucene index which does NOT store vectors

2011-11-04 Thread Grant Ingersoll
ector file that I can use with mahout. > > Probably I should not use internal docid, but instead some unique identifier > field. > > Also, I assume at some point this could be a map-reduce job in hadoop. > > I'm just asking for sanity check, or if there are any better ideas out there. > > Thanks > Bob -- Grant Ingersoll http://www.lucidimagination.com

Re: Can anybody explain the distance method in SquaredEuclideanDistanceMeasure?

2011-11-04 Thread Grant Ingersoll
t; code to explain each distance measure implementation, that will really help, > thanks guys. That would be a great addition! Also, javadoc would be helpful, so patches would be great there. Grant Ingersoll http://www.lucidimagination.com

Watchmaker framework usage

2011-11-04 Thread Grant Ingersoll
We've been debating removing/archiving the Watchmaker integration in Mahout due to seeming lack of maintenance and interest. Is anybody actually using it? -Grant

Re: How to find which point belongs which cluster after running KMeansClusterer

2011-11-04 Thread Grant Ingersoll
>> So, what helped me was to process this into a map with cluster Id as the >>>>>> key and vector list as the value. I read the clustered points and all >>>>>> the data in the map in the form. In the end, the list against each >>>>>>

Re: Graphical Mahout Cluster Visualization Tools?

2011-11-03 Thread Grant Ingersoll
I've tried various open source tools (Gephi, others), but haven't found one yet that can handle large volumes of points in an efficient way. FWIW, the Carrot2 workbench is BSD, perhaps it could be used with some work? That being said, I did recently add the ability to ClusterDumper to output

Re: How to find which point belongs which cluster after running KMeansClusterer

2011-11-03 Thread Grant Ingersoll
ll points/vector belong >>> to this cluster, but... so did i miss something? Thanks a lot. Cheers >>> Ramon >>> >>> >>> - >>> No virus found in this message. >>> Checked by AVG - www.avg.com >>> Version: 10.0.1411 / Virus Database: 2092/3992 - Release Date: 11/02/11 >> > Grant Ingersoll http://www.lucidimagination.com

How To Contribute

2011-11-02 Thread Grant Ingersoll
In the vein of users become contributors become committers: It seems there has been some spark of interest in contributing more, so I thought I would pass along a few pointers: 1. https://cwiki.apache.org/MAHOUT/how-to-contribute.html -- Details how to submit patches, etc. IDE codestyles at

Re: does anyone use the "row label bindings" stuff in Vector / Matrix?

2011-11-02 Thread Grant Ingersoll
ssion about them way back when and Ted and Jeff went through a few iterations to add them in. > > -jake > > On Wed, Nov 2, 2011 at 8:08 AM, Grant Ingersoll wrote: > >> >> On Nov 2, 2011, at 10:58 AM, Jake Mannix wrote: >> >>> On Wed, Nov 2, 20

Re: does anyone use the "row label bindings" stuff in Vector / Matrix?

2011-11-02 Thread Grant Ingersoll
On Nov 2, 2011, at 10:58 AM, Jake Mannix wrote: > On Wed, Nov 2, 2011 at 7:34 AM, Grant Ingersoll wrote: > >> What functionality, specifically, are you proposing to remove? > > > I'm suggesting we kill, from Matrix.java and descendents, all of the >

Re: Embedding mahout in a java app

2011-11-02 Thread Grant Ingersoll
On Nov 2, 2011, at 7:17 AM, Tharindu Mathew wrote: > I want to create a java UI tool (based on a web app) that can pick and > apply different algorithms available in Mahout to different data sets. Very cool! Keep us posted, as this would be immensely useful! Any chance it will be donated back

Re: does anyone use the "row label bindings" stuff in Vector / Matrix?

2011-11-02 Thread Grant Ingersoll
What functionality, specifically, are you proposing to remove? I know we had a lot of discussion around some of this stuff way back when as to how best to do it, but of course, that doesn't mean it has uptake. If it's on the Matrix, then doesn't it more easily get shipped around via the Writab

Minhash key groups

2011-11-02 Thread Grant Ingersoll
What's the Minhash key groups value used for in the MinhashDriver? I mean, I see it is used for building up the key out of the hashed values, but what's the significance of different values for it? The default is 2, what does it mean practically speaking if I choose, say, 10? AFAICT, it would

Re: We need help about how to install mahout

2011-11-01 Thread Grant Ingersoll
On Nov 1, 2011, at 2:16 PM, Patrick Hunt wrote: > On Tue, Nov 1, 2011 at 10:44 AM, Ted Dunning wrote: >> On Tue, Nov 1, 2011 at 9:18 AM, Patrick Hunt wrote: >> >>> 2011/10/31 Ted Dunning : Keep in mind that Cloudera has packaged the 0.5 release. That is >>> probably OK for most reco

Re: Production use cases of Mahout

2011-11-01 Thread Grant Ingersoll
any info if it's available? > > -- > Regards, > > Tharindu > > blog: http://mackiemathew.com/ -------- Grant Ingersoll http://www.lucidimagination.com

Re: User vs. Item performance

2011-10-26 Thread Grant Ingersoll
you like, but on Hadoop. > > On Wed, Oct 26, 2011 at 1:56 PM, Grant Ingersoll wrote: >> I seem to recall past discussions on where one hits the bottleneck w/ user >> based recommendation approaches in Mahout, but I can't seem to locate it >> anymore. Anyone know of

User vs. Item performance

2011-10-26 Thread Grant Ingersoll
I seem to recall past discussions on where one hits the bottleneck w/ user based recommendation approaches in Mahout, but I can't seem to locate it anymore. Anyone know off hand? Where do user based approaches hit their limits, more or less? Thanks, Grant

Re: Exception in thread "main" org.apache.lucene.index.CorruptIndexException: unrecognized format -3 in file "_b.fnm"

2011-10-26 Thread Grant Ingersoll
the one the index was >> created with? >> >> >> Isabel >> > > > -- > Lance Norskog > goks...@gmail.com Grant Ingersoll http://www.lucidimagination.com

Re: MinHash Clustering in Mahout

2011-10-25 Thread Grant Ingersoll
On Oct 19, 2011, at 11:38 AM, Varun Thacker wrote: > I was trying to run the MinHash algorithm on the Reuters data set, so I did > the following before running MinHashDriver > > - Get the Reuters dataset > - Run org.apache.lucene.benchmark.utils.ExtractReuters to generate > reuters-out fro

Re: Exception thrown while running K-means clustering using Mahout

2011-10-22 Thread Grant Ingersoll
m/zkpy0k.png> > > Please help me out . Thanks a lot . -------- Grant Ingersoll http://www.lucidimagination.com

Mahout Training/Talks at ApacheCon

2011-10-22 Thread Grant Ingersoll
Just a friendly nudge to those on the fence for ApacheCon in Vancouver this year that there will be both a Mahout training and some Mahout talks. I think a few of us committers will also be hacking Mahout on Tuesday if you are interested. Training info: http://na11.apachecon.com/talks/18395 M

Re: Bayes classifier can't get model when running on Hadoop

2011-10-17 Thread Grant Ingersoll
rainer-thetaNormalizer > drwxrwxrwx - hadoop supergroup 0 2011-10-17 10:18 > /user/hadoop/bayes-model/trainer-weights > > And I use this model to classify new data, all sample will be classified to > "unknown" > > My Environment: > >

Re: RecommenderJob and NaN

2011-10-14 Thread Grant Ingersoll
the shell script with -x as > you will probably have to tweak it. > > Lance > > On Thu, Oct 13, 2011 at 11:04 PM, Sebastian Schelter wrote: > >> Only got the raw data, how did you convert it to our standard >> recommender input? >> >> --sebastian >&g

Re: RecommenderJob and NaN

2011-10-14 Thread Grant Ingersoll
;> recommender input? >> >> --sebastian >> >> >> On 14.10.2011 01:17, Grant Ingersoll wrote: >>> Were you able to get the data, Sebastian? >>> >>> On Oct 13, 2011, at 4:01 AM, Sebastian Schelter wrote: >>> >>>>

  1   2   3   4   >