Hey Mike, Let me clarify:
The tokens are not unique. Let's say doc1 contains the token foo and has the properties weight1 = 0.75, weight2 = 0.90, frequency = 10 Now, let's say doc2 also contains the token foo with properties: weight1 = 0.8, weight2 = 0.75, frequency = 5 Now, I want to search for all the documents that contain foo, but I want them sorted by frequency. Then, I would have doc1, doc2. Now, I want to search for all the documents that contain foon, but I want them sorted by weight1. Then, I would have doc2, doc1 Does that clarify? On May 5, 2011, at 3:01 PM, Mike Sokolov wrote: > Are the tokens unique within a document? If so, why not store a document for > every doc/token pair with fields: > > id (doc#/token#) > doc-id (doc#) > token > weight1 > weight2 > frequency > > Then search for token, sort by weight1, weight2 or frequency. > > If the token matches are unique within a document you will only get each > document listed once. If they aren't unique, it's not clear what you want to > sort by anyway.... > > -Mike > > On 05/05/2011 04:12 PM, Chris Schilling wrote: >> Hi, >> >> I am trying to figure out how to solve this problem: >> >> I have about 500,000 files that I would like to index, but the files are >> structured. So, each file has the following layout: >> >> doc1 >> token1, weight11, frequency1, weight21 >> token2, weight12, frequency2, weight22 >> . >> . >> . >> >> etc for 500,000 docs. >> >> Basically, I would like to index the tokens for each doc. When I search for >> a token, I would like to be able to return the top docs sorted by weight1, >> frequency, or weight2. >> >> So, in my naive setup, I loop through the files in the directory, then I >> loop through the lines of the file. In side of the loop through each file, >> I call this function: >> >> public Document processKeywords(Document doc, String keyword, Float >> weight1, Float weight2, Integer frequency) throws Exception { >> Document doc = new Document(); >> doc.add(new Field("keywords", keyword, Field.Store.NO, >> Field.Index.ANALYZED)); >> doc.add(new NumericField(keyword+"weight1", >> Field.Store.YES, true).setFloatValue(weight1)); >> doc.add(new NumericField(keyword+"weight2", >> Field.Store.YES, true).setFloatValue(weight2)); >> doc.add(new NumericField(keyword+"frequency", >> Field.Store.YES, true).setFloatValue(frequency)); >> return doc; >> } >> >> So, for each token, I create 3 new fields each time. Notice how I am trying >> to index the keyword in the "keywords" field. For the weights and >> frequency, I create a new field with a name based on the keyword. On >> average, I have 100 tokens per document, so each document will have about >> 300 distinct fields. >> >> When running my program, the lucene portion eats up tons of memory and when >> it gets to the max alloted by the JVM (I have tried allowing up to 4 Gb), >> the program slows to a crawl. I assume it is spending all of its time in >> garbage collection due to all these fields. >> >> My code above seems like a very hacky way of accomplishing what I want >> (sorting documents based on keyword search using different numeric fields >> associated with that keyword). >> >> FYI, here is the main search code, where q is the token I am searching for >> and sortby is the field I want to use to sort. I setup a QP to search for >> the keyword in the "keywords" field. Then, I can extract the stats that I >> indexed for the given query keyword. >> >> private static final QueryParser parser = new >> QueryParser(Version.LUCENE_30, "keywords", new >> StandardAnalyzer(Version.LUCENE_30)); >> >> public void search(String q, String sortby) throws IOException, >> ParseException { >> Query query = parser.parse(q); >> long start = System.currentTimeMillis(); >> TopDocs hits = this.is.search(query, null, 10, new Sort(new >> SortField(q+"sortby", SortField.FLOAT, true))); >> long end = System.currentTimeMillis(); >> System.out.println("Found " + hits.totalHits + >> " document(s) (in " + (end - start) + >> " milliseconds) that matched query '" + >> q + "':"); >> for(ScoreDoc scoreDoc : hits.scoreDocs) { >> Document doc = this.is.doc(scoreDoc.doc); >> String hash = doc.get("hash"); >> System.out.println(hash + " " + doc.get(q+"sortby") + " >> " + hash); >> } >> } >> >> I am pretty new to Lucene, so I hope this makes sense. I tried to pare my >> problem down as much as possible. Like I said, the main problem I am >> running into is that after processing about 30000 documents, the indexing >> slows to a crawl and seems to spend all of its time in the garbage >> collector. I am looking for a more efficient/effective way of solving this >> problem. Code tidbits would help, but are not necessary :) >> >> Thanks for your help, >> Chris S. >> --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org