Re: java.io.IOException when trying to list terms in index (IndexReader)

2009-08-02 Thread se3g2011
hi,as you the error messages you listed below,pls put the 'reader.close()' block to the bottom of method. i think,if you invoke it first,the infrastructure stream is closed ,so exceptions is encountered. ohaya wrote: > > Hi, > > I changed the beginning of the try to: > > try { >

Weird discrepancy with term counts vs. terms (off by 1)

2009-08-02 Thread ohaya
Hi, I've noticed a kind of strange problem with term counts and actual terms. Some background: I wrote an app that creates an index, including a "path" field. I am now working on an app (code was in the previous thread) that, as part of what it does, needs to get a list of all of the "path"

Re: Group by in Lucene ?

2009-08-02 Thread Erik Hatcher
Don't overlook Solr: http://lucene.apache.org/solr Erik On Aug 1, 2009, at 5:43 AM, mschipperheyn wrote: http://code.google.com/p/bobo-browse looks like it may be the ticket. Marc -- View this message in context: http://www.nabble.com/Group-by-in-Lucene---tp13581760p24767693.html

Weird behaviour

2009-08-02 Thread prashant ullegaddi
Hi, I've indexed some 50million documents. I've indexed the target URL of each document as "url" field by using StandardAnalyzer with index.ANALYZED. Suppose, there is a wikipedia page with title:"Rahul Dravid" and url: http://en.wikipedia.org/wiki/Rahul_Dravid. But when I search for +title:"Rahu

Re: Weird behaviour

2009-08-02 Thread Shai Erera
You write that you index the string under the "url" field. Do you also index it under "title"? If not, that can explain why title:"Rahul Dravid" does not work for you. Also, did you try to look at the index w/ Luke? It will show you what are the terms in the index. Another thing which is always g

Re: Weird behaviour

2009-08-02 Thread prashant ullegaddi
Firstly, I'm indexing the string in url field only. I've never used Luke, I don't know how to use. What I'm trying to do is search for those documents which are from some particular site, and have a given title. On Sun, Aug 2, 2009 at 4:07 PM, Shai Erera wrote: > You write that you index the

Re: Weird discrepancy with term counts vs. terms (off by 1)

2009-08-02 Thread ohaya
Hi, BTW, my indexer app is basically the same as the demo IndexFiles.java. Here's part of the main: try { IndexWriter writer = new IndexWriter(INDEX_DIR, new StandardAnalyzer(), true, IndexWriter.MaxFieldLength.LIMITED); System.out.println("Indexing to directory '" +INDEX_DIR+

score from spans

2009-08-02 Thread Eran Sevi
Hi, How can I get the score of a span that is the result of SpanQuery.getSpans() ? The score should can be the same for each document, but if it's unique per span, it's even better. I tried looking for a way to expose this functionality through the Spans class but it looks too complicated. I'm no

Re: Weird behaviour

2009-08-02 Thread Shai Erera
How do you parse/convert the page to a Document object? Are you sure the title "Rahul Dravid" is extracted properly and put in the "title" field? You can read about Luke here: http://www.getopt.org/luke/. Can you do System.out.println(document.toString()) before you add it to the index, and paste

Re: Weird discrepancy with term counts vs. terms (off by 1)

2009-08-02 Thread Phil Whelan
Hi Jim, On Sun, Aug 2, 2009 at 1:32 AM, wrote: > I first noticed the problem that I'm seeing while working on this latter app. > Basically, what I noticed was that while I was adding 13 documents to the > index, when I listed the "path" terms, there were only 12 of them. Field text (the whole

Re: Weird discrepancy with term counts vs. terms (off by 1)

2009-08-02 Thread Phil Whelan
Hi Jim, On Sun, Aug 2, 2009 at 9:08 AM, Phil Whelan wrote: > >> So then, I reviewed the index using Luke, and what I saw with that was that >> there were indeed only 12 "path" terms (under "Term Count" on the left), >> but, when I clicked the "Show Top Terms" in Luke, there were 13 terms listed

Re: Weird behaviour

2009-08-02 Thread prashant ullegaddi
Yes, I'm sure that title:"Rahul Dravid" is extracted properly, and there is a document relevant to this query as well. The following query and its results proves it: Enter query: Searching for: +title:"rahul dravid" +url:wiki 4 total matching documents trec-id: clueweb09-enwp02-13-14368, URL: h

Re: Weird behaviour

2009-08-02 Thread Phil Whelan
Hi Prashant, I agree with Shai, that using Luke and printing out what the Document looks like before it goes into the index, are going to be your best bet for debugging this problem. The problem you're having is that StandardAnalyzer does not break-up the hostname into separate terms, as it has a

Re: ThreadedIndexWriter vs. IndexWriter

2009-08-02 Thread Michael McCandless
Woops sorry for the confusion! Mike On Sat, Aug 1, 2009 at 1:03 PM, Phil Whelan wrote: > Hi Mike, > > It's Jibo, not me, having the problem. But thanks for the link. I was > interested to look at the code. Will be buying the book soon. > > Phil > > On Sat, Aug 1, 2009 at 2:08 AM, Michael McCandle

Re: Weird behaviour

2009-08-02 Thread prashant ullegaddi
Hi Phil, The query you gave did work. Well, that proves StandardAnalyzer has a different way of tokenizing URLs. Thanks, Prashant. On Sun, Aug 2, 2009 at 11:22 PM, Phil Whelan wrote: > Hi Prashant, > > I agree with Shai, that using Luke and printing out what the Document > looks like before it

Re: Weird discrepancy with term counts vs. terms (off by 1)

2009-08-02 Thread Andrzej Bialecki
Phil Whelan wrote: Hi Jim, On Sun, Aug 2, 2009 at 9:08 AM, Phil Whelan wrote: So then, I reviewed the index using Luke, and what I saw with that was that there were indeed only 12 "path" terms (under "Term Count" on the left), but, when I clicked the "Show Top Terms" in Luke, there were 13 te

Re: Weird discrepancy with term counts vs. terms (off by 1)

2009-08-02 Thread Phil Whelan
On Sun, Aug 2, 2009 at 10:58 AM, Andrzej Bialecki wrote: > Thank you Phil for spotting this bug - this fix will be included in the next > release of Luke. Glad to help. Thanks for building this great tool! Phil - To unsubscribe,

Re: Weird behaviour

2009-08-02 Thread Shai Erera
You can always create your own Analyzer which creates a TokenStream just like StandardAnalyzer, but instead of using StandardFilter, write another TokenFilter which receives the HOST token type, and breaks it further to its components (e.g., extract "en", "wikipedia" and "org"). You can also return

Re: Weird behaviour

2009-08-02 Thread prashant ullegaddi
Thank you Phil and Shai. I will write a different Analyzer. On Sun, Aug 2, 2009 at 11:50 PM, Shai Erera wrote: > You can always create your own Analyzer which creates a TokenStream just > like StandardAnalyzer, but instead of using StandardFilter, write another > TokenFilter which receives the

Re: arabic analyzer

2009-08-02 Thread Robert Muir
> the fact is, plural (as an example) is not supported, and that is one of > the most common things that a person doing some search will expect to Walid, I'm not sure this is true. Many plurals are supported (certainly not exceptional cases or broken plurals). This is no different than the other l

Re: Weird discrepancy with term counts vs. terms (off by 1)

2009-08-02 Thread ohaya
Hi Phil, For problem with my app, it wasn't what you suggested (about the tokens, etc.). For some later things, my indexer creates both a "path" field that is analyzed (and thus tokenized, etc.) and another field, "fullpath", which is not analyzed (and thus, not tokenized). The problem with my

Re: java.io.IOException when trying to list terms in index (IndexReader)

2009-08-02 Thread ohaya
Hi, I thought that, in the code that I posted, there was a close() in the finally? Or, are you saying that when an IndexReader is opened, that that somehow persists in the system, even past my Java app terminating? FYI, I'm doing this testing on Windows, under Eclipse... Jim se3g2011

Re: Weird discrepancy with term counts vs. terms (off by 1)

2009-08-02 Thread Phil Whelan
Hi Jim, On Sun, Aug 2, 2009 at 12:12 PM, wrote: > i.e., I was ignoring the 1st term in the TermEnum (since the .next() bumps > the TermEnum to the 2nd term, initially). Great! Glad you found the problem. I couldn't see it. Phil -

Re: java.io.IOException when trying to list terms in index (IndexReader)

2009-08-02 Thread Erick Erickson
I've seen Eclipse get into weird states, but I don't think that's your problem. You open the IndexReader and set up a TermEnum on it. Then, no matter what you close the underlying IndexReader in the finally block. Then later you use the TermEnum *even though the underlying reader has been closed*.

Re: java.io.IOException when trying to list terms in index (IndexReader)

2009-08-02 Thread ohaya
Erick, It's working now (I removed the finally, and put the close() elsewhere). Thanks for the explanation. Jim Erick Erickson wrote: > I've seen Eclipse get into weird states, but I don't think that's your > problem. > > You open the IndexReader and set up a TermEnum on it. Then, no

question about

2009-08-02 Thread Leonard Gestrin
Hello, I have question about KEYWORD type and searching/updating. I am getting strange behavior that I can't quite comprehend. My index is created using standard analyzer, which used for writing and searching. It has three fields userpin - alphanumeric field which is stored as TEXT documentkey

question about indexing/searching using standardanalyzer for KEYWORD field that contains alphanumeric data

2009-08-02 Thread Leonard Gestrin
Hello, I have question about KEYWORD type and searching/updating. I am getting strange behavior that I can't quite comprehend. My index is created using standard analyzer, which used for writing and searching. It has three fields userpin - alphanumeric field which is stored as TEXT documentkey

Re: Boosting Search Results

2009-08-02 Thread bourne71
Thanks for all the reply. It help me to understand problem better, but is it possible to create a query that will give additional boost to the results if and only if both of the word is found inside the results. This will definitely make sure that the results will be in the higher up of the list.

Re: Lucene for dynamic data retrieval

2009-08-02 Thread Otis Gospodnetic
Hi Satish, Lucene doesn't enforce an index schema, so each document can have a different set of fields. It sounds like you need to write a custom indexer that follows your custom rules and creates Lucene Documents with different Fields, depending on what you want indexed. You also mention sea

How to improve search time?

2009-08-02 Thread prashant ullegaddi
Hi, I've a single index of size 87GB containing around 50M documents. When I search for any query, best search time I observed was 8sec. And when query is expanded with synonyms, search takes minutes (~ 2-3min). Is there a better way to search so that overall search time reduces? Thanks, Prashant

Re: How to improve search time?

2009-08-02 Thread Phil Whelan
Hi Prashant, Take a look at this... http://wiki.apache.org/lucene-java/ImproveSearchingSpeed Cheers, Phil On Sun, Aug 2, 2009 at 9:33 PM, prashant ullegaddi wrote: > Hi, > > I've a single index of size 87GB containing around 50M documents. When I > search for any query, > best search time I obse

Re: Boosting Search Results

2009-08-02 Thread henok sahilu
hello there i like to know about the Boosting Search results thing thanks --- On Sun, 8/2/09, bourne71 wrote: From: bourne71 Subject: Re: Boosting Search Results To: java-user@lucene.apache.org Date: Sunday, August 2, 2009, 8:14 PM Thanks for all the reply. It help me to understand problem