Re: hit highlighting in lucene ?

2009-05-21 Thread KK
Thank you all. @Muir Thanks for sharing your views. I'ld like to have some more details on the process you mentioned as I've absolutely no idea on this highlighting stuffs, could not make much out of our mail. Can you point me to some tutorials/good write ups on the same, if you have some write ups

Retrieving payloads for terms matched by a query

2009-05-21 Thread Dmitri Bichko
Hi, I may be missing something obvious, but how do I get the payloads for the specific token positions that were matched by a query? For example, if I have a phrase query like "A keyword B" that matches the field "A keyword B A", I can get the payloads for A and B with IndexReader.termPositions()

Re: Parsing large xml files

2009-05-21 Thread crackeur
http://vtd-xml.sf.net - Original Message - From: "Sithu D. Sudarsan" To: java-user@lucene.apache.org Sent: Thursday, May 21, 2009 7:42:59 AM GMT -08:00 US/Canada Pacific Subject: Parsing large xml files Hi, While trying to parse xml documents of about 50MB size, we run into

Re: Do TermDocs and TermEnum need to be closed?

2009-05-21 Thread Jeremy Volkman
Thanks Mike. In the meantime I'll just not close them. :) On Thu, May 21, 2009 at 12:19 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > You're right, SegmentTermDocs/TermEnum.close calls close on its > IndexInputs, but those IndexInputs were obtained by calling clone() on > the "real

Re: Query rewriting/optimization

2009-05-21 Thread Preetham Kajekar
Thanks for the response ! Will post my findings. Thx, ~preetham Michael McCandless wrote: Alas, Lucene in general does not do such structural optimization (and I agree, we should). EG we could do it during Query.rewrite(). There are certain corner cases that are handled, eg a BooleanQuery wit

Re: Searching index problems with tomcat

2009-05-21 Thread Marco Lazzara
;>>> Can you post your indexReader/Searcher initialization code from your >>>>>> standalone app, as well as your webapp. >>>>>> >>>>>> Could you further post your Analyzer Setup/Query Building code from >>>>>> both

Re: Phrase Highlighting

2009-05-21 Thread Michael McCandless
On Thu, May 21, 2009 at 3:09 PM, Max Lynch wrote: > Sorry, the following code is in python, but I can hack a Java thing together > if necessary. I'm a big Python fan :) > HighlighterSpanScorer is the SpanScorer from the highlight > package just renamed to avoid conflict with the other SpanScorer

Re: Phrase Highlighting

2009-05-21 Thread Max Lynch
On Thu, Apr 30, 2009 at 5:16 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > On Thu, Apr 30, 2009 at 12:15 AM, Max Lynch wrote: > > You should switch to the SpanScorer (in o.a.l.search.highlighter). > >> That fragment scorer should only match true phrase matches. > >> > >> Mike > >>

Re: Query rewriting/optimization

2009-05-21 Thread Michael McCandless
Alas, Lucene in general does not do such structural optimization (and I agree, we should). EG we could do it during Query.rewrite(). There are certain corner cases that are handled, eg a BooleanQuery with a single BooleanClause, or BooleanQuery where minimumNumberShouldMatch exceeds the number of

Re: corpus vacabulary

2009-05-21 Thread Otis Gospodnetic
Hello, Perhaps the following will help: asf-lucene/contrib$ ff HighFreq*java ./miscellaneous/src/java/org/apache/lucene/misc/HighFreqTerms.java Oits -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Ridzwan Aminuddin > To: java-user@lucene.apa

Re: Lucene 2.9

2009-05-21 Thread Michael McCandless
Darned that Google; they need to do better ;) Here's the entry from CHANGES.txt on Lucene's trunk: 2. LUCENE-1382: Add an optional arbitrary String "commitUserData" to IndexWriter.commit(), which is stored in the segments file and is then retrievable via IndexReader.getCommitUserData ins

Re: Lucene 2.9

2009-05-21 Thread Tim Williams
On Thu, May 21, 2009 at 1:12 PM, Michael McCandless wrote: > Sorry for the slow response. > > It's really not clear when 2.9 will be released.  We have accumulated > a number of good improvements -- higher performance field sorting, new > higher performance Collector (replaces HitCollector) API, >

Re: Lucene 2.9

2009-05-21 Thread Michael McCandless
Sorry for the slow response. It's really not clear when 2.9 will be released. We have accumulated a number of good improvements -- higher performance field sorting, new higher performance Collector (replaces HitCollector) API, segment-based searching, attaching a String label to each commit from

Query rewriting/optimization

2009-05-21 Thread Preetham Kajekar
Hi, I am wondering if Lucene internally rewrites/optimizes Query. I am programatically generating Query based on various user options, and quite often I have BooleanQueri'es wrapped inside BooleanQueries etc. Like, ((Src:Testing Dst:Test) (Src:Test2 Port:http)). In this case, would Lucene optim

Re: Term frequencies within a search

2009-05-21 Thread Michael McCandless
This is often requested, but Lucene doesn't make it easy. I'd love for someone to come up and build this feature :) Do you need term freqs for just the top N that were collected? Or for all docs that matched the query? Mike On Thu, May 21, 2009 at 6:34 AM, Robert Young wrote: > Hi, > I would

Re: Do TermDocs and TermEnum need to be closed?

2009-05-21 Thread Michael McCandless
You're right, SegmentTermDocs/TermEnum.close calls close on its IndexInputs, but those IndexInputs were obtained by calling clone() on the "real" IndexInputs and so for NIOFSDirectory, FSDirectory and RAMDirectory at least, when a clone's close() is called, that's a no-op. I think there are many p

The org.apache.lucene.SegmentReader.class system property

2009-05-21 Thread Michael McCandless
Does anyone set that property in order to customize the SegmentReader class that Lucene uses? A while back, this was added for GCJ specific code (appears under src/gcj/* in a source checkout), but that code hasn't kept up w/ recent changes to Lucene (eg readOnly IndexReader) and won't work out-of-

Re: Does Lucene fail fast on boolean queries?

2009-05-21 Thread Joel Halbert
TrieRangeQuery - thanks for the tip. -Original Message- From: Michael McCandless Reply-To: java-user@lucene.apache.org To: java-user@lucene.apache.org Subject: Re: Does Lucene fail fast on boolean queries? Date: Thu, 21 May 2009 11:39:23 -0400 On Thu, May 21, 2009 at 10:58 AM, Joel Halb

Re: Does Lucene fail fast on boolean queries?

2009-05-21 Thread Michael McCandless
On Thu, May 21, 2009 at 10:58 AM, Joel Halbert wrote: > Thx.  We're not relying on the internal implementation, but I was > wondering with respect to how efficient it is with respect to doing a > boolean AND query. > > i.e. does clause precedence effect the efficiency of the query - so is X > && Y

RE: Parsing large xml files

2009-05-21 Thread Sudarsan, Sithu D.
Thanks, I'll try that and get back to you Sincerely, Sithu D Sudarsan -Original Message- From: Michael Barbarelli [mailto:mbarbare...@gmail.com] Sent: Thursday, May 21, 2009 10:52 AM To: java-user@lucene.apache.org Subject: Re: Parsing large xml files Why not use an XML pull parser?

Re: Parsing large xml files

2009-05-21 Thread Erick Erickson
What fails and what is the stack trace? Have you tried just parsing the XML in a stand-alone program independent of indexing? You should easily be able to parse a 50MB file with that much memory. I suspect something else is going on here. Perhaps you're not *really* allocating that much memory to

Re: Parsing large xml files

2009-05-21 Thread Joel Halbert
try http://piccolo.sourceforge.net/ is small and fast. -Original Message- From: Michael Barbarelli Reply-To: java-user@lucene.apache.org To: java-user@lucene.apache.org Subject: Re: Parsing large xml files Date: Thu, 21 May 2009 15:52:00 +0100 Why not use an XML pull parser? I recommen

Re: Does Lucene fail fast on boolean queries?

2009-05-21 Thread Joel Halbert
Thx. We're not relying on the internal implementation, but I was wondering with respect to how efficient it is with respect to doing a boolean AND query. i.e. does clause precedence effect the efficiency of the query - so is X && Y faster than Y && X if there are fewer hits for X. From how you de

Re: Parsing large xml files

2009-05-21 Thread Michael Barbarelli
Why not use an XML pull parser? I recommend against using an in-memory parser. On Thu, May 21, 2009 at 3:42 PM, Sudarsan, Sithu D. < sithu.sudar...@fda.hhs.gov> wrote: > > Hi, > > While trying to parse xml documents of about 50MB size, we run into > OutOfMemoryError due to java heap space. Incre

Parsing large xml files

2009-05-21 Thread Sudarsan, Sithu D.
Hi, While trying to parse xml documents of about 50MB size, we run into OutOfMemoryError due to java heap space. Increasing JVM to use close 2GB (that is the max), does not help. Is there any API that could be used to handle such large single xml files? If Lucene is not the right place, please l

Do TermDocs and TermEnum need to be closed?

2009-05-21 Thread Jeremy Volkman
Greetings all, I currently have a FieldExistsFilter which returns all documents that contain a particular field. I'm in the process of converting my custom filters to be DocIdSet based rather than BitSet based. This filter, however, requires the use of a TermDocs object to iterate over terms and D

Re: Does Lucene fail fast on boolean queries?

2009-05-21 Thread Michael McCandless
Well... scoring of AND queries currently is done doc-at-once. So Lucene will first step to doc 1 for Name, then ask age to skip to doc >= 1, will see that both have doc=1 and collect it. The same thing happens for doc=2. Then, Lucene will ask for the next doc of Name, which returns "false" (end

Re: Does Lucene fail fast on boolean queries?

2009-05-21 Thread Joel Halbert
Thx. so, just to clarify, in the example I gave below... Lucene will search for documents matching on Name and find doc 1 and doc 2. Then it will search age and find docs 1, 2 and then break. It will not go on to seek 5 and 10...? -Original Message- From: Michael McCandless Reply-To: jav

Re: Searching index problems with tomcat

2009-05-21 Thread Marco Lazzara
;>> Could you further post your Analyzer Setup/Query Building code from >>>>>> both apps. >>>>>> >>>>>> Could you further post the document creation code used at indexing >>>>>> time? (Which analyzer, and which fields are index

Re: hit highlighting in lucene ?

2009-05-21 Thread Robert Muir
its definitely an area in lucene that could use some improvement. my recommendation for multilingual text is to apply the unicode "default" algorithms: Tokenize text according to UAX #29: unicode text segmentation Apply full case-folding (unicode ch. 3.13) with FC_NFKC closure Apply UAX #15: unic

Re: Does Lucene fail fast on boolean queries?

2009-05-21 Thread Michael McCandless
Yes. As soon as Lucene sees that the Name docID iteration has ended, the search will break. Mike On Thu, May 21, 2009 at 8:44 AM, Joel Halbert wrote: > Hi, > > When Lucene performs a Boolean query, say: > > Field Name = Male > AND > Field Age = 30 > > assuming the resultant docs for each portio

Re: About sort questions

2009-05-21 Thread Erick Erickson
I suspect that your boost values are too small to really influencethe scores very much. Have you tried using boost values of, say, d:5^100 OR uid:10^10 OR lang:lisp ? But if you have specific documents that you *know* you want in specific places, why play around with boosting at all? You can use s

Re: How to query/search unicoded docs in lucene using unicode text as query?

2009-05-21 Thread Robert Muir
hello, your example (hindi), is probably suffering from a number of search issues: i dont recommend standardanalyzer as for this example, it will break words around dependent the vowels and nukta dot, etc. whitespaceanalyzer might be a good start. also, is it possible to apply unicode normalizati

How to query/search unicoded docs in lucene using unicode text as query?

2009-05-21 Thread KK
Hi All, I've indexed some docs[non-english] in unicoded utf=8 format. For both indexing as well as searching/querying I'm using simpleanalyzer. For english texts when I tried with single words its working then I thought of trying for non-english texts. So I wrote those words[multiple words] in babe

Re: hit highlighting in lucene ?

2009-05-21 Thread Joel Halbert
> If I index english pages > with the same indexer, it will not take care of stemming and stop word > removal? correct > Cant we have a single indexer that handles non-eng and eng in > equally good ways? You can have a single indexer, but, if you wanted to use one Analyzer for English docume

Re: Searching index problems with tomcat

2009-05-21 Thread Matthew Hall
Its been a few days, and we haven't heard back about this issue, can we assume that you fixed it via using fully qualified paths then? Matt Ian Lea wrote: Marco You haven't answered Matt's question about where you are running it from. Tomcat's default directory may well not be the same as y

Re: hit highlighting in lucene ?

2009-05-21 Thread KK
Initially I was using standardAnalyzer but I switched to simpleAnalyzer which I guess doesnot do more that tokenizing[and may be tokenizing] and I think this does not do stemming which I dont/cant do because I've no stemmer for the languages I'm indexing. For indexing and querring I'm using the sam

Re: hit highlighting in lucene ?

2009-05-21 Thread Joel Halbert
The highlighter should be language independent. So long as you are consistent with your use of Analyzer between indexing/query/highlighting. As for the most appropriate Analyzer to use for your local language, this is a seperate question - especially if you are using stop word and stemming filters

Does Lucene fail fast on boolean queries?

2009-05-21 Thread Joel Halbert
Hi, When Lucene performs a Boolean query, say: Field Name = Male AND Field Age = 30 assuming the resultant docs for each portion of the query were: Matching docs for: Name = 1,2 Matching docs for: Age = 1,2,5,10 Will Lucene stop searching for documents matching the Age term once it has found

Re: Posting unicode data to lucene not working during searching/retreival!

2009-05-21 Thread KK
Thank you very much. As you told me I just added a single line in the jsp page mentioning the charset as utf-8 and it worked like a charm. Thank you. KK On Thu, May 21, 2009 at 5:47 PM, Uwe Schindler wrote: > If you print the result e.g. to a webpage through the servlet API, the > output is don

hit highlighting in lucene ?

2009-05-21 Thread KK
Hi All, I was looking for various ways of implementing hit highlighting in Lucene and found some standard classes that does support highlighting like this *lucene*.apache.org/java/2_2_0/api/org/apache/*lucene*/search/*highlight* /package-summary.html ik but what i believe is that this is only for

RE: Posting unicode data to lucene not working during searching/retreival!

2009-05-21 Thread Uwe Schindler
Hi KK, > right? and remove this conversion that I'm doing later , > > byte [] utfEncodeByteArray = textOnly.getBytes(); > String utfString = new String(utfEncodeByteArray, Charset.forName("UTF- > 8")); > > This will make sure I'm not depending on the platform encoding, right? In principle, ye

RE: Posting unicode data to lucene not working during searching/retreival!

2009-05-21 Thread Uwe Schindler
If you print the result e.g. to a webpage through the servlet API, the output is done with ISO-8859-1 (which is the default for HTTP). If you want to change this, you must tell the servlet layer the encoding before getting a PrintWriter (response.setEncoding(), response.setContentTpe("text/html; ch

Re: how to get the word before and the word after the matched Term?

2009-05-21 Thread Grant Ingersoll
See http://www.lucidimagination.com/search/document/7fe40486bc935ce4/get_term_neighbours (although I think you can do better than the code in the third reply by using a TermVectorMapper such that you can process the TermVector as it comes from disk.) Essentially, you need to use a combinati

Term frequencies within a search

2009-05-21 Thread Robert Young
Hi, I would like to perform a query and then get a summary of the term frequencies of the result. Is this possible? Thanks Rob

Re: Posting unicode data to lucene not working during searching/retreival!

2009-05-21 Thread KK
I did all the changes but no improvement. the data is getting indexed properly, I think because I'm able to see the results through luke and luke has option for seeing the results in both utf-8 encoding and string default encoding. I tried to use both but no difference. In both the cases I'm able t

Re: Posting unicode data to lucene not working during searching/retreival!

2009-05-21 Thread KK
Thanks @Uwe. #To answer your last mails query, textOnly is the output of the method downloadPage(), complete text thing includeing all html tags etc... #Instead of doing the encode/decode later, what i should do is when downloading the page through buffered reader put the charset as utf-8 as you me

RE: Posting unicode data to lucene not working during searching/retreival!

2009-05-21 Thread Uwe Schindler
I forgot: > byte [] utfEncodeByteArray = textOnly.getBytes(); > String utfString = new String(utfEncodeByteArray, Charset.forName("UTF- > 8")); > > here textonly is the text extracted from the downloaded page What is textonly here? A String, if yes, why decode and then again encode it? The impor

RE: Posting unicode data to lucene not working during searching/retreival!

2009-05-21 Thread Uwe Schindler
Hallo KK., > Thanks for your quick response. Let me explain the whole thing. > I'm downloading the pages for give urls and then extracting text and > converting that to unicode utf-8 this way, > > byte [] utfEncodeByteArray = textOnly.getBytes(); > String utfString = new String(utfEncodeByteArray

Re: Posting unicode data to lucene not working during searching/retreival!

2009-05-21 Thread KK
Thanks for your quick response. Let me explain the whole thing. I'm downloading the pages for give urls and then extracting text and converting that to unicode utf-8 this way, byte [] utfEncodeByteArray = textOnly.getBytes(); String utfString = new String(utfEncodeByteArray, Charset.forName("UTF-8