Re: View lucene index file
check out this page http://jakarta.apache.org/lucene/docs/contributions.html we got tools like LIMO and Luke. Cheo On Thu, 9 Sep 2004 23:38:17 -0400, Anne Y. Zhang <[EMAIL PROTECTED]> wrote: > I am using Nutch. Is there any way I can view the lucene index file? > It seems that lucene write index as binary file. Could anybody explain > how lucene does the indexing and where the index file located? > Thank you very much! > > Ya > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- Cheolgoo, Kang - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
View lucene index file
I am using Nutch. Is there any way I can view the lucene index file? It seems that lucene write index as binary file. Could anybody explain how lucene does the indexing and where the index file located? Thank you very much! Ya - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: combining open office spellchecker with Lucene
David Spencer wrote: Good heuristics but are there any more precise, standard guidelines as to how to balance or combine what I think are the following possible criteria in suggesting a better choice: Not that I know of. - ignore(penalize?) terms that are rare I think this one is easy to threshold: ignore matching terms that are rarer than the term entered. - ignore(penalize?) terms that are common This, in effect, falls out of the previous criterion. A term that is very common will not have any matching terms that are more common. As an optimization, you could avoid even looking for matching terms when a term is very common. - terms that are closer (string distance) to the term entered are better This is the meaty one. - terms that start w/ the same 'n' chars as the users term are better Perhaps. Are folks really better at spelling the beginning of words? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MultiFieldQueryParser seems broken... Fix attached.
> But, inspired by that message, couldn't MultiFieldQueryParser just be a > subclass of QueryParser that overrides getFieldQuery()? I wasn't sure that everything "went through" getFieldQuery(). If so, yes, that should work. In either case, I don't even think a subclass is necessary. Just have a different constructor for QueryParser that takes multiple default field names, and just add the behavior to QueryParser, keyed off that characteristic (more than one default field name). Bill - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MultiFieldQueryParser seems broken... Fix attached.
> is it a problem if the users will search "coffee OR tea" as a search > string in the case that MultifieldQueryParser is > modifyed as Bill suggested?, and the default opperator is set to AND? > Here's what you get (which is correct): % java -classpath /usr/local/lib/lucene-1.4.1.jar:. \ -DSearchText.QueryDefaultOperator=AND \ -DSearchTest.QueryParser=new SearchTest 'coffee OR tea' query is (title:coffee authors:coffee contents:coffee) (title:tea authors:tea contents:tea) % Bill - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents
On Thursday 09 September 2004 19:47, Daniel Taurat wrote: > I am facing an out of memory problem using ÂLucene 1.4.1. Could you try with a recent CVS version? There has been a fix about files not being deleted after 1.4.1. Not sure if that could cause the problems you're experiencing. Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Existing Parsers
Some of the tools listed use cmd line execs to output a doc of some sort to text and then I grab the text and add it to a lucene doc, etc etc... Any stats on the scalability of that? In large scale applications, I'm assuming this will cause some serious issues... anyone have any input on this? -Chris Fraschetti On Thu, 09 Sep 2004 09:54:43 -0700, David Spencer <[EMAIL PROTECTED]> wrote: > Honey George wrote: > > > Hi, > > I know some of them. > > 1. PDF > > + http://www.pdfbox.org/ > > + http://www.foolabs.com/xpdf/download.html > >- I am using this and found good. It even supports > > My dated experience from 2 years ago was that (the evil, native code) > foolabs pdf parser was the best, but obviously things could have changed. > > http://www.mail-archive.com/[EMAIL PROTECTED]/msg02912.html > > > various languages. > > 2. word > > + http://sourceforge.net/projects/wvware > > 3. excel > > + http://www.jguru.com/faq/view.jsp?EID=1074230 > > > > -George > > --- [EMAIL PROTECTED] wrote: > > > >>Anyone know of any reliable parsers out there for > >>pdf word > >>excel or powerpoint? > > For powerpoint it's not easy. I've been using this and it has worked > fine util recently and seems to sometimes go into an infinite loop now > on some recent PPTs. Native code and a package that seems to be dormant > but to some extent it does the job. The file "ppthtml" does the work. > > http://chicago.sourceforge.net/xlhtml > > > > >> > >> > > > > - > > > >>To unsubscribe, e-mail: > >>[EMAIL PROTECTED] > >>For additional commands, e-mail: > >>[EMAIL PROTECTED] > >> > >> > > > > > > > > > > > > > > ___ALL-NEW Yahoo! > > Messenger - all new features - even more fun! http://uk.messenger.yahoo.com > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: combining open office spellchecker with Lucene
Doug Cutting wrote: Aad Nales wrote: Before I start reinventing wheels I would like to do a short check to see if anybody else has already tried this. A customer has requested us to look into the possibility to perform a spell check on queries. So far the most promising way of doing this seems to be to create an Analyzer based on the spellchecker of OpenOffice. My question is: "has anybody tried this before?" Note that a spell checker used with a search engine should use collection frequency information. That's to say, only "corrections" which are more frequent in the collection than what the user entered should be displayed. Frequency information can also be used when constructing the checker. For example, one need never consider proposing terms that occur in very few documents. And one should not try correction at all for terms which occur in a large proportion of the collection. Good heuristics but are there any more precise, standard guidelines as to how to balance or combine what I think are the following possible criteria in suggesting a better choice: - ignore(penalize?) terms that are rare - ignore(penalize?) terms that are common - terms that are closer (string distance) to the term entered are better - terms that start w/ the same 'n' chars as the users term are better Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Out of memory in lucene 1.4.1 when re-indexing large number of documents
Hi, I am facing an out of memory problem using Lucene 1.4.1. I am re-indexing a pretty large number ( about 30.000 ) of documents. I identify old instances by checking for a unique ID field, delete those with indexReader.delete() and add the new document version. HeapDump says I am having a huge number of HashMaps with SegmentTermEnum objects (256891) . IndexReader is closed directly after delete(term)... Seems to me that this did not happen with version1.2 (same number of objects and all...). Has anyone an idea how I get these "hanging" objects? Or what to do in order to avoid them? Thanks Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene working example.
I used lucene in one of my projects which required searching content on web pages. I wrote my own jsp and html parser that handled major cases and it successfully parsed almost all my required web pages. I also came up with an example implementation (http://dharmanand.tarundua.net/lucene_eg.war) as I didnt find very good working examples on the internet. Yahoo! India Matrimony: Find your life partner online Go to: http://yahoo.shaadi.com/india-matrimony - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MultiFieldQueryParser seems broken... Fix attached.
On Thursday 09 September 2004 18:52, Doug Cutting wrote: > I have not been > able to construct a two-word query that returns a page without both > words in either the content, the title, the url or in a single anchor. > Can you? Like this one? konvens leitseite Leitseite is only in the title of the first match (www.gldv.org), konvens is only in the body. -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: combining open office spellchecker with Lucene
Aad Nales wrote: Before I start reinventing wheels I would like to do a short check to see if anybody else has already tried this. A customer has requested us to look into the possibility to perform a spell check on queries. So far the most promising way of doing this seems to be to create an Analyzer based on the spellchecker of OpenOffice. My question is: "has anybody tried this before?" Note that a spell checker used with a search engine should use collection frequency information. That's to say, only "corrections" which are more frequent in the collection than what the user entered should be displayed. Frequency information can also be used when constructing the checker. For example, one need never consider proposing terms that occur in very few documents. And one should not try correction at all for terms which occur in a large proportion of the collection. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Existing Parsers
Honey George wrote: Hi, I know some of them. 1. PDF + http://www.pdfbox.org/ + http://www.foolabs.com/xpdf/download.html - I am using this and found good. It even supports My dated experience from 2 years ago was that (the evil, native code) foolabs pdf parser was the best, but obviously things could have changed. http://www.mail-archive.com/[EMAIL PROTECTED]/msg02912.html various languages. 2. word + http://sourceforge.net/projects/wvware 3. excel + http://www.jguru.com/faq/view.jsp?EID=1074230 -George --- [EMAIL PROTECTED] wrote: Anyone know of any reliable parsers out there for pdf word excel or powerpoint? For powerpoint it's not easy. I've been using this and it has worked fine util recently and seems to sometimes go into an infinite loop now on some recent PPTs. Native code and a package that seems to be dormant but to some extent it does the job. The file "ppthtml" does the work. http://chicago.sourceforge.net/xlhtml - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MultiFieldQueryParser seems broken... Fix attached.
Bill Janssen wrote: I'd think that if a user specified a query "cutting lucene", with an implicit AND and the default fields "title" and "author", they'd expect to see a match in which both "cutting" and "lucene" appears. That is, (title:cutting OR author:cutting) AND (title:lucene OR author:lucene) Your proposal is certainly an improvement. It's interesting to note that in Nutch I implemented something different. There, a search for "cutting lucene" expands to something like: (+url:cutting^4.0 +url:lucene^4.0 +url:"cutting lucene"~2147483647^4.0) (+anchor:cutting^2.0 +anchor:lucene^2.0 +anchor:"cutting lucene"~4^2.0) (+content:cutting +content:lucene +content:"cutting lucene"~2147483647) So a page with "cutting" in the body and "lucene" in anchor text won't match: the body, anchor or url must contain all query terms. A single authority (content, url or anchor) must vouch for all attributes. Note that Nutch also boosts matches where the terms are close together. Using "~2147483647" permits them to be anywhere in the document, but boosts more when they're closer and in-order. (The "~4" in anchor matches is to prohibit matches across different anchors. Each anchor is separated by a Token.positionIncrement() of 4.) But perhaps this is not a feature. Perhaps Nutch should instead expand this to: +(url:cutting^4.0 anchor:cutting^2.0 content:cutting) +(url:lucene^4.0 anchor:lucene^2.0 content:lucene) url:"cutting lucene"~2147483647^4.0 anchor:"cutting lucene"~4^2.0 content:"cutting lucene"~2147483647 That would, e.g., permit a match with only "lucene" in an anchor and "cutting" in the content, which the earlier formulation would not. Can anyone tell whether Google has this requirement? I have not been able to construct a two-word query that returns a page without both words in either the content, the title, the url or in a single anchor. Can you? If you're interested, the Nutch query expansion code in question is: http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/plugin/query-basic/src/java/net/nutch/searcher/basic/BasicQueryFilter.java?view=markup To play with it you can download Nutch and use the command: bin/nutch net.nutch.searcher.Query http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=1798116 Yes, the approach there is similar. I attempted to complete the solution and provide a working replacement for MultiFieldQueryParser. But, inspired by that message, couldn't MultiFieldQueryParser just be a subclass of QueryParser that overrides getFieldQuery()? Cheers, Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: combining open office spellchecker with Lucene
Andrzej Bialecki wrote: David Spencer wrote: I can/should send the code out. The logic is that for any terms in a query that have zero matches, go thru all the terms(!) and calculate the Levenshtein string distance, and return the best matches. A more intelligent way of doing this is to instead look for terms that also match on the 1st "n" (prob 3) chars. ...or prepare in advance a fast lookup index - split all existing terms to bi- or trigrams, create a separate lookup index, and then simply for each term ask a phrase query (phrase = all n-grams from an input term), with a slop > 0, to get similar existing terms. This should be fast, and you could provide a "did you mean" function too... Sounds interesting/fun but I'm not sure if I'm following exactly. Let's talk thru the trigram index case. Are you saying that for every trigram in every word there will be a mapping of trigram -> term? Thus if "recursive" is in the (orig) index then we'd create entries like: rec -> recursive ecu -> ... cur -> ... urs -> ... rsi -> ... siv -> ... ive -> ... And so on for all terms in the orig index. OK fine. But now the user types in a query like "recursivz". What's the algorithm - obviously I guess take all trigrams in the bad term and go thru the trigram-index, but there will be lots of suggestions. Now what - use string distance to score them? I guess that makes sense - plz confirm if I understand And so I guess the point here is we precalculate the trigram->term mappings to avoid an expensive traversal of all terms in an index, but we still use string distance as a 2nd pass (and prob should force the matches to always match on the 1st n (3) chars using the heuristic that people can usually start the spelling a word corrrectly). - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: combining open office spellchecker with Lucene
David Spencer wrote: I can/should send the code out. The logic is that for any terms in a query that have zero matches, go thru all the terms(!) and calculate the Levenshtein string distance, and return the best matches. A more intelligent way of doing this is to instead look for terms that also match on the 1st "n" (prob 3) chars. ...or prepare in advance a fast lookup index - split all existing terms to bi- or trigrams, create a separate lookup index, and then simply for each term ask a phrase query (phrase = all n-grams from an input term), with a slop > 0, to get similar existing terms. This should be fast, and you could provide a "did you mean" function too... -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator - FreeBSD developer (http://www.freebsd.org) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: combining open office spellchecker with Lucene
Aad Nales wrote: Hi All, Before I start reinventing wheels I would like to do a short check to see if anybody else has already tried this. A customer has requested us to look into the possibility to perform a spell check on queries. So far the most promising way of doing this seems to be to create an Analyzer based on the spellchecker of OpenOffice. My question is: "has anybody tried this before?" I did a WordNet/synonym query expander. Search for "WordNet" on this page. Of interest is it stores the Wordnet info in a separate Lucene index as at its essence all an index is is a database. http://jakarta.apache.org/lucene/docs/lucene-sandbox/ Also, another variation, is to instead spell based on what terms are in the index, not what an external dictionary says. I've done this on my experimental site searchmorph.com in a dumb/inefficient way. Here's an example: http://www.searchmorph.com/kat/search.jsp?s=recursivz After you click above it takes ~10sec as it produces terms close to "recursivz". Opps - looking at the output, it looks like the same word is suggest multiple times - ouch - I must be considering all fields, not just the contents field. TBD is fixing this. (or no wonder it's so slow :)) I can/should send the code out. The logic is that for any terms in a query that have zero matches, go thru all the terms(!) and calculate the Levenshtein string distance, and return the best matches. A more intelligent way of doing this is to instead look for terms that also match on the 1st "n" (prob 3) chars. Cheers, Aad -- Aad Nales [EMAIL PROTECTED], +31-(0)6 54 207 340 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: (n00b) Meaning of Hits.id (int)
Oh, it's that simple. :) Thanks for that! Peter Morus Walter wrote: It's lucenes internal id or document number which allows you to access the document and its stored fields. See IndexSearcher.doc(int i) or IndexReader.document(int n) The docs just don't name the parameter 'id'. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Existing Parsers
For Word see the tm-extractor at www.text-mining.org (based on POI). Pretty simple to use. -Message d'origine- De : [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Envoyé : jeudi 9 septembre 2004 15:47 À : Lucene Users List Objet : Existing Parsers Anyone know of any reliable parsers out there for pdf word excel or powerpoint? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Existing Parsers
Hi, I know some of them. 1. PDF + http://www.pdfbox.org/ + http://www.foolabs.com/xpdf/download.html - I am using this and found good. It even supports various languages. 2. word + http://sourceforge.net/projects/wvware 3. excel + http://www.jguru.com/faq/view.jsp?EID=1074230 -George --- [EMAIL PROTECTED] wrote: > Anyone know of any reliable parsers out there for > pdf word > excel or powerpoint? > > - > To unsubscribe, e-mail: > [EMAIL PROTECTED] > For additional commands, e-mail: > [EMAIL PROTECTED] > > ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Existing Parsers
There are a number of libraries for Java that provide PDF text extraction functionality. A pretty comprehensive list is available at < http://www.geocities.com/marcoschmidt.geo/java-libraries-pdf.html >. I'm obviously biased towards recommending our solution, PDFTextStream < http://snowtide.com/home/PDFTextStream/ >; it's the fastest thing out there for Java, and it provides a very easy-to-use Lucene integration module that will have you up and running in no time < http://snowtide.com/home/PDFTextStream/techtips/easy_lucene_integration >. For office documents, just about the only game in town that I know of is the Jakarta POI project < http://jakarta.apache.org/poi/ >. It's been quite a while since I've touched it, but it's definitely the best place to start. Chas Emerick | [EMAIL PROTECTED] PDFTextStream: fast PDF text extraction for Java apps and Lucene http://snowtide.com/home/PDFTextStream/ On Sep 9, 2004, at 9:47 AM, <[EMAIL PROTECTED]> wrote: Anyone know of any reliable parsers out there for pdf word excel or powerpoint? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Case sensitiveness and wildcard searches
George, The QueryParser does toLowerCase() on WildcardQueries by default. Hence you'd need to follow Daniel's advice to use QueryParser's setLowercaseWildcardTerms(false) if you wanted IM* to stay IM* Cheers, René -- Supergünstige DSL-Tarife + WLAN-Router für 0,- EUR* Jetzt zu GMX wechseln und sparen http://www.gmx.net/de/go/dsl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Existing Parsers
Anyone know of any reliable parsers out there for pdf word excel or powerpoint? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Case sensitiveness and wildcard searches
Thanks for links René, The mail is not exactly talking about my case because the StandardAnalyzer which I use does lowercase the input. So it is the same scenario as the FAQ entry. -George --- "René_Hackl" <[EMAIL PROTECTED]> wrote: > Hi George, > > I'm not sure about v1.3, but you may want to take a > look > at > > http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgNo=9342 > > or > > http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=1806371 > > cheers, > René > > -- > NEU: Bis zu 10 GB Speicher für e-mails & Dateien! > 1 GB bereits bei GMX FreeMail > http://www.gmx.net/de/go/mail > > > - > To unsubscribe, e-mail: > [EMAIL PROTECTED] > For additional commands, e-mail: > [EMAIL PROTECTED] > > ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Handling user queries (Was: Re: MultiFieldQueryParser seems broken... Fix attached.)
René Hackl wrote: is it a problem if the users will search "coffee OR tea" as a search string in the case that MultifieldQueryParser is modifyed as Bill suggested?, and the default opperator is set to AND? No. There's not a problem with the proposed correction to MFQP. MFQP should work the way Bill suggested. My babbling about coffee or tea was more aimed at Bill's referring to "darn users started demanding" . So this is a totally different matter. In my experience, many users fall to everyday language traps, like in: "What do you want to drink, coffee or tea?" The answer normally isn't 'yes' to both, is it? this problem may be solved if the users know the meaning of the following signs mean: - + "" * ~ this will improve the results in a better way that our parsing is doing ... I have an app where in some cases I make subqueries for an initial user-stated query. The aim is to come up with pointers to partial matching docs. The background is, one ill-advised NOT can ruin a query. But this has nothing to do with MFQP. Just random thoughts about making users happy even when they are new to formulating queries :-) Cheers, René - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Using MySpell iso the Snowball Analyzer
Hi Aad Use the stemmed result as what you index, but then also remember to stem the query terms as well - you need to do the same on the way out as on the way in. We don't use MySpell but we do use our own stemmer in this way, as there are many examples where Snowball falls down like: caught -> caught instead of catch buses -> buse instead of bus and Snowball gets worse for none-English languages like Dutch Cheers Pete - Original Message - From: "Aad Nales" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Thursday, September 09, 2004 8:44 AM Subject: Using MySpell iso the Snowball Analyzer > For an eductational customer we have been requested to add spell > checking to queries that enter lucene. The MySpell classes of > Pietschmann seem to makes this more than feasible. What i wonder if > somebody else has done this before? Any tips, questions or remarks? > > MySpell is the successor of ISpell and is used as the spellchecker in > OpenOffice. It excutes a stemming algoritm in combination with a > dictionary. My second question is if any has extracted the stemming > result to be used in an index? > > Thanks for any or all feedback, > cheers, > Aad > > > -- > Aad Nales > [EMAIL PROTECTED], +31-(0)6 54 207 340 > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Handling user queries (Was: Re: MultiFieldQueryParser seems broken... Fix attached.)
> is it a problem if the users will search "coffee OR tea" as a search > string in the case that MultifieldQueryParser is > modifyed as Bill suggested?, and the default opperator is set to AND? No. There's not a problem with the proposed correction to MFQP. MFQP should work the way Bill suggested. My babbling about coffee or tea was more aimed at Bill's referring to "darn users started demanding" . So this is a totally different matter. In my experience, many users fall to everyday language traps, like in: "What do you want to drink, coffee or tea?" The answer normally isn't 'yes' to both, is it? I have an app where in some cases I make subqueries for an initial user-stated query. The aim is to come up with pointers to partial matching docs. The background is, one ill-advised NOT can ruin a query. But this has nothing to do with MFQP. Just random thoughts about making users happy even when they are new to formulating queries :-) Cheers, René -- NEU: Bis zu 10 GB Speicher für e-mails & Dateien! 1 GB bereits bei GMX FreeMail http://www.gmx.net/de/go/mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
TermQuery PROBLEM!!!
Hello everybody and esp. Erik:) In case search entry is empty, I'd like to generate all documents came during the last minute I have a filed ( created as doc.add(Field.Keyword(F_PUBLISHORT, publishort));) containing this data in "MMddhhMM" format. The problem is I get nothing, but I do know I have documents with for example "200404271420" value WHAT I DO WRONG? When I do queries based on QueryParser (i.e. filter!=null) everything is ok. Thanks in advance J. .. if (filter==null || filter.equals("")){ filter=null; line="200404271420"; fld = "publishort"; } ... if (filter==null){ query = new TermQuery(new Term(fld,line)); }else{ NeisQueryParser nqp=new NeisQueryParser(); query = nqp.parse(line); } formated_query=query.toString(); Sort sort=null; ms = getMS(); //MultiSearcher if (filter==null) { if (sort_byscore)hits = ms.search(query, getCurrentTimeFilter()); else hits = ms.search(query,getCurrentTimeFilter(), "publishort"); }else{ if (range_flag){ if (sort_byscore)hits = ms.search(query, getDateFilter(),(String)null); else hits = ms.search(query,getDateFilter(),sort); }else{ if (sort_byscore)hits = ms.search(query); else hits = ms.search(query,sort); } } total_hitnum=hits.length(); String logdata=""; - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Using MySpell iso the Snowball Analyzer
For an eductational customer we have been requested to add spell checking to queries that enter lucene. The MySpell classes of Pietschmann seem to makes this more than feasible. What i wonder if somebody else has done this before? Any tips, questions or remarks? MySpell is the successor of ISpell and is used as the spellchecker in OpenOffice. It excutes a stemming algoritm in combination with a dictionary. My second question is if any has extracted the stemming result to be used in an index? Thanks for any or all feedback, cheers, Aad -- Aad Nales [EMAIL PROTECTED], +31-(0)6 54 207 340 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PDF->Text Performance comparison
> 1) I tried to migrate to never versions(o.6.4, 0.6.5, 0.6.6), but all the time I had > problems with parsing the same pdf documents, which worked well for > 0.6.3. I mentioned my problems here: > https://sourceforge.net/tracker/?func=detail&atid=552832&aid=1021691&group_id=78314 I am waiting for a response from you on this issue, try to login to SF when posting bugs so you get a notification when it is updated. > 2) When I were started with 0.6.3 I experienced perfomance problems > too, especially with large pdf documents (I had several with more > then 20MB size). I changed a bit source, wrapping the following line > of BaseParser class: I will give that a try, thanks for letting me know. Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: (n00b) Meaning of Hits.id (int)
Peter Pimley writes: > > My documents are not stored in their original form by lucene, but in a > seperate database. My lucene docs do however store the primary key, so > that I can fetch the original version from the database to show the user > (does that sound sane?) > yes. > I see that the 'Hits' class has an id (int) method, which sounds > interesting. The javadoc says "Returns the id for the nth document in > this set.". However, I can't find any mention anywhere else about > Document ids. Could anybody explain what this is? > It's lucenes internal id or document number which allows you to access the document and its stored fields. See IndexSearcher.doc(int i) or IndexReader.document(int n) The docs just don't name the parameter 'id'. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MultiFieldQueryParser seems broken... Fix attached.
René Hackl wrote: Bill, Thank you for clarifying on that issue. I missed the... (title:cutting OR author:cutting) AND (title:lucene OR author:lucene) ... (title:cutting OR title:lucene) AND (author:cutting OR author:lucene) Note that this would match even if only "lucene" occurred in the ... "only lucene"/"only cutting" match. I'd think that if a user specified a query "cutting lucene", with an implicit AND and the default fields "title" and "author", they'd expect to see a match in which both "cutting" and "lucene" appears. Hopefully they'd expect that. Sometimes users assume that e.g. "coffee OR tea" would provide matches with either term, but not both. But this is already "user-attune your application" territory. Your proposal makes perfect sense, of course. René is it a problem if the users will search "coffee OR tea" as a search string in the case that MultifieldQueryParser is modifyed as Bill suggested?, and the default opperator is set to AND? I don't think so ... I think that the resulting Query should be: (title:cutting OR author:cutting) OR (title:lucene OR author:lucene) And I think that the results will be correct. Am I wrong? I don't know exactly what will happen with more complex queries, the uses grouping, exact matches and NOT operator like: (alcohol NOT tea) OR ("black tea" AND brandy) what will happen if you send this to a MultifieldQueryParser that searches in an index with the fields "drink" and "juices" Maybe this kind of search constructions should be a part of JUnit tests, if they are not already there. Thanks, Sergiu - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Case sensitiveness and wildcard searches
Hi George, I'm not sure about v1.3, but you may want to take a look at http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgNo=9342 or http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=1806371 cheers, René -- NEU: Bis zu 10 GB Speicher für e-mails & Dateien! 1 GB bereits bei GMX FreeMail http://www.gmx.net/de/go/mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
(n00b) Meaning of Hits.id (int)
Hello everyone. I'm in the process of writing "my first lucene app", and I've got to the bit where I get my search results back (very exciting! ;). My documents are not stored in their original form by lucene, but in a seperate database. My lucene docs do however store the primary key, so that I can fetch the original version from the database to show the user (does that sound sane?) I see that the 'Hits' class has an id (int) method, which sounds interesting. The javadoc says "Returns the id for the nth document in this set.". However, I can't find any mention anywhere else about Document ids. Could anybody explain what this is? Many Thanks in Advance, Peter Pimley, Semantico - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Case sensitiveness and wildcard searches
Hi, I noticed a behavior with wildcard searches and like to clarify. >From the FAQ http://www.jguru.com/faq/view.jsp?EID=538312 in JGuru, Analyzer is not used for wildcard queries. In my case I have a document which contains the word IMPORTANT. I use PorterStemFiler + StandardAnalyzer for indexing & searching. I am getting the document if I search for the word IM*. But if analyzer is not used then who does the conversion of the word to lowercase. My code will look like this. --- QueryParser qp=new QueryParser("title", new MyAnalyzer()); Query q = qp.parse(text); --- Though I pass the text in uppercase (IM*), when I print the Query object I can see it in lowercase, something like (title:im*) I am using lucene-1.3-final. Can someone explain this? Thanks & regards, George ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing size
Dmitry Serebrennikov wrote: Niraj Alok wrote: Hi PA, Thanks for the detail ! Since we are using lucene to store the data also, I guess I would not be able to use it. By the way, I could be wrong, but I think the 35% figure you referenced in the your first e-mail actually does not include any stored fields. The deal with 35% was, I think, to illustrate that index data structures used for searching by Lucene are efficient. But Lucene does nothing special about stored content - no compression or anything like that. So you end up with the pure size of your data plus the 35% of the indexed data. There will be a patch available to the end of this week, which allows you to store binary values compressed within a lucene index. It means that you will be able to store and retrieve whole documents within lucene in a very efficient way ;-) regards bernhard Cheers. Dmitry. Regards, Niraj - Original Message - From: "petite_abeille" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Wednesday, September 01, 2004 1:14 PM Subject: Re: indexing size Hi Niraj, On Sep 01, 2004, at 06:45, Niraj Alok wrote: If I make some of them Field.Unstored, I can see from the javadocs that it will be indexed and tokenized but not stored. If it is not stored, how can I use it while searching? The different type of fields don't impact how you do your search. This is always the same. Using Unstored fields simply means that you use Lucene as a pure index for search purpose only, not for storing any data. Specifically, the assumption is that your original data lives somewhere else, outside of Lucene. If this assumption is true, then you can index everything as Unstored with the addition of one Keyword per document. The Keyword field holds some sort of unique identifier which allows you to retrieve the original data if necessary (e.g. a primary key, an URI, what not). Here is an example of this approach: (1) For indexing, check the indexValuesWithID() method http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/ SZIndex.java?view=markup Note the addition of a Field.Keyword for each document and the use of Field.UnStored for everything else (2) For fetching, check objectsWithSpecificationAndHitsInStore() http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/ SZFinder.java?view=markup HTH. Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PDF->Text Performance comparison
Hello Ben, I've been using PDFBox within last year, but only version 0.6.3, because of 2 reasons: 1) I tried to migrate to never versions(o.6.4, 0.6.5, 0.6.6), but all the time I had problems with parsing the same pdf documents, which worked well for 0.6.3. I mentioned my problems here: https://sourceforge.net/tracker/?func=detail&atid=552832&aid=1021691&group_id=78314 2) When I were started with 0.6.3 I experienced perfomance problems too, especially with large pdf documents (I had several with more then 20MB size). I changed a bit source, wrapping the following line of BaseParser class: out = stream.createFilteredStream( streamLength ); to out = new BufferedOutputStream(stream.createFilteredStream( streamLength )); The performance increase, I've got, was huge: parsing 21MB pdf document to text before modifacatrion was taking 78 seconds, after modification 12 seconds, so more the 6 times faster. I tried also to use buffered streams in some other places, but it was not that visible. I hope this change can also be incorporated into the current 0.6.6 release and then benchmarks may stay in PDFBox side :) Max BL> On Wed, 8 Sep 2004, Chas Emerick wrote: >> PDFTextStream: fast PDF text extraction for Java applications >> http://snowtide.com/home/PDFTextStream/ BL> For those that have not seen, snowtide.com has done a performance BL> comparison against several Java PDF->Text libraries, including Snowtide's BL> PDFTextStream, PDFBox, Etymon PJ and JPedal. It appears to be fairly well BL> done. BL> http://snowtide.com/home/PDFTextStream/Performance BL> PDFBox: slow PDF text extraction for Java applications BL> http://www.pdfbox.org BL> :) BL> Ben BL> - BL> To unsubscribe, e-mail: [EMAIL PROTECTED] BL> For additional commands, e-mail: [EMAIL PROTECTED] -- Best regards, Maximmailto:[EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
combining open office spellchecker with Lucene
Hi All, Before I start reinventing wheels I would like to do a short check to see if anybody else has already tried this. A customer has requested us to look into the possibility to perform a spell check on queries. So far the most promising way of doing this seems to be to create an Analyzer based on the spellchecker of OpenOffice. My question is: "has anybody tried this before?" Cheers, Aad -- Aad Nales [EMAIL PROTECTED], +31-(0)6 54 207 340 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MultiFieldQueryParser seems broken... Fix attached.
Bill, Thank you for clarifying on that issue. I missed the... > (title:cutting OR author:cutting) AND (title:lucene OR author:lucene) ... > (title:cutting OR title:lucene) AND (author:cutting OR author:lucene) > > Note that this would match even if only "lucene" occurred in the ... "only lucene"/"only cutting" match. > I'd think that if a user specified a query "cutting lucene", with an > implicit AND and the default fields "title" and "author", they'd > expect to see a match in which both "cutting" and "lucene" appears. Hopefully they'd expect that. Sometimes users assume that e.g. "coffee OR tea" would provide matches with either term, but not both. But this is already "user-attune your application" territory. Your proposal makes perfect sense, of course. René -- Supergünstige DSL-Tarife + WLAN-Router für 0,- EUR* Jetzt zu GMX wechseln und sparen http://www.gmx.net/de/go/dsl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
test pls ignore
-- Aad Nales [EMAIL PROTECTED], +31-(0)6 54 207 340 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]