Re: View lucene index file

2004-09-09 Thread Cheolgoo Kang
check out this page

http://jakarta.apache.org/lucene/docs/contributions.html

we got tools like LIMO and Luke.

Cheo

On Thu, 9 Sep 2004 23:38:17 -0400, Anne Y. Zhang <[EMAIL PROTECTED]> wrote:
> I am using Nutch. Is there any way I can view the lucene index file?
> It seems that lucene write index as binary file. Could anybody explain
> how lucene does the indexing and where the index file located?
> Thank you very much!
> 
> Ya
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 



-- 
Cheolgoo, Kang

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



View lucene index file

2004-09-09 Thread Anne Y. Zhang
I am using Nutch. Is there any way I can view the lucene index file?
It seems that lucene write index as binary file. Could anybody explain
how lucene does the indexing and where the index file located?
Thank you very much!

Ya
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: combining open office spellchecker with Lucene

2004-09-09 Thread Doug Cutting
David Spencer wrote:
Good heuristics but are there any more precise, standard guidelines as 
to how to balance or combine what I think are the following possible 
criteria in suggesting a better choice:
Not that I know of.
- ignore(penalize?) terms that are rare
I think this one is easy to threshold: ignore matching terms that are 
rarer than the term entered.

- ignore(penalize?) terms that are common
This, in effect, falls out of the previous criterion.  A term that is 
very common will not have any matching terms that are more common.  As 
an optimization, you could avoid even looking for matching terms when a 
term is very common.

- terms that are closer (string distance) to the term entered are better
This is the meaty one.
- terms that start w/ the same 'n' chars as the users term are better
Perhaps.  Are folks really better at spelling the beginning of words?
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-09 Thread Bill Janssen
> But, inspired by that message, couldn't MultiFieldQueryParser just be a 
> subclass of QueryParser that overrides getFieldQuery()?

I wasn't sure that everything "went through" getFieldQuery().  If so,
yes, that should work.  In either case, I don't even think a subclass
is necessary.  Just have a different constructor for QueryParser that
takes multiple default field names, and just add the behavior to
QueryParser, keyed off that characteristic (more than one default
field name).

Bill

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-09 Thread Bill Janssen
> is it a problem if the users will search "coffee OR tea" as a search 
> string in the case that MultifieldQueryParser is
> modifyed as Bill suggested?, and the default opperator is set to AND?
> 

Here's what you get (which is correct):

% java -classpath /usr/local/lib/lucene-1.4.1.jar:. \
   -DSearchText.QueryDefaultOperator=AND \
   -DSearchTest.QueryParser=new SearchTest 'coffee OR tea'
query is (title:coffee authors:coffee contents:coffee) (title:tea authors:tea 
contents:tea)
%

Bill

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-09 Thread Daniel Naber
On Thursday 09 September 2004 19:47, Daniel Taurat wrote:

> I am facing an out of memory problem using ÂLucene 1.4.1.

Could you try with a recent CVS version? There has been a fix about files 
not being deleted after 1.4.1. Not sure if that could cause the problems 
you're experiencing.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Existing Parsers

2004-09-09 Thread Chris Fraschetti
Some of the tools listed use cmd line execs to output a doc of some
sort to text and then I grab the text and add it to a lucene doc, etc
etc...

Any stats on the scalability of that? In large scale applications, I'm
assuming this will cause some serious issues... anyone have any input
on this?

-Chris Fraschetti


On Thu, 09 Sep 2004 09:54:43 -0700, David Spencer
<[EMAIL PROTECTED]> wrote:
> Honey George wrote:
> 
> > Hi,
> >   I know some of them.
> > 1. PDF
> >  + http://www.pdfbox.org/
> >  + http://www.foolabs.com/xpdf/download.html
> >- I am using this and found good. It even supports
> 
> My dated experience from 2 years ago was that (the evil, native code)
> foolabs pdf parser was the best, but obviously things could have changed.
> 
> http://www.mail-archive.com/[EMAIL PROTECTED]/msg02912.html
> 
> >  various languages.
> > 2. word
> >   + http://sourceforge.net/projects/wvware
> > 3. excel
> >   + http://www.jguru.com/faq/view.jsp?EID=1074230
> >
> > -George
> >  --- [EMAIL PROTECTED] wrote:
> >
> >>Anyone know of any reliable parsers out there for
> >>pdf word
> >>excel or powerpoint?
> 
> For powerpoint it's not easy. I've been using this and it has worked
> fine util recently and seems to sometimes go into an infinite loop now
> on some recent PPTs. Native code and a package that seems to be dormant
> but to some extent it does the job. The file "ppthtml" does the work.
> 
> http://chicago.sourceforge.net/xlhtml
> 
> 
> 
> >>
> >>
> >
> > -
> >
> >>To unsubscribe, e-mail:
> >>[EMAIL PROTECTED]
> >>For additional commands, e-mail:
> >>[EMAIL PROTECTED]
> >>
> >>
> >
> >
> >
> >
> >
> >
> > ___ALL-NEW Yahoo! 
> > Messenger - all new features - even more fun!  http://uk.messenger.yahoo.com
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: combining open office spellchecker with Lucene

2004-09-09 Thread David Spencer
Doug Cutting wrote:
Aad Nales wrote:
Before I start reinventing wheels I would like to do a short check to
see if anybody else has already tried this. A customer has requested us
to look into the possibility to perform a spell check on queries. So far
the most promising way of doing this seems to be to create an Analyzer
based on the spellchecker of OpenOffice. My question is: "has anybody
tried this before?" 

Note that a spell checker used with a search engine should use 
collection frequency information.  That's to say, only "corrections" 
which are more frequent in the collection than what the user entered 
should be displayed.  Frequency information can also be used when 
constructing the checker.  For example, one need never consider 
proposing terms that occur in very few documents.  And one should not 
try correction at all for terms which occur in a large proportion of the 
collection.
Good heuristics but are there any more precise, standard guidelines as 
to how to balance or combine what I think are the following possible 
criteria in suggesting a better choice:

- ignore(penalize?) terms that are rare
- ignore(penalize?) terms that are common
- terms that are closer (string distance) to the term entered are better
- terms that start w/ the same 'n' chars as the users term are better


Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-09 Thread Daniel Taurat
Hi,
I am facing an out of memory problem using  Lucene 1.4.1.
I am  re-indexing a pretty large number ( about 30.000 ) of documents.
I identify old instances by checking for a unique ID field, delete those 
with indexReader.delete() and add the new document version.

HeapDump says I am having  a huge number of HashMaps with 
SegmentTermEnum objects (256891) .

IndexReader is closed directly after delete(term)...
Seems to me that this did not happen with version1.2 (same number of 
objects and  all...).
Has anyone an idea how I get  these "hanging"  objects? Or what to do in 
order to avoid them?

Thanks
Daniel
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Lucene working example.

2004-09-09 Thread Mr dharmanand singh
I used lucene in one of my projects which required
searching content on web pages. I wrote my own jsp and
html parser that handled major cases and it
successfully parsed almost all my required web pages.
I also came up with an example implementation
(http://dharmanand.tarundua.net/lucene_eg.war) as I
didn’t find very good working examples on the internet.


Yahoo! India Matrimony: Find your life partner online
Go to: http://yahoo.shaadi.com/india-matrimony

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-09 Thread Daniel Naber
On Thursday 09 September 2004 18:52, Doug Cutting wrote:

> I have not been
> able to construct a two-word query that returns a page without both
> words in either the content, the title, the url or in a single anchor.
> Can you?

Like this one?

konvens leitseite 

Leitseite is only in the title of the first match (www.gldv.org), konvens 
is only in the body.

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: combining open office spellchecker with Lucene

2004-09-09 Thread Doug Cutting
Aad Nales wrote:
Before I start reinventing wheels I would like to do a short check to
see if anybody else has already tried this. A customer has requested us
to look into the possibility to perform a spell check on queries. So far
the most promising way of doing this seems to be to create an Analyzer
based on the spellchecker of OpenOffice. My question is: "has anybody
tried this before?" 
Note that a spell checker used with a search engine should use 
collection frequency information.  That's to say, only "corrections" 
which are more frequent in the collection than what the user entered 
should be displayed.  Frequency information can also be used when 
constructing the checker.  For example, one need never consider 
proposing terms that occur in very few documents.  And one should not 
try correction at all for terms which occur in a large proportion of the 
collection.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Existing Parsers

2004-09-09 Thread David Spencer
Honey George wrote:
Hi,
  I know some of them.
1. PDF
 + http://www.pdfbox.org/
 + http://www.foolabs.com/xpdf/download.html
   - I am using this and found good. It even supports 
My dated experience from 2 years ago was that (the evil, native code) 
foolabs pdf parser was the best, but obviously things could have changed.

http://www.mail-archive.com/[EMAIL PROTECTED]/msg02912.html
 various languages.
2. word
  + http://sourceforge.net/projects/wvware
3. excel
  + http://www.jguru.com/faq/view.jsp?EID=1074230
-George
 --- [EMAIL PROTECTED] wrote: 

Anyone know of any reliable parsers out there for
pdf word 
excel or powerpoint?
For powerpoint it's not easy. I've been using this and it has worked 
fine util recently and seems to sometimes go into an infinite loop now 
on some recent PPTs. Native code and a package that seems to be dormant 
but to some extent it does the job. The file "ppthtml" does the work.

http://chicago.sourceforge.net/xlhtml

-
To unsubscribe, e-mail:
[EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]





___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-09 Thread Doug Cutting
Bill Janssen wrote:
I'd think that if a user specified a query "cutting lucene", with an
implicit AND and the default fields "title" and "author", they'd
expect to see a match in which both "cutting" and "lucene" appears.  That is,
(title:cutting OR author:cutting) AND (title:lucene OR author:lucene)
Your proposal is certainly an improvement.
It's interesting to note that in Nutch I implemented something 
different.  There, a search for "cutting lucene" expands to something like:

 (+url:cutting^4.0 +url:lucene^4.0 +url:"cutting lucene"~2147483647^4.0)
 (+anchor:cutting^2.0 +anchor:lucene^2.0 +anchor:"cutting lucene"~4^2.0)
 (+content:cutting +content:lucene +content:"cutting lucene"~2147483647)
So a page with "cutting" in the body and "lucene" in anchor text won't 
match: the body, anchor or url must contain all query terms.  A single 
authority (content, url or anchor) must vouch for all attributes.

Note that Nutch also boosts matches where the terms are close together. 
 Using "~2147483647" permits them to be anywhere in the document, but 
boosts more when they're closer and in-order.  (The "~4" in anchor 
matches is to prohibit matches across different anchors.  Each anchor is 
separated by a Token.positionIncrement() of 4.)

But perhaps this is not a feature.  Perhaps Nutch should instead expand 
this to:

 +(url:cutting^4.0 anchor:cutting^2.0 content:cutting)
 +(url:lucene^4.0 anchor:lucene^2.0 content:lucene)
 url:"cutting lucene"~2147483647^4.0
 anchor:"cutting lucene"~4^2.0
 content:"cutting lucene"~2147483647
That would, e.g., permit a match with only "lucene" in an anchor and 
"cutting" in the content, which the earlier formulation would not.

Can anyone tell whether Google has this requirement?  I have not been 
able to construct a two-word query that returns a page without both 
words in either the content, the title, the url or in a single anchor. 
Can you?

If you're interested, the Nutch query expansion code in question is:
http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/plugin/query-basic/src/java/net/nutch/searcher/basic/BasicQueryFilter.java?view=markup
To play with it you can download Nutch and use the command:
  bin/nutch net.nutch.searcher.Query
http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=1798116

Yes, the approach there is similar.  I attempted to complete the
solution and provide a working replacement for MultiFieldQueryParser.
But, inspired by that message, couldn't MultiFieldQueryParser just be a 
subclass of QueryParser that overrides getFieldQuery()?

Cheers,
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: combining open office spellchecker with Lucene

2004-09-09 Thread David Spencer
Andrzej Bialecki wrote:
David Spencer wrote:
I can/should send the code out. The logic is that for any terms in a 
query that have zero matches, go thru all the terms(!) and calculate 
the Levenshtein string distance, and return the best matches. A more 
intelligent way of doing this is to instead look for terms that also 
match on the 1st "n" (prob 3) chars.

...or prepare in advance a fast lookup index - split all existing terms 
to bi- or trigrams, create a separate lookup index, and then simply for 
each term ask a phrase query (phrase = all n-grams from an input term), 
with a slop > 0, to get similar existing terms. This should be fast, and 
you could provide a "did you mean" function too...
Sounds interesting/fun but I'm not sure if I'm following exactly.
Let's talk thru the trigram index case.
Are you saying that for every trigram in every word there will be a 
mapping of trigram -> term?
Thus if "recursive" is in the (orig) index then we'd create entries like:

rec -> recursive
ecu -> ...
cur -> ...
urs -> ...
rsi -> ...
siv -> ...
ive -> ...
And so on for all terms in the orig index.
OK fine.
But now the user types in a query like "recursivz".
What's the algorithm - obviously I guess take all trigrams in the bad 
term and go thru the trigram-index, but there will be lots of 
suggestions. Now what - use string distance to score them? I guess that 
makes sense - plz confirm if I understand And so I guess the point 
here is we precalculate the trigram->term mappings to avoid an expensive 
traversal of all terms in an index, but we still use string distance as 
a 2nd pass (and prob should force the matches to always match on the 1st 
  n (3) chars using the heuristic that people can usually start the 
spelling a word  corrrectly).




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: combining open office spellchecker with Lucene

2004-09-09 Thread Andrzej Bialecki
David Spencer wrote:
I can/should send the code out. The logic is that for any terms in a 
query that have zero matches, go thru all the terms(!) and calculate the 
Levenshtein string distance, and return the best matches. A more 
intelligent way of doing this is to instead look for terms that also 
match on the 1st "n" (prob 3) chars.
...or prepare in advance a fast lookup index - split all existing terms 
to bi- or trigrams, create a separate lookup index, and then simply for 
each term ask a phrase query (phrase = all n-grams from an input term), 
with a slop > 0, to get similar existing terms. This should be fast, and 
you could provide a "did you mean" function too...

--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org)
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: combining open office spellchecker with Lucene

2004-09-09 Thread David Spencer
Aad Nales wrote:
Hi All,
Before I start reinventing wheels I would like to do a short check to
see if anybody else has already tried this. A customer has requested us
to look into the possibility to perform a spell check on queries. So far
the most promising way of doing this seems to be to create an Analyzer
based on the spellchecker of OpenOffice. My question is: "has anybody
tried this before?" 
I did a WordNet/synonym query expander. Search for "WordNet" on this 
page. Of interest is it stores the Wordnet info in a separate Lucene 
index as at its essence all an index is is a database.

http://jakarta.apache.org/lucene/docs/lucene-sandbox/
Also, another variation, is to instead spell based on what terms are in 
the index, not what an external dictionary says. I've done this on my 
experimental site searchmorph.com in a dumb/inefficient way. Here's an 
example:

http://www.searchmorph.com/kat/search.jsp?s=recursivz
After you click above it takes ~10sec as it produces terms close to 
"recursivz". Opps - looking at the output, it looks like the same word 
is suggest multiple times - ouch - I must be considering all fields, not 
just the contents field. TBD is fixing this. (or no wonder it's so slow :))

I can/should send the code out. The logic is that for any terms in a 
query that have zero matches, go thru all the terms(!) and calculate the 
Levenshtein string distance, and return the best matches. A more 
intelligent way of doing this is to instead look for terms that also 
match on the 1st "n" (prob 3) chars.



Cheers,
Aad
--
Aad Nales
[EMAIL PROTECTED], +31-(0)6 54 207 340 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: (n00b) Meaning of Hits.id (int)

2004-09-09 Thread Peter Pimley
Oh, it's that simple. :)
Thanks for that!
Peter
Morus Walter wrote:
It's lucenes internal id or document number which allows you to access
the document and its stored fields.
See 
IndexSearcher.doc(int i)
or
IndexReader.document(int n)

The docs just don't name the parameter 'id'.
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Existing Parsers

2004-09-09 Thread Cocula Remi
For Word see the tm-extractor at www.text-mining.org (based on POI). Pretty simple to 
use.


-Message d'origine-
De : [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Envoyé : jeudi 9 septembre 2004 15:47
À : Lucene Users List
Objet : Existing Parsers


Anyone know of any reliable parsers out there for pdf word 
excel or powerpoint?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Existing Parsers

2004-09-09 Thread Honey George
Hi,
  I know some of them.
1. PDF
 + http://www.pdfbox.org/
 + http://www.foolabs.com/xpdf/download.html
   - I am using this and found good. It even supports 
 various languages.
2. word
  + http://sourceforge.net/projects/wvware
3. excel
  + http://www.jguru.com/faq/view.jsp?EID=1074230

-George
 --- [EMAIL PROTECTED] wrote: 
> Anyone know of any reliable parsers out there for
> pdf word 
> excel or powerpoint?
> 
>
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 
>  





___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Existing Parsers

2004-09-09 Thread Chas Emerick
There are a number of libraries for Java that provide PDF text 
extraction functionality.  A pretty comprehensive list is available at 
< http://www.geocities.com/marcoschmidt.geo/java-libraries-pdf.html >.  
I'm obviously biased towards recommending our solution, PDFTextStream < 
http://snowtide.com/home/PDFTextStream/ >; it's the fastest thing out 
there for Java, and it provides a very easy-to-use Lucene integration 
module that will have you up and running in no time < 
http://snowtide.com/home/PDFTextStream/techtips/easy_lucene_integration 
>.

For office documents, just about the only game in town that I know of 
is the Jakarta POI project < http://jakarta.apache.org/poi/ >.  It's 
been quite a while since I've touched it, but it's definitely the best 
place to start.

Chas Emerick   |   [EMAIL PROTECTED]
PDFTextStream: fast PDF text extraction for Java apps and Lucene
http://snowtide.com/home/PDFTextStream/
On Sep 9, 2004, at 9:47 AM, <[EMAIL PROTECTED]> wrote:
Anyone know of any reliable parsers out there for pdf word
excel or powerpoint?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Case sensitiveness and wildcard searches

2004-09-09 Thread "René Hackl"
George,

The QueryParser does toLowerCase() on WildcardQueries by default. Hence
you'd need to follow Daniel's advice to use

 QueryParser's setLowercaseWildcardTerms(false) 

if you wanted IM* to stay IM*

Cheers,
René


-- 
Supergünstige DSL-Tarife + WLAN-Router für 0,- EUR*
Jetzt zu GMX wechseln und sparen http://www.gmx.net/de/go/dsl


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Existing Parsers

2004-09-09 Thread dhatcher
Anyone know of any reliable parsers out there for pdf word 
excel or powerpoint?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Case sensitiveness and wildcard searches

2004-09-09 Thread Honey George
Thanks for links René,
 The mail is not exactly talking about my case because
the StandardAnalyzer which I use does lowercase the
input. So it is the same scenario as the FAQ entry.

-George

 --- "René_Hackl" <[EMAIL PROTECTED]> wrote: 
> Hi George,
> 
> I'm not sure about v1.3, but you may want to take a
> look
> at
> 
>
http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgNo=9342
> 
> or
> 
>
http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=1806371
> 
> cheers,
> René
> 
> -- 
> NEU: Bis zu 10 GB Speicher für e-mails & Dateien!
> 1 GB bereits bei GMX FreeMail
> http://www.gmx.net/de/go/mail
> 
> 
>
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 
>  





___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Handling user queries (Was: Re: MultiFieldQueryParser seems broken... Fix attached.)

2004-09-09 Thread sergiu gordea
René Hackl wrote:
is it a problem if the users will search "coffee OR tea" as a search 
string in the case that MultifieldQueryParser is
modifyed as Bill suggested?, and the default opperator is set to AND?
   

No. There's not a problem with the proposed correction to MFQP. MFQP should
work the way Bill suggested.
My babbling about coffee or tea was more aimed at Bill's referring to "darn
users started demanding" . So this is a totally different
matter. In my experience, many users fall to everyday language traps, like
in: "What do you want to drink, coffee or tea?" The answer normally isn't
'yes' to both, is it?  

 

this problem may be solved if the users know the meaning of the 
following signs mean:
- + "" * ~
this will improve the results in a better way that our parsing is doing ...

I have an app where in some cases I make subqueries for an initial
user-stated query. The aim is to come up with pointers to partial matching
docs. The background is, one ill-advised NOT can ruin a query. But this has
nothing to do with MFQP. Just random thoughts about making users happy even
when they are new to formulating queries :-)
Cheers,
René
 




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Using MySpell iso the Snowball Analyzer

2004-09-09 Thread Pete Lewis
Hi Aad

Use the stemmed result as what you index, but then also remember to stem the
query terms as well - you need to do the same on the way out as on the way
in.

We don't use MySpell but we do use our own stemmer in this way, as there are
many examples where Snowball falls down like:

caught -> caught instead of catch
buses -> buse instead of bus

and Snowball gets worse for none-English languages like Dutch

Cheers
Pete

- Original Message - 
From: "Aad Nales" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Thursday, September 09, 2004 8:44 AM
Subject: Using MySpell iso the Snowball Analyzer


> For an eductational customer we have been requested to add spell
> checking to queries that enter lucene. The MySpell classes of
> Pietschmann seem to makes this more than feasible. What i wonder if
> somebody else has done this before? Any tips, questions or remarks?
>
> MySpell is the successor of ISpell and is used as the spellchecker in
> OpenOffice. It excutes a stemming algoritm in combination with a
> dictionary. My second question is if any has extracted the stemming
> result to be used in an index?
>
> Thanks for any or all feedback,
> cheers,
> Aad
>
>
> --
> Aad Nales
> [EMAIL PROTECTED], +31-(0)6 54 207 340
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Handling user queries (Was: Re: MultiFieldQueryParser seems broken... Fix attached.)

2004-09-09 Thread "René Hackl"
> is it a problem if the users will search "coffee OR tea" as a search 
> string in the case that MultifieldQueryParser is
> modifyed as Bill suggested?, and the default opperator is set to AND?

No. There's not a problem with the proposed correction to MFQP. MFQP should
work the way Bill suggested.

My babbling about coffee or tea was more aimed at Bill's referring to "darn
users started demanding" . So this is a totally different
matter. In my experience, many users fall to everyday language traps, like
in: "What do you want to drink, coffee or tea?" The answer normally isn't
'yes' to both, is it?  

I have an app where in some cases I make subqueries for an initial
user-stated query. The aim is to come up with pointers to partial matching
docs. The background is, one ill-advised NOT can ruin a query. But this has
nothing to do with MFQP. Just random thoughts about making users happy even
when they are new to formulating queries :-)

Cheers,
René

-- 
NEU: Bis zu 10 GB Speicher für e-mails & Dateien!
1 GB bereits bei GMX FreeMail http://www.gmx.net/de/go/mail


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



TermQuery PROBLEM!!!

2004-09-09 Thread iouli . golovatyi

Hello everybody and esp. Erik:)

In case search entry is empty, I'd like to generate all documents came
during the last minute
I have a filed ( created as doc.add(Field.Keyword(F_PUBLISHORT,
publishort));) containing this data in "MMddhhMM" format.

The problem is I get nothing, but I do know I have documents with for
example "200404271420" value

WHAT I DO WRONG?

When I do queries based on QueryParser (i.e. filter!=null) everything is
ok.

Thanks in advance
J.


..
   if (filter==null || filter.equals("")){
 filter=null;
 line="200404271420";
 fld = "publishort";
   }
...


if (filter==null){
  query = new TermQuery(new Term(fld,line));
}else{
  NeisQueryParser nqp=new NeisQueryParser();
  query = nqp.parse(line);
}

formated_query=query.toString();
Sort sort=null;
ms = getMS(); //MultiSearcher
if (filter==null) {
  if (sort_byscore)hits = ms.search(query,
getCurrentTimeFilter());
  else hits = ms.search(query,getCurrentTimeFilter(),
"publishort");
}else{
  if (range_flag){
if (sort_byscore)hits = ms.search(query,
getDateFilter(),(String)null);
else hits = ms.search(query,getDateFilter(),sort);
  }else{
if (sort_byscore)hits = ms.search(query);
else hits = ms.search(query,sort);
  }
}
total_hitnum=hits.length();
String logdata="";



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Using MySpell iso the Snowball Analyzer

2004-09-09 Thread Aad Nales
For an eductational customer we have been requested to add spell
checking to queries that enter lucene. The MySpell classes of
Pietschmann seem to makes this more than feasible. What i wonder if
somebody else has done this before? Any tips, questions or remarks?

MySpell is the successor of ISpell and is used as the spellchecker in
OpenOffice. It excutes a stemming algoritm in combination with a
dictionary. My second question is if any has extracted the stemming
result to be used in an index?

Thanks for any or all feedback,
cheers,
Aad


--
Aad Nales
[EMAIL PROTECTED], +31-(0)6 54 207 340 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: PDF->Text Performance comparison

2004-09-09 Thread Ben Litchfield
>  1) I tried to migrate to never versions(o.6.4, 0.6.5, 0.6.6), but all the time I had
>  problems with parsing the same pdf documents, which worked well for
>  0.6.3. I mentioned my problems here:
>   https://sourceforge.net/tracker/?func=detail&atid=552832&aid=1021691&group_id=78314

I am waiting for a response from you on this issue, try to login to SF
when posting bugs so you get a notification when it is updated.



>  2) When I were started with 0.6.3 I experienced perfomance problems
>  too, especially with large pdf documents (I had several with more
>  then 20MB size). I changed a bit source, wrapping the following line
>  of BaseParser class:

I will give that a try, thanks for letting me know.

Ben

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: (n00b) Meaning of Hits.id (int)

2004-09-09 Thread Morus Walter
Peter Pimley writes:
> 
> My documents are not stored in their original form by lucene, but in a 
> seperate database.  My lucene docs do however store the primary key, so 
> that I can fetch the original version from the database to show the user 
> (does that sound sane?)
> 
yes.

> I see that the 'Hits' class has an id (int) method, which sounds 
> interesting.  The javadoc says "Returns the id for the nth document in 
> this set.".  However, I can't find any mention anywhere else about 
> Document ids.  Could anybody explain what this is?
> 
It's lucenes internal id or document number which allows you to access
the document and its stored fields.

See 
IndexSearcher.doc(int i)
or
IndexReader.document(int n)

The docs just don't name the parameter 'id'.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-09 Thread sergiu gordea
René Hackl wrote:
Bill,
Thank you for clarifying on that issue. I missed the...
 

(title:cutting OR author:cutting) AND (title:lucene OR author:lucene)
   

  ...
 

(title:cutting OR title:lucene) AND (author:cutting OR author:lucene)
Note that this would match even if only "lucene" occurred in the
   

... "only lucene"/"only cutting" match. 

 

I'd think that if a user specified a query "cutting lucene", with an
implicit AND and the default fields "title" and "author", they'd
expect to see a match in which both "cutting" and "lucene" appears. 
   

Hopefully they'd expect that. Sometimes users assume that e.g. "coffee OR
tea" would provide matches with either term, but not both. But this is
already "user-attune your application" territory. Your proposal makes
perfect sense, of course.
René
 

is it a problem if the users will search "coffee OR tea" as a search 
string in the case that MultifieldQueryParser is
modifyed as Bill suggested?, and the default opperator is set to AND?

I don't think so ... I think that the resulting Query should be:
(title:cutting OR author:cutting) OR (title:lucene OR author:lucene)
And I think that the results will be correct.
Am I wrong?
I don't know exactly what will happen with more complex queries, the uses grouping, 
exact matches and NOT operator
like:
 (alcohol NOT tea) OR ("black tea" AND brandy)
what will happen if you send this to a MultifieldQueryParser that searches in an index 
with
the fields "drink" and "juices"
Maybe this kind of search constructions should be a part of JUnit tests, if they are 
not already there.
Thanks,
Sergiu 
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Case sensitiveness and wildcard searches

2004-09-09 Thread "René Hackl"
Hi George,

I'm not sure about v1.3, but you may want to take a look
at

http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgNo=9342

or

http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=1806371

cheers,
René

-- 
NEU: Bis zu 10 GB Speicher für e-mails & Dateien!
1 GB bereits bei GMX FreeMail http://www.gmx.net/de/go/mail


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



(n00b) Meaning of Hits.id (int)

2004-09-09 Thread Peter Pimley
Hello everyone.
I'm in the process of writing "my first lucene app", and I've got to the 
bit where I get my search results back (very exciting! ;).

My documents are not stored in their original form by lucene, but in a 
seperate database.  My lucene docs do however store the primary key, so 
that I can fetch the original version from the database to show the user 
(does that sound sane?)

I see that the 'Hits' class has an id (int) method, which sounds 
interesting.  The javadoc says "Returns the id for the nth document in 
this set.".  However, I can't find any mention anywhere else about 
Document ids.  Could anybody explain what this is?

Many Thanks in Advance,
Peter Pimley, Semantico
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Case sensitiveness and wildcard searches

2004-09-09 Thread Honey George
Hi,
 I noticed a behavior with wildcard searches and like
to clarify.

>From the FAQ
http://www.jguru.com/faq/view.jsp?EID=538312
in JGuru, Analyzer is not used for wildcard queries.
In my case I have a document which contains the word
IMPORTANT. I use PorterStemFiler + StandardAnalyzer
for indexing & searching. I am getting the document if
I search for the word IM*. But if analyzer is not used
then who does the conversion of the word to lowercase.

My code will look like this.

---
QueryParser qp=new QueryParser("title",
  new MyAnalyzer());
Query q = qp.parse(text);
---


Though I pass the text in uppercase (IM*), when I
print the Query object I can see it in lowercase,
something like  (title:im*)

I am using lucene-1.3-final. Can someone explain this?

Thanks & regards,
   George







___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexing size

2004-09-09 Thread Bernhard Messer
Dmitry Serebrennikov wrote:
Niraj Alok wrote:
Hi PA,
Thanks for the detail ! Since we are using lucene to store the data 
also, I
guess I would not be able to use it.
 

By the way, I could be wrong, but I think the 35% figure you 
referenced in the your first e-mail actually does not include any 
stored fields. The deal with 35% was, I think, to illustrate that 
index data structures used for searching by Lucene are efficient. But 
Lucene does nothing special about stored content - no compression or 
anything like that. So you end up with the pure size of your data plus 
the 35% of the indexed data.
There will be a patch available to the end of this week, which allows 
you to store binary values compressed within a lucene index. It means 
that you will be able to store and retrieve whole documents within 
lucene in a very efficient way ;-)

regards
bernhard

Cheers.
Dmitry.
Regards,
Niraj
- Original Message -
From: "petite_abeille" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, September 01, 2004 1:14 PM
Subject: Re: indexing size
 

Hi Niraj,
On Sep 01, 2004, at 06:45, Niraj Alok wrote:
  

If I make some of them Field.Unstored, I can see from the javadocs
that it
will be indexed and tokenized but not stored. If it is not stored, how
can I
use it while searching?

The different type of fields don't impact how you do your search. This
is always the same.
Using Unstored fields simply means that you use Lucene as a pure index
for search purpose only, not for storing any data.
Specifically, the assumption is that your original data lives somewhere
else, outside of Lucene. If this assumption is true, then you can index
everything as Unstored with the addition of one Keyword per document.
The Keyword field holds some sort of unique identifier which allows you
to retrieve the original data if necessary (e.g. a primary key, an URI,
what not).
Here is an example of this approach:
(1) For indexing, check the indexValuesWithID() method
http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/
SZIndex.java?view=markup
Note the addition of a Field.Keyword for each document and the use of
Field.UnStored for everything else
(2) For fetching, check objectsWithSpecificationAndHitsInStore()
http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/
SZFinder.java?view=markup
HTH.
Cheers,
PA.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
  

 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: PDF->Text Performance comparison

2004-09-09 Thread Maxim Patramanskij
Hello Ben,

I've been using PDFBox within last year, but only version 0.6.3,
because of 2 reasons:

 1) I tried to migrate to never versions(o.6.4, 0.6.5, 0.6.6), but all the time I had
 problems with parsing the same pdf documents, which worked well for
 0.6.3. I mentioned my problems here:
  https://sourceforge.net/tracker/?func=detail&atid=552832&aid=1021691&group_id=78314

 2) When I were started with 0.6.3 I experienced perfomance problems
 too, especially with large pdf documents (I had several with more
 then 20MB size). I changed a bit source, wrapping the following line
 of BaseParser class:

out = stream.createFilteredStream( streamLength );

to

out = new BufferedOutputStream(stream.createFilteredStream( streamLength 
));


 The performance increase, I've got, was huge:
 parsing 21MB pdf document to text before modifacatrion was taking 78
 seconds, after modification 12 seconds, so more the 6 times faster.

 I tried also to use buffered streams in some other places, but it was
 not that visible. I hope this change can also be incorporated into
 the current 0.6.6 release and then benchmarks may stay in PDFBox side
 :)


 Max


BL> On Wed, 8 Sep 2004, Chas Emerick wrote:
>> PDFTextStream: fast PDF text extraction for Java applications
>> http://snowtide.com/home/PDFTextStream/


BL> For those that have not seen, snowtide.com has done a performance
BL> comparison against several Java PDF->Text libraries, including Snowtide's
BL> PDFTextStream, PDFBox, Etymon PJ and JPedal.  It appears to be fairly well
BL> done.

BL> http://snowtide.com/home/PDFTextStream/Performance


BL> PDFBox: slow PDF text extraction for Java applications
BL> http://www.pdfbox.org

BL> :)

BL> Ben


BL> -
BL> To unsubscribe, e-mail: [EMAIL PROTECTED]
BL> For additional commands, e-mail: [EMAIL PROTECTED]




-- 
Best regards,
 Maximmailto:[EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



combining open office spellchecker with Lucene

2004-09-09 Thread Aad Nales
Hi All,

Before I start reinventing wheels I would like to do a short check to
see if anybody else has already tried this. A customer has requested us
to look into the possibility to perform a spell check on queries. So far
the most promising way of doing this seems to be to create an Analyzer
based on the spellchecker of OpenOffice. My question is: "has anybody
tried this before?" 

Cheers,
Aad


--
Aad Nales
[EMAIL PROTECTED], +31-(0)6 54 207 340 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-09 Thread "René Hackl"
Bill,

Thank you for clarifying on that issue. I missed the...

> (title:cutting OR author:cutting) AND (title:lucene OR author:lucene)
   ...
> (title:cutting OR title:lucene) AND (author:cutting OR author:lucene)
> 
> Note that this would match even if only "lucene" occurred in the

... "only lucene"/"only cutting" match. 

> I'd think that if a user specified a query "cutting lucene", with an
> implicit AND and the default fields "title" and "author", they'd
> expect to see a match in which both "cutting" and "lucene" appears. 

Hopefully they'd expect that. Sometimes users assume that e.g. "coffee OR
tea" would provide matches with either term, but not both. But this is
already "user-attune your application" territory. Your proposal makes
perfect sense, of course.

René


-- 
Supergünstige DSL-Tarife + WLAN-Router für 0,- EUR*
Jetzt zu GMX wechseln und sparen http://www.gmx.net/de/go/dsl


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



test pls ignore

2004-09-09 Thread Aad Nales



--
Aad Nales
[EMAIL PROTECTED], +31-(0)6 54 207 340 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]