Re: umlaut normalisation

2004-01-27 Thread Thomas Scheffler

Andrzej Bialecki sagte:
> Thomas Scheffler wrote:
>
>> Hi,
>>
>> is that possible with lucene to use umlaut normalisation?
>> For example Query: Hühnerstall --> Query: Huehnerstall.
>>
>> This ofcause includes that the document was indexed with normalized
>> umlauts.
>> This issue is very important, because not every one starting a search
>> against german documents may have a german keyboard.
>
> It seems to me the best place would be to put this replacement in a
> custom Analyzer (perhaps extend GermanAnalyzer?).

I thought it would be allready available somehow since it's supported in
other major textsearch engines for example NSE from IBM, why not in
lucene?

>
>> This brings me to the next problem. Currently only Luke delivers result
>> for "Hühnerstall", my selfed implemented solution allways makes
>> "huhnerstall" out of it in the query (Why?). But ther is no
>> "huhnerstall"
>> indexed.
>>
>
> Please check which Analyzer you're using in each case.
>

DEBUG Query: MyCoReDemoDC_derivate_0014-->Hühnerstall
DEBUG Set DerivateID to MyCoReDemoDC_derivate_0014 for next query...
DEBUG parsing query using: org.apache.lucene.analysis.de.GermanAnalyzer
DEBUG adding clause: content:huhnerstall
DEBUG preparsed query:(+DerivateID:MyCoReDemoDC_derivate_0014
+content:huhnerstall)

It's the GermanAnalyzer. It doesn't matter what I choose in luke it will
allways find documents for "Hühnerstall", but I'm not able to find it the
self programmed way. My extended QueryParser overwrites parse. It put out
the Analyzer and then parses the String with super.parse(String). The
resulting Query is put in a BooleanClause and later combined withe the
first part (fieldquery using WhiteSpaceAnalyzer) you see above to a new
Query.
So there is one query part with the WhiteSpaceAnalyzer and the other with
GermanAnalyzer. But I dont' know why Hühnerstall get's to huhnerstall.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



umlaut normalisation

2004-01-27 Thread Thomas Scheffler
Hi,

is that possible with lucene to use umlaut normalisation?
For example Query: Hühnerstall --> Query: Huehnerstall.

This ofcause includes that the document was indexed with normalized umlauts.
This issue is very important, because not every one starting a search
against german documents may have a german keyboard.

This brings me to the next problem. Currently only Luke delivers result
for "Hühnerstall", my selfed implemented solution allways makes
"huhnerstall" out of it in the query (Why?). But ther is no "huhnerstall"
indexed.

regards Thomas


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: BooleanQuery question

2004-01-16 Thread Thomas Scheffler

Karl Koch sagte:
> Hi all,
>
> why does the boolean query have a "required" and a "prohited" field
> (boolean
> value)? If something is required it cannot be forbidden and otherwise? How
> does this match with the Boolean model we know from theory?

What if required and prohibited are both off? That's somthing we need.

>
> Are there differences between Lucene and the Boolean model in theory?

To save three conditions you have to take at least 2 bits. That's for the
theory.


Kind regards

Thomas

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: merged search of document

2004-01-14 Thread Thomas Scheffler

Thomas Scheffler sagte:
> Hi,
>
> I need a tip for implementation. I have several documents all of them with
> a field named DocID. DocID identifies not a single Lucene Document but a
> collection of them. When I wan't to start a seach it should handle the
> search in that way, as these lucene documents where one.
>
> example:
>
> Document 1: DocID:XYZ
>
> containing: foo
>
> Document 2: DocID:XYZ
>
> containing: bar
>
> Document 3: DocID:ABC
>
> containing: foo bar
>
> Document 4: GHJ
>
> containing: foo
>
> As you already guesses, when I'm searching for "+foo +bar" I wan't the
> hits to contain Document 1, Document 2 and Document 3, not Document 4. Is
> that clear what I want? How do I implement such a monster? Is that
> possible with lucene? The content is not stored within lucene it's just
> tokenized and indexed.

OK

if that ever needed to be answered again, I post here my solution. It's
not quite optimal yet but it does function somehow. In my conrete
implementation the "DocID" is called "DerivateID".

First of all I get all DerivateIDs out of the Index and perform the search
on every (that's the non optimal point) DerivateID. For a search like (foo
bar) one query is:

(+DerivateID:MyCoReDemoDC_derivate_0014 +foo)
(+DerivateID:MyCoReDemoDC_derivate_0014 +bar)

The in this case two "OR" queries are split up and for every such subquery
the hits.length() must be greater than 0. If it so, than
MyCoReDemoDC_derivate_0014 is a hit for the search (foo bar). For (foo
-bar) it was a bit more difficult, as you have to make sure, that all
documents with the same DerivateID don't contain "bar". So another query
is started and if both hits, the original and rechecking one, are from the
same length than the query delivers a result.

As I mentioned it before it's not quite optimal yet but it allows you to
search for something like (foo bar),("foo bar") and (foo -bar) and that
was all I need.

If you want to look inside the code go on our cvs system:

http://www.mycore.de/repository/cgi-bin/viewcvs.cgi/mycore/sources/org/mycore/backend/lucene/

I'm thanking all of you for your great help.

Thomas Scheffler

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: merged search of document

2004-01-12 Thread Thomas Scheffler

Thomas Scheffler sagte:
>
> Erik Hatcher sagte:
>> On Jan 12, 2004, at 8:21 AM, Thomas Scheffler wrote:
>>> OK, I've looked inside QueryParser and it's seems to be the right
>>> place to
>>> do that. But it's rather complicated to transform a query to another,
>>> since QueryParserTokenManager as an extreme example is not quite
>>> understandable and needs a huge time for all that stuff to work in.
>>
>> Keep in mind that QueryParser is reasonably overridable.  Simply
>> subclass it and override one of the get*Query methods.  This may be the
>> simplest way for you to inject additional clauses into a query.
>
> You're right if the source query consist of a non atomic query. If it's a
> combined one it does allready function quite well. Because of the last
> check in Query(String) of QueryParser it returns firstQuery then and
> getBooleanQuery is never called. As I'm not able to overwrite
> Query(String) I'm not quite sure how to handle searches like: foo.
>

Of cause you can overwrite parse(String), so I do.

public Query parse(String query) throws ParseException{
Query queryTemp=super.parse(query);
if (queryTemp.toString(field).equals(query)){
Vector v=new Vector();
BooleanClause clause=new BooleanClause(queryTemp,true,false);
v.add(clause);
return getBooleanQuery(v);
}
return queryTemp;
}

Now all relevant cases seem to function. Thanks for the help. You were
pushing a great step forward in my working progress.



-- 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: merged search of document

2004-01-12 Thread Thomas Scheffler

Erik Hatcher sagte:
> On Jan 12, 2004, at 8:21 AM, Thomas Scheffler wrote:
>> OK, I've looked inside QueryParser and it's seems to be the right
>> place to
>> do that. But it's rather complicated to transform a query to another,
>> since QueryParserTokenManager as an extreme example is not quite
>> understandable and needs a huge time for all that stuff to work in.
>
> Keep in mind that QueryParser is reasonably overridable.  Simply
> subclass it and override one of the get*Query methods.  This may be the
> simplest way for you to inject additional clauses into a query.

You're right if the source query consist of a non atomic query. If it's a
combined one it does allready function quite well. Because of the last
check in Query(String) of QueryParser it returns firstQuery then and
getBooleanQuery is never called. As I'm not able to overwrite
Query(String) I'm not quite sure how to handle searches like: foo.

Thomas

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: merged search of document

2004-01-12 Thread Thomas Scheffler

Erik Hatcher sagte:
> On Jan 7, 2004, at 4:18 PM, Dror Matalon wrote:
>> Actually I would guess that performence should be fine. I would look at
>> the code generated by the standard analyzer,
>> http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/analysis/
>> standard/package-summary.html
>> which translates from "(a AND b) OR c" to "+a +b c" and then see what
>> it does with it.
>
> Huh?  StandardAnalyzer does not do that type of translation.
> QueryParser is what parses "(a AND b) OR c" into a nested BooleanQuery.
>
> Query.toString is what will visually turn it into the +/- view.

OK, I've looked inside QueryParser and it's seems to be the right place to
do that. But it's rather complicated to transform a query to another,
since QueryParserTokenManager as an extreme example is not quite
understandable and needs a huge time for all that stuff to work in.

Here's my progress. First I grab all values for "DocID" in the lucene
index. This seems to work well after a sample earlier in this mailing
list.
For every "DocID" I want to search for the submitted query. Therefor I
need to transform the queries after the following rule:

"foo bar" --> +DocID:inserthere +"foo bar"

foo bar --> (+DocID:inserthere +foo) AND (+DocID:inserthere +bar)

foo -bar --> (+DocID:inserthere +foo) AND (+DocID:inserthere -bar)

I think this should do the trick, since (foo bar) should be understand as
a query for "foo" and "bar" must be in the document. So a relative simple
syntax of my queries need only little adaptation to run with lucene I
think. But of cause I don't know how to transform above styled queries.

Any help on this topic available?

Thank you.

Thomas

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: merged search of document

2004-01-07 Thread Thomas Scheffler
Am Mit, den 07.01.2004 schrieb Erik Hatcher um 22:38:
> On Jan 7, 2004, at 4:18 PM, Dror Matalon wrote:
> > Actually I would guess that performence should be fine. I would look at
> > the code generated by the standard analyzer,
> > http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/analysis/
> > standard/package-summary.html
> > which translates from "(a AND b) OR c" to "+a +b c" and then see what  
> > it does with it.
> 
> Huh?  StandardAnalyzer does not do that type of translation.   
> QueryParser is what parses "(a AND b) OR c" into a nested BooleanQuery.
> 
> Query.toString is what will visually turn it into the +/- view.

QueryParser delivers a single Query to me. Since I have to make several
Queries how can it help me to write my own QueryParser?
How does QueryParser, Query, IndexSearcher work together?
Which class do I need to extend then?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: merged search of document

2004-01-07 Thread Thomas Scheffler
Am Mit, den 07.01.2004 schrieb Dror Matalon um 21:57:
> I can see the problem, but I'm not sure it's something Lucene should
> provide. I guess you can try to do some post processing of Lucene
> results. For AND and OR operations it should be quite easy. If you get
> any hits for a page in a book, the whole book has the terms. The hard
> part will be handling "NOT" operations. Seems like you'd have to
> actually do a '+' search for the term and then rule out all the books
> that do contain the term. Yuck.

That sounds quite good to me, hope it performs not as bad.
How can I make of a query (foo OR bar AND foobar AND -"foo bar") - I
hope I got al important cases - a bunch of queries. What would you
suggest?



signature.asc
Description: Dies ist ein digital signierter Nachrichtenteil


Re: merged search of document

2004-01-07 Thread Thomas Scheffler
Am Mit, den 07.01.2004 schrieb Dror Matalon um 20:10:
> On Wed, Jan 07, 2004 at 07:58:52PM +0100, Thomas Scheffler wrote:
> > Am Mit, den 07.01.2004 schrieb Dror Matalon um 19:00:
> > > The solution is simple, but you need to think of it conceptually in a
> > > different way. Instead of "all documents with the same DocID are the same
> > > document" think "fetch all the document where DocId is XYZ."
> > > 
> > > Assuming the contents are in a field called contents
> > > you do 
> > > +(DocID:XYZ) (contents:foo) (contents:bar)
> > 
> > I allready was on that way but think of a search like (foo -bar). With
> > your solution it will result in a hit because on page 345 (to keep my
> > example) is the word "foo" and no "bar". Of cause I want with my model,
> > that the book don't get a hit for that query. You see how hard it is to
> > handle, isn't it? 
> 
> I think, I'm starting to understand. So you want to treat several
> documents as one, and if the hit fails for one of the documents, it
> should fail for all the documents with the same id. OK. This begs the
> question. Why don't you make all these document with the same id one
> document, and index them together?

This would be a functional but not nice solution. The "pages" are send
to my java class. This point I cannot change cause it api related
restriction. To index 1000 pages I have to index the first one, when I
get the second one I need to reget the first page, bind both together an
send it to the indexwriter. I must keep track of every single page the
"book" contains. This procedure is made for every page and get uglier
while page size is increasing. Furthermore my "book" allows single pages
to be deleted or updated. Every time such a atomic task
(adding/deleting) is performed the index for the whole "book" must be
restored. The mechanism to transfer a "page" to a lucene document is
very time consuming, so I wan't to do that stuff as less as possible. It
would be great as you see, if somehow lucene is possible to thread a
"logical document" (consisting of several lucene documents) like normal
lucene documents.

> 
> > 
> > > 
> > > For that matter, you can use a standard analyzer on the query and use a
> > > boolean to tie it to the specific document set.
> > > 
> > > This is how we do searching on a specific channel at fastbuzz.com.
> > > 
> > > Dror
> > > 
> > > 
> > > On Wed, Jan 07, 2004 at 05:21:43PM +0100, Thomas Scheffler wrote:
> > > > 
> > > > Jamie Stallwood sagte:
> > > > > +(DocID:XYZ DocID:ABC) +(foo bar)
> > > > >
> > > > > will find a document that (MUST have (xyz OR abc)) AND (MUST have (foo OR
> > > > > bar)).
> > > > 
> > > > This is just the solution for the example in real world I really don't
> > > > have noc documents containing "foo" or "bar". What I meant was: Make
> > > > Lucene think, that all Documents with the same DocID are ONE Document.
> > > > Imagine you have a big book, say 1000 pages. Instead of putting the whole
> > > > book in the index, you split it up in single pages and index them. Now
> > > > it's faster if a page changes or is deleted to update your index instead
> > > > of doing it over and over again for all 1000 pages. So you problem starts
> > > > when you're searching on the book. You search for (foo bar), foo is on
> > > > site 345 while bar ist on 435. You want to get a hit for the book. So I
> > > > need a solution matching this more generic example.
> > > > 
> > > > >
> > > > > -Original Message-
> > > > > From: Thomas Scheffler [mailto:[EMAIL PROTECTED]
> > > > > Sent: 07 January 2004 11:23
> > > > > To: [EMAIL PROTECTED]
> > > > > Subject: merged search of document
> > > > >
> > > > > Hi,
> > > > >
> > > > > I need a tip for implementation. I have several documents all of them with
> > > > > a field named DocID. DocID identifies not a single Lucene Document but a
> > > > > collection of them. When I wan't to start a seach it should handle the
> > > > > search in that way, as these lucene documents where one.
> > > > >
> > > > > example:
> > > > >
> > > > > Document 1: DocID:XYZ
> > > > >
> > > > > con

Re: merged search of document

2004-01-07 Thread Thomas Scheffler
Am Mit, den 07.01.2004 schrieb Dror Matalon um 19:00:
> The solution is simple, but you need to think of it conceptually in a
> different way. Instead of "all documents with the same DocID are the same
> document" think "fetch all the document where DocId is XYZ."
> 
> Assuming the contents are in a field called contents
> you do 
> +(DocID:XYZ) (contents:foo) (contents:bar)

I allready was on that way but think of a search like (foo -bar). With
your solution it will result in a hit because on page 345 (to keep my
example) is the word "foo" and no "bar". Of cause I want with my model,
that the book don't get a hit for that query. You see how hard it is to
handle, isn't it? 

> 
> For that matter, you can use a standard analyzer on the query and use a
> boolean to tie it to the specific document set.
> 
> This is how we do searching on a specific channel at fastbuzz.com.
> 
> Dror
> 
> 
> On Wed, Jan 07, 2004 at 05:21:43PM +0100, Thomas Scheffler wrote:
> > 
> > Jamie Stallwood sagte:
> > > +(DocID:XYZ DocID:ABC) +(foo bar)
> > >
> > > will find a document that (MUST have (xyz OR abc)) AND (MUST have (foo OR
> > > bar)).
> > 
> > This is just the solution for the example in real world I really don't
> > have noc documents containing "foo" or "bar". What I meant was: Make
> > Lucene think, that all Documents with the same DocID are ONE Document.
> > Imagine you have a big book, say 1000 pages. Instead of putting the whole
> > book in the index, you split it up in single pages and index them. Now
> > it's faster if a page changes or is deleted to update your index instead
> > of doing it over and over again for all 1000 pages. So you problem starts
> > when you're searching on the book. You search for (foo bar), foo is on
> > site 345 while bar ist on 435. You want to get a hit for the book. So I
> > need a solution matching this more generic example.
> > 
> > >
> > > -Original Message-
> > > From: Thomas Scheffler [mailto:[EMAIL PROTECTED]
> > > Sent: 07 January 2004 11:23
> > > To: [EMAIL PROTECTED]
> > > Subject: merged search of document
> > >
> > > Hi,
> > >
> > > I need a tip for implementation. I have several documents all of them with
> > > a field named DocID. DocID identifies not a single Lucene Document but a
> > > collection of them. When I wan't to start a seach it should handle the
> > > search in that way, as these lucene documents where one.
> > >
> > > example:
> > >
> > > Document 1: DocID:XYZ
> > >
> > > containing: foo
> > >
> > > Document 2: DocID:XYZ
> > >
> > > containing: bar
> > >
> > > Document 3: DocID:ABC
> > >
> > > containing: foo bar
> > >
> > > Document 4: GHJ
> > >
> > > containing: foo
> > >
> > > As you already guesses, when I'm searching for "+foo +bar" I wan't the
> > > hits to contain Document 1, Document 2 and Document 3, not Document 4. Is
> > > that clear what I want? How do I implement such a monster? Is that
> > > possible with lucene? The content is not stored within lucene it's just
> > > tokenized and indexed.
> > >
> > > Any help?
> > >
> > > Thanks in advance!
> > >
> > > Thomas Scheffler
> > >
> > >
> > > -
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> > >
> > >
> > > -
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> > >
> > 
> > 
> > -- 
> > 
> > 
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> > 
--
Fachbegriffe der Informatik - Einfach erklÃrt
=
NÂ 37 -- Fehlertolerant :

Das Programm erlaubt keine Benutzereingaben. 



signature.asc
Description: Dies ist ein digital signierter Nachrichtenteil


RE: merged search of document

2004-01-07 Thread Thomas Scheffler

Jamie Stallwood sagte:
> +(DocID:XYZ DocID:ABC) +(foo bar)
>
> will find a document that (MUST have (xyz OR abc)) AND (MUST have (foo OR
> bar)).

This is just the solution for the example in real world I really don't
have noc documents containing "foo" or "bar". What I meant was: Make
Lucene think, that all Documents with the same DocID are ONE Document.
Imagine you have a big book, say 1000 pages. Instead of putting the whole
book in the index, you split it up in single pages and index them. Now
it's faster if a page changes or is deleted to update your index instead
of doing it over and over again for all 1000 pages. So you problem starts
when you're searching on the book. You search for (foo bar), foo is on
site 345 while bar ist on 435. You want to get a hit for the book. So I
need a solution matching this more generic example.

>
> -Original Message-
> From: Thomas Scheffler [mailto:[EMAIL PROTECTED]
> Sent: 07 January 2004 11:23
> To: [EMAIL PROTECTED]
> Subject: merged search of document
>
> Hi,
>
> I need a tip for implementation. I have several documents all of them with
> a field named DocID. DocID identifies not a single Lucene Document but a
> collection of them. When I wan't to start a seach it should handle the
> search in that way, as these lucene documents where one.
>
> example:
>
> Document 1: DocID:XYZ
>
> containing: foo
>
> Document 2: DocID:XYZ
>
> containing: bar
>
> Document 3: DocID:ABC
>
> containing: foo bar
>
> Document 4: GHJ
>
> containing: foo
>
> As you already guesses, when I'm searching for "+foo +bar" I wan't the
> hits to contain Document 1, Document 2 and Document 3, not Document 4. Is
> that clear what I want? How do I implement such a monster? Is that
> possible with lucene? The content is not stored within lucene it's just
> tokenized and indexed.
>
> Any help?
>
> Thanks in advance!
>
> Thomas Scheffler
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-- 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



merged search of document

2004-01-07 Thread Thomas Scheffler
Hi,

I need a tip for implementation. I have several documents all of them with
a field named DocID. DocID identifies not a single Lucene Document but a
collection of them. When I wan't to start a seach it should handle the
search in that way, as these lucene documents where one.

example:

Document 1: DocID:XYZ

containing: foo

Document 2: DocID:XYZ

containing: bar

Document 3: DocID:ABC

containing: foo bar

Document 4: GHJ

containing: foo

As you already guesses, when I'm searching for "+foo +bar" I wan't the
hits to contain Document 1, Document 2 and Document 3, not Document 4. Is
that clear what I want? How do I implement such a monster? Is that
possible with lucene? The content is not stored within lucene it's just
tokenized and indexed.

Any help?

Thanks in advance!

Thomas Scheffler


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]