Re: Summarization; sentence-level and document-level filters.

2003-12-17 Thread Ulrich Mayring
Gregor Heinrich wrote:
Yes, copying a summary from one field to an untokenized field was the plan.

I identified DocumentWriter.invertDocument() to be a possible place for an
addition of this document-level analysis. But I admit this appears way too
low-level and inflexible for the overall design.
So I'll make it two-pass indexing.
The way I did it: I'm indexing HTML documents, so before Lucene can do 
anything I need to run a HTML parser. This parser, while scanning the 
tags, builds two text strings at the same time: one that contains the 
document content for indexing and one that contains it for summarizing.

There are relevant differences between those two strings, for example in 
handling of headlines and punctuation. Lucene then gets to index the 
first string and the second is given to a summarizer. The summary it 
returns is added to a Lucene field.

This way I can do summarizing and indexing in one pass.

Ulrich



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Summarization; sentence-level and document-level filters.

2003-12-17 Thread maurits van wijland
Gregor,

I don't have any benchmarks for summarization. Sorry!
I have two testversions of commercial summarizers and
their performance is better than the Classifier4J, but these
have been written in C++. So you can't compare properly.

regards,
Maurits


- Original Message - 
From: Gregor Heinrich [EMAIL PROTECTED]
To: 'Lucene Users List' [EMAIL PROTECTED]
Sent: Tuesday, December 16, 2003 9:35 PM
Subject: RE: Summarization; sentence-level and document-level filters.


 Maurits: thanks for the hint to classifier4j -- I have had a look on this
 package and tried the SimpleSummarizer and it seems to work fine.
(However,
 as I don't know the benchmarks for summarization, I'm not the one to
judge.)

 Do you have experience with it?

 Gregor

 -Original Message-
 From: maurits van wijland [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, December 16, 2003 1:09 AM
 To: Lucene Users List; [EMAIL PROTECTED]
 Subject: Re: Summarization; sentence-level and document-level filters.


 Hi Gregor,

 Sofar as I know there is no summarizer in the plans. And maybe I can help
 you along the way. Have a look
 at Classifier4J project on Sourceforge.

 http://classifier4j.sourceforge.net/

 It has a small documetn summarizer besides a bayes classifier.It might
speed
 up your coding.

 On the level of lucene, I have no idea. My gut feeling says that a summary
 should be build before the
 text is tokenized! The tokenizer can ofcourse be used when analysing a
 document, but hooking into
 the lucene indexing is a bad idea I think.

 Someone else has any ideas?

 regards,

 Maurits




 - Original Message -
 From: Gregor Heinrich [EMAIL PROTECTED]
 To: 'Lucene Users List' [EMAIL PROTECTED]
 Sent: Monday, December 15, 2003 7:41 PM
 Subject: Summarization; sentence-level and document-level filters.


  Hi,
 
  is there any possibility to do sentence-level or document level analysis
  with the current Analysis/TokenStream architecture? Or where else is the
  best place to plug in customised document-level and sentence-level
 analysis
  features? Is there any precedence case ?
 
  My technical problem:
 
  I'd like to include a summarization feature into my system, which should
 (1)
  best make use of the architecture already there in Lucene, and (2)
should
 be
  able to trigger summarization on a per-document basis while requiring
  sentence-level information, such as full-stops and commas. To preserve
 this
  punctuation, a special Tokenizer can be used that outputs such
landmarks
  as tokens instead of filtering them out. The actual SummaryFilter then
  filters out the punctuation for its successors in the Analyzer's filter
  chain.
 
  The other, more complex thing is the document-level information: As
 Lucene's
  architecture uses a filter concept that does not know about the document
 the
  tokens are generated from (which is good abstraction), a
document-specific
  operation like summarization is a bit of an awkward thing with this (and
  originally not intended, I guess). On the other hand, I'd like to have
the
  existing filter structure in place for preprocessing of the input,
because
  my raw texts are generated by converters from other formats that output
  unwanted chars (from figures, pagenumbers, etc.), which are filtered out
  anyway by my custom Analyzer.
 
  Any idea how to solve this second problem? Is there any support for such
  document / sentence structure analysis planned?
 
  Thanks and regards,
 
  Gregor
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene and Mysql

2003-12-17 Thread Jeff Linwood
Hi,

You should create a Lucene Document for each record in your table.  Make
each of the columns that contains text a field on the Document object.  Also
store the primary key of the record as a field.

Here's a very basic article I wrote about using Lucene:

http://builder.com.com/5100-6389-5054799.html

Jeff
- Original Message - 
From: Stefan Trcko [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Tuesday, December 16, 2003 2:30 PM
Subject: Lucene and Mysql


Hello

I'm new to Lucene. I want users can search text which is stored in mysql
database.
Is there any tutorial how to implement this kind of search feature.

Best regards,
Stefan


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Displaying Query

2003-12-17 Thread Gayo Diallo
Hi all,

I use this code
Query query = QueryParser.parse(q, Contenu, new Analyseur());
String larequet = query.toString();

System.out.println(la requête à traiter est:  + larequet);

And I have as this line displayed [EMAIL PROTECTED]

I don't know Why I have't my query string displayed correctly. May someone 
help me.

Best regards,

Gayo


RE: Displaying Query

2003-12-17 Thread Tate Avery

Try:

String larequet = query.toString(default field name here);

Example:

String larequet = query.toString(texte);

That should give string version of query.


-Original Message-
From: Gayo Diallo [mailto:[EMAIL PROTECTED]
Sent: Wednesday, December 17, 2003 10:46 AM
To: [EMAIL PROTECTED]
Subject: Displaying Query


Hi all,

I use this code
Query query = QueryParser.parse(q, Contenu, new Analyseur());

String larequet = query.toString();

System.out.println(la requête à traiter est:  + larequet);

And I have as this line displayed [EMAIL PROTECTED]

I don't know Why I have't my query string displayed correctly. May someone 
help me.

Best regards,

Gayo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Indexing Speed: Documents vs. Sentences

2003-12-17 Thread Jochen Frey
Hi,

I am using Lucene to index a large number of web pages (a few 100GB) and the
indexing speed is great.

Lately I have been trying to index on a sentence level, not the document
level. My problem is that the indexing speed has gone down dramatically and
I am wondering if there is any way for me to improve on that.

Indexing on a sentence level the overall amount of data stays the same while
the number of records increases substantially (since there is usually many
sentences to one web page).

It seems to me like the indexing speed (everything else being the same)
depends largely on the number of Documents inserted into the index, and not
so much on the size of the data within the documents (correct?).

I have played with the merge factor, using RAMDirectory, etc and I am quite
comfortable with our overall configuration, so my guess is that that is not
the issue (and I am QUITE happy with the indexing speed as long as I use
complete pages and not sentences).

Maybe there is a different way of attacking this? My goal is to be able to
execute a query and get the sentences that match the query in the most
efficient way while maintaining good/great indexing speed. I would prefer
not having to search the complete document for the sentence in question.

My current solution is to have one Lucene Document for each page (containing
the URL and other information I require) that does NOT contain the text of
the page. Then I have one Lucene Document for each sentence within that
document, which contains the text of this particular sentence in addition to
some identifying information that references the entry of the page itself.

Any and all suggestions are welcome.

Thanks!
Jochen


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Indexing Speed: Documents vs. Sentences

2003-12-17 Thread Dan Quaroni
I'm confused about something - what's the point of creating a document for
every sentence?

-Original Message-
From: Jochen Frey [mailto:[EMAIL PROTECTED]
Sent: Wednesday, December 17, 2003 4:17 PM
To: 'Lucene Users List'
Subject: Indexing Speed: Documents vs. Sentences


Hi,

I am using Lucene to index a large number of web pages (a few 100GB) and the
indexing speed is great.

Lately I have been trying to index on a sentence level, not the document
level. My problem is that the indexing speed has gone down dramatically and
I am wondering if there is any way for me to improve on that.

Indexing on a sentence level the overall amount of data stays the same while
the number of records increases substantially (since there is usually many
sentences to one web page).

It seems to me like the indexing speed (everything else being the same)
depends largely on the number of Documents inserted into the index, and not
so much on the size of the data within the documents (correct?).

I have played with the merge factor, using RAMDirectory, etc and I am quite
comfortable with our overall configuration, so my guess is that that is not
the issue (and I am QUITE happy with the indexing speed as long as I use
complete pages and not sentences).

Maybe there is a different way of attacking this? My goal is to be able to
execute a query and get the sentences that match the query in the most
efficient way while maintaining good/great indexing speed. I would prefer
not having to search the complete document for the sentence in question.

My current solution is to have one Lucene Document for each page (containing
the URL and other information I require) that does NOT contain the text of
the page. Then I have one Lucene Document for each sentence within that
document, which contains the text of this particular sentence in addition to
some identifying information that references the entry of the page itself.

Any and all suggestions are welcome.

Thanks!
Jochen


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Indexing Speed: Documents vs. Sentences

2003-12-17 Thread Jochen Frey
Hi!

In essence: 
1) I don't care about the whole page

2) I only care about the actual sentence that matches the query.

3) I want the matching for the query only to happen within one sentence and
not over sentence boundaries (even when I do a PhraseQuery with some slop). 

The query: i like the beach~20
should not match: And we go to the restaurant and i really like it. the
beach was wonderful as well.

4) I would much prefer not to parse the actual page to find the sentence
that matches the query (though I obviously will, if I have to).

Does that answer your question?

Thanks!
Jochen

 -Original Message-
 From: Dan Quaroni [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, December 17, 2003 1:19 PM
 To: 'Lucene Users List'
 Subject: RE: Indexing Speed: Documents vs. Sentences
 
 I'm confused about something - what's the point of creating a document for
 every sentence?
 
 -Original Message-
 From: Jochen Frey [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, December 17, 2003 4:17 PM
 To: 'Lucene Users List'
 Subject: Indexing Speed: Documents vs. Sentences
 
 
 Hi,
 
 I am using Lucene to index a large number of web pages (a few 100GB) and
 the
 indexing speed is great.
 
 Lately I have been trying to index on a sentence level, not the document
 level. My problem is that the indexing speed has gone down dramatically
 and
 I am wondering if there is any way for me to improve on that.
 
 Indexing on a sentence level the overall amount of data stays the same
 while
 the number of records increases substantially (since there is usually many
 sentences to one web page).
 
 It seems to me like the indexing speed (everything else being the same)
 depends largely on the number of Documents inserted into the index, and
 not
 so much on the size of the data within the documents (correct?).
 
 I have played with the merge factor, using RAMDirectory, etc and I am
 quite
 comfortable with our overall configuration, so my guess is that that is
 not
 the issue (and I am QUITE happy with the indexing speed as long as I use
 complete pages and not sentences).
 
 Maybe there is a different way of attacking this? My goal is to be able to
 execute a query and get the sentences that match the query in the most
 efficient way while maintaining good/great indexing speed. I would prefer
 not having to search the complete document for the sentence in question.
 
 My current solution is to have one Lucene Document for each page
 (containing
 the URL and other information I require) that does NOT contain the text of
 the page. Then I have one Lucene Document for each sentence within that
 document, which contains the text of this particular sentence in addition
 to
 some identifying information that references the entry of the page itself.
 
 Any and all suggestions are welcome.
 
 Thanks!
 Jochen
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Indexing Speed: Documents vs. Sentences

2003-12-17 Thread Dan Quaroni
When you parse the page you can prevent sentence-boundry hits from matching
your criteria

-Original Message-
From: Jochen Frey [mailto:[EMAIL PROTECTED]
Sent: Wednesday, December 17, 2003 4:34 PM
To: 'Lucene Users List'
Subject: RE: Indexing Speed: Documents vs. Sentences


Right.

However, even if I do that, my problem #3 below remains unsolved: I do not
wish to match phrases across sentence boundaries.

Anyone have a neat solution (or pointers to one)?

Thanks again!
Jochen

 -Original Message-
 From: Dan Quaroni [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, December 17, 2003 1:29 PM
 To: 'Lucene Users List'
 Subject: RE: Indexing Speed: Documents vs. Sentences
 
 Yeah.  I'd suggest parsing the page, unfortunately. :)
 
 -Original Message-
 From: Jochen Frey [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, December 17, 2003 4:26 PM
 To: 'Lucene Users List'
 Subject: RE: Indexing Speed: Documents vs. Sentences
 
 
 Hi!
 
 In essence:
 1) I don't care about the whole page
 
 2) I only care about the actual sentence that matches the query.
 
 3) I want the matching for the query only to happen within one sentence
 and
 not over sentence boundaries (even when I do a PhraseQuery with some
 slop).
 
 The query: i like the beach~20
 should not match: And we go to the restaurant and i really like it. the
 beach was wonderful as well.
 
 4) I would much prefer not to parse the actual page to find the sentence
 that matches the query (though I obviously will, if I have to).
 
 Does that answer your question?
 
 Thanks!
 Jochen
 
  -Original Message-
  From: Dan Quaroni [mailto:[EMAIL PROTECTED]
  Sent: Wednesday, December 17, 2003 1:19 PM
  To: 'Lucene Users List'
  Subject: RE: Indexing Speed: Documents vs. Sentences
 
  I'm confused about something - what's the point of creating a document
 for
  every sentence?
 
  -Original Message-
  From: Jochen Frey [mailto:[EMAIL PROTECTED]
  Sent: Wednesday, December 17, 2003 4:17 PM
  To: 'Lucene Users List'
  Subject: Indexing Speed: Documents vs. Sentences
 
 
  Hi,
 
  I am using Lucene to index a large number of web pages (a few 100GB) and
  the
  indexing speed is great.
 
  Lately I have been trying to index on a sentence level, not the document
  level. My problem is that the indexing speed has gone down dramatically
  and
  I am wondering if there is any way for me to improve on that.
 
  Indexing on a sentence level the overall amount of data stays the same
  while
  the number of records increases substantially (since there is usually
 many
  sentences to one web page).
 
  It seems to me like the indexing speed (everything else being the same)
  depends largely on the number of Documents inserted into the index, and
  not
  so much on the size of the data within the documents (correct?).
 
  I have played with the merge factor, using RAMDirectory, etc and I am
  quite
  comfortable with our overall configuration, so my guess is that that is
  not
  the issue (and I am QUITE happy with the indexing speed as long as I use
  complete pages and not sentences).
 
  Maybe there is a different way of attacking this? My goal is to be able
 to
  execute a query and get the sentences that match the query in the most
  efficient way while maintaining good/great indexing speed. I would
 prefer
  not having to search the complete document for the sentence in question.
 
  My current solution is to have one Lucene Document for each page
  (containing
  the URL and other information I require) that does NOT contain the text
 of
  the page. Then I have one Lucene Document for each sentence within that
  document, which contains the text of this particular sentence in
 addition
  to
  some identifying information that references the entry of the page
 itself.
 
  Any and all suggestions are welcome.
 
  Thanks!
  Jochen
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Indexing Speed: Documents vs. Sentences

2003-12-17 Thread Jochen Frey
Dan, I will send you a separate e-mail directly to your address.

In the meanwhile, I hope to get input from other people. Maybe someone else
knows how to solve my original problem below.

Thanks!
Jochen

 -Original Message-
 From: Dan Quaroni [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, December 17, 2003 1:36 PM
 To: 'Lucene Users List'
 Subject: RE: Indexing Speed: Documents vs. Sentences
 
 When you parse the page you can prevent sentence-boundry hits from
 matching
 your criteria
 
 -Original Message-
 From: Jochen Frey [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, December 17, 2003 4:34 PM
 To: 'Lucene Users List'
 Subject: RE: Indexing Speed: Documents vs. Sentences
 
 
 Right.
 
 However, even if I do that, my problem #3 below remains unsolved: I do not
 wish to match phrases across sentence boundaries.
 
 Anyone have a neat solution (or pointers to one)?
 
 Thanks again!
 Jochen
 
  -Original Message-
  From: Dan Quaroni [mailto:[EMAIL PROTECTED]
  Sent: Wednesday, December 17, 2003 1:29 PM
  To: 'Lucene Users List'
  Subject: RE: Indexing Speed: Documents vs. Sentences
 
  Yeah.  I'd suggest parsing the page, unfortunately. :)
 
  -Original Message-
  From: Jochen Frey [mailto:[EMAIL PROTECTED]
  Sent: Wednesday, December 17, 2003 4:26 PM
  To: 'Lucene Users List'
  Subject: RE: Indexing Speed: Documents vs. Sentences
 
 
  Hi!
 
  In essence:
  1) I don't care about the whole page
 
  2) I only care about the actual sentence that matches the query.
 
  3) I want the matching for the query only to happen within one sentence
  and
  not over sentence boundaries (even when I do a PhraseQuery with some
  slop).
 
  The query: i like the beach~20
  should not match: And we go to the restaurant and i really like it. the
  beach was wonderful as well.
 
  4) I would much prefer not to parse the actual page to find the sentence
  that matches the query (though I obviously will, if I have to).
 
  Does that answer your question?
 
  Thanks!
  Jochen
 
   -Original Message-
   From: Dan Quaroni [mailto:[EMAIL PROTECTED]
   Sent: Wednesday, December 17, 2003 1:19 PM
   To: 'Lucene Users List'
   Subject: RE: Indexing Speed: Documents vs. Sentences
  
   I'm confused about something - what's the point of creating a document
  for
   every sentence?
  
   -Original Message-
   From: Jochen Frey [mailto:[EMAIL PROTECTED]
   Sent: Wednesday, December 17, 2003 4:17 PM
   To: 'Lucene Users List'
   Subject: Indexing Speed: Documents vs. Sentences
  
  
   Hi,
  
   I am using Lucene to index a large number of web pages (a few 100GB)
 and
   the
   indexing speed is great.
  
   Lately I have been trying to index on a sentence level, not the
 document
   level. My problem is that the indexing speed has gone down
 dramatically
   and
   I am wondering if there is any way for me to improve on that.
  
   Indexing on a sentence level the overall amount of data stays the same
   while
   the number of records increases substantially (since there is usually
  many
   sentences to one web page).
  
   It seems to me like the indexing speed (everything else being the
 same)
   depends largely on the number of Documents inserted into the index,
 and
   not
   so much on the size of the data within the documents (correct?).
  
   I have played with the merge factor, using RAMDirectory, etc and I am
   quite
   comfortable with our overall configuration, so my guess is that that
 is
   not
   the issue (and I am QUITE happy with the indexing speed as long as I
 use
   complete pages and not sentences).
  
   Maybe there is a different way of attacking this? My goal is to be
 able
  to
   execute a query and get the sentences that match the query in the most
   efficient way while maintaining good/great indexing speed. I would
  prefer
   not having to search the complete document for the sentence in
 question.
  
   My current solution is to have one Lucene Document for each page
   (containing
   the URL and other information I require) that does NOT contain the
 text
  of
   the page. Then I have one Lucene Document for each sentence within
 that
   document, which contains the text of this particular sentence in
  addition
   to
   some identifying information that references the entry of the page
  itself.
  
   Any and all suggestions are welcome.
  
   Thanks!
   Jochen
  
  
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
  
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
  -
  To