Re: Dmitry's Term Vector stuff, plus some

2004-02-27 Thread Bruce Ritchie
Doug Cutting wrote: Doug, do you believe the storing (as an option of course) of token offset information would be something that you'de accept as a contribution to the core of lucene? Does anyone else think that this would be beneficial information to have? I have mixed feelings about this.

Re: Dmitry's Term Vector stuff, plus some

2004-02-26 Thread Otis Gospodnetic
--- [EMAIL PROTECTED] wrote: > Doug, > nice suggestion about capping the highlighter's number of tokens - > I'll add that in. Or just add hooks that let the caller specify that the highlighter should return as soon as it finds something satisfactory (now sure how that could be defined... maybe un

Re: Dmitry's Term Vector stuff, plus some

2004-02-26 Thread Doug Cutting
Bruce Ritchie wrote: Doug, do you believe the storing (as an option of course) of token offset information would be something that you'de accept as a contribution to the core of lucene? Does anyone else think that this would be beneficial information to have? I have mixed feelings about this. A

Re: Dmitry's Term Vector stuff, plus some

2004-02-26 Thread markharw00d
>>Another approach that someone mentioned for solving this problem is to create a >>fragment index for long documents. Alternatively, could you use term sequence positions to guess where to start extracting text from the doc? If you have identified the best section of the doc based purely on ide

Re: Dmitry's Term Vector stuff, plus some

2004-02-25 Thread Paolo Spadafora
AIL PROTECTED]> To: "Lucene Developers List" <[EMAIL PROTECTED]> Sent: Wednesday, February 25, 2004 1:04 PM Subject: Re: Dmitry's Term Vector stuff, plus some > [EMAIL PROTECTED] wrote: > > I'm not sure what applications people have in mind for Term Vector s

Re: Dmitry's Term Vector stuff, plus some

2004-02-25 Thread Bruce Ritchie
[EMAIL PROTECTED] wrote: nice suggestion about capping the highlighter's number of tokens - I'll add that in. I agree, good suggestion. I've had a quick look at your knowledgebase docs. Can't you split them at index time into multiple smaller docs using the tags as doc boundaries? Each lucene do

Re: Dmitry's Term Vector stuff, plus some

2004-02-25 Thread markharw00d
Doug, nice suggestion about capping the highlighter's number of tokens - I'll add that in. Bruce, I've had a quick look at your knowledgebase docs. Can't you split them at index time into multiple smaller docs using the tags as doc boundaries? Each lucene document could then have a field with th

Re: Dmitry's Term Vector stuff, plus some

2004-02-25 Thread Bruce Ritchie
[EMAIL PROTECTED] wrote: The conclusion? * Reading text from the index is very quick (6 meg of document text in 10ms!) * Tokenizing the text is much slower but this is only noticable if you're processing a LOT of text. (My docs were an average of 1k in size so only took 5ms each) * The time taken

Re: Dmitry's Term Vector stuff, plus some

2004-02-25 Thread Doug Cutting
[EMAIL PROTECTED] wrote: Bruce, Could a short term ( and possibly compromised )solution to your performance problem be to offer only the first 3k of these large 200k docs to the highlighter in order to minimize the amount of tokenization required? Arguably the most relevant bit of a document is ty

Re: Dmitry's Term Vector stuff, plus some

2004-02-25 Thread markharw00d
I've just run some stats on the overhead of tokenizing text/highlighting. It looks like its tokenizing that's the main problem and it is CPU bound. I ran three tests, all on the same index/machine : pentium 3 800mhz, 360mb index, lucene 1.3 final, JDK 1.4.1, Porter stemmer based analyser. For

Re: Dmitry's Term Vector stuff, plus some

2004-02-25 Thread Bruce Ritchie
Doug Cutting wrote: I'm not sure what applications people have in mind for Term Vector support but I would prefer to have the original text positions (not term sequence positions) stored so I can offer this: 1) Significant terms/phrases identification Like "Gigabits" on gigablast.com - used to o

Re: Dmitry's Term Vector stuff, plus some

2004-02-25 Thread Doug Cutting
[EMAIL PROTECTED] wrote: I'm not sure what applications people have in mind for Term Vector support but I would prefer to have the original text positions (not term sequence positions) stored so I can offer this: 1) Significant terms/phrases identification Like "Gigabits" on gigablast.com - used

Re: Dmitry's Term Vector stuff, plus some

2004-02-24 Thread markharw00d
I'm not sure what applications people have in mind for Term Vector support but I would prefer to have the original text positions (not term sequence positions) stored so I can offer this: 1) Significant terms/phrases identification Like "Gigabits" on gigablast.com - used to offer choices of (uns

Re: Dmitry's Term Vector stuff, plus some

2004-02-24 Thread Grant Ingersoll
This is provided by the Token.startOffset() and Token.endOffset() at indexing time, I think. I don't know if this is accessible at run time. A good place to see what is stored in the files is the File Formats section located at http://jakarta.apache.org/lucene/docs/fileformats.html. (Get the

Re: Dmitry's Term Vector stuff, plus some

2004-02-24 Thread Bruce Ritchie
Grant Ingersoll wrote: It is the location of the token in the document (see IndexReader.termPositions()). This information is already being stored in other parts of the index, it just isn't very efficient to get at it. Ok, that wasn't the answer I was hoping for :) I was hoping that the positi

Re: Dmitry's Term Vector stuff, plus some

2004-02-24 Thread Grant Ingersoll
It is the location of the token in the document (see IndexReader.termPositions()). This information is already being stored in other parts of the index, it just isn't very efficient to get at it. I think it would be useful to add to the IndexReader a way to get a list of positions given a te

Re: Dmitry's Term Vector stuff, plus some

2004-02-24 Thread Bruce Ritchie
Doug Cutting wrote: Grant Ingersoll wrote: Do you see any reason to write position information at all for the term vectors? It could be useful to some folks. If, for example, you only want to expand a query with terms that occur near query terms, like automatic phrase identification. In ge

Re: Dmitry's Term Vector stuff, plus some

2004-02-17 Thread Grant Ingersoll
I am going to leave them off for now. >>> [EMAIL PROTECTED] 02/17/04 04:03PM >>> Grant Ingersoll wrote: > Do you see any reason to write position information at all for the term vectors? It could be useful to some folks. If, for example, you only want to expand a query with terms that occur nea

Re: Dmitry's Term Vector stuff, plus some

2004-02-17 Thread Doug Cutting
Grant Ingersoll wrote: Do you see any reason to write position information at all for the term vectors? It could be useful to some folks. If, for example, you only want to expand a query with terms that occur near query terms, like automatic phrase identification. In general, the vector stuff i

Re: Dmitry's Term Vector stuff, plus some

2004-02-17 Thread Grant Ingersoll
Do you see any reason to write position information at all for the term vectors? Isn't this information stored elsewhere? Seems like I could drop that piece all together... - To unsubscribe, e-mail: [EMAIL PROTECTED] For addi

Re: Dmitry's Term Vector stuff, plus some

2004-02-17 Thread Doug Cutting
Grant Ingersoll wrote: I agree with your assessment about getting it right the first time. I can make the changes, as I don't think they are that involved and it will benefit me and my employer in the long run if the changes are committed since we won't have reapply the patches every time there is

Re: Dmitry's Term Vector stuff, plus some

2004-02-17 Thread Grant Ingersoll
Doug, I agree with your assessment about getting it right the first time. I can make the changes, as I don't think they are that involved and it will benefit me and my employer in the long run if the changes are committed since we won't have reapply the patches every time there is a new releas

Re: Dmitry's Term Vector stuff, plus some

2004-02-17 Thread Doug Cutting
Grant Ingersoll wrote: Was wondering if you consider your comments on the Term vector stuff to be a show stopper or not? There hasn't been much response to your questions, so I wanted to bring it up again, as I do not want to see this go the way of the last attempt. I proposed three changes: 1.

RE: Dmitry's Term Vector stuff, plus some

2004-02-15 Thread Grant Ingersoll
Hey Doug, Was wondering if you consider your comments on the Term vector stuff to be a show stopper or not? There hasn't been much response to your questions, so I wanted to bring it up again, as I do not want to see this go the way of the last attempt. If there is anything I can do to help w/

RE: Dmitry's Term Vector stuff, plus some

2004-02-10 Thread Grant Ingersoll
ext [PATCH] so the committers know there is a > patch available for that bug. > > Looking forward to seeing your contribution, > Eric > > -Original Message- > From: Doug Cutting [mailto:[EMAIL PROTECTED] > Sent: Thursday, February 05, 2004 5:33 PM > To: Luce

RE: Dmitry's Term Vector stuff, plus some

2004-02-10 Thread Karl Koch
ug Cutting [mailto:[EMAIL PROTECTED] > Sent: Thursday, February 05, 2004 5:33 PM > To: Lucene Developers List > Subject: Re: Dmitry's Term Vector stuff, plus some > > > The best way to generate a patch is to connect to the root of your > Lucene CVS checkout, and run: &g

RE: Dmitry's Term Vector stuff, plus some

2004-02-06 Thread Grant Ingersoll
lable for that bug. Looking forward to seeing your contribution, Eric -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Thursday, February 05, 2004 5:33 PM To: Lucene Developers List Subject: Re: Dmitry's Term Vector stuff, plus some The best way to generate a p

RE: Dmitry's Term Vector stuff, plus some

2004-02-06 Thread Eric Isakson
CH] so the committers know there is a patch available for that bug. Looking forward to seeing your contribution, Eric -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Thursday, February 05, 2004 5:33 PM To: Lucene Developers List Subject: Re: Dmitry's Term Vector s

Re: Dmitry's Term Vector stuff, plus some

2004-02-05 Thread Doug Cutting
The best way to generate a patch is to connect to the root of your Lucene CVS checkout, and run: cvs diff -Nu > my.patch This will include all newly added files. Hmm. Perhaps that requires that you're a developer. If it does, then simply tar up the new files separately from the patch file.