Doug Cutting wrote:
Doug, do you believe the storing (as an option of course) of token
offset information would be something that you'de accept as a
contribution to the core of lucene? Does anyone else think that this
would be beneficial information to have?
I have mixed feelings about this.
--- [EMAIL PROTECTED] wrote:
> Doug,
> nice suggestion about capping the highlighter's number of tokens -
> I'll add that in.
Or just add hooks that let the caller specify that the highlighter
should return as soon as it finds something satisfactory (now sure how
that could be defined... maybe un
Bruce Ritchie wrote:
Doug, do you believe the storing (as an option of course) of token
offset information would be something that you'de accept as a
contribution to the core of lucene? Does anyone else think that this
would be beneficial information to have?
I have mixed feelings about this. A
>>Another approach that someone mentioned for solving this problem is to create a
>>fragment index for long documents.
Alternatively, could you use term sequence positions to guess where to start
extracting text from the doc?
If you have identified the best section of the doc based purely on ide
AIL PROTECTED]>
To: "Lucene Developers List" <[EMAIL PROTECTED]>
Sent: Wednesday, February 25, 2004 1:04 PM
Subject: Re: Dmitry's Term Vector stuff, plus some
> [EMAIL PROTECTED] wrote:
> > I'm not sure what applications people have in mind for Term Vector
s
[EMAIL PROTECTED] wrote:
nice suggestion about capping the highlighter's number of tokens - I'll add that in.
I agree, good suggestion.
I've had a quick look at your knowledgebase docs. Can't you split them at index time into multiple
smaller docs using the tags as doc boundaries?
Each lucene do
Doug,
nice suggestion about capping the highlighter's number of tokens - I'll add that in.
Bruce,
I've had a quick look at your knowledgebase docs. Can't you split them at index time
into multiple smaller docs using the tags as doc boundaries?
Each lucene document could then have a field with th
[EMAIL PROTECTED] wrote:
The conclusion?
* Reading text from the index is very quick (6 meg of document text in 10ms!)
* Tokenizing the text is much slower but this is only noticable if you're processing a LOT of text.
(My docs were an average of 1k in size so only took 5ms each)
* The time taken
[EMAIL PROTECTED] wrote:
Bruce,
Could a short term ( and possibly compromised )solution to your performance problem be to offer only the first 3k of these large 200k docs to
the highlighter in order to minimize the amount of tokenization required? Arguably the most relevant bit of a document is ty
I've just run some stats on the overhead of tokenizing text/highlighting.
It looks like its tokenizing that's the main problem and it is CPU bound.
I ran three tests, all on the same index/machine : pentium 3 800mhz, 360mb index,
lucene 1.3 final, JDK 1.4.1, Porter stemmer based analyser.
For
Doug Cutting wrote:
I'm not sure what applications people have in mind for Term Vector
support but I would prefer to have the original text positions (not
term sequence positions) stored so I can offer this:
1) Significant terms/phrases identification
Like "Gigabits" on gigablast.com - used to o
[EMAIL PROTECTED] wrote:
I'm not sure what applications people have in mind for Term Vector support but I
would prefer to have the original text positions (not term sequence positions) stored
so I can offer this:
1) Significant terms/phrases identification
Like "Gigabits" on gigablast.com - used
I'm not sure what applications people have in mind for Term Vector support but I
would prefer to have the original text positions (not term sequence positions) stored
so I can offer this:
1) Significant terms/phrases identification
Like "Gigabits" on gigablast.com - used to offer choices of (uns
This is provided by the Token.startOffset() and Token.endOffset() at indexing time, I
think.
I don't know if this is accessible at run time. A good place to see what is stored in
the files is the File Formats section located at
http://jakarta.apache.org/lucene/docs/fileformats.html. (Get the
Grant Ingersoll wrote:
It is the location of the token in the document (see IndexReader.termPositions()).
This information is already being stored in other parts of the index, it just isn't very efficient to get at it.
Ok, that wasn't the answer I was hoping for :) I was hoping that the positi
It is the location of the token in the document (see IndexReader.termPositions()).
This information is already being stored in other parts of the index, it just isn't
very efficient to get at it.
I think it would be useful to add to the IndexReader a way to get a list of positions
given a te
Doug Cutting wrote:
Grant Ingersoll wrote:
Do you see any reason to write position information at all for the
term vectors?
It could be useful to some folks. If, for example, you only want to
expand a query with terms that occur near query terms, like automatic
phrase identification. In ge
I am going to leave them off for now.
>>> [EMAIL PROTECTED] 02/17/04 04:03PM >>>
Grant Ingersoll wrote:
> Do you see any reason to write position information at all for the term vectors?
It could be useful to some folks. If, for example, you only want to
expand a query with terms that occur nea
Grant Ingersoll wrote:
Do you see any reason to write position information at all for the term vectors?
It could be useful to some folks. If, for example, you only want to
expand a query with terms that occur near query terms, like automatic
phrase identification. In general, the vector stuff i
Do you see any reason to write position information at all for the term vectors?
Isn't this information stored elsewhere? Seems like I could drop that piece all
together...
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For addi
Grant Ingersoll wrote:
I agree with your assessment about getting it right the first time. I can make the changes, as I don't think they are that involved and it will benefit me and my employer in the long run if the changes are committed since we won't have reapply the patches every time there is
Doug,
I agree with your assessment about getting it right the first time. I can make the
changes, as I don't think they are that involved and it will benefit me and my
employer in the long run if the changes are committed since we won't have reapply the
patches every time there is a new releas
Grant Ingersoll wrote:
Was wondering if you consider your comments on the Term vector stuff to be a show stopper or not? There hasn't been much response to your questions, so I wanted to bring it up again, as I do not want to see this go the way of the last attempt.
I proposed three changes:
1.
Hey Doug,
Was wondering if you consider your comments on the Term vector stuff to be a show
stopper or not? There hasn't been much response to your questions, so I wanted to
bring it up again, as I do not want to see this go the way of the last attempt.
If there is anything I can do to help w/
ext [PATCH] so the committers know there is a
> patch available for that bug.
>
> Looking forward to seeing your contribution,
> Eric
>
> -Original Message-
> From: Doug Cutting [mailto:[EMAIL PROTECTED]
> Sent: Thursday, February 05, 2004 5:33 PM
> To: Luce
ug Cutting [mailto:[EMAIL PROTECTED]
> Sent: Thursday, February 05, 2004 5:33 PM
> To: Lucene Developers List
> Subject: Re: Dmitry's Term Vector stuff, plus some
>
>
> The best way to generate a patch is to connect to the root of your
> Lucene CVS checkout, and run:
&g
lable for that bug.
Looking forward to seeing your contribution,
Eric
-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Thursday, February 05, 2004 5:33 PM
To: Lucene Developers List
Subject: Re: Dmitry's Term Vector stuff, plus some
The best way to generate a p
CH] so the committers know there is a patch available for that bug.
Looking forward to seeing your contribution,
Eric
-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Thursday, February 05, 2004 5:33 PM
To: Lucene Developers List
Subject: Re: Dmitry's Term Vector s
The best way to generate a patch is to connect to the root of your
Lucene CVS checkout, and run:
cvs diff -Nu > my.patch
This will include all newly added files.
Hmm. Perhaps that requires that you're a developer. If it does, then
simply tar up the new files separately from the patch file.
29 matches
Mail list logo