bug 323 community, where all
x87 floating point errors in gcc come to die!
Marvin Humphrey
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
anges.
I'm a little confused. What do you mean by a "full to-hardware flush" and how
is that different from the sync()/fsync() calls that Lucene makes by default
on each IndexWriter commit()?
Marvin Humphrey
ecomes a more generalized notion, with the
TF/IDF-specific functionality moving into a subclass.
Maybe something similar could be made to work in Lucene. Dunno how McCandless
has things set up for spec'ing codecs on the flex branch.
Marvin Humphrey
andard posting formats that Lucene offers. But the point of
flex is to provide an extension framework, I thought.
Well, whatever. It's just another place where Lucy and Lucene will part ways.
Marvin Humphrey
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
a flat position space.
A generic aggregator wouldn't know that it needed to do that. The postings
codec developer would be forced to write aggregation code in addition to
segment-level code.
Marvin Humphrey
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
private
> attributes. The attrs pass through Multi*Enum.
Hmm. Does that mean that the consumer needs to refresh the attributes with
each iteration? Because what happens when you switch sub-enums within the
Multi*Enum? Don't those attributes go stale, as they belong to a sub-enum
that
, now and
forever, one per method call.
> Still torn... I think it's convenience vs performance.
But convenience for the posting format plugin developer matters too, right?
Are you confident that a generic aggregator can support all possible codecs,
or wil
ngs across multiple
segments, so IMO it's best to shunt users like Renaud towards the segment
level.
Marvin Humphrey
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
by
> the OR operator, so that nevermind whether a record contains a or
> contains b both of which supposedly are required, so long as it contains
> c, it's a hit?
IMO, that's the only sensible way to handle unary operators. If they were
global rather than nested, what would this
sed to contribute to a score, interleaving is also easy,
because you just compare scores and the higher score wins.
It's only important if you wanted to, say, sort by distance.
Marvin Humphrey
-
To unsubscribe, e-mail: java-user-
/75ac07b7e2d6160d/pluggable_indexreader_was_real_time_updates
Yes. That post is from last spring; since then, a prototype of the Lucy
pluggable IndexReader design has been implemented in KinoSearch, so you could
write such a component today.
On Thu, Dec 17, 2009 at 09:33:11AM -0800, Marvin Humphrey wrote:
> On Thu, Dec 17, 2009 at 05:03:11PM +0100, Toke Eskildsen wrote:
>
> > A third alternative would be to count the number of unique datetime-values
> > and make a compressed representation, but that would make the c
we're sorting string data rather than date time values (so the
memory occupied by all those unique values is significant). What algorithms
would you consider for performing the uniquing?
Marvin Humphrey
-
To unsubscribe
ver accepted patches from anyone, though -- since then they have to
get permission from all contributors for relicensing.
Marvin Humphrey
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For addition
the IP in the interface is entirely original, then the copyright holder for
the original library has no claim on it and can't stop the interface author
from doing whatever they want with their own material... right?
Marvin Humphrey
---
re is that Lucene gets to use multiple threads within
one process, while Lucy has to at least be capable of using a multiple-process
concurrency model in order to support real-time search for non-threaded hosts.
Marvin Humphrey
--
uses it. So "it depends" I guess?
For the purposes of MergePolicy, all you would need are the doc counts and the
delcounts, and optionally other stuff in SegmentInfos. In theory you could
lazy load the other stuff like the term dictionary index. Obviously that
would be an unacce
ill plenty fast.
Actually, if you're not warming sort caches, launching a Lucene IndexReader
isn't obscenely expensive any more -- just expensive. Right?
Marvin Humphrey
[1] At least on Unixen. I believe we can support all of this using Windows
MapViewOfFile and frien
version checking
code and adhere to the spec, making it possible to (maybe) safely interpret
that data.
Marvin Humphrey
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
pen SegmentReaders for all segments.
Yeah, that's gonna be a bigger problem. :( It's cake to give Lucy's indexer
a reader, because opening readers is cheap. But the Lucene heavy-IndexReader
model messes that up -- IndexWriter has traditionally been a fast class to
open.
Marvin
descendants? Does IndexWriter always
have an IndexReader at its disposal yet? And if the answer to those two
questions is yes, can you refactor MergePolicy to work off of an IndexReader
rather than a SegmentInfos?
Marvin Humphrey
---
e query object and want a string that QueryParser will
> reparse fairly exacty.
Query objects are serializable, so that they can be sent over a network in a
search cluster. Can you use the serialization facilities?
Marvin Humphrey
---
On Wed, Mar 18, 2009 at 08:09:33AM +0100, Paul Libbrecht wrote:
> LSI is patented so it's not been a flurry of implementation attempts.
Hasn't the original patent expired?
http://mail.python.org/pipermail/python-list/2007-July/621547.html
Ma
: after calling terms() it's already pointing at the first
term. So you
need to rewrite this as a do-while loop.
Possibly my least favourite feature of Lucene. :-(
What would a better API look like?
Marvin Humphrey
Rectangular Research
http://www.rectangula
e.NET, the latter-day Ferret (C/Ruby), and
KinoSearch (C/Perl) (and possibly others, I haven't reviewed them all)
all get close to the metal. None rely on hash-based method dispatch
for inner-loop code, and all manipulate primitive types directly.
Marvin Humphrey
Rectangula
t compatible with
Lucene, and it doesn't make sense to impede progress by imposing that
constraint. Better to finish it up, present a polished version and a
coherent explanation of the design, then have Mike McCandless riff off
of it -- that worked pretty well for indexing speed improve
t a secret!
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
nt field values for display and searching
* lazy loading
* arbitrary data
* arbitrary compression algo choice
* complete document recovery
* ...
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
-
To uns
our app in particular -- how do you handle identical XML tag
names that mean totally different things when nested inside different
elements?
Acme
Widget
Marvin Humphrey
Rectangular Research
http://www.recta
product_id => 'acme-lt-1',
attr_weight => 6.3,
attr_heat_dissipation_factor => 20,
});
I'll need to make a few backend tweaks, but this API pretty much
solves the multi-dimensional data problem. :)
Thoughts?
Marvin H
p with
a FieldDef subclass that handles multi-dimensional data. I seem to
recall that Solr had something along those lines, using prefixed
field names or something. Do I recall correctly?
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
nsion until it is modified.
I guess my short answer is that I don't have an
opinion on adding another type-safe constant TOKENIZED_NO_NORMS
because I don't like the whole scheme.
KS 0.20 doesn't even have Document or Field classes. :) They've
been eliminated, and n
st as useful as those in a shorter document --
would lose. That wouldn't be helpful for a common case in naive web
search, where impossible-to-exclude navigational and advertising text
could end up diluting the scores of perfectly
.
http://www.mail-archive.com/java-dev@lucene.apache.org/msg04509.html
http://www.mail-archive.com/java-dev@lucene.apache.org/msg01704.html
Searching the mail archives for "lengthNorm" will turn up some more.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
languages such as Thai and Japanese, where "words"
are not separated by spaces and may consist of multiple characters?
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
-
To unsubscribe, e-mail: [EMAIL PROT
his:
http://www.glue.umd.edu/~oard/courses/708a/fall01/838/P2/
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
org/texts/
introduction.html>.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
; in the body should exclude doc 1.
However, since the title matches against 'a -foo' ("a" is present,
and "foo" is not), I believe you'll get a hit.
If you can solve that problem, and also return 1 hit for the query
string "a +foo", let me know!
oost assigned to a
title field, I've found that I can't get really good IR precision
without going back to a non-truncating tf() for title.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
-
To
een tags is more important than text
between tags and boost it. There's no good way to handle that right now.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
exact title matches.
The answer is to use Lucene's default lengthNorm for title and the modified
lengthNorm for bodytext. Wasn't sure how to do that in Lucene; now I know.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
Greets,
Is it possible to have an IndexWriter apply different Similarity
models to different Fields?
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For
about building a larger BooleanQuery by combining the output of
the QueryParser with custom-built Query objects based on your metadata?
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
-
To unsubscribe, e-mail: [
le to
retrieving all docs all the time.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
d examine the contents of a "category" field. That's a lot of
overhead. Is there another way?
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
both a Perl/KinoSearch and a Java Lucene version, and they will use
the Reuters corpus.
Where are you at with your app?
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
ML out onto the file
system. You'll find it at <http://www.rectangular.com/svn/kinosearch/
trunk/t/benchmarks/extract_reuters.plx>.
Best,
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
-
To u
On Apr 6, 2006, at 4:23 PM, Daniel Noll wrote:
Marvin Humphrey wrote:
I wrote:
It looks like StopAnalyzer tokenizes by letter, and doesn't
handle apostrophes. So, the input "I don't know" produces these
tokens:
don
t
know
Is that right?
It's not
is a stopword, so the tokens are:
don
know
Phew, that's much more useful.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
Greets,
It looks like StopAnalyzer tokenizes by letter, and doesn't handle
apostrophes. So, the input "I don't know" produces these tokens:
don
t
know
Is that right?
Marvin Humphrey
Rectangular Research
http:/
this subject perform a web search for
"proposal defaultsimilarity lengthnorm".
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-seek time. Decompressing a Lucene term dictionary file
isn't *that* intense.
I hope you won't mind if I don't volunteer to do the actual coding or
data collection, though, as I have my hands full porting all of
Lucene. :)
nt for this; I would
normally expect a more common term to take longer, as there are more
docs to score. Anybody got a expanation handy?
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
-
To unsubscribe, e-mai
would provide all the
useful benefits of setIndexInterval() at write-time. There's no
significant disk-space issue involved. The startup time to fill the
cache isn't worth worrying about, either.
Coincidentally, I'm porting this exact section of TermInfosReader
this mo
t to do that before asking for trouble by kludging
setIndexInterval into 1.4.3. The internals of TermInfosWriter are
quite complex.
Best,
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
-
To unsubscribe, e-mail:
exInterval -- you're rewriting the entire index, but it's faster
than reindexing from scratch because you don't need to redo the IO or
the analysis. Few people will find it useful to tinker with this,
but you're the exception, and I'll be interested to hear about your
f
e, and the reader has to be able to adapt on the fly.
Best,
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
I would rather have an application which used less memory and took
longer, than one which uses all the available RAM just to milk out
a bit of extra speed.
You have grasped the tradeoff precisely.
Best,
Marvin Humphrey
Rectangular
I could look further, and ask some french
linguists help.
I'm asking because a new version of my own search engine library has
a default tokenizer which keeps apostrophic strings together (like
StandardTokenizer), and I want to be aware of cases where this choice
causes problems. However, it's unlikely I'll change that behavior,
as the problem is addressed by making it trivially easy to customize
the tokenizer. So I would say that for my own purposes, consulting a
linguist is probably overkill.
Cheers,
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
ver want to search for the ll
in you'll, or the O in O'Reilly, etc.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
list, so I'm sending my reply there (and cc'ing Jian).
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
erruns.
I am hoping that the answer to this will be a fix to the encoding
mechanism in Lucene so that it really does use legal UTF-8. The most
efficient way to go about this has not yet presented itself.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
#--
ng the two-byte form.
else if (((code >= 0x80) && (code <= 0x7FF)) || code == 0) {
writeByte((byte)(0xC0 | (code >> 6)));
writeByte((byte)(0x80 | (code & 0x3F)));
http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8
Can someone please confirm that the intention is
he guts of Lucene's searching, but from a high level it looks
similar, so this might work...
Keep track of positions matched during phrase-matching. Use those
for highlighting terms which are part of the phrase match. Use post-
search analysis for highlighting anything that isn
On Aug 15, 2005, at 8:53 PM, Marvin Humphrey wrote:
Create a phrase query that when it encounters ab => { tokenlength
=> 2 } knows to look for something at position 3.
Fencepost error! That should have been "position 2".
Not that correcting the error makes the algo an
subwords is non-trivial.
Tag every term with its length in tokens. :)
Index at these positions.
Pos0: a ab abc abcd
Pos1: b bc bcd
Pos2: c cd
Pos3: d
Create a phrase query that when it encounters ab => { tokenlength =>
2 } knows to look for something at position 3.
Marvin Humphrey
Recta
eas?
How about this?
1) Lowercase.
2) Convert non-alphanumeric characters to spaces.
3) Introduce a space at every boundary between a letter and a number.
4) concatenate all 1, 2, 3 .. n term combinations and index them.
5) Don't stem.
Marvin Humphrey
Rectangular Research
http://www.
unstemmed form from the original text.
Wouldn't that fall down if you had two distinct terms which produce
the same string when stemmed?
Best,
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
-
To unsubscri
Greetings,
Is it possible to have Lucene parse malformed queries? For instance,
is there a way to have this query...
art museums "new york city
... return results for ...
art museums "new york city"
... or is that just a parse error, end of story?
It's a DWIM* thing.
69 matches
Mail list logo