Re: Repeatability of results

2012-04-04 Thread Marvin Humphrey
bug 323 community, where all x87 floating point errors in gcc come to die! Marvin Humphrey - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: SSD Experience (on developer machine)

2011-08-23 Thread Marvin Humphrey
anges. I'm a little confused. What do you mean by a "full to-hardware flush" and how is that different from the sync()/fsync() calls that Lucene makes by default on each IndexWriter commit()? Marvin Humphrey

Re: BM25 Scoring Patch

2010-02-17 Thread Marvin Humphrey
ecomes a more generalized notion, with the TF/IDF-specific functionality moving into a subclass. Maybe something similar could be made to work in Lucene. Dunno how McCandless has things set up for spec'ing codecs on the flex branch. Marvin Humphrey

Re: Flex & Docs/AndPositionsEnum

2010-02-11 Thread Marvin Humphrey
andard posting formats that Lucene offers. But the point of flex is to provide an extension framework, I thought. Well, whatever. It's just another place where Lucy and Lucene will part ways. Marvin Humphrey - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Flex & Docs/AndPositionsEnum

2010-02-10 Thread Marvin Humphrey
a flat position space. A generic aggregator wouldn't know that it needed to do that. The postings codec developer would be forced to write aggregation code in addition to segment-level code. Marvin Humphrey - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Flex & Docs/AndPositionsEnum

2010-02-10 Thread Marvin Humphrey
private > attributes. The attrs pass through Multi*Enum. Hmm. Does that mean that the consumer needs to refresh the attributes with each iteration? Because what happens when you switch sub-enums within the Multi*Enum? Don't those attributes go stale, as they belong to a sub-enum that

Re: Flex & Docs/AndPositionsEnum

2010-02-09 Thread Marvin Humphrey
, now and forever, one per method call. > Still torn... I think it's convenience vs performance. But convenience for the posting format plugin developer matters too, right? Are you confident that a generic aggregator can support all possible codecs, or wil

Re: Flex & Docs/AndPositionsEnum

2010-02-09 Thread Marvin Humphrey
ngs across multiple segments, so IMO it's best to shunt users like Renaud towards the segment level. Marvin Humphrey - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Unary Operators and Operator Precedence

2010-01-19 Thread Marvin Humphrey
by > the OR operator, so that nevermind whether a record contains a or > contains b both of which supposedly are required, so long as it contains > c, it's a hit? IMO, that's the only sensible way to handle unary operators. If they were global rather than nested, what would this

Re: R-Tree in lucene thoughts?

2010-01-07 Thread Marvin Humphrey
sed to contribute to a score, interleaving is also easy, because you just compare scores and the higher score wins. It's only important if you wanted to, say, sort by distance. Marvin Humphrey - To unsubscribe, e-mail: java-user-

Re: R-Tree in lucene thoughts?

2010-01-07 Thread Marvin Humphrey
/75ac07b7e2d6160d/pluggable_indexreader_was_real_time_updates Yes. That post is from last spring; since then, a prototype of the Lucy pluggable IndexReader design has been implemented in KinoSearch, so you could write such a component today.

Re: External sort

2009-12-17 Thread Marvin Humphrey
On Thu, Dec 17, 2009 at 09:33:11AM -0800, Marvin Humphrey wrote: > On Thu, Dec 17, 2009 at 05:03:11PM +0100, Toke Eskildsen wrote: > > > A third alternative would be to count the number of unique datetime-values > > and make a compressed representation, but that would make the c

Re: External sort

2009-12-17 Thread Marvin Humphrey
we're sorting string data rather than date time values (so the memory occupied by all those unique values is significant). What algorithms would you consider for performing the uniquing? Marvin Humphrey - To unsubscribe

Re: Questions about SEN patch submissions

2009-11-09 Thread Marvin Humphrey
ver accepted patches from anyone, though -- since then they have to get permission from all contributors for relicensing. Marvin Humphrey - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For addition

Re: Questions about SEN patch submissions

2009-11-09 Thread Marvin Humphrey
the IP in the interface is entirely original, then the copyright holder for the original library has no claim on it and can't stop the interface author from doing whatever they want with their own material... right? Marvin Humphrey ---

Re: MergePolicy public but SegmentInfos package protected?

2009-03-27 Thread Marvin Humphrey
re is that Lucene gets to use multiple threads within one process, while Lucy has to at least be capable of using a multiple-process concurrency model in order to support real-time search for non-threaded hosts. Marvin Humphrey --

Re: MergePolicy public but SegmentInfos package protected?

2009-03-27 Thread Marvin Humphrey
uses it. So "it depends" I guess? For the purposes of MergePolicy, all you would need are the doc counts and the delcounts, and optionally other stuff in SegmentInfos. In theory you could lazy load the other stuff like the term dictionary index. Obviously that would be an unacce

Re: MergePolicy public but SegmentInfos package protected?

2009-03-27 Thread Marvin Humphrey
ill plenty fast. Actually, if you're not warming sort caches, launching a Lucene IndexReader isn't obscenely expensive any more -- just expensive. Right? Marvin Humphrey [1] At least on Unixen. I believe we can support all of this using Windows MapViewOfFile and frien

Re: MergePolicy public but SegmentInfos package protected?

2009-03-27 Thread Marvin Humphrey
version checking code and adhere to the spec, making it possible to (maybe) safely interpret that data. Marvin Humphrey - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: MergePolicy public but SegmentInfos package protected?

2009-03-26 Thread Marvin Humphrey
pen SegmentReaders for all segments. Yeah, that's gonna be a bigger problem. :( It's cake to give Lucy's indexer a reader, because opening readers is cheap. But the Lucene heavy-IndexReader model messes that up -- IndexWriter has traditionally been a fast class to open. Marvin

Re: MergePolicy public but SegmentInfos package protected?

2009-03-25 Thread Marvin Humphrey
descendants? Does IndexWriter always have an IndexReader at its disposal yet? And if the answer to those two questions is yes, can you refactor MergePolicy to work off of an IndexReader rather than a SegmentInfos? Marvin Humphrey ---

Re: robust inverse of query parser?

2009-03-20 Thread Marvin Humphrey
e query object and want a string that QueryParser will > reparse fairly exacty. Query objects are serializable, so that they can be sent over a network in a search cluster. Can you use the serialization facilities? Marvin Humphrey ---

Re: lsi as indexing algorithm with lucene

2009-03-18 Thread Marvin Humphrey
On Wed, Mar 18, 2009 at 08:09:33AM +0100, Paul Libbrecht wrote: > LSI is patented so it's not been a flurry of implementation attempts. Hasn't the original patent expired? http://mail.python.org/pipermail/python-list/2007-July/621547.html Ma

Re: Problems about using Lucene to generate tag cloud..

2008-04-06 Thread Marvin Humphrey
: after calling terms() it's already pointing at the first term. So you need to rewrite this as a do-while loop. Possibly my least favourite feature of Lucene. :-( What would a better API look like? Marvin Humphrey Rectangular Research http://www.rectangula

Re: SOC: Lulu, a Lua implementation of Lucene

2008-02-29 Thread Marvin Humphrey
e.NET, the latter-day Ferret (C/Ruby), and KinoSearch (C/Perl) (and possibly others, I haven't reviewed them all) all get close to the metal. None rely on hash-based method dispatch for inner-loop code, and all manipulate primitive types directly. Marvin Humphrey Rectangula

Re: SOC: Lulu, a Lua implementation of Lucene

2008-02-29 Thread Marvin Humphrey
t compatible with Lucene, and it doesn't make sense to impede progress by imposing that constraint. Better to finish it up, present a polished version and a coherent explanation of the design, then have Mike McCandless riff off of it -- that worked pretty well for indexing speed improve

Re: improving RAM usage by IndexWriter

2007-03-19 Thread Marvin Humphrey
t a secret! Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: NO_NORMS and TOKENIZED?

2007-02-21 Thread Marvin Humphrey
nt field values for display and searching * lazy loading * arbitrary data * arbitrary compression algo choice * complete document recovery * ... Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To uns

Re: NO_NORMS and TOKENIZED?

2007-02-19 Thread Marvin Humphrey
our app in particular -- how do you handle identical XML tag names that mean totally different things when nested inside different elements? Acme Widget Marvin Humphrey Rectangular Research http://www.recta

Re: NO_NORMS and TOKENIZED?

2007-02-19 Thread Marvin Humphrey
product_id => 'acme-lt-1', attr_weight => 6.3, attr_heat_dissipation_factor => 20, }); I'll need to make a few backend tweaks, but this API pretty much solves the multi-dimensional data problem. :) Thoughts? Marvin H

Re: NO_NORMS and TOKENIZED?

2007-02-18 Thread Marvin Humphrey
p with a FieldDef subclass that handles multi-dimensional data. I seem to recall that Solr had something along those lines, using prefixed field names or something. Do I recall correctly? Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: NO_NORMS and TOKENIZED?

2007-02-16 Thread Marvin Humphrey
nsion until it is modified. I guess my short answer is that I don't have an opinion on adding another type-safe constant TOKENIZED_NO_NORMS because I don't like the whole scheme. KS 0.20 doesn't even have Document or Field classes. :) They've been eliminated, and n

Re: Lucene & LSA

2006-12-14 Thread Marvin Humphrey
st as useful as those in a shorter document -- would lose. That wouldn't be helpful for a common case in naive web search, where impossible-to-exclude navigational and advertising text could end up diluting the scores of perfectly

Re: Lucene scoring: Term frequency normalisation

2006-12-12 Thread Marvin Humphrey
. http://www.mail-archive.com/java-dev@lucene.apache.org/msg04509.html http://www.mail-archive.com/java-dev@lucene.apache.org/msg01704.html Searching the mail archives for "lengthNorm" will turn up some more. Marvin Humphrey Rectangular Research http://www.rectangular.com/

Re: Analysis/tokenization of compound words

2006-09-23 Thread Marvin Humphrey
languages such as Thai and Japanese, where "words" are not separated by spaces and may consist of multiple characters? Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROT

Re: Analysis/tokenization of compound words

2006-09-19 Thread Marvin Humphrey
his: http://www.glue.umd.edu/~oard/courses/708a/fall01/838/P2/ Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Stemmer Implementation Strategy - feedback?

2006-08-07 Thread Marvin Humphrey
org/texts/ introduction.html>. Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: MultiFieldQueryParser.parse deprecated. What can I use?

2006-07-25 Thread Marvin Humphrey
; in the body should exclude doc 1. However, since the title matches against 'a -foo' ("a" is present, and "foo" is not), I believe you'll get a hit. If you can solve that problem, and also return 1 hit for the query string "a +foo", let me know!

Re: What are norms?

2006-07-14 Thread Marvin Humphrey
oost assigned to a title field, I've found that I can't get really good IR precision without going back to a non-truncating tf() for title. Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To

Re: Per-token weighting / attribute data in index

2006-06-02 Thread Marvin Humphrey
een tags is more important than text between tags and boost it. There's no good way to handle that right now. Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Per-Field Similarity

2006-05-23 Thread Marvin Humphrey
exact title matches. The answer is to use Lucene's default lengthNorm for title and the modified lengthNorm for bodytext. Wasn't sure how to do that in Lucene; now I know. Marvin Humphrey Rectangular Research http://www.rectangular.com/

Per-Field Similarity

2006-05-23 Thread Marvin Humphrey
Greets, Is it possible to have an IndexWriter apply different Similarity models to different Fields? Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For

Re: Searching API: QueryParser vs Programatic queries

2006-05-22 Thread Marvin Humphrey
about building a larger BooleanQuery by combining the output of the QueryParser with custom-built Query objects based on your metadata? Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [

Re: Aggregating category hits

2006-05-16 Thread Marvin Humphrey
le to retrieving all docs all the time. Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Aggregating category hits

2006-05-15 Thread Marvin Humphrey
d examine the contents of a "category" field. That's a lot of overhead. Is there another way? Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene search benchmark/stress test tool

2006-04-30 Thread Marvin Humphrey
both a Perl/KinoSearch and a Java Lucene version, and they will use the Reuters corpus. Where are you at with your app? Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED]

Re: Reuters

2006-04-21 Thread Marvin Humphrey
ML out onto the file system. You'll find it at <http://www.rectangular.com/svn/kinosearch/ trunk/t/benchmarks/extract_reuters.plx>. Best, Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To u

Re: StopAnalyzer and apostrophes

2006-04-06 Thread Marvin Humphrey
On Apr 6, 2006, at 4:23 PM, Daniel Noll wrote: Marvin Humphrey wrote: I wrote: It looks like StopAnalyzer tokenizes by letter, and doesn't handle apostrophes. So, the input "I don't know" produces these tokens: don t know Is that right? It's not

Re: StopAnalyzer and apostrophes

2006-04-06 Thread Marvin Humphrey
is a stopword, so the tokens are: don know Phew, that's much more useful. Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

StopAnalyzer and apostrophes

2006-04-06 Thread Marvin Humphrey
Greets, It looks like StopAnalyzer tokenizes by letter, and doesn't handle apostrophes. So, the input "I don't know" produces these tokens: don t know Is that right? Marvin Humphrey Rectangular Research http:/

Re: Changing ranking

2006-03-23 Thread Marvin Humphrey
this subject perform a web search for "proposal defaultsimilarity lengthnorm". Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Memory Usage

2005-11-17 Thread Marvin Humphrey
-seek time. Decompressing a Lucene term dictionary file isn't *that* intense. I hope you won't mind if I don't volunteer to do the actual coding or data collection, though, as I have my hands full porting all of Lucene. :)

Re: Memory Usage

2005-11-15 Thread Marvin Humphrey
nt for this; I would normally expect a more common term to take longer, as there are more docs to score. Anybody got a expanation handy? Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mai

Re: Memory Usage

2005-11-14 Thread Marvin Humphrey
would provide all the useful benefits of setIndexInterval() at write-time. There's no significant disk-space issue involved. The startup time to fill the cache isn't worth worrying about, either. Coincidentally, I'm porting this exact section of TermInfosReader this mo

Re: Memory Usage

2005-11-14 Thread Marvin Humphrey
t to do that before asking for trouble by kludging setIndexInterval into 1.4.3. The internals of TermInfosWriter are quite complex. Best, Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail:

Re: Memory Usage

2005-11-13 Thread Marvin Humphrey
exInterval -- you're rewriting the entire index, but it's faster than reindexing from scratch because you don't need to redo the IO or the analysis. Few people will find it useful to tinker with this, but you're the exception, and I'll be interested to hear about your f

Re: Memory Usage

2005-11-13 Thread Marvin Humphrey
e, and the reader has to be able to adapt on the fly. Best, Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Memory Usage

2005-11-09 Thread Marvin Humphrey
I would rather have an application which used less memory and took longer, than one which uses all the available RAM just to milk out a bit of extra speed. You have grasped the tradeoff precisely. Best, Marvin Humphrey Rectangular

Re: Bad behaviors of FrenchAnalyzer

2005-10-11 Thread Marvin Humphrey
I could look further, and ask some french linguists help. I'm asking because a new version of my own search engine library has a default tokenizer which keeps apostrophic strings together (like StandardTokenizer), and I want to be aware of cases where this choice causes problems. However, it's unlikely I'll change that behavior, as the problem is addressed by making it trivially easy to customize the tokenizer. So I would say that for my own purposes, consulting a linguist is probably overkill. Cheers, Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Bad behaviors of FrenchAnalyzer

2005-10-11 Thread Marvin Humphrey
ver want to search for the ll in you'll, or the O in O'Reilly, etc. Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene does NOT use UTF-8.

2005-08-27 Thread Marvin Humphrey
list, so I'm sending my reply there (and cc'ing Jian). Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Lucene does NOT use UTF-8.

2005-08-26 Thread Marvin Humphrey
erruns. I am hoping that the answer to this will be a fix to the encoding mechanism in Lucene so that it really does use legal UTF-8. The most efficient way to go about this has not yet presented itself. Marvin Humphrey Rectangular Research http://www.rectangular.com/ #--

Standard or Modified UTF-8?

2005-08-26 Thread Marvin Humphrey
ng the two-byte form. else if (((code >= 0x80) && (code <= 0x7FF)) || code == 0) { writeByte((byte)(0xC0 | (code >> 6))); writeByte((byte)(0x80 | (code & 0x3F))); http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8 Can someone please confirm that the intention is

Re: Thinking about better highlighting

2005-08-26 Thread Marvin Humphrey
he guts of Lucene's searching, but from a high level it looks similar, so this might work... Keep track of positions matched during phrase-matching. Use those for highlighting terms which are part of the phrase match. Use post- search analysis for highlighting anything that isn

Re: intra-word delimiters

2005-08-15 Thread Marvin Humphrey
On Aug 15, 2005, at 8:53 PM, Marvin Humphrey wrote: Create a phrase query that when it encounters ab => { tokenlength => 2 } knows to look for something at position 3. Fencepost error! That should have been "position 2". Not that correcting the error makes the algo an

Re: intra-word delimiters

2005-08-15 Thread Marvin Humphrey
subwords is non-trivial. Tag every term with its length in tokens. :) Index at these positions. Pos0: a ab abc abcd Pos1: b bc bcd Pos2: c cd Pos3: d Create a phrase query that when it encounters ab => { tokenlength => 2 } knows to look for something at position 3. Marvin Humphrey Recta

Re: intra-word delimiters

2005-08-15 Thread Marvin Humphrey
eas? How about this? 1) Lowercase. 2) Convert non-alphanumeric characters to spaces. 3) Introduce a space at every boundary between a letter and a number. 4) concatenate all 1, 2, 3 .. n term combinations and index them. 5) Don't stem. Marvin Humphrey Rectangular Research http://www.

Re: How to get the un-stemed word

2005-07-08 Thread Marvin Humphrey
unstemmed form from the original text. Wouldn't that fall down if you had two distinct terms which produce the same string when stemmed? Best, Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscri

Error tolerant query parsing

2005-06-28 Thread Marvin Humphrey
Greetings, Is it possible to have Lucene parse malformed queries? For instance, is there a way to have this query... art museums "new york city ... return results for ... art museums "new york city" ... or is that just a parse error, end of story? It's a DWIM* thing.