[jira] Commented: (LUCENE-1306) CombinedNGramTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605483#action_12605483 ] Hiroaki Kawai commented on LUCENE-1306: --- After thinking for a week, I think this idea is nice. IMHO, this might be renamed to NGramTokenizer simply. A general n-gram tokenizer accepts a sequence that has no gap in it. By the concept, TokenFilter accepts a tokien stream (gapped sequence), and current NGramTokenFilter does not work well in that sense. CombinedNGramTokenFilter filles the gap with prefix(^) and suffix($), and the token stream becomes a simple stream again virtually, n-gram works nice agian. Comments: 1. prefix and suffix chars should be configurable. Because user must choose a char that is not used in the terms. 2. prefix and suffix might be a white space. Because most of the users are not interested in whitespace itself. 3. If you want to do a phrase query (for example, "This is"), we have to generate $^ token in the gap to make the positions valid. 4. n-gram algorithm should be rewritten to make the positions valid. Please see LUCENE-1225 I think "^h" is OK, because prefix and suffix are the chars that was introduced as a workaround. > CombinedNGramTokenFilter > > > Key: LUCENE-1306 > URL: https://issues.apache.org/jira/browse/LUCENE-1306 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/analyzers >Reporter: Karl Wettin >Assignee: Karl Wettin >Priority: Trivial > Attachments: LUCENE-1306.txt > > > Alternative NGram filter that produce tokens with composite prefix and suffix > markers. > {code:java} > ts = new WhitespaceTokenizer(new StringReader("hello")); > ts = new CombinedNGramTokenFilter(ts, 2, 2); > assertNext(ts, "^h"); > assertNext(ts, "he"); > assertNext(ts, "el"); > assertNext(ts, "ll"); > assertNext(ts, "lo"); > assertNext(ts, "o$"); > assertNull(ts.next()); > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Obtain IndexCommits from directory
Need to be able to get a list of IndexCommits for a directory. Also open an IndexReader for each IndexCommit. Am thinking of API such as. I suppose this could cause problems for reopen. IndexCommit[] commits = IndexReader.listCommitPoints(Directory directory); and IndexReader.open(IndexCommit commit, Directory directory);
handling token created/deleted events in an Index
With the LUCENE-1297, the SpellChecker will be able to choose how to estimate distance between two words. Here are some other enhancement: * The capacity to synchronize the main Index and the SpellChecker Index. Handling tokens creation is easy, a simple TokenFilter can do the work. But for Token deletion, it's a bit harder. Lazy deleted can be used if each time, token popularity is checked in the main Index. It's a pull strategy, a push from the Directory should be lighter. * Choosing the similarity strategy. Now, it's only a Ngram computation. Homophony can be nice, for example. * Spell Index can be used for dynamic similarity without disturbing the main Index. By example, Snowball is nice for grouping words from its roots, but it disturbs the Index if you wont to make a start with query. Some time ago, I suggested a patch LUCENE-1190, but, I guess it's too monolithic. A more modular way should be better. Any comments or suggestion? M. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1297) Allow other string distance measures in spellchecker
[ https://issues.apache.org/jira/browse/LUCENE-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605290#action_12605290 ] Grant Ingersoll commented on LUCENE-1297: - +1 on committing this. I downloaded the latest and applied, ran the tests, etc. and it looks good. > Allow other string distance measures in spellchecker > > > Key: LUCENE-1297 > URL: https://issues.apache.org/jira/browse/LUCENE-1297 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/spellchecker >Affects Versions: 2.4 > Environment: n/a >Reporter: Thomas Morton >Assignee: Otis Gospodnetic >Priority: Minor > Fix For: 2.4 > > Attachments: LUCENE-1297.patch, LUCENE-1297.patch > > > Updated spelling code to allow for other string distance measures to be used. > Created StringDistance interface. > Modified existing Levenshtein distance measure to implement interface (and > renamed class). > Verified that change to Levenshtein distance didn't impact runtime > performance. > Implemented Jaro/Winkler distance metric > Modified SpellChecker to take distacne measure as in constructor or in set > method and to use interface when calling. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1301) Refactor DocumentsWriter
[ https://issues.apache.org/jira/browse/LUCENE-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1301: --- Attachment: LUCENE-1301.take2.patch Woops, sorry, I forgot to svn add that. I'm attaching my current state, with that file added. Does this one work? (You may need to forcefully remove DocumentsWriterFieldData.java if applying the patch doesn't do so). > Refactor DocumentsWriter > > > Key: LUCENE-1301 > URL: https://issues.apache.org/jira/browse/LUCENE-1301 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.3, 2.3.1, 2.3.2, 2.4 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.4 > > Attachments: LUCENE-1301.patch, LUCENE-1301.take2.patch > > > I've been working on refactoring DocumentsWriter to make it more > modular, so that adding new indexing functionality (like column-stride > stored fields, LUCENE-1231) is just a matter of adding a plugin into > the indexing chain. > This is an initial step towards flexible indexing (but there is still > alot more to do!). > And it's very much still a work in progress -- there are intemittant > thread safety issues, I need to add tests cases and test/iterate on > performance, many "nocommits", etc. This is a snapshot of my current > state... > The approach introduces "consumers" (abstract classes defining the > interface) at different levels during indexing. EG DocConsumer > consumes the whole document. DocFieldConsumer consumes separate > fields, one at a time. InvertedDocConsumer consumes tokens produced > by running each field through the analyzer. TermsHashConsumer writes > its own bytes into in-memory posting lists stored in byte slices, > indexed by term, etc. > DocumentsWriter*.java is then much simpler: it only interacts with a > DocConsumer and has no idea what that consumer is doing. Under that > DocConsumer there is a whole "indexing chain" that does the real work: > * NormsWriter holds norms in memory and then flushes them to _X.nrm. > * FreqProxTermsWriter holds postings data in memory and then flushes > to _X.frq/prx. > * StoredFieldsWriter flushes immediately to _X.fdx/fdt > * TermVectorsTermsWriter flushes immediately to _X.tvx/tvf/tvd > DocumentsWriter still manages things like flushing a segment, closing > doc stores, buffering & applying deletes, freeing memory, aborting > when necesary, etc. > In this first step, everything is package-private, and, the indexing > chain is hardwired (instantiated in DocumentsWriter) to the chain > currently matching Lucene trunk. Over time we can open this up. > There are no changes to the index file format. > For the most part this is just a [large] refactoring, except for these > two small actual changes: > * Improved concurrency with mixed large/small docs: previously the > thread state would be tied up when docs finished indexing > out-of-order. Now, it's not: instead I use a separate class to > hold any pending state to flush to the doc stores, and immediately > free up the thread state to index other docs. > * Buffered norms in memory now remain sparse, until flushed to the > _X.nrm file. Previously we would "fill holes" in norms in memory, > as we go, which could easily use way too much memory. Really this > isn't a solution to the problem of sparse norms (LUCENE-830); it > just delays that issue from causing memory blowup during indexing; > memory use will still blowup during searching. > I expect performance (indexing throughput) will be worse with this > change. I'll profile & iterate to minimize this, but I think we can > accept some loss. I also plan to measure benefit of manually > re-cycling RawPostingList instances from our own pool, vs letting GC > recycle them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
A few interesting papers from WWW2008
Hi, I found the following papers of potential interest to Lucene community: * http://www2008.org/papers/pdf/p387-zhangA.pdf "Performance of Compressed Inverted List Caching in Search Engines", discusses a new compression algorithm for inverted indexes, PForDelta, and its performance benefits over other well-known algorithms. * http://www2008.org/papers/pdf/p1213-ding.pdf "Using Graphics Processors for High-Performance IR Query Processing", discusses the application of GPU for posting list decompression and intersection. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]