[jira] Commented: (LUCENE-1306) CombinedNGramTokenFilter

2008-06-16 Thread Hiroaki Kawai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605483#action_12605483
 ] 

Hiroaki Kawai commented on LUCENE-1306:
---

After thinking for a week, I think this idea is nice.

IMHO, this might be renamed to NGramTokenizer simply. A general n-gram 
tokenizer accepts a sequence that has no gap in it. By the concept, TokenFilter 
accepts a tokien stream (gapped sequence), and current NGramTokenFilter does 
not work well in that sense. CombinedNGramTokenFilter filles the gap with 
prefix(^) and suffix($), and the token stream becomes a simple stream again 
virtually, n-gram works nice agian.

Comments:
1. prefix and suffix chars should be configurable. Because user must choose a 
char that is not used in the terms.
2. prefix and suffix might be a white space. Because most of the users are not 
interested in whitespace itself.
3. If you want to do a phrase query (for example, "This is"), we have to 
generate $^ token in the gap to make the positions valid.
4. n-gram algorithm should be rewritten to make the positions valid. Please see 
LUCENE-1225

I think "^h" is OK, because prefix and suffix are the chars that was introduced 
as a workaround.


> CombinedNGramTokenFilter
> 
>
> Key: LUCENE-1306
> URL: https://issues.apache.org/jira/browse/LUCENE-1306
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Reporter: Karl Wettin
>Assignee: Karl Wettin
>Priority: Trivial
> Attachments: LUCENE-1306.txt
>
>
> Alternative NGram filter that produce tokens with composite prefix and suffix 
> markers.
> {code:java}
> ts = new WhitespaceTokenizer(new StringReader("hello"));
> ts = new CombinedNGramTokenFilter(ts, 2, 2);
> assertNext(ts, "^h");
> assertNext(ts, "he");
> assertNext(ts, "el");
> assertNext(ts, "ll");
> assertNext(ts, "lo");
> assertNext(ts, "o$");
> assertNull(ts.next());
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Obtain IndexCommits from directory

2008-06-16 Thread Jason Rutherglen
Need to be able to get a list of IndexCommits for a directory.  Also open an
IndexReader for each IndexCommit.  Am thinking of API such as.  I suppose
this could cause problems for reopen.

IndexCommit[] commits = IndexReader.listCommitPoints(Directory directory);

and

IndexReader.open(IndexCommit commit, Directory directory);


handling token created/deleted events in an Index

2008-06-16 Thread Mathieu Lecarme
With the LUCENE-1297, the SpellChecker will be able to choose how to  
estimate distance between two words.


Here are some other enhancement:
 * The capacity to synchronize the main Index and the SpellChecker  
Index. Handling tokens creation is easy, a simple TokenFilter can do  
the work. But for Token deletion, it's a bit harder. Lazy deleted can  
be used if each time, token popularity is checked in the main Index.  
It's a pull strategy, a push from the Directory should be lighter.
 * Choosing the similarity strategy. Now, it's only a Ngram  
computation. Homophony can be nice, for example.
 * Spell Index can be used for dynamic similarity without disturbing  
the main Index. By example, Snowball is nice for grouping words from  
its roots, but it disturbs the Index if you wont to make a start with  
query.


Some time ago, I suggested a patch LUCENE-1190, but, I guess it's too  
monolithic. A more modular way should be better.


Any comments or suggestion?

M.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1297) Allow other string distance measures in spellchecker

2008-06-16 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605290#action_12605290
 ] 

Grant Ingersoll commented on LUCENE-1297:
-

+1 on committing this.  I downloaded the latest and applied, ran the tests, 
etc. and it looks good.

> Allow other string distance measures in spellchecker
> 
>
> Key: LUCENE-1297
> URL: https://issues.apache.org/jira/browse/LUCENE-1297
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/spellchecker
>Affects Versions: 2.4
> Environment: n/a
>Reporter: Thomas Morton
>Assignee: Otis Gospodnetic
>Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1297.patch, LUCENE-1297.patch
>
>
> Updated spelling code to allow for other string distance measures to be used.
> Created StringDistance interface.
> Modified existing Levenshtein distance measure to implement interface (and 
> renamed class).
> Verified that change to Levenshtein distance didn't impact runtime 
> performance.
> Implemented Jaro/Winkler distance metric
> Modified SpellChecker to take distacne measure as in constructor or in set 
> method and to use interface when calling.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1301) Refactor DocumentsWriter

2008-06-16 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1301:
---

Attachment: LUCENE-1301.take2.patch

Woops, sorry, I forgot to svn add that.  I'm attaching my current
state, with that file added.  Does this one work?  (You may need to
forcefully remove DocumentsWriterFieldData.java if applying the patch
doesn't do so).



> Refactor DocumentsWriter
> 
>
> Key: LUCENE-1301
> URL: https://issues.apache.org/jira/browse/LUCENE-1301
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.3, 2.3.1, 2.3.2, 2.4
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1301.patch, LUCENE-1301.take2.patch
>
>
> I've been working on refactoring DocumentsWriter to make it more
> modular, so that adding new indexing functionality (like column-stride
> stored fields, LUCENE-1231) is just a matter of adding a plugin into
> the indexing chain.
> This is an initial step towards flexible indexing (but there is still
> alot more to do!).
> And it's very much still a work in progress -- there are intemittant
> thread safety issues, I need to add tests cases and test/iterate on
> performance, many "nocommits", etc.  This is a snapshot of my current
> state...
> The approach introduces "consumers" (abstract classes defining the
> interface) at different levels during indexing.  EG DocConsumer
> consumes the whole document.  DocFieldConsumer consumes separate
> fields, one at a time.  InvertedDocConsumer consumes tokens produced
> by running each field through the analyzer.  TermsHashConsumer writes
> its own bytes into in-memory posting lists stored in byte slices,
> indexed by term, etc.
> DocumentsWriter*.java is then much simpler: it only interacts with a
> DocConsumer and has no idea what that consumer is doing.  Under that
> DocConsumer there is a whole "indexing chain" that does the real work:
>   * NormsWriter holds norms in memory and then flushes them to _X.nrm.
>   * FreqProxTermsWriter holds postings data in memory and then flushes
> to _X.frq/prx.
>   * StoredFieldsWriter flushes immediately to _X.fdx/fdt
>   * TermVectorsTermsWriter flushes immediately to _X.tvx/tvf/tvd
> DocumentsWriter still manages things like flushing a segment, closing
> doc stores, buffering & applying deletes, freeing memory, aborting
> when necesary, etc.
> In this first step, everything is package-private, and, the indexing
> chain is hardwired (instantiated in DocumentsWriter) to the chain
> currently matching Lucene trunk.  Over time we can open this up.
> There are no changes to the index file format.
> For the most part this is just a [large] refactoring, except for these
> two small actual changes:
>   * Improved concurrency with mixed large/small docs: previously the
> thread state would be tied up when docs finished indexing
> out-of-order.  Now, it's not: instead I use a separate class to
> hold any pending state to flush to the doc stores, and immediately
> free up the thread state to index other docs.
>   * Buffered norms in memory now remain sparse, until flushed to the
> _X.nrm file.  Previously we would "fill holes" in norms in memory,
> as we go, which could easily use way too much memory.  Really this
> isn't a solution to the problem of sparse norms (LUCENE-830); it
> just delays that issue from causing memory blowup during indexing;
> memory use will still blowup during searching.
> I expect performance (indexing throughput) will be worse with this
> change.  I'll profile & iterate to minimize this, but I think we can
> accept some loss.  I also plan to measure benefit of manually
> re-cycling RawPostingList instances from our own pool, vs letting GC
> recycle them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



A few interesting papers from WWW2008

2008-06-16 Thread Andrzej Bialecki

Hi,

I found the following papers of potential interest to Lucene community:

* http://www2008.org/papers/pdf/p387-zhangA.pdf "Performance of 
Compressed Inverted List Caching in Search Engines", discusses a new 
compression algorithm for inverted indexes, PForDelta, and its 
performance benefits over other well-known algorithms.


* http://www2008.org/papers/pdf/p1213-ding.pdf "Using Graphics 
Processors for High-Performance IR Query Processing", discusses the 
application of GPU for posting list decompression and intersection.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]