[jira] [Commented] (LUCENE-2605) queryparser parses on whitespace
[ https://issues.apache.org/jira/browse/LUCENE-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325862#comment-15325862 ] Fuad Efendi commented on LUCENE-2605: - This one was really painful problem (unexpected "tokenization" by query parser!) Thank you for fixing that! > queryparser parses on whitespace > > > Key: LUCENE-2605 > URL: https://issues.apache.org/jira/browse/LUCENE-2605 > Project: Lucene - Core > Issue Type: Bug > Components: core/queryparser >Reporter: Robert Muir >Assignee: Steve Rowe > Fix For: 4.9, 6.0 > > Attachments: LUCENE-2605.patch, LUCENE-2605.patch, LUCENE-2605.patch > > > The queryparser parses input on whitespace, and sends each whitespace > separated term to its own independent token stream. > This breaks the following at query-time, because they can't see across > whitespace boundaries: > * n-gram analysis > * shingles > * synonyms (especially multi-word for whitespace-separated languages) > * languages where a 'word' can contain whitespace (e.g. vietnamese) > Its also rather unexpected, as users think their > charfilters/tokenizers/tokenfilters will do the same thing at index and > querytime, but > in many cases they can't. Instead, preferably the queryparser would parse > around only real 'operators'. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2357) Thread Local memory leaks on restart
[ https://issues.apache.org/jira/browse/SOLR-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15189540#comment-15189540 ] Fuad Efendi commented on SOLR-2357: --- - Tomcat has memory leaks with custom ThreadLocal instances as a key and value. - Tomcat 7.0.6 and later fix the problem by renewing threads in the pool. Please see http://wiki.apache.org/tomcat/MemoryLeakProtection for details. Can we close this issue now? Thanks, > Thread Local memory leaks on restart > > > Key: SOLR-2357 > URL: https://issues.apache.org/jira/browse/SOLR-2357 > Project: Solr > Issue Type: Bug > Components: contrib - Solr Cell (Tika extraction), search >Affects Versions: 1.4.1 > Environment: Windows Server 2008, Apache Tomcat 7.0.8, Java 1.6.23 >Reporter: Gus Heck > Labels: memory_leak, threadlocal > > Restarting solr (via a changed to a watched resource or via manager app for > example) after submitting documents with Solr-Cell, gives the following > message (many many times), and causes Tomcat to shutdown completely. > SEVERE: The web application [/solr] created a ThreadLocal with key of type > [org. > apache.solr.common.util.DateUtil.ThreadLocalDateFormat] (value > [org.apache.solr. > common.util.DateUtil$ThreadLocalDateFormat@dc30dfa]) and a value of type > [java.t > ext.SimpleDateFormat] (value [java.text.SimpleDateFormat@5af7aed5]) but > failed t > o remove it when the web application was stopped. Threads are going to be > renewe > d over time to try and avoid a probable memory leak. > Feb 10, 2011 7:17:53 AM org.apache.catalina.loader.WebappClassLoader > checkThread > LocalMapForLeaks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2605) queryparser parses on whitespace
[ https://issues.apache.org/jira/browse/LUCENE-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15188046#comment-15188046 ] Fuad Efendi commented on LUCENE-2605: - Is that resolved? Anyone working on it? Thanks > queryparser parses on whitespace > > > Key: LUCENE-2605 > URL: https://issues.apache.org/jira/browse/LUCENE-2605 > Project: Lucene - Core > Issue Type: Bug > Components: core/queryparser >Reporter: Robert Muir > Fix For: 4.9, master > > > The queryparser parses input on whitespace, and sends each whitespace > separated term to its own independent token stream. > This breaks the following at query-time, because they can't see across > whitespace boundaries: > * n-gram analysis > * shingles > * synonyms (especially multi-word for whitespace-separated languages) > * languages where a 'word' can contain whitespace (e.g. vietnamese) > Its also rather unexpected, as users think their > charfilters/tokenizers/tokenfilters will do the same thing at index and > querytime, but > in many cases they can't. Instead, preferably the queryparser would parse > around only real 'operators'. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe
[ https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fuad Efendi updated SOLR-2233: -- Attachment: SOLR-2233.patch Revised version of old patch (11-Nov-2010); previous version of patch was hard to read ;-) Main changes: - connection won't close reopen after timeout - connection can't be closed by second thread unexpectedly to first thread (initial bug fixed) Please note it works fine with MS-SQL server. However, concurrent statements (in concurrent threads) via the same connection object is tricky, JDBC may or may not implement it (JDBC-ODBC bridge for instance) DataImportHandler - JdbcDataSource is not thread safe - Key: SOLR-2233 URL: https://issues.apache.org/jira/browse/SOLR-2233 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Affects Versions: 1.5 Reporter: Fuad Efendi Attachments: FE-patch.txt, SOLR-2233-JdbcDataSource.patch, SOLR-2233-JdbcDataSource.patch, SOLR-2233.patch, SOLR-2233.patch Whenever Thread A spends more than 10 seconds on a Connection (by retrieving records in a batch), Thread B will close connection. Related exceptions happen when we use threads= attribute for entity; usually exception stack contains message connection already closed It shouldn't happen with some JNDI data source, where Connection.close() simply returns Connection to a pool of available connections, but we might get different errors. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe
[ https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13044460#comment-13044460 ] Fuad Efendi commented on SOLR-2233: --- Note that with this implementation connection is closed only when main instance of main class finalized = connection never closed; so that the code is still naive (server can close connection - how will we know that?) - fortunately it doesn't happen in my specific case already few months of night imports... We should use connection pooling - this would be next improvement; conn.close() in this case will return connection to pool (without closing it), and pool is responsible for testing connections for liveness. DataImportHandler - JdbcDataSource is not thread safe - Key: SOLR-2233 URL: https://issues.apache.org/jira/browse/SOLR-2233 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Affects Versions: 1.5 Reporter: Fuad Efendi Attachments: FE-patch.txt, SOLR-2233-JdbcDataSource.patch, SOLR-2233-JdbcDataSource.patch, SOLR-2233.patch, SOLR-2233.patch Whenever Thread A spends more than 10 seconds on a Connection (by retrieving records in a batch), Thread B will close connection. Related exceptions happen when we use threads= attribute for entity; usually exception stack contains message connection already closed It shouldn't happen with some JNDI data source, where Connection.close() simply returns Connection to a pool of available connections, but we might get different errors. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe
[ https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fuad Efendi updated SOLR-2233: -- Attachment: SOLR-2233.patch - small bug with closeResources() - each ResultSetIterator now has own (separate) instance of Connection - extremely good for performance (multithreading) but it is not transactional (different connections can return different results) - but we are optimistic DataImportHandler - JdbcDataSource is not thread safe - Key: SOLR-2233 URL: https://issues.apache.org/jira/browse/SOLR-2233 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Affects Versions: 1.5 Reporter: Fuad Efendi Attachments: FE-patch.txt, SOLR-2233-JdbcDataSource.patch, SOLR-2233-JdbcDataSource.patch, SOLR-2233.patch, SOLR-2233.patch, SOLR-2233.patch Whenever Thread A spends more than 10 seconds on a Connection (by retrieving records in a batch), Thread B will close connection. Related exceptions happen when we use threads= attribute for entity; usually exception stack contains message connection already closed It shouldn't happen with some JNDI data source, where Connection.close() simply returns Connection to a pool of available connections, but we might get different errors. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe
[ https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fuad Efendi updated SOLR-2233: -- Attachment: SOLR-2233-001.patch to avoid mistakes I added version... SOLR-2233-001.patch (previous attachment was wrong) DataImportHandler - JdbcDataSource is not thread safe - Key: SOLR-2233 URL: https://issues.apache.org/jira/browse/SOLR-2233 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Affects Versions: 1.5 Reporter: Fuad Efendi Attachments: FE-patch.txt, SOLR-2233-001.patch, SOLR-2233-JdbcDataSource.patch, SOLR-2233-JdbcDataSource.patch, SOLR-2233.patch, SOLR-2233.patch, SOLR-2233.patch Whenever Thread A spends more than 10 seconds on a Connection (by retrieving records in a batch), Thread B will close connection. Related exceptions happen when we use threads= attribute for entity; usually exception stack contains message connection already closed It shouldn't happen with some JNDI data source, where Connection.close() simply returns Connection to a pool of available connections, but we might get different errors. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe
[ https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fuad Efendi updated SOLR-2233: -- Attachment: (was: SOLR-2233.patch) DataImportHandler - JdbcDataSource is not thread safe - Key: SOLR-2233 URL: https://issues.apache.org/jira/browse/SOLR-2233 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Affects Versions: 1.5 Reporter: Fuad Efendi Attachments: SOLR-2233-JdbcDataSource.patch, SOLR-2233.patch, SOLR-2233.patch Whenever Thread A spends more than 10 seconds on a Connection (by retrieving records in a batch), Thread B will close connection. Related exceptions happen when we use threads= attribute for entity; usually exception stack contains message connection already closed It shouldn't happen with some JNDI data source, where Connection.close() simply returns Connection to a pool of available connections, but we might get different errors. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe
[ https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fuad Efendi updated SOLR-2233: -- Attachment: (was: FE-patch.txt) DataImportHandler - JdbcDataSource is not thread safe - Key: SOLR-2233 URL: https://issues.apache.org/jira/browse/SOLR-2233 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Affects Versions: 1.5 Reporter: Fuad Efendi Attachments: SOLR-2233-JdbcDataSource.patch, SOLR-2233.patch, SOLR-2233.patch Whenever Thread A spends more than 10 seconds on a Connection (by retrieving records in a batch), Thread B will close connection. Related exceptions happen when we use threads= attribute for entity; usually exception stack contains message connection already closed It shouldn't happen with some JNDI data source, where Connection.close() simply returns Connection to a pool of available connections, but we might get different errors. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe
[ https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fuad Efendi updated SOLR-2233: -- Attachment: (was: SOLR-2233-JdbcDataSource.patch) DataImportHandler - JdbcDataSource is not thread safe - Key: SOLR-2233 URL: https://issues.apache.org/jira/browse/SOLR-2233 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Affects Versions: 1.5 Reporter: Fuad Efendi Attachments: SOLR-2233-JdbcDataSource.patch, SOLR-2233.patch, SOLR-2233.patch Whenever Thread A spends more than 10 seconds on a Connection (by retrieving records in a batch), Thread B will close connection. Related exceptions happen when we use threads= attribute for entity; usually exception stack contains message connection already closed It shouldn't happen with some JNDI data source, where Connection.close() simply returns Connection to a pool of available connections, but we might get different errors. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe
[ https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fuad Efendi updated SOLR-2233: -- Attachment: SOLR-2233-001.patch DataImportHandler - JdbcDataSource is not thread safe - Key: SOLR-2233 URL: https://issues.apache.org/jira/browse/SOLR-2233 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Affects Versions: 1.5 Reporter: Fuad Efendi Attachments: SOLR-2233-001.patch, SOLR-2233-JdbcDataSource.patch, SOLR-2233.patch, SOLR-2233.patch Whenever Thread A spends more than 10 seconds on a Connection (by retrieving records in a batch), Thread B will close connection. Related exceptions happen when we use threads= attribute for entity; usually exception stack contains message connection already closed It shouldn't happen with some JNDI data source, where Connection.close() simply returns Connection to a pool of available connections, but we might get different errors. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-304) Dynamic fields cause IsValidUpdateIndexDocument to fail
[ https://issues.apache.org/jira/browse/SOLR-304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13044465#comment-13044465 ] Fuad Efendi commented on SOLR-304: -- such an old bug report, and no any watchers; case closed can not reproduce ;-) Dynamic fields cause IsValidUpdateIndexDocument to fail --- Key: SOLR-304 URL: https://issues.apache.org/jira/browse/SOLR-304 Project: Solr Issue Type: Bug Components: clients - C# Affects Versions: 1.2 Reporter: Jeff Rodenburg Assignee: Jeff Rodenburg I am using solrsharp-1.2-07082007 - I have a dynamicField declared in my schema.xml file as dynamicField name=*_demo type=text_ws indexed=true stored=true/ -but, if I try to add a field using my vb.net application doc.Add(id_demo, s) where is a string value, the document does fails solrSearcher.SolrSchema.IsValidUpdateIndexDocument(doc) MS -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe
[ https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fuad Efendi updated SOLR-2233: -- Affects Version/s: 1.4 1.4.1 3.1 3.2 DataImportHandler - JdbcDataSource is not thread safe - Key: SOLR-2233 URL: https://issues.apache.org/jira/browse/SOLR-2233 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Affects Versions: 1.4, 1.4.1, 1.5, 3.1, 3.2 Reporter: Fuad Efendi Attachments: SOLR-2233-001.patch, SOLR-2233-JdbcDataSource.patch, SOLR-2233.patch, SOLR-2233.patch Whenever Thread A spends more than 10 seconds on a Connection (by retrieving records in a batch), Thread B will close connection. Related exceptions happen when we use threads= attribute for entity; usually exception stack contains message connection already closed It shouldn't happen with some JNDI data source, where Connection.close() simply returns Connection to a pool of available connections, but we might get different errors. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe
[ https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13041788#comment-13041788 ] Fuad Efendi commented on SOLR-2233: --- Hi Frank, yes, correct; although it's hard to recall what I did... unfortunately reformatted... I can resubmit (apply patch format with Lucene style generate patch); but better to redo it from scratch again. Existing code doesn't run multithreaded; and it is slow even for single-thread (inappropriate JDBC usage) I completely removed this code: - private Connection getConnection() throws Exception { -long currTime = System.currentTimeMillis(); -if (currTime - connLastUsed CONN_TIME_OUT) { - synchronized (this) { -Connection tmpConn = factory.call(); -closeConnection(); -connLastUsed = System.currentTimeMillis(); -return conn = tmpConn; - -} else { - connLastUsed = currTime; - return conn; } } DataImportHandler - JdbcDataSource is not thread safe - Key: SOLR-2233 URL: https://issues.apache.org/jira/browse/SOLR-2233 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Affects Versions: 1.5 Reporter: Fuad Efendi Attachments: FE-patch.txt, SOLR-2233-JdbcDataSource.patch, SOLR-2233-JdbcDataSource.patch Whenever Thread A spends more than 10 seconds on a Connection (by retrieving records in a batch), Thread B will close connection. Related exceptions happen when we use threads= attribute for entity; usually exception stack contains message connection already closed It shouldn't happen with some JNDI data source, where Connection.close() simply returns Connection to a pool of available connections, but we might get different errors. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe
[ https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13041811#comment-13041811 ] Fuad Efendi commented on SOLR-2233: --- Existing implementation uses single Connection during 10 seconds time interval, and even shares this object with other threads (if you try multithreaded) So that problem becomes environment vendor specific: to open new connection to Oracle 10g, for instance, we need to authenticate, and in dedicated server it might take a long, plus dedicated resources for each connection, - server can get overloaded. MySQL, fro another side, does not closes connection internally (even if you call conn.close() in your code); connection will be simply returned to a pool of connection objects. And what if something goes wrong... (what if MySQL or Oracle internals need additional time for closing, opening, ...) - we might even get problems like too many connections. Modern apps don't see that because they use manageable connection pooling instead of close-open... I need to verify this patch; it was quick solution to make threads=... attribute working, and it currently works in production system (MS-SQL). DataImportHandler - JdbcDataSource is not thread safe - Key: SOLR-2233 URL: https://issues.apache.org/jira/browse/SOLR-2233 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Affects Versions: 1.5 Reporter: Fuad Efendi Attachments: FE-patch.txt, SOLR-2233-JdbcDataSource.patch, SOLR-2233-JdbcDataSource.patch Whenever Thread A spends more than 10 seconds on a Connection (by retrieving records in a batch), Thread B will close connection. Related exceptions happen when we use threads= attribute for entity; usually exception stack contains message connection already closed It shouldn't happen with some JNDI data source, where Connection.close() simply returns Connection to a pool of available connections, but we might get different errors. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe
[ https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13041884#comment-13041884 ] Fuad Efendi commented on SOLR-2233: --- Hi Frank, thanks for the patch; unfortunately it is not thread safe... if you don't mind let me continue working on this, I want to use internal connection pool (if JNDI data source is not available)... My initial patch already contains *too much*; and new one will remove ResultSetIterator and make it much simlper to understand (and multithreaded); and code shoulnd't have any dependency on rare *optionally supported* patterns such as ResultSet.TYPE_FORWARD_ONLY; READ_ONLY should be managed differently (and it is hard to manage if data size is huge and data is concurrently updated while we are importing it) Possible solution could be connection.close() after reading each single record (and initial query should return PKs of records) - but it would be next step... I wrote initial patch for a production system where complex 10-query-based documents (about 500k docs) took many hours to import (and now it is about 40 minutes only) (and what happens if we have network problem and we are in the middre of Iterator?) Thanks DataImportHandler - JdbcDataSource is not thread safe - Key: SOLR-2233 URL: https://issues.apache.org/jira/browse/SOLR-2233 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Affects Versions: 1.5 Reporter: Fuad Efendi Attachments: FE-patch.txt, SOLR-2233-JdbcDataSource.patch, SOLR-2233-JdbcDataSource.patch, SOLR-2233.patch Whenever Thread A spends more than 10 seconds on a Connection (by retrieving records in a batch), Thread B will close connection. Related exceptions happen when we use threads= attribute for entity; usually exception stack contains message connection already closed It shouldn't happen with some JNDI data source, where Connection.close() simply returns Connection to a pool of available connections, but we might get different errors. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
[ https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034999#comment-13034999 ] Fuad Efendi commented on LUCENE-2230: - I believe this issue should be closed due to significant performance improvements related to LUCENE-2089 and LUCENE-2258. I don't think there is any interest from the community to continue with this (BK Tree and Strike a Match) naive approach; although some people found it useful. Of course we might have few more distance implementations as a separate improvement. Please close it. Thanks Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times. Key: LUCENE-2230 URL: https://issues.apache.org/jira/browse/LUCENE-2230 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 3.0 Environment: Lucene currently uses brute force full-terms scanner and calculates distance for each term. New BKTree structure improves performance in average 20 times when distance is 1, and 3 times when distance is 3. I tested with index size several millions docs, and 250,000 terms. New algo uses integer distances between objects. Reporter: Fuad Efendi Attachments: BKTree.java, Distance.java, DistanceImpl.java, FuzzyTermEnumNEW.java, FuzzyTermEnumNEW.java Original Estimate: 1m Remaining Estimate: 1m W. Burkhard and R. Keller. Some approaches to best-match file searching, CACM, 1973 http://portal.acm.org/citation.cfm?doid=362003.362025 I was inspired by http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees (Nick Johnson, Google). Additionally, simplified algorythm at http://www.catalysoft.com/articles/StrikeAMatch.html seems to be much more logically correct than Levenstein distance, and it is 3-5 times faster (isolated tests). Big list od distance implementations: http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2338) improved per-field similarity integration into schema.xml
[ https://issues.apache.org/jira/browse/SOLR-2338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13024811#comment-13024811 ] Fuad Efendi commented on SOLR-2338: --- test-files/solr/conf/schema.xml contains sample of per-field definitions; example/solr/schema.xml doesn't have it yet improved per-field similarity integration into schema.xml - Key: SOLR-2338 URL: https://issues.apache.org/jira/browse/SOLR-2338 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 4.0 Reporter: Robert Muir Assignee: Robert Muir Fix For: 4.0 Attachments: SOLR-2338.patch, SOLR-2338.patch, SOLR-2338.patch Currently since LUCENE-2236, we can enable Similarity per-field, but in schema.xml there is only a 'global' factory for the SimilarityProvider. In my opinion this is too low-level because to customize Similarity on a per-field basis, you have to set your own CustomSimilarityProvider with similarity class=.../ and manage the per-field mapping yourself in java code. Instead I think it would be better if you just specify the Similarity in the FieldType, like after analyzer. As far as the example, one idea from LUCENE-1360 was to make a short_text or metadata_text used by the various metadata fields in the example that has better norm quantization for its shortness... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-792) Pivot (ie: Decision Tree) Faceting Component
[ https://issues.apache.org/jira/browse/SOLR-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13015415#comment-13015415 ] Fuad Efendi commented on SOLR-792: -- Hi, Jason Folk posted: bq. facet.tree currently seems to bark at exclusion tags, I wouldn't mind trying to take a crack at this (as I currently do need it), but not really sure where to begin looking. Is it resolved? My client currently uses pivot in production, few mlns records If it's not resolved yet I can dig into it... Pivot (ie: Decision Tree) Faceting Component Key: SOLR-792 URL: https://issues.apache.org/jira/browse/SOLR-792 Project: Solr Issue Type: New Feature Reporter: Erik Hatcher Assignee: Yonik Seeley Priority: Minor Attachments: SOLR-792-PivotFaceting.patch, SOLR-792-PivotFaceting.patch, SOLR-792-PivotFaceting.patch, SOLR-792-PivotFaceting.patch, SOLR-792-as-helper-class.patch, SOLR-792-raw-type.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch A component to do multi-level faceting. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2006) DataImportHandler creates multiple DB connections during a delta update
[ https://issues.apache.org/jira/browse/SOLR-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12968866#action_12968866 ] Fuad Efendi commented on SOLR-2006: --- I believe it is resolved in SOLR-2233 DataImportHandler creates multiple DB connections during a delta update --- Key: SOLR-2006 URL: https://issues.apache.org/jira/browse/SOLR-2006 Project: Solr Issue Type: Improvement Components: contrib - DataImportHandler Affects Versions: 1.4, 1.4.1, 3.1, 4.0 Reporter: Lance Norskog The DataImportHandler code for delta updates creates a separate copy of each datasource for each entity in the document. This creates a separate JDBC connection for each entity. In some relational databases, connections are a heavyweight resource and their use should be limited. A JDBC pool would help avoid this problem, and also assist in doing multi-threaded DIH indexing jobs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1828) DIH Handler separate connection for delta and full index
[ https://issues.apache.org/jira/browse/SOLR-1828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12968869#action_12968869 ] Fuad Efendi commented on SOLR-1828: --- Related issue patch: SOLR-2233 DIH Handler separate connection for delta and full index Key: SOLR-1828 URL: https://issues.apache.org/jira/browse/SOLR-1828 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Affects Versions: 1.4 Environment: Linux Reporter: Bill Bell We would like to configure the DIH handler for a SLAVE connection for FULL imports, and a MASTER connection for DELTA. Use case: 1. The DIH full index slams the database pretty hard, and we would like those to run one a day on the SLAVE MYSQL connection. 2. The DIH delta index does not hit the database very hard, and we would like that to run off the MASTER MYSQL connection. Currently the DIH handler does not allow a name=db-1 on the deltaQuery=, it is only at the entity level. Please add it to each delta, full, etc as an option. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (SOLR-1828) DIH Handler separate connection for delta and full index
[ https://issues.apache.org/jira/browse/SOLR-1828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12968869#action_12968869 ] Fuad Efendi edited comment on SOLR-1828 at 12/7/10 1:55 PM: Performance-related issue patch: SOLR-2233 This seems to be wrong: MySQL is better optimized for Read-Mostly...? It shouldn't be like that... all reads should go to slave... was (Author: funtick): Related issue patch: SOLR-2233 DIH Handler separate connection for delta and full index Key: SOLR-1828 URL: https://issues.apache.org/jira/browse/SOLR-1828 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Affects Versions: 1.4 Environment: Linux Reporter: Bill Bell We would like to configure the DIH handler for a SLAVE connection for FULL imports, and a MASTER connection for DELTA. Use case: 1. The DIH full index slams the database pretty hard, and we would like those to run one a day on the SLAVE MYSQL connection. 2. The DIH delta index does not hit the database very hard, and we would like that to run off the MASTER MYSQL connection. Currently the DIH handler does not allow a name=db-1 on the deltaQuery=, it is only at the entity level. Please add it to each delta, full, etc as an option. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (SOLR-1828) DIH Handler separate connection for delta and full index
[ https://issues.apache.org/jira/browse/SOLR-1828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12968869#action_12968869 ] Fuad Efendi edited comment on SOLR-1828 at 12/7/10 1:56 PM: Performance-related issue patch: SOLR-2233 This seems to be wrong: MySQL-MASTER is better optimized for Read-Mostly?! It shouldn't be like that... all reads should go to slave... was (Author: funtick): Performance-related issue patch: SOLR-2233 This seems to be wrong: MySQL is better optimized for Read-Mostly...? It shouldn't be like that... all reads should go to slave... DIH Handler separate connection for delta and full index Key: SOLR-1828 URL: https://issues.apache.org/jira/browse/SOLR-1828 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Affects Versions: 1.4 Environment: Linux Reporter: Bill Bell We would like to configure the DIH handler for a SLAVE connection for FULL imports, and a MASTER connection for DELTA. Use case: 1. The DIH full index slams the database pretty hard, and we would like those to run one a day on the SLAVE MYSQL connection. 2. The DIH delta index does not hit the database very hard, and we would like that to run off the MASTER MYSQL connection. Currently the DIH handler does not allow a name=db-1 on the deltaQuery=, it is only at the entity level. Please add it to each delta, full, etc as an option. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1916) investigate DIH use of default locale
[ https://issues.apache.org/jira/browse/SOLR-1916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12968889#action_12968889 ] Fuad Efendi commented on SOLR-1916: --- I had similar issue, Microsoft SQL Server, DATETIME type. DIH stores Date in a filesystem using default (for SOLR) timezone and locale. Then, Delta Import executed query with WHERE last_update_date '01.12.2010' (just as a sample). Localized string is used instead of real date. And timezone of remote database is not necessarily the same as timezone of SOLR. Fortunately, it's easy to fix (without altering code). investigate DIH use of default locale - Key: SOLR-1916 URL: https://issues.apache.org/jira/browse/SOLR-1916 Project: Solr Issue Type: Task Components: contrib - DataImportHandler Affects Versions: 3.1, 4.0 Reporter: Robert Muir Priority: Blocker Fix For: 3.1, 4.0 This is a spinoff from LUCENE-2466. In this issue I changed my locale to various locales and found some problems in Lucene/Solr triggered by use of the default Locale. I noticed some use of the default-locale for Date operations in DIH (TimeZone.getDefault/Locale.getDefault) and, while no tests fail, I think it might be better to support a locale parameter for this. The wiki documents that numeric parsing can support localized numerics formats: http://wiki.apache.org/solr/DataImportHandler#NumberFormatTransformer In both cases, I don't think we should ever use the default Locale. If no Locale is provided, I find that new Locale() -- Unicode Root Locale, is a better default for a server situation in a lot of cases, as it won't change depending on the computer, or perhaps we just make Locale params mandatory for this. Finally, in both cases, if localized numbers/dates are explicitly supported, I think we should come up with a test strategy to ensure everything is working. One idea is to do something similar to or make use of Lucene's LocalizedTestCase. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2186) DataImportHandler multi-threaded option throws exception
[ https://issues.apache.org/jira/browse/SOLR-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12968890#action_12968890 ] Fuad Efendi commented on SOLR-2186: --- I resolved this issue for SQL, SOLR-2233; it was related to 'thread A closes connection needed by thread B' DataImportHandler multi-threaded option throws exception Key: SOLR-2186 URL: https://issues.apache.org/jira/browse/SOLR-2186 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Reporter: Lance Norskog Assignee: Grant Ingersoll Attachments: TikaResolver.patch The multi-threaded option for the DataImportHandler throws an exception and the entire operation fails. This is true even if only 1 thread is configured via *threads='1'* -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe
[ https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fuad Efendi updated SOLR-2233: -- Component/s: contrib - DataImportHandler DataImportHandler - JdbcDataSource is not thread safe - Key: SOLR-2233 URL: https://issues.apache.org/jira/browse/SOLR-2233 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Affects Versions: 1.5 Reporter: Fuad Efendi Attachments: FE-patch.txt, SOLR-2233-JdbcDataSource.patch, SOLR-2233-JdbcDataSource.patch Whenever Thread A spends more than 10 seconds on a Connection (by retrieving records in a batch), Thread B will close connection. Related exceptions happen when we use threads= attribute for entity; usually exception stack contains message connection already closed It shouldn't happen with some JNDI data source, where Connection.close() simply returns Connection to a pool of available connections, but we might get different errors. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe
[ https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12931839#action_12931839 ] Fuad Efendi commented on SOLR-2233: --- It is 3 times faster after I applied changes: Before: 729 documents/minute After: 2639 documents/minute In my test, with 10 sub-entities some of them are multi-valued (and hard to use CachedJdbcDataSource for composite PKs). I can't explain it by only threads=16 option (which this patch makes possible). It is probably Connection Close / Connection Open issue which is very expensive for SQL-Server (except MySQL JDBC driver which internally uses connection pooling) DataImportHandler - JdbcDataSource is not thread safe - Key: SOLR-2233 URL: https://issues.apache.org/jira/browse/SOLR-2233 Project: Solr Issue Type: Bug Affects Versions: 1.5 Reporter: Fuad Efendi Attachments: FE-patch.txt, SOLR-2233-JdbcDataSource.patch, SOLR-2233-JdbcDataSource.patch Whenever Thread A spends more than 10 seconds on a Connection (by retrieving records in a batch), Thread B will close connection. Related exceptions happen when we use threads= attribute for entity; usually exception stack contains message connection already closed It shouldn't happen with some JNDI data source, where Connection.close() simply returns Connection to a pool of available connections, but we might get different errors. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe
[ https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12931401#action_12931401 ] Fuad Efendi commented on SOLR-2233: --- The only remaining problem is what to do if Database Server closed/dropped connection or something like that (for instance, due to timeout settings on a database, or due to heavy load, or network problem). The more time required to index data, the more frequent problems. Even connection pool (accessed via JNDI) won't help because existing (and new) code tries to keep the same connection for a long time, without any logic to check that connection is still alive. What to do if we are in the middle of RecordSet and database dropped connection? DataImportHandler - JdbcDataSource is not thread safe - Key: SOLR-2233 URL: https://issues.apache.org/jira/browse/SOLR-2233 Project: Solr Issue Type: Bug Affects Versions: 1.5 Reporter: Fuad Efendi Attachments: FE-patch.txt, SOLR-2233-JdbcDataSource.patch, SOLR-2233-JdbcDataSource.patch Whenever Thread A spends more than 10 seconds on a Connection (by retrieving records in a batch), Thread B will close connection. Related exceptions happen when we use threads= attribute for entity; usually exception stack contains message connection already closed It shouldn't happen with some JNDI data source, where Connection.close() simply returns Connection to a pool of available connections, but we might get different errors. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (SOLR-2231) DataImportHandler - MultiThreaded - Logging
DataImportHandler - MultiThreaded - Logging --- Key: SOLR-2231 URL: https://issues.apache.org/jira/browse/SOLR-2231 Project: Solr Issue Type: Improvement Affects Versions: 1.5 Reporter: Fuad Efendi Priority: Trivial Please use {code} if (LOG.isInfoEnabled()) LOG.info(...) {code} For instance, line 95 of ThreadedEntityProcessorWrapper creates huge log output which is impossible to manage via logging properties: LOG.info(arow : +arow); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-2231) DataImportHandler - MultiThreaded - Logging
[ https://issues.apache.org/jira/browse/SOLR-2231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fuad Efendi updated SOLR-2231: -- Description: Please use {code} if (LOG.isInfoEnabled()) LOG.info(...) {code} For instance, line 95 of ThreadedEntityProcessorWrapper creates huge log output which is impossible to manage via logging properties: LOG.info(arow : +arow); This line (in a loop) will output results of all SQL from a database (and will slow down SOLR performance). It's even better to use LOG.debug instead of LOG.info, INFO is enabled by default. was: Please use {code} if (LOG.isInfoEnabled()) LOG.info(...) {code} For instance, line 95 of ThreadedEntityProcessorWrapper creates huge log output which is impossible to manage via logging properties: LOG.info(arow : +arow); DataImportHandler - MultiThreaded - Logging --- Key: SOLR-2231 URL: https://issues.apache.org/jira/browse/SOLR-2231 Project: Solr Issue Type: Improvement Affects Versions: 1.5 Reporter: Fuad Efendi Priority: Trivial Please use {code} if (LOG.isInfoEnabled()) LOG.info(...) {code} For instance, line 95 of ThreadedEntityProcessorWrapper creates huge log output which is impossible to manage via logging properties: LOG.info(arow : +arow); This line (in a loop) will output results of all SQL from a database (and will slow down SOLR performance). It's even better to use LOG.debug instead of LOG.info, INFO is enabled by default. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe
DataImportHandler - JdbcDataSource is not thread safe - Key: SOLR-2233 URL: https://issues.apache.org/jira/browse/SOLR-2233 Project: Solr Issue Type: Bug Affects Versions: 1.5 Reporter: Fuad Efendi Whenever Thread A spends more than 10 seconds on a Connection (by retrieving records in a batch), Thread B will close connection. Related exceptions happen when we use threads= attribute for entity; usually exception stack contains message connection already closed It shouldn't happen with some JNDI data source, where Connection.close() simply returns Connection to a pool of available connections, but we might get different errors. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe
[ https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fuad Efendi updated SOLR-2233: -- Attachment: FE-patch.txt I need to test it; but changes are obvious. JDBC API says * strongNote:/strong Support for the codeisLast/code method * is optional for codeResultSet/codes with a result * set type of codeTYPE_FORWARD_ONLY/code - but I am sure everyone supports this. DataImportHandler - JdbcDataSource is not thread safe - Key: SOLR-2233 URL: https://issues.apache.org/jira/browse/SOLR-2233 Project: Solr Issue Type: Bug Affects Versions: 1.5 Reporter: Fuad Efendi Attachments: FE-patch.txt Whenever Thread A spends more than 10 seconds on a Connection (by retrieving records in a batch), Thread B will close connection. Related exceptions happen when we use threads= attribute for entity; usually exception stack contains message connection already closed It shouldn't happen with some JNDI data source, where Connection.close() simply returns Connection to a pool of available connections, but we might get different errors. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe
[ https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12931168#action_12931168 ] Fuad Efendi commented on SOLR-2233: --- *Performance Tuning* I have extremely sophisticated SQL; root entity runs 10-15 subqueries, and I am unable to use {{CachedSqlEntityProcessor}}. That's why I am looking into multithreading. Unfortunately, with existing approach connection will be closed after each use. And for most databases _creating a connection (authentication, resource allocation) is extremely expensive_. The best approach is to use container resource (JNDI, connection pooling), but I'll try to find what else can be improved. DataImportHandler - JdbcDataSource is not thread safe - Key: SOLR-2233 URL: https://issues.apache.org/jira/browse/SOLR-2233 Project: Solr Issue Type: Bug Affects Versions: 1.5 Reporter: Fuad Efendi Attachments: FE-patch.txt Whenever Thread A spends more than 10 seconds on a Connection (by retrieving records in a batch), Thread B will close connection. Related exceptions happen when we use threads= attribute for entity; usually exception stack contains message connection already closed It shouldn't happen with some JNDI data source, where Connection.close() simply returns Connection to a pool of available connections, but we might get different errors. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe
[ https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fuad Efendi updated SOLR-2233: -- Attachment: SOLR-2233-JdbcDataSource.patch Connection moved to top-level class DataSource should be used in a thread-safe manner; multiple threads can use multiple DataSource (per Item) Connection should be closed at the end of import in any case... DataImportHandler - JdbcDataSource is not thread safe - Key: SOLR-2233 URL: https://issues.apache.org/jira/browse/SOLR-2233 Project: Solr Issue Type: Bug Affects Versions: 1.5 Reporter: Fuad Efendi Attachments: FE-patch.txt, SOLR-2233-JdbcDataSource.patch Whenever Thread A spends more than 10 seconds on a Connection (by retrieving records in a batch), Thread B will close connection. Related exceptions happen when we use threads= attribute for entity; usually exception stack contains message connection already closed It shouldn't happen with some JNDI data source, where Connection.close() simply returns Connection to a pool of available connections, but we might get different errors. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe
[ https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12931187#action_12931187 ] Fuad Efendi commented on SOLR-2233: --- This is exception I was talking about, threads=16, 12 sub-entities, with existing trunk version, note *The connection is closed* {code} org.apache.solr.handler.dataimport.DataImportHandlerException: com.microsoft.sqlserver.jdbc.SQLServerException: The connection is closed. at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:337) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$600(JdbcDataSource.java:226) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:260) at org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:75) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73) at org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper.nextRow(ThreadedEntityProcessorWrapper.java:84) at org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.runAThread(DocBuilder.java:433) at org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.run(DocBuilder.java:386) at org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.runAThread(DocBuilder.java:453) at org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.access$000(DocBuilder.java:340) at org.apache.solr.handler.dataimport.DocBuilder$EntityRunner$1.run(DocBuilder.java:393) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: The connection is closed. at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDriverError(SQLServerException.java:171) at com.microsoft.sqlserver.jdbc.SQLServerConnection.checkClosed(SQLServerConnection.java:319) at com.microsoft.sqlserver.jdbc.SQLServerStatement.checkClosed(SQLServerStatement.java:956) at com.microsoft.sqlserver.jdbc.SQLServerResultSet.checkClosed(SQLServerResultSet.java:348) at com.microsoft.sqlserver.jdbc.SQLServerResultSet.next(SQLServerResultSet.java:915) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:329) ... 13 more {code} DataImportHandler - JdbcDataSource is not thread safe - Key: SOLR-2233 URL: https://issues.apache.org/jira/browse/SOLR-2233 Project: Solr Issue Type: Bug Affects Versions: 1.5 Reporter: Fuad Efendi Attachments: FE-patch.txt, SOLR-2233-JdbcDataSource.patch Whenever Thread A spends more than 10 seconds on a Connection (by retrieving records in a batch), Thread B will close connection. Related exceptions happen when we use threads= attribute for entity; usually exception stack contains message connection already closed It shouldn't happen with some JNDI data source, where Connection.close() simply returns Connection to a pool of available connections, but we might get different errors. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe
[ https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fuad Efendi updated SOLR-2233: -- Attachment: SOLR-2233-JdbcDataSource.patch {{resultSet.next()}} - Microsoft JDBC doesn't support isLast() for FORWARD_ONLY DataImportHandler - JdbcDataSource is not thread safe - Key: SOLR-2233 URL: https://issues.apache.org/jira/browse/SOLR-2233 Project: Solr Issue Type: Bug Affects Versions: 1.5 Reporter: Fuad Efendi Attachments: FE-patch.txt, SOLR-2233-JdbcDataSource.patch, SOLR-2233-JdbcDataSource.patch Whenever Thread A spends more than 10 seconds on a Connection (by retrieving records in a batch), Thread B will close connection. Related exceptions happen when we use threads= attribute for entity; usually exception stack contains message connection already closed It shouldn't happen with some JNDI data source, where Connection.close() simply returns Connection to a pool of available connections, but we might get different errors. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe
[ https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12931225#action_12931225 ] Fuad Efendi commented on SOLR-2233: --- And some real-life test, root entity contains 10 subentities, 16 threads allocated, *befor* {code} str name=Time Elapsed0:1:0.322/str str name=Total Requests made to DataSource7296/str str name=Total Rows Fetched8061/str str name=Total Documents Processed729/str {code} *after* {code} str name=Time Elapsed0:1:1.184/str str name=Total Requests made to DataSource0/str str name=Total Rows Fetched29247/str str name=Total Documents Processed2639/str {code} Look at it, it seems we don't unnecessarily close connection! *Total Requests made to DataSource: 0* DataImportHandler - JdbcDataSource is not thread safe - Key: SOLR-2233 URL: https://issues.apache.org/jira/browse/SOLR-2233 Project: Solr Issue Type: Bug Affects Versions: 1.5 Reporter: Fuad Efendi Attachments: FE-patch.txt, SOLR-2233-JdbcDataSource.patch, SOLR-2233-JdbcDataSource.patch Whenever Thread A spends more than 10 seconds on a Connection (by retrieving records in a batch), Thread B will close connection. Related exceptions happen when we use threads= attribute for entity; usually exception stack contains message connection already closed It shouldn't happen with some JNDI data source, where Connection.close() simply returns Connection to a pool of available connections, but we might get different errors. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-792) Pivot (ie: Decision Tree) Faceting Component
[ https://issues.apache.org/jira/browse/SOLR-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fuad Efendi updated SOLR-792: - Comment: was deleted (was: I believe recent patch (2010-10-19) causes problems... I have this errors now: {code} lst name=facet_pivot arr name=ChannelID,ClassificationID lst str name=fieldChannelID/str str name=value ERROR:SCHEMA-INDEX-MISMATCH,stringValue=`#8;#0;#0;#0;#5; /str int name=count4491/int {code} And those xxxID are int type, not String... ) Pivot (ie: Decision Tree) Faceting Component Key: SOLR-792 URL: https://issues.apache.org/jira/browse/SOLR-792 Project: Solr Issue Type: New Feature Reporter: Erik Hatcher Assignee: Ryan McKinley Priority: Minor Attachments: SOLR-792-as-helper-class.patch, SOLR-792-PivotFaceting.patch, SOLR-792-PivotFaceting.patch, SOLR-792-PivotFaceting.patch, SOLR-792-PivotFaceting.patch, SOLR-792-raw-type.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch A component to do multi-level faceting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-792) Tree Faceting Component
[ https://issues.apache.org/jira/browse/SOLR-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914814#action_12914814 ] Fuad Efendi commented on SOLR-792: -- Default value (as seen in a code) is facet.pivot.mincount=1 It confused me during simple tests (showing wrong results). Finally I found I need to add explicitly facet.pivot.mincount=0 Tree Faceting Component --- Key: SOLR-792 URL: https://issues.apache.org/jira/browse/SOLR-792 Project: Solr Issue Type: New Feature Reporter: Erik Hatcher Assignee: Ryan McKinley Priority: Minor Attachments: SOLR-792-PivotFaceting.patch, SOLR-792-PivotFaceting.patch, SOLR-792-PivotFaceting.patch, SOLR-792-PivotFaceting.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch A component to do multi-level faceting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
[ https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833010#action_12833010 ] Fuad Efendi commented on LUCENE-2230: - LUCENE-2089 - extremely good staff (Lucene-Flex branch, applicable for wildcard-queries, RegEx, and Fuzzy Search). BKTree improves performance if distance is 2; otherwise it is almost full-term-scan. Some links borrowed: http://en.wikipedia.org/wiki/Deterministic_finite-state_machine http://rcmuir.wordpress.com/2009/12/04/finite-state-queries-for-lucene/ http://www.amazon.com/Algorithms-Strings-Trees-Sequences-Computational/dp/0521585198 Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times. Key: LUCENE-2230 URL: https://issues.apache.org/jira/browse/LUCENE-2230 Project: Lucene - Java Issue Type: Improvement Affects Versions: 3.0 Environment: Lucene currently uses brute force full-terms scanner and calculates distance for each term. New BKTree structure improves performance in average 20 times when distance is 1, and 3 times when distance is 3. I tested with index size several millions docs, and 250,000 terms. New algo uses integer distances between objects. Reporter: Fuad Efendi Attachments: BKTree.java, Distance.java, DistanceImpl.java, FuzzyTermEnumNEW.java, FuzzyTermEnumNEW.java Original Estimate: 0.02h Remaining Estimate: 0.02h W. Burkhard and R. Keller. Some approaches to best-match file searching, CACM, 1973 http://portal.acm.org/citation.cfm?doid=362003.362025 I was inspired by http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees (Nick Johnson, Google). Additionally, simplified algorythm at http://www.catalysoft.com/articles/StrikeAMatch.html seems to be much more logically correct than Levenstein distance, and it is 3-5 times faster (isolated tests). Big list od distance implementations: http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
[ https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832027#action_12832027 ] Fuad Efendi commented on LUCENE-2230: - Hi Uwe, Thanks for the analysis! I spent only few days on this basic PoC. I need to use IndexReader (index version number and etc.) also to rewarm a cache; if term disappeared from index we can still leave it in BKTree (not a problem; can't remove!), and if we have new term we need simply call {code}public void add(E term){code} Synchronization should be significantly improved... Cache warming takes 10-15 seconds in my environment, about 250k tokens, and I use TreeSet internally for fast lookup. I also believe that main performance issue is related to Levenstein algo (which is significantly improved in trunk; plus synchronization is removed from FuzzySearch: LUCENE-2258) Regarding memory requirements: BKTree is not heavy... I should use {code}StringHelper.intern(fld);{code} - it's already in memory... and FuzzyTermEnum uses almost same amount of memory for processing as BKTree. I'll check FieldCache. BKTree-approach can be significantly improved. Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times. Key: LUCENE-2230 URL: https://issues.apache.org/jira/browse/LUCENE-2230 Project: Lucene - Java Issue Type: Improvement Affects Versions: 3.0 Environment: Lucene currently uses brute force full-terms scanner and calculates distance for each term. New BKTree structure improves performance in average 20 times when distance is 1, and 3 times when distance is 3. I tested with index size several millions docs, and 250,000 terms. New algo uses integer distances between objects. Reporter: Fuad Efendi Attachments: BKTree.java, Distance.java, DistanceImpl.java, FuzzyTermEnumNEW.java, FuzzyTermEnumNEW.java Original Estimate: 0.02h Remaining Estimate: 0.02h W. Burkhard and R. Keller. Some approaches to best-match file searching, CACM, 1973 http://portal.acm.org/citation.cfm?doid=362003.362025 I was inspired by http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees (Nick Johnson, Google). Additionally, simplified algorythm at http://www.catalysoft.com/articles/StrikeAMatch.html seems to be much more logically correct than Levenstein distance, and it is 3-5 times faster (isolated tests). Big list od distance implementations: http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832033#action_12832033 ] Fuad Efendi commented on LUCENE-2089: - Downloadable article (PDF): http://www.mitpressjournals.org/doi/pdf/10.1162/0891201042544938?cookieSet=1 explore using automaton for fuzzyquery -- Key: LUCENE-2089 URL: https://issues.apache.org/jira/browse/LUCENE-2089 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Robert Muir Assignee: Mark Miller Priority: Minor Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea * up front, calculate the maximum required K edits needed to match the users supplied float threshold. * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E. if the required E is above our supported max, we use dumb mode at first (no seeking, no DFA, just brute force like now). As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq. This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms. This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from dumb mode to smart mode. i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor) ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| |7N|10|64.0|4155.9|38.6|20.3| |14N|10|0.0|2511.6|46.0|37.9| |28N|10|0.0|2506.3|93.0|86.6| |56N|10|0.0|2524.5|304.4|298.5| as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA-DFA conversion. So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there. instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok. the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization. in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832049#action_12832049 ] Fuad Efendi commented on LUCENE-2089: - Ok; I am trying to study DFANFA and to compare with LUCENE-2230 (BKTree size is fixed without dependency on distance, but we need to hard-cache it...). What I found is that classic Levenshtein algo is eating 75% CPU, and classic brute-force TermEnum 25%... Distance (submitted by end user) must be integer... explore using automaton for fuzzyquery -- Key: LUCENE-2089 URL: https://issues.apache.org/jira/browse/LUCENE-2089 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Robert Muir Assignee: Mark Miller Priority: Minor Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea * up front, calculate the maximum required K edits needed to match the users supplied float threshold. * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E. if the required E is above our supported max, we use dumb mode at first (no seeking, no DFA, just brute force like now). As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq. This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms. This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from dumb mode to smart mode. i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor) ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| |7N|10|64.0|4155.9|38.6|20.3| |14N|10|0.0|2511.6|46.0|37.9| |28N|10|0.0|2506.3|93.0|86.6| |56N|10|0.0|2524.5|304.4|298.5| as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA-DFA conversion. So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there. instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok. the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization. in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
[ https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832096#action_12832096 ] Fuad Efendi edited comment on LUCENE-2230 at 2/10/10 5:56 PM: -- Hi Uwe, I am trying to study LUCENE-2258 right now... bq. BKTree contains terms no longer available BKTree contains objects, not terms; in my sample it contains Strings, new BKTreeString(new Distance()). It is a structure for fast lookup of close objects from a set of objects, with predefined distance algorithm. It won't hurt if String appears in BKTree structure, and corresponding Term disappeared from Index; search results will be the same. Simply, search for DisappearedTerm OR AnotherTerm is the same as search for AnotherTerm. At least, we can run background thread which will create new BKTree instance, without hurting end users. Yes, Term-String is another thing to do... I recreate fake terms in TermEnum... BKTree allows to iterate about 5-10% of whole structure in order to find closest matches only if distance threshold is small, 2. If it is 4, almost no any improvement. And, classic Levenshtein distance is slow... was (Author: funtick): Hi Uwe, I am trying to study Lucene-2258 right now... bq. BKTree contains terms no longer available BKTree contains objects, not terms; in my sample it contains Strings, new BKTreeString(new Distance()). It is a structure for fast lookup of close objects from a set of objects, with predefined distance algorithm. It won't hurt if String appears in BKTree structure, and corresponding Term disappeared from Index; search results will be the same. Simply, search for DisappearedTerm OR AnotherTerm is the same as search for AnotherTerm. At least, we can run background thread which will create new BKTree instance, without hurting end users. Yes, Term-String is another thing to do... I recreate fake terms in TermEnum... BKTree allows to iterate about 5-10% of whole structure in order to find closest matches only if distance threshold is small, 2. If it is 4, almost no any improvement. And, classic Levenshtein distance is slow... Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times. Key: LUCENE-2230 URL: https://issues.apache.org/jira/browse/LUCENE-2230 Project: Lucene - Java Issue Type: Improvement Affects Versions: 3.0 Environment: Lucene currently uses brute force full-terms scanner and calculates distance for each term. New BKTree structure improves performance in average 20 times when distance is 1, and 3 times when distance is 3. I tested with index size several millions docs, and 250,000 terms. New algo uses integer distances between objects. Reporter: Fuad Efendi Attachments: BKTree.java, Distance.java, DistanceImpl.java, FuzzyTermEnumNEW.java, FuzzyTermEnumNEW.java Original Estimate: 0.02h Remaining Estimate: 0.02h W. Burkhard and R. Keller. Some approaches to best-match file searching, CACM, 1973 http://portal.acm.org/citation.cfm?doid=362003.362025 I was inspired by http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees (Nick Johnson, Google). Additionally, simplified algorythm at http://www.catalysoft.com/articles/StrikeAMatch.html seems to be much more logically correct than Levenstein distance, and it is 3-5 times faster (isolated tests). Big list od distance implementations: http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
[ https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832096#action_12832096 ] Fuad Efendi commented on LUCENE-2230: - Hi Uwe, I am trying to study Lucene-2258 right now... bq. BKTree contains terms no longer available BKTree contains objects, not terms; in my sample it contains Strings, new BKTreeString(new Distance()). It is a structure for fast lookup of close objects from a set of objects, with predefined distance algorithm. It won't hurt if String appears in BKTree structure, and corresponding Term disappeared from Index; search results will be the same. Simply, search for DisappearedTerm OR AnotherTerm is the same as search for AnotherTerm. At least, we can run background thread which will create new BKTree instance, without hurting end users. Yes, Term-String is another thing to do... I recreate fake terms in TermEnum... BKTree allows to iterate about 5-10% of whole structure in order to find closest matches only if distance threshold is small, 2. If it is 4, almost no any improvement. And, classic Levenshtein distance is slow... Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times. Key: LUCENE-2230 URL: https://issues.apache.org/jira/browse/LUCENE-2230 Project: Lucene - Java Issue Type: Improvement Affects Versions: 3.0 Environment: Lucene currently uses brute force full-terms scanner and calculates distance for each term. New BKTree structure improves performance in average 20 times when distance is 1, and 3 times when distance is 3. I tested with index size several millions docs, and 250,000 terms. New algo uses integer distances between objects. Reporter: Fuad Efendi Attachments: BKTree.java, Distance.java, DistanceImpl.java, FuzzyTermEnumNEW.java, FuzzyTermEnumNEW.java Original Estimate: 0.02h Remaining Estimate: 0.02h W. Burkhard and R. Keller. Some approaches to best-match file searching, CACM, 1973 http://portal.acm.org/citation.cfm?doid=362003.362025 I was inspired by http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees (Nick Johnson, Google). Additionally, simplified algorythm at http://www.catalysoft.com/articles/StrikeAMatch.html seems to be much more logically correct than Levenstein distance, and it is 3-5 times faster (isolated tests). Big list od distance implementations: http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
[ https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832096#action_12832096 ] Fuad Efendi edited comment on LUCENE-2230 at 2/10/10 6:22 PM: -- Hi Uwe, I am trying to study LUCENE-2258 right now... bq. BKTree contains terms no longer available BKTree contains objects, not terms; in my sample it contains Strings, new BKTreeString(new Distance()). It is a structure for fast lookup of close objects from a set of objects, with predefined distance algorithm. It won't hurt if String appears in BKTree structure, and corresponding Term disappeared from Index; search results will be the same. Simply, search for DisappearedTerm OR AnotherTerm is the same as search for AnotherTerm. At least, we can run background thread which will create new BKTree instance, without hurting end users. Yes, Term-String is another thing to do... I recreate fake terms in TermEnum... BKTree allows to iterate about 5-10% of whole structure in order to find closest matches only if distance threshold is small, 2. If it is 4, almost no any improvement. And, classic Levenshtein distance is slow... Edited: trying to study LUCENE-2089... was (Author: funtick): Hi Uwe, I am trying to study LUCENE-2258 right now... bq. BKTree contains terms no longer available BKTree contains objects, not terms; in my sample it contains Strings, new BKTreeString(new Distance()). It is a structure for fast lookup of close objects from a set of objects, with predefined distance algorithm. It won't hurt if String appears in BKTree structure, and corresponding Term disappeared from Index; search results will be the same. Simply, search for DisappearedTerm OR AnotherTerm is the same as search for AnotherTerm. At least, we can run background thread which will create new BKTree instance, without hurting end users. Yes, Term-String is another thing to do... I recreate fake terms in TermEnum... BKTree allows to iterate about 5-10% of whole structure in order to find closest matches only if distance threshold is small, 2. If it is 4, almost no any improvement. And, classic Levenshtein distance is slow... Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times. Key: LUCENE-2230 URL: https://issues.apache.org/jira/browse/LUCENE-2230 Project: Lucene - Java Issue Type: Improvement Affects Versions: 3.0 Environment: Lucene currently uses brute force full-terms scanner and calculates distance for each term. New BKTree structure improves performance in average 20 times when distance is 1, and 3 times when distance is 3. I tested with index size several millions docs, and 250,000 terms. New algo uses integer distances between objects. Reporter: Fuad Efendi Attachments: BKTree.java, Distance.java, DistanceImpl.java, FuzzyTermEnumNEW.java, FuzzyTermEnumNEW.java Original Estimate: 0.02h Remaining Estimate: 0.02h W. Burkhard and R. Keller. Some approaches to best-match file searching, CACM, 1973 http://portal.acm.org/citation.cfm?doid=362003.362025 I was inspired by http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees (Nick Johnson, Google). Additionally, simplified algorythm at http://www.catalysoft.com/articles/StrikeAMatch.html seems to be much more logically correct than Levenstein distance, and it is 3-5 times faster (isolated tests). Big list od distance implementations: http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832049#action_12832049 ] Fuad Efendi edited comment on LUCENE-2089 at 2/10/10 6:25 PM: -- Ok; I am trying to study DFANFA and to compare with LUCENE-2230 (BKTree size is fixed without dependency on distance, but we need to hard-cache it...). What I found is that classic Levenshtein algo is eating 75% CPU, and classic brute-force TermEnum 25%... Distance (submitted by end user) must be integer... Edited: BKTree memory requirements don't have dependency on distance threshold etc.; but BKTree can help only if threshold is small, otherwise it is similar to full-scan. was (Author: funtick): Ok; I am trying to study DFANFA and to compare with LUCENE-2230 (BKTree size is fixed without dependency on distance, but we need to hard-cache it...). What I found is that classic Levenshtein algo is eating 75% CPU, and classic brute-force TermEnum 25%... Distance (submitted by end user) must be integer... explore using automaton for fuzzyquery -- Key: LUCENE-2089 URL: https://issues.apache.org/jira/browse/LUCENE-2089 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Robert Muir Assignee: Mark Miller Priority: Minor Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea * up front, calculate the maximum required K edits needed to match the users supplied float threshold. * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E. if the required E is above our supported max, we use dumb mode at first (no seeking, no DFA, just brute force like now). As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq. This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms. This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from dumb mode to smart mode. i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor) ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| |7N|10|64.0|4155.9|38.6|20.3| |14N|10|0.0|2511.6|46.0|37.9| |28N|10|0.0|2506.3|93.0|86.6| |56N|10|0.0|2524.5|304.4|298.5| as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA-DFA conversion. So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there. instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok. the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization. in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832130#action_12832130 ] Fuad Efendi commented on LUCENE-2089: - What about this, http://www.catalysoft.com/articles/StrikeAMatch.html - it seems logically more appropriate to (human-entered) text objects than Levenshtein distance, and it is (in theory) extremely fast; is DFA-distance faster? explore using automaton for fuzzyquery -- Key: LUCENE-2089 URL: https://issues.apache.org/jira/browse/LUCENE-2089 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Robert Muir Assignee: Mark Miller Priority: Minor Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea * up front, calculate the maximum required K edits needed to match the users supplied float threshold. * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E. if the required E is above our supported max, we use dumb mode at first (no seeking, no DFA, just brute force like now). As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq. This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms. This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from dumb mode to smart mode. i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor) ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| |7N|10|64.0|4155.9|38.6|20.3| |14N|10|0.0|2511.6|46.0|37.9| |28N|10|0.0|2506.3|93.0|86.6| |56N|10|0.0|2524.5|304.4|298.5| as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA-DFA conversion. So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there. instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok. the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization. in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832130#action_12832130 ] Fuad Efendi edited comment on LUCENE-2089 at 2/10/10 7:09 PM: -- What about this, http://www.catalysoft.com/articles/StrikeAMatch.html it seems logically more appropriate to (human-entered) text objects than Levenshtein distance, and it is (in theory) extremely fast; is DFA-distance faster? was (Author: funtick): What about this, http://www.catalysoft.com/articles/StrikeAMatch.html - it seems logically more appropriate to (human-entered) text objects than Levenshtein distance, and it is (in theory) extremely fast; is DFA-distance faster? explore using automaton for fuzzyquery -- Key: LUCENE-2089 URL: https://issues.apache.org/jira/browse/LUCENE-2089 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Robert Muir Assignee: Mark Miller Priority: Minor Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea * up front, calculate the maximum required K edits needed to match the users supplied float threshold. * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E. if the required E is above our supported max, we use dumb mode at first (no seeking, no DFA, just brute force like now). As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq. This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms. This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from dumb mode to smart mode. i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor) ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| |7N|10|64.0|4155.9|38.6|20.3| |14N|10|0.0|2511.6|46.0|37.9| |28N|10|0.0|2506.3|93.0|86.6| |56N|10|0.0|2524.5|304.4|298.5| as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA-DFA conversion. So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there. instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok. the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization. in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832143#action_12832143 ] Fuad Efendi commented on LUCENE-2089: - Hi Robert, Yes, I agree; we need to stick with Levenshtein distance also to isolate performance comparisons: same distance, but FuzzyTermEnum with full-scan vs. DFA-based approach, and we need to be able to compare old relevance with new one (with integer distance threshold it is not the same as with classic float-point...) thanks for the link to your article! What if we can store some precounted values in the index... such as storing similar terms in additional field... or some pieces of DFA (which I still need to learn...) explore using automaton for fuzzyquery -- Key: LUCENE-2089 URL: https://issues.apache.org/jira/browse/LUCENE-2089 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Robert Muir Assignee: Mark Miller Priority: Minor Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea * up front, calculate the maximum required K edits needed to match the users supplied float threshold. * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E. if the required E is above our supported max, we use dumb mode at first (no seeking, no DFA, just brute force like now). As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq. This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms. This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from dumb mode to smart mode. i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor) ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| |7N|10|64.0|4155.9|38.6|20.3| |14N|10|0.0|2511.6|46.0|37.9| |28N|10|0.0|2506.3|93.0|86.6| |56N|10|0.0|2524.5|304.4|298.5| as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA-DFA conversion. So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there. instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok. the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization. in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832173#action_12832173 ] Fuad Efendi commented on LUCENE-2089: - For LUCENE-2230 I did a lot of long-run load-stress tests (against SOLR), but before doing that I created baseline for static admin screen in SOLR: 1500TPS. And I reached 220TPS with Fuzzy Search... what I am trying to say is this: can DFA with Levenshtein reach 250TPS (in real-world multi-tier web environment)? Baseline for static page is 1500. Also, is DFA mostly CPU-bound? Can we improve it by making (some) I/O-bound unload? Just joking ;) explore using automaton for fuzzyquery -- Key: LUCENE-2089 URL: https://issues.apache.org/jira/browse/LUCENE-2089 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Robert Muir Assignee: Mark Miller Priority: Minor Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea * up front, calculate the maximum required K edits needed to match the users supplied float threshold. * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E. if the required E is above our supported max, we use dumb mode at first (no seeking, no DFA, just brute force like now). As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq. This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms. This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from dumb mode to smart mode. i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor) ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| |7N|10|64.0|4155.9|38.6|20.3| |14N|10|0.0|2511.6|46.0|37.9| |28N|10|0.0|2506.3|93.0|86.6| |56N|10|0.0|2524.5|304.4|298.5| as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA-DFA conversion. So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there. instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok. the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization. in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832173#action_12832173 ] Fuad Efendi edited comment on LUCENE-2089 at 2/10/10 8:27 PM: -- For LUCENE-2230 I did a lot of long-run load-stress tests (against SOLR), but before doing that I created baseline for static admin screen in SOLR: 1500TPS. And I reached 220TPS with Fuzzy Search... what I am trying to say is this: can DFA with Levenshtein reach 250TPS (in real-world multi-tier web environment)? Baseline for static page is 1500. Also, is DFA mostly CPU-bound? Can we improve it by making (some) I/O-bound unload? Just joking ;) I used explicitly distance threshold=2, that's why 220TPS... with threshold=5 it would be 50TPS or may be less... If DFA doesn't have dependency on threshold, it is the way to go. was (Author: funtick): For LUCENE-2230 I did a lot of long-run load-stress tests (against SOLR), but before doing that I created baseline for static admin screen in SOLR: 1500TPS. And I reached 220TPS with Fuzzy Search... what I am trying to say is this: can DFA with Levenshtein reach 250TPS (in real-world multi-tier web environment)? Baseline for static page is 1500. Also, is DFA mostly CPU-bound? Can we improve it by making (some) I/O-bound unload? Just joking ;) explore using automaton for fuzzyquery -- Key: LUCENE-2089 URL: https://issues.apache.org/jira/browse/LUCENE-2089 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Robert Muir Assignee: Mark Miller Priority: Minor Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea * up front, calculate the maximum required K edits needed to match the users supplied float threshold. * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E. if the required E is above our supported max, we use dumb mode at first (no seeking, no DFA, just brute force like now). As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq. This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms. This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from dumb mode to smart mode. i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor) ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| |7N|10|64.0|4155.9|38.6|20.3| |14N|10|0.0|2511.6|46.0|37.9| |28N|10|0.0|2506.3|93.0|86.6| |56N|10|0.0|2524.5|304.4|298.5| as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA-DFA conversion. So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there. instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok. the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization. in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832194#action_12832194 ] Fuad Efendi commented on LUCENE-2089: - Ok, now I understand what AutomatonQuery is... frankly, I had this idea, to create small dictionary of similar words, to create terms from those words, and to execute query Word1 OR Word2 OR ... instead of scanning whole term dictionary, but how small will be such dictionary in case, for instance, dogs... is size of dictionary (in case of ASCII-characters) 26*26*26*26? Or, 65536*65536*65536*65536 in case of Unicode? Simple. Is it so simple? explore using automaton for fuzzyquery -- Key: LUCENE-2089 URL: https://issues.apache.org/jira/browse/LUCENE-2089 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Robert Muir Assignee: Mark Miller Priority: Minor Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea * up front, calculate the maximum required K edits needed to match the users supplied float threshold. * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E. if the required E is above our supported max, we use dumb mode at first (no seeking, no DFA, just brute force like now). As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq. This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms. This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from dumb mode to smart mode. i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor) ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| |7N|10|64.0|4155.9|38.6|20.3| |14N|10|0.0|2511.6|46.0|37.9| |28N|10|0.0|2506.3|93.0|86.6| |56N|10|0.0|2524.5|304.4|298.5| as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA-DFA conversion. So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there. instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok. the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization. in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832194#action_12832194 ] Fuad Efendi edited comment on LUCENE-2089 at 2/10/10 9:22 PM: -- Ok, now I understand what AutomatonQuery is... frankly, I had this idea, to create small dictionary of similar words, to create terms from those words, and to execute query Word1 OR Word2 OR ... instead of scanning whole term dictionary, but how small will be such dictionary in case, for instance, dogs... is size of dictionary (in case of ASCII-characters) 26*26*26*26? Or, 65536*65536*65536*65536 in case of Unicode? Simple. Is it so simple? Even with Unicode, we can precount set of characters for a specific field instance; it can be 36 characters; and query like dogs OR aaabdogs OR ... OR dogs and, if we can quickly find intersection of custom dictionary with terms dictionary, then build the query... am I on correct path with understanding? was (Author: funtick): Ok, now I understand what AutomatonQuery is... frankly, I had this idea, to create small dictionary of similar words, to create terms from those words, and to execute query Word1 OR Word2 OR ... instead of scanning whole term dictionary, but how small will be such dictionary in case, for instance, dogs... is size of dictionary (in case of ASCII-characters) 26*26*26*26? Or, 65536*65536*65536*65536 in case of Unicode? Simple. Is it so simple? explore using automaton for fuzzyquery -- Key: LUCENE-2089 URL: https://issues.apache.org/jira/browse/LUCENE-2089 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Robert Muir Assignee: Mark Miller Priority: Minor Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea * up front, calculate the maximum required K edits needed to match the users supplied float threshold. * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E. if the required E is above our supported max, we use dumb mode at first (no seeking, no DFA, just brute force like now). As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq. This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms. This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from dumb mode to smart mode. i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor) ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| |7N|10|64.0|4155.9|38.6|20.3| |14N|10|0.0|2511.6|46.0|37.9| |28N|10|0.0|2506.3|93.0|86.6| |56N|10|0.0|2524.5|304.4|298.5| as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA-DFA conversion. So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there. instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok. the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization. in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832360#action_12832360 ] Fuad Efendi commented on LUCENE-2089: - Levenshtein Distance is good for Spelling Corrections use case (even terminology is the same: insert, delete, replace...) But is is not good for more generic similarity: distance between RunAutomation and AutomationRun is huge (6!). But it is two-word combination indeed,and I don't know good one-(human)-word use case where Levenshtein Distance is not good (or natural). From other viewpoint, I can't see any use case where StrikeAMatch (counts of 2-char similarities) is bad, although it is not spelling corrections. And, from third viewpoint, if we totally forget that it is indeed human-generated-input and implement Levenshtein distance on raw bitsets instead of unicode characters (end-user clicks on keyboard)... we will get totally non-acceptable results... I believe such distance algos were initially designed many years ago, before Internet (and Search), to allow auto-recovery during data transmission (first astronauts...) - but autorecovery was based on fact that (acceptable) mistaken code has one and only one closest match from the dictionary; so it was extremely fast (50 years ago). And now, we are using old algo designed for completely different use case (fixed-size bitset transmissions) for spelling corrections... What if we will focus on a keyboard (101 keys?) instead of Unicode... spelling corrections... 20ms is not good, it is 50TPS only (on a single core)... explore using automaton for fuzzyquery -- Key: LUCENE-2089 URL: https://issues.apache.org/jira/browse/LUCENE-2089 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Robert Muir Assignee: Mark Miller Priority: Minor Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea * up front, calculate the maximum required K edits needed to match the users supplied float threshold. * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E. if the required E is above our supported max, we use dumb mode at first (no seeking, no DFA, just brute force like now). As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq. This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms. This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from dumb mode to smart mode. i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor) ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| |7N|10|64.0|4155.9|38.6|20.3| |14N|10|0.0|2511.6|46.0|37.9| |28N|10|0.0|2506.3|93.0|86.6| |56N|10|0.0|2524.5|304.4|298.5| as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA-DFA conversion. So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there. instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok. the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization. in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832368#action_12832368 ] Fuad Efendi commented on LUCENE-2089: - Another idea (similar to 50-years-old auto-recovery), it doesn't allow me to sleep :) What if we do all distance calculations (and other types of calculations) at indexing time instead of at query time? For instance, we may have index structure like {Term, List[MisspelledTerm, Distance]}, and we can query this structure by {MisspelledTerm, Distance}? We mentioned it here already, LUCENE-1513, but our use case is very specific... and why to allow 5 spelling mistakes in Unicode if user's input contains 3 characters only in Latin1? We should hardcode some constraints. explore using automaton for fuzzyquery -- Key: LUCENE-2089 URL: https://issues.apache.org/jira/browse/LUCENE-2089 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Robert Muir Assignee: Mark Miller Priority: Minor Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea * up front, calculate the maximum required K edits needed to match the users supplied float threshold. * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E. if the required E is above our supported max, we use dumb mode at first (no seeking, no DFA, just brute force like now). As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq. This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms. This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from dumb mode to smart mode. i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor) ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| |7N|10|64.0|4155.9|38.6|20.3| |14N|10|0.0|2511.6|46.0|37.9| |28N|10|0.0|2506.3|93.0|86.6| |56N|10|0.0|2524.5|304.4|298.5| as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA-DFA conversion. So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there. instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok. the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization. in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832368#action_12832368 ] Fuad Efendi edited comment on LUCENE-2089 at 2/11/10 3:05 AM: -- Another idea (similar to 50-years-old auto-recovery), it doesn't allow me to sleep :) What if we do all distance calculations (and other types of calculations) at indexing time instead of at query time? For instance, we may have index structure like {Term, List[MisspelledTerm, Distance]}, and we can query this structure by {MisspelledTerm, Distance}? We mentioned it here already, LUCENE-1513, but our use case is very specific... and why to allow 5 spelling mistakes in Unicode if user's input contains 3 characters only in Latin1? We should hardcode some constraints. Yes, memory requirements... in case of dogs it can be at least few millions of additional misspelled-terms for this specific dogs term only... but it doesn't grow linearly... and we can limit such structure for distance=2, and use additional query-time processing if we need distance=3. It's just (naive) idea: to precalculate similar terms at indexing time... was (Author: funtick): Another idea (similar to 50-years-old auto-recovery), it doesn't allow me to sleep :) What if we do all distance calculations (and other types of calculations) at indexing time instead of at query time? For instance, we may have index structure like {Term, List[MisspelledTerm, Distance]}, and we can query this structure by {MisspelledTerm, Distance}? We mentioned it here already, LUCENE-1513, but our use case is very specific... and why to allow 5 spelling mistakes in Unicode if user's input contains 3 characters only in Latin1? We should hardcode some constraints. explore using automaton for fuzzyquery -- Key: LUCENE-2089 URL: https://issues.apache.org/jira/browse/LUCENE-2089 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Robert Muir Assignee: Mark Miller Priority: Minor Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea * up front, calculate the maximum required K edits needed to match the users supplied float threshold. * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E. if the required E is above our supported max, we use dumb mode at first (no seeking, no DFA, just brute force like now). As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq. This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms. This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from dumb mode to smart mode. i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor) ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| |7N|10|64.0|4155.9|38.6|20.3| |14N|10|0.0|2511.6|46.0|37.9| |28N|10|0.0|2506.3|93.0|86.6| |56N|10|0.0|2524.5|304.4|298.5| as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA-DFA conversion. So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there. instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok. the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization. in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
[ https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12829163#action_12829163 ] Fuad Efendi edited comment on LUCENE-2230 at 2/9/10 9:17 PM: - After long-run load-stress tests... I used 2 boxes, one with SOLR, another one with simple multithreaded stress simulator (with randomply generated fuzzy query samples); each box is 2x AMD Opteron 2350 (8 core per box); 64-bit. I disabled all SOLR caches except Document Cache (I want isolated tests; I want to ignore time taken by disk I/O to load document). Performance boosted accordingly to number of load-stress threads (on client computer), then dropped: 9 Threads: == TPS: 200 - 210 Response: 45 - 50 (ms) 10 Threads: === TPS: 200 - 215 Response: 45 - 55 (ms) 12 Threads: === TPS: 180 - 220 Response: 50 - 90 (ms) 16 Threads: === TPS: 60 - 65 Response: 230 - 260 (ms) It can be explained by CPU-bound processing and 8 cores available; top command on SOLR instance was shown 750% - 790% CPU time (8-core) on 3rd step (12 stressing threads), and 200% on 4th step (16 stressing threads) - due probably to Network I/O, Tomcat internals, etc. It's better to have Apache HTTPD in front of SOLR in production, with proxy_ajp (persistent connections) and HTTP caching enabled; and fine-tune Tomcat threads according to use case. BTW, my best counters for default SOLR/Lucene were: TPS: 12 Response: 750ms Fuzzy queries were tuned such a way that distance threshold was less than or equal two. I used StrikeAMatch distance... Thanks, http://www.tokenizer.ca +1 416-993-2060(cell) P.S. Before performing load-stress tests, I established the baseline in my environment: 1500 TPS by pinging http://x.x.x.x:8080/apache-solr-1.4/admin/ (static JSP). And, I reached 220TPS for fuzzy search, starting from 12-15TPS (default Lucene/SOLR)... was (Author: funtick): After long-run load-stress tests... I used 2 boxes, one with SOLR, another one with simple multithreaded stress simulator (with randomply generated fuzzy query samples); each box is 2x AMD Opteron 2350 (8 core per box); 64-bit. I disabled all SOLR caches except Document Cache (I want isolated tests; I want to ignore time taken by disk I/O to load document). Performance boosted accordingly to number of load-stress threads (on client computer), then dropped: 9 Threads: == TPS: 200 - 210 Response: 45 - 50 (ms) 10 Threads: === TPS: 200 - 215 Response: 45 - 55 (ms) 12 Threads: === TPS: 180 - 220 Response: 50 - 90 (ms) 16 Threads: === TPS: 60 - 65 Response: 230 - 260 (ms) It can be explained by CPU-bound processing and 8 cores available; top command on SOLR instance was shown 750% - 790% CPU time (8-core) on 3rd step (12 stressing threads), and 200% on 4th step (16 stressing threads) - due probably to Network I/O, Tomcat internals, etc. It's better to have Apache HTTPD in front of SOLR in production, with proxy_ajp (persistent connections) and HTTP caching enabled; and fine-tune Tomcat threads according to use case. BTW, my best counters for default SOLR/Lucene were: TPS: 12 Response: 750ms Fuzzy queries were tuned such a way that distance threshold was less than or equal two. I used StrikeAMatch distance... Thanks, http://www.tokenizer.ca +1 416-993-2060(cell) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times. Key: LUCENE-2230 URL: https://issues.apache.org/jira/browse/LUCENE-2230 Project: Lucene - Java Issue Type: Improvement Affects Versions: 3.0 Environment: Lucene currently uses brute force full-terms scanner and calculates distance for each term. New BKTree structure improves performance in average 20 times when distance is 1, and 3 times when distance is 3. I tested with index size several millions docs, and 250,000 terms. New algo uses integer distances between objects. Reporter: Fuad Efendi Attachments: BKTree.java, Distance.java, DistanceImpl.java, FuzzyTermEnumNEW.java, FuzzyTermEnumNEW.java Original Estimate: 0.02h Remaining Estimate: 0.02h W. Burkhard and R. Keller. Some approaches to best-match file searching, CACM, 1973 http://portal.acm.org/citation.cfm?doid=362003.362025 I was inspired by http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees (Nick Johnson, Google). Additionally, simplified algorythm at http://www.catalysoft.com/articles/StrikeAMatch.html seems to be much more logically correct than Levenstein distance, and it is 3-5 times faster (isolated tests). Big list od distance implementations: http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm -- This message is automatically generated by JIRA. - You can reply to this email
[jira] Issue Comment Edited: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
[ https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12829163#action_12829163 ] Fuad Efendi edited comment on LUCENE-2230 at 2/9/10 9:35 PM: - After long-run load-stress tests... I used 2 boxes, one with SOLR, another one with simple multithreaded stress simulator (with randomply generated fuzzy query samples); each box is 2x AMD Opteron 2350 (8 core per box); 64-bit. I disabled all SOLR caches except Document Cache (I want isolated tests; I want to ignore time taken by disk I/O to load document). Performance boosted accordingly to number of load-stress threads (on client computer), then dropped: 9 Threads: == TPS: 200 - 210 Response: 45 - 50 (ms) 10 Threads: === TPS: 200 - 215 Response: 45 - 55 (ms) 12 Threads: === TPS: 180 - 220 Response: 50 - 90 (ms) 16 Threads: === TPS: 60 - 65 Response: 230 - 260 (ms) It can be explained by CPU-bound processing and 8 cores available; top command on SOLR instance was shown 750% - 790% CPU time (8-core) on 3rd step (12 stressing threads), and 200% on 4th step (16 stressing threads) - due probably to Network I/O, Tomcat internals, etc. It's better to have Apache HTTPD in front of SOLR in production, with proxy_ajp (persistent connections) and HTTP caching enabled; and fine-tune Tomcat threads according to use case. BTW, my best counters for default SOLR/Lucene were: TPS: 12 Response: 750ms Fuzzy queries were tuned such a way that distance threshold was less than or equal two. I used StrikeAMatch distance... Thanks, http://www.tokenizer.ca +1 416-993-2060(cell) P.S. Before performing load-stress tests, I established the baseline in my environment: 1500 TPS by pinging http://x.x.x.x:8080/apache-solr-1.4/admin/ (static JSP). And, I reached 220TPS for fuzzy search, starting from 12-15TPS (default Lucene/SOLR)... P.P.S. Distance function must follow 3 'axioms': {code} D(a,a) = 0 D(a,b) = D(b,a) D(a,b) + D(b,c) = D(a,c) {code} And, function must return Integer value. Otherwise, BKTree will produce wrong results. Also, it's mentioned somewhere in Levenstein Algo Java Docs (in contrib folder I believe) that instance method runs faster than static method; need to test with Java 6... most probably 'yes', depends on JVM implementation; I can guess only that CPU-internals are better optimized for instance method... was (Author: funtick): After long-run load-stress tests... I used 2 boxes, one with SOLR, another one with simple multithreaded stress simulator (with randomply generated fuzzy query samples); each box is 2x AMD Opteron 2350 (8 core per box); 64-bit. I disabled all SOLR caches except Document Cache (I want isolated tests; I want to ignore time taken by disk I/O to load document). Performance boosted accordingly to number of load-stress threads (on client computer), then dropped: 9 Threads: == TPS: 200 - 210 Response: 45 - 50 (ms) 10 Threads: === TPS: 200 - 215 Response: 45 - 55 (ms) 12 Threads: === TPS: 180 - 220 Response: 50 - 90 (ms) 16 Threads: === TPS: 60 - 65 Response: 230 - 260 (ms) It can be explained by CPU-bound processing and 8 cores available; top command on SOLR instance was shown 750% - 790% CPU time (8-core) on 3rd step (12 stressing threads), and 200% on 4th step (16 stressing threads) - due probably to Network I/O, Tomcat internals, etc. It's better to have Apache HTTPD in front of SOLR in production, with proxy_ajp (persistent connections) and HTTP caching enabled; and fine-tune Tomcat threads according to use case. BTW, my best counters for default SOLR/Lucene were: TPS: 12 Response: 750ms Fuzzy queries were tuned such a way that distance threshold was less than or equal two. I used StrikeAMatch distance... Thanks, http://www.tokenizer.ca +1 416-993-2060(cell) P.S. Before performing load-stress tests, I established the baseline in my environment: 1500 TPS by pinging http://x.x.x.x:8080/apache-solr-1.4/admin/ (static JSP). And, I reached 220TPS for fuzzy search, starting from 12-15TPS (default Lucene/SOLR)... Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times. Key: LUCENE-2230 URL: https://issues.apache.org/jira/browse/LUCENE-2230 Project: Lucene - Java Issue Type: Improvement Affects Versions: 3.0 Environment: Lucene currently uses brute force full-terms scanner and calculates distance for each term. New BKTree structure improves performance in average 20 times when distance is 1, and 3 times when distance is 3. I tested with index size several millions docs, and 250,000 terms. New algo uses integer distances between objects. Reporter: Fuad Efendi Attachments: BKTree.java,
[jira] Commented: (SOLR-1764) While indexing a java.lang.IllegalStateException: Can't overwrite cause exception is thrown
[ https://issues.apache.org/jira/browse/SOLR-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831458#action_12831458 ] Fuad Efendi commented on SOLR-1764: --- Funny, it might happen that this is not a problem with JDK 1.6.0_9; or may be with latest JDK. As a quick workaround... Also, you may try to use SolrJ with binary format... I'll try to check that elementwordamp;word/element doesn't cause a problem... While indexing a java.lang.IllegalStateException: Can't overwrite cause exception is thrown - Key: SOLR-1764 URL: https://issues.apache.org/jira/browse/SOLR-1764 Project: Solr Issue Type: Bug Components: clients - java Affects Versions: 1.4 Environment: Windows XP, JBoss 4.2.3 GA Reporter: Michael McGowan Priority: Blocker I get an exception while indexing. It seems that I'm unable to see the root cause of the exception because it is masked by another java.lang.IllegalStateException: Can't overwrite cause exception. Here is the stacktrace : 16:59:04,292 ERROR [STDERR] Feb 8, 2010 4:59:04 PM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {} 0 15 16:59:04,292 ERROR [STDERR] Feb 8, 2010 4:59:04 PM org.apache.solr.common.SolrException log SEVERE: java.lang.IllegalStateException: Can't overwrite cause at java.lang.Throwable.initCause(Throwable.java:320) at com.ctc.wstx.compat.Jdk14Impl.setInitCause(Jdk14Impl.java:70) at com.ctc.wstx.exc.WstxException.init(WstxException.java:46) at com.ctc.wstx.exc.WstxIOException.init(WstxIOException.java:16) at com.ctc.wstx.stax.WstxInputFactory.doCreateSR(WstxInputFactory.java:536) at com.ctc.wstx.stax.WstxInputFactory.createSR(WstxInputFactory.java:592) at com.ctc.wstx.stax.WstxInputFactory.createSR(WstxInputFactory.java:648) at com.ctc.wstx.stax.WstxInputFactory.createXMLStreamReader(WstxInputFactory.java:319) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:68) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:182) at org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:84) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:157) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:262) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:446) at java.lang.Thread.run(Thread.java:619) 16:59:04,292 ERROR [STDERR] Feb 8, 2010 4:59:04 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={wt=xmlversion=2.2} status=500 QTime=15 16:59:04,292 ERROR [STDERR] Feb 8, 2010 4:59:04 PM org.apache.solr.common.SolrException log SEVERE: java.lang.IllegalStateException: Can't overwrite cause at java.lang.Throwable.initCause(Throwable.java:320) at com.ctc.wstx.compat.Jdk14Impl.setInitCause(Jdk14Impl.java:70) at
[jira] Commented: (SOLR-1764) While indexing a java.lang.IllegalStateException: Can't overwrite cause exception is thrown
[ https://issues.apache.org/jira/browse/SOLR-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831268#action_12831268 ] Fuad Efendi commented on SOLR-1764: --- Michael, Which version of Java are you using? I believe something wrong with XML (upload) file, and specific Java version classes conflict with WoodStox, although SOLR may need improvement too: http://forums.sun.com/thread.jspa?threadID=5150576 While indexing a java.lang.IllegalStateException: Can't overwrite cause exception is thrown - Key: SOLR-1764 URL: https://issues.apache.org/jira/browse/SOLR-1764 Project: Solr Issue Type: Bug Components: clients - java Affects Versions: 1.4 Environment: Windows XP, JBoss 4.2.3 GA Reporter: Michael McGowan Priority: Blocker I get an exception while indexing. It seems that I'm unable to see the root cause of the exception because it is masked by another java.lang.IllegalStateException: Can't overwrite cause exception. Here is the stacktrace : 16:59:04,292 ERROR [STDERR] Feb 8, 2010 4:59:04 PM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {} 0 15 16:59:04,292 ERROR [STDERR] Feb 8, 2010 4:59:04 PM org.apache.solr.common.SolrException log SEVERE: java.lang.IllegalStateException: Can't overwrite cause at java.lang.Throwable.initCause(Throwable.java:320) at com.ctc.wstx.compat.Jdk14Impl.setInitCause(Jdk14Impl.java:70) at com.ctc.wstx.exc.WstxException.init(WstxException.java:46) at com.ctc.wstx.exc.WstxIOException.init(WstxIOException.java:16) at com.ctc.wstx.stax.WstxInputFactory.doCreateSR(WstxInputFactory.java:536) at com.ctc.wstx.stax.WstxInputFactory.createSR(WstxInputFactory.java:592) at com.ctc.wstx.stax.WstxInputFactory.createSR(WstxInputFactory.java:648) at com.ctc.wstx.stax.WstxInputFactory.createXMLStreamReader(WstxInputFactory.java:319) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:68) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:182) at org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:84) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:157) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:262) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:446) at java.lang.Thread.run(Thread.java:619) 16:59:04,292 ERROR [STDERR] Feb 8, 2010 4:59:04 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={wt=xmlversion=2.2} status=500 QTime=15 16:59:04,292 ERROR [STDERR] Feb 8, 2010 4:59:04 PM org.apache.solr.common.SolrException log SEVERE: java.lang.IllegalStateException: Can't overwrite cause at java.lang.Throwable.initCause(Throwable.java:320) at com.ctc.wstx.compat.Jdk14Impl.setInitCause(Jdk14Impl.java:70) at
[jira] Issue Comment Edited: (SOLR-1764) While indexing a java.lang.IllegalStateException: Can't overwrite cause exception is thrown
[ https://issues.apache.org/jira/browse/SOLR-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831268#action_12831268 ] Fuad Efendi edited comment on SOLR-1764 at 2/9/10 2:48 AM: --- Michael, Which version of Java are you using? I believe something wrong with XML (upload) file, and specific Java version classes conflict with WoodStox, although SOLR may need improvement too: http://forums.sun.com/thread.jspa?threadID=5150576 It says that text nodes such as prim name=y[-A-Z0-9.,()/='+:?!%amp;amp;*; ]/prim can be split (for instance, to porcess entities), depending on implementation, and, to be safe, SOLR needs something like {code} while (reader.isCharacters()) { sb.append(reader.getText()); reader.next(); } {code} was (Author: funtick): Michael, Which version of Java are you using? I believe something wrong with XML (upload) file, and specific Java version classes conflict with WoodStox, although SOLR may need improvement too: http://forums.sun.com/thread.jspa?threadID=5150576 While indexing a java.lang.IllegalStateException: Can't overwrite cause exception is thrown - Key: SOLR-1764 URL: https://issues.apache.org/jira/browse/SOLR-1764 Project: Solr Issue Type: Bug Components: clients - java Affects Versions: 1.4 Environment: Windows XP, JBoss 4.2.3 GA Reporter: Michael McGowan Priority: Blocker I get an exception while indexing. It seems that I'm unable to see the root cause of the exception because it is masked by another java.lang.IllegalStateException: Can't overwrite cause exception. Here is the stacktrace : 16:59:04,292 ERROR [STDERR] Feb 8, 2010 4:59:04 PM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {} 0 15 16:59:04,292 ERROR [STDERR] Feb 8, 2010 4:59:04 PM org.apache.solr.common.SolrException log SEVERE: java.lang.IllegalStateException: Can't overwrite cause at java.lang.Throwable.initCause(Throwable.java:320) at com.ctc.wstx.compat.Jdk14Impl.setInitCause(Jdk14Impl.java:70) at com.ctc.wstx.exc.WstxException.init(WstxException.java:46) at com.ctc.wstx.exc.WstxIOException.init(WstxIOException.java:16) at com.ctc.wstx.stax.WstxInputFactory.doCreateSR(WstxInputFactory.java:536) at com.ctc.wstx.stax.WstxInputFactory.createSR(WstxInputFactory.java:592) at com.ctc.wstx.stax.WstxInputFactory.createSR(WstxInputFactory.java:648) at com.ctc.wstx.stax.WstxInputFactory.createXMLStreamReader(WstxInputFactory.java:319) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:68) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:182) at org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:84) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:157) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:262) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at
[jira] Issue Comment Edited: (SOLR-1764) While indexing a java.lang.IllegalStateException: Can't overwrite cause exception is thrown
[ https://issues.apache.org/jira/browse/SOLR-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831268#action_12831268 ] Fuad Efendi edited comment on SOLR-1764 at 2/9/10 2:51 AM: --- Michael, Which version of Java are you using? I believe something wrong with XML (upload) file, and specific Java version classes conflict with WoodStox, although SOLR may need improvement too: http://forums.sun.com/thread.jspa?threadID=5150576 It says that text nodes such as prim name=y[-A-Z0-9.,()/='+:?!%amp;amp;*; ]/prim can be split (for instance, to process entities), depending on implementation, and, to be safe, SOLR needs something like {code} while (reader.isCharacters()) { sb.append(reader.getText()); reader.next(); } {code} was (Author: funtick): Michael, Which version of Java are you using? I believe something wrong with XML (upload) file, and specific Java version classes conflict with WoodStox, although SOLR may need improvement too: http://forums.sun.com/thread.jspa?threadID=5150576 It says that text nodes such as prim name=y[-A-Z0-9.,()/='+:?!%amp;amp;*; ]/prim can be split (for instance, to porcess entities), depending on implementation, and, to be safe, SOLR needs something like {code} while (reader.isCharacters()) { sb.append(reader.getText()); reader.next(); } {code} While indexing a java.lang.IllegalStateException: Can't overwrite cause exception is thrown - Key: SOLR-1764 URL: https://issues.apache.org/jira/browse/SOLR-1764 Project: Solr Issue Type: Bug Components: clients - java Affects Versions: 1.4 Environment: Windows XP, JBoss 4.2.3 GA Reporter: Michael McGowan Priority: Blocker I get an exception while indexing. It seems that I'm unable to see the root cause of the exception because it is masked by another java.lang.IllegalStateException: Can't overwrite cause exception. Here is the stacktrace : 16:59:04,292 ERROR [STDERR] Feb 8, 2010 4:59:04 PM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {} 0 15 16:59:04,292 ERROR [STDERR] Feb 8, 2010 4:59:04 PM org.apache.solr.common.SolrException log SEVERE: java.lang.IllegalStateException: Can't overwrite cause at java.lang.Throwable.initCause(Throwable.java:320) at com.ctc.wstx.compat.Jdk14Impl.setInitCause(Jdk14Impl.java:70) at com.ctc.wstx.exc.WstxException.init(WstxException.java:46) at com.ctc.wstx.exc.WstxIOException.init(WstxIOException.java:16) at com.ctc.wstx.stax.WstxInputFactory.doCreateSR(WstxInputFactory.java:536) at com.ctc.wstx.stax.WstxInputFactory.createSR(WstxInputFactory.java:592) at com.ctc.wstx.stax.WstxInputFactory.createSR(WstxInputFactory.java:648) at com.ctc.wstx.stax.WstxInputFactory.createXMLStreamReader(WstxInputFactory.java:319) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:68) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:182) at org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:84) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:157) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
[jira] Commented: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
[ https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12829163#action_12829163 ] Fuad Efendi commented on LUCENE-2230: - After long-run load-stress tests... I used 2 boxes, one with SOLR, another one with simple multithreaded stress simulator (with randomply generated fuzzy query samples); each box is 2x AMD Opteron 2350 (8 core per box); 64-bit. I disabled all SOLR caches except Document Cache (I want isolated tests; I want to ignore time taken by disk I/O to load document). Performance boosted accordingly to number of load-stress threads (on client computer), then dropped: 9 Threads: == TPS: 200 - 210 Response: 45 - 50 (ms) 10 Threads: === TPS: 200 - 215 Response: 45 - 55 (ms) 12 Threads: === TPS: 180 - 220 Response: 50 - 90 (ms) 16 Threads: === TPS: 60 - 65 Response: 230 - 260 (ms) It can be explained by CPU-bound processing and 8 cores available; top command on SOLR instance was shown 750% - 790% CPU time (8-core) on 3rd step (12 stressing threads), and 200% on 4th step (16 stressing threads) - due probably to Network I/O, Tomcat internals, etc. It's better to have Apache HTTPD in front of SOLR in production, with proxy_ajp (persistent connections) and HTTP caching enabled; and fine-tune Tomcat threads according to use case. BTW, my best counters for default SOLR/Lucene were: TPS: 12 Response: 750ms Fuzzy queries were tuned such a way that distance threshold was less than or equal two. I used StrikeAMatch distance... Thanks, http://www.tokenizer.ca +1 416-993-2060(cell) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times. Key: LUCENE-2230 URL: https://issues.apache.org/jira/browse/LUCENE-2230 Project: Lucene - Java Issue Type: Improvement Affects Versions: 3.0 Environment: Lucene currently uses brute force full-terms scanner and calculates distance for each term. New BKTree structure improves performance in average 20 times when distance is 1, and 3 times when distance is 3. I tested with index size several millions docs, and 250,000 terms. New algo uses integer distances between objects. Reporter: Fuad Efendi Attachments: BKTree.java, Distance.java, DistanceImpl.java, FuzzyTermEnumNEW.java, FuzzyTermEnumNEW.java Original Estimate: 0.02h Remaining Estimate: 0.02h W. Burkhard and R. Keller. Some approaches to best-match file searching, CACM, 1973 http://portal.acm.org/citation.cfm?doid=362003.362025 I was inspired by http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees (Nick Johnson, Google). Additionally, simplified algorythm at http://www.catalysoft.com/articles/StrikeAMatch.html seems to be much more logically correct than Levenstein distance, and it is 3-5 times faster (isolated tests). Big list od distance implementations: http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
[ https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12804001#action_12804001 ] Fuad Efendi edited comment on LUCENE-2230 at 1/24/10 12:46 AM: --- Minor bug fixed (with cache warm-up)... Don't forget to disable QueryResultsCache and to enable large DocumentCache (if you are using SOLR); otherwise you won't see the difference. (SOLR users: there are some other tricks!) With Lucene 2.9.1: 800ms With BKTree and Levenstein algo: 200ms With BKTree and http://www.catalysoft.com/articles/StrikeAMatch.html 70ms Average timing after many hours of tests. We may consider integer distance instead of float for Lucene Query if we accept this algo; I tried the best to have it close to float distance. BKTree is cached at FuzzyTermEnumNEW. It needs warm-up if index changed; simplest way - to recalc it at night (separate thread will do it). == P.S. FuzzyQuery constructs instance of FuzzyTermEnum and passes instance of IndexReader to constructor. This is the way... If IndexReader changed (new instance) we can simply repopulate BKTree (or, for instance, we can compare list of terms and simply add terms missed in BKTree)... was (Author: funtick): Minor bug fixed (with cache warm-up)... Don't forget to disable QueryResultsCache and to enable large DocumentCache (if you are using SOLR); otherwise you won't see the difference. (SOLR users: there are some other tricks!) With Lucene 2.9.1: 800ms With BKTree and Levenstein algo: 200ms With BKTree and http://www.catalysoft.com/articles/StrikeAMatch.html 70ms Average timing after many hours of tests. We may consider integer distance instead of float for Lucene Query if we accept this algo; I tried the best to have it close to float distance. BKTree is cached at FuzzyTermEnumNEW. It needs warm-up if index changed; simplest way - to recalc it at night (separate thread will do it). Thanks, Fuad +1 416-993-2060(cell) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times. Key: LUCENE-2230 URL: https://issues.apache.org/jira/browse/LUCENE-2230 Project: Lucene - Java Issue Type: Improvement Affects Versions: 3.0 Environment: Lucene currently uses brute force full-terms scanner and calculates distance for each term. New BKTree structure improves performance in average 20 times when distance is 1, and 3 times when distance is 3. I tested with index size several millions docs, and 250,000 terms. New algo uses integer distances between objects. Reporter: Fuad Efendi Attachments: BKTree.java, Distance.java, DistanceImpl.java, FuzzyTermEnumNEW.java, FuzzyTermEnumNEW.java Original Estimate: 0.02h Remaining Estimate: 0.02h W. Burkhard and R. Keller. Some approaches to best-match file searching, CACM, 1973 http://portal.acm.org/citation.cfm?doid=362003.362025 I was inspired by http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees (Nick Johnson, Google). Additionally, simplified algorythm at http://www.catalysoft.com/articles/StrikeAMatch.html seems to be much more logically correct than Levenstein distance, and it is 3-5 times faster (isolated tests). Big list od distance implementations: http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
[ https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fuad Efendi updated LUCENE-2230: Attachment: FuzzyTermEnumNEW.java Minor bug fixed (with cache wam-up)... Don't forget to disable QueryResultsCache and to enable large DocumentCache (if you are using SOLR); otherwise you won't see the difference. (SOLR users: there are some other tricks!) With Lucene 2.9.1: 800ms With BKTree and Levenstein algo: 200ms With BKTree and http://www.catalysoft.com/articles/StrikeAMatch.html: 70ms Average timing after many hours of tests. We may consider integer distance instead of float for Lucene Query if if accept this algo; I tried the best to have it close to float distance. BKTree is cached at FuzzyTermEnumNEW. It needs warm-up if index changed; simplest way - to recalc it at night (separate thread will do it). Thanks, Fuad +1 416-993-2060(cell) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times. Key: LUCENE-2230 URL: https://issues.apache.org/jira/browse/LUCENE-2230 Project: Lucene - Java Issue Type: Improvement Affects Versions: 3.0 Environment: Lucene currently uses brute force full-terms scanner and calculates distance for each term. New BKTree structure improves performance in average 20 times when distance is 1, and 3 times when distance is 3. I tested with index size several millions docs, and 250,000 terms. New algo uses integer distances between objects. Reporter: Fuad Efendi Attachments: BKTree.java, Distance.java, DistanceImpl.java, FuzzyTermEnumNEW.java, FuzzyTermEnumNEW.java Original Estimate: 0.02h Remaining Estimate: 0.02h W. Burkhard and R. Keller. Some approaches to best-match file searching, CACM, 1973 http://portal.acm.org/citation.cfm?doid=362003.362025 I was inspired by http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees (Nick Johnson, Google). Additionally, simplified algorythm at http://www.catalysoft.com/articles/StrikeAMatch.html seems to be much more logically correct than Levenstein distance, and it is 3-5 times faster (isolated tests). Big list od distance implementations: http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
[ https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12804001#action_12804001 ] Fuad Efendi edited comment on LUCENE-2230 at 1/23/10 2:49 AM: -- Minor bug fixed (with cache warm-up)... Don't forget to disable QueryResultsCache and to enable large DocumentCache (if you are using SOLR); otherwise you won't see the difference. (SOLR users: there are some other tricks!) With Lucene 2.9.1: 800ms With BKTree and Levenstein algo: 200ms With BKTree and http://www.catalysoft.com/articles/StrikeAMatch.html 70ms Average timing after many hours of tests. We may consider integer distance instead of float for Lucene Query if we accept this algo; I tried the best to have it close to float distance. BKTree is cached at FuzzyTermEnumNEW. It needs warm-up if index changed; simplest way - to recalc it at night (separate thread will do it). Thanks, Fuad +1 416-993-2060(cell) was (Author: funtick): Minor bug fixed (with cache wam-up)... Don't forget to disable QueryResultsCache and to enable large DocumentCache (if you are using SOLR); otherwise you won't see the difference. (SOLR users: there are some other tricks!) With Lucene 2.9.1: 800ms With BKTree and Levenstein algo: 200ms With BKTree and http://www.catalysoft.com/articles/StrikeAMatch.html: 70ms Average timing after many hours of tests. We may consider integer distance instead of float for Lucene Query if if accept this algo; I tried the best to have it close to float distance. BKTree is cached at FuzzyTermEnumNEW. It needs warm-up if index changed; simplest way - to recalc it at night (separate thread will do it). Thanks, Fuad +1 416-993-2060(cell) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times. Key: LUCENE-2230 URL: https://issues.apache.org/jira/browse/LUCENE-2230 Project: Lucene - Java Issue Type: Improvement Affects Versions: 3.0 Environment: Lucene currently uses brute force full-terms scanner and calculates distance for each term. New BKTree structure improves performance in average 20 times when distance is 1, and 3 times when distance is 3. I tested with index size several millions docs, and 250,000 terms. New algo uses integer distances between objects. Reporter: Fuad Efendi Attachments: BKTree.java, Distance.java, DistanceImpl.java, FuzzyTermEnumNEW.java, FuzzyTermEnumNEW.java Original Estimate: 0.02h Remaining Estimate: 0.02h W. Burkhard and R. Keller. Some approaches to best-match file searching, CACM, 1973 http://portal.acm.org/citation.cfm?doid=362003.362025 I was inspired by http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees (Nick Johnson, Google). Additionally, simplified algorythm at http://www.catalysoft.com/articles/StrikeAMatch.html seems to be much more logically correct than Levenstein distance, and it is 3-5 times faster (isolated tests). Big list od distance implementations: http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times. Key: LUCENE-2230 URL: https://issues.apache.org/jira/browse/LUCENE-2230 Project: Lucene - Java Issue Type: Improvement Affects Versions: 3.0 Environment: Lucene currently uses brute force full-terms scanner and calculates distance for each term. New BKTree structure improves performance in average 20 times when distance is 1, and 3 times when distance is 3. I tested with index size several millions docs, and 250,000 terms. New algo uses integer distances between objects. Reporter: Fuad Efendi W. Burkhard and R. Keller. Some approaches to best-match file searching, CACM, 1973 http://portal.acm.org/citation.cfm?doid=362003.362025 I was inspired by http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees (Nick Johnson, Google). Additionally, simplified algorythm at http://www.catalysoft.com/articles/StrikeAMatch.html seems to be much more logically correct than Levenstein distance, and it is 3-5 times faster (isolated tests). Big list od distance implementations: http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
[ https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fuad Efendi updated LUCENE-2230: Attachment: DistanceImpl.java Distance.java BKTree.java First version of BKTree.java Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times. Key: LUCENE-2230 URL: https://issues.apache.org/jira/browse/LUCENE-2230 Project: Lucene - Java Issue Type: Improvement Affects Versions: 3.0 Environment: Lucene currently uses brute force full-terms scanner and calculates distance for each term. New BKTree structure improves performance in average 20 times when distance is 1, and 3 times when distance is 3. I tested with index size several millions docs, and 250,000 terms. New algo uses integer distances between objects. Reporter: Fuad Efendi Attachments: BKTree.java, Distance.java, DistanceImpl.java Original Estimate: 0.02h Remaining Estimate: 0.02h W. Burkhard and R. Keller. Some approaches to best-match file searching, CACM, 1973 http://portal.acm.org/citation.cfm?doid=362003.362025 I was inspired by http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees (Nick Johnson, Google). Additionally, simplified algorythm at http://www.catalysoft.com/articles/StrikeAMatch.html seems to be much more logically correct than Levenstein distance, and it is 3-5 times faster (isolated tests). Big list od distance implementations: http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
[ https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fuad Efendi updated LUCENE-2230: Attachment: FuzzyTermEnumNEW.java FuzzyTermEnumNEW.java In order to use it with Lucene 2.9.1, complie all files (4 files) in a separate JAR file (Java 6). In a source distribution of Lucene 2.9.1, modify FuzzyQuery, single method: protected FilteredTermEnum getEnum(IndexReader reader) throws IOException { return new FuzzyTermEnumNEW(reader, getTerm(), minimumSimilarity, prefixLength); } - and complie it (using default settings such as Java 1.4 compatibility); ant jar-core will do it. Use 2 new jars instead of lucene-core-2.9.1.jar Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times. Key: LUCENE-2230 URL: https://issues.apache.org/jira/browse/LUCENE-2230 Project: Lucene - Java Issue Type: Improvement Affects Versions: 3.0 Environment: Lucene currently uses brute force full-terms scanner and calculates distance for each term. New BKTree structure improves performance in average 20 times when distance is 1, and 3 times when distance is 3. I tested with index size several millions docs, and 250,000 terms. New algo uses integer distances between objects. Reporter: Fuad Efendi Attachments: BKTree.java, Distance.java, DistanceImpl.java, FuzzyTermEnumNEW.java Original Estimate: 0.02h Remaining Estimate: 0.02h W. Burkhard and R. Keller. Some approaches to best-match file searching, CACM, 1973 http://portal.acm.org/citation.cfm?doid=362003.362025 I was inspired by http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees (Nick Johnson, Google). Additionally, simplified algorythm at http://www.catalysoft.com/articles/StrikeAMatch.html seems to be much more logically correct than Levenstein distance, and it is 3-5 times faster (isolated tests). Big list od distance implementations: http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1990) Add unsigned packed int impls in oal.util
[ https://issues.apache.org/jira/browse/LUCENE-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777054#action_12777054 ] Fuad Efendi commented on LUCENE-1990: - Suttiwat sent me a link: http://blog.juma.me.uk/2008/10/14/32-bit-or-64-bit-jvm-how-about-a-hybrid/ This is vendor-specific, and possibly may cause unexpected problems, but we can try in some specific cases: Compressed Oops have been included (but disabled by default) in the performance release JDK6u6p (requires you to fill a survey), so I decided to try it in an internal application and compare it with 64-bit mode and 32-bit mode. -XX:+UseCompressedOops There are other vendors around too such as Oracle JRockit which is much faster server-side... Add unsigned packed int impls in oal.util - Key: LUCENE-1990 URL: https://issues.apache.org/jira/browse/LUCENE-1990 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Priority: Minor There are various places in Lucene that could take advantage of an efficient packed unsigned int/long impl. EG the terms dict index in the standard codec in LUCENE-1458 could subsantially reduce it's RAM usage. FieldCache.StringIndex could as well. And I think load into RAM codecs like the one in TestExternalCodecs could use this too. I'm picturing something very basic like: {code} interface PackedUnsignedLongs { long get(long index); void set(long index, long value); } {code} Plus maybe an iterator for getting and maybe also for setting. If it helps, most of the usages of this inside Lucene will be write once so eg the set could make that an assumption/requirement. And a factory somewhere: {code} PackedUnsignedLongs create(int count, long maxValue); {code} I think we should simply autogen the code (we can start from the autogen code in LUCENE-1410), or, if there is an good existing impl that has a compatible license that'd be great. I don't have time near-term to do this... so if anyone has the itch, please jump! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1990) Add unsigned packed int impls in oal.util
[ https://issues.apache.org/jira/browse/LUCENE-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12775420#action_12775420 ] Fuad Efendi edited comment on LUCENE-1990 at 11/10/09 2:01 PM: --- Specifically for FieldCache, let's see... suppose Field may have 8 different values, and number of documents is high. {code} Value0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 ... Value1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 ... Value2 0 0 0 1 1 0 0 0 0 0 0 0 0 0 ... Value3 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ... Value4 0 0 0 0 0 0 1 0 0 0 0 0 0 0 ... Value5 0 0 0 0 0 1 0 0 0 0 1 0 1 0 ... Value6 0 0 0 0 0 0 0 1 0 1 0 0 0 0 ... Value7 0 0 0 0 0 0 0 0 1 0 0 0 0 1 ... {code} - represented as Matrix (or as a Vector); for instance, first row means that Document1 and Document8 have Value0. And now, if we go horizontally we will end up with 8 arrays of int[]. What if we go vertically? Field could be encoded as 3-bit (8 different values). CONSTRAINT: specifically for FieldCache, each Column must have the only 1. And we can end with array of 3-bit values storing position in a column! Size of array is IndexReader.maxDoc(). hope I am reinventing bycycle :) was (Author: funtick): Specifically for FieldCache, let's see... suppose Field may have 8 different values, and number of documents is high. {code} Value0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 ... Value1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 ... Value2 0 0 0 1 1 0 0 0 0 0 0 0 0 0 ... Value3 0 0 0 0 0 0 0 0 0 0 0 10 0 0 ... Value4 0 0 0 0 0 0 1 0 0 0 0 0 0 0 ... Value5 0 0 0 0 0 1 0 0 0 0 1 0 1 0 ... Value6 0 0 0 0 0 0 0 1 0 1 0 0 0 0 ... Value7 0 0 0 0 0 0 0 0 1 0 0 0 0 1 ... {code} - represented as Matrix (or as a Vector); for instance, first row means that Document1 and Document8 have Value0. And now, if we go horizontally we will end up with 8 arrays of int[]. What if we go vertically? Field could be encoded as 3-bit (8 different values). CONSTRAINT: specifically for FieldCache, each Column must have the only 1. And we can end with array of 3-bit values storing position in a column! Size of array is IndexReader.maxDoc(). hope I am reinventing bycycle :) Add unsigned packed int impls in oal.util - Key: LUCENE-1990 URL: https://issues.apache.org/jira/browse/LUCENE-1990 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Priority: Minor There are various places in Lucene that could take advantage of an efficient packed unsigned int/long impl. EG the terms dict index in the standard codec in LUCENE-1458 could subsantially reduce it's RAM usage. FieldCache.StringIndex could as well. And I think load into RAM codecs like the one in TestExternalCodecs could use this too. I'm picturing something very basic like: {code} interface PackedUnsignedLongs { long get(long index); void set(long index, long value); } {code} Plus maybe an iterator for getting and maybe also for setting. If it helps, most of the usages of this inside Lucene will be write once so eg the set could make that an assumption/requirement. And a factory somewhere: {code} PackedUnsignedLongs create(int count, long maxValue); {code} I think we should simply autogen the code (we can start from the autogen code in LUCENE-1410), or, if there is an good existing impl that has a compatible license that'd be great. I don't have time near-term to do this... so if anyone has the itch, please jump! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1990) Add unsigned packed int impls in oal.util
[ https://issues.apache.org/jira/browse/LUCENE-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12775420#action_12775420 ] Fuad Efendi commented on LUCENE-1990: - Specifically for FieldCache, let's see... suppose Field may have 8 different values, and number of documents is high. {code} Value0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 ... Value1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 ... Value2 0 0 0 1 1 0 0 0 0 0 0 0 0 0 ... Value3 0 0 0 0 0 0 0 0 0 0 0 10 0 0 ... Value4 0 0 0 0 0 0 1 0 0 0 0 0 0 0 ... Value5 0 0 0 0 0 1 0 0 0 0 1 0 1 0 ... Value6 0 0 0 0 0 0 0 1 0 1 0 0 0 0 ... Value7 0 0 0 0 0 0 0 0 1 0 0 0 0 1 ... {code} - represented as Matrix (or as a Vector); for instance, first row means that Document1 and Document8 have Value0. And now, if we go horizontally we will end up with 8 arrays of int[]. What if we go vertically? Field could be encoded as 3-bit (8 different values). CONSTRAINT: specifically for FieldCache, each Column must have the only 1. And we can end with array of 3-bit values storing position in a column! Size of array is IndexReader.maxDoc(). hope I am reinventing bycycle :) Add unsigned packed int impls in oal.util - Key: LUCENE-1990 URL: https://issues.apache.org/jira/browse/LUCENE-1990 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Priority: Minor There are various places in Lucene that could take advantage of an efficient packed unsigned int/long impl. EG the terms dict index in the standard codec in LUCENE-1458 could subsantially reduce it's RAM usage. FieldCache.StringIndex could as well. And I think load into RAM codecs like the one in TestExternalCodecs could use this too. I'm picturing something very basic like: {code} interface PackedUnsignedLongs { long get(long index); void set(long index, long value); } {code} Plus maybe an iterator for getting and maybe also for setting. If it helps, most of the usages of this inside Lucene will be write once so eg the set could make that an assumption/requirement. And a factory somewhere: {code} PackedUnsignedLongs create(int count, long maxValue); {code} I think we should simply autogen the code (we can start from the autogen code in LUCENE-1410), or, if there is an good existing impl that has a compatible license that'd be great. I don't have time near-term to do this... so if anyone has the itch, please jump! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1990) Add unsigned packed int impls in oal.util
[ https://issues.apache.org/jira/browse/LUCENE-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12775420#action_12775420 ] Fuad Efendi edited comment on LUCENE-1990 at 11/10/09 4:10 PM: --- Specifically for FieldCache, let's see... suppose Field may have 8 different values, and number of documents is high. {code} Value0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 ... Value1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 ... Value2 0 0 0 1 1 0 0 0 0 0 0 0 0 0 ... Value3 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ... Value4 0 0 0 0 0 0 1 0 0 0 0 0 0 0 ... Value5 0 0 0 0 0 1 0 0 0 0 1 0 1 0 ... Value6 0 0 0 0 0 0 0 1 0 1 0 0 0 0 ... Value7 0 0 0 0 0 0 0 0 1 0 0 0 0 1 ... {code} - represented as Matrix (or as a Vector); for instance, first row means that Document1 and Document8 have Value0. And now, if we go horizontally we will end up with 8 arrays of int[]. What if we go vertically? Field could be encoded as 3-bit (8 different values). CONSTRAINT: specifically for FieldCache, each Column must have the only 1. And we can end with array of 3-bit values storing position in a column! Size of array is IndexReader.maxDoc(). hope I am reinventing bycycle :) P.S. Of course each solution has pros and cons, I am trying to focus on FieldCache specific use cases. 1. For a given document ID, find a value for a field 2. For a given query results, sort it by a field values 3. For a given query results, count facet for each field value I don't think such naive compression is slower than abstract int[] arrays... and we need to change public API of field cache too: if method returns int[] we are not saving any RAM. Better is to compare with SOLR use cases and to make API closer to real requirements; SOLR operates with some bitsets instead of arrays... was (Author: funtick): Specifically for FieldCache, let's see... suppose Field may have 8 different values, and number of documents is high. {code} Value0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 ... Value1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 ... Value2 0 0 0 1 1 0 0 0 0 0 0 0 0 0 ... Value3 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ... Value4 0 0 0 0 0 0 1 0 0 0 0 0 0 0 ... Value5 0 0 0 0 0 1 0 0 0 0 1 0 1 0 ... Value6 0 0 0 0 0 0 0 1 0 1 0 0 0 0 ... Value7 0 0 0 0 0 0 0 0 1 0 0 0 0 1 ... {code} - represented as Matrix (or as a Vector); for instance, first row means that Document1 and Document8 have Value0. And now, if we go horizontally we will end up with 8 arrays of int[]. What if we go vertically? Field could be encoded as 3-bit (8 different values). CONSTRAINT: specifically for FieldCache, each Column must have the only 1. And we can end with array of 3-bit values storing position in a column! Size of array is IndexReader.maxDoc(). hope I am reinventing bycycle :) Add unsigned packed int impls in oal.util - Key: LUCENE-1990 URL: https://issues.apache.org/jira/browse/LUCENE-1990 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Priority: Minor There are various places in Lucene that could take advantage of an efficient packed unsigned int/long impl. EG the terms dict index in the standard codec in LUCENE-1458 could subsantially reduce it's RAM usage. FieldCache.StringIndex could as well. And I think load into RAM codecs like the one in TestExternalCodecs could use this too. I'm picturing something very basic like: {code} interface PackedUnsignedLongs { long get(long index); void set(long index, long value); } {code} Plus maybe an iterator for getting and maybe also for setting. If it helps, most of the usages of this inside Lucene will be write once so eg the set could make that an assumption/requirement. And a factory somewhere: {code} PackedUnsignedLongs create(int count, long maxValue); {code} I think we should simply autogen the code (we can start from the autogen code in LUCENE-1410), or, if there is an good existing impl that has a compatible license that'd be great. I don't have time near-term to do this... so if anyone has the itch, please jump! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1990) Add unsigned packed int impls in oal.util
[ https://issues.apache.org/jira/browse/LUCENE-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12775420#action_12775420 ] Fuad Efendi edited comment on LUCENE-1990 at 11/10/09 4:11 PM: --- Specifically for FieldCache, let's see... suppose Field may have 8 different values, and number of documents is high. {code} Value0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 ... Value1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 ... Value2 0 0 0 1 1 0 0 0 0 0 0 0 0 0 ... Value3 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ... Value4 0 0 0 0 0 0 1 0 0 0 0 0 0 0 ... Value5 0 0 0 0 0 1 0 0 0 0 1 0 1 0 ... Value6 0 0 0 0 0 0 0 1 0 1 0 0 0 0 ... Value7 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ... {code} - represented as Matrix (or as a Vector); for instance, first row means that Document1 and Document8 have Value0. And now, if we go horizontally we will end up with 8 arrays of int[]. What if we go vertically? Field could be encoded as 3-bit (8 different values). CONSTRAINT: specifically for FieldCache, each Column must have the only 1. And we can end with array of 3-bit values storing position in a column! Size of array is IndexReader.maxDoc(). hope I am reinventing bycycle :) P.S. Of course each solution has pros and cons, I am trying to focus on FieldCache specific use cases. 1. For a given document ID, find a value for a field 2. For a given query results, sort it by a field values 3. For a given query results, count facet for each field value I don't think such naive compression is slower than abstract int[] arrays... and we need to change public API of field cache too: if method returns int[] we are not saving any RAM. Better is to compare with SOLR use cases and to make API closer to real requirements; SOLR operates with some bitsets instead of arrays... was (Author: funtick): Specifically for FieldCache, let's see... suppose Field may have 8 different values, and number of documents is high. {code} Value0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 ... Value1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 ... Value2 0 0 0 1 1 0 0 0 0 0 0 0 0 0 ... Value3 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ... Value4 0 0 0 0 0 0 1 0 0 0 0 0 0 0 ... Value5 0 0 0 0 0 1 0 0 0 0 1 0 1 0 ... Value6 0 0 0 0 0 0 0 1 0 1 0 0 0 0 ... Value7 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ... {code} - represented as Matrix (or as a Vector); for instance, first row means that Document1 and Document8 have Value0. And now, if we go horizontally we will end up with 8 arrays of int[]. What if we go vertically? Field could be encoded as 3-bit (8 different values). CONSTRAINT: specifically for FieldCache, each Column must have the only 1. And we can end with array of 3-bit values storing position in a column! Size of array is IndexReader.maxDoc(). hope I am reinventing bycycle :) P.S. Of course each solution has pros and cons, I am trying to focus on FieldCache specific use cases. 1. For a given document ID, find a value for a field 2. For a given query results, sort it by a field values 3. For a given query results, count facet for each field value I don't think such naive compression is slower than abstract int[] arrays... and we need to change public API of field cache too: if method returns int[] we are not saving any RAM. Better is to compare with SOLR use cases and to make API closer to real requirements; SOLR operates with some bitsets instead of arrays... Add unsigned packed int impls in oal.util - Key: LUCENE-1990 URL: https://issues.apache.org/jira/browse/LUCENE-1990 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Priority: Minor There are various places in Lucene that could take advantage of an efficient packed unsigned int/long impl. EG the terms dict index in the standard codec in LUCENE-1458 could subsantially reduce it's RAM usage. FieldCache.StringIndex could as well. And I think load into RAM codecs like the one in TestExternalCodecs could use this too. I'm picturing something very basic like: {code} interface PackedUnsignedLongs { long get(long index); void set(long index, long value); } {code} Plus maybe an iterator for getting and maybe also for setting. If it helps, most of the usages of this inside Lucene will be write once so eg the set could make that an assumption/requirement. And a factory somewhere: {code} PackedUnsignedLongs create(int count, long maxValue); {code} I think we should simply autogen the code (we can start from the autogen code in LUCENE-1410), or, if there is an good existing impl that has
[jira] Issue Comment Edited: (LUCENE-1990) Add unsigned packed int impls in oal.util
[ https://issues.apache.org/jira/browse/LUCENE-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12775420#action_12775420 ] Fuad Efendi edited comment on LUCENE-1990 at 11/10/09 4:11 PM: --- Specifically for FieldCache, let's see... suppose Field may have 8 different values, and number of documents is high. {code} Value0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 ... Value1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 ... Value2 0 0 0 1 1 0 0 0 0 0 0 0 0 0 ... Value3 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ... Value4 0 0 0 0 0 0 1 0 0 0 0 0 0 0 ... Value5 0 0 0 0 0 1 0 0 0 0 1 0 1 0 ... Value6 0 0 0 0 0 0 0 1 0 1 0 0 0 0 ... Value7 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ... {code} - represented as Matrix (or as a Vector); for instance, first row means that Document1 and Document8 have Value0. And now, if we go horizontally we will end up with 8 arrays of int[]. What if we go vertically? Field could be encoded as 3-bit (8 different values). CONSTRAINT: specifically for FieldCache, each Column must have the only 1. And we can end with array of 3-bit values storing position in a column! Size of array is IndexReader.maxDoc(). hope I am reinventing bycycle :) P.S. Of course each solution has pros and cons, I am trying to focus on FieldCache specific use cases. 1. For a given document ID, find a value for a field 2. For a given query results, sort it by a field values 3. For a given query results, count facet for each field value I don't think such naive compression is slower than abstract int[] arrays... and we need to change public API of field cache too: if method returns int[] we are not saving any RAM. Better is to compare with SOLR use cases and to make API closer to real requirements; SOLR operates with some bitsets instead of arrays... was (Author: funtick): Specifically for FieldCache, let's see... suppose Field may have 8 different values, and number of documents is high. {code} Value0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 ... Value1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 ... Value2 0 0 0 1 1 0 0 0 0 0 0 0 0 0 ... Value3 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ... Value4 0 0 0 0 0 0 1 0 0 0 0 0 0 0 ... Value5 0 0 0 0 0 1 0 0 0 0 1 0 1 0 ... Value6 0 0 0 0 0 0 0 1 0 1 0 0 0 0 ... Value7 0 0 0 0 0 0 0 0 1 0 0 0 0 1 ... {code} - represented as Matrix (or as a Vector); for instance, first row means that Document1 and Document8 have Value0. And now, if we go horizontally we will end up with 8 arrays of int[]. What if we go vertically? Field could be encoded as 3-bit (8 different values). CONSTRAINT: specifically for FieldCache, each Column must have the only 1. And we can end with array of 3-bit values storing position in a column! Size of array is IndexReader.maxDoc(). hope I am reinventing bycycle :) P.S. Of course each solution has pros and cons, I am trying to focus on FieldCache specific use cases. 1. For a given document ID, find a value for a field 2. For a given query results, sort it by a field values 3. For a given query results, count facet for each field value I don't think such naive compression is slower than abstract int[] arrays... and we need to change public API of field cache too: if method returns int[] we are not saving any RAM. Better is to compare with SOLR use cases and to make API closer to real requirements; SOLR operates with some bitsets instead of arrays... Add unsigned packed int impls in oal.util - Key: LUCENE-1990 URL: https://issues.apache.org/jira/browse/LUCENE-1990 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Priority: Minor There are various places in Lucene that could take advantage of an efficient packed unsigned int/long impl. EG the terms dict index in the standard codec in LUCENE-1458 could subsantially reduce it's RAM usage. FieldCache.StringIndex could as well. And I think load into RAM codecs like the one in TestExternalCodecs could use this too. I'm picturing something very basic like: {code} interface PackedUnsignedLongs { long get(long index); void set(long index, long value); } {code} Plus maybe an iterator for getting and maybe also for setting. If it helps, most of the usages of this inside Lucene will be write once so eg the set could make that an assumption/requirement. And a factory somewhere: {code} PackedUnsignedLongs create(int count, long maxValue); {code} I think we should simply autogen the code (we can start from the autogen code in LUCENE-1410), or, if there is an good existing impl that has a
[jira] Commented: (LUCENE-1995) ArrayIndexOutOfBoundsException during indexing
[ https://issues.apache.org/jira/browse/LUCENE-1995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769699#action_12769699 ] Fuad Efendi commented on LUCENE-1995: - I am recalling a bug in Arrays.sort() (Joshua Bloch) which was fixed after 9 years; signed instead of unsigned... ArrayIndexOutOfBoundsException during indexing -- Key: LUCENE-1995 URL: https://issues.apache.org/jira/browse/LUCENE-1995 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9 Reporter: Yonik Seeley Assignee: Michael McCandless Fix For: 2.9.1 http://search.lucidimagination.com/search/document/f29fc52348ab9b63/arrayindexoutofboundsexception_during_indexing -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1995) ArrayIndexOutOfBoundsException during indexing
[ https://issues.apache.org/jira/browse/LUCENE-1995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769715#action_12769715 ] Fuad Efendi commented on LUCENE-1995: - Joshua writes in his Google Research Blog: The version of binary search that I wrote for the JDK contained the same bug. It was reported to Sun recently when it broke someone's program, after lying in wait for nine years or so. http://googleresearch.blogspot.com/2006/06/extra-extra-read-all-about-it-nearly.html Anyway, this is specific use case of reporter; I didn't have ANY problems with ramBufferSizeMB: 8192 during a month (at least) of constant updates (5000/sec)... Yes, I am using term vectors (as Michael niticed it plays a role)... And what exactly causes the problem is unclear; having explicit check for 2048 is just workaround... quick shortcut... ArrayIndexOutOfBoundsException during indexing -- Key: LUCENE-1995 URL: https://issues.apache.org/jira/browse/LUCENE-1995 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9 Reporter: Yonik Seeley Assignee: Michael McCandless Fix For: 2.9.1 http://search.lucidimagination.com/search/document/f29fc52348ab9b63/arrayindexoutofboundsexception_during_indexing -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1995) ArrayIndexOutOfBoundsException during indexing
[ https://issues.apache.org/jira/browse/LUCENE-1995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769725#action_12769725 ] Fuad Efendi commented on LUCENE-1995: - But who did the bug? Joshua writes, it's him :) - based on other's famous findings and books... === it just contains a few lines of code that calculates a double value from two document fields and then stores that value in one of these dynamic fields And problem happens when he indexes document number 15,000,000... - I am guessing he is indexing double... ((type=tdouble, indexed=t, stored=f)... Why do we ever need to index multi-valued field double? Cardinality is the highest possible... I don't know Lucene internals; I am thinking that (double, docID) will occupy 12 bytes, and with multivalued (or dynamic) field we may need a lot of RAM for 15 mlns docs... especially if we are trying to put into buskets some objects using hash of double... ArrayIndexOutOfBoundsException during indexing -- Key: LUCENE-1995 URL: https://issues.apache.org/jira/browse/LUCENE-1995 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9 Reporter: Yonik Seeley Assignee: Michael McCandless Fix For: 2.9.1 http://search.lucidimagination.com/search/document/f29fc52348ab9b63/arrayindexoutofboundsexception_during_indexing -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1995) ArrayIndexOutOfBoundsException during indexing
[ https://issues.apache.org/jira/browse/LUCENE-1995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769747#action_12769747 ] Fuad Efendi commented on LUCENE-1995: - bq. He took it, and the bug with it, from elsewhere. He didn't do the bug either. He just propagated it. This is even worse. Especially for such classic case as Arrays.sort(). Buggy propagating... * The sorting algorithm is a tuned quicksort, adapted from Jon * L. Bentley and M. Douglas McIlroy's Engineering a Sort Function, * Software-Practice and Experience, Vol. 23(11) P. 1249-1265 (November * 1993). This algorithm offers n*log(n) performance on many data sets * that cause other quicksorts to degrade to quadratic performance. bq. If your usage actually went above 2GB, you would have had a problem. 8192 is not a valid value, we don't support it, and now we'll throw an exception if it's over 2048. Now I think my actual usage was below 2Gb... bq. No, we only support a max of 2GB ram buffer, by design currently. Thanks for confirmation... However, JavaDoc didn't mention explicitly that, and by design is unclear wordings... it's already several years by design... bq. 2048 probably won't be safe, because a large doc just as the buffer is filling up could still overflow. (Though, RAM is also used eg for norms, so you might squeak by). - Uncertainness... ArrayIndexOutOfBoundsException during indexing -- Key: LUCENE-1995 URL: https://issues.apache.org/jira/browse/LUCENE-1995 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9 Reporter: Yonik Seeley Assignee: Michael McCandless Fix For: 2.9.1 http://search.lucidimagination.com/search/document/f29fc52348ab9b63/arrayindexoutofboundsexception_during_indexing -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1995) ArrayIndexOutOfBoundsException during indexing
[ https://issues.apache.org/jira/browse/LUCENE-1995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769749#action_12769749 ] Fuad Efendi commented on LUCENE-1995: - bq. bq. If your usage actually went above 2GB, you would have had a problem. 8192 is not a valid value, we don't support it, and now we'll throw an exception if it's over 2048. bq. Now I think my actual usage was below 2Gb... How I was below 2048 if I had few segments created by IndexWriter during a day, without any SOLR-commit?.. ArrayIndexOutOfBoundsException during indexing -- Key: LUCENE-1995 URL: https://issues.apache.org/jira/browse/LUCENE-1995 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9 Reporter: Yonik Seeley Assignee: Michael McCandless Fix For: 2.9.1 http://search.lucidimagination.com/search/document/f29fc52348ab9b63/arrayindexoutofboundsexception_during_indexing -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1995) ArrayIndexOutOfBoundsException during indexing
[ https://issues.apache.org/jira/browse/LUCENE-1995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769749#action_12769749 ] Fuad Efendi edited comment on LUCENE-1995 at 10/25/09 3:14 AM: --- bq. bq. If your usage actually went above 2GB, you would have had a problem. 8192 is not a valid value, we don't support it, and now we'll throw an exception if it's over 2048. bq. Now I think my actual usage was below 2Gb... How I was below 2048 if I had few segments created by IndexWriter during a day, without any SOLR-commit?.. may be I am wrong, it was few weeks ago... I am currently using 1024 because I need memory for other staff too, and I don't want to try again... was (Author: funtick): bq. bq. If your usage actually went above 2GB, you would have had a problem. 8192 is not a valid value, we don't support it, and now we'll throw an exception if it's over 2048. bq. Now I think my actual usage was below 2Gb... How I was below 2048 if I had few segments created by IndexWriter during a day, without any SOLR-commit?.. ArrayIndexOutOfBoundsException during indexing -- Key: LUCENE-1995 URL: https://issues.apache.org/jira/browse/LUCENE-1995 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9 Reporter: Yonik Seeley Assignee: Michael McCandless Fix For: 2.9.1 http://search.lucidimagination.com/search/document/f29fc52348ab9b63/arrayindexoutofboundsexception_during_indexing -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Closed: (SOLR-711) SimpleFacets: Performance Boost for Tokenized Fields for smaller DocSet using Term Vectors
[ https://issues.apache.org/jira/browse/SOLR-711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fuad Efendi closed SOLR-711. Resolution: Fixed Thanks Shalin for pointing to SOLR-475 which is very advanced solution to term counting approach. SimpleFacets: Performance Boost for Tokenized Fields for smaller DocSet using Term Vectors -- Key: SOLR-711 URL: https://issues.apache.org/jira/browse/SOLR-711 Project: Solr Issue Type: Improvement Components: search Affects Versions: 1.3 Reporter: Fuad Efendi Fix For: 1.4 Original Estimate: 1680h Remaining Estimate: 1680h From [http://www.nabble.com/SimpleFacets%3A-Performance-Boost-for-Tokenized-Fields-td19033760.html]: Scenario: - 10,000,000 documents in the index; - 5-10 terms per document; - 200,000 unique terms for a tokenized field. _Obviously calculating sizes of 200,000 intersections with FilterCache is 100 times slower than traversing 10 - 20,000 documents for smaller DocSets and counting frequencies of Terms._ Not applicable if size of DocSet is close to total number of unique tokens (200,000 in our scenario). See SimpleFacets.java: {code} public NamedList getFacetTermEnumCounts( SolrIndexSearcher searcher, DocSet docs, ... {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-667) Alternate LRUCache implementation
[ https://issues.apache.org/jira/browse/SOLR-667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12635221#action_12635221 ] Fuad Efendi commented on SOLR-667: -- Paul, Yonik, thanks for your efforts; BTW 'Concurrent'HashMap uses spinloops for 'safe' updates in order to avoid synchronization (instead of giving up CPU cycles); there are always cases when it is not faster that simple HashMap with synchronization. LingPipe uses different approach, see last comment at SOLR-665. Also, why are you in-a-loop with LRU? LFU is logically better. +1 and thanks for sharing. Alternate LRUCache implementation - Key: SOLR-667 URL: https://issues.apache.org/jira/browse/SOLR-667 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.3 Reporter: Noble Paul Fix For: 1.4 Attachments: ConcurrentLRUCache.java, ConcurrentLRUCache.java, ConcurrentLRUCache.java, SOLR-667.patch, SOLR-667.patch, SOLR-667.patch, SOLR-667.patch The only available SolrCache i.e LRUCache is based on _LinkedHashMap_ which has _get()_ also synchronized. This can cause severe bottlenecks for faceted search. Any alternate implementation which can be faster/better must be considered. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation
[ https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12624835#action_12624835 ] Fuad Efendi commented on LUCENE-831: Would be nice to have TermVectorCache (if term vectors are stored in the index) Complete overhaul of FieldCache API/Implementation -- Key: LUCENE-831 URL: https://issues.apache.org/jira/browse/LUCENE-831 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Hoss Man Fix For: 3.0 Attachments: fieldcache-overhaul.032208.diff, fieldcache-overhaul.diff, fieldcache-overhaul.diff, LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff, LUCENE-831.03.31.2008.diff, LUCENE-831.patch Motivation: 1) Complete overhaul the API/implementation of FieldCache type things... a) eliminate global static map keyed on IndexReader (thus eliminating synch block between completley independent IndexReaders) b) allow more customization of cache management (ie: use expiration/replacement strategies, disk backed caches, etc) c) allow people to define custom cache data logic (ie: custom parsers, complex datatypes, etc... anything tied to a reader) d) allow people to inspect what's in a cache (list of CacheKeys) for an IndexReader so a new IndexReader can be likewise warmed. e) Lend support for smarter cache management if/when IndexReader.reopen is added (merging of cached data from subReaders). 2) Provide backwards compatibility to support existing FieldCache API with the new implementation, so there is no redundent caching as client code migrades to new API. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (SOLR-711) SimpleFacets: Performance Boost for Tokenized Fields for smaller DocSet using Term Vectors
SimpleFacets: Performance Boost for Tokenized Fields for smaller DocSet using Term Vectors -- Key: SOLR-711 URL: https://issues.apache.org/jira/browse/SOLR-711 Project: Solr Issue Type: Improvement Components: search Affects Versions: 1.3 Reporter: Fuad Efendi Fix For: 1.4 From [url]http://www.nabble.com/SimpleFacets%3A-Performance-Boost-for-Tokenized-Fields-td19033760.html[/url]: Scenario: - 10,000,000 documents in the index; - 5-10 terms per document; - 200,000 unique terms for a tokenized field. _Obviously calculating sizes of 200,000 intersections with FilterCache is 100 times slower than traversing 10 - 20,000 documents for smaller DocSets and counting frequencies of Terms._ Not applicable if size of DocSet is close to total number of unique tokens (200,000 in our scenario). See SimpleFacets: {{ public NamedList getFacetTermEnumCounts( SolrIndexSearcher searcher, DocSet docs, String field, int offset, int limit, int mincount, boolean missing, boolean sort, String prefix) throws IOException {...} }} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-711) SimpleFacets: Performance Boost for Tokenized Fields for smaller DocSet using Term Vectors
[ https://issues.apache.org/jira/browse/SOLR-711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fuad Efendi updated SOLR-711: - Comment: was deleted SimpleFacets: Performance Boost for Tokenized Fields for smaller DocSet using Term Vectors -- Key: SOLR-711 URL: https://issues.apache.org/jira/browse/SOLR-711 Project: Solr Issue Type: Improvement Components: search Affects Versions: 1.3 Reporter: Fuad Efendi Fix For: 1.4 Original Estimate: 1680h Remaining Estimate: 1680h From [http://www.nabble.com/SimpleFacets%3A-Performance-Boost-for-Tokenized-Fields-td19033760.html]: Scenario: - 10,000,000 documents in the index; - 5-10 terms per document; - 200,000 unique terms for a tokenized field. _Obviously calculating sizes of 200,000 intersections with FilterCache is 100 times slower than traversing 10 - 20,000 documents for smaller DocSets and counting frequencies of Terms._ Not applicable if size of DocSet is close to total number of unique tokens (200,000 in our scenario). See SimpleFacets: {code:title=SimpleFacets.java|borderStyle=solid} public NamedList getFacetTermEnumCounts( SolrIndexSearcher searcher, DocSet docs, ... {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-711) SimpleFacets: Performance Boost for Tokenized Fields for smaller DocSet using Term Vectors
[ https://issues.apache.org/jira/browse/SOLR-711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fuad Efendi updated SOLR-711: - Description: From [http://www.nabble.com/SimpleFacets%3A-Performance-Boost-for-Tokenized-Fields-td19033760.html]: Scenario: - 10,000,000 documents in the index; - 5-10 terms per document; - 200,000 unique terms for a tokenized field. _Obviously calculating sizes of 200,000 intersections with FilterCache is 100 times slower than traversing 10 - 20,000 documents for smaller DocSets and counting frequencies of Terms._ Not applicable if size of DocSet is close to total number of unique tokens (200,000 in our scenario). See SimpleFacets.java: {code} public NamedList getFacetTermEnumCounts( SolrIndexSearcher searcher, DocSet docs, ... {code} was: From [http://www.nabble.com/SimpleFacets%3A-Performance-Boost-for-Tokenized-Fields-td19033760.html]: Scenario: - 10,000,000 documents in the index; - 5-10 terms per document; - 200,000 unique terms for a tokenized field. _Obviously calculating sizes of 200,000 intersections with FilterCache is 100 times slower than traversing 10 - 20,000 documents for smaller DocSets and counting frequencies of Terms._ Not applicable if size of DocSet is close to total number of unique tokens (200,000 in our scenario). See SimpleFacets: {code:title=SimpleFacets.java|borderStyle=solid} public NamedList getFacetTermEnumCounts( SolrIndexSearcher searcher, DocSet docs, ... {code} SimpleFacets: Performance Boost for Tokenized Fields for smaller DocSet using Term Vectors -- Key: SOLR-711 URL: https://issues.apache.org/jira/browse/SOLR-711 Project: Solr Issue Type: Improvement Components: search Affects Versions: 1.3 Reporter: Fuad Efendi Fix For: 1.4 Original Estimate: 1680h Remaining Estimate: 1680h From [http://www.nabble.com/SimpleFacets%3A-Performance-Boost-for-Tokenized-Fields-td19033760.html]: Scenario: - 10,000,000 documents in the index; - 5-10 terms per document; - 200,000 unique terms for a tokenized field. _Obviously calculating sizes of 200,000 intersections with FilterCache is 100 times slower than traversing 10 - 20,000 documents for smaller DocSets and counting frequencies of Terms._ Not applicable if size of DocSet is close to total number of unique tokens (200,000 in our scenario). See SimpleFacets.java: {code} public NamedList getFacetTermEnumCounts( SolrIndexSearcher searcher, DocSet docs, ... {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-711) SimpleFacets: Performance Boost for Tokenized Fields for smaller DocSet using Term Vectors
[ https://issues.apache.org/jira/browse/SOLR-711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fuad Efendi updated SOLR-711: - Description: From [http://www.nabble.com/SimpleFacets%3A-Performance-Boost-for-Tokenized-Fields-td19033760.html]: Scenario: - 10,000,000 documents in the index; - 5-10 terms per document; - 200,000 unique terms for a tokenized field. _Obviously calculating sizes of 200,000 intersections with FilterCache is 100 times slower than traversing 10 - 20,000 documents for smaller DocSets and counting frequencies of Terms._ Not applicable if size of DocSet is close to total number of unique tokens (200,000 in our scenario). See SimpleFacets: {code:title=SimpleFacets.java|borderStyle=solid} public NamedList getFacetTermEnumCounts( SolrIndexSearcher searcher, DocSet docs, ... {code} was: From [url]http://www.nabble.com/SimpleFacets%3A-Performance-Boost-for-Tokenized-Fields-td19033760.html[/url]: Scenario: - 10,000,000 documents in the index; - 5-10 terms per document; - 200,000 unique terms for a tokenized field. _Obviously calculating sizes of 200,000 intersections with FilterCache is 100 times slower than traversing 10 - 20,000 documents for smaller DocSets and counting frequencies of Terms._ Not applicable if size of DocSet is close to total number of unique tokens (200,000 in our scenario). See SimpleFacets: {{ public NamedList getFacetTermEnumCounts( SolrIndexSearcher searcher, DocSet docs, String field, int offset, int limit, int mincount, boolean missing, boolean sort, String prefix) throws IOException {...} }} trivial formatting SimpleFacets: Performance Boost for Tokenized Fields for smaller DocSet using Term Vectors -- Key: SOLR-711 URL: https://issues.apache.org/jira/browse/SOLR-711 Project: Solr Issue Type: Improvement Components: search Affects Versions: 1.3 Reporter: Fuad Efendi Fix For: 1.4 Original Estimate: 1680h Remaining Estimate: 1680h From [http://www.nabble.com/SimpleFacets%3A-Performance-Boost-for-Tokenized-Fields-td19033760.html]: Scenario: - 10,000,000 documents in the index; - 5-10 terms per document; - 200,000 unique terms for a tokenized field. _Obviously calculating sizes of 200,000 intersections with FilterCache is 100 times slower than traversing 10 - 20,000 documents for smaller DocSets and counting frequencies of Terms._ Not applicable if size of DocSet is close to total number of unique tokens (200,000 in our scenario). See SimpleFacets: {code:title=SimpleFacets.java|borderStyle=solid} public NamedList getFacetTermEnumCounts( SolrIndexSearcher searcher, DocSet docs, ... {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (SOLR-671) Range queries with 'slong' field type do not retrieve correct results
Range queries with 'slong' field type do not retrieve correct results - Key: SOLR-671 URL: https://issues.apache.org/jira/browse/SOLR-671 Project: Solr Issue Type: Bug Environment: SOLR-1.3-DEV Schema: !-- Numeric field types that manipulate the value into a string value that isn't human-readable in its internal form, but with a lexicographic ordering the same as the numeric ordering, so that range queries work correctly. -- fieldType name=sint class=solr.SortableIntField sortMissingLast=true omitNorms=true/ fieldType name=slong class=solr.SortableLongField sortMissingLast=true omitNorms=true/ fieldType name=sfloat class=solr.SortableFloatField sortMissingLast=true omitNorms=true/ fieldType name=sdouble class=solr.SortableDoubleField sortMissingLast=true omitNorms=true/ field name=timestamp type=slong indexed=true stored=true/ Reporter: Fuad Efendi Range queries always return all results (do not filter): timestamp:[1019386401114 TO 1219386401114] lst name=debug str name=rawquerystringtimestamp:[1019386401114 TO 1219386401114]/str str name=querystringtimestamp:[1019386401114 TO 1219386401114]/str str name=parsedquerytimestamp:[1019386401114 TO 1219386401114]/str str name=parsedquery_toStringtimestamp:[#8;#0;εごᅚ TO #8;#0;ѯ刯慚]/str ... str name=QParserOldLuceneQParser/str -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-671) Range queries with 'slong' field type do not retrieve correct results
[ https://issues.apache.org/jira/browse/SOLR-671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fuad Efendi updated SOLR-671: - Priority: Blocker (was: Major) Affects Version/s: 1.3 Range queries with 'slong' field type do not retrieve correct results - Key: SOLR-671 URL: https://issues.apache.org/jira/browse/SOLR-671 Project: Solr Issue Type: Bug Affects Versions: 1.3 Environment: SOLR-1.3-DEV Schema: !-- Numeric field types that manipulate the value into a string value that isn't human-readable in its internal form, but with a lexicographic ordering the same as the numeric ordering, so that range queries work correctly. -- fieldType name=sint class=solr.SortableIntField sortMissingLast=true omitNorms=true/ fieldType name=slong class=solr.SortableLongField sortMissingLast=true omitNorms=true/ fieldType name=sfloat class=solr.SortableFloatField sortMissingLast=true omitNorms=true/ fieldType name=sdouble class=solr.SortableDoubleField sortMissingLast=true omitNorms=true/ field name=timestamp type=slong indexed=true stored=true/ Reporter: Fuad Efendi Priority: Blocker Original Estimate: 168h Remaining Estimate: 168h Range queries always return all results (do not filter): timestamp:[1019386401114 TO 1219386401114] lst name=debug str name=rawquerystringtimestamp:[1019386401114 TO 1219386401114]/str str name=querystringtimestamp:[1019386401114 TO 1219386401114]/str str name=parsedquerytimestamp:[1019386401114 TO 1219386401114]/str str name=parsedquery_toStringtimestamp:[#8;#0;εごᅚ TO #8;#0;ѯ刯慚]/str ... str name=QParserOldLuceneQParser/str -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-671) Range queries with 'slong' field type do not retrieve correct results
[ https://issues.apache.org/jira/browse/SOLR-671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fuad Efendi updated SOLR-671: - Priority: Trivial (was: Blocker) Issue Type: Test (was: Bug) I executed another query which works fine: timestamp:[* TO 1000] - 0 results Finally found it works... Please close. Range queries with 'slong' field type do not retrieve correct results - Key: SOLR-671 URL: https://issues.apache.org/jira/browse/SOLR-671 Project: Solr Issue Type: Test Affects Versions: 1.3 Environment: SOLR-1.3-DEV Schema: !-- Numeric field types that manipulate the value into a string value that isn't human-readable in its internal form, but with a lexicographic ordering the same as the numeric ordering, so that range queries work correctly. -- fieldType name=sint class=solr.SortableIntField sortMissingLast=true omitNorms=true/ fieldType name=slong class=solr.SortableLongField sortMissingLast=true omitNorms=true/ fieldType name=sfloat class=solr.SortableFloatField sortMissingLast=true omitNorms=true/ fieldType name=sdouble class=solr.SortableDoubleField sortMissingLast=true omitNorms=true/ field name=timestamp type=slong indexed=true stored=true/ Reporter: Fuad Efendi Priority: Trivial Original Estimate: 168h Remaining Estimate: 168h Range queries always return all results (do not filter): timestamp:[1019386401114 TO 1219386401114] lst name=debug str name=rawquerystringtimestamp:[1019386401114 TO 1219386401114]/str str name=querystringtimestamp:[1019386401114 TO 1219386401114]/str str name=parsedquerytimestamp:[1019386401114 TO 1219386401114]/str str name=parsedquery_toStringtimestamp:[#8;#0;εごᅚ TO #8;#0;ѯ刯慚]/str ... str name=QParserOldLuceneQParser/str -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-671) Range queries with 'slong' field type do not retrieve correct results
[ https://issues.apache.org/jira/browse/SOLR-671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fuad Efendi updated SOLR-671: - Priority: Major (was: Trivial) Issue Type: Bug (was: Test) Here is test case, similar to Arrays.sort() bug (unsigned...): {code} long time1 = System.currentTimeMillis() - 30*24*3600*1000; long time2 = 30*24*3600*1000; System.out.println(time1); System.out.println(time1-time2); Output: 1219389000674 1221091967970 {code} (time1-time2) time1! What happens inside SOLR slong for such queries? Range queries with 'slong' field type do not retrieve correct results - Key: SOLR-671 URL: https://issues.apache.org/jira/browse/SOLR-671 Project: Solr Issue Type: Bug Affects Versions: 1.3 Environment: SOLR-1.3-DEV Schema: !-- Numeric field types that manipulate the value into a string value that isn't human-readable in its internal form, but with a lexicographic ordering the same as the numeric ordering, so that range queries work correctly. -- fieldType name=sint class=solr.SortableIntField sortMissingLast=true omitNorms=true/ fieldType name=slong class=solr.SortableLongField sortMissingLast=true omitNorms=true/ fieldType name=sfloat class=solr.SortableFloatField sortMissingLast=true omitNorms=true/ fieldType name=sdouble class=solr.SortableDoubleField sortMissingLast=true omitNorms=true/ field name=timestamp type=slong indexed=true stored=true/ Reporter: Fuad Efendi Original Estimate: 168h Remaining Estimate: 168h Range queries always return all results (do not filter): timestamp:[1019386401114 TO 1219386401114] lst name=debug str name=rawquerystringtimestamp:[1019386401114 TO 1219386401114]/str str name=querystringtimestamp:[1019386401114 TO 1219386401114]/str str name=parsedquerytimestamp:[1019386401114 TO 1219386401114]/str str name=parsedquery_toStringtimestamp:[#8;#0;εごᅚ TO #8;#0;ѯ刯慚]/str ... str name=QParserOldLuceneQParser/str -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (SOLR-671) Range queries with 'slong' field type do not retrieve correct results
[ https://issues.apache.org/jira/browse/SOLR-671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12619223#action_12619223 ] funtick edited comment on SOLR-671 at 8/2/08 7:12 AM: -- Here is test case, similar to Arrays.sort() bug (unsigned...): {code} long time1 = System.currentTimeMillis(); long time2 = 30*24*3600*1000; System.out.println(time1); System.out.println(time1-time2); Output: 1219389000674 1221091967970 {code} (time1-time2) time1! What happens inside SOLR slong for such queries? was (Author: funtick): Here is test case, similar to Arrays.sort() bug (unsigned...): {code} long time1 = System.currentTimeMillis() - 30*24*3600*1000; long time2 = 30*24*3600*1000; System.out.println(time1); System.out.println(time1-time2); Output: 1219389000674 1221091967970 {code} (time1-time2) time1! What happens inside SOLR slong for such queries? Range queries with 'slong' field type do not retrieve correct results - Key: SOLR-671 URL: https://issues.apache.org/jira/browse/SOLR-671 Project: Solr Issue Type: Bug Affects Versions: 1.3 Environment: SOLR-1.3-DEV Schema: !-- Numeric field types that manipulate the value into a string value that isn't human-readable in its internal form, but with a lexicographic ordering the same as the numeric ordering, so that range queries work correctly. -- fieldType name=sint class=solr.SortableIntField sortMissingLast=true omitNorms=true/ fieldType name=slong class=solr.SortableLongField sortMissingLast=true omitNorms=true/ fieldType name=sfloat class=solr.SortableFloatField sortMissingLast=true omitNorms=true/ fieldType name=sdouble class=solr.SortableDoubleField sortMissingLast=true omitNorms=true/ field name=timestamp type=slong indexed=true stored=true/ Reporter: Fuad Efendi Original Estimate: 168h Remaining Estimate: 168h Range queries always return all results (do not filter): timestamp:[1019386401114 TO 1219386401114] lst name=debug str name=rawquerystringtimestamp:[1019386401114 TO 1219386401114]/str str name=querystringtimestamp:[1019386401114 TO 1219386401114]/str str name=parsedquerytimestamp:[1019386401114 TO 1219386401114]/str str name=parsedquery_toStringtimestamp:[#8;#0;εごᅚ TO #8;#0;ѯ刯慚]/str ... str name=QParserOldLuceneQParser/str -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-671) Range queries with 'slong' field type do not retrieve correct results
[ https://issues.apache.org/jira/browse/SOLR-671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12619227#action_12619227 ] Fuad Efendi commented on SOLR-671: -- {code} long time1 = System.currentTimeMillis(); long time2 = 30*24*3600*1000; long time3 = time1 - time2; System.out.println(Time1: +time1); System.out.println(Time2: +time2); System.out.println(Time3: +time3); Time1: 1217686478242 Time2: -1702967296 Time3: 1219389445538 {code} bug is obvious... {code} long time1 = System.currentTimeMillis(); long time2 = 30*24*3600*1000L; long time3 = time1 - time2; System.out.println(Time1: +time1); System.out.println(Time2: +time2); System.out.println(Time3: +time3); Time1: 1217686559557 Time2: 259200 Time3: 1215094559557 {code} Close it... Range queries with 'slong' field type do not retrieve correct results - Key: SOLR-671 URL: https://issues.apache.org/jira/browse/SOLR-671 Project: Solr Issue Type: Bug Environment: SOLR-1.3-DEV Schema: !-- Numeric field types that manipulate the value into a string value that isn't human-readable in its internal form, but with a lexicographic ordering the same as the numeric ordering, so that range queries work correctly. -- fieldType name=sint class=solr.SortableIntField sortMissingLast=true omitNorms=true/ fieldType name=slong class=solr.SortableLongField sortMissingLast=true omitNorms=true/ fieldType name=sfloat class=solr.SortableFloatField sortMissingLast=true omitNorms=true/ fieldType name=sdouble class=solr.SortableDoubleField sortMissingLast=true omitNorms=true/ field name=timestamp type=slong indexed=true stored=true/ Reporter: Fuad Efendi Original Estimate: 168h Remaining Estimate: 168h Range queries always return all results (do not filter): timestamp:[1019386401114 TO 1219386401114] lst name=debug str name=rawquerystringtimestamp:[1019386401114 TO 1219386401114]/str str name=querystringtimestamp:[1019386401114 TO 1219386401114]/str str name=parsedquerytimestamp:[1019386401114 TO 1219386401114]/str str name=parsedquery_toStringtimestamp:[#8;#0;εごᅚ TO #8;#0;ѯ刯慚]/str ... str name=QParserOldLuceneQParser/str -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-671) Range queries with 'slong' field type do not retrieve correct results
[ https://issues.apache.org/jira/browse/SOLR-671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fuad Efendi updated SOLR-671: - Priority: Trivial (was: Major) Issue Type: Test (was: Bug) Affects Version/s: (was: 1.3) Range queries with 'slong' field type do not retrieve correct results - Key: SOLR-671 URL: https://issues.apache.org/jira/browse/SOLR-671 Project: Solr Issue Type: Test Environment: SOLR-1.3-DEV Schema: !-- Numeric field types that manipulate the value into a string value that isn't human-readable in its internal form, but with a lexicographic ordering the same as the numeric ordering, so that range queries work correctly. -- fieldType name=sint class=solr.SortableIntField sortMissingLast=true omitNorms=true/ fieldType name=slong class=solr.SortableLongField sortMissingLast=true omitNorms=true/ fieldType name=sfloat class=solr.SortableFloatField sortMissingLast=true omitNorms=true/ fieldType name=sdouble class=solr.SortableDoubleField sortMissingLast=true omitNorms=true/ field name=timestamp type=slong indexed=true stored=true/ Reporter: Fuad Efendi Priority: Trivial Original Estimate: 168h Remaining Estimate: 168h Range queries always return all results (do not filter): timestamp:[1019386401114 TO 1219386401114] lst name=debug str name=rawquerystringtimestamp:[1019386401114 TO 1219386401114]/str str name=querystringtimestamp:[1019386401114 TO 1219386401114]/str str name=parsedquerytimestamp:[1019386401114 TO 1219386401114]/str str name=parsedquery_toStringtimestamp:[#8;#0;εごᅚ TO #8;#0;ѯ刯慚]/str ... str name=QParserOldLuceneQParser/str -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-665) FIFO Cache (Unsynchronized): 9x times performance boost
[ https://issues.apache.org/jira/browse/SOLR-665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12619058#action_12619058 ] Fuad Efendi commented on SOLR-665: -- Guys at LingPipe (Natural Language Processing) http://alias-i.com/ are using excellent Map implementations with optimistic concurrency strategy: http://alias-i.com/lingpipe/docs/api/com/aliasi/util/FastCache.html http://alias-i.com/lingpipe/docs/api/com/aliasi/util/HardFastCache.html FIFO Cache (Unsynchronized): 9x times performance boost --- Key: SOLR-665 URL: https://issues.apache.org/jira/browse/SOLR-665 Project: Solr Issue Type: Improvement Affects Versions: 1.3 Environment: JRockit R27 (Java 6) Reporter: Fuad Efendi Attachments: ConcurrentFIFOCache.java, ConcurrentFIFOCache.java, ConcurrentLRUCache.java, ConcurrentLRUWeakCache.java, FIFOCache.java, SimplestConcurrentLRUCache.java Original Estimate: 672h Remaining Estimate: 672h Attached is modified version of LRUCache where 1. map = new LinkedHashMap(initialSize, 0.75f, false) - so that reordering/true (performance bottleneck of LRU) is replaced to insertion-order/false (so that it became FIFO) 2. Almost all (absolutely unneccessary) synchronized statements commented out See discussion at http://www.nabble.com/LRUCache---synchronized%21--td16439831.html Performance metrics (taken from SOLR Admin): LRU Requests: 7638 Average Time-Per-Request: 15300 Average Request-per-Second: 0.06 FIFO: Requests: 3355 Average Time-Per-Request: 1610 Average Request-per-Second: 0.11 Performance increased 9 times which roughly corresponds to a number of CPU in a system, http://www.tokenizer.org/ (Shopping Search Engine at Tokenizer.org) Current number of documents: 7494689 name: filterCache class:org.apache.solr.search.LRUCache version: 1.0 description: LRU Cache(maxSize=1000, initialSize=1000) stats:lookups : 15966954582 hits : 16391851546 hitratio : 0.102 inserts : 4246120 evictions : 0 size : 2668705 cumulative_lookups : 16415839763 cumulative_hits : 16411608101 cumulative_hitratio : 0.99 cumulative_inserts : 4246246 cumulative_evictions : 0 Thanks -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-667) Alternate LRUCache implementation
[ https://issues.apache.org/jira/browse/SOLR-667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12618750#action_12618750 ] Fuad Efendi commented on SOLR-667: -- bq. ...safety, where nothing bad ever happens to an object. When _SOLR_ adds object to cache or remove it from cache it does not change it, it manipulates with internal arrays of pointers to objects (which are probably atomic, but I don't know such JVM GC internals in-depth...) Looks heavy with TreeSet... Alternate LRUCache implementation - Key: SOLR-667 URL: https://issues.apache.org/jira/browse/SOLR-667 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.3 Reporter: Noble Paul Attachments: ConcurrentLRUCache.java The only available SolrCache i.e LRUCache is based on _LinkedHashMap_ which has _get()_ also synchronized. This can cause severe bottlenecks for faceted search. Any alternate implementation which can be faster/better must be considered. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-665) FIFO Cache (Unsynchronized): 9x times performance boost
[ https://issues.apache.org/jira/browse/SOLR-665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12618766#action_12618766 ] Fuad Efendi commented on SOLR-665: -- I don't think ConcurrentHashMap will improve performance, and ConcurrentMap is not what SOLR needs: {code} V putIfAbsent(K key, V value); V replace(K key, V value); boolean replace(K key, V oldValue, V newValue); {code} There is also some(...) overhead with _oldValue_ and _the state of the hash table at some point_; additional memory requirements; etc... can we design something plain-simpler being focused on SOLR specific requirements? Without all functionality of Map etc... FIFO Cache (Unsynchronized): 9x times performance boost --- Key: SOLR-665 URL: https://issues.apache.org/jira/browse/SOLR-665 Project: Solr Issue Type: Improvement Affects Versions: 1.3 Environment: JRockit R27 (Java 6) Reporter: Fuad Efendi Attachments: ConcurrentFIFOCache.java, ConcurrentFIFOCache.java, ConcurrentLRUCache.java, ConcurrentLRUWeakCache.java, FIFOCache.java, SimplestConcurrentLRUCache.java Original Estimate: 672h Remaining Estimate: 672h Attached is modified version of LRUCache where 1. map = new LinkedHashMap(initialSize, 0.75f, false) - so that reordering/true (performance bottleneck of LRU) is replaced to insertion-order/false (so that it became FIFO) 2. Almost all (absolutely unneccessary) synchronized statements commented out See discussion at http://www.nabble.com/LRUCache---synchronized%21--td16439831.html Performance metrics (taken from SOLR Admin): LRU Requests: 7638 Average Time-Per-Request: 15300 Average Request-per-Second: 0.06 FIFO: Requests: 3355 Average Time-Per-Request: 1610 Average Request-per-Second: 0.11 Performance increased 9 times which roughly corresponds to a number of CPU in a system, http://www.tokenizer.org/ (Shopping Search Engine at Tokenizer.org) Current number of documents: 7494689 name: filterCache class:org.apache.solr.search.LRUCache version: 1.0 description: LRU Cache(maxSize=1000, initialSize=1000) stats:lookups : 15966954582 hits : 16391851546 hitratio : 0.102 inserts : 4246120 evictions : 0 size : 2668705 cumulative_lookups : 16415839763 cumulative_hits : 16411608101 cumulative_hitratio : 0.99 cumulative_inserts : 4246246 cumulative_evictions : 0 Thanks -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-667) Alternate LRUCache implementation
[ https://issues.apache.org/jira/browse/SOLR-667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12618805#action_12618805 ] Fuad Efendi commented on SOLR-667: -- Paul, I have never ever suggested to use 'volatile' 'to avoid synchronization' for concurrent programming. I only noticed some extremely stupid code where SOLR uses _double_synchronization and AtomicLong inside: {code} public synchronized Object put(Object key, Object value) { if (state == State.LIVE) { stats.inserts.incrementAndGet(); } synchronized (map) { // increment local inserts regardless of state??? // it does make it more consistent with the current size... inserts++; return map.put(key,value); } } {code} Each tool has an area of applicability, and even ConcurrentHashMap just slightly intersects with SOLR needs; SOLR does not need 'consistent view at a point in time' on cached objects. 'volatile' is part of Java Specs, and implemented differently by different vendors. I use volatile (instead of more expensive AtomicLong) only and only to prevent JVM HotSpot Optimizer from some _not-applicable_ staff... Alternate LRUCache implementation - Key: SOLR-667 URL: https://issues.apache.org/jira/browse/SOLR-667 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.3 Reporter: Noble Paul Attachments: ConcurrentLRUCache.java The only available SolrCache i.e LRUCache is based on _LinkedHashMap_ which has _get()_ also synchronized. This can cause severe bottlenecks for faceted search. Any alternate implementation which can be faster/better must be considered. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.