from:"Fuad Efendi \(JIRA\)"

[jira] [Commented] (LUCENE-2605) queryparser parses on whitespace

2016-06-11 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325862#comment-15325862
 ] 

Fuad Efendi commented on LUCENE-2605:
-

This one was really painful problem (unexpected "tokenization" by query 
parser!) 
Thank you for fixing that!

> queryparser parses on whitespace
> 
>
> Key: LUCENE-2605
> URL: https://issues.apache.org/jira/browse/LUCENE-2605
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser
>Reporter: Robert Muir
>Assignee: Steve Rowe
> Fix For: 4.9, 6.0
>
> Attachments: LUCENE-2605.patch, LUCENE-2605.patch, LUCENE-2605.patch
>
>
> The queryparser parses input on whitespace, and sends each whitespace 
> separated term to its own independent token stream.
> This breaks the following at query-time, because they can't see across 
> whitespace boundaries:
> * n-gram analysis
> * shingles 
> * synonyms (especially multi-word for whitespace-separated languages)
> * languages where a 'word' can contain whitespace (e.g. vietnamese)
> Its also rather unexpected, as users think their 
> charfilters/tokenizers/tokenfilters will do the same thing at index and 
> querytime, but
> in many cases they can't. Instead, preferably the queryparser would parse 
> around only real 'operators'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2357) Thread Local memory leaks on restart

2016-03-10 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15189540#comment-15189540
 ] 

Fuad Efendi commented on SOLR-2357:
---

- Tomcat has memory leaks with custom ThreadLocal instances as a key and value.
- Tomcat 7.0.6 and later fix the problem by renewing threads in the pool.

Please see http://wiki.apache.org/tomcat/MemoryLeakProtection for details.


Can we close this issue now? Thanks,

> Thread Local memory leaks on restart
> 
>
> Key: SOLR-2357
> URL: https://issues.apache.org/jira/browse/SOLR-2357
> Project: Solr
>  Issue Type: Bug
>  Components: contrib - Solr Cell (Tika extraction), search
>Affects Versions: 1.4.1
> Environment: Windows Server 2008, Apache Tomcat 7.0.8, Java 1.6.23
>Reporter: Gus Heck
>  Labels: memory_leak, threadlocal
>
> Restarting solr (via a changed to a watched resource or via manager app for 
> example) after submitting documents with Solr-Cell, gives the following 
> message (many many times), and causes Tomcat to shutdown completely. 
> SEVERE: The web application [/solr] created a ThreadLocal with key of type 
> [org.
> apache.solr.common.util.DateUtil.ThreadLocalDateFormat] (value 
> [org.apache.solr.
> common.util.DateUtil$ThreadLocalDateFormat@dc30dfa]) and a value of type 
> [java.t
> ext.SimpleDateFormat] (value [java.text.SimpleDateFormat@5af7aed5]) but 
> failed t
> o remove it when the web application was stopped. Threads are going to be 
> renewe
> d over time to try and avoid a probable memory leak.
> Feb 10, 2011 7:17:53 AM org.apache.catalina.loader.WebappClassLoader 
> checkThread
> LocalMapForLeaks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2605) queryparser parses on whitespace

2016-03-09 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15188046#comment-15188046
 ] 

Fuad Efendi commented on LUCENE-2605:
-

Is that resolved? Anyone working on it? Thanks

> queryparser parses on whitespace
> 
>
> Key: LUCENE-2605
> URL: https://issues.apache.org/jira/browse/LUCENE-2605
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser
>Reporter: Robert Muir
> Fix For: 4.9, master
>
>
> The queryparser parses input on whitespace, and sends each whitespace 
> separated term to its own independent token stream.
> This breaks the following at query-time, because they can't see across 
> whitespace boundaries:
> * n-gram analysis
> * shingles 
> * synonyms (especially multi-word for whitespace-separated languages)
> * languages where a 'word' can contain whitespace (e.g. vietnamese)
> Its also rather unexpected, as users think their 
> charfilters/tokenizers/tokenfilters will do the same thing at index and 
> querytime, but
> in many cases they can't. Instead, preferably the queryparser would parse 
> around only real 'operators'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe

2011-06-05 Thread Fuad Efendi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fuad Efendi updated SOLR-2233:
--

Attachment: SOLR-2233.patch

Revised version of old patch (11-Nov-2010); previous version of patch  was hard 
to read ;-)

Main changes: 
- connection won't close  reopen after timeout
- connection can't be closed by second thread unexpectedly to first thread 
(initial bug fixed)

Please note it works fine with MS-SQL server. However, concurrent statements 
(in concurrent threads) via the same connection object is tricky, JDBC may or 
may not implement it (JDBC-ODBC bridge for instance)

 DataImportHandler - JdbcDataSource is not thread safe
 -

 Key: SOLR-2233
 URL: https://issues.apache.org/jira/browse/SOLR-2233
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Affects Versions: 1.5
Reporter: Fuad Efendi
 Attachments: FE-patch.txt, SOLR-2233-JdbcDataSource.patch, 
 SOLR-2233-JdbcDataSource.patch, SOLR-2233.patch, SOLR-2233.patch


 Whenever Thread A spends more than 10 seconds on a Connection (by retrieving 
 records in a batch), Thread B will close connection.
 Related exceptions happen when we use threads= attribute for entity; 
 usually exception stack contains message connection already closed
 It shouldn't happen with some JNDI data source, where Connection.close() 
 simply returns Connection to a pool of available connections, but we might 
 get different errors.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe

2011-06-05 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13044460#comment-13044460
 ] 

Fuad Efendi commented on SOLR-2233:
---

Note that with this implementation connection is closed only when main 
instance of main class finalized = connection never closed; so that the code 
is still naive (server can close connection - how will we know that?) - 
fortunately it doesn't happen in my specific case already few months of night 
imports...

We should use connection pooling - this would be next improvement; conn.close() 
in this case will return connection to pool (without closing it), and pool is 
responsible for testing connections for liveness.

 DataImportHandler - JdbcDataSource is not thread safe
 -

 Key: SOLR-2233
 URL: https://issues.apache.org/jira/browse/SOLR-2233
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Affects Versions: 1.5
Reporter: Fuad Efendi
 Attachments: FE-patch.txt, SOLR-2233-JdbcDataSource.patch, 
 SOLR-2233-JdbcDataSource.patch, SOLR-2233.patch, SOLR-2233.patch


 Whenever Thread A spends more than 10 seconds on a Connection (by retrieving 
 records in a batch), Thread B will close connection.
 Related exceptions happen when we use threads= attribute for entity; 
 usually exception stack contains message connection already closed
 It shouldn't happen with some JNDI data source, where Connection.close() 
 simply returns Connection to a pool of available connections, but we might 
 get different errors.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe

2011-06-05 Thread Fuad Efendi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fuad Efendi updated SOLR-2233:
--

Attachment: SOLR-2233.patch

- small bug with closeResources()
- each ResultSetIterator now has own (separate) instance of Connection - 
extremely good for performance (multithreading) but it is not transactional 
(different connections can return different results) - but we are optimistic 

 DataImportHandler - JdbcDataSource is not thread safe
 -

 Key: SOLR-2233
 URL: https://issues.apache.org/jira/browse/SOLR-2233
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Affects Versions: 1.5
Reporter: Fuad Efendi
 Attachments: FE-patch.txt, SOLR-2233-JdbcDataSource.patch, 
 SOLR-2233-JdbcDataSource.patch, SOLR-2233.patch, SOLR-2233.patch, 
 SOLR-2233.patch


 Whenever Thread A spends more than 10 seconds on a Connection (by retrieving 
 records in a batch), Thread B will close connection.
 Related exceptions happen when we use threads= attribute for entity; 
 usually exception stack contains message connection already closed
 It shouldn't happen with some JNDI data source, where Connection.close() 
 simply returns Connection to a pool of available connections, but we might 
 get different errors.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe

2011-06-05 Thread Fuad Efendi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fuad Efendi updated SOLR-2233:
--

Attachment: SOLR-2233-001.patch

to avoid mistakes I added version... SOLR-2233-001.patch
(previous attachment was wrong)

 DataImportHandler - JdbcDataSource is not thread safe
 -

 Key: SOLR-2233
 URL: https://issues.apache.org/jira/browse/SOLR-2233
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Affects Versions: 1.5
Reporter: Fuad Efendi
 Attachments: FE-patch.txt, SOLR-2233-001.patch, 
 SOLR-2233-JdbcDataSource.patch, SOLR-2233-JdbcDataSource.patch, 
 SOLR-2233.patch, SOLR-2233.patch, SOLR-2233.patch


 Whenever Thread A spends more than 10 seconds on a Connection (by retrieving 
 records in a batch), Thread B will close connection.
 Related exceptions happen when we use threads= attribute for entity; 
 usually exception stack contains message connection already closed
 It shouldn't happen with some JNDI data source, where Connection.close() 
 simply returns Connection to a pool of available connections, but we might 
 get different errors.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe

2011-06-05 Thread Fuad Efendi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fuad Efendi updated SOLR-2233:
--

Attachment: (was: SOLR-2233.patch)

 DataImportHandler - JdbcDataSource is not thread safe
 -

 Key: SOLR-2233
 URL: https://issues.apache.org/jira/browse/SOLR-2233
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Affects Versions: 1.5
Reporter: Fuad Efendi
 Attachments: SOLR-2233-JdbcDataSource.patch, SOLR-2233.patch, 
 SOLR-2233.patch


 Whenever Thread A spends more than 10 seconds on a Connection (by retrieving 
 records in a batch), Thread B will close connection.
 Related exceptions happen when we use threads= attribute for entity; 
 usually exception stack contains message connection already closed
 It shouldn't happen with some JNDI data source, where Connection.close() 
 simply returns Connection to a pool of available connections, but we might 
 get different errors.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe

2011-06-05 Thread Fuad Efendi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fuad Efendi updated SOLR-2233:
--

Attachment: (was: FE-patch.txt)

 DataImportHandler - JdbcDataSource is not thread safe
 -

 Key: SOLR-2233
 URL: https://issues.apache.org/jira/browse/SOLR-2233
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Affects Versions: 1.5
Reporter: Fuad Efendi
 Attachments: SOLR-2233-JdbcDataSource.patch, SOLR-2233.patch, 
 SOLR-2233.patch


 Whenever Thread A spends more than 10 seconds on a Connection (by retrieving 
 records in a batch), Thread B will close connection.
 Related exceptions happen when we use threads= attribute for entity; 
 usually exception stack contains message connection already closed
 It shouldn't happen with some JNDI data source, where Connection.close() 
 simply returns Connection to a pool of available connections, but we might 
 get different errors.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe

2011-06-05 Thread Fuad Efendi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fuad Efendi updated SOLR-2233:
--

Attachment: (was: SOLR-2233-JdbcDataSource.patch)

 DataImportHandler - JdbcDataSource is not thread safe
 -

 Key: SOLR-2233
 URL: https://issues.apache.org/jira/browse/SOLR-2233
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Affects Versions: 1.5
Reporter: Fuad Efendi
 Attachments: SOLR-2233-JdbcDataSource.patch, SOLR-2233.patch, 
 SOLR-2233.patch


 Whenever Thread A spends more than 10 seconds on a Connection (by retrieving 
 records in a batch), Thread B will close connection.
 Related exceptions happen when we use threads= attribute for entity; 
 usually exception stack contains message connection already closed
 It shouldn't happen with some JNDI data source, where Connection.close() 
 simply returns Connection to a pool of available connections, but we might 
 get different errors.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe

2011-06-05 Thread Fuad Efendi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fuad Efendi updated SOLR-2233:
--

Attachment: SOLR-2233-001.patch

 DataImportHandler - JdbcDataSource is not thread safe
 -

 Key: SOLR-2233
 URL: https://issues.apache.org/jira/browse/SOLR-2233
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Affects Versions: 1.5
Reporter: Fuad Efendi
 Attachments: SOLR-2233-001.patch, SOLR-2233-JdbcDataSource.patch, 
 SOLR-2233.patch, SOLR-2233.patch


 Whenever Thread A spends more than 10 seconds on a Connection (by retrieving 
 records in a batch), Thread B will close connection.
 Related exceptions happen when we use threads= attribute for entity; 
 usually exception stack contains message connection already closed
 It shouldn't happen with some JNDI data source, where Connection.close() 
 simply returns Connection to a pool of available connections, but we might 
 get different errors.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-304) Dynamic fields cause IsValidUpdateIndexDocument to fail

2011-06-05 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13044465#comment-13044465
 ] 

Fuad Efendi commented on SOLR-304:
--

such an old bug report, and no any watchers; case closed can not reproduce ;-)


 Dynamic fields cause IsValidUpdateIndexDocument to fail
 ---

 Key: SOLR-304
 URL: https://issues.apache.org/jira/browse/SOLR-304
 Project: Solr
  Issue Type: Bug
  Components: clients - C#
Affects Versions: 1.2
Reporter: Jeff Rodenburg
Assignee: Jeff Rodenburg

 I am using solrsharp-1.2-07082007 - I have a dynamicField declared in my 
 schema.xml file as
 dynamicField name=*_demo type=text_ws indexed=true stored=true/
 -but, if I try to add a field using my vb.net application
 doc.Add(id_demo, s)
 where is a string value, the document does fails
 solrSearcher.SolrSchema.IsValidUpdateIndexDocument(doc)
 MS

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe

2011-06-05 Thread Fuad Efendi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fuad Efendi updated SOLR-2233:
--

Affects Version/s: 1.4
   1.4.1
   3.1
   3.2

 DataImportHandler - JdbcDataSource is not thread safe
 -

 Key: SOLR-2233
 URL: https://issues.apache.org/jira/browse/SOLR-2233
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Affects Versions: 1.4, 1.4.1, 1.5, 3.1, 3.2
Reporter: Fuad Efendi
 Attachments: SOLR-2233-001.patch, SOLR-2233-JdbcDataSource.patch, 
 SOLR-2233.patch, SOLR-2233.patch


 Whenever Thread A spends more than 10 seconds on a Connection (by retrieving 
 records in a batch), Thread B will close connection.
 Related exceptions happen when we use threads= attribute for entity; 
 usually exception stack contains message connection already closed
 It shouldn't happen with some JNDI data source, where Connection.close() 
 simply returns Connection to a pool of available connections, but we might 
 get different errors.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe

2011-05-31 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13041788#comment-13041788
 ] 

Fuad Efendi commented on SOLR-2233:
---

Hi Frank, yes, correct; although it's hard to recall what I did... 
unfortunately reformatted... I can resubmit (apply patch  format with Lucene 
style  generate patch); but better to redo it from scratch again. Existing 
code doesn't run multithreaded; and it is slow even for single-thread 
(inappropriate JDBC usage)

I completely removed this code:

-  private Connection getConnection() throws Exception {
-long currTime = System.currentTimeMillis();
-if (currTime - connLastUsed  CONN_TIME_OUT) {
-  synchronized (this) {
-Connection tmpConn = factory.call();
-closeConnection();
-connLastUsed = System.currentTimeMillis();
-return conn = tmpConn;
-
-} else {
-  connLastUsed = currTime;
-  return conn;
 }
   }
 


 DataImportHandler - JdbcDataSource is not thread safe
 -

 Key: SOLR-2233
 URL: https://issues.apache.org/jira/browse/SOLR-2233
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Affects Versions: 1.5
Reporter: Fuad Efendi
 Attachments: FE-patch.txt, SOLR-2233-JdbcDataSource.patch, 
 SOLR-2233-JdbcDataSource.patch


 Whenever Thread A spends more than 10 seconds on a Connection (by retrieving 
 records in a batch), Thread B will close connection.
 Related exceptions happen when we use threads= attribute for entity; 
 usually exception stack contains message connection already closed
 It shouldn't happen with some JNDI data source, where Connection.close() 
 simply returns Connection to a pool of available connections, but we might 
 get different errors.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe

2011-05-31 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13041811#comment-13041811
 ] 

Fuad Efendi commented on SOLR-2233:
---

Existing implementation uses single Connection during 10 seconds time interval, 
and even shares this object with other threads (if you try multithreaded)

So that problem becomes environment  vendor specific: to open new connection 
to Oracle 10g, for instance, we need to authenticate, and in dedicated server 
it might take a long, plus dedicated resources for each connection, - server 
can get overloaded. MySQL, fro another side, does not closes connection 
internally (even if you call conn.close() in your code); connection will be 
simply returned to a pool of connection objects. And what if something goes 
wrong... (what if MySQL or Oracle internals need additional time for closing, 
opening, ...) - we might even get problems like too many connections. 
Modern apps don't see that because they use manageable connection pooling 
instead of close-open...

I need to verify this patch; it was quick solution to make threads=... 
attribute working, and it currently works in production system (MS-SQL).

 DataImportHandler - JdbcDataSource is not thread safe
 -

 Key: SOLR-2233
 URL: https://issues.apache.org/jira/browse/SOLR-2233
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Affects Versions: 1.5
Reporter: Fuad Efendi
 Attachments: FE-patch.txt, SOLR-2233-JdbcDataSource.patch, 
 SOLR-2233-JdbcDataSource.patch


 Whenever Thread A spends more than 10 seconds on a Connection (by retrieving 
 records in a batch), Thread B will close connection.
 Related exceptions happen when we use threads= attribute for entity; 
 usually exception stack contains message connection already closed
 It shouldn't happen with some JNDI data source, where Connection.close() 
 simply returns Connection to a pool of available connections, but we might 
 get different errors.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe

2011-05-31 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13041884#comment-13041884
 ] 

Fuad Efendi commented on SOLR-2233:
---

Hi Frank, thanks for the patch; unfortunately it is not thread safe... if you 
don't mind let me continue working on this, I want to use internal connection 
pool (if JNDI data source is not available)...

My initial patch already contains *too much*; and new one will remove 
ResultSetIterator and make it much simlper to understand (and multithreaded); 
and code shoulnd't have any dependency on rare *optionally supported* patterns 
such as ResultSet.TYPE_FORWARD_ONLY; READ_ONLY should be managed differently 
(and it is hard to manage if data size is huge and data is concurrently updated 
while we are importing it)
Possible solution could be connection.close() after reading each single record 
(and initial query should return PKs of records) - but it would be next step... 
I wrote initial patch for a production system where complex 10-query-based 
documents (about 500k docs) took many hours to import (and now it is about 40 
minutes only) (and what happens if we have network problem and we are in the 
middre of Iterator?)

Thanks

 DataImportHandler - JdbcDataSource is not thread safe
 -

 Key: SOLR-2233
 URL: https://issues.apache.org/jira/browse/SOLR-2233
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Affects Versions: 1.5
Reporter: Fuad Efendi
 Attachments: FE-patch.txt, SOLR-2233-JdbcDataSource.patch, 
 SOLR-2233-JdbcDataSource.patch, SOLR-2233.patch


 Whenever Thread A spends more than 10 seconds on a Connection (by retrieving 
 records in a batch), Thread B will close connection.
 Related exceptions happen when we use threads= attribute for entity; 
 usually exception stack contains message connection already closed
 It shouldn't happen with some JNDI data source, where Connection.close() 
 simply returns Connection to a pool of available connections, but we might 
 get different errors.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.

2011-05-17 Thread Fuad Efendi (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034999#comment-13034999
]

Fuad Efendi commented on LUCENE-2230:
-

I believe this issue should be closed due to significant performance
improvements related to LUCENE-2089 and LUCENE-2258.
I don't think there is any interest from the community to continue with this
(BK Tree and Strike a Match) naive approach; although some people found it
useful. Of course we might have few more distance implementations as a separate
improvement.

Please close it.

Thanks

Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.

Key: LUCENE-2230
URL: https://issues.apache.org/jira/browse/LUCENE-2230
Project: Lucene - Java
Issue Type: Improvement
Components: core/search
Affects Versions: 3.0
Environment: Lucene currently uses brute force full-terms scanner and
calculates distance for each term. New BKTree structure improves performance
in average 20 times when distance is 1, and 3 times when distance is 3. I
tested with index size several millions docs, and 250,000 terms.
New algo uses integer distances between objects.
Reporter: Fuad Efendi
Attachments: BKTree.java, Distance.java, DistanceImpl.java,
FuzzyTermEnumNEW.java, FuzzyTermEnumNEW.java

Original Estimate: 1m
Remaining Estimate: 1m

W. Burkhard and R. Keller. Some approaches to best-match file searching,
CACM, 1973
http://portal.acm.org/citation.cfm?doid=362003.362025
I was inspired by
http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees (Nick
Johnson, Google).
Additionally, simplified algorythm at
http://www.catalysoft.com/articles/StrikeAMatch.html seems to be much more
logically correct than Levenstein distance, and it is 3-5 times faster
(isolated tests).
Big list od distance implementations:
http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2338) improved per-field similarity integration into schema.xml

2011-04-25 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13024811#comment-13024811
 ] 

Fuad Efendi commented on SOLR-2338:
---

test-files/solr/conf/schema.xml contains sample of per-field definitions;
example/solr/schema.xml doesn't have it yet

 improved per-field similarity integration into schema.xml
 -

 Key: SOLR-2338
 URL: https://issues.apache.org/jira/browse/SOLR-2338
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 4.0
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 4.0

 Attachments: SOLR-2338.patch, SOLR-2338.patch, SOLR-2338.patch


 Currently since LUCENE-2236, we can enable Similarity per-field, but in 
 schema.xml there is only a 'global' factory
 for the SimilarityProvider.
 In my opinion this is too low-level because to customize Similarity on a 
 per-field basis, you have to set your own
 CustomSimilarityProvider with similarity class=.../ and manage the 
 per-field mapping yourself in java code.
 Instead I think it would be better if you just specify the Similarity in the 
 FieldType, like after analyzer.
 As far as the example, one idea from LUCENE-1360 was to make a short_text 
 or metadata_text used by the
 various metadata fields in the example that has better norm quantization for 
 its shortness...

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-792) Pivot (ie: Decision Tree) Faceting Component

2011-04-04 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13015415#comment-13015415
 ] 

Fuad Efendi commented on SOLR-792:
--

Hi,

Jason Folk posted:
bq. facet.tree currently seems to bark at exclusion tags, I wouldn't mind 
trying to take a crack at this (as I currently do need it), but not really sure 
where to begin looking.


Is it resolved? My client currently uses pivot in production, few mlns records

If it's not resolved yet I can dig into it...

 Pivot (ie: Decision Tree) Faceting Component
 

 Key: SOLR-792
 URL: https://issues.apache.org/jira/browse/SOLR-792
 Project: Solr
  Issue Type: New Feature
Reporter: Erik Hatcher
Assignee: Yonik Seeley
Priority: Minor
 Attachments: SOLR-792-PivotFaceting.patch, 
 SOLR-792-PivotFaceting.patch, SOLR-792-PivotFaceting.patch, 
 SOLR-792-PivotFaceting.patch, SOLR-792-as-helper-class.patch, 
 SOLR-792-raw-type.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch, 
 SOLR-792.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch


 A component to do multi-level faceting.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-2006) DataImportHandler creates multiple DB connections during a delta update

2010-12-07 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12968866#action_12968866
 ] 

Fuad Efendi commented on SOLR-2006:
---

I believe it is resolved in SOLR-2233

 DataImportHandler creates multiple DB connections during a delta update
 ---

 Key: SOLR-2006
 URL: https://issues.apache.org/jira/browse/SOLR-2006
 Project: Solr
  Issue Type: Improvement
  Components: contrib - DataImportHandler
Affects Versions: 1.4, 1.4.1, 3.1, 4.0
Reporter: Lance Norskog

 The DataImportHandler code for delta updates creates a separate copy of each 
 datasource for each entity in the document. This creates a separate JDBC 
 connection for each entity. In some relational databases, connections are a 
 heavyweight resource and their use should be limited.
 A JDBC pool would help avoid this problem, and also assist in doing 
 multi-threaded DIH indexing jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-1828) DIH Handler separate connection for delta and full index

2010-12-07 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12968869#action_12968869
 ] 

Fuad Efendi commented on SOLR-1828:
---

Related issue  patch: SOLR-2233

 DIH Handler separate connection for delta and full index
 

 Key: SOLR-1828
 URL: https://issues.apache.org/jira/browse/SOLR-1828
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Affects Versions: 1.4
 Environment: Linux
Reporter: Bill Bell

 We would like to configure the DIH handler for a SLAVE connection for FULL 
 imports, and a MASTER connection for DELTA.
 Use case:
 1. The DIH full index slams the database pretty hard, and we would like those 
 to run one a day on the SLAVE MYSQL connection.
 2. The DIH delta index does not hit the database very hard, and we would like 
 that to run off the MASTER MYSQL connection.
 Currently the DIH handler does not allow a name=db-1 on the deltaQuery=, 
 it is only at the entity level. Please add it to each delta, full, etc as an 
 option.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (SOLR-1828) DIH Handler separate connection for delta and full index

2010-12-07 Thread Fuad Efendi (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-1828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12968869#action_12968869
]

Fuad Efendi edited comment on SOLR-1828 at 12/7/10 1:55 PM:

Performance-related issue patch: SOLR-2233
This seems to be wrong: MySQL is better optimized for Read-Mostly...? It
shouldn't be like that... all reads should go to slave...

was (Author: funtick):
Related issue patch: SOLR-2233

DIH Handler separate connection for delta and full index

Key: SOLR-1828
URL: https://issues.apache.org/jira/browse/SOLR-1828
Project: Solr
Issue Type: Bug
Components: contrib - DataImportHandler
Affects Versions: 1.4
Environment: Linux
Reporter: Bill Bell

We would like to configure the DIH handler for a SLAVE connection for FULL
imports, and a MASTER connection for DELTA.
Use case:
1. The DIH full index slams the database pretty hard, and we would like those
to run one a day on the SLAVE MYSQL connection.
2. The DIH delta index does not hit the database very hard, and we would like
that to run off the MASTER MYSQL connection.
Currently the DIH handler does not allow a name=db-1 on the deltaQuery=,
it is only at the entity level. Please add it to each delta, full, etc as an
option.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (SOLR-1828) DIH Handler separate connection for delta and full index

2010-12-07 Thread Fuad Efendi (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-1828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12968869#action_12968869
]

Fuad Efendi edited comment on SOLR-1828 at 12/7/10 1:56 PM:

Performance-related issue patch: SOLR-2233
This seems to be wrong: MySQL-MASTER is better optimized for Read-Mostly?! It
shouldn't be like that... all reads should go to slave...

was (Author: funtick):
Performance-related issue patch: SOLR-2233
This seems to be wrong: MySQL is better optimized for Read-Mostly...? It
shouldn't be like that... all reads should go to slave...

DIH Handler separate connection for delta and full index

Key: SOLR-1828
URL: https://issues.apache.org/jira/browse/SOLR-1828
Project: Solr
Issue Type: Bug
Components: contrib - DataImportHandler
Affects Versions: 1.4
Environment: Linux
Reporter: Bill Bell

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-1916) investigate DIH use of default locale

2010-12-07 Thread Fuad Efendi (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-1916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12968889#action_12968889
]

Fuad Efendi commented on SOLR-1916:
---

I had similar issue, Microsoft SQL Server, DATETIME type.
DIH stores Date in a filesystem using default (for SOLR) timezone and locale.
Then, Delta Import executed query with WHERE last_update_date '01.12.2010'
(just as a sample).
Localized string is used instead of real date. And timezone of remote database
is not necessarily the same as timezone of SOLR.
Fortunately, it's easy to fix (without altering code).

investigate DIH use of default locale
-

Key: SOLR-1916
URL: https://issues.apache.org/jira/browse/SOLR-1916
Project: Solr
Issue Type: Task
Components: contrib - DataImportHandler
Affects Versions: 3.1, 4.0
Reporter: Robert Muir
Priority: Blocker
Fix For: 3.1, 4.0

This is a spinoff from LUCENE-2466.
In this issue I changed my locale to various locales and found some problems
in Lucene/Solr triggered by use of the default Locale.
I noticed some use of the default-locale for Date operations in DIH
(TimeZone.getDefault/Locale.getDefault) and, while no tests fail, I think it
might be better to support a locale parameter for this.
The wiki documents that numeric parsing can support localized numerics
formats: http://wiki.apache.org/solr/DataImportHandler#NumberFormatTransformer
In both cases, I don't think we should ever use the default Locale. If no
Locale is provided, I find that new Locale() -- Unicode Root Locale, is a
better default for a server situation in a lot of cases, as it won't change
depending on the computer, or perhaps we just make Locale params mandatory
for this.
Finally, in both cases, if localized numbers/dates are explicitly supported,
I think we should come up with a test strategy to ensure everything is
working. One idea is to do something similar to or make use of Lucene's
LocalizedTestCase.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-2186) DataImportHandler multi-threaded option throws exception

2010-12-07 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12968890#action_12968890
 ] 

Fuad Efendi commented on SOLR-2186:
---

I resolved this issue for SQL, SOLR-2233; it was related to 'thread A closes 
connection needed by thread B'

 DataImportHandler multi-threaded option throws exception
 

 Key: SOLR-2186
 URL: https://issues.apache.org/jira/browse/SOLR-2186
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Reporter: Lance Norskog
Assignee: Grant Ingersoll
 Attachments: TikaResolver.patch


 The multi-threaded option for the DataImportHandler throws an exception and 
 the entire operation fails. This is true even if only 1 thread is configured 
 via *threads='1'*

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe

2010-12-07 Thread Fuad Efendi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fuad Efendi updated SOLR-2233:
--

Component/s: contrib - DataImportHandler

 DataImportHandler - JdbcDataSource is not thread safe
 -

 Key: SOLR-2233
 URL: https://issues.apache.org/jira/browse/SOLR-2233
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Affects Versions: 1.5
Reporter: Fuad Efendi
 Attachments: FE-patch.txt, SOLR-2233-JdbcDataSource.patch, 
 SOLR-2233-JdbcDataSource.patch


 Whenever Thread A spends more than 10 seconds on a Connection (by retrieving 
 records in a batch), Thread B will close connection.
 Related exceptions happen when we use threads= attribute for entity; 
 usually exception stack contains message connection already closed
 It shouldn't happen with some JNDI data source, where Connection.close() 
 simply returns Connection to a pool of available connections, but we might 
 get different errors.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe

2010-11-14 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12931839#action_12931839
 ] 

Fuad Efendi commented on SOLR-2233:
---

It is 3 times faster after I applied changes:
Before: 729 documents/minute
After: 2639 documents/minute

In my test, with 10 sub-entities some of them are multi-valued (and hard to use 
CachedJdbcDataSource for composite PKs). I can't explain it by only 
threads=16 option (which this patch makes possible). It is probably 
Connection Close / Connection Open issue which is very expensive for 
SQL-Server (except MySQL JDBC driver which internally uses connection pooling)

 DataImportHandler - JdbcDataSource is not thread safe
 -

 Key: SOLR-2233
 URL: https://issues.apache.org/jira/browse/SOLR-2233
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Fuad Efendi
 Attachments: FE-patch.txt, SOLR-2233-JdbcDataSource.patch, 
 SOLR-2233-JdbcDataSource.patch


 Whenever Thread A spends more than 10 seconds on a Connection (by retrieving 
 records in a batch), Thread B will close connection.
 Related exceptions happen when we use threads= attribute for entity; 
 usually exception stack contains message connection already closed
 It shouldn't happen with some JNDI data source, where Connection.close() 
 simply returns Connection to a pool of available connections, but we might 
 get different errors.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe

2010-11-12 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12931401#action_12931401
 ] 

Fuad Efendi commented on SOLR-2233:
---

The only remaining problem is what to do if Database Server closed/dropped 
connection or something like that (for instance, due to timeout settings on a 
database, or due to heavy load, or network problem). The more time required to 
index data, the more frequent problems.

Even connection pool (accessed via JNDI) won't help because existing (and new) 
code tries to keep the same connection for a long time, without any logic to 
check that connection is still alive. What to do if we are in the middle of 
RecordSet and database dropped connection? 
 

 DataImportHandler - JdbcDataSource is not thread safe
 -

 Key: SOLR-2233
 URL: https://issues.apache.org/jira/browse/SOLR-2233
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Fuad Efendi
 Attachments: FE-patch.txt, SOLR-2233-JdbcDataSource.patch, 
 SOLR-2233-JdbcDataSource.patch


 Whenever Thread A spends more than 10 seconds on a Connection (by retrieving 
 records in a batch), Thread B will close connection.
 Related exceptions happen when we use threads= attribute for entity; 
 usually exception stack contains message connection already closed
 It shouldn't happen with some JNDI data source, where Connection.close() 
 simply returns Connection to a pool of available connections, but we might 
 get different errors.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Created: (SOLR-2231) DataImportHandler - MultiThreaded - Logging

2010-11-11 Thread Fuad Efendi (JIRA)

DataImportHandler - MultiThreaded - Logging
---

 Key: SOLR-2231
 URL: https://issues.apache.org/jira/browse/SOLR-2231
 Project: Solr
  Issue Type: Improvement
Affects Versions: 1.5
Reporter: Fuad Efendi
Priority: Trivial


Please use
{code}
if (LOG.isInfoEnabled()) LOG.info(...)
{code}

For instance, line 95 of ThreadedEntityProcessorWrapper creates huge log output 
which is impossible to manage via logging properties:
LOG.info(arow : +arow);




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (SOLR-2231) DataImportHandler - MultiThreaded - Logging

2010-11-11 Thread Fuad Efendi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fuad Efendi updated SOLR-2231:
--

Description: 
Please use
{code}
if (LOG.isInfoEnabled()) LOG.info(...)
{code}

For instance, line 95 of ThreadedEntityProcessorWrapper creates huge log output 
which is impossible to manage via logging properties:
LOG.info(arow : +arow);

This line (in a loop) will output results of all SQL from a database (and will 
slow down SOLR performance). It's even better to use LOG.debug instead of 
LOG.info, INFO is enabled by default.


  was:
Please use
{code}
if (LOG.isInfoEnabled()) LOG.info(...)
{code}

For instance, line 95 of ThreadedEntityProcessorWrapper creates huge log output 
which is impossible to manage via logging properties:
LOG.info(arow : +arow);





 DataImportHandler - MultiThreaded - Logging
 ---

 Key: SOLR-2231
 URL: https://issues.apache.org/jira/browse/SOLR-2231
 Project: Solr
  Issue Type: Improvement
Affects Versions: 1.5
Reporter: Fuad Efendi
Priority: Trivial

 Please use
 {code}
 if (LOG.isInfoEnabled()) LOG.info(...)
 {code}
 For instance, line 95 of ThreadedEntityProcessorWrapper creates huge log 
 output which is impossible to manage via logging properties:
 LOG.info(arow : +arow);
 This line (in a loop) will output results of all SQL from a database (and 
 will slow down SOLR performance). It's even better to use LOG.debug instead 
 of LOG.info, INFO is enabled by default.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Created: (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe

2010-11-11 Thread Fuad Efendi (JIRA)

DataImportHandler - JdbcDataSource is not thread safe
-

 Key: SOLR-2233
 URL: https://issues.apache.org/jira/browse/SOLR-2233
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Fuad Efendi


Whenever Thread A spends more than 10 seconds on a Connection (by retrieving 
records in a batch), Thread B will close connection.
Related exceptions happen when we use threads= attribute for entity; usually 
exception stack contains message connection already closed
It shouldn't happen with some JNDI data source, where Connection.close() simply 
returns Connection to a pool of available connections, but we might get 
different errors.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe

2010-11-11 Thread Fuad Efendi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fuad Efendi updated SOLR-2233:
--

Attachment: FE-patch.txt

I need to test it; but changes are obvious.
JDBC API says 
   * strongNote:/strong Support for the codeisLast/code method 
 * is optional for codeResultSet/codes with a result 
 * set type of codeTYPE_FORWARD_ONLY/code
- but I am sure everyone supports this.

 DataImportHandler - JdbcDataSource is not thread safe
 -

 Key: SOLR-2233
 URL: https://issues.apache.org/jira/browse/SOLR-2233
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Fuad Efendi
 Attachments: FE-patch.txt


 Whenever Thread A spends more than 10 seconds on a Connection (by retrieving 
 records in a batch), Thread B will close connection.
 Related exceptions happen when we use threads= attribute for entity; 
 usually exception stack contains message connection already closed
 It shouldn't happen with some JNDI data source, where Connection.close() 
 simply returns Connection to a pool of available connections, but we might 
 get different errors.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe

2010-11-11 Thread Fuad Efendi (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12931168#action_12931168
]

Fuad Efendi commented on SOLR-2233:
---

*Performance Tuning*

I have extremely sophisticated SQL; root entity runs 10-15 subqueries, and I am
unable to use {{CachedSqlEntityProcessor}}. That's why I am looking into
multithreading.

Unfortunately, with existing approach connection will be closed after each use.
And for most databases _creating a connection (authentication, resource
allocation) is extremely expensive_.

The best approach is to use container resource (JNDI, connection pooling), but
I'll try to find what else can be improved.

DataImportHandler - JdbcDataSource is not thread safe
-

Key: SOLR-2233
URL: https://issues.apache.org/jira/browse/SOLR-2233
Project: Solr
Issue Type: Bug
Affects Versions: 1.5
Reporter: Fuad Efendi
Attachments: FE-patch.txt

Whenever Thread A spends more than 10 seconds on a Connection (by retrieving
records in a batch), Thread B will close connection.
Related exceptions happen when we use threads= attribute for entity;
usually exception stack contains message connection already closed
It shouldn't happen with some JNDI data source, where Connection.close()
simply returns Connection to a pool of available connections, but we might
get different errors.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe

2010-11-11 Thread Fuad Efendi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fuad Efendi updated SOLR-2233:
--

Attachment: SOLR-2233-JdbcDataSource.patch

Connection moved to top-level class
DataSource should be used in a thread-safe manner; multiple threads can use 
multiple DataSource (per Item)
Connection should be closed at the end of import in any case...

 DataImportHandler - JdbcDataSource is not thread safe
 -

 Key: SOLR-2233
 URL: https://issues.apache.org/jira/browse/SOLR-2233
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Fuad Efendi
 Attachments: FE-patch.txt, SOLR-2233-JdbcDataSource.patch


 Whenever Thread A spends more than 10 seconds on a Connection (by retrieving 
 records in a batch), Thread B will close connection.
 Related exceptions happen when we use threads= attribute for entity; 
 usually exception stack contains message connection already closed
 It shouldn't happen with some JNDI data source, where Connection.close() 
 simply returns Connection to a pool of available connections, but we might 
 get different errors.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe

2010-11-11 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12931187#action_12931187
 ] 

Fuad Efendi commented on SOLR-2233:
---

This is exception I was talking about, threads=16, 12 sub-entities, with 
existing trunk version, note *The connection is closed*

{code}
org.apache.solr.handler.dataimport.DataImportHandlerException: 
com.microsoft.sqlserver.jdbc.SQLServerException: The connection is closed.
at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64)
at 
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:337)
at 
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$600(JdbcDataSource.java:226)
at 
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:260)
at 
org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:75)
at 
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
at 
org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper.nextRow(ThreadedEntityProcessorWrapper.java:84)
at 
org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.runAThread(DocBuilder.java:433)
at 
org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.run(DocBuilder.java:386)
at 
org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.runAThread(DocBuilder.java:453)
at 
org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.access$000(DocBuilder.java:340)
at 
org.apache.solr.handler.dataimport.DocBuilder$EntityRunner$1.run(DocBuilder.java:393)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: The connection is 
closed.
at 
com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDriverError(SQLServerException.java:171)
at 
com.microsoft.sqlserver.jdbc.SQLServerConnection.checkClosed(SQLServerConnection.java:319)
at 
com.microsoft.sqlserver.jdbc.SQLServerStatement.checkClosed(SQLServerStatement.java:956)
at 
com.microsoft.sqlserver.jdbc.SQLServerResultSet.checkClosed(SQLServerResultSet.java:348)
at 
com.microsoft.sqlserver.jdbc.SQLServerResultSet.next(SQLServerResultSet.java:915)
at 
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:329)
... 13 more
{code}

 DataImportHandler - JdbcDataSource is not thread safe
 -

 Key: SOLR-2233
 URL: https://issues.apache.org/jira/browse/SOLR-2233
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Fuad Efendi
 Attachments: FE-patch.txt, SOLR-2233-JdbcDataSource.patch


 Whenever Thread A spends more than 10 seconds on a Connection (by retrieving 
 records in a batch), Thread B will close connection.
 Related exceptions happen when we use threads= attribute for entity; 
 usually exception stack contains message connection already closed
 It shouldn't happen with some JNDI data source, where Connection.close() 
 simply returns Connection to a pool of available connections, but we might 
 get different errors.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe

2010-11-11 Thread Fuad Efendi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fuad Efendi updated SOLR-2233:
--

Attachment: SOLR-2233-JdbcDataSource.patch

{{resultSet.next()}} - Microsoft JDBC doesn't support isLast() for FORWARD_ONLY

 DataImportHandler - JdbcDataSource is not thread safe
 -

 Key: SOLR-2233
 URL: https://issues.apache.org/jira/browse/SOLR-2233
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Fuad Efendi
 Attachments: FE-patch.txt, SOLR-2233-JdbcDataSource.patch, 
 SOLR-2233-JdbcDataSource.patch


 Whenever Thread A spends more than 10 seconds on a Connection (by retrieving 
 records in a batch), Thread B will close connection.
 Related exceptions happen when we use threads= attribute for entity; 
 usually exception stack contains message connection already closed
 It shouldn't happen with some JNDI data source, where Connection.close() 
 simply returns Connection to a pool of available connections, but we might 
 get different errors.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-2233) DataImportHandler - JdbcDataSource is not thread safe

2010-11-11 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12931225#action_12931225
 ] 

Fuad Efendi commented on SOLR-2233:
---

And some real-life test, root entity contains 10 subentities, 16 threads 
allocated,

*befor*
{code}
  str name=Time Elapsed0:1:0.322/str 
  str name=Total Requests made to DataSource7296/str 
  str name=Total Rows Fetched8061/str 
  str name=Total Documents Processed729/str 
{code}

*after*
{code}
  str name=Time Elapsed0:1:1.184/str 
  str name=Total Requests made to DataSource0/str 
  str name=Total Rows Fetched29247/str 
  str name=Total Documents Processed2639/str 
{code}

Look at it, it seems we don't unnecessarily close connection!   
*Total Requests made to DataSource: 0*

 DataImportHandler - JdbcDataSource is not thread safe
 -

 Key: SOLR-2233
 URL: https://issues.apache.org/jira/browse/SOLR-2233
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Fuad Efendi
 Attachments: FE-patch.txt, SOLR-2233-JdbcDataSource.patch, 
 SOLR-2233-JdbcDataSource.patch


 Whenever Thread A spends more than 10 seconds on a Connection (by retrieving 
 records in a batch), Thread B will close connection.
 Related exceptions happen when we use threads= attribute for entity; 
 usually exception stack contains message connection already closed
 It shouldn't happen with some JNDI data source, where Connection.close() 
 simply returns Connection to a pool of available connections, but we might 
 get different errors.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (SOLR-792) Pivot (ie: Decision Tree) Faceting Component

2010-11-04 Thread Fuad Efendi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fuad Efendi updated SOLR-792:
-

Comment: was deleted

(was: I believe recent patch (2010-10-19) causes problems... I have this errors 
now:
{code}
lst name=facet_pivot
arr name=ChannelID,ClassificationID
lst
str name=fieldChannelID/str
str name=value
ERROR:SCHEMA-INDEX-MISMATCH,stringValue=`#8;#0;#0;#0;#5;
/str
int name=count4491/int
{code}

And those xxxID are int type, not String...




)

 Pivot (ie: Decision Tree) Faceting Component
 

 Key: SOLR-792
 URL: https://issues.apache.org/jira/browse/SOLR-792
 Project: Solr
  Issue Type: New Feature
Reporter: Erik Hatcher
Assignee: Ryan McKinley
Priority: Minor
 Attachments: SOLR-792-as-helper-class.patch, 
 SOLR-792-PivotFaceting.patch, SOLR-792-PivotFaceting.patch, 
 SOLR-792-PivotFaceting.patch, SOLR-792-PivotFaceting.patch, 
 SOLR-792-raw-type.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch, 
 SOLR-792.patch, SOLR-792.patch, SOLR-792.patch


 A component to do multi-level faceting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-792) Tree Faceting Component

2010-09-25 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914814#action_12914814
 ] 

Fuad Efendi commented on SOLR-792:
--

Default value (as seen in a code) is
facet.pivot.mincount=1

It confused me during simple tests (showing wrong results). Finally I found I 
need to add explicitly
facet.pivot.mincount=0





 Tree Faceting Component
 ---

 Key: SOLR-792
 URL: https://issues.apache.org/jira/browse/SOLR-792
 Project: Solr
  Issue Type: New Feature
Reporter: Erik Hatcher
Assignee: Ryan McKinley
Priority: Minor
 Attachments: SOLR-792-PivotFaceting.patch, 
 SOLR-792-PivotFaceting.patch, SOLR-792-PivotFaceting.patch, 
 SOLR-792-PivotFaceting.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch, 
 SOLR-792.patch, SOLR-792.patch, SOLR-792.patch


 A component to do multi-level faceting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.

2010-02-12 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833010#action_12833010
 ] 

Fuad Efendi commented on LUCENE-2230:
-

LUCENE-2089 - extremely good staff (Lucene-Flex branch, applicable for 
wildcard-queries, RegEx, and Fuzzy Search). BKTree improves performance if 
distance is 2; otherwise it is almost full-term-scan.
Some links borrowed:
http://en.wikipedia.org/wiki/Deterministic_finite-state_machine
http://rcmuir.wordpress.com/2009/12/04/finite-state-queries-for-lucene/
http://www.amazon.com/Algorithms-Strings-Trees-Sequences-Computational/dp/0521585198


 Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
 

 Key: LUCENE-2230
 URL: https://issues.apache.org/jira/browse/LUCENE-2230
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 3.0
 Environment: Lucene currently uses brute force full-terms scanner and 
 calculates distance for each term. New BKTree structure improves performance 
 in average 20 times when distance is 1, and 3 times when distance is 3. I 
 tested with index size several millions docs, and 250,000 terms. 
 New algo uses integer distances between objects.
Reporter: Fuad Efendi
 Attachments: BKTree.java, Distance.java, DistanceImpl.java, 
 FuzzyTermEnumNEW.java, FuzzyTermEnumNEW.java

   Original Estimate: 0.02h
  Remaining Estimate: 0.02h

 W. Burkhard and R. Keller. Some approaches to best-match file searching, 
 CACM, 1973
 http://portal.acm.org/citation.cfm?doid=362003.362025
 I was inspired by 
 http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees (Nick 
 Johnson, Google).
 Additionally, simplified algorythm at 
 http://www.catalysoft.com/articles/StrikeAMatch.html seems to be much more 
 logically correct than Levenstein distance, and it is 3-5 times faster 
 (isolated tests).
 Big list od distance implementations:
 http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.

2010-02-10 Thread Fuad Efendi (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832027#action_12832027
]

Fuad Efendi commented on LUCENE-2230:
-

Hi Uwe,

Thanks for the analysis! I spent only few days on this basic PoC.

I need to use IndexReader (index version number and etc.) also to rewarm a
cache; if term disappeared from index we can still leave it in BKTree (not a
problem; can't remove!), and if we have new term we need simply call
{code}public void add(E term){code}

Synchronization should be significantly improved...

Cache warming takes 10-15 seconds in my environment, about 250k tokens, and I
use TreeSet internally for fast lookup. I also believe that main performance
issue is related to Levenstein algo (which is significantly improved in trunk;
plus synchronization is removed from FuzzySearch: LUCENE-2258)

Regarding memory requirements: BKTree is not heavy... I should use
{code}StringHelper.intern(fld);{code}
- it's already in memory... and FuzzyTermEnum uses almost same amount of memory
for processing as BKTree. I'll check FieldCache.

BKTree-approach can be significantly improved.

Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.

Key: LUCENE-2230
URL: https://issues.apache.org/jira/browse/LUCENE-2230
Project: Lucene - Java
Issue Type: Improvement
Affects Versions: 3.0
Environment: Lucene currently uses brute force full-terms scanner and
calculates distance for each term. New BKTree structure improves performance
in average 20 times when distance is 1, and 3 times when distance is 3. I
tested with index size several millions docs, and 250,000 terms.
New algo uses integer distances between objects.
Reporter: Fuad Efendi
Attachments: BKTree.java, Distance.java, DistanceImpl.java,
FuzzyTermEnumNEW.java, FuzzyTermEnumNEW.java

Original Estimate: 0.02h
Remaining Estimate: 0.02h

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2010-02-10 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832033#action_12832033
 ] 

Fuad Efendi commented on LUCENE-2089:
-

Downloadable article (PDF):
http://www.mitpressjournals.org/doi/pdf/10.1162/0891201042544938?cookieSet=1


 explore using automaton for fuzzyquery
 --

 Key: LUCENE-2089
 URL: https://issues.apache.org/jira/browse/LUCENE-2089
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Robert Muir
Assignee: Mark Miller
Priority: Minor
 Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


 Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
 itching to write that nasty algorithm)
 we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
 * up front, calculate the maximum required K edits needed to match the users 
 supplied float threshold.
 * for at least small common E up to some max K (1,2,3, etc) we should create 
 a DFA for each E. 
 if the required E is above our supported max, we use dumb mode at first (no 
 seeking, no DFA, just brute force like now).
 As the pq fills, we swap progressively lower DFAs into the enum, based upon 
 the lowest score in the pq.
 This should work well on avg, at high E, you will typically fill the pq very 
 quickly since you will match many terms. 
 This not only provides a mechanism to switch to more efficient DFAs during 
 enumeration, but also to switch from dumb mode to smart mode.
 i modified my wildcard benchmark to generate random fuzzy queries.
 * Pattern: 7N stands for NNN, etc.
 * AvgMS_DFA: this is the time spent creating the automaton (constructor)
 ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
 |7N|10|64.0|4155.9|38.6|20.3|
 |14N|10|0.0|2511.6|46.0|37.9| 
 |28N|10|0.0|2506.3|93.0|86.6|
 |56N|10|0.0|2524.5|304.4|298.5|
 as you can see, this prototype is no good yet, because it creates the DFA in 
 a slow way. right now it creates an NFA, and all this wasted time is in 
 NFA-DFA conversion.
 So, for a very long string, it just gets worse and worse. This has nothing to 
 do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
 AvgMS_DFA), there is no problem there.
 instead we should just build a DFA to begin with, maybe with this paper: 
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
 we can precompute the tables with that algorithm up to some reasonable K, and 
 then I think we are ok.
 the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
 linear minimization, if someone wants to implement this they should not worry 
 about minimization.
 in fact, we need to at some point determine if AutomatonQuery should even 
 minimize FSM's at all, or if it is simply enough for them to be deterministic 
 with no transitions to dead states. (The only code that actually assumes 
 minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a 
 summation easily). we need to benchmark really complex DFAs (i.e. write a 
 regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2010-02-10 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832049#action_12832049
 ] 

Fuad Efendi commented on LUCENE-2089:
-

Ok; I am trying to study DFANFA and to compare with LUCENE-2230 (BKTree size 
is fixed without dependency on distance, but we need to hard-cache it...). What 
I found is that classic Levenshtein algo is eating 75% CPU, and classic 
brute-force TermEnum 25%...
Distance (submitted by end user) must be integer...

 explore using automaton for fuzzyquery
 --

 Key: LUCENE-2089
 URL: https://issues.apache.org/jira/browse/LUCENE-2089
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Robert Muir
Assignee: Mark Miller
Priority: Minor
 Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


 Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
 itching to write that nasty algorithm)
 we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
 * up front, calculate the maximum required K edits needed to match the users 
 supplied float threshold.
 * for at least small common E up to some max K (1,2,3, etc) we should create 
 a DFA for each E. 
 if the required E is above our supported max, we use dumb mode at first (no 
 seeking, no DFA, just brute force like now).
 As the pq fills, we swap progressively lower DFAs into the enum, based upon 
 the lowest score in the pq.
 This should work well on avg, at high E, you will typically fill the pq very 
 quickly since you will match many terms. 
 This not only provides a mechanism to switch to more efficient DFAs during 
 enumeration, but also to switch from dumb mode to smart mode.
 i modified my wildcard benchmark to generate random fuzzy queries.
 * Pattern: 7N stands for NNN, etc.
 * AvgMS_DFA: this is the time spent creating the automaton (constructor)
 ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
 |7N|10|64.0|4155.9|38.6|20.3|
 |14N|10|0.0|2511.6|46.0|37.9| 
 |28N|10|0.0|2506.3|93.0|86.6|
 |56N|10|0.0|2524.5|304.4|298.5|
 as you can see, this prototype is no good yet, because it creates the DFA in 
 a slow way. right now it creates an NFA, and all this wasted time is in 
 NFA-DFA conversion.
 So, for a very long string, it just gets worse and worse. This has nothing to 
 do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
 AvgMS_DFA), there is no problem there.
 instead we should just build a DFA to begin with, maybe with this paper: 
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
 we can precompute the tables with that algorithm up to some reasonable K, and 
 then I think we are ok.
 the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
 linear minimization, if someone wants to implement this they should not worry 
 about minimization.
 in fact, we need to at some point determine if AutomatonQuery should even 
 minimize FSM's at all, or if it is simply enough for them to be deterministic 
 with no transitions to dead states. (The only code that actually assumes 
 minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a 
 summation easily). we need to benchmark really complex DFAs (i.e. write a 
 regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.

2010-02-10 Thread Fuad Efendi (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832096#action_12832096
]

Fuad Efendi edited comment on LUCENE-2230 at 2/10/10 5:56 PM:
--

Hi Uwe,

I am trying to study LUCENE-2258 right now...

bq. BKTree contains terms no longer available

BKTree contains objects, not terms; in my sample it contains Strings, new
BKTreeString(new Distance()). It is a structure for fast lookup of close
objects from a set of objects, with predefined distance algorithm.

It won't hurt if String appears in BKTree structure, and corresponding Term
disappeared from Index; search results will be the same. Simply, search for
DisappearedTerm OR AnotherTerm is the same as search for AnotherTerm.
At least, we can run background thread which will create new BKTree instance,
without hurting end users.

Yes, Term-String is another thing to do... I recreate fake terms in
TermEnum...

BKTree allows to iterate about 5-10% of whole structure in order to find
closest matches only if distance threshold is small, 2. If it is 4, almost no
any improvement. And, classic Levenshtein distance is slow...

was (Author: funtick):
Hi Uwe,

I am trying to study Lucene-2258 right now...

bq. BKTree contains terms no longer available

Yes, Term-String is another thing to do... I recreate fake terms in
TermEnum...

Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.

Original Estimate: 0.02h
Remaining Estimate: 0.02h

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.

2010-02-10 Thread Fuad Efendi (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832096#action_12832096
]

Fuad Efendi commented on LUCENE-2230:
-

Hi Uwe,

I am trying to study Lucene-2258 right now...

bq. BKTree contains terms no longer available

Yes, Term-String is another thing to do... I recreate fake terms in
TermEnum...

Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.

Original Estimate: 0.02h
Remaining Estimate: 0.02h

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.

2010-02-10 Thread Fuad Efendi (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832096#action_12832096
]

Fuad Efendi edited comment on LUCENE-2230 at 2/10/10 6:22 PM:
--

Hi Uwe,

I am trying to study LUCENE-2258 right now...

bq. BKTree contains terms no longer available

Yes, Term-String is another thing to do... I recreate fake terms in
TermEnum...

Edited: trying to study LUCENE-2089...

was (Author: funtick):
Hi Uwe,

I am trying to study LUCENE-2258 right now...

bq. BKTree contains terms no longer available

Yes, Term-String is another thing to do... I recreate fake terms in
TermEnum...

Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.

Original Estimate: 0.02h
Remaining Estimate: 0.02h

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2089) explore using automaton for fuzzyquery

2010-02-10 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832049#action_12832049
 ] 

Fuad Efendi edited comment on LUCENE-2089 at 2/10/10 6:25 PM:
--

Ok; I am trying to study DFANFA and to compare with LUCENE-2230 (BKTree size 
is fixed without dependency on distance, but we need to hard-cache it...). What 
I found is that classic Levenshtein algo is eating 75% CPU, and classic 
brute-force TermEnum 25%...
Distance (submitted by end user) must be integer...


Edited: BKTree memory requirements don't have dependency on distance threshold 
etc.; but BKTree can help only if threshold is small, otherwise it is similar 
to full-scan.

  was (Author: funtick):
Ok; I am trying to study DFANFA and to compare with LUCENE-2230 (BKTree 
size is fixed without dependency on distance, but we need to hard-cache it...). 
What I found is that classic Levenshtein algo is eating 75% CPU, and classic 
brute-force TermEnum 25%...
Distance (submitted by end user) must be integer...
  
 explore using automaton for fuzzyquery
 --

 Key: LUCENE-2089
 URL: https://issues.apache.org/jira/browse/LUCENE-2089
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Robert Muir
Assignee: Mark Miller
Priority: Minor
 Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


 Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
 itching to write that nasty algorithm)
 we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
 * up front, calculate the maximum required K edits needed to match the users 
 supplied float threshold.
 * for at least small common E up to some max K (1,2,3, etc) we should create 
 a DFA for each E. 
 if the required E is above our supported max, we use dumb mode at first (no 
 seeking, no DFA, just brute force like now).
 As the pq fills, we swap progressively lower DFAs into the enum, based upon 
 the lowest score in the pq.
 This should work well on avg, at high E, you will typically fill the pq very 
 quickly since you will match many terms. 
 This not only provides a mechanism to switch to more efficient DFAs during 
 enumeration, but also to switch from dumb mode to smart mode.
 i modified my wildcard benchmark to generate random fuzzy queries.
 * Pattern: 7N stands for NNN, etc.
 * AvgMS_DFA: this is the time spent creating the automaton (constructor)
 ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
 |7N|10|64.0|4155.9|38.6|20.3|
 |14N|10|0.0|2511.6|46.0|37.9| 
 |28N|10|0.0|2506.3|93.0|86.6|
 |56N|10|0.0|2524.5|304.4|298.5|
 as you can see, this prototype is no good yet, because it creates the DFA in 
 a slow way. right now it creates an NFA, and all this wasted time is in 
 NFA-DFA conversion.
 So, for a very long string, it just gets worse and worse. This has nothing to 
 do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
 AvgMS_DFA), there is no problem there.
 instead we should just build a DFA to begin with, maybe with this paper: 
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
 we can precompute the tables with that algorithm up to some reasonable K, and 
 then I think we are ok.
 the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
 linear minimization, if someone wants to implement this they should not worry 
 about minimization.
 in fact, we need to at some point determine if AutomatonQuery should even 
 minimize FSM's at all, or if it is simply enough for them to be deterministic 
 with no transitions to dead states. (The only code that actually assumes 
 minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a 
 summation easily). we need to benchmark really complex DFAs (i.e. write a 
 regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2010-02-10 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832130#action_12832130
 ] 

Fuad Efendi commented on LUCENE-2089:
-

What about this,
http://www.catalysoft.com/articles/StrikeAMatch.html

- it seems logically more appropriate to (human-entered) text objects than 
Levenshtein distance, and it is (in theory) extremely fast; is DFA-distance 
faster?

 explore using automaton for fuzzyquery
 --

 Key: LUCENE-2089
 URL: https://issues.apache.org/jira/browse/LUCENE-2089
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Robert Muir
Assignee: Mark Miller
Priority: Minor
 Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


 Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
 itching to write that nasty algorithm)
 we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
 * up front, calculate the maximum required K edits needed to match the users 
 supplied float threshold.
 * for at least small common E up to some max K (1,2,3, etc) we should create 
 a DFA for each E. 
 if the required E is above our supported max, we use dumb mode at first (no 
 seeking, no DFA, just brute force like now).
 As the pq fills, we swap progressively lower DFAs into the enum, based upon 
 the lowest score in the pq.
 This should work well on avg, at high E, you will typically fill the pq very 
 quickly since you will match many terms. 
 This not only provides a mechanism to switch to more efficient DFAs during 
 enumeration, but also to switch from dumb mode to smart mode.
 i modified my wildcard benchmark to generate random fuzzy queries.
 * Pattern: 7N stands for NNN, etc.
 * AvgMS_DFA: this is the time spent creating the automaton (constructor)
 ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
 |7N|10|64.0|4155.9|38.6|20.3|
 |14N|10|0.0|2511.6|46.0|37.9| 
 |28N|10|0.0|2506.3|93.0|86.6|
 |56N|10|0.0|2524.5|304.4|298.5|
 as you can see, this prototype is no good yet, because it creates the DFA in 
 a slow way. right now it creates an NFA, and all this wasted time is in 
 NFA-DFA conversion.
 So, for a very long string, it just gets worse and worse. This has nothing to 
 do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
 AvgMS_DFA), there is no problem there.
 instead we should just build a DFA to begin with, maybe with this paper: 
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
 we can precompute the tables with that algorithm up to some reasonable K, and 
 then I think we are ok.
 the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
 linear minimization, if someone wants to implement this they should not worry 
 about minimization.
 in fact, we need to at some point determine if AutomatonQuery should even 
 minimize FSM's at all, or if it is simply enough for them to be deterministic 
 with no transitions to dead states. (The only code that actually assumes 
 minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a 
 summation easily). we need to benchmark really complex DFAs (i.e. write a 
 regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2089) explore using automaton for fuzzyquery

2010-02-10 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832130#action_12832130
 ] 

Fuad Efendi edited comment on LUCENE-2089 at 2/10/10 7:09 PM:
--

What about this,
http://www.catalysoft.com/articles/StrikeAMatch.html
it seems logically more appropriate to (human-entered) text objects than 
Levenshtein distance, and it is (in theory) extremely fast; is DFA-distance 
faster?

  was (Author: funtick):
What about this,
http://www.catalysoft.com/articles/StrikeAMatch.html

- it seems logically more appropriate to (human-entered) text objects than 
Levenshtein distance, and it is (in theory) extremely fast; is DFA-distance 
faster?
  
 explore using automaton for fuzzyquery
 --

 Key: LUCENE-2089
 URL: https://issues.apache.org/jira/browse/LUCENE-2089
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Robert Muir
Assignee: Mark Miller
Priority: Minor
 Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


 Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
 itching to write that nasty algorithm)
 we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
 * up front, calculate the maximum required K edits needed to match the users 
 supplied float threshold.
 * for at least small common E up to some max K (1,2,3, etc) we should create 
 a DFA for each E. 
 if the required E is above our supported max, we use dumb mode at first (no 
 seeking, no DFA, just brute force like now).
 As the pq fills, we swap progressively lower DFAs into the enum, based upon 
 the lowest score in the pq.
 This should work well on avg, at high E, you will typically fill the pq very 
 quickly since you will match many terms. 
 This not only provides a mechanism to switch to more efficient DFAs during 
 enumeration, but also to switch from dumb mode to smart mode.
 i modified my wildcard benchmark to generate random fuzzy queries.
 * Pattern: 7N stands for NNN, etc.
 * AvgMS_DFA: this is the time spent creating the automaton (constructor)
 ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
 |7N|10|64.0|4155.9|38.6|20.3|
 |14N|10|0.0|2511.6|46.0|37.9| 
 |28N|10|0.0|2506.3|93.0|86.6|
 |56N|10|0.0|2524.5|304.4|298.5|
 as you can see, this prototype is no good yet, because it creates the DFA in 
 a slow way. right now it creates an NFA, and all this wasted time is in 
 NFA-DFA conversion.
 So, for a very long string, it just gets worse and worse. This has nothing to 
 do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
 AvgMS_DFA), there is no problem there.
 instead we should just build a DFA to begin with, maybe with this paper: 
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
 we can precompute the tables with that algorithm up to some reasonable K, and 
 then I think we are ok.
 the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
 linear minimization, if someone wants to implement this they should not worry 
 about minimization.
 in fact, we need to at some point determine if AutomatonQuery should even 
 minimize FSM's at all, or if it is simply enough for them to be deterministic 
 with no transitions to dead states. (The only code that actually assumes 
 minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a 
 summation easily). we need to benchmark really complex DFAs (i.e. write a 
 regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2010-02-10 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832143#action_12832143
 ] 

Fuad Efendi commented on LUCENE-2089:
-

Hi Robert,

Yes, I agree; we need to stick with Levenshtein distance also to isolate 
performance comparisons: same distance, but FuzzyTermEnum with full-scan vs. 
DFA-based approach, and we need to be able to compare old relevance with new 
one (with integer distance threshold it is not the same as with classic 
float-point...) thanks for the link to your article! 

What if we can store some precounted values in the index... such as storing 
similar terms in additional field... or some pieces of DFA (which I still 
need to learn...)

 explore using automaton for fuzzyquery
 --

 Key: LUCENE-2089
 URL: https://issues.apache.org/jira/browse/LUCENE-2089
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Robert Muir
Assignee: Mark Miller
Priority: Minor
 Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


 Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
 itching to write that nasty algorithm)
 we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
 * up front, calculate the maximum required K edits needed to match the users 
 supplied float threshold.
 * for at least small common E up to some max K (1,2,3, etc) we should create 
 a DFA for each E. 
 if the required E is above our supported max, we use dumb mode at first (no 
 seeking, no DFA, just brute force like now).
 As the pq fills, we swap progressively lower DFAs into the enum, based upon 
 the lowest score in the pq.
 This should work well on avg, at high E, you will typically fill the pq very 
 quickly since you will match many terms. 
 This not only provides a mechanism to switch to more efficient DFAs during 
 enumeration, but also to switch from dumb mode to smart mode.
 i modified my wildcard benchmark to generate random fuzzy queries.
 * Pattern: 7N stands for NNN, etc.
 * AvgMS_DFA: this is the time spent creating the automaton (constructor)
 ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
 |7N|10|64.0|4155.9|38.6|20.3|
 |14N|10|0.0|2511.6|46.0|37.9| 
 |28N|10|0.0|2506.3|93.0|86.6|
 |56N|10|0.0|2524.5|304.4|298.5|
 as you can see, this prototype is no good yet, because it creates the DFA in 
 a slow way. right now it creates an NFA, and all this wasted time is in 
 NFA-DFA conversion.
 So, for a very long string, it just gets worse and worse. This has nothing to 
 do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
 AvgMS_DFA), there is no problem there.
 instead we should just build a DFA to begin with, maybe with this paper: 
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
 we can precompute the tables with that algorithm up to some reasonable K, and 
 then I think we are ok.
 the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
 linear minimization, if someone wants to implement this they should not worry 
 about minimization.
 in fact, we need to at some point determine if AutomatonQuery should even 
 minimize FSM's at all, or if it is simply enough for them to be deterministic 
 with no transitions to dead states. (The only code that actually assumes 
 minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a 
 summation easily). we need to benchmark really complex DFAs (i.e. write a 
 regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2010-02-10 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832173#action_12832173
 ] 

Fuad Efendi commented on LUCENE-2089:
-

For LUCENE-2230 I did a lot of long-run load-stress tests (against SOLR), but 
before doing that I created baseline for static admin screen in SOLR: 1500TPS. 
And I reached 220TPS with Fuzzy Search... what I am trying to say is this: can 
DFA with Levenshtein reach 250TPS (in real-world multi-tier web environment)? 
Baseline for static page is 1500. Also, is DFA mostly CPU-bound? Can we improve 
it by making (some) I/O-bound unload?
Just joking ;)


 explore using automaton for fuzzyquery
 --

 Key: LUCENE-2089
 URL: https://issues.apache.org/jira/browse/LUCENE-2089
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Robert Muir
Assignee: Mark Miller
Priority: Minor
 Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


 Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
 itching to write that nasty algorithm)
 we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
 * up front, calculate the maximum required K edits needed to match the users 
 supplied float threshold.
 * for at least small common E up to some max K (1,2,3, etc) we should create 
 a DFA for each E. 
 if the required E is above our supported max, we use dumb mode at first (no 
 seeking, no DFA, just brute force like now).
 As the pq fills, we swap progressively lower DFAs into the enum, based upon 
 the lowest score in the pq.
 This should work well on avg, at high E, you will typically fill the pq very 
 quickly since you will match many terms. 
 This not only provides a mechanism to switch to more efficient DFAs during 
 enumeration, but also to switch from dumb mode to smart mode.
 i modified my wildcard benchmark to generate random fuzzy queries.
 * Pattern: 7N stands for NNN, etc.
 * AvgMS_DFA: this is the time spent creating the automaton (constructor)
 ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
 |7N|10|64.0|4155.9|38.6|20.3|
 |14N|10|0.0|2511.6|46.0|37.9| 
 |28N|10|0.0|2506.3|93.0|86.6|
 |56N|10|0.0|2524.5|304.4|298.5|
 as you can see, this prototype is no good yet, because it creates the DFA in 
 a slow way. right now it creates an NFA, and all this wasted time is in 
 NFA-DFA conversion.
 So, for a very long string, it just gets worse and worse. This has nothing to 
 do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
 AvgMS_DFA), there is no problem there.
 instead we should just build a DFA to begin with, maybe with this paper: 
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
 we can precompute the tables with that algorithm up to some reasonable K, and 
 then I think we are ok.
 the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
 linear minimization, if someone wants to implement this they should not worry 
 about minimization.
 in fact, we need to at some point determine if AutomatonQuery should even 
 minimize FSM's at all, or if it is simply enough for them to be deterministic 
 with no transitions to dead states. (The only code that actually assumes 
 minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a 
 summation easily). we need to benchmark really complex DFAs (i.e. write a 
 regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2089) explore using automaton for fuzzyquery

2010-02-10 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832173#action_12832173
 ] 

Fuad Efendi edited comment on LUCENE-2089 at 2/10/10 8:27 PM:
--

For LUCENE-2230 I did a lot of long-run load-stress tests (against SOLR), but 
before doing that I created baseline for static admin screen in SOLR: 1500TPS. 
And I reached 220TPS with Fuzzy Search... what I am trying to say is this: can 
DFA with Levenshtein reach 250TPS (in real-world multi-tier web environment)? 
Baseline for static page is 1500. Also, is DFA mostly CPU-bound? Can we improve 
it by making (some) I/O-bound unload?
Just joking ;)

I used explicitly distance threshold=2, that's why 220TPS... with threshold=5 
it would be 50TPS or may be less...

If DFA doesn't have dependency on threshold, it is the way to go.

  was (Author: funtick):
For LUCENE-2230 I did a lot of long-run load-stress tests (against SOLR), 
but before doing that I created baseline for static admin screen in SOLR: 
1500TPS. And I reached 220TPS with Fuzzy Search... what I am trying to say is 
this: can DFA with Levenshtein reach 250TPS (in real-world multi-tier web 
environment)? Baseline for static page is 1500. Also, is DFA mostly CPU-bound? 
Can we improve it by making (some) I/O-bound unload?
Just joking ;)

  
 explore using automaton for fuzzyquery
 --

 Key: LUCENE-2089
 URL: https://issues.apache.org/jira/browse/LUCENE-2089
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Robert Muir
Assignee: Mark Miller
Priority: Minor
 Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


 Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
 itching to write that nasty algorithm)
 we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
 * up front, calculate the maximum required K edits needed to match the users 
 supplied float threshold.
 * for at least small common E up to some max K (1,2,3, etc) we should create 
 a DFA for each E. 
 if the required E is above our supported max, we use dumb mode at first (no 
 seeking, no DFA, just brute force like now).
 As the pq fills, we swap progressively lower DFAs into the enum, based upon 
 the lowest score in the pq.
 This should work well on avg, at high E, you will typically fill the pq very 
 quickly since you will match many terms. 
 This not only provides a mechanism to switch to more efficient DFAs during 
 enumeration, but also to switch from dumb mode to smart mode.
 i modified my wildcard benchmark to generate random fuzzy queries.
 * Pattern: 7N stands for NNN, etc.
 * AvgMS_DFA: this is the time spent creating the automaton (constructor)
 ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
 |7N|10|64.0|4155.9|38.6|20.3|
 |14N|10|0.0|2511.6|46.0|37.9| 
 |28N|10|0.0|2506.3|93.0|86.6|
 |56N|10|0.0|2524.5|304.4|298.5|
 as you can see, this prototype is no good yet, because it creates the DFA in 
 a slow way. right now it creates an NFA, and all this wasted time is in 
 NFA-DFA conversion.
 So, for a very long string, it just gets worse and worse. This has nothing to 
 do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
 AvgMS_DFA), there is no problem there.
 instead we should just build a DFA to begin with, maybe with this paper: 
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
 we can precompute the tables with that algorithm up to some reasonable K, and 
 then I think we are ok.
 the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
 linear minimization, if someone wants to implement this they should not worry 
 about minimization.
 in fact, we need to at some point determine if AutomatonQuery should even 
 minimize FSM's at all, or if it is simply enough for them to be deterministic 
 with no transitions to dead states. (The only code that actually assumes 
 minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a 
 summation easily). we need to benchmark really complex DFAs (i.e. write a 
 regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2010-02-10 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832194#action_12832194
 ] 

Fuad Efendi commented on LUCENE-2089:
-

Ok, now I understand what AutomatonQuery is... frankly, I had this idea, to 
create small dictionary of similar words, to create terms from those words, 
and to execute query Word1 OR Word2 OR ... instead of scanning whole term 
dictionary, but how small will be such dictionary in case, for instance, 
dogs... is size of dictionary (in case of ASCII-characters) 26*26*26*26? 
Or, 65536*65536*65536*65536 in case of Unicode?
Simple.
Is it so simple?

 explore using automaton for fuzzyquery
 --

 Key: LUCENE-2089
 URL: https://issues.apache.org/jira/browse/LUCENE-2089
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Robert Muir
Assignee: Mark Miller
Priority: Minor
 Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


 Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
 itching to write that nasty algorithm)
 we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
 * up front, calculate the maximum required K edits needed to match the users 
 supplied float threshold.
 * for at least small common E up to some max K (1,2,3, etc) we should create 
 a DFA for each E. 
 if the required E is above our supported max, we use dumb mode at first (no 
 seeking, no DFA, just brute force like now).
 As the pq fills, we swap progressively lower DFAs into the enum, based upon 
 the lowest score in the pq.
 This should work well on avg, at high E, you will typically fill the pq very 
 quickly since you will match many terms. 
 This not only provides a mechanism to switch to more efficient DFAs during 
 enumeration, but also to switch from dumb mode to smart mode.
 i modified my wildcard benchmark to generate random fuzzy queries.
 * Pattern: 7N stands for NNN, etc.
 * AvgMS_DFA: this is the time spent creating the automaton (constructor)
 ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
 |7N|10|64.0|4155.9|38.6|20.3|
 |14N|10|0.0|2511.6|46.0|37.9| 
 |28N|10|0.0|2506.3|93.0|86.6|
 |56N|10|0.0|2524.5|304.4|298.5|
 as you can see, this prototype is no good yet, because it creates the DFA in 
 a slow way. right now it creates an NFA, and all this wasted time is in 
 NFA-DFA conversion.
 So, for a very long string, it just gets worse and worse. This has nothing to 
 do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
 AvgMS_DFA), there is no problem there.
 instead we should just build a DFA to begin with, maybe with this paper: 
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
 we can precompute the tables with that algorithm up to some reasonable K, and 
 then I think we are ok.
 the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
 linear minimization, if someone wants to implement this they should not worry 
 about minimization.
 in fact, we need to at some point determine if AutomatonQuery should even 
 minimize FSM's at all, or if it is simply enough for them to be deterministic 
 with no transitions to dead states. (The only code that actually assumes 
 minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a 
 summation easily). we need to benchmark really complex DFAs (i.e. write a 
 regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2089) explore using automaton for fuzzyquery

2010-02-10 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832194#action_12832194
 ] 

Fuad Efendi edited comment on LUCENE-2089 at 2/10/10 9:22 PM:
--

Ok, now I understand what AutomatonQuery is... frankly, I had this idea, to 
create small dictionary of similar words, to create terms from those words, 
and to execute query Word1 OR Word2 OR ... instead of scanning whole term 
dictionary, but how small will be such dictionary in case, for instance, 
dogs... is size of dictionary (in case of ASCII-characters) 26*26*26*26? 
Or, 65536*65536*65536*65536 in case of Unicode?
Simple.
Is it so simple?

Even with Unicode, we can precount set of characters for a specific field 
instance; it can be 36 characters; and query like dogs OR aaabdogs OR 
... OR dogs

and, if we can quickly find intersection of custom dictionary with terms 
dictionary, then build the query... am I on correct path with understanding?

  was (Author: funtick):
Ok, now I understand what AutomatonQuery is... frankly, I had this idea, to 
create small dictionary of similar words, to create terms from those words, 
and to execute query Word1 OR Word2 OR ... instead of scanning whole term 
dictionary, but how small will be such dictionary in case, for instance, 
dogs... is size of dictionary (in case of ASCII-characters) 26*26*26*26? 
Or, 65536*65536*65536*65536 in case of Unicode?
Simple.
Is it so simple?
  
 explore using automaton for fuzzyquery
 --

 Key: LUCENE-2089
 URL: https://issues.apache.org/jira/browse/LUCENE-2089
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Robert Muir
Assignee: Mark Miller
Priority: Minor
 Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


 Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
 itching to write that nasty algorithm)
 we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
 * up front, calculate the maximum required K edits needed to match the users 
 supplied float threshold.
 * for at least small common E up to some max K (1,2,3, etc) we should create 
 a DFA for each E. 
 if the required E is above our supported max, we use dumb mode at first (no 
 seeking, no DFA, just brute force like now).
 As the pq fills, we swap progressively lower DFAs into the enum, based upon 
 the lowest score in the pq.
 This should work well on avg, at high E, you will typically fill the pq very 
 quickly since you will match many terms. 
 This not only provides a mechanism to switch to more efficient DFAs during 
 enumeration, but also to switch from dumb mode to smart mode.
 i modified my wildcard benchmark to generate random fuzzy queries.
 * Pattern: 7N stands for NNN, etc.
 * AvgMS_DFA: this is the time spent creating the automaton (constructor)
 ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
 |7N|10|64.0|4155.9|38.6|20.3|
 |14N|10|0.0|2511.6|46.0|37.9| 
 |28N|10|0.0|2506.3|93.0|86.6|
 |56N|10|0.0|2524.5|304.4|298.5|
 as you can see, this prototype is no good yet, because it creates the DFA in 
 a slow way. right now it creates an NFA, and all this wasted time is in 
 NFA-DFA conversion.
 So, for a very long string, it just gets worse and worse. This has nothing to 
 do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
 AvgMS_DFA), there is no problem there.
 instead we should just build a DFA to begin with, maybe with this paper: 
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
 we can precompute the tables with that algorithm up to some reasonable K, and 
 then I think we are ok.
 the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
 linear minimization, if someone wants to implement this they should not worry 
 about minimization.
 in fact, we need to at some point determine if AutomatonQuery should even 
 minimize FSM's at all, or if it is simply enough for them to be deterministic 
 with no transitions to dead states. (The only code that actually assumes 
 minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a 
 summation easily). we need to benchmark really complex DFAs (i.e. write a 
 regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2010-02-10 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832360#action_12832360
 ] 

Fuad Efendi commented on LUCENE-2089:
-

Levenshtein Distance is good for Spelling Corrections use case (even 
terminology is the same: insert, delete, replace...) 
But is is not good for more generic similarity: distance between RunAutomation 
and AutomationRun is huge (6!). But it is two-word combination indeed,and I 
don't know good one-(human)-word use case where Levenshtein Distance is not 
good (or natural). From other viewpoint, I can't see any use case where 
StrikeAMatch (counts of 2-char similarities) is bad, although it is not 
spelling corrections. And, from third viewpoint, if we totally forget that it 
is indeed human-generated-input and implement Levenshtein distance on raw 
bitsets instead of unicode characters (end-user clicks on keyboard)... we will 
get totally non-acceptable results... 

I believe such distance algos were initially designed many years ago, before 
Internet (and Search), to allow auto-recovery during data transmission (first 
astronauts...) - but autorecovery was based on fact that (acceptable) mistaken 
code has one and only one closest match from the dictionary; so it was 
extremely fast (50 years ago). And now, we are using old algo designed for 
completely different use case (fixed-size bitset transmissions) for spelling 
corrections... 

What if we will focus on a keyboard (101 keys?) instead of Unicode... spelling 
corrections...

20ms is not good, it is 50TPS only (on a single core)...

 explore using automaton for fuzzyquery
 --

 Key: LUCENE-2089
 URL: https://issues.apache.org/jira/browse/LUCENE-2089
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Robert Muir
Assignee: Mark Miller
Priority: Minor
 Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


 Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
 itching to write that nasty algorithm)
 we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
 * up front, calculate the maximum required K edits needed to match the users 
 supplied float threshold.
 * for at least small common E up to some max K (1,2,3, etc) we should create 
 a DFA for each E. 
 if the required E is above our supported max, we use dumb mode at first (no 
 seeking, no DFA, just brute force like now).
 As the pq fills, we swap progressively lower DFAs into the enum, based upon 
 the lowest score in the pq.
 This should work well on avg, at high E, you will typically fill the pq very 
 quickly since you will match many terms. 
 This not only provides a mechanism to switch to more efficient DFAs during 
 enumeration, but also to switch from dumb mode to smart mode.
 i modified my wildcard benchmark to generate random fuzzy queries.
 * Pattern: 7N stands for NNN, etc.
 * AvgMS_DFA: this is the time spent creating the automaton (constructor)
 ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
 |7N|10|64.0|4155.9|38.6|20.3|
 |14N|10|0.0|2511.6|46.0|37.9| 
 |28N|10|0.0|2506.3|93.0|86.6|
 |56N|10|0.0|2524.5|304.4|298.5|
 as you can see, this prototype is no good yet, because it creates the DFA in 
 a slow way. right now it creates an NFA, and all this wasted time is in 
 NFA-DFA conversion.
 So, for a very long string, it just gets worse and worse. This has nothing to 
 do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
 AvgMS_DFA), there is no problem there.
 instead we should just build a DFA to begin with, maybe with this paper: 
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
 we can precompute the tables with that algorithm up to some reasonable K, and 
 then I think we are ok.
 the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
 linear minimization, if someone wants to implement this they should not worry 
 about minimization.
 in fact, we need to at some point determine if AutomatonQuery should even 
 minimize FSM's at all, or if it is simply enough for them to be deterministic 
 with no transitions to dead states. (The only code that actually assumes 
 minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a 
 summation easily). we need to benchmark really complex DFAs (i.e. write a 
 regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2010-02-10 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832368#action_12832368
 ] 

Fuad Efendi commented on LUCENE-2089:
-

Another idea (similar to 50-years-old auto-recovery), it doesn't allow me to 
sleep :)
What if we do all distance calculations (and other types of calculations) at 
indexing time instead of at query time? For instance, we may have index 
structure like {Term, List[MisspelledTerm, Distance]}, and we can query this 
structure by {MisspelledTerm, Distance}? We mentioned it here already, 
LUCENE-1513, but our use case is very specific... and why to allow 5 spelling 
mistakes in Unicode if user's input contains 3 characters only in Latin1? We 
should hardcode some constraints.

 explore using automaton for fuzzyquery
 --

 Key: LUCENE-2089
 URL: https://issues.apache.org/jira/browse/LUCENE-2089
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Robert Muir
Assignee: Mark Miller
Priority: Minor
 Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


 Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
 itching to write that nasty algorithm)
 we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
 * up front, calculate the maximum required K edits needed to match the users 
 supplied float threshold.
 * for at least small common E up to some max K (1,2,3, etc) we should create 
 a DFA for each E. 
 if the required E is above our supported max, we use dumb mode at first (no 
 seeking, no DFA, just brute force like now).
 As the pq fills, we swap progressively lower DFAs into the enum, based upon 
 the lowest score in the pq.
 This should work well on avg, at high E, you will typically fill the pq very 
 quickly since you will match many terms. 
 This not only provides a mechanism to switch to more efficient DFAs during 
 enumeration, but also to switch from dumb mode to smart mode.
 i modified my wildcard benchmark to generate random fuzzy queries.
 * Pattern: 7N stands for NNN, etc.
 * AvgMS_DFA: this is the time spent creating the automaton (constructor)
 ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
 |7N|10|64.0|4155.9|38.6|20.3|
 |14N|10|0.0|2511.6|46.0|37.9| 
 |28N|10|0.0|2506.3|93.0|86.6|
 |56N|10|0.0|2524.5|304.4|298.5|
 as you can see, this prototype is no good yet, because it creates the DFA in 
 a slow way. right now it creates an NFA, and all this wasted time is in 
 NFA-DFA conversion.
 So, for a very long string, it just gets worse and worse. This has nothing to 
 do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
 AvgMS_DFA), there is no problem there.
 instead we should just build a DFA to begin with, maybe with this paper: 
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
 we can precompute the tables with that algorithm up to some reasonable K, and 
 then I think we are ok.
 the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
 linear minimization, if someone wants to implement this they should not worry 
 about minimization.
 in fact, we need to at some point determine if AutomatonQuery should even 
 minimize FSM's at all, or if it is simply enough for them to be deterministic 
 with no transitions to dead states. (The only code that actually assumes 
 minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a 
 summation easily). we need to benchmark really complex DFAs (i.e. write a 
 regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2089) explore using automaton for fuzzyquery

2010-02-10 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832368#action_12832368
 ] 

Fuad Efendi edited comment on LUCENE-2089 at 2/11/10 3:05 AM:
--

Another idea (similar to 50-years-old auto-recovery), it doesn't allow me to 
sleep :)
What if we do all distance calculations (and other types of calculations) at 
indexing time instead of at query time? For instance, we may have index 
structure like {Term, List[MisspelledTerm, Distance]}, and we can query this 
structure by {MisspelledTerm, Distance}? We mentioned it here already, 
LUCENE-1513, but our use case is very specific... and why to allow 5 spelling 
mistakes in Unicode if user's input contains 3 characters only in Latin1? We 
should hardcode some constraints.

Yes, memory requirements... in case of dogs it can be at least few 
millions of additional misspelled-terms for this specific dogs term only... 
but it doesn't grow linearly... and we can limit such structure for distance=2, 
and use additional query-time processing if we need distance=3.

It's just (naive) idea: to precalculate similar terms at indexing time...

  was (Author: funtick):
Another idea (similar to 50-years-old auto-recovery), it doesn't allow me 
to sleep :)
What if we do all distance calculations (and other types of calculations) at 
indexing time instead of at query time? For instance, we may have index 
structure like {Term, List[MisspelledTerm, Distance]}, and we can query this 
structure by {MisspelledTerm, Distance}? We mentioned it here already, 
LUCENE-1513, but our use case is very specific... and why to allow 5 spelling 
mistakes in Unicode if user's input contains 3 characters only in Latin1? We 
should hardcode some constraints.
  
 explore using automaton for fuzzyquery
 --

 Key: LUCENE-2089
 URL: https://issues.apache.org/jira/browse/LUCENE-2089
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Robert Muir
Assignee: Mark Miller
Priority: Minor
 Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


 Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
 itching to write that nasty algorithm)
 we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
 * up front, calculate the maximum required K edits needed to match the users 
 supplied float threshold.
 * for at least small common E up to some max K (1,2,3, etc) we should create 
 a DFA for each E. 
 if the required E is above our supported max, we use dumb mode at first (no 
 seeking, no DFA, just brute force like now).
 As the pq fills, we swap progressively lower DFAs into the enum, based upon 
 the lowest score in the pq.
 This should work well on avg, at high E, you will typically fill the pq very 
 quickly since you will match many terms. 
 This not only provides a mechanism to switch to more efficient DFAs during 
 enumeration, but also to switch from dumb mode to smart mode.
 i modified my wildcard benchmark to generate random fuzzy queries.
 * Pattern: 7N stands for NNN, etc.
 * AvgMS_DFA: this is the time spent creating the automaton (constructor)
 ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
 |7N|10|64.0|4155.9|38.6|20.3|
 |14N|10|0.0|2511.6|46.0|37.9| 
 |28N|10|0.0|2506.3|93.0|86.6|
 |56N|10|0.0|2524.5|304.4|298.5|
 as you can see, this prototype is no good yet, because it creates the DFA in 
 a slow way. right now it creates an NFA, and all this wasted time is in 
 NFA-DFA conversion.
 So, for a very long string, it just gets worse and worse. This has nothing to 
 do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
 AvgMS_DFA), there is no problem there.
 instead we should just build a DFA to begin with, maybe with this paper: 
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
 we can precompute the tables with that algorithm up to some reasonable K, and 
 then I think we are ok.
 the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
 linear minimization, if someone wants to implement this they should not worry 
 about minimization.
 in fact, we need to at some point determine if AutomatonQuery should even 
 minimize FSM's at all, or if it is simply enough for them to be deterministic 
 with no transitions to dead states. (The only code that actually assumes 
 minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a 
 summation easily). we need to benchmark really complex DFAs (i.e. write a 
 regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.

2010-02-09 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12829163#action_12829163
 ] 

Fuad Efendi edited comment on LUCENE-2230 at 2/9/10 9:17 PM:
-

After long-run load-stress tests...

I used 2 boxes, one with SOLR, another one with simple multithreaded stress 
simulator (with randomply generated fuzzy query samples); each box is 2x AMD 
Opteron 2350 (8 core per box); 64-bit.

I disabled all SOLR caches except Document Cache (I want isolated tests; I want 
to ignore time taken by disk I/O to load document).

Performance boosted accordingly to number of load-stress threads (on client 
computer), then dropped: 

9 Threads:
==
TPS: 200 - 210
Response: 45 - 50 (ms)

10 Threads:
===
TPS: 200 - 215
Response: 45 - 55 (ms)

12 Threads:
===
TPS: 180 - 220
Response: 50 - 90 (ms)
 
16 Threads:
===
TPS: 60 - 65
Response: 230 - 260 (ms)
 

It can be explained by CPU-bound processing and 8 cores available; top 
command on SOLR instance was shown 750% - 790% CPU time (8-core) on 3rd step 
(12 stressing threads), and 200% on 4th step (16 stressing threads) - due 
probably to Network I/O, Tomcat internals, etc.

It's better to have Apache HTTPD in front of SOLR in production, with proxy_ajp 
(persistent connections) and HTTP caching enabled; and fine-tune Tomcat threads 
according to use case.

BTW, my best counters for default SOLR/Lucene were:
TPS: 12
Response: 750ms

Fuzzy queries were tuned such a way that distance threshold was less than or 
equal two. I used StrikeAMatch distance...

Thanks,
http://www.tokenizer.ca
+1 416-993-2060(cell)

P.S.
Before performing load-stress tests, I established the baseline in my 
environment: 1500 TPS by pinging http://x.x.x.x:8080/apache-solr-1.4/admin/ 
(static JSP).
And, I reached 220TPS for fuzzy search, starting from 12-15TPS (default 
Lucene/SOLR)...



  was (Author: funtick):
After long-run load-stress tests...

I used 2 boxes, one with SOLR, another one with simple multithreaded stress 
simulator (with randomply generated fuzzy query samples); each box is 2x AMD 
Opteron 2350 (8 core per box); 64-bit.

I disabled all SOLR caches except Document Cache (I want isolated tests; I want 
to ignore time taken by disk I/O to load document).

Performance boosted accordingly to number of load-stress threads (on client 
computer), then dropped: 

9 Threads:
==
TPS: 200 - 210
Response: 45 - 50 (ms)

10 Threads:
===
TPS: 200 - 215
Response: 45 - 55 (ms)

12 Threads:
===
TPS: 180 - 220
Response: 50 - 90 (ms)
 
16 Threads:
===
TPS: 60 - 65
Response: 230 - 260 (ms)
 

It can be explained by CPU-bound processing and 8 cores available; top 
command on SOLR instance was shown 750% - 790% CPU time (8-core) on 3rd step 
(12 stressing threads), and 200% on 4th step (16 stressing threads) - due 
probably to Network I/O, Tomcat internals, etc.

It's better to have Apache HTTPD in front of SOLR in production, with proxy_ajp 
(persistent connections) and HTTP caching enabled; and fine-tune Tomcat threads 
according to use case.

BTW, my best counters for default SOLR/Lucene were:
TPS: 12
Response: 750ms

Fuzzy queries were tuned such a way that distance threshold was less than or 
equal two. I used StrikeAMatch distance...

Thanks,
http://www.tokenizer.ca
+1 416-993-2060(cell)
  
 Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
 

 Key: LUCENE-2230
 URL: https://issues.apache.org/jira/browse/LUCENE-2230
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 3.0
 Environment: Lucene currently uses brute force full-terms scanner and 
 calculates distance for each term. New BKTree structure improves performance 
 in average 20 times when distance is 1, and 3 times when distance is 3. I 
 tested with index size several millions docs, and 250,000 terms. 
 New algo uses integer distances between objects.
Reporter: Fuad Efendi
 Attachments: BKTree.java, Distance.java, DistanceImpl.java, 
 FuzzyTermEnumNEW.java, FuzzyTermEnumNEW.java

   Original Estimate: 0.02h
  Remaining Estimate: 0.02h

 W. Burkhard and R. Keller. Some approaches to best-match file searching, 
 CACM, 1973
 http://portal.acm.org/citation.cfm?doid=362003.362025
 I was inspired by 
 http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees (Nick 
 Johnson, Google).
 Additionally, simplified algorythm at 
 http://www.catalysoft.com/articles/StrikeAMatch.html seems to be much more 
 logically correct than Levenstein distance, and it is 3-5 times faster 
 (isolated tests).
 Big list od distance implementations:
 http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm

-- 
This message is automatically generated by JIRA.
-
You can reply to this email

[jira] Issue Comment Edited: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.

2010-02-09 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12829163#action_12829163
 ] 

Fuad Efendi edited comment on LUCENE-2230 at 2/9/10 9:35 PM:
-

After long-run load-stress tests...

I used 2 boxes, one with SOLR, another one with simple multithreaded stress 
simulator (with randomply generated fuzzy query samples); each box is 2x AMD 
Opteron 2350 (8 core per box); 64-bit.

I disabled all SOLR caches except Document Cache (I want isolated tests; I want 
to ignore time taken by disk I/O to load document).

Performance boosted accordingly to number of load-stress threads (on client 
computer), then dropped: 

9 Threads:
==
TPS: 200 - 210
Response: 45 - 50 (ms)

10 Threads:
===
TPS: 200 - 215
Response: 45 - 55 (ms)

12 Threads:
===
TPS: 180 - 220
Response: 50 - 90 (ms)
 
16 Threads:
===
TPS: 60 - 65
Response: 230 - 260 (ms)
 

It can be explained by CPU-bound processing and 8 cores available; top 
command on SOLR instance was shown 750% - 790% CPU time (8-core) on 3rd step 
(12 stressing threads), and 200% on 4th step (16 stressing threads) - due 
probably to Network I/O, Tomcat internals, etc.

It's better to have Apache HTTPD in front of SOLR in production, with proxy_ajp 
(persistent connections) and HTTP caching enabled; and fine-tune Tomcat threads 
according to use case.

BTW, my best counters for default SOLR/Lucene were:
TPS: 12
Response: 750ms

Fuzzy queries were tuned such a way that distance threshold was less than or 
equal two. I used StrikeAMatch distance...

Thanks,
http://www.tokenizer.ca
+1 416-993-2060(cell)

P.S.
Before performing load-stress tests, I established the baseline in my 
environment: 1500 TPS by pinging http://x.x.x.x:8080/apache-solr-1.4/admin/ 
(static JSP).
And, I reached 220TPS for fuzzy search, starting from 12-15TPS (default 
Lucene/SOLR)...

P.P.S.
Distance function must follow 3 'axioms':
{code}
D(a,a) = 0
D(a,b) = D(b,a)
D(a,b) + D(b,c) = D(a,c)
{code}

And, function must return Integer value.

Otherwise, BKTree will produce wrong results. 


Also, it's mentioned somewhere in Levenstein Algo Java Docs (in contrib folder 
I believe) that instance method runs faster than static method; need to test 
with Java 6... most probably 'yes', depends on JVM implementation; I can guess 
only that CPU-internals are better optimized for instance method...

  was (Author: funtick):
After long-run load-stress tests...

I used 2 boxes, one with SOLR, another one with simple multithreaded stress 
simulator (with randomply generated fuzzy query samples); each box is 2x AMD 
Opteron 2350 (8 core per box); 64-bit.

I disabled all SOLR caches except Document Cache (I want isolated tests; I want 
to ignore time taken by disk I/O to load document).

Performance boosted accordingly to number of load-stress threads (on client 
computer), then dropped: 

9 Threads:
==
TPS: 200 - 210
Response: 45 - 50 (ms)

10 Threads:
===
TPS: 200 - 215
Response: 45 - 55 (ms)

12 Threads:
===
TPS: 180 - 220
Response: 50 - 90 (ms)
 
16 Threads:
===
TPS: 60 - 65
Response: 230 - 260 (ms)
 

It can be explained by CPU-bound processing and 8 cores available; top 
command on SOLR instance was shown 750% - 790% CPU time (8-core) on 3rd step 
(12 stressing threads), and 200% on 4th step (16 stressing threads) - due 
probably to Network I/O, Tomcat internals, etc.

It's better to have Apache HTTPD in front of SOLR in production, with proxy_ajp 
(persistent connections) and HTTP caching enabled; and fine-tune Tomcat threads 
according to use case.

BTW, my best counters for default SOLR/Lucene were:
TPS: 12
Response: 750ms

Fuzzy queries were tuned such a way that distance threshold was less than or 
equal two. I used StrikeAMatch distance...

Thanks,
http://www.tokenizer.ca
+1 416-993-2060(cell)

P.S.
Before performing load-stress tests, I established the baseline in my 
environment: 1500 TPS by pinging http://x.x.x.x:8080/apache-solr-1.4/admin/ 
(static JSP).
And, I reached 220TPS for fuzzy search, starting from 12-15TPS (default 
Lucene/SOLR)...


  
 Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
 

 Key: LUCENE-2230
 URL: https://issues.apache.org/jira/browse/LUCENE-2230
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 3.0
 Environment: Lucene currently uses brute force full-terms scanner and 
 calculates distance for each term. New BKTree structure improves performance 
 in average 20 times when distance is 1, and 3 times when distance is 3. I 
 tested with index size several millions docs, and 250,000 terms. 
 New algo uses integer distances between objects.
Reporter: Fuad Efendi
 Attachments: BKTree.java,

[jira] Commented: (SOLR-1764) While indexing a java.lang.IllegalStateException: Can't overwrite cause exception is thrown

2010-02-09 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831458#action_12831458
 ] 

Fuad Efendi commented on SOLR-1764:
---

Funny, it might happen that this is not a problem with JDK 1.6.0_9; or may be 
with latest JDK. As a quick workaround... Also, you may try to use SolrJ with 
binary format...
I'll try to check that elementwordamp;word/element doesn't cause a 
problem...

 While indexing a java.lang.IllegalStateException: Can't overwrite cause 
 exception is thrown
 -

 Key: SOLR-1764
 URL: https://issues.apache.org/jira/browse/SOLR-1764
 Project: Solr
  Issue Type: Bug
  Components: clients - java
Affects Versions: 1.4
 Environment: Windows XP, JBoss 4.2.3 GA
Reporter: Michael McGowan
Priority: Blocker

 I get an exception while indexing. It seems that I'm unable to see the root 
 cause of the exception because it is masked by another 
 java.lang.IllegalStateException: Can't overwrite cause exception.
 Here is the stacktrace :
 16:59:04,292 ERROR [STDERR] Feb 8, 2010 4:59:04 PM 
 org.apache.solr.update.processor.LogUpdateProcessor finish
 INFO: {} 0 15
 16:59:04,292 ERROR [STDERR] Feb 8, 2010 4:59:04 PM 
 org.apache.solr.common.SolrException log
 SEVERE: java.lang.IllegalStateException: Can't overwrite cause
 at java.lang.Throwable.initCause(Throwable.java:320)
 at com.ctc.wstx.compat.Jdk14Impl.setInitCause(Jdk14Impl.java:70)
 at com.ctc.wstx.exc.WstxException.init(WstxException.java:46)
 at com.ctc.wstx.exc.WstxIOException.init(WstxIOException.java:16)
 at 
 com.ctc.wstx.stax.WstxInputFactory.doCreateSR(WstxInputFactory.java:536)
 at 
 com.ctc.wstx.stax.WstxInputFactory.createSR(WstxInputFactory.java:592)
 at 
 com.ctc.wstx.stax.WstxInputFactory.createSR(WstxInputFactory.java:648)
 at 
 com.ctc.wstx.stax.WstxInputFactory.createXMLStreamReader(WstxInputFactory.java:319)
 at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:68)
 at 
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
 at 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
 at 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
 at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
 at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
 at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
 at 
 org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96)
 at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
 at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
 at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230)
 at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
 at 
 org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:182)
 at 
 org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:84)
 at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
 at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
 at 
 org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:157)
 at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
 at 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:262)
 at 
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
 at 
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
 at 
 org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:446)
 at java.lang.Thread.run(Thread.java:619)
 16:59:04,292 ERROR [STDERR] Feb 8, 2010 4:59:04 PM 
 org.apache.solr.core.SolrCore execute
 INFO: [] webapp=/solr path=/update params={wt=xmlversion=2.2} status=500 
 QTime=15
 16:59:04,292 ERROR [STDERR] Feb 8, 2010 4:59:04 PM 
 org.apache.solr.common.SolrException log
 SEVERE: java.lang.IllegalStateException: Can't overwrite cause
 at java.lang.Throwable.initCause(Throwable.java:320)
 at com.ctc.wstx.compat.Jdk14Impl.setInitCause(Jdk14Impl.java:70)
 at

[jira] Commented: (SOLR-1764) While indexing a java.lang.IllegalStateException: Can't overwrite cause exception is thrown

2010-02-08 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831268#action_12831268
 ] 

Fuad Efendi commented on SOLR-1764:
---

Michael,

Which version of Java are you using? I believe something wrong with XML 
(upload) file, and specific Java version classes conflict with WoodStox, 
although SOLR may need improvement too:
http://forums.sun.com/thread.jspa?threadID=5150576


 While indexing a java.lang.IllegalStateException: Can't overwrite cause 
 exception is thrown
 -

 Key: SOLR-1764
 URL: https://issues.apache.org/jira/browse/SOLR-1764
 Project: Solr
  Issue Type: Bug
  Components: clients - java
Affects Versions: 1.4
 Environment: Windows XP, JBoss 4.2.3 GA
Reporter: Michael McGowan
Priority: Blocker

 I get an exception while indexing. It seems that I'm unable to see the root 
 cause of the exception because it is masked by another 
 java.lang.IllegalStateException: Can't overwrite cause exception.
 Here is the stacktrace :
 16:59:04,292 ERROR [STDERR] Feb 8, 2010 4:59:04 PM 
 org.apache.solr.update.processor.LogUpdateProcessor finish
 INFO: {} 0 15
 16:59:04,292 ERROR [STDERR] Feb 8, 2010 4:59:04 PM 
 org.apache.solr.common.SolrException log
 SEVERE: java.lang.IllegalStateException: Can't overwrite cause
 at java.lang.Throwable.initCause(Throwable.java:320)
 at com.ctc.wstx.compat.Jdk14Impl.setInitCause(Jdk14Impl.java:70)
 at com.ctc.wstx.exc.WstxException.init(WstxException.java:46)
 at com.ctc.wstx.exc.WstxIOException.init(WstxIOException.java:16)
 at 
 com.ctc.wstx.stax.WstxInputFactory.doCreateSR(WstxInputFactory.java:536)
 at 
 com.ctc.wstx.stax.WstxInputFactory.createSR(WstxInputFactory.java:592)
 at 
 com.ctc.wstx.stax.WstxInputFactory.createSR(WstxInputFactory.java:648)
 at 
 com.ctc.wstx.stax.WstxInputFactory.createXMLStreamReader(WstxInputFactory.java:319)
 at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:68)
 at 
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
 at 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
 at 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
 at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
 at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
 at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
 at 
 org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96)
 at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
 at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
 at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230)
 at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
 at 
 org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:182)
 at 
 org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:84)
 at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
 at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
 at 
 org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:157)
 at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
 at 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:262)
 at 
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
 at 
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
 at 
 org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:446)
 at java.lang.Thread.run(Thread.java:619)
 16:59:04,292 ERROR [STDERR] Feb 8, 2010 4:59:04 PM 
 org.apache.solr.core.SolrCore execute
 INFO: [] webapp=/solr path=/update params={wt=xmlversion=2.2} status=500 
 QTime=15
 16:59:04,292 ERROR [STDERR] Feb 8, 2010 4:59:04 PM 
 org.apache.solr.common.SolrException log
 SEVERE: java.lang.IllegalStateException: Can't overwrite cause
 at java.lang.Throwable.initCause(Throwable.java:320)
 at com.ctc.wstx.compat.Jdk14Impl.setInitCause(Jdk14Impl.java:70)
 at

[jira] Issue Comment Edited: (SOLR-1764) While indexing a java.lang.IllegalStateException: Can't overwrite cause exception is thrown

2010-02-08 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831268#action_12831268
 ] 

Fuad Efendi edited comment on SOLR-1764 at 2/9/10 2:48 AM:
---

Michael,

Which version of Java are you using? I believe something wrong with XML 
(upload) file, and specific Java version classes conflict with WoodStox, 
although SOLR may need improvement too:
http://forums.sun.com/thread.jspa?threadID=5150576

It says that text nodes such as 
prim name=y[-A-Z0-9.,()/='+:?!%amp;amp;*; ]/prim
can be split (for instance, to porcess entities), depending on implementation, 
and, to be safe, SOLR needs something like
{code}
 while (reader.isCharacters()) {
sb.append(reader.getText());
reader.next();
}
{code}

  was (Author: funtick):
Michael,

Which version of Java are you using? I believe something wrong with XML 
(upload) file, and specific Java version classes conflict with WoodStox, 
although SOLR may need improvement too:
http://forums.sun.com/thread.jspa?threadID=5150576

  
 While indexing a java.lang.IllegalStateException: Can't overwrite cause 
 exception is thrown
 -

 Key: SOLR-1764
 URL: https://issues.apache.org/jira/browse/SOLR-1764
 Project: Solr
  Issue Type: Bug
  Components: clients - java
Affects Versions: 1.4
 Environment: Windows XP, JBoss 4.2.3 GA
Reporter: Michael McGowan
Priority: Blocker

 I get an exception while indexing. It seems that I'm unable to see the root 
 cause of the exception because it is masked by another 
 java.lang.IllegalStateException: Can't overwrite cause exception.
 Here is the stacktrace :
 16:59:04,292 ERROR [STDERR] Feb 8, 2010 4:59:04 PM 
 org.apache.solr.update.processor.LogUpdateProcessor finish
 INFO: {} 0 15
 16:59:04,292 ERROR [STDERR] Feb 8, 2010 4:59:04 PM 
 org.apache.solr.common.SolrException log
 SEVERE: java.lang.IllegalStateException: Can't overwrite cause
 at java.lang.Throwable.initCause(Throwable.java:320)
 at com.ctc.wstx.compat.Jdk14Impl.setInitCause(Jdk14Impl.java:70)
 at com.ctc.wstx.exc.WstxException.init(WstxException.java:46)
 at com.ctc.wstx.exc.WstxIOException.init(WstxIOException.java:16)
 at 
 com.ctc.wstx.stax.WstxInputFactory.doCreateSR(WstxInputFactory.java:536)
 at 
 com.ctc.wstx.stax.WstxInputFactory.createSR(WstxInputFactory.java:592)
 at 
 com.ctc.wstx.stax.WstxInputFactory.createSR(WstxInputFactory.java:648)
 at 
 com.ctc.wstx.stax.WstxInputFactory.createXMLStreamReader(WstxInputFactory.java:319)
 at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:68)
 at 
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
 at 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
 at 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
 at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
 at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
 at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
 at 
 org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96)
 at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
 at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
 at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230)
 at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
 at 
 org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:182)
 at 
 org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:84)
 at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
 at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
 at 
 org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:157)
 at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
 at 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:262)
 at 
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
 at 
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
 at

[jira] Issue Comment Edited: (SOLR-1764) While indexing a java.lang.IllegalStateException: Can't overwrite cause exception is thrown

2010-02-08 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831268#action_12831268
 ] 

Fuad Efendi edited comment on SOLR-1764 at 2/9/10 2:51 AM:
---

Michael,

Which version of Java are you using? I believe something wrong with XML 
(upload) file, and specific Java version classes conflict with WoodStox, 
although SOLR may need improvement too:
http://forums.sun.com/thread.jspa?threadID=5150576

It says that text nodes such as 
prim name=y[-A-Z0-9.,()/='+:?!%amp;amp;*; ]/prim
can be split (for instance, to process entities), depending on implementation, 
and, to be safe, SOLR needs something like
{code}
 while (reader.isCharacters()) {
sb.append(reader.getText());
reader.next();
}
{code}

  was (Author: funtick):
Michael,

Which version of Java are you using? I believe something wrong with XML 
(upload) file, and specific Java version classes conflict with WoodStox, 
although SOLR may need improvement too:
http://forums.sun.com/thread.jspa?threadID=5150576

It says that text nodes such as 
prim name=y[-A-Z0-9.,()/='+:?!%amp;amp;*; ]/prim
can be split (for instance, to porcess entities), depending on implementation, 
and, to be safe, SOLR needs something like
{code}
 while (reader.isCharacters()) {
sb.append(reader.getText());
reader.next();
}
{code}
  
 While indexing a java.lang.IllegalStateException: Can't overwrite cause 
 exception is thrown
 -

 Key: SOLR-1764
 URL: https://issues.apache.org/jira/browse/SOLR-1764
 Project: Solr
  Issue Type: Bug
  Components: clients - java
Affects Versions: 1.4
 Environment: Windows XP, JBoss 4.2.3 GA
Reporter: Michael McGowan
Priority: Blocker

 I get an exception while indexing. It seems that I'm unable to see the root 
 cause of the exception because it is masked by another 
 java.lang.IllegalStateException: Can't overwrite cause exception.
 Here is the stacktrace :
 16:59:04,292 ERROR [STDERR] Feb 8, 2010 4:59:04 PM 
 org.apache.solr.update.processor.LogUpdateProcessor finish
 INFO: {} 0 15
 16:59:04,292 ERROR [STDERR] Feb 8, 2010 4:59:04 PM 
 org.apache.solr.common.SolrException log
 SEVERE: java.lang.IllegalStateException: Can't overwrite cause
 at java.lang.Throwable.initCause(Throwable.java:320)
 at com.ctc.wstx.compat.Jdk14Impl.setInitCause(Jdk14Impl.java:70)
 at com.ctc.wstx.exc.WstxException.init(WstxException.java:46)
 at com.ctc.wstx.exc.WstxIOException.init(WstxIOException.java:16)
 at 
 com.ctc.wstx.stax.WstxInputFactory.doCreateSR(WstxInputFactory.java:536)
 at 
 com.ctc.wstx.stax.WstxInputFactory.createSR(WstxInputFactory.java:592)
 at 
 com.ctc.wstx.stax.WstxInputFactory.createSR(WstxInputFactory.java:648)
 at 
 com.ctc.wstx.stax.WstxInputFactory.createXMLStreamReader(WstxInputFactory.java:319)
 at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:68)
 at 
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
 at 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
 at 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
 at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
 at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
 at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
 at 
 org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96)
 at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
 at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
 at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230)
 at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
 at 
 org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:182)
 at 
 org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:84)
 at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
 at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
 at 
 org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:157)
 at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)

[jira] Commented: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.

2010-02-03 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12829163#action_12829163
 ] 

Fuad Efendi commented on LUCENE-2230:
-

After long-run load-stress tests...

I used 2 boxes, one with SOLR, another one with simple multithreaded stress 
simulator (with randomply generated fuzzy query samples); each box is 2x AMD 
Opteron 2350 (8 core per box); 64-bit.

I disabled all SOLR caches except Document Cache (I want isolated tests; I want 
to ignore time taken by disk I/O to load document).

Performance boosted accordingly to number of load-stress threads (on client 
computer), then dropped: 

9 Threads:
==
TPS: 200 - 210
Response: 45 - 50 (ms)

10 Threads:
===
TPS: 200 - 215
Response: 45 - 55 (ms)

12 Threads:
===
TPS: 180 - 220
Response: 50 - 90 (ms)
 
16 Threads:
===
TPS: 60 - 65
Response: 230 - 260 (ms)
 

It can be explained by CPU-bound processing and 8 cores available; top 
command on SOLR instance was shown 750% - 790% CPU time (8-core) on 3rd step 
(12 stressing threads), and 200% on 4th step (16 stressing threads) - due 
probably to Network I/O, Tomcat internals, etc.

It's better to have Apache HTTPD in front of SOLR in production, with proxy_ajp 
(persistent connections) and HTTP caching enabled; and fine-tune Tomcat threads 
according to use case.

BTW, my best counters for default SOLR/Lucene were:
TPS: 12
Response: 750ms

Fuzzy queries were tuned such a way that distance threshold was less than or 
equal two. I used StrikeAMatch distance...

Thanks,
http://www.tokenizer.ca
+1 416-993-2060(cell)

 Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
 

 Key: LUCENE-2230
 URL: https://issues.apache.org/jira/browse/LUCENE-2230
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 3.0
 Environment: Lucene currently uses brute force full-terms scanner and 
 calculates distance for each term. New BKTree structure improves performance 
 in average 20 times when distance is 1, and 3 times when distance is 3. I 
 tested with index size several millions docs, and 250,000 terms. 
 New algo uses integer distances between objects.
Reporter: Fuad Efendi
 Attachments: BKTree.java, Distance.java, DistanceImpl.java, 
 FuzzyTermEnumNEW.java, FuzzyTermEnumNEW.java

   Original Estimate: 0.02h
  Remaining Estimate: 0.02h

 W. Burkhard and R. Keller. Some approaches to best-match file searching, 
 CACM, 1973
 http://portal.acm.org/citation.cfm?doid=362003.362025
 I was inspired by 
 http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees (Nick 
 Johnson, Google).
 Additionally, simplified algorythm at 
 http://www.catalysoft.com/articles/StrikeAMatch.html seems to be much more 
 logically correct than Levenstein distance, and it is 3-5 times faster 
 (isolated tests).
 Big list od distance implementations:
 http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.

2010-01-23 Thread Fuad Efendi (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12804001#action_12804001
]

Fuad Efendi edited comment on LUCENE-2230 at 1/24/10 12:46 AM:
---

Minor bug fixed (with cache warm-up)...

Don't forget to disable QueryResultsCache and to enable large DocumentCache (if
you are using SOLR); otherwise you won't see the difference. (SOLR users: there
are some other tricks!)

With Lucene 2.9.1:
800ms

With BKTree and Levenstein algo:
200ms

With BKTree and http://www.catalysoft.com/articles/StrikeAMatch.html
70ms

Average timing after many hours of tests. We may consider integer distance
instead of float for Lucene Query if we accept this algo; I tried the best to
have it close to float distance.

BKTree is cached at FuzzyTermEnumNEW. It needs warm-up if index changed;
simplest way - to recalc it at night (separate thread will do it).

==
P.S.

FuzzyQuery constructs instance of FuzzyTermEnum and passes instance of
IndexReader to constructor. This is the way... If IndexReader changed (new
instance) we can simply repopulate BKTree (or, for instance, we can compare
list of terms and simply add terms missed in BKTree)...

was (Author: funtick):
Minor bug fixed (with cache warm-up)...

Don't forget to disable QueryResultsCache and to enable large DocumentCache (if
you are using SOLR); otherwise you won't see the difference. (SOLR users: there
are some other tricks!)

With Lucene 2.9.1:
800ms

With BKTree and Levenstein algo:
200ms

With BKTree and http://www.catalysoft.com/articles/StrikeAMatch.html
70ms

Average timing after many hours of tests. We may consider integer distance
instead of float for Lucene Query if we accept this algo; I tried the best to
have it close to float distance.

BKTree is cached at FuzzyTermEnumNEW. It needs warm-up if index changed;
simplest way - to recalc it at night (separate thread will do it).

Thanks,
Fuad
+1 416-993-2060(cell)

Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.

Original Estimate: 0.02h
Remaining Estimate: 0.02h

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.

2010-01-22 Thread Fuad Efendi (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Fuad Efendi updated LUCENE-2230:

Attachment: FuzzyTermEnumNEW.java

Minor bug fixed (with cache wam-up)...

Don't forget to disable QueryResultsCache and to enable large DocumentCache (if
you are using SOLR); otherwise you won't see the difference. (SOLR users: there
are some other tricks!)

With Lucene 2.9.1:
800ms

With BKTree and Levenstein algo:
200ms

With BKTree and http://www.catalysoft.com/articles/StrikeAMatch.html:
70ms

Average timing after many hours of tests. We may consider integer distance
instead of float for Lucene Query if if accept this algo; I tried the best to
have it close to float distance.

BKTree is cached at FuzzyTermEnumNEW. It needs warm-up if index changed;
simplest way - to recalc it at night (separate thread will do it).

Thanks,
Fuad
+1 416-993-2060(cell)

Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.

Original Estimate: 0.02h
Remaining Estimate: 0.02h

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.

2010-01-22 Thread Fuad Efendi (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12804001#action_12804001
]

Fuad Efendi edited comment on LUCENE-2230 at 1/23/10 2:49 AM:
--

Minor bug fixed (with cache warm-up)...

Don't forget to disable QueryResultsCache and to enable large DocumentCache (if
you are using SOLR); otherwise you won't see the difference. (SOLR users: there
are some other tricks!)

With Lucene 2.9.1:
800ms

With BKTree and Levenstein algo:
200ms

With BKTree and http://www.catalysoft.com/articles/StrikeAMatch.html
70ms

Average timing after many hours of tests. We may consider integer distance
instead of float for Lucene Query if we accept this algo; I tried the best to
have it close to float distance.

BKTree is cached at FuzzyTermEnumNEW. It needs warm-up if index changed;
simplest way - to recalc it at night (separate thread will do it).

Thanks,
Fuad
+1 416-993-2060(cell)

was (Author: funtick):
Minor bug fixed (with cache wam-up)...

Don't forget to disable QueryResultsCache and to enable large DocumentCache (if
you are using SOLR); otherwise you won't see the difference. (SOLR users: there
are some other tricks!)

With Lucene 2.9.1:
800ms

With BKTree and Levenstein algo:
200ms

With BKTree and http://www.catalysoft.com/articles/StrikeAMatch.html:
70ms

Average timing after many hours of tests. We may consider integer distance
instead of float for Lucene Query if if accept this algo; I tried the best to
have it close to float distance.

BKTree is cached at FuzzyTermEnumNEW. It needs warm-up if index changed;
simplest way - to recalc it at night (separate thread will do it).

Thanks,
Fuad
+1 416-993-2060(cell)

Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.

Original Estimate: 0.02h
Remaining Estimate: 0.02h

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.

2010-01-21 Thread Fuad Efendi (JIRA)

Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.


 Key: LUCENE-2230
 URL: https://issues.apache.org/jira/browse/LUCENE-2230
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 3.0
 Environment: Lucene currently uses brute force full-terms scanner and 
calculates distance for each term. New BKTree structure improves performance in 
average 20 times when distance is 1, and 3 times when distance is 3. I tested 
with index size several millions docs, and 250,000 terms. 
New algo uses integer distances between objects.
Reporter: Fuad Efendi


W. Burkhard and R. Keller. Some approaches to best-match file searching, CACM, 
1973
http://portal.acm.org/citation.cfm?doid=362003.362025

I was inspired by 
http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees (Nick 
Johnson, Google).


Additionally, simplified algorythm at 
http://www.catalysoft.com/articles/StrikeAMatch.html seems to be much more 
logically correct than Levenstein distance, and it is 3-5 times faster 
(isolated tests).

Big list od distance implementations:
http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.

2010-01-21 Thread Fuad Efendi (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fuad Efendi updated LUCENE-2230:


Attachment: DistanceImpl.java
Distance.java
BKTree.java

First version of BKTree.java

 Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
 

 Key: LUCENE-2230
 URL: https://issues.apache.org/jira/browse/LUCENE-2230
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 3.0
 Environment: Lucene currently uses brute force full-terms scanner and 
 calculates distance for each term. New BKTree structure improves performance 
 in average 20 times when distance is 1, and 3 times when distance is 3. I 
 tested with index size several millions docs, and 250,000 terms. 
 New algo uses integer distances between objects.
Reporter: Fuad Efendi
 Attachments: BKTree.java, Distance.java, DistanceImpl.java

   Original Estimate: 0.02h
  Remaining Estimate: 0.02h

 W. Burkhard and R. Keller. Some approaches to best-match file searching, 
 CACM, 1973
 http://portal.acm.org/citation.cfm?doid=362003.362025
 I was inspired by 
 http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees (Nick 
 Johnson, Google).
 Additionally, simplified algorythm at 
 http://www.catalysoft.com/articles/StrikeAMatch.html seems to be much more 
 logically correct than Levenstein distance, and it is 3-5 times faster 
 (isolated tests).
 Big list od distance implementations:
 http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.

2010-01-21 Thread Fuad Efendi (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fuad Efendi updated LUCENE-2230:


Attachment: FuzzyTermEnumNEW.java

FuzzyTermEnumNEW.java

In order to use it with Lucene 2.9.1, complie all files (4 files) in a separate 
JAR file (Java 6).

In a source distribution of Lucene 2.9.1, modify FuzzyQuery, single method:
  protected FilteredTermEnum getEnum(IndexReader reader) throws IOException {
return new FuzzyTermEnumNEW(reader, getTerm(), minimumSimilarity, 
prefixLength);
  }
- and complie it (using default settings such as Java 1.4 compatibility); ant 
jar-core will do it.

Use 2 new jars instead of lucene-core-2.9.1.jar




 Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
 

 Key: LUCENE-2230
 URL: https://issues.apache.org/jira/browse/LUCENE-2230
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 3.0
 Environment: Lucene currently uses brute force full-terms scanner and 
 calculates distance for each term. New BKTree structure improves performance 
 in average 20 times when distance is 1, and 3 times when distance is 3. I 
 tested with index size several millions docs, and 250,000 terms. 
 New algo uses integer distances between objects.
Reporter: Fuad Efendi
 Attachments: BKTree.java, Distance.java, DistanceImpl.java, 
 FuzzyTermEnumNEW.java

   Original Estimate: 0.02h
  Remaining Estimate: 0.02h

 W. Burkhard and R. Keller. Some approaches to best-match file searching, 
 CACM, 1973
 http://portal.acm.org/citation.cfm?doid=362003.362025
 I was inspired by 
 http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees (Nick 
 Johnson, Google).
 Additionally, simplified algorythm at 
 http://www.catalysoft.com/articles/StrikeAMatch.html seems to be much more 
 logically correct than Levenstein distance, and it is 3-5 times faster 
 (isolated tests).
 Big list od distance implementations:
 http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1990) Add unsigned packed int impls in oal.util

2009-11-12 Thread Fuad Efendi (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777054#action_12777054
]

Fuad Efendi commented on LUCENE-1990:
-

Suttiwat sent me a link:
http://blog.juma.me.uk/2008/10/14/32-bit-or-64-bit-jvm-how-about-a-hybrid/

This is vendor-specific, and possibly may cause unexpected problems, but we can
try in some specific cases:
Compressed Oops have been included (but disabled by default) in the
performance release JDK6u6p (requires you to fill a survey), so I decided to
try it in an internal application and compare it with 64-bit mode and 32-bit
mode.

-XX:+UseCompressedOops

There are other vendors around too such as Oracle JRockit which is much faster
server-side...

Add unsigned packed int impls in oal.util
-

Key: LUCENE-1990
URL: https://issues.apache.org/jira/browse/LUCENE-1990
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless
Priority: Minor

There are various places in Lucene that could take advantage of an
efficient packed unsigned int/long impl. EG the terms dict index in
the standard codec in LUCENE-1458 could subsantially reduce it's RAM
usage. FieldCache.StringIndex could as well. And I think load into
RAM codecs like the one in TestExternalCodecs could use this too.
I'm picturing something very basic like:
{code}
interface PackedUnsignedLongs {
long get(long index);
void set(long index, long value);
}
{code}
Plus maybe an iterator for getting and maybe also for setting. If it
helps, most of the usages of this inside Lucene will be write once
so eg the set could make that an assumption/requirement.
And a factory somewhere:
{code}
PackedUnsignedLongs create(int count, long maxValue);
{code}
I think we should simply autogen the code (we can start from the
autogen code in LUCENE-1410), or, if there is an good existing impl
that has a compatible license that'd be great.
I don't have time near-term to do this... so if anyone has the itch,
please jump!

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1990) Add unsigned packed int impls in oal.util

2009-11-10 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12775420#action_12775420
 ] 

Fuad Efendi edited comment on LUCENE-1990 at 11/10/09 2:01 PM:
---

Specifically for FieldCache, let's see... suppose Field may have 8 different 
values, and number of documents is high.

{code}
Value0  0  1  0  0  0  0  0   0  1  0  0  0  0  0 ...  
Value1  1  0  1  0  0  0  0   0  0  0  0  0  0  0 ...  
Value2  0  0  0  1  1  0  0   0  0  0  0  0  0  0 ...  
Value3  0  0  0  0  0  0  0   0  0  0  0  1  0  0 ...  
Value4  0  0  0  0  0  0  1   0  0  0  0  0  0  0 ...  
Value5  0  0  0  0  0  1  0   0  0  0  1  0  1  0 ...  
Value6  0  0  0  0  0  0  0   1  0  1  0  0  0  0 ...  
Value7  0  0  0  0  0  0  0   0  1  0  0  0  0  1 ...  
{code}

- represented as Matrix (or as a Vector); for instance, first row means that 
Document1 and Document8 have Value0.

And now, if we go horizontally we will end up with 8 arrays of int[]. What if 
we go vertically? Field could be encoded as 3-bit (8 different values).

CONSTRAINT: specifically for FieldCache, each Column must have the only 1.

And we can end with array of 3-bit values storing position in a column! Size of 
array is IndexReader.maxDoc().


hope I am reinventing bycycle :)


  was (Author: funtick):
Specifically for FieldCache, let's see... suppose Field may have 8 
different values, and number of documents is high.

{code}
Value0  0  1  0  0  0  0  0   0  1  0  0  0  0  0 ...  
Value1  1  0  1  0  0  0  0   0  0  0  0  0  0  0 ...  
Value2  0  0  0  1  1  0  0   0  0  0  0  0  0  0 ...  
Value3  0  0  0  0  0  0  0   0  0  0  0 10  0  0 ...  
Value4  0  0  0  0  0  0  1   0  0  0  0  0  0  0 ...  
Value5  0  0  0  0  0  1  0   0  0  0  1  0  1  0 ...  
Value6  0  0  0  0  0  0  0   1  0  1  0  0  0  0 ...  
Value7  0  0  0  0  0  0  0   0  1  0  0  0  0  1 ...  
{code}

- represented as Matrix (or as a Vector); for instance, first row means that 
Document1 and Document8 have Value0.

And now, if we go horizontally we will end up with 8 arrays of int[]. What if 
we go vertically? Field could be encoded as 3-bit (8 different values).

CONSTRAINT: specifically for FieldCache, each Column must have the only 1.

And we can end with array of 3-bit values storing position in a column! Size of 
array is IndexReader.maxDoc().


hope I am reinventing bycycle :)

  
 Add unsigned packed int impls in oal.util
 -

 Key: LUCENE-1990
 URL: https://issues.apache.org/jira/browse/LUCENE-1990
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Priority: Minor

 There are various places in Lucene that could take advantage of an
 efficient packed unsigned int/long impl.  EG the terms dict index in
 the standard codec in LUCENE-1458 could subsantially reduce it's RAM
 usage.  FieldCache.StringIndex could as well.  And I think load into
 RAM codecs like the one in TestExternalCodecs could use this too.
 I'm picturing something very basic like:
 {code}
 interface PackedUnsignedLongs  {
   long get(long index);
   void set(long index, long value);
 }
 {code}
 Plus maybe an iterator for getting and maybe also for setting.  If it
 helps, most of the usages of this inside Lucene will be write once
 so eg the set could make that an assumption/requirement.
 And a factory somewhere:
 {code}
   PackedUnsignedLongs create(int count, long maxValue);
 {code}
 I think we should simply autogen the code (we can start from the
 autogen code in LUCENE-1410), or, if there is an good existing impl
 that has a compatible license that'd be great.
 I don't have time near-term to do this... so if anyone has the itch,
 please jump!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1990) Add unsigned packed int impls in oal.util

2009-11-10 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12775420#action_12775420
 ] 

Fuad Efendi commented on LUCENE-1990:
-

Specifically for FieldCache, let's see... suppose Field may have 8 different 
values, and number of documents is high.

{code}
Value0  0  1  0  0  0  0  0   0  1  0  0  0  0  0 ...  
Value1  1  0  1  0  0  0  0   0  0  0  0  0  0  0 ...  
Value2  0  0  0  1  1  0  0   0  0  0  0  0  0  0 ...  
Value3  0  0  0  0  0  0  0   0  0  0  0 10  0  0 ...  
Value4  0  0  0  0  0  0  1   0  0  0  0  0  0  0 ...  
Value5  0  0  0  0  0  1  0   0  0  0  1  0  1  0 ...  
Value6  0  0  0  0  0  0  0   1  0  1  0  0  0  0 ...  
Value7  0  0  0  0  0  0  0   0  1  0  0  0  0  1 ...  
{code}

- represented as Matrix (or as a Vector); for instance, first row means that 
Document1 and Document8 have Value0.

And now, if we go horizontally we will end up with 8 arrays of int[]. What if 
we go vertically? Field could be encoded as 3-bit (8 different values).

CONSTRAINT: specifically for FieldCache, each Column must have the only 1.

And we can end with array of 3-bit values storing position in a column! Size of 
array is IndexReader.maxDoc().


hope I am reinventing bycycle :)


 Add unsigned packed int impls in oal.util
 -

 Key: LUCENE-1990
 URL: https://issues.apache.org/jira/browse/LUCENE-1990
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Priority: Minor

 There are various places in Lucene that could take advantage of an
 efficient packed unsigned int/long impl.  EG the terms dict index in
 the standard codec in LUCENE-1458 could subsantially reduce it's RAM
 usage.  FieldCache.StringIndex could as well.  And I think load into
 RAM codecs like the one in TestExternalCodecs could use this too.
 I'm picturing something very basic like:
 {code}
 interface PackedUnsignedLongs  {
   long get(long index);
   void set(long index, long value);
 }
 {code}
 Plus maybe an iterator for getting and maybe also for setting.  If it
 helps, most of the usages of this inside Lucene will be write once
 so eg the set could make that an assumption/requirement.
 And a factory somewhere:
 {code}
   PackedUnsignedLongs create(int count, long maxValue);
 {code}
 I think we should simply autogen the code (we can start from the
 autogen code in LUCENE-1410), or, if there is an good existing impl
 that has a compatible license that'd be great.
 I don't have time near-term to do this... so if anyone has the itch,
 please jump!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1990) Add unsigned packed int impls in oal.util

2009-11-10 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12775420#action_12775420
 ] 

Fuad Efendi edited comment on LUCENE-1990 at 11/10/09 4:10 PM:
---

Specifically for FieldCache, let's see... suppose Field may have 8 different 
values, and number of documents is high.

{code}
Value0  0  1  0  0  0  0  0   0  1  0  0  0  0  0 ...  
Value1  1  0  1  0  0  0  0   0  0  0  0  0  0  0 ...  
Value2  0  0  0  1  1  0  0   0  0  0  0  0  0  0 ...  
Value3  0  0  0  0  0  0  0   0  0  0  0  1  0  0 ...  
Value4  0  0  0  0  0  0  1   0  0  0  0  0  0  0 ...  
Value5  0  0  0  0  0  1  0   0  0  0  1  0  1  0 ...  
Value6  0  0  0  0  0  0  0   1  0  1  0  0  0  0 ...  
Value7  0  0  0  0  0  0  0   0  1  0  0  0  0  1 ...  
{code}

- represented as Matrix (or as a Vector); for instance, first row means that 
Document1 and Document8 have Value0.

And now, if we go horizontally we will end up with 8 arrays of int[]. What if 
we go vertically? Field could be encoded as 3-bit (8 different values).

CONSTRAINT: specifically for FieldCache, each Column must have the only 1.

And we can end with array of 3-bit values storing position in a column! Size of 
array is IndexReader.maxDoc().


hope I am reinventing bycycle :)

P.S.
Of course each solution has pros and cons, I am trying to focus on FieldCache 
specific use cases.

1. For a given document ID, find a value for a field
2. For a given query results, sort it by a field values
3. For a given query results, count facet for each field value

I don't think such naive compression is slower than abstract int[] arrays... 
and we need to change public API of field cache too: if method returns int[] we 
are not saving any RAM.

Better is to compare with SOLR use cases and to make API closer to real 
requirements; SOLR operates with some bitsets instead of arrays...

  was (Author: funtick):
Specifically for FieldCache, let's see... suppose Field may have 8 
different values, and number of documents is high.

{code}
Value0  0  1  0  0  0  0  0   0  1  0  0  0  0  0 ...  
Value1  1  0  1  0  0  0  0   0  0  0  0  0  0  0 ...  
Value2  0  0  0  1  1  0  0   0  0  0  0  0  0  0 ...  
Value3  0  0  0  0  0  0  0   0  0  0  0  1  0  0 ...  
Value4  0  0  0  0  0  0  1   0  0  0  0  0  0  0 ...  
Value5  0  0  0  0  0  1  0   0  0  0  1  0  1  0 ...  
Value6  0  0  0  0  0  0  0   1  0  1  0  0  0  0 ...  
Value7  0  0  0  0  0  0  0   0  1  0  0  0  0  1 ...  
{code}

- represented as Matrix (or as a Vector); for instance, first row means that 
Document1 and Document8 have Value0.

And now, if we go horizontally we will end up with 8 arrays of int[]. What if 
we go vertically? Field could be encoded as 3-bit (8 different values).

CONSTRAINT: specifically for FieldCache, each Column must have the only 1.

And we can end with array of 3-bit values storing position in a column! Size of 
array is IndexReader.maxDoc().


hope I am reinventing bycycle :)

  
 Add unsigned packed int impls in oal.util
 -

 Key: LUCENE-1990
 URL: https://issues.apache.org/jira/browse/LUCENE-1990
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Priority: Minor

 There are various places in Lucene that could take advantage of an
 efficient packed unsigned int/long impl.  EG the terms dict index in
 the standard codec in LUCENE-1458 could subsantially reduce it's RAM
 usage.  FieldCache.StringIndex could as well.  And I think load into
 RAM codecs like the one in TestExternalCodecs could use this too.
 I'm picturing something very basic like:
 {code}
 interface PackedUnsignedLongs  {
   long get(long index);
   void set(long index, long value);
 }
 {code}
 Plus maybe an iterator for getting and maybe also for setting.  If it
 helps, most of the usages of this inside Lucene will be write once
 so eg the set could make that an assumption/requirement.
 And a factory somewhere:
 {code}
   PackedUnsignedLongs create(int count, long maxValue);
 {code}
 I think we should simply autogen the code (we can start from the
 autogen code in LUCENE-1410), or, if there is an good existing impl
 that has a compatible license that'd be great.
 I don't have time near-term to do this... so if anyone has the itch,
 please jump!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1990) Add unsigned packed int impls in oal.util

2009-11-10 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12775420#action_12775420
 ] 

Fuad Efendi edited comment on LUCENE-1990 at 11/10/09 4:11 PM:
---


Specifically for FieldCache, let's see... suppose Field may have 8 different 
values, and number of documents is high.

{code}
Value0  0  1  0  0  0  0  0   0  1  0  0  0  0  0 ...  
Value1  1  0  1  0  0  0  0   0  0  0  0  0  0  0 ...  
Value2  0  0  0  1  1  0  0   0  0  0  0  0  0  0 ...  
Value3  0  0  0  0  0  0  0   0  0  0  0  1  0  0 ...  
Value4  0  0  0  0  0  0  1   0  0  0  0  0  0  0 ...  
Value5  0  0  0  0  0  1  0   0  0  0  1  0  1  0 ...  
Value6  0  0  0  0  0  0  0   1  0  1  0  0  0  0 ...  
Value7  0  0  0  0  0  0  0   0  0  0  0  0  0  1 ...  
{code}

- represented as Matrix (or as a Vector); for instance, first row means that 
Document1 and Document8 have Value0.

And now, if we go horizontally we will end up with 8 arrays of int[]. What if 
we go vertically? Field could be encoded as 3-bit (8 different values).

CONSTRAINT: specifically for FieldCache, each Column must have the only 1.

And we can end with array of 3-bit values storing position in a column! Size of 
array is IndexReader.maxDoc().


hope I am reinventing bycycle :)

P.S.
Of course each solution has pros and cons, I am trying to focus on FieldCache 
specific use cases.

1. For a given document ID, find a value for a field
2. For a given query results, sort it by a field values
3. For a given query results, count facet for each field value

I don't think such naive compression is slower than abstract int[] arrays... 
and we need to change public API of field cache too: if method returns int[] we 
are not saving any RAM.

Better is to compare with SOLR use cases and to make API closer to real 
requirements; SOLR operates with some bitsets instead of arrays...

  was (Author: funtick):
Specifically for FieldCache, let's see... suppose Field may have 8 
different values, and number of documents is high.

{code}
Value0  0  1  0  0  0  0  0   0  1  0  0  0  0  0 ...  
Value1  1  0  1  0  0  0  0   0  0  0  0  0  0  0 ...  
Value2  0  0  0  1  1  0  0   0  0  0  0  0  0  0 ...  
Value3  0  0  0  0  0  0  0   0  0  0  0  1  0  0 ...  
Value4  0  0  0  0  0  0  1   0  0  0  0  0  0  0 ...  
Value5  0  0  0  0  0  1  0   0  0  0  1  0  1  0 ...  
Value6  0  0  0  0  0  0  0   1  0  1  0  0  0  0 ...  
Value7  0  0  0  0  0  0  0   0  0  0  0  0  0  1 ...  
{code}

- represented as Matrix (or as a Vector); for instance, first row means that 
Document1 and Document8 have Value0.

And now, if we go horizontally we will end up with 8 arrays of int[]. What if 
we go vertically? Field could be encoded as 3-bit (8 different values).

CONSTRAINT: specifically for FieldCache, each Column must have the only 1.

And we can end with array of 3-bit values storing position in a column! Size of 
array is IndexReader.maxDoc().


hope I am reinventing bycycle :)

P.S.
Of course each solution has pros and cons, I am trying to focus on FieldCache 
specific use cases.

1. For a given document ID, find a value for a field
2. For a given query results, sort it by a field values
3. For a given query results, count facet for each field value

I don't think such naive compression is slower than abstract int[] arrays... 
and we need to change public API of field cache too: if method returns int[] we 
are not saving any RAM.

Better is to compare with SOLR use cases and to make API closer to real 
requirements; SOLR operates with some bitsets instead of arrays...
  
 Add unsigned packed int impls in oal.util
 -

 Key: LUCENE-1990
 URL: https://issues.apache.org/jira/browse/LUCENE-1990
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Priority: Minor

 There are various places in Lucene that could take advantage of an
 efficient packed unsigned int/long impl.  EG the terms dict index in
 the standard codec in LUCENE-1458 could subsantially reduce it's RAM
 usage.  FieldCache.StringIndex could as well.  And I think load into
 RAM codecs like the one in TestExternalCodecs could use this too.
 I'm picturing something very basic like:
 {code}
 interface PackedUnsignedLongs  {
   long get(long index);
   void set(long index, long value);
 }
 {code}
 Plus maybe an iterator for getting and maybe also for setting.  If it
 helps, most of the usages of this inside Lucene will be write once
 so eg the set could make that an assumption/requirement.
 And a factory somewhere:
 {code}
   PackedUnsignedLongs create(int count, long maxValue);
 {code}
 I think we should simply autogen the code (we can start from the
 autogen code in LUCENE-1410), or, if there is an good existing impl
 that has

[jira] Issue Comment Edited: (LUCENE-1990) Add unsigned packed int impls in oal.util

2009-11-10 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12775420#action_12775420
 ] 

Fuad Efendi edited comment on LUCENE-1990 at 11/10/09 4:11 PM:
---

Specifically for FieldCache, let's see... suppose Field may have 8 different 
values, and number of documents is high.

{code}
Value0  0  1  0  0  0  0  0   0  1  0  0  0  0  0 ...  
Value1  1  0  1  0  0  0  0   0  0  0  0  0  0  0 ...  
Value2  0  0  0  1  1  0  0   0  0  0  0  0  0  0 ...  
Value3  0  0  0  0  0  0  0   0  0  0  0  1  0  0 ...  
Value4  0  0  0  0  0  0  1   0  0  0  0  0  0  0 ...  
Value5  0  0  0  0  0  1  0   0  0  0  1  0  1  0 ...  
Value6  0  0  0  0  0  0  0   1  0  1  0  0  0  0 ...  
Value7  0  0  0  0  0  0  0   0  0  0  0  0  0  1 ...  
{code}

- represented as Matrix (or as a Vector); for instance, first row means that 
Document1 and Document8 have Value0.

And now, if we go horizontally we will end up with 8 arrays of int[]. What if 
we go vertically? Field could be encoded as 3-bit (8 different values).

CONSTRAINT: specifically for FieldCache, each Column must have the only 1.

And we can end with array of 3-bit values storing position in a column! Size of 
array is IndexReader.maxDoc().


hope I am reinventing bycycle :)

P.S.
Of course each solution has pros and cons, I am trying to focus on FieldCache 
specific use cases.

1. For a given document ID, find a value for a field
2. For a given query results, sort it by a field values
3. For a given query results, count facet for each field value

I don't think such naive compression is slower than abstract int[] arrays... 
and we need to change public API of field cache too: if method returns int[] we 
are not saving any RAM.

Better is to compare with SOLR use cases and to make API closer to real 
requirements; SOLR operates with some bitsets instead of arrays...

  was (Author: funtick):
Specifically for FieldCache, let's see... suppose Field may have 8 
different values, and number of documents is high.

{code}
Value0  0  1  0  0  0  0  0   0  1  0  0  0  0  0 ...  
Value1  1  0  1  0  0  0  0   0  0  0  0  0  0  0 ...  
Value2  0  0  0  1  1  0  0   0  0  0  0  0  0  0 ...  
Value3  0  0  0  0  0  0  0   0  0  0  0  1  0  0 ...  
Value4  0  0  0  0  0  0  1   0  0  0  0  0  0  0 ...  
Value5  0  0  0  0  0  1  0   0  0  0  1  0  1  0 ...  
Value6  0  0  0  0  0  0  0   1  0  1  0  0  0  0 ...  
Value7  0  0  0  0  0  0  0   0  1  0  0  0  0  1 ...  
{code}

- represented as Matrix (or as a Vector); for instance, first row means that 
Document1 and Document8 have Value0.

And now, if we go horizontally we will end up with 8 arrays of int[]. What if 
we go vertically? Field could be encoded as 3-bit (8 different values).

CONSTRAINT: specifically for FieldCache, each Column must have the only 1.

And we can end with array of 3-bit values storing position in a column! Size of 
array is IndexReader.maxDoc().


hope I am reinventing bycycle :)

P.S.
Of course each solution has pros and cons, I am trying to focus on FieldCache 
specific use cases.

1. For a given document ID, find a value for a field
2. For a given query results, sort it by a field values
3. For a given query results, count facet for each field value

I don't think such naive compression is slower than abstract int[] arrays... 
and we need to change public API of field cache too: if method returns int[] we 
are not saving any RAM.

Better is to compare with SOLR use cases and to make API closer to real 
requirements; SOLR operates with some bitsets instead of arrays...
  
 Add unsigned packed int impls in oal.util
 -

 Key: LUCENE-1990
 URL: https://issues.apache.org/jira/browse/LUCENE-1990
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Priority: Minor

 There are various places in Lucene that could take advantage of an
 efficient packed unsigned int/long impl.  EG the terms dict index in
 the standard codec in LUCENE-1458 could subsantially reduce it's RAM
 usage.  FieldCache.StringIndex could as well.  And I think load into
 RAM codecs like the one in TestExternalCodecs could use this too.
 I'm picturing something very basic like:
 {code}
 interface PackedUnsignedLongs  {
   long get(long index);
   void set(long index, long value);
 }
 {code}
 Plus maybe an iterator for getting and maybe also for setting.  If it
 helps, most of the usages of this inside Lucene will be write once
 so eg the set could make that an assumption/requirement.
 And a factory somewhere:
 {code}
   PackedUnsignedLongs create(int count, long maxValue);
 {code}
 I think we should simply autogen the code (we can start from the
 autogen code in LUCENE-1410), or, if there is an good existing impl
 that has a

[jira] Commented: (LUCENE-1995) ArrayIndexOutOfBoundsException during indexing

2009-10-24 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769699#action_12769699
 ] 

Fuad Efendi commented on LUCENE-1995:
-

I am recalling a bug in Arrays.sort() (Joshua Bloch) which was fixed after 9 
years; signed instead of unsigned...

 ArrayIndexOutOfBoundsException during indexing
 --

 Key: LUCENE-1995
 URL: https://issues.apache.org/jira/browse/LUCENE-1995
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9
Reporter: Yonik Seeley
Assignee: Michael McCandless
 Fix For: 2.9.1


 http://search.lucidimagination.com/search/document/f29fc52348ab9b63/arrayindexoutofboundsexception_during_indexing

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1995) ArrayIndexOutOfBoundsException during indexing

2009-10-24 Thread Fuad Efendi (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769715#action_12769715
]

Fuad Efendi commented on LUCENE-1995:
-

Joshua writes in his Google Research Blog:
The version of binary search that I wrote for the JDK contained the same bug.
It was reported to Sun recently when it broke someone's program, after lying in
wait for nine years or so.
http://googleresearch.blogspot.com/2006/06/extra-extra-read-all-about-it-nearly.html

Anyway, this is specific use case of reporter; I didn't have ANY problems with
ramBufferSizeMB: 8192 during a month (at least) of constant updates
(5000/sec)... Yes, I am using term vectors (as Michael niticed it plays a
role)...

And what exactly causes the problem is unclear; having explicit check for 2048
is just workaround... quick shortcut...

ArrayIndexOutOfBoundsException during indexing
--

Key: LUCENE-1995
URL: https://issues.apache.org/jira/browse/LUCENE-1995
Project: Lucene - Java
Issue Type: Bug
Components: Index
Affects Versions: 2.9
Reporter: Yonik Seeley
Assignee: Michael McCandless
Fix For: 2.9.1

http://search.lucidimagination.com/search/document/f29fc52348ab9b63/arrayindexoutofboundsexception_during_indexing

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1995) ArrayIndexOutOfBoundsException during indexing

2009-10-24 Thread Fuad Efendi (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769725#action_12769725
]

Fuad Efendi commented on LUCENE-1995:
-

But who did the bug? Joshua writes, it's him :) - based on other's famous
findings and books...
===
it just contains a few lines of code that calculates a
double value from two document fields and then stores that value in one of
these dynamic fields
And problem happens when he indexes document number 15,000,000...
- I am guessing he is indexing double... ((type=tdouble, indexed=t,
stored=f)... Why do we ever need to index multi-valued field double?
Cardinality is the highest possible... I don't know Lucene internals; I am
thinking that (double, docID) will occupy 12 bytes, and with multivalued (or
dynamic) field we may need a lot of RAM for 15 mlns docs... especially if we
are trying to put into buskets some objects using hash of double...

ArrayIndexOutOfBoundsException during indexing
--

http://search.lucidimagination.com/search/document/f29fc52348ab9b63/arrayindexoutofboundsexception_during_indexing

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1995) ArrayIndexOutOfBoundsException during indexing

2009-10-24 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769747#action_12769747
 ] 

Fuad Efendi commented on LUCENE-1995:
-

bq. He took it, and the bug with it, from elsewhere. He didn't do the bug 
either. He just propagated it.

This is even worse. Especially for such classic case as Arrays.sort(). Buggy 
propagating...

 * The sorting algorithm is a tuned quicksort, adapted from Jon
 * L. Bentley and M. Douglas McIlroy's Engineering a Sort Function,
 * Software-Practice and Experience, Vol. 23(11) P. 1249-1265 (November
 * 1993).  This algorithm offers n*log(n) performance on many data sets
 * that cause other quicksorts to degrade to quadratic performance.

bq.  If your usage actually went above 2GB, you would have had a problem. 8192 
is not a valid value, we don't support it, and now we'll throw an exception if 
it's over 2048.
Now I think my actual usage was below 2Gb... 

bq. No, we only support a max of 2GB ram buffer, by design currently. 
Thanks for confirmation... However, JavaDoc didn't mention explicitly that, and 
by design is unclear wordings... it's already several years by design...

bq. 2048 probably won't be safe, because a large doc just as the buffer is 
filling up could still overflow. (Though, RAM is also used eg for norms, so you 
might squeak by).
- Uncertainness...

 ArrayIndexOutOfBoundsException during indexing
 --

 Key: LUCENE-1995
 URL: https://issues.apache.org/jira/browse/LUCENE-1995
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9
Reporter: Yonik Seeley
Assignee: Michael McCandless
 Fix For: 2.9.1


 http://search.lucidimagination.com/search/document/f29fc52348ab9b63/arrayindexoutofboundsexception_during_indexing

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1995) ArrayIndexOutOfBoundsException during indexing

2009-10-24 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769749#action_12769749
 ] 

Fuad Efendi commented on LUCENE-1995:
-

bq. bq. If your usage actually went above 2GB, you would have had a problem. 
8192 is not a valid value, we don't support it, and now we'll throw an 
exception if it's over 2048.
bq. Now I think my actual usage was below 2Gb... 
How I was below 2048 if I had few segments created by IndexWriter during a day, 
without any SOLR-commit?..

 ArrayIndexOutOfBoundsException during indexing
 --

 Key: LUCENE-1995
 URL: https://issues.apache.org/jira/browse/LUCENE-1995
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9
Reporter: Yonik Seeley
Assignee: Michael McCandless
 Fix For: 2.9.1


 http://search.lucidimagination.com/search/document/f29fc52348ab9b63/arrayindexoutofboundsexception_during_indexing

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1995) ArrayIndexOutOfBoundsException during indexing

2009-10-24 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769749#action_12769749
 ] 

Fuad Efendi edited comment on LUCENE-1995 at 10/25/09 3:14 AM:
---

bq. bq. If your usage actually went above 2GB, you would have had a problem. 
8192 is not a valid value, we don't support it, and now we'll throw an 
exception if it's over 2048.
bq. Now I think my actual usage was below 2Gb... 
How I was below 2048 if I had few segments created by IndexWriter during a day, 
without any SOLR-commit?.. may be I am wrong, it was few weeks ago... I am 
currently using 1024 because I need memory for other staff too, and I don't 
want to try again...

  was (Author: funtick):
bq. bq. If your usage actually went above 2GB, you would have had a 
problem. 8192 is not a valid value, we don't support it, and now we'll throw an 
exception if it's over 2048.
bq. Now I think my actual usage was below 2Gb... 
How I was below 2048 if I had few segments created by IndexWriter during a day, 
without any SOLR-commit?..
  
 ArrayIndexOutOfBoundsException during indexing
 --

 Key: LUCENE-1995
 URL: https://issues.apache.org/jira/browse/LUCENE-1995
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9
Reporter: Yonik Seeley
Assignee: Michael McCandless
 Fix For: 2.9.1


 http://search.lucidimagination.com/search/document/f29fc52348ab9b63/arrayindexoutofboundsexception_during_indexing

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Closed: (SOLR-711) SimpleFacets: Performance Boost for Tokenized Fields for smaller DocSet using Term Vectors

2008-12-17 Thread Fuad Efendi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fuad Efendi closed SOLR-711.


Resolution: Fixed

Thanks Shalin for pointing to SOLR-475 which is very advanced solution to term 
counting approach.

 SimpleFacets: Performance Boost for Tokenized Fields for smaller DocSet using 
 Term Vectors
 --

 Key: SOLR-711
 URL: https://issues.apache.org/jira/browse/SOLR-711
 Project: Solr
  Issue Type: Improvement
  Components: search
Affects Versions: 1.3
Reporter: Fuad Efendi
 Fix For: 1.4

   Original Estimate: 1680h
  Remaining Estimate: 1680h

 From 
 [http://www.nabble.com/SimpleFacets%3A-Performance-Boost-for-Tokenized-Fields-td19033760.html]:
 Scenario:
 - 10,000,000 documents in the index; 
 - 5-10 terms per document; 
 - 200,000 unique terms for a tokenized field. 
 _Obviously calculating sizes of 200,000 intersections with FilterCache is 100 
 times slower than traversing 10 - 20,000 documents for smaller DocSets and 
 counting frequencies of Terms._
 Not applicable if size of DocSet is close to total number of unique tokens 
 (200,000 in our scenario).
 See   SimpleFacets.java:
 {code}
 public NamedList getFacetTermEnumCounts(
   SolrIndexSearcher searcher, 
   DocSet docs, ...
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-667) Alternate LRUCache implementation

2008-09-28 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12635221#action_12635221
 ] 

Fuad Efendi commented on SOLR-667:
--

Paul, Yonik,  thanks for your efforts; BTW 'Concurrent'HashMap uses spinloops 
for 'safe' updates in order to avoid synchronization (instead of giving up CPU 
cycles); there are always cases when it is not faster that simple HashMap with 
synchronization.

LingPipe uses different approach, see last comment at SOLR-665.

Also, why are you in-a-loop with LRU? LFU is logically better.

+1 and thanks for sharing.

 Alternate LRUCache implementation
 -

 Key: SOLR-667
 URL: https://issues.apache.org/jira/browse/SOLR-667
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.3
Reporter: Noble Paul
 Fix For: 1.4

 Attachments: ConcurrentLRUCache.java, ConcurrentLRUCache.java, 
 ConcurrentLRUCache.java, SOLR-667.patch, SOLR-667.patch, SOLR-667.patch, 
 SOLR-667.patch


 The only available SolrCache i.e LRUCache is based on _LinkedHashMap_ which 
 has _get()_ also synchronized. This can cause severe bottlenecks for faceted 
 search. Any alternate implementation which can be faster/better must be 
 considered. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

2008-08-22 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12624835#action_12624835
 ] 

Fuad Efendi commented on LUCENE-831:


Would be nice to have TermVectorCache (if term vectors are stored in the index)

 Complete overhaul of FieldCache API/Implementation
 --

 Key: LUCENE-831
 URL: https://issues.apache.org/jira/browse/LUCENE-831
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Hoss Man
 Fix For: 3.0

 Attachments: fieldcache-overhaul.032208.diff, 
 fieldcache-overhaul.diff, fieldcache-overhaul.diff, 
 LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff, 
 LUCENE-831.03.31.2008.diff, LUCENE-831.patch


 Motivation:
 1) Complete overhaul the API/implementation of FieldCache type things...
 a) eliminate global static map keyed on IndexReader (thus
 eliminating synch block between completley independent IndexReaders)
 b) allow more customization of cache management (ie: use 
 expiration/replacement strategies, disk backed caches, etc)
 c) allow people to define custom cache data logic (ie: custom
 parsers, complex datatypes, etc... anything tied to a reader)
 d) allow people to inspect what's in a cache (list of CacheKeys) for
 an IndexReader so a new IndexReader can be likewise warmed. 
 e) Lend support for smarter cache management if/when
 IndexReader.reopen is added (merging of cached data from subReaders).
 2) Provide backwards compatibility to support existing FieldCache API with
 the new implementation, so there is no redundent caching as client code
 migrades to new API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Created: (SOLR-711) SimpleFacets: Performance Boost for Tokenized Fields for smaller DocSet using Term Vectors

2008-08-19 Thread Fuad Efendi (JIRA)

SimpleFacets: Performance Boost for Tokenized Fields for smaller DocSet using 
Term Vectors
--

 Key: SOLR-711
 URL: https://issues.apache.org/jira/browse/SOLR-711
 Project: Solr
  Issue Type: Improvement
  Components: search
Affects Versions: 1.3
Reporter: Fuad Efendi
 Fix For: 1.4


From 
[url]http://www.nabble.com/SimpleFacets%3A-Performance-Boost-for-Tokenized-Fields-td19033760.html[/url]:

Scenario:
- 10,000,000 documents in the index; 
- 5-10 terms per document; 
- 200,000 unique terms for a tokenized field. 

_Obviously calculating sizes of 200,000 intersections with FilterCache is 100 
times slower than traversing 10 - 20,000 documents for smaller DocSets and 
counting frequencies of Terms._

Not applicable if size of DocSet is close to total number of unique tokens 
(200,000 in our scenario).

See   SimpleFacets:
 {{
public NamedList getFacetTermEnumCounts(
  SolrIndexSearcher searcher, 
  DocSet docs, 
  String field, 
  int offset, 
  int limit, 
  int mincount, 
  boolean missing, 
  boolean sort, 
  String prefix)
throws IOException {...}
}}




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-711) SimpleFacets: Performance Boost for Tokenized Fields for smaller DocSet using Term Vectors

2008-08-19 Thread Fuad Efendi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fuad Efendi updated SOLR-711:
-

Comment: was deleted

 SimpleFacets: Performance Boost for Tokenized Fields for smaller DocSet using 
 Term Vectors
 --

 Key: SOLR-711
 URL: https://issues.apache.org/jira/browse/SOLR-711
 Project: Solr
  Issue Type: Improvement
  Components: search
Affects Versions: 1.3
Reporter: Fuad Efendi
 Fix For: 1.4

   Original Estimate: 1680h
  Remaining Estimate: 1680h

 From 
 [http://www.nabble.com/SimpleFacets%3A-Performance-Boost-for-Tokenized-Fields-td19033760.html]:
 Scenario:
 - 10,000,000 documents in the index; 
 - 5-10 terms per document; 
 - 200,000 unique terms for a tokenized field. 
 _Obviously calculating sizes of 200,000 intersections with FilterCache is 100 
 times slower than traversing 10 - 20,000 documents for smaller DocSets and 
 counting frequencies of Terms._
 Not applicable if size of DocSet is close to total number of unique tokens 
 (200,000 in our scenario).
 See   SimpleFacets:
 {code:title=SimpleFacets.java|borderStyle=solid}
 public NamedList getFacetTermEnumCounts(
   SolrIndexSearcher searcher, 
   DocSet docs, ...
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-711) SimpleFacets: Performance Boost for Tokenized Fields for smaller DocSet using Term Vectors

2008-08-19 Thread Fuad Efendi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fuad Efendi updated SOLR-711:
-

Description: 
From 
[http://www.nabble.com/SimpleFacets%3A-Performance-Boost-for-Tokenized-Fields-td19033760.html]:

Scenario:
- 10,000,000 documents in the index; 
- 5-10 terms per document; 
- 200,000 unique terms for a tokenized field. 

_Obviously calculating sizes of 200,000 intersections with FilterCache is 100 
times slower than traversing 10 - 20,000 documents for smaller DocSets and 
counting frequencies of Terms._

Not applicable if size of DocSet is close to total number of unique tokens 
(200,000 in our scenario).

See   SimpleFacets.java:
{code}
public NamedList getFacetTermEnumCounts(
  SolrIndexSearcher searcher, 
  DocSet docs, ...
{code}




  was:
From 
[http://www.nabble.com/SimpleFacets%3A-Performance-Boost-for-Tokenized-Fields-td19033760.html]:

Scenario:
- 10,000,000 documents in the index; 
- 5-10 terms per document; 
- 200,000 unique terms for a tokenized field. 

_Obviously calculating sizes of 200,000 intersections with FilterCache is 100 
times slower than traversing 10 - 20,000 documents for smaller DocSets and 
counting frequencies of Terms._

Not applicable if size of DocSet is close to total number of unique tokens 
(200,000 in our scenario).

See   SimpleFacets:
{code:title=SimpleFacets.java|borderStyle=solid}
public NamedList getFacetTermEnumCounts(
  SolrIndexSearcher searcher, 
  DocSet docs, ...
{code}





 SimpleFacets: Performance Boost for Tokenized Fields for smaller DocSet using 
 Term Vectors
 --

 Key: SOLR-711
 URL: https://issues.apache.org/jira/browse/SOLR-711
 Project: Solr
  Issue Type: Improvement
  Components: search
Affects Versions: 1.3
Reporter: Fuad Efendi
 Fix For: 1.4

   Original Estimate: 1680h
  Remaining Estimate: 1680h

 From 
 [http://www.nabble.com/SimpleFacets%3A-Performance-Boost-for-Tokenized-Fields-td19033760.html]:
 Scenario:
 - 10,000,000 documents in the index; 
 - 5-10 terms per document; 
 - 200,000 unique terms for a tokenized field. 
 _Obviously calculating sizes of 200,000 intersections with FilterCache is 100 
 times slower than traversing 10 - 20,000 documents for smaller DocSets and 
 counting frequencies of Terms._
 Not applicable if size of DocSet is close to total number of unique tokens 
 (200,000 in our scenario).
 See   SimpleFacets.java:
 {code}
 public NamedList getFacetTermEnumCounts(
   SolrIndexSearcher searcher, 
   DocSet docs, ...
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-711) SimpleFacets: Performance Boost for Tokenized Fields for smaller DocSet using Term Vectors

2008-08-19 Thread Fuad Efendi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fuad Efendi updated SOLR-711:
-

Description: 
From 
[http://www.nabble.com/SimpleFacets%3A-Performance-Boost-for-Tokenized-Fields-td19033760.html]:

Scenario:
- 10,000,000 documents in the index; 
- 5-10 terms per document; 
- 200,000 unique terms for a tokenized field. 

_Obviously calculating sizes of 200,000 intersections with FilterCache is 100 
times slower than traversing 10 - 20,000 documents for smaller DocSets and 
counting frequencies of Terms._

Not applicable if size of DocSet is close to total number of unique tokens 
(200,000 in our scenario).

See   SimpleFacets:
{code:title=SimpleFacets.java|borderStyle=solid}
public NamedList getFacetTermEnumCounts(
  SolrIndexSearcher searcher, 
  DocSet docs, ...
{code}




  was:
From 
[url]http://www.nabble.com/SimpleFacets%3A-Performance-Boost-for-Tokenized-Fields-td19033760.html[/url]:

Scenario:
- 10,000,000 documents in the index; 
- 5-10 terms per document; 
- 200,000 unique terms for a tokenized field. 

_Obviously calculating sizes of 200,000 intersections with FilterCache is 100 
times slower than traversing 10 - 20,000 documents for smaller DocSets and 
counting frequencies of Terms._

Not applicable if size of DocSet is close to total number of unique tokens 
(200,000 in our scenario).

See   SimpleFacets:
 {{
public NamedList getFacetTermEnumCounts(
  SolrIndexSearcher searcher, 
  DocSet docs, 
  String field, 
  int offset, 
  int limit, 
  int mincount, 
  boolean missing, 
  boolean sort, 
  String prefix)
throws IOException {...}
}}





trivial formatting

 SimpleFacets: Performance Boost for Tokenized Fields for smaller DocSet using 
 Term Vectors
 --

 Key: SOLR-711
 URL: https://issues.apache.org/jira/browse/SOLR-711
 Project: Solr
  Issue Type: Improvement
  Components: search
Affects Versions: 1.3
Reporter: Fuad Efendi
 Fix For: 1.4

   Original Estimate: 1680h
  Remaining Estimate: 1680h

 From 
 [http://www.nabble.com/SimpleFacets%3A-Performance-Boost-for-Tokenized-Fields-td19033760.html]:
 Scenario:
 - 10,000,000 documents in the index; 
 - 5-10 terms per document; 
 - 200,000 unique terms for a tokenized field. 
 _Obviously calculating sizes of 200,000 intersections with FilterCache is 100 
 times slower than traversing 10 - 20,000 documents for smaller DocSets and 
 counting frequencies of Terms._
 Not applicable if size of DocSet is close to total number of unique tokens 
 (200,000 in our scenario).
 See   SimpleFacets:
 {code:title=SimpleFacets.java|borderStyle=solid}
 public NamedList getFacetTermEnumCounts(
   SolrIndexSearcher searcher, 
   DocSet docs, ...
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (SOLR-671) Range queries with 'slong' field type do not retrieve correct results

2008-08-02 Thread Fuad Efendi (JIRA)

Range queries with 'slong' field type do not retrieve correct results
-

 Key: SOLR-671
 URL: https://issues.apache.org/jira/browse/SOLR-671
 Project: Solr
  Issue Type: Bug
 Environment: SOLR-1.3-DEV 

Schema:

   !-- Numeric field types that manipulate the value into
 a string value that isn't human-readable in its internal form,
 but with a lexicographic ordering the same as the numeric ordering,
 so that range queries work correctly. --
fieldType name=sint class=solr.SortableIntField sortMissingLast=true 
omitNorms=true/
fieldType name=slong class=solr.SortableLongField 
sortMissingLast=true omitNorms=true/
fieldType name=sfloat class=solr.SortableFloatField 
sortMissingLast=true omitNorms=true/
fieldType name=sdouble class=solr.SortableDoubleField 
sortMissingLast=true omitNorms=true/

   field name=timestamp type=slong indexed=true stored=true/


Reporter: Fuad Efendi


Range queries always return all results (do not filter):

timestamp:[1019386401114 TO 1219386401114]


lst name=debug
str name=rawquerystringtimestamp:[1019386401114 TO 1219386401114]/str
str name=querystringtimestamp:[1019386401114 TO 1219386401114]/str
str name=parsedquerytimestamp:[1019386401114 TO 1219386401114]/str
str name=parsedquery_toStringtimestamp:[#8;#0;εごᅚ TO #8;#0;ѯ刯慚]/str

...

str name=QParserOldLuceneQParser/str

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-671) Range queries with 'slong' field type do not retrieve correct results

2008-08-02 Thread Fuad Efendi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fuad Efendi updated SOLR-671:
-

 Priority: Blocker  (was: Major)
Affects Version/s: 1.3

 Range queries with 'slong' field type do not retrieve correct results
 -

 Key: SOLR-671
 URL: https://issues.apache.org/jira/browse/SOLR-671
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.3
 Environment: SOLR-1.3-DEV 
 Schema:
!-- Numeric field types that manipulate the value into
  a string value that isn't human-readable in its internal form,
  but with a lexicographic ordering the same as the numeric ordering,
  so that range queries work correctly. --
 fieldType name=sint class=solr.SortableIntField 
 sortMissingLast=true omitNorms=true/
 fieldType name=slong class=solr.SortableLongField 
 sortMissingLast=true omitNorms=true/
 fieldType name=sfloat class=solr.SortableFloatField 
 sortMissingLast=true omitNorms=true/
 fieldType name=sdouble class=solr.SortableDoubleField 
 sortMissingLast=true omitNorms=true/
field name=timestamp type=slong indexed=true stored=true/
Reporter: Fuad Efendi
Priority: Blocker
   Original Estimate: 168h
  Remaining Estimate: 168h

 Range queries always return all results (do not filter):
 timestamp:[1019386401114 TO 1219386401114]
 lst name=debug
 str name=rawquerystringtimestamp:[1019386401114 TO 1219386401114]/str
 str name=querystringtimestamp:[1019386401114 TO 1219386401114]/str
 str name=parsedquerytimestamp:[1019386401114 TO 1219386401114]/str
 str name=parsedquery_toStringtimestamp:[#8;#0;εごᅚ TO #8;#0;ѯ刯慚]/str
 ...
 str name=QParserOldLuceneQParser/str

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-671) Range queries with 'slong' field type do not retrieve correct results

2008-08-02 Thread Fuad Efendi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fuad Efendi updated SOLR-671:
-

  Priority: Trivial  (was: Blocker)
Issue Type: Test  (was: Bug)

I executed another query which works fine:
timestamp:[* TO 1000] - 0 results
Finally found it works...

Please close.

 Range queries with 'slong' field type do not retrieve correct results
 -

 Key: SOLR-671
 URL: https://issues.apache.org/jira/browse/SOLR-671
 Project: Solr
  Issue Type: Test
Affects Versions: 1.3
 Environment: SOLR-1.3-DEV 
 Schema:
!-- Numeric field types that manipulate the value into
  a string value that isn't human-readable in its internal form,
  but with a lexicographic ordering the same as the numeric ordering,
  so that range queries work correctly. --
 fieldType name=sint class=solr.SortableIntField 
 sortMissingLast=true omitNorms=true/
 fieldType name=slong class=solr.SortableLongField 
 sortMissingLast=true omitNorms=true/
 fieldType name=sfloat class=solr.SortableFloatField 
 sortMissingLast=true omitNorms=true/
 fieldType name=sdouble class=solr.SortableDoubleField 
 sortMissingLast=true omitNorms=true/
field name=timestamp type=slong indexed=true stored=true/
Reporter: Fuad Efendi
Priority: Trivial
   Original Estimate: 168h
  Remaining Estimate: 168h

 Range queries always return all results (do not filter):
 timestamp:[1019386401114 TO 1219386401114]
 lst name=debug
 str name=rawquerystringtimestamp:[1019386401114 TO 1219386401114]/str
 str name=querystringtimestamp:[1019386401114 TO 1219386401114]/str
 str name=parsedquerytimestamp:[1019386401114 TO 1219386401114]/str
 str name=parsedquery_toStringtimestamp:[#8;#0;εごᅚ TO #8;#0;ѯ刯慚]/str
 ...
 str name=QParserOldLuceneQParser/str

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-671) Range queries with 'slong' field type do not retrieve correct results

2008-08-02 Thread Fuad Efendi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fuad Efendi updated SOLR-671:
-

  Priority: Major  (was: Trivial)
Issue Type: Bug  (was: Test)

Here is test case, similar to Arrays.sort() bug (unsigned...):

{code}
long time1 = System.currentTimeMillis() - 30*24*3600*1000;
long time2 = 30*24*3600*1000;
System.out.println(time1);
System.out.println(time1-time2);

Output:
1219389000674
1221091967970
{code}

(time1-time2)  time1!

What happens inside SOLR slong for such queries?


 Range queries with 'slong' field type do not retrieve correct results
 -

 Key: SOLR-671
 URL: https://issues.apache.org/jira/browse/SOLR-671
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.3
 Environment: SOLR-1.3-DEV 
 Schema:
!-- Numeric field types that manipulate the value into
  a string value that isn't human-readable in its internal form,
  but with a lexicographic ordering the same as the numeric ordering,
  so that range queries work correctly. --
 fieldType name=sint class=solr.SortableIntField 
 sortMissingLast=true omitNorms=true/
 fieldType name=slong class=solr.SortableLongField 
 sortMissingLast=true omitNorms=true/
 fieldType name=sfloat class=solr.SortableFloatField 
 sortMissingLast=true omitNorms=true/
 fieldType name=sdouble class=solr.SortableDoubleField 
 sortMissingLast=true omitNorms=true/
field name=timestamp type=slong indexed=true stored=true/
Reporter: Fuad Efendi
   Original Estimate: 168h
  Remaining Estimate: 168h

 Range queries always return all results (do not filter):
 timestamp:[1019386401114 TO 1219386401114]
 lst name=debug
 str name=rawquerystringtimestamp:[1019386401114 TO 1219386401114]/str
 str name=querystringtimestamp:[1019386401114 TO 1219386401114]/str
 str name=parsedquerytimestamp:[1019386401114 TO 1219386401114]/str
 str name=parsedquery_toStringtimestamp:[#8;#0;εごᅚ TO #8;#0;ѯ刯慚]/str
 ...
 str name=QParserOldLuceneQParser/str

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (SOLR-671) Range queries with 'slong' field type do not retrieve correct results

2008-08-02 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12619223#action_12619223
 ] 

funtick edited comment on SOLR-671 at 8/2/08 7:12 AM:
--

Here is test case, similar to Arrays.sort() bug (unsigned...):

{code}

long time1 = System.currentTimeMillis();
long time2 = 30*24*3600*1000;
System.out.println(time1);
System.out.println(time1-time2);

Output:
1219389000674
1221091967970
{code}

(time1-time2)  time1!

What happens inside SOLR slong for such queries?


  was (Author: funtick):
Here is test case, similar to Arrays.sort() bug (unsigned...):

{code}
long time1 = System.currentTimeMillis() - 30*24*3600*1000;
long time2 = 30*24*3600*1000;
System.out.println(time1);
System.out.println(time1-time2);

Output:
1219389000674
1221091967970
{code}

(time1-time2)  time1!

What happens inside SOLR slong for such queries?

  
 Range queries with 'slong' field type do not retrieve correct results
 -

 Key: SOLR-671
 URL: https://issues.apache.org/jira/browse/SOLR-671
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.3
 Environment: SOLR-1.3-DEV 
 Schema:
!-- Numeric field types that manipulate the value into
  a string value that isn't human-readable in its internal form,
  but with a lexicographic ordering the same as the numeric ordering,
  so that range queries work correctly. --
 fieldType name=sint class=solr.SortableIntField 
 sortMissingLast=true omitNorms=true/
 fieldType name=slong class=solr.SortableLongField 
 sortMissingLast=true omitNorms=true/
 fieldType name=sfloat class=solr.SortableFloatField 
 sortMissingLast=true omitNorms=true/
 fieldType name=sdouble class=solr.SortableDoubleField 
 sortMissingLast=true omitNorms=true/
field name=timestamp type=slong indexed=true stored=true/
Reporter: Fuad Efendi
   Original Estimate: 168h
  Remaining Estimate: 168h

 Range queries always return all results (do not filter):
 timestamp:[1019386401114 TO 1219386401114]
 lst name=debug
 str name=rawquerystringtimestamp:[1019386401114 TO 1219386401114]/str
 str name=querystringtimestamp:[1019386401114 TO 1219386401114]/str
 str name=parsedquerytimestamp:[1019386401114 TO 1219386401114]/str
 str name=parsedquery_toStringtimestamp:[#8;#0;εごᅚ TO #8;#0;ѯ刯慚]/str
 ...
 str name=QParserOldLuceneQParser/str

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-671) Range queries with 'slong' field type do not retrieve correct results

2008-08-02 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12619227#action_12619227
 ] 

Fuad Efendi commented on SOLR-671:
--

{code}
long time1 = System.currentTimeMillis();
long time2 = 30*24*3600*1000;
long time3 = time1 - time2;
System.out.println(Time1: +time1);
System.out.println(Time2:  +time2);
System.out.println(Time3: +time3);

Time1: 1217686478242
Time2: -1702967296
Time3: 1219389445538
{code}

bug is obvious...

{code}
long time1 = System.currentTimeMillis();
long time2 = 30*24*3600*1000L;
long time3 = time1 - time2;
System.out.println(Time1: +time1);
System.out.println(Time2:  +time2);
System.out.println(Time3: +time3);

Time1: 1217686559557
Time2: 259200
Time3: 1215094559557
{code}


Close it...



 Range queries with 'slong' field type do not retrieve correct results
 -

 Key: SOLR-671
 URL: https://issues.apache.org/jira/browse/SOLR-671
 Project: Solr
  Issue Type: Bug
 Environment: SOLR-1.3-DEV 
 Schema:
!-- Numeric field types that manipulate the value into
  a string value that isn't human-readable in its internal form,
  but with a lexicographic ordering the same as the numeric ordering,
  so that range queries work correctly. --
 fieldType name=sint class=solr.SortableIntField 
 sortMissingLast=true omitNorms=true/
 fieldType name=slong class=solr.SortableLongField 
 sortMissingLast=true omitNorms=true/
 fieldType name=sfloat class=solr.SortableFloatField 
 sortMissingLast=true omitNorms=true/
 fieldType name=sdouble class=solr.SortableDoubleField 
 sortMissingLast=true omitNorms=true/
field name=timestamp type=slong indexed=true stored=true/
Reporter: Fuad Efendi
   Original Estimate: 168h
  Remaining Estimate: 168h

 Range queries always return all results (do not filter):
 timestamp:[1019386401114 TO 1219386401114]
 lst name=debug
 str name=rawquerystringtimestamp:[1019386401114 TO 1219386401114]/str
 str name=querystringtimestamp:[1019386401114 TO 1219386401114]/str
 str name=parsedquerytimestamp:[1019386401114 TO 1219386401114]/str
 str name=parsedquery_toStringtimestamp:[#8;#0;εごᅚ TO #8;#0;ѯ刯慚]/str
 ...
 str name=QParserOldLuceneQParser/str

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-671) Range queries with 'slong' field type do not retrieve correct results

2008-08-02 Thread Fuad Efendi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fuad Efendi updated SOLR-671:
-

 Priority: Trivial  (was: Major)
   Issue Type: Test  (was: Bug)
Affects Version/s: (was: 1.3)

 Range queries with 'slong' field type do not retrieve correct results
 -

 Key: SOLR-671
 URL: https://issues.apache.org/jira/browse/SOLR-671
 Project: Solr
  Issue Type: Test
 Environment: SOLR-1.3-DEV 
 Schema:
!-- Numeric field types that manipulate the value into
  a string value that isn't human-readable in its internal form,
  but with a lexicographic ordering the same as the numeric ordering,
  so that range queries work correctly. --
 fieldType name=sint class=solr.SortableIntField 
 sortMissingLast=true omitNorms=true/
 fieldType name=slong class=solr.SortableLongField 
 sortMissingLast=true omitNorms=true/
 fieldType name=sfloat class=solr.SortableFloatField 
 sortMissingLast=true omitNorms=true/
 fieldType name=sdouble class=solr.SortableDoubleField 
 sortMissingLast=true omitNorms=true/
field name=timestamp type=slong indexed=true stored=true/
Reporter: Fuad Efendi
Priority: Trivial
   Original Estimate: 168h
  Remaining Estimate: 168h

 Range queries always return all results (do not filter):
 timestamp:[1019386401114 TO 1219386401114]
 lst name=debug
 str name=rawquerystringtimestamp:[1019386401114 TO 1219386401114]/str
 str name=querystringtimestamp:[1019386401114 TO 1219386401114]/str
 str name=parsedquerytimestamp:[1019386401114 TO 1219386401114]/str
 str name=parsedquery_toStringtimestamp:[#8;#0;εごᅚ TO #8;#0;ѯ刯慚]/str
 ...
 str name=QParserOldLuceneQParser/str

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-665) FIFO Cache (Unsynchronized): 9x times performance boost

2008-08-01 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12619058#action_12619058
 ] 

Fuad Efendi commented on SOLR-665:
--

Guys at LingPipe (Natural Language Processing) http://alias-i.com/ are using 
excellent Map implementations with optimistic concurrency strategy:
http://alias-i.com/lingpipe/docs/api/com/aliasi/util/FastCache.html
http://alias-i.com/lingpipe/docs/api/com/aliasi/util/HardFastCache.html


 FIFO Cache (Unsynchronized): 9x times performance boost
 ---

 Key: SOLR-665
 URL: https://issues.apache.org/jira/browse/SOLR-665
 Project: Solr
  Issue Type: Improvement
Affects Versions: 1.3
 Environment: JRockit R27 (Java 6)
Reporter: Fuad Efendi
 Attachments: ConcurrentFIFOCache.java, ConcurrentFIFOCache.java, 
 ConcurrentLRUCache.java, ConcurrentLRUWeakCache.java, FIFOCache.java, 
 SimplestConcurrentLRUCache.java

   Original Estimate: 672h
  Remaining Estimate: 672h

 Attached is modified version of LRUCache where 
 1. map = new LinkedHashMap(initialSize, 0.75f, false) - so that 
 reordering/true (performance bottleneck of LRU) is replaced to 
 insertion-order/false (so that it became FIFO)
 2. Almost all (absolutely unneccessary) synchronized statements commented out
 See discussion at 
 http://www.nabble.com/LRUCache---synchronized%21--td16439831.html
 Performance metrics (taken from SOLR Admin):
 LRU
 Requests: 7638
 Average Time-Per-Request: 15300
 Average Request-per-Second: 0.06
 FIFO:
 Requests: 3355
 Average Time-Per-Request: 1610
 Average Request-per-Second: 0.11
 Performance increased 9 times which roughly corresponds to a number of CPU in 
 a system, http://www.tokenizer.org/ (Shopping Search Engine at Tokenizer.org)
 Current number of documents: 7494689
 name:  filterCache  
 class:org.apache.solr.search.LRUCache  
 version:  1.0  
 description:  LRU Cache(maxSize=1000, initialSize=1000)  
 stats:lookups : 15966954582
 hits : 16391851546
 hitratio : 0.102
 inserts : 4246120
 evictions : 0
 size : 2668705
 cumulative_lookups : 16415839763
 cumulative_hits : 16411608101
 cumulative_hitratio : 0.99
 cumulative_inserts : 4246246
 cumulative_evictions : 0 
 Thanks

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-667) Alternate LRUCache implementation

2008-07-31 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12618750#action_12618750
 ] 

Fuad Efendi commented on SOLR-667:
--

bq. ...safety, where nothing bad ever happens to an object. 
When _SOLR_ adds object to cache or remove it from cache it does not change it, 
it manipulates with internal arrays of pointers to objects (which are probably 
atomic, but I don't know such JVM  GC internals in-depth...)

Looks heavy with TreeSet...


 Alternate LRUCache implementation
 -

 Key: SOLR-667
 URL: https://issues.apache.org/jira/browse/SOLR-667
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.3
Reporter: Noble Paul
 Attachments: ConcurrentLRUCache.java


 The only available SolrCache i.e LRUCache is based on _LinkedHashMap_ which 
 has _get()_ also synchronized. This can cause severe bottlenecks for faceted 
 search. Any alternate implementation which can be faster/better must be 
 considered. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-665) FIFO Cache (Unsynchronized): 9x times performance boost

2008-07-31 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12618766#action_12618766
 ] 

Fuad Efendi commented on SOLR-665:
--

I don't think ConcurrentHashMap will improve performance, and ConcurrentMap is 
not what SOLR needs:
{code}
V putIfAbsent(K key, V value);
V replace(K key, V value);
boolean replace(K key, V oldValue, V newValue);
{code}

There is also some(...) overhead with _oldValue_ and _the state of the hash 
table at some point_; additional memory requirements; etc... can we design 
something plain-simpler being focused on SOLR specific requirements? Without 
all functionality of Map etc...

 FIFO Cache (Unsynchronized): 9x times performance boost
 ---

 Key: SOLR-665
 URL: https://issues.apache.org/jira/browse/SOLR-665
 Project: Solr
  Issue Type: Improvement
Affects Versions: 1.3
 Environment: JRockit R27 (Java 6)
Reporter: Fuad Efendi
 Attachments: ConcurrentFIFOCache.java, ConcurrentFIFOCache.java, 
 ConcurrentLRUCache.java, ConcurrentLRUWeakCache.java, FIFOCache.java, 
 SimplestConcurrentLRUCache.java

   Original Estimate: 672h
  Remaining Estimate: 672h

 Attached is modified version of LRUCache where 
 1. map = new LinkedHashMap(initialSize, 0.75f, false) - so that 
 reordering/true (performance bottleneck of LRU) is replaced to 
 insertion-order/false (so that it became FIFO)
 2. Almost all (absolutely unneccessary) synchronized statements commented out
 See discussion at 
 http://www.nabble.com/LRUCache---synchronized%21--td16439831.html
 Performance metrics (taken from SOLR Admin):
 LRU
 Requests: 7638
 Average Time-Per-Request: 15300
 Average Request-per-Second: 0.06
 FIFO:
 Requests: 3355
 Average Time-Per-Request: 1610
 Average Request-per-Second: 0.11
 Performance increased 9 times which roughly corresponds to a number of CPU in 
 a system, http://www.tokenizer.org/ (Shopping Search Engine at Tokenizer.org)
 Current number of documents: 7494689
 name:  filterCache  
 class:org.apache.solr.search.LRUCache  
 version:  1.0  
 description:  LRU Cache(maxSize=1000, initialSize=1000)  
 stats:lookups : 15966954582
 hits : 16391851546
 hitratio : 0.102
 inserts : 4246120
 evictions : 0
 size : 2668705
 cumulative_lookups : 16415839763
 cumulative_hits : 16411608101
 cumulative_hitratio : 0.99
 cumulative_inserts : 4246246
 cumulative_evictions : 0 
 Thanks

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-667) Alternate LRUCache implementation

2008-07-31 Thread Fuad Efendi (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12618805#action_12618805
 ] 

Fuad Efendi commented on SOLR-667:
--

Paul, 


I have never ever suggested to use 'volatile'  'to avoid synchronization' for 
concurrent programming. I only noticed some extremely stupid code where SOLR 
uses _double_synchronization and AtomicLong inside:

{code}
  public synchronized Object put(Object key, Object value) {
if (state == State.LIVE) {
  stats.inserts.incrementAndGet();
}

synchronized (map) {
  // increment local inserts regardless of state???
  // it does make it more consistent with the current size...
  inserts++;
  return map.put(key,value);
}
  }
{code}

Each tool has an area of applicability, and even ConcurrentHashMap just 
slightly intersects with SOLR needs; SOLR does not need 'consistent view at a 
point in time' on cached objects.

'volatile' is part of Java Specs, and implemented differently by different 
vendors. I use volatile (instead of more expensive AtomicLong) only and only to 
prevent JVM HotSpot Optimizer from some _not-applicable_ staff...

 Alternate LRUCache implementation
 -

 Key: SOLR-667
 URL: https://issues.apache.org/jira/browse/SOLR-667
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.3
Reporter: Noble Paul
 Attachments: ConcurrentLRUCache.java


 The only available SolrCache i.e LRUCache is based on _LinkedHashMap_ which 
 has _get()_ also synchronized. This can cause severe bottlenecks for faceted 
 search. Any alternate implementation which can be faster/better must be 
 considered. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

1 2 >

1 - 100 of 191 matches

Mail list logo