[jira] [Commented] (LUCENE-5938) New DocIdSet implementation with random write access

2014-09-11 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14129950#comment-14129950
 ] 

Eks Dev commented on LUCENE-5938:
-

Just a crazy idea.   Do you need to store words with all bits set? Did not look 
into implementation, but from your description it sounds like it might be as 
well possible to not store them without adding to many if-s at execution path. 
This way, it wold work better also for dense BS (like implicit inverting 
trick), and for all intermidate cases where you have some partial sorting (some 
sort of run length encoding)? 


 New DocIdSet implementation with random write access
 

 Key: LUCENE-5938
 URL: https://issues.apache.org/jira/browse/LUCENE-5938
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
 Attachments: LUCENE-5938.patch


 We have a great cost API that is supposed to help make decisions about how to 
 best execute queries. However, due to the fact that several of our filter 
 implementations (eg. TermsFilter and BooleanFilter) return FixedBitSets, 
 either we use the cost API and make bad decisions, or need to fall back to 
 heuristics which are not as good such as 
 RandomAccessFilterStrategy.useRandomAccess which decides that random access 
 should be used if the first doc in the set is less than 100.
 On the other hand, we also have some nice compressed and cacheable DocIdSet 
 implementation but we cannot make use of them because TermsFilter requires a 
 DocIdSet that has random write access, and FixedBitSet is the only DocIdSet 
 that we have that supports random access.
 I think it would be nice to replace FixedBitSet in those filters with another 
 DocIdSet that would also support random write access but would have a better 
 cost?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5914) More options for stored fields compression

2014-09-03 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119977#comment-14119977
 ] 

Eks Dev commented on LUCENE-5914:
-

lovely, thanks for explaining, I expected something like this but was not 100% 
sure without looking into code. 
Simply, I see absolutely nothing ono might wish from general, OOTB compression 
support... 

In theory...
The only meaningful enhancements to the standard are possible to come only by 
modelling semantics of the data (the user must know quite a bit about the 
distribution of the data) to improve compression/speed = but this cannot be 
provided by the core, (Lucene is rightly content agnostic), at most the core 
APIs might make it more or less comfortable, but imo nothing more. 

For example (contrived as LZ4 would deal with it quite ok, just to illustrate), 
if I know that my field contains up to 5 distinct string values, I might  add 
simple dictionary coding to use max one byte without even going to codec level. 
The only place where I see theoretical possibility to need to go down-dirty is 
if I would want to reach sub-byte representations (3 bits per value in 
example), but this is rarely needed/hard to beat default LZ4/deflate and also 
even harder not to make slow. At the end of a day, someone who needs this type 
of specialisation should be able to write his own per-field codec.

Great work, and thanks again!

 

 More options for stored fields compression
 --

 Key: LUCENE-5914
 URL: https://issues.apache.org/jira/browse/LUCENE-5914
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 4.11

 Attachments: LUCENE-5914.patch


 Since we added codec-level compression in Lucene 4.1 I think I got about the 
 same amount of users complaining that compression was too aggressive and that 
 compression was too light.
 I think it is due to the fact that we have users that are doing very 
 different things with Lucene. For example if you have a small index that fits 
 in the filesystem cache (or is close to), then you might never pay for actual 
 disk seeks and in such a case the fact that the current stored fields format 
 needs to over-decompress data can sensibly slow search down on cheap queries.
 On the other hand, it is more and more common to use Lucene for things like 
 log analytics, and in that case you have huge amounts of data for which you 
 don't care much about stored fields performance. However it is very 
 frustrating to notice that the data that you store takes several times less 
 space when you gzip it compared to your index although Lucene claims to 
 compress stored fields.
 For that reason, I think it would be nice to have some kind of options that 
 would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5914) More options for stored fields compression

2014-09-02 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117986#comment-14117986
 ] 

Eks Dev commented on LUCENE-5914:
-

bq. Do you have pointers to emails/irc logs describing such issues?

I do not know what the gold standard lucene usage is, but at least one use case 
I can describe, maybe it helps. I am not proposing anything here, just sharing 
experience. 

Think about the (typical lucene?) usage with structured data (e.g. indexing 
relational db, like product catalog or such) with many smallish fields and then 
retrieving 2k such documents to post-process them, classify, cluster them or 
whatnot (e.g. mahout and co.) 

- Default compression with CHUNK_SIZE makes it decompress 2k * CHUNK_SIZE/2  
bytes on average in order to retrieve 2k Documents 
- Reducing chunk_size helps a lot, but there is a sweet-spot, and if you reduce 
it too much, you will not see enough compression and then your index is not 
fitting into cache , so you get hurt on IO. 

Ideally we should enable to use biggish chunk_size during compression to 
improve compression and decompress only single document (not depending on 
chunk_size), just like you proposed here (if I figured it out correctly?)

Usually, such data is highly compressible (imagine all these low cardinality 
fields like color of something...) and even some basic compression does the 
magic.

What we did?
- Reduced chunk_size
- As a bonus to improve compression, added plain static dictionary compression 
for a few fields in update chain (we store analysed fields)
- When applicable, we pre-sort collection periodically before indexing (on low 
cardinality fields first) this old db-admin secret weapon helps a lot

Conclusion: compression is great, and anything that helps tweak this balance 
(CPU effort / IO effort)  in different phases indexing/retrieving smoothly 
makes lucene use case coverage broader.  (e.g. I want to afford more CPU 
during indexing, and less CPU during retrieval, static coder being extreme 
case for this...)

I am not sure I figured out exactly if and how this patch is going to help in a 
such cases (how to achieve reasonable compression if we do per document 
compression for small documents? Reusing dictionaries from previous chunks? 
static dictionaries... ). 

In any case, thanks for doing the heavy lifting here! I think you already did 
really great job with compression in lucene. 

PS: Ages ago, before lucene, when memory was really expensive, we had our own 
serialization (not in lucene) that simply had one static Huffman coder per 
field (with byte or word symbols), with code-table populated offline,  that was 
great, simple option as it enabled reasonable compression for slow changing 
collections and really fast random access.  
 

 More options for stored fields compression
 --

 Key: LUCENE-5914
 URL: https://issues.apache.org/jira/browse/LUCENE-5914
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 4.11

 Attachments: LUCENE-5914.patch


 Since we added codec-level compression in Lucene 4.1 I think I got about the 
 same amount of users complaining that compression was too aggressive and that 
 compression was too light.
 I think it is due to the fact that we have users that are doing very 
 different things with Lucene. For example if you have a small index that fits 
 in the filesystem cache (or is close to), then you might never pay for actual 
 disk seeks and in such a case the fact that the current stored fields format 
 needs to over-decompress data can sensibly slow search down on cheap queries.
 On the other hand, it is more and more common to use Lucene for things like 
 log analytics, and in that case you have huge amounts of data for which you 
 don't care much about stored fields performance. However it is very 
 frustrating to notice that the data that you store takes several times less 
 space when you gzip it compared to your index although Lucene claims to 
 compress stored fields.
 For that reason, I think it would be nice to have some kind of options that 
 would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Parquet dictionary encoding bit packing

2013-09-16 Thread eks dev
indeed, I did look at Parquet and had the same feeling as Otis,  some
striking similarity with terminology used around stored fields.

If I got it right, parquet chunk stores sets of documents in chunks, just
like lucene does but each chunk is column stride.
Maybe possible to apply this idea to compressing stored fields (chunks in
column stride fashion)?





On Sun, Sep 15, 2013 at 11:17 PM, Otis Gospodnetic 
otis.gospodne...@gmail.com wrote:

 Hi,

 I was reading the Parquet announcement from July:

 https://blog.twitter.com/2013/announcing-parquet-10-columnar-storage-for-hadoop

 And a few things caught my attention - Dictionary encoding and
 (dynamic) bit packing.  This smells like something Adrien likes to eat
 for breakfast.

 Over in the Hadoop ecosystem Parquet interest has picked up:
 http://search-hadoop.com/?q=parquet

 I thought I'd point it out as I haven't seen anyone bring this up.  I
 imagine there are ideas to be borrowed there.

 Otis
 --
 Solr  ElasticSearch Support -- http://sematext.com/
 Performance Monitoring -- http://sematext.com/spm

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




[jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

2013-07-24 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13718785#comment-13718785
 ] 

Eks Dev commented on SOLR-5069:
---

wow, this is getting pretty close to collection clustering and other candies, 
somehow to plug-in mahout and it's there

Great job and great direction for solr. End-applications not only need to find 
things, they often want to do something with them as well :)

Thanks!   

 MapReduce for SolrCloud
 ---

 Key: SOLR-5069
 URL: https://issues.apache.org/jira/browse/SOLR-5069
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Reporter: Noble Paul
Assignee: Noble Paul

 Solr currently does not have a way to run long running computational tasks 
 across the cluster. We can piggyback on the mapreduce paradigm so that users 
 have smooth learning curve.
  * The mapreduce component will be written as a RequestHandler in Solr
  * Works only in SolrCloud mode. (No support for standalone mode) 
  * Users can write MapReduce programs in Javascript or Java. First cut would 
 be JS ( ? )
 h1. sample word count program
 h2.how to invoke?
 http://host:port/solr/collection-x/mapreduce?map=map-scriptreduce=reduce-scriptsink=collectionX
 h3. params 
 * map :  A javascript implementation of the map program
 * reduce : a Javascript implementation of the reduce program
 * sink : The collection to which the output is written. If this is not passed 
 , the request will wait till completion and respond with the output of the 
 reduce program and will be emitted as a standard solr response. . If no sink 
 is passed the request will be redirected to the reduce node where it will 
 wait till the process is complete. If the sink param is passed ,the rsponse 
 will contain an id of the run which can be used to query the status in 
 another command.
 * reduceNode : Node name where the reduce is run . If not passed an arbitrary 
 node is chosen
 The node which received the command would first identify one replica from 
 each slice where the map program is executed . It will also identify one 
 another node from the same collection where the reduce program is run. Each 
 run is given an id and the details of the nodes participating in the run will 
 be written to ZK (as an ephemeral node). 
 h4. map script 
 {code:JavaScript}
 var res = $.streamQuery(*:*);//this is not run across the cluster. //Only on 
 this index
 while(res.hasMore()){
   var doc = res.next();
   var txt = doc.get(“txt”);//the field on which word count is performed
   var words = txt.split( );
for(i = 0; i  words.length; i++){
   $.map(words[i],{‘count’:1});// this will send the map over to //the 
 reduce host
 }
 }
 {code}
 Essentially two threads are created in the 'map' hosts . One for running the 
 program and the other for co-ordinating with the 'reduce' host . The maps 
 emitted are streamed live over an http connection to the reduce program
 h4. reduce script
 This script is run in one node . This node accepts http connections from map 
 nodes and the 'maps' that are sent are collected in a queue which will be 
 polled and fed into the reduce program. This also keeps the 'reduced' data in 
 memory till the whole run is complete. It expects a done message from all 
 'map' nodes before it declares the tasks are complete. After  reduce program 
 is executed for all the input it proceeds to write out the result to the 
 'sink' collection or it is written straight out to the response.
 {code:JavaScript}
 var pair = $.nextMap();
 var reduced = $.getCtx().getReducedMap();// a hashmap
 var count = reduced.get(pair.key());
 if(count === null) {
   count = {“count”:0};
   reduced.put(pair.key(), count);
 }
 count.count += pair.val().count ;
 {code}
 h4.example output
 {code:JavaScript}
 {
 “result”:[
 “wordx”:{ 
  “count”:15876765
  },
 “wordy” : {
“count”:24657654
   }
  
   ]
 }
 {code}
 TBD
 * The format in which the output is written to the target collection, I 
 assume the reducedMap will have values mapping to the schema of the collection
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4872) BooleanWeight should decide how to execute minNrShouldMatch

2013-03-27 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615006#comment-13615006
 ] 

Eks Dev commented on LUCENE-4872:
-

the same pattern like Simon here, just having these terms wrapped in 
fuzzy/prefix query, often as dismax query. 

for example:
BQ(boo* OR hoo* OR whatever) with e.g. minShouldMatch = 2  

So the only diff to Simon's case is that single boolean clauses are often more 
complicated then simple TermQuery 


 BooleanWeight should decide how to execute minNrShouldMatch
 ---

 Key: LUCENE-4872
 URL: https://issues.apache.org/jira/browse/LUCENE-4872
 Project: Lucene - Core
  Issue Type: Sub-task
  Components: core/search
Reporter: Robert Muir
 Fix For: 5.0, 4.3

 Attachments: crazyMinShouldMatch.tasks


 LUCENE-4571 adds a dedicated document-at-time scorer for minNrShouldMatch 
 which can use advance() behind the scenes. 
 In cases where you have some really common terms and some rare ones this can 
 be a huge performance improvement.
 On the other hand BooleanScorer might still be faster in some cases.
 We should think about what the logic should be here: one simple thing to do 
 is to always use the new scorer when minShouldMatch is set: thats where i'm 
 leaning. 
 But maybe we could have a smarter heuristic too, perhaps based on cost()

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3918) Port index sorter to trunk APIs

2013-02-04 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570663#comment-13570663
 ] 

Eks Dev commented on LUCENE-3918:
-

this is the right way to give some really good meaning to venerable optimize 
call :)

We were, and are sorting our data before indexing just to achieve exactly this, 
improvement in locality of reference. Depending on data (has to be somehow 
sortable, e.g. hierarchical structure, on url...), speedup (and likely 
compression Adrian made) gains are sometimes unbelievable...  


 Port index sorter to trunk APIs
 ---

 Key: LUCENE-3918
 URL: https://issues.apache.org/jira/browse/LUCENE-3918
 Project: Lucene - Core
  Issue Type: Task
  Components: modules/other
Affects Versions: 4.0-ALPHA
Reporter: Robert Muir
 Fix For: 4.2, 5.0

 Attachments: LUCENE-3918.patch


 LUCENE-2482 added an IndexSorter to 3.x, but we need to port this
 functionality to 4.0 apis.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4117) IO error while trying to get the size of the Directory

2012-11-28 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13505530#comment-13505530
 ] 

Eks Dev commented on SOLR-4117:
---

fwiw, we *think* we observed the following problem in simple master slave setup 
with NRTCachingDirectory... I am not sure it has something to do with issue, 
because ewe did not see this exception, anyhow   

on replication, slave gets the index from master and works fine, then on:
1. graceful restart, the world looks fine 
2. kill -9 or such, solr does not start because an index gets corrupt (should 
actually not happen)

We speculate that solr now does replication directly to Directory 
implementation and does not ensure that replicated files get fsck-ed completely 
after replication. As far as I remember, replication was going to /temp (disk) 
and than moving files if all went ok. Working under assumption that everything 
is already persisted. Maybe this invariant does not hold any more and some 
explicit fsck is needed for caching directories? 

I might be completely wrong, we just observed symptoms in not really 
debug-friendly environment



 

 IO error while trying to get the size of the Directory
 --

 Key: SOLR-4117
 URL: https://issues.apache.org/jira/browse/SOLR-4117
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 5.0
 Environment: 5.0.0.2012.11.28.10.42.06
 Debian Squeeze, Tomcat 6, Sun Java 6, 10 nodes, 10 shards, rep. factor 2.
Reporter: Markus Jelsma
Assignee: Mark Miller
Priority: Minor
 Fix For: 5.0


 With SOLR-4032 fixed we see other issues when randomly taking down nodes 
 (nicely via tomcat restart) while indexing a few million web pages from 
 Hadoop. We do make sure that at least one node is up for a shard but due to 
 recovery issues it may not be live.
 One node seems to work but generates IO errors in the log and 
 ZookeeperExeption in the GUI. In the GUI we only see:
 {code}
 SolrCore Initialization Failures
 openindex_f: 
 org.apache.solr.common.cloud.ZooKeeperException:org.apache.solr.common.cloud.ZooKeeperException:
  
 Please check your logs for more information
 {code}
 and in the log we only see the following exception:
 {code}
 2012-11-28 11:47:26,652 ERROR [solr.handler.ReplicationHandler] - 
 [http-8080-exec-28] - : IO error while trying to get the size of the 
 Directory:org.apache.lucene.store.NoSuchDirectoryException: directory 
 '/opt/solr/cores/shard_f/data/index' does not exist
 at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:217)
 at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:240)
 at 
 org.apache.lucene.store.NRTCachingDirectory.listAll(NRTCachingDirectory.java:132)
 at 
 org.apache.solr.core.DirectoryFactory.sizeOfDirectory(DirectoryFactory.java:146)
 at 
 org.apache.solr.handler.ReplicationHandler.getIndexSize(ReplicationHandler.java:472)
 at 
 org.apache.solr.handler.ReplicationHandler.getReplicationDetails(ReplicationHandler.java:568)
 at 
 org.apache.solr.handler.ReplicationHandler.handleRequestBody(ReplicationHandler.java:213)
 at 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:144)
 at 
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:240)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1830)
 at 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:476)
 at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276)
 at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
 at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
 at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
 at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
 at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
 at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
 at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
 at 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
 at 
 org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:889)
 at 
 org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:744)
 at 
 org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:2274)
 at 
 java.util.concurrent.ThreadPoolExecutor

[jira] [Comment Edited] (SOLR-4117) IO error while trying to get the size of the Directory

2012-11-28 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13505530#comment-13505530
 ] 

Eks Dev edited comment on SOLR-4117 at 11/28/12 3:27 PM:
-

fwiw, we *think* we observed the following problem in simple master slave setup 
with NRTCachingDirectory... I am not sure it has something to do with issue, 
because ewe did not see this exception, anyhow   

on replication, slave gets the index from master and works fine, then on:
1. graceful restart, the world looks fine 
2. kill -9 or such, solr does not start because an index gets corrupt (should 
actually not happen)

We speculate that solr now does replication directly to Directory 
implementation and does not ensure that replicated files get fsck-ed completely 
after replication. As far as I remember, replication was going to /temp (disk) 
and than moving files if all went ok. Working under assumption that everything 
is already persisted. Maybe this invariant does not hold any more and some 
explicit fsck is needed for caching directories? 

I might be completely wrong, we just observed symptoms in not really 
debug-friendly environment

Here Exception after  hard restart:

Caused by: org.apache.solr.common.SolrException: Error opening new searcher
   at org.apache.solr.core.SolrCore.init(SolrCore.java:804)
   at org.apache.solr.core.SolrCore.init(SolrCore.java:618)
   at org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:973)
   at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1003)
   ... 10 more
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
   at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1441)
   at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1553)
   at org.apache.solr.core.SolrCore.init(SolrCore.java:779)
   ... 13 more
Caused by: java.io.FileNotFoundException: ...\core0\data\index\segments_1 (The 
system cannot find the file specified)
   at java.io.RandomAccessFile.open(Native Method)
   at java.io.RandomAccessFile.init(RandomAccessFile.java:233)
   at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:222)
   at 
org.apache.lucene.store.NRTCachingDirectory.openInput(NRTCachingDirectory.java:232)
   at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:281)
   at 
org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:56)
   at 
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:668)
   at 
org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:52)
   at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:87)
   at 
org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:34)
   at 
org.apache.solr.search.SolrIndexSearcher.init(SolrIndexSearcher.java:120)
   at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1417)

 

  was (Author: eksdev):
fwiw, we *think* we observed the following problem in simple master slave 
setup with NRTCachingDirectory... I am not sure it has something to do with 
issue, because ewe did not see this exception, anyhow   

on replication, slave gets the index from master and works fine, then on:
1. graceful restart, the world looks fine 
2. kill -9 or such, solr does not start because an index gets corrupt (should 
actually not happen)

We speculate that solr now does replication directly to Directory 
implementation and does not ensure that replicated files get fsck-ed completely 
after replication. As far as I remember, replication was going to /temp (disk) 
and than moving files if all went ok. Working under assumption that everything 
is already persisted. Maybe this invariant does not hold any more and some 
explicit fsck is needed for caching directories? 

I might be completely wrong, we just observed symptoms in not really 
debug-friendly environment



 
  
 IO error while trying to get the size of the Directory
 --

 Key: SOLR-4117
 URL: https://issues.apache.org/jira/browse/SOLR-4117
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 5.0
 Environment: 5.0.0.2012.11.28.10.42.06
 Debian Squeeze, Tomcat 6, Sun Java 6, 10 nodes, 10 shards, rep. factor 2.
Reporter: Markus Jelsma
Assignee: Mark Miller
Priority: Minor
 Fix For: 5.0


 With SOLR-4032 fixed we see other issues when randomly taking down nodes 
 (nicely via tomcat restart) while indexing a few million web pages from 
 Hadoop. We do make sure that at least one node is up for a shard but due to 
 recovery issues it may not be live.
 One node seems to work but generates IO errors in the log and 
 ZookeeperExeption in the GUI

[jira] [Commented] (SOLR-4117) IO error while trying to get the size of the Directory

2012-11-28 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1350#comment-1350
 ] 

Eks Dev commented on SOLR-4117:
---

fsync of course, fsck was intended for my terminal window :) 

 IO error while trying to get the size of the Directory
 --

 Key: SOLR-4117
 URL: https://issues.apache.org/jira/browse/SOLR-4117
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 5.0
 Environment: 5.0.0.2012.11.28.10.42.06
 Debian Squeeze, Tomcat 6, Sun Java 6, 10 nodes, 10 shards, rep. factor 2.
Reporter: Markus Jelsma
Assignee: Mark Miller
Priority: Minor
 Fix For: 5.0


 With SOLR-4032 fixed we see other issues when randomly taking down nodes 
 (nicely via tomcat restart) while indexing a few million web pages from 
 Hadoop. We do make sure that at least one node is up for a shard but due to 
 recovery issues it may not be live.
 One node seems to work but generates IO errors in the log and 
 ZookeeperExeption in the GUI. In the GUI we only see:
 {code}
 SolrCore Initialization Failures
 openindex_f: 
 org.apache.solr.common.cloud.ZooKeeperException:org.apache.solr.common.cloud.ZooKeeperException:
  
 Please check your logs for more information
 {code}
 and in the log we only see the following exception:
 {code}
 2012-11-28 11:47:26,652 ERROR [solr.handler.ReplicationHandler] - 
 [http-8080-exec-28] - : IO error while trying to get the size of the 
 Directory:org.apache.lucene.store.NoSuchDirectoryException: directory 
 '/opt/solr/cores/shard_f/data/index' does not exist
 at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:217)
 at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:240)
 at 
 org.apache.lucene.store.NRTCachingDirectory.listAll(NRTCachingDirectory.java:132)
 at 
 org.apache.solr.core.DirectoryFactory.sizeOfDirectory(DirectoryFactory.java:146)
 at 
 org.apache.solr.handler.ReplicationHandler.getIndexSize(ReplicationHandler.java:472)
 at 
 org.apache.solr.handler.ReplicationHandler.getReplicationDetails(ReplicationHandler.java:568)
 at 
 org.apache.solr.handler.ReplicationHandler.handleRequestBody(ReplicationHandler.java:213)
 at 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:144)
 at 
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:240)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1830)
 at 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:476)
 at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276)
 at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
 at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
 at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
 at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
 at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
 at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
 at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
 at 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
 at 
 org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:889)
 at 
 org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:744)
 at 
 org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:2274)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4032) Unable to replicate between nodes ( read past EOF)

2012-11-27 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504859#comment-13504859
 ] 

Eks Dev commented on SOLR-4032:
---

We see it as well, 

it looks like it only happens with NRTCachingDirectory, but take this statement 
with healthy  suspicion. It went ok only once without NRTCachingDirectory. 




 Unable to replicate between nodes ( read past EOF)
 --

 Key: SOLR-4032
 URL: https://issues.apache.org/jira/browse/SOLR-4032
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.0
 Environment: 5.0-SNAPSHOT 1366361:1404534M - markus - 2012-11-01 
 12:37:38
 Debian Squeeze, Tomcat 6, Sun Java 6, 10 nodes, 10 shards, rep. factor 2.
Reporter: Markus Jelsma
Assignee: Mark Miller
 Fix For: 4.1, 5.0


 Please see: 
 http://lucene.472066.n3.nabble.com/trunk-is-unable-to-replicate-between-nodes-Unable-to-download-completely-td4017049.html
  and 
 http://lucene.472066.n3.nabble.com/Possible-memory-leak-in-recovery-td4017833.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4548) BooleanFilter should optionally pass down further restricted acceptDocs in the MUST case (and acceptDocs in general)

2012-11-11 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13494858#comment-13494858
 ] 

Eks Dev commented on LUCENE-4548:
-

...would be to nuke Filters completely from Lucene ...

User +1

Filter is conceptually nothing more than no-scoring and a possibility to have 
an implementation that can be cached. 

From the user API point of whew, there is really no need to bother users with 
Filter abstraction. Both of these two are just attributes of the query (do you 
need to score this clause or would you like to have it cached). 

 BooleanFilter should optionally pass down further restricted acceptDocs in 
 the MUST case (and acceptDocs in general)
 

 Key: LUCENE-4548
 URL: https://issues.apache.org/jira/browse/LUCENE-4548
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Uwe Schindler
 Attachments: LUCENE-4548.patch


 Spin-off from dev@lao:
 {quote}
 bq. I am about to write a Filter that only operates on a set of documents 
 that have already passed other filter(s).  It's rather expensive, since it 
 has to use DocValues to examine a value and then determine if its a match.  
 So it scales O(n) where n is the number of documents it must see.  The 2nd 
 arg of getDocIdSet is Bits acceptDocs.  Unfortunately Bits doesn't have an 
 int iterator but I can deal with that seeing if it extends DocIdSet.
 bq. I'm looking at BooleanFilter which I want to use and I notice that it 
 passes null to filter.getDocIdSet for acceptDocs, and it justifies this with 
 the following comment:
 bq. // we dont pass acceptDocs, we will filter at the end using an additional 
 filter
 the idea of passing the already build bits for the MUST is a good idea and 
 can be implemented easily.
 The reason why the acceptDocs were not passed down is the new way of filter 
 works in Lucene 4.0 and to optimize caching. Because accept docs are the only 
 thing that changes when deletions are applied and filters are required to 
 handle them separately:  whenever something is able to cache (e.g. 
 CachingWrapperFilter), the acceptDocs are not cached, so the underlying 
 filters get a null acceptDocs to produce the full bitset and the filtering is 
 done when CachingWrapperFilter gets the “uptodate” acceptDocs. But for this 
 case this does not matter if the first filter clause does not get acceptdocs, 
 but later MUST clauses of course can get them (they are not 
 deletion-specific)!
 Can you open issue to optimize the MUST case (possibly MUST_NOT, too)?
 Another thing that could help here: You can stop using BooleanFilter if you 
 can apply the filters sequentially (only MUST clauses) by wrapping with 
 multiple FilteredQuery: new FilteredQuery(new FilteredQuery(originalQuery, 
 clause1), clause2). If the DocIdSets enable bits() and the FilteredQuery 
 autodetection decides to use random access filters, the acceptdocs are also 
 passed down from the outside to the inner, removing the documents filtered 
 out.
 {quote}
 Maybe BooleanFilter should have 2 modes (Boolean ctor argument): Passing down 
 the acceptDocs to every filter (for the case where Filter calculation is 
 expensive and accept docs help to limit the calculations) or not passing down 
 (if the filter is cheap and the multiple acceptDocs bit checks for every 
 single filter is more expensive – which is then more effective, e.g. when the 
 Filter is only a cached bitset). The first mode would also optimize the 
 MUST/MUST_NOT case to pass down the further restricted acceptDocs on later 
 filters (just like FilteredQuery does).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4226) Efficient compression of small to medium stored fields

2012-08-29 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443897#comment-13443897
 ] 

Eks Dev commented on LUCENE-4226:
-

bq. but I removed the ability to select the compression algorithm on a 
per-field basis in order to make the patch simpler and to handle cross-field 
compression.

Maybe it is worth to keep it there for really short fields. Those general 
compression algorithms are great for bigger amounts of data, but for really 
short fields there is nothing like per field compression.   
Thinking about database usage, e.g. fields with low cardinality, or fields with 
restricted symbol set (only digits in long UID field for example).  Say zip 
code, product color...  is perfectly compressed using something with static 
dictionary approach (static huffman coder with escape symbol-s, at bit level, 
or plain vanilla dictionary lookup), and both of them are insanely fast and 
compress heavily. 

Even trivial utility for users is easily doable, index data without 
compression, get the frequencies from the term dictionary- estimate e.g. 
static Huffman code table and reindex with this dictionary. 


 Efficient compression of small to medium stored fields
 --

 Key: LUCENE-4226
 URL: https://issues.apache.org/jira/browse/LUCENE-4226
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Reporter: Adrien Grand
Priority: Trivial
 Attachments: CompressionBenchmark.java, CompressionBenchmark.java, 
 LUCENE-4226.patch, LUCENE-4226.patch, SnappyCompressionAlgorithm.java


 I've been doing some experiments with stored fields lately. It is very common 
 for an index with stored fields enabled to have most of its space used by the 
 .fdt index file. To prevent this .fdt file from growing too much, one option 
 is to compress stored fields. Although compression works rather well for 
 large fields, this is not the case for small fields and the compression ratio 
 can be very close to 100%, even with efficient compression algorithms.
 In order to improve the compression ratio for small fields, I've written a 
 {{StoredFieldsFormat}} that compresses several documents in a single chunk of 
 data. To see how it behaves in terms of document deserialization speed and 
 compression ratio, I've run several tests with different index compression 
 strategies on 100,000 docs from Mike's 1K Wikipedia articles (title and text 
 were indexed and stored):
  - no compression,
  - docs compressed with deflate (compression level = 1),
  - docs compressed with deflate (compression level = 9),
  - docs compressed with Snappy,
  - using the compressing {{StoredFieldsFormat}} with deflate (level = 1) and 
 chunks of 6 docs,
  - using the compressing {{StoredFieldsFormat}} with deflate (level = 9) and 
 chunks of 6 docs,
  - using the compressing {{StoredFieldsFormat}} with Snappy and chunks of 6 
 docs.
 For those who don't know Snappy, it is compression algorithm from Google 
 which has very high compression ratios, but compresses and decompresses data 
 very quickly.
 {noformat}
 Format   Compression ratio IndexReader.document time
 
 uncompressed 100%  100%
 doc/deflate 1 59%  616%
 doc/deflate 9 58%  595%
 doc/snappy80%  129%
 index/deflate 1   49%  966%
 index/deflate 9   46%  938%
 index/snappy  65%  264%
 {noformat}
 (doc = doc-level compression, index = index-level compression)
 I find it interesting because it allows to trade speed for space (with 
 deflate, the .fdt file shrinks by a factor of 2, much better than with 
 doc-level compression). One other interesting thing is that {{index/snappy}} 
 is almost as compact as {{doc/deflate}} while it is more than 2x faster at 
 retrieving documents from disk.
 These tests have been done on a hot OS cache, which is the worst case for 
 compressed fields (one can expect better results for formats that have a high 
 compression ratio since they probably require fewer read/write operations 
 from disk).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3684) Frequently full gc while do pressure index

2012-08-07 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13429985#comment-13429985
 ] 

Eks Dev commented on SOLR-3684:
---

We did it a long time ago on tomcat, as we use particularly expensive 
analyzers, so even for searching optimum is around Noo cores. Actually, that 
was the only big problem with solr we had.  
 
Actually, anything that keeps insane thread churn low helps. Not only max 
number of threads, but TTL time for idle threads should be also somehow 
increased. The longer threads live, the better. Solr is completely safe due to 
core-reloading and smart Index management, no point in renewing threads.   

If one needs to queue requests, that is just another problem,  but for this 
there no need to up max worker threads to more than number of cores plus some 
smallish constant

What we would like to achieve is to keep separate thread pools for searching, 
indexing and the rest... but we never managed to figure out how to do it. 
even benign, /ping, /status whatever are increasing thread churn... If we 
were able to configure separate pools , we could keep small number of 
long-living threads for searching, even smaller number for indexing and one 
who cares pool for the rest. It is somehow possible on tomcat, if someone 
knows how to do it, please share. 

 Frequently full gc while do pressure index
 --

 Key: SOLR-3684
 URL: https://issues.apache.org/jira/browse/SOLR-3684
 Project: Solr
  Issue Type: Improvement
  Components: multicore
Affects Versions: 4.0-ALPHA
 Environment: System: Linux
 Java process: 4G memory
 Jetty: 1000 threads 
 Index: 20 field
 Core: 5
Reporter: Raintung Li
Priority: Critical
  Labels: garbage, performance
 Fix For: 4.0

 Attachments: patch.txt

   Original Estimate: 168h
  Remaining Estimate: 168h

 Recently we test the Solr index throughput and performance, configure the 20 
 fields do test, the field type is normal text_general, start 1000 threads for 
 Jetty, and define 5 cores.
 After test continued for some time, the solr process throughput is down very 
 quickly. After check the root cause, find the java process always do the full 
 GC. 
 Check the heap dump, the main object is StandardTokenizer, it is be saved in 
 the CloseableThreadLocal by IndexSchema.SolrIndexAnalyzer.
 In the Solr, will use the PerFieldReuseStrategy for the default reuse 
 component strategy, that means one field has one own StandardTokenizer if it 
 use standard analyzer,  and standardtokenizer will occur 32KB memory because 
 of zzBuffer char array.
 The worst case: Total memory = live threads*cores*fields*32KB
 In the test case, the memory is 1000*5*20*32KB= 3.2G for StandardTokenizer, 
 and those object only thread die can be released.
 Suggestion:
 Every request only handles by one thread that means one document only 
 analyses by one thread.  For one thread will parse the document’s field step 
 by step, so the same field type can use the same reused component. While 
 thread switches the same type’s field analyzes only reset the same component 
 input stream, it can save a lot of memory for same type’s field.
 Total memory will be = live threads*cores*(different fields types)*32KB
 The source code modifies that it is simple; I can provide the modification 
 patch for IndexSchema.java: 
 private class SolrIndexAnalyzer extends AnalyzerWrapper {
 
   private class SolrFieldReuseStrategy extends ReuseStrategy {
 /**
  * {@inheritDoc}
  */
 @SuppressWarnings(unchecked)
 public TokenStreamComponents getReusableComponents(String 
 fieldName) {
   MapAnalyzer, TokenStreamComponents componentsPerField = 
 (MapAnalyzer, TokenStreamComponents) getStoredValue();
   return componentsPerField != null ? 
 componentsPerField.get(analyzers.get(fieldName)) : null;
 }
 /**
  * {@inheritDoc}
  */
 @SuppressWarnings(unchecked)
 public void setReusableComponents(String fieldName, 
 TokenStreamComponents components) {
   MapAnalyzer, TokenStreamComponents componentsPerField = 
 (MapAnalyzer, TokenStreamComponents) getStoredValue();
   if (componentsPerField == null) {
 componentsPerField = new HashMapAnalyzer, 
 TokenStreamComponents();
 setStoredValue(componentsPerField);
   }
   componentsPerField.put(analyzers.get(fieldName), components);
 }
   }
   
 protected final static HashMapString, Analyzer analyzers;
 /**
  * Implementation of {@link ReuseStrategy} that reuses components 
 per-field by
  * maintaining a Map

[jira] [Commented] (LUCENE-3312) Break out StorableField from IndexableField

2012-06-01 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13287213#comment-13287213
 ] 

Eks Dev commented on LUCENE-3312:
-

bq. My assumption is that StoredField-s will never be used anymore as potential 
sources of token streams?

One case where it might make sense are scenarios where a user wants to store 
analyzed field (not original) and later to to read it as TokenStream. Kind of 
TermVector without tf. I think I remember seing great patch with 
indexable-storable field (with serialization and deserialization).

A user can do it in two passes, but sumetimes it is a not chep to analyze two 
times



 Break out StorableField from IndexableField
 ---

 Key: LUCENE-3312
 URL: https://issues.apache.org/jira/browse/LUCENE-3312
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Reporter: Michael McCandless
Assignee: Nikola Tankovic
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: Field Type branch

 Attachments: lucene-3312-patch-01.patch, lucene-3312-patch-02.patch, 
 lucene-3312-patch-03.patch, lucene-3312-patch-04.patch


 In the field type branch we have strongly decoupled
 Document/Field/FieldType impl from the indexer, by having only a
 narrow API (IndexableField) passed to IndexWriter.  This frees apps up
 use their own documents instead of the user-space impls we provide
 in oal.document.
 Similarly, with LUCENE-3309, we've done the same thing on the
 doc/field retrieval side (from IndexReader), with the
 StoredFieldsVisitor.
 But, maybe we should break out StorableField from IndexableField,
 such that when you index a doc you provide two Iterables -- one for the
 IndexableFields and one for the StorableFields.  Either can be null.
 One downside is possible perf hit for fields that are both indexed 
 stored (ie, we visit them twice, lookup their name in a hash twice,
 etc.).  But the upside is a cleaner separation of concerns in API

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3846) Fuzzy suggester

2012-03-05 Thread Eks Dev (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1328#comment-1328
 ] 

Eks Dev commented on LUCENE-3846:
-

Robert, 
I am not talking from some abstract-theoretical point of view, I made my own 
experience on  nontrivial Lucene datasets that are unfortunately not for 
sharing. Having possibility to train cost matrices per edit operation brings a 
lot, but you may have had another experience (different problems, different 
data...).  

Without specifying concrete task (annotated data), there is no notion of 
better, so this argument simply does not help (show me it is better, no 
you show me all ones matrix  is better than any other, no, no...). It is 
simply about the experience we made in the past, different opinions.   

I personally would not try this argument with molecular biology teams, and tell 
them their POM and BLOSUM matrices are worthless or to someone in record 
linkage community (Lucene was used in this context a lot) or ... 




 Fuzzy suggester
 ---

 Key: LUCENE-3846
 URL: https://issues.apache.org/jira/browse/LUCENE-3846
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.6, 4.0

 Attachments: LUCENE-3846.patch


 Would be nice to have a suggester that can handle some fuzziness (like spell 
 correction) so that it's able to suggest completions that are near what you 
 typed.
 As a first go at this, I implemented 1T (ie up to 1 edit, including a 
 transposition), except the first letter must be correct.
 But there is a penalty, ie, the corrected suggestion needs to have a much 
 higher freq than the exact match suggestion before it can compete.
 Still tons of nocommits, and somehow we should merge this / make it work with 
 analyzing suggester too (LUCENE-3842).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] [Commented] (LUCENE-3846) Fuzzy suggester

2012-03-05 Thread eks dev
For that matter, I am worried not to offend anyone, just that type of person :)

But expressing his opinion, just as we did here has nothing to do with
it. Hope you did not read my comments as offending, this was by no
means my intention.

Just do not complain later, I warned you, molecular biologists can be
mean if you touch their matrices :)


(I took it on mailing list, as it adds only noise to Jira)


On Mon, Mar 5, 2012 at 1:38 PM, Robert Muir (Commented) (JIRA)
j...@apache.org wrote:

    [ 
 https://issues.apache.org/jira/browse/LUCENE-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13222303#comment-13222303
  ]

 Robert Muir commented on LUCENE-3846:
 -

 {quote}
 I personally would not try this argument with molecular biology teams, and 
 tell them their POM and BLOSUM matrices are worthless or to someone in record 
 linkage community (Lucene was used in this context a lot) or ...
 {quote}

 Thats ok, I would :)

 I don't think we should complicate already-complicated things unless there is 
 some clear benefit.

 I'm not worried about offending anyone.


 Fuzzy suggester
 ---

                 Key: LUCENE-3846
                 URL: https://issues.apache.org/jira/browse/LUCENE-3846
             Project: Lucene - Java
          Issue Type: Improvement
            Reporter: Michael McCandless
            Assignee: Michael McCandless
             Fix For: 3.6, 4.0

         Attachments: LUCENE-3846.patch


 Would be nice to have a suggester that can handle some fuzziness (like spell 
 correction) so that it's able to suggest completions that are near what 
 you typed.
 As a first go at this, I implemented 1T (ie up to 1 edit, including a 
 transposition), except the first letter must be correct.
 But there is a penalty, ie, the corrected suggestion needs to have a much 
 higher freq than the exact match suggestion before it can compete.
 Still tons of nocommits, and somehow we should merge this / make it work 
 with analyzing suggester too (LUCENE-3842).

 --
 This message is automatically generated by JIRA.
 If you think it was sent incorrectly, please contact your JIRA 
 administrators: 
 https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
 For more information on JIRA, see: http://www.atlassian.com/software/jira



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3846) Fuzzy suggester

2012-03-04 Thread Eks Dev (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13221961#comment-13221961
 ] 

Eks Dev commented on LUCENE-3846:
-

awesome! FST/A went a long way.

Just a few random toughs, triggered by ... corrected suggestion needs to 
have a much higher freq than the exact match... 

Frequency influence is normally slightly more complicated than only more 
popular, depending on search task user is facing. Only more popular helps if 
we assume user types it wrong and our suggestions dictionary is always right. 
But in cases where you have user who types it correctly, and collection 
contains errors you would cut all documents with fuzzy. 

What I found works pretty good is considering this problem to be of nearest 
neighbor type. Namely, 
task is to find closest matches to the query. Some are more and some less 
popular. Take for example a case where user types black dog and our 
collection contains document blaKC dog, having frequency of blakc much lower 
than black, only more popular would miss this document.

What works out of the box pretty good is comparing frequency of query word and 
candidate to some reasonable cut-off and classifying them to HF/LF 
(high/low frequency) terms. It is based on the fact that typos are normally 
very seldom (if not, they should be treated as synonyms!). So if user types LF 
token, probably fuzzy candidate would be HF, and the other way around. 

But as said, it depends what the task is.


Next level for fuzzy * in Lucene is going into specifying separate costs for 
Inserts/deletes, swaps and transpositions at character(byte) level and 
optionally considering position of edit. This brings precision++ if used 
properly, like in 
- inserting/deleting silent h should cost less than other letters (thomas vs 
thomas)  
- Phonetics, swap c - k is less evil than default
- inserting s at the end... bug vs bugs

Apart from that, I see absolutely nothing more one on earth can do better :)


Sorry again for just shooting around with wish lists at you guys, my 
time-schedule really does not permit any serious work in form of patches. 

 Fuzzy suggester
 ---

 Key: LUCENE-3846
 URL: https://issues.apache.org/jira/browse/LUCENE-3846
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.6, 4.0

 Attachments: LUCENE-3846.patch


 Would be nice to have a suggester that can handle some fuzziness (like spell 
 correction) so that it's able to suggest completions that are near what you 
 typed.
 As a first go at this, I implemented 1T (ie up to 1 edit, including a 
 transposition), except the first letter must be correct.
 But there is a penalty, ie, the corrected suggestion needs to have a much 
 higher freq than the exact match suggestion before it can compete.
 Still tons of nocommits, and somehow we should merge this / make it work with 
 analyzing suggester too (LUCENE-3842).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3846) Fuzzy suggester

2012-03-04 Thread Eks Dev (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13221962#comment-13221962
 ] 

Eks Dev commented on LUCENE-3846:
-

awesome! FST/A went a long way.

Just a few random toughs, triggered by ... corrected suggestion needs to 
have a much higher freq than the exact match... 

Frequency influence is normally slightly more complicated than only more 
popular, depending on search task user is facing. Only more popular helps if 
we assume user types it wrong and our suggestions dictionary is always right. 
But in cases where you have user who types it correctly, and collection 
contains errors you would cut all documents with fuzzy. 

What I found works pretty good is considering this problem to be of nearest 
neighbor type. Namely, 
task is to find closest matches to the query. Some are more and some less 
popular. Take for example a case where user types black dog and our 
collection contains document blaKC dog, having frequency of blakc much lower 
than black, only more popular would miss this document.

What works out of the box pretty good is comparing frequency of query word and 
candidate to some reasonable cut-off and classifying them to HF/LF 
(high/low frequency) terms. It is based on the fact that typos are normally 
very seldom (if not, they should be treated as synonyms!). So if user types LF 
token, probably fuzzy candidate would be HF, and the other way around. 

But as said, it depends what the task is.


Next level for fuzzy * in Lucene is going into specifying separate costs for 
Inserts/deletes, swaps and transpositions at character(byte) level and 
optionally considering position of edit. This brings precision++ if used 
properly, like in 
- inserting/deleting silent h should cost less than other letters (thomas vs 
thomas)  
- Phonetics, swap c - k is less evil than default
- inserting s at the end... bug vs bugs

Apart from that, I see absolutely nothing more one on earth can do better :)


Sorry again for just shooting around with wish lists at you guys, my 
time-schedule really does not permit any serious work in form of patches. 

 Fuzzy suggester
 ---

 Key: LUCENE-3846
 URL: https://issues.apache.org/jira/browse/LUCENE-3846
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.6, 4.0

 Attachments: LUCENE-3846.patch


 Would be nice to have a suggester that can handle some fuzziness (like spell 
 correction) so that it's able to suggest completions that are near what you 
 typed.
 As a first go at this, I implemented 1T (ie up to 1 edit, including a 
 transposition), except the first letter must be correct.
 But there is a penalty, ie, the corrected suggestion needs to have a much 
 higher freq than the exact match suggestion before it can compete.
 Still tons of nocommits, and somehow we should merge this / make it work with 
 analyzing suggester too (LUCENE-3842).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3846) Fuzzy suggester

2012-03-04 Thread Eks Dev (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13221974#comment-13221974
 ] 

Eks Dev commented on LUCENE-3846:
-

sure as hell, re-ranking covers most of the cases. If you are not saturating 
top-N depth, it works just perfect, but if you are saturating top-n, you have 
to increase depth / number of allowed edits, which in turn hurts performance... 
 

{quote}
rather than trying to complicate the actual intersection algorithm 
{quote}

The logic in intersection algorithm would not have to know anything about the 
language specifics, it would be defined in cost matrix. But suporting cost 
matrix per edit operation deep down can be complex. You would simply reduce 
language/domain parametrization to configuration of costs in matrix

 

 Fuzzy suggester
 ---

 Key: LUCENE-3846
 URL: https://issues.apache.org/jira/browse/LUCENE-3846
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.6, 4.0

 Attachments: LUCENE-3846.patch


 Would be nice to have a suggester that can handle some fuzziness (like spell 
 correction) so that it's able to suggest completions that are near what you 
 typed.
 As a first go at this, I implemented 1T (ie up to 1 edit, including a 
 transposition), except the first letter must be correct.
 But there is a penalty, ie, the corrected suggestion needs to have a much 
 higher freq than the exact match suggestion before it can compete.
 Still tons of nocommits, and somehow we should merge this / make it work with 
 analyzing suggester too (LUCENE-3842).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3846) Fuzzy suggester

2012-03-04 Thread Eks Dev (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13222014#comment-13222014
 ] 

Eks Dev commented on LUCENE-3846:
-

{quote}
feel free to show me evidence they do
{quote}

Even here they help a lot, do not underestimate error model! (as in noisy 
channel, see http://norvig.com/spell-correct.html for a nice overview).

Examples, off the top of my head:
in a case you search for Carin in a set {Karin, Marin, Darin}, (All valid 
names, at edit distance one) you would prefer to see Karin as a highest (to the 
only one) ranked fuzzy suggestion. (close consonants).

Or discount on swap(vowel ,vowel) vs swap(vowel/consonant, consonant). 
Mistaking one vowel for another is more probable than mistaking two consonants 
or consonant and vowel (as long as humans type). 

Books, scanned using OCR have no problems with phonetics, but other...

Context is important, in-word context as part of error model (character level 
context, like previous character) but even more important is the context from  
the language model, that normally dominates. 

I could look for some interesting papers in my archives if you are not 
convinced yet :)
This one is worth reading (http://acl.ldc.upenn.edu/P/P00/P00-1037.pdf), 
tackles, among other things, exactly this topic. 

{quote}
it's easy to use a custom cost matrix. The cost can also be context-dependent 
too (based on past matched characters, though not [easily] future ones).
{quote}
 
Great to hear that!  
prefix based context is the only context at sub-word level I ever used. I doubt 
lookahead brings something. 


 Fuzzy suggester
 ---

 Key: LUCENE-3846
 URL: https://issues.apache.org/jira/browse/LUCENE-3846
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.6, 4.0

 Attachments: LUCENE-3846.patch


 Would be nice to have a suggester that can handle some fuzziness (like spell 
 correction) so that it's able to suggest completions that are near what you 
 typed.
 As a first go at this, I implemented 1T (ie up to 1 edit, including a 
 transposition), except the first letter must be correct.
 But there is a penalty, ie, the corrected suggestion needs to have a much 
 higher freq than the exact match suggestion before it can compete.
 Still tons of nocommits, and somehow we should merge this / make it work with 
 analyzing suggester too (LUCENE-3842).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3846) Fuzzy suggester

2012-03-04 Thread Eks Dev (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13222022#comment-13222022
 ] 

Eks Dev commented on LUCENE-3846:
-

Sure, give me enough annotated data and I can give you close to optimal cost 
matrix. There are (rather simple) ways to estimate these costs.  
 
Or you are trying to argument there is no cost table better than the one filled 
with ones?

 Fuzzy suggester
 ---

 Key: LUCENE-3846
 URL: https://issues.apache.org/jira/browse/LUCENE-3846
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.6, 4.0

 Attachments: LUCENE-3846.patch


 Would be nice to have a suggester that can handle some fuzziness (like spell 
 correction) so that it's able to suggest completions that are near what you 
 typed.
 As a first go at this, I implemented 1T (ie up to 1 edit, including a 
 transposition), except the first letter must be correct.
 But there is a penalty, ie, the corrected suggestion needs to have a much 
 higher freq than the exact match suggestion before it can compete.
 Still tons of nocommits, and somehow we should merge this / make it work with 
 analyzing suggester too (LUCENE-3842).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3841) CloseableThreadLocal does not work well with Tomcat thread pooling

2012-03-03 Thread Eks Dev (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13221629#comment-13221629
 ] 

Eks Dev commented on LUCENE-3841:
-

This is indeed a problem. Recently we moved to solr on tomcat and we hit it, 
slightly different form. 

The nature of the problem is in high thread churn on tomcat, and when combined 
with expensive analyzers it wracks gc() havoc (*even without stale 
ClosableThreadLocals from this issue*). We are attacking this problem currently 
by reducing maxThreads and increasing minSpareThreads (also reducing time to 
forced thread renew). The goal is to increase life-time of threads, and to 
contain them to reasonable limits. I would appreciate any tips into this 
direction.

The problem with this strategy is if some cheep requests, not really related to 
your search saturate smallish thread pool... I am looking for a way to define 
separate thread pools for search/update requests and one for the rest as it 
does not make sense to have 100 search threads searching lucene on dual core 
box. Not really experienced with tomcat... 

Of course, keeping Analyzer creation cheep helps(e.g. make expensive, 
background structures thread-safe that can be shared and only thin analyzer 
using them). But this is not always easy.

Just sharing experience here, maybe someone finds it helpful. Hints always 
welcome :)


 

 CloseableThreadLocal does not work well with Tomcat thread pooling
 --

 Key: LUCENE-3841
 URL: https://issues.apache.org/jira/browse/LUCENE-3841
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/other
Affects Versions: 3.5
 Environment: Lucene/Tika/Snowball running in a Tomcat web application
Reporter: Matthew Bellew

 We tracked down a large memory leak (effectively a leak anyway) caused
 by how Analyzer users CloseableThreadLocal.
 CloseableThreadLocal.hardRefs holds references to Thread objects as
 keys.  The problem is that it only frees these references in the set()
 method, and SnowballAnalyzer will only call set() when it is used by a
 NEW thread.
 The problem scenario is as follows:
 The server experiences a spike in usage (say by robots or whatever)
 and many threads are created and referenced by
 CloseableThreadLocal.hardRefs.  The server quiesces and lets many of
 these threads expire normally.  Now we have a smaller, but adequate
 thread pool.  So CloseableThreadLocal.set() may not be called by
 SnowBallAnalyzer (via Analyzer) for a _long_ time.  The purge code is
 never called, and these threads along with their thread local storage
 (lucene related or not) is never cleaned up.
 I think calling the purge code in both get() and set() would have
 avoided this problem, but is potentially expensive.  Perhaps using 
 WeakHashMap instead of HashMap may also have helped.  WeakHashMap 
 purges on get() and set().  So this might be an efficient way to
 clean up threads in get(), while set() might do the more expensive
 Map.keySet() iteration.
 Our current work around is to not share SnowBallAnalyzer instances
 among HTTP searcher threads.  We open and close one on every request.
 Thanks,
 Matt

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] [Commented] (LUCENE-2632) FilteringCodec, TeeCodec, TeeDirectory

2012-02-14 Thread eks dev
cool indeed!

Now I can easily create full blown index on master and search (or
replicate) only a subset I need to search.

New use cases possible with this:
-  Today one has to blow-up term directory with  XXXMio unique ids
just to support deletions. Often a thing only needed during indexing.
For search only slaves, it is often sufficient to have uid as a stored
field (if at all), but term dictionary does not get bloated.

- possibility to simply store original documents in one index (kind of
key-value store) ,  but to search /distribute much smaller index. This
enables many new scenarios where Lucene takes storage responsibility
(Lucene overtakes Database role in many cases).


On Tue, Feb 14, 2012 at 8:45 AM, Uwe Schindler (Commented) (JIRA)
j...@apache.org wrote:

    [ 
 https://issues.apache.org/jira/browse/LUCENE-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207566#comment-13207566
  ]

 Uwe Schindler commented on LUCENE-2632:
 ---

 Hey cool, sounds like this unmaintainable ParallelReaders obsolete by doing 
 the splitting to several directories/parallel fields in the codec - so 
 merging automtically works correct with every MP?

 FilteringCodec, TeeCodec, TeeDirectory
 --

                 Key: LUCENE-2632
                 URL: https://issues.apache.org/jira/browse/LUCENE-2632
             Project: Lucene - Java
          Issue Type: New Feature
          Components: core/index
    Affects Versions: 4.0
            Reporter: Andrzej Bialecki
         Attachments: LUCENE-2632.patch, LUCENE-2632.patch


 This issue adds two new Codec implementations:
 * TeeCodec: there have been attempts in the past to implement parallel 
 writing to multiple indexes so that they are all synchronized. This was 
 however complicated due to the complexity of IndexWriter/SegmentMerger 
 logic. The solution presented here offers a similar functionality but 
 working on a different level - as the name suggests, the TeeCodec duplicates 
 index data into multiple output Directories.
 * TeeDirectory (used also in TeeCodec) is a simple abstraction to perform 
 Directory operations on several directories in parallel (effectively 
 mirroring their data). Optionally it's possible to specify a set of suffixes 
 of files that should be mirrored so that non-matching files are skipped.
 * FilteringCodec is related in a remote way to the ideas of index pruning 
 presented in LUCENE-1812 and the concept of tiered search. Since we can use 
 TeeCodec to write to multiple output Directories in a synchronized way, we 
 could also filter out or modify some of the data that is being written. The 
 FilteringCodec provides this functionality, so that you can use like this:
 {code}
 IndexWriter -- TeeCodec
                  |  |
                  |  +-- StandardCodec -- Directory1
                  +-- FilteringCodec -- StandardCodec -- Directory2
 {code}
 The end result of this chain is two indexes that are kept in sync - one is 
 the full regular index, and the other one is a filtered index.

 --
 This message is automatically generated by JIRA.
 If you think it was sent incorrectly, please contact your JIRA 
 administrators: 
 https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
 For more information on JIRA, see: http://www.atlassian.com/software/jira



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3760) Cleanup DR.getCurrentVersion/DR.getUserData/DR.getIndexCommit().getUserData()

2012-02-09 Thread Eks Dev (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13204634#comment-13204634
 ] 

Eks Dev commented on LUCENE-3760:
-

I have a use case for CommitUserData and I think standard solr DIH could 
benefit from it as well. 
I use it to persist current status of the max Document id (user id, not lucene 
docid) to know what I have indexed so far (all update commands are stored in 
the database and have simple incrementing counter). This makes incremental 
update process restart- and rollback- safe as it gets written on lucene 
commit and read on startup. I do not index this field (not to pollute term 
dictionary) and I need only to keep max value of it. 

I find it hugely useful, but if you have better ideas on how to safely persist 
max/min value of the field I am all ears. 

Last time I checked, solr DIH used its own file in cfg directory to persist 
max(timestamp), which is kind of risky as it is not in sync with lucene commit 
point under all scenarios. 

I think I even opened an isue on solr jira to expose user commit data  
feature to solr, but I am missing good ideas on how to expose it to solr users 
(max/min/avg field tracking maybe)...

Cheers, eks

 Cleanup DR.getCurrentVersion/DR.getUserData/DR.getIndexCommit().getUserData()
 -

 Key: LUCENE-3760
 URL: https://issues.apache.org/jira/browse/LUCENE-3760
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.6, 4.0

 Attachments: LUCENE-3760.patch, LUCENE-3760.patch


 Spinoff from Ryan's dev thread DR.getCommitUserData() vs 
 DR.getIndexCommit().getUserData()... these methods are confusing/dups right 
 now.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3760) Cleanup DR.getCurrentVersion/DR.getUserData/DR.getIndexCommit().getUserData()

2012-02-09 Thread Eks Dev (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13204796#comment-13204796
 ] 

Eks Dev commented on LUCENE-3760:
-

whoops, before putting mouth to action, one should use brain... just quickly 
skimmed over this issue and stumbled on  ...If no one speaks up for a whole 
release cycle, they are gone... out of context, so I concluded user data is 
gone.

Of course, they do not have to be static, I read it only on restart so even if 
I do not need open IR, it is not an issue to open it once... 
sorry for the noise

 Cleanup DR.getCurrentVersion/DR.getUserData/DR.getIndexCommit().getUserData()
 -

 Key: LUCENE-3760
 URL: https://issues.apache.org/jira/browse/LUCENE-3760
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.6, 4.0

 Attachments: LUCENE-3760.patch, LUCENE-3760.patch


 Spinoff from Ryan's dev thread DR.getCommitUserData() vs 
 DR.getIndexCommit().getUserData()... these methods are confusing/dups right 
 now.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: DUH2 getStatistics() ok?

2011-10-05 Thread eks dev
sure, sorry for not providing patch. I was focused on something
completely different and this just accidentally caught my eye

On Wed, Oct 5, 2011 at 1:06 AM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : Subject: DUH2 getStatistics() ok?

 eks: to close the loop, i read your message yesterday and asked miller
 about it on IRC, and that lead him to commiting r1178632.

 thank you for catching that.

 https://svn.apache.org/viewvc?view=revisionrevision=1178632


 -Hoss

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



DUH2 getStatistics() ok?

2011-09-22 Thread eks dev
I was looking into DUH2 changes (for SOLR-2701) and noticed
DUH2.getStatistics might have a bug

 public NamedList getStatistics() {
NamedList lst = new SimpleOrderedMap();
lst.add(commits, commitCommands.get());
if (commitTracker.getTimeUpperBound()  0) { // isn't this
getDocsUpperBound()?
  lst.add(autocommit maxDocs, commitTracker.getDocsUpperBound());
}


cheers,
e.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Regarding Transaction logging

2011-09-09 Thread eks dev
+1
indeed! All possibilities are are needed.

One might do wild things if it is somehow  typed. For example,
dictionary compression for fields that are tokenized (not only
stored), as we already have Term dictionary supporting ord-s. Keeping
just a map Token - ord with transaction log...




On Fri, Sep 9, 2011 at 11:19 AM, Andrzej Bialecki a...@getopt.org wrote:
 On 09/09/2011 11:00, Simon Willnauer wrote:

 I created LUCENE-3424 for this. But I still would like to keep the
 discussion open here rather than moving this entirely to an issue.
 There is more about this than only the seq. ids.

 I'm concerned also about the content of the transaction log. In Solr it uses
 javabin-encoded UpdateCommand-s (either SolrInputDocuments or Delete/Commit
 commands). Documents in the log are raw documents, i.e. before analysis.

 This may have some merits for Solr (e.g. you could imagine having different
 analysis chains on the Solr slaves), but IMHO it's more of a hassle for
 Lucene, because it means that the analysis has to be repeated over and over
 again on all clients. If the analysis chain is costly (e.g. NLP) then it
 would make sense to have an option to log documents post-analysis, i.e. as
 correctly typed stored values (e.g. string - numeric) AND the resulting
 TokenStream-s. This has also the advantage of moving us towards the dumb
 IndexWriter concept, i.e. separating analysis from the core inverted index
 functionality.

 So I'd argue for recording post-analysis docs in the tlog, either
 exclusively or as a default option.

 --
 Best regards,
 Andrzej Bialecki     
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com


 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Regarding Transaction logging

2011-09-09 Thread eks dev
I didn't think, it was just a spontaneous reaction :)

At the moment I am using static dictionaries to at least get a grip on
size of stored fields (escaping encoded terms)

Re: Global
Maybe the trick would be to somehow use term dictionary as it  must be
*eventually* updated? An idea is to write raw token stream for
atomicity and reduce it later in compaction phase (e.g on lucene
commit())... no matter what we plan do, TL compaction  is going to be
needed?

It is slightly moving target problem (TL chases term dictionary),
but I am sure, benefits can be huge. compacted TL entry would need to
have a pointer to Term[] used to encode it, but this is by all means
doable, just simple Term[].

It surely makes not much sense for high cardinality fields, but if you
have something with low cardinality (indexed and stored) on a big
(100Mio) collection, this reduces space by exorbitant amounts.


I do not know, just trying to build upon the fact that we have term
dictionary updated in any case---


This  works not only for transaction logging, but also for
(Analyzed)-{Stored , indexed} fields. By the way, I never look how
our term vectors work, keeping reference to token or verbatim term
copy?





On Fri, Sep 9, 2011 at 12:31 PM, Andrzej Bialecki a...@getopt.org wrote:
 On 09/09/2011 12:07, eks dev wrote:

 +1
 indeed! All possibilities are are needed.

 One might do wild things if it is somehow  typed. For example,
 dictionary compression for fields that are tokenized (not only
 stored), as we already have Term dictionary supporting ord-s. Keeping
 just a map Token-  ord with transaction log...

 Hmm, you mean a per-doc map? because a global map would have to be updated
 as we add new docs, which would make the writing process non-atomic, which
 is the last thing you want from a transaction log :)

 As a per-doc compression, sure. In fact, what you describe is essentially a
 single doc mini-index, because the map is a term dict, the token streams
 with ords are postings, etc.

 --
 Best regards,
 Andrzej Bialecki     
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com


 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



new AutomatonQuery(RunAutomaton) ?

2011-08-31 Thread eks dev
At the moment it is not possible (?) to construct AutomatonQuery with
RunAutomaton.
Would it make sense to add this possibility? Is it doable at all?

I have to keep a collection of RunAtomaton-s for other purposes (after
search feature extraction) and it would be handy to feed them directly
to AutomatonQuery.
I could as well keep cached AutomatonQuery objects (Field name does
not change), but then I would need to get (Run)Automaton from the
Query...

Thanks,
eks.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: new AutomatonQuery(RunAutomaton) ?

2011-08-31 Thread eks dev
Thanks Robert, this is what I expected after looking into CompiledAutomaton ..

On Wed, Aug 31, 2011 at 2:00 PM, Robert Muir rcm...@gmail.com wrote:
 On Wed, Aug 31, 2011 at 3:51 AM, eks dev eks...@yahoo.co.uk wrote:
 At the moment it is not possible (?) to construct AutomatonQuery with
 RunAutomaton.
 Would it make sense to add this possibility? Is it doable at all?

 Its not doable, we need more information than the runautomaton, its not 
 enough.

 --
 lucidimagination.com

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: new AutomatonQuery(RunAutomaton) ?

2011-08-31 Thread eks dev
I do not think it will be expensive, it is just an attempt to keep
code smaller, simpler and marginally faster :)

those are a lot (Ca 1000) of small prefix based regex-es with limited
alphabet compiled as RunAutomaton I load on startup and lookup from
some RunAutomaton[] on request...

they look like Regex(((123)|(124)|(401)|(777)|(351))[0-9]{0,2})

By the way, what will AutomatonQuery prefer (XXX)[0-9]{0,2} or
(XXX)[0-9]* or (XXX).* ? Any performance difference?

Semantically are they the same as I know that my content is only 5 digits

I need them to
1. formulate complex BooleanQuery, where AutomatonQuery gets one clause
2. do post processing (a lot of hits) of the query against hits and
this has to be fast.

I guess, I will switch to keeping only Automaton[] and build
RunAutomaton on the fly (per request) for fast query vs hits, this is
done once per request only, but them I need to keep state of the
RunAutomaton per query... makes things slightly more verbose...








On Wed, Aug 31, 2011 at 5:06 PM, Robert Muir rcm...@gmail.com wrote:
 Can you provide more information about your automaton and why
 'recompiling' it might be expensive?

 E.g. #states/#transitions, is it finite or infinite, etc.

 On Wed, Aug 31, 2011 at 10:56 AM, eks dev eks...@yahoo.co.uk wrote:
 Thanks Robert, this is what I expected after looking into CompiledAutomaton 
 ..

 On Wed, Aug 31, 2011 at 2:00 PM, Robert Muir rcm...@gmail.com wrote:
 On Wed, Aug 31, 2011 at 3:51 AM, eks dev eks...@yahoo.co.uk wrote:
 At the moment it is not possible (?) to construct AutomatonQuery with
 RunAutomaton.
 Would it make sense to add this possibility? Is it doable at all?

 Its not doable, we need more information than the runautomaton, its not 
 enough.

 --
 lucidimagination.com

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org





 --
 lucidimagination.com

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: new AutomatonQuery(RunAutomaton) ?

2011-08-31 Thread eks dev
bytes are good, I am in byte range on this data, and even simpler is good :)

It is simple, I just need to know if  this automaton I used for
AutomatonQuery accepts one stored field, so yes it is the same
information as in Term, but I need to run over it once more because my
query is not filtering on AutomatonQuery

((AutomatonQuery(A)) OR (OtherQuery) )+

So I get back documents not matched by this Automaton and I do not
know which ones are there due to the OtherQuery

running search in 2 passes, with and without automaton  is not practicable




On Wed, Aug 31, 2011 at 8:45 PM, Robert Muir rcm...@gmail.com wrote:
 On Wed, Aug 31, 2011 at 2:37 PM, eks dev eks...@yahoo.co.uk wrote:
 Keeping AutomatonQuery around came to me as an option, but do not
 forget, I need Automaton (RunAutomaton) for post processing... There
 is no way to get Automaton back from the AutomatonQuery?


 The compiled automaton is not always a RunAutomaton, sometimes its
 internal representation is something even simpler :)
 Additionally, when it is a RunAutomaton, its a UTF-8 one, for
 operating directly on bytes...

 Can you describe a little bit about what 'post processing' you need to
 do? I imagine its post processing on something other than the terms?

 --
 lucidimagination.com

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: LevenshteinAutomata challenge

2011-08-10 Thread eks dev
Thanks David,
That would be super duper great!

Regex1LEVRegex2...

the trick with smart skipping terms enum is not changing with
automaton construction? These two steps are independent... if you
build infinite Automaton like REgex(.*), your problem :)

This cannot be better at the moment.  One day, TermDict will become
automaton as well... but for now we live with this brilliant idea to
use sorted list of terms as other automaton for matching . But even
than, REgex(.*) should match everything :)


My proposal is about performance and flexibility
Compare simple use case:

in order to support transposition in the first two characters like in
my example, you would need Lev. Automaton that has maxDistance 2, so
it would waste time to scan *much much more* than needed for given
constraint. It does not matter if you match against sorted list or
another automaton.

Alternative would be to run two fuzzy enums with Lev automata with
maxDistance = 1.  and filter them out on prefix later. But these are
two separate term enum scans

With concatenation, composite automaton would scan prefix part like
normal finite regex prefixes (fast!) and the rest with Lev. automaton
with maxDistance 1, using whatever is at the moment possible (sorted
list, or another automation like mikes FST TermDictionary)

Another example would be to add some phonetic variations here and there...

CARIN   Regex(C|K)Lev(ARIN)

Point being, this enables us to restrict search space of Lev automaton
significantly (which is normally huge as simple transposition needs
minDistance 2 ...) to achieve what we want. (e.g. most of CARIN
matches with LEV(2) are mostly nonsense)

one could also restrict suffixes easily, with supported intersection,
one could do other wild things...

I'll definitely have to look at this scary code as Mike indicated in his blog

Cheers, eks



On Wed, Aug 10, 2011 at 11:07 AM, Dawid Weiss
dawid.we...@cs.put.poznan.pl wrote:

 The thing you're describing is a regular composition of automata (as it
 exists, for example, when composing clauses of a regular expression). If I
 recall right the Levenshtein automaton in Lucene is built on modified brics
 code... if so then this should be not a problem. The problem may be that
 currently automatons are used in enums in a way that skips from one accepted
 sequence to another accepted sequence (if possible). If the automaton has *
 operators then there is no way to establish these and everything falls back
 to full matching strategy.
 Dawid

 On Wed, Aug 10, 2011 at 10:54 AM, eks dev eks...@yahoo.co.uk wrote:

 Hi Robert, Mike  other FS(A|T) gurus,

 a challenge for you ;)

 Would it be possible to combine these brilliant peaces of functionality
 with normal Automaton somehow...

 Example to illustrate.
 DirectSpellChecker:
 - where instead of minPrefix, we would specify Regex (other Automaton)
 pfxAutiomaton = Regex((AB)|(BA)) // e.g. Saying,
 levAutomaton = LevenshteinAutomata(XYZ)

 spell(pfxAutomaton, levAutomaton);

 would match terms that start with AB or BA and suffix part are normal
 edit distance matches, like ABXY, with one delete
 This would support wild things, like enable only transpositions in first
 three characters... In order to gat these matches today, you need to make
 Lev. Automata with maxDistance = 2 (which is then HUGE space to search
 without prefix)... Or generate more Lev. automata and make union of results
 (expensive to itterate)

 Other good use cases are simple to construct...

 The most general question, can we support at least concatenation between
 LevenshteinAutomata  and normal Automata. Intersection/union would be crazy
 thing as well? Where we would have:
 FilteringAutomata.intersect(LevenshteinAutomata)... but I guess I am
 dreaming with this one, but concatenation sounds  doable (at least prefix
 side)

 Cheers,
 Eks




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: LevenshteinAutomata challenge

2011-08-10 Thread eks dev
Thanks David,

I did not know I can mix Automaton with LevenshteinAutomaton.

What you say is Automaton.concatenate(LevenshteinAutomaton),
intersect, union would work.

This is simply fantastic!

Regex1LEVRegex2...

the trick with smart skipping terms enum is not changing with
automaton construction? These two steps are independent... if you
build infinite Automaton like REgex(.*), your problem :)

This cannot be better at the moment.  One day, TermDict will become
automaton as well... but for now we live with this brilliant idea to
use sorted list of terms as other automaton for matching . But even
than, REgex(.*) should match everything :)


My proposal is about performance and flexibility
Compare simple use case:

in order to support transposition in the first two characters like in
my example, you would need Lev. Automaton that has maxDistance 2, so
it would waste time to scan *much much more* than needed for given
constraint. It does not matter if you match against sorted list or
another automaton.

Alternative would be to run two fuzzy enums with Lev automata with
maxDistance = 1.  and filter them out on prefix later. But these are
two separate term enum scans

With concatenation, composite automaton would scan prefix part like
normal finite regex prefixes (fast!) and the rest with Lev. automaton
with maxDistance 1, using whatever is at the moment possible (sorted
list, or another automation like mikes FST TermDictionary)

Another example would be to add some phonetic variations here and there...

CARIN   Regex(C|K)Lev(ARIN)

Point being, this enables us to restrict search space of Lev automaton
significantly (which is normally huge as simple transposition needs
minDistance 2 ...) to achieve what we want. (e.g. most of CARIN
matches with LEV(2) are mostly nonsense)

one could also restrict suffixes easily, with supported intersection,
one could do other wild things...

I'll definitely have to look at this scary code as Mike indicated in his blog

Cheers, eks

On Wed, Aug 10, 2011 at 11:07 AM, Dawid Weiss
dawid.we...@cs.put.poznan.pl wrote:

 The thing you're describing is a regular composition of automata (as it
 exists, for example, when composing clauses of a regular expression). If I
 recall right the Levenshtein automaton in Lucene is built on modified brics
 code... if so then this should be not a problem. The problem may be that
 currently automatons are used in enums in a way that skips from one accepted
 sequence to another accepted sequence (if possible). If the automaton has *
 operators then there is no way to establish these and everything falls back
 to full matching strategy.
 Dawid

 On Wed, Aug 10, 2011 at 10:54 AM, eks dev eks...@yahoo.co.uk wrote:

 Hi Robert, Mike  other FS(A|T) gurus,

 a challenge for you ;)

 Would it be possible to combine these brilliant peaces of functionality
 with normal Automaton somehow...

 Example to illustrate.
 DirectSpellChecker:
 - where instead of minPrefix, we would specify Regex (other Automaton)
 pfxAutiomaton = Regex((AB)|(BA)) // e.g. Saying,
 levAutomaton = LevenshteinAutomata(XYZ)

 spell(pfxAutomaton, levAutomaton);

 would match terms that start with AB or BA and suffix part are normal
 edit distance matches, like ABXY, with one delete
 This would support wild things, like enable only transpositions in first
 three characters... In order to gat these matches today, you need to make
 Lev. Automata with maxDistance = 2 (which is then HUGE space to search
 without prefix)... Or generate more Lev. automata and make union of results
 (expensive to itterate)

 Other good use cases are simple to construct...

 The most general question, can we support at least concatenation between
 LevenshteinAutomata  and normal Automata. Intersection/union would be crazy
 thing as well? Where we would have:
 FilteringAutomata.intersect(LevenshteinAutomata)... but I guess I am
 dreaming with this one, but concatenation sounds  doable (at least prefix
 side)

 Cheers,
 Eks




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: LevenshteinAutomata challenge

2011-08-10 Thread eks dev
Hi Robert, 
indeed, transpositions can be embedded into Lev, Automaton. I share my opinion 
with you, even stronger, I think transpositions are fundamental for spell-check 
type of apps. 


But, even if implemented, it would not change the point from my example, only 
slightly. (To support efficiently... )

Lev(ABXYZ, 1) // assuming transposition  counts as one, still has much more 
to do ( bigger search space, or whatever is this called) 

then this composition
RegEx((AB)|(BA))Lev(XYZ, 1)


Lev(CARIN, 1) 
vs 
RegEx([CK])Lev(ARIN, 1)

And so on and so on...

So yes, transpositions would be great, but they are orthogonal to regular 
composition of automata  (Lev. and regex and however one constructs them) . 


This composition of automata enables us to restrict space significantly, and 
add 
heuristics that help improve match quality (precision).
In my experience, plain edit distance without heuristics is rarely useful, 
especially on shorter than, say 7 chars, It generates a lot of candidates and 
one needs re-scoring phase  which is in turn slow, if too many candidates. 
So we come back to this, automata composition ,  with it, you could push some 
of these heuristics back to candidate fetching phase to increase precision of 
this candidate fetching phase with Edit distance ... 


Theoretical alternative, less flexible/clean, would be to somehow add 
parameters 
to Lev. automata builder, like I want to have max distance = 1 from 3rd 
character on, but only transposition without deletes/inserts on first two 
characters looks ugly 


But Mike and Dawid said this is already there, works... only some API sugar is 
needed! 

Lovely :)


Re ... though if you don't want to implement it, you could let him know are 
interested...

I would really live to, but to find that much time to understand this hairy 
code 
is at the moment impossible. 


Thanks to all, it turned out that your answer to my challenge was simple,  
done, we have it already! ;)



 in order to support transposition in the first two characters like in
 my example, you would need Lev. Automaton that has maxDistance 2,

Actually this isn't true: we can implement the variant where
transpositions are a basic edit operation:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 (Chapter 7)

I think this transpositions variant would really be more ideal for
spellchecking.
I think it would actually be best/easiest if this was implemented in
python in moman: https://bitbucket.org/jpbarrette/moman/

Jean-Philippe Barrette-LaPierre has told me before he was interested
in implementing it, but I think his time is limited, though if you
don't want to implement it, you could let him know are interested.

Re: LevenshteinAutomata challenge

2011-08-10 Thread eks dev
Re:
For the regexp syntax you discuss, you can actually already do this.
This is one reason why RegexpQuery has a constructor that takes

I did not try to suggest new syntax, it was just an attempt to describe the 
question :) This sugar part later is easy...


Lucene went really long way! I follow it more or less intensively since times 
where everybody asked  what does the name Lucene mean?. This was lng time 
ago, where the project had one member, Doug. In meantime many projects came and 
went, but Lucene progressed. This mind blowing additions on trunk expected in 
4.0 are more than clear indicator of health. 


Lucene-core team that made all these goodies has enormous human qualities, and 
without doubt, unprecedented professional. But the first ones make a team more 
than a group of individuals. Yeah, Lucene rocks, as long as chaps of that sort 
stick around. 


 I just wanted to say,... thank you!

Cheers, Eks.


[jira] [Updated] (SOLR-2701) Expose IndexWriter.commit(MapString,String commitUserData) to solr

2011-08-08 Thread Eks Dev (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eks Dev updated SOLR-2701:
--

Attachment: SOLR-2701.patch

rather simplistic approach, adding userCommitData to CommitUpdateCommand.

So we at least have a vehicle to pass it to IndexWriter.

No advanced machinery to make it available to  non-expert users. At least ti is 
not wrong to have it there?

Eclipse removed some unused imports from DUH2 as well   

 Expose IndexWriter.commit(MapString,String commitUserData) to solr 
 -

 Key: SOLR-2701
 URL: https://issues.apache.org/jira/browse/SOLR-2701
 Project: Solr
  Issue Type: New Feature
  Components: update
Affects Versions: 4.0
Reporter: Eks Dev
Priority: Minor
  Labels: commit, update
 Attachments: SOLR-2701.patch

   Original Estimate: 8h
  Remaining Estimate: 8h

 At the moment, there is no feature that enables associating user information 
 to the commit point.
  
 Lucene supports this possibility and it should be exposed to solr as well, 
 probably via beforeCommit Listener (analogous to prepareCommit in Lucene).
 Most likely home for this Map to live is UpdateHandler.
 Example use case would be an atomic tracking of sequence numbers or 
 timestamps for incremental updates.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] [Updated] (SOLR-2700) transaction logging

2011-08-07 Thread eks dev
Just a casual comment..
This issue marks another big milestone in  solr/lucene evolution, it
moves into new direction of being not only search library, but rather
full data storage/manipulation solution. Who needs sql and nosql db-s,
they cannot search without painful integration :)

Imo, this issue is symbolically just as important for us users as flex
indexing was.

Flex Indexing and column stride fields are great infrastructure to
build upon, but they also started with one small step by making
omitTf hack :) Mike is great with his progress, not perfection

cheers, eks



On Sun, Aug 7, 2011 at 7:44 PM, Yonik Seeley (JIRA) j...@apache.org wrote:

     [ 
 https://issues.apache.org/jira/browse/SOLR-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
  ]

 Yonik Seeley updated SOLR-2700:
 ---

    Attachment: SOLR-2700.patch

 Here's an update that handles delete-by-id and also makes lookups concurrent 
 (no synchronization on the file reads so multiple can proceed at once).

 transaction logging
 ---

                 Key: SOLR-2700
                 URL: https://issues.apache.org/jira/browse/SOLR-2700
             Project: Solr
          Issue Type: New Feature
            Reporter: Yonik Seeley
         Attachments: SOLR-2700.patch, SOLR-2700.patch


 A transaction log is needed for durability of updates, for a more performant 
 realtime-get, and for replaying updates to recovering peers.

 --
 This message is automatically generated by JIRA.
 For more information on JIRA, see: http://www.atlassian.com/software/jira



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



IndexReader.maxDoc() and other

2011-08-06 Thread eks dev
Assuming there are no deletes,  would the following work as a way to load *last 
added document*, surviving optimize as well? 
Order of documentId-s in Lucene survives optimize as far as I remember? 

IndexReader ir...
int maxDoc = ir.maxDoc() - 1;
if(maxDoc0) //? What is the return value on empty index, 0 or 1?
Document d = ir.getDocument(maxDoc);

Would this correspond to the last committed document (at commit point where 
index reader was opened)

Or last added document, including pending/uncommitted (I am not getting 
IndexReader from the IndexWriter, no nrt yet...)


The problem I am trying to solve are incremental updates (there are no 
deletions). Having unique, numerical uid stored in index that is increasing 
with 
every add, I just need a way to find max(uid) on the last commit to get my 
delta 
from the database.

Above solution was one of the options. 
2.The second would be to iterate TermsEnum for uid field until I hit an end, 
but 
this sounds slow (even if I start skipping around like a monkey)? 

3.Third option would be to index reverse uid  (HUGE_CONSTANT - uid), so it gets 
on top in terms dictionary?  

4. And finally, the last option I am thinking of would be to track max(UID) and 
write it as a user Parameter with  IndexWriter.commit(Map...), so I could read 
it easily (piggy-back on lucene commit is as safe as it gets, better then 
persisting own files...) 

I like the last option, but have no idea how to create beforeCommitListener in 
solr?   


The most robust is 2/3, but maybe slow-ish (there are 100-200Mio documents/UIDs)

Any better ideas? (and no, DIH wall clock timestamp is not good enough)

I am talking about solr/lucene 4 trunk, we decided to take a risk :) 
 
Thanks, 
eks

Re: IndexReader.maxDoc() and other

2011-08-06 Thread eks dev
Thanks Yonik, 

assuming I am not going to index ID , than only an option 4. remains so far. I 
have no other ideas, and Log* merge policy would mean all 4 Indexing magic went 
to nothing :)

Colud then the following do the job? 
clone DefaultIndexWriterProvider into my codebase (ugly, keep in sync , but 
doable)
make it provide 
EnhancedSolrIndexWriter extends SolrIndexWriter

@Override
commit(...){
   super.commit(MapString, String Core.getUserMap());
} 

the same with close(...)  


If yes, Is this feature something solr could use? MapString, String 
userParams 
 somewhere in Core that gets committed with whatever it has at commit time. I 
could wrap up a patch by modifying SolrIndexWriter directly then?

Nice thing about it, one could have possibility to keep small map of key value 
pairs in sync with commit points with all goods of TwoPhaseCommit... for no 
way 
for this to get out of sync things, like my use case below... I imagine DIH 
could use it as well



-

No longer... the default merge policy can now merge non-contiguous segments.
You can of course still select a Log* merge policy, which never
reorders ids with respect to each other.

-Yonik
http://www.lucidimagination.com




From: eks dev eks...@yahoo.co.uk
To: dev@lucene.apache.org
Sent: Sat, 6 August, 2011 20:47:09
Subject: IndexReader.maxDoc()  and other


Assuming there are no deletes,  would the following work as a way to load *last 
added document*, surviving optimize as well? 
Order of documentId-s in Lucene survives optimize as far as I remember? 

IndexReader ir...
int maxDoc = ir.maxDoc() -  1;
if(maxDoc0) //? What is the return value on empty index, 0 or 1?
Document d = ir.getDocument(maxDoc);

Would this correspond to the last committed document (at commit point where 
index reader was opened)

Or last added document, including pending/uncommitted (I am not getting 
IndexReader from the IndexWriter, no nrt yet...)


The problem I am trying to solve are incremental updates (there are no 
deletions). Having unique, numerical uid stored in index that is increasing 
with 
every add, I just need a way to find max(uid) on the last commit to get my 
delta 
from the database.

Above solution was one of the options. 
2.The second would be to iterate TermsEnum for uid field until I hit an end, 
but 
this sounds slow (even if I start skipping around like a monkey)? 

3.Third option would be to index reverse uid  (HUGE_CONSTANT - uid), so it gets 
on top in terms dictionary?  

4. And finally, the last option I am thinking of would be to track max(UID) and 
write it as a user Parameter with  IndexWriter.commit(Map...), so I could read 
it easily (piggy-back on lucene commit is as safe as it gets, better then 
persisting own files...) 

I like the last option, but have no idea how to create beforeCommitListener in 
solr?


The most robust is 2/3, but maybe slow-ish (there are 100-200Mio documents/UIDs)

Any better ideas? (and no, DIH wall clock timestamp is not good enough)

I am talking about solr/lucene 4 trunk, we decided to take a risk :) 
 
Thanks, 
eks

[jira] [Commented] (SOLR-2701) Expose IndexWriter.commit(MapString,String commitUserData) to solr

2011-08-06 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13080474#comment-13080474
 ] 

Eks Dev commented on SOLR-2701:
---

one hook for users to update content of this map would be to add beforeCommit 
callbacks. This looks simple enough in UpdateHandler2.commit() call, but there 
is a catch: 

We need to invoke listeners before we close() for implicit commits... having 
decref-ed IndexWriter, the question is if we want to run beforeCommit listeners 
even if IW does not really get closed (user updates map more often than 
needed). 

IMO, this should not be a problem, invoking callbacks a little bit more often 
than needed. 

Another place where we have implicit commit is newIndexWriter() / 
here we need only to add IndexWriterProvider.isIndexWriterNull() to check if we 
need callbacks

A solution for close()  would be also simple by adding 
IndexWriterProvider.isIndexGoingToCloseOnNextDecref() before invoking decref() 
to condition callbacks

Any better solution? Are the callbacks good approach to provide user hooks for 
this?

---
Another approach is to get beforeCommitCallbacks at lucene level and piggy-back 
there for solr callbacks? 
We would only need to change IndexWriter.commit(Map..) and close() but commit 
is final...


Notice: I am very rusty considering solr/lucene codebase = any help would be 
appreciated.  Last patch I made here is ages ago :)


 Expose IndexWriter.commit(MapString,String commitUserData) to solr 
 -

 Key: SOLR-2701
 URL: https://issues.apache.org/jira/browse/SOLR-2701
 Project: Solr
  Issue Type: New Feature
  Components: update
Affects Versions: 4.0
Reporter: Eks Dev
Priority: Minor
  Labels: commit, update
   Original Estimate: 8h
  Remaining Estimate: 8h

 At the moment, there is no feature that enables associating user information 
 to the commit point.
  
 Lucene supports this possibility and it should be exposed to solr as well, 
 probably via beforeCommit Listener (analogous to prepareCommit in Lucene).
 Most likely home for this Map to live is UpdateHandler.
 Example use case would be an atomic tracking of sequence numbers or 
 timestamps for incremental updates.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-1879) Parallel incremental indexing

2011-08-01 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13073462#comment-13073462
 ] 

Eks Dev commented on LUCENE-1879:
-

The user mentioned above in comment was me, I guess. Commenting here just to 
add interesting use case that would be perfectly solved by this issue.  

Imagine solr Master - Slave setup, full document contains CONTENT and ID 
fields, e.g. 200Mio+ collection. On master, we need field ID indexed in order 
to process delete/update commands. On slave, we do not need lookup on ID and 
would like to keep our TermsDictionary small, without exploding TermsDictionary 
with 200Mio+ unique ID terms (ouch, this is a lot compared to 5Mio unique terms 
in CONTENT, with or without pulsing). 

With this issue,  this could be nativly achieved by modifying solr 
UpdateHandler not to transfer ID-Index to slaves at all.

There are other ways to fix it, but this would be the best.(I am currently 
investigating an option to transfer full index on update, but to filter-out 
TermsDictionary on IndexReader level (it remains on disk, but this part never 
gets accessed on slaves). I do not know yet if this is possible at all in 
general , e.g. FST based term dictionary is already built (prefix compressed 
TermDict would be doable)

 Parallel incremental indexing
 -

 Key: LUCENE-1879
 URL: https://issues.apache.org/jira/browse/LUCENE-1879
 Project: Lucene - Java
  Issue Type: New Feature
  Components: core/index
Reporter: Michael Busch
Assignee: Michael Busch
 Fix For: 4.0

 Attachments: parallel_incremental_indexing.tar


 A new feature that allows building parallel indexes and keeping them in sync 
 on a docID level, independent of the choice of the MergePolicy/MergeScheduler.
 Find details on the wiki page for this feature:
 http://wiki.apache.org/lucene-java/ParallelIncrementalIndexing 
 Discussion on java-dev:
 http://markmail.org/thread/ql3oxzkob7aqf3jd

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3289) FST should allow controlling how hard builder tries to share suffixes

2011-07-08 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13061804#comment-13061804
 ] 

Eks Dev commented on LUCENE-3289:
-

bq. The strings are extremely long (more like short documents) and probably 
need to be compressed in some different datastructure, e.g. a word-based one?

That would be indeed cool, e.g. FST with words (ngrams?) as symbols. Ages ago 
we used one trie, for all unique terms to get prefix/edit distance on words and 
one word-trie (symbols were words via symbol table) for documents. I am sure 
this would cut memory requirements significantly for multiword cases when 
compared to char level FST.
e.g. TermDictionary that supports ord() could be used as a symbol table.






 FST should allow controlling how hard builder tries to share suffixes
 -

 Key: LUCENE-3289
 URL: https://issues.apache.org/jira/browse/LUCENE-3289
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.4, 4.0

 Attachments: LUCENE-3289.patch, LUCENE-3289.patch


 Today we have a boolean option to the FST builder telling it whether
 it should share suffixes.
 If you turn this off, building is much faster, uses much less RAM, and
 the resulting FST is a prefix trie.  But, the FST is larger than it
 needs to be.  When it's on, the builder maintains a node hash holding
 every node seen so far in the FST -- this uses up RAM and slows things
 down.
 On a dataset that Elmer (see java-user thread Autocompletion on large
 index on Jul 6 2011) provided (thank you!), which is 1.32 M titles
 avg 67.3 chars per title, building with suffix sharing on took 22.5
 seconds, required 1.25 GB heap, and produced 91.6 MB FST.  With suffix
 sharing off, it was 8.2 seconds, 450 MB heap and 129 MB FST.
 I think we should allow this boolean to be shade-of-gray instead:
 usually, how well suffixes can share is a function of how far they are
 from the end of the string, so, by adding a tunable N to only share
 when suffix length  N, we can let caller make reasonable tradeoffs. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3135) backport suggest module to branch 3.x

2011-05-24 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038418#comment-13038418
 ] 

Eks Dev commented on LUCENE-3135:
-

 if we can backport the FST-based functionality
+1

 backport suggest module to branch 3.x
 -

 Key: LUCENE-3135
 URL: https://issues.apache.org/jira/browse/LUCENE-3135
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/spellchecker
Reporter: Robert Muir

 It would be nice to develop a plan to expose the autosuggest functionality to 
 Lucene users in 3.x
 There are some complications, such as seeing if we can backport the FST-based 
 functionality,
 which might require a good bit of work. But I think this would be well-worth 
 it.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: An IDF variation with penalty for very rare terms

2011-04-15 Thread eks dev
indeed, frequency usage  is collection and use case dependant...
Not directly your case, but the idea is the same.

We used this information in spell/typo-variations context to
boost/penalize similarity, by dividing terms into a couple of freq
based segments.

Take an example:
Maria - Very High Freq
Marina - Very High Freq
Mraia - Very Low Freq

similarity(Maria, Marina) is by string distance measures very high,
practically the same as (Maria, Mraia) but the likelihood that you
mistyped Mraia is an order of magnitude higher than if you hit VHF-VHF
pair.

Point being, frequency hides a lot of semantics, and how you tune it,
as Martin said, does not really matter, if it works.

We also never found theory that formalize this, but it was logical,
and it worked in practice.

What you said, makes sense to me, especially for very big collections
(or specialized domains with limited vocabulary...) the bigger the
collection, the bigger garbage density in VLF domain (above certain
size of the collection). If  vocabulary in your collection is
somehow limited, there is a size limit where most of new terms (VLF)
are crapterms. One could try to  estimate how saturated a
collection is...


cheers,
eks


On Wed, Apr 13, 2011 at 9:36 PM, Marvin Humphrey mar...@rectangular.com wrote:
 On Wed, Apr 13, 2011 at 01:01:09AM +0400, Earwin Burrfoot wrote:
 Excuse me for somewhat of an offtopic, but have anybody ever seen/used 
 -subj- ?
 Something that looks like like http://dl.dropbox.com/u/920413/IDFplusplus.png
 Traditional log(N/x) tail, but when nearing zero freq, instead of
 going to +inf you do a nice round bump (with controlled
 height/location/sharpness) and drop down to -inf (or zero).

 I haven't used that technique, nor can I quote academic literature blessing
 it.  Nevertheless, what you're doing makes sense makes sense to me.

 Rationale is that - most good, discriminating terms are found in at
 least a certain percentage of your documents, but there are lots of
 mostly unique crapterms, which at some collection sizes stop being
 strictly unique and with IDF's help explode your scores.

 So you've designed a heuristic that allows you to filter a certain kind of
 noise.  It sounds a lot like how people tune length normalization to adapt to
 their document collections.  Many tuning techniques are corpus-specific.
 Whatever works, works!

 Marvin Humphrey


 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-2691) Consolidate Near Real Time and Reopen API semantics

2010-11-25 Thread eks dev
Earwin, I used MMAP a lot, is quite nice, it  has its place under the sun,
but it is not a silver bullet, it has its quirks... the same goes for
RAMDirectory.

bq. There is zero need for any such signal. ...non-existing file ...

Why would IndexReader ever try to read non-existing file? IR is going to see
its RAMDirectory point-in-time snapshot of an Index until you somehow try to
reload updated Index image on disk.




On Thu, Nov 25, 2010 at 6:00 PM, Earwin Burrfoot (JIRA) j...@apache.orgwrote:


[
 https://issues.apache.org/jira/browse/LUCENE-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935794#action_12935794]

 Earwin Burrfoot commented on LUCENE-2691:
 -

 {quote}
 bq. You're still okay with an API that allows you to reopen IRs on
 different directories?
 Well, that's no good - we can catch this and throw an exc?
 {quote}
 I don't understand why should we bother with checking and throwing
 exceptions, when we can prevent such things from compiling at all.
 By using an API, that doesn't support reopening on anything different from
 original source.

 bq. Really, there are two separate things open/reopen needs:
 That's not true. Take a look at my WriterBackedReader above (or
 DirectoryReader in trunk). It requires writer at least to call
 deleteUnusedFiles(), nrtIsCurrent().
 So you can't easily reopen between Directory-backed and Writer-backed
 readers without much switching and checking.

 bq. r_ram.reload(); //Here we want to reload from the FSDirecotory?
 Use MMapDirectory? It's only a bit slower for searches, while not raping
 your GC on big indexes.
 Also check this out - https://gist.github.com/715617 , it is a
 RAMDirectory offspring that wraps any other given directory and basically
 does what you want (if I guessed right).
 It doesn't use blocking for files, so file size limit is 2Gb, but this can
 be easily fixed. On the up side - it reads file into memory only after the
 size is known (unlike RAMDir), which allows you to use huge precisely-sized
 blocks, lessening GC pressure.
 I used it for a long time, but then my indexes grew, heaps followed, VM
 exploded and I switched to MMapDirectory (with minor patches).

 bq. What is missing is a signal from IR.reload() to RAMdirectory to slurp
 fresh information from FSDirecory?
 There is zero need for any such signal. If a reader requests non-existing
 file from RAMDirectory, it should check backing dir before throwing
 exception. If backing dir does have the file - it is loaded and opened.
 Why do you people love complicating things that much? :)

  Consolidate Near Real Time and Reopen API semantics
  ---
 
  Key: LUCENE-2691
  URL: https://issues.apache.org/jira/browse/LUCENE-2691
  Project: Lucene - Java
   Issue Type: Improvement
 Reporter: Grant Ingersoll
 Assignee: Grant Ingersoll
 Priority: Minor
  Fix For: 4.0
 
  Attachments: LUCENE-2691.patch, LUCENE-2691.patch
 
 
  We should consolidate the IndexWriter.getReader and the
 IndexReader.reopen semantics, since most people are already using the
 IR.reopen() method, we should simply add::
  {code}
  IR.reopen(IndexWriter)
  {code}
  Initially, it could just call the IW.getReader(), but it probably should
 switch to just using package private methods for sharing the internals

 --
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.


 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




Re: Polymorphic Index

2010-10-22 Thread eks dev
Sure, it all would work and would be better than naive index UID. 
Mapping  more UIDs to one permits this compromise Number of unique terms in 
term dict against CPU during update to resolve collisions. 


I like Paul's idea with more fields,  it reduces number of UIDs in term 
dictionary, but increases density of postings lists for these terms. It 
simplifies update as no collisions are possible, just makes  it slower.  


It is all too fiddly and suboptimal, one needs to tone to find an optimum here, 
but hey, better than naive approach. 


Both of these solutions are just  better way to do it wrong :) The real 
solution 
is definitely somewhere around ParallelReader usage.

Ideally, one should be able to say by opening index which parts of index he is 
going to be using. One way to do it is to to create Parallel Indexes, searching 
part is fully functional and already there. 


Anyone using ParallelReader, any tips on creating parallel indexes?

In my particular case, ParallelReader is not strictly necessary, because I 
only need to filter-out one Field from termDictionary  and its Postings 
during 
RAMDisk loading. One has some flexibility  to do a lot with SwithDirectory, but 
postings for one field are not in separate files...


Thanks for good tips, we found two better solutions for our UID use cases 
toolbox

Cheers, eks







  





- Original Message 
 From: Toke Eskildsen t...@statsbiblioteket.dk
 To: dev@lucene.apache.org dev@lucene.apache.org
 Sent: Fri, 22 October, 2010 0:32:04
 Subject: RE: Polymorphic Index
 
 From: Mark Harwood [markharw...@yahoo.co.uk]
  Good  point, Toke. Forgot about that. Of course doubling the number
  of hash algos used to 4 increases the space massively.
 
 Maybe your hashing-idea  could work even with collisions?
 
 Using your original two-hash suggestion,  we're just about sure to get 
collisions. However, we are still able to uniquely  identify the right 
document 
as the UID is also stored (search for the hashes,  iterate over the results 
and 
get the UID for each). When an update is requested  for an existing document, 
the indexer extracts the UIDs from all the documents  that matches the hash. 
Then it performs a delete of the hash-terms and  re-indexes all the documents 
that had false collisions. As the number of  unique hash-values as well as 
hash-function can be adjusted, this could be a  nicely tweakable 
performance-vs-space trade off.
 
 This will only work if  it is possible to re-create the documents from stored 
terms or by requesting the  data from outside of Lucene by UID. Is this 
possible 
with your setup, eks  dev?
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For  additional commands, e-mail: dev-h...@lucene.apache.org
 
 




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Polymorphic Index

2010-10-22 Thread eks dev
I do not think aeletes are really a problem, there are probably many ways to 
fix 
that. The problem is to create and keep in sync two(or more) indexes with 
parallel docIDs. 


For example: use DeleteByQuery and feed it with Filter with set bits on docID 
positions? 




- Original Message 
 From: Toke Eskildsen t...@statsbiblioteket.dk
 To: dev@lucene.apache.org dev@lucene.apache.org
 Sent: Fri, 22 October, 2010 14:27:45
 Subject: Re: Polymorphic Index
 
 On Fri, 2010-10-22 at 11:23 +0200, eks dev wrote:
  Both of these  solutions are just  better way to do it wrong :) The real 
solution 

   is definitely somewhere around ParallelReader usage.
 
 The problem with  parallel is with updates of documents. The IndexWriter
 takes terms and  queries for deletions and that does not work with the
 parallel approach as  there must be separate IndexWriters for each index.
 There will be no indexed  UIDs in the searcher-oriented index, so there
 is no way to perform the  deletions.
 
  Ideally, one should be able to say by opening index which  parts of index 
  he 
is 

  going to be using. One way to do it is to to  create Parallel Indexes, 
searching 

  part is fully functional and already  there. 
 
 That is correct. If IndexWriter accepted docIDs for deletions,  the
 parallel approach would work (get the docID by searching the  parallel
 index, then use it for deletions in both indexes). Unfortunately it  does
 not so you'll need to tweak the IndexWriter. I don't know how hard  that
 is.
 
  Anyone using ParallelReader, any tips on creating  parallel indexes?
 
 I would suggest that you make sure that you can solve  the delete-problem
 first, before you start creating parallel  indexes.
 
 
 -
 To  unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For  additional commands, e-mail: dev-h...@lucene.apache.org
 
 




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Polymorphic Index

2010-10-22 Thread eks dev
Thanks Grant, 
this sound good.

https://issues.apache.org/jira/browse/LUCENE-1812
and 
https://issues.apache.org/jira/browse/LUCENE-2632

I didn't notice them before due to high_volume high_quality traffic here in 
lucene world, one cannot keep up :)  


Will have to look into it in detail. 

With pruning the problem is going to be to somehow preserve this write once 
benefit for slave updates (copy deltas and relaod()) .
Update full index by adding/deleting a few docs - commit -  prune- Update 
slaves incrementally? Will that work? 


I will have to check what this pruning codec produces (one merge on the way and 
I need full update of slaves...)

and these TeeSinkCodec and FilteringCodec look from JIRA description just 
exctly 
like a  solution! Sounds too good.


Thanks again!
Eks




- Original Message 
 From: Grant Ingersoll gsing...@apache.org
 To: dev@lucene.apache.org
 Sent: Fri, 22 October, 2010 15:26:31
 Subject: Re: Polymorphic Index
 
 
 On Oct 21, 2010, at 3:44 PM, eks dev wrote:
 
  Hi All, 
  I  am trying to figure out a way to implement following use case with 
  lucene/solr. 
  
  
  In order to support simple incremental  updates (master) I need to index  
  and 

  store UID Field on 300Mio  collection. (My UID is a 32 byte  sequence). But 
  I 
do 

  not need  indexed (only stored) it during normal  searching (slaves). 
  
  
  The problem is that my term dictionary gets blown away with  sheer number  
  of 

  unique IDs. Number of unique terms on this  collection, excluding UID  is 
less 

  than 7Mio.
  I can  tolerate resources hit on Updater (big hardware, on disk index...).
  
  This is a master slave setup, where searchers run from RAMDisk  and  having 
  300Mio * 32 (give or take prefix compression) plus  pointers to  postings 
  and 

  postings is something I would really  love to avoid as this  is significant 
  compared to really small  documents I have. 
  
  
  Cutting to the chase:
  How I  can have Indexed UID field, and when done with indexing:
  1) Load  searchable index into ram from such an index on disk without one 
   field? 
 
 That doesn't seem like it would be all that hard to do in Lucene  with a few 
edits to the appropriate low level classes to simply not load the  term 
dictionary for a particular set of fields (pass in a set?).  This sort  of 
masking even seems like a generally useful performance gain in the typical  
master/worker replicated environment.
 
  
  2) create 2 Indices  in sync on docIDs, One containing only indexed UID
 
 Kind of reminds me of Andrzej's pruning codec stuff.  Perhaps the new Flex 
stuff helps  here?
 
  3) somehow transform index with indexed UID by dropingUID  field, 
  preserving 

  docIs. Kind of tool smart index-editing tool. 
 
 Again, take a look at Andrzej's pruning codec.
 
 -Grant
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For  additional commands, e-mail: dev-h...@lucene.apache.org
 
 




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Polymorphic Index

2010-10-21 Thread eks dev
Hi All, 
I am trying to figure out a way to implement following use case with 
lucene/solr. 


In order to support simple incremental updates (master) I need to index  and 
store UID Field on 300Mio collection. (My UID is a 32 byte  sequence). But I do 
not need indexed (only stored) it during normal  searching (slaves). 


The problem is that my term dictionary gets blown away with sheer number  of 
unique IDs. Number of unique terms on this collection, excluding UID  is less 
than 7Mio.
 I can tolerate resources hit on Updater (big hardware, on disk index...).

This is a master slave setup, where searchers run from RAMDisk and  having 
300Mio * 32 (give or take prefix compression) plus pointers to  postings and 
postings is something I would really love to avoid as this  is significant 
compared to really small documents I have. 


Cutting to the chase:
How I can have Indexed UID field, and when done with indexing:
1) Load searchable index into ram from such an index on disk without one 
field? 

2) create 2 Indices in sync on docIDs, One containing only indexed UID
3) somehow transform index with indexed UID by droping UID field, preserving 
docIs. Kind of tool smart index-editing tool. 

Something else already there i do not know?

Preserving docIds is crucial, as I need support for lovely incremental  updates 
(like in solr master-slave update). Also Stored field should  remain!
I am not looking for use MMAPed Index and let OS deal with it advice... 
I do not mind doing it with flex branch 4.0, nut being in a hurry.

Thanks in advance, 
Eks 




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Using long instead of int for docIds

2010-10-12 Thread eks dev
--- the practical limit for a single lucene index is ~100M docs anyway ---

I do not see it that way, there are very practical cases (short documents)
with 250M docs and  sub-second response times :)
And I believe it can be pushed even further, especially when flex branch
stabilizes

Changes nothing on your int/long point, just doing the justice to Lucene

Cheers,
eks

On Tue, Oct 12, 2010 at 1:01 PM, Israel Ekpo israele...@gmail.com wrote:

 Thanks Yonik for responding.

 This clarifies a lot.

 On Mon, Oct 11, 2010 at 11:11 PM, Yonik Seeley yo...@lucidimagination.com
  wrote:

 I think ints instead of longs for docids is still the best practical
 choice for today.
 - longs double the size it takes to store collected ids
 - Java native arrays are indexed by int (hence we couldn't collect
 more than 2B matches easily anyway)
 - the practical limit for a single lucene index is ~100M docs anyway

 But, perhaps MultiSearcher (or a new class called BigMultiSearcher)
 should start using longs.

 -Yonik

 On Mon, Oct 11, 2010 at 1:24 AM, Israel Ekpo israele...@gmail.com
 wrote:
  Hi Solr Devs,
 
  I have always had this question at the back of my mind and I would love
 to
  know the answers to a couple of questions.
 
  1. Does using int for document ids place any restrictions on the number
 of
  documents that can be stored in a single index? I am assuming we cannot
 go
  beyond 2 to power 31 minus 1 documents but I have not actually test this
  yet.
 
  2. What would it take to change the core to use long instead of int for
  document ids?
 
  3. Would there be any practical gains or benefits of making such a
 change?
 
  I initially wanted to send this question to the Stomp the Chomp
 challenge
  but I figured it would be better to open it to all.
 
  Any useful feedbacks will be highly appreciated.
 
  --
  °O°
  Good Enough is not good enough.
  To give anything less than your best is to sacrifice the gift.
  Quality First. Measure Twice. Cut Once.
  http://www.israelekpo.com/
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




 --
 °O°
 Good Enough is not good enough.
 To give anything less than your best is to sacrifice the gift.
 Quality First. Measure Twice. Cut Once.
 http://www.israelekpo.com/



[jira] Commented: (LUCENE-2557) FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches

2010-07-26 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12892341#action_12892341
 ] 

Eks Dev commented on LUCENE-2557:
-

It looks like we have one invariant:
IDF(QueryTerm) = IDF(Expansion Term) // Preventing better scoring documents 
with ET then Documents with exact match on QT.

Fixing all expansions to IDF(QT) would remove dynamics of the score, making the 
contribution to the score  for all expansions identical. Maybe proportionally 
scaling IDF of all expansions  to preserve mutual IDF dynamics, (relative to 
IDF(QT) to keep-up with invariant)  would work better?

In case when there is no matching QueryTerm, why not simply preserving 
expansion Term IDF, what is averaging good for, performance?

 FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches
 --

 Key: LUCENE-2557
 URL: https://issues.apache.org/jira/browse/LUCENE-2557
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 3.0.2
Reporter: Jingkei Ly
 Attachments: idf-scoring-test-case.patch, LUCENE-2557.patch


 The FuzzyQuery often causes misspellings to be ranked higher than the exact 
 match, which seems to be an undesirable property generally. 
 For example, in an index of surnames, if I search using a FuzzyQuery for 
 smith, the misspellings such as smiith, or smiht would appear near the 
 top of the search results ahead of documents that match smith.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Brics Automaton version

2010-06-21 Thread eks dev
I have been trying to use automaton library from Lucene, (instead of direct 
import of the brics lib), 
and noticed some methods I need are not there (e.g. getShortestExample) 

Looking at the change log of the brics automaton 
(http://www.brics.dk/automaton/ChangeLog):
1.3-1 - 1.3-2
==
- added Automaton methods:
- getShortestExample
- setMinimizeAlways


current version is 1.11-2, many bugs fixed in meantime...

Any plans to upgrade? 




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Brics Automaton version

2010-06-21 Thread eks dev
ok, that explains it, but I didn't expect it, considering small size of the
library.

i would even argue it makes sense to keep some (all?) of these methods,
especially if intended use of the Automaton code gets expanded to Analyzer
chains. This particular method has usage in our code for optimizing matching
based on minimum possible length that can get accepted.

i would really try to avoid having two, 99% identical tools in code, or to
specialize Automaton  co classes to do what they did in the first place.
Could get confusing.

Also, having full library (or at least imported classes) makes upgrades
easier. 1.11.3 will come one day...

whichever, I would just appreciate a final statement on this?

Thanks... and kudos for new Fuzzy, Regex Query... looks impressive




On Mon, Jun 21, 2010 at 6:01 PM, Robert Muir rcm...@gmail.com wrote:

 we are based on the latest version (1.11.2)

 getShortestExample (among other methods) are not available because we don't
 have anything using them in lucene... we only have the stuff we need.

 On Mon, Jun 21, 2010 at 11:22 AM, eks dev eks...@yahoo.co.uk wrote:

 I have been trying to use automaton library from Lucene, (instead of
 direct import of the brics lib),
 and noticed some methods I need are not there (e.g. getShortestExample)

 Looking at the change log of the brics automaton (
 http://www.brics.dk/automaton/ChangeLog):
 1.3-1 - 1.3-2
 ==
 - added Automaton methods:
- getShortestExample
- setMinimizeAlways


 current version is 1.11-2, many bugs fixed in meantime...

 Any plans to upgrade?




 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




 --
 Robert Muir
 rcm...@gmail.com



[jira] Commented: (LUCENE-2482) Index sorter

2010-05-27 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12872386#action_12872386
 ] 

Eks Dev commented on LUCENE-2482:
-

Re: I'm not sure if I follow your use case though

Simple case, you have a 100Mio docs with 2 fields, CITY and  TEXT

sorting on CITY makes postings look like: 
Orlando:  -
 New York:   
-
perfectly compressible. 

without really affecting distribution (compressibility) of terms from the TEXT 
field.

If CITY would remain in unsorted order (e.g. uniform distribution), you deal 
with very large postings for all terms coming from this field  

Sorting on many fields helps often, e.g. if you have hierarchical compositions 
like 1 CITY with many  ZIP_CODES...  philosophically, sorting always increases 
compressibility and improves locality of reference... but sure, you need to 
know what you want

 Index sorter
 

 Key: LUCENE-2482
 URL: https://issues.apache.org/jira/browse/LUCENE-2482
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Affects Versions: 3.1
Reporter: Andrzej Bialecki 
 Fix For: 3.1

 Attachments: indexSorter.patch


 A tool to sort index according to a float document weight. Documents with 
 high weight are given low document numbers, which means that they will be 
 first evaluated. When using a strategy of early termination of queries (see 
 TimeLimitedCollector) such sorting significantly improves the quality of 
 partial results.
 (Originally this tool was created by Doug Cutting in Nutch, and used norms as 
 document weights - thus the ordering was limited by the limited resolution of 
 norms. This is a pure Lucene version of the tool, and it uses arbitrary 
 floats from a specified stored field).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-329) Fuzzy query scoring issues

2010-02-15 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833860#action_12833860
 ] 

Eks Dev commented on LUCENE-329:


{quote}
query for John~ Patitucci~ I'm probably more interested in a partial match on 
the rarer surname than a partial match on the common forename. 
{quote}


as a matter of fact, we have not only one frequency  to consider, rather two 
Term frequencies!

consider simpler case
Query term: Johan //would be High frequency term
gives:
Fuzzy Expanded term1 Johana // High frequency
Fuzzy Expanded term2 Joahn // Low Freq

I guess you would like to score the second term higher, meaning Lower frequency 
(higher IDF)... So far so good. 

Now turn it upside down and search for LF typo Joahn... in that case you 
would preffer HF Term Johan from expanded list to score higher...

Point being, this situation here is just not complete without taking both 
frequencies into consideration (Query Term and Expanded term). In my 
experience, some simple nonlinear hints based on these two freqs bring some 
easy precision points (HF-LF Pairs are much more likely to be typos that two 
HF-HF...  ). 


 Fuzzy query scoring issues
 --

 Key: LUCENE-329
 URL: https://issues.apache.org/jira/browse/LUCENE-329
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 1.2rc5
 Environment: Operating System: All
 Platform: All
Reporter: Mark Harwood
Priority: Minor
 Attachments: patch.txt


 Queries which automatically produce multiple terms (wildcard, range, prefix, 
 fuzzy etc)currently suffer from two problems:
 1) Scores for matching documents are significantly smaller than term queries 
 because of the volume of terms introduced (A match on query Foo~ is 0.1 
 whereas a match on query Foo is 1).
 2) The rarer forms of expanded terms are favoured over those of more common 
 forms because of the IDF. When using Fuzzy queries for example, rare mis-
 spellings typically appear in results before the more common correct 
 spellings.
 I will attach a patch that corrects the issues identified above by 
 1) Overriding Similarity.coord to counteract the downplaying of scores 
 introduced by expanding terms.
 2) Taking the IDF factor of the most common form of expanded terms as the 
 basis of scoring all other expanded terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2010-02-12 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832911#action_12832911
 ] 

Eks Dev commented on LUCENE-2089:
-

{quote}
...Aaron i think generation may pose a problem for a full unicode alphabet...
{quote}

I wouldn't discount Aron's approach so quickly! There is one *really smart*  
way to aproach generation of the distance negborhood. Have a look at FastSS 
http://fastss.csg.uzh.ch/  The trick is to delete, not to genarate variations 
over complete  alphabet! They call it deletion negborhood.  Also, generates 
much less variation Terms, reducing pressure on binary search in TermDict!

You do not get all these goodies from Weighted  distance implementation, but 
the solution is much simpler. Would work similary to the  current spellchecker 
(just lookup on variations), only  faster. They have even some exemple code 
to see how they generate deletions 
(http://fastss.csg.uzh.ch/FastSimilarSearch.java).

{quote}
but the more intelligent stuff you speak of could be really cool esp. for 
spellchecking, sure you dont want to rewrite our spellchecker?

btw its not clear to me yet, could you implement that stuff on top of ghetto 
DFA (the sorted terms dict we have now) or is something more sophisticated 
needed? its a lot easier to write this stuff now with the flex MTQ apis 
{quote}

I really  would love to, but I was paid before to work on this. 

I guess  gheto dfa would not work, at least not fast enough (I didn't think 
about it really). Practically you would need to know which characters extend 
current character in you dictionary, or in DFA parlance, all outgoing 
transitions from the current state. gheto dfa cannot do it efficiently?

What would be an idea with flex is to implement this stuff with an in memory 
trie (full trie or TST), befor jumping into noisy channel (this is easy to add 
later) and persistent trie-dictionary.  The traversal part is identical,  and  
would make a nice contrib with a usefull use case as the majority of folks have 
 enogh memory to slurp complete termDict into memory... Would serve as a proof 
of concept for flex and fuzzyQ,  help you understand the magic of calculating 
edit distance against Trie structures. Once you have trie structure, the sky is 
the limit, prefix, regex... If I remeber corectly, there were some trie 
implmentations floating around, with it you need just one extra traversal 
method to find all terms at distance N. You can have a look at 
http://jaspell.sourceforge.net/; TST implmentation, class 
TernarySearchTrie.matchAlmost(...) methods. Just for an ilustration what is 
going there, it is simple recursive traversal of all terms at max distance of N.
Later we could tweak memory demand, switch to some more compact trie... and at 
the and add weighted distance and convince Mike to make blasing fast persisten 
trie :)... in meantime, the folks with enogh memory would have really really 
fast fuzzy, prefix... better distance... 



So the theory :) I hope you find these comments usful, even without patches



 


 explore using automaton for fuzzyquery
 --

 Key: LUCENE-2089
 URL: https://issues.apache.org/jira/browse/LUCENE-2089
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Robert Muir
Assignee: Mark Miller
Priority: Minor
 Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


 Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
 itching to write that nasty algorithm)
 we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
 * up front, calculate the maximum required K edits needed to match the users 
 supplied float threshold.
 * for at least small common E up to some max K (1,2,3, etc) we should create 
 a DFA for each E. 
 if the required E is above our supported max, we use dumb mode at first (no 
 seeking, no DFA, just brute force like now).
 As the pq fills, we swap progressively lower DFAs into the enum, based upon 
 the lowest score in the pq.
 This should work well on avg, at high E, you will typically fill the pq very 
 quickly since you will match many terms. 
 This not only provides a mechanism to switch to more efficient DFAs during 
 enumeration, but also to switch from dumb mode to smart mode.
 i modified my wildcard benchmark to generate random fuzzy queries.
 * Pattern: 7N stands for NNN, etc.
 * AvgMS_DFA: this is the time spent creating the automaton (constructor)
 ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
 |7N|10|64.0|4155.9|38.6|20.3|
 |14N|10|0.0|2511.6|46.0|37.9| 
 |28N|10|0.0|2506.3|93.0|86.6|
 |56N|10|0.0|2524.5|304.4|298.5|
 as you can see, this prototype is no good yet, because it creates the DFA in 
 a slow way. right now it creates an NFA

[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2010-02-11 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832424#action_12832424
 ] 

Eks Dev commented on LUCENE-2089:
-

{quote}
 What about this,
http://www.catalysoft.com/articles/StrikeAMatch.html
it seems logically more appropriate to (human-entered) text objects than 
Levenshtein distance, and it is (in theory) extremely fast; is DFA-distance 
faster? 
{quote}

Is that only me who sees plain, vanilla bigram distance here? What is new or 
better in StrikeAMatch compared to the first phase of the current SpellCehcker 
(feeding PriorityQueue with candidates)? 

If you need too use this, nothing simpler, you do not even need pair comparison 
(aka traversal), just Index terms split into bigrams and search with standard 
Query. 


Autmaton trick is a neat one. Imo,  the only thing that would work better is to 
make term dictionary real trie (ternary, n-ary, dfa, makes no big diff). Making 
TerrmDict some sort of trie/dfa would permit smart beam-search,  even without 
compiling query DFA. Beam search also makes implementation of better distances 
possible (Weighted Edit distance without metric constraint ). I guess this is 
going to be possible with Flex, Mike was allready talking about DFA Dictionary 
:)

It took a while to figure out the trick Robert pooled here, treating term 
dictionary as another DFA due to the sortedness, nice. 

 explore using automaton for fuzzyquery
 --

 Key: LUCENE-2089
 URL: https://issues.apache.org/jira/browse/LUCENE-2089
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Robert Muir
Assignee: Mark Miller
Priority: Minor
 Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


 Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
 itching to write that nasty algorithm)
 we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
 * up front, calculate the maximum required K edits needed to match the users 
 supplied float threshold.
 * for at least small common E up to some max K (1,2,3, etc) we should create 
 a DFA for each E. 
 if the required E is above our supported max, we use dumb mode at first (no 
 seeking, no DFA, just brute force like now).
 As the pq fills, we swap progressively lower DFAs into the enum, based upon 
 the lowest score in the pq.
 This should work well on avg, at high E, you will typically fill the pq very 
 quickly since you will match many terms. 
 This not only provides a mechanism to switch to more efficient DFAs during 
 enumeration, but also to switch from dumb mode to smart mode.
 i modified my wildcard benchmark to generate random fuzzy queries.
 * Pattern: 7N stands for NNN, etc.
 * AvgMS_DFA: this is the time spent creating the automaton (constructor)
 ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
 |7N|10|64.0|4155.9|38.6|20.3|
 |14N|10|0.0|2511.6|46.0|37.9| 
 |28N|10|0.0|2506.3|93.0|86.6|
 |56N|10|0.0|2524.5|304.4|298.5|
 as you can see, this prototype is no good yet, because it creates the DFA in 
 a slow way. right now it creates an NFA, and all this wasted time is in 
 NFA-DFA conversion.
 So, for a very long string, it just gets worse and worse. This has nothing to 
 do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
 AvgMS_DFA), there is no problem there.
 instead we should just build a DFA to begin with, maybe with this paper: 
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
 we can precompute the tables with that algorithm up to some reasonable K, and 
 then I think we are ok.
 the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
 linear minimization, if someone wants to implement this they should not worry 
 about minimization.
 in fact, we need to at some point determine if AutomatonQuery should even 
 minimize FSM's at all, or if it is simply enough for them to be deterministic 
 with no transitions to dead states. (The only code that actually assumes 
 minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a 
 summation easily). we need to benchmark really complex DFAs (i.e. write a 
 regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2010-02-11 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832741#action_12832741
 ] 

Eks Dev commented on LUCENE-2089:
-

{quote}
I assume you mean by weighted edit distance that the transitions in the state 
machine would have costs?
{quote}

Yes, kind of, not embedded in the trie, just defined externally.

What I am talking about is a part of the noisy channel approach, modeling only 
channel distribution. Have a look at the http://norvig.com/spell-correct.html 
for basic theory. I am suggesting almost the same,  just applied at character 
level and without language model part. It is rather easy once you have your 
dictionary in some sort of tree structure.


You guide your trie traversal over the trie by  iterating on each char in your 
search term accumulating   log probabilities of single transformations 
(recycling prefix part). When you hit a leaf insert into PriorityQueue of 
appropriate depth. What I mean by probabilities of single transformations are 
defined as:
insertion(character a)//map char-log probability (think of it as kind of cost 
of inserting this particular character)
deletion(character)//map char-log probability...
transposition(char a, char b)
replacement(char a, char b)//2D matrix char,char-probability (cost)
if you wish , you could even add some positional information, boosting match on 
start/end of the string

I avoided tricky mechanicson traversal, insertion, deletion, but on trie you 
can do it by following different paths... 

the only good implementation (in memory) around there I know of is in LingPipe 
spell checker (they implement full Noisy Channel, with Language model  driving 
traversal)... has huge educational value, Bob is really great at explaining 
things. The code itself is proprietary. 
I would suggest you to peek into this code to see this 2-Minute rumbling I 
wrote here properly explained :) Just ignore the language model part and assume 
you have NULL language model (all chars in language are equally probable) , 
doing full traversal over the trie. 

{quote}
If this is the case couldn't we even define standard levenshtein very easily 
(instead of nasty math), and would the beam search technique enumerate 
efficiently for us?
{quote}
 Standard Lev. is trivially configured once you have this, it is just setting 
all these costs to 1 (delete, insert... in log domain)... But who would use 
standard distance with such a beast, reducing impact of inserting/deleting 
silent h as in Thomas Tomas... 
Enumeration is trie traversal, practically calculating distance against all 
terms at the same time and collectiong N best along the way. The place where 
you save your time is recycling prefix part in this calculation. Enumeration is 
optimal as this trie there contains only the terms from termDict, you are not 
trying all possible alphabet characters and you can implement early path 
abandoning easily ether by cost (log probability) or/and by limiting the 
number  of successive insertions

If interested in really in depth things, look at 
http://www.amazon.com/Algorithms-Strings-Trees-Sequences-Computational/dp/0521585198
Great book, (another great tip from  b...@lingpipe). A bit strange with 
terminology (at least to me), but once you get used to it, is really worth the 
time you spend trying to grasp it.




 

 explore using automaton for fuzzyquery
 --

 Key: LUCENE-2089
 URL: https://issues.apache.org/jira/browse/LUCENE-2089
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Robert Muir
Assignee: Mark Miller
Priority: Minor
 Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


 Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
 itching to write that nasty algorithm)
 we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
 * up front, calculate the maximum required K edits needed to match the users 
 supplied float threshold.
 * for at least small common E up to some max K (1,2,3, etc) we should create 
 a DFA for each E. 
 if the required E is above our supported max, we use dumb mode at first (no 
 seeking, no DFA, just brute force like now).
 As the pq fills, we swap progressively lower DFAs into the enum, based upon 
 the lowest score in the pq.
 This should work well on avg, at high E, you will typically fill the pq very 
 quickly since you will match many terms. 
 This not only provides a mechanism to switch to more efficient DFAs during 
 enumeration, but also to switch from dumb mode to smart mode.
 i modified my wildcard benchmark to generate random fuzzy queries.
 * Pattern: 7N stands for NNN, etc.
 * AvgMS_DFA: this is the time spent creating the automaton (constructor)
 ||Pattern||Iter||AvgHits||AvgMS(old

[jira] Commented: (LUCENE-1410) PFOR implementation

2009-10-06 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762742#action_12762742
 ] 

Eks Dev commented on LUCENE-1410:
-

Mike, 
That is definitely the way to go, distribution dependent encoding, where every 
Term gets individual treatment.
  
Take for an example simple, but not all that rare case where Index gets sorted 
on some of the indexed fields (we use it really extensively, e.g. presorted doc 
collection on user_rights/zip/city, all indexed). There you get perfectly 
compressible  postings by simply managing intervals of set bits. Updates 
distort this picture, but we rebuild index periodically and all gets good 
again.  At the moment we load them into RAM as Filters in IntervalSets. if that 
would be possible in lucene, we wouldn't bother with Filters (VInt decoding on 
such super dense fields was killing us, even in RAMDirectory) ...  

Thinking about your comments, isn't pulsing somewhat orthogonal to packing 
method? For example, if you load index into RAMDirecectory, one could avoid one 
indirection level and inline all postings.

Flex Indexing rocks, that is going to be the most important addition to lucene 
since it started (imo)... I would even bet on double search speed  in first 
attempt for average queries :)

Cheers, 
eks 

 PFOR implementation
 ---

 Key: LUCENE-1410
 URL: https://issues.apache.org/jira/browse/LUCENE-1410
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Other
Reporter: Paul Elschot
Priority: Minor
 Attachments: autogen.tgz, LUCENE-1410-codecs.tar.bz2, 
 LUCENE-1410b.patch, LUCENE-1410c.patch, LUCENE-1410d.patch, 
 LUCENE-1410e.patch, TermQueryTests.tgz, TestPFor2.java, TestPFor2.java, 
 TestPFor2.java

   Original Estimate: 21840h
  Remaining Estimate: 21840h

 Implementation of Patched Frame of Reference.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-1410) PFOR implementation

2009-10-06 Thread eks dev
Paul,
the point I was trying to make with this example was extreme,  but realistic. 
Imagine 100Mio docs, sorted on field user_rights,  a term user_rights:XX 
selects 40Mio of them (user rights...). To encode this, you need format with  
two integers (for more of such intervals you would need slightly more, but 
nevertheless, much less than for OpenBitSet, VInts, PFor...  ). Strictly 
speaking this term is dense, but highly compressible and could be inlined with 
pulsing trick...

cheers, eks  





From: Paul Elschot paul.elsc...@xs4all.nl
To: java-dev@lucene.apache.org
Sent: Tuesday, 6 October, 2009 23:33:03
Subject: Re: [jira] Commented: (LUCENE-1410) PFOR implementation

Eks,


 
 [ 
 https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762742#action_12762742
  ] 
 
 Eks Dev commented on LUCENE-1410:
 -
 
 Mike, 
 That is definitely the way to go, distribution dependent encoding, where 
 every Term gets individual treatment.
 
 Take for an example simple, but not all that rare case where Index gets 
 sorted on some of the indexed fields (we use it really extensively, e.g. 
 presorted doc collection on user_rights/zip/city, all indexed). There you 
 get perfectly compressible  postings by simply managing intervals of set 
 bits. Updates distort this picture, but we rebuild index periodically and 
 all gets good again.  At the moment we load them into RAM as Filters in 
 IntervalSets. if that would be possible in lucene, we wouldn't bother with 
 Filters (VInt decoding on such super dense fields was killing us, even in 
 RAMDirectory) ... 


You could try switching the Filter to OpenBitSet when that takes fewer bytes 
than SortedVIntList.


Regards,
Paul Elschot





  

Re: [jira] Commented: (LUCENE-1410) PFOR implementation

2009-10-06 Thread eks dev
if you would drive this example further in combination with flex-indexing 
permitting per term postings format, I could imagine some nice tools for 
optimizeHard() , where normal index construction works with defaults as planned 
for solid mix-performance case and at the end you run optimizeHard() where 
postings get resorted on such fields (basically enabling rle encoding to work) 
and at the same time all other terms get optimal encoding format for 
postings... perfect for read only indexes where you want to max performance and 
reduce ix size



From: eks dev eks...@yahoo.co.uk
To: java-dev@lucene.apache.org
Sent: Tuesday, 6 October, 2009 23:59:12
Subject: Re: [jira] Commented: (LUCENE-1410) PFOR implementation


Paul,
the point I was trying to make with this example was extreme,  but realistic. 
Imagine 100Mio docs, sorted on field user_rights,  a term user_rights:XX 
selects 40Mio of them (user rights...). To encode this, you need format with  
two integers (for more of such intervals you would need slightly more, but 
nevertheless, much less than for OpenBitSet, VInts, PFor...  ). Strictly 
speaking this term is dense, but highly compressible and could be inlined with 
pulsing trick...

cheers, eks  





From: Paul Elschot paul.elsc...@xs4all.nl
To: java-dev@lucene.apache.org
Sent: Tuesday, 6 October, 2009 23:33:03
Subject: Re: [jira] Commented: (LUCENE-1410) PFOR implementation

Eks,


 
 [ 
 https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762742#action_12762742
  ] 
 
 Eks Dev commented on LUCENE-1410:
 -
 
 Mike, 
 That is definitely the way to go, distribution dependent encoding, where 
 every Term gets individual treatment.
 
 Take for an example simple, but not all that rare case where Index gets 
 sorted on some of the indexed fields (we use it really extensively, e.g. 
 presorted doc collection on user_rights/zip/city, all indexed). There you 
 get perfectly compressible  postings by simply managing intervals of 
 set bits. Updates distort this picture, but we rebuild index periodically 
 and all gets good again.  At the moment we load them into RAM as Filters 
 in IntervalSets. if that would be possible in lucene, we wouldn't bother 
 with Filters (VInt decoding on such super dense fields was killing us, 
 even in RAMDirectory) ... 


You could try switching the Filter to OpenBitSet when that takes fewer bytes 
than SortedVIntList.


Regards,
Paul Elschot






  

[jira] Commented: (LUCENE-1762) Slightly more readable code in TermAttributeImpl

2009-07-27 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12735809#action_12735809
 ] 

Eks Dev commented on LUCENE-1762:
-

cool, thanks for the review.   

 Slightly more readable code in TermAttributeImpl 
 -

 Key: LUCENE-1762
 URL: https://issues.apache.org/jira/browse/LUCENE-1762
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 2.9
Reporter: Eks Dev
Assignee: Uwe Schindler
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1762.patch, LUCENE-1762.patch, LUCENE-1762.patch, 
 LUCENE-1762.patch


 No big deal. 
 growTermBuffer(int newSize) was using correct, but slightly hard to follow 
 code. 
 the method was returning null as a hint that the current termBuffer has 
 enough space to the upstream code or reallocated buffer.
 this patch simplifies logic   making this method to only reallocate buffer, 
 nothing more.  
 It reduces number of if(null) checks in a few methods and reduces amount of 
 code. 
 all tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1762) Slightly more readable code in TermAttributeImpl

2009-07-25 Thread Eks Dev (JIRA)
Slightly more readable code in TermAttributeImpl 
-

 Key: LUCENE-1762
 URL: https://issues.apache.org/jira/browse/LUCENE-1762
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Priority: Trivial


No big deal. 

growTermBuffer(int newSize) was using correct, but slightly hard to follow 
code. 

the method was returning null as a hint that the current termBuffer has enough 
space to the upstream code or reallocated buffer.

this patch simplifies logic   making this method to only reallocate buffer, 
nothing more.  
It reduces number of if(null) checks in a few methods and reduces amount of 
code. 
all tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1762) Slightly more readable code in TermAttributeImpl

2009-07-25 Thread Eks Dev (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eks Dev updated LUCENE-1762:


Attachment: LUCENE-1762.patch

 Slightly more readable code in TermAttributeImpl 
 -

 Key: LUCENE-1762
 URL: https://issues.apache.org/jira/browse/LUCENE-1762
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Priority: Trivial
 Attachments: LUCENE-1762.patch


 No big deal. 
 growTermBuffer(int newSize) was using correct, but slightly hard to follow 
 code. 
 the method was returning null as a hint that the current termBuffer has 
 enough space to the upstream code or reallocated buffer.
 this patch simplifies logic   making this method to only reallocate buffer, 
 nothing more.  
 It reduces number of if(null) checks in a few methods and reduces amount of 
 code. 
 all tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1762) Slightly more readable code in TermAttributeImpl

2009-07-25 Thread Eks Dev (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eks Dev updated LUCENE-1762:


Attachment: LUCENE-1762.patch

made the changes in Token along the same lines, 

- had to change one constant in TokenTest as I have changed initial allocation 
policy of termBuffer to be consistent with Arayutils.getnextSize()

if(termBuffer==null)

NEW:
 termBuffer = new char[ArrayUtil.getNextSize(newSize  MIN_BUFFER_SIZE ? 
MIN_BUFFER_SIZE : newSize)]; 

OLD:
termBuffer = new char[newSize  MIN_BUFFER_SIZE ? MIN_BUFFER_SIZE : newSize]; 

not sure if this is better, but looks more consistent to me (buffer size is 
always determined via getNewSize())

Uwe, 
setOnlyUseNewAPI(false) does not exist, it was removed with some of the patches 
lately. It gets automatically detected via reflection?



 Slightly more readable code in TermAttributeImpl 
 -

 Key: LUCENE-1762
 URL: https://issues.apache.org/jira/browse/LUCENE-1762
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Assignee: Uwe Schindler
Priority: Trivial
 Attachments: LUCENE-1762.patch, LUCENE-1762.patch


 No big deal. 
 growTermBuffer(int newSize) was using correct, but slightly hard to follow 
 code. 
 the method was returning null as a hint that the current termBuffer has 
 enough space to the upstream code or reallocated buffer.
 this patch simplifies logic   making this method to only reallocate buffer, 
 nothing more.  
 It reduces number of if(null) checks in a few methods and reduces amount of 
 code. 
 all tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1762) Slightly more readable code in TermAttributeImpl

2009-07-25 Thread Eks Dev (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eks Dev updated LUCENE-1762:


Attachment: LUCENE-1762.patch

- made allocation in initTermBuffer() consistent with 
ArrayUtil.getNextSize(int) -  this is ok not to start with MIN_BUFFER_SIZE, but 
rather with ArrayUtil.getNextSize(MIN_BUFFER_SIZE)... e.g. if getNextSize gets 
very sensitive to initial conditions one day...
 
- null-ed  termText on switch to termBuffer in resizeTermBuffer (as it was 
before!) . This was a bug in previous patch  

 Slightly more readable code in TermAttributeImpl 
 -

 Key: LUCENE-1762
 URL: https://issues.apache.org/jira/browse/LUCENE-1762
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Assignee: Uwe Schindler
Priority: Trivial
 Attachments: LUCENE-1762.patch, LUCENE-1762.patch, LUCENE-1762.patch


 No big deal. 
 growTermBuffer(int newSize) was using correct, but slightly hard to follow 
 code. 
 the method was returning null as a hint that the current termBuffer has 
 enough space to the upstream code or reallocated buffer.
 this patch simplifies logic   making this method to only reallocate buffer, 
 nothing more.  
 It reduces number of if(null) checks in a few methods and reduces amount of 
 code. 
 all tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Java caching of low-level index data?

2009-07-22 Thread eks dev
imo, it is too low level to do it better than OSs. I agree, cache unloading 
effect would be prevented with it, but I am not sure if it brings net-net 
benefit, you would get this problem fixed, but probably OS would kill you 
anyhow (you took valuable memory from OS) on queries that miss your internal 
cache...  

We  could try to do better if we put more focus on higher levels and do the 
caching there... maybe even cache somhow some CPU work, e.g.  keep dense 
Postings in faster, less compressed format, load TermDictionary into 
RAMDirectory and keep the rest on disk.. Ideas in that direction have better 
chance to bring us forward. Take for example FuzzyQuery, there you can do some 
LRU caching at Term level and and save huge amounts of IO and CPU... 






From: Shai Erera ser...@gmail.com
To: java-dev@lucene.apache.org
Sent: Wednesday, 22 July, 2009 17:32:34
Subject: Re: Java caching of low-level index data?


That's an interesting idea.

I always wonder however how much exactly would we gain, vs. the effort spent 
to develop, debug and maintain it. Just some thoughts that we should consider 
regarding this:

* For very large indices, where we think this will generally be good for, I 
believe it's reasonable to assume that the search index will sit on its own 
machine, or set of CPUs, RAM and HD. Therefore given that very few will run on 
the OS other than the search index, I assume the OS cache will be enough (if 
not better)?

* In other cases, where the search app runs together w/ other apps, I'm not 
sure how much we'll gain. I can assume such apps will use a smaller index, or 
will not need to support high query load? If so, will they really care if we 
cache their data, vs. the OS?

Like I said, these are just thoughts. I don't mean to cancel the idea w/ them, 
just to think how much will it improve performance (vs. maybe even hurt it?). 
Often I find it that some optimizations that are done will benefit very large 
indices. But these usually get their decent share of resources, and the JVM 
itself is run w/ larger heap etc. So these optimizations turn out to not 
affect such indices much after all. And for smaller indices, performance is 
usually not a problem (well ... they might just fit entirely in RAM).

Shai


On Wed, Jul 22, 2009 at 6:21 PM, Nigel nigelspl...@gmail.com wrote:

In discussions of Lucene search performance, the importance of OS caching of 
index data is frequently mentioned.  The typical recommendation is to keep 
plenty of unallocated RAM available (e.g. don't gobble it all up with your 
JVM heap) and try to avoid large I/O operations that would purge the OS 
cache.

I'm curious if anyone has thought about (or even tried) caching the low-level 
index data in Java, rather than in the OS.  For example, at the IndexInput 
level there could be an LRU cache of byte[] blocks, similar to how a RDBMS 
caches index pages.  (Conveniently, BufferedIndexInput already reads in 1k 
chunks.) You would reverse the advice above and instead make your JVM heap as 
large as possible (or at least large enough to achieve a desired speed/space 
tradeoff). 

This approach seems like it would have some advantages:

- Explicit control over how much you want cached (adjust your JVM heap and 
cache settings as desired)
- Cached index data won't be purged by the OS doing other things

- Index warming might be faster, or at least more predictable

The obvious disadvantage for some situations is that more RAM would now be 
tied up by the JVM, rather than managed dynamically by the OS.

Any thoughts?  It seems like this would be pretty easy to implement (subclass 
FSDirectory, return subclass of FSIndexInput that checks the cache before 
reading, cache keyed on filename + position), but maybe I'm oversimplifying, 
and for that matter a similar implementation may already exist somewhere for 
all I know.

Thanks,
Chris




  

Re: Java caching of low-level index data?

2009-07-22 Thread eks dev

this should not be all that difficult to try. I accept it makes sense in some 
cases ... but which ones?
Background: all my attempts to fight OS went bed :( 

Let us think again what does it mean what Mike gave as an example?

You are explicitly deciding that Lucene should get bigger share of RAM. OS will 
unload these pages 
 if OS needs Lucene  RAM for something else and you are not using them. Right?

If something else should get less resources, we are on target, but this is 
end result. For any shared setup where you have many things that run, this 
decision has its consequences, something else is going to be starved. 

The other case, where only lucene runs, well what is the difference if we evict 
unused pages or OS does it (better control is just what we get on benefit)? 
This is the case where you are anyhow in not really comfortable for real 
caching situation, otherwise even greedy OSs wouldn't swap (at least my 
experience with reasonably configured OSs)... 

after thinking about it again, I would say, yes, there are for sure some cases 
where it helps, but not many cases and even in these cases benefit will be 
small.

I guess :)






- Original Message 
 From: Michael McCandless luc...@mikemccandless.com
 To: java-dev@lucene.apache.org
 Sent: Wednesday, 22 July, 2009 18:37:19
 Subject: Re: Java caching of low-level index data?
 
 I think it's a neat idea!
 
 But you are in fact fighting the OS so I'm not sure how well this'll
 work in practice.
 
 EG the OS will happily swap out pages from your process if it thinks
 you're not using them, so it'd easily swap out your cache in favor of
 its own IO cache (this is the swappiness configuration on Linux),
 which would then kill performance (take a page hit when you finally
 did need to use your cache).  In C (possibly requiring root) you could
 wire the pages, but we can't do that from javaland, so it's already
 not a fair fight.
 
 Mike
 
 On Wed, Jul 22, 2009 at 11:56 AM, eks devwrote:
  imo, it is too low level to do it better than OSs. I agree, cache unloading
  effect would be prevented with it, but I am not sure if it brings net-net
  benefit, you would get this problem fixed, but probably OS would kill you
  anyhow (you took valuable memory from OS) on queries that miss your internal
  cache...
 
  We could try to do better if we put more focus on higher levels and do the
  caching there... maybe even cache somhow some CPU work, e.g.  keep dense
  Postings in faster, less compressed format, load TermDictionary into
  RAMDirectory and keep the rest on disk.. Ideas in that direction have better
  chance to bring us forward. Take for example FuzzyQuery, there you can do
  some LRU caching at Term level and and save huge amounts of IO and CPU...
 
 
 
 
  From: Shai Erera 
  To: java-dev@lucene.apache.org
  Sent: Wednesday, 22 July, 2009 17:32:34
  Subject: Re: Java caching of low-level index data?
 
  That's an interesting idea.
 
  I always wonder however how much exactly would we gain, vs. the effort spent
  to develop, debug and maintain it. Just some thoughts that we should
  consider regarding this:
 
  * For very large indices, where we think this will generally be good for, I
  believe it's reasonable to assume that the search index will sit on its own
  machine, or set of CPUs, RAM and HD. Therefore given that very few will run
  on the OS other than the search index, I assume the OS cache will be enough
  (if not better)?
 
  * In other cases, where the search app runs together w/ other apps, I'm not
  sure how much we'll gain. I can assume such apps will use a smaller index,
  or will not need to support high query load? If so, will they really care if
  we cache their data, vs. the OS?
 
  Like I said, these are just thoughts. I don't mean to cancel the idea w/
  them, just to think how much will it improve performance (vs. maybe even
  hurt it?). Often I find it that some optimizations that are done will
  benefit very large indices. But these usually get their decent share of
  resources, and the JVM itself is run w/ larger heap etc. So these
  optimizations turn out to not affect such indices much after all. And for
  smaller indices, performance is usually not a problem (well ... they might
  just fit entirely in RAM).
 
  Shai
 
  On Wed, Jul 22, 2009 at 6:21 PM, Nigel wrote:
 
  In discussions of Lucene search performance, the importance of OS caching
  of index data is frequently mentioned.  The typical recommendation is to
  keep plenty of unallocated RAM available (e.g. don't gobble it all up with
  your JVM heap) and try to avoid large I/O operations that would purge the 
  OS
  cache.
 
  I'm curious if anyone has thought about (or even tried) caching the
  low-level index data in Java, rather than in the OS.  For example, at the
  IndexInput level there could be an LRU cache of byte[] blocks, similar to
  how a RDBMS caches index pages.  (Conveniently, BufferedIndexInput already
  reads in 1k chunks.) You would 

[jira] Commented: (LUCENE-1743) MMapDirectory should only mmap large files, small files should be opened using SimpleFS/NIOFS

2009-07-14 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731085#action_12731085
 ] 

Eks Dev commented on LUCENE-1743:
-

indeed! obvious idea, 

the only thing I do not like with it is making these hidden, deceptive 
decisions I said I want MMapDirectory and someone else decided something else 
for me... it does not matter if we have conses here now, it may change 
tomorrow 

probably better way would be to turbo charge FileSwitchDirectory with sexy 
parametrization options, 
MMapDirectory - F(fileExtension, minSize, maxSize) // If fileExtension and 
file size less than maxSize and greater than minSize than open file with 
MMapDirectory... than go on on next rule... (can be designed upside down as 
well... changes nothing in idea)

the same for RAMDir, NIO, FS... 

With this, we can make UwesBestOfMMapDirectoryFor32BitOSs (your proposal here) 
or 
HighlyConcurentForWindows64WithTermDictionaryInRamAndStoredFieldsOnDiskDirectory
 just for me :) 

So the most of the end users take some smart defaults we provide in core, and 
freaks (Expert users in official lingo :) have their job easy, just to 
configure TurboChargedFileSwitchDirectory

Should be easy to come up with clean design for these Concrete Directory 
selection rules by keeping concrete Directories pure

Cheers, Eks 




 MMapDirectory should only mmap large files, small files should be opened 
 using SimpleFS/NIOFS
 -

 Key: LUCENE-1743
 URL: https://issues.apache.org/jira/browse/LUCENE-1743
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.1


 This is a followup to LUCENE-1741:
 Javadocs state (in FileChannel#map): For most operating systems, mapping a 
 file into memory is more expensive than reading or writing a few tens of 
 kilobytes of data via the usual read and write methods. From the standpoint 
 of performance it is generally only worth mapping relatively large files into 
 memory.
 MMapDirectory should get a user-configureable size parameter that is a lower 
 limit for mmapping files. All files with a sizelimit should be opened using 
 a conventional IndexInput from SimpleFS or NIO (another configuration option 
 for the fallback?).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1743) MMapDirectory should only mmap large files, small files should be opened using SimpleFS/NIOFS

2009-07-14 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731104#action_12731104
 ] 

Eks Dev commented on LUCENE-1743:
-

right, it is not everything about reading index, you have to write it as well...

why not making  it an abstract class with 
abstract Directory getDirectory(String file, int minSize, int maxSize, String 
[read/write/append], String context);
String getName(); // for logging
   
What do you understand under context? Something along the lines /Give me 
directory for segment merges, read only for search./ 
...Maybe one day we will have possibility not to kill OS cache by merging,



 MMapDirectory should only mmap large files, small files should be opened 
 using SimpleFS/NIOFS
 -

 Key: LUCENE-1743
 URL: https://issues.apache.org/jira/browse/LUCENE-1743
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.1


 This is a followup to LUCENE-1741:
 Javadocs state (in FileChannel#map): For most operating systems, mapping a 
 file into memory is more expensive than reading or writing a few tens of 
 kilobytes of data via the usual read and write methods. From the standpoint 
 of performance it is generally only worth mapping relatively large files into 
 memory.
 MMapDirectory should get a user-configureable size parameter that is a lower 
 limit for mmapping files. All files with a sizelimit should be opened using 
 a conventional IndexInput from SimpleFS or NIO (another configuration option 
 for the fallback?).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Updated: (LUCENE-1741) Make MMapDirectory.MAX_BBUF user configureable to support chunking the index files in smaller parts

2009-07-13 Thread eks dev

I have no test data which size is good, it is just trying out

Sure, for this you need bad OS and large index, you are not as lucky as I am to 
have it  :)

Anyhow, I would argument against default value. An algorithm is quite simple, 
if you hit OOM on map(), reduce this value until it fits :)
no need to touch it if it works...




- Original Message 
 From: Uwe Schindler (JIRA) j...@apache.org
 To: java-dev@lucene.apache.org
 Sent: Monday, 13 July, 2009 17:21:15
 Subject: [jira] Updated: (LUCENE-1741) Make MMapDirectory.MAX_BBUF user 
 configureable to support chunking the index files in smaller parts
 
 
  [ 
 https://issues.apache.org/jira/browse/LUCENE-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
  
 ]
 
 Uwe Schindler updated LUCENE-1741:
 --
 
 Attachment: LUCENE-1741.patch
 
 Attached is a patch using the JRE_IS_64BIT in Constants. I set the default to 
 256 MiBytes (128 seems to small for large indexes, if the index is e.g. about 
 1.5 GiBytes, you would get 6 junks.
 
 I have no test data which size is good, it is just trying out (and depends 
 e.g. 
 on how often you reboot Windows, as Eks said).
 
  Make MMapDirectory.MAX_BBUF user configureable to support chunking the 
  index 
 files in smaller parts
  
 ---
 
  Key: LUCENE-1741
  URL: https://issues.apache.org/jira/browse/LUCENE-1741
  Project: Lucene - Java
   Issue Type: Improvement
 Affects Versions: 2.9
 Reporter: Uwe Schindler
 Assignee: Uwe Schindler
 Priority: Minor
  Fix For: 2.9
 
  Attachments: LUCENE-1741.patch, LUCENE-1741.patch
 
 
  This is a followup for java-user thred: 
 http://www.lucidimagination.com/search/document/9ba9137bb5d8cb78/oom_with_2_9#9bf3b5b8f3b1fb9b
  It is easy to implement, just add a setter method for this parameter to 
 MMapDir.
 
 -- 
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.
 
 
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1741) Make MMapDirectory.MAX_BBUF user configureable to support chunking the index files in smaller parts

2009-07-13 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12730560#action_12730560
 ] 

Eks Dev commented on LUCENE-1741:
-

Uwe, you convinced me, I looked at the code, and indeed, no performance penalty 
for this. 

what helped me  was 1.1G... (I've tried to find maximum); Max file size is 1.4G 
... but 1.1 is just OS coincidence, no magic about it. 

I guess 512mb makes a good value, if memory is so fragmented that you cannot 
allocate 0.5G, you are definitely having some other problems around. We are 
taliking here about VM memory, and even on windows having 512Mb in block is not 
an issue (or better said, I have never seen problems with this value).

@Paul: It is misunderstanding, my algorithm was meant to be manual... no 
catching OOM and retry (I've burned my fingers already on catching 
RuntimeException, do only when absolutely desperate :). Uwe made this value 
user settable anyhow.  

Thanks Uwe!

  
   

 Make MMapDirectory.MAX_BBUF user configureable to support chunking the index 
 files in smaller parts
 ---

 Key: LUCENE-1741
 URL: https://issues.apache.org/jira/browse/LUCENE-1741
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1741.patch, LUCENE-1741.patch


 This is a followup for java-user thred: 
 http://www.lucidimagination.com/search/document/9ba9137bb5d8cb78/oom_with_2_9#9bf3b5b8f3b1fb9b
 It is easy to implement, just add a setter method for this parameter to 
 MMapDir.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: A Comparison of Open Source Search Engines

2009-07-06 Thread eks dev

 Anybody knows other interesting open-source search engines?

 Minion (https://minion.dev.java.net/)



- Original Message 
 From: Earwin Burrfoot ear...@gmail.com
 To: java-dev@lucene.apache.org
 Sent: Monday, 6 July, 2009 23:01:52
 Subject: Re: A Comparison of Open Source Search Engines
 
 I'd say out of these libraries only Lucene and Sphinx are worth mentioning.
 
 There's also MG4J, which wasn't covered and has a nice algorithmic background.
 Anybody knows other interesting open-source search engines?
 
 On Tue, Jul 7, 2009 at 00:39, John Wangwrote:
  Vik did a very nice job.
  One thing the experiment did not mention is that Lucene handles incremental
  updates, whereas many of the other competitors do not. So the indexing
  performance comparison is not really fair.
  -John
 
  On Mon, Jul 6, 2009 at 8:06 AM, Sean Owen wrote:
 
 
  
 http://zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter/
 
  I imagine many of you already saw this -- Lucene does pretty well in
  this shootout.
  The only area it tended to lag, it seems, is memory usage and speed in
  some cases.
 
  -
  To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-dev-h...@lucene.apache.org
 
 
 
 
 
 -- 
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785
 
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2009-06-29 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725168#action_12725168
 ] 

Eks Dev commented on LUCENE-1720:
-

it's been late for this issue, but maybe worth thinking about. We could change 
semantics of this problem completely. Imo, the problem can be reformulated as 
Provide possibility to cancel running queries on best effort basis, with or 
without providing so far collected results

That would leave Timer management to the end users and make an issue focus on 
one Lucene core ... Timeout management can be then provided as an example 
somewhere How to implement Timeout management using ...








 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2009-06-29 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725182#action_12725182
 ] 

Eks Dev commented on LUCENE-1720:
-

Sure, I just wanted to sharpen definition what is Lucene core issue, and what 
we can leave to end users. It is not only about the time, rather about 
canceling search requests (even better, general activities). 

 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Improving TimeLimitedCollector

2009-06-24 Thread eks dev
Re: I think such a parameter should not exist on individual search methods
since it's more of a global setting (i.e., I want my searches to be
limited to 5 seconds, always, not just for a particular query). Right?

I am not sure about this one, we had cases where one phisical index served two 
logical indices with different requirements for clients. having Timeout 
settable per Query is nice to have. 

At the end of day, with such timeout you support Quality/Time compromise 
settings:
if you need all results, be ready to wait longer and set longer timeout
if you need SOME results quickly than reduce this timeout

that should be idealy user decision 





From: Shai Erera ser...@gmail.com
To: java-dev@lucene.apache.org
Sent: Wednesday, 24 June, 2009 10:55:50
Subject: Re: Improving TimeLimitedCollector


But TimeLimitingCollector's logic is coded in its collect() method. The top 
scorer calls nextDoc() or advance() on all its sub-scorers, and only when a 
match is found it calls collect().

If we want the sub-scorers to check whether they should abort, we'd need to 
revamp (liked the word :)) TimeLimitingCollector, to be something like 
CheckAbort SegmentMerger uses. I.e., the top scorer will pass such an instance 
to its sub scorers, which will call a TimeLimit.check() or something and if the 
time limit has expired this call will throw a TimeExceededException (like TLC).

We can enable this by adding another parameter to IndexSearcher whether 
searches should be limited by time, and what's the time limit. It will then 
instantiate that object and pass it to its Scorer and so on. I think such a 
parameter should not exist on individual search methods since it's more of a 
global setting (i.e., I want my searches to be limited to 5 seconds, always, 
not just for a particular query). Right?

Another option would be to add a setTimeout method on Query, which will use it 
when it constructs its Scorer. The shortcoming of this is that if I want to use 
someone else's query which did not implement setTimeout, then I'll need to 
build a TimeOutQueryWrapper that will wrap a Query, and implement the timeout 
logic, but that's get complicated.

I think the Collector approach makes the most sense to me, since it's the only 
object I fully control in the search process. I cannot control Query 
implementations, and I cannot control the decisions made by IndexSearcher. But 
I can always wrap someone else's Collector with TLC and pass it to search().

Shai


On Wed, Jun 24, 2009 at 12:26 AM, Jason Rutherglen jason.rutherg...@gmail.com 
wrote:

As we're revamping collectors, weights, and scorers, perhaps we
can push time limiting into the individual subscorers? Currently
on a boolean query, we're timing out the query at the top level
which doesn't work well if the subqueries exceed the time limit.


  

Re: Fuzzy search change

2009-06-18 Thread eks dev

what would be the difference/benefit compared to standard lucene SpellChecker? 

If I I am not wrong:
- Lucene SpellChecker uses standard lucene index  as a storage for tokens 
instead of QDBM... meaning full inverted index with arbitrary N-grams length, 
with tf/idf/norms... not only HashMaptrigram, wordList 

- SC uses paradigm give me  N Best candidates (similarity), not only all 
above cutoff... this Similarity depends (standard lucene Similarity) on N-Gram 
frequency, (one could even use some sexy norms to fine tune words...)...  

If I've read your proposal correctly and did not miss something important, my 
suggestion would be to have a look at lucene SC 
(http://lucene.apache.org/java/2_3_2/api/contrib-spellchecker/org/apache/lucene/search/spell/SpellChecker.html)
 before you start 
 

have fun, 
eks



- Original Message 
 From: Michael McCandless luc...@mikemccandless.com
 To: java-dev@lucene.apache.org
 Sent: Thursday, 18 June, 2009 16:29:59
 Subject: Re: Fuzzy search change
 
 This would make an awesome addition to Lucene!
 
 This is similar to how Lucene's spellchecker identifies candidates, if
 I understand it right.
 
 Would you be able to port it to java?
 
 Mike
 
 On Thu, Jun 18, 2009 at 7:12 AM, Varun Dhussawrote:
  Hi,
 
  I wrote on this a long time ago, but haven't followed it up. I just finished
  a C++ implementation of a spell check module in my software. I borrowed the
  idea from Xapian. It is to use a trigram index to filter results, and then
  use Edit Distance on the filtered set. Would such a solution be acceptable
  to the Lucene Community? The details of my implementation are as follows:
 
  1) QDBM data store hash map
  2) Trigram tokenizer on the input string
  3) Data store hash(key,value) = (trigram, keyword_id_list
  4) Use trigram tokenizer and match with the trigram index
  5) Get the IDs within the input cutoff
  6) Run Edit Distance on the list and return
 
  In my tests on a Intel Core 2 Duo with 3 GB RAM and Windows XP 32 bit, it
  runs in 0.5 sec with a keyword record count of about 1,000,000 records.
  This is at least 3-4 times less than the current search times on Lucene.
 
  Since the results can be put in a thread safe hash table structure, the
  trigram search can be distributed over a thread pool also.
 
  Does this seem like a workable suggestion to the community?
 
  Regards
 
  --
  Varun Dhussa
  Product Architect
  CE InfoSystems (P) Ltd
  http://www.mapmyindia.com
 
 
  -
  To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-dev-h...@lucene.apache.org
 
 
 
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1594) Use source code specialization to maximize search performance

2009-05-07 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12707116#action_12707116
 ] 

Eks Dev commented on LUCENE-1594:
-

huh, it reduces hardware costs 2-3 times for larger setup! great

 Use source code specialization to maximize search performance
 -

 Key: LUCENE-1594
 URL: https://issues.apache.org/jira/browse/LUCENE-1594
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: FastSearchTask.java, LUCENE-1594.patch, 
 LUCENE-1594.patch, LUCENE-1594.patch


 Towards eeking absolute best search performance, and after seeing the
 Java ghosts in LUCENE-1575, I decided to build a simple prototype
 source code specializer for Lucene's searches.
 The idea is to write dynamic Java code, specialized to run a very
 specific query context (eg TermQuery, collecting top N by field, no
 filter, no deletions), compile that Java code, and run it.
 Here're the performance gains when compared to trunk:
 ||Query||Sort||Filt|Deletes||Scoring||Hits||QPS (base)||QPS (new)||%||
 |1|Date (long)|no|no|Track,Max|2561886|6.8|10.6|{color:green}55.9%{color}|
 |1|Date (long)|no|5%|Track,Max|2433472|6.3|10.5|{color:green}66.7%{color}|
 |1|Date (long)|25%|no|Track,Max|640022|5.2|9.9|{color:green}90.4%{color}|
 |1|Date (long)|25%|5%|Track,Max|607949|5.3|10.3|{color:green}94.3%{color}|
 |1|Date (long)|10%|no|Track,Max|256300|6.7|12.3|{color:green}83.6%{color}|
 |1|Date (long)|10%|5%|Track,Max|243317|6.6|12.6|{color:green}90.9%{color}|
 |1|Relevance|no|no|Track,Max|2561886|11.2|17.3|{color:green}54.5%{color}|
 |1|Relevance|no|5%|Track,Max|2433472|10.1|15.7|{color:green}55.4%{color}|
 |1|Relevance|25%|no|Track,Max|640022|6.1|14.1|{color:green}131.1%{color}|
 |1|Relevance|25%|5%|Track,Max|607949|6.2|14.4|{color:green}132.3%{color}|
 |1|Relevance|10%|no|Track,Max|256300|7.7|15.6|{color:green}102.6%{color}|
 |1|Relevance|10%|5%|Track,Max|243317|7.6|15.9|{color:green}109.2%{color}|
 |1|Title (string)|no|no|Track,Max|2561886|7.8|12.5|{color:green}60.3%{color}|
 |1|Title (string)|no|5%|Track,Max|2433472|7.5|11.1|{color:green}48.0%{color}|
 |1|Title (string)|25%|no|Track,Max|640022|5.7|11.2|{color:green}96.5%{color}|
 |1|Title (string)|25%|5%|Track,Max|607949|5.5|11.3|{color:green}105.5%{color}|
 |1|Title (string)|10%|no|Track,Max|256300|7.0|12.7|{color:green}81.4%{color}|
 |1|Title (string)|10%|5%|Track,Max|243317|6.7|13.2|{color:green}97.0%{color}|
 Those tests were run on a 19M doc wikipedia index (splitting each
 Wikipedia doc @ ~1024 chars), on Linux, Java 1.6.0_10
 But: it only works with TermQuery for now; it's just a start.
 It should be easy for others to run this test:
   * apply patch
   * cd contrib/benchmark
   * run python -u bench.py -delindex /path/to/index/with/deletes
 -nodelindex /path/to/index/without/deletes
 (You can leave off one of -delindex or -nodelindex and it'll skip
 those tests).
 For each test, bench.py generates a single Java source file that runs
 that one query; you can open
 contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/FastSearchTask.java
 to see it.  I'll attach an example.  It writes results.txt, in Jira
 table format, which you should be able to copy/paste back here.
 The specializer uses pretty much every search speedup I can think of
 -- the ones from LUCENE-1575 (to score or not, to maxScore or not),
 the ones suggested in the spinoff LUCENE-1593 (pre-fill w/ sentinels,
 don't use docID for tie breaking), LUCENE-1536 (random access
 filters).  It bypasses TermDocs and interacts directly with the
 IndexInput, and with BitVector for deletions.  It directly folds in
 the collector, if possible.  A filter if used must be random access,
 and is assumed to pre-multiply-in the deleted docs.
 Current status:
   * I only handle TermQuery.  I'd like to add others over time...
   * It can collect by score, or single field (with the 3 scoring
 options in LUCENE-1575).  It can't do reverse field sort nor
 multi-field sort now.
   * The auto-gen code (gen.py) is rather hideous.  It could use some
 serious refactoring, etc.; I think we could get it to the point
 where each Query can gen its own specialized code, maybe.  It also
 needs to be eventually ported to Java.
   * The script runs old, then new, then checks that the topN results
 are identical, and aborts if not.  So I'm pretty sure the
 specialized code is working correctly, for the cases I'm testing.
   * The patch includes a few small changes to core, mostly to open up
 package protected APIs so I can access stuff
 I think this is an interesting effort for several reasons:
   * It gives us a best-case upper bound

[jira] Commented: (LUCENE-1518) Merge Query and Filter classes

2009-04-30 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704561#action_12704561
 ] 

Eks Dev commented on LUCENE-1518:
-

imo, it is really not all that important to make Filter and Query the same 
(that is just one alternative to achieve goal). 

Basic problem we try  to solve is adding Filter directly to BoolenQuery, and 
making optimizations after that easier. Wrapping with CSQ is just adding anothe 
layer between Lucene search machinery and Filter, making these optimizations 
harder.

On the other hand, I must accept, conceptually FIter and Query are the same, 
supporting together following options:
1. Pure boolean model: You do not care about scores (today we can do it only 
wia CSQ, as Filter does not enter BoolenQuery)
2. Mixed boolean and ranked: you have to define Filter contribution to the 
documents (CSQ)
3. Pure ranked: No filters, all gets scored (the same as 2.)

Ideally, as a user, I define only Query (Filter based or not) and for each 
clause in my Query define 
Query.setScored(true/false) or useConstantScore(double score); 

also I should be able to say, Dear Lucene please materialize this 
Query_Filter for me as I would like to have it cached and please store only 
DocIds (Filter today).  Maybe open possibility to open possibility to cache 
scores of the documents as well. 

one thing is concept  and another is optimization. From optimization point of 
view, we have couple of decisions to make:

- DocID Set supports random access, yes or no (my Materialized Query)
- Decide if clause should / should not be scored/ or should be constant

So, for each Query we need to decide/support:

- scoring{yes, no, constant} and 
- opening option to materialize Query (that is how we today create Filters 
today)
- these Materialized Queries (aka Filter) should be able to tell us if they 
support random access, if they cache only doc id's or scores as well


nothing usefull in this email, just  thinking aloud, sometimes helps :)






 Merge Query and Filter classes
 --

 Key: LUCENE-1518
 URL: https://issues.apache.org/jira/browse/LUCENE-1518
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1518.patch


 This issue presents a patch, that merges Queries and Filters in a way, that 
 the new Filter class extends Query. This would make it possible, to use every 
 filter as a query.
 The new abstract filter class would contain all methods of 
 ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the 
 Filter's getDocIdSet()/bits() methods he has nothing more to do, he could 
 just use the filter as a normal query.
 I do not want to completely convert Filters to ConstantScoreQueries. The idea 
 is to combine Queries and Filters in such a way, that every Filter can 
 automatically be used at all places where a Query can be used (e.g. also 
 alone a search query without any other constraint). For that, the abstract 
 Query methods must be implemented and return a default weight for Filters 
 which is the current ConstantScore Logic. If the filter is used as a real 
 filter (where the API wants a Filter), the getDocIdSet part could be directly 
 used, the weight is useless (as it is currently, too). The constant score 
 default implementation is only used when the Filter is used as a Query (e.g. 
 as direct parameter to Searcher.search()). For the special case of 
 BooleanQueries combining Filters and Queries the idea is, to optimize the 
 BooleanQuery logic in such a way, that it detects if a BooleanClause is a 
 Filter (using instanceof) and then directly uses the Filter API and not take 
 the burden of the ConstantScoreQuery (see LUCENE-1345).
 Here some ideas how to implement Searcher.search() with Query and Filter:
 - User runs Searcher.search() using a Filter as the only parameter. As every 
 Filter is also a ConstantScoreQuery, the query can be executed and returns 
 score 1.0 for all matching documents.
 - User runs Searcher.search() using a Query as the only parameter: No change, 
 all is the same as before
 - User runs Searcher.search() using a BooleanQuery as parameter: If the 
 BooleanQuery does not contain a Query that is subclass of Filter (the new 
 Filter) everything as usual. If the BooleanQuery only contains exactly one 
 Filter and nothing else the Filter is used as a constant score query. If 
 BooleanQuery contains clauses with Queries and Filters the new algorithm 
 could be used: The queries are executed and the results filtered with the 
 filters.
 For the user this has the main advantage: That he can construct his query 
 using a simplified API without thinking about Filters oder Queries, you can 
 just combine clauses

[jira] Commented: (LUCENE-1518) Merge Query and Filter classes

2009-04-30 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704613#action_12704613
 ] 

Eks Dev commented on LUCENE-1518:
-

Shai, 
Regarding pure ranked, CSQ is really what we need, no? --- 

Yep, it would work for Filters, but why not making it possible to have normal 
Query constant score. For these cases,  I am just not sure if this aproach 
gets max performance (did not look at this code for quite a while).  

Imagine you have a Query and you are not interested in Scoring at all, this can 
be acomplished with only DocID iterator arithmetic, ignoring  score() totally.  
But that is only an optimization (maybe allready there?)

Paul, 
How about materializing the DocIds _and_ the score values?
exactly,  that would open full caching posibility (original purpose of 
Filters).  Think Search Results caching ... that is practically another name 
for search() method. It is easy to create this, but using it again would 
require some bigger changes :) 

Filter_on_Steroids materialize(boolean without_score); 



 Merge Query and Filter classes
 --

 Key: LUCENE-1518
 URL: https://issues.apache.org/jira/browse/LUCENE-1518
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1518.patch


 This issue presents a patch, that merges Queries and Filters in a way, that 
 the new Filter class extends Query. This would make it possible, to use every 
 filter as a query.
 The new abstract filter class would contain all methods of 
 ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the 
 Filter's getDocIdSet()/bits() methods he has nothing more to do, he could 
 just use the filter as a normal query.
 I do not want to completely convert Filters to ConstantScoreQueries. The idea 
 is to combine Queries and Filters in such a way, that every Filter can 
 automatically be used at all places where a Query can be used (e.g. also 
 alone a search query without any other constraint). For that, the abstract 
 Query methods must be implemented and return a default weight for Filters 
 which is the current ConstantScore Logic. If the filter is used as a real 
 filter (where the API wants a Filter), the getDocIdSet part could be directly 
 used, the weight is useless (as it is currently, too). The constant score 
 default implementation is only used when the Filter is used as a Query (e.g. 
 as direct parameter to Searcher.search()). For the special case of 
 BooleanQueries combining Filters and Queries the idea is, to optimize the 
 BooleanQuery logic in such a way, that it detects if a BooleanClause is a 
 Filter (using instanceof) and then directly uses the Filter API and not take 
 the burden of the ConstantScoreQuery (see LUCENE-1345).
 Here some ideas how to implement Searcher.search() with Query and Filter:
 - User runs Searcher.search() using a Filter as the only parameter. As every 
 Filter is also a ConstantScoreQuery, the query can be executed and returns 
 score 1.0 for all matching documents.
 - User runs Searcher.search() using a Query as the only parameter: No change, 
 all is the same as before
 - User runs Searcher.search() using a BooleanQuery as parameter: If the 
 BooleanQuery does not contain a Query that is subclass of Filter (the new 
 Filter) everything as usual. If the BooleanQuery only contains exactly one 
 Filter and nothing else the Filter is used as a constant score query. If 
 BooleanQuery contains clauses with Queries and Filters the new algorithm 
 could be used: The queries are executed and the results filtered with the 
 filters.
 For the user this has the main advantage: That he can construct his query 
 using a simplified API without thinking about Filters oder Queries, you can 
 just combine clauses together. The scorer/weight logic then identifies the 
 cases to use the filter or the query weight API. Just like the query 
 optimizer of a RDB.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1518) Merge Query and Filter classes

2009-04-30 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704618#action_12704618
 ] 

Eks Dev commented on LUCENE-1518:
-

Paul: ...The current patch at LUCENE-1345 does not need such a FilterWeight; 
the no scoring case is handled by not asking for score values...

Me: ...Imagine you have a Query and you are not interested in Scoring at all, 
this can be acomplished with only DocID iterator arithmetic, ignoring score() 
totally. But that is only an optimization (maybe allready there?)...

I knew Paul will kick in at this place, he sad exactly the same thing I did, 
but, as oposed to me, he made formulation that executes :) 
Pfff, I feel bad :)





 Merge Query and Filter classes
 --

 Key: LUCENE-1518
 URL: https://issues.apache.org/jira/browse/LUCENE-1518
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1518.patch


 This issue presents a patch, that merges Queries and Filters in a way, that 
 the new Filter class extends Query. This would make it possible, to use every 
 filter as a query.
 The new abstract filter class would contain all methods of 
 ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the 
 Filter's getDocIdSet()/bits() methods he has nothing more to do, he could 
 just use the filter as a normal query.
 I do not want to completely convert Filters to ConstantScoreQueries. The idea 
 is to combine Queries and Filters in such a way, that every Filter can 
 automatically be used at all places where a Query can be used (e.g. also 
 alone a search query without any other constraint). For that, the abstract 
 Query methods must be implemented and return a default weight for Filters 
 which is the current ConstantScore Logic. If the filter is used as a real 
 filter (where the API wants a Filter), the getDocIdSet part could be directly 
 used, the weight is useless (as it is currently, too). The constant score 
 default implementation is only used when the Filter is used as a Query (e.g. 
 as direct parameter to Searcher.search()). For the special case of 
 BooleanQueries combining Filters and Queries the idea is, to optimize the 
 BooleanQuery logic in such a way, that it detects if a BooleanClause is a 
 Filter (using instanceof) and then directly uses the Filter API and not take 
 the burden of the ConstantScoreQuery (see LUCENE-1345).
 Here some ideas how to implement Searcher.search() with Query and Filter:
 - User runs Searcher.search() using a Filter as the only parameter. As every 
 Filter is also a ConstantScoreQuery, the query can be executed and returns 
 score 1.0 for all matching documents.
 - User runs Searcher.search() using a Query as the only parameter: No change, 
 all is the same as before
 - User runs Searcher.search() using a BooleanQuery as parameter: If the 
 BooleanQuery does not contain a Query that is subclass of Filter (the new 
 Filter) everything as usual. If the BooleanQuery only contains exactly one 
 Filter and nothing else the Filter is used as a constant score query. If 
 BooleanQuery contains clauses with Queries and Filters the new algorithm 
 could be used: The queries are executed and the results filtered with the 
 filters.
 For the user this has the main advantage: That he can construct his query 
 using a simplified API without thinking about Filters oder Queries, you can 
 just combine clauses together. The scorer/weight logic then identifies the 
 cases to use the filter or the query weight API. Just like the query 
 optimizer of a RDB.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: new TokenStream api Question

2009-04-28 Thread eks dev
Hi Michael,
Sure, the Interfaces are solution to this. They define what Lucene core expects 
from these entities and gives freedom to people to provide any implementation 
they wish. E.g.  users that do not need Offset information, can just provide 
dummy implementation that returns constants... 

The only problem with Interfaces is back compatibility curse :)  

But!
 Attribute Offset is simple enough entity, so I do not believe there is a need 
ever to change an interface 
Term is just char[] with offset/length , the same. 

Having really simple (and keeping them simple)  concepts behind  makes 
Interfaces possible... I see no danger. But as said, the concepts behind must 
remain simple.
  

And by the way, I like the new API.  

Cheers, Eks




From: Michael Busch busch...@gmail.com
To: java-dev@lucene.apache.org
Sent: Tuesday, 28 April, 2009 10:22:45
Subject: Re: new TokenStream api Question

Hi Eks Dev,

I actually started experimenting with changing the new API slightly to overcome 
one drawback: with the variables now distributed over various Attribute classes 
(vs. being in a single class Token previously), cloning a Token (i.e. calling 
captureState()) is more expensive. This slows down the CachingTokenFilter and 
Tee/Sink-TokenStreams.

So I was thinking about introducing interfaces for each of the Attributes. E.g. 
OffsetAttribute would then be an interface with all current methods, and 
OffsetAttributeImpl would be its implementation. The user would still use the 
API in exactly the same way as now, that is be e.g. calling 
addAttribute(OffsetAttribute.class), and the code takes care of instantiating 
the right class. However, there would then also be an API to pass in an actual 
instance, and this API would use reflection to find all interfaces that the 
instances implements. All of those interfaces that extend the Attribute 
interface would be added to the AttributeSource map, with the instance as the 
value.

Then the Token class would implement all six attribute interfaces. An expert 
user could decide to pass in a Token instance instead of calling 
addAttribute(TermAttribute.class), addAttribute(PayloadAttribute.class), ...
Then the attribute source would only contain a single instance that needs to be 
cloned in captureState(), making cloning much faster. And a (probably also 
expert) user could even implement an own class that implements exactly the 
necessary interfaces (maybe only 3 of the 6 provided), and make cloning faster 
than it is even with the old Token-based API.

And of course also in your case could you just create a different 
implementation of such an interface, right? I think what's nice about this 
change is that it doesn't make it more complicated to use the TokenStream API, 
and the indexing pipeline still uses it the same way too, yet it's more 
extensible more expert users and possible to achieve the same or even better 
cloning performance.

I will open a new Jira issue for this soon. But I'd be happy to hear feedback 
about the proposed changes, and especially if you think these changes would 
help you for your usecase.

-Michael

On 4/27/09 1:49 PM, eks dev wrote: 
Should I create a patch with something like this? With Expert javadoc, 
and explanation what is this good for should be a nice addition to Attribute 
cases.  Practically, it would enable specialization of hard linked Attributes 
like TermAttribute. The only preconditions are: - Specialized 
Attribute must extend one of the hard linked ones, and provide class of it  
- Must implement default constructor   - should extend by not introducing state 
(big majority of cases) (not to break captureState())The last one could be 
relaxed i guess, but I am not yet 100% familiar with this code.Use cases 
for this are along the lines of my example, smaller, easier user code and 
performance (token filters mainly)- Original Message 
From: Uwe Schindler u...@thetaphi.de  To: java-dev@lucene.apache.org  Sent: 
Sunday, 26 April, 2009 23:03:06  Subject: RE: new TokenStream api Question
There is one problem: if you extend TermAttribute, the class is different  
(which is the key in the attributes list). So when you initialize the  
TokenStream and do aYourClass termAtt = (YourClass) 
addAttribute(YourClass.class)...you create a new attribute. So one 
possibility would be to also specify  the instance and save the attribute by 
class (as key), but with your  instance. If you are the first one that creates 
the attribute (if it is a  token stream and not a filter it is ok, you will be 
the first, it adding the  attribute in the ctor), everything is ok. Register 
the attribute by yourself  (maybe we should add a specialized addAttribute, 
that can specify a instance  as default)?:YourClass termAtt = new 
YourClass();  attributes.put(TermAttribute.class, termAtt);In this case, for
 the indexer it is a standard TermAttribute, but you

[jira] Commented: (LUCENE-1619) TermAttribute.termLength() optimization

2009-04-28 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703543#action_12703543
 ] 

Eks Dev commented on LUCENE-1619:
-

thanks Mike

 TermAttribute.termLength() optimization
 ---

 Key: LUCENE-1619
 URL: https://issues.apache.org/jira/browse/LUCENE-1619
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Assignee: Michael McCandless
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1619.patch


public int termLength() {
  initTermBuffer(); // This patch removes this method call 
  return termLength;
}
 I see no reason to initTermBuffer() in termLength()... all tests pass, but I 
 could be wrong?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute

2009-04-27 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703085#action_12703085
 ] 

Eks Dev commented on LUCENE-1616:
-

I am ok with both options, removing separate looks a bit better for me as it 
forces users to think attomic about offset = {start, end}. 

If you separate start and end offset too far in your code, probability that you 
do not see mistake somewhere is higher compared to the case where you manage 
start and end on your own in these cases (this is then rather explicit in you 
code)... 

But that is all really something we should not think too much about it :) We 
make no mistakes eather way
 
I can provide new patch, if needed. 

 add one setter for start and end offset to OffsetAttribute
 --

 Key: LUCENE-1616
 URL: https://issues.apache.org/jira/browse/LUCENE-1616
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1616.patch


 add OffsetAttribute. setOffset(startOffset, endOffset);
 trivial change, no JUnit needed
 Changed CharTokenizer to use it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute

2009-04-27 Thread eks dev

Ok, I'll create another patch a bit later today


- Original Message 
 From: Michael McCandless (JIRA) j...@apache.org
 To: java-dev@lucene.apache.org
 Sent: Monday, 27 April, 2009 16:34:30
 Subject: [jira] Commented: (LUCENE-1616) add one setter for start and end 
 offset to OffsetAttribute
 
 
     [ 
 https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703144#action_12703144
  
 ] 
 
 Michael McCandless commented on LUCENE-1616:
 
 
 bq. removing separate looks a bit better for me as it forces users to think 
 attomic about offset = {start, end}.
 
 This is my thinking as well.
 
 And in general I prefer one clear way to do something (the Python way) 
 instead 
 providing various different ways to do the same thing (the Perl way).
 
  add one setter for start and end offset to OffsetAttribute
  --
 
                 Key: LUCENE-1616
                 URL: https://issues.apache.org/jira/browse/LUCENE-1616
             Project: Lucene - Java
           Issue Type: Improvement
           Components: Analysis
             Reporter: Eks Dev
             Priority: Trivial
             Fix For: 2.9
 
         Attachments: LUCENE-1616.patch
 
 
  add OffsetAttribute. setOffset(startOffset, endOffset);
  trivial change, no JUnit needed
  Changed CharTokenizer to use it
 
 -- 
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.
 
 
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute

2009-04-27 Thread Eks Dev (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eks Dev updated LUCENE-1616:


Attachment: LUCENE-1616.patch

whoops, this time it compiles :)

 add one setter for start and end offset to OffsetAttribute
 --

 Key: LUCENE-1616
 URL: https://issues.apache.org/jira/browse/LUCENE-1616
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1616.patch, LUCENE-1616.patch, LUCENE-1616.patch


 add OffsetAttribute. setOffset(startOffset, endOffset);
 trivial change, no JUnit needed
 Changed CharTokenizer to use it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute

2009-04-27 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703254#action_12703254
 ] 

Eks Dev commented on LUCENE-1616:
-

me too, sorry! 
Eclipse left me blind for some funny reason
waiting for test to complete before I commit again ... 

 add one setter for start and end offset to OffsetAttribute
 --

 Key: LUCENE-1616
 URL: https://issues.apache.org/jira/browse/LUCENE-1616
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1616.patch, LUCENE-1616.patch, LUCENE-1616.patch


 add OffsetAttribute. setOffset(startOffset, endOffset);
 trivial change, no JUnit needed
 Changed CharTokenizer to use it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute

2009-04-27 Thread Eks Dev (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eks Dev updated LUCENE-1616:


Attachment: LUCENE-1616.patch

ok, maybe this time it will work, I hope I managed to clean it up (core build 
and test pass). 

The only thing that fails is contrib, but I guess this has nothing to do with 
it? 


[javac] 
D:\Repository\SerachAndMatch\Lucene\lucene\java\trunk\contrib\highlighter\src\java\org\apache\lucene\search\highlight\WeightedSpanTermExtractor.java:306:
 cannot find symbol
[javac]   MemoryIndex indexer = new MemoryIndex();
[javac]   ^
[javac]   symbol:   class MemoryIndex
[javac]   location: class 
org.apache.lucene.search.highlight.WeightedSpanTermExtractor
[javac] 
D:\Repository\SerachAndMatch\Lucene\lucene\java\trunk\contrib\highlighter\src\java\org\apache\lucene\search\highlight\WeightedSpanTermExtractor.java:306:
 cannot find symbol
[javac]   MemoryIndex indexer = new MemoryIndex();
[javac] ^
[javac]   symbol:   class MemoryIndex
[javac]   location: class 
org.apache.lucene.search.highlight.WeightedSpanTermExtractor
[javac] Note: Some input files use unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.
[javac] 3 errors

 add one setter for start and end offset to OffsetAttribute
 --

 Key: LUCENE-1616
 URL: https://issues.apache.org/jira/browse/LUCENE-1616
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1616.patch, LUCENE-1616.patch, LUCENE-1616.patch, 
 LUCENE-1616.patch


 add OffsetAttribute. setOffset(startOffset, endOffset);
 trivial change, no JUnit needed
 Changed CharTokenizer to use it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute

2009-04-27 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703335#action_12703335
 ] 

Eks Dev commented on LUCENE-1616:
-

ant build-contrib 

 add one setter for start and end offset to OffsetAttribute
 --

 Key: LUCENE-1616
 URL: https://issues.apache.org/jira/browse/LUCENE-1616
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1616.patch, LUCENE-1616.patch, LUCENE-1616.patch, 
 LUCENE-1616.patch


 add OffsetAttribute. setOffset(startOffset, endOffset);
 trivial change, no JUnit needed
 Changed CharTokenizer to use it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1619) TermAttribute.termLength() optimization

2009-04-27 Thread Eks Dev (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eks Dev updated LUCENE-1619:


Attachment: LUCENE-1619.patch

 TermAttribute.termLength() optimization
 ---

 Key: LUCENE-1619
 URL: https://issues.apache.org/jira/browse/LUCENE-1619
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Priority: Trivial
 Attachments: LUCENE-1619.patch


public int termLength() {
  initTermBuffer(); // This patch removes this method call 
  return termLength;
}
 I see no reason to initTermBuffer() in termLength()... all tests pass, but I 
 could be wrong?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1619) TermAttribute.termLength() optimization

2009-04-27 Thread Eks Dev (JIRA)
TermAttribute.termLength() optimization
---

 Key: LUCENE-1619
 URL: https://issues.apache.org/jira/browse/LUCENE-1619
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Priority: Trivial
 Attachments: LUCENE-1619.patch


   public int termLength() {
 initTermBuffer(); // This patch removes this method call 
 return termLength;
   }

I see no reason to initTermBuffer() in termLength()... all tests pass, but I 
could be wrong?



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: new TokenStream api Question

2009-04-27 Thread eks dev

Should I create a patch with something like this? 

With Expert javadoc, and explanation what is this good for should be a nice 
addition to Attribute cases.
Practically, it would enable specialization of hard linked Attributes like 
TermAttribute. 

The only preconditions are: 

- Specialized Attribute must extend one of the hard linked ones, and 
provide class of it
- Must implement default constructor 
- should extend by not introducing state (big majority of cases) (not to break 
captureState())

The last one could be relaxed i guess, but I am not yet 100% familiar with this 
code.

Use cases for this are along the lines of my example, smaller, easier user code 
and performance (token filters mainly)



- Original Message 
 From: Uwe Schindler u...@thetaphi.de
 To: java-dev@lucene.apache.org
 Sent: Sunday, 26 April, 2009 23:03:06
 Subject: RE: new TokenStream api Question
 
 There is one problem: if you extend TermAttribute, the class is different
 (which is the key in the attributes list). So when you initialize the
 TokenStream and do a
 
 YourClass termAtt = (YourClass) addAttribute(YourClass.class)
 
 ...you create a new attribute. So one possibility would be to also specify
 the instance and save the attribute by class (as key), but with your
 instance. If you are the first one that creates the attribute (if it is a
 token stream and not a filter it is ok, you will be the first, it adding the
 attribute in the ctor), everything is ok. Register the attribute by yourself
 (maybe we should add a specialized addAttribute, that can specify a instance
 as default)?:
 
 YourClass termAtt = new YourClass();
 attributes.put(TermAttribute.class, termAtt);
 
 In this case, for the indexer it is a standard TermAttribute, but you can
 more with it.
 
 Replacing TermAttribute by an own class is not possible, as the indexer will
 get a ClassCastException when using the instance retrieved with
 getAttribute(TermAttribute.class).
 
 Uwe
 
 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de
 
  -Original Message-
  From: eks dev [mailto:eks...@yahoo.co.uk]
  Sent: Sunday, April 26, 2009 10:39 PM
  To: java-dev@lucene.apache.org
  Subject: new TokenStream api Question
  
  
  I am just looking into new TermAttribute usage and wonder what would be
  the best way to implement PrefixFilter that would filter out some Terms
  that have some prefix,
  
  something like this, where '-' represents my prefix:
  
public final boolean incrementToken() throws IOException {
  // the first word we found
  while (input.incrementToken()) {
int len = termAtt.termLength();
  
if(len  0  termAtt.termBuffer()[0]!='-') //only length  0 and
  non LFs
  return true;
// note: else we ignore it
  }
  // reached EOS
  return false;
}
  
  
  
  
  
  The question would be:
  
  can I extend TermAttribute and add boolean startsWith(char c);
  
  The point is speed and my code gets smaller.
  TermAttribute has one method called in termLength() and termBuffer() I do
  not understand (back compatibility, I guess)
public int termLength() {
  initTermBuffer(); // I'd like to avoid it...
  return termLength;
}
  
  
  I'd like to get rid of initTermBuffer(), the first option is to *extend*
  TermAttribute code (but fields are private, so no help there) or can I
  implement my own MyTermAttribute (will Indexer know how to deal with it?)
  
  Must I extend TermAttribute or I can add my own?
  
  thanks,
  eks
  
  
  
  
  -
  To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-dev-h...@lucene.apache.org
 
 
 
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



  

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1618) Allow setting the IndexWriter docstore to be a different directory

2009-04-27 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703406#action_12703406
 ] 

Eks Dev commented on LUCENE-1618:
-

Maybe, 
FileSwitchDirectory should have possibility to get file list/extensions that 
should be loaded into RAM... making it maintenance free, pushing this decision 
to end user... if, and when we decide to support users in it, we could than 
maintain static list at separate place . Kind of separate execution and 
configuration

I *think* I saw something similar Ning Lee made quite a while ago, from hadoop 
camp (indexing on hadoop something...). But cannot remember what was it :(


  

 Allow setting the IndexWriter docstore to be a different directory
 --

 Key: LUCENE-1618
 URL: https://issues.apache.org/jira/browse/LUCENE-1618
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9

   Original Estimate: 336h
  Remaining Estimate: 336h

 Add an IndexWriter.setDocStoreDirectory method that allows doc
 stores to be placed in a different directory than the IW default
 dir.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1615) deprecated method used in fieldsReader / setOmitTf()

2009-04-26 Thread Eks Dev (JIRA)
deprecated method used in fieldsReader / setOmitTf()


 Key: LUCENE-1615
 URL: https://issues.apache.org/jira/browse/LUCENE-1615
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Eks Dev
Priority: Trivial


setOmitTf(boolean) is deprecated and should not be used by core classes. One 
place where it appears is FieldsReader , this patch fixes it. It was necessary 
to change Fieldable to AbstractField at two places, only local variables.   

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1615) deprecated method used in fieldsReader / setOmitTf()

2009-04-26 Thread Eks Dev (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eks Dev updated LUCENE-1615:


Attachment: LUCENE-1615.patch

 deprecated method used in fieldsReader / setOmitTf()
 

 Key: LUCENE-1615
 URL: https://issues.apache.org/jira/browse/LUCENE-1615
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Eks Dev
Priority: Trivial
 Attachments: LUCENE-1615.patch


 setOmitTf(boolean) is deprecated and should not be used by core classes. One 
 place where it appears is FieldsReader , this patch fixes it. It was 
 necessary to change Fieldable to AbstractField at two places, only local 
 variables.   

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1615) deprecated method used in fieldsReader / setOmitTf()

2009-04-26 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702901#action_12702901
 ] 

Eks Dev commented on LUCENE-1615:
-

sure, replacing Fieldable is good,  just noticed quick win when cleaning-up 
deprecations from our code base... one step in a time 

 deprecated method used in fieldsReader / setOmitTf()
 

 Key: LUCENE-1615
 URL: https://issues.apache.org/jira/browse/LUCENE-1615
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Eks Dev
Priority: Trivial
 Attachments: LUCENE-1615.patch


 setOmitTf(boolean) is deprecated and should not be used by core classes. One 
 place where it appears is FieldsReader , this patch fixes it. It was 
 necessary to change Fieldable to AbstractField at two places, only local 
 variables.   

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



new TokenStream api Question

2009-04-26 Thread eks dev

I am just looking into new TermAttribute usage and wonder what would be the 
best way to implement PrefixFilter that would filter out some Terms that have 
some prefix, 

something like this, where '-' represents my prefix:

  public final boolean incrementToken() throws IOException {
// the first word we found
while (input.incrementToken()) {
  int len = termAtt.termLength();
  
  if(len  0  termAtt.termBuffer()[0]!='-') //only length  0 and non LFs 
return true;
  // note: else we ignore it
}
// reached EOS 
return false;
  }

 



The question would be:

can I extend TermAttribute and add boolean startsWith(char c);

The point is speed and my code gets smaller.  
TermAttribute has one method called in termLength() and termBuffer() I do not 
understand (back compatibility, I guess)
  public int termLength() {
initTermBuffer(); // I'd like to avoid it...
return termLength;
  }


I'd like to get rid of initTermBuffer(), the first option is to *extend*  
TermAttribute code (but fields are private, so no help there) or can I 
implement my own MyTermAttribute (will Indexer know how to deal with it?) 

Must I extend TermAttribute or I can add my own?

thanks, 
eks




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



  1   2   3   4   >