[jira] Commented: (LUCENE-1313) Ocean Realtime Search

Jason Rutherglen (JIRA) Tue, 02 Sep 2008 05:06:10 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627642#action_12627642
 ]


Jason Rutherglen commented on LUCENE-1313:
------------------------------------------

Hi Karl,

Thanks for taking a look at the code!  Yes the methods need javadoc, I was 
waiting to see if I had settled on them, and because I started building new 
code on top I guess the methods have settled so I need to add javadoc to them.  

If you are using TransactionSystem then the getSearcher method would be called 
for each query.  I have developed OceanDatabase which makes the searching 
transparent and implements optimistic concurrency (version number stored in the 
document).  I believe most systems will want to use OceanDatabase, however the 
raw TransactionSystem which is more like IndexWriter will be left as well.  I 
have been working on OceanDatabase and have neglected the javadocs of 
TransactionSystem.  

I modeled the searcherPolicy instanceof code on the MergeScheduler type of 
system where there is a marker interface that the subclasses implement.   I 
don't mind changing it, or if you want to you can as well.  I considered it a 
minor detail though and admittedly did not spend much time on it.  You are 
welcome to change it.

The transaction log is replayed on a restart of the system.  It repopulates a 
RamIndex (uses RAMDirectory) on startup based on the max snapshot id of the 
existing indexes, and replays the transaction log from there.   

I looked at converting documents to a token stream, the problem is, if the 
field is stored, it creates redundant storage of the data in the transaction 
log.  Ultimately I could not find anything to be gained from storing a token 
stream.  Also if it was converted, what would happen with stored fields?  The 
issue with replaying the document later though is not having the Analyzer.  In 
the distributed object code patch LUCENE-1336 I made Analyzer Serializable.  I 
think it's best to serialize the Analyzer, or create a small database of 
serialized analyzers that can be called upon during the transaction log 
recovery process.  Because I am not entirely sure about the ramifications of 
serializing the Analyzer, for example, how much data a serialized Analyzer may 
have.  Perhaps other have some ideas or feedback about serializing analyzers.

In conclusion, I'll add more javadocs.  Please feel free to ask more questions!

Jason

> Ocean Realtime Search
> ---------------------
>
>                 Key: LUCENE-1313
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1313
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>            Reporter: Jason Rutherglen
>         Attachments: lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, 
> lucene-1313.patch
>
>
> Provides realtime search using Lucene.  Conceptually, updates are divided 
> into discrete transactions.  The transaction is recorded to a transaction log 
> which is similar to the mysql bin log.  Deletes from the transaction are made 
> to the existing indexes.  Document additions are made to an in memory 
> InstantiatedIndex.  The transaction is then complete.  After each transaction 
> TransactionSystem.getSearcher() may be called which allows searching over the 
> index including the latest transaction.
> TransactionSystem is the main class.  Methods similar to IndexWriter are 
> provided for updating.  getSearcher returns a Searcher class. 
> - getSearcher()
> - addDocument(Document document)
> - addDocument(Document document, Analyzer analyzer)
> - updateDocument(Term term, Document document)
> - updateDocument(Term term, Document document, Analyzer analyzer)
> - deleteDocument(Term term)
> - deleteDocument(Query query)
> - commitTransaction(List<Document> documents, Analyzer analyzer, List<Term> 
> deleteByTerms, List<Query> deleteByQueries)
> Sample code:
> {code}
> // setup
> FSDirectoryMap directoryMap = new FSDirectoryMap(new File("/testocean"), 
> "log");
> LogDirectory logDirectory = directoryMap.getLogDirectory();
> TransactionLog transactionLog = new TransactionLog(logDirectory);
> TransactionSystem system = new TransactionSystem(transactionLog, new 
> SimpleAnalyzer(), directoryMap);
> // transaction
> Document d = new Document();
> d.add(new Field("contents", "hello world", Field.Store.YES, 
> Field.Index.TOKENIZED));
> system.addDocument(d);
> // search
> OceanSearcher searcher = system.getSearcher();
> ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs;
> System.out.println(hits.length + " total results");
> for (int i = 0; i < hits.length && i < 10; i++) {
>   Document d = searcher.doc(hits[i].doc);
>   System.out.println(i + " " + hits[i].score+ " " + d.get("contents");
> }
> {code}
> There is a test class org.apache.lucene.ocean.TestSearch that was used for 
> basic testing.  
> A sample disk directory structure is as follows:
> |/snapshot_105_00.xml | XML file containing which indexes and their 
> generation numbers correspond to a snapshot.  Each transaction creates a new 
> snapshot file.  In this file the 105 is the snapshotid, also known as the 
> transactionid.  The 00 is the minor version of the snapshot corresponding to 
> a merge.  A merge is a minor snapshot version because the data does not 
> change, only the underlying structure of the index|
> |/3 | Directory containing an on disk Lucene index|
> |/log | Directory containing log files|
> |/log/log00000001.bin | Log file.  As new log files are created the suffix 
> number is incremented|

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1313) Ocean Realtime Search

Reply via email to