[jira] Updated: (LUCENE-1313) Ocean Realtime Search

Jason Rutherglen (JIRA) Thu, 17 Jul 2008 07:38:29 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jason Rutherglen updated LUCENE-1313:
-------------------------------------

    Attachment: lucene-1313.patch

lucene-1313.patch

- Depends on LUCENE-1314
- OceanSegmentReader implements reuse of deletedDocs bytes in conjunction with 
LUCENE-1314
- Snapshot logging happens to a rolling log file
- CRC32 checking added to transaction log
- Added TestSystem test case that performs adds, updates and deletes.  
TestSystem uses arbitrarily small settings numbers to force the various 
background merges to happen within a minimal number of transactions
- Transactions with over N documents encoded into a segment (via RAMDirectory) 
to the transaction log rather than serialized as a Document
- Started wiki page http://wiki.apache.org/lucene-java/OceanRealtimeSearch 
linked from http://wiki.apache.org/lucene-java/LuceneResources.  Will place 
documentation there.
- Document fields with Reader or TokenStream values supported

Began work on LargeBatch functionality, needs test case.  Large batches allow 
adding documents in bulk (also performing deletes) in a transaction that goes 
straight to an index bypassing the transaction log.  This provides the same  
speed as using IndexWriter to perform bulk Document processing in Ocean.  

Started OceanDatabase which will offer a Java API inspired by GData.  Will 
offer optimistic concurrency (something required in a realtime search system) 
and dynamic object mapping (meaning types such as long, date, double will be 
mapped to a string term using some Solr code).  A file sync is performed after 
each transaction, will add an option to allow syncing after N transactions like 
mysql.  This will improve realtime update speeds.  

Future:

- Support for multiple servers by implementing distributed API and replication 
using LUCENE-1336
- Test case that is akin to TestStressIndexing2 mainly to test threading
- Add LargeBatch test to TestSystem
- Facets
- Looking at adding GData compatible XML over HTTP API.  Possibly can reuse the 
old Lucene GData code.
- Integrate tag index when it's completed
- Add LRU record cache to transaction log which will be useful for faster 
replication


> Ocean Realtime Search
> ---------------------
>
>                 Key: LUCENE-1313
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1313
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>            Reporter: Jason Rutherglen
>         Attachments: lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, 
> lucene-1313.patch
>
>
> Provides realtime search using Lucene.  Conceptually, updates are divided 
> into discrete transactions.  The transaction is recorded to a transaction log 
> which is similar to the mysql bin log.  Deletes from the transaction are made 
> to the existing indexes.  Document additions are made to an in memory 
> InstantiatedIndex.  The transaction is then complete.  After each transaction 
> TransactionSystem.getSearcher() may be called which allows searching over the 
> index including the latest transaction.
> TransactionSystem is the main class.  Methods similar to IndexWriter are 
> provided for updating.  getSearcher returns a Searcher class. 
> - getSearcher()
> - addDocument(Document document)
> - addDocument(Document document, Analyzer analyzer)
> - updateDocument(Term term, Document document)
> - updateDocument(Term term, Document document, Analyzer analyzer)
> - deleteDocument(Term term)
> - deleteDocument(Query query)
> - commitTransaction(List<Document> documents, Analyzer analyzer, List<Term> 
> deleteByTerms, List<Query> deleteByQueries)
> Sample code:
> {code}
> // setup
> FSDirectoryMap directoryMap = new FSDirectoryMap(new File("/testocean"), 
> "log");
> LogDirectory logDirectory = directoryMap.getLogDirectory();
> TransactionLog transactionLog = new TransactionLog(logDirectory);
> TransactionSystem system = new TransactionSystem(transactionLog, new 
> SimpleAnalyzer(), directoryMap);
> // transaction
> Document d = new Document();
> d.add(new Field("contents", "hello world", Field.Store.YES, 
> Field.Index.TOKENIZED));
> system.addDocument(d);
> // search
> OceanSearcher searcher = system.getSearcher();
> ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs;
> System.out.println(hits.length + " total results");
> for (int i = 0; i < hits.length && i < 10; i++) {
>   Document d = searcher.doc(hits[i].doc);
>   System.out.println(i + " " + hits[i].score+ " " + d.get("contents");
> }
> {code}
> There is a test class org.apache.lucene.ocean.TestSearch that was used for 
> basic testing.  
> A sample disk directory structure is as follows:
> |/snapshot_105_00.xml | XML file containing which indexes and their 
> generation numbers correspond to a snapshot.  Each transaction creates a new 
> snapshot file.  In this file the 105 is the snapshotid, also known as the 
> transactionid.  The 00 is the minor version of the snapshot corresponding to 
> a merge.  A merge is a minor snapshot version because the data does not 
> change, only the underlying structure of the index|
> |/3 | Directory containing an on disk Lucene index|
> |/log | Directory containing log files|
> |/log/log00000001.bin | Log file.  As new log files are created the suffix 
> number is incremented|

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1313) Ocean Realtime Search

Reply via email to