[ https://issues.apache.org/jira/browse/CONNECTORS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632230#comment-14632230 ]
Shinichiro Abe commented on CONNECTORS-1219: -------------------------------------------- it will work if we just create new indexsearcher with new indexreader which takes HdfsDirectory. as to searcher it depends on using near realtime search or not. (1) coexist writer and searcher this is a approach like solr/solrcloud or elasticsearch. indexsearcher can search the documents indexwriter has. even if to write to hdfs is slow, indexsearcher can search in-memory uncommitted documents from indexwriter (2) separate into writer side and searcher side. this is a approach like solr's legacy style, master(writer)-slave(searcher) architecture, so we can't use near realtime search. indexsearcher searches the documents from hdfs in which there are the documents committed by indexwriter. which are fitted to mcf standard? in solr, elasticsearch, oak and sling, documents are searchable as soon as clients post the documents. oak and sling are content repository with search index by push model(posts a document from client, then stores it to repository and index it simultaneously), these are bounded by jcr standard though. on the other hand, mcf is pull model. the search applications through output connector have a responsibility for whether documents are searchable soon. so according to mcf standard, lucene connector will have to choose (2) with the plugin but near realtime searching is lost. I intended to (1) in the v0.3 patch. btw, alfresco, liferay and drupal are also content repository with pull model clawls, I heard it from someone, but these differs from mcf's doc version checking, these can index documents using something like transaction info about CRUD documents which is managed by repository side, so documents are indexed soon and are searchable soon. mcf is bounded by a limitation of repository side, e.g. concurrent access limit(shared drive, web, alfresco, cmis, sharpoint… almost all repository?) or heavy cpu load on repo side by multi-threading access. unfortunately, I heard mcf crawls is slow from some users sometimes so far, of course I knew and explained them that is not in mcf's taking care of, then adjusted repo side or customize existing connectors. as my first approach for those, I had an idea to index documents to local disk by using lucene without any http transport and use near realtime search with writer's buffered document, i.e. (1) approach. currently, I have no idea for repository side limitation though. > Lucene Output Connector > ----------------------- > > Key: CONNECTORS-1219 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1219 > Project: ManifoldCF > Issue Type: New Feature > Reporter: Shinichiro Abe > Assignee: Shinichiro Abe > Attachments: CONNECTORS-1219-v0.1patch.patch, > CONNECTORS-1219-v0.2.patch, CONNECTORS-1219-v0.3.patch > > > A output connector for Lucene local index directly, not via remote search > engine. It would be nice if we could use Lucene various API to the index > directly, even though we could do the same thing to the Solr or Elasticsearch > index. I assume we can do something to classification, categorization, and > tagging, using e.g lucene-classification package. -- This message was sent by Atlassian JIRA (v6.3.4#6332)