[ 
https://issues.apache.org/jira/browse/CONNECTORS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632230#comment-14632230
 ] 

Shinichiro Abe commented on CONNECTORS-1219:
--------------------------------------------

it will work if we just create new indexsearcher with new indexreader which 
takes HdfsDirectory. 

as to searcher it depends on using near realtime search or not.
(1) coexist writer and searcher
this is a approach like solr/solrcloud or elasticsearch.
indexsearcher can search the documents indexwriter has.
even if to write to hdfs is slow, indexsearcher can search in-memory 
uncommitted documents from indexwriter 
(2) separate into writer side and searcher side.
this is a approach like solr's legacy style, master(writer)-slave(searcher) 
architecture, so we can't use near realtime search.
indexsearcher searches the documents from hdfs in which there are the documents 
committed by indexwriter.

which are fitted to mcf standard?

in solr, elasticsearch, oak and sling, documents are searchable as soon as 
clients post the documents. oak and sling are content repository with search 
index by push model(posts a document from client, then stores it to repository 
and index it simultaneously), these are bounded by jcr standard though. on the 
other hand, mcf is pull model. the search applications through output connector 
have a responsibility for whether documents are searchable soon. so according 
to mcf standard, lucene connector will have to choose (2) with the plugin but 
near realtime searching is lost. I intended to (1) in the v0.3 patch.

btw, alfresco, liferay and drupal are also content repository with pull model 
clawls, I heard it from someone, but these differs from mcf's doc version 
checking, these can index documents using something like transaction info about 
CRUD documents which is managed by repository side, so documents are indexed 
soon and are searchable soon. mcf is bounded by a limitation of repository 
side, e.g. concurrent access limit(shared drive, web, alfresco, cmis, 
sharpoint… almost all repository?) or heavy cpu load on repo side by 
multi-threading access. unfortunately, I heard mcf crawls is slow from some 
users sometimes so far, of course I knew and explained them that is not in 
mcf's taking care of, then adjusted repo side or customize existing connectors. 
as my first approach for those, I had an idea to index documents to local disk 
by using lucene without any http transport and use near realtime search with 
writer's buffered document, i.e. (1) approach. currently, I have no idea for 
repository side limitation though.

> Lucene Output Connector
> -----------------------
>
>                 Key: CONNECTORS-1219
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1219
>             Project: ManifoldCF
>          Issue Type: New Feature
>            Reporter: Shinichiro Abe
>            Assignee: Shinichiro Abe
>         Attachments: CONNECTORS-1219-v0.1patch.patch, 
> CONNECTORS-1219-v0.2.patch, CONNECTORS-1219-v0.3.patch
>
>
> A output connector for Lucene local index directly, not via remote search 
> engine. It would be nice if we could use Lucene various API to the index 
> directly, even though we could do the same thing to the Solr or Elasticsearch 
> index. I assume we can do something to classification, categorization, and 
> tagging, using e.g lucene-classification package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to