Ninaad Joshi created NUTCH-2469:
-----------------------------------

             Summary: Documents not commited to solr in Sever mode
                 Key: NUTCH-2469
                 URL: https://issues.apache.org/jira/browse/NUTCH-2469
             Project: Nutch
          Issue Type: Bug
          Components: indexer
    Affects Versions: 2.3.1
            Reporter: Ninaad Joshi
            Priority: Blocker


I found there is a discrepancy in execution paths when running Nutch in local 
standalone mode vis-à-vis server mode. 

I observed, in local standalone mode, when the indexing process is done the 
document along with its fields get indexed and committed in solr and is 
returned if queried immediately. However, the same when done through server 
mode, the document gets indexed but is not committed in solr, hence not 
returned if queried immediately. When we restart solr the indexed document is 
returned if queried.

I browsed through the IndexingJob.java file to understand the cause for this. I 
found out:
# There are two different entry paths for the local standalone mode and the 
server mode
** Server mode entry point: public Map<String, Object> run(Map<String, Object> 
args)
** Standalone mode entry point: 
*** public int run(String[] args)
*** public void index(String batchId)
# The local standalone mode path did extra stuff than the server mode
** The public void index(String batchId) function initially calls the server 
mode path: public Map<String, Object> run(Map<String, Object> args)
** And then does this extra stuff
*** Gets IndexWriters
*** Using IndexWriters Describes 
Using IndexWriters commits if COMMIT_INDEX=true is specified in the 
configuration
*** The aforementioned extra stuff is not done in the server mode

I feel the execution paths for both the modes should be same and hence propose 
to:
# Move the extra stuff done using IndexWriters in public void index(String 
batchId) to the end of server mode execution path i.e public Map<String, 
Object> run(Map<String, Object> args) function 
# Call public Map<String, Object> run(Map<String, Object> args) function 
directly from Standalone mode entry point: public int run(String[] args)
# public int run(String[] args) becomes redundant and can be safely removed.

I have attached the proposed patch along with this issue. Kindly go through the 
same and approve.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to