Ninaad Joshi created NUTCH-2469:
-----------------------------------
Summary: Documents not commited to solr in Sever mode
Key: NUTCH-2469
URL: https://issues.apache.org/jira/browse/NUTCH-2469
Project: Nutch
Issue Type: Bug
Components: indexer
Affects Versions: 2.3.1
Reporter: Ninaad Joshi
Priority: Blocker
I found there is a discrepancy in execution paths when running Nutch in local
standalone mode vis-à-vis server mode.
I observed, in local standalone mode, when the indexing process is done the
document along with its fields get indexed and committed in solr and is
returned if queried immediately. However, the same when done through server
mode, the document gets indexed but is not committed in solr, hence not
returned if queried immediately. When we restart solr the indexed document is
returned if queried.
I browsed through the IndexingJob.java file to understand the cause for this. I
found out:
# There are two different entry paths for the local standalone mode and the
server mode
** Server mode entry point: public Map<String, Object> run(Map<String, Object>
args)
** Standalone mode entry point:
*** public int run(String[] args)
*** public void index(String batchId)
# The local standalone mode path did extra stuff than the server mode
** The public void index(String batchId) function initially calls the server
mode path: public Map<String, Object> run(Map<String, Object> args)
** And then does this extra stuff
*** Gets IndexWriters
*** Using IndexWriters Describes
Using IndexWriters commits if COMMIT_INDEX=true is specified in the
configuration
*** The aforementioned extra stuff is not done in the server mode
I feel the execution paths for both the modes should be same and hence propose
to:
# Move the extra stuff done using IndexWriters in public void index(String
batchId) to the end of server mode execution path i.e public Map<String,
Object> run(Map<String, Object> args) function
# Call public Map<String, Object> run(Map<String, Object> args) function
directly from Standalone mode entry point: public int run(String[] args)
# public int run(String[] args) becomes redundant and can be safely removed.
I have attached the proposed patch along with this issue. Kindly go through the
same and approve.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)