[ https://issues.apache.org/jira/browse/NUTCH-1004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085055#comment-13085055 ]
Lewis John McGibbney commented on NUTCH-1004: --------------------------------------------- no objections from me Markus. Two neat little patches. What was your suggestion regarding index-basic config option? I am curious > Do not index empty values for title field > ----------------------------------------- > > Key: NUTCH-1004 > URL: https://issues.apache.org/jira/browse/NUTCH-1004 > Project: Nutch > Issue Type: Bug > Components: indexer > Affects Versions: 1.3, 2.0 > Reporter: Markus Jelsma > Assignee: Markus Jelsma > Fix For: 1.4, 2.0 > > Attachments: NUTCH-1004-1.4.patch, NUTCH-1004-2.0.patch > > > Tika can generate multiple values for the title field for some files such as > certain PDF's. It seems parse-tika's DOMContentUtils.getTitle() and helper > methods are responsible for this behaviour. We should add a check on this to > prevent empty values for the title field. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira