Ah you're right. There's an issue for this. You're welcome to submit a patch:

https://issues.apache.org/jira/browse/NUTCH-1140

I'll mark it for 1.5, seems it isn't yet.

> Actually, it turns out it's a Nutch issue.  Tika outputs the correct title
> for the pdf.  However, the indexer-more plugin is adding in the filename
> due to the HTTP header "Content-Disposition".
> 
> Is there a way to turn this off while keeping the other functionality of
> the plugin?  I'd prefer not to have a bunch of tweaks in the Nutch code.
> 
> On Wed, Nov 2, 2011 at 10:11 AM, Markus Jelsma
> 
> <[email protected]>wrote:
> > The output is a bit misleading indeed. The file has two valid titles and
> > two
> > are being extracted. The title and the filename are both seen as titles
> > by Tika.
> > 
> > You can spot this behaviour better using the indexchecker tool.
> > 
> > Please consult the Tika wiki, docs or mailing list on how to proceed.
> > Either
> > that or make your Solr schema field for title multiValued and deal with
> > it appropriately in your search front-end.
> > 
> > Cheers
> > 
> > On Wednesday 02 November 2011 15:02:11 Bai Shen wrote:
> > > Found it right after I asked. :)  BTW, the command is wrong on the
> > > wiki.
> >  
> >  I
> >  
> > > need to get around to making an account so I can fix things.
> > > 
> > > I ran it on the pdf url and it only gives me one title.  But it's
> > > pretty long.  Could that be the problem?
> > > 
> > > The url is
> > > http://www.sipri.org/yearbook/2011/files/SIPRIYB11summaryNL.pdfif you
> > 
> > want
> > 
> > > to check yourself.
> > > 
> > > On Wed, Nov 2, 2011 at 9:18 AM, Markus Jelsma
> > 
> > <[email protected]>wrote:
> > > > bin/nutch parsechecker <url>
> > > > 
> > > > see also:
> > > > http://wiki.apache.org/nutch/CommandLineOptions
> > > > 
> > > > On Wednesday 02 November 2011 14:16:10 Bai Shen wrote:
> > > > > Parsechecker tool?  Where do I find that?
> > > > > 
> > > > > On Tue, Nov 1, 2011 at 4:56 PM, Markus Jelsma
> > > > 
> > > > <[email protected]>wrote:
> > > > > > > I'm running the latest version of 1.4  We just rebuilt it last
> > > > > > > week. Is that patch included?
> > > > > > 
> > > > > > Yes, so you actually have more than one non-zero length titles
> > 
> > coming
> > 
> > > > > > from your parser. Please try the parsechecker tool and confirm,
> > > > > > but i'm not sure it
> > > > > > is capable of showing multiple titles.
> > > > > > 
> > > > > > > And where would it get multiple titles from?
> > > > > > 
> > > > > > Most likely from PDF or other document types. You can check with
> > > > > > a stand-alone
> > > > > > Tika.
> > > > > > 
> > > > > > > How do I tell what the titles
> > > > > > > are so I can see if they're valid or not?
> > > > > > > 
> > > > > > > On Tue, Nov 1, 2011 at 4:33 PM, Markus Jelsma
> > > > > > 
> > > > > > <[email protected]>wrote:
> > > > > > > > This should work around the problem in most cases. The parser
> > 
> > can
> > 
> > > > > > output
> > > > > > 
> > > > > > > > two
> > > > > > > > titles of which one is actually empty. This patch (in 1.4)
> > 
> > skips
> > 
> > > > > > > > empty titles.
> > > > > > > > 
> > > > > > > > If this doesn't work you really have two _valid_ titles
> > > > > > > > coming from
> > > > > > 
> > > > > > your
> > > > > > 
> > > > > > > > document.
> > > > > > > > 
> > > > > > > > https://issues.apache.org/jira/browse/NUTCH-1004
> > > > > > > > 
> > > > > > > > > It looks like the issue I'm encountering is the same one as
> > > > > > > > > here.
> > 
> > http://lucene.472066.n3.nabble.com/multiple-values-encountered-for-non-mu
> > 
> > > > > > > > lt
> > > > > > > > 
> > > > > > > > > iValued-field-title-td1446817.html
> > > > > > > > > 
> > > > > > > > > I'm not really sure what the linked bug is since that
> > 
> > involves
> > 
> > > > the
> > > > 
> > > > > > HTML
> > > > > > 
> > > > > > > > > parser and I'm seeing this problem with a PDF file.
> > > > > > > > > 
> > > > > > > > > On Tue, Nov 1, 2011 at 3:41 PM, Bai Shen <
> > > > 
> > > > [email protected]>
> > > > 
> > > > > > > > wrote:
> > > > > > > > > > I'm getting an exception when I try to commit to Solr.
> > > > > > > > > > Looking at the Solr log, it's showing that title is
> > > > > > > > > > getting multiple values when it's not a multivalue
> > > > > > > > > > field.  None of my code does anything with the title, so
> > > > > > > > > > I'm not sure why this is happening.
> > > > > > > > > > 
> > > > > > > > > > How can I look at the pending commit and determine why
> > 
> > and/or
> > 
> > > > > > delete
> > > > > > 
> > > > > > > > the
> > > > > > > > 
> > > > > > > > > > extraneous values?  The document in question is a pdf if
> > 
> > that
> > 
> > > > > > makes a
> > > > > > 
> > > > > > > > > > difference.
> > > > 
> > > > --
> > > > Markus Jelsma - CTO - Openindex
> > > > http://www.linkedin.com/in/markus17
> > > > 050-8536620 / 06-50258350
> > 
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350

Reply via email to