The problem with that patch is that there's no way to disable the reset functionality without removing the entire plugin.
In our case, we want the pdf title, not the file name. Perhaps make it a property or split it out to it's own plugin? On Thu, Nov 3, 2011 at 3:00 PM, Markus Jelsma <[email protected]>wrote: > Ah you're right. There's an issue for this. You're welcome to submit a > patch: > > https://issues.apache.org/jira/browse/NUTCH-1140 > > I'll mark it for 1.5, seems it isn't yet. > > > Actually, it turns out it's a Nutch issue. Tika outputs the correct > title > > for the pdf. However, the indexer-more plugin is adding in the filename > > due to the HTTP header "Content-Disposition". > > > > Is there a way to turn this off while keeping the other functionality of > > the plugin? I'd prefer not to have a bunch of tweaks in the Nutch code. > > > > On Wed, Nov 2, 2011 at 10:11 AM, Markus Jelsma > > > > <[email protected]>wrote: > > > The output is a bit misleading indeed. The file has two valid titles > and > > > two > > > are being extracted. The title and the filename are both seen as titles > > > by Tika. > > > > > > You can spot this behaviour better using the indexchecker tool. > > > > > > Please consult the Tika wiki, docs or mailing list on how to proceed. > > > Either > > > that or make your Solr schema field for title multiValued and deal with > > > it appropriately in your search front-end. > > > > > > Cheers > > > > > > On Wednesday 02 November 2011 15:02:11 Bai Shen wrote: > > > > Found it right after I asked. :) BTW, the command is wrong on the > > > > wiki. > > > > > > I > > > > > > > need to get around to making an account so I can fix things. > > > > > > > > I ran it on the pdf url and it only gives me one title. But it's > > > > pretty long. Could that be the problem? > > > > > > > > The url is > > > > http://www.sipri.org/yearbook/2011/files/SIPRIYB11summaryNL.pdfifyou > > > > > > want > > > > > > > to check yourself. > > > > > > > > On Wed, Nov 2, 2011 at 9:18 AM, Markus Jelsma > > > > > > <[email protected]>wrote: > > > > > bin/nutch parsechecker <url> > > > > > > > > > > see also: > > > > > http://wiki.apache.org/nutch/CommandLineOptions > > > > > > > > > > On Wednesday 02 November 2011 14:16:10 Bai Shen wrote: > > > > > > Parsechecker tool? Where do I find that? > > > > > > > > > > > > On Tue, Nov 1, 2011 at 4:56 PM, Markus Jelsma > > > > > > > > > > <[email protected]>wrote: > > > > > > > > I'm running the latest version of 1.4 We just rebuilt it > last > > > > > > > > week. Is that patch included? > > > > > > > > > > > > > > Yes, so you actually have more than one non-zero length titles > > > > > > coming > > > > > > > > > > from your parser. Please try the parsechecker tool and confirm, > > > > > > > but i'm not sure it > > > > > > > is capable of showing multiple titles. > > > > > > > > > > > > > > > And where would it get multiple titles from? > > > > > > > > > > > > > > Most likely from PDF or other document types. You can check > with > > > > > > > a stand-alone > > > > > > > Tika. > > > > > > > > > > > > > > > How do I tell what the titles > > > > > > > > are so I can see if they're valid or not? > > > > > > > > > > > > > > > > On Tue, Nov 1, 2011 at 4:33 PM, Markus Jelsma > > > > > > > > > > > > > > <[email protected]>wrote: > > > > > > > > > This should work around the problem in most cases. The > parser > > > > > > can > > > > > > > > > > output > > > > > > > > > > > > > > > > two > > > > > > > > > titles of which one is actually empty. This patch (in 1.4) > > > > > > skips > > > > > > > > > > > > empty titles. > > > > > > > > > > > > > > > > > > If this doesn't work you really have two _valid_ titles > > > > > > > > > coming from > > > > > > > > > > > > > > your > > > > > > > > > > > > > > > > document. > > > > > > > > > > > > > > > > > > https://issues.apache.org/jira/browse/NUTCH-1004 > > > > > > > > > > > > > > > > > > > It looks like the issue I'm encountering is the same one > as > > > > > > > > > > here. > > > > > > > http://lucene.472066.n3.nabble.com/multiple-values-encountered-for-non-mu > > > > > > > > > > > > lt > > > > > > > > > > > > > > > > > > > iValued-field-title-td1446817.html > > > > > > > > > > > > > > > > > > > > I'm not really sure what the linked bug is since that > > > > > > involves > > > > > > > > the > > > > > > > > > > > > HTML > > > > > > > > > > > > > > > > > parser and I'm seeing this problem with a PDF file. > > > > > > > > > > > > > > > > > > > > On Tue, Nov 1, 2011 at 3:41 PM, Bai Shen < > > > > > > > > > > [email protected]> > > > > > > > > > > > > > > wrote: > > > > > > > > > > > I'm getting an exception when I try to commit to Solr. > > > > > > > > > > > Looking at the Solr log, it's showing that title is > > > > > > > > > > > getting multiple values when it's not a multivalue > > > > > > > > > > > field. None of my code does anything with the title, > so > > > > > > > > > > > I'm not sure why this is happening. > > > > > > > > > > > > > > > > > > > > > > How can I look at the pending commit and determine why > > > > > > and/or > > > > > > > > > > delete > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > extraneous values? The document in question is a pdf > if > > > > > > that > > > > > > > > > > makes a > > > > > > > > > > > > > > > > > > difference. > > > > > > > > > > -- > > > > > Markus Jelsma - CTO - Openindex > > > > > http://www.linkedin.com/in/markus17 > > > > > 050-8536620 / 06-50258350 > > > > > > -- > > > Markus Jelsma - CTO - Openindex > > > http://www.linkedin.com/in/markus17 > > > 050-8536620 / 06-50258350 >

