[ https://issues.apache.org/jira/browse/NUTCH-2687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16746050#comment-16746050 ]
Sebastian Nagel commented on NUTCH-2687: ---------------------------------------- +1 Just for completion - the HTTP header for the given URL looks like: {noformat} HTTP/1.1 200 OK ... Content-Transfer-Encoding: binary Content-Disposition: inline; filename="Koopstra2016_Ontologically classifying ERP feature, the NEXT method_Final.pdf"; filename*=utf-8''Koopstra2016_Ontologically%20classifying%20ERP%20feature%2c%20the%20NEXT%20method_Final.pdf Set-Cookie: .secureclient=si=fqoKPL_-eUqylz7xEh0lqw2; path=/; secure; HttpOnly ... {noformat} There is a further quotation mark in the Content-Disposition field and the regex should match only the substring until the first closing quotation mark. > Regex for reading title from Content-Disposition is wrong > --------------------------------------------------------- > > Key: NUTCH-2687 > URL: https://issues.apache.org/jira/browse/NUTCH-2687 > Project: Nutch > Issue Type: Bug > Reporter: Markus Jelsma > Assignee: Markus Jelsma > Priority: Major > Fix For: 1.16 > > Attachments: NUTCH-2687.patch > > > Given URL: > https://www.amuse-project.org/file/download/default/E6D0537647AF1204656076943F4729B0/Koopstra2016_5fOntologically%20classifying%20ERP%20feature,%20the%20NEXT%20method_5fFinal.pdf > And regex: \\bfilename=['\"](.+)['\"] > We get the following title: > Koopstra2016_Ontologically classifying ERP feature, the NEXT > method_Final.pdf"; filename*=utf-8' > Changed regex to: \\bfilename=['\"]([^\"]+) fixes it -- This message was sent by Atlassian JIRA (v7.6.3#76005)