[jira] [Commented] (NUTCH-2482) index-geoip not to add null values to document fields
[ https://issues.apache.org/jira/browse/NUTCH-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939806#comment-16939806 ] ASF GitHub Bot commented on NUTCH-2482: --- lewismc commented on issue #476: NUTCH-2482 index-geoip not to add null values to document fields URL: https://github.com/apache/nutch/pull/476#issuecomment-536136377 A valuable addition @sebastian-nagel +1 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > index-geoip not to add null values to document fields > - > > Key: NUTCH-2482 > URL: https://issues.apache.org/jira/browse/NUTCH-2482 > Project: Nutch > Issue Type: Bug > Components: indexer, plugin >Affects Versions: 1.13 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Minor > Labels: patch-available > Fix For: 1.16 > > > The plugin index-geoip may add null values to document fields which then > cause further errors, here a NPE in IndexingFiltersChecker when toString() is > called on null: > {noformat} > $ bin/nutch indexchecker -Dstore.ip.address=true > -Dindex.geoip.usage=cityDatabase \ > -Dplugin.includes="protocol-http|parse-html|index-(basic|geoip)" > http://www.example.com/ > ... > Exception in thread "main" java.lang.NullPointerException > at > org.apache.nutch.indexer.IndexingFiltersChecker.fetch(IndexingFiltersChecker.java:340) > at > org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:127) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at > org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:370) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (NUTCH-2482) index-geoip not to add null values to document fields
[ https://issues.apache.org/jira/browse/NUTCH-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2482: -- Assignee: Sebastian Nagel > index-geoip not to add null values to document fields > - > > Key: NUTCH-2482 > URL: https://issues.apache.org/jira/browse/NUTCH-2482 > Project: Nutch > Issue Type: Bug > Components: indexer, plugin >Affects Versions: 1.13 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Minor > Labels: patch-available > Fix For: 1.16 > > > The plugin index-geoip may add null values to document fields which then > cause further errors, here a NPE in IndexingFiltersChecker when toString() is > called on null: > {noformat} > $ bin/nutch indexchecker -Dstore.ip.address=true > -Dindex.geoip.usage=cityDatabase \ > -Dplugin.includes="protocol-http|parse-html|index-(basic|geoip)" > http://www.example.com/ > ... > Exception in thread "main" java.lang.NullPointerException > at > org.apache.nutch.indexer.IndexingFiltersChecker.fetch(IndexingFiltersChecker.java:340) > at > org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:127) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at > org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:370) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (NUTCH-2482) index-geoip not to add null values to document fields
[ https://issues.apache.org/jira/browse/NUTCH-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2482: --- Labels: patch-available (was: ) > index-geoip not to add null values to document fields > - > > Key: NUTCH-2482 > URL: https://issues.apache.org/jira/browse/NUTCH-2482 > Project: Nutch > Issue Type: Bug > Components: indexer, plugin >Affects Versions: 1.13 >Reporter: Sebastian Nagel >Priority: Minor > Labels: patch-available > Fix For: 1.16 > > > The plugin index-geoip may add null values to document fields which then > cause further errors, here a NPE in IndexingFiltersChecker when toString() is > called on null: > {noformat} > $ bin/nutch indexchecker -Dstore.ip.address=true > -Dindex.geoip.usage=cityDatabase \ > -Dplugin.includes="protocol-http|parse-html|index-(basic|geoip)" > http://www.example.com/ > ... > Exception in thread "main" java.lang.NullPointerException > at > org.apache.nutch.indexer.IndexingFiltersChecker.fetch(IndexingFiltersChecker.java:340) > at > org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:127) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at > org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:370) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-2482) index-geoip not to add null values to document fields
[ https://issues.apache.org/jira/browse/NUTCH-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939782#comment-16939782 ] ASF GitHub Bot commented on NUTCH-2482: --- sebastian-nagel commented on pull request #476: NUTCH-2482 index-geoip not to add null values to document fields URL: https://github.com/apache/nutch/pull/476 - also improve handling of errors when searching for and reading GeoIP database files - upgrade dependencies This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > index-geoip not to add null values to document fields > - > > Key: NUTCH-2482 > URL: https://issues.apache.org/jira/browse/NUTCH-2482 > Project: Nutch > Issue Type: Bug > Components: indexer, plugin >Affects Versions: 1.13 >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 1.16 > > > The plugin index-geoip may add null values to document fields which then > cause further errors, here a NPE in IndexingFiltersChecker when toString() is > called on null: > {noformat} > $ bin/nutch indexchecker -Dstore.ip.address=true > -Dindex.geoip.usage=cityDatabase \ > -Dplugin.includes="protocol-http|parse-html|index-(basic|geoip)" > http://www.example.com/ > ... > Exception in thread "main" java.lang.NullPointerException > at > org.apache.nutch.indexer.IndexingFiltersChecker.fetch(IndexingFiltersChecker.java:340) > at > org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:127) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at > org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:370) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (NUTCH-685) Content-level redirect status lost in ParseSegment
[ https://issues.apache.org/jira/browse/NUTCH-685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-685: -- Fix Version/s: 1.17 > Content-level redirect status lost in ParseSegment > -- > > Key: NUTCH-685 > URL: https://issues.apache.org/jira/browse/NUTCH-685 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Andrzej Bialecki >Priority: Major > Fix For: 1.17 > > > When Fetcher runs in parsing mode, content-level redirects (HTML meta tag > "Refresh") are properly discovered and recorded in crawl_fetch under source > URL and target URL. If Fetcher runs in non-parsing mode, and ParseSegment is > run as a separate step, the content-level redirection data is used only to > add the new (target) URL, but the status of the original URL is not reset to > indicate a redirect. Consequently, status of the original URL will be > different depending on the way you run Fetcher, whereas it should be the same. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (NUTCH-2261) ParseSegment job does not pass metadata for content-level redirects
[ https://issues.apache.org/jira/browse/NUTCH-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2261: --- Fix Version/s: (was: 1.16) 1.17 > ParseSegment job does not pass metadata for content-level redirects > --- > > Key: NUTCH-2261 > URL: https://issues.apache.org/jira/browse/NUTCH-2261 > Project: Nutch > Issue Type: Bug > Components: metadata, parser >Affects Versions: 1.11, 1.12, 1.13 >Reporter: David Astle >Priority: Minor > Fix For: 1.17 > > > When Fetcher runs in parsing mode, CrawlDatum metadata is properly passed to > a new CrawlDatum for content-level redirects (HTML meta tag "Refresh"). If > Fetcher runs in non-parsing mode, and ParseSegment is run as a separate step, > then metadata other than "_repr_" is not passed to the new CrawlDatum. > This means that any filter relying on metadata, such as DepthScoringFilter > and URLMetaScoringFilter, will not work. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (NUTCH-2710) Normalize outlinks before checking for internal or external links
[ https://issues.apache.org/jira/browse/NUTCH-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2710: --- Fix Version/s: (was: 1.16) 1.17 > Normalize outlinks before checking for internal or external links > - > > Key: NUTCH-2710 > URL: https://issues.apache.org/jira/browse/NUTCH-2710 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Major > Fix For: 1.17 > > Attachments: NUTCH-2710.patch > > > We have a normalizer that transforms external URLs back to internal URLs. But > those URLs are never passed to the normalizer, because they have already been > filtered out by internal and/or external host/domain checks in > parseOutputFormat.filterNormalize(). > This patch proposes to move the normalizers above the checks for > internal/external hosts/domains. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-2732) Ignored and tracked configuration files by git
[ https://issues.apache.org/jira/browse/NUTCH-2732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939568#comment-16939568 ] ASF GitHub Bot commented on NUTCH-2732: --- sebastian-nagel commented on issue #475: NUTCH-2732: nutch-default.xml as a non-template file. URL: https://github.com/apache/nutch/pull/475#issuecomment-535999591 +1 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Ignored and tracked configuration files by git > -- > > Key: NUTCH-2732 > URL: https://issues.apache.org/jira/browse/NUTCH-2732 > Project: Nutch > Issue Type: Improvement > Components: build >Affects Versions: 1.15 >Reporter: Roannel Fernández Hernández >Assignee: Roannel Fernández Hernández >Priority: Trivial > Fix For: 1.16 > > > In folder conf/ there are files that are ignored and tracked by git at the > same time. A way to solve this is creating {{*.template}} files for those > files. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-2732) Ignored and tracked configuration files by git
[ https://issues.apache.org/jira/browse/NUTCH-2732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939542#comment-16939542 ] ASF GitHub Bot commented on NUTCH-2732: --- r0ann3l commented on pull request #475: NUTCH-2732: nutch-default.xml as a non-template file. URL: https://github.com/apache/nutch/pull/475 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Ignored and tracked configuration files by git > -- > > Key: NUTCH-2732 > URL: https://issues.apache.org/jira/browse/NUTCH-2732 > Project: Nutch > Issue Type: Improvement > Components: build >Affects Versions: 1.15 >Reporter: Roannel Fernández Hernández >Assignee: Roannel Fernández Hernández >Priority: Trivial > Fix For: 1.16 > > > In folder conf/ there are files that are ignored and tracked by git at the > same time. A way to solve this is creating {{*.template}} files for those > files. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-2732) Ignored and tracked configuration files by git
[ https://issues.apache.org/jira/browse/NUTCH-2732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939537#comment-16939537 ] Roannel Fernández Hernández commented on NUTCH-2732: Hi [~snagel]. You may be right. But in this case we need to specify it in the .gitignore file to avoid ignoring this tracked file. > Ignored and tracked configuration files by git > -- > > Key: NUTCH-2732 > URL: https://issues.apache.org/jira/browse/NUTCH-2732 > Project: Nutch > Issue Type: Improvement > Components: build >Affects Versions: 1.15 >Reporter: Roannel Fernández Hernández >Assignee: Roannel Fernández Hernández >Priority: Trivial > Fix For: 1.16 > > > In folder conf/ there are files that are ignored and tracked by git at the > same time. A way to solve this is creating {{*.template}} files for those > files. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika
[ https://issues.apache.org/jira/browse/NUTCH-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939516#comment-16939516 ] Tim Allison edited comment on NUTCH-2457 at 9/27/19 2:55 PM: - W00t! Default is to parse embedded, right? :D Wouldn't want to break backwards compatibility! Kidding...I'm kidding... Sorry, and thank you! was (Author: talli...@mitre.org): W00t! Default is to parse embedded, right? :D > Embedded documents likely not correctly parsed by Tika > -- > > Key: NUTCH-2457 > URL: https://issues.apache.org/jira/browse/NUTCH-2457 > Project: Nutch > Issue Type: Bug > Components: parser, plugin >Affects Versions: 1.14 >Reporter: Tim Allison >Priority: Major > Labels: patch-available > Fix For: 1.16 > > > While working on TIKA-2490, I think I found that Nutch's current method of > requesting a mime-specific parser for each file will fail to parse embedded > files, e.g. > https://github.com/apache/tika/blob/master/tika-server/src/test/resources/test_recursive_embedded.docx > The fix should be straightforward, and I'll submit a PR once I can get Nutch > up and running in my dev environment. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika
[ https://issues.apache.org/jira/browse/NUTCH-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939516#comment-16939516 ] Tim Allison commented on NUTCH-2457: W00t! Default is to parse embedded, right? :D > Embedded documents likely not correctly parsed by Tika > -- > > Key: NUTCH-2457 > URL: https://issues.apache.org/jira/browse/NUTCH-2457 > Project: Nutch > Issue Type: Bug > Components: parser, plugin >Affects Versions: 1.14 >Reporter: Tim Allison >Priority: Major > Labels: patch-available > Fix For: 1.16 > > > While working on TIKA-2490, I think I found that Nutch's current method of > requesting a mime-specific parser for each file will fail to parse embedded > files, e.g. > https://github.com/apache/tika/blob/master/tika-server/src/test/resources/test_recursive_embedded.docx > The fix should be straightforward, and I'll submit a PR once I can get Nutch > up and running in my dev environment. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (NUTCH-2381) In some situations the class TextProfileSignature gives different signatures for the same text "profile" page.
[ https://issues.apache.org/jira/browse/NUTCH-2381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2381: --- Labels: patch-available signature (was: signature) > In some situations the class TextProfileSignature gives different signatures > for the same text "profile" page. > -- > > Key: NUTCH-2381 > URL: https://issues.apache.org/jira/browse/NUTCH-2381 > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 1.13 >Reporter: Rodrigo Joni Sestari >Assignee: Sebastian Nagel >Priority: Major > Labels: patch-available, signature > Fix For: 1.16 > > > In some situations the class TextProfileSignature gives different signatures > for the same text "profile" page. > The method TextProfileSignature.calculate uses a HashMap to salve the tokens, > after some process, the tokens come sorted by decreasing frequency. > For some pages like "http://curia.europa.eu/jcms/; the text "profile" is the > same but the signature come different for each fetch. > Its happens because the tokens are sorted only by decreasing frequency. > Tokens with the same frequency maybe not have the same order in different > fetchs. > The HashMap no guarantees as to the order of the map and not guarantee that > the order will remain constant over time. > My suggestion is change the methods TokenComparator.compare in order to sort > by frequency and Name. > Rodrigo -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika
[ https://issues.apache.org/jira/browse/NUTCH-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2457: --- Component/s: plugin parser > Embedded documents likely not correctly parsed by Tika > -- > > Key: NUTCH-2457 > URL: https://issues.apache.org/jira/browse/NUTCH-2457 > Project: Nutch > Issue Type: Bug > Components: parser, plugin >Affects Versions: 1.14 >Reporter: Tim Allison >Priority: Major > Fix For: 1.16 > > > While working on TIKA-2490, I think I found that Nutch's current method of > requesting a mime-specific parser for each file will fail to parse embedded > files, e.g. > https://github.com/apache/tika/blob/master/tika-server/src/test/resources/test_recursive_embedded.docx > The fix should be straightforward, and I'll submit a PR once I can get Nutch > up and running in my dev environment. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika
[ https://issues.apache.org/jira/browse/NUTCH-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939509#comment-16939509 ] Sebastian Nagel edited comment on NUTCH-2457 at 9/27/19 2:49 PM: - Thanks, [~talli...@apache.org], got it. Implemented solution 2: it works. Optionally, parsing of embedded documents can be turned off via the new property "tika.parse.embedded". was (Author: wastl-nagel): Thanks, [~talli...@apache.org], got it. Will implement solution 2. > Embedded documents likely not correctly parsed by Tika > -- > > Key: NUTCH-2457 > URL: https://issues.apache.org/jira/browse/NUTCH-2457 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.14 >Reporter: Tim Allison >Priority: Major > Fix For: 1.16 > > > While working on TIKA-2490, I think I found that Nutch's current method of > requesting a mime-specific parser for each file will fail to parse embedded > files, e.g. > https://github.com/apache/tika/blob/master/tika-server/src/test/resources/test_recursive_embedded.docx > The fix should be straightforward, and I'll submit a PR once I can get Nutch > up and running in my dev environment. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika
[ https://issues.apache.org/jira/browse/NUTCH-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939509#comment-16939509 ] Sebastian Nagel commented on NUTCH-2457: Thanks, [~talli...@apache.org], got it. Will implement solution 2. > Embedded documents likely not correctly parsed by Tika > -- > > Key: NUTCH-2457 > URL: https://issues.apache.org/jira/browse/NUTCH-2457 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.14 >Reporter: Tim Allison >Priority: Major > Fix For: 1.16 > > > While working on TIKA-2490, I think I found that Nutch's current method of > requesting a mime-specific parser for each file will fail to parse embedded > files, e.g. > https://github.com/apache/tika/blob/master/tika-server/src/test/resources/test_recursive_embedded.docx > The fix should be straightforward, and I'll submit a PR once I can get Nutch > up and running in my dev environment. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (NUTCH-2732) Ignored and tracked configuration files by git
[ https://issues.apache.org/jira/browse/NUTCH-2732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reopened NUTCH-2732: Hi [~roannel], one point: the file conf/nutch-default.xml has also been moved to a template. As this file is mostly for documenting properties and isconsidered never to be changed (overrides should go to nutch-site.xml), I suggest to keep the old name to avoid confusion. What do you think? > Ignored and tracked configuration files by git > -- > > Key: NUTCH-2732 > URL: https://issues.apache.org/jira/browse/NUTCH-2732 > Project: Nutch > Issue Type: Improvement > Components: build >Affects Versions: 1.15 >Reporter: Roannel Fernández Hernández >Assignee: Roannel Fernández Hernández >Priority: Trivial > Fix For: 1.16 > > > In folder conf/ there are files that are ignored and tracked by git at the > same time. A way to solve this is creating {{*.template}} files for those > files. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-2381) In some situations the class TextProfileSignature gives different signatures for the same text "profile" page.
[ https://issues.apache.org/jira/browse/NUTCH-2381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939500#comment-16939500 ] Sebastian Nagel commented on NUTCH-2381: Add to conf/nutch-default.xml or *.template (cf. NUTCH-2732): {noformat} db.signature.text_profile.sec_sort_lex true Whether the TextProfileSignature class should sort words also lexicographically to avoid changing signatures due to unstable hash sorting. Default is `true`, set to `false` to ensure backward-compatibility with CrawlDbs written by Nutch 1.15 or prior, see also NUTCH-2381. {noformat} > In some situations the class TextProfileSignature gives different signatures > for the same text "profile" page. > -- > > Key: NUTCH-2381 > URL: https://issues.apache.org/jira/browse/NUTCH-2381 > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 1.13 >Reporter: Rodrigo Joni Sestari >Assignee: Sebastian Nagel >Priority: Major > Labels: signature > Fix For: 1.16 > > > In some situations the class TextProfileSignature gives different signatures > for the same text "profile" page. > The method TextProfileSignature.calculate uses a HashMap to salve the tokens, > after some process, the tokens come sorted by decreasing frequency. > For some pages like "http://curia.europa.eu/jcms/; the text "profile" is the > same but the signature come different for each fetch. > Its happens because the tokens are sorted only by decreasing frequency. > Tokens with the same frequency maybe not have the same order in different > fetchs. > The HashMap no guarantees as to the order of the map and not guarantee that > the order will remain constant over time. > My suggestion is change the methods TokenComparator.compare in order to sort > by frequency and Name. > Rodrigo -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika
[ https://issues.apache.org/jira/browse/NUTCH-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939478#comment-16939478 ] Tim Allison commented on NUTCH-2457: The issue is that the AutoDetectParser automatically/silently adds itself as a parser to the ParseContext. When an embedded document is parsed, there's a lookup for the embedded parser in the ParseContext. Because you weren't using the AutoDetectParser, there is no parser in ParseContext, and the embedded documents are not being parsed. So, you have two options (maybe more...): 1) use the AutoDetectParser; set https://tika.apache.org/1.17/api/org/apache/tika/metadata/TikaCoreProperties.html#CONTENT_TYPE_OVERRIDE to the mime, and you'll avoid a second detection for the container file 2) Use your current method, but add a cached AutoDetectParser to the ParseContext > Embedded documents likely not correctly parsed by Tika > -- > > Key: NUTCH-2457 > URL: https://issues.apache.org/jira/browse/NUTCH-2457 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.14 >Reporter: Tim Allison >Priority: Major > Fix For: 1.16 > > > While working on TIKA-2490, I think I found that Nutch's current method of > requesting a mime-specific parser for each file will fail to parse embedded > files, e.g. > https://github.com/apache/tika/blob/master/tika-server/src/test/resources/test_recursive_embedded.docx > The fix should be straightforward, and I'll submit a PR once I can get Nutch > up and running in my dev environment. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika
[ https://issues.apache.org/jira/browse/NUTCH-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939473#comment-16939473 ] Tim Allison commented on NUTCH-2457: Let me take a look at the code again...it has been a while... > Embedded documents likely not correctly parsed by Tika > -- > > Key: NUTCH-2457 > URL: https://issues.apache.org/jira/browse/NUTCH-2457 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.14 >Reporter: Tim Allison >Priority: Major > Fix For: 1.16 > > > While working on TIKA-2490, I think I found that Nutch's current method of > requesting a mime-specific parser for each file will fail to parse embedded > files, e.g. > https://github.com/apache/tika/blob/master/tika-server/src/test/resources/test_recursive_embedded.docx > The fix should be straightforward, and I'll submit a PR once I can get Nutch > up and running in my dev environment. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika
[ https://issues.apache.org/jira/browse/NUTCH-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939453#comment-16939453 ] Sebastian Nagel commented on NUTCH-2457: [~talli...@apache.org], using the AutoDetectParser instead of the CompositeParser fixes the issue also when running the parser checker (i.e. parse-tika run in the encapsulated plugin class loader). The price is probably that MIME detection is done a second time, for the outer document. Although it might not because the MIME type is passed via Tika metadata. Should be checked, whether there are significant performance impacts. > Embedded documents likely not correctly parsed by Tika > -- > > Key: NUTCH-2457 > URL: https://issues.apache.org/jira/browse/NUTCH-2457 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.14 >Reporter: Tim Allison >Priority: Major > Fix For: 1.16 > > > While working on TIKA-2490, I think I found that Nutch's current method of > requesting a mime-specific parser for each file will fail to parse embedded > files, e.g. > https://github.com/apache/tika/blob/master/tika-server/src/test/resources/test_recursive_embedded.docx > The fix should be straightforward, and I'll submit a PR once I can get Nutch > up and running in my dev environment. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika
[ https://issues.apache.org/jira/browse/NUTCH-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939430#comment-16939430 ] Sebastian Nagel edited comment on NUTCH-2457 at 9/27/19 1:12 PM: - Actually, Nutch calls {noformat} CompositeParser compositeParser = (CompositeParser) tikaConfig.getParser(); Parser parser = compositeParser.getParsers().get(MediaType.parse(mimeType)); {noformat} But it works for embedded documents only for a unit test, I've just added (see PR). Running the parser checker, the embedded documents are not parsed: {noformat} > bin/nutch parsechecker -Dplugin.includes='protocol-file|parse-tika' -dumpText > file:/.../test_recursive_embedded.docx ... contentType: application/vnd.openxmlformats-officedocument.wordprocessingml.document ... embed_0 {noformat} [~talli...@apache.org], is this caused by the way the parser is called? It might be also because of parse-tika (as a Nutch plugin) holds the tika-parsers jar in the plugin class loader while the tika-core jar is in the main class loader (because it is required for MIME detection). was (Author: wastl-nagel): Actually, Nutch calls {noformat} CompositeParser compositeParser = (CompositeParser) tikaConfig.getParser(); Parser parser = compositeParser.getParsers().get(MediaType.parse(mimeType)); {noformat} But it works for embedded documents only for a unit test, I've just added (see PR). Running the parser checker, the embedded documents are not parsed: {noformat} > parsechecker -Dplugin.includes='protocol-file|parse-tika' -dumpText > file:/.../test_recursive_embedded.docx ... contentType: application/vnd.openxmlformats-officedocument.wordprocessingml.document ... embed_0 {noformat} [~talli...@apache.org], is this caused by the way the parser is called? It might be also because of parse-tika (as a Nutch plugin) holds the tika-parsers jar in the plugin class loader while the tika-core jar is in the main class loader (because it is required for MIME detection). > Embedded documents likely not correctly parsed by Tika > -- > > Key: NUTCH-2457 > URL: https://issues.apache.org/jira/browse/NUTCH-2457 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.14 >Reporter: Tim Allison >Priority: Major > Fix For: 1.16 > > > While working on TIKA-2490, I think I found that Nutch's current method of > requesting a mime-specific parser for each file will fail to parse embedded > files, e.g. > https://github.com/apache/tika/blob/master/tika-server/src/test/resources/test_recursive_embedded.docx > The fix should be straightforward, and I'll submit a PR once I can get Nutch > up and running in my dev environment. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika
[ https://issues.apache.org/jira/browse/NUTCH-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939431#comment-16939431 ] ASF GitHub Bot commented on NUTCH-2457: --- sebastian-nagel commented on issue #474: NUTCH-2457 Embedded documents likely not correctly parsed by Tika URL: https://github.com/apache/nutch/pull/474#issuecomment-535932438 Yes, please see my comment in [Jira](https://issues.apache.org/jira/browse/NUTCH-2457). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Embedded documents likely not correctly parsed by Tika > -- > > Key: NUTCH-2457 > URL: https://issues.apache.org/jira/browse/NUTCH-2457 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.14 >Reporter: Tim Allison >Priority: Major > Fix For: 1.16 > > > While working on TIKA-2490, I think I found that Nutch's current method of > requesting a mime-specific parser for each file will fail to parse embedded > files, e.g. > https://github.com/apache/tika/blob/master/tika-server/src/test/resources/test_recursive_embedded.docx > The fix should be straightforward, and I'll submit a PR once I can get Nutch > up and running in my dev environment. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika
[ https://issues.apache.org/jira/browse/NUTCH-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939428#comment-16939428 ] ASF GitHub Bot commented on NUTCH-2457: --- tballison commented on issue #474: NUTCH-2457 Embedded documents likely not correctly parsed by Tika URL: https://github.com/apache/nutch/pull/474#issuecomment-535932051 >Embedded documents likely not correctly parsed by Tika Can we help? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Embedded documents likely not correctly parsed by Tika > -- > > Key: NUTCH-2457 > URL: https://issues.apache.org/jira/browse/NUTCH-2457 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.14 >Reporter: Tim Allison >Priority: Major > Fix For: 1.16 > > > While working on TIKA-2490, I think I found that Nutch's current method of > requesting a mime-specific parser for each file will fail to parse embedded > files, e.g. > https://github.com/apache/tika/blob/master/tika-server/src/test/resources/test_recursive_embedded.docx > The fix should be straightforward, and I'll submit a PR once I can get Nutch > up and running in my dev environment. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika
[ https://issues.apache.org/jira/browse/NUTCH-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939430#comment-16939430 ] Sebastian Nagel commented on NUTCH-2457: Actually, Nutch calls {noformat} CompositeParser compositeParser = (CompositeParser) tikaConfig.getParser(); Parser parser = compositeParser.getParsers().get(MediaType.parse(mimeType)); {noformat} But it works for embedded documents only for a unit test, I've just added (see PR). Running the parser checker, the embedded documents are not parsed: {noformat} > parsechecker -Dplugin.includes='protocol-file|parse-tika' -dumpText > file:/.../test_recursive_embedded.docx ... contentType: application/vnd.openxmlformats-officedocument.wordprocessingml.document ... embed_0 {noformat} [~talli...@apache.org], is this caused by the way the parser is called? It might be also because of parse-tika (as a Nutch plugin) holds the tika-parsers jar in the plugin class loader while the tika-core jar is in the main class loader (because it is required for MIME detection). > Embedded documents likely not correctly parsed by Tika > -- > > Key: NUTCH-2457 > URL: https://issues.apache.org/jira/browse/NUTCH-2457 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.14 >Reporter: Tim Allison >Priority: Major > Fix For: 1.16 > > > While working on TIKA-2490, I think I found that Nutch's current method of > requesting a mime-specific parser for each file will fail to parse embedded > files, e.g. > https://github.com/apache/tika/blob/master/tika-server/src/test/resources/test_recursive_embedded.docx > The fix should be straightforward, and I'll submit a PR once I can get Nutch > up and running in my dev environment. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika
[ https://issues.apache.org/jira/browse/NUTCH-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939422#comment-16939422 ] ASF GitHub Bot commented on NUTCH-2457: --- sebastian-nagel commented on pull request #474: NUTCH-2457 Embedded documents likely not correctly parsed by Tika URL: https://github.com/apache/nutch/pull/474 - add unit test for embedded documents This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Embedded documents likely not correctly parsed by Tika > -- > > Key: NUTCH-2457 > URL: https://issues.apache.org/jira/browse/NUTCH-2457 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.14 >Reporter: Tim Allison >Priority: Major > Fix For: 1.16 > > > While working on TIKA-2490, I think I found that Nutch's current method of > requesting a mime-specific parser for each file will fail to parse embedded > files, e.g. > https://github.com/apache/tika/blob/master/tika-server/src/test/resources/test_recursive_embedded.docx > The fix should be straightforward, and I'll submit a PR once I can get Nutch > up and running in my dev environment. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (NUTCH-1403) Add default ScoringFilter for manipulating metadata
[ https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1403: --- Fix Version/s: (was: 1.17) 1.16 > Add default ScoringFilter for manipulating metadata > > > Key: NUTCH-1403 > URL: https://issues.apache.org/jira/browse/NUTCH-1403 > Project: Nutch > Issue Type: Improvement >Reporter: Julien Nioche >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.16 > > > This is currently done by the urlmeta plugin, which has too vague a name and > a redundant indexing filter now that we have the index-metadata plugin. This > scoring filter would help defining which metadata to pass from : > - the crawl metadata to the content metadata > - the content metadata to the parse metadata > - the parse metadata to the crawldatum for the outlinks > I'd make this scoring filter available by default i.e. not in a separate > plugin as its functionalities are commonly used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (NUTCH-1403) Add default ScoringFilter for manipulating metadata
[ https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-1403: -- Assignee: Sebastian Nagel > Add default ScoringFilter for manipulating metadata > > > Key: NUTCH-1403 > URL: https://issues.apache.org/jira/browse/NUTCH-1403 > Project: Nutch > Issue Type: Improvement >Reporter: Julien Nioche >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.17 > > > This is currently done by the urlmeta plugin, which has too vague a name and > a redundant indexing filter now that we have the index-metadata plugin. This > scoring filter would help defining which metadata to pass from : > - the crawl metadata to the content metadata > - the content metadata to the parse metadata > - the parse metadata to the crawldatum for the outlinks > I'd make this scoring filter available by default i.e. not in a separate > plugin as its functionalities are commonly used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-2381) In some situations the class TextProfileSignature gives different signatures for the same text "profile" page.
[ https://issues.apache.org/jira/browse/NUTCH-2381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939372#comment-16939372 ] ASF GitHub Bot commented on NUTCH-2381: --- sebastian-nagel commented on pull request #473: NUTCH-2381 In some situations the class TextProfileSignature gives different signatures for the same text "profile" page URL: https://github.com/apache/nutch/pull/473 - implement secondary sorting, similar to patch provided by Rodrigo Joni Sestari - allow to restore previous behavior by setting property `db.signature.text_profile.sec_sort_lex = false` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > In some situations the class TextProfileSignature gives different signatures > for the same text "profile" page. > -- > > Key: NUTCH-2381 > URL: https://issues.apache.org/jira/browse/NUTCH-2381 > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 1.13 >Reporter: Rodrigo Joni Sestari >Assignee: Sebastian Nagel >Priority: Major > Labels: signature > Fix For: 1.16 > > > In some situations the class TextProfileSignature gives different signatures > for the same text "profile" page. > The method TextProfileSignature.calculate uses a HashMap to salve the tokens, > after some process, the tokens come sorted by decreasing frequency. > For some pages like "http://curia.europa.eu/jcms/; the text "profile" is the > same but the signature come different for each fetch. > Its happens because the tokens are sorted only by decreasing frequency. > Tokens with the same frequency maybe not have the same order in different > fetchs. > The HashMap no guarantees as to the order of the map and not guarantee that > the order will remain constant over time. > My suggestion is change the methods TokenComparator.compare in order to sort > by frequency and Name. > Rodrigo -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (NUTCH-2381) In some situations the class TextProfileSignature gives different signatures for the same text "profile" page.
[ https://issues.apache.org/jira/browse/NUTCH-2381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2381: -- Assignee: Sebastian Nagel > In some situations the class TextProfileSignature gives different signatures > for the same text "profile" page. > -- > > Key: NUTCH-2381 > URL: https://issues.apache.org/jira/browse/NUTCH-2381 > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 1.13 >Reporter: Rodrigo Joni Sestari >Assignee: Sebastian Nagel >Priority: Major > Labels: signature > Fix For: 1.16 > > > In some situations the class TextProfileSignature gives different signatures > for the same text "profile" page. > The method TextProfileSignature.calculate uses a HashMap to salve the tokens, > after some process, the tokens come sorted by decreasing frequency. > For some pages like "http://curia.europa.eu/jcms/; the text "profile" is the > same but the signature come different for each fetch. > Its happens because the tokens are sorted only by decreasing frequency. > Tokens with the same frequency maybe not have the same order in different > fetchs. > The HashMap no guarantees as to the order of the map and not guarantee that > the order will remain constant over time. > My suggestion is change the methods TokenComparator.compare in order to sort > by frequency and Name. > Rodrigo -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-2381) In some situations the class TextProfileSignature gives different signatures for the same text "profile" page.
[ https://issues.apache.org/jira/browse/NUTCH-2381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939336#comment-16939336 ] Sebastian Nagel commented on NUTCH-2381: Good point: "A particular iteration order is not specified for {{HashMap}} objects - any code that depends on iteration order should be fixed." ([https://docs.oracle.com/javase/8/docs/technotes/guides/collections/changes8.html)] Will provide fix. CAVEAT: while it makes the TextProfileSignature more reliable, it will change the signatures in an already existing CrawlDb. > In some situations the class TextProfileSignature gives different signatures > for the same text "profile" page. > -- > > Key: NUTCH-2381 > URL: https://issues.apache.org/jira/browse/NUTCH-2381 > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 1.13 >Reporter: Rodrigo Joni Sestari >Priority: Major > Labels: signature > Fix For: 1.16 > > > In some situations the class TextProfileSignature gives different signatures > for the same text "profile" page. > The method TextProfileSignature.calculate uses a HashMap to salve the tokens, > after some process, the tokens come sorted by decreasing frequency. > For some pages like "http://curia.europa.eu/jcms/; the text "profile" is the > same but the signature come different for each fetch. > Its happens because the tokens are sorted only by decreasing frequency. > Tokens with the same frequency maybe not have the same order in different > fetchs. > The HashMap no guarantees as to the order of the map and not guarantee that > the order will remain constant over time. > My suggestion is change the methods TokenComparator.compare in order to sort > by frequency and Name. > Rodrigo -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-2736) Upgrade Dockerfile to be based on recent Ubuntu LTS version
[ https://issues.apache.org/jira/browse/NUTCH-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939271#comment-16939271 ] Hudson commented on NUTCH-2736: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3646 (See [https://builds.apache.org/job/Nutch-trunk/3646/]) NUTCH-2736 Upgrade Dockerfile to be based on recent Ubuntu LTS version - (snagel: [https://github.com/apache/nutch/commit/c735ebb7a40dd3e6ab29583ada91a73378922874]) * (edit) docker/Dockerfile * (edit) docker/README.md > Upgrade Dockerfile to be based on recent Ubuntu LTS version > --- > > Key: NUTCH-2736 > URL: https://issues.apache.org/jira/browse/NUTCH-2736 > Project: Nutch > Issue Type: Improvement > Components: build, test >Affects Versions: 1.16 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 1.16 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2184: --- Fix Version/s: (was: 1.16) 1.17 > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.17 > > Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (NUTCH-2736) Upgrade Dockerfile to be based on recent Ubuntu LTS version
[ https://issues.apache.org/jira/browse/NUTCH-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2736. Resolution: Fixed > Upgrade Dockerfile to be based on recent Ubuntu LTS version > --- > > Key: NUTCH-2736 > URL: https://issues.apache.org/jira/browse/NUTCH-2736 > Project: Nutch > Issue Type: Improvement > Components: build, test >Affects Versions: 1.16 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 1.16 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-2736) Upgrade Dockerfile to be based on recent Ubuntu LTS version
[ https://issues.apache.org/jira/browse/NUTCH-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939246#comment-16939246 ] ASF GitHub Bot commented on NUTCH-2736: --- sebastian-nagel commented on pull request #472: NUTCH-2736 Upgrade Dockerfile to be based on recent Ubuntu LTS version URL: https://github.com/apache/nutch/pull/472 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade Dockerfile to be based on recent Ubuntu LTS version > --- > > Key: NUTCH-2736 > URL: https://issues.apache.org/jira/browse/NUTCH-2736 > Project: Nutch > Issue Type: Improvement > Components: build, test >Affects Versions: 1.16 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 1.16 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (NUTCH-2162) Nutch Webapp Crawl fails as it tries to index
[ https://issues.apache.org/jira/browse/NUTCH-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2162: --- Fix Version/s: (was: 1.16) 1.17 > Nutch Webapp Crawl fails as it tries to index > - > > Key: NUTCH-2162 > URL: https://issues.apache.org/jira/browse/NUTCH-2162 > Project: Nutch > Issue Type: Bug > Components: web gui >Affects Versions: 1.11 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.17 > > Attachments: nutch_webapp.log > > > Right now a crawl task fails on the trunk version of the WebApp due to it > attempting to index. No indexer is defined by default so this is a major bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)