[jira] [Commented] (NUTCH-2305) generate.min.score doesn't work in 2.x
[ https://issues.apache.org/jira/browse/NUTCH-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15431671#comment-15431671 ] Hudson commented on NUTCH-2305: --- SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1568 (See [https://builds.apache.org/job/Nutch-nutchgora/1568/]) Fix for NUTCH-2305 generate.min.score doesn't work, contributed by (snagel: rev 5c3a381289f158f69b4f7ebe7b059cd7d9ba7638) * (edit) src/java/org/apache/nutch/crawl/GeneratorMapper.java * (edit) conf/nutch-default.xml > generate.min.score doesn't work in 2.x > -- > > Key: NUTCH-2305 > URL: https://issues.apache.org/jira/browse/NUTCH-2305 > Project: Nutch > Issue Type: Bug > Components: generator >Reporter: Kiyonari Harigae >Assignee: Sebastian Nagel > Fix For: 2.4 > > Attachments: NUTCH-2305.patch > > > The definition of "generate.min.score" is exist in GeneratorJob but, > It does not work even if described in nutch-site.conf. > "generate.min.score" is necessary also 2.x -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2302) RAMConfManager Could Be Constructed With Custom Configuration
[ https://issues.apache.org/jira/browse/NUTCH-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432095#comment-15432095 ] Hudson commented on NUTCH-2302: --- SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1569 (See [https://builds.apache.org/job/Nutch-nutchgora/1569/]) NUTCH-2302 RAMConfManager Could Be Constructed With Custom Configuration (furkankamaci: rev fd722c896468fe047758891d75a58259c88289d8) * (edit) src/java/org/apache/nutch/api/impl/RAMConfManager.java > RAMConfManager Could Be Constructed With Custom Configuration > -- > > Key: NUTCH-2302 > URL: https://issues.apache.org/jira/browse/NUTCH-2302 > Project: Nutch > Issue Type: Improvement > Components: REST_api, web gui >Reporter: Furkan KAMACI >Assignee: Furkan KAMACI > Fix For: 2.4 > > > RAMConfManager is intented to hold different configurations which can be > accessible via a configuration id. However, it forces you to use a default > configuration with a default id when you construct it. When RAMConfManager is > used by any other classes they cannot set a custom configuration and it leads > problem. i.e. test resources cannot be used when you test NutchServer due to > it uses default configuration which is forced by RAMConfManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2246) Refactor /seed endpoint for backward compatibility
[ https://issues.apache.org/jira/browse/NUTCH-2246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432400#comment-15432400 ] Hudson commented on NUTCH-2246: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3392 (See [https://builds.apache.org/job/Nutch-trunk/3392/]) Remove NUTCH-2246 from the 1.12 section of CHANGES.txt (fixed in 1.13) (snagel: rev 78e99092c6d1308e054f9a20e50b7a6eb6206784) * (edit) CHANGES.txt > Refactor /seed endpoint for backward compatibility > -- > > Key: NUTCH-2246 > URL: https://issues.apache.org/jira/browse/NUTCH-2246 > Project: Nutch > Issue Type: Sub-task > Components: REST_api >Affects Versions: 1.12 >Reporter: Sujen Shah >Assignee: Sujen Shah >Priority: Minor > Labels: memex > Fix For: 1.13 > > > Currently the seed endpoint allows you to create a seed list by providing a > list of urls passed as an argument. > After the first refactor here - > https://issues.apache.org/jira/browse/NUTCH-2090. User could no longer > provide a physical path to the seedlist. > Nutch should give both options to the user. > Additionally, once a seedlist is created by providing a list of urls (not a > physical file), Nutch should store it like it does for the configurations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2242) lastModified not always set
[ https://issues.apache.org/jira/browse/NUTCH-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432469#comment-15432469 ] Hudson commented on NUTCH-2242: --- FAILURE: Integrated in Jenkins build Nutch-trunk #3393 (See [https://builds.apache.org/job/Nutch-trunk/3393/]) NUTCH-2164 NUTCH-2242 Inconsistent 'Modified Time' in crawl db / (snagel: rev 70622c3e18cee879f5a38d895f68dd0be69461e1) * (edit) src/java/org/apache/nutch/crawl/DefaultFetchSchedule.java * (edit) src/java/org/apache/nutch/protocol/ProtocolOutput.java * (edit) src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java * (edit) src/test/org/apache/nutch/crawl/TestCrawlDbStates.java > lastModified not always set > --- > > Key: NUTCH-2242 > URL: https://issues.apache.org/jira/browse/NUTCH-2242 > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 1.11 >Reporter: Jurian Broertjes >Priority: Minor > Fix For: 1.13 > > Attachments: NUTCH-2242.patch > > > I observed two issues: > - When using the DefaultFetchSchedule, CrawlDatum's modifiedTime field is not > updated on the first successful fetch. > - When a document modification is detected (protocol- or signature-wise), the > modifiedTime isn't updated > I can provide a patch later today. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2164) Inconsistent 'Modified Time' in crawl db
[ https://issues.apache.org/jira/browse/NUTCH-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432468#comment-15432468 ] Hudson commented on NUTCH-2164: --- FAILURE: Integrated in Jenkins build Nutch-trunk #3393 (See [https://builds.apache.org/job/Nutch-trunk/3393/]) NUTCH-2164 NUTCH-2242 Inconsistent 'Modified Time' in crawl db / (snagel: rev 70622c3e18cee879f5a38d895f68dd0be69461e1) * (edit) src/java/org/apache/nutch/crawl/DefaultFetchSchedule.java * (edit) src/java/org/apache/nutch/protocol/ProtocolOutput.java * (edit) src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java * (edit) src/test/org/apache/nutch/crawl/TestCrawlDbStates.java > Inconsistent 'Modified Time' in crawl db > > > Key: NUTCH-2164 > URL: https://issues.apache.org/jira/browse/NUTCH-2164 > Project: Nutch > Issue Type: Improvement > Components: crawldb, fetcher >Affects Versions: 1.11 >Reporter: Thamme Gowda >Priority: Minor > Fix For: 1.13 > > > The 'Modified time' in crawldb is invalid. It is set to (0-Timezone > Difference) > *How to verify/reproduce:* > Run 'nutch readdb /path/to/crawldb -dump yy' and then inspect content of > 'yy' > The following improvements can be done: > 1. Set modified time by DefaultFetchSchedule > 2. Set ProtocolStatus.lastModified if modified time is available in protocol > response headers > This issue is also discussed in dev mailing lists: > http://www.mail-archive.com/dev@nutch.apache.org/msg19803.html# -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2303) NutchServer Could Be Able To Select a Configuration to Use
[ https://issues.apache.org/jira/browse/NUTCH-2303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15433176#comment-15433176 ] Hudson commented on NUTCH-2303: --- SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1570 (See [https://builds.apache.org/job/Nutch-nutchgora/1570/]) NUTCH-2303 NutchServer Could Be Able To Select a Configuration to Use (furkankamaci: rev 6227f3b171b67e790a089d6fee4d3c65de0e0ee1) * (edit) src/java/org/apache/nutch/api/NutchServer.java * (edit) src/java/org/apache/nutch/api/security/SecurityUtil.java > NutchServer Could Be Able To Select a Configuration to Use > -- > > Key: NUTCH-2303 > URL: https://issues.apache.org/jira/browse/NUTCH-2303 > Project: Nutch > Issue Type: Improvement > Components: REST_api, web gui >Reporter: Furkan KAMACI >Assignee: Furkan KAMACI > Fix For: 2.4 > > > RAMConfManager is intented to hold different configurations. However, > currently NutchServer uses default config and it could be let to set an > active configuration id when startup a NutchServer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2306) Id of Active Configuration Could Be Stored at NutchStatus and Exposed via REST API
[ https://issues.apache.org/jira/browse/NUTCH-2306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15433178#comment-15433178 ] Hudson commented on NUTCH-2306: --- SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1570 (See [https://builds.apache.org/job/Nutch-nutchgora/1570/]) NUTCH-2306 Id of Active Configuration Could Be Stored at NutchStatus and (furkankamaci: rev ed96b104ddf82bcb20557a29b251c3fd73eb146a) * (edit) src/java/org/apache/nutch/api/model/response/NutchStatus.java * (edit) src/java/org/apache/nutch/api/resources/AdminResource.java * (edit) src/java/org/apache/nutch/api/resources/AbstractResource.java > Id of Active Configuration Could Be Stored at NutchStatus and Exposed via > REST API > -- > > Key: NUTCH-2306 > URL: https://issues.apache.org/jira/browse/NUTCH-2306 > Project: Nutch > Issue Type: Improvement > Components: REST_api, web gui >Reporter: Furkan KAMACI >Assignee: Furkan KAMACI > Fix For: 2.4 > > > NutchStatus holds information about configuration it uses. However, it should > also store the id of that configuration. Once NUTCH-2302 and NUTCH-2303 are > merged, we will be able to store acitive configuration id and expose this > information via REST API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2301) Create Tests for Security Layer of NutchServer
[ https://issues.apache.org/jira/browse/NUTCH-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15433679#comment-15433679 ] Hudson commented on NUTCH-2301: --- FAILURE: Integrated in Jenkins build Nutch-nutchgora #1571 (See [https://builds.apache.org/job/Nutch-nutchgora/1571/]) NUTCH-2301 Tests for Security Layer of NutchServer Are Created (furkankamaci: rev 3bc3d81e964aac59f61951740e848bd429a15b3c) * (add) src/test/org/apache/nutch/api/TestNutchAPI.java * (edit) src/test/nutch-site.xml * (add) src/test/nutch-ssl.keystore.jks * (add) src/test/org/apache/nutch/api/AbstractNutchAPITestBase.java * (delete) src/test/org/apache/nutch/api/TestAPI.java > Create Tests for Security Layer of NutchServer > -- > > Key: NUTCH-2301 > URL: https://issues.apache.org/jira/browse/NUTCH-2301 > Project: Nutch > Issue Type: Sub-task > Components: REST_api, web gui >Reporter: Furkan KAMACI >Assignee: Furkan KAMACI > Fix For: 2.4 > > > Create tests for security layer of NutchServer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2308) Implement SSL Connection Test at TestNutchAPI
[ https://issues.apache.org/jira/browse/NUTCH-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15449662#comment-15449662 ] Hudson commented on NUTCH-2308: --- SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1572 (See [https://builds.apache.org/job/Nutch-nutchgora/1572/]) NUTCH-2308 SSL connection test at TestNutchAPI is implemented. (furkankamaci: rev 75d846cf3998faeffa6edf5a7d7fec2d22c8d4d9) * (edit) src/java/org/apache/nutch/api/NutchServer.java * (add) src/test/testTrustKeyStore * (edit) src/test/org/apache/nutch/api/AbstractNutchAPITestBase.java * (add) src/test/nutch.cer * (edit) src/java/org/apache/nutch/api/resources/SeedResource.java * (edit) src/java/org/apache/nutch/api/resources/DbResource.java * (edit) src/java/org/apache/nutch/api/resources/ConfigResource.java * (edit) src/java/org/apache/nutch/api/resources/AdminResource.java * (edit) conf/nutch-default.xml * (add) src/java/org/apache/nutch/api/security/SecurityUtils.java * (delete) src/java/org/apache/nutch/api/security/SecurityUtil.java * (edit) src/test/nutch-ssl.keystore.jks * (edit) src/test/nutch-site.xml * (edit) src/test/org/apache/nutch/api/TestNutchAPI.java * (edit) src/java/org/apache/nutch/api/resources/JobResource.java > Implement SSL Connection Test at TestNutchAPI > - > > Key: NUTCH-2308 > URL: https://issues.apache.org/jira/browse/NUTCH-2308 > Project: Nutch > Issue Type: Improvement > Components: REST_api, web gui >Reporter: Furkan KAMACI >Assignee: Furkan KAMACI > Fix For: 2.4 > > > Currently, testing of SSL is ignored at TestNutchAPI. We should complete the > implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2264) Check Forbidden APIs at Build
[ https://issues.apache.org/jira/browse/NUTCH-2264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15449661#comment-15449661 ] Hudson commented on NUTCH-2264: --- SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1572 (See [https://builds.apache.org/job/Nutch-nutchgora/1572/]) NUTCH-2264 Forbidden APIs are Checked at Build (furkankamaci: rev a671540a94d8afafd72a09396c97d9ede43a7ea2) * (edit) src/java/org/apache/nutch/net/URLFilterChecker.java * (edit) src/plugin/subcollection/src/test/org/apache/nutch/collection/TestSubcollection.java * (edit) build.xml * (edit) src/test/org/apache/nutch/plugin/TestPluginSystem.java * (edit) src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java * (edit) src/java/org/apache/nutch/crawl/DbUpdaterJob.java * (edit) src/java/org/apache/nutch/host/HostDbUpdateReducer.java * (edit) src/plugin/parse-tika/src/test/org/apache/nutch/parse/tika/DOMContentUtilsTest.java * (edit) src/java/org/apache/nutch/util/Bytes.java * (edit) src/java/org/apache/nutch/host/HostInjectorJob.java * (edit) src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java * (edit) src/plugin/index-anchor/src/java/org/apache/nutch/indexer/anchor/AnchorIndexingFilter.java * (edit) src/plugin/urlnormalizer-regex/src/test/org/apache/nutch/net/urlnormalizer/regex/TestRegexURLNormalizer.java * (edit) ivy/ivy.xml * (edit) src/plugin/parse-metatags/src/java/org/apache/nutch/parse/metatags/MetaTagsParser.java * (edit) src/java/org/apache/nutch/parse/ParseUtil.java * (edit) src/java/org/apache/nutch/util/URLUtil.java * (edit) src/java/org/apache/nutch/tools/Benchmark.java * (edit) src/java/org/apache/nutch/tools/arc/ArcRecordReader.java * (edit) src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java * (edit) src/plugin/protocol-sftp/src/java/org/apache/nutch/protocol/sftp/Sftp.java * (edit) src/java/org/apache/nutch/fetcher/FetcherReducer.java * (edit) src/java/org/apache/nutch/parse/ParserJob.java * (edit) src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java * (edit) src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/File.java * (edit) src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpBasicAuthentication.java * (edit) src/plugin/lib-http/src/test/org/apache/nutch/protocol/http/api/TestRobotRulesParser.java * (edit) src/plugin/microformats-reltag/src/java/org/apache/nutch/microformats/reltag/RelTagParser.java * (edit) src/plugin/language-identifier/src/java/org/apache/nutch/analysis/lang/HTMLLanguageParser.java * (edit) src/plugin/microformats-reltag/src/test/org/apache/nutch/microformats/reltag/TestRelTagParser.java * (edit) src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java * (edit) src/java/org/apache/nutch/api/impl/JobWorker.java * (edit) src/java/org/apache/nutch/api/resources/AdminResource.java * (edit) src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/FtpRobotRulesParser.java * (edit) src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexURLFilterBase.java * (edit) src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java * (edit) src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Client.java * (edit) src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestRobotsMetaProcessor.java * (edit) src/java/org/apache/nutch/util/domain/DomainStatistics.java * (edit) src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java * (edit) src/java/org/apache/nutch/tools/proxy/FakeHandler.java * (edit) src/java/org/apache/nutch/tools/ResolveUrls.java * (edit) src/java/org/apache/nutch/protocol/RobotRulesParser.java * (edit) src/plugin/parse-metatags/src/test/org/apache/nutch/parse/metatags/TestMetaTagsParser.java * (edit) src/plugin/parse-js/src/java/org/apache/nutch/parse/js/JSParseFilter.java * (edit) src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/FtpResponse.java * (edit) src/plugin/index-html/src/java/org/apache/nutch/indexer/html/HtmlIndexingFilter.java * (edit) src/java/org/apache/nutch/fetcher/FetcherJob.java * (edit) src/java/org/apache/nutch/protocol/Content.java * (edit) src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java * (edit) src/plugin/creativecommons/src/java/org/creativecommons/nutch/CCParseFilter.java * (edit) src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/HTMLMetaProcessor.java * (edit) src/plugin/parse-tika/src/test/org/apache/nutch/parse/tika/TestImageMetadata.java * (edit) src/test/org/apache/nutch/parse/TestSitemapParser.java * (edit) src/java/org/apache/nutch/tools/DmozParser.java * (edit) src/java/org/apache/nutch/webui/client/impl/RemoteCommand.java * (edit) src/plugin/language-identifier/src/test/org/apache/nutch/analysis/lang/T
[jira] [Commented] (NUTCH-2122) Implement Javadoc package-info.java for webui packages
[ https://issues.apache.org/jira/browse/NUTCH-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15450364#comment-15450364 ] Hudson commented on NUTCH-2122: --- SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1573 (See [https://builds.apache.org/job/Nutch-nutchgora/1573/]) NUTCH-2122 Missing package-info.java classes for webui packages are (furkankamaci: rev 8ad3e44a37fb1146d135cbfaeabff118f573afce) * (add) src/java/org/apache/nutch/webui/client/impl/package-info.java * (add) src/java/org/apache/nutch/webui/client/model/package-info.java * (add) src/java/org/apache/nutch/webui/model/package-info.java * (add) src/java/org/apache/nutch/webui/pages/menu/package-info.java * (add) src/java/org/apache/nutch/webui/service/package-info.java * (add) src/java/org/apache/nutch/webui/config/package-info.java * (add) src/java/org/apache/nutch/webui/pages/crawls/package-info.java * (add) src/java/org/apache/nutch/webui/pages/seed/package-info.java * (add) src/java/org/apache/nutch/webui/service/impl/package-info.java * (add) src/java/org/apache/nutch/webui/pages/assets/package-info.java * (add) src/java/org/apache/nutch/webui/pages/instances/package-info.java * (add) src/java/org/apache/nutch/webui/pages/components/package-info.java * (add) src/java/org/apache/nutch/webui/pages/settings/package-info.java * (add) src/java/org/apache/nutch/webui/pages/package-info.java * (add) src/java/org/apache/nutch/webui/package-info.java * (add) src/java/org/apache/nutch/webui/client/package-info.java > Implement Javadoc package-info.java for webui packages > -- > > Key: NUTCH-2122 > URL: https://issues.apache.org/jira/browse/NUTCH-2122 > Project: Nutch > Issue Type: Improvement > Components: nutch server >Affects Versions: 1.10 >Reporter: Lewis John McGibbney >Assignee: Furkan KAMACI >Priority: Trivial > Fix For: 2.4 > > > [~sujenshah] I noticed that the Javadoc does not contain package.html > displaying package level introductory Javadoc as every other package does. > http://nutch.apache.org/apidocs/apidocs-1.10/index.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2314) Use indexer-elastic2 Plugin for javadoc and eclipse Targets
[ https://issues.apache.org/jira/browse/NUTCH-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15460500#comment-15460500 ] Hudson commented on NUTCH-2314: --- SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1574 (See [https://builds.apache.org/job/Nutch-nutchgora/1574/]) NUTCH-2314 indexer-elastic2 plugin is used for javadoc and eclipse (furkankamaci: rev 7dcc5fa69f3edd431b47d127048fd9f97b442fa6) * (edit) build.xml * (edit) default.properties > Use indexer-elastic2 Plugin for javadoc and eclipse Targets > --- > > Key: NUTCH-2314 > URL: https://issues.apache.org/jira/browse/NUTCH-2314 > Project: Nutch > Issue Type: Bug > Components: plugin >Reporter: Furkan KAMACI >Assignee: Furkan KAMACI > Fix For: 2.4 > > > indexer-elastic2 plugin is used at deploy and clean tasks of plugin/build.xml > However, indexer-elastic plugin is used instead of indexer-elastic2 for > javadoc and eclipse tasks at build.xml and gives error. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2089) Move Nutch 2.x to compile on JDK 8
[ https://issues.apache.org/jira/browse/NUTCH-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15468328#comment-15468328 ] Hudson commented on NUTCH-2089: --- SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1575 (See [https://builds.apache.org/job/Nutch-nutchgora/1575/]) NUTCH-2089 Nutch 2.x is moved to compile on JDK 8 (furkankamaci: rev 0ea78907dee6b07058b66a99e395aea8cf623e92) * (edit) src/java/org/apache/nutch/plugin/PluginRepository.java * (edit) src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java * (edit) src/plugin/urlfilter-domain/src/java/org/apache/nutch/urlfilter/domain/DomainURLFilter.java * (edit) src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/File.java * (edit) src/java/org/apache/nutch/indexer/IndexUtil.java * (edit) src/java/org/apache/nutch/net/URLNormalizers.java * (edit) src/java/org/apache/nutch/util/domain/TopLevelDomain.java * (edit) src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/XMLCharacterRecognizer.java * (edit) src/plugin/parse-tika/src/test/org/apache/nutch/parse/tika/TestRSSParser.java * (edit) src/java/org/apache/nutch/crawl/GeneratorJob.java * (edit) src/plugin/urlnormalizer-regex/src/java/org/apache/nutch/net/urlnormalizer/regex/RegexURLNormalizer.java * (edit) src/java/org/apache/nutch/scoring/ScoringFilter.java * (edit) src/java/org/apache/nutch/util/Bytes.java * (edit) src/plugin/index-anchor/src/java/org/apache/nutch/indexer/anchor/AnchorIndexingFilter.java * (edit) src/plugin/parse-html/src/java/org/apache/nutch/parse/html/XMLCharacterRecognizer.java * (edit) src/plugin/urlfilter-prefix/src/java/org/apache/nutch/urlfilter/prefix/PrefixURLFilter.java * (edit) src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMBuilder.java * (edit) src/plugin/urlfilter-validator/src/java/org/apache/nutch/urlfilter/validator/UrlValidator.java * (edit) src/java/org/apache/nutch/util/MimeUtil.java * (edit) src/test/org/apache/nutch/crawl/TestGenerator.java * (edit) src/java/org/apache/nutch/util/NutchTool.java * (edit) src/plugin/urlfilter-domain/src/java/org/apache/nutch/urlfilter/domain/package-info.java * (edit) src/java/org/apache/nutch/parse/NutchSitemapParse.java * (edit) src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java * (edit) src/java/org/apache/nutch/util/TrieStringMatcher.java * (edit) src/plugin/parse-js/src/java/org/apache/nutch/parse/js/JSParseFilter.java * (edit) src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/DummySSLProtocolSocketFactory.java * (edit) src/java/org/apache/nutch/util/EncodingDetector.java * (edit) src/plugin/language-identifier/src/java/org/apache/nutch/analysis/lang/HTMLLanguageParser.java * (edit) src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java * (edit) src/java/org/apache/nutch/crawl/FetchSchedule.java * (edit) src/java/org/apache/nutch/parse/ParsePluginsReader.java * (edit) src/java/org/apache/nutch/fetcher/FetcherJob.java * (edit) src/plugin/scoring-opic/src/java/org/apache/nutch/scoring/opic/OPICScoringFilter.java * (edit) src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java * (edit) src/java/org/apache/nutch/parse/Parser.java * (edit) src/plugin/feed/src/java/org/apache/nutch/parse/feed/FeedParser.java * (edit) src/java/org/apache/nutch/util/NutchJob.java * (edit) src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java * (edit) src/test/org/apache/nutch/fetcher/TestFetcher.java * (edit) src/java/org/apache/nutch/util/URLUtil.java * (edit) src/java/org/apache/nutch/util/domain/DomainSuffix.java * (edit) src/plugin/parse-swf/src/java/org/apache/nutch/parse/swf/SWFParser.java * (edit) src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Client.java * (edit) src/java/org/apache/nutch/util/SuffixStringMatcher.java * (edit) src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java * (edit) src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpBasicAuthentication.java * (edit) src/java/org/apache/nutch/api/impl/RAMConfManager.java * (edit) src/java/org/apache/nutch/util/TableUtil.java * (edit) src/plugin/protocol-file/src/test/org/apache/nutch/protocol/file/TestProtocolFile.java * (edit) src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMBuilder.java * (edit) src/java/org/apache/nutch/crawl/SignatureFactory.java * (edit) src/plugin/scoring-link/src/java/org/apache/nutch/scoring/link/package-info.java * (edit) src/java/org/apache/nutch/storage/StorageUtils.java * (edit) src/java/org/apache/nutch/crawl/AbstractFetchSchedule.java * (edit) src/java/org/apache/nutch/crawl/TextProfileSignature.java * (edit) src/java/org/apache/nutch/parse/ParserChecker.java * (edit) src/java/org/apache/nutch/api/NutchServer.java * (edit) src/java/org/apache/nutch/util/NodeWalker.java * (edit) src
[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events
[ https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15471126#comment-15471126 ] Hudson commented on NUTCH-2132: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3395 (See [https://builds.apache.org/job/Nutch-trunk/3395/]) Fix for NUTCH-2132: Publisher/Subscriber model for Nutch to emit events, (sujen: rev e53b34b2322f2d071981a72577644a225642ecbc) * (add) src/plugin/publish-rabbitmq/build-ivy.xml * (add) src/plugin/publish-rabbitmq/src/java/org/apache/nutch/publisher/rabbitmq/package-info.java * (add) src/java/org/apache/nutch/fetcher/FetcherThreadPublisher.java * (edit) src/plugin/build.xml * (edit) src/java/org/apache/nutch/fetcher/FetcherThread.java * (add) src/java/org/apache/nutch/publisher/NutchPublisher.java * (edit) src/plugin/nutch-extensionpoints/plugin.xml * (edit) conf/nutch-default.xml * (edit) ivy/ivy.xml * (add) src/java/org/apache/nutch/fetcher/FetcherThreadEvent.java * (add) src/plugin/publish-rabbitmq/plugin.xml * (edit) src/java/org/apache/nutch/metadata/Nutch.java * (add) src/plugin/publish-rabbitmq/src/java/org/apache/nutch/publisher/rabbitmq/RabbitMQPublisherImpl.java * (edit) build.xml * (add) src/plugin/publish-rabbitmq/build.xml * (add) src/java/org/apache/nutch/publisher/NutchPublishers.java * (add) src/plugin/publish-rabbitmq/ivy.xml > Publisher/Subscriber model for Nutch to emit events > > > Key: NUTCH-2132 > URL: https://issues.apache.org/jira/browse/NUTCH-2132 > Project: Nutch > Issue Type: New Feature > Components: fetcher, REST_api >Reporter: Sujen Shah >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.13 > > Attachments: NUTCH-2132.patch, NUTCH-2132.v2.patch, > PubSub_routingkey.patch > > > It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- > Fetcher events like fetch-start, fetch-end, a fetch report which may contain > data like outlinks of the current fetched url, score, etc). > A consumer of this functionality could use this data to generate real time > visualization and generate statics of the crawl without having to wait for > the fetch round to finish. > The REST API could contain an endpoint which would respond with a url to > which a client could subscribe to get the fetcher events. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2320) URLFilterChecker to run as TCP Telnet service
[ https://issues.apache.org/jira/browse/NUTCH-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15548786#comment-15548786 ] Hudson commented on NUTCH-2320: --- FAILURE: Integrated in Jenkins build Nutch-trunk #3396 (See [https://builds.apache.org/job/Nutch-trunk/3396/]) NUTCH-2320 URLFilterChecker to run as TCP Telnet service (markus: rev 836b2e01d1a4e0e9443601da755ea37de91b8c7d) * (edit) src/java/org/apache/nutch/net/URLFilterChecker.java > URLFilterChecker to run as TCP Telnet service > - > > Key: NUTCH-2320 > URL: https://issues.apache.org/jira/browse/NUTCH-2320 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.12 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.13 > > Attachments: NUTCH-2320.patch > > > Allow testing URL filters for webapplications just like indexing filters > checker. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2327) Seeds injected in REST workflow must be ingested into HDFS
[ https://issues.apache.org/jira/browse/NUTCH-2327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15605849#comment-15605849 ] Hudson commented on NUTCH-2327: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3399 (See [https://builds.apache.org/job/Nutch-trunk/3399/]) Fix for NUTCH-2327: Seeds injected in REST must be ingested into HDFS, (sujen: rev 24cc2aa9c68fa356e4e926b6bf86bac99d52e38c) * (edit) src/java/org/apache/nutch/service/resources/SeedResource.java > Seeds injected in REST workflow must be ingested into HDFS > -- > > Key: NUTCH-2327 > URL: https://issues.apache.org/jira/browse/NUTCH-2327 > Project: Nutch > Issue Type: Improvement > Components: injector, REST_api >Affects Versions: 1.12 >Reporter: Lewis John McGibbney >Assignee: Sujen Shah > Fix For: 1.13 > > > Right now when one uses the REST POST /seed/create API, a directory is > created within /var/some/path/here which is create if you are working locally > with the Nutch server e.g. on one machine. It is however not suitable for > using the REST API in distributed deployments where seeds needs to be present > within HDFS. More documentation on this topic is available at > https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI#Seed_List_creation > There are also various mailing list threads regarding use of the REST and > this injector url issue described above needs to be addressed. > [~sujenshah] CC for context. > http://www.mail-archive.com/user%40nutch.apache.org/msg14922.html > http://www.mail-archive.com/user%40nutch.apache.org/msg14921.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2336) SegmentReader to implement Tool
[ https://issues.apache.org/jira/browse/NUTCH-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15711774#comment-15711774 ] Hudson commented on NUTCH-2336: --- FAILURE: Integrated in Jenkins build Nutch-trunk #3400 (See [https://builds.apache.org/job/Nutch-trunk/3400/]) NUTCH-2336 SegmentReader to implement Tool (contributed by Vincent (snagel: rev 6e051f2ccadba6c6bac60ee8708ced958a30cc8b) * (edit) src/java/org/apache/nutch/segment/SegmentReader.java > SegmentReader to implement Tool > --- > > Key: NUTCH-2336 > URL: https://issues.apache.org/jira/browse/NUTCH-2336 > Project: Nutch > Issue Type: Improvement > Components: segment >Affects Versions: 1.12 >Reporter: Vincent Slot >Priority: Minor > Labels: patch > Fix For: 1.13 > > Attachments: NUTCH-2336.patch > > > Let SegmentReader implement Tool for use on Hadoop -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2337) urlnormalizer-basic to strip empty port
[ https://issues.apache.org/jira/browse/NUTCH-2337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15745211#comment-15745211 ] Hudson commented on NUTCH-2337: --- SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1576 (See [https://builds.apache.org/job/Nutch-nutchgora/1576/]) NUTCH-2337 urlnormalizer-basic to strip empty port - make sure that URLs (snagel: rev 6e3c34db16e385b0dadbe6444c2685283c863350) * (edit) src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java * (edit) src/plugin/urlnormalizer-basic/src/test/org/apache/nutch/net/urlnormalizer/basic/TestBasicURLNormalizer.java > urlnormalizer-basic to strip empty port > --- > > Key: NUTCH-2337 > URL: https://issues.apache.org/jira/browse/NUTCH-2337 > Project: Nutch > Issue Type: Bug > Components: plugin >Affects Versions: 2.3.1, 1.12 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 2.4, 1.13 > > > Basic URL normalizer should strip an empty port from the URL, that's not the > case at present: > {noformat} > echo "http://example.com:/"; \ >| nutch plugin urlnormalizer-basic > org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer > http://example.com:/ > {noformat} > The result should be {{http://example.com/}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2337) urlnormalizer-basic to strip empty port
[ https://issues.apache.org/jira/browse/NUTCH-2337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15745226#comment-15745226 ] Hudson commented on NUTCH-2337: --- FAILURE: Integrated in Jenkins build Nutch-trunk #3402 (See [https://builds.apache.org/job/Nutch-trunk/3402/]) NUTCH-2337 urlnormalizer-basic to strip empty port, closes #160 - make (snagel: rev f351790d7f496561aeae5e214d1b33975ca34cf2) * (edit) src/plugin/urlnormalizer-basic/src/test/org/apache/nutch/net/urlnormalizer/basic/TestBasicURLNormalizer.java * (edit) src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java > urlnormalizer-basic to strip empty port > --- > > Key: NUTCH-2337 > URL: https://issues.apache.org/jira/browse/NUTCH-2337 > Project: Nutch > Issue Type: Bug > Components: plugin >Affects Versions: 2.3.1, 1.12 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 2.4, 1.13 > > > Basic URL normalizer should strip an empty port from the URL, that's not the > case at present: > {noformat} > echo "http://example.com:/"; \ >| nutch plugin urlnormalizer-basic > org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer > http://example.com:/ > {noformat} > The result should be {{http://example.com/}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2350) Add Missing activeConfId Field to NutchStatus Object
[ https://issues.apache.org/jira/browse/NUTCH-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15828446#comment-15828446 ] Hudson commented on NUTCH-2350: --- SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1577 (See [https://builds.apache.org/job/Nutch-nutchgora/1577/]) NUTCH-2350 Added missing activeConfId field to NutchStatus. (kamaci: rev 6e074fc0b61f421cb7bc516e92dea33c3ce23fd5) * (edit) src/java/org/apache/nutch/webui/client/model/NutchStatus.java > Add Missing activeConfId Field to NutchStatus Object > > > Key: NUTCH-2350 > URL: https://issues.apache.org/jira/browse/NUTCH-2350 > Project: Nutch > Issue Type: Bug > Components: web gui >Affects Versions: 2.3.1 >Reporter: Furkan KAMACI >Assignee: Furkan KAMACI > Fix For: 2.4 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2344) Authentication Support for Web GUI
[ https://issues.apache.org/jira/browse/NUTCH-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15828445#comment-15828445 ] Hudson commented on NUTCH-2344: --- SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1577 (See [https://builds.apache.org/job/Nutch-nutchgora/1577/]) NUTCH-2344 Authentication support for Web GUI (kamaci: rev def067735c5a6dc46d867c4c89cb176a275b1967) * (add) src/java/org/apache/nutch/webui/pages/auth/SignInPage.html * (add) src/java/org/apache/nutch/webui/pages/auth/SignInPage.java * (edit) ivy/ivy.xml * (edit) src/java/org/apache/nutch/webui/pages/assets/nutch-style.css * (add) src/java/org/apache/nutch/webui/pages/auth/SignInSession.java * (add) src/java/org/apache/nutch/webui/pages/auth/AuthenticatedWebPage.java * (add) src/java/org/apache/nutch/webui/pages/auth/package-info.java * (edit) conf/nutch-default.xml * (edit) src/java/org/apache/nutch/webui/pages/AbstractBasePage.java * (edit) src/java/org/apache/nutch/webui/NutchUiApplication.properties * (add) src/java/org/apache/nutch/webui/pages/auth/User.java * (edit) src/java/org/apache/nutch/webui/NutchUiApplication.java * (edit) src/java/org/apache/nutch/webui/pages/LogOutPage.java * (add) src/java/org/apache/nutch/webui/pages/auth/AuthorizationStrategy.java > Authentication Support for Web GUI > -- > > Key: NUTCH-2344 > URL: https://issues.apache.org/jira/browse/NUTCH-2344 > Project: Nutch > Issue Type: New Feature > Components: web gui >Affects Versions: 2.3.1 >Reporter: Furkan KAMACI >Assignee: Furkan KAMACI > Fix For: 2.4 > > Attachments: Firefox_Screenshot_2017-01-13T19-10-49.499Z.png > > > We should implement an authentication support for Web GUI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2351) Log with Generic Class Name at Nutch 2.x
[ https://issues.apache.org/jira/browse/NUTCH-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830740#comment-15830740 ] Hudson commented on NUTCH-2351: --- SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1578 (See [https://builds.apache.org/job/Nutch-nutchgora/1578/]) NUTCH-2351 Logging with generic class name. (snagel: rev 1a84334c115bfda16980cd822da31ba5ae401afe) * (edit) src/java/org/apache/nutch/fetcher/FetcherReducer.java * (edit) src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java * (edit) src/java/org/apache/nutch/crawl/AbstractFetchSchedule.java * (edit) src/java/org/apache/nutch/util/domain/DomainStatistics.java * (edit) src/plugin/index-html/src/java/org/apache/nutch/indexer/html/HtmlIndexingFilter.java * (edit) src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java * (edit) src/plugin/parse-tika/src/test/org/apache/nutch/parse/tika/DOMContentUtilsTest.java * (edit) src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java * (edit) src/java/org/apache/nutch/tools/DmozParser.java * (edit) src/java/org/apache/nutch/webui/client/impl/CrawlingCycle.java * (edit) src/plugin/language-identifier/src/java/org/apache/nutch/analysis/lang/HTMLLanguageParser.java * (edit) src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexURLFilterBase.java * (edit) src/plugin/creativecommons/src/java/org/creativecommons/nutch/CCIndexingFilter.java * (edit) src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java * (edit) src/java/org/apache/nutch/tools/Benchmark.java * (edit) src/java/org/apache/nutch/util/MimeUtil.java * (edit) src/plugin/parse-zip/src/java/org/apache/nutch/parse/zip/ZipParser.java * (edit) src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext/ExtParser.java * (edit) src/java/org/apache/nutch/host/HostDbReader.java * (edit) src/java/org/apache/nutch/tools/proxy/LogDebugHandler.java * (edit) src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java * (edit) src/plugin/parse-swf/src/java/org/apache/nutch/parse/swf/SWFParser.java * (edit) src/java/org/apache/nutch/net/URLNormalizers.java * (edit) src/java/org/apache/nutch/host/HostInjectorJob.java * (edit) src/java/org/apache/nutch/plugin/PluginDescriptor.java * (edit) src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java * (edit) src/java/org/apache/nutch/util/DomUtil.java * (edit) src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java * (edit) src/java/org/apache/nutch/parse/ParseUtil.java * (edit) src/java/org/apache/nutch/util/GZIPUtils.java * (edit) src/java/org/apache/nutch/host/HostDb.java * (edit) src/java/org/apache/nutch/indexer/IndexUtil.java * (edit) src/java/org/apache/nutch/indexer/IndexWriters.java * (edit) src/java/org/apache/nutch/host/HostDbUpdateJob.java * (edit) src/java/org/apache/nutch/webui/service/impl/NutchServiceImpl.java * (edit) src/plugin/index-anchor/src/java/org/apache/nutch/indexer/anchor/AnchorIndexingFilter.java * (edit) src/java/org/apache/nutch/api/resources/AdminResource.java * (edit) src/java/org/apache/nutch/api/impl/JobWorker.java * (edit) src/java/org/apache/nutch/plugin/PluginManifestParser.java * (edit) src/java/org/apache/nutch/webui/client/impl/RemoteCommandExecutor.java * (edit) src/java/org/apache/nutch/crawl/SignatureFactory.java * (edit) src/java/org/apache/nutch/parse/ParserJob.java * (edit) src/java/org/apache/nutch/parse/ParserFactory.java * (edit) src/plugin/lib-regex-filter/src/test/org/apache/nutch/urlfilter/api/RegexURLFilterBaseTest.java * (edit) src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java * (edit) src/java/org/apache/nutch/crawl/WebTableReader.java * (edit) src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java * (edit) src/plugin/urlfilter-domain/src/java/org/apache/nutch/urlfilter/domain/DomainURLFilter.java * (edit) src/java/org/apache/nutch/parse/ParserChecker.java * (edit) src/java/org/apache/nutch/api/NutchServer.java * (edit) src/java/org/apache/nutch/indexer/IndexingFilters.java * (edit) src/java/org/apache/nutch/util/ObjectCache.java * (edit) src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/FtpRobotRulesParser.java * (edit) src/plugin/tld/src/java/org/apache/nutch/indexer/tld/TLDIndexingFilter.java * (edit) src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/DummySSLProtocolSocketFactory.java * (edit) src/java/org/apache/nutch/api/security/SecurityUtils.java * (edit) src/java/org/apache/nutch/indexer/solr/SolrDeleteDuplicates.java * (edit) src/java/org/apache/nutch/plugin/PluginRepository.java * (edit) src/java/org/apache/nutch/util/EncodingDetector.java * (edit) src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestHtmlPars
[jira] [Commented] (NUTCH-2352) Log with Generic Class Name at Nutch 1.x
[ https://issues.apache.org/jira/browse/NUTCH-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830758#comment-15830758 ] Hudson commented on NUTCH-2352: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3404 (See [https://builds.apache.org/job/Nutch-trunk/3404/]) NUTCH-2352 Logging with generic class name, closes #172 (snagel: rev 2b93a66f0472e93223c69053d5482dcbef26de6d) * (edit) src/plugin/scoring-similarity/src/java/org/apache/nutch/scoring/similarity/cosine/Model.java * (edit) src/java/org/apache/nutch/fetcher/FetcherThread.java * (edit) src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/Http.java * (edit) src/plugin/subcollection/src/java/org/apache/nutch/collection/CollectionManager.java * (edit) src/java/org/apache/nutch/service/NutchServer.java * (edit) src/java/org/apache/nutch/fetcher/FetchItemQueue.java * (edit) src/plugin/urlmeta/src/java/org/apache/nutch/indexer/urlmeta/URLMetaIndexingFilter.java * (edit) src/java/org/apache/nutch/scoring/webgraph/WebGraph.java * (edit) src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/DummyX509TrustManager.java * (edit) src/java/org/apache/nutch/scoring/webgraph/ScoreUpdater.java * (edit) src/plugin/urlnormalizer-protocol/src/java/org/apache/nutch/net/urlnormalizer/protocol/ProtocolURLNormalizer.java * (edit) src/java/org/apache/nutch/net/URLNormalizers.java * (edit) src/java/org/apache/nutch/crawl/CrawlDbMerger.java * (edit) src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java * (edit) src/java/org/apache/nutch/parse/ParseUtil.java * (edit) src/plugin/feed/src/java/org/apache/nutch/parse/feed/FeedParser.java * (edit) src/java/org/apache/nutch/crawl/CrawlDb.java * (edit) src/test/org/apache/nutch/tools/proxy/ProxyTestbed.java * (edit) src/plugin/lib-regex-filter/src/test/org/apache/nutch/urlfilter/api/RegexURLFilterBaseTest.java * (edit) src/java/org/apache/nutch/service/impl/JobWorker.java * (edit) src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java * (edit) src/plugin/parse-zip/src/java/org/apache/nutch/parse/zip/ZipParser.java * (edit) src/plugin/parse-js/src/java/org/apache/nutch/parse/js/JSParseFilter.java * (edit) src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/FtpRobotRulesParser.java * (edit) src/java/org/apache/nutch/parse/ParseResult.java * (edit) src/java/org/apache/nutch/fetcher/QueueFeeder.java * (edit) src/java/org/apache/nutch/parse/ParseSegment.java * (edit) src/java/org/apache/nutch/hostdb/UpdateHostDbMapper.java * (edit) src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java * (edit) src/java/org/apache/nutch/util/domain/DomainSuffixesReader.java * (edit) src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java * (edit) src/java/org/apache/nutch/indexer/IndexingFilters.java * (edit) src/java/org/apache/nutch/util/DomUtil.java * (edit) src/plugin/urlmeta/src/java/org/apache/nutch/scoring/urlmeta/URLMetaScoringFilter.java * (edit) src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpAuthenticationFactory.java * (edit) src/java/org/apache/nutch/crawl/Injector.java * (edit) src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestHtmlParser.java * (edit) src/java/org/apache/nutch/util/domain/DomainStatistics.java * (edit) src/plugin/publish-rabbitmq/src/java/org/apache/nutch/publisher/rabbitmq/RabbitMQPublisherImpl.java * (edit) src/plugin/urlnormalizer-querystring/src/java/org/apache/nutch/net/urlnormalizer/querystring/QuerystringURLNormalizer.java * (edit) src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/DefaultClickAllAjaxLinksHandler.java * (edit) src/test/org/apache/nutch/crawl/TODOTestCrawlDbStates.java * (edit) src/java/org/apache/nutch/tools/FileDumper.java * (edit) src/java/org/apache/nutch/segment/SegmentMergeFilters.java * (edit) src/java/org/apache/nutch/webui/service/impl/NutchServiceImpl.java * (edit) src/java/org/apache/nutch/hostdb/ReadHostDb.java * (edit) src/test/org/apache/nutch/service/TestNutchServer.java * (edit) src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java * (edit) src/plugin/urlfilter-domain/src/java/org/apache/nutch/urlfilter/domain/DomainURLFilter.java * (edit) src/plugin/urlnormalizer-regex/src/test/org/apache/nutch/net/urlnormalizer/regex/TestRegexURLNormalizer.java * (edit) src/java/org/apache/nutch/util/ProtocolStatusStatistics.java * (edit) src/plugin/scoring-similarity/src/java/org/apache/nutch/scoring/similarity/cosine/CosineSimilarity.java * (edit) src/test/org/apache/nutch/crawl/CrawlDbUpdateUtil.java * (edit) src/plugin/urlfilter-ignoreexempt/src/java/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.java * (edit) src/plugin/protocol-httpclient/src/java/org/apache/nu
[jira] [Commented] (NUTCH-2346) Check Types at Object Equality
[ https://issues.apache.org/jira/browse/NUTCH-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833122#comment-15833122 ] Hudson commented on NUTCH-2346: --- FAILURE: Integrated in Jenkins build Nutch-nutchgora #1579 (See [https://builds.apache.org/job/Nutch-nutchgora/1579/]) NUTCH-2346 Types are checked at object equality (kamaci: rev 170f8c1375c8826c6397de0eb80e2fa29d2bfe5f) * (edit) src/java/org/apache/nutch/crawl/GeneratorJob.java * (edit) src/java/org/apache/nutch/metadata/Metadata.java > Check Types at Object Equality > -- > > Key: NUTCH-2346 > URL: https://issues.apache.org/jira/browse/NUTCH-2346 > Project: Nutch > Issue Type: Bug > Components: generator, metadata >Affects Versions: 2.3.1 >Reporter: Furkan KAMACI >Assignee: Furkan KAMACI >Priority: Minor > Fix For: 2.4 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2346) Check Types at Object Equality
[ https://issues.apache.org/jira/browse/NUTCH-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15843707#comment-15843707 ] Hudson commented on NUTCH-2346: --- SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1580 (See [https://builds.apache.org/job/Nutch-nutchgora/1580/]) NUTCH-2346v2 Check Types at Object Equality v2 (lewis.mcgibbney: rev 022ed5c03206fab821770f85c2711f7c01edb17e) * (edit) src/java/org/apache/nutch/metadata/Metadata.java > Check Types at Object Equality > -- > > Key: NUTCH-2346 > URL: https://issues.apache.org/jira/browse/NUTCH-2346 > Project: Nutch > Issue Type: Bug > Components: generator, metadata >Affects Versions: 2.3.1 >Reporter: Furkan KAMACI >Assignee: Furkan KAMACI >Priority: Minor > Fix For: 2.4 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2347) Use Logger Instead of Printing Throwable
[ https://issues.apache.org/jira/browse/NUTCH-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15848272#comment-15848272 ] Hudson commented on NUTCH-2347: --- SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1581 (See [https://builds.apache.org/job/Nutch-nutchgora/1581/]) NUTCH-2347 Logger is used instead of printing Throwable. (kamaci: rev 8dbf8083aa63fbd881c18fc8824981b4c84c9c02) * (edit) src/java/org/apache/nutch/protocol/RobotRulesParser.java * (edit) src/java/org/apache/nutch/parse/NutchSitemapParser.java * (edit) src/java/org/apache/nutch/util/URLUtil.java * (edit) src/java/org/apache/nutch/crawl/WebTableReader.java * (edit) src/java/org/apache/nutch/host/HostDbReader.java * (edit) src/java/org/apache/nutch/tools/DmozParser.java * (edit) src/java/org/apache/nutch/util/GenericWritableConfigurable.java * (edit) src/java/org/apache/nutch/parse/ParseUtil.java * (edit) src/java/org/apache/nutch/util/NutchTool.java > Use Logger Instead of Printing Throwable > > > Key: NUTCH-2347 > URL: https://issues.apache.org/jira/browse/NUTCH-2347 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.3.1 >Reporter: Furkan KAMACI >Assignee: Furkan KAMACI >Priority: Minor > Fix For: 2.4 > > > Loggers should be used instead of printing Throwable. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-2349) urlnormalizer-basic NPE for ill-formed URL "http:/"
[ https://issues.apache.org/jira/browse/NUTCH-2349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15848273#comment-15848273 ] Hudson commented on NUTCH-2349: --- SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1581 (See [https://builds.apache.org/job/Nutch-nutchgora/1581/]) NUTCH-2349 urlnormalizer-basic: NPE for URLs without authority - check (snagel: rev 700857d16c9e1517ddb9868ed41171d91e5c9116) * (edit) src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java * (edit) src/plugin/urlnormalizer-basic/src/test/org/apache/nutch/net/urlnormalizer/basic/TestBasicURLNormalizer.java > urlnormalizer-basic NPE for ill-formed URL "http:/" > --- > > Key: NUTCH-2349 > URL: https://issues.apache.org/jira/browse/NUTCH-2349 > Project: Nutch > Issue Type: Bug >Affects Versions: 2.4, 1.13 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel > Fix For: 2.4, 1.13 > > > NUTCH-2337 introduced a potential (though rare) NullPointerException when an > ill-formed URL (just the protocol followed by "{{:}}", "{{:/}}", "{{:}}" > or even more slashes): > {noformat} > % echo "http:/"; \ > | runtime/local/bin/nutch org.apache.nutch.net.URLNormalizerChecker \ > -normalizer org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer > Checking URLNormalizer > org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer > Exception in thread "main" java.lang.NullPointerException > at > org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer.normalize(BasicURLNormalizer.java:120) > at > org.apache.nutch.net.URLNormalizerChecker.checkOne(URLNormalizerChecker.java:72) > at > org.apache.nutch.net.URLNormalizerChecker.main(URLNormalizerChecker.java:110) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-2349) urlnormalizer-basic NPE for ill-formed URL "http:/"
[ https://issues.apache.org/jira/browse/NUTCH-2349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15848282#comment-15848282 ] Hudson commented on NUTCH-2349: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3405 (See [https://builds.apache.org/job/Nutch-trunk/3405/]) NUTCH-2349 urlnormalizer-basic: NPE for URLs without authority - check (snagel: rev 1a718e0cc9a0c381e40f4bf8351e26f73522) * (edit) src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java * (edit) src/plugin/urlnormalizer-basic/src/test/org/apache/nutch/net/urlnormalizer/basic/TestBasicURLNormalizer.java > urlnormalizer-basic NPE for ill-formed URL "http:/" > --- > > Key: NUTCH-2349 > URL: https://issues.apache.org/jira/browse/NUTCH-2349 > Project: Nutch > Issue Type: Bug >Affects Versions: 2.4, 1.13 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel > Fix For: 2.4, 1.13 > > > NUTCH-2337 introduced a potential (though rare) NullPointerException when an > ill-formed URL (just the protocol followed by "{{:}}", "{{:/}}", "{{:}}" > or even more slashes): > {noformat} > % echo "http:/"; \ > | runtime/local/bin/nutch org.apache.nutch.net.URLNormalizerChecker \ > -normalizer org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer > Checking URLNormalizer > org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer > Exception in thread "main" java.lang.NullPointerException > at > org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer.normalize(BasicURLNormalizer.java:120) > at > org.apache.nutch.net.URLNormalizerChecker.checkOne(URLNormalizerChecker.java:72) > at > org.apache.nutch.net.URLNormalizerChecker.main(URLNormalizerChecker.java:110) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-2359) Parsefilter-regex raises IndexOutOfBoundsException when rules are ill-formed
[ https://issues.apache.org/jira/browse/NUTCH-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15865783#comment-15865783 ] Hudson commented on NUTCH-2359: --- FAILURE: Integrated in Jenkins build Nutch-trunk #3406 (See [https://builds.apache.org/job/Nutch-trunk/3406/]) NUTCH-2359 Parsefilter-regex raises IndexOutOfBoundsException when rules (markus: rev 9a9c4b32b9c1ab9c47583a217665e4694272d58a) * (add) src/plugin/parsefilter-regex/README.txt * (edit) src/plugin/parsefilter-regex/src/java/org/apache/nutch/parsefilter/regex/RegexParseFilter.java > Parsefilter-regex raises IndexOutOfBoundsException when rules are ill-formed > > > Key: NUTCH-2359 > URL: https://issues.apache.org/jira/browse/NUTCH-2359 > Project: Nutch > Issue Type: Bug > Components: plugin >Affects Versions: 1.12 >Reporter: Laknath Semage >Assignee: Markus Jelsma >Priority: Minor > Labels: patch > Fix For: 1.13 > > > This patch fixes: > 1) [Bug] Parsefilter-regex raises IndexOutOfBoundsException when rules are > ill-formed > 2) Rules are split using any space character (\s) instead tab (\t) > 3) A detailed Readme for the plugin -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-2355) Protocol plugins to set cookie if Cookie metadata field is present
[ https://issues.apache.org/jira/browse/NUTCH-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15875861#comment-15875861 ] Hudson commented on NUTCH-2355: --- FAILURE: Integrated in Jenkins build Nutch-trunk #3408 (See [https://builds.apache.org/job/Nutch-trunk/3408/]) NUTCH-2355 Protocol plugins to set cookie if Cookie metadata field is (markus: rev 217fad16bfdea0494390e8f170d9350cf06657ef) * (edit) src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java * (edit) src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java * (edit) src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java * (edit) conf/nutch-default.xml > Protocol plugins to set cookie if Cookie metadata field is present > -- > > Key: NUTCH-2355 > URL: https://issues.apache.org/jira/browse/NUTCH-2355 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.12 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.13 > > Attachments: NUTCH-2355.patch > > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-2364) http.agent.rotate: IllegalArgumentException / last element of agent names ignored
[ https://issues.apache.org/jira/browse/NUTCH-2364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15898220#comment-15898220 ] Hudson commented on NUTCH-2364: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3410 (See [https://builds.apache.org/job/Nutch-trunk/3410/]) NUTCH-2364 http.agent.rotate: IllegalArgumentException / last element of (snagel: [https://github.com/apache/nutch/commit/e5e67028251e5cc1fdd10ed94103fadff0c41a4a]) * (edit) src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java > http.agent.rotate: IllegalArgumentException / last element of agent names > ignored > - > > Key: NUTCH-2364 > URL: https://issues.apache.org/jira/browse/NUTCH-2364 > Project: Nutch > Issue Type: Bug > Components: protocol >Affects Versions: 1.10, 1.11, 2.3.1, 1.12 >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 2.4, 1.13 > > > With http.agent.rotate == true and a one-element agent name list, the > following exception is thrown: > {noformat} > % cat .../conf/agents.txt > my-test-crawler/Nutch-1.13 > % .../bin/nutch parsechecker -Dhttp.agent.rotate=true http://nutch.apache.org/ > ... > Fetch failed with protocol status: exception(16), lastModified=0: > java.lang.IllegalArgumentException: bound must be positive > % cat .../logs/hadoop.log > ... > 2017-03-03 11:17:19,750 ERROR http.Http - Failed to get protocol output > java.lang.IllegalArgumentException: bound must be positive > at > java.util.concurrent.ThreadLocalRandom.nextInt(ThreadLocalRandom.java:352) > at > org.apache.nutch.protocol.http.api.HttpBase.getUserAgent(HttpBase.java:379) > at > org.apache.nutch.protocol.http.HttpResponse.(HttpResponse.java:180) > ... > {noformat} > Caused by > {code} > userAgentNames.get(ThreadLocalRandom.current().nextInt(userAgentNames.size()-1)); > {code} > but nextInt(...) is defined as: "Returns a pseudorandom int value between > zero (inclusive) and the specified bound (exclusive)." -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-2357) Index metadata throw Exception because writable object cannot be cast to Text
[ https://issues.apache.org/jira/browse/NUTCH-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925276#comment-15925276 ] Hudson commented on NUTCH-2357: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3412 (See [https://builds.apache.org/job/Nutch-trunk/3412/]) NUTCH-2357 Index metadata throw Exception because writable object cannot (snagel: [https://github.com/apache/nutch/commit/439f1153991ec104acdb73420ddc816cd9c665e8]) * (edit) src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java > Index metadata throw Exception because writable object cannot be cast to Text > - > > Key: NUTCH-2357 > URL: https://issues.apache.org/jira/browse/NUTCH-2357 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.12 > Environment: It was detected using Linux mint 18. >Reporter: Eyeris Rodriguez Rueda >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.13 > > > Index Metadata plugin use this property(see below), to take keys from Datum > and index it. > > index.db.md > > > ... > > > Using any value from this property one Exception is thrown. > The problem occurs because Writable object can not be cast to Text see this > line. > https://github.com/apache/nutch/blob/master/src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java#L58 > A little change will fix it. > This is the Exception: > ** > 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: digest dest: > digest > 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: tstamp dest: > tstamp > 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: > metatag.description dest: description > 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: > metatag.keywords dest: keywords > 2017-02-06 18:18:30,134 WARN mapred.LocalJobRunner - job_local1516_0001 > java.lang.Exception: java.lang.ClassCastException: > org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.Text > at > org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) > Caused by: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable > cannot be cast to org.apache.hadoop.io.Text > at > org.apache.nutch.indexer.metadata.MetadataIndexer.filter(MetadataIndexer.java:58) > at > org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:51) > at > org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:330) > at > org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:56) > at > org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) > at > org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 2017-02-06 18:18:30,777 ERROR indexer.IndexingJob - Indexer: > java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836) > at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145) > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237) > ** -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-2366) Deprecated Job constructor in hostdb/ReadHostDb.java
[ https://issues.apache.org/jira/browse/NUTCH-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15926093#comment-15926093 ] Hudson commented on NUTCH-2366: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3413 (See [https://builds.apache.org/job/Nutch-trunk/3413/]) NUTCH-2366 Deprecated Job constructor in hostdb/ReadHostDb.java\ (markus: [https://github.com/apache/nutch/commit/3926910e145df083ec9d42cd397c0cbd9b3a16da]) * (edit) src/java/org/apache/nutch/hostdb/ReadHostDb.java > Deprecated Job constructor in hostdb/ReadHostDb.java > > > Key: NUTCH-2366 > URL: https://issues.apache.org/jira/browse/NUTCH-2366 > Project: Nutch > Issue Type: Bug > Components: build >Affects Versions: 1.12 >Reporter: Omkar Reddy >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.13 > > Attachments: NUTCH-2366.patch > > > When we try to build ant using nutch we get the following warning : > warning: [deprecation] Job(Configuration,String) in Job has been deprecated >[javac] Job job = new Job(conf, "ReadHostDb"); > This is because the constructor Job(Configuration conf, String jobName) has > been deprecated and the reference can be found at [0]. > [0] > http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html#getInstance%28org.apache.hadoop.conf.Configuration%29 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-2367) Get single record from HostDB
[ https://issues.apache.org/jira/browse/NUTCH-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15927826#comment-15927826 ] Hudson commented on NUTCH-2367: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3414 (See [https://builds.apache.org/job/Nutch-trunk/3414/]) NUTCH-2367 Get single record from HostDB (markus: [https://github.com/apache/nutch/commit/be3aea1410835b34cfacdff7c3def9fb01a83e76]) * (edit) src/java/org/apache/nutch/hostdb/ReadHostDb.java > Get single record from HostDB > - > > Key: NUTCH-2367 > URL: https://issues.apache.org/jira/browse/NUTCH-2367 > Project: Nutch > Issue Type: Improvement > Components: hostdb >Affects Versions: 1.12 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.13 > > Attachments: NUTCH-2367.patch > > > Introduces: > {code} > bin/nutch readhostdb crawl/hostdb/ -get www.apache.org > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-2068) Allow subcollection overrides via metadata
[ https://issues.apache.org/jira/browse/NUTCH-2068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15927883#comment-15927883 ] Hudson commented on NUTCH-2068: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3415 (See [https://builds.apache.org/job/Nutch-trunk/3415/]) NUTCH-2068 Allow subcollection overrides via metadata (markus: [https://github.com/apache/nutch/commit/9fb7d6c2e61ce36375722b16842b694621f3b053]) * (edit) src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java > Allow subcollection overrides via metadata > -- > > Key: NUTCH-2068 > URL: https://issues.apache.org/jira/browse/NUTCH-2068 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.10 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Attachments: NUTCH-2068.patch > > > Similar to index-metdata but overrides subcollection. If both subcollection > and index-metadata are active, you will get two values for the field possible > causing multivalued field errors. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-2336) SegmentReader to implement Tool
[ https://issues.apache.org/jira/browse/NUTCH-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958730#comment-15958730 ] Hudson commented on NUTCH-2336: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3420 (See [https://builds.apache.org/job/Nutch-trunk/3420/]) Adapt NUTCH-2336 to NUTCH-2281 (snagel: [https://github.com/apache/nutch/commit/330532175f751e7c977fb8549c048fc9cf4bd10d]) * (edit) src/java/org/apache/nutch/segment/SegmentReader.java > SegmentReader to implement Tool > --- > > Key: NUTCH-2336 > URL: https://issues.apache.org/jira/browse/NUTCH-2336 > Project: Nutch > Issue Type: Improvement > Components: segment >Affects Versions: 1.12 >Reporter: Vincent Slot >Priority: Minor > Labels: patch > Fix For: 1.13 > > Attachments: NUTCH-2336.patch > > > Let SegmentReader implement Tool for use on Hadoop -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-2281) Support non-default FileSystem
[ https://issues.apache.org/jira/browse/NUTCH-2281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958729#comment-15958729 ] Hudson commented on NUTCH-2281: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3420 (See [https://builds.apache.org/job/Nutch-trunk/3420/]) NUTCH-2281 Support non-default FileSystem (snagel: [https://github.com/apache/nutch/commit/faed27af5b2c471610af93e2cb45f551615bd922]) * (edit) src/java/org/apache/nutch/scoring/webgraph/LinkDumper.java * (edit) src/java/org/apache/nutch/segment/SegmentMerger.java * (edit) src/java/org/apache/nutch/crawl/LinkDbMerger.java * (edit) src/java/org/apache/nutch/hostdb/UpdateHostDb.java * (edit) src/java/org/apache/nutch/indexer/IndexingJob.java * (edit) src/java/org/apache/nutch/scoring/webgraph/NodeReader.java * (edit) src/java/org/apache/nutch/scoring/webgraph/ScoreUpdater.java * (edit) src/java/org/apache/nutch/crawl/Injector.java * (edit) src/java/org/apache/nutch/tools/CommonCrawlDataDumper.java * (edit) src/java/org/apache/nutch/scoring/webgraph/LinkRank.java * (edit) src/java/org/apache/nutch/crawl/CrawlDb.java * (edit) src/java/org/apache/nutch/crawl/LinkDb.java * (edit) src/java/org/apache/nutch/indexer/IndexerMapReduce.java * (edit) src/java/org/apache/nutch/crawl/DeduplicationJob.java * (edit) src/java/org/apache/nutch/crawl/CrawlDbMerger.java * (edit) src/java/org/apache/nutch/parse/ParseSegment.java * (edit) src/java/org/apache/nutch/scoring/webgraph/WebGraph.java * (edit) src/java/org/apache/nutch/tools/FileDumper.java * (edit) src/java/org/apache/nutch/segment/SegmentReader.java * (edit) src/java/org/apache/nutch/crawl/Generator.java * (edit) src/java/org/apache/nutch/crawl/LinkDbReader.java * (edit) src/java/org/apache/nutch/crawl/CrawlDbReader.java * (edit) src/java/org/apache/nutch/util/LockUtil.java Adapt NUTCH-2336 to NUTCH-2281 (snagel: [https://github.com/apache/nutch/commit/330532175f751e7c977fb8549c048fc9cf4bd10d]) * (edit) src/java/org/apache/nutch/segment/SegmentReader.java NUTCH-2281 Support non-default file system - fix install of CrawlDb for (snagel: [https://github.com/apache/nutch/commit/5dcd7b13f450561a7b34bb6761041150c84bfdab]) * (edit) src/java/org/apache/nutch/crawl/Injector.java * (edit) src/java/org/apache/nutch/crawl/CrawlDb.java * (edit) src/java/org/apache/nutch/crawl/CrawlDbMerger.java * (edit) src/java/org/apache/nutch/crawl/Generator.java > Support non-default FileSystem > -- > > Key: NUTCH-2281 > URL: https://issues.apache.org/jira/browse/NUTCH-2281 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.12 >Reporter: Sebastian Nagel > Fix For: 1.14 > > > If a path (input or output) does not belong to the configured default > FileSystem various Nutch tools may raise an exception like > {noformat} > Exception in ... java.lang.IllegalArgumentException: Wrong FS: s3a://..., > expected: hdfs://... > {noformat} > This is fixed by getting a reference to the FileSystem from the Path object > {noformat} > FileSystem fs = path.getFileSystem(getConf()); > {noformat} > instead of > {noformat} > FileSystem fs = FileSystem.get(getConf()); > {noformat} > A given path (e.g., {{s3a://...}}) may not belong to the default file system > ({{hdfs://}} or {{file://}} in local mode) and simple checks such as > {{fs.exists(path)}} then will fail. Cf. > [FileSystem.checkPath(path)|https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/FileSystem.html#checkPath(org.apache.hadoop.fs.Path)], > and > [FileSystem.get(conf)|https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/FileSystem.html#get(org.apache.hadoop.conf.Configuration)] > vs. > [FileSystem.get(URI,conf)|https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/FileSystem.html#get(java.net.URI,%20org.apache.hadoop.conf.Configuration)] > which is called by > [Path.getFileSystem(conf)|https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/Path.html#getFileSystem%28org.apache.hadoop.conf.Configuration%29]. > > Note that the FileSystem for input and output may be different, e.g., read > from HDFS and write to S3. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-2335) Injector not to filter and normalize existing URLs in CrawlDb
[ https://issues.apache.org/jira/browse/NUTCH-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958788#comment-15958788 ] Hudson commented on NUTCH-2335: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3421 (See [https://builds.apache.org/job/Nutch-trunk/3421/]) NUTCH-2335 Injector not to filter and normalize existing items/URLs in (snagel: [https://github.com/apache/nutch/commit/5945db20de21c62795315c095ccf9ff4c61f3ebe]) * (edit) src/java/org/apache/nutch/crawl/Injector.java > Injector not to filter and normalize existing URLs in CrawlDb > - > > Key: NUTCH-2335 > URL: https://issues.apache.org/jira/browse/NUTCH-2335 > Project: Nutch > Issue Type: Improvement > Components: crawldb, injector >Affects Versions: 1.12 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel > Fix For: 1.14 > > > With NUTCH-1712 the behavior of the Injector has changed in case new URLs are > added to an existing CrawlDb: > - before only injected URLs were filtered and normalized > - now filters and normalizers are applied to all URLs including those already > in the CrawlDb > The default should be as before not to filter existing URLs. Filtering and > normalizing may take long for large CrawlDbs and/or complex URL filters. If > URL filter or normalizer rules are not changed there is no need to apply them > anew every time new URLs are added. Of course, injected URLs should be > filtered and normalized by default. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-2269) Clean not working after crawl
[ https://issues.apache.org/jira/browse/NUTCH-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958866#comment-15958866 ] Hudson commented on NUTCH-2269: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3422 (See [https://builds.apache.org/job/Nutch-trunk/3422/]) fix for NUTCH-2269 contributed by r0ann3l (snagel: [https://github.com/apache/nutch/commit/e040ace189aa0379b998c8852a09c1a1a2308d82]) * (edit) src/java/org/apache/nutch/indexer/CleaningJob.java > Clean not working after crawl > - > > Key: NUTCH-2269 > URL: https://issues.apache.org/jira/browse/NUTCH-2269 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.12 > Environment: Vagrant, Ubuntu, Java 8, Solr 4.10 >Reporter: Francesco Capponi >Assignee: Lewis John McGibbney > Fix For: 2.4, 1.14 > > > I'm have been having this problem for a while and I had to rollback using the > old solr clean instead of the newer version. > Once it inserts/update correctly every document in Nutch, when it tries to > clean, it returns error 255: > {quote} > 2016-05-30 10:13:04,992 WARN output.FileOutputCommitter - Output Path is > null in setupJob() > 2016-05-30 10:13:07,284 INFO indexer.IndexWriters - Adding > org.apache.nutch.indexwriter.solr.SolrIndexWriter > 2016-05-30 10:13:08,114 INFO solr.SolrMappingReader - source: content dest: > content > 2016-05-30 10:13:08,114 INFO solr.SolrMappingReader - source: title dest: > title > 2016-05-30 10:13:08,114 INFO solr.SolrMappingReader - source: host dest: host > 2016-05-30 10:13:08,114 INFO solr.SolrMappingReader - source: segment dest: > segment > 2016-05-30 10:13:08,114 INFO solr.SolrMappingReader - source: boost dest: > boost > 2016-05-30 10:13:08,114 INFO solr.SolrMappingReader - source: digest dest: > digest > 2016-05-30 10:13:08,114 INFO solr.SolrMappingReader - source: tstamp dest: > tstamp > 2016-05-30 10:13:08,133 INFO solr.SolrIndexWriter - SolrIndexer: deleting > 15/15 documents > 2016-05-30 10:13:08,919 WARN output.FileOutputCommitter - Output Path is > null in cleanupJob() > 2016-05-30 10:13:08,937 WARN mapred.LocalJobRunner - job_local662730477_0001 > java.lang.Exception: java.lang.IllegalStateException: Connection pool shut > down > at > org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) > Caused by: java.lang.IllegalStateException: Connection pool shut down > at org.apache.http.util.Asserts.check(Asserts.java:34) > at > org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:169) > at > org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:202) > at > org.apache.http.impl.conn.PoolingClientConnectionManager.requestConnection(PoolingClientConnectionManager.java:184) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:415) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) > at > org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:480) > at > org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241) > at > org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230) > at > org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:150) > at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:483) > at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:464) > at > org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:190) > at > org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:178) > at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115) > at > org.apache.nutch.indexer.CleaningJob$DeleterReducer.close(CleaningJob.java:120) > at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:237) > at > org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:459) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) > at > org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExec
[jira] [Commented] (NUTCH-2193) Upgrade feed parser plugin to use rome 1.5
[ https://issues.apache.org/jira/browse/NUTCH-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958961#comment-15958961 ] Hudson commented on NUTCH-2193: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3423 (See [https://builds.apache.org/job/Nutch-trunk/3423/]) NUTCH-2193 Upgrade feed parser plugin to use rome 1.5.1 (snagel: [https://github.com/apache/nutch/commit/c1819539ba21a294c1afc12b876b83f74a1ce3e7]) * (edit) src/plugin/feed/ivy.xml * (edit) src/plugin/feed/src/java/org/apache/nutch/parse/feed/FeedParser.java * (edit) src/plugin/feed/plugin.xml > Upgrade feed parser plugin to use rome 1.5 > -- > > Key: NUTCH-2193 > URL: https://issues.apache.org/jira/browse/NUTCH-2193 > Project: Nutch > Issue Type: Improvement > Components: parser >Affects Versions: 1.11 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 1.14 > > Attachments: NUTCH-2193.patch > > > The class loader issue in the rome library (NUTCH-1494, [[rometools > #130|https://github.com/rometools/rome/issues/130]]) is fixed with rome 1.5. > Time to upgrade. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-2372) Javadocs build failing.
[ https://issues.apache.org/jira/browse/NUTCH-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15969920#comment-15969920 ] Hudson commented on NUTCH-2372: --- FAILURE: Integrated in Jenkins build Nutch-trunk #3425 (See [https://builds.apache.org/job/Nutch-trunk/3425/]) NUTCH-2372 Fixing the errors in documentation (omkarreddy2008: [https://github.com/apache/nutch/commit/61985f17998d1deaa47d8e56b46136e0fc1f4108]) * (edit) src/java/org/apache/nutch/util/MimeUtil.java * (edit) src/plugin/index-anchor/src/java/org/apache/nutch/indexer/anchor/AnchorIndexingFilter.java * (edit) src/java/org/apache/nutch/tools/CommonCrawlDataDumper.java * (edit) src/java/org/apache/nutch/util/TableUtil.java * (edit) src/java/org/apache/nutch/segment/SegmentMerger.java * (edit) src/plugin/index-links/src/java/org/apache/nutch/indexer/links/LinksIndexingFilter.java * (edit) src/java/org/apache/nutch/tools/arc/ArcRecordReader.java * (edit) src/java/org/apache/nutch/util/TrieStringMatcher.java * (edit) src/java/org/apache/nutch/util/TimingUtil.java * (edit) src/plugin/index-replace/src/java/org/apache/nutch/indexer/replace/ReplaceIndexer.java * (edit) src/java/org/apache/nutch/tools/CommonCrawlFormatFactory.java * (edit) src/java/org/apache/nutch/crawl/FetchSchedule.java * (edit) src/java/org/apache/nutch/tools/FileDumper.java * (edit) src/java/org/apache/nutch/tools/CommonCrawlFormatSimple.java * (edit) src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpBasicAuthentication.java * (edit) src/java/org/apache/nutch/segment/SegmentPart.java * (edit) src/plugin/feed/src/java/org/apache/nutch/indexer/feed/FeedIndexingFilter.java * (edit) src/java/org/apache/nutch/util/EncodingDetector.java * (edit) src/java/org/apache/nutch/crawl/Injector.java * (edit) src/java/org/apache/nutch/crawl/Generator.java * (edit) src/java/org/apache/nutch/service/impl/ConfManagerImpl.java * (edit) src/plugin/mimetype-filter/src/java/org/apache/nutch/indexer/filter/MimeTypeIndexingFilter.java * (edit) src/java/org/apache/nutch/util/LockUtil.java * (edit) src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Client.java * (edit) src/java/org/apache/nutch/hostdb/UpdateHostDbReducer.java * (edit) src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexURLFilterBase.java * (edit) src/java/org/apache/nutch/tools/CommonCrawlFormat.java * (edit) src/java/org/apache/nutch/util/SuffixStringMatcher.java * (edit) src/plugin/urlnormalizer-querystring/src/java/org/apache/nutch/net/urlnormalizer/querystring/QuerystringURLNormalizer.java * (edit) src/plugin/parse-zip/src/java/org/apache/nutch/parse/zip/ZipTextExtractor.java * (edit) src/java/org/apache/nutch/hostdb/UpdateHostDbMapper.java * (edit) src/plugin/urlmeta/src/java/org/apache/nutch/scoring/urlmeta/URLMetaScoringFilter.java * (edit) src/java/org/apache/nutch/plugin/PluginRepository.java * (edit) src/plugin/indexer-dummy/src/java/org/apache/nutch/indexwriter/dummy/DummyIndexWriter.java * (edit) src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java * (edit) src/java/org/apache/nutch/net/URLNormalizers.java * (edit) src/java/org/apache/nutch/tools/AbstractCommonCrawlFormat.java * (edit) src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrUtils.java * (edit) src/java/org/apache/nutch/util/PrefixStringMatcher.java * (edit) src/plugin/lib-htmlunit/src/java/org/apache/nutch/protocol/htmlunit/HtmlUnitWebDriver.java * (edit) src/plugin/urlfilter-domain/src/java/org/apache/nutch/urlfilter/domain/DomainURLFilter.java * (edit) src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java * (edit) src/java/org/apache/nutch/parse/ParseResult.java * (edit) src/plugin/urlmeta/src/java/org/apache/nutch/indexer/urlmeta/URLMetaIndexingFilter.java * (edit) src/java/org/apache/nutch/util/URLUtil.java * (edit) src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java * (edit) src/plugin/language-identifier/src/java/org/apache/nutch/analysis/lang/HTMLLanguageParser.java * (edit) src/plugin/index-geoip/src/java/org/apache/nutch/indexer/geoip/GeoIPIndexingFilter.java * (edit) src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java * (edit) src/java/org/apache/nutch/parse/ParserChecker.java * (edit) src/java/org/apache/nutch/hostdb/ReadHostDb.java * (edit) src/plugin/urlfilter-domainblacklist/src/java/org/apache/nutch/urlfilter/domainblacklist/DomainBlacklistURLFilter.java * (edit) src/plugin/scoring-opic/src/java/org/apache/nutch/scoring/opic/OPICScoringFilter.java * (edit) src/plugin/scoring-similarity/src/java/org/apache/nutch/scoring/similarity/util/LuceneTokenizer.java * (edit) src/plugin/subcollection/src/java/org/apache/nutch/collection/CollectionManager.java > Javadocs build failing. > --- >
[jira] [Commented] (NUTCH-2333) Indexer for RabbitMQ
[ https://issues.apache.org/jira/browse/NUTCH-2333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15969918#comment-15969918 ] Hudson commented on NUTCH-2333: --- FAILURE: Integrated in Jenkins build Nutch-trunk #3425 (See [https://builds.apache.org/job/Nutch-trunk/3425/]) Fixes for NUTCH-2333: Added the lines for ant runtime task (gitRoann3l;fhdez: [https://github.com/apache/nutch/commit/5873a24d3845563bd1028f6a27e22438670b4063]) * (edit) src/plugin/build.xml * (edit) build.xml Fixes for NUTCH-2333: Added the logic for indexing process (gitRoann3l;fhdez: [https://github.com/apache/nutch/commit/62496aec84cbf889f14175dbf03f0e8a1200ac9c]) * (add) src/plugin/indexer-rabbit/src/java/org/apache/nutch/indexwriter/rabbit/RabbitDocument.java * (add) src/plugin/indexer-rabbit/src/java/org/apache/nutch/indexwriter/rabbit/RabbitMQConstants.java * (add) src/plugin/indexer-rabbit/src/java/org/apache/nutch/indexwriter/rabbit/RabbitMessage.java * (add) src/plugin/indexer-rabbit/plugin.xml * (add) src/plugin/indexer-rabbit/build-ivy.xml * (add) src/plugin/indexer-rabbit/build.xml * (add) src/plugin/indexer-rabbit/ivy.xml * (add) src/plugin/indexer-rabbit/src/java/org/apache/nutch/indexwriter/rabbit/RabbitIndexWriter.java Fixes for NUTCH-2333: Added the properties for RabbitMQ indexer. (gitRoann3l;fhdez: [https://github.com/apache/nutch/commit/594564b27258fbcca68e90e41db801a750d11426]) * (edit) conf/nutch-default.xml Fixes for NUTCH-2333: Added new properties to indexer (gitRoann3l;fhdez: [https://github.com/apache/nutch/commit/17886f722ff16da0aa29bd059953feca609a5165]) * (edit) conf/nutch-default.xml Fixes for NUTCH-2333: Corrected some comments in the configuration file (gitRoann3l;fhdez: [https://github.com/apache/nutch/commit/c0af89aeb0e5c9e2059192eac7514cea3825b7e2]) * (edit) conf/nutch-default.xml * (edit) src/plugin/indexer-rabbit/src/java/org/apache/nutch/indexwriter/rabbit/RabbitIndexWriter.java * (edit) src/plugin/indexer-rabbit/src/java/org/apache/nutch/indexwriter/rabbit/RabbitMQConstants.java > Indexer for RabbitMQ > > > Key: NUTCH-2333 > URL: https://issues.apache.org/jira/browse/NUTCH-2333 > Project: Nutch > Issue Type: New Feature > Components: indexer >Affects Versions: 1.12 >Reporter: Roannel Fernández Hernández >Priority: Minor > Fix For: 1.14 > > > A plugin to send the documents to a RabbitMQ server. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events
[ https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15969919#comment-15969919 ] Hudson commented on NUTCH-2132: --- FAILURE: Integrated in Jenkins build Nutch-trunk #3425 (See [https://builds.apache.org/job/Nutch-trunk/3425/]) Fixes for NUTCH-2132: Added the library amqp (gitRoann3l;fhdez: [https://github.com/apache/nutch/commit/9d65ac6d6b33d83e1fc9ab387f29a3287b0b26b3]) * (edit) src/plugin/publish-rabbitmq/plugin.xml Fixes for NUTCH-2132: Added new properties (gitRoann3l;fhdez: [https://github.com/apache/nutch/commit/bc9a2c859c2b2036aa58d899c529cbf6282a41df]) * (edit) conf/nutch-default.xml * (edit) src/plugin/publish-rabbitmq/src/java/org/apache/nutch/publisher/rabbitmq/RabbitMQPublisherImpl.java Fixes for NUTCH-2132: Deleted empty comments (gitRoann3l;fhdez: [https://github.com/apache/nutch/commit/5eb77a9de6d20f31a5fd6759022b25552744cc16]) * (edit) src/plugin/publish-rabbitmq/src/java/org/apache/nutch/publisher/rabbitmq/RabbitMQPublisherImpl.java Fixes for NUTCH-2132: Fixed the default port (gitRoann3l;fhdez: [https://github.com/apache/nutch/commit/ee651752295468af75a14f6c98686c0d7c26136a]) * (edit) conf/nutch-default.xml * (edit) src/plugin/publish-rabbitmq/src/java/org/apache/nutch/publisher/rabbitmq/RabbitMQPublisherImpl.java > Publisher/Subscriber model for Nutch to emit events > > > Key: NUTCH-2132 > URL: https://issues.apache.org/jira/browse/NUTCH-2132 > Project: Nutch > Issue Type: New Feature > Components: fetcher, REST_api >Reporter: Sujen Shah >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.13 > > Attachments: NUTCH-2132.patch, NUTCH-2132.v2.patch, > PubSub_routingkey.patch > > > It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- > Fetcher events like fetch-start, fetch-end, a fetch report which may contain > data like outlinks of the current fetched url, score, etc). > A consumer of this functionality could use this data to generate real time > visualization and generate statics of the crawl without having to wait for > the fetch round to finish. > The REST API could contain an endpoint which would respond with a url to > which a client could subscribe to get the fetcher events. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-2046) The crawl script should be able to skip an initial injection.
[ https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15973288#comment-15973288 ] Hudson commented on NUTCH-2046: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3427 (See [https://builds.apache.org/job/Nutch-trunk/3427/]) fix for NUTCH-2046 contributed by jnioche (julien: [https://github.com/apache/nutch/commit/7b0103fe62c9b0e479bb03e7b9575522adcf68b8]) * (edit) src/bin/crawl > The crawl script should be able to skip an initial injection. > - > > Key: NUTCH-2046 > URL: https://issues.apache.org/jira/browse/NUTCH-2046 > Project: Nutch > Issue Type: Improvement > Components: crawldb, injector >Affects Versions: 1.10 >Reporter: Luis Lopez >Assignee: Julien Nioche > Labels: crawl, injection > Fix For: 1.14 > > Attachments: crawl.patch > > > When our crawl gets really big a new injection takes considerable time as it > updates crawldb, the crawl script should be able to skip the injection and go > directly to the generate call. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-2353) Create seed file with metadata using the REST API
[ https://issues.apache.org/jira/browse/NUTCH-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016992#comment-16016992 ] Hudson commented on NUTCH-2353: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3428 (See [https://builds.apache.org/job/Nutch-trunk/3428/]) Fix for NUTCH-2353 contributed by jorgelbg (jlbetancourt: [https://github.com/apache/nutch/commit/7deb576bc58bb74725cbb6c5d82d7b9244c6ad42]) * (edit) src/java/org/apache/nutch/webui/model/SeedUrl.java * (edit) src/java/org/apache/nutch/service/model/request/SeedUrl.java * (edit) src/java/org/apache/nutch/service/resources/SeedResource.java > Create seed file with metadata using the REST API > - > > Key: NUTCH-2353 > URL: https://issues.apache.org/jira/browse/NUTCH-2353 > Project: Nutch > Issue Type: Improvement > Components: injector, REST_api >Affects Versions: 1.12 >Reporter: Jorge Luis Betancourt Gonzalez >Assignee: Jorge Luis Betancourt Gonzalez >Priority: Minor > Labels: rest_api > Fix For: 1.14 > > > At the moment its not possible to create a seed file and specify any metadata > when using the REST API. The file gets created but there is no option to add > any metadata to the seed URLs. > If we use a payload like this: > {code} > { > "name":"name-of-seedlist", > "seedUrls":[ > { > "url" : "http://example.com";, > "metadata" : { > "key1" : "value1", > "key2" : "value2", > "key3" : "value3" > } > } > ] > } > {code} > It should be easy to specify the desired metadata. Also this should keep BC > with the previous array syntax if we only want to specify the list of URLs > without any metadata at all. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-2353) Create seed file with metadata using the REST API
[ https://issues.apache.org/jira/browse/NUTCH-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017310#comment-16017310 ] Hudson commented on NUTCH-2353: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3429 (See [https://builds.apache.org/job/Nutch-trunk/3429/]) Fix for NUTCH-2353 contributed by jorgelbg (snagel: [https://github.com/apache/nutch/commit/0312bae38c9e95d496336dc24133b15ebefd4d3c]) * (edit) src/java/org/apache/nutch/webui/model/SeedUrl.java * (edit) src/java/org/apache/nutch/service/model/request/SeedUrl.java * (edit) src/java/org/apache/nutch/service/resources/SeedResource.java > Create seed file with metadata using the REST API > - > > Key: NUTCH-2353 > URL: https://issues.apache.org/jira/browse/NUTCH-2353 > Project: Nutch > Issue Type: Improvement > Components: injector, REST_api >Affects Versions: 1.12 >Reporter: Jorge Luis Betancourt Gonzalez >Assignee: Jorge Luis Betancourt Gonzalez >Priority: Minor > Labels: rest_api > Fix For: 1.14 > > > At the moment its not possible to create a seed file and specify any metadata > when using the REST API. The file gets created but there is no option to add > any metadata to the seed URLs. > If we use a payload like this: > {code} > { > "name":"name-of-seedlist", > "seedUrls":[ > { > "url" : "http://example.com";, > "metadata" : { > "key1" : "value1", > "key2" : "value2", > "key3" : "value3" > } > } > ] > } > {code} > It should be easy to specify the desired metadata. Also this should keep BC > with the previous array syntax if we only want to specify the list of URLs > without any metadata at all. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-2376) Improve configurability of HTTP Accept* header fields
[ https://issues.apache.org/jira/browse/NUTCH-2376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017309#comment-16017309 ] Hudson commented on NUTCH-2376: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3429 (See [https://builds.apache.org/job/Nutch-trunk/3429/]) NUTCH-2376 Improve configurability of HTTP Accept* header fields - (snagel: [https://github.com/apache/nutch/commit/af9d7a3e68002860fcc178e21b869d2f79c27dee]) * (edit) src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java * (edit) src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java * (edit) src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java * (edit) conf/nutch-default.xml > Improve configurability of HTTP Accept* header fields > - > > Key: NUTCH-2376 > URL: https://issues.apache.org/jira/browse/NUTCH-2376 > Project: Nutch > Issue Type: Improvement > Components: protocol >Affects Versions: 2.3.1, 1.13 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 2.4, 1.14 > > > There should be no differences between protocol-http and protocol-httpclient > whether the HTTP header fields {{Accept}}, {{Accept-Language}} and > {{Accept-Charset}} are configurable. The configured values should be used for > both plugins. In addition, > - it should be possible to unset the default values (overwrite with empty > value) so that no HTTP header field is sent > - default values should be contained in nutch-default.xml > Note: {{Accept-Encoding}} should not be configurable as the protocol plugins > must support the accepted compression codecs which may not be the case e.g. > for Brotli. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-2373) Indexer for Hbase
[ https://issues.apache.org/jira/browse/NUTCH-2373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16020250#comment-16020250 ] Hudson commented on NUTCH-2373: --- SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1583 (See [https://builds.apache.org/job/Nutch-nutchgora/1583/]) NUTCH-2373 HBaseIndexWriter - indexer for hbase implemented (kaidulislam90: [https://github.com/apache/nutch/commit/bda007601cb84bff5ce44f3e7b5906d2803f2504]) * (edit) conf/nutch-default.xml * (add) src/plugin/indexer-hbase/src/java/org/apache/nutch/indexwriter/hbase/HBaseIndexWriter.java * (add) src/plugin/indexer-hbase/ivy.xml * (edit) ivy/ivy.xml * (edit) src/plugin/build.xml * (add) src/plugin/indexer-hbase/build.xml * (add) src/plugin/indexer-hbase/src/java/org/apache/nutch/indexwriter/hbase/package-info.java * (edit) build.xml * (add) src/plugin/indexer-hbase/plugin.xml * (add) src/plugin/indexer-hbase/src/java/org/apache/nutch/indexwriter/hbase/HBaseConstants.java * (add) src/plugin/indexer-hbase/src/java/org/apache/nutch/indexwriter/hbase/HBaseMappingReader.java NUTCH-2373 Multiple SLF4J bindings issue solved, unnecessary (ikaidul: [https://github.com/apache/nutch/commit/9541ad853f7e86d45028ce0c0e85603a3e988bee]) * (edit) src/plugin/indexer-hbase/plugin.xml * (edit) src/plugin/indexer-hbase/ivy.xml * (edit) ivy/ivy.xml * (add) conf/hbaseindex-mapping.xml NUTCH-2373 Extra newline removed (ikaidul: [https://github.com/apache/nutch/commit/42257b397b4699f8c7a4d33d366d35e67dd61c7d]) * (edit) src/plugin/indexer-hbase/src/java/org/apache/nutch/indexwriter/hbase/package-info.java NUTCH-2373 Code formatted using Nutch eclipse-codeformat.xml (ikaidul: [https://github.com/apache/nutch/commit/0f023e84367a4bae37f2695d7dd0891b578d62c5]) * (edit) src/plugin/indexer-hbase/src/java/org/apache/nutch/indexwriter/hbase/HBaseMappingReader.java * (edit) src/plugin/indexer-hbase/src/java/org/apache/nutch/indexwriter/hbase/HBaseIndexWriter.java * (edit) src/plugin/indexer-hbase/src/java/org/apache/nutch/indexwriter/hbase/HBaseConstants.java NUTCH-2373 getting mapped qualifier name from key issue solved (kaidulislam90: [https://github.com/apache/nutch/commit/7d6f3c3bb9761bae5ae75f48113cb2b59f1a]) * (edit) src/plugin/indexer-hbase/src/java/org/apache/nutch/indexwriter/hbase/HBaseIndexWriter.java NUTCH-2373 Boilerplate default column family removed, considering first (kaidulislam90: [https://github.com/apache/nutch/commit/3db369994286cd535a4cba39bc4c1d882ac7e203]) * (edit) src/plugin/indexer-hbase/src/java/org/apache/nutch/indexwriter/hbase/HBaseMappingReader.java NUTCH-2373 typo corrected in hbaseindex-mapping.xml (kaidulislam90: [https://github.com/apache/nutch/commit/0103f4d80d2b0f532d11d7c2451e58b276829419]) * (edit) conf/hbaseindex-mapping.xml NUTCH-2373 An issue on document counting fixed, default batch-size (kaidulislam90: [https://github.com/apache/nutch/commit/9dad864f806e2152efc1b29f3bab76c164f21da0]) * (edit) conf/nutch-default.xml * (edit) src/plugin/indexer-hbase/src/java/org/apache/nutch/indexwriter/hbase/HBaseIndexWriter.java > Indexer for Hbase > - > > Key: NUTCH-2373 > URL: https://issues.apache.org/jira/browse/NUTCH-2373 > Project: Nutch > Issue Type: New Feature > Components: indexer >Affects Versions: 2.3 >Reporter: Kaidul Islam >Assignee: Kaidul Islam > Fix For: 2.4 > > > Some use-case involves storing the documents in some sort of database other > than indexing search engines i.e. Solr, ElasticSearch. This is a plugin to > send the documents to Hbase storage. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-2388) bin/crawl indexing only webpages containing batchID instead of all in 2.x
[ https://issues.apache.org/jira/browse/NUTCH-2388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16021527#comment-16021527 ] Hudson commented on NUTCH-2388: --- SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1584 (See [https://builds.apache.org/job/Nutch-nutchgora/1584/]) NUTCH-2388 bin/crawl indexing only webpages containing batchID instead (kaidulislam90: [https://github.com/apache/nutch/commit/32a57b52a67cd5c2cb637c6fbae2dfce5a2c27b5]) * (edit) src/bin/crawl > bin/crawl indexing only webpages containing batchID instead of all in 2.x > - > > Key: NUTCH-2388 > URL: https://issues.apache.org/jira/browse/NUTCH-2388 > Project: Nutch > Issue Type: Bug > Components: bin >Affects Versions: 2.3 >Reporter: Kaidul Islam >Assignee: Kaidul Islam >Priority: Trivial > Fix For: 2.4 > > Original Estimate: 24h > Remaining Estimate: 24h > > During each iteration, after generating, fetching, parsing and updating the > current batch into DB, the indexer is supposed to index the current batch > too. But its indexing all currently. > {code} > __bin_nutch index $commonOptions -D solr.server.url=$SOLRURL -all -crawlId > "$CRAWL_ID" > {code} > It should be like below i guess - > {code} > __bin_nutch index $commonOptions -D solr.server.url=$SOLRURL $batchId > -crawlId "$CRAWL_ID" > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-2374) Upgrade Nutch 2.X to Gora 0.7
[ https://issues.apache.org/jira/browse/NUTCH-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16051334#comment-16051334 ] Hudson commented on NUTCH-2374: --- SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1586 (See [https://builds.apache.org/job/Nutch-nutchgora/1586/]) NUTCH-2374 Upgrade Nutch 2.X to Gora 0.7 (lakshmi: [https://github.com/apache/nutch/commit/b92aa37a8291892329d067c0d18a8ea808a22d13]) * (edit) ivy/ivy.xml * (edit) src/java/org/apache/nutch/host/HostDbUpdateJob.java * (edit) src/java/org/apache/nutch/util/domain/DomainStatistics.java * (edit) src/java/org/apache/nutch/crawl/WebTableReader.java * (edit) src/java/org/apache/nutch/storage/StorageUtils.java > Upgrade Nutch 2.X to Gora 0.7 > - > > Key: NUTCH-2374 > URL: https://issues.apache.org/jira/browse/NUTCH-2374 > Project: Nutch > Issue Type: Bug > Components: build, storage >Reporter: Lewis John McGibbney > Fix For: 2.4 > > > We should make the upgrades before we release Nutch 2.X. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2397) Parser to add paragraph line breaks
[ https://issues.apache.org/jira/browse/NUTCH-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16074513#comment-16074513 ] Hudson commented on NUTCH-2397: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3432 (See [https://builds.apache.org/job/Nutch-trunk/3432/]) Fix for NUTCH-2397 (improved solution contributed by Vipul Behl, closes (snagel: [https://github.com/apache/nutch/commit/48c38b03f3cfb73402431f262990a6d091570e9a]) * (edit) src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java * (edit) src/plugin/parse-zip/src/test/org/apache/nutch/parse/zip/TestZipParser.java * (edit) src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java > Parser to add paragraph line breaks > --- > > Key: NUTCH-2397 > URL: https://issues.apache.org/jira/browse/NUTCH-2397 > Project: Nutch > Issue Type: Improvement > Components: parser >Affects Versions: 2.3.1, 1.13 >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 2.4, 1.14 > > > (initially reported with patch/pull-request by Vipul Behl, see > [#190|https://github.com/apache/nutch/pull/190]) > The parser (parse-tika and parse-html) could be improved to add line breaks > between paragraphs, instead of writing the whole document into a single line. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2391) Spurious Duplications for MD5
[ https://issues.apache.org/jira/browse/NUTCH-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16074988#comment-16074988 ] Hudson commented on NUTCH-2391: --- SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1587 (See [https://builds.apache.org/job/Nutch-nutchgora/1587/]) NUTCH-2393 Fix for issue addressed in NUTCH-2391 (kaidulislam90: [https://github.com/apache/nutch/commit/ef33ba7db80d08d5ef56501bcc45baadfee14dfc]) * (edit) src/java/org/apache/nutch/crawl/MD5Signature.java > Spurious Duplications for MD5 > - > > Key: NUTCH-2391 > URL: https://issues.apache.org/jira/browse/NUTCH-2391 > Project: Nutch > Issue Type: Bug > Components: commoncrawl >Affects Versions: 1.11 >Reporter: David Johnson >Priority: Minor > Fix For: 1.14 > > > We're seeing some incidence of a large number of documents being marked as > duplicate in our crawl. > We traced it back to one of the crawl plugins returning an empty array for > the content field. > We'd like to propose changing the MD5 signature generation from: > {code} > public byte[] calculate(Content content, Parse parse) { > byte[] data = content.getContent(); > if (data == null) > data = content.getUrl().getBytes(); > return MD5Hash.digest(data).getDigest(); > } > {code} > to: > {code} > public byte[] calculate(Content content, Parse parse) { > byte[] data = content.getContent(); > if ((data == null) || (data.length == 0)) > data = content.getUrl().getBytes(); > return MD5Hash.digest(data).getDigest(); > } > {code} > to address the issue -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2374) Upgrade Nutch 2.X to Gora 0.7
[ https://issues.apache.org/jira/browse/NUTCH-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16074989#comment-16074989 ] Hudson commented on NUTCH-2374: --- SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1587 (See [https://builds.apache.org/job/Nutch-nutchgora/1587/]) NUTCH-2374 Upgrade Nutch 2.X to Gora 0.7 (snagel: [https://github.com/apache/nutch/commit/63b58e8889297ca4afcff2de4c8b1f86d657dbf2]) * (edit) src/java/org/apache/nutch/storage/StorageUtils.java * (edit) ivy/ivy.xml * (edit) src/java/org/apache/nutch/host/HostDbUpdateJob.java * (edit) src/java/org/apache/nutch/util/domain/DomainStatistics.java * (edit) src/java/org/apache/nutch/crawl/WebTableReader.java > Upgrade Nutch 2.X to Gora 0.7 > - > > Key: NUTCH-2374 > URL: https://issues.apache.org/jira/browse/NUTCH-2374 > Project: Nutch > Issue Type: Bug > Components: build, storage >Reporter: Lewis John McGibbney > Fix For: 2.4 > > > We should make the upgrades before we release Nutch 2.X. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2393) 2.x patch for MD5 duplication issue addressed in NUTCH-2391
[ https://issues.apache.org/jira/browse/NUTCH-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16074987#comment-16074987 ] Hudson commented on NUTCH-2393: --- SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1587 (See [https://builds.apache.org/job/Nutch-nutchgora/1587/]) NUTCH-2393 Fix for issue addressed in NUTCH-2391 (kaidulislam90: [https://github.com/apache/nutch/commit/ef33ba7db80d08d5ef56501bcc45baadfee14dfc]) * (edit) src/java/org/apache/nutch/crawl/MD5Signature.java NUTCH-2393 checking buf.remaining() instead of buf.array().length == 0 (kaidulislam90: [https://github.com/apache/nutch/commit/3d0c1a765990f38a63172ff1016f3940325f5b59]) * (edit) src/java/org/apache/nutch/crawl/MD5Signature.java > 2.x patch for MD5 duplication issue addressed in NUTCH-2391 > --- > > Key: NUTCH-2393 > URL: https://issues.apache.org/jira/browse/NUTCH-2393 > Project: Nutch > Issue Type: Bug > Components: commoncrawl >Affects Versions: 2.3.1 >Reporter: Kaidul Islam >Assignee: Kaidul Islam >Priority: Minor > Fix For: 2.4 > > Original Estimate: 24h > Remaining Estimate: 24h > > Equivalent patch for 2.x for issue addressed in NUTCH-2391 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2391) Spurious Duplications for MD5
[ https://issues.apache.org/jira/browse/NUTCH-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16075010#comment-16075010 ] Hudson commented on NUTCH-2391: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3433 (See [https://builds.apache.org/job/Nutch-trunk/3433/]) NUTCH-2391 use URL for MD5 digest as fall-back if content is empty (snagel: [https://github.com/apache/nutch/commit/d35b433c397c03e78245c3e262ecaa31c78a564e]) * (edit) src/java/org/apache/nutch/crawl/MD5Signature.java > Spurious Duplications for MD5 > - > > Key: NUTCH-2391 > URL: https://issues.apache.org/jira/browse/NUTCH-2391 > Project: Nutch > Issue Type: Bug > Components: commoncrawl >Affects Versions: 1.11 >Reporter: David Johnson >Priority: Minor > Fix For: 1.14 > > > We're seeing some incidence of a large number of documents being marked as > duplicate in our crawl. > We traced it back to one of the crawl plugins returning an empty array for > the content field. > We'd like to propose changing the MD5 signature generation from: > {code} > public byte[] calculate(Content content, Parse parse) { > byte[] data = content.getContent(); > if (data == null) > data = content.getUrl().getBytes(); > return MD5Hash.digest(data).getDigest(); > } > {code} > to: > {code} > public byte[] calculate(Content content, Parse parse) { > byte[] data = content.getContent(); > if ((data == null) || (data.length == 0)) > data = content.getUrl().getBytes(); > return MD5Hash.digest(data).getDigest(); > } > {code} > to address the issue -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2398) Fetcher saving redirected robots.txt under redirect target URL
[ https://issues.apache.org/jira/browse/NUTCH-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089934#comment-16089934 ] Hudson commented on NUTCH-2398: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3434 (See [https://builds.apache.org/job/Nutch-trunk/3434/]) NUTCH-2398: Save content of redirected robots.txt under redirect target (snagel: [https://github.com/apache/nutch/commit/620b85df36d0c802f333a56ca1ef7021a7935360]) * (edit) src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java > Fetcher saving redirected robots.txt under redirect target URL > -- > > Key: NUTCH-2398 > URL: https://issues.apache.org/jira/browse/NUTCH-2398 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 1.13 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 1.14 > > > NUTCH-2300 lets the Fetcher store optionally the robots.txt response (content > and HTTP status). If the '.../robots.txt' is redirected, the redirected > content is also stored but with the redirect source URL as key. It should use > the redirect target URL instead. Otherwise one of the responses is > overwritten in the segments map file. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16092964#comment-16092964 ] Hudson commented on NUTCH-1465: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3435 (See [https://builds.apache.org/job/Nutch-trunk/3435/]) NUTCH-1465 (markus: [https://github.com/apache/nutch/commit/b58d6cd9111b2d25b8f6f009015ac214bac4006d]) * (edit) conf/log4j.properties * (add) src/java/org/apache/nutch/util/SitemapProcessor.java * (edit) ivy/ivy.xml * (edit) conf/nutch-default.xml * (edit) src/bin/nutch > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Markus Jelsma > Fix For: 1.14 > > Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, > NUTCH-1465.patch, NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, > NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2403) Nutch Selenium: Wrong documentation about PhantomJS
[ https://issues.apache.org/jira/browse/NUTCH-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16099508#comment-16099508 ] Hudson commented on NUTCH-2403: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3438 (See [https://builds.apache.org/job/Nutch-trunk/3438/]) NUTCH-2403: Fix spelling of phantomJS configuration (moreno: [https://github.com/apache/nutch/commit/df5f96289097c44d4b4f405a2449b3352363d8e0]) * (edit) src/plugin/protocol-selenium/README.md > Nutch Selenium: Wrong documentation about PhantomJS > --- > > Key: NUTCH-2403 > URL: https://issues.apache.org/jira/browse/NUTCH-2403 > Project: Nutch > Issue Type: Bug > Components: documentation, plugin >Affects Versions: 1.13 >Reporter: Moreno Feltscher >Assignee: Moreno Feltscher > Fix For: 1.14 > > > The Nutch Selenium documentation states that PhantomJS can be used as > {{phantomJS}} for {{selenium.driver}}. The correct value would be > {{phantomjs}} according to > https://github.com/apache/nutch/blob/master/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java#L124 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay
[ https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16102106#comment-16102106 ] Hudson commented on NUTCH-2368: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3446 (See [https://builds.apache.org/job/Nutch-trunk/3446/]) NUTCH-2368 Variable generate.max.count and fetcher.server.delay (markus: [https://github.com/apache/nutch/commit/44f7ad973f2017bacde2bf5277f846179eafc6dd]) * (edit) src/java/org/apache/nutch/fetcher/FetchItemQueue.java * (edit) conf/nutch-default.xml * (edit) src/java/org/apache/nutch/crawl/Generator.java > Variable generate.max.count and fetcher.server.delay > > > Key: NUTCH-2368 > URL: https://issues.apache.org/jira/browse/NUTCH-2368 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.12 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.14 > > Attachments: NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, > NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, > NUTCH-2368.patch, NUTCH-2368.patch > > > In some cases we need to use host specific characteristics in determining > crawl speed and bulk sizes because with our (Openindex) settings we can just > recrawl host with up to 800k urls. > This patch solves the problem by introducing the HostDB to the Generator and > providing powerful Jexl expressions. Check these two expressions added to the > Generator: > {code} > -Dgenerate.max.count.expr=' > if (unfetched + fetched > 80) { > return (conf.getInt("fetcher.timelimit.mins", 12) * 60) / ((pct95._rs_ + > 500) / 1000) * conf.getInt("fetcher.threads.per.queue", 1) > } else { > return conf.getDouble("generate.max.count", 300); > }' > -Dgenerate.fetch.delay.expr=' > if (unfetched + fetched > 80) { > return (pct95._rs_ + 500); > } else { > return conf.getDouble("fetcher.server.delay", 1000) > }' > {code} > For each large host: select as many records as possible that are possible to > fetch based on number of threads, 95th percentile response time of the fetch > limit. Or: queueMaxCount = (timelimit / resonsetime) * numThreads. > The second expression just follows up to that, settings the crawlDelay of the > fetch queue. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2389) Precise data parsing using Jsoup CSS selectors
[ https://issues.apache.org/jira/browse/NUTCH-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16106551#comment-16106551 ] Hudson commented on NUTCH-2389: --- FAILURE: Integrated in Jenkins build Nutch-nutchgora #1588 (See [https://builds.apache.org/job/Nutch-nutchgora/1588/]) NUTCH-2389 jsoup-extractor with parse filter, indexing filter and unit (ikaidul: [https://github.com/apache/nutch/commit/f41735cb3c96650f6a51f1c5eb87566572bf1679]) * (add) src/plugin/jsoup-extractor/plugin.xml * (add) src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/core/JsoupExtractorConstants.java * (add) src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/core/JsoupDocumentReader.java * (add) src/plugin/jsoup-extractor/src/test/org/apache/nutch/parse/jsoup/extractor/TestJsoupParser.java * (add) src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/core/package-info.java * (add) src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/parse/JsoupHtmlParser.java * (edit) src/plugin/build.xml * (add) src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/core/normalizer/package-info.java * (edit) conf/nutch-default.xml * (add) src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/parse/package-info.java * (add) conf/jsoup-extractor-sample.xml * (edit) build.xml * (add) src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/indexer/package-info.java * (add) src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/core/normalizer/SimpleStringNormalizer.java * (add) src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/indexer/JsoupIndexingFilter.java * (add) src/plugin/jsoup-extractor/src/test/org/apache/nutch/parse/jsoup/extractor/ViewCountNormalizer.java * (add) conf/jsoup-extractor.xml * (add) src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/core/normalizer/Normalizable.java * (add) src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/core/JsoupDocument.java * (add) src/plugin/jsoup-extractor/build.xml NUTCH-2389 jsoup-extractor/ivy.xml commited (ikaidul: [https://github.com/apache/nutch/commit/fe6997f30e4bcffe962da4d09ae73f379c026a76]) * (add) src/plugin/jsoup-extractor/ivy.xml NUTCH-2389 Unit test implemented but not passed (ikaidul: [https://github.com/apache/nutch/commit/17bd8f6e87f4fa4fd35c5aecfa09d8ef3bea6fd7]) * (delete) conf/jsoup-extractor-sample.xml * (edit) conf/jsoup-extractor.xml * (edit) src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/core/JsoupDocumentReader.java * (edit) src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/parse/JsoupHtmlParser.java * (edit) src/plugin/jsoup-extractor/plugin.xml * (edit) src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/core/JsoupExtractorConstants.java * (edit) src/plugin/jsoup-extractor/src/test/org/apache/nutch/parse/jsoup/extractor/TestJsoupParser.java NUTCH-2389 package name changed (kaidulislam90: [https://github.com/apache/nutch/commit/52e785d6f8ebf6f57150b255df380510f6ebcf6b]) * (delete) src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/indexer/JsoupIndexingFilter.java * (add) src/plugin/jsoup-extractor/src/java/org/apache/nutch/core/jsoup/extractor/JsoupExtractorConstants.java * (delete) src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/core/package-info.java * (delete) src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/parse/package-info.java * (edit) src/plugin/jsoup-extractor/plugin.xml * (add) src/plugin/jsoup-extractor/src/java/org/apache/nutch/core/jsoup/extractor/normalizer/Normalizable.java * (add) src/plugin/jsoup-extractor/src/java/org/apache/nutch/parse/jsoup/extractor/JsoupHtmlParser.java * (delete) src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/core/JsoupDocument.java * (delete) src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/parse/JsoupHtmlParser.java * (delete) src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/core/JsoupExtractorConstants.java * (add) src/plugin/jsoup-extractor/src/java/org/apache/nutch/core/jsoup/extractor/JsoupDocument.java * (add) src/plugin/jsoup-extractor/src/java/org/apache/nutch/core/jsoup/extractor/normalizer/SimpleStringNormalizer.java * (delete) src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/indexer/package-info.java * (add) src/plugin/jsoup-extractor/src/java/org/apache/nutch/core/jsoup/extractor/JsoupDocumentReader.java * (add) src/plugin/jsoup-extractor/src/java/org/apache/nutch/indexer/jsoup/extractor/package-info.java * (delete) src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/core/normalizer/SimpleStringNormalizer.java * (delete) src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/core/normalizer/package-info.java
[jira] [Commented] (NUTCH-2404) Failed Jenkin Build #1588 error in unit test resolved
[ https://issues.apache.org/jira/browse/NUTCH-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16107675#comment-16107675 ] Hudson commented on NUTCH-2404: --- SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1589 (See [https://builds.apache.org/job/Nutch-nutchgora/1589/]) NUTCH-2404 Fix for Failed Jenkin build #1588 after merging pull request (kaidulislam90: [https://github.com/apache/nutch/commit/a6870de1bdd518900d3546b3fe68d46d370db76c]) * (edit) src/plugin/jsoup-extractor/build.xml * (edit) src/plugin/jsoup-extractor/src/test/org/apache/nutch/parse/jsoup/extractor/TestJsoupHtmlParser.java > Failed Jenkin Build #1588 error in unit test resolved > - > > Key: NUTCH-2404 > URL: https://issues.apache.org/jira/browse/NUTCH-2404 > Project: Nutch > Issue Type: Bug > Components: test >Affects Versions: 2.4 >Reporter: Kaidul Islam >Assignee: Kaidul Islam > Fix For: 2.4 > > > Fix for Jenkin Build #1588 after merging pull request #192 (NUTCH-2389). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2405) jsoup-extractor structure correction, typo fixed
[ https://issues.apache.org/jira/browse/NUTCH-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120339#comment-16120339 ] Hudson commented on NUTCH-2405: --- SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1590 (See [https://builds.apache.org/job/Nutch-nutchgora/1590/]) NUTCH-2405 1. Missed root tag added in jsoup-extractor.xml (kaidulislam90: [https://github.com/apache/nutch/commit/49ff77e83cc1e62cf10c377027c122e6a7d83128]) * (edit) conf/jsoup-extractor.xml * (edit) conf/jsoup-extractor-example.xml * (edit) src/plugin/jsoup-extractor/src/java/org/apache/nutch/parse/jsoup/extractor/JsoupHtmlParser.java > jsoup-extractor structure correction, typo fixed > > > Key: NUTCH-2405 > URL: https://issues.apache.org/jira/browse/NUTCH-2405 > Project: Nutch > Issue Type: Bug > Components: plugin >Affects Versions: 2.4 >Reporter: Kaidul Islam >Assignee: Kaidul Islam >Priority: Minor > Fix For: 2.4 > > > Several bugs faced during testing with my project have been fixed > 1. Missed root tag added in jsoup-extractor.xml like > jsoup-extractor-example.xml > 2. jsoup API text() used instead of ownText() to get full contents under CSS > selector > 3. => typo fixed -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2408) CrawlDb: allow update from unparsed segments
[ https://issues.apache.org/jira/browse/NUTCH-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127123#comment-16127123 ] Hudson commented on NUTCH-2408: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3448 (See [https://builds.apache.org/job/Nutch-trunk/3448/]) NUTCH-2408 CrawlDb: allow update from unparsed segments (snagel: [https://github.com/apache/nutch/commit/a7d0ac2724e1c16f8071a5d734f092d1bc03cac1]) * (edit) src/java/org/apache/nutch/crawl/CrawlDb.java > CrawlDb: allow update from unparsed segments > > > Key: NUTCH-2408 > URL: https://issues.apache.org/jira/browse/NUTCH-2408 > Project: Nutch > Issue Type: Improvement > Components: crawldb >Affects Versions: 1.13 >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 1.14 > > > The command updatedb (class o.a.n.crawl.CrawlDb) does not allow to update the > CrawlDb with fetch status only (from segment subdirectory crawl_fetch) > without also reading crawl_parse (which contains outlinks but also scores, > signatures and meta data). > A workflow which does not require parsing of documents (e.g., because raw > HTML content is exported to WARC files) is then unable to update the CrawlDb > to store the fetch status. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2400) Solr 6.6.0 compatibility
[ https://issues.apache.org/jira/browse/NUTCH-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127517#comment-16127517 ] Hudson commented on NUTCH-2400: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3449 (See [https://builds.apache.org/job/Nutch-trunk/3449/]) NUTCH-2400 Solr 6.6.0 compatibility (lewis.mcgibbney: [https://github.com/apache/nutch/commit/1857e624db1c1671edeb58c6ea9e861cbb435440]) * (edit) conf/schema.xml NUTCH-2400 Solr 6.6.0 compatibility (lewis.mcgibbney: [https://github.com/apache/nutch/commit/4115dcaf6d2f0e55354fef88649f85c04bc7584b]) * (edit) conf/schema.xml > Solr 6.6.0 compatibility > > > Key: NUTCH-2400 > URL: https://issues.apache.org/jira/browse/NUTCH-2400 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 1.13 > Environment: Nutch 1.14-SNAPSHOT Solr 6.6.0 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Blocker > Fix For: 1.14 > > Attachments: managed-schema > > > This issue relates to following mailing list thread > http://www.mail-archive.com/user%40nutch.apache.org/msg15574.html > The schema.xml upgrade works with Solr 6.6.0, please try it out and let me > know how things go. > I've also updated the tutorial at https://wiki.apache.org/nutch/NutchTutorial > so please check that out as well. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2378) ChildFirst plugin classloader
[ https://issues.apache.org/jira/browse/NUTCH-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128751#comment-16128751 ] Hudson commented on NUTCH-2378: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3450 (See [https://builds.apache.org/job/Nutch-trunk/3450/]) NUTCH-2378 ChildFirst plugin classloader (contributed by Jurian (snagel: [https://github.com/apache/nutch/commit/e66d44d9c290c550e78edb425a43e010b861172c]) * (edit) src/plugin/indexer-solr/plugin.xml * (edit) src/plugin/parsefilter-naivebayes/plugin.xml * (edit) src/plugin/parse-tika/plugin.xml * (edit) src/java/org/apache/nutch/plugin/PluginClassLoader.java * (edit) src/plugin/parse-tika/ivy.xml > ChildFirst plugin classloader > - > > Key: NUTCH-2378 > URL: https://issues.apache.org/jira/browse/NUTCH-2378 > Project: Nutch > Issue Type: Improvement > Components: plugin >Affects Versions: 1.13 >Reporter: Jurian Broertjes >Assignee: Sebastian Nagel > Fix For: 2.4, 1.14 > > Attachments: NUTCH-2378-childfirst-plugin-classloader.patch > > > While working on upgrading the indexer-elastic plugin from 2.x to 5.x, I ran > into several nasty runtime dependency issues (both local and on Hadoop). > After seeking help on the mailing list, I still was unable to resolve these > issues and after digging further, decided to try a different plugin > classloader strategy. > The normal classloader delegates class loading requests to it's parent > classloader. This can cause all sorts of nasty runtime dependency version > conflicts (jar hell, version conflicts), since the plugin's own classloader > gets queried last. The child-first classloader approach tries to load a class > from the plugin's dependencies first and when unavailable, delegates to it's > parent classloader. This fixed the issues I had. > The new approach can give runtime LinkageErrors, but these are easily > resolvable (see the patch for a few examples) > I've tested the new loader a bit and am curious about others' findings. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2378) ChildFirst plugin classloader
[ https://issues.apache.org/jira/browse/NUTCH-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16132763#comment-16132763 ] Hudson commented on NUTCH-2378: --- SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1591 (See [https://builds.apache.org/job/Nutch-nutchgora/1591/]) NUTCH-2378 ChildFirst plugin classloader (contributed by Jurian (snagel: [https://github.com/apache/nutch/commit/93fb5395478e982e45e8bebbf69435db1a8ce5e7]) * (edit) src/java/org/apache/nutch/plugin/PluginClassLoader.java * (edit) src/plugin/parse-tika/plugin.xml NUTCH-2378 ChildFirst plugin classloader - fix jsoup-extractor: all (snagel: [https://github.com/apache/nutch/commit/e1d9191158cc2519987c5646c64eaf5a11603089]) * (delete) src/plugin/jsoup-extractor/src/test/org/apache/nutch/parse/jsoup/extractor/ViewCountNormalizer.java * (edit) src/java/org/apache/nutch/plugin/Extension.java * (add) src/plugin/jsoup-extractor/src/java/org/apache/nutch/parse/jsoup/extractor/ViewCountNormalizer.java * (edit) src/plugin/jsoup-extractor/src/java/org/apache/nutch/core/jsoup/extractor/JsoupDocumentReader.java > ChildFirst plugin classloader > - > > Key: NUTCH-2378 > URL: https://issues.apache.org/jira/browse/NUTCH-2378 > Project: Nutch > Issue Type: Improvement > Components: plugin >Affects Versions: 1.13 >Reporter: Jurian Broertjes >Assignee: Sebastian Nagel > Fix For: 2.4, 1.14 > > Attachments: NUTCH-2378-childfirst-plugin-classloader.patch > > > While working on upgrading the indexer-elastic plugin from 2.x to 5.x, I ran > into several nasty runtime dependency issues (both local and on Hadoop). > After seeking help on the mailing list, I still was unable to resolve these > issues and after digging further, decided to try a different plugin > classloader strategy. > The normal classloader delegates class loading requests to it's parent > classloader. This can cause all sorts of nasty runtime dependency version > conflicts (jar hell, version conflicts), since the plugin's own classloader > gets queried last. The child-first classloader approach tries to load a class > from the plugin's dependencies first and when unavailable, delegates to it's > parent classloader. This fixed the issues I had. > The new approach can give runtime LinkageErrors, but these are easily > resolvable (see the patch for a few examples) > I've tested the new loader a bit and am curious about others' findings. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2413) Parsing fetcher to respect property "parse.filter.urls"
[ https://issues.apache.org/jira/browse/NUTCH-2413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16142732#comment-16142732 ] Hudson commented on NUTCH-2413: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3452 (See [https://builds.apache.org/job/Nutch-trunk/3452/]) fix for NUTCH-2413 contributed by maborec (marcos: [https://github.com/apache/nutch/commit/6c648633cecc158f409e3a4ec45cf33bc68b4b1d]) * (edit) src/java/org/apache/nutch/fetcher/FetcherThread.java fix for NUTCH-2413 contributed by maborec (marcos: [https://github.com/apache/nutch/commit/5dc48f2fc2f7a6f9d039251b9133df12bee99d52]) * (edit) src/java/org/apache/nutch/fetcher/FetcherThread.java NUTCH-2413 - Fix some styling. Prepare filters and normalizers in (marcos: [https://github.com/apache/nutch/commit/60af77262726e8a09202a2319add512c54e7a2f4]) * (edit) src/java/org/apache/nutch/fetcher/FetcherThread.java > Parsing fetcher to respect property "parse.filter.urls" > --- > > Key: NUTCH-2413 > URL: https://issues.apache.org/jira/browse/NUTCH-2413 > Project: Nutch > Issue Type: Bug > Components: fetcher, parser >Affects Versions: 1.13 > Environment: Apache Nutch release 1.13. >Reporter: Marcos Bori >Assignee: Sebastian Nagel > Fix For: 1.14 > > > In a situation when we want to: > (1) Execute the fetch and parse together ("fetcher.parse" setting to "true") > (2) Avoid applying the URL filters when executing this phase. > Condition (2) can be configured when parsing is executed as a separate > process by setting "parse.filter.urls" to "false". > However, this setting ("parse.filter.urls") is ignored when we execute the > fetch and parse phases together. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2397) Parser to add paragraph line breaks
[ https://issues.apache.org/jira/browse/NUTCH-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16161121#comment-16161121 ] Hudson commented on NUTCH-2397: --- SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1592 (See [https://builds.apache.org/job/Nutch-nutchgora/1592/]) NUTCH-2397: Parser to add paragraph line breaks (snagel: [https://github.com/apache/nutch/commit/aaa8099c8fe3761869f4c881fb66b2c11a2e350b]) * (edit) src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java * (edit) src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java > Parser to add paragraph line breaks > --- > > Key: NUTCH-2397 > URL: https://issues.apache.org/jira/browse/NUTCH-2397 > Project: Nutch > Issue Type: Improvement > Components: parser >Affects Versions: 2.3.1, 1.13 >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 2.4, 1.14 > > > (initially reported with patch/pull-request by Vipul Behl, see > [#190|https://github.com/apache/nutch/pull/190]) > The parser (parse-tika and parse-html) could be improved to add line breaks > between paragraphs, instead of writing the whole document into a single line. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2409) Injector: complete command-line help and counters
[ https://issues.apache.org/jira/browse/NUTCH-2409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16161128#comment-16161128 ] Hudson commented on NUTCH-2409: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3453 (See [https://builds.apache.org/job/Nutch-trunk/3453/]) NUTCH-2409 Injector: complete command-line help and counters - add (snagel: [https://github.com/apache/nutch/commit/9b4a9df26c5b92c82029d030a9bf72cda043209c]) * (edit) src/java/org/apache/nutch/crawl/Injector.java > Injector: complete command-line help and counters > - > > Key: NUTCH-2409 > URL: https://issues.apache.org/jira/browse/NUTCH-2409 > Project: Nutch > Issue Type: Improvement > Components: injector >Affects Versions: 1.13 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Trivial > Fix For: 1.14 > > > See discussion in > [NUTCH-2335|https://issues.apache.org/jira/browse/NUTCH-2335?focusedCommentId=16130178&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16130178]: > - add counters for removed items from CrawlDb: > {noformat} > Injector: Total urls removed from CrawlDb by filters: 2 > Injector: Total urls with status gone removed from CrawlDb > (db.update.purge.404): 0 > {noformat} > - add {{-Ddb.update.purge.404=true}} to command-line help: > {noformat} > Usage: Injector [-D...] [-overwrite|-update] [-noFilter] > [-noNormalize] [-filterNormalizeAll] > ... > -D... set or overwrite configuration property (property=value) > -Ddb.update.purge.404=true > remove URLs with status gone (404) from CrawlDb > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2430) Complete plugin build configuration
[ https://issues.apache.org/jira/browse/NUTCH-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16179201#comment-16179201 ] Hudson commented on NUTCH-2430: --- FAILURE: Integrated in Jenkins build Nutch-trunk #3454 (See [https://builds.apache.org/job/Nutch-trunk/3454/]) Complete plugin build configuration (NUTCH-2430) - add missing plugin (snagel: [https://github.com/apache/nutch/commit/64fc5761ee1f04538426bb4a7d3eea140996a976]) * (edit) build.xml * (edit) src/plugin/build.xml * (delete) src/plugin/parse-replace/plugin.xml * (delete) src/plugin/parse-replace/sample/testParseReplace.html * (delete) src/plugin/parse-replace/README.txt * (delete) src/plugin/parse-replace/src/java/org/apache/nutch/parse/replace/ReplaceParser.java * (delete) src/plugin/parse-replace/src/test/org/apache/nutch/parse/replace/TestParseReplace.java * (edit) default.properties * (delete) src/plugin/parse-replace/src/java/org/apache/nutch/parse/replace/package-info.java * (delete) src/plugin/parse-replace/build.xml * (delete) src/plugin/parse-replace/ivy.xml > Complete plugin build configuration > --- > > Key: NUTCH-2430 > URL: https://issues.apache.org/jira/browse/NUTCH-2430 > Project: Nutch > Issue Type: Improvement > Components: build >Affects Versions: 1.13 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel > Fix For: 1.14 > > > The build configuration around plugins isn't complete > - missing plugin folders in the Eclipse target (see NUTCH-2135) > - not all plugins included API docs / javadoc -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2436) Remove empty comment, and redundant semicolon from CommandRunner
[ https://issues.apache.org/jira/browse/NUTCH-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16184747#comment-16184747 ] Hudson commented on NUTCH-2436: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3456 (See [https://builds.apache.org/job/Nutch-trunk/3456/]) NUTCH-2436 Fix (kenneth: [https://github.com/apache/nutch/commit/4d67a77bde35a8af1b7d62e7cd281bdf13b11b80]) * (edit) src/java/org/apache/nutch/util/CommandRunner.java > Remove empty comment, and redundant semicolon from CommandRunner > > > Key: NUTCH-2436 > URL: https://issues.apache.org/jira/browse/NUTCH-2436 > Project: Nutch > Issue Type: Bug >Reporter: kenneth mcfarland >Assignee: kenneth mcfarland >Priority: Trivial > Fix For: 1.14 > > > CommandRunner has a set of empty comments and a redundant semicolon. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2433) Html Parser: keep htmltag where the outlinks are found
[ https://issues.apache.org/jira/browse/NUTCH-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16185784#comment-16185784 ] Hudson commented on NUTCH-2433: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3457 (See [https://builds.apache.org/job/Nutch-trunk/3457/]) NUTCH-2433 / Html Parser: keep htmltag where the outlinks are found (marcos: [https://github.com/apache/nutch/commit/7db11734f25a53cda15634071a47ff524a06002e]) * (edit) conf/nutch-default.xml * (edit) src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java > Html Parser: keep htmltag where the outlinks are found > -- > > Key: NUTCH-2433 > URL: https://issues.apache.org/jira/browse/NUTCH-2433 > Project: Nutch > Issue Type: New Feature > Components: parser >Affects Versions: 1.13 > Environment: Apache Nutch release 1.13. >Reporter: Marcos Bori > Labels: html, outlink > Fix For: 1.14 > > > When parsing HTML pages, I need to know in which HTML tag the outlinks were > found (for example, 'a', 'script', 'img', etc). > I propose to add a new configuration value, > "parser.html.outlinks.htmlnode_metadata_name". > If this configuration property is not empty, all found outlinks will be > assigned a metadata with the name indicated in this configuration property > with the html tag name where the outlink was found. > I will now send the pull request with my code implementation. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2437) gora mongodb mapping file error
[ https://issues.apache.org/jira/browse/NUTCH-2437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16191807#comment-16191807 ] Hudson commented on NUTCH-2437: --- SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1593 (See [https://builds.apache.org/job/Nutch-nutchgora/1593/]) fix for NUTCH-2437 contributed by tmzzngl (tulay.muezzinoglu: [https://github.com/apache/nutch/commit/93b879600683540069ed35799d8710f0739a766d]) * (edit) conf/gora-mongodb-mapping.xml > gora mongodb mapping file error > --- > > Key: NUTCH-2437 > URL: https://issues.apache.org/jira/browse/NUTCH-2437 > Project: Nutch > Issue Type: Bug > Components: storage >Affects Versions: 2.4 >Reporter: Tulay Muezzinoglu >Priority: Trivial > Labels: gora, mapping, mongo > Fix For: 2.4 > > > conf/gora-mongodb-mapping.xml > {code} > > {code} > should be > {code} > > {code} > Otherwise it is throwing exception. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1763) Improving comments on the Injector Class
[ https://issues.apache.org/jira/browse/NUTCH-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16211836#comment-16211836 ] Hudson commented on NUTCH-1763: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3458 (See [https://builds.apache.org/job/Nutch-trunk/3458/]) NUTCH-1763 Code comment Injector contributed by Diaa (snagel: [https://github.com/apache/nutch/commit/21d56a0c5626553a3bf5058588d9277e6844e00f]) * (edit) src/java/org/apache/nutch/crawl/Injector.java > Improving comments on the Injector Class > > > Key: NUTCH-1763 > URL: https://issues.apache.org/jira/browse/NUTCH-1763 > Project: Nutch > Issue Type: Improvement > Components: injector >Affects Versions: 1.9 >Reporter: Diaa >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 1.14 > > Attachments: Injector.java.patch, Injector.java.patch > > Original Estimate: 0h > Remaining Estimate: 0h > > I think the Injector class could use some improvements in the comments. > I am attaching a few improvements to that and will keep adding as I > understand it more. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2446) URLFiltersCheck fix
[ https://issues.apache.org/jira/browse/NUTCH-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16214825#comment-16214825 ] Hudson commented on NUTCH-2446: --- SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1594 (See [https://builds.apache.org/job/Nutch-nutchgora/1594/]) Fix for NUTCH-2446 by Kenneth McFarland (snagel: [https://github.com/apache/nutch/commit/72128eb5e863afea66ff5be7a7a2df824af688e8]) * (edit) src/java/org/apache/nutch/net/URLFilterChecker.java > URLFiltersCheck fix > --- > > Key: NUTCH-2446 > URL: https://issues.apache.org/jira/browse/NUTCH-2446 > Project: Nutch > Issue Type: Bug > Environment: master >Reporter: kenneth mcfarland >Assignee: kenneth mcfarland >Priority: Minor > Fix For: 2.4, 1.14 > > > Currently URLFilterChecker.checkAll() creates a URLFilters object repeatedly > when conf does not change. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2446) URLFiltersCheck fix
[ https://issues.apache.org/jira/browse/NUTCH-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16214828#comment-16214828 ] Hudson commented on NUTCH-2446: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3459 (See [https://builds.apache.org/job/Nutch-trunk/3459/]) Fix for NUTCH-2446 by Kenneth McFarland (kennethpaulmcfarland: [https://github.com/apache/nutch/commit/19fdd6c6339efd08c7c77d3c4e87f464b7c3a038]) * (edit) src/java/org/apache/nutch/net/URLFilterChecker.java > URLFiltersCheck fix > --- > > Key: NUTCH-2446 > URL: https://issues.apache.org/jira/browse/NUTCH-2446 > Project: Nutch > Issue Type: Bug > Environment: master >Reporter: kenneth mcfarland >Assignee: kenneth mcfarland >Priority: Minor > Fix For: 2.4, 1.14 > > > Currently URLFilterChecker.checkAll() creates a URLFilters object repeatedly > when conf does not change. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2444) HostDB CSV dumper to emit field header by default
[ https://issues.apache.org/jira/browse/NUTCH-2444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16215161#comment-16215161 ] Hudson commented on NUTCH-2444: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3460 (See [https://builds.apache.org/job/Nutch-trunk/3460/]) NUTCH-2444 HostDB CSV dumper to emit field header by default (markus: [https://github.com/apache/nutch/commit/d7e4046e6e725ed759d0c43e37c51c5c3122e006]) * (edit) src/java/org/apache/nutch/hostdb/ReadHostDb.java > HostDB CSV dumper to emit field header by default > - > > Key: NUTCH-2444 > URL: https://issues.apache.org/jira/browse/NUTCH-2444 > Project: Nutch > Issue Type: Bug > Components: hostdb >Affects Versions: 1.13 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.14 > > Attachments: NUTCH-2444.patch > > > Started to get annoyed by constantly having to look-u HostDatum for the field > set. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2445) Fetcher following outlinks to keep track of already fetched items
[ https://issues.apache.org/jira/browse/NUTCH-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16215248#comment-16215248 ] Hudson commented on NUTCH-2445: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3461 (See [https://builds.apache.org/job/Nutch-trunk/3461/]) NUTCH-2445 Fetcher following outlinks to keep track of already fetched (markus: [https://github.com/apache/nutch/commit/0cdd095c881eed52dc461e559ce6ae278e99157f]) * (edit) src/java/org/apache/nutch/fetcher/FetchItemQueue.java * (edit) src/java/org/apache/nutch/fetcher/FetcherThread.java > Fetcher following outlinks to keep track of already fetched items > - > > Key: NUTCH-2445 > URL: https://issues.apache.org/jira/browse/NUTCH-2445 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.13 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.14 > > Attachments: NUTCH-2445.patch, NUTCH-2445.patch > > > When fetcher.follow.outlinks.depth is non-zero, fetcher follows outlinks. > This patch keeps track of already fetched URL's and thus avoid fetching the > same URL twice. > A Set is used to keep track of them, hashcodes to reduce memory usage. This > is not used if fetcher doesn't follow outlinks. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2448) Allow Sending an empty http.agent.version
[ https://issues.apache.org/jira/browse/NUTCH-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16217545#comment-16217545 ] Hudson commented on NUTCH-2448: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3462 (See [https://builds.apache.org/job/Nutch-trunk/3462/]) NUTCH-2448: Treat white-space http.agent.version as empty. (github: [https://github.com/apache/nutch/commit/9f54d5b3ec5a0fd36f91ec8af762e52859f4eeea]) * (edit) src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java > Allow Sending an empty http.agent.version > - > > Key: NUTCH-2448 > URL: https://issues.apache.org/jira/browse/NUTCH-2448 > Project: Nutch > Issue Type: Bug > Components: fetcher, protocol >Affects Versions: 1.13 >Reporter: Yossi Tamari >Priority: Minor > Fix For: 2.4, 1.14 > > > http.agent.version defaults in nutch-default.xml to Nutch-1.14-SNAPSHOT > (depending on the version of course). > If I want to override it to not send a version as part of the user-agent, > there is nothing I can do in nutch-site.xml, since putting an empty string > there causes the default to be taken, and putting any value there causes a > slash to be appended to the http.agent.name. > As far as I can see, the only way to override it is to remove the value in > nutch-default.xml, which is probably not the “correct” way, considering it > contains a comment saying “Do not modify this file directly”. > The suggested solution is to treat a white-space-only value as empty. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2394) Possible bugs in the source code
[ https://issues.apache.org/jira/browse/NUTCH-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219095#comment-16219095 ] Hudson commented on NUTCH-2394: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3464 (See [https://builds.apache.org/job/Nutch-trunk/3464/]) NUTCH-2394 Fix of bugs detected by static code analysis - String.trim() (snagel: [https://github.com/apache/nutch/commit/63037c71370cad1eba4152668f33b184c686d092]) * (edit) src/plugin/urlnormalizer-protocol/src/java/org/apache/nutch/net/urlnormalizer/protocol/ProtocolURLNormalizer.java * (edit) src/java/org/apache/nutch/crawl/URLPartitioner.java * (edit) src/java/org/apache/nutch/tools/CommonCrawlDataDumper.java * (edit) src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java * (edit) src/plugin/urlnormalizer-slash/src/java/org/apache/nutch/net/urlnormalizer/slash/SlashURLNormalizer.java * (edit) src/plugin/urlnormalizer-host/src/java/org/apache/nutch/net/urlnormalizer/host/HostURLNormalizer.java * (edit) src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java > Possible bugs in the source code > > > Key: NUTCH-2394 > URL: https://issues.apache.org/jira/browse/NUTCH-2394 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.13 >Reporter: AppChecker > Labels: appchecker, static-analysis > Fix For: 1.14 > > > Hi! > I've checked your project with static analyzer > [AppChecker|https://npo-echelon.ru/en/solutions/appchecker.php] and if found > several suspicious code fragments: > 1) > [src/plugin/headings/src/java/org/apache/nutch/parse/headings/HeadingsParseFilter.java|https://github.com/apache/nutch/blob/e53b34b2322f2d071981a72577644a225642ecbc/src/plugin/headings/src/java/org/apache/nutch/parse/headings/HeadingsParseFilter.java#L56] > {code:java} > heading.trim(); > {code} > heading is not changed, because java.lang.String.trim returns new string. > Probably, it should be: > {code:java} > heading = heading.trim(); > {code} > see also: > * > [src/plugin/urlnormalizer-host/src/java/org/apache/nutch/net/urlnormalizer/host/HostURLNormalizer.java#L78|https://github.com/apache/nutch/blob/e53b34b2322f2d071981a72577644a225642ecbc/src/plugin/urlnormalizer-host/src/java/org/apache/nutch/net/urlnormalizer/host/HostURLNormalizer.java#L78] > * > [src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java#L115|https://github.com/apache/nutch/blob/e53b34b2322f2d071981a72577644a225642ecbc/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java#L115] > * > [src/java/org/apache/nutch/net/urlnormalizer/protocol/ProtocolURLNormalizer.java#L76|https://github.com/apache/nutch/blob/e53b34b2322f2d071981a72577644a225642ecbc/src/plugin/urlnormalizer-protocol/src/java/org/apache/nutch/net/urlnormalizer/protocol/ProtocolURLNormalizer.java#L76] > * > [src/java/org/apache/nutch/net/urlnormalizer/slash/SlashURLNormalizer.java#L78|https://github.com/apache/nutch/blob/e53b34b2322f2d071981a72577644a225642ecbc/src/plugin/urlnormalizer-slash/src/java/org/apache/nutch/net/urlnormalizer/slash/SlashURLNormalizer.java#L78] > * > [src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java#L326|https://github.com/apache/nutch/blob/e53b34b2322f2d071981a72577644a225642ecbc/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java#L326] > 2) > [src/java/org/apache/nutch/crawl/URLPartitioner.java#L84|https://github.com/apache/nutch/blob/2b93a66f0472e93223c69053d5482dcbef26de6d/src/java/org/apache/nutch/crawl/URLPartitioner.java#L84] > {code:java} > if (mode.equals(PARTITION_MODE_DOMAIN) && url != null) > ... > else if .. > ... > InetAddress address = InetAddress.getByName(url.getHost()); > ... > {code} > if url is null, method url.getHost() will be invoked, so NullPointerException > wiil be thrown > 3) > [src/java/org/apache/nutch/tools/CommonCrawlDataDumper.java#L346|https://github.com/apache/nutch/blob/e53b34b2322f2d071981a72577644a225642ecbc/src/java/org/apache/nutch/tools/CommonCrawlDataDumper.java#L346] > {code:java} > String[] fullPathLevels = fullDir.split(File.separator); > {code} > Using File.separator in regular expressions may throws > java.util.regex.PatternSyntaxException exceptions, because it is "\" on > Windows-based systems. > Possible correction: > {code:java} > String[] fullPathLevels = fullDir.split(Pattern.quote(File.separator)); > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2452) Problem retrieving encoded URLs via FTP?
[ https://issues.apache.org/jira/browse/NUTCH-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16239741#comment-16239741 ] Hudson commented on NUTCH-2452: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3465 (See [https://builds.apache.org/job/Nutch-trunk/3465/]) NUTCH-2452 Allow nutch to retrieve Ftp URLs that contain UrlEncoded (snagel: [https://github.com/apache/nutch/commit/517dbdf3261d42e90883d07320b7991ff8e2bcf8]) * (edit) src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/FtpResponse.java > Problem retrieving encoded URLs via FTP? > > > Key: NUTCH-2452 > URL: https://issues.apache.org/jira/browse/NUTCH-2452 > Project: Nutch > Issue Type: Bug > Components: protocol >Affects Versions: 1.13 > Environment: Ubuntu 16.04.3 LTS > OpenJDK 1.8.0_131 > nutch 1.14-SNAPSHOT > Synology RS816 >Reporter: Hiran Chaudhuri > Fix For: 1.14 > > > I tried running Nutch on my Synology NAS. As SMB protocol is not contained in > Nutch, I turned on FTP service on the NAS and configured Nutch to crawl > ftp://nas. > The experience gives me varying results which seem to point to problems > within Nutch. However this may need further evaluation. > As some files could not be downloaded and I could not see a good error > message I changed the method > org.apache.nutch.protocol.ftp.FTP.getProtocolOutput(Text, CrawlDatum) to not > only return protocol status but send the full exception and stack trace to > the logs: > {{ } catch (Exception e) { > LOG.warn("Could not get {}", url, e); > return new ProtocolOutput(null, new ProtocolStatus(e)); > } > }} > With this modification I suddenly see such messages in the logfile: > {{2017-10-25 14:14:37,254 TRACE org.apache.nutch.protocol.ftp.Ftp - fetching > ftp://nas/silver-sda2/home/vivi/Desktop/Pictures/Kenya%20Pics/ > 2017-10-25 14:14:37,512 WARN org.apache.nutch.protocol.ftp.Ftp - Could not > get ftp://nas/silver-sda2/home/vivi/Desktop/Pictures/Kenya%20Pics/ > org.apache.nutch.protocol.ftp.FtpError: Ftp Error: 404 > at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:151) > at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340) > }} > Please mind the URL was not configured from me. Instead it was obtained by > crawling my NAS. Also the URL looks perfectly fine to me. Even more, using > Firefox and the same authentication data on the same URL displays the > directory successfully. Therefore I suspect the FTP client is unable to > decode the URL such that the FTP server would understand it. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2443) Extract links from the video tag with the parse-html plugin
[ https://issues.apache.org/jira/browse/NUTCH-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16239740#comment-16239740 ] Hudson commented on NUTCH-2443: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3465 (See [https://builds.apache.org/job/Nutch-trunk/3465/]) NUTCH-2443 add source tag to the parse-html and parse-tika outlink (jorge-luis.betancourt: [https://github.com/apache/nutch/commit/d34a002b25a770369ad6a5a20475c7072d8fa02b]) * (edit) src/plugin/parse-tika/src/test/org/apache/nutch/tika/TestDOMContentUtils.java * (edit) src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java * (edit) src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java * (edit) src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestDOMContentUtils.java > Extract links from the video tag with the parse-html plugin > --- > > Key: NUTCH-2443 > URL: https://issues.apache.org/jira/browse/NUTCH-2443 > Project: Nutch > Issue Type: Improvement > Components: parser, plugin >Affects Versions: 1.13 >Reporter: Jorge Luis Betancourt Gonzalez >Assignee: Jorge Luis Betancourt Gonzalez >Priority: Minor > Fix For: 1.14 > > > At the moment the {{parse-html}} extracts links from the tags {{a, area, > form}} (configurable){{, frame, iframe, script, link, img}}. Since we allow > extracting links to binary files (images) extracting links also from the > {{video}} tag should be supported. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2420) Bug in variable generate.max.count and fetcher.server.delay
[ https://issues.apache.org/jira/browse/NUTCH-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240651#comment-16240651 ] Hudson commented on NUTCH-2420: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3466 (See [https://builds.apache.org/job/Nutch-trunk/3466/]) NUTCH-2420 Bug in variable generate.max.count and fetcher.server.delay (markus: [https://github.com/apache/nutch/commit/6199492f5e1e8811022257c88dbf63f1e1c739d0]) * (edit) src/java/org/apache/nutch/crawl/Generator.java > Bug in variable generate.max.count and fetcher.server.delay > --- > > Key: NUTCH-2420 > URL: https://issues.apache.org/jira/browse/NUTCH-2420 > Project: Nutch > Issue Type: Bug > Components: generator >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.14 > > Attachments: NUTCH-2420.patch > > > Feature added by NUTCH-2368 does not work for multiple hosts. Once a > HostDatum has been read by getHostDatum(), the next host cannot be read. > Apparantly i need to open and close the SequenceFile.Readers for every > HostDatum it needs. Reader has no reset() method or whatsoever. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2442) Injector to stop if job fails to avoid loss of CrawlDb
[ https://issues.apache.org/jira/browse/NUTCH-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240970#comment-16240970 ] Hudson commented on NUTCH-2442: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3467 (See [https://builds.apache.org/job/Nutch-trunk/3467/]) NUTCH-2442 Injector to stop if job fails to avoid loss of CrawlDb (omkarreddy2008: [https://github.com/apache/nutch/commit/2352f9a4f47693cd8ca653f0b0629d186593fc4a]) * (edit) src/java/org/apache/nutch/util/domain/DomainStatistics.java * (edit) src/java/org/apache/nutch/crawl/Injector.java * (edit) src/java/org/apache/nutch/util/CrawlCompletionStats.java * (edit) src/java/org/apache/nutch/util/ProtocolStatusStatistics.java * (edit) src/java/org/apache/nutch/util/SitemapProcessor.java * (edit) src/java/org/apache/nutch/hostdb/ReadHostDb.java > Injector to stop if job fails to avoid loss of CrawlDb > -- > > Key: NUTCH-2442 > URL: https://issues.apache.org/jira/browse/NUTCH-2442 > Project: Nutch > Issue Type: Bug > Components: injector >Affects Versions: 1.13 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.14 > > > Injector does not check whether the MapReduce job is successful. Even if the > job fails > - installs the CrawlDb > -- move current/ to old/ > -- replace current/ with an empty or potentially incomplete version > - exits with code 0 so that scripts running the crawl workflow cannot detect > the failure -- if Injector is run a second time the CrawlDb is lost (both > current/ and old/ are empty or corrupted) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2458) TikaParser doesn't work with tika-config.xml set
[ https://issues.apache.org/jira/browse/NUTCH-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16247521#comment-16247521 ] Hudson commented on NUTCH-2458: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3468 (See [https://builds.apache.org/job/Nutch-trunk/3468/]) NUTCH-2458 (markus: [https://github.com/apache/nutch/commit/c345618ec425f0e907a6e54565f2d0577139b45f]) * (edit) src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java > TikaParser doesn't work with tika-config.xml set > > > Key: NUTCH-2458 > URL: https://issues.apache.org/jira/browse/NUTCH-2458 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.13 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.14 > > Attachments: NUTCH-2458.patch > > > Well, it doesn't indeed. Thanks to Timothy Allison, its solved. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2458) TikaParser doesn't work with tika-config.xml set
[ https://issues.apache.org/jira/browse/NUTCH-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16268626#comment-16268626 ] Hudson commented on NUTCH-2458: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3469 (See [https://builds.apache.org/job/Nutch-trunk/3469/]) NUTCH-2458 (snagel: [https://github.com/apache/nutch/commit/c17dd1dd6bf914beb7b13528c95b487630f86905]) * (edit) src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java > TikaParser doesn't work with tika-config.xml set > > > Key: NUTCH-2458 > URL: https://issues.apache.org/jira/browse/NUTCH-2458 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.13 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.14 > > Attachments: NUTCH-2458.patch > > > Well, it doesn't indeed. Thanks to Timothy Allison, its solved. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2463) Enable sampling CrawlDB
[ https://issues.apache.org/jira/browse/NUTCH-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16268625#comment-16268625 ] Hudson commented on NUTCH-2463: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3469 (See [https://builds.apache.org/job/Nutch-trunk/3469/]) NUTCH-2463 - Enable sampling CrawlDB (github: [https://github.com/apache/nutch/commit/65651b5cce54736978356ba1a8dea8a10f405d3c]) * (edit) src/java/org/apache/nutch/crawl/CrawlDbReader.java > Enable sampling CrawlDB > --- > > Key: NUTCH-2463 > URL: https://issues.apache.org/jira/browse/NUTCH-2463 > Project: Nutch > Issue Type: Improvement > Components: crawldb >Reporter: Yossi Tamari >Priority: Minor > Fix For: 1.14 > > > CrawlDB can grow to contain billions of records. When that happens *readdb > -dump* is pretty useless, and *readdb -topN* can run for ages (and does not > provide a statistically correct sample). > We should add a parameter *-sample* to *readdb -dump* which is followed by a > number between 0 and 1, and only that fraction of records from the CrawlDB > will be processed. > The sample should be statistically random, and all the other filters should > be applied on the sampled records. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2464) Plugin headings: Headers That Contain HTML Elements Are Not Parsed
[ https://issues.apache.org/jira/browse/NUTCH-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273291#comment-16273291 ] Hudson commented on NUTCH-2464: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3470 (See [https://builds.apache.org/job/Nutch-trunk/3470/]) Fix for NUTCH-2464 get textual content from nested items on heading (jorge-luis.betancourt: [https://github.com/apache/nutch/commit/b8580b3dd3d47c8f0157d9860f8dab8d1dc8607c]) * (edit) src/plugin/headings/src/java/org/apache/nutch/parse/headings/HeadingsParseFilter.java * (add) src/plugin/headings/src/test/org/apache/nutch/parse/headings/TestHeadingsParseFilter.java * (edit) src/plugin/headings/ivy.xml > Plugin headings: Headers That Contain HTML Elements Are Not Parsed > -- > > Key: NUTCH-2464 > URL: https://issues.apache.org/jira/browse/NUTCH-2464 > Project: Nutch > Issue Type: Bug > Components: plugin >Affects Versions: 1.13 > Environment: Internal development/test environments. >Reporter: Cass Pallansch >Assignee: Jorge Luis Betancourt Gonzalez > Fix For: 1.14 > > Attachments: NUTCH-2464-complex-header.html > > > Nutch does not appear to traverse the HTML elements that may be contained > within header elements (e.g., H1, H2, H3, etc. tags). Many times there are > anchors and/or tags within these elements that contain the actual text > nodes that should be picked up as the header value for indexing purposes. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2465) Broken Eclipse project. Classpaths and interactiveselenium should be fixed.
[ https://issues.apache.org/jira/browse/NUTCH-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273292#comment-16273292 ] Hudson commented on NUTCH-2465: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3470 (See [https://builds.apache.org/job/Nutch-trunk/3470/]) fix of NUTCH-2465 broken Eclipse project. Classpaths and (semyon.semyonov: [https://github.com/apache/nutch/commit/01bdc70b52f64a0d8ee81823eb61e5854e3f6291]) * (edit) src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/InteractiveSeleniumHandler.java * (edit) build.xml * (edit) src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/HttpResponse.java * (edit) src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/DefaultHandler.java * (edit) src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/DefalultMultiInteractionHandler.java * (edit) src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/DefaultClickAllAjaxLinksHandler.java > Broken Eclipse project. Classpaths and interactiveselenium should be fixed. > --- > > Key: NUTCH-2465 > URL: https://issues.apache.org/jira/browse/NUTCH-2465 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.14 >Reporter: Semyon Semyonov > Fix For: 1.14 > > > With the latest version of develop the Eclipse project doesn't work anymore. > There are two sets of problem: > 1) Classpath problems > 2) Incorrect usage of org.apache.nutch.protocol.interactiveselenium in the > code. Should be replaced by > org.apache.nutch.protocol.interactiveselenium.handlers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2456) Allow to index pages/URLs not contained in CrawlDb
[ https://issues.apache.org/jira/browse/NUTCH-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278316#comment-16278316 ] Hudson commented on NUTCH-2456: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3471 (See [https://builds.apache.org/job/Nutch-trunk/3471/]) NUTCH-2456: Redirected documents are not indexed (snagel: [https://github.com/apache/nutch/commit/a7bc1a8c5a3a5ab9c72574afd98089a354bf0484]) * (edit) src/java/org/apache/nutch/indexer/IndexerMapReduce.java > Allow to index pages/URLs not contained in CrawlDb > -- > > Key: NUTCH-2456 > URL: https://issues.apache.org/jira/browse/NUTCH-2456 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.13 >Reporter: Yossi Tamari >Priority: Critical > Fix For: 1.14 > > > If http.redirect.max is set to a positive value, the Fetcher will follow > redirects, creating a new CrawlDatum. > If the redirected URL is fetched and parsed, during indexing for it we have a > special case: dbDatum is null. This means that in > [https://github.com/apache/nutch/blob/6199492f5e1e8811022257c88dbf63f1e1c739d0/src/java/org/apache/nutch/indexer/IndexerMapReduce.java#L259] > the document is not indexed, as it is assumed it only has inlinks (actually > it has everything but dbDatum). > I'm not sure what the correct fix is here. It seems to me the condition > should use AND instead of OR anyway, but I may not understand the original > intent. It is clear that it is too strict as is. > However, the code following that line assumes all 4 objects are not null, so > a patch would need to change more than just the condition. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2468) should filter out invalid URLs by default
[ https://issues.apache.org/jira/browse/NUTCH-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278378#comment-16278378 ] Hudson commented on NUTCH-2468: --- SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1596 (See [https://builds.apache.org/job/Nutch-nutchgora/1596/]) NUTCH-2468 should filter out invalid URLs by default - enable plugin (snagel: [https://github.com/apache/nutch/commit/40da65bf3c55f802ae91b8d7424955450a7146ab]) * (edit) conf/nutch-default.xml > should filter out invalid URLs by default > - > > Key: NUTCH-2468 > URL: https://issues.apache.org/jira/browse/NUTCH-2468 > Project: Nutch > Issue Type: Improvement > Components: bin >Affects Versions: 1.12 >Reporter: Michael Coffey >Priority: Minor > Fix For: 2.4, 1.14 > > > Some Nutch components, by default, should reject invalid URLs. This was > recently discussed in the users mailing list and has affected my work for a > while. Although there may be some special-purpose needs to collect invalid > URLs, they are not generally useful for crawling. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2468) should filter out invalid URLs by default
[ https://issues.apache.org/jira/browse/NUTCH-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278393#comment-16278393 ] Hudson commented on NUTCH-2468: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3472 (See [https://builds.apache.org/job/Nutch-trunk/3472/]) NUTCH-2468 should filter out invalid URLs by default - enable plugin (snagel: [https://github.com/apache/nutch/commit/d8754b7f88e73949dadaa0412aedea4427207f25]) * (edit) conf/nutch-default.xml > should filter out invalid URLs by default > - > > Key: NUTCH-2468 > URL: https://issues.apache.org/jira/browse/NUTCH-2468 > Project: Nutch > Issue Type: Improvement > Components: bin >Affects Versions: 1.12 >Reporter: Michael Coffey >Priority: Minor > Fix For: 2.4, 1.14 > > > Some Nutch components, by default, should reject invalid URLs. This was > recently discussed in the users mailing list and has affected my work for a > while. Although there may be some special-purpose needs to collect invalid > URLs, they are not generally useful for crawling. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2451) protocol-ftp to resolve relative URL when following redirects
[ https://issues.apache.org/jira/browse/NUTCH-2451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278444#comment-16278444 ] Hudson commented on NUTCH-2451: --- SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1597 (See [https://builds.apache.org/job/Nutch-nutchgora/1597/]) NUTCH-2451 protocol-ftp to resolve relative URL when following redirects (snagel: [https://github.com/apache/nutch/commit/fc586d4508dbd8f1f5d19fc943e3b43b9f6956ca]) * (edit) src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java > protocol-ftp to resolve relative URL when following redirects > - > > Key: NUTCH-2451 > URL: https://issues.apache.org/jira/browse/NUTCH-2451 > Project: Nutch > Issue Type: Bug > Components: protocol >Affects Versions: 1.13 > Environment: Ubuntu 16.04.3 LTS > OpenJDK 1.8.0_131 > nutch 1.14-SNAPSHOT > Synology RS816 >Reporter: Hiran Chaudhuri >Assignee: Sebastian Nagel > Fix For: 2.4, 1.14 > > > I tried running Nutch on my Synology NAS. As SMB protocol is not contained in > Nutch, I turned on FTP service on the NAS and configured Nutch to crawl > ftp://nas. > The experience gives me varying results which seem to point to problems > within Nutch. However this may need further evaluation. > As some files could not be downloaded and I could not see a good error > message I changed the method > org.apache.nutch.protocol.ftp.FTP.getProtocolOutput(Text, CrawlDatum) to not > only return protocol status but send the full exception and stack trace to > the logs: > {{} catch (Exception e) { > LOG.warn("Could not get {}", url, e); > return new ProtocolOutput(null, new ProtocolStatus(e)); > } > }} > With this modification I suddenly see such messages in the logfile: > {{2017-10-25 22:09:31,865 TRACE org.apache.nutch.protocol.ftp.Ftp - fetching > ftp://nas/MediaPC/usr/lib32/gconv/ARMSCII-8.so > 2017-10-25 22:09:32,147 WARN org.apache.nutch.protocol.ftp.Ftp - Could not > get ftp://nas/MediaPC/usr/lib32/gconv/ARMSCII-8.so > java.net.MalformedURLException > at java.net.URL.(URL.java:627) > at java.net.URL.(URL.java:490) > at java.net.URL.(URL.java:439) > at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:145) > at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340) > Caused by: java.lang.NullPointerException > }} > Please mind the URL was not configured from me. Instead it was obtained by > crawling my NAS. Also the URL looks perfectly fine to me. Even if the file > did not exist I would not expect a MalformedURLException to occur. Even more, > using Firefox and the same authentication data on the same URL retrieves the > file successfully. > How come Nutch cannot get the file? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2469) Documents not commited to solr in Sever mode
[ https://issues.apache.org/jira/browse/NUTCH-2469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278445#comment-16278445 ] Hudson commented on NUTCH-2469: --- SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1597 (See [https://builds.apache.org/job/Nutch-nutchgora/1597/]) NUTCH-2469 Documents not commited to solr in Sever mode - applied patch (snagel: [https://github.com/apache/nutch/commit/cc2f4abeb7b8326acbb00f9d10b46a092bbbe9a5]) * (edit) src/java/org/apache/nutch/indexer/IndexingJob.java > Documents not commited to solr in Sever mode > > > Key: NUTCH-2469 > URL: https://issues.apache.org/jira/browse/NUTCH-2469 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 2.3.1 >Reporter: Ninaad Joshi >Assignee: Sebastian Nagel >Priority: Blocker > Fix For: 2.4 > > Attachments: NinaadJoshi.IndexingJob.java.patch > > > I found there is a discrepancy in execution paths when running Nutch in local > standalone mode vis-à-vis server mode. > I observed, in local standalone mode, when the indexing process is done the > document along with its fields get indexed and committed in solr and is > returned if queried immediately. However, the same when done through server > mode, the document gets indexed but is not committed in solr, hence not > returned if queried immediately. When we restart solr the indexed document is > returned if queried. > I browsed through the IndexingJob.java file to understand the cause for this. > I found out: > # There are two different entry paths for the local standalone mode and the > server mode > ** Server mode entry point: public Map run(Map Object> args) > ** Standalone mode entry point: > *** public int run(String[] args) > *** public void index(String batchId) > # The local standalone mode path did extra stuff than the server mode > ** The public void index(String batchId) function initially calls the server > mode path: public Map run(Map args) > ** And then does this extra stuff > *** Gets IndexWriters > *** Using IndexWriters Describes > Using IndexWriters commits if COMMIT_INDEX=true is specified in the > configuration > *** The aforementioned extra stuff is not done in the server mode > I feel the execution paths for both the modes should be same and hence > propose to: > # Move the extra stuff done using IndexWriters in public void index(String > batchId) to the end of server mode execution path i.e public Map Object> run(Map args) function > # Call public Map run(Map args) function > directly from Standalone mode entry point: public int run(String[] args) > # public int run(String[] args) becomes redundant and can be safely removed. > I have attached the proposed patch along with this issue. Kindly go through > the same and approve. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2451) protocol-ftp to resolve relative URL when following redirects
[ https://issues.apache.org/jira/browse/NUTCH-2451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278447#comment-16278447 ] Hudson commented on NUTCH-2451: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3473 (See [https://builds.apache.org/job/Nutch-trunk/3473/]) NUTCH-2451 protocol-ftp to resolve relative URL when following redirects (snagel: [https://github.com/apache/nutch/commit/5b3cf0e2028aed576d080be70fc9028796616b94]) * (edit) src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java > protocol-ftp to resolve relative URL when following redirects > - > > Key: NUTCH-2451 > URL: https://issues.apache.org/jira/browse/NUTCH-2451 > Project: Nutch > Issue Type: Bug > Components: protocol >Affects Versions: 1.13 > Environment: Ubuntu 16.04.3 LTS > OpenJDK 1.8.0_131 > nutch 1.14-SNAPSHOT > Synology RS816 >Reporter: Hiran Chaudhuri >Assignee: Sebastian Nagel > Fix For: 2.4, 1.14 > > > I tried running Nutch on my Synology NAS. As SMB protocol is not contained in > Nutch, I turned on FTP service on the NAS and configured Nutch to crawl > ftp://nas. > The experience gives me varying results which seem to point to problems > within Nutch. However this may need further evaluation. > As some files could not be downloaded and I could not see a good error > message I changed the method > org.apache.nutch.protocol.ftp.FTP.getProtocolOutput(Text, CrawlDatum) to not > only return protocol status but send the full exception and stack trace to > the logs: > {{} catch (Exception e) { > LOG.warn("Could not get {}", url, e); > return new ProtocolOutput(null, new ProtocolStatus(e)); > } > }} > With this modification I suddenly see such messages in the logfile: > {{2017-10-25 22:09:31,865 TRACE org.apache.nutch.protocol.ftp.Ftp - fetching > ftp://nas/MediaPC/usr/lib32/gconv/ARMSCII-8.so > 2017-10-25 22:09:32,147 WARN org.apache.nutch.protocol.ftp.Ftp - Could not > get ftp://nas/MediaPC/usr/lib32/gconv/ARMSCII-8.so > java.net.MalformedURLException > at java.net.URL.(URL.java:627) > at java.net.URL.(URL.java:490) > at java.net.URL.(URL.java:439) > at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:145) > at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340) > Caused by: java.lang.NullPointerException > }} > Please mind the URL was not configured from me. Instead it was obtained by > crawling my NAS. Also the URL looks perfectly fine to me. Even if the file > did not exist I would not expect a MalformedURLException to occur. Even more, > using Firefox and the same authentication data on the same URL retrieves the > file successfully. > How come Nutch cannot get the file? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2470) CrawlDbReader -stats to show quantiles of score
[ https://issues.apache.org/jira/browse/NUTCH-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278496#comment-16278496 ] Hudson commented on NUTCH-2470: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3474 (See [https://builds.apache.org/job/Nutch-trunk/3474/]) NUTCH-2470 CrawlDbReader -stats to show quantiles of score - improve (snagel: [https://github.com/apache/nutch/commit/08c2fb9d024741425f57537c18dc706b1f861bdc]) * (edit) ivy/ivy.xml * (edit) src/java/org/apache/nutch/crawl/CrawlDbReader.java > CrawlDbReader -stats to show quantiles of score > --- > > Key: NUTCH-2470 > URL: https://issues.apache.org/jira/browse/NUTCH-2470 > Project: Nutch > Issue Type: Improvement > Components: crawldb >Affects Versions: 1.13 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 1.14 > > > The command "readdb -stats" shows for the CrawlDatum score min., max. and > average. Median and quartiles (quantiles, in general) would complete the > statistics to get an impression how scores are distributed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2399) indexer-elastic does not index multi-value fields (only the first value is indexed)
[ https://issues.apache.org/jira/browse/NUTCH-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16280135#comment-16280135 ] Hudson commented on NUTCH-2399: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3475 (See [https://builds.apache.org/job/Nutch-trunk/3475/]) NUTCH-2399 Add support for multivalue fields on indexer-elastic (jorge-luis.betancourt: [https://github.com/apache/nutch/commit/106a215cbd430a13e29ee590e948e198abf6445c]) * (edit) src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java > indexer-elastic does not index multi-value fields (only the first value is > indexed) > --- > > Key: NUTCH-2399 > URL: https://issues.apache.org/jira/browse/NUTCH-2399 > Project: Nutch > Issue Type: Bug > Components: indexer >Reporter: Yossi Tamari > Fix For: 1.14 > > > Currently, if there is a NutchField with multiple values, only the first > value is indexed (because this is what doc.getFieldValue returns). Pull > request #200 checks if the NutchField has multiple values, and if so, they > are added as an array (multivalue) field. > [https://github.com/apache/nutch/pull/200] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2438) Upgrade Nutch 2.X to Gora 0.8
[ https://issues.apache.org/jira/browse/NUTCH-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16289976#comment-16289976 ] Hudson commented on NUTCH-2438: --- SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1598 (See [https://builds.apache.org/job/Nutch-nutchgora/1598/]) fix for NUTCH-2438 contributed by tmzzngl (tulay.muezzinoglu: [https://github.com/apache/nutch/commit/95417aa9fbebcfdd516930dc4aa370b0a343994c]) * (edit) build.xml * (edit) ivy/ivy.xml NUTCH-2438 Upgrade Nutch 2.X to Gora 0.8 (lewis.mcgibbney: [https://github.com/apache/nutch/commit/5370102135d91e1054adf13c0345159873b4a2ef]) * (edit) src/java/org/apache/nutch/storage/WebPage.java * (edit) src/java/org/apache/nutch/storage/Host.java * (edit) ivy/ivy.xml > Upgrade Nutch 2.X to Gora 0.8 > - > > Key: NUTCH-2438 > URL: https://issues.apache.org/jira/browse/NUTCH-2438 > Project: Nutch > Issue Type: Improvement > Components: storage >Affects Versions: 2.4 >Reporter: Tulay Muezzinoglu >Assignee: Lewis John McGibbney > Fix For: 2.4 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2474) CrawlDbReader -stats fails with ClassCastException
[ https://issues.apache.org/jira/browse/NUTCH-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16291049#comment-16291049 ] Hudson commented on NUTCH-2474: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3477 (See [https://builds.apache.org/job/Nutch-trunk/3477/]) NUTCH-2474 CrawlDbReader -stats fails with ClassCastException - replace (snagel: [https://github.com/apache/nutch/commit/12e14ac0604298e09672287ba20ccb13a56d4fd7]) * (edit) src/java/org/apache/nutch/crawl/CrawlDbReader.java > CrawlDbReader -stats fails with ClassCastException > -- > > Key: NUTCH-2474 > URL: https://issues.apache.org/jira/browse/NUTCH-2474 > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 1.14 > Environment: Java 8, distributed mode: Hadoop CDH 5.13.0 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Critical > Fix For: 1.14 > > > In distributed mode CrawlDbReader / readdb -stats fails with a > ClassCastException in the combiner: > {noformat} > 17/12/08 04:57:13 INFO mapreduce.Job: Task Id : > attempt_1512553291624_0022_m_39_0, Status : FAILED > Error: java.lang.ClassCastException: org.apache.hadoop.io.FloatWritable > cannot be cast to org.apache.hadoop.io.LongWritable > at > org.apache.nutch.crawl.CrawlDbReader$CrawlDbStatCombiner.reduce(CrawlDbReader.java:296) > at > org.apache.nutch.crawl.CrawlDbReader$CrawlDbStatCombiner.reduce(CrawlDbReader.java:222) > at > org.apache.hadoop.mapred.Task$OldCombinerRunner.combine(Task.java:1639) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1946) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1514) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:466) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) > {noformat} > FloatWritables are used since NUTCH-2470, so that's when this bug was > introduced. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2035) Regex filter using case sensitive rules.
[ https://issues.apache.org/jira/browse/NUTCH-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16292802#comment-16292802 ] Hudson commented on NUTCH-2035: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3478 (See [https://builds.apache.org/job/Nutch-trunk/3478/]) NUTCH-2035 urlfilter-regex case insensitive rules (snagel: [https://github.com/apache/nutch/commit/df14c8a0a19e4f670d75ecd7ae2a22c3d8eeb0b6]) * (edit) conf/regex-urlfilter.txt.template > Regex filter using case sensitive rules. > > > Key: NUTCH-2035 > URL: https://issues.apache.org/jira/browse/NUTCH-2035 > Project: Nutch > Issue Type: Improvement > Components: plugin >Affects Versions: 1.10 >Reporter: Luis Lopez >Assignee: Sebastian Nagel >Priority: Minor > Labels: filters, regex, regex-urlfilter > Fix For: 2.4, 1.14 > > Attachments: regex-urlfilter.txt > > > Regex expressions are computationally expensive and having “EXE|exe|JPG|jpg” > etc etc. adds up if we use complex rules. > Regex filter should use case insensitive rules to make the rules more > readable and improve performance. -- This message was sent by Atlassian JIRA (v6.4.14#64029)