[jira] [Commented] (NUTCH-2305) generate.min.score doesn't work in 2.x

2016-08-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15431671#comment-15431671
 ] 

Hudson commented on NUTCH-2305:
---

SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1568 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1568/])
Fix for NUTCH-2305 generate.min.score doesn't work, contributed by (snagel: rev 
5c3a381289f158f69b4f7ebe7b059cd7d9ba7638)
* (edit) src/java/org/apache/nutch/crawl/GeneratorMapper.java
* (edit) conf/nutch-default.xml


> generate.min.score doesn't work in 2.x
> --
>
> Key: NUTCH-2305
> URL: https://issues.apache.org/jira/browse/NUTCH-2305
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Reporter: Kiyonari Harigae
>Assignee: Sebastian Nagel
> Fix For: 2.4
>
> Attachments: NUTCH-2305.patch
>
>
> The definition of "generate.min.score" is exist in GeneratorJob but,
> It does not work even if described in nutch-site.conf.
> "generate.min.score" is necessary also 2.x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2302) RAMConfManager Could Be Constructed With Custom Configuration

2016-08-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432095#comment-15432095
 ] 

Hudson commented on NUTCH-2302:
---

SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1569 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1569/])
NUTCH-2302 RAMConfManager Could Be Constructed With Custom Configuration 
(furkankamaci: rev fd722c896468fe047758891d75a58259c88289d8)
* (edit) src/java/org/apache/nutch/api/impl/RAMConfManager.java


> RAMConfManager Could Be Constructed With Custom Configuration 
> --
>
> Key: NUTCH-2302
> URL: https://issues.apache.org/jira/browse/NUTCH-2302
> Project: Nutch
>  Issue Type: Improvement
>  Components: REST_api, web gui
>Reporter: Furkan KAMACI
>Assignee: Furkan KAMACI
> Fix For: 2.4
>
>
> RAMConfManager is intented to hold different configurations which can be 
> accessible via a configuration id. However, it forces you to use a default 
> configuration with a default id when you construct it. When RAMConfManager is 
> used by any other classes they cannot set a custom configuration and it leads 
> problem. i.e. test resources cannot be used when you test NutchServer due to 
> it uses default configuration which is forced by RAMConfManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2246) Refactor /seed endpoint for backward compatibility

2016-08-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432400#comment-15432400
 ] 

Hudson commented on NUTCH-2246:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3392 (See 
[https://builds.apache.org/job/Nutch-trunk/3392/])
Remove NUTCH-2246 from the 1.12 section of CHANGES.txt (fixed in 1.13) (snagel: 
rev 78e99092c6d1308e054f9a20e50b7a6eb6206784)
* (edit) CHANGES.txt


> Refactor /seed endpoint for backward compatibility
> --
>
> Key: NUTCH-2246
> URL: https://issues.apache.org/jira/browse/NUTCH-2246
> Project: Nutch
>  Issue Type: Sub-task
>  Components: REST_api
>Affects Versions: 1.12
>Reporter: Sujen Shah
>Assignee: Sujen Shah
>Priority: Minor
>  Labels: memex
> Fix For: 1.13
>
>
> Currently the seed endpoint allows you to create a seed list by providing a 
> list of urls passed as an argument. 
> After the first refactor here - 
> https://issues.apache.org/jira/browse/NUTCH-2090. User could no longer 
> provide a physical path to the seedlist. 
> Nutch should give both options to the user.
> Additionally, once a seedlist is created by providing a list of urls (not a 
> physical file), Nutch should store it like it does for the configurations. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2242) lastModified not always set

2016-08-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432469#comment-15432469
 ] 

Hudson commented on NUTCH-2242:
---

FAILURE: Integrated in Jenkins build Nutch-trunk #3393 (See 
[https://builds.apache.org/job/Nutch-trunk/3393/])
NUTCH-2164 NUTCH-2242 Inconsistent 'Modified Time' in crawl db / (snagel: rev 
70622c3e18cee879f5a38d895f68dd0be69461e1)
* (edit) src/java/org/apache/nutch/crawl/DefaultFetchSchedule.java
* (edit) src/java/org/apache/nutch/protocol/ProtocolOutput.java
* (edit) src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java
* (edit) src/test/org/apache/nutch/crawl/TestCrawlDbStates.java


> lastModified not always set
> ---
>
> Key: NUTCH-2242
> URL: https://issues.apache.org/jira/browse/NUTCH-2242
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.11
>Reporter: Jurian Broertjes
>Priority: Minor
> Fix For: 1.13
>
> Attachments: NUTCH-2242.patch
>
>
> I observed two issues:
> - When using the DefaultFetchSchedule, CrawlDatum's modifiedTime field is not 
> updated on the first successful fetch. 
> - When a document modification is detected (protocol- or signature-wise), the 
> modifiedTime isn't updated
> I can provide a patch later today.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2164) Inconsistent 'Modified Time' in crawl db

2016-08-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432468#comment-15432468
 ] 

Hudson commented on NUTCH-2164:
---

FAILURE: Integrated in Jenkins build Nutch-trunk #3393 (See 
[https://builds.apache.org/job/Nutch-trunk/3393/])
NUTCH-2164 NUTCH-2242 Inconsistent 'Modified Time' in crawl db / (snagel: rev 
70622c3e18cee879f5a38d895f68dd0be69461e1)
* (edit) src/java/org/apache/nutch/crawl/DefaultFetchSchedule.java
* (edit) src/java/org/apache/nutch/protocol/ProtocolOutput.java
* (edit) src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java
* (edit) src/test/org/apache/nutch/crawl/TestCrawlDbStates.java


> Inconsistent 'Modified Time' in crawl db
> 
>
> Key: NUTCH-2164
> URL: https://issues.apache.org/jira/browse/NUTCH-2164
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb, fetcher
>Affects Versions: 1.11
>Reporter: Thamme Gowda
>Priority: Minor
> Fix For: 1.13
>
>
> The 'Modified time' in crawldb is invalid. It is set to (0-Timezone 
> Difference)
> *How to verify/reproduce:*
>   Run 'nutch readdb /path/to/crawldb -dump yy' and then inspect content of 
> 'yy'
> The following improvements can be done:
> 1. Set modified time by DefaultFetchSchedule
> 2. Set ProtocolStatus.lastModified if modified time is available in protocol 
> response headers
> This issue is also discussed in dev mailing lists: 
> http://www.mail-archive.com/dev@nutch.apache.org/msg19803.html#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2303) NutchServer Could Be Able To Select a Configuration to Use

2016-08-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15433176#comment-15433176
 ] 

Hudson commented on NUTCH-2303:
---

SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1570 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1570/])
NUTCH-2303 NutchServer Could Be Able To Select a Configuration to Use 
(furkankamaci: rev 6227f3b171b67e790a089d6fee4d3c65de0e0ee1)
* (edit) src/java/org/apache/nutch/api/NutchServer.java
* (edit) src/java/org/apache/nutch/api/security/SecurityUtil.java


> NutchServer Could Be Able To Select a Configuration to Use
> --
>
> Key: NUTCH-2303
> URL: https://issues.apache.org/jira/browse/NUTCH-2303
> Project: Nutch
>  Issue Type: Improvement
>  Components: REST_api, web gui
>Reporter: Furkan KAMACI
>Assignee: Furkan KAMACI
> Fix For: 2.4
>
>
> RAMConfManager is intented to hold different configurations. However, 
> currently NutchServer uses default config and it could be let to set an 
> active configuration id when startup a NutchServer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2306) Id of Active Configuration Could Be Stored at NutchStatus and Exposed via REST API

2016-08-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15433178#comment-15433178
 ] 

Hudson commented on NUTCH-2306:
---

SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1570 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1570/])
NUTCH-2306 Id of Active Configuration Could Be Stored at NutchStatus and 
(furkankamaci: rev ed96b104ddf82bcb20557a29b251c3fd73eb146a)
* (edit) src/java/org/apache/nutch/api/model/response/NutchStatus.java
* (edit) src/java/org/apache/nutch/api/resources/AdminResource.java
* (edit) src/java/org/apache/nutch/api/resources/AbstractResource.java


> Id of Active Configuration Could Be Stored at NutchStatus and Exposed via 
> REST API
> --
>
> Key: NUTCH-2306
> URL: https://issues.apache.org/jira/browse/NUTCH-2306
> Project: Nutch
>  Issue Type: Improvement
>  Components: REST_api, web gui
>Reporter: Furkan KAMACI
>Assignee: Furkan KAMACI
> Fix For: 2.4
>
>
> NutchStatus holds information about configuration it uses. However, it should 
> also store the id of that configuration. Once NUTCH-2302 and NUTCH-2303 are 
> merged, we will be able to store acitive configuration id and expose this 
> information via REST API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2301) Create Tests for Security Layer of NutchServer

2016-08-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15433679#comment-15433679
 ] 

Hudson commented on NUTCH-2301:
---

FAILURE: Integrated in Jenkins build Nutch-nutchgora #1571 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1571/])
NUTCH-2301 Tests for Security Layer of NutchServer Are Created (furkankamaci: 
rev 3bc3d81e964aac59f61951740e848bd429a15b3c)
* (add) src/test/org/apache/nutch/api/TestNutchAPI.java
* (edit) src/test/nutch-site.xml
* (add) src/test/nutch-ssl.keystore.jks
* (add) src/test/org/apache/nutch/api/AbstractNutchAPITestBase.java
* (delete) src/test/org/apache/nutch/api/TestAPI.java


> Create Tests for Security Layer of NutchServer
> --
>
> Key: NUTCH-2301
> URL: https://issues.apache.org/jira/browse/NUTCH-2301
> Project: Nutch
>  Issue Type: Sub-task
>  Components: REST_api, web gui
>Reporter: Furkan KAMACI
>Assignee: Furkan KAMACI
> Fix For: 2.4
>
>
> Create tests for security layer of NutchServer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2308) Implement SSL Connection Test at TestNutchAPI

2016-08-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15449662#comment-15449662
 ] 

Hudson commented on NUTCH-2308:
---

SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1572 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1572/])
NUTCH-2308 SSL connection test at TestNutchAPI is implemented. (furkankamaci: 
rev 75d846cf3998faeffa6edf5a7d7fec2d22c8d4d9)
* (edit) src/java/org/apache/nutch/api/NutchServer.java
* (add) src/test/testTrustKeyStore
* (edit) src/test/org/apache/nutch/api/AbstractNutchAPITestBase.java
* (add) src/test/nutch.cer
* (edit) src/java/org/apache/nutch/api/resources/SeedResource.java
* (edit) src/java/org/apache/nutch/api/resources/DbResource.java
* (edit) src/java/org/apache/nutch/api/resources/ConfigResource.java
* (edit) src/java/org/apache/nutch/api/resources/AdminResource.java
* (edit) conf/nutch-default.xml
* (add) src/java/org/apache/nutch/api/security/SecurityUtils.java
* (delete) src/java/org/apache/nutch/api/security/SecurityUtil.java
* (edit) src/test/nutch-ssl.keystore.jks
* (edit) src/test/nutch-site.xml
* (edit) src/test/org/apache/nutch/api/TestNutchAPI.java
* (edit) src/java/org/apache/nutch/api/resources/JobResource.java


> Implement SSL Connection Test at TestNutchAPI
> -
>
> Key: NUTCH-2308
> URL: https://issues.apache.org/jira/browse/NUTCH-2308
> Project: Nutch
>  Issue Type: Improvement
>  Components: REST_api, web gui
>Reporter: Furkan KAMACI
>Assignee: Furkan KAMACI
> Fix For: 2.4
>
>
> Currently, testing of SSL is ignored at TestNutchAPI. We should complete the 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2264) Check Forbidden APIs at Build

2016-08-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15449661#comment-15449661
 ] 

Hudson commented on NUTCH-2264:
---

SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1572 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1572/])
NUTCH-2264 Forbidden APIs are Checked at Build (furkankamaci: rev 
a671540a94d8afafd72a09396c97d9ede43a7ea2)
* (edit) src/java/org/apache/nutch/net/URLFilterChecker.java
* (edit) 
src/plugin/subcollection/src/test/org/apache/nutch/collection/TestSubcollection.java
* (edit) build.xml
* (edit) src/test/org/apache/nutch/plugin/TestPluginSystem.java
* (edit) 
src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java
* (edit) src/java/org/apache/nutch/crawl/DbUpdaterJob.java
* (edit) src/java/org/apache/nutch/host/HostDbUpdateReducer.java
* (edit) 
src/plugin/parse-tika/src/test/org/apache/nutch/parse/tika/DOMContentUtilsTest.java
* (edit) src/java/org/apache/nutch/util/Bytes.java
* (edit) src/java/org/apache/nutch/host/HostInjectorJob.java
* (edit) 
src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java
* (edit) 
src/plugin/index-anchor/src/java/org/apache/nutch/indexer/anchor/AnchorIndexingFilter.java
* (edit) 
src/plugin/urlnormalizer-regex/src/test/org/apache/nutch/net/urlnormalizer/regex/TestRegexURLNormalizer.java
* (edit) ivy/ivy.xml
* (edit) 
src/plugin/parse-metatags/src/java/org/apache/nutch/parse/metatags/MetaTagsParser.java
* (edit) src/java/org/apache/nutch/parse/ParseUtil.java
* (edit) src/java/org/apache/nutch/util/URLUtil.java
* (edit) src/java/org/apache/nutch/tools/Benchmark.java
* (edit) src/java/org/apache/nutch/tools/arc/ArcRecordReader.java
* (edit) 
src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
* (edit) 
src/plugin/protocol-sftp/src/java/org/apache/nutch/protocol/sftp/Sftp.java
* (edit) src/java/org/apache/nutch/fetcher/FetcherReducer.java
* (edit) src/java/org/apache/nutch/parse/ParserJob.java
* (edit) src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java
* (edit) 
src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/File.java
* (edit) 
src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpBasicAuthentication.java
* (edit) 
src/plugin/lib-http/src/test/org/apache/nutch/protocol/http/api/TestRobotRulesParser.java
* (edit) 
src/plugin/microformats-reltag/src/java/org/apache/nutch/microformats/reltag/RelTagParser.java
* (edit) 
src/plugin/language-identifier/src/java/org/apache/nutch/analysis/lang/HTMLLanguageParser.java
* (edit) 
src/plugin/microformats-reltag/src/test/org/apache/nutch/microformats/reltag/TestRelTagParser.java
* (edit) 
src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java
* (edit) src/java/org/apache/nutch/api/impl/JobWorker.java
* (edit) src/java/org/apache/nutch/api/resources/AdminResource.java
* (edit) 
src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/FtpRobotRulesParser.java
* (edit) 
src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexURLFilterBase.java
* (edit) 
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java
* (edit) 
src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Client.java
* (edit) 
src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestRobotsMetaProcessor.java
* (edit) src/java/org/apache/nutch/util/domain/DomainStatistics.java
* (edit) 
src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
* (edit) src/java/org/apache/nutch/tools/proxy/FakeHandler.java
* (edit) src/java/org/apache/nutch/tools/ResolveUrls.java
* (edit) src/java/org/apache/nutch/protocol/RobotRulesParser.java
* (edit) 
src/plugin/parse-metatags/src/test/org/apache/nutch/parse/metatags/TestMetaTagsParser.java
* (edit) 
src/plugin/parse-js/src/java/org/apache/nutch/parse/js/JSParseFilter.java
* (edit) 
src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/FtpResponse.java
* (edit) 
src/plugin/index-html/src/java/org/apache/nutch/indexer/html/HtmlIndexingFilter.java
* (edit) src/java/org/apache/nutch/fetcher/FetcherJob.java
* (edit) src/java/org/apache/nutch/protocol/Content.java
* (edit) 
src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java
* (edit) 
src/plugin/creativecommons/src/java/org/creativecommons/nutch/CCParseFilter.java
* (edit) 
src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/HTMLMetaProcessor.java
* (edit) 
src/plugin/parse-tika/src/test/org/apache/nutch/parse/tika/TestImageMetadata.java
* (edit) src/test/org/apache/nutch/parse/TestSitemapParser.java
* (edit) src/java/org/apache/nutch/tools/DmozParser.java
* (edit) src/java/org/apache/nutch/webui/client/impl/RemoteCommand.java
* (edit) 
src/plugin/language-identifier/src/test/org/apache/nutch/analysis/lang/T

[jira] [Commented] (NUTCH-2122) Implement Javadoc package-info.java for webui packages

2016-08-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15450364#comment-15450364
 ] 

Hudson commented on NUTCH-2122:
---

SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1573 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1573/])
NUTCH-2122 Missing package-info.java classes for webui packages are 
(furkankamaci: rev 8ad3e44a37fb1146d135cbfaeabff118f573afce)
* (add) src/java/org/apache/nutch/webui/client/impl/package-info.java
* (add) src/java/org/apache/nutch/webui/client/model/package-info.java
* (add) src/java/org/apache/nutch/webui/model/package-info.java
* (add) src/java/org/apache/nutch/webui/pages/menu/package-info.java
* (add) src/java/org/apache/nutch/webui/service/package-info.java
* (add) src/java/org/apache/nutch/webui/config/package-info.java
* (add) src/java/org/apache/nutch/webui/pages/crawls/package-info.java
* (add) src/java/org/apache/nutch/webui/pages/seed/package-info.java
* (add) src/java/org/apache/nutch/webui/service/impl/package-info.java
* (add) src/java/org/apache/nutch/webui/pages/assets/package-info.java
* (add) src/java/org/apache/nutch/webui/pages/instances/package-info.java
* (add) src/java/org/apache/nutch/webui/pages/components/package-info.java
* (add) src/java/org/apache/nutch/webui/pages/settings/package-info.java
* (add) src/java/org/apache/nutch/webui/pages/package-info.java
* (add) src/java/org/apache/nutch/webui/package-info.java
* (add) src/java/org/apache/nutch/webui/client/package-info.java


> Implement Javadoc package-info.java for webui packages
> --
>
> Key: NUTCH-2122
> URL: https://issues.apache.org/jira/browse/NUTCH-2122
> Project: Nutch
>  Issue Type: Improvement
>  Components: nutch server
>Affects Versions: 1.10
>Reporter: Lewis John McGibbney
>Assignee: Furkan KAMACI
>Priority: Trivial
> Fix For: 2.4
>
>
> [~sujenshah] I noticed that the Javadoc does not contain package.html 
> displaying package level introductory Javadoc as every other package does.
> http://nutch.apache.org/apidocs/apidocs-1.10/index.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2314) Use indexer-elastic2 Plugin for javadoc and eclipse Targets

2016-09-02 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15460500#comment-15460500
 ] 

Hudson commented on NUTCH-2314:
---

SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1574 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1574/])
NUTCH-2314 indexer-elastic2 plugin is used for javadoc and eclipse 
(furkankamaci: rev 7dcc5fa69f3edd431b47d127048fd9f97b442fa6)
* (edit) build.xml
* (edit) default.properties


> Use indexer-elastic2 Plugin for javadoc and eclipse Targets
> ---
>
> Key: NUTCH-2314
> URL: https://issues.apache.org/jira/browse/NUTCH-2314
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin
>Reporter: Furkan KAMACI
>Assignee: Furkan KAMACI
> Fix For: 2.4
>
>
> indexer-elastic2 plugin is used at deploy and clean tasks of plugin/build.xml 
> However, indexer-elastic plugin is used instead of indexer-elastic2 for 
> javadoc and eclipse tasks at build.xml and gives error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2089) Move Nutch 2.x to compile on JDK 8

2016-09-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15468328#comment-15468328
 ] 

Hudson commented on NUTCH-2089:
---

SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1575 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1575/])
NUTCH-2089 Nutch 2.x is moved to compile on JDK 8 (furkankamaci: rev 
0ea78907dee6b07058b66a99e395aea8cf623e92)
* (edit) src/java/org/apache/nutch/plugin/PluginRepository.java
* (edit) 
src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java
* (edit) 
src/plugin/urlfilter-domain/src/java/org/apache/nutch/urlfilter/domain/DomainURLFilter.java
* (edit) 
src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/File.java
* (edit) src/java/org/apache/nutch/indexer/IndexUtil.java
* (edit) src/java/org/apache/nutch/net/URLNormalizers.java
* (edit) src/java/org/apache/nutch/util/domain/TopLevelDomain.java
* (edit) 
src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/XMLCharacterRecognizer.java
* (edit) 
src/plugin/parse-tika/src/test/org/apache/nutch/parse/tika/TestRSSParser.java
* (edit) src/java/org/apache/nutch/crawl/GeneratorJob.java
* (edit) 
src/plugin/urlnormalizer-regex/src/java/org/apache/nutch/net/urlnormalizer/regex/RegexURLNormalizer.java
* (edit) src/java/org/apache/nutch/scoring/ScoringFilter.java
* (edit) src/java/org/apache/nutch/util/Bytes.java
* (edit) 
src/plugin/index-anchor/src/java/org/apache/nutch/indexer/anchor/AnchorIndexingFilter.java
* (edit) 
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/XMLCharacterRecognizer.java
* (edit) 
src/plugin/urlfilter-prefix/src/java/org/apache/nutch/urlfilter/prefix/PrefixURLFilter.java
* (edit) 
src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMBuilder.java
* (edit) 
src/plugin/urlfilter-validator/src/java/org/apache/nutch/urlfilter/validator/UrlValidator.java
* (edit) src/java/org/apache/nutch/util/MimeUtil.java
* (edit) src/test/org/apache/nutch/crawl/TestGenerator.java
* (edit) src/java/org/apache/nutch/util/NutchTool.java
* (edit) 
src/plugin/urlfilter-domain/src/java/org/apache/nutch/urlfilter/domain/package-info.java
* (edit) src/java/org/apache/nutch/parse/NutchSitemapParse.java
* (edit) 
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java
* (edit) src/java/org/apache/nutch/util/TrieStringMatcher.java
* (edit) 
src/plugin/parse-js/src/java/org/apache/nutch/parse/js/JSParseFilter.java
* (edit) 
src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/DummySSLProtocolSocketFactory.java
* (edit) src/java/org/apache/nutch/util/EncodingDetector.java
* (edit) 
src/plugin/language-identifier/src/java/org/apache/nutch/analysis/lang/HTMLLanguageParser.java
* (edit) src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java
* (edit) src/java/org/apache/nutch/crawl/FetchSchedule.java
* (edit) src/java/org/apache/nutch/parse/ParsePluginsReader.java
* (edit) src/java/org/apache/nutch/fetcher/FetcherJob.java
* (edit) 
src/plugin/scoring-opic/src/java/org/apache/nutch/scoring/opic/OPICScoringFilter.java
* (edit) 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
* (edit) src/java/org/apache/nutch/parse/Parser.java
* (edit) src/plugin/feed/src/java/org/apache/nutch/parse/feed/FeedParser.java
* (edit) src/java/org/apache/nutch/util/NutchJob.java
* (edit) src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java
* (edit) src/test/org/apache/nutch/fetcher/TestFetcher.java
* (edit) src/java/org/apache/nutch/util/URLUtil.java
* (edit) src/java/org/apache/nutch/util/domain/DomainSuffix.java
* (edit) src/plugin/parse-swf/src/java/org/apache/nutch/parse/swf/SWFParser.java
* (edit) 
src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Client.java
* (edit) src/java/org/apache/nutch/util/SuffixStringMatcher.java
* (edit) 
src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java
* (edit) 
src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpBasicAuthentication.java
* (edit) src/java/org/apache/nutch/api/impl/RAMConfManager.java
* (edit) src/java/org/apache/nutch/util/TableUtil.java
* (edit) 
src/plugin/protocol-file/src/test/org/apache/nutch/protocol/file/TestProtocolFile.java
* (edit) 
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMBuilder.java
* (edit) src/java/org/apache/nutch/crawl/SignatureFactory.java
* (edit) 
src/plugin/scoring-link/src/java/org/apache/nutch/scoring/link/package-info.java
* (edit) src/java/org/apache/nutch/storage/StorageUtils.java
* (edit) src/java/org/apache/nutch/crawl/AbstractFetchSchedule.java
* (edit) src/java/org/apache/nutch/crawl/TextProfileSignature.java
* (edit) src/java/org/apache/nutch/parse/ParserChecker.java
* (edit) src/java/org/apache/nutch/api/NutchServer.java
* (edit) src/java/org/apache/nutch/util/NodeWalker.java
* (edit) 
src

[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events

2016-09-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15471126#comment-15471126
 ] 

Hudson commented on NUTCH-2132:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3395 (See 
[https://builds.apache.org/job/Nutch-trunk/3395/])
Fix for NUTCH-2132: Publisher/Subscriber model for Nutch to emit events, 
(sujen: rev e53b34b2322f2d071981a72577644a225642ecbc)
* (add) src/plugin/publish-rabbitmq/build-ivy.xml
* (add) 
src/plugin/publish-rabbitmq/src/java/org/apache/nutch/publisher/rabbitmq/package-info.java
* (add) src/java/org/apache/nutch/fetcher/FetcherThreadPublisher.java
* (edit) src/plugin/build.xml
* (edit) src/java/org/apache/nutch/fetcher/FetcherThread.java
* (add) src/java/org/apache/nutch/publisher/NutchPublisher.java
* (edit) src/plugin/nutch-extensionpoints/plugin.xml
* (edit) conf/nutch-default.xml
* (edit) ivy/ivy.xml
* (add) src/java/org/apache/nutch/fetcher/FetcherThreadEvent.java
* (add) src/plugin/publish-rabbitmq/plugin.xml
* (edit) src/java/org/apache/nutch/metadata/Nutch.java
* (add) 
src/plugin/publish-rabbitmq/src/java/org/apache/nutch/publisher/rabbitmq/RabbitMQPublisherImpl.java
* (edit) build.xml
* (add) src/plugin/publish-rabbitmq/build.xml
* (add) src/java/org/apache/nutch/publisher/NutchPublishers.java
* (add) src/plugin/publish-rabbitmq/ivy.xml


> Publisher/Subscriber model for Nutch to emit events 
> 
>
> Key: NUTCH-2132
> URL: https://issues.apache.org/jira/browse/NUTCH-2132
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.13
>
> Attachments: NUTCH-2132.patch, NUTCH-2132.v2.patch, 
> PubSub_routingkey.patch
>
>
> It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- 
> Fetcher events like fetch-start, fetch-end, a fetch report which may contain 
> data like outlinks of the current fetched url, score, etc). 
> A consumer of this functionality could use this data to generate real time 
> visualization and generate statics of the crawl without having to wait for 
> the fetch round to finish. 
> The REST API could contain an endpoint which would respond with a url to 
> which a client could subscribe to get the fetcher events. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2320) URLFilterChecker to run as TCP Telnet service

2016-10-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15548786#comment-15548786
 ] 

Hudson commented on NUTCH-2320:
---

FAILURE: Integrated in Jenkins build Nutch-trunk #3396 (See 
[https://builds.apache.org/job/Nutch-trunk/3396/])
NUTCH-2320 URLFilterChecker to run as TCP Telnet service (markus: rev 
836b2e01d1a4e0e9443601da755ea37de91b8c7d)
* (edit) src/java/org/apache/nutch/net/URLFilterChecker.java


> URLFilterChecker to run as TCP Telnet service
> -
>
> Key: NUTCH-2320
> URL: https://issues.apache.org/jira/browse/NUTCH-2320
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.13
>
> Attachments: NUTCH-2320.patch
>
>
> Allow testing URL filters for webapplications just like indexing filters 
> checker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2327) Seeds injected in REST workflow must be ingested into HDFS

2016-10-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15605849#comment-15605849
 ] 

Hudson commented on NUTCH-2327:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3399 (See 
[https://builds.apache.org/job/Nutch-trunk/3399/])
Fix for NUTCH-2327: Seeds injected in REST must be ingested into HDFS, (sujen: 
rev 24cc2aa9c68fa356e4e926b6bf86bac99d52e38c)
* (edit) src/java/org/apache/nutch/service/resources/SeedResource.java


> Seeds injected in REST workflow must be ingested into HDFS
> --
>
> Key: NUTCH-2327
> URL: https://issues.apache.org/jira/browse/NUTCH-2327
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector, REST_api
>Affects Versions: 1.12
>Reporter: Lewis John McGibbney
>Assignee: Sujen Shah
> Fix For: 1.13
>
>
> Right now when one uses the REST POST /seed/create API, a directory is 
> created within /var/some/path/here which is create if you are working locally 
> with the Nutch server e.g. on one machine. It is however not suitable for 
> using the REST API in distributed deployments where seeds needs to be present 
> within HDFS. More documentation on this topic is available at 
> https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI#Seed_List_creation
> There are also various mailing list threads regarding use of the REST and 
> this injector url issue described above needs to be addressed.
> [~sujenshah] CC for context.
> http://www.mail-archive.com/user%40nutch.apache.org/msg14922.html
> http://www.mail-archive.com/user%40nutch.apache.org/msg14921.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2336) SegmentReader to implement Tool

2016-12-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15711774#comment-15711774
 ] 

Hudson commented on NUTCH-2336:
---

FAILURE: Integrated in Jenkins build Nutch-trunk #3400 (See 
[https://builds.apache.org/job/Nutch-trunk/3400/])
NUTCH-2336 SegmentReader to implement Tool (contributed by Vincent (snagel: rev 
6e051f2ccadba6c6bac60ee8708ced958a30cc8b)
* (edit) src/java/org/apache/nutch/segment/SegmentReader.java


> SegmentReader to implement Tool
> ---
>
> Key: NUTCH-2336
> URL: https://issues.apache.org/jira/browse/NUTCH-2336
> Project: Nutch
>  Issue Type: Improvement
>  Components: segment
>Affects Versions: 1.12
>Reporter: Vincent Slot
>Priority: Minor
>  Labels: patch
> Fix For: 1.13
>
> Attachments: NUTCH-2336.patch
>
>
> Let SegmentReader implement Tool for use on Hadoop



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2337) urlnormalizer-basic to strip empty port

2016-12-13 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15745211#comment-15745211
 ] 

Hudson commented on NUTCH-2337:
---

SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1576 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1576/])
NUTCH-2337 urlnormalizer-basic to strip empty port - make sure that URLs 
(snagel: rev 6e3c34db16e385b0dadbe6444c2685283c863350)
* (edit) 
src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java
* (edit) 
src/plugin/urlnormalizer-basic/src/test/org/apache/nutch/net/urlnormalizer/basic/TestBasicURLNormalizer.java


> urlnormalizer-basic to strip empty port
> ---
>
> Key: NUTCH-2337
> URL: https://issues.apache.org/jira/browse/NUTCH-2337
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin
>Affects Versions: 2.3.1, 1.12
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 2.4, 1.13
>
>
> Basic URL normalizer should strip an empty port from the URL, that's not the 
> case at present:
> {noformat}
> echo "http://example.com:/"; \
>| nutch plugin urlnormalizer-basic 
> org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
> http://example.com:/
> {noformat}
> The result should be {{http://example.com/}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2337) urlnormalizer-basic to strip empty port

2016-12-13 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15745226#comment-15745226
 ] 

Hudson commented on NUTCH-2337:
---

FAILURE: Integrated in Jenkins build Nutch-trunk #3402 (See 
[https://builds.apache.org/job/Nutch-trunk/3402/])
NUTCH-2337 urlnormalizer-basic to strip empty port, closes #160 - make (snagel: 
rev f351790d7f496561aeae5e214d1b33975ca34cf2)
* (edit) 
src/plugin/urlnormalizer-basic/src/test/org/apache/nutch/net/urlnormalizer/basic/TestBasicURLNormalizer.java
* (edit) 
src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java


> urlnormalizer-basic to strip empty port
> ---
>
> Key: NUTCH-2337
> URL: https://issues.apache.org/jira/browse/NUTCH-2337
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin
>Affects Versions: 2.3.1, 1.12
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 2.4, 1.13
>
>
> Basic URL normalizer should strip an empty port from the URL, that's not the 
> case at present:
> {noformat}
> echo "http://example.com:/"; \
>| nutch plugin urlnormalizer-basic 
> org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
> http://example.com:/
> {noformat}
> The result should be {{http://example.com/}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2350) Add Missing activeConfId Field to NutchStatus Object

2017-01-18 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15828446#comment-15828446
 ] 

Hudson commented on NUTCH-2350:
---

SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1577 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1577/])
NUTCH-2350 Added missing activeConfId field to NutchStatus. (kamaci: rev 
6e074fc0b61f421cb7bc516e92dea33c3ce23fd5)
* (edit) src/java/org/apache/nutch/webui/client/model/NutchStatus.java


> Add Missing activeConfId Field to NutchStatus Object
> 
>
> Key: NUTCH-2350
> URL: https://issues.apache.org/jira/browse/NUTCH-2350
> Project: Nutch
>  Issue Type: Bug
>  Components: web gui
>Affects Versions: 2.3.1
>Reporter: Furkan KAMACI
>Assignee: Furkan KAMACI
> Fix For: 2.4
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2344) Authentication Support for Web GUI

2017-01-18 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15828445#comment-15828445
 ] 

Hudson commented on NUTCH-2344:
---

SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1577 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1577/])
NUTCH-2344 Authentication support for Web GUI (kamaci: rev 
def067735c5a6dc46d867c4c89cb176a275b1967)
* (add) src/java/org/apache/nutch/webui/pages/auth/SignInPage.html
* (add) src/java/org/apache/nutch/webui/pages/auth/SignInPage.java
* (edit) ivy/ivy.xml
* (edit) src/java/org/apache/nutch/webui/pages/assets/nutch-style.css
* (add) src/java/org/apache/nutch/webui/pages/auth/SignInSession.java
* (add) src/java/org/apache/nutch/webui/pages/auth/AuthenticatedWebPage.java
* (add) src/java/org/apache/nutch/webui/pages/auth/package-info.java
* (edit) conf/nutch-default.xml
* (edit) src/java/org/apache/nutch/webui/pages/AbstractBasePage.java
* (edit) src/java/org/apache/nutch/webui/NutchUiApplication.properties
* (add) src/java/org/apache/nutch/webui/pages/auth/User.java
* (edit) src/java/org/apache/nutch/webui/NutchUiApplication.java
* (edit) src/java/org/apache/nutch/webui/pages/LogOutPage.java
* (add) src/java/org/apache/nutch/webui/pages/auth/AuthorizationStrategy.java


> Authentication Support for Web GUI
> --
>
> Key: NUTCH-2344
> URL: https://issues.apache.org/jira/browse/NUTCH-2344
> Project: Nutch
>  Issue Type: New Feature
>  Components: web gui
>Affects Versions: 2.3.1
>Reporter: Furkan KAMACI
>Assignee: Furkan KAMACI
> Fix For: 2.4
>
> Attachments: Firefox_Screenshot_2017-01-13T19-10-49.499Z.png
>
>
> We should implement an authentication support for Web GUI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2351) Log with Generic Class Name at Nutch 2.x

2017-01-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830740#comment-15830740
 ] 

Hudson commented on NUTCH-2351:
---

SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1578 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1578/])
NUTCH-2351 Logging with generic class name. (snagel: rev 
1a84334c115bfda16980cd822da31ba5ae401afe)
* (edit) src/java/org/apache/nutch/fetcher/FetcherReducer.java
* (edit) 
src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java
* (edit) src/java/org/apache/nutch/crawl/AbstractFetchSchedule.java
* (edit) src/java/org/apache/nutch/util/domain/DomainStatistics.java
* (edit) 
src/plugin/index-html/src/java/org/apache/nutch/indexer/html/HtmlIndexingFilter.java
* (edit) 
src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java
* (edit) 
src/plugin/parse-tika/src/test/org/apache/nutch/parse/tika/DOMContentUtilsTest.java
* (edit) src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java
* (edit) src/java/org/apache/nutch/tools/DmozParser.java
* (edit) src/java/org/apache/nutch/webui/client/impl/CrawlingCycle.java
* (edit) 
src/plugin/language-identifier/src/java/org/apache/nutch/analysis/lang/HTMLLanguageParser.java
* (edit) 
src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexURLFilterBase.java
* (edit) 
src/plugin/creativecommons/src/java/org/creativecommons/nutch/CCIndexingFilter.java
* (edit) 
src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java
* (edit) src/java/org/apache/nutch/tools/Benchmark.java
* (edit) src/java/org/apache/nutch/util/MimeUtil.java
* (edit) src/plugin/parse-zip/src/java/org/apache/nutch/parse/zip/ZipParser.java
* (edit) src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext/ExtParser.java
* (edit) src/java/org/apache/nutch/host/HostDbReader.java
* (edit) src/java/org/apache/nutch/tools/proxy/LogDebugHandler.java
* (edit) 
src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
* (edit) src/plugin/parse-swf/src/java/org/apache/nutch/parse/swf/SWFParser.java
* (edit) src/java/org/apache/nutch/net/URLNormalizers.java
* (edit) src/java/org/apache/nutch/host/HostInjectorJob.java
* (edit) src/java/org/apache/nutch/plugin/PluginDescriptor.java
* (edit) 
src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java
* (edit) src/java/org/apache/nutch/util/DomUtil.java
* (edit) 
src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java
* (edit) src/java/org/apache/nutch/parse/ParseUtil.java
* (edit) src/java/org/apache/nutch/util/GZIPUtils.java
* (edit) src/java/org/apache/nutch/host/HostDb.java
* (edit) src/java/org/apache/nutch/indexer/IndexUtil.java
* (edit) src/java/org/apache/nutch/indexer/IndexWriters.java
* (edit) src/java/org/apache/nutch/host/HostDbUpdateJob.java
* (edit) src/java/org/apache/nutch/webui/service/impl/NutchServiceImpl.java
* (edit) 
src/plugin/index-anchor/src/java/org/apache/nutch/indexer/anchor/AnchorIndexingFilter.java
* (edit) src/java/org/apache/nutch/api/resources/AdminResource.java
* (edit) src/java/org/apache/nutch/api/impl/JobWorker.java
* (edit) src/java/org/apache/nutch/plugin/PluginManifestParser.java
* (edit) src/java/org/apache/nutch/webui/client/impl/RemoteCommandExecutor.java
* (edit) src/java/org/apache/nutch/crawl/SignatureFactory.java
* (edit) src/java/org/apache/nutch/parse/ParserJob.java
* (edit) src/java/org/apache/nutch/parse/ParserFactory.java
* (edit) 
src/plugin/lib-regex-filter/src/test/org/apache/nutch/urlfilter/api/RegexURLFilterBaseTest.java
* (edit) 
src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java
* (edit) src/java/org/apache/nutch/crawl/WebTableReader.java
* (edit) 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
* (edit) 
src/plugin/urlfilter-domain/src/java/org/apache/nutch/urlfilter/domain/DomainURLFilter.java
* (edit) src/java/org/apache/nutch/parse/ParserChecker.java
* (edit) src/java/org/apache/nutch/api/NutchServer.java
* (edit) src/java/org/apache/nutch/indexer/IndexingFilters.java
* (edit) src/java/org/apache/nutch/util/ObjectCache.java
* (edit) 
src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/FtpRobotRulesParser.java
* (edit) 
src/plugin/tld/src/java/org/apache/nutch/indexer/tld/TLDIndexingFilter.java
* (edit) 
src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/DummySSLProtocolSocketFactory.java
* (edit) src/java/org/apache/nutch/api/security/SecurityUtils.java
* (edit) src/java/org/apache/nutch/indexer/solr/SolrDeleteDuplicates.java
* (edit) src/java/org/apache/nutch/plugin/PluginRepository.java
* (edit) src/java/org/apache/nutch/util/EncodingDetector.java
* (edit) 
src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestHtmlPars

[jira] [Commented] (NUTCH-2352) Log with Generic Class Name at Nutch 1.x

2017-01-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830758#comment-15830758
 ] 

Hudson commented on NUTCH-2352:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3404 (See 
[https://builds.apache.org/job/Nutch-trunk/3404/])
NUTCH-2352 Logging with generic class name, closes #172 (snagel: rev 
2b93a66f0472e93223c69053d5482dcbef26de6d)
* (edit) 
src/plugin/scoring-similarity/src/java/org/apache/nutch/scoring/similarity/cosine/Model.java
* (edit) src/java/org/apache/nutch/fetcher/FetcherThread.java
* (edit) 
src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/Http.java
* (edit) 
src/plugin/subcollection/src/java/org/apache/nutch/collection/CollectionManager.java
* (edit) src/java/org/apache/nutch/service/NutchServer.java
* (edit) src/java/org/apache/nutch/fetcher/FetchItemQueue.java
* (edit) 
src/plugin/urlmeta/src/java/org/apache/nutch/indexer/urlmeta/URLMetaIndexingFilter.java
* (edit) src/java/org/apache/nutch/scoring/webgraph/WebGraph.java
* (edit) 
src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/DummyX509TrustManager.java
* (edit) src/java/org/apache/nutch/scoring/webgraph/ScoreUpdater.java
* (edit) 
src/plugin/urlnormalizer-protocol/src/java/org/apache/nutch/net/urlnormalizer/protocol/ProtocolURLNormalizer.java
* (edit) src/java/org/apache/nutch/net/URLNormalizers.java
* (edit) src/java/org/apache/nutch/crawl/CrawlDbMerger.java
* (edit) 
src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java
* (edit) src/java/org/apache/nutch/parse/ParseUtil.java
* (edit) src/plugin/feed/src/java/org/apache/nutch/parse/feed/FeedParser.java
* (edit) src/java/org/apache/nutch/crawl/CrawlDb.java
* (edit) src/test/org/apache/nutch/tools/proxy/ProxyTestbed.java
* (edit) 
src/plugin/lib-regex-filter/src/test/org/apache/nutch/urlfilter/api/RegexURLFilterBaseTest.java
* (edit) src/java/org/apache/nutch/service/impl/JobWorker.java
* (edit) 
src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java
* (edit) src/plugin/parse-zip/src/java/org/apache/nutch/parse/zip/ZipParser.java
* (edit) 
src/plugin/parse-js/src/java/org/apache/nutch/parse/js/JSParseFilter.java
* (edit) 
src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/FtpRobotRulesParser.java
* (edit) src/java/org/apache/nutch/parse/ParseResult.java
* (edit) src/java/org/apache/nutch/fetcher/QueueFeeder.java
* (edit) src/java/org/apache/nutch/parse/ParseSegment.java
* (edit) src/java/org/apache/nutch/hostdb/UpdateHostDbMapper.java
* (edit) 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
* (edit) src/java/org/apache/nutch/util/domain/DomainSuffixesReader.java
* (edit) 
src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java
* (edit) src/java/org/apache/nutch/indexer/IndexingFilters.java
* (edit) src/java/org/apache/nutch/util/DomUtil.java
* (edit) 
src/plugin/urlmeta/src/java/org/apache/nutch/scoring/urlmeta/URLMetaScoringFilter.java
* (edit) 
src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpAuthenticationFactory.java
* (edit) src/java/org/apache/nutch/crawl/Injector.java
* (edit) 
src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestHtmlParser.java
* (edit) src/java/org/apache/nutch/util/domain/DomainStatistics.java
* (edit) 
src/plugin/publish-rabbitmq/src/java/org/apache/nutch/publisher/rabbitmq/RabbitMQPublisherImpl.java
* (edit) 
src/plugin/urlnormalizer-querystring/src/java/org/apache/nutch/net/urlnormalizer/querystring/QuerystringURLNormalizer.java
* (edit) 
src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/DefaultClickAllAjaxLinksHandler.java
* (edit) src/test/org/apache/nutch/crawl/TODOTestCrawlDbStates.java
* (edit) src/java/org/apache/nutch/tools/FileDumper.java
* (edit) src/java/org/apache/nutch/segment/SegmentMergeFilters.java
* (edit) src/java/org/apache/nutch/webui/service/impl/NutchServiceImpl.java
* (edit) src/java/org/apache/nutch/hostdb/ReadHostDb.java
* (edit) src/test/org/apache/nutch/service/TestNutchServer.java
* (edit) src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java
* (edit) 
src/plugin/urlfilter-domain/src/java/org/apache/nutch/urlfilter/domain/DomainURLFilter.java
* (edit) 
src/plugin/urlnormalizer-regex/src/test/org/apache/nutch/net/urlnormalizer/regex/TestRegexURLNormalizer.java
* (edit) src/java/org/apache/nutch/util/ProtocolStatusStatistics.java
* (edit) 
src/plugin/scoring-similarity/src/java/org/apache/nutch/scoring/similarity/cosine/CosineSimilarity.java
* (edit) src/test/org/apache/nutch/crawl/CrawlDbUpdateUtil.java
* (edit) 
src/plugin/urlfilter-ignoreexempt/src/java/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.java
* (edit) 
src/plugin/protocol-httpclient/src/java/org/apache/nu

[jira] [Commented] (NUTCH-2346) Check Types at Object Equality

2017-01-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833122#comment-15833122
 ] 

Hudson commented on NUTCH-2346:
---

FAILURE: Integrated in Jenkins build Nutch-nutchgora #1579 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1579/])
NUTCH-2346 Types are checked at object equality (kamaci: rev 
170f8c1375c8826c6397de0eb80e2fa29d2bfe5f)
* (edit) src/java/org/apache/nutch/crawl/GeneratorJob.java
* (edit) src/java/org/apache/nutch/metadata/Metadata.java


> Check Types at Object Equality
> --
>
> Key: NUTCH-2346
> URL: https://issues.apache.org/jira/browse/NUTCH-2346
> Project: Nutch
>  Issue Type: Bug
>  Components: generator, metadata
>Affects Versions: 2.3.1
>Reporter: Furkan KAMACI
>Assignee: Furkan KAMACI
>Priority: Minor
> Fix For: 2.4
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2346) Check Types at Object Equality

2017-01-27 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15843707#comment-15843707
 ] 

Hudson commented on NUTCH-2346:
---

SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1580 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1580/])
NUTCH-2346v2 Check Types at Object Equality v2 (lewis.mcgibbney: rev 
022ed5c03206fab821770f85c2711f7c01edb17e)
* (edit) src/java/org/apache/nutch/metadata/Metadata.java


> Check Types at Object Equality
> --
>
> Key: NUTCH-2346
> URL: https://issues.apache.org/jira/browse/NUTCH-2346
> Project: Nutch
>  Issue Type: Bug
>  Components: generator, metadata
>Affects Versions: 2.3.1
>Reporter: Furkan KAMACI
>Assignee: Furkan KAMACI
>Priority: Minor
> Fix For: 2.4
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2347) Use Logger Instead of Printing Throwable

2017-02-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15848272#comment-15848272
 ] 

Hudson commented on NUTCH-2347:
---

SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1581 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1581/])
NUTCH-2347 Logger is used instead of printing Throwable. (kamaci: rev 
8dbf8083aa63fbd881c18fc8824981b4c84c9c02)
* (edit) src/java/org/apache/nutch/protocol/RobotRulesParser.java
* (edit) src/java/org/apache/nutch/parse/NutchSitemapParser.java
* (edit) src/java/org/apache/nutch/util/URLUtil.java
* (edit) src/java/org/apache/nutch/crawl/WebTableReader.java
* (edit) src/java/org/apache/nutch/host/HostDbReader.java
* (edit) src/java/org/apache/nutch/tools/DmozParser.java
* (edit) src/java/org/apache/nutch/util/GenericWritableConfigurable.java
* (edit) src/java/org/apache/nutch/parse/ParseUtil.java
* (edit) src/java/org/apache/nutch/util/NutchTool.java


> Use Logger Instead of Printing Throwable
> 
>
> Key: NUTCH-2347
> URL: https://issues.apache.org/jira/browse/NUTCH-2347
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.3.1
>Reporter: Furkan KAMACI
>Assignee: Furkan KAMACI
>Priority: Minor
> Fix For: 2.4
>
>
> Loggers should be used instead of printing Throwable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2349) urlnormalizer-basic NPE for ill-formed URL "http:/"

2017-02-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15848273#comment-15848273
 ] 

Hudson commented on NUTCH-2349:
---

SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1581 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1581/])
NUTCH-2349 urlnormalizer-basic: NPE for URLs without authority - check (snagel: 
rev 700857d16c9e1517ddb9868ed41171d91e5c9116)
* (edit) 
src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java
* (edit) 
src/plugin/urlnormalizer-basic/src/test/org/apache/nutch/net/urlnormalizer/basic/TestBasicURLNormalizer.java


> urlnormalizer-basic NPE for ill-formed URL "http:/"
> ---
>
> Key: NUTCH-2349
> URL: https://issues.apache.org/jira/browse/NUTCH-2349
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.4, 1.13
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
> Fix For: 2.4, 1.13
>
>
> NUTCH-2337 introduced a potential (though rare) NullPointerException when an 
> ill-formed URL (just the protocol followed by "{{:}}", "{{:/}}", "{{:}}" 
> or even more slashes):
> {noformat}
> % echo "http:/"; \
>   | runtime/local/bin/nutch org.apache.nutch.net.URLNormalizerChecker \
>  -normalizer org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer 
> Checking URLNormalizer 
> org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
> Exception in thread "main" java.lang.NullPointerException
> at 
> org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer.normalize(BasicURLNormalizer.java:120)
> at 
> org.apache.nutch.net.URLNormalizerChecker.checkOne(URLNormalizerChecker.java:72)
> at 
> org.apache.nutch.net.URLNormalizerChecker.main(URLNormalizerChecker.java:110)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2349) urlnormalizer-basic NPE for ill-formed URL "http:/"

2017-02-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15848282#comment-15848282
 ] 

Hudson commented on NUTCH-2349:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3405 (See 
[https://builds.apache.org/job/Nutch-trunk/3405/])
NUTCH-2349 urlnormalizer-basic: NPE for URLs without authority - check (snagel: 
rev 1a718e0cc9a0c381e40f4bf8351e26f73522)
* (edit) 
src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java
* (edit) 
src/plugin/urlnormalizer-basic/src/test/org/apache/nutch/net/urlnormalizer/basic/TestBasicURLNormalizer.java


> urlnormalizer-basic NPE for ill-formed URL "http:/"
> ---
>
> Key: NUTCH-2349
> URL: https://issues.apache.org/jira/browse/NUTCH-2349
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.4, 1.13
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
> Fix For: 2.4, 1.13
>
>
> NUTCH-2337 introduced a potential (though rare) NullPointerException when an 
> ill-formed URL (just the protocol followed by "{{:}}", "{{:/}}", "{{:}}" 
> or even more slashes):
> {noformat}
> % echo "http:/"; \
>   | runtime/local/bin/nutch org.apache.nutch.net.URLNormalizerChecker \
>  -normalizer org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer 
> Checking URLNormalizer 
> org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
> Exception in thread "main" java.lang.NullPointerException
> at 
> org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer.normalize(BasicURLNormalizer.java:120)
> at 
> org.apache.nutch.net.URLNormalizerChecker.checkOne(URLNormalizerChecker.java:72)
> at 
> org.apache.nutch.net.URLNormalizerChecker.main(URLNormalizerChecker.java:110)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2359) Parsefilter-regex raises IndexOutOfBoundsException when rules are ill-formed

2017-02-14 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15865783#comment-15865783
 ] 

Hudson commented on NUTCH-2359:
---

FAILURE: Integrated in Jenkins build Nutch-trunk #3406 (See 
[https://builds.apache.org/job/Nutch-trunk/3406/])
NUTCH-2359 Parsefilter-regex raises IndexOutOfBoundsException when rules 
(markus: rev 9a9c4b32b9c1ab9c47583a217665e4694272d58a)
* (add) src/plugin/parsefilter-regex/README.txt
* (edit) 
src/plugin/parsefilter-regex/src/java/org/apache/nutch/parsefilter/regex/RegexParseFilter.java


> Parsefilter-regex raises IndexOutOfBoundsException when rules are ill-formed
> 
>
> Key: NUTCH-2359
> URL: https://issues.apache.org/jira/browse/NUTCH-2359
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin
>Affects Versions: 1.12
>Reporter: Laknath Semage
>Assignee: Markus Jelsma
>Priority: Minor
>  Labels: patch
> Fix For: 1.13
>
>
> This patch fixes:
> 1) [Bug] Parsefilter-regex raises IndexOutOfBoundsException when rules are 
> ill-formed
> 2) Rules are split using any space character (\s) instead tab (\t) 
> 3) A detailed Readme for the plugin



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2355) Protocol plugins to set cookie if Cookie metadata field is present

2017-02-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15875861#comment-15875861
 ] 

Hudson commented on NUTCH-2355:
---

FAILURE: Integrated in Jenkins build Nutch-trunk #3408 (See 
[https://builds.apache.org/job/Nutch-trunk/3408/])
NUTCH-2355 Protocol plugins to set cookie if Cookie metadata field is (markus: 
rev 217fad16bfdea0494390e8f170d9350cf06657ef)
* (edit) 
src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java
* (edit) 
src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
* (edit) 
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
* (edit) conf/nutch-default.xml


> Protocol plugins to set cookie if Cookie metadata field is present
> --
>
> Key: NUTCH-2355
> URL: https://issues.apache.org/jira/browse/NUTCH-2355
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.13
>
> Attachments: NUTCH-2355.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2364) http.agent.rotate: IllegalArgumentException / last element of agent names ignored

2017-03-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15898220#comment-15898220
 ] 

Hudson commented on NUTCH-2364:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3410 (See 
[https://builds.apache.org/job/Nutch-trunk/3410/])
NUTCH-2364 http.agent.rotate: IllegalArgumentException / last element of 
(snagel: 
[https://github.com/apache/nutch/commit/e5e67028251e5cc1fdd10ed94103fadff0c41a4a])
* (edit) 
src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java


> http.agent.rotate: IllegalArgumentException / last element of agent names 
> ignored
> -
>
> Key: NUTCH-2364
> URL: https://issues.apache.org/jira/browse/NUTCH-2364
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Affects Versions: 1.10, 1.11, 2.3.1, 1.12
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 2.4, 1.13
>
>
> With http.agent.rotate == true and a one-element agent name list, the 
> following exception is thrown:
> {noformat}
> % cat .../conf/agents.txt
> my-test-crawler/Nutch-1.13
> % .../bin/nutch parsechecker -Dhttp.agent.rotate=true http://nutch.apache.org/
> ...
> Fetch failed with protocol status: exception(16), lastModified=0: 
> java.lang.IllegalArgumentException: bound must be positive
> % cat .../logs/hadoop.log
> ...
> 2017-03-03 11:17:19,750 ERROR http.Http - Failed to get protocol output
> java.lang.IllegalArgumentException: bound must be positive
> at 
> java.util.concurrent.ThreadLocalRandom.nextInt(ThreadLocalRandom.java:352)
> at 
> org.apache.nutch.protocol.http.api.HttpBase.getUserAgent(HttpBase.java:379)
> at 
> org.apache.nutch.protocol.http.HttpResponse.(HttpResponse.java:180)
> ...
> {noformat}
> Caused by
> {code}
> userAgentNames.get(ThreadLocalRandom.current().nextInt(userAgentNames.size()-1));
> {code}
> but nextInt(...) is defined as: "Returns a pseudorandom int value between 
> zero (inclusive) and the specified bound (exclusive)."



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2357) Index metadata throw Exception because writable object cannot be cast to Text

2017-03-14 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925276#comment-15925276
 ] 

Hudson commented on NUTCH-2357:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3412 (See 
[https://builds.apache.org/job/Nutch-trunk/3412/])
NUTCH-2357 Index metadata throw Exception because writable object cannot 
(snagel: 
[https://github.com/apache/nutch/commit/439f1153991ec104acdb73420ddc816cd9c665e8])
* (edit) 
src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java


> Index metadata throw Exception because writable object cannot be cast to Text
> -
>
> Key: NUTCH-2357
> URL: https://issues.apache.org/jira/browse/NUTCH-2357
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.12
> Environment: It was detected using Linux mint 18.
>Reporter: Eyeris Rodriguez Rueda
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.13
>
>
> Index Metadata plugin use this property(see below), to take keys from Datum 
> and index it.
> 
>   index.db.md
>   
>   
> ...
>   
> 
> Using any value from this property one Exception is thrown.
> The problem occurs because Writable object can not be cast to Text see this 
> line.
> https://github.com/apache/nutch/blob/master/src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java#L58
> A little change will fix it.
> This is the Exception:
> **
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: digest dest: 
> digest
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: tstamp dest: 
> tstamp
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: 
> metatag.description dest: description
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: 
> metatag.keywords dest: keywords
> 2017-02-06 18:18:30,134 WARN  mapred.LocalJobRunner - job_local1516_0001
> java.lang.Exception: java.lang.ClassCastException: 
> org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.Text
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable 
> cannot be cast to org.apache.hadoop.io.Text
>   at 
> org.apache.nutch.indexer.metadata.MetadataIndexer.filter(MetadataIndexer.java:58)
>   at 
> org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:51)
>   at 
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:330)
>   at 
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:56)
>   at 
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
>   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 2017-02-06 18:18:30,777 ERROR indexer.IndexingJob - Indexer: 
> java.io.IOException: Job failed!
>   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
>   at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
>   at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)
> **



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2366) Deprecated Job constructor in hostdb/ReadHostDb.java

2017-03-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15926093#comment-15926093
 ] 

Hudson commented on NUTCH-2366:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3413 (See 
[https://builds.apache.org/job/Nutch-trunk/3413/])
NUTCH-2366 Deprecated Job constructor in hostdb/ReadHostDb.java\ (markus: 
[https://github.com/apache/nutch/commit/3926910e145df083ec9d42cd397c0cbd9b3a16da])
* (edit) src/java/org/apache/nutch/hostdb/ReadHostDb.java


> Deprecated Job constructor in hostdb/ReadHostDb.java
> 
>
> Key: NUTCH-2366
> URL: https://issues.apache.org/jira/browse/NUTCH-2366
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.12
>Reporter: Omkar Reddy
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.13
>
> Attachments: NUTCH-2366.patch
>
>
> When we try to build ant using nutch we get the following warning : 
> warning: [deprecation] Job(Configuration,String) in Job has been deprecated
>[javac] Job job = new Job(conf, "ReadHostDb");
> This is because the constructor Job(Configuration conf, String jobName) has 
> been deprecated and the reference can be found at [0].
> [0] 
> http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html#getInstance%28org.apache.hadoop.conf.Configuration%29



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2367) Get single record from HostDB

2017-03-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15927826#comment-15927826
 ] 

Hudson commented on NUTCH-2367:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3414 (See 
[https://builds.apache.org/job/Nutch-trunk/3414/])
NUTCH-2367 Get single record from HostDB (markus: 
[https://github.com/apache/nutch/commit/be3aea1410835b34cfacdff7c3def9fb01a83e76])
* (edit) src/java/org/apache/nutch/hostdb/ReadHostDb.java


> Get single record from HostDB
> -
>
> Key: NUTCH-2367
> URL: https://issues.apache.org/jira/browse/NUTCH-2367
> Project: Nutch
>  Issue Type: Improvement
>  Components: hostdb
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.13
>
> Attachments: NUTCH-2367.patch
>
>
> Introduces:
> {code}
> bin/nutch readhostdb crawl/hostdb/ -get www.apache.org
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2068) Allow subcollection overrides via metadata

2017-03-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15927883#comment-15927883
 ] 

Hudson commented on NUTCH-2068:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3415 (See 
[https://builds.apache.org/job/Nutch-trunk/3415/])
NUTCH-2068 Allow subcollection overrides via metadata (markus: 
[https://github.com/apache/nutch/commit/9fb7d6c2e61ce36375722b16842b694621f3b053])
* (edit) 
src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java


> Allow subcollection overrides via metadata
> --
>
> Key: NUTCH-2068
> URL: https://issues.apache.org/jira/browse/NUTCH-2068
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.10
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Attachments: NUTCH-2068.patch
>
>
> Similar to index-metdata but overrides subcollection. If both subcollection 
> and index-metadata are active, you will get two values for the field possible 
> causing multivalued field errors.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2336) SegmentReader to implement Tool

2017-04-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958730#comment-15958730
 ] 

Hudson commented on NUTCH-2336:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3420 (See 
[https://builds.apache.org/job/Nutch-trunk/3420/])
Adapt NUTCH-2336 to NUTCH-2281 (snagel: 
[https://github.com/apache/nutch/commit/330532175f751e7c977fb8549c048fc9cf4bd10d])
* (edit) src/java/org/apache/nutch/segment/SegmentReader.java


> SegmentReader to implement Tool
> ---
>
> Key: NUTCH-2336
> URL: https://issues.apache.org/jira/browse/NUTCH-2336
> Project: Nutch
>  Issue Type: Improvement
>  Components: segment
>Affects Versions: 1.12
>Reporter: Vincent Slot
>Priority: Minor
>  Labels: patch
> Fix For: 1.13
>
> Attachments: NUTCH-2336.patch
>
>
> Let SegmentReader implement Tool for use on Hadoop



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2281) Support non-default FileSystem

2017-04-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958729#comment-15958729
 ] 

Hudson commented on NUTCH-2281:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3420 (See 
[https://builds.apache.org/job/Nutch-trunk/3420/])
NUTCH-2281 Support non-default FileSystem (snagel: 
[https://github.com/apache/nutch/commit/faed27af5b2c471610af93e2cb45f551615bd922])
* (edit) src/java/org/apache/nutch/scoring/webgraph/LinkDumper.java
* (edit) src/java/org/apache/nutch/segment/SegmentMerger.java
* (edit) src/java/org/apache/nutch/crawl/LinkDbMerger.java
* (edit) src/java/org/apache/nutch/hostdb/UpdateHostDb.java
* (edit) src/java/org/apache/nutch/indexer/IndexingJob.java
* (edit) src/java/org/apache/nutch/scoring/webgraph/NodeReader.java
* (edit) src/java/org/apache/nutch/scoring/webgraph/ScoreUpdater.java
* (edit) src/java/org/apache/nutch/crawl/Injector.java
* (edit) src/java/org/apache/nutch/tools/CommonCrawlDataDumper.java
* (edit) src/java/org/apache/nutch/scoring/webgraph/LinkRank.java
* (edit) src/java/org/apache/nutch/crawl/CrawlDb.java
* (edit) src/java/org/apache/nutch/crawl/LinkDb.java
* (edit) src/java/org/apache/nutch/indexer/IndexerMapReduce.java
* (edit) src/java/org/apache/nutch/crawl/DeduplicationJob.java
* (edit) src/java/org/apache/nutch/crawl/CrawlDbMerger.java
* (edit) src/java/org/apache/nutch/parse/ParseSegment.java
* (edit) src/java/org/apache/nutch/scoring/webgraph/WebGraph.java
* (edit) src/java/org/apache/nutch/tools/FileDumper.java
* (edit) src/java/org/apache/nutch/segment/SegmentReader.java
* (edit) src/java/org/apache/nutch/crawl/Generator.java
* (edit) src/java/org/apache/nutch/crawl/LinkDbReader.java
* (edit) src/java/org/apache/nutch/crawl/CrawlDbReader.java
* (edit) src/java/org/apache/nutch/util/LockUtil.java
Adapt NUTCH-2336 to NUTCH-2281 (snagel: 
[https://github.com/apache/nutch/commit/330532175f751e7c977fb8549c048fc9cf4bd10d])
* (edit) src/java/org/apache/nutch/segment/SegmentReader.java
NUTCH-2281 Support non-default file system - fix install of CrawlDb for 
(snagel: 
[https://github.com/apache/nutch/commit/5dcd7b13f450561a7b34bb6761041150c84bfdab])
* (edit) src/java/org/apache/nutch/crawl/Injector.java
* (edit) src/java/org/apache/nutch/crawl/CrawlDb.java
* (edit) src/java/org/apache/nutch/crawl/CrawlDbMerger.java
* (edit) src/java/org/apache/nutch/crawl/Generator.java


> Support non-default FileSystem
> --
>
> Key: NUTCH-2281
> URL: https://issues.apache.org/jira/browse/NUTCH-2281
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.12
>Reporter: Sebastian Nagel
> Fix For: 1.14
>
>
> If a path (input or output) does not belong to the configured default 
> FileSystem various Nutch tools may raise an exception like
> {noformat}
>   Exception in ... java.lang.IllegalArgumentException: Wrong FS: s3a://..., 
> expected: hdfs://...
> {noformat}
> This is fixed by getting a reference to the FileSystem from the Path object
> {noformat}
>   FileSystem fs = path.getFileSystem(getConf());
> {noformat}
> instead of
> {noformat}
>   FileSystem fs = FileSystem.get(getConf());
> {noformat}
> A given path (e.g., {{s3a://...}}) may not belong to the default file system 
> ({{hdfs://}} or {{file://}} in local mode) and simple checks such as 
> {{fs.exists(path)}} then will fail. Cf. 
> [FileSystem.checkPath(path)|https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/FileSystem.html#checkPath(org.apache.hadoop.fs.Path)],
>  and 
> [FileSystem.get(conf)|https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/FileSystem.html#get(org.apache.hadoop.conf.Configuration)]
>  vs. 
> [FileSystem.get(URI,conf)|https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/FileSystem.html#get(java.net.URI,%20org.apache.hadoop.conf.Configuration)]
>  which is called by 
> [Path.getFileSystem(conf)|https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/Path.html#getFileSystem%28org.apache.hadoop.conf.Configuration%29].
>   
> Note that the FileSystem for input and output may be different, e.g., read 
> from HDFS and write to S3.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2335) Injector not to filter and normalize existing URLs in CrawlDb

2017-04-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958788#comment-15958788
 ] 

Hudson commented on NUTCH-2335:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3421 (See 
[https://builds.apache.org/job/Nutch-trunk/3421/])
NUTCH-2335 Injector not to filter and normalize existing items/URLs in (snagel: 
[https://github.com/apache/nutch/commit/5945db20de21c62795315c095ccf9ff4c61f3ebe])
* (edit) src/java/org/apache/nutch/crawl/Injector.java


> Injector not to filter and normalize existing URLs in CrawlDb
> -
>
> Key: NUTCH-2335
> URL: https://issues.apache.org/jira/browse/NUTCH-2335
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb, injector
>Affects Versions: 1.12
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
> Fix For: 1.14
>
>
> With NUTCH-1712 the behavior of the Injector has changed in case new URLs are 
> added to an existing CrawlDb:
> - before only injected URLs were filtered and normalized
> - now filters and normalizers are applied to all URLs including those already 
> in the CrawlDb
> The default should be as before not to filter existing URLs. Filtering and 
> normalizing may take long for large CrawlDbs and/or complex URL filters. If 
> URL filter or normalizer rules are not changed there is no need to apply them 
> anew every time new URLs are added. Of course, injected URLs should be 
> filtered and normalized by default.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2269) Clean not working after crawl

2017-04-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958866#comment-15958866
 ] 

Hudson commented on NUTCH-2269:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3422 (See 
[https://builds.apache.org/job/Nutch-trunk/3422/])
fix for NUTCH-2269 contributed by r0ann3l (snagel: 
[https://github.com/apache/nutch/commit/e040ace189aa0379b998c8852a09c1a1a2308d82])
* (edit) src/java/org/apache/nutch/indexer/CleaningJob.java


> Clean not working after crawl
> -
>
> Key: NUTCH-2269
> URL: https://issues.apache.org/jira/browse/NUTCH-2269
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.12
> Environment: Vagrant, Ubuntu, Java 8, Solr 4.10
>Reporter: Francesco Capponi
>Assignee: Lewis John McGibbney
> Fix For: 2.4, 1.14
>
>
> I'm have been having this problem for a while and I had to rollback using the 
> old solr clean instead of the newer version. 
> Once it inserts/update correctly every document in Nutch, when it tries to 
> clean, it returns error 255:
> {quote}
> 2016-05-30 10:13:04,992 WARN  output.FileOutputCommitter - Output Path is 
> null in setupJob()
> 2016-05-30 10:13:07,284 INFO  indexer.IndexWriters - Adding 
> org.apache.nutch.indexwriter.solr.SolrIndexWriter
> 2016-05-30 10:13:08,114 INFO  solr.SolrMappingReader - source: content dest: 
> content
> 2016-05-30 10:13:08,114 INFO  solr.SolrMappingReader - source: title dest: 
> title
> 2016-05-30 10:13:08,114 INFO  solr.SolrMappingReader - source: host dest: host
> 2016-05-30 10:13:08,114 INFO  solr.SolrMappingReader - source: segment dest: 
> segment
> 2016-05-30 10:13:08,114 INFO  solr.SolrMappingReader - source: boost dest: 
> boost
> 2016-05-30 10:13:08,114 INFO  solr.SolrMappingReader - source: digest dest: 
> digest
> 2016-05-30 10:13:08,114 INFO  solr.SolrMappingReader - source: tstamp dest: 
> tstamp
> 2016-05-30 10:13:08,133 INFO  solr.SolrIndexWriter - SolrIndexer: deleting 
> 15/15 documents
> 2016-05-30 10:13:08,919 WARN  output.FileOutputCommitter - Output Path is 
> null in cleanupJob()
> 2016-05-30 10:13:08,937 WARN  mapred.LocalJobRunner - job_local662730477_0001
> java.lang.Exception: java.lang.IllegalStateException: Connection pool shut 
> down
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by: java.lang.IllegalStateException: Connection pool shut down
>   at org.apache.http.util.Asserts.check(Asserts.java:34)
>   at 
> org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:169)
>   at 
> org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:202)
>   at 
> org.apache.http.impl.conn.PoolingClientConnectionManager.requestConnection(PoolingClientConnectionManager.java:184)
>   at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:415)
>   at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
>   at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
>   at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
>   at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
>   at 
> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:480)
>   at 
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241)
>   at 
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230)
>   at 
> org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:150)
>   at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:483)
>   at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:464)
>   at 
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:190)
>   at 
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:178)
>   at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115)
>   at 
> org.apache.nutch.indexer.CleaningJob$DeleterReducer.close(CleaningJob.java:120)
>   at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:237)
>   at 
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:459)
>   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExec

[jira] [Commented] (NUTCH-2193) Upgrade feed parser plugin to use rome 1.5

2017-04-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958961#comment-15958961
 ] 

Hudson commented on NUTCH-2193:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3423 (See 
[https://builds.apache.org/job/Nutch-trunk/3423/])
NUTCH-2193 Upgrade feed parser plugin to use rome 1.5.1 (snagel: 
[https://github.com/apache/nutch/commit/c1819539ba21a294c1afc12b876b83f74a1ce3e7])
* (edit) src/plugin/feed/ivy.xml
* (edit) src/plugin/feed/src/java/org/apache/nutch/parse/feed/FeedParser.java
* (edit) src/plugin/feed/plugin.xml


> Upgrade feed parser plugin to use rome 1.5
> --
>
> Key: NUTCH-2193
> URL: https://issues.apache.org/jira/browse/NUTCH-2193
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.11
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.14
>
> Attachments: NUTCH-2193.patch
>
>
> The class loader issue in the rome library (NUTCH-1494, [[rometools 
> #130|https://github.com/rometools/rome/issues/130]]) is fixed with rome 1.5. 
> Time to upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2372) Javadocs build failing.

2017-04-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15969920#comment-15969920
 ] 

Hudson commented on NUTCH-2372:
---

FAILURE: Integrated in Jenkins build Nutch-trunk #3425 (See 
[https://builds.apache.org/job/Nutch-trunk/3425/])
NUTCH-2372 Fixing the errors in documentation (omkarreddy2008: 
[https://github.com/apache/nutch/commit/61985f17998d1deaa47d8e56b46136e0fc1f4108])
* (edit) src/java/org/apache/nutch/util/MimeUtil.java
* (edit) 
src/plugin/index-anchor/src/java/org/apache/nutch/indexer/anchor/AnchorIndexingFilter.java
* (edit) src/java/org/apache/nutch/tools/CommonCrawlDataDumper.java
* (edit) src/java/org/apache/nutch/util/TableUtil.java
* (edit) src/java/org/apache/nutch/segment/SegmentMerger.java
* (edit) 
src/plugin/index-links/src/java/org/apache/nutch/indexer/links/LinksIndexingFilter.java
* (edit) src/java/org/apache/nutch/tools/arc/ArcRecordReader.java
* (edit) src/java/org/apache/nutch/util/TrieStringMatcher.java
* (edit) src/java/org/apache/nutch/util/TimingUtil.java
* (edit) 
src/plugin/index-replace/src/java/org/apache/nutch/indexer/replace/ReplaceIndexer.java
* (edit) src/java/org/apache/nutch/tools/CommonCrawlFormatFactory.java
* (edit) src/java/org/apache/nutch/crawl/FetchSchedule.java
* (edit) src/java/org/apache/nutch/tools/FileDumper.java
* (edit) src/java/org/apache/nutch/tools/CommonCrawlFormatSimple.java
* (edit) 
src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpBasicAuthentication.java
* (edit) src/java/org/apache/nutch/segment/SegmentPart.java
* (edit) 
src/plugin/feed/src/java/org/apache/nutch/indexer/feed/FeedIndexingFilter.java
* (edit) src/java/org/apache/nutch/util/EncodingDetector.java
* (edit) src/java/org/apache/nutch/crawl/Injector.java
* (edit) src/java/org/apache/nutch/crawl/Generator.java
* (edit) src/java/org/apache/nutch/service/impl/ConfManagerImpl.java
* (edit) 
src/plugin/mimetype-filter/src/java/org/apache/nutch/indexer/filter/MimeTypeIndexingFilter.java
* (edit) src/java/org/apache/nutch/util/LockUtil.java
* (edit) 
src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Client.java
* (edit) src/java/org/apache/nutch/hostdb/UpdateHostDbReducer.java
* (edit) 
src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexURLFilterBase.java
* (edit) src/java/org/apache/nutch/tools/CommonCrawlFormat.java
* (edit) src/java/org/apache/nutch/util/SuffixStringMatcher.java
* (edit) 
src/plugin/urlnormalizer-querystring/src/java/org/apache/nutch/net/urlnormalizer/querystring/QuerystringURLNormalizer.java
* (edit) 
src/plugin/parse-zip/src/java/org/apache/nutch/parse/zip/ZipTextExtractor.java
* (edit) src/java/org/apache/nutch/hostdb/UpdateHostDbMapper.java
* (edit) 
src/plugin/urlmeta/src/java/org/apache/nutch/scoring/urlmeta/URLMetaScoringFilter.java
* (edit) src/java/org/apache/nutch/plugin/PluginRepository.java
* (edit) 
src/plugin/indexer-dummy/src/java/org/apache/nutch/indexwriter/dummy/DummyIndexWriter.java
* (edit) 
src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java
* (edit) src/java/org/apache/nutch/net/URLNormalizers.java
* (edit) src/java/org/apache/nutch/tools/AbstractCommonCrawlFormat.java
* (edit) 
src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrUtils.java
* (edit) src/java/org/apache/nutch/util/PrefixStringMatcher.java
* (edit) 
src/plugin/lib-htmlunit/src/java/org/apache/nutch/protocol/htmlunit/HtmlUnitWebDriver.java
* (edit) 
src/plugin/urlfilter-domain/src/java/org/apache/nutch/urlfilter/domain/DomainURLFilter.java
* (edit) 
src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java
* (edit) src/java/org/apache/nutch/parse/ParseResult.java
* (edit) 
src/plugin/urlmeta/src/java/org/apache/nutch/indexer/urlmeta/URLMetaIndexingFilter.java
* (edit) src/java/org/apache/nutch/util/URLUtil.java
* (edit) 
src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java
* (edit) 
src/plugin/language-identifier/src/java/org/apache/nutch/analysis/lang/HTMLLanguageParser.java
* (edit) 
src/plugin/index-geoip/src/java/org/apache/nutch/indexer/geoip/GeoIPIndexingFilter.java
* (edit) 
src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java
* (edit) src/java/org/apache/nutch/parse/ParserChecker.java
* (edit) src/java/org/apache/nutch/hostdb/ReadHostDb.java
* (edit) 
src/plugin/urlfilter-domainblacklist/src/java/org/apache/nutch/urlfilter/domainblacklist/DomainBlacklistURLFilter.java
* (edit) 
src/plugin/scoring-opic/src/java/org/apache/nutch/scoring/opic/OPICScoringFilter.java
* (edit) 
src/plugin/scoring-similarity/src/java/org/apache/nutch/scoring/similarity/util/LuceneTokenizer.java
* (edit) 
src/plugin/subcollection/src/java/org/apache/nutch/collection/CollectionManager.java


> Javadocs build failing.
> ---
>

[jira] [Commented] (NUTCH-2333) Indexer for RabbitMQ

2017-04-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15969918#comment-15969918
 ] 

Hudson commented on NUTCH-2333:
---

FAILURE: Integrated in Jenkins build Nutch-trunk #3425 (See 
[https://builds.apache.org/job/Nutch-trunk/3425/])
Fixes for NUTCH-2333: Added the lines for ant runtime task (gitRoann3l;fhdez: 
[https://github.com/apache/nutch/commit/5873a24d3845563bd1028f6a27e22438670b4063])
* (edit) src/plugin/build.xml
* (edit) build.xml
Fixes for NUTCH-2333: Added the logic for indexing process (gitRoann3l;fhdez: 
[https://github.com/apache/nutch/commit/62496aec84cbf889f14175dbf03f0e8a1200ac9c])
* (add) 
src/plugin/indexer-rabbit/src/java/org/apache/nutch/indexwriter/rabbit/RabbitDocument.java
* (add) 
src/plugin/indexer-rabbit/src/java/org/apache/nutch/indexwriter/rabbit/RabbitMQConstants.java
* (add) 
src/plugin/indexer-rabbit/src/java/org/apache/nutch/indexwriter/rabbit/RabbitMessage.java
* (add) src/plugin/indexer-rabbit/plugin.xml
* (add) src/plugin/indexer-rabbit/build-ivy.xml
* (add) src/plugin/indexer-rabbit/build.xml
* (add) src/plugin/indexer-rabbit/ivy.xml
* (add) 
src/plugin/indexer-rabbit/src/java/org/apache/nutch/indexwriter/rabbit/RabbitIndexWriter.java
Fixes for NUTCH-2333: Added the properties for RabbitMQ indexer. 
(gitRoann3l;fhdez: 
[https://github.com/apache/nutch/commit/594564b27258fbcca68e90e41db801a750d11426])
* (edit) conf/nutch-default.xml
Fixes for NUTCH-2333: Added new properties to indexer (gitRoann3l;fhdez: 
[https://github.com/apache/nutch/commit/17886f722ff16da0aa29bd059953feca609a5165])
* (edit) conf/nutch-default.xml
Fixes for NUTCH-2333: Corrected some comments in the configuration file 
(gitRoann3l;fhdez: 
[https://github.com/apache/nutch/commit/c0af89aeb0e5c9e2059192eac7514cea3825b7e2])
* (edit) conf/nutch-default.xml
* (edit) 
src/plugin/indexer-rabbit/src/java/org/apache/nutch/indexwriter/rabbit/RabbitIndexWriter.java
* (edit) 
src/plugin/indexer-rabbit/src/java/org/apache/nutch/indexwriter/rabbit/RabbitMQConstants.java


> Indexer for RabbitMQ
> 
>
> Key: NUTCH-2333
> URL: https://issues.apache.org/jira/browse/NUTCH-2333
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Affects Versions: 1.12
>Reporter: Roannel Fernández Hernández
>Priority: Minor
> Fix For: 1.14
>
>
> A plugin to send the documents to a RabbitMQ server.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events

2017-04-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15969919#comment-15969919
 ] 

Hudson commented on NUTCH-2132:
---

FAILURE: Integrated in Jenkins build Nutch-trunk #3425 (See 
[https://builds.apache.org/job/Nutch-trunk/3425/])
Fixes for NUTCH-2132: Added the library amqp (gitRoann3l;fhdez: 
[https://github.com/apache/nutch/commit/9d65ac6d6b33d83e1fc9ab387f29a3287b0b26b3])
* (edit) src/plugin/publish-rabbitmq/plugin.xml
Fixes for NUTCH-2132: Added new properties (gitRoann3l;fhdez: 
[https://github.com/apache/nutch/commit/bc9a2c859c2b2036aa58d899c529cbf6282a41df])
* (edit) conf/nutch-default.xml
* (edit) 
src/plugin/publish-rabbitmq/src/java/org/apache/nutch/publisher/rabbitmq/RabbitMQPublisherImpl.java
Fixes for NUTCH-2132: Deleted empty comments (gitRoann3l;fhdez: 
[https://github.com/apache/nutch/commit/5eb77a9de6d20f31a5fd6759022b25552744cc16])
* (edit) 
src/plugin/publish-rabbitmq/src/java/org/apache/nutch/publisher/rabbitmq/RabbitMQPublisherImpl.java
Fixes for NUTCH-2132: Fixed the default port (gitRoann3l;fhdez: 
[https://github.com/apache/nutch/commit/ee651752295468af75a14f6c98686c0d7c26136a])
* (edit) conf/nutch-default.xml
* (edit) 
src/plugin/publish-rabbitmq/src/java/org/apache/nutch/publisher/rabbitmq/RabbitMQPublisherImpl.java


> Publisher/Subscriber model for Nutch to emit events 
> 
>
> Key: NUTCH-2132
> URL: https://issues.apache.org/jira/browse/NUTCH-2132
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.13
>
> Attachments: NUTCH-2132.patch, NUTCH-2132.v2.patch, 
> PubSub_routingkey.patch
>
>
> It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- 
> Fetcher events like fetch-start, fetch-end, a fetch report which may contain 
> data like outlinks of the current fetched url, score, etc). 
> A consumer of this functionality could use this data to generate real time 
> visualization and generate statics of the crawl without having to wait for 
> the fetch round to finish. 
> The REST API could contain an endpoint which would respond with a url to 
> which a client could subscribe to get the fetcher events. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2046) The crawl script should be able to skip an initial injection.

2017-04-18 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15973288#comment-15973288
 ] 

Hudson commented on NUTCH-2046:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3427 (See 
[https://builds.apache.org/job/Nutch-trunk/3427/])
fix for NUTCH-2046 contributed by jnioche (julien: 
[https://github.com/apache/nutch/commit/7b0103fe62c9b0e479bb03e7b9575522adcf68b8])
* (edit) src/bin/crawl


> The crawl script should be able to skip an initial injection.
> -
>
> Key: NUTCH-2046
> URL: https://issues.apache.org/jira/browse/NUTCH-2046
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb, injector
>Affects Versions: 1.10
>Reporter: Luis Lopez
>Assignee: Julien Nioche
>  Labels: crawl, injection
> Fix For: 1.14
>
> Attachments: crawl.patch
>
>
> When our crawl gets really big a new injection takes considerable time as it 
> updates crawldb, the crawl script should be able to skip the injection and go 
> directly to the generate call.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2353) Create seed file with metadata using the REST API

2017-05-18 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016992#comment-16016992
 ] 

Hudson commented on NUTCH-2353:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3428 (See 
[https://builds.apache.org/job/Nutch-trunk/3428/])
Fix for NUTCH-2353 contributed by jorgelbg (jlbetancourt: 
[https://github.com/apache/nutch/commit/7deb576bc58bb74725cbb6c5d82d7b9244c6ad42])
* (edit) src/java/org/apache/nutch/webui/model/SeedUrl.java
* (edit) src/java/org/apache/nutch/service/model/request/SeedUrl.java
* (edit) src/java/org/apache/nutch/service/resources/SeedResource.java


> Create seed file with metadata using the REST API
> -
>
> Key: NUTCH-2353
> URL: https://issues.apache.org/jira/browse/NUTCH-2353
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector, REST_api
>Affects Versions: 1.12
>Reporter: Jorge Luis Betancourt Gonzalez
>Assignee: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: rest_api
> Fix For: 1.14
>
>
> At the moment its not possible to create a seed file and specify any metadata 
> when using the REST API. The file gets created but there is no option to add 
> any metadata to the seed URLs.
> If we use a payload like this:
> {code}
> {
> "name":"name-of-seedlist", 
> "seedUrls":[
> {
> "url" : "http://example.com";,
> "metadata" : {
> "key1" : "value1",
> "key2" : "value2",
> "key3" : "value3"
> }
> }
> ]
> }
> {code}
> It should be easy to specify the desired metadata. Also this should keep BC 
> with the previous array syntax if we only want to specify the list of URLs 
> without any metadata at all.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2353) Create seed file with metadata using the REST API

2017-05-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017310#comment-16017310
 ] 

Hudson commented on NUTCH-2353:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3429 (See 
[https://builds.apache.org/job/Nutch-trunk/3429/])
Fix for NUTCH-2353 contributed by jorgelbg (snagel: 
[https://github.com/apache/nutch/commit/0312bae38c9e95d496336dc24133b15ebefd4d3c])
* (edit) src/java/org/apache/nutch/webui/model/SeedUrl.java
* (edit) src/java/org/apache/nutch/service/model/request/SeedUrl.java
* (edit) src/java/org/apache/nutch/service/resources/SeedResource.java


> Create seed file with metadata using the REST API
> -
>
> Key: NUTCH-2353
> URL: https://issues.apache.org/jira/browse/NUTCH-2353
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector, REST_api
>Affects Versions: 1.12
>Reporter: Jorge Luis Betancourt Gonzalez
>Assignee: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: rest_api
> Fix For: 1.14
>
>
> At the moment its not possible to create a seed file and specify any metadata 
> when using the REST API. The file gets created but there is no option to add 
> any metadata to the seed URLs.
> If we use a payload like this:
> {code}
> {
> "name":"name-of-seedlist", 
> "seedUrls":[
> {
> "url" : "http://example.com";,
> "metadata" : {
> "key1" : "value1",
> "key2" : "value2",
> "key3" : "value3"
> }
> }
> ]
> }
> {code}
> It should be easy to specify the desired metadata. Also this should keep BC 
> with the previous array syntax if we only want to specify the list of URLs 
> without any metadata at all.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2376) Improve configurability of HTTP Accept* header fields

2017-05-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017309#comment-16017309
 ] 

Hudson commented on NUTCH-2376:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3429 (See 
[https://builds.apache.org/job/Nutch-trunk/3429/])
NUTCH-2376 Improve configurability of HTTP Accept* header fields - (snagel: 
[https://github.com/apache/nutch/commit/af9d7a3e68002860fcc178e21b869d2f79c27dee])
* (edit) 
src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
* (edit) 
src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java
* (edit) 
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
* (edit) conf/nutch-default.xml


> Improve configurability of HTTP Accept* header fields
> -
>
> Key: NUTCH-2376
> URL: https://issues.apache.org/jira/browse/NUTCH-2376
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol
>Affects Versions: 2.3.1, 1.13
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 2.4, 1.14
>
>
> There should be no differences between protocol-http and protocol-httpclient 
> whether the HTTP header fields {{Accept}}, {{Accept-Language}} and 
> {{Accept-Charset}} are configurable. The configured values should be used for 
> both plugins. In addition,
> - it should be possible to unset the default values (overwrite with empty 
> value) so that no HTTP header field is sent
> - default values should be contained in nutch-default.xml
> Note: {{Accept-Encoding}} should not be configurable as the protocol plugins 
> must support the accepted compression codecs which may not be the case e.g. 
> for Brotli.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2373) Indexer for Hbase

2017-05-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16020250#comment-16020250
 ] 

Hudson commented on NUTCH-2373:
---

SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1583 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1583/])
NUTCH-2373 HBaseIndexWriter - indexer for hbase implemented (kaidulislam90: 
[https://github.com/apache/nutch/commit/bda007601cb84bff5ce44f3e7b5906d2803f2504])
* (edit) conf/nutch-default.xml
* (add) 
src/plugin/indexer-hbase/src/java/org/apache/nutch/indexwriter/hbase/HBaseIndexWriter.java
* (add) src/plugin/indexer-hbase/ivy.xml
* (edit) ivy/ivy.xml
* (edit) src/plugin/build.xml
* (add) src/plugin/indexer-hbase/build.xml
* (add) 
src/plugin/indexer-hbase/src/java/org/apache/nutch/indexwriter/hbase/package-info.java
* (edit) build.xml
* (add) src/plugin/indexer-hbase/plugin.xml
* (add) 
src/plugin/indexer-hbase/src/java/org/apache/nutch/indexwriter/hbase/HBaseConstants.java
* (add) 
src/plugin/indexer-hbase/src/java/org/apache/nutch/indexwriter/hbase/HBaseMappingReader.java
NUTCH-2373 Multiple SLF4J bindings issue solved, unnecessary (ikaidul: 
[https://github.com/apache/nutch/commit/9541ad853f7e86d45028ce0c0e85603a3e988bee])
* (edit) src/plugin/indexer-hbase/plugin.xml
* (edit) src/plugin/indexer-hbase/ivy.xml
* (edit) ivy/ivy.xml
* (add) conf/hbaseindex-mapping.xml
NUTCH-2373 Extra newline removed (ikaidul: 
[https://github.com/apache/nutch/commit/42257b397b4699f8c7a4d33d366d35e67dd61c7d])
* (edit) 
src/plugin/indexer-hbase/src/java/org/apache/nutch/indexwriter/hbase/package-info.java
NUTCH-2373 Code formatted using Nutch eclipse-codeformat.xml (ikaidul: 
[https://github.com/apache/nutch/commit/0f023e84367a4bae37f2695d7dd0891b578d62c5])
* (edit) 
src/plugin/indexer-hbase/src/java/org/apache/nutch/indexwriter/hbase/HBaseMappingReader.java
* (edit) 
src/plugin/indexer-hbase/src/java/org/apache/nutch/indexwriter/hbase/HBaseIndexWriter.java
* (edit) 
src/plugin/indexer-hbase/src/java/org/apache/nutch/indexwriter/hbase/HBaseConstants.java
NUTCH-2373 getting mapped qualifier name from key issue solved (kaidulislam90: 
[https://github.com/apache/nutch/commit/7d6f3c3bb9761bae5ae75f48113cb2b59f1a])
* (edit) 
src/plugin/indexer-hbase/src/java/org/apache/nutch/indexwriter/hbase/HBaseIndexWriter.java
NUTCH-2373 Boilerplate default column family removed, considering first 
(kaidulislam90: 
[https://github.com/apache/nutch/commit/3db369994286cd535a4cba39bc4c1d882ac7e203])
* (edit) 
src/plugin/indexer-hbase/src/java/org/apache/nutch/indexwriter/hbase/HBaseMappingReader.java
NUTCH-2373 typo corrected in hbaseindex-mapping.xml (kaidulislam90: 
[https://github.com/apache/nutch/commit/0103f4d80d2b0f532d11d7c2451e58b276829419])
* (edit) conf/hbaseindex-mapping.xml
NUTCH-2373 An issue on document counting fixed, default batch-size 
(kaidulislam90: 
[https://github.com/apache/nutch/commit/9dad864f806e2152efc1b29f3bab76c164f21da0])
* (edit) conf/nutch-default.xml
* (edit) 
src/plugin/indexer-hbase/src/java/org/apache/nutch/indexwriter/hbase/HBaseIndexWriter.java


> Indexer for Hbase
> -
>
> Key: NUTCH-2373
> URL: https://issues.apache.org/jira/browse/NUTCH-2373
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Affects Versions: 2.3
>Reporter: Kaidul Islam
>Assignee: Kaidul Islam
> Fix For: 2.4
>
>
> Some use-case involves storing the documents in some sort of database other 
> than indexing search engines i.e. Solr, ElasticSearch.  This is a plugin to 
> send the documents to Hbase storage.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2388) bin/crawl indexing only webpages containing batchID instead of all in 2.x

2017-05-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16021527#comment-16021527
 ] 

Hudson commented on NUTCH-2388:
---

SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1584 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1584/])
NUTCH-2388 bin/crawl indexing only webpages containing batchID instead 
(kaidulislam90: 
[https://github.com/apache/nutch/commit/32a57b52a67cd5c2cb637c6fbae2dfce5a2c27b5])
* (edit) src/bin/crawl


> bin/crawl indexing only webpages containing batchID instead of all in 2.x
> -
>
> Key: NUTCH-2388
> URL: https://issues.apache.org/jira/browse/NUTCH-2388
> Project: Nutch
>  Issue Type: Bug
>  Components: bin
>Affects Versions: 2.3
>Reporter: Kaidul Islam
>Assignee: Kaidul Islam
>Priority: Trivial
> Fix For: 2.4
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> During each iteration, after generating, fetching, parsing and updating the 
> current batch into DB, the indexer is supposed to index the current batch 
> too. But its indexing all currently.
> {code}
> __bin_nutch index $commonOptions -D solr.server.url=$SOLRURL -all -crawlId 
> "$CRAWL_ID"
> {code}
> It should be like below i guess -
> {code}
> __bin_nutch index $commonOptions -D solr.server.url=$SOLRURL $batchId 
> -crawlId "$CRAWL_ID"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2374) Upgrade Nutch 2.X to Gora 0.7

2017-06-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16051334#comment-16051334
 ] 

Hudson commented on NUTCH-2374:
---

SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1586 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1586/])
NUTCH-2374 Upgrade Nutch 2.X to Gora 0.7 (lakshmi: 
[https://github.com/apache/nutch/commit/b92aa37a8291892329d067c0d18a8ea808a22d13])
* (edit) ivy/ivy.xml
* (edit) src/java/org/apache/nutch/host/HostDbUpdateJob.java
* (edit) src/java/org/apache/nutch/util/domain/DomainStatistics.java
* (edit) src/java/org/apache/nutch/crawl/WebTableReader.java
* (edit) src/java/org/apache/nutch/storage/StorageUtils.java


> Upgrade Nutch 2.X to Gora 0.7
> -
>
> Key: NUTCH-2374
> URL: https://issues.apache.org/jira/browse/NUTCH-2374
> Project: Nutch
>  Issue Type: Bug
>  Components: build, storage
>Reporter: Lewis John McGibbney
> Fix For: 2.4
>
>
> We should make the upgrades before we release Nutch 2.X.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2397) Parser to add paragraph line breaks

2017-07-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16074513#comment-16074513
 ] 

Hudson commented on NUTCH-2397:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3432 (See 
[https://builds.apache.org/job/Nutch-trunk/3432/])
Fix for NUTCH-2397 (improved solution contributed by Vipul Behl, closes 
(snagel: 
[https://github.com/apache/nutch/commit/48c38b03f3cfb73402431f262990a6d091570e9a])
* (edit) 
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java
* (edit) 
src/plugin/parse-zip/src/test/org/apache/nutch/parse/zip/TestZipParser.java
* (edit) 
src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java


> Parser to add paragraph line breaks
> ---
>
> Key: NUTCH-2397
> URL: https://issues.apache.org/jira/browse/NUTCH-2397
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 2.3.1, 1.13
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 2.4, 1.14
>
>
> (initially reported with patch/pull-request by Vipul Behl, see 
> [#190|https://github.com/apache/nutch/pull/190])
> The parser (parse-tika and parse-html) could be improved to add line breaks 
> between paragraphs, instead of writing the whole document into a single line.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2391) Spurious Duplications for MD5

2017-07-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16074988#comment-16074988
 ] 

Hudson commented on NUTCH-2391:
---

SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1587 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1587/])
NUTCH-2393 Fix for issue addressed in NUTCH-2391 (kaidulislam90: 
[https://github.com/apache/nutch/commit/ef33ba7db80d08d5ef56501bcc45baadfee14dfc])
* (edit) src/java/org/apache/nutch/crawl/MD5Signature.java


> Spurious Duplications for MD5
> -
>
> Key: NUTCH-2391
> URL: https://issues.apache.org/jira/browse/NUTCH-2391
> Project: Nutch
>  Issue Type: Bug
>  Components: commoncrawl
>Affects Versions: 1.11
>Reporter: David Johnson
>Priority: Minor
> Fix For: 1.14
>
>
> We're seeing some incidence of a large number of documents being marked as 
> duplicate in our crawl.
> We traced it back to one of the crawl plugins returning an empty array for 
> the content field.
> We'd like to propose changing the MD5 signature generation from:
> {code}
> public byte[] calculate(Content content, Parse parse) {
> byte[] data = content.getContent();
> if (data == null)
>   data = content.getUrl().getBytes();
> return MD5Hash.digest(data).getDigest();
>   }
> {code}
> to:
> {code}
> public byte[] calculate(Content content, Parse parse) {
> byte[] data = content.getContent();
> if ((data == null) || (data.length == 0))
>   data = content.getUrl().getBytes();
> return MD5Hash.digest(data).getDigest();
>   }
> {code}
> to address the issue



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2374) Upgrade Nutch 2.X to Gora 0.7

2017-07-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16074989#comment-16074989
 ] 

Hudson commented on NUTCH-2374:
---

SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1587 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1587/])
NUTCH-2374 Upgrade Nutch 2.X to Gora 0.7 (snagel: 
[https://github.com/apache/nutch/commit/63b58e8889297ca4afcff2de4c8b1f86d657dbf2])
* (edit) src/java/org/apache/nutch/storage/StorageUtils.java
* (edit) ivy/ivy.xml
* (edit) src/java/org/apache/nutch/host/HostDbUpdateJob.java
* (edit) src/java/org/apache/nutch/util/domain/DomainStatistics.java
* (edit) src/java/org/apache/nutch/crawl/WebTableReader.java


> Upgrade Nutch 2.X to Gora 0.7
> -
>
> Key: NUTCH-2374
> URL: https://issues.apache.org/jira/browse/NUTCH-2374
> Project: Nutch
>  Issue Type: Bug
>  Components: build, storage
>Reporter: Lewis John McGibbney
> Fix For: 2.4
>
>
> We should make the upgrades before we release Nutch 2.X.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2393) 2.x patch for MD5 duplication issue addressed in NUTCH-2391

2017-07-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16074987#comment-16074987
 ] 

Hudson commented on NUTCH-2393:
---

SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1587 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1587/])
NUTCH-2393 Fix for issue addressed in NUTCH-2391 (kaidulislam90: 
[https://github.com/apache/nutch/commit/ef33ba7db80d08d5ef56501bcc45baadfee14dfc])
* (edit) src/java/org/apache/nutch/crawl/MD5Signature.java
NUTCH-2393 checking buf.remaining() instead of buf.array().length == 0 
(kaidulislam90: 
[https://github.com/apache/nutch/commit/3d0c1a765990f38a63172ff1016f3940325f5b59])
* (edit) src/java/org/apache/nutch/crawl/MD5Signature.java


> 2.x patch for MD5 duplication issue addressed in NUTCH-2391
> ---
>
> Key: NUTCH-2393
> URL: https://issues.apache.org/jira/browse/NUTCH-2393
> Project: Nutch
>  Issue Type: Bug
>  Components: commoncrawl
>Affects Versions: 2.3.1
>Reporter: Kaidul Islam
>Assignee: Kaidul Islam
>Priority: Minor
> Fix For: 2.4
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Equivalent patch for 2.x for issue addressed in NUTCH-2391



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2391) Spurious Duplications for MD5

2017-07-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16075010#comment-16075010
 ] 

Hudson commented on NUTCH-2391:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3433 (See 
[https://builds.apache.org/job/Nutch-trunk/3433/])
NUTCH-2391 use URL for MD5 digest as fall-back if content is empty (snagel: 
[https://github.com/apache/nutch/commit/d35b433c397c03e78245c3e262ecaa31c78a564e])
* (edit) src/java/org/apache/nutch/crawl/MD5Signature.java


> Spurious Duplications for MD5
> -
>
> Key: NUTCH-2391
> URL: https://issues.apache.org/jira/browse/NUTCH-2391
> Project: Nutch
>  Issue Type: Bug
>  Components: commoncrawl
>Affects Versions: 1.11
>Reporter: David Johnson
>Priority: Minor
> Fix For: 1.14
>
>
> We're seeing some incidence of a large number of documents being marked as 
> duplicate in our crawl.
> We traced it back to one of the crawl plugins returning an empty array for 
> the content field.
> We'd like to propose changing the MD5 signature generation from:
> {code}
> public byte[] calculate(Content content, Parse parse) {
> byte[] data = content.getContent();
> if (data == null)
>   data = content.getUrl().getBytes();
> return MD5Hash.digest(data).getDigest();
>   }
> {code}
> to:
> {code}
> public byte[] calculate(Content content, Parse parse) {
> byte[] data = content.getContent();
> if ((data == null) || (data.length == 0))
>   data = content.getUrl().getBytes();
> return MD5Hash.digest(data).getDigest();
>   }
> {code}
> to address the issue



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2398) Fetcher saving redirected robots.txt under redirect target URL

2017-07-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089934#comment-16089934
 ] 

Hudson commented on NUTCH-2398:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3434 (See 
[https://builds.apache.org/job/Nutch-trunk/3434/])
NUTCH-2398: Save content of redirected robots.txt under redirect target 
(snagel: 
[https://github.com/apache/nutch/commit/620b85df36d0c802f333a56ca1ef7021a7935360])
* (edit) 
src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java


> Fetcher saving redirected robots.txt under redirect target URL
> --
>
> Key: NUTCH-2398
> URL: https://issues.apache.org/jira/browse/NUTCH-2398
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.13
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.14
>
>
> NUTCH-2300 lets the Fetcher store optionally the robots.txt response (content 
> and HTTP status). If the '.../robots.txt' is redirected, the redirected 
> content is also stored but with the redirect source URL as key. It should use 
> the redirect target URL instead. Otherwise one of the responses is 
> overwritten in the segments map file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-07-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16092964#comment-16092964
 ] 

Hudson commented on NUTCH-1465:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3435 (See 
[https://builds.apache.org/job/Nutch-trunk/3435/])
NUTCH-1465 (markus: 
[https://github.com/apache/nutch/commit/b58d6cd9111b2d25b8f6f009015ac214bac4006d])
* (edit) conf/log4j.properties
* (add) src/java/org/apache/nutch/util/SitemapProcessor.java
* (edit) ivy/ivy.xml
* (edit) conf/nutch-default.xml
* (edit) src/bin/nutch


> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465.patch, NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2403) Nutch Selenium: Wrong documentation about PhantomJS

2017-07-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16099508#comment-16099508
 ] 

Hudson commented on NUTCH-2403:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3438 (See 
[https://builds.apache.org/job/Nutch-trunk/3438/])
NUTCH-2403: Fix spelling of phantomJS configuration (moreno: 
[https://github.com/apache/nutch/commit/df5f96289097c44d4b4f405a2449b3352363d8e0])
* (edit) src/plugin/protocol-selenium/README.md


> Nutch Selenium: Wrong documentation about PhantomJS
> ---
>
> Key: NUTCH-2403
> URL: https://issues.apache.org/jira/browse/NUTCH-2403
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation, plugin
>Affects Versions: 1.13
>Reporter: Moreno Feltscher
>Assignee: Moreno Feltscher
> Fix For: 1.14
>
>
> The Nutch Selenium documentation states that PhantomJS can be used as 
> {{phantomJS}} for {{selenium.driver}}. The correct value would be 
> {{phantomjs}} according to 
> https://github.com/apache/nutch/blob/master/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java#L124



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay

2017-07-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16102106#comment-16102106
 ] 

Hudson commented on NUTCH-2368:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3446 (See 
[https://builds.apache.org/job/Nutch-trunk/3446/])
NUTCH-2368 Variable generate.max.count and fetcher.server.delay (markus: 
[https://github.com/apache/nutch/commit/44f7ad973f2017bacde2bf5277f846179eafc6dd])
* (edit) src/java/org/apache/nutch/fetcher/FetchItemQueue.java
* (edit) conf/nutch-default.xml
* (edit) src/java/org/apache/nutch/crawl/Generator.java


> Variable generate.max.count and fetcher.server.delay
> 
>
> Key: NUTCH-2368
> URL: https://issues.apache.org/jira/browse/NUTCH-2368
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, 
> NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, 
> NUTCH-2368.patch, NUTCH-2368.patch
>
>
> In some cases we need to use host specific characteristics in determining 
> crawl speed and bulk sizes because with our (Openindex) settings we can just 
> recrawl host with up to 800k urls.
> This patch solves the problem by introducing the HostDB to the Generator and 
> providing powerful Jexl expressions. Check these two expressions added to the 
> Generator:
> {code}
> -Dgenerate.max.count.expr='
> if (unfetched + fetched > 80) {
>   return (conf.getInt("fetcher.timelimit.mins", 12) * 60) / ((pct95._rs_ + 
> 500) / 1000) * conf.getInt("fetcher.threads.per.queue", 1)
> } else {
>   return conf.getDouble("generate.max.count", 300);
> }'
> -Dgenerate.fetch.delay.expr='
> if (unfetched + fetched > 80) {
>   return (pct95._rs_ + 500);
> } else {
>   return conf.getDouble("fetcher.server.delay", 1000)
> }'
> {code}
> For each large host: select as many records as possible that are possible to 
> fetch based on number of threads, 95th percentile response time of the fetch 
> limit. Or: queueMaxCount = (timelimit / resonsetime) * numThreads.
> The second expression just follows up to that, settings the crawlDelay of the 
> fetch queue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2389) Precise data parsing using Jsoup CSS selectors

2017-07-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16106551#comment-16106551
 ] 

Hudson commented on NUTCH-2389:
---

FAILURE: Integrated in Jenkins build Nutch-nutchgora #1588 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1588/])
NUTCH-2389 jsoup-extractor with parse filter, indexing filter and unit 
(ikaidul: 
[https://github.com/apache/nutch/commit/f41735cb3c96650f6a51f1c5eb87566572bf1679])
* (add) src/plugin/jsoup-extractor/plugin.xml
* (add) 
src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/core/JsoupExtractorConstants.java
* (add) 
src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/core/JsoupDocumentReader.java
* (add) 
src/plugin/jsoup-extractor/src/test/org/apache/nutch/parse/jsoup/extractor/TestJsoupParser.java
* (add) 
src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/core/package-info.java
* (add) 
src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/parse/JsoupHtmlParser.java
* (edit) src/plugin/build.xml
* (add) 
src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/core/normalizer/package-info.java
* (edit) conf/nutch-default.xml
* (add) 
src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/parse/package-info.java
* (add) conf/jsoup-extractor-sample.xml
* (edit) build.xml
* (add) 
src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/indexer/package-info.java
* (add) 
src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/core/normalizer/SimpleStringNormalizer.java
* (add) 
src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/indexer/JsoupIndexingFilter.java
* (add) 
src/plugin/jsoup-extractor/src/test/org/apache/nutch/parse/jsoup/extractor/ViewCountNormalizer.java
* (add) conf/jsoup-extractor.xml
* (add) 
src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/core/normalizer/Normalizable.java
* (add) 
src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/core/JsoupDocument.java
* (add) src/plugin/jsoup-extractor/build.xml
NUTCH-2389 jsoup-extractor/ivy.xml commited (ikaidul: 
[https://github.com/apache/nutch/commit/fe6997f30e4bcffe962da4d09ae73f379c026a76])
* (add) src/plugin/jsoup-extractor/ivy.xml
NUTCH-2389 Unit test implemented but not passed (ikaidul: 
[https://github.com/apache/nutch/commit/17bd8f6e87f4fa4fd35c5aecfa09d8ef3bea6fd7])
* (delete) conf/jsoup-extractor-sample.xml
* (edit) conf/jsoup-extractor.xml
* (edit) 
src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/core/JsoupDocumentReader.java
* (edit) 
src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/parse/JsoupHtmlParser.java
* (edit) src/plugin/jsoup-extractor/plugin.xml
* (edit) 
src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/core/JsoupExtractorConstants.java
* (edit) 
src/plugin/jsoup-extractor/src/test/org/apache/nutch/parse/jsoup/extractor/TestJsoupParser.java
NUTCH-2389 package name changed (kaidulislam90: 
[https://github.com/apache/nutch/commit/52e785d6f8ebf6f57150b255df380510f6ebcf6b])
* (delete) 
src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/indexer/JsoupIndexingFilter.java
* (add) 
src/plugin/jsoup-extractor/src/java/org/apache/nutch/core/jsoup/extractor/JsoupExtractorConstants.java
* (delete) 
src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/core/package-info.java
* (delete) 
src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/parse/package-info.java
* (edit) src/plugin/jsoup-extractor/plugin.xml
* (add) 
src/plugin/jsoup-extractor/src/java/org/apache/nutch/core/jsoup/extractor/normalizer/Normalizable.java
* (add) 
src/plugin/jsoup-extractor/src/java/org/apache/nutch/parse/jsoup/extractor/JsoupHtmlParser.java
* (delete) 
src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/core/JsoupDocument.java
* (delete) 
src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/parse/JsoupHtmlParser.java
* (delete) 
src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/core/JsoupExtractorConstants.java
* (add) 
src/plugin/jsoup-extractor/src/java/org/apache/nutch/core/jsoup/extractor/JsoupDocument.java
* (add) 
src/plugin/jsoup-extractor/src/java/org/apache/nutch/core/jsoup/extractor/normalizer/SimpleStringNormalizer.java
* (delete) 
src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/indexer/package-info.java
* (add) 
src/plugin/jsoup-extractor/src/java/org/apache/nutch/core/jsoup/extractor/JsoupDocumentReader.java
* (add) 
src/plugin/jsoup-extractor/src/java/org/apache/nutch/indexer/jsoup/extractor/package-info.java
* (delete) 
src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/core/normalizer/SimpleStringNormalizer.java
* (delete) 
src/plugin/jsoup-extractor/src/java/org/apache/nutch/jsoup/extractor/core/normalizer/package-info.java

[jira] [Commented] (NUTCH-2404) Failed Jenkin Build #1588 error in unit test resolved

2017-07-31 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16107675#comment-16107675
 ] 

Hudson commented on NUTCH-2404:
---

SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1589 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1589/])
NUTCH-2404 Fix for Failed Jenkin build #1588 after merging pull request 
(kaidulislam90: 
[https://github.com/apache/nutch/commit/a6870de1bdd518900d3546b3fe68d46d370db76c])
* (edit) src/plugin/jsoup-extractor/build.xml
* (edit) 
src/plugin/jsoup-extractor/src/test/org/apache/nutch/parse/jsoup/extractor/TestJsoupHtmlParser.java


> Failed Jenkin Build #1588 error in unit test resolved
> -
>
> Key: NUTCH-2404
> URL: https://issues.apache.org/jira/browse/NUTCH-2404
> Project: Nutch
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.4
>Reporter: Kaidul Islam
>Assignee: Kaidul Islam
> Fix For: 2.4
>
>
> Fix for Jenkin Build #1588 after merging pull request #192 (NUTCH-2389).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2405) jsoup-extractor structure correction, typo fixed

2017-08-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120339#comment-16120339
 ] 

Hudson commented on NUTCH-2405:
---

SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1590 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1590/])
NUTCH-2405 1. Missed root tag  added in jsoup-extractor.xml 
(kaidulislam90: 
[https://github.com/apache/nutch/commit/49ff77e83cc1e62cf10c377027c122e6a7d83128])
* (edit) conf/jsoup-extractor.xml
* (edit) conf/jsoup-extractor-example.xml
* (edit) 
src/plugin/jsoup-extractor/src/java/org/apache/nutch/parse/jsoup/extractor/JsoupHtmlParser.java


> jsoup-extractor structure correction, typo fixed
> 
>
> Key: NUTCH-2405
> URL: https://issues.apache.org/jira/browse/NUTCH-2405
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin
>Affects Versions: 2.4
>Reporter: Kaidul Islam
>Assignee: Kaidul Islam
>Priority: Minor
> Fix For: 2.4
>
>
> Several bugs faced during testing with my project have been fixed
> 1. Missed root tag  added in jsoup-extractor.xml like 
> jsoup-extractor-example.xml
> 2. jsoup API text() used instead of ownText() to get full contents under CSS 
> selector
> 3.  =>  typo fixed



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2408) CrawlDb: allow update from unparsed segments

2017-08-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127123#comment-16127123
 ] 

Hudson commented on NUTCH-2408:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3448 (See 
[https://builds.apache.org/job/Nutch-trunk/3448/])
NUTCH-2408 CrawlDb: allow update from unparsed segments (snagel: 
[https://github.com/apache/nutch/commit/a7d0ac2724e1c16f8071a5d734f092d1bc03cac1])
* (edit) src/java/org/apache/nutch/crawl/CrawlDb.java


> CrawlDb: allow update from unparsed segments
> 
>
> Key: NUTCH-2408
> URL: https://issues.apache.org/jira/browse/NUTCH-2408
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Affects Versions: 1.13
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.14
>
>
> The command updatedb (class o.a.n.crawl.CrawlDb) does not allow to update the 
> CrawlDb with fetch status only (from segment subdirectory crawl_fetch) 
> without also reading crawl_parse (which contains outlinks but also scores, 
> signatures and meta data). 
> A workflow which does not require parsing of documents (e.g., because raw 
> HTML content is exported to WARC files) is then unable to update the CrawlDb 
> to store the fetch status.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2400) Solr 6.6.0 compatibility

2017-08-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127517#comment-16127517
 ] 

Hudson commented on NUTCH-2400:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3449 (See 
[https://builds.apache.org/job/Nutch-trunk/3449/])
NUTCH-2400 Solr 6.6.0 compatibility (lewis.mcgibbney: 
[https://github.com/apache/nutch/commit/1857e624db1c1671edeb58c6ea9e861cbb435440])
* (edit) conf/schema.xml
NUTCH-2400 Solr 6.6.0 compatibility (lewis.mcgibbney: 
[https://github.com/apache/nutch/commit/4115dcaf6d2f0e55354fef88649f85c04bc7584b])
* (edit) conf/schema.xml


> Solr 6.6.0 compatibility
> 
>
> Key: NUTCH-2400
> URL: https://issues.apache.org/jira/browse/NUTCH-2400
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.13
> Environment: Nutch 1.14-SNAPSHOT Solr 6.6.0
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 1.14
>
> Attachments: managed-schema
>
>
> This issue relates to following mailing list thread 
> http://www.mail-archive.com/user%40nutch.apache.org/msg15574.html
> The schema.xml upgrade works with Solr 6.6.0, please try it out and let me 
> know how things go.
> I've also updated the tutorial at https://wiki.apache.org/nutch/NutchTutorial 
> so please check that out as well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2378) ChildFirst plugin classloader

2017-08-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128751#comment-16128751
 ] 

Hudson commented on NUTCH-2378:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3450 (See 
[https://builds.apache.org/job/Nutch-trunk/3450/])
NUTCH-2378 ChildFirst plugin classloader (contributed by Jurian (snagel: 
[https://github.com/apache/nutch/commit/e66d44d9c290c550e78edb425a43e010b861172c])
* (edit) src/plugin/indexer-solr/plugin.xml
* (edit) src/plugin/parsefilter-naivebayes/plugin.xml
* (edit) src/plugin/parse-tika/plugin.xml
* (edit) src/java/org/apache/nutch/plugin/PluginClassLoader.java
* (edit) src/plugin/parse-tika/ivy.xml


> ChildFirst plugin classloader
> -
>
> Key: NUTCH-2378
> URL: https://issues.apache.org/jira/browse/NUTCH-2378
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 1.13
>Reporter: Jurian Broertjes
>Assignee: Sebastian Nagel
> Fix For: 2.4, 1.14
>
> Attachments: NUTCH-2378-childfirst-plugin-classloader.patch
>
>
> While working on upgrading the indexer-elastic plugin from 2.x to 5.x, I ran 
> into several nasty runtime dependency issues (both local and on Hadoop). 
> After seeking help on the mailing list, I still was unable to resolve these 
> issues and after digging further, decided to try a different plugin 
> classloader strategy. 
> The normal classloader delegates class loading requests to it's parent 
> classloader. This can cause all sorts of nasty runtime dependency version 
> conflicts (jar hell, version conflicts), since the plugin's own classloader 
> gets queried last. The child-first classloader approach tries to load a class 
> from the plugin's dependencies first and when unavailable, delegates to it's 
> parent classloader. This fixed the issues I had.
> The new approach can give runtime LinkageErrors, but these are easily 
> resolvable (see the patch for a few examples)
> I've tested the new loader a bit and am curious about others' findings.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2378) ChildFirst plugin classloader

2017-08-18 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16132763#comment-16132763
 ] 

Hudson commented on NUTCH-2378:
---

SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1591 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1591/])
NUTCH-2378 ChildFirst plugin classloader (contributed by Jurian (snagel: 
[https://github.com/apache/nutch/commit/93fb5395478e982e45e8bebbf69435db1a8ce5e7])
* (edit) src/java/org/apache/nutch/plugin/PluginClassLoader.java
* (edit) src/plugin/parse-tika/plugin.xml
NUTCH-2378 ChildFirst plugin classloader - fix jsoup-extractor: all (snagel: 
[https://github.com/apache/nutch/commit/e1d9191158cc2519987c5646c64eaf5a11603089])
* (delete) 
src/plugin/jsoup-extractor/src/test/org/apache/nutch/parse/jsoup/extractor/ViewCountNormalizer.java
* (edit) src/java/org/apache/nutch/plugin/Extension.java
* (add) 
src/plugin/jsoup-extractor/src/java/org/apache/nutch/parse/jsoup/extractor/ViewCountNormalizer.java
* (edit) 
src/plugin/jsoup-extractor/src/java/org/apache/nutch/core/jsoup/extractor/JsoupDocumentReader.java


> ChildFirst plugin classloader
> -
>
> Key: NUTCH-2378
> URL: https://issues.apache.org/jira/browse/NUTCH-2378
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 1.13
>Reporter: Jurian Broertjes
>Assignee: Sebastian Nagel
> Fix For: 2.4, 1.14
>
> Attachments: NUTCH-2378-childfirst-plugin-classloader.patch
>
>
> While working on upgrading the indexer-elastic plugin from 2.x to 5.x, I ran 
> into several nasty runtime dependency issues (both local and on Hadoop). 
> After seeking help on the mailing list, I still was unable to resolve these 
> issues and after digging further, decided to try a different plugin 
> classloader strategy. 
> The normal classloader delegates class loading requests to it's parent 
> classloader. This can cause all sorts of nasty runtime dependency version 
> conflicts (jar hell, version conflicts), since the plugin's own classloader 
> gets queried last. The child-first classloader approach tries to load a class 
> from the plugin's dependencies first and when unavailable, delegates to it's 
> parent classloader. This fixed the issues I had.
> The new approach can give runtime LinkageErrors, but these are easily 
> resolvable (see the patch for a few examples)
> I've tested the new loader a bit and am curious about others' findings.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2413) Parsing fetcher to respect property "parse.filter.urls"

2017-08-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16142732#comment-16142732
 ] 

Hudson commented on NUTCH-2413:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3452 (See 
[https://builds.apache.org/job/Nutch-trunk/3452/])
fix for NUTCH-2413 contributed by maborec (marcos: 
[https://github.com/apache/nutch/commit/6c648633cecc158f409e3a4ec45cf33bc68b4b1d])
* (edit) src/java/org/apache/nutch/fetcher/FetcherThread.java
fix for NUTCH-2413 contributed by maborec (marcos: 
[https://github.com/apache/nutch/commit/5dc48f2fc2f7a6f9d039251b9133df12bee99d52])
* (edit) src/java/org/apache/nutch/fetcher/FetcherThread.java
NUTCH-2413 - Fix some styling. Prepare filters and normalizers in (marcos: 
[https://github.com/apache/nutch/commit/60af77262726e8a09202a2319add512c54e7a2f4])
* (edit) src/java/org/apache/nutch/fetcher/FetcherThread.java


> Parsing fetcher to respect property "parse.filter.urls"
> ---
>
> Key: NUTCH-2413
> URL: https://issues.apache.org/jira/browse/NUTCH-2413
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher, parser
>Affects Versions: 1.13
> Environment: Apache Nutch release 1.13.
>Reporter: Marcos Bori
>Assignee: Sebastian Nagel
> Fix For: 1.14
>
>
> In a situation when we want to:
> (1) Execute the fetch and parse together ("fetcher.parse" setting to "true")
> (2) Avoid applying the URL filters when executing this phase.
> Condition (2) can be configured when parsing is executed as a separate 
> process by setting "parse.filter.urls" to "false".
> However, this setting ("parse.filter.urls") is ignored when we execute the 
> fetch and parse phases together. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2397) Parser to add paragraph line breaks

2017-09-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16161121#comment-16161121
 ] 

Hudson commented on NUTCH-2397:
---

SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1592 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1592/])
NUTCH-2397: Parser to add paragraph line breaks (snagel: 
[https://github.com/apache/nutch/commit/aaa8099c8fe3761869f4c881fb66b2c11a2e350b])
* (edit) 
src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
* (edit) 
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java


> Parser to add paragraph line breaks
> ---
>
> Key: NUTCH-2397
> URL: https://issues.apache.org/jira/browse/NUTCH-2397
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 2.3.1, 1.13
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 2.4, 1.14
>
>
> (initially reported with patch/pull-request by Vipul Behl, see 
> [#190|https://github.com/apache/nutch/pull/190])
> The parser (parse-tika and parse-html) could be improved to add line breaks 
> between paragraphs, instead of writing the whole document into a single line.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2409) Injector: complete command-line help and counters

2017-09-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16161128#comment-16161128
 ] 

Hudson commented on NUTCH-2409:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3453 (See 
[https://builds.apache.org/job/Nutch-trunk/3453/])
NUTCH-2409 Injector: complete command-line help and counters - add (snagel: 
[https://github.com/apache/nutch/commit/9b4a9df26c5b92c82029d030a9bf72cda043209c])
* (edit) src/java/org/apache/nutch/crawl/Injector.java


> Injector: complete command-line help and counters
> -
>
> Key: NUTCH-2409
> URL: https://issues.apache.org/jira/browse/NUTCH-2409
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: 1.13
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Trivial
> Fix For: 1.14
>
>
> See discussion in 
> [NUTCH-2335|https://issues.apache.org/jira/browse/NUTCH-2335?focusedCommentId=16130178&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16130178]:
> - add counters for removed items from CrawlDb:
> {noformat}
> Injector: Total urls removed from CrawlDb by filters: 2
> Injector: Total urls with status gone removed from CrawlDb 
> (db.update.purge.404): 0
> {noformat}
> - add {{-Ddb.update.purge.404=true}} to command-line help:
> {noformat}
> Usage: Injector [-D...]   [-overwrite|-update] [-noFilter] 
> [-noNormalize] [-filterNormalizeAll]
> ...
>  -D...  set or overwrite configuration property (property=value)
>  -Ddb.update.purge.404=true
> remove URLs with status gone (404) from CrawlDb
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2430) Complete plugin build configuration

2017-09-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16179201#comment-16179201
 ] 

Hudson commented on NUTCH-2430:
---

FAILURE: Integrated in Jenkins build Nutch-trunk #3454 (See 
[https://builds.apache.org/job/Nutch-trunk/3454/])
Complete plugin build configuration (NUTCH-2430) - add missing plugin (snagel: 
[https://github.com/apache/nutch/commit/64fc5761ee1f04538426bb4a7d3eea140996a976])
* (edit) build.xml
* (edit) src/plugin/build.xml
* (delete) src/plugin/parse-replace/plugin.xml
* (delete) src/plugin/parse-replace/sample/testParseReplace.html
* (delete) src/plugin/parse-replace/README.txt
* (delete) 
src/plugin/parse-replace/src/java/org/apache/nutch/parse/replace/ReplaceParser.java
* (delete) 
src/plugin/parse-replace/src/test/org/apache/nutch/parse/replace/TestParseReplace.java
* (edit) default.properties
* (delete) 
src/plugin/parse-replace/src/java/org/apache/nutch/parse/replace/package-info.java
* (delete) src/plugin/parse-replace/build.xml
* (delete) src/plugin/parse-replace/ivy.xml


> Complete plugin build configuration
> ---
>
> Key: NUTCH-2430
> URL: https://issues.apache.org/jira/browse/NUTCH-2430
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.13
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
> Fix For: 1.14
>
>
> The build configuration around plugins isn't complete
> - missing plugin folders in the Eclipse target (see NUTCH-2135)
> - not all plugins included API docs / javadoc



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2436) Remove empty comment, and redundant semicolon from CommandRunner

2017-09-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16184747#comment-16184747
 ] 

Hudson commented on NUTCH-2436:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3456 (See 
[https://builds.apache.org/job/Nutch-trunk/3456/])
NUTCH-2436 Fix (kenneth: 
[https://github.com/apache/nutch/commit/4d67a77bde35a8af1b7d62e7cd281bdf13b11b80])
* (edit) src/java/org/apache/nutch/util/CommandRunner.java


> Remove empty comment, and redundant semicolon from CommandRunner
> 
>
> Key: NUTCH-2436
> URL: https://issues.apache.org/jira/browse/NUTCH-2436
> Project: Nutch
>  Issue Type: Bug
>Reporter: kenneth mcfarland
>Assignee: kenneth mcfarland
>Priority: Trivial
> Fix For: 1.14
>
>
> CommandRunner has a set of empty comments and a redundant semicolon. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2433) Html Parser: keep htmltag where the outlinks are found

2017-09-29 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16185784#comment-16185784
 ] 

Hudson commented on NUTCH-2433:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3457 (See 
[https://builds.apache.org/job/Nutch-trunk/3457/])
NUTCH-2433 / Html Parser: keep htmltag where the outlinks are found (marcos: 
[https://github.com/apache/nutch/commit/7db11734f25a53cda15634071a47ff524a06002e])
* (edit) conf/nutch-default.xml
* (edit) 
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java


> Html Parser: keep htmltag where the outlinks are found
> --
>
> Key: NUTCH-2433
> URL: https://issues.apache.org/jira/browse/NUTCH-2433
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.13
> Environment: Apache Nutch release 1.13.
>Reporter: Marcos Bori
>  Labels: html, outlink
> Fix For: 1.14
>
>
> When parsing HTML pages, I need to know in which HTML tag the outlinks were 
> found (for example, 'a', 'script', 'img', etc).
> I propose to add a new configuration value, 
> "parser.html.outlinks.htmlnode_metadata_name".
> If this configuration property is not empty, all found outlinks will be 
> assigned a metadata with the name indicated in this configuration property 
> with the html tag name where the outlink was found.
> I will now send the pull request with my code implementation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2437) gora mongodb mapping file error

2017-10-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16191807#comment-16191807
 ] 

Hudson commented on NUTCH-2437:
---

SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1593 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1593/])
fix for NUTCH-2437 contributed by tmzzngl (tulay.muezzinoglu: 
[https://github.com/apache/nutch/commit/93b879600683540069ed35799d8710f0739a766d])
* (edit) conf/gora-mongodb-mapping.xml


> gora mongodb mapping file error
> ---
>
> Key: NUTCH-2437
> URL: https://issues.apache.org/jira/browse/NUTCH-2437
> Project: Nutch
>  Issue Type: Bug
>  Components: storage
>Affects Versions: 2.4
>Reporter: Tulay Muezzinoglu
>Priority: Trivial
>  Labels: gora, mapping, mongo
> Fix For: 2.4
>
>
> conf/gora-mongodb-mapping.xml
> {code}
>  
> {code}
> should be
>  {code} 
>  
> {code}
>  Otherwise it is throwing exception.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1763) Improving comments on the Injector Class

2017-10-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16211836#comment-16211836
 ] 

Hudson commented on NUTCH-1763:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3458 (See 
[https://builds.apache.org/job/Nutch-trunk/3458/])
NUTCH-1763 Code comment Injector contributed by Diaa (snagel: 
[https://github.com/apache/nutch/commit/21d56a0c5626553a3bf5058588d9277e6844e00f])
* (edit) src/java/org/apache/nutch/crawl/Injector.java


> Improving comments on the Injector Class
> 
>
> Key: NUTCH-1763
> URL: https://issues.apache.org/jira/browse/NUTCH-1763
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: 1.9
>Reporter: Diaa
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.14
>
> Attachments: Injector.java.patch, Injector.java.patch
>
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> I think the Injector class could use some improvements in the comments.
> I am attaching a few improvements to that and will keep adding as I 
> understand it more.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2446) URLFiltersCheck fix

2017-10-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16214825#comment-16214825
 ] 

Hudson commented on NUTCH-2446:
---

SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1594 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1594/])
Fix for NUTCH-2446 by Kenneth McFarland (snagel: 
[https://github.com/apache/nutch/commit/72128eb5e863afea66ff5be7a7a2df824af688e8])
* (edit) src/java/org/apache/nutch/net/URLFilterChecker.java


> URLFiltersCheck fix
> ---
>
> Key: NUTCH-2446
> URL: https://issues.apache.org/jira/browse/NUTCH-2446
> Project: Nutch
>  Issue Type: Bug
> Environment: master
>Reporter: kenneth mcfarland
>Assignee: kenneth mcfarland
>Priority: Minor
> Fix For: 2.4, 1.14
>
>
> Currently URLFilterChecker.checkAll() creates a URLFilters object repeatedly 
> when conf does not change.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2446) URLFiltersCheck fix

2017-10-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16214828#comment-16214828
 ] 

Hudson commented on NUTCH-2446:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3459 (See 
[https://builds.apache.org/job/Nutch-trunk/3459/])
Fix for NUTCH-2446 by Kenneth McFarland (kennethpaulmcfarland: 
[https://github.com/apache/nutch/commit/19fdd6c6339efd08c7c77d3c4e87f464b7c3a038])
* (edit) src/java/org/apache/nutch/net/URLFilterChecker.java


> URLFiltersCheck fix
> ---
>
> Key: NUTCH-2446
> URL: https://issues.apache.org/jira/browse/NUTCH-2446
> Project: Nutch
>  Issue Type: Bug
> Environment: master
>Reporter: kenneth mcfarland
>Assignee: kenneth mcfarland
>Priority: Minor
> Fix For: 2.4, 1.14
>
>
> Currently URLFilterChecker.checkAll() creates a URLFilters object repeatedly 
> when conf does not change.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2444) HostDB CSV dumper to emit field header by default

2017-10-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16215161#comment-16215161
 ] 

Hudson commented on NUTCH-2444:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3460 (See 
[https://builds.apache.org/job/Nutch-trunk/3460/])
NUTCH-2444 HostDB CSV dumper to emit field header by default (markus: 
[https://github.com/apache/nutch/commit/d7e4046e6e725ed759d0c43e37c51c5c3122e006])
* (edit) src/java/org/apache/nutch/hostdb/ReadHostDb.java


> HostDB CSV dumper to emit field header by default
> -
>
> Key: NUTCH-2444
> URL: https://issues.apache.org/jira/browse/NUTCH-2444
> Project: Nutch
>  Issue Type: Bug
>  Components: hostdb
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.14
>
> Attachments: NUTCH-2444.patch
>
>
> Started to get annoyed by constantly having to look-u HostDatum for the field 
> set.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2445) Fetcher following outlinks to keep track of already fetched items

2017-10-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16215248#comment-16215248
 ] 

Hudson commented on NUTCH-2445:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3461 (See 
[https://builds.apache.org/job/Nutch-trunk/3461/])
NUTCH-2445 Fetcher following outlinks to keep track of already fetched (markus: 
[https://github.com/apache/nutch/commit/0cdd095c881eed52dc461e559ce6ae278e99157f])
* (edit) src/java/org/apache/nutch/fetcher/FetchItemQueue.java
* (edit) src/java/org/apache/nutch/fetcher/FetcherThread.java


> Fetcher following outlinks to keep track of already fetched items
> -
>
> Key: NUTCH-2445
> URL: https://issues.apache.org/jira/browse/NUTCH-2445
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2445.patch, NUTCH-2445.patch
>
>
> When fetcher.follow.outlinks.depth is non-zero, fetcher follows outlinks. 
> This patch keeps track of already fetched URL's and thus avoid fetching the 
> same URL twice.
> A Set is used to keep track of them, hashcodes to reduce memory usage. This 
> is not used if fetcher doesn't follow outlinks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2448) Allow Sending an empty http.agent.version

2017-10-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16217545#comment-16217545
 ] 

Hudson commented on NUTCH-2448:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3462 (See 
[https://builds.apache.org/job/Nutch-trunk/3462/])
NUTCH-2448: Treat white-space http.agent.version as empty. (github: 
[https://github.com/apache/nutch/commit/9f54d5b3ec5a0fd36f91ec8af762e52859f4eeea])
* (edit) 
src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java


> Allow Sending an empty http.agent.version
> -
>
> Key: NUTCH-2448
> URL: https://issues.apache.org/jira/browse/NUTCH-2448
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher, protocol
>Affects Versions: 1.13
>Reporter: Yossi Tamari
>Priority: Minor
> Fix For: 2.4, 1.14
>
>
> http.agent.version defaults in nutch-default.xml to Nutch-1.14-SNAPSHOT 
> (depending on the version of course).
> If I want to override it to not send a version as part of the user-agent, 
> there is nothing I can do in nutch-site.xml, since putting an empty string 
> there causes the default to be taken, and putting any value there causes a 
> slash to be appended to the http.agent.name.
> As far as I can see, the only way to override it is to remove the value in 
> nutch-default.xml, which is probably not the “correct” way, considering it 
> contains a comment saying “Do not modify this file directly”.
> The suggested solution is to treat a white-space-only value as empty.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2394) Possible bugs in the source code

2017-10-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219095#comment-16219095
 ] 

Hudson commented on NUTCH-2394:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3464 (See 
[https://builds.apache.org/job/Nutch-trunk/3464/])
NUTCH-2394 Fix of bugs detected by static code analysis - String.trim() 
(snagel: 
[https://github.com/apache/nutch/commit/63037c71370cad1eba4152668f33b184c686d092])
* (edit) 
src/plugin/urlnormalizer-protocol/src/java/org/apache/nutch/net/urlnormalizer/protocol/ProtocolURLNormalizer.java
* (edit) src/java/org/apache/nutch/crawl/URLPartitioner.java
* (edit) src/java/org/apache/nutch/tools/CommonCrawlDataDumper.java
* (edit) 
src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java
* (edit) 
src/plugin/urlnormalizer-slash/src/java/org/apache/nutch/net/urlnormalizer/slash/SlashURLNormalizer.java
* (edit) 
src/plugin/urlnormalizer-host/src/java/org/apache/nutch/net/urlnormalizer/host/HostURLNormalizer.java
* (edit) 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java


> Possible bugs in the source code
> 
>
> Key: NUTCH-2394
> URL: https://issues.apache.org/jira/browse/NUTCH-2394
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.13
>Reporter: AppChecker
>  Labels: appchecker, static-analysis
> Fix For: 1.14
>
>
> Hi!
> I've checked your project with static analyzer 
> [AppChecker|https://npo-echelon.ru/en/solutions/appchecker.php] and if found 
> several suspicious code fragments:
> 1) 
> [src/plugin/headings/src/java/org/apache/nutch/parse/headings/HeadingsParseFilter.java|https://github.com/apache/nutch/blob/e53b34b2322f2d071981a72577644a225642ecbc/src/plugin/headings/src/java/org/apache/nutch/parse/headings/HeadingsParseFilter.java#L56]
> {code:java}
> heading.trim();
> {code}
> heading is not changed, because java.lang.String.trim returns new string.
> Probably, it should be:
> {code:java}
> heading = heading.trim();
> {code}
> see also:
> * 
> [src/plugin/urlnormalizer-host/src/java/org/apache/nutch/net/urlnormalizer/host/HostURLNormalizer.java#L78|https://github.com/apache/nutch/blob/e53b34b2322f2d071981a72577644a225642ecbc/src/plugin/urlnormalizer-host/src/java/org/apache/nutch/net/urlnormalizer/host/HostURLNormalizer.java#L78]
> * 
> [src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java#L115|https://github.com/apache/nutch/blob/e53b34b2322f2d071981a72577644a225642ecbc/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java#L115]
> * 
> [src/java/org/apache/nutch/net/urlnormalizer/protocol/ProtocolURLNormalizer.java#L76|https://github.com/apache/nutch/blob/e53b34b2322f2d071981a72577644a225642ecbc/src/plugin/urlnormalizer-protocol/src/java/org/apache/nutch/net/urlnormalizer/protocol/ProtocolURLNormalizer.java#L76]
> * 
> [src/java/org/apache/nutch/net/urlnormalizer/slash/SlashURLNormalizer.java#L78|https://github.com/apache/nutch/blob/e53b34b2322f2d071981a72577644a225642ecbc/src/plugin/urlnormalizer-slash/src/java/org/apache/nutch/net/urlnormalizer/slash/SlashURLNormalizer.java#L78]
> * 
> [src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java#L326|https://github.com/apache/nutch/blob/e53b34b2322f2d071981a72577644a225642ecbc/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java#L326]
> 2) 
> [src/java/org/apache/nutch/crawl/URLPartitioner.java#L84|https://github.com/apache/nutch/blob/2b93a66f0472e93223c69053d5482dcbef26de6d/src/java/org/apache/nutch/crawl/URLPartitioner.java#L84]
> {code:java}
> if (mode.equals(PARTITION_MODE_DOMAIN) && url != null)
>   ...
> else if ..
>   ...
>   InetAddress address = InetAddress.getByName(url.getHost());
>   ...
> {code}
> if url is null, method url.getHost() will be invoked, so NullPointerException 
> wiil be thrown
> 3) 
> [src/java/org/apache/nutch/tools/CommonCrawlDataDumper.java#L346|https://github.com/apache/nutch/blob/e53b34b2322f2d071981a72577644a225642ecbc/src/java/org/apache/nutch/tools/CommonCrawlDataDumper.java#L346]
> {code:java}
> String[] fullPathLevels = fullDir.split(File.separator);
> {code}
> Using File.separator in regular expressions may throws 
> java.util.regex.PatternSyntaxException exceptions, because it is "\" on 
> Windows-based systems.
> Possible  correction:
> {code:java}
> String[] fullPathLevels = fullDir.split(Pattern.quote(File.separator));
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2452) Problem retrieving encoded URLs via FTP?

2017-11-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16239741#comment-16239741
 ] 

Hudson commented on NUTCH-2452:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3465 (See 
[https://builds.apache.org/job/Nutch-trunk/3465/])
NUTCH-2452 Allow nutch to retrieve Ftp URLs that contain UrlEncoded (snagel: 
[https://github.com/apache/nutch/commit/517dbdf3261d42e90883d07320b7991ff8e2bcf8])
* (edit) 
src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/FtpResponse.java


> Problem retrieving encoded URLs via FTP?
> 
>
> Key: NUTCH-2452
> URL: https://issues.apache.org/jira/browse/NUTCH-2452
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Affects Versions: 1.13
> Environment: Ubuntu 16.04.3 LTS
> OpenJDK 1.8.0_131
> nutch 1.14-SNAPSHOT
> Synology RS816
>Reporter: Hiran Chaudhuri
> Fix For: 1.14
>
>
> I tried running Nutch on my Synology NAS. As SMB protocol is not contained in 
> Nutch, I turned on FTP service on the NAS and configured Nutch to crawl 
> ftp://nas.
> The experience gives me varying results which seem to point to problems 
> within Nutch. However this may need further evaluation.
> As some files could not be downloaded and I could not see a good error 
> message I changed the method 
> org.apache.nutch.protocol.ftp.FTP.getProtocolOutput(Text, CrawlDatum) to not 
> only return protocol status but send the full exception and stack trace to 
> the logs:
> {{ } catch (Exception e) {
> LOG.warn("Could not get {}", url, e);
> return new ProtocolOutput(null, new ProtocolStatus(e));
> }
> }}
> With this modification I suddenly see such messages in the logfile:
> {{2017-10-25 14:14:37,254 TRACE org.apache.nutch.protocol.ftp.Ftp - fetching 
> ftp://nas/silver-sda2/home/vivi/Desktop/Pictures/Kenya%20Pics/
> 2017-10-25 14:14:37,512 WARN  org.apache.nutch.protocol.ftp.Ftp - Could not 
> get ftp://nas/silver-sda2/home/vivi/Desktop/Pictures/Kenya%20Pics/
> org.apache.nutch.protocol.ftp.FtpError: Ftp Error: 404
> at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:151)
> at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
> }}
> Please mind the URL was not configured from me. Instead it was obtained by 
> crawling my NAS. Also the URL looks perfectly fine to me. Even more, using 
> Firefox and the same authentication data on the same URL displays the 
> directory successfully. Therefore I suspect the FTP client is unable to 
> decode the URL such that the FTP server would understand it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2443) Extract links from the video tag with the parse-html plugin

2017-11-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16239740#comment-16239740
 ] 

Hudson commented on NUTCH-2443:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3465 (See 
[https://builds.apache.org/job/Nutch-trunk/3465/])
NUTCH-2443 add source tag to the parse-html and parse-tika outlink 
(jorge-luis.betancourt: 
[https://github.com/apache/nutch/commit/d34a002b25a770369ad6a5a20475c7072d8fa02b])
* (edit) 
src/plugin/parse-tika/src/test/org/apache/nutch/tika/TestDOMContentUtils.java
* (edit) 
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java
* (edit) 
src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
* (edit) 
src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestDOMContentUtils.java


> Extract links from the video tag with the parse-html plugin
> ---
>
> Key: NUTCH-2443
> URL: https://issues.apache.org/jira/browse/NUTCH-2443
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser, plugin
>Affects Versions: 1.13
>Reporter: Jorge Luis Betancourt Gonzalez
>Assignee: Jorge Luis Betancourt Gonzalez
>Priority: Minor
> Fix For: 1.14
>
>
> At the moment the {{parse-html}} extracts links from the tags {{a, area, 
> form}} (configurable){{, frame, iframe, script, link, img}}. Since we allow 
> extracting links to binary files (images) extracting links also from the 
> {{video}} tag should be supported.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2420) Bug in variable generate.max.count and fetcher.server.delay

2017-11-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240651#comment-16240651
 ] 

Hudson commented on NUTCH-2420:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3466 (See 
[https://builds.apache.org/job/Nutch-trunk/3466/])
NUTCH-2420 Bug in variable generate.max.count and fetcher.server.delay (markus: 
[https://github.com/apache/nutch/commit/6199492f5e1e8811022257c88dbf63f1e1c739d0])
* (edit) src/java/org/apache/nutch/crawl/Generator.java


> Bug in variable generate.max.count and fetcher.server.delay
> ---
>
> Key: NUTCH-2420
> URL: https://issues.apache.org/jira/browse/NUTCH-2420
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2420.patch
>
>
> Feature added by NUTCH-2368 does not work for multiple hosts. Once a 
> HostDatum has been read by getHostDatum(), the next host cannot be read. 
> Apparantly i need to open and close the SequenceFile.Readers for every 
> HostDatum it needs. Reader has no reset() method or whatsoever.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2442) Injector to stop if job fails to avoid loss of CrawlDb

2017-11-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240970#comment-16240970
 ] 

Hudson commented on NUTCH-2442:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3467 (See 
[https://builds.apache.org/job/Nutch-trunk/3467/])
NUTCH-2442 Injector to stop if job fails to avoid loss of CrawlDb 
(omkarreddy2008: 
[https://github.com/apache/nutch/commit/2352f9a4f47693cd8ca653f0b0629d186593fc4a])
* (edit) src/java/org/apache/nutch/util/domain/DomainStatistics.java
* (edit) src/java/org/apache/nutch/crawl/Injector.java
* (edit) src/java/org/apache/nutch/util/CrawlCompletionStats.java
* (edit) src/java/org/apache/nutch/util/ProtocolStatusStatistics.java
* (edit) src/java/org/apache/nutch/util/SitemapProcessor.java
* (edit) src/java/org/apache/nutch/hostdb/ReadHostDb.java


> Injector to stop if job fails to avoid loss of CrawlDb
> --
>
> Key: NUTCH-2442
> URL: https://issues.apache.org/jira/browse/NUTCH-2442
> Project: Nutch
>  Issue Type: Bug
>  Components: injector
>Affects Versions: 1.13
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.14
>
>
> Injector does not check whether the MapReduce job is successful. Even if the 
> job fails
> - installs the CrawlDb
> -- move current/ to old/
> -- replace current/ with an empty or potentially incomplete version
> - exits with code 0 so that scripts running the crawl workflow cannot detect 
> the failure -- if Injector is run a second time the CrawlDb is lost (both 
> current/ and old/ are empty or corrupted)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2458) TikaParser doesn't work with tika-config.xml set

2017-11-10 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16247521#comment-16247521
 ] 

Hudson commented on NUTCH-2458:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3468 (See 
[https://builds.apache.org/job/Nutch-trunk/3468/])
NUTCH-2458 (markus: 
[https://github.com/apache/nutch/commit/c345618ec425f0e907a6e54565f2d0577139b45f])
* (edit) 
src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java


> TikaParser doesn't work with tika-config.xml set
> 
>
> Key: NUTCH-2458
> URL: https://issues.apache.org/jira/browse/NUTCH-2458
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2458.patch
>
>
> Well, it doesn't indeed. Thanks to Timothy Allison, its solved.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2458) TikaParser doesn't work with tika-config.xml set

2017-11-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16268626#comment-16268626
 ] 

Hudson commented on NUTCH-2458:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3469 (See 
[https://builds.apache.org/job/Nutch-trunk/3469/])
NUTCH-2458 (snagel: 
[https://github.com/apache/nutch/commit/c17dd1dd6bf914beb7b13528c95b487630f86905])
* (edit) 
src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java


> TikaParser doesn't work with tika-config.xml set
> 
>
> Key: NUTCH-2458
> URL: https://issues.apache.org/jira/browse/NUTCH-2458
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2458.patch
>
>
> Well, it doesn't indeed. Thanks to Timothy Allison, its solved.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2463) Enable sampling CrawlDB

2017-11-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16268625#comment-16268625
 ] 

Hudson commented on NUTCH-2463:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3469 (See 
[https://builds.apache.org/job/Nutch-trunk/3469/])
NUTCH-2463 - Enable sampling CrawlDB (github: 
[https://github.com/apache/nutch/commit/65651b5cce54736978356ba1a8dea8a10f405d3c])
* (edit) src/java/org/apache/nutch/crawl/CrawlDbReader.java


> Enable sampling CrawlDB
> ---
>
> Key: NUTCH-2463
> URL: https://issues.apache.org/jira/browse/NUTCH-2463
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Reporter: Yossi Tamari
>Priority: Minor
> Fix For: 1.14
>
>
> CrawlDB can grow to contain billions of records. When that happens *readdb 
> -dump* is pretty useless, and *readdb -topN* can run for ages (and does not 
> provide a statistically correct sample).
> We should add a parameter *-sample* to *readdb -dump* which is followed by a 
> number between 0 and 1, and only that fraction of records from the CrawlDB 
> will be processed.
> The sample should be statistically random, and all the other filters should 
> be applied on the sampled records.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2464) Plugin headings: Headers That Contain HTML Elements Are Not Parsed

2017-11-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273291#comment-16273291
 ] 

Hudson commented on NUTCH-2464:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3470 (See 
[https://builds.apache.org/job/Nutch-trunk/3470/])
Fix for NUTCH-2464 get textual content from nested items on heading 
(jorge-luis.betancourt: 
[https://github.com/apache/nutch/commit/b8580b3dd3d47c8f0157d9860f8dab8d1dc8607c])
* (edit) 
src/plugin/headings/src/java/org/apache/nutch/parse/headings/HeadingsParseFilter.java
* (add) 
src/plugin/headings/src/test/org/apache/nutch/parse/headings/TestHeadingsParseFilter.java
* (edit) src/plugin/headings/ivy.xml


> Plugin headings: Headers That Contain HTML Elements Are Not Parsed
> --
>
> Key: NUTCH-2464
> URL: https://issues.apache.org/jira/browse/NUTCH-2464
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin
>Affects Versions: 1.13
> Environment: Internal development/test environments.
>Reporter: Cass Pallansch
>Assignee: Jorge Luis Betancourt Gonzalez
> Fix For: 1.14
>
> Attachments: NUTCH-2464-complex-header.html
>
>
> Nutch does not appear to traverse the HTML elements that may be contained 
> within header elements (e.g., H1, H2, H3, etc. tags).  Many times there are 
> anchors and/or  tags within these elements that contain the actual text 
> nodes that should be picked up as the header value for indexing purposes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2465) Broken Eclipse project. Classpaths and interactiveselenium should be fixed.

2017-11-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273292#comment-16273292
 ] 

Hudson commented on NUTCH-2465:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3470 (See 
[https://builds.apache.org/job/Nutch-trunk/3470/])
fix of NUTCH-2465 broken Eclipse project. Classpaths and (semyon.semyonov: 
[https://github.com/apache/nutch/commit/01bdc70b52f64a0d8ee81823eb61e5854e3f6291])
* (edit) 
src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/InteractiveSeleniumHandler.java
* (edit) build.xml
* (edit) 
src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/HttpResponse.java
* (edit) 
src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/DefaultHandler.java
* (edit) 
src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/DefalultMultiInteractionHandler.java
* (edit) 
src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/DefaultClickAllAjaxLinksHandler.java


> Broken Eclipse project. Classpaths and interactiveselenium should be fixed.
> ---
>
> Key: NUTCH-2465
> URL: https://issues.apache.org/jira/browse/NUTCH-2465
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Semyon Semyonov
> Fix For: 1.14
>
>
> With the latest version of develop the Eclipse project doesn't work anymore.
> There are two sets of problem:
> 1) Classpath problems 
> 2) Incorrect usage of org.apache.nutch.protocol.interactiveselenium in the 
> code. Should be replaced by 
> org.apache.nutch.protocol.interactiveselenium.handlers 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2456) Allow to index pages/URLs not contained in CrawlDb

2017-12-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278316#comment-16278316
 ] 

Hudson commented on NUTCH-2456:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3471 (See 
[https://builds.apache.org/job/Nutch-trunk/3471/])
NUTCH-2456: Redirected documents are not indexed (snagel: 
[https://github.com/apache/nutch/commit/a7bc1a8c5a3a5ab9c72574afd98089a354bf0484])
* (edit) src/java/org/apache/nutch/indexer/IndexerMapReduce.java


> Allow to index pages/URLs not contained in CrawlDb
> --
>
> Key: NUTCH-2456
> URL: https://issues.apache.org/jira/browse/NUTCH-2456
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.13
>Reporter: Yossi Tamari
>Priority: Critical
> Fix For: 1.14
>
>
> If http.redirect.max is set to a positive value, the Fetcher will follow 
> redirects, creating a new CrawlDatum.
> If the redirected URL is fetched and parsed, during indexing for it we have a 
> special case: dbDatum is null. This means that in 
> [https://github.com/apache/nutch/blob/6199492f5e1e8811022257c88dbf63f1e1c739d0/src/java/org/apache/nutch/indexer/IndexerMapReduce.java#L259]
>  the document is not indexed, as it is assumed it only has inlinks (actually 
> it has everything but dbDatum).
> I'm not sure what the correct fix is here. It seems to me the condition 
> should use AND instead of OR anyway, but I may not understand the original 
> intent. It is clear that it is too strict as is.
> However, the code following that line assumes all 4 objects are not null, so 
> a patch would need to change more than just the condition.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2468) should filter out invalid URLs by default

2017-12-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278378#comment-16278378
 ] 

Hudson commented on NUTCH-2468:
---

SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1596 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1596/])
NUTCH-2468 should filter out invalid URLs by default - enable plugin (snagel: 
[https://github.com/apache/nutch/commit/40da65bf3c55f802ae91b8d7424955450a7146ab])
* (edit) conf/nutch-default.xml


> should filter out invalid URLs by default
> -
>
> Key: NUTCH-2468
> URL: https://issues.apache.org/jira/browse/NUTCH-2468
> Project: Nutch
>  Issue Type: Improvement
>  Components: bin
>Affects Versions: 1.12
>Reporter: Michael Coffey
>Priority: Minor
> Fix For: 2.4, 1.14
>
>
> Some Nutch components, by default, should reject invalid URLs. This was 
> recently discussed in the users mailing list and has affected my work for a 
> while. Although there may be some special-purpose needs to collect invalid 
> URLs, they are not generally useful for crawling.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2468) should filter out invalid URLs by default

2017-12-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278393#comment-16278393
 ] 

Hudson commented on NUTCH-2468:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3472 (See 
[https://builds.apache.org/job/Nutch-trunk/3472/])
NUTCH-2468 should filter out invalid URLs by default - enable plugin (snagel: 
[https://github.com/apache/nutch/commit/d8754b7f88e73949dadaa0412aedea4427207f25])
* (edit) conf/nutch-default.xml


> should filter out invalid URLs by default
> -
>
> Key: NUTCH-2468
> URL: https://issues.apache.org/jira/browse/NUTCH-2468
> Project: Nutch
>  Issue Type: Improvement
>  Components: bin
>Affects Versions: 1.12
>Reporter: Michael Coffey
>Priority: Minor
> Fix For: 2.4, 1.14
>
>
> Some Nutch components, by default, should reject invalid URLs. This was 
> recently discussed in the users mailing list and has affected my work for a 
> while. Although there may be some special-purpose needs to collect invalid 
> URLs, they are not generally useful for crawling.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2451) protocol-ftp to resolve relative URL when following redirects

2017-12-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278444#comment-16278444
 ] 

Hudson commented on NUTCH-2451:
---

SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1597 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1597/])
NUTCH-2451 protocol-ftp to resolve relative URL when following redirects 
(snagel: 
[https://github.com/apache/nutch/commit/fc586d4508dbd8f1f5d19fc943e3b43b9f6956ca])
* (edit) src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java


> protocol-ftp to resolve relative URL when following redirects
> -
>
> Key: NUTCH-2451
> URL: https://issues.apache.org/jira/browse/NUTCH-2451
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Affects Versions: 1.13
> Environment: Ubuntu 16.04.3 LTS
> OpenJDK 1.8.0_131
> nutch 1.14-SNAPSHOT
> Synology RS816
>Reporter: Hiran Chaudhuri
>Assignee: Sebastian Nagel
> Fix For: 2.4, 1.14
>
>
> I tried running Nutch on my Synology NAS. As SMB protocol is not contained in 
> Nutch, I turned on FTP service on the NAS and configured Nutch to crawl 
> ftp://nas.
> The experience gives me varying results which seem to point to problems 
> within Nutch. However this may need further evaluation.
> As some files could not be downloaded and I could not see a good error 
> message I changed the method 
> org.apache.nutch.protocol.ftp.FTP.getProtocolOutput(Text, CrawlDatum) to not 
> only return protocol status but send the full exception and stack trace to 
> the logs:
> {{} catch (Exception e) {
>   LOG.warn("Could not get {}", url, e);
>   return new ProtocolOutput(null, new ProtocolStatus(e));
> }
> }}
> With this modification I suddenly see such messages in the logfile:
> {{2017-10-25 22:09:31,865 TRACE org.apache.nutch.protocol.ftp.Ftp - fetching 
> ftp://nas/MediaPC/usr/lib32/gconv/ARMSCII-8.so
> 2017-10-25 22:09:32,147 WARN  org.apache.nutch.protocol.ftp.Ftp - Could not 
> get ftp://nas/MediaPC/usr/lib32/gconv/ARMSCII-8.so
> java.net.MalformedURLException
>   at java.net.URL.(URL.java:627)
>   at java.net.URL.(URL.java:490)
>   at java.net.URL.(URL.java:439)
>   at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:145)
>   at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
> Caused by: java.lang.NullPointerException
> }}
> Please mind the URL was not configured from me. Instead it was obtained by 
> crawling my NAS. Also the URL looks perfectly fine to me. Even if the file 
> did not exist I would not expect a MalformedURLException to occur. Even more, 
> using Firefox and the same authentication data on the same URL retrieves the 
> file successfully.
> How come Nutch cannot get the file?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2469) Documents not commited to solr in Sever mode

2017-12-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278445#comment-16278445
 ] 

Hudson commented on NUTCH-2469:
---

SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1597 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1597/])
NUTCH-2469 Documents not commited to solr in Sever mode - applied patch 
(snagel: 
[https://github.com/apache/nutch/commit/cc2f4abeb7b8326acbb00f9d10b46a092bbbe9a5])
* (edit) src/java/org/apache/nutch/indexer/IndexingJob.java


> Documents not commited to solr in Sever mode
> 
>
> Key: NUTCH-2469
> URL: https://issues.apache.org/jira/browse/NUTCH-2469
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 2.3.1
>Reporter: Ninaad Joshi
>Assignee: Sebastian Nagel
>Priority: Blocker
> Fix For: 2.4
>
> Attachments: NinaadJoshi.IndexingJob.java.patch
>
>
> I found there is a discrepancy in execution paths when running Nutch in local 
> standalone mode vis-à-vis server mode. 
> I observed, in local standalone mode, when the indexing process is done the 
> document along with its fields get indexed and committed in solr and is 
> returned if queried immediately. However, the same when done through server 
> mode, the document gets indexed but is not committed in solr, hence not 
> returned if queried immediately. When we restart solr the indexed document is 
> returned if queried.
> I browsed through the IndexingJob.java file to understand the cause for this. 
> I found out:
> # There are two different entry paths for the local standalone mode and the 
> server mode
> ** Server mode entry point: public Map run(Map Object> args)
> ** Standalone mode entry point: 
> *** public int run(String[] args)
> *** public void index(String batchId)
> # The local standalone mode path did extra stuff than the server mode
> ** The public void index(String batchId) function initially calls the server 
> mode path: public Map run(Map args)
> ** And then does this extra stuff
> *** Gets IndexWriters
> *** Using IndexWriters Describes 
> Using IndexWriters commits if COMMIT_INDEX=true is specified in the 
> configuration
> *** The aforementioned extra stuff is not done in the server mode
> I feel the execution paths for both the modes should be same and hence 
> propose to:
> # Move the extra stuff done using IndexWriters in public void index(String 
> batchId) to the end of server mode execution path i.e public Map Object> run(Map args) function 
> # Call public Map run(Map args) function 
> directly from Standalone mode entry point: public int run(String[] args)
> # public int run(String[] args) becomes redundant and can be safely removed.
> I have attached the proposed patch along with this issue. Kindly go through 
> the same and approve.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2451) protocol-ftp to resolve relative URL when following redirects

2017-12-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278447#comment-16278447
 ] 

Hudson commented on NUTCH-2451:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3473 (See 
[https://builds.apache.org/job/Nutch-trunk/3473/])
NUTCH-2451 protocol-ftp to resolve relative URL when following redirects 
(snagel: 
[https://github.com/apache/nutch/commit/5b3cf0e2028aed576d080be70fc9028796616b94])
* (edit) src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java


> protocol-ftp to resolve relative URL when following redirects
> -
>
> Key: NUTCH-2451
> URL: https://issues.apache.org/jira/browse/NUTCH-2451
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Affects Versions: 1.13
> Environment: Ubuntu 16.04.3 LTS
> OpenJDK 1.8.0_131
> nutch 1.14-SNAPSHOT
> Synology RS816
>Reporter: Hiran Chaudhuri
>Assignee: Sebastian Nagel
> Fix For: 2.4, 1.14
>
>
> I tried running Nutch on my Synology NAS. As SMB protocol is not contained in 
> Nutch, I turned on FTP service on the NAS and configured Nutch to crawl 
> ftp://nas.
> The experience gives me varying results which seem to point to problems 
> within Nutch. However this may need further evaluation.
> As some files could not be downloaded and I could not see a good error 
> message I changed the method 
> org.apache.nutch.protocol.ftp.FTP.getProtocolOutput(Text, CrawlDatum) to not 
> only return protocol status but send the full exception and stack trace to 
> the logs:
> {{} catch (Exception e) {
>   LOG.warn("Could not get {}", url, e);
>   return new ProtocolOutput(null, new ProtocolStatus(e));
> }
> }}
> With this modification I suddenly see such messages in the logfile:
> {{2017-10-25 22:09:31,865 TRACE org.apache.nutch.protocol.ftp.Ftp - fetching 
> ftp://nas/MediaPC/usr/lib32/gconv/ARMSCII-8.so
> 2017-10-25 22:09:32,147 WARN  org.apache.nutch.protocol.ftp.Ftp - Could not 
> get ftp://nas/MediaPC/usr/lib32/gconv/ARMSCII-8.so
> java.net.MalformedURLException
>   at java.net.URL.(URL.java:627)
>   at java.net.URL.(URL.java:490)
>   at java.net.URL.(URL.java:439)
>   at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:145)
>   at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
> Caused by: java.lang.NullPointerException
> }}
> Please mind the URL was not configured from me. Instead it was obtained by 
> crawling my NAS. Also the URL looks perfectly fine to me. Even if the file 
> did not exist I would not expect a MalformedURLException to occur. Even more, 
> using Firefox and the same authentication data on the same URL retrieves the 
> file successfully.
> How come Nutch cannot get the file?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2470) CrawlDbReader -stats to show quantiles of score

2017-12-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278496#comment-16278496
 ] 

Hudson commented on NUTCH-2470:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3474 (See 
[https://builds.apache.org/job/Nutch-trunk/3474/])
NUTCH-2470 CrawlDbReader -stats to show quantiles of score - improve (snagel: 
[https://github.com/apache/nutch/commit/08c2fb9d024741425f57537c18dc706b1f861bdc])
* (edit) ivy/ivy.xml
* (edit) src/java/org/apache/nutch/crawl/CrawlDbReader.java


> CrawlDbReader -stats to show quantiles of score
> ---
>
> Key: NUTCH-2470
> URL: https://issues.apache.org/jira/browse/NUTCH-2470
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Affects Versions: 1.13
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.14
>
>
> The command "readdb -stats" shows for the CrawlDatum score min., max. and 
> average. Median and quartiles (quantiles, in general) would complete the 
> statistics to get an impression how scores are distributed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2399) indexer-elastic does not index multi-value fields (only the first value is indexed)

2017-12-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16280135#comment-16280135
 ] 

Hudson commented on NUTCH-2399:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3475 (See 
[https://builds.apache.org/job/Nutch-trunk/3475/])
NUTCH-2399 Add support for multivalue fields on indexer-elastic 
(jorge-luis.betancourt: 
[https://github.com/apache/nutch/commit/106a215cbd430a13e29ee590e948e198abf6445c])
* (edit) 
src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java


> indexer-elastic does not index multi-value fields (only the first value is 
> indexed)
> ---
>
> Key: NUTCH-2399
> URL: https://issues.apache.org/jira/browse/NUTCH-2399
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Reporter: Yossi Tamari
> Fix For: 1.14
>
>
> Currently, if there is a NutchField with multiple values, only the first 
> value is indexed (because this is what doc.getFieldValue returns). Pull 
> request #200 checks if the NutchField has multiple values, and if so, they 
> are added as an array (multivalue) field.
> [https://github.com/apache/nutch/pull/200]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2438) Upgrade Nutch 2.X to Gora 0.8

2017-12-13 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16289976#comment-16289976
 ] 

Hudson commented on NUTCH-2438:
---

SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1598 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1598/])
fix for NUTCH-2438 contributed by tmzzngl (tulay.muezzinoglu: 
[https://github.com/apache/nutch/commit/95417aa9fbebcfdd516930dc4aa370b0a343994c])
* (edit) build.xml
* (edit) ivy/ivy.xml
NUTCH-2438 Upgrade Nutch 2.X to Gora 0.8 (lewis.mcgibbney: 
[https://github.com/apache/nutch/commit/5370102135d91e1054adf13c0345159873b4a2ef])
* (edit) src/java/org/apache/nutch/storage/WebPage.java
* (edit) src/java/org/apache/nutch/storage/Host.java
* (edit) ivy/ivy.xml


> Upgrade Nutch 2.X to Gora 0.8
> -
>
> Key: NUTCH-2438
> URL: https://issues.apache.org/jira/browse/NUTCH-2438
> Project: Nutch
>  Issue Type: Improvement
>  Components: storage
>Affects Versions: 2.4
>Reporter: Tulay Muezzinoglu
>Assignee: Lewis John McGibbney
> Fix For: 2.4
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2474) CrawlDbReader -stats fails with ClassCastException

2017-12-14 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16291049#comment-16291049
 ] 

Hudson commented on NUTCH-2474:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3477 (See 
[https://builds.apache.org/job/Nutch-trunk/3477/])
NUTCH-2474 CrawlDbReader -stats fails with ClassCastException - replace 
(snagel: 
[https://github.com/apache/nutch/commit/12e14ac0604298e09672287ba20ccb13a56d4fd7])
* (edit) src/java/org/apache/nutch/crawl/CrawlDbReader.java


> CrawlDbReader -stats fails with ClassCastException
> --
>
> Key: NUTCH-2474
> URL: https://issues.apache.org/jira/browse/NUTCH-2474
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.14
> Environment: Java 8, distributed mode: Hadoop CDH 5.13.0
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Critical
> Fix For: 1.14
>
>
> In distributed mode CrawlDbReader / readdb -stats fails with a 
> ClassCastException in the combiner:
> {noformat}
> 17/12/08 04:57:13 INFO mapreduce.Job: Task Id : 
> attempt_1512553291624_0022_m_39_0, Status : FAILED
> Error: java.lang.ClassCastException: org.apache.hadoop.io.FloatWritable 
> cannot be cast to org.apache.hadoop.io.LongWritable
> at 
> org.apache.nutch.crawl.CrawlDbReader$CrawlDbStatCombiner.reduce(CrawlDbReader.java:296)
> at 
> org.apache.nutch.crawl.CrawlDbReader$CrawlDbStatCombiner.reduce(CrawlDbReader.java:222)
> at 
> org.apache.hadoop.mapred.Task$OldCombinerRunner.combine(Task.java:1639)
> at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1946)
> at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1514)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:466)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
> {noformat}
> FloatWritables are used since NUTCH-2470, so that's when this bug was 
> introduced.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2035) Regex filter using case sensitive rules.

2017-12-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16292802#comment-16292802
 ] 

Hudson commented on NUTCH-2035:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3478 (See 
[https://builds.apache.org/job/Nutch-trunk/3478/])
NUTCH-2035 urlfilter-regex case insensitive rules (snagel: 
[https://github.com/apache/nutch/commit/df14c8a0a19e4f670d75ecd7ae2a22c3d8eeb0b6])
* (edit) conf/regex-urlfilter.txt.template


> Regex filter using case sensitive rules.
> 
>
> Key: NUTCH-2035
> URL: https://issues.apache.org/jira/browse/NUTCH-2035
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 1.10
>Reporter: Luis Lopez
>Assignee: Sebastian Nagel
>Priority: Minor
>  Labels: filters, regex, regex-urlfilter
> Fix For: 2.4, 1.14
>
> Attachments: regex-urlfilter.txt
>
>
> Regex expressions are computationally expensive and having “EXE|exe|JPG|jpg” 
> etc etc. adds up if we use complex rules.
> Regex filter should use case insensitive rules to make the rules more 
> readable and improve performance.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


  1   2   3   4   5   6   7   8   9   10   >