[jira] [Commented] (NUTCH-2415) Create a JEXL based IndexingFilter

2017-12-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294076#comment-16294076
 ] 

ASF GitHub Bot commented on NUTCH-2415:
---

YossiTamari commented on a change in pull request #219: NUTCH-2415 : Create a 
JEXL based IndexingFilter
URL: https://github.com/apache/nutch/pull/219#discussion_r157364581
 
 

 ##
 File path: build.xml
 ##
 @@ -1042,6 +1042,8 @@
 
 
 
+
 
 Review comment:
   @sebastian-nagel regarding default.properties: Is that the package name? 
Because the package name is org.apache.nutch.indexer.filter, which is already 
there. Should I change the package name? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Create a JEXL based IndexingFilter
> --
>
> Key: NUTCH-2415
> URL: https://issues.apache.org/jira/browse/NUTCH-2415
> Project: Nutch
>  Issue Type: New Feature
>  Components: plugin
>Affects Versions: 1.13
>Reporter: Yossi Tamari
>Assignee: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>
> Following on NUTCH-2414 and NUTCH-2412, the requirement was raised for a 
> IndexingFilter plugin which will decide whether to index a document based on 
> a JEXL expression.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (NUTCH-2365) HTTP Redirects to SubDomains don't get crawled

2017-12-17 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-2365:
--

Assignee: Sebastian Nagel

> HTTP Redirects to SubDomains don't get crawled
> --
>
> Key: NUTCH-2365
> URL: https://issues.apache.org/jira/browse/NUTCH-2365
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.12
> Environment: Fedora 25
>Reporter: Sriram Nookala
>Assignee: Sebastian Nagel
> Fix For: 1.14
>
>
> Crawling a domain  http://www.mercenarytrader.com which redirects to 
> https://members.mercenarytrader.com which doesn't get followed by Nutch even 
> though 'db.ignore.external.links' is set to 'true' and 
> 'db.ignore.external.links.mode' is set to 'byDomain'. 
>   The bug is in FetcherThread where the comparison is by host and not by 
> domain
> String origHost = new URL(urlString).getHost().toLowerCase();
> >   String newHost = new URL(newUrl).getHost().toLowerCase();
> >   if (ignoreExternalLinks) {
> > if (!origHost.equals(newHost)) {
> >   if (LOG.isDebugEnabled()) {
> > LOG.debug(" - ignoring redirect " + redirType + " from "
> > + urlString + " to " + newUrl
> > + " because external links are ignored");
> >   }
> >   return null;
> > }
> >   }



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2365) HTTP Redirects to SubDomains don't get crawled

2017-12-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294078#comment-16294078
 ] 

ASF GitHub Bot commented on NUTCH-2365:
---

sebastian-nagel opened a new pull request #264: NUTCH-2365 Fetcher to respect 
db.ignore.external.links.mode for redirects
URL: https://github.com/apache/nutch/pull/264
 
 
   - restructure method handleRedirects: result of URL filters is checked early
   - simplify debug logging calls


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HTTP Redirects to SubDomains don't get crawled
> --
>
> Key: NUTCH-2365
> URL: https://issues.apache.org/jira/browse/NUTCH-2365
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.12
> Environment: Fedora 25
>Reporter: Sriram Nookala
> Fix For: 1.14
>
>
> Crawling a domain  http://www.mercenarytrader.com which redirects to 
> https://members.mercenarytrader.com which doesn't get followed by Nutch even 
> though 'db.ignore.external.links' is set to 'true' and 
> 'db.ignore.external.links.mode' is set to 'byDomain'. 
>   The bug is in FetcherThread where the comparison is by host and not by 
> domain
> String origHost = new URL(urlString).getHost().toLowerCase();
> >   String newHost = new URL(newUrl).getHost().toLowerCase();
> >   if (ignoreExternalLinks) {
> > if (!origHost.equals(newHost)) {
> >   if (LOG.isDebugEnabled()) {
> > LOG.debug(" - ignoring redirect " + redirType + " from "
> > + urlString + " to " + newUrl
> > + " because external links are ignored");
> >   }
> >   return null;
> > }
> >   }



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2365) HTTP Redirects to SubDomains don't get crawled

2017-12-17 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2365:
---
Patch Info: Patch Available

> HTTP Redirects to SubDomains don't get crawled
> --
>
> Key: NUTCH-2365
> URL: https://issues.apache.org/jira/browse/NUTCH-2365
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.12
> Environment: Fedora 25
>Reporter: Sriram Nookala
>Assignee: Sebastian Nagel
> Fix For: 1.14
>
>
> Crawling a domain  http://www.mercenarytrader.com which redirects to 
> https://members.mercenarytrader.com which doesn't get followed by Nutch even 
> though 'db.ignore.external.links' is set to 'true' and 
> 'db.ignore.external.links.mode' is set to 'byDomain'. 
>   The bug is in FetcherThread where the comparison is by host and not by 
> domain
> String origHost = new URL(urlString).getHost().toLowerCase();
> >   String newHost = new URL(newUrl).getHost().toLowerCase();
> >   if (ignoreExternalLinks) {
> > if (!origHost.equals(newHost)) {
> >   if (LOG.isDebugEnabled()) {
> > LOG.debug(" - ignoring redirect " + redirType + " from "
> > + urlString + " to " + newUrl
> > + " because external links are ignored");
> >   }
> >   return null;
> > }
> >   }



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2365) HTTP Redirects to SubDomains don't get crawled if

2017-12-17 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2365:
---
Summary: HTTP Redirects to SubDomains don't get crawled if   (was: HTTP 
Redirects to SubDomains don't get crawled)

> HTTP Redirects to SubDomains don't get crawled if 
> --
>
> Key: NUTCH-2365
> URL: https://issues.apache.org/jira/browse/NUTCH-2365
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.12
> Environment: Fedora 25
>Reporter: Sriram Nookala
>Assignee: Sebastian Nagel
> Fix For: 1.14
>
>
> Crawling a domain  http://www.mercenarytrader.com which redirects to 
> https://members.mercenarytrader.com which doesn't get followed by Nutch even 
> though 'db.ignore.external.links' is set to 'true' and 
> 'db.ignore.external.links.mode' is set to 'byDomain'. 
>   The bug is in FetcherThread where the comparison is by host and not by 
> domain
> String origHost = new URL(urlString).getHost().toLowerCase();
> >   String newHost = new URL(newUrl).getHost().toLowerCase();
> >   if (ignoreExternalLinks) {
> > if (!origHost.equals(newHost)) {
> >   if (LOG.isDebugEnabled()) {
> > LOG.debug(" - ignoring redirect " + redirType + " from "
> > + urlString + " to " + newUrl
> > + " because external links are ignored");
> >   }
> >   return null;
> > }
> >   }



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2365) HTTP Redirects to SubDomains don't get crawled if db.ignore.external.links.mode == byDomain

2017-12-17 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2365:
---
Summary: HTTP Redirects to SubDomains don't get crawled if 
db.ignore.external.links.mode == byDomain  (was: HTTP Redirects to SubDomains 
don't get crawled if )

> HTTP Redirects to SubDomains don't get crawled if 
> db.ignore.external.links.mode == byDomain
> ---
>
> Key: NUTCH-2365
> URL: https://issues.apache.org/jira/browse/NUTCH-2365
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.12
> Environment: Fedora 25
>Reporter: Sriram Nookala
>Assignee: Sebastian Nagel
> Fix For: 1.14
>
>
> Crawling a domain  http://www.mercenarytrader.com which redirects to 
> https://members.mercenarytrader.com which doesn't get followed by Nutch even 
> though 'db.ignore.external.links' is set to 'true' and 
> 'db.ignore.external.links.mode' is set to 'byDomain'. 
>   The bug is in FetcherThread where the comparison is by host and not by 
> domain
> String origHost = new URL(urlString).getHost().toLowerCase();
> >   String newHost = new URL(newUrl).getHost().toLowerCase();
> >   if (ignoreExternalLinks) {
> > if (!origHost.equals(newHost)) {
> >   if (LOG.isDebugEnabled()) {
> > LOG.debug(" - ignoring redirect " + redirType + " from "
> > + urlString + " to " + newUrl
> > + " because external links are ignored");
> >   }
> >   return null;
> > }
> >   }



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2239) Selenium Handlers for Ajax Patterns from Student submissions

2017-12-17 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2239:
---
Fix Version/s: (was: 1.14)
   1.15

> Selenium Handlers for Ajax Patterns from Student submissions
> 
>
> Key: NUTCH-2239
> URL: https://issues.apache.org/jira/browse/NUTCH-2239
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher, protocol
>Reporter: Raghav Bharadwaj Jayasimha Rao
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.15
>
>
> - Refactor student submissions from USC class of CSCI 572 to obtain a 
> comprehensive set of selenium handlers for various Ajax Patterns



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2152) CommonCrawl dump via Service endpoint

2017-12-17 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2152:
---
Fix Version/s: (was: 1.14)
   1.15

> CommonCrawl dump via Service endpoint
> -
>
> Key: NUTCH-2152
> URL: https://issues.apache.org/jira/browse/NUTCH-2152
> Project: Nutch
>  Issue Type: Sub-task
>  Components: REST_api
>Affects Versions: 1.12
>Reporter: Sujen Shah
>Assignee: Sujen Shah
>  Labels: memex
> Fix For: 1.15
>
> Attachments: NUTCH-2152.git.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-1749) Optionally exclude title from content field

2017-12-17 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1749:
---
Fix Version/s: (was: 1.14)
   1.15

> Optionally exclude title from content field
> ---
>
> Key: NUTCH-1749
> URL: https://issues.apache.org/jira/browse/NUTCH-1749
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.7
>Reporter: Greg Padiasek
> Fix For: 1.15
>
> Attachments: DOMContentUtils.patch
>
>
> The HTML parser plugin inserts document title into document content. Since 
> the title alone can be retrieved via DOMContentUtils.getTitle() and content 
> is retrieved via DOMContentUtils.getText(), there is no need to duplicate 
> title in the content. When title is included in the content it becomes 
> difficult/impossible to extract document body without title. A need to 
> extract document body without title is visible when user wants to index or 
> display body and title separately.
> Attached is a patch which prevents including title in document content in the 
> HTML parser plugin.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1917) index.parse.md, index.content.md and index.db.md should support wildcard

2017-12-17 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294081#comment-16294081
 ] 

Sebastian Nagel commented on NUTCH-1917:


Needs also a hint in nutch-default.xml

> index.parse.md, index.content.md and index.db.md should support wildcard
> 
>
> Key: NUTCH-1917
> URL: https://issues.apache.org/jira/browse/NUTCH-1917
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.9
>Reporter: Lewis John McGibbney
> Fix For: 1.15
>
> Attachments: MetadataIndexer.java.patch
>
>
> Right now metatags.names supports the '*' character for a catch all.
> I believe that the above index properties should also support catch all as a 
> mechanism for quickly building augmented data models from crawl data. 
> Individual identification and manual inclusion of tags one by one is error 
> prone and time consuming.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-1917) index.parse.md, index.content.md and index.db.md should support wildcard

2017-12-17 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1917:
---
Fix Version/s: (was: 1.14)
   1.15

> index.parse.md, index.content.md and index.db.md should support wildcard
> 
>
> Key: NUTCH-1917
> URL: https://issues.apache.org/jira/browse/NUTCH-1917
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.9
>Reporter: Lewis John McGibbney
> Fix For: 1.15
>
> Attachments: MetadataIndexer.java.patch
>
>
> Right now metatags.names supports the '*' character for a catch all.
> I believe that the above index properties should also support catch all as a 
> mechanism for quickly building augmented data models from crawl data. 
> Individual identification and manual inclusion of tags one by one is error 
> prone and time consuming.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2415) Create a JEXL based IndexingFilter

2017-12-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294085#comment-16294085
 ] 

ASF GitHub Bot commented on NUTCH-2415:
---

sebastian-nagel commented on a change in pull request #219: NUTCH-2415 : Create 
a JEXL based IndexingFilter
URL: https://github.com/apache/nutch/pull/219#discussion_r157365100
 
 

 ##
 File path: build.xml
 ##
 @@ -1042,6 +1042,8 @@
 
 
 
+
 
 Review comment:
   Yes, of course, the package name should be descriptive, e.g. 
org.apache.nutch.indexer.jexl. Thanks!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Create a JEXL based IndexingFilter
> --
>
> Key: NUTCH-2415
> URL: https://issues.apache.org/jira/browse/NUTCH-2415
> Project: Nutch
>  Issue Type: New Feature
>  Components: plugin
>Affects Versions: 1.13
>Reporter: Yossi Tamari
>Assignee: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>
> Following on NUTCH-2414 and NUTCH-2412, the requirement was raised for a 
> IndexingFilter plugin which will decide whether to index a document based on 
> a JEXL expression.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2017-12-17 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2184:
---
Fix Version/s: (was: 1.14)
   1.15

> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.15
>
> Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2140) Atomic update and optimistic concurrency update using Solr

2017-12-17 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2140:
---
Fix Version/s: (was: 1.14)
   1.15

> Atomic update and optimistic concurrency update using Solr
> --
>
> Key: NUTCH-2140
> URL: https://issues.apache.org/jira/browse/NUTCH-2140
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Affects Versions: 1.9
>Reporter: Roannel Fernández Hernández
> Fix For: 1.15
>
> Attachments: NUTCH-2140-v1.patch, NUTCH-2140-v2.patch
>
>
> The SOLRIndexWriter plugin allows to index the documents into a Solr server. 
> The plugin replaces the documents that already are indexed into Solr. 
> Sometimes, replace only one field or add new fields and keep the others 
> values of the documents indexed is useful.
> Solr supports two approaches for this task: Atomic update and optimistic 
> concurrency update. However, the SOLRIndexWriter plugin doesn't support that 
> approaches.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2186) -addBinaryContent flag can cause "String length must be a multiple of four" error in IndexingJob

2017-12-17 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2186:
---
Fix Version/s: (was: 1.14)
   1.15

> -addBinaryContent flag can cause "String length must be a multiple of four" 
> error in IndexingJob
> 
>
> Key: NUTCH-2186
> URL: https://issues.apache.org/jira/browse/NUTCH-2186
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.11
> Environment: Apache Nutch 1.12-SNAPSHOT (trunk as of date this issue 
> was opened)
> Apache Solr 4.10.2
>Reporter: Lewis John McGibbney
> Fix For: 1.15
>
>
> When using the following indexing command
> {code}
> ./runtime/local/bin/nutch index -crawldb 
> /usr/local/trunk_new1/esdswg_crawl/crawldb/ -linkdb 
> /usr/local/trunk_new1/esdswg_crawl/linkdb/ -segmentDir 
> /usr/local/trunk_new1/esdswg_crawl/segments -addBinaryContent -deleteGone
> {code}
> I am able to generate the following error in my Solr logs
> {code}
> msg=String length must be a multiple of four.
>   at 
> org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:178)
>   at 
> org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:78)
>   at 
> org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:238)
>   at 
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:164)
>   at 
> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
>   at 
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
>   at 
> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:926)
>   at 
> org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1080)
>   at 
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:692)
>   at 
> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
>   at 
> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247)
>   at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
>   at 
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:99)
>   at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>   at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
>   at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
>   at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
>   at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
>   at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
>   at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
>   at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
>   at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
>   at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
>   at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
>   at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
>   at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
>   at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
>   at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
>   at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
>   at org.eclipse.jetty.server.Server.handle(Server.java:368)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
>   at 
> org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014)
>   at org.eclipse.jetty.http.HttpParser.

[jira] [Updated] (NUTCH-2251) Make CommonCrawlFormatJackson instance reusable by properly handling object state

2017-12-17 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2251:
---
Fix Version/s: (was: 1.14)
   1.15

> Make CommonCrawlFormatJackson instance reusable by properly handling object 
> state
> -
>
> Key: NUTCH-2251
> URL: https://issues.apache.org/jira/browse/NUTCH-2251
> Project: Nutch
>  Issue Type: Sub-task
>  Components: commoncrawl
>Reporter: Thamme Gowda
> Fix For: 1.15
>
>
> The class `CommonCrawlFormatJackson` keeps appending the documents when it is 
> used for more formatting more than one document. 
> This class shall be modified to handle states such that the same instance can 
> be used instead of creating new one for each document being dumped.
> This suggestion has been mentioned in the previous fix related to format 
> issue : https://github.com/apache/nutch/pull/103



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2207) Remove class duplication and smarten-up scoring-similarity plugin

2017-12-17 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2207:
---
Fix Version/s: (was: 1.14)
   1.15

> Remove class duplication and smarten-up scoring-similarity plugin
> -
>
> Key: NUTCH-2207
> URL: https://issues.apache.org/jira/browse/NUTCH-2207
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, scoring
>Affects Versions: 1.11
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.15
>
>
> Right now it appears that DocumentVector.java is duplicated, there is also no 
> license header on 
> [ScoringFilterModel.java|https://github.com/apache/nutch/blob/trunk/src/plugin/scoring-similarity/src/java/org/apache/nutch/scoring/similarity/ScoringFilterModel.java].
>  I think I've also spotted a number of places that imports are not being 
> used. Finally, Javadoc is virtually non-existent for the scoring-similarity 
> plugin at all. It would help to augment some documentation. 
> It would be very helpful if the [SimilairittScoringFilter wiki 
> page|https://wiki.apache.org/nutch/SimilarityScoringFilter] was cited.
> We could also do with visiting the wiki page ensuring that all references are 
> present.
> CC [~sujenshah]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2380) indexer-elastic version bump

2017-12-17 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294091#comment-16294091
 ] 

Sebastian Nagel commented on NUTCH-2380:


[~jurian], I was able to apply the patch and unit tests pass. Can you confirm 
that the upgrade to Elasticsearch 5.3.0 works? I don't have the time to test 
it. Otherwise let's fix it in Nutch 1.15. Thanks!

> indexer-elastic version bump
> 
>
> Key: NUTCH-2380
> URL: https://issues.apache.org/jira/browse/NUTCH-2380
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.13
>Reporter: Jurian Broertjes
>Priority: Minor
> Fix For: 1.14
>
> Attachments: NUTCH-2380-indexer-elastic-p0.patch
>
>
> The current version of the indexer-elastic plugin is not compatible with ES 
> 5.x. The patch bumps the ES lib version to 5.3 but also requires a Nutch 
> classloader fix (NUTCH-2378) due to runtime dependency issues. 
> I didn't test compatibility with ES 2.x, so not sure if that still works.
> Please let me know what you think of the provided patch.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2237) DeduplicationJob: Add extra order criteria based on slug

2017-12-17 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2237:
---
Fix Version/s: (was: 1.14)
   1.15

> DeduplicationJob: Add extra order criteria based on slug
> 
>
> Key: NUTCH-2237
> URL: https://issues.apache.org/jira/browse/NUTCH-2237
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ron van der Vegt
> Fix For: 1.15
>
> Attachments: NUTCH-2237.patch, NUTCH-2237.patch
>
>
> Currently user can elect the main document when signatures are the same on 
> score, url lenght and fetchtime. The quality of the slug, based mainly on the 
> amount of meaningful characters, could give users more flexibility to make a 
> difference between slugified urls and urls based on page id.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2267) Solr indexer fails at the end of the job with a java error message

2017-12-17 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2267:
---
Fix Version/s: (was: 1.14)
   1.15

> Solr indexer fails at the end of the job with a java error message
> --
>
> Key: NUTCH-2267
> URL: https://issues.apache.org/jira/browse/NUTCH-2267
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.12
> Environment: hadoop v2.7.2  solr6 in cloud configuration with 
> zookeeper 3.4.6. I use the master branch from github currently on commit 
> da252eb7b3d2d7b70   ( NUTCH - 2263 mingram and maxgram support for Unigram 
> Cosine Similarity Model is provided. )
>Reporter: kaveh minooie
>Assignee: Lewis John McGibbney
> Fix For: 1.15
>
>
> this is was what I was getting first:
> 16/05/23 13:52:27 INFO mapreduce.Job:  map 100% reduce 100%
> 16/05/23 13:52:27 INFO mapreduce.Job: Task Id : 
> attempt_1462499602101_0119_r_00_0, Status : FAILED
> Error: Bad return type
> Exception Details:
>   Location:
> org/apache/solr/client/solrj/impl/HttpClientUtil.createClient(Lorg/apache/solr/common/params/SolrParams;Lorg/apache/http/conn/ClientConnectionManager;)Lorg/apache/http/impl/client/CloseableHttpClient;
>  @58: areturn
>   Reason:
> Type 'org/apache/http/impl/client/DefaultHttpClient' (current frame, 
> stack[0]) is not assignable to 
> 'org/apache/http/impl/client/CloseableHttpClient' (from method signature)
>   Current Frame:
> bci: @58
> flags: { }
> locals: { 'org/apache/solr/common/params/SolrParams', 
> 'org/apache/http/conn/ClientConnectionManager', 
> 'org/apache/solr/common/params/ModifiableSolrParams', 
> 'org/apache/http/impl/client/DefaultHttpClient' }
> stack: { 'org/apache/http/impl/client/DefaultHttpClient' }
>   Bytecode:
> 0x000: bb00 0359 2ab7 0004 4db2 0005 b900 0601
> 0x010: 0099 001e b200 05bb 0007 59b7 0008 1209
> 0x020: b600 0a2c b600 0bb6 000c b900 0d02 002b
> 0x030: b800 104e 2d2c b800 0f2d b0
>   Stackmap Table:
> append_frame(@47,Object[#143])
> 16/05/23 13:52:28 INFO mapreduce.Job:  map 100% reduce 0% 
> as you can see the failed reducer gets re-spawned. then I found this issue: 
> https://issues.apache.org/jira/browse/SOLR-7657 and I updated my hadoop 
> config file. after that, the indexer seems to be able to finish ( I got the 
> document in the solr, it seems ) but I still get the error message at the end 
> of the job:
> 16/05/23 16:39:26 INFO mapreduce.Job:  map 100% reduce 99%
> 16/05/23 16:39:44 INFO mapreduce.Job:  map 100% reduce 100%
> 16/05/23 16:39:57 INFO mapreduce.Job: Job job_1464045047943_0001 completed 
> successfully
> 16/05/23 16:39:58 INFO mapreduce.Job: Counters: 53
>   File System Counters
>   FILE: Number of bytes read=42700154855
>   FILE: Number of bytes written=70210771807
>   FILE: Number of read operations=0
>   FILE: Number of large read operations=0
>   FILE: Number of write operations=0
>   HDFS: Number of bytes read=8699202825
>   HDFS: Number of bytes written=0
>   HDFS: Number of read operations=537
>   HDFS: Number of large read operations=0
>   HDFS: Number of write operations=0
>   Job Counters 
>   Launched map tasks=134
>   Launched reduce tasks=1
>   Data-local map tasks=107
>   Rack-local map tasks=27
>   Total time spent by all maps in occupied slots (ms)=49377664
>   Total time spent by all reduces in occupied slots (ms)=32765064
>   Total time spent by all map tasks (ms)=3086104
>   Total time spent by all reduce tasks (ms)=1365211
>   Total vcore-milliseconds taken by all map tasks=3086104
>   Total vcore-milliseconds taken by all reduce tasks=1365211
>   Total megabyte-milliseconds taken by all map tasks=12640681984
>   Total megabyte-milliseconds taken by all reduce tasks=8387856384
>   Map-Reduce Framework
>   Map input records=25305474
>   Map output records=25305474
>   Map output bytes=27422869763
>   Map output materialized bytes=27489888004
>   Input split bytes=15225
>   Combine input records=0
>   Combine output records=0
>   Reduce input groups=16061459
>   Reduce shuffle bytes=27489888004
>   Reduce input records=25305474
>   Reduce output records=230
>   Spilled Records=54688613
>   Shuffled Maps =134
>   Failed Shuffles=0
>   Merged Map outputs=134
>   GC tim

[jira] [Commented] (NUTCH-2267) Solr indexer fails at the end of the job with a java error message

2017-12-17 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294096#comment-16294096
 ] 

Sebastian Nagel commented on NUTCH-2267:


Is this still an issue after the upgrade to Solr 6.6.0 (NUTCH-2400)?

> Solr indexer fails at the end of the job with a java error message
> --
>
> Key: NUTCH-2267
> URL: https://issues.apache.org/jira/browse/NUTCH-2267
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.12
> Environment: hadoop v2.7.2  solr6 in cloud configuration with 
> zookeeper 3.4.6. I use the master branch from github currently on commit 
> da252eb7b3d2d7b70   ( NUTCH - 2263 mingram and maxgram support for Unigram 
> Cosine Similarity Model is provided. )
>Reporter: kaveh minooie
>Assignee: Lewis John McGibbney
> Fix For: 1.15
>
>
> this is was what I was getting first:
> 16/05/23 13:52:27 INFO mapreduce.Job:  map 100% reduce 100%
> 16/05/23 13:52:27 INFO mapreduce.Job: Task Id : 
> attempt_1462499602101_0119_r_00_0, Status : FAILED
> Error: Bad return type
> Exception Details:
>   Location:
> org/apache/solr/client/solrj/impl/HttpClientUtil.createClient(Lorg/apache/solr/common/params/SolrParams;Lorg/apache/http/conn/ClientConnectionManager;)Lorg/apache/http/impl/client/CloseableHttpClient;
>  @58: areturn
>   Reason:
> Type 'org/apache/http/impl/client/DefaultHttpClient' (current frame, 
> stack[0]) is not assignable to 
> 'org/apache/http/impl/client/CloseableHttpClient' (from method signature)
>   Current Frame:
> bci: @58
> flags: { }
> locals: { 'org/apache/solr/common/params/SolrParams', 
> 'org/apache/http/conn/ClientConnectionManager', 
> 'org/apache/solr/common/params/ModifiableSolrParams', 
> 'org/apache/http/impl/client/DefaultHttpClient' }
> stack: { 'org/apache/http/impl/client/DefaultHttpClient' }
>   Bytecode:
> 0x000: bb00 0359 2ab7 0004 4db2 0005 b900 0601
> 0x010: 0099 001e b200 05bb 0007 59b7 0008 1209
> 0x020: b600 0a2c b600 0bb6 000c b900 0d02 002b
> 0x030: b800 104e 2d2c b800 0f2d b0
>   Stackmap Table:
> append_frame(@47,Object[#143])
> 16/05/23 13:52:28 INFO mapreduce.Job:  map 100% reduce 0% 
> as you can see the failed reducer gets re-spawned. then I found this issue: 
> https://issues.apache.org/jira/browse/SOLR-7657 and I updated my hadoop 
> config file. after that, the indexer seems to be able to finish ( I got the 
> document in the solr, it seems ) but I still get the error message at the end 
> of the job:
> 16/05/23 16:39:26 INFO mapreduce.Job:  map 100% reduce 99%
> 16/05/23 16:39:44 INFO mapreduce.Job:  map 100% reduce 100%
> 16/05/23 16:39:57 INFO mapreduce.Job: Job job_1464045047943_0001 completed 
> successfully
> 16/05/23 16:39:58 INFO mapreduce.Job: Counters: 53
>   File System Counters
>   FILE: Number of bytes read=42700154855
>   FILE: Number of bytes written=70210771807
>   FILE: Number of read operations=0
>   FILE: Number of large read operations=0
>   FILE: Number of write operations=0
>   HDFS: Number of bytes read=8699202825
>   HDFS: Number of bytes written=0
>   HDFS: Number of read operations=537
>   HDFS: Number of large read operations=0
>   HDFS: Number of write operations=0
>   Job Counters 
>   Launched map tasks=134
>   Launched reduce tasks=1
>   Data-local map tasks=107
>   Rack-local map tasks=27
>   Total time spent by all maps in occupied slots (ms)=49377664
>   Total time spent by all reduces in occupied slots (ms)=32765064
>   Total time spent by all map tasks (ms)=3086104
>   Total time spent by all reduce tasks (ms)=1365211
>   Total vcore-milliseconds taken by all map tasks=3086104
>   Total vcore-milliseconds taken by all reduce tasks=1365211
>   Total megabyte-milliseconds taken by all map tasks=12640681984
>   Total megabyte-milliseconds taken by all reduce tasks=8387856384
>   Map-Reduce Framework
>   Map input records=25305474
>   Map output records=25305474
>   Map output bytes=27422869763
>   Map output materialized bytes=27489888004
>   Input split bytes=15225
>   Combine input records=0
>   Combine output records=0
>   Reduce input groups=16061459
>   Reduce shuffle bytes=27489888004
>   Reduce input records=25305474
>   Reduce output records=230
>   Spilled Records=54688613
>   Shuffled Maps =134
>   Failed Shuf

[jira] [Commented] (NUTCH-2478) // is not a valid base URL

2017-12-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294101#comment-16294101
 ] 

ASF GitHub Bot commented on NUTCH-2478:
---

sebastian-nagel closed pull request #263: NUTCH-2478 parser resolve base url
URL: https://github.com/apache/nutch/pull/263
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/java/org/apache/nutch/util/DomUtil.java 
b/src/java/org/apache/nutch/util/DomUtil.java
index e93477a43..b4f0eac82 100644
--- a/src/java/org/apache/nutch/util/DomUtil.java
+++ b/src/java/org/apache/nutch/util/DomUtil.java
@@ -31,7 +31,9 @@
 import javax.xml.transform.stream.StreamResult;
 
 import org.apache.xerces.parsers.DOMParser;
+import org.w3c.dom.DocumentFragment;
 import org.w3c.dom.Element;
+import org.w3c.dom.NodeList;
 import org.xml.sax.InputSource;
 import org.xml.sax.SAXException;
 
@@ -103,4 +105,11 @@ public static void saveDom(OutputStream os, Element e) {
   LOG.error("Error: ", ex);
 }
   }
+
+  public static void saveDom(OutputStream os, DocumentFragment doc) {
+NodeList docChildren = doc.getChildNodes();
+for (int i = 0; i < docChildren.getLength(); i++) {
+  saveDom(os, (Element) docChildren.item(i));
+}
+  }
 }
diff --git 
a/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java
 
b/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java
index 4527dd7b4..1f1061d39 100644
--- 
a/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java
+++ 
b/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java
@@ -254,7 +254,7 @@ public boolean getTitle(StringBuffer sb, Node node) {
   }
 
   /** If Node contains a BASE tag then it's HREF is returned. */
-  public URL getBase(Node node) {
+  public String getBase(Node node) {
 
 NodeWalker walker = new NodeWalker(node);
 
@@ -276,10 +276,7 @@ public URL getBase(Node node) {
   for (int i = 0; i < attrs.getLength(); i++) {
 Node attr = attrs.item(i);
 if ("href".equalsIgnoreCase(attr.getNodeName())) {
-  try {
-return new URL(attr.getNodeValue());
-  } catch (MalformedURLException e) {
-  }
+  return attr.getNodeValue();
 }
   }
 }
diff --git 
a/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java 
b/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
index 7f60939ae..9ed9fa4ee 100644
--- a/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
+++ b/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
@@ -207,11 +207,19 @@ public ParseResult getParse(Content content) {
 
 if (!metaTags.getNoFollow()) { // okay to follow links
   ArrayList l = new ArrayList(); // extract outlinks
-  URL baseTag = utils.getBase(root);
+  URL baseTag = base;
+  String baseTagHref = utils.getBase(root);
+  if (baseTagHref != null) {
+try {
+  baseTag = new URL(base, baseTagHref);
+} catch (MalformedURLException e) {
+  baseTag = base;
+}
+  }
   if (LOG.isTraceEnabled()) {
 LOG.trace("Getting links...");
   }
-  utils.getOutlinks(baseTag != null ? baseTag : base, l, root);
+  utils.getOutlinks(baseTag, l, root);
   outlinks = l.toArray(new Outlink[l.size()]);
   if (LOG.isTraceEnabled()) {
 LOG.trace("found " + outlinks.length + " outlinks in "
diff --git 
a/src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestHtmlParser.java
 
b/src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestHtmlParser.java
index 0b3920667..a4c820674 100644
--- 
a/src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestHtmlParser.java
+++ 
b/src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestHtmlParser.java
@@ -19,10 +19,12 @@
 
 import java.lang.invoke.MethodHandles;
 import java.nio.charset.Charset;
+import java.nio.charset.StandardCharsets;
 
 import org.apache.hadoop.conf.Configuration;
 import org.apache.nutch.metadata.Metadata;
 import org.apache.nutch.parse.html.HtmlParser;
+import org.apache.nutch.parse.Outlink;
 import org.apache.nutch.parse.Parse;
 import org.apache.nutch.parse.Parser;
 import org.apache.nutch.protocol.Content;
@@ -78,17 +80,26 @@
   { "HTML5, utf-16, BOM", "utf-16",
   "\ufeff\n\n\n" + encodingTestContent } };
 
+  private static final String resolveBaseUrlTestContent = //
+  "\n\n" + //
+  "  Test Resolve Base URLs (NUTCH-2478)\n" + //
+  "  \n" + //
+  "\n\n" + //
+  "  outlink\n" + //
+  "\n";
+
  

[jira] [Resolved] (NUTCH-2478) // is not a valid base URL

2017-12-17 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2478.

Resolution: Fixed

Thanks, [~markus17]!

> // is not a valid base URL
> --
>
> Key: NUTCH-2478
> URL: https://issues.apache.org/jira/browse/NUTCH-2478
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
>
> This test fails:
> {code}
>   @Test
>   public void testBadResolver() throws Exception {
> URL base = new URL("//www.example.org/");
> String target = "index/produkt/kanaly/";
> 
> URL abs = URLUtil.resolveURL(base, target);
> Assert.assertEquals("http://www.example.org/index/produkt/kanaly/";, 
> abs.toString());
>   }
> {code}
> and has to fail because of invalid base URL, so the current URL is used. If 
> current URL is not /, its path will be prepended, resulting in 404 being 
> crawled.
> This ticket must allow // as base, and resolve the protocol.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2459) Nutch cannot download/parse some files via FTP

2017-12-17 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2459:
---
Fix Version/s: 1.15

> Nutch cannot download/parse some files via FTP
> --
>
> Key: NUTCH-2459
> URL: https://issues.apache.org/jira/browse/NUTCH-2459
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Affects Versions: 1.13
> Environment: Ubuntu 16.04.3 LTS
> OpenJDK 1.8.0_131
> nutch 1.14-SNAPSHOT
> Synology RS816
>Reporter: Hiran Chaudhuri
> Fix For: 1.15
>
>
> I tried running Nutch on my Synology NAS. As SMB protocol is not contained in 
> Nutch, I turned on FTP service on the NAS and configured Nutch to crawl 
> ftp://nas.
> The experience gives me varying results which seem to point to problems 
> within Nutch. However this may need further evaluation.
> As some files could not be downloaded and I could not see a good error 
> message I changed the method 
> org.apache.nutch.protocol.ftp.FTP.getProtocolOutput(Text, CrawlDatum) to not 
> only return protocol status but send the full exception and stack trace to 
> the logs:
> {{ } catch (Exception e) {
> LOG.warn("Could not get {}", url, e);
> return new ProtocolOutput(null, new ProtocolStatus(e));
> }
> }}
> With this modification I suddenly see such messages in the logfile:
> {{2017-11-09 23:44:56,135 WARN  org.apache.nutch.protocol.ftp.Ftp - Error: 
> java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
> at java.util.LinkedList.checkElementIndex(LinkedList.java:555)
> at java.util.LinkedList.get(LinkedList.java:476)
> at 
> org.apache.nutch.protocol.ftp.FtpResponse.getFileAsHttpResponse(FtpResponse.java:327)
> at 
> org.apache.nutch.protocol.ftp.FtpResponse.(FtpResponse.java:267)
> at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:133)
> at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
> 2017-11-09 23:44:56,135 ERROR org.apache.nutch.protocol.ftp.Ftp - Could not 
> get protocol output for ftp://nas/MediaPC/boot/memtest86+.elf
> org.apache.nutch.protocol.ftp.FtpException: 
> java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
>   at 
> org.apache.nutch.protocol.ftp.FtpResponse.(FtpResponse.java:309)
>   at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:133)
>   at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
> Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
>   at java.util.LinkedList.checkElementIndex(LinkedList.java:555)
>   at java.util.LinkedList.get(LinkedList.java:476)
>   at 
> org.apache.nutch.protocol.ftp.FtpResponse.getFileAsHttpResponse(FtpResponse.java:327)
> }}
> I cannot tell what the URLs showing this problems have in common. They seem 
> to be regular files, however a lot of other regular files can be fetched and 
> parsed successfully. As far as I understand the source code, at least one 
> outgoing link is expected:
> {{
> FTPFile ftpFile = (FTPFile) list.get(0);
> }}
> Can this be safely assumed for all files? Or should there rather be a check 
> if outgoing links were found?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2478) // is not a valid base URL

2017-12-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294112#comment-16294112
 ] 

Hudson commented on NUTCH-2478:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3483 (See 
[https://builds.apache.org/job/Nutch-trunk/3483/])
NUTCH-2478 HTML parser should resolve base URL  - fix (snagel: 
[https://github.com/apache/nutch/commit/607e7d950a2f3399db161b5a6770b40bf1d60c1a])
* (edit) 
src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
* (edit) 
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java
* (edit) 
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
* (edit) 
src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestHtmlParser.java
* (edit) 
src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java
NUTCH-2478 HTML parser should resolve base URL  - finally 
(snagel: 
[https://github.com/apache/nutch/commit/2aec79f13b04e022f0c30830a5e621cfcfffc88d])
* (edit) 
src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java
* (add) src/plugin/parse-tika/src/test/org/apache/nutch/tika/TestHtmlParser.java
* (edit) 
src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestHtmlParser.java
* (edit) 
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
* (edit) src/java/org/apache/nutch/util/DomUtil.java


> // is not a valid base URL
> --
>
> Key: NUTCH-2478
> URL: https://issues.apache.org/jira/browse/NUTCH-2478
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
>
> This test fails:
> {code}
>   @Test
>   public void testBadResolver() throws Exception {
> URL base = new URL("//www.example.org/");
> String target = "index/produkt/kanaly/";
> 
> URL abs = URLUtil.resolveURL(base, target);
> Assert.assertEquals("http://www.example.org/index/produkt/kanaly/";, 
> abs.toString());
>   }
> {code}
> and has to fail because of invalid base URL, so the current URL is used. If 
> current URL is not /, its path will be prepended, resulting in 404 being 
> crawled.
> This ticket must allow // as base, and resolve the protocol.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2477) Refactor *Checker classes to use base class for common code

2017-12-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294117#comment-16294117
 ] 

ASF GitHub Bot commented on NUTCH-2477:
---

sebastian-nagel commented on issue #256: fix for NUTCH-2477 (refactor checker 
classes) contributed by Jurian Broertjes
URL: https://github.com/apache/nutch/pull/256#issuecomment-352250750
 
 
   +1
   
   Thanks, looks good to me. I'm going to commit this to 1.14.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Refactor *Checker classes to use base class for common code
> ---
>
> Key: NUTCH-2477
> URL: https://issues.apache.org/jira/browse/NUTCH-2477
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Jurian Broertjes
>Priority: Minor
>  Labels: pull-request-available
>
> The various Checker class implementations have quite a bit of duplicated code 
> in them. This should be refactored for cleanliness and maintainability.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2477) Refactor *Checker classes to use base class for common code

2017-12-17 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2477:
---
Fix Version/s: 1.14

> Refactor *Checker classes to use base class for common code
> ---
>
> Key: NUTCH-2477
> URL: https://issues.apache.org/jira/browse/NUTCH-2477
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Jurian Broertjes
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.14
>
>
> The various Checker class implementations have quite a bit of duplicated code 
> in them. This should be refactored for cleanliness and maintainability.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2477) Refactor *Checker classes to use base class for common code

2017-12-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294137#comment-16294137
 ] 

ASF GitHub Bot commented on NUTCH-2477:
---

sebastian-nagel closed pull request #256: fix for NUTCH-2477 (refactor checker 
classes) contributed by Jurian Broertjes
URL: https://github.com/apache/nutch/pull/256
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java 
b/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java
index 05caf5a56..549163830 100644
--- a/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java
+++ b/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java
@@ -17,23 +17,14 @@
 
 package org.apache.nutch.indexer;
 
-import java.io.BufferedReader;
-import java.io.InputStreamReader;
-import java.io.PrintWriter;
 import java.lang.invoke.MethodHandles;
-import java.net.ServerSocket;
-import java.net.Socket;
-import java.net.InetSocketAddress;
-import java.nio.charset.Charset;
 import java.util.HashMap;
 import java.util.Iterator;
 import java.util.List;
 import java.util.Map;
 
-import org.apache.hadoop.conf.Configured;
 import org.apache.hadoop.io.Text;
 import org.apache.hadoop.mapred.JobConf;
-import org.apache.hadoop.util.Tool;
 import org.apache.hadoop.util.ToolRunner;
 import org.apache.nutch.crawl.CrawlDatum;
 import org.apache.nutch.crawl.Inlinks;
@@ -52,6 +43,7 @@
 import org.apache.nutch.scoring.ScoringFilters;
 import org.apache.nutch.util.NutchConfiguration;
 import org.apache.nutch.util.StringUtil;
+import org.apache.nutch.util.AbstractChecker;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
@@ -65,41 +57,33 @@
  * @author Julien Nioche
  **/
 
-public class IndexingFiltersChecker extends Configured implements Tool {
+public class IndexingFiltersChecker extends AbstractChecker {
 
   protected URLNormalizers normalizers = null;
   protected boolean dumpText = false;
   protected boolean followRedirects = false;
-  protected boolean keepClientCnxOpen = false;
   // used to simulate the metadata propagated from injection
   protected HashMap metadata = new HashMap<>();
-  protected int tcpPort = -1;
 
   private static final Logger LOG = LoggerFactory
   .getLogger(MethodHandles.lookup().lookupClass());
 
-  public IndexingFiltersChecker() {
-
-  }
-
   public int run(String[] args) throws Exception {
 String url = null;
-String usage = "Usage: IndexingFiltersChecker [-normalize] 
[-followRedirects] [-dumpText] [-md key=value] [-listen ] 
[-keepClientCnxOpen]";
+usage = "Usage: IndexingFiltersChecker [-normalize] [-followRedirects] 
[-dumpText] [-md key=value] (-stdin | -listen  [-keepClientCnxOpen])";
 
-if (args.length == 0) {
+// Print help when no args given
+if (args.length < 1) {
   System.err.println(usage);
-  return -1;
+  System.exit(-1);
 }
 
+int numConsumed;
 for (int i = 0; i < args.length; i++) {
   if (args[i].equals("-normalize")) {
 normalizers = new URLNormalizers(getConf(), 
URLNormalizers.SCOPE_DEFAULT);
-  } else if (args[i].equals("-listen")) {
-tcpPort = Integer.parseInt(args[++i]);
   } else if (args[i].equals("-followRedirects")) {
 followRedirects = true;
-  } else if (args[i].equals("-keepClientCnxOpen")) {
-keepClientCnxOpen = true;
   } else if (args[i].equals("-dumpText")) {
 dumpText = true;
   } else if (args[i].equals("-md")) {
@@ -112,104 +96,27 @@ public int run(String[] args) throws Exception {
 } else
   k = nextOne;
 metadata.put(k, v);
+  } else if ((numConsumed = super.parseArgs(args, i)) > 0) {
+i += numConsumed - 1;
   } else if (i != args.length - 1) {
+System.err.println("ERR: Not a recognized argument: " + args[i]);
 System.err.println(usage);
 System.exit(-1);
   } else {
-url =args[i];
+url = args[i];
   }
 }
 
-// In listening mode?
-if (tcpPort == -1) {
-  // No, just fetch and display
-  StringBuilder output = new StringBuilder();
-  int ret = fetch(url, output);
-  System.out.println(output);
-  return ret;
+if (url != null) {
+  return super.processSingle(url);
 } else {
-  // Listen on socket and start workers on incoming requests
-  listen();
-}
-
-return 0;
-  }
-  
-  protected void listen() throws Exception {
-ServerSocket server = null;
-
-try{
-  server = new ServerSocket();
-  server.bind(new InetSocketAddress(tcpPort));
-  LOG.info(server.toString());
-} catch (Exception e) {
-  LOG.error("Coul

[jira] [Updated] (NUTCH-2431) URLFilterchecker to implement Tool-interface

2017-12-17 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2431:
---
Summary: URLFilterchecker to implement Tool-interface  (was: Filterchecker 
to implement Tool-interface)

> URLFilterchecker to implement Tool-interface
> 
>
> Key: NUTCH-2431
> URL: https://issues.apache.org/jira/browse/NUTCH-2431
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Jurian Broertjes
>Priority: Minor
>  Labels: urlfilter
> Attachments: NUTCH-2431.patch
>
>
> The current implementation of the URLFilterChecker does not allow for 
> commandline config overrides. It needs to implement the Tool interface for 
> this. 
> Please see the attached patch



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (NUTCH-2477) Refactor *Checker classes to use base class for common code

2017-12-17 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2477.

Resolution: Fixed

Thanks, [~jurian]! PR merged. I've extended the command-line help for the URL 
filter or normalizer checker  and added that the plugin name must be given now 
instead of the class name. Good idea: I always had to look up the correct class 
name, plugin names are easier to remember.

> Refactor *Checker classes to use base class for common code
> ---
>
> Key: NUTCH-2477
> URL: https://issues.apache.org/jira/browse/NUTCH-2477
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Jurian Broertjes
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.14
>
>
> The various Checker class implementations have quite a bit of duplicated code 
> in them. This should be refactored for cleanliness and maintainability.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2320) URLFilterChecker to run as TCP Telnet service

2017-12-17 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294142#comment-16294142
 ] 

Sebastian Nagel commented on NUTCH-2320:


This is resolved for 1.14 with NUTCH-2477, correct?

> URLFilterChecker to run as TCP Telnet service
> -
>
> Key: NUTCH-2320
> URL: https://issues.apache.org/jira/browse/NUTCH-2320
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.14
>
> Attachments: NUTCH-2320.patch, NUTCH-2320.patch
>
>
> Allow testing URL filters for webapplications just like indexing filters 
> checker.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2338) URLNormalizerChecker to run as TCP Telnet service

2017-12-17 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294143#comment-16294143
 ] 

Sebastian Nagel commented on NUTCH-2338:


This is resolved for 1.14 with NUTCH-2477, correct?

> URLNormalizerChecker to run as TCP Telnet service
> -
>
> Key: NUTCH-2338
> URL: https://issues.apache.org/jira/browse/NUTCH-2338
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2338.patch
>
>
> Similar to NUTCH-2320, but then for normalizer checker.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2431) URLFilterchecker to implement Tool-interface

2017-12-17 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294144#comment-16294144
 ] 

Sebastian Nagel commented on NUTCH-2431:


This is resolved for 1.14 with NUTCH-2477, correct?

> URLFilterchecker to implement Tool-interface
> 
>
> Key: NUTCH-2431
> URL: https://issues.apache.org/jira/browse/NUTCH-2431
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Jurian Broertjes
>Priority: Minor
>  Labels: urlfilter
> Attachments: NUTCH-2431.patch
>
>
> The current implementation of the URLFilterChecker does not allow for 
> commandline config overrides. It needs to implement the Tool interface for 
> this. 
> Please see the attached patch



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2415) Create a JEXL based IndexingFilter

2017-12-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294146#comment-16294146
 ] 

ASF GitHub Bot commented on NUTCH-2415:
---

YossiTamari commented on issue #219: NUTCH-2415 : Create a JEXL based 
IndexingFilter
URL: https://github.com/apache/nutch/pull/219#issuecomment-352255599
 
 
   All above comments were addressed in the last commit.
   Note that due to package renaming the history of this PR is a bit confusing.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Create a JEXL based IndexingFilter
> --
>
> Key: NUTCH-2415
> URL: https://issues.apache.org/jira/browse/NUTCH-2415
> Project: Nutch
>  Issue Type: New Feature
>  Components: plugin
>Affects Versions: 1.13
>Reporter: Yossi Tamari
>Assignee: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>
> Following on NUTCH-2414 and NUTCH-2412, the requirement was raised for a 
> IndexingFilter plugin which will decide whether to index a document based on 
> a JEXL expression.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2338) URLNormalizerChecker to run as TCP Telnet service

2017-12-17 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294148#comment-16294148
 ] 

Markus Jelsma commented on NUTCH-2338:
--

Yes! 

> URLNormalizerChecker to run as TCP Telnet service
> -
>
> Key: NUTCH-2338
> URL: https://issues.apache.org/jira/browse/NUTCH-2338
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2338.patch
>
>
> Similar to NUTCH-2320, but then for normalizer checker.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (NUTCH-2338) URLNormalizerChecker to run as TCP Telnet service

2017-12-17 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-2338.
--
Resolution: Duplicate

> URLNormalizerChecker to run as TCP Telnet service
> -
>
> Key: NUTCH-2338
> URL: https://issues.apache.org/jira/browse/NUTCH-2338
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2338.patch
>
>
> Similar to NUTCH-2320, but then for normalizer checker.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (NUTCH-2338) URLNormalizerChecker to run as TCP Telnet service

2017-12-17 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-2338.


> URLNormalizerChecker to run as TCP Telnet service
> -
>
> Key: NUTCH-2338
> URL: https://issues.apache.org/jira/browse/NUTCH-2338
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2338.patch
>
>
> Similar to NUTCH-2320, but then for normalizer checker.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (NUTCH-2478) // is not a valid base URL

2017-12-17 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-2478.


> // is not a valid base URL
> --
>
> Key: NUTCH-2478
> URL: https://issues.apache.org/jira/browse/NUTCH-2478
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
>
> This test fails:
> {code}
>   @Test
>   public void testBadResolver() throws Exception {
> URL base = new URL("//www.example.org/");
> String target = "index/produkt/kanaly/";
> 
> URL abs = URLUtil.resolveURL(base, target);
> Assert.assertEquals("http://www.example.org/index/produkt/kanaly/";, 
> abs.toString());
>   }
> {code}
> and has to fail because of invalid base URL, so the current URL is used. If 
> current URL is not /, its path will be prepended, resulting in 404 being 
> crawled.
> This ticket must allow // as base, and resolve the protocol.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (NUTCH-2320) URLFilterChecker to run as TCP Telnet service

2017-12-17 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-2320.


Yes!

> URLFilterChecker to run as TCP Telnet service
> -
>
> Key: NUTCH-2320
> URL: https://issues.apache.org/jira/browse/NUTCH-2320
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.14
>
> Attachments: NUTCH-2320.patch, NUTCH-2320.patch
>
>
> Allow testing URL filters for webapplications just like indexing filters 
> checker.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2478) // is not a valid base URL

2017-12-17 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294150#comment-16294150
 ] 

Markus Jelsma commented on NUTCH-2478:
--

Thanks!

> // is not a valid base URL
> --
>
> Key: NUTCH-2478
> URL: https://issues.apache.org/jira/browse/NUTCH-2478
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
>
> This test fails:
> {code}
>   @Test
>   public void testBadResolver() throws Exception {
> URL base = new URL("//www.example.org/");
> String target = "index/produkt/kanaly/";
> 
> URL abs = URLUtil.resolveURL(base, target);
> Assert.assertEquals("http://www.example.org/index/produkt/kanaly/";, 
> abs.toString());
>   }
> {code}
> and has to fail because of invalid base URL, so the current URL is used. If 
> current URL is not /, its path will be prepended, resulting in 404 being 
> crawled.
> This ticket must allow // as base, and resolve the protocol.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (NUTCH-2320) URLFilterChecker to run as TCP Telnet service

2017-12-17 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-2320.
--
Resolution: Duplicate

> URLFilterChecker to run as TCP Telnet service
> -
>
> Key: NUTCH-2320
> URL: https://issues.apache.org/jira/browse/NUTCH-2320
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.14
>
> Attachments: NUTCH-2320.patch, NUTCH-2320.patch
>
>
> Allow testing URL filters for webapplications just like indexing filters 
> checker.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2477) Refactor *Checker classes to use base class for common code

2017-12-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294159#comment-16294159
 ] 

Hudson commented on NUTCH-2477:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3484 (See 
[https://builds.apache.org/job/Nutch-trunk/3484/])
fix for NUTCH-2477 (refactor checker classes) contributed by Jurian (snagel: 
[https://github.com/apache/nutch/commit/9c684114337241bb7652619c11b7374b27204dcb])
* (edit) src/java/org/apache/nutch/net/URLFilters.java
* (add) src/java/org/apache/nutch/util/AbstractChecker.java
* (edit) src/java/org/apache/nutch/net/URLNormalizerChecker.java
* (edit) src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java
* (edit) src/java/org/apache/nutch/net/URLFilterChecker.java


> Refactor *Checker classes to use base class for common code
> ---
>
> Key: NUTCH-2477
> URL: https://issues.apache.org/jira/browse/NUTCH-2477
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Jurian Broertjes
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.14
>
>
> The various Checker class implementations have quite a bit of duplicated code 
> in them. This should be refactored for cleanliness and maintainability.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (NUTCH-2322) URL not available for Jexl operations

2017-12-17 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2322.

Resolution: Fixed

+1

Committed to 1.x 
([8b3412a8|https://github.com/apache/nutch/commit/8b3412a864b2bed1bdd633710ffa36b975d8e6bc]).
 Thanks!

> URL not available for Jexl operations
> -
>
> Key: NUTCH-2322
> URL: https://issues.apache.org/jira/browse/NUTCH-2322
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2322-1.11.patch, NUTCH-2322-1.11.patch, 
> NUTCH-2322.patch, NUTCH-2322.patch, NUTCH-2322.patch, NUTCH-2322.patch
>
>
> In CrawlDatum.evaluate(), the records's URL is just missing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (NUTCH-2322) URL not available for Jexl operations

2017-12-17 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-2322.


Thanks!

> URL not available for Jexl operations
> -
>
> Key: NUTCH-2322
> URL: https://issues.apache.org/jira/browse/NUTCH-2322
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2322-1.11.patch, NUTCH-2322-1.11.patch, 
> NUTCH-2322.patch, NUTCH-2322.patch, NUTCH-2322.patch, NUTCH-2322.patch
>
>
> In CrawlDatum.evaluate(), the records's URL is just missing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2370) FileDumper: save JSON mapping file -> URL

2017-12-17 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2370:
---
Summary: FileDumper: save JSON mapping file -> URL  (was: Saving mapping of 
dumped file to URL)

> FileDumper: save JSON mapping file -> URL
> -
>
> Key: NUTCH-2370
> URL: https://issues.apache.org/jira/browse/NUTCH-2370
> Project: Nutch
>  Issue Type: Improvement
>  Components: dumpers
>Affects Versions: 1.14
>Reporter: Madhav Sharan
>Priority: Minor
> Fix For: 1.14
>
>
> - nutch dump [0] is a great tool to simply dump all the crawled files from 
> nutch segments.
> - After dump we loose information about URL from which this file was crawled. 
> URL is used to name dumped file but that information is encrypted.
> - In `reverseUrlDirs` option one can figure out URL by checking the file path 
> but even accessing file path is little complicated than simple mapping file.
> - In `flatdir` there is no way to know actual URL.
> I am submitting a PR which edits [0] and saves a json for each crawled 
> segment which maps a file path to URL.
> [0] 
> https://github.com/apache/nutch/blob/3e2d3d456489bf52bc586dae0e2e71fb7aad8fe7/src/java/org/apache/nutch/tools/FileDumper.java



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2370) FileDumper: save JSON mapping file -> URL

2017-12-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294184#comment-16294184
 ] 

ASF GitHub Bot commented on NUTCH-2370:
---

sebastian-nagel closed pull request #180: fix for NUTCH-2370 contributed by 
msha...@usc.edu
URL: https://github.com/apache/nutch/pull/180
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/java/org/apache/nutch/tools/FileDumper.java 
b/src/java/org/apache/nutch/tools/FileDumper.java
index e8b0f46e8..31218bbb0 100644
--- a/src/java/org/apache/nutch/tools/FileDumper.java
+++ b/src/java/org/apache/nutch/tools/FileDumper.java
@@ -57,6 +57,7 @@
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
+import org.codehaus.jackson.map.ObjectMapper;
 /**
  * 
  * The file dumper tool enables one to reverse generate the raw content from
@@ -158,6 +159,7 @@ public void dump(File outputDir, File segmentRootDir, 
String[] mimeTypes, boolea
 for (File segment : segmentDirs) {
   LOG.info("Processing segment: [" + segment.getAbsolutePath() + "]");
   DataOutputStream doutputStream = null;
+  Map filenameToUrl = new HashMap();
 
   File segmentDir = new File(segment.getAbsolutePath(), Content.DIR_NAME);
   File[] partDirs = segmentDir.listFiles(file -> file.canRead() && 
file.isDirectory());
@@ -247,7 +249,7 @@ public void dump(File outputDir, File segmentRootDir, 
String[] mimeTypes, boolea
   } else {
 outputFullPath = String.format("%s/%s", fullDir, 
DumpFileUtil.createFileName(md5Ofurl, baseName, extension));
   }
-
+  filenameToUrl.put(outputFullPath, url);
   File outputFile = new File(outputFullPath);
 
   if (!outputFile.exists()) {
@@ -289,6 +291,10 @@ public void dump(File outputDir, File segmentRootDir, 
String[] mimeTypes, boolea
   }
 }
   }
+  //save filenameToUrl in a json file for each segment there is one 
mapping file 
+  String filenameToUrlFilePath = String.format("%s/%s_filenameToUrl.json", 
outputDir.getAbsolutePath(), segment.getName() );
+  new ObjectMapper().writeValue(new File(filenameToUrlFilePath), 
filenameToUrl);
+  
 }
 LOG.info("Dumper File Stats: "
 + DumpFileUtil.displayFileTypes(typeCounts, filteredCounts));


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> FileDumper: save JSON mapping file -> URL
> -
>
> Key: NUTCH-2370
> URL: https://issues.apache.org/jira/browse/NUTCH-2370
> Project: Nutch
>  Issue Type: Improvement
>  Components: dumpers
>Affects Versions: 1.14
>Reporter: Madhav Sharan
>Priority: Minor
> Fix For: 1.14
>
>
> - nutch dump [0] is a great tool to simply dump all the crawled files from 
> nutch segments.
> - After dump we loose information about URL from which this file was crawled. 
> URL is used to name dumped file but that information is encrypted.
> - In `reverseUrlDirs` option one can figure out URL by checking the file path 
> but even accessing file path is little complicated than simple mapping file.
> - In `flatdir` there is no way to know actual URL.
> I am submitting a PR which edits [0] and saves a json for each crawled 
> segment which maps a file path to URL.
> [0] 
> https://github.com/apache/nutch/blob/3e2d3d456489bf52bc586dae0e2e71fb7aad8fe7/src/java/org/apache/nutch/tools/FileDumper.java



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2370) FileDumper: save JSON mapping file -> URL

2017-12-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294183#comment-16294183
 ] 

ASF GitHub Bot commented on NUTCH-2370:
---

sebastian-nagel commented on issue #180: fix for NUTCH-2370 contributed by 
msha...@usc.edu
URL: https://github.com/apache/nutch/pull/180#issuecomment-352260567
 
 
   +1 LGTM! Thanks, @smadha!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> FileDumper: save JSON mapping file -> URL
> -
>
> Key: NUTCH-2370
> URL: https://issues.apache.org/jira/browse/NUTCH-2370
> Project: Nutch
>  Issue Type: Improvement
>  Components: dumpers
>Affects Versions: 1.14
>Reporter: Madhav Sharan
>Priority: Minor
> Fix For: 1.14
>
>
> - nutch dump [0] is a great tool to simply dump all the crawled files from 
> nutch segments.
> - After dump we loose information about URL from which this file was crawled. 
> URL is used to name dumped file but that information is encrypted.
> - In `reverseUrlDirs` option one can figure out URL by checking the file path 
> but even accessing file path is little complicated than simple mapping file.
> - In `flatdir` there is no way to know actual URL.
> I am submitting a PR which edits [0] and saves a json for each crawled 
> segment which maps a file path to URL.
> [0] 
> https://github.com/apache/nutch/blob/3e2d3d456489bf52bc586dae0e2e71fb7aad8fe7/src/java/org/apache/nutch/tools/FileDumper.java



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2322) URL not available for Jexl operations

2017-12-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294186#comment-16294186
 ] 

Hudson commented on NUTCH-2322:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3485 (See 
[https://builds.apache.org/job/Nutch-trunk/3485/])
NUTCH-2322 URL not available for Jexl operations - apply patch (snagel: 
[https://github.com/apache/nutch/commit/8b3412a864b2bed1bdd633710ffa36b975d8e6bc])
* (edit) src/java/org/apache/nutch/crawl/Generator.java
* (edit) src/java/org/apache/nutch/crawl/CrawlDatum.java
* (edit) src/java/org/apache/nutch/crawl/CrawlDbReader.java


> URL not available for Jexl operations
> -
>
> Key: NUTCH-2322
> URL: https://issues.apache.org/jira/browse/NUTCH-2322
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2322-1.11.patch, NUTCH-2322-1.11.patch, 
> NUTCH-2322.patch, NUTCH-2322.patch, NUTCH-2322.patch, NUTCH-2322.patch
>
>
> In CrawlDatum.evaluate(), the records's URL is just missing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2321) Indexing filter checker leaks threads

2017-12-17 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294190#comment-16294190
 ] 

Sebastian Nagel commented on NUTCH-2321:


The patch does not apply anymore after NUTCH-2477. If it's still a problem can 
you update the patch? Thanks!

> Indexing filter checker leaks threads
> -
>
> Key: NUTCH-2321
> URL: https://issues.apache.org/jira/browse/NUTCH-2321
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.14
>
> Attachments: NUTCH-2321.patch
>
>
> Same issue as NUTCH-2320.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (NUTCH-2370) FileDumper: save JSON mapping file -> URL

2017-12-17 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2370.

Resolution: Fixed

Thanks, [~msha...@usc.edu]!

> FileDumper: save JSON mapping file -> URL
> -
>
> Key: NUTCH-2370
> URL: https://issues.apache.org/jira/browse/NUTCH-2370
> Project: Nutch
>  Issue Type: Improvement
>  Components: dumpers
>Affects Versions: 1.14
>Reporter: Madhav Sharan
>Priority: Minor
> Fix For: 1.14
>
>
> - nutch dump [0] is a great tool to simply dump all the crawled files from 
> nutch segments.
> - After dump we loose information about URL from which this file was crawled. 
> URL is used to name dumped file but that information is encrypted.
> - In `reverseUrlDirs` option one can figure out URL by checking the file path 
> but even accessing file path is little complicated than simple mapping file.
> - In `flatdir` there is no way to know actual URL.
> I am submitting a PR which edits [0] and saves a json for each crawled 
> segment which maps a file path to URL.
> [0] 
> https://github.com/apache/nutch/blob/3e2d3d456489bf52bc586dae0e2e71fb7aad8fe7/src/java/org/apache/nutch/tools/FileDumper.java



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (NUTCH-2370) FileDumper: save JSON mapping file -> URL

2017-12-17 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-2370:
--

Assignee: Sebastian Nagel

> FileDumper: save JSON mapping file -> URL
> -
>
> Key: NUTCH-2370
> URL: https://issues.apache.org/jira/browse/NUTCH-2370
> Project: Nutch
>  Issue Type: Improvement
>  Components: dumpers
>Affects Versions: 1.14
>Reporter: Madhav Sharan
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.14
>
>
> - nutch dump [0] is a great tool to simply dump all the crawled files from 
> nutch segments.
> - After dump we loose information about URL from which this file was crawled. 
> URL is used to name dumped file but that information is encrypted.
> - In `reverseUrlDirs` option one can figure out URL by checking the file path 
> but even accessing file path is little complicated than simple mapping file.
> - In `flatdir` there is no way to know actual URL.
> I am submitting a PR which edits [0] and saves a json for each crawled 
> segment which maps a file path to URL.
> [0] 
> https://github.com/apache/nutch/blob/3e2d3d456489bf52bc586dae0e2e71fb7aad8fe7/src/java/org/apache/nutch/tools/FileDumper.java



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (NUTCH-2034) CrawlDB filtered documents counter.

2017-12-17 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2034.

Resolution: Fixed

Thanks, [~betolink]! Committed to 1.x 
([961c725a|https://github.com/apache/nutch/commit/961c725aba2d6013a343dca66f595d6f28293a7b]).

> CrawlDB filtered documents counter.
> ---
>
> Key: NUTCH-2034
> URL: https://issues.apache.org/jira/browse/NUTCH-2034
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Affects Versions: 1.10
>Reporter: Luis Lopez
>Assignee: Lewis John McGibbney
>Priority: Minor
>  Labels: counters, crawldb, filter, info, regex
> Fix For: 1.14
>
>
> When we are doing big crawls we would like to know how many of the URLs are 
> being discarded by the regex filters, this is only presented in the Inject 
> class:
> Injector: Total number of urls rejected by filters: 0
> It will be nice to have a counter in the CrawlDB class so we know in every 
> round how many were discarded by our filters:
> CrawlDb update: Total number of URLs filtered by regex filters: 31415



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2450) Remove FixMe in ParseOutputFormat

2017-12-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294200#comment-16294200
 ] 

ASF GitHub Bot commented on NUTCH-2450:
---

lewismc commented on issue #235: Fix for NUTCH-2450 by Kenneth McFarland
URL: https://github.com/apache/nutch/pull/235#issuecomment-352264301
 
 
   
[TestURLUtil](https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/util/TestURLUtil.java)
 testGetDomainDame() or additional tests for that test class.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove FixMe in ParseOutputFormat
> -
>
> Key: NUTCH-2450
> URL: https://issues.apache.org/jira/browse/NUTCH-2450
> Project: Nutch
>  Issue Type: Bug
> Environment: master branch
>Reporter: Kenneth McFarland
>Assignee: Kenneth McFarland
>Priority: Minor
>
> ParseOutputFormat contains a few FixMe's that I've looked at. If a valid url 
> is created, it will always return valid results. There is a spot in the code 
> where the try catch is already done, so the predicate is satisfied and there 
> is no need to keep checking it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2034) CrawlDB filtered documents counter.

2017-12-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294209#comment-16294209
 ] 

Hudson commented on NUTCH-2034:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3486 (See 
[https://builds.apache.org/job/Nutch-trunk/3486/])
NUTCH-2034 CrawlDB update job to count documents in CrawlDb rejected by 
(snagel: 
[https://github.com/apache/nutch/commit/961c725aba2d6013a343dca66f595d6f28293a7b])
* (edit) src/java/org/apache/nutch/crawl/CrawlDb.java
* (edit) src/java/org/apache/nutch/crawl/CrawlDbFilter.java


> CrawlDB filtered documents counter.
> ---
>
> Key: NUTCH-2034
> URL: https://issues.apache.org/jira/browse/NUTCH-2034
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Affects Versions: 1.10
>Reporter: Luis Lopez
>Assignee: Lewis John McGibbney
>Priority: Minor
>  Labels: counters, crawldb, filter, info, regex
> Fix For: 1.14
>
>
> When we are doing big crawls we would like to know how many of the URLs are 
> being discarded by the regex filters, this is only presented in the Inject 
> class:
> Injector: Total number of urls rejected by filters: 0
> It will be nice to have a counter in the CrawlDB class so we know in every 
> round how many were discarded by our filters:
> CrawlDb update: Total number of URLs filtered by regex filters: 31415



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2370) FileDumper: save JSON mapping file -> URL

2017-12-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294208#comment-16294208
 ] 

Hudson commented on NUTCH-2370:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3486 (See 
[https://builds.apache.org/job/Nutch-trunk/3486/])
fix for NUTCH-2370 contributed by msha...@usc.edu (goyal.madhav: 
[https://github.com/apache/nutch/commit/fd6f20e6dfc9a4a7bbad3478e6af4469d9449cca])
* (edit) src/java/org/apache/nutch/tools/FileDumper.java


> FileDumper: save JSON mapping file -> URL
> -
>
> Key: NUTCH-2370
> URL: https://issues.apache.org/jira/browse/NUTCH-2370
> Project: Nutch
>  Issue Type: Improvement
>  Components: dumpers
>Affects Versions: 1.14
>Reporter: Madhav Sharan
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.14
>
>
> - nutch dump [0] is a great tool to simply dump all the crawled files from 
> nutch segments.
> - After dump we loose information about URL from which this file was crawled. 
> URL is used to name dumped file but that information is encrypted.
> - In `reverseUrlDirs` option one can figure out URL by checking the file path 
> but even accessing file path is little complicated than simple mapping file.
> - In `flatdir` there is no way to know actual URL.
> I am submitting a PR which edits [0] and saves a json for each crawled 
> segment which maps a file path to URL.
> [0] 
> https://github.com/apache/nutch/blob/3e2d3d456489bf52bc586dae0e2e71fb7aad8fe7/src/java/org/apache/nutch/tools/FileDumper.java



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2358) HostInjectorJob doesn't work

2017-12-17 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2358:

Fix Version/s: 2.4

> HostInjectorJob doesn't work
> 
>
> Key: NUTCH-2358
> URL: https://issues.apache.org/jira/browse/NUTCH-2358
> Project: Nutch
>  Issue Type: Bug
>  Components: hostdb
>Affects Versions: 2.3.1
>Reporter: Kiyonari Harigae
> Fix For: 2.4
>
> Attachments: NUTCH-2358.patch
>
>
> HostInjectorJob fails with NPE which causes
> Host#getMetadata returns null when instantiate Host with new Host().
> However, to run HostInjector completely, need to solve GORA-503



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2358) HostInjectorJob doesn't work

2017-12-17 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294236#comment-16294236
 ] 

Lewis John McGibbney commented on NUTCH-2358:
-

Thank you [~cloudysunny14] patch applied and commited to 2.x branch

> HostInjectorJob doesn't work
> 
>
> Key: NUTCH-2358
> URL: https://issues.apache.org/jira/browse/NUTCH-2358
> Project: Nutch
>  Issue Type: Bug
>  Components: hostdb
>Affects Versions: 2.3.1
>Reporter: Kiyonari Harigae
> Fix For: 2.4
>
> Attachments: NUTCH-2358.patch
>
>
> HostInjectorJob fails with NPE which causes
> Host#getMetadata returns null when instantiate Host with new Host().
> However, to run HostInjector completely, need to solve GORA-503



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (NUTCH-2358) HostInjectorJob doesn't work

2017-12-17 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2358.
-
Resolution: Fixed

> HostInjectorJob doesn't work
> 
>
> Key: NUTCH-2358
> URL: https://issues.apache.org/jira/browse/NUTCH-2358
> Project: Nutch
>  Issue Type: Bug
>  Components: hostdb
>Affects Versions: 2.3.1
>Reporter: Kiyonari Harigae
> Fix For: 2.4
>
> Attachments: NUTCH-2358.patch
>
>
> HostInjectorJob fails with NPE which causes
> Host#getMetadata returns null when instantiate Host with new Host().
> However, to run HostInjector completely, need to solve GORA-503



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2358) HostInjectorJob doesn't work

2017-12-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294244#comment-16294244
 ] 

Hudson commented on NUTCH-2358:
---

SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1600 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1600/])
NUTCH-2358 HostInjectorJob doesn't work (lewis.mcgibbney: 
[https://github.com/apache/nutch/commit/103f5c50faaa0c0cfbc05f6b75235f632bdc2ea6])
* (edit) src/java/org/apache/nutch/host/HostInjectorJob.java


> HostInjectorJob doesn't work
> 
>
> Key: NUTCH-2358
> URL: https://issues.apache.org/jira/browse/NUTCH-2358
> Project: Nutch
>  Issue Type: Bug
>  Components: hostdb
>Affects Versions: 2.3.1
>Reporter: Kiyonari Harigae
> Fix For: 2.4
>
> Attachments: NUTCH-2358.patch
>
>
> HostInjectorJob fails with NPE which causes
> Host#getMetadata returns null when instantiate Host with new Host().
> However, to run HostInjector completely, need to solve GORA-503



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2483) Remove/replace indirect dependencies to org.json

2017-12-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294341#comment-16294341
 ] 

ASF GitHub Bot commented on NUTCH-2483:
---

sebastian-nagel opened a new pull request #265: NUTCH-2483 Remove/replace 
indirect dependencies to org.json
URL: https://github.com/apache/nutch/pull/265
 
 
   - exclude dependency for webarchive-commons: no classes requiring JSON are 
used
   - exclude dependency for wicket-bootstrap-extensions (transitive dep via 
closure-compiler):
 only one class is used from wicket-bootstrap-extensions which isn't 
related to JavaScript or JSON


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove/replace indirect dependencies to org.json
> 
>
> Key: NUTCH-2483
> URL: https://issues.apache.org/jira/browse/NUTCH-2483
> Project: Nutch
>  Issue Type: Bug
>  Components: deployment
>Affects Versions: 1.13
>Reporter: Sebastian Nagel
> Fix For: 1.14
>
>
> As indirect transitive dependency we ship with Nutch 1.x binary packages a 
> jar file of org.json which [license|http://www.json.org/license.html] is 
> since one year among the [category 
> x|https://www.apache.org/legal/resolved.html#category-x] licenses (see also 
> [license faq|https://www.apache.org/legal/resolved.html#json]).
> We should check whether the library is mandatory and the exclude or replace 
> it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2415) Create a JEXL based IndexingFilter

2017-12-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294346#comment-16294346
 ] 

ASF GitHub Bot commented on NUTCH-2415:
---

sebastian-nagel commented on issue #219: NUTCH-2415 : Create a JEXL based 
IndexingFilter
URL: https://github.com/apache/nutch/pull/219#issuecomment-352287161
 
 
   +1 looks good


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Create a JEXL based IndexingFilter
> --
>
> Key: NUTCH-2415
> URL: https://issues.apache.org/jira/browse/NUTCH-2415
> Project: Nutch
>  Issue Type: New Feature
>  Components: plugin
>Affects Versions: 1.13
>Reporter: Yossi Tamari
>Assignee: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>
> Following on NUTCH-2414 and NUTCH-2412, the requirement was raised for a 
> IndexingFilter plugin which will decide whether to index a document based on 
> a JEXL expression.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)