date:20190927

[jira] [Commented] (NUTCH-2482) index-geoip not to add null values to document fields

2019-09-27 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/NUTCH-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939806#comment-16939806
 ] 

ASF GitHub Bot commented on NUTCH-2482:
---

lewismc commented on issue #476: NUTCH-2482 index-geoip not to add null values 
to document fields
URL: https://github.com/apache/nutch/pull/476#issuecomment-536136377
 
 
   A valuable addition @sebastian-nagel 
   +1
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> index-geoip not to add null values to document fields
> -
>
> Key: NUTCH-2482
> URL: https://issues.apache.org/jira/browse/NUTCH-2482
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer, plugin
>Affects Versions: 1.13
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
>  Labels: patch-available
> Fix For: 1.16
>
>
> The plugin index-geoip may add null values to document fields which then 
> cause further errors, here a NPE in IndexingFiltersChecker when toString() is 
> called on null:
> {noformat}
> $ bin/nutch indexchecker -Dstore.ip.address=true 
> -Dindex.geoip.usage=cityDatabase \
>  -Dplugin.includes="protocol-http|parse-html|index-(basic|geoip)" 
> http://www.example.com/
> ...
> Exception in thread "main" java.lang.NullPointerException
> at 
> org.apache.nutch.indexer.IndexingFiltersChecker.fetch(IndexingFiltersChecker.java:340)
> at 
> org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:127)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at 
> org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:370)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (NUTCH-2482) index-geoip not to add null values to document fields

2019-09-27 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-2482:
--

Assignee: Sebastian Nagel

> index-geoip not to add null values to document fields
> -
>
> Key: NUTCH-2482
> URL: https://issues.apache.org/jira/browse/NUTCH-2482
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer, plugin
>Affects Versions: 1.13
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
>  Labels: patch-available
> Fix For: 1.16
>
>
> The plugin index-geoip may add null values to document fields which then 
> cause further errors, here a NPE in IndexingFiltersChecker when toString() is 
> called on null:
> {noformat}
> $ bin/nutch indexchecker -Dstore.ip.address=true 
> -Dindex.geoip.usage=cityDatabase \
>  -Dplugin.includes="protocol-http|parse-html|index-(basic|geoip)" 
> http://www.example.com/
> ...
> Exception in thread "main" java.lang.NullPointerException
> at 
> org.apache.nutch.indexer.IndexingFiltersChecker.fetch(IndexingFiltersChecker.java:340)
> at 
> org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:127)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at 
> org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:370)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (NUTCH-2482) index-geoip not to add null values to document fields

2019-09-27 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2482:
---
Labels: patch-available  (was: )

> index-geoip not to add null values to document fields
> -
>
> Key: NUTCH-2482
> URL: https://issues.apache.org/jira/browse/NUTCH-2482
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer, plugin
>Affects Versions: 1.13
>Reporter: Sebastian Nagel
>Priority: Minor
>  Labels: patch-available
> Fix For: 1.16
>
>
> The plugin index-geoip may add null values to document fields which then 
> cause further errors, here a NPE in IndexingFiltersChecker when toString() is 
> called on null:
> {noformat}
> $ bin/nutch indexchecker -Dstore.ip.address=true 
> -Dindex.geoip.usage=cityDatabase \
>  -Dplugin.includes="protocol-http|parse-html|index-(basic|geoip)" 
> http://www.example.com/
> ...
> Exception in thread "main" java.lang.NullPointerException
> at 
> org.apache.nutch.indexer.IndexingFiltersChecker.fetch(IndexingFiltersChecker.java:340)
> at 
> org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:127)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at 
> org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:370)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (NUTCH-2482) index-geoip not to add null values to document fields

2019-09-27 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/NUTCH-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939782#comment-16939782
 ] 

ASF GitHub Bot commented on NUTCH-2482:
---

sebastian-nagel commented on pull request #476: NUTCH-2482 index-geoip not to 
add null values to document fields
URL: https://github.com/apache/nutch/pull/476
 
 
   - also improve handling of errors when searching for and reading GeoIP 
database files
   - upgrade dependencies
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> index-geoip not to add null values to document fields
> -
>
> Key: NUTCH-2482
> URL: https://issues.apache.org/jira/browse/NUTCH-2482
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer, plugin
>Affects Versions: 1.13
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> The plugin index-geoip may add null values to document fields which then 
> cause further errors, here a NPE in IndexingFiltersChecker when toString() is 
> called on null:
> {noformat}
> $ bin/nutch indexchecker -Dstore.ip.address=true 
> -Dindex.geoip.usage=cityDatabase \
>  -Dplugin.includes="protocol-http|parse-html|index-(basic|geoip)" 
> http://www.example.com/
> ...
> Exception in thread "main" java.lang.NullPointerException
> at 
> org.apache.nutch.indexer.IndexingFiltersChecker.fetch(IndexingFiltersChecker.java:340)
> at 
> org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:127)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at 
> org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:370)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (NUTCH-685) Content-level redirect status lost in ParseSegment

2019-09-27 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-685:
--
Fix Version/s: 1.17

> Content-level redirect status lost in ParseSegment
> --
>
> Key: NUTCH-685
> URL: https://issues.apache.org/jira/browse/NUTCH-685
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Andrzej Bialecki
>Priority: Major
> Fix For: 1.17
>
>
> When Fetcher runs in parsing mode, content-level redirects (HTML meta tag 
> "Refresh") are properly discovered and recorded in crawl_fetch under source 
> URL and target URL. If Fetcher runs in non-parsing mode, and ParseSegment is 
> run as a separate step, the content-level redirection data is used only to 
> add the new (target) URL, but the status of the original URL is not reset to 
> indicate a redirect. Consequently, status of the original URL will be 
> different depending on the way you run Fetcher, whereas it should be the same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (NUTCH-2261) ParseSegment job does not pass metadata for content-level redirects

2019-09-27 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2261:
---
Fix Version/s: (was: 1.16)
   1.17

> ParseSegment job does not pass metadata for content-level redirects
> ---
>
> Key: NUTCH-2261
> URL: https://issues.apache.org/jira/browse/NUTCH-2261
> Project: Nutch
>  Issue Type: Bug
>  Components: metadata, parser
>Affects Versions: 1.11, 1.12, 1.13
>Reporter: David Astle
>Priority: Minor
> Fix For: 1.17
>
>
> When Fetcher runs in parsing mode, CrawlDatum metadata is properly passed to 
> a new CrawlDatum for content-level redirects (HTML meta tag "Refresh").  If 
> Fetcher runs in non-parsing mode, and ParseSegment is run as a separate step, 
> then metadata other than "_repr_" is not passed to the new CrawlDatum.
> This means that any filter relying on metadata, such as DepthScoringFilter 
> and URLMetaScoringFilter, will not work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (NUTCH-2710) Normalize outlinks before checking for internal or external links

2019-09-27 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2710:
---
Fix Version/s: (was: 1.16)
   1.17

> Normalize outlinks before checking for internal or external links
> -
>
> Key: NUTCH-2710
> URL: https://issues.apache.org/jira/browse/NUTCH-2710
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Major
> Fix For: 1.17
>
> Attachments: NUTCH-2710.patch
>
>
> We have a normalizer that transforms external URLs back to internal URLs. But 
> those URLs are never passed to the normalizer, because they have already been 
> filtered out by internal and/or external host/domain checks in 
> parseOutputFormat.filterNormalize().
> This patch proposes to move the normalizers above the checks for 
> internal/external hosts/domains.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (NUTCH-2732) Ignored and tracked configuration files by git

2019-09-27 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/NUTCH-2732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939568#comment-16939568
 ] 

ASF GitHub Bot commented on NUTCH-2732:
---

sebastian-nagel commented on issue #475: NUTCH-2732: nutch-default.xml as a 
non-template file.
URL: https://github.com/apache/nutch/pull/475#issuecomment-535999591
 
 
   +1
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Ignored and tracked configuration files by git
> --
>
> Key: NUTCH-2732
> URL: https://issues.apache.org/jira/browse/NUTCH-2732
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.15
>Reporter: Roannel Fernández Hernández
>Assignee: Roannel Fernández Hernández
>Priority: Trivial
> Fix For: 1.16
>
>
> In folder conf/ there are files that are ignored and tracked by git at the 
> same time. A way to solve this is creating {{*.template}} files for those 
> files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (NUTCH-2732) Ignored and tracked configuration files by git

2019-09-27 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/NUTCH-2732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939542#comment-16939542
 ] 

ASF GitHub Bot commented on NUTCH-2732:
---

r0ann3l commented on pull request #475: NUTCH-2732: nutch-default.xml as a 
non-template file.
URL: https://github.com/apache/nutch/pull/475
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Ignored and tracked configuration files by git
> --
>
> Key: NUTCH-2732
> URL: https://issues.apache.org/jira/browse/NUTCH-2732
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.15
>Reporter: Roannel Fernández Hernández
>Assignee: Roannel Fernández Hernández
>Priority: Trivial
> Fix For: 1.16
>
>
> In folder conf/ there are files that are ignored and tracked by git at the 
> same time. A way to solve this is creating {{*.template}} files for those 
> files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (NUTCH-2732) Ignored and tracked configuration files by git

2019-09-27 Thread Jira



[ 
https://issues.apache.org/jira/browse/NUTCH-2732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939537#comment-16939537
 ] 

Roannel Fernández Hernández commented on NUTCH-2732:


Hi [~snagel]. You may be right. But in this case we need to specify it in the 
.gitignore file to avoid ignoring this tracked file.

> Ignored and tracked configuration files by git
> --
>
> Key: NUTCH-2732
> URL: https://issues.apache.org/jira/browse/NUTCH-2732
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.15
>Reporter: Roannel Fernández Hernández
>Assignee: Roannel Fernández Hernández
>Priority: Trivial
> Fix For: 1.16
>
>
> In folder conf/ there are files that are ignored and tracked by git at the 
> same time. A way to solve this is creating {{*.template}} files for those 
> files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika

2019-09-27 Thread Tim Allison (Jira)



[ 
https://issues.apache.org/jira/browse/NUTCH-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939516#comment-16939516
 ] 

Tim Allison edited comment on NUTCH-2457 at 9/27/19 2:55 PM:
-

W00t!  Default is to parse embedded, right? :D

Wouldn't want to break backwards compatibility!  

Kidding...I'm kidding...

Sorry, and thank you!


was (Author: talli...@mitre.org):
W00t!  Default is to parse embedded, right? :D

> Embedded documents likely not correctly parsed by Tika
> --
>
> Key: NUTCH-2457
> URL: https://issues.apache.org/jira/browse/NUTCH-2457
> Project: Nutch
>  Issue Type: Bug
>  Components: parser, plugin
>Affects Versions: 1.14
>Reporter: Tim Allison
>Priority: Major
>  Labels: patch-available
> Fix For: 1.16
>
>
> While working on TIKA-2490, I think I found that Nutch's current method of 
> requesting a mime-specific parser for each file will fail to parse embedded 
> files, e.g. 
> https://github.com/apache/tika/blob/master/tika-server/src/test/resources/test_recursive_embedded.docx
> The fix should be straightforward, and I'll submit a PR once I can get Nutch 
> up and running in my dev environment. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika

2019-09-27 Thread Tim Allison (Jira)



[ 
https://issues.apache.org/jira/browse/NUTCH-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939516#comment-16939516
 ] 

Tim Allison commented on NUTCH-2457:


W00t!  Default is to parse embedded, right? :D

> Embedded documents likely not correctly parsed by Tika
> --
>
> Key: NUTCH-2457
> URL: https://issues.apache.org/jira/browse/NUTCH-2457
> Project: Nutch
>  Issue Type: Bug
>  Components: parser, plugin
>Affects Versions: 1.14
>Reporter: Tim Allison
>Priority: Major
>  Labels: patch-available
> Fix For: 1.16
>
>
> While working on TIKA-2490, I think I found that Nutch's current method of 
> requesting a mime-specific parser for each file will fail to parse embedded 
> files, e.g. 
> https://github.com/apache/tika/blob/master/tika-server/src/test/resources/test_recursive_embedded.docx
> The fix should be straightforward, and I'll submit a PR once I can get Nutch 
> up and running in my dev environment. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (NUTCH-2381) In some situations the class TextProfileSignature gives different signatures for the same text "profile" page.

2019-09-27 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2381:
---
Labels: patch-available signature  (was: signature)

> In some situations the class TextProfileSignature gives different signatures 
> for the same text "profile" page.
> --
>
> Key: NUTCH-2381
> URL: https://issues.apache.org/jira/browse/NUTCH-2381
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.13
>Reporter: Rodrigo Joni Sestari
>Assignee: Sebastian Nagel
>Priority: Major
>  Labels: patch-available, signature
> Fix For: 1.16
>
>
> In some situations the class TextProfileSignature gives different signatures 
> for the same text "profile" page.
> The method TextProfileSignature.calculate uses a HashMap to salve the tokens, 
> after some process, the tokens come sorted by decreasing frequency.
> For some pages like "http://curia.europa.eu/jcms/; the text "profile" is the 
> same but the signature come different for each fetch.
> Its happens because the tokens are sorted only by decreasing frequency. 
> Tokens with the same frequency maybe not have the same order in different 
> fetchs.
> The HashMap no guarantees as to the order of the map and  not guarantee that 
> the order will remain constant over time.
> My suggestion is change the methods TokenComparator.compare  in order to sort 
> by frequency and Name.
> Rodrigo



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika

2019-09-27 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2457:
---
Component/s: plugin
 parser

> Embedded documents likely not correctly parsed by Tika
> --
>
> Key: NUTCH-2457
> URL: https://issues.apache.org/jira/browse/NUTCH-2457
> Project: Nutch
>  Issue Type: Bug
>  Components: parser, plugin
>Affects Versions: 1.14
>Reporter: Tim Allison
>Priority: Major
> Fix For: 1.16
>
>
> While working on TIKA-2490, I think I found that Nutch's current method of 
> requesting a mime-specific parser for each file will fail to parse embedded 
> files, e.g. 
> https://github.com/apache/tika/blob/master/tika-server/src/test/resources/test_recursive_embedded.docx
> The fix should be straightforward, and I'll submit a PR once I can get Nutch 
> up and running in my dev environment. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika

2019-09-27 Thread Sebastian Nagel (Jira)



[ 
https://issues.apache.org/jira/browse/NUTCH-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939509#comment-16939509
 ] 

Sebastian Nagel edited comment on NUTCH-2457 at 9/27/19 2:49 PM:
-

Thanks, [~talli...@apache.org], got it. Implemented solution 2: it works. 
Optionally, parsing of embedded documents can be turned off via the new 
property "tika.parse.embedded".


was (Author: wastl-nagel):
Thanks, [~talli...@apache.org], got it. Will implement solution 2.

> Embedded documents likely not correctly parsed by Tika
> --
>
> Key: NUTCH-2457
> URL: https://issues.apache.org/jira/browse/NUTCH-2457
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Tim Allison
>Priority: Major
> Fix For: 1.16
>
>
> While working on TIKA-2490, I think I found that Nutch's current method of 
> requesting a mime-specific parser for each file will fail to parse embedded 
> files, e.g. 
> https://github.com/apache/tika/blob/master/tika-server/src/test/resources/test_recursive_embedded.docx
> The fix should be straightforward, and I'll submit a PR once I can get Nutch 
> up and running in my dev environment. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika

2019-09-27 Thread Sebastian Nagel (Jira)



[ 
https://issues.apache.org/jira/browse/NUTCH-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939509#comment-16939509
 ] 

Sebastian Nagel commented on NUTCH-2457:


Thanks, [~talli...@apache.org], got it. Will implement solution 2.

> Embedded documents likely not correctly parsed by Tika
> --
>
> Key: NUTCH-2457
> URL: https://issues.apache.org/jira/browse/NUTCH-2457
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Tim Allison
>Priority: Major
> Fix For: 1.16
>
>
> While working on TIKA-2490, I think I found that Nutch's current method of 
> requesting a mime-specific parser for each file will fail to parse embedded 
> files, e.g. 
> https://github.com/apache/tika/blob/master/tika-server/src/test/resources/test_recursive_embedded.docx
> The fix should be straightforward, and I'll submit a PR once I can get Nutch 
> up and running in my dev environment. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Reopened] (NUTCH-2732) Ignored and tracked configuration files by git

2019-09-27 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reopened NUTCH-2732:


Hi [~roannel], one point: the file conf/nutch-default.xml has also been moved 
to a template. As this file is mostly for documenting properties and 
isconsidered never to be changed (overrides should go to nutch-site.xml), I 
suggest to keep the old name to avoid confusion. What do you think?

> Ignored and tracked configuration files by git
> --
>
> Key: NUTCH-2732
> URL: https://issues.apache.org/jira/browse/NUTCH-2732
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.15
>Reporter: Roannel Fernández Hernández
>Assignee: Roannel Fernández Hernández
>Priority: Trivial
> Fix For: 1.16
>
>
> In folder conf/ there are files that are ignored and tracked by git at the 
> same time. A way to solve this is creating {{*.template}} files for those 
> files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (NUTCH-2381) In some situations the class TextProfileSignature gives different signatures for the same text "profile" page.

2019-09-27 Thread Sebastian Nagel (Jira)



[ 
https://issues.apache.org/jira/browse/NUTCH-2381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939500#comment-16939500
 ] 

Sebastian Nagel commented on NUTCH-2381:


Add to conf/nutch-default.xml or *.template (cf. NUTCH-2732):
{noformat}

  db.signature.text_profile.sec_sort_lex
  true
  
Whether the TextProfileSignature class should sort words also 
lexicographically
to avoid changing signatures due to unstable hash sorting. Default is 
`true`,
set to `false` to ensure backward-compatibility with CrawlDbs written by 
Nutch
1.15 or prior, see also NUTCH-2381.
  

{noformat}

> In some situations the class TextProfileSignature gives different signatures 
> for the same text "profile" page.
> --
>
> Key: NUTCH-2381
> URL: https://issues.apache.org/jira/browse/NUTCH-2381
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.13
>Reporter: Rodrigo Joni Sestari
>Assignee: Sebastian Nagel
>Priority: Major
>  Labels: signature
> Fix For: 1.16
>
>
> In some situations the class TextProfileSignature gives different signatures 
> for the same text "profile" page.
> The method TextProfileSignature.calculate uses a HashMap to salve the tokens, 
> after some process, the tokens come sorted by decreasing frequency.
> For some pages like "http://curia.europa.eu/jcms/; the text "profile" is the 
> same but the signature come different for each fetch.
> Its happens because the tokens are sorted only by decreasing frequency. 
> Tokens with the same frequency maybe not have the same order in different 
> fetchs.
> The HashMap no guarantees as to the order of the map and  not guarantee that 
> the order will remain constant over time.
> My suggestion is change the methods TokenComparator.compare  in order to sort 
> by frequency and Name.
> Rodrigo



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika

2019-09-27 Thread Tim Allison (Jira)



[ 
https://issues.apache.org/jira/browse/NUTCH-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939478#comment-16939478
 ] 

Tim Allison commented on NUTCH-2457:


The issue is that the AutoDetectParser automatically/silently adds itself as a 
parser to the ParseContext.  When an embedded document is parsed, there's a 
lookup for the embedded parser in the ParseContext.  Because you weren't using 
the AutoDetectParser, there is no parser in ParseContext, and the embedded 
documents are not being parsed.

So, you have two options (maybe more...):

1) use the AutoDetectParser; set 
https://tika.apache.org/1.17/api/org/apache/tika/metadata/TikaCoreProperties.html#CONTENT_TYPE_OVERRIDE
 to the mime, and you'll avoid a second detection for the container file

2) Use your current method, but add a cached AutoDetectParser to the 
ParseContext

> Embedded documents likely not correctly parsed by Tika
> --
>
> Key: NUTCH-2457
> URL: https://issues.apache.org/jira/browse/NUTCH-2457
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Tim Allison
>Priority: Major
> Fix For: 1.16
>
>
> While working on TIKA-2490, I think I found that Nutch's current method of 
> requesting a mime-specific parser for each file will fail to parse embedded 
> files, e.g. 
> https://github.com/apache/tika/blob/master/tika-server/src/test/resources/test_recursive_embedded.docx
> The fix should be straightforward, and I'll submit a PR once I can get Nutch 
> up and running in my dev environment. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika

2019-09-27 Thread Tim Allison (Jira)



[ 
https://issues.apache.org/jira/browse/NUTCH-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939473#comment-16939473
 ] 

Tim Allison commented on NUTCH-2457:


Let me take a look at the code again...it has been a while...

> Embedded documents likely not correctly parsed by Tika
> --
>
> Key: NUTCH-2457
> URL: https://issues.apache.org/jira/browse/NUTCH-2457
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Tim Allison
>Priority: Major
> Fix For: 1.16
>
>
> While working on TIKA-2490, I think I found that Nutch's current method of 
> requesting a mime-specific parser for each file will fail to parse embedded 
> files, e.g. 
> https://github.com/apache/tika/blob/master/tika-server/src/test/resources/test_recursive_embedded.docx
> The fix should be straightforward, and I'll submit a PR once I can get Nutch 
> up and running in my dev environment. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika

2019-09-27 Thread Sebastian Nagel (Jira)



[ 
https://issues.apache.org/jira/browse/NUTCH-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939453#comment-16939453
 ] 

Sebastian Nagel commented on NUTCH-2457:


[~talli...@apache.org], using the AutoDetectParser instead of the 
CompositeParser fixes the issue also when running the parser checker (i.e. 
parse-tika run in the encapsulated plugin class loader). The price is probably 
that MIME detection is done a second time, for the outer document. Although it 
might not because the MIME type is passed via Tika metadata. Should be checked, 
whether there are significant performance impacts.

> Embedded documents likely not correctly parsed by Tika
> --
>
> Key: NUTCH-2457
> URL: https://issues.apache.org/jira/browse/NUTCH-2457
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Tim Allison
>Priority: Major
> Fix For: 1.16
>
>
> While working on TIKA-2490, I think I found that Nutch's current method of 
> requesting a mime-specific parser for each file will fail to parse embedded 
> files, e.g. 
> https://github.com/apache/tika/blob/master/tika-server/src/test/resources/test_recursive_embedded.docx
> The fix should be straightforward, and I'll submit a PR once I can get Nutch 
> up and running in my dev environment. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika

2019-09-27 Thread Sebastian Nagel (Jira)



[ 
https://issues.apache.org/jira/browse/NUTCH-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939430#comment-16939430
 ] 

Sebastian Nagel edited comment on NUTCH-2457 at 9/27/19 1:12 PM:
-

Actually, Nutch calls
{noformat}
 CompositeParser compositeParser = (CompositeParser) tikaConfig.getParser();
 Parser parser = compositeParser.getParsers().get(MediaType.parse(mimeType));
{noformat}

But it works for embedded documents only for a unit test, I've just added (see 
PR). Running the parser checker, the embedded documents are not parsed:
{noformat}
> bin/nutch parsechecker -Dplugin.includes='protocol-file|parse-tika' -dumpText 
> file:/.../test_recursive_embedded.docx
...
contentType: 
application/vnd.openxmlformats-officedocument.wordprocessingml.document
...
embed_0
{noformat}

[~talli...@apache.org], is this caused by the way the parser is called? It 
might be also because of parse-tika (as a Nutch plugin) holds the tika-parsers 
jar in the plugin class loader while the tika-core jar is in the main class 
loader (because it is required for MIME detection).


was (Author: wastl-nagel):
Actually, Nutch calls
{noformat}
 CompositeParser compositeParser = (CompositeParser) tikaConfig.getParser();
 Parser parser = compositeParser.getParsers().get(MediaType.parse(mimeType));
{noformat}

But it works for embedded documents only for a unit test, I've just added (see 
PR). Running the parser checker, the embedded documents are not parsed:
{noformat}
> parsechecker -Dplugin.includes='protocol-file|parse-tika' -dumpText 
> file:/.../test_recursive_embedded.docx
...
contentType: 
application/vnd.openxmlformats-officedocument.wordprocessingml.document
...
embed_0
{noformat}

[~talli...@apache.org], is this caused by the way the parser is called? It 
might be also because of parse-tika (as a Nutch plugin) holds the tika-parsers 
jar in the plugin class loader while the tika-core jar is in the main class 
loader (because it is required for MIME detection).

> Embedded documents likely not correctly parsed by Tika
> --
>
> Key: NUTCH-2457
> URL: https://issues.apache.org/jira/browse/NUTCH-2457
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Tim Allison
>Priority: Major
> Fix For: 1.16
>
>
> While working on TIKA-2490, I think I found that Nutch's current method of 
> requesting a mime-specific parser for each file will fail to parse embedded 
> files, e.g. 
> https://github.com/apache/tika/blob/master/tika-server/src/test/resources/test_recursive_embedded.docx
> The fix should be straightforward, and I'll submit a PR once I can get Nutch 
> up and running in my dev environment. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika

2019-09-27 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/NUTCH-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939431#comment-16939431
 ] 

ASF GitHub Bot commented on NUTCH-2457:
---

sebastian-nagel commented on issue #474: NUTCH-2457 Embedded documents likely 
not correctly parsed by Tika
URL: https://github.com/apache/nutch/pull/474#issuecomment-535932438
 
 
   Yes, please see my comment in 
[Jira](https://issues.apache.org/jira/browse/NUTCH-2457).
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Embedded documents likely not correctly parsed by Tika
> --
>
> Key: NUTCH-2457
> URL: https://issues.apache.org/jira/browse/NUTCH-2457
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Tim Allison
>Priority: Major
> Fix For: 1.16
>
>
> While working on TIKA-2490, I think I found that Nutch's current method of 
> requesting a mime-specific parser for each file will fail to parse embedded 
> files, e.g. 
> https://github.com/apache/tika/blob/master/tika-server/src/test/resources/test_recursive_embedded.docx
> The fix should be straightforward, and I'll submit a PR once I can get Nutch 
> up and running in my dev environment. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika

2019-09-27 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/NUTCH-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939428#comment-16939428
 ] 

ASF GitHub Bot commented on NUTCH-2457:
---

tballison commented on issue #474: NUTCH-2457 Embedded documents likely not 
correctly parsed by Tika
URL: https://github.com/apache/nutch/pull/474#issuecomment-535932051
 
 
   >Embedded documents likely not correctly parsed by Tika
   
   Can we help?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Embedded documents likely not correctly parsed by Tika
> --
>
> Key: NUTCH-2457
> URL: https://issues.apache.org/jira/browse/NUTCH-2457
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Tim Allison
>Priority: Major
> Fix For: 1.16
>
>
> While working on TIKA-2490, I think I found that Nutch's current method of 
> requesting a mime-specific parser for each file will fail to parse embedded 
> files, e.g. 
> https://github.com/apache/tika/blob/master/tika-server/src/test/resources/test_recursive_embedded.docx
> The fix should be straightforward, and I'll submit a PR once I can get Nutch 
> up and running in my dev environment. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika

2019-09-27 Thread Sebastian Nagel (Jira)



[ 
https://issues.apache.org/jira/browse/NUTCH-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939430#comment-16939430
 ] 

Sebastian Nagel commented on NUTCH-2457:


Actually, Nutch calls
{noformat}
 CompositeParser compositeParser = (CompositeParser) tikaConfig.getParser();
 Parser parser = compositeParser.getParsers().get(MediaType.parse(mimeType));
{noformat}

But it works for embedded documents only for a unit test, I've just added (see 
PR). Running the parser checker, the embedded documents are not parsed:
{noformat}
> parsechecker -Dplugin.includes='protocol-file|parse-tika' -dumpText 
> file:/.../test_recursive_embedded.docx
...
contentType: 
application/vnd.openxmlformats-officedocument.wordprocessingml.document
...
embed_0
{noformat}

[~talli...@apache.org], is this caused by the way the parser is called? It 
might be also because of parse-tika (as a Nutch plugin) holds the tika-parsers 
jar in the plugin class loader while the tika-core jar is in the main class 
loader (because it is required for MIME detection).

> Embedded documents likely not correctly parsed by Tika
> --
>
> Key: NUTCH-2457
> URL: https://issues.apache.org/jira/browse/NUTCH-2457
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Tim Allison
>Priority: Major
> Fix For: 1.16
>
>
> While working on TIKA-2490, I think I found that Nutch's current method of 
> requesting a mime-specific parser for each file will fail to parse embedded 
> files, e.g. 
> https://github.com/apache/tika/blob/master/tika-server/src/test/resources/test_recursive_embedded.docx
> The fix should be straightforward, and I'll submit a PR once I can get Nutch 
> up and running in my dev environment. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika

2019-09-27 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/NUTCH-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939422#comment-16939422
 ] 

ASF GitHub Bot commented on NUTCH-2457:
---

sebastian-nagel commented on pull request #474: NUTCH-2457 Embedded documents 
likely not correctly parsed by Tika
URL: https://github.com/apache/nutch/pull/474
 
 
   - add unit test for embedded documents
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Embedded documents likely not correctly parsed by Tika
> --
>
> Key: NUTCH-2457
> URL: https://issues.apache.org/jira/browse/NUTCH-2457
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Tim Allison
>Priority: Major
> Fix For: 1.16
>
>
> While working on TIKA-2490, I think I found that Nutch's current method of 
> requesting a mime-specific parser for each file will fail to parse embedded 
> files, e.g. 
> https://github.com/apache/tika/blob/master/tika-server/src/test/resources/test_recursive_embedded.docx
> The fix should be straightforward, and I'll submit a PR once I can get Nutch 
> up and running in my dev environment. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (NUTCH-1403) Add default ScoringFilter for manipulating metadata

2019-09-27 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1403:
---
Fix Version/s: (was: 1.17)
   1.16

> Add default ScoringFilter for manipulating metadata 
> 
>
> Key: NUTCH-1403
> URL: https://issues.apache.org/jira/browse/NUTCH-1403
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Julien Nioche
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.16
>
>
> This is currently done by the urlmeta plugin, which has too vague a name and 
> a redundant indexing filter now that we have the index-metadata plugin. This 
> scoring filter would help defining which metadata to pass from : 
> - the crawl metadata to the content metadata
> - the content metadata to the parse metadata
> - the parse metadata to the crawldatum for the outlinks
> I'd make this scoring filter available by default i.e. not in a separate 
> plugin as its functionalities are commonly used.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (NUTCH-1403) Add default ScoringFilter for manipulating metadata

2019-09-27 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-1403:
--

Assignee: Sebastian Nagel

> Add default ScoringFilter for manipulating metadata 
> 
>
> Key: NUTCH-1403
> URL: https://issues.apache.org/jira/browse/NUTCH-1403
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Julien Nioche
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.17
>
>
> This is currently done by the urlmeta plugin, which has too vague a name and 
> a redundant indexing filter now that we have the index-metadata plugin. This 
> scoring filter would help defining which metadata to pass from : 
> - the crawl metadata to the content metadata
> - the content metadata to the parse metadata
> - the parse metadata to the crawldatum for the outlinks
> I'd make this scoring filter available by default i.e. not in a separate 
> plugin as its functionalities are commonly used.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (NUTCH-2381) In some situations the class TextProfileSignature gives different signatures for the same text "profile" page.

2019-09-27 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/NUTCH-2381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939372#comment-16939372
 ] 

ASF GitHub Bot commented on NUTCH-2381:
---

sebastian-nagel commented on pull request #473: NUTCH-2381 In some situations 
the class TextProfileSignature gives different signatures for the same text 
"profile" page
URL: https://github.com/apache/nutch/pull/473
 
 
   - implement secondary sorting, similar to patch provided by Rodrigo Joni 
Sestari
   - allow to restore previous behavior by setting property 
`db.signature.text_profile.sec_sort_lex = false`
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> In some situations the class TextProfileSignature gives different signatures 
> for the same text "profile" page.
> --
>
> Key: NUTCH-2381
> URL: https://issues.apache.org/jira/browse/NUTCH-2381
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.13
>Reporter: Rodrigo Joni Sestari
>Assignee: Sebastian Nagel
>Priority: Major
>  Labels: signature
> Fix For: 1.16
>
>
> In some situations the class TextProfileSignature gives different signatures 
> for the same text "profile" page.
> The method TextProfileSignature.calculate uses a HashMap to salve the tokens, 
> after some process, the tokens come sorted by decreasing frequency.
> For some pages like "http://curia.europa.eu/jcms/; the text "profile" is the 
> same but the signature come different for each fetch.
> Its happens because the tokens are sorted only by decreasing frequency. 
> Tokens with the same frequency maybe not have the same order in different 
> fetchs.
> The HashMap no guarantees as to the order of the map and  not guarantee that 
> the order will remain constant over time.
> My suggestion is change the methods TokenComparator.compare  in order to sort 
> by frequency and Name.
> Rodrigo



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (NUTCH-2381) In some situations the class TextProfileSignature gives different signatures for the same text "profile" page.

2019-09-27 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-2381:
--

Assignee: Sebastian Nagel

> In some situations the class TextProfileSignature gives different signatures 
> for the same text "profile" page.
> --
>
> Key: NUTCH-2381
> URL: https://issues.apache.org/jira/browse/NUTCH-2381
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.13
>Reporter: Rodrigo Joni Sestari
>Assignee: Sebastian Nagel
>Priority: Major
>  Labels: signature
> Fix For: 1.16
>
>
> In some situations the class TextProfileSignature gives different signatures 
> for the same text "profile" page.
> The method TextProfileSignature.calculate uses a HashMap to salve the tokens, 
> after some process, the tokens come sorted by decreasing frequency.
> For some pages like "http://curia.europa.eu/jcms/; the text "profile" is the 
> same but the signature come different for each fetch.
> Its happens because the tokens are sorted only by decreasing frequency. 
> Tokens with the same frequency maybe not have the same order in different 
> fetchs.
> The HashMap no guarantees as to the order of the map and  not guarantee that 
> the order will remain constant over time.
> My suggestion is change the methods TokenComparator.compare  in order to sort 
> by frequency and Name.
> Rodrigo



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (NUTCH-2381) In some situations the class TextProfileSignature gives different signatures for the same text "profile" page.

2019-09-27 Thread Sebastian Nagel (Jira)



[ 
https://issues.apache.org/jira/browse/NUTCH-2381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939336#comment-16939336
 ] 

Sebastian Nagel commented on NUTCH-2381:


Good point: "A particular iteration order is not specified for {{HashMap}} 
objects - any code that depends on iteration order should be fixed." 
([https://docs.oracle.com/javase/8/docs/technotes/guides/collections/changes8.html)]

Will provide fix.

CAVEAT: while it makes the TextProfileSignature more reliable, it will change 
the signatures in an already existing CrawlDb.

> In some situations the class TextProfileSignature gives different signatures 
> for the same text "profile" page.
> --
>
> Key: NUTCH-2381
> URL: https://issues.apache.org/jira/browse/NUTCH-2381
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.13
>Reporter: Rodrigo Joni Sestari
>Priority: Major
>  Labels: signature
> Fix For: 1.16
>
>
> In some situations the class TextProfileSignature gives different signatures 
> for the same text "profile" page.
> The method TextProfileSignature.calculate uses a HashMap to salve the tokens, 
> after some process, the tokens come sorted by decreasing frequency.
> For some pages like "http://curia.europa.eu/jcms/; the text "profile" is the 
> same but the signature come different for each fetch.
> Its happens because the tokens are sorted only by decreasing frequency. 
> Tokens with the same frequency maybe not have the same order in different 
> fetchs.
> The HashMap no guarantees as to the order of the map and  not guarantee that 
> the order will remain constant over time.
> My suggestion is change the methods TokenComparator.compare  in order to sort 
> by frequency and Name.
> Rodrigo



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (NUTCH-2736) Upgrade Dockerfile to be based on recent Ubuntu LTS version

2019-09-27 Thread Hudson (Jira)



[ 
https://issues.apache.org/jira/browse/NUTCH-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939271#comment-16939271
 ] 

Hudson commented on NUTCH-2736:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3646 (See 
[https://builds.apache.org/job/Nutch-trunk/3646/])
NUTCH-2736 Upgrade Dockerfile to be based on recent Ubuntu LTS version - 
(snagel: 
[https://github.com/apache/nutch/commit/c735ebb7a40dd3e6ab29583ada91a73378922874])
* (edit) docker/Dockerfile
* (edit) docker/README.md


> Upgrade Dockerfile to be based on recent Ubuntu LTS version
> ---
>
> Key: NUTCH-2736
> URL: https://issues.apache.org/jira/browse/NUTCH-2736
> Project: Nutch
>  Issue Type: Improvement
>  Components: build, test
>Affects Versions: 1.16
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2019-09-27 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2184:
---
Fix Version/s: (was: 1.16)
   1.17

> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.17
>
> Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (NUTCH-2736) Upgrade Dockerfile to be based on recent Ubuntu LTS version

2019-09-27 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2736.

Resolution: Fixed

> Upgrade Dockerfile to be based on recent Ubuntu LTS version
> ---
>
> Key: NUTCH-2736
> URL: https://issues.apache.org/jira/browse/NUTCH-2736
> Project: Nutch
>  Issue Type: Improvement
>  Components: build, test
>Affects Versions: 1.16
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (NUTCH-2736) Upgrade Dockerfile to be based on recent Ubuntu LTS version

2019-09-27 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/NUTCH-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939246#comment-16939246
 ] 

ASF GitHub Bot commented on NUTCH-2736:
---

sebastian-nagel commented on pull request #472: NUTCH-2736 Upgrade Dockerfile 
to be based on recent Ubuntu LTS version
URL: https://github.com/apache/nutch/pull/472
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Upgrade Dockerfile to be based on recent Ubuntu LTS version
> ---
>
> Key: NUTCH-2736
> URL: https://issues.apache.org/jira/browse/NUTCH-2736
> Project: Nutch
>  Issue Type: Improvement
>  Components: build, test
>Affects Versions: 1.16
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (NUTCH-2162) Nutch Webapp Crawl fails as it tries to index

2019-09-27 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2162:
---
Fix Version/s: (was: 1.16)
   1.17

> Nutch Webapp Crawl fails as it tries to index
> -
>
> Key: NUTCH-2162
> URL: https://issues.apache.org/jira/browse/NUTCH-2162
> Project: Nutch
>  Issue Type: Bug
>  Components: web gui
>Affects Versions: 1.11
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.17
>
> Attachments: nutch_webapp.log
>
>
> Right now a crawl task fails on the trunk version of the WebApp due to it 
> attempting to index. No indexer is defined by default so this is a major bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (NUTCH-2482) index-geoip not to add null values to document fields

[jira] [Assigned] (NUTCH-2482) index-geoip not to add null values to document fields

[jira] [Updated] (NUTCH-2482) index-geoip not to add null values to document fields

[jira] [Commented] (NUTCH-2482) index-geoip not to add null values to document fields

[jira] [Updated] (NUTCH-685) Content-level redirect status lost in ParseSegment

[jira] [Updated] (NUTCH-2261) ParseSegment job does not pass metadata for content-level redirects

[jira] [Updated] (NUTCH-2710) Normalize outlinks before checking for internal or external links

[jira] [Commented] (NUTCH-2732) Ignored and tracked configuration files by git

[jira] [Commented] (NUTCH-2732) Ignored and tracked configuration files by git

[jira] [Commented] (NUTCH-2732) Ignored and tracked configuration files by git

[jira] [Comment Edited] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika

[jira] [Commented] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika

[jira] [Updated] (NUTCH-2381) In some situations the class TextProfileSignature gives different signatures for the same text "profile" page.

[jira] [Updated] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika

[jira] [Comment Edited] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika

[jira] [Commented] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika

[jira] [Reopened] (NUTCH-2732) Ignored and tracked configuration files by git

[jira] [Commented] (NUTCH-2381) In some situations the class TextProfileSignature gives different signatures for the same text "profile" page.

[jira] [Commented] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika

[jira] [Commented] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika

[jira] [Commented] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika

[jira] [Comment Edited] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika

[jira] [Commented] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika

[jira] [Commented] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika

[jira] [Commented] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika

[jira] [Commented] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika

[jira] [Updated] (NUTCH-1403) Add default ScoringFilter for manipulating metadata

[jira] [Assigned] (NUTCH-1403) Add default ScoringFilter for manipulating metadata

[jira] [Commented] (NUTCH-2381) In some situations the class TextProfileSignature gives different signatures for the same text "profile" page.

[jira] [Assigned] (NUTCH-2381) In some situations the class TextProfileSignature gives different signatures for the same text "profile" page.

[jira] [Commented] (NUTCH-2381) In some situations the class TextProfileSignature gives different signatures for the same text "profile" page.

[jira] [Commented] (NUTCH-2736) Upgrade Dockerfile to be based on recent Ubuntu LTS version

[jira] [Updated] (NUTCH-2184) Enable IndexingJob to function with no crawldb

[jira] [Resolved] (NUTCH-2736) Upgrade Dockerfile to be based on recent Ubuntu LTS version

[jira] [Commented] (NUTCH-2736) Upgrade Dockerfile to be based on recent Ubuntu LTS version

[jira] [Updated] (NUTCH-2162) Nutch Webapp Crawl fails as it tries to index

36 matches

Site Navigation

Mail list logo

Footer information