[jira] [Updated] (NUTCH-2467) Sitemap type field can be null

2017-11-28 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2467: - Attachment: NUTCH-2467.patch Incredible stupid patch but i did it because the sitemap.type thing

[jira] [Updated] (NUTCH-2467) Sitemap type field can be null

2017-11-28 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2467: - Patch Info: Patch Available > Sitemap type field can be n

[jira] [Created] (NUTCH-2467) Sitemap type field can be null

2017-11-28 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2467: Summary: Sitemap type field can be null Key: NUTCH-2467 URL: https://issues.apache.org/jira/browse/NUTCH-2467 Project: Nutch Issue Type: Bug Affects

[jira] [Updated] (NUTCH-2466) Sitemap processor to follow redirects

2017-11-28 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2466: - Patch Info: Patch Available > Sitemap processor to follow redire

[jira] [Updated] (NUTCH-2466) Sitemap processor to follow redirects

2017-11-28 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2466: - Attachment: NUTCH-2466.patch Patch for master! > Sitemap processor to follow redire

[jira] [Created] (NUTCH-2466) Sitemap processor to follow redirects

2017-11-28 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2466: Summary: Sitemap processor to follow redirects Key: NUTCH-2466 URL: https://issues.apache.org/jira/browse/NUTCH-2466 Project: Nutch Issue Type: Bug

[jira] [Updated] (NUTCH-2439) Upgrade to Apache Tika 1.17

2017-11-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2439: - Summary: Upgrade to Apache Tika 1.17 (was: Upgrade to Apache Tika 1.16) > Upgrade to Apache T

[jira] [Commented] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay

2017-11-14 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16251361#comment-16251361 ] Markus Jelsma commented on NUTCH-2368: -- Ah, can you open a new issue for this? Thanks! > Varia

[jira] [Resolved] (NUTCH-2458) TikaParser doesn't work with tika-config.xml set

2017-11-10 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2458. -- Resolution: Fixed Committed to 9e4d9544..705686e7 master -> master > TikaParser doesn'

[jira] [Commented] (NUTCH-2458) TikaParser doesn't work with tika-config.xml set

2017-11-09 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16246018#comment-16246018 ] Markus Jelsma commented on NUTCH-2458: -- Will commit this one shortly.. > TikaParser doesn't w

[jira] [Updated] (NUTCH-2458) TikaParser doesn't work with tika-config.xml set

2017-11-09 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2458: - Attachment: NUTCH-2458.patch Patch for master! > TikaParser doesn't work with tika-config.

[jira] [Created] (NUTCH-2458) TikaParser doesn't work with tika-config.xml set

2017-11-09 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2458: Summary: TikaParser doesn't work with tika-config.xml set Key: NUTCH-2458 URL: https://issues.apache.org/jira/browse/NUTCH-2458 Project: Nutch Issue Type

[jira] [Commented] (NUTCH-2456) Allow to index pages/URLs not contained in CrawlDb

2017-11-08 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16244761#comment-16244761 ] Markus Jelsma commented on NUTCH-2456: -- What will this patch achieve then? Just the case of ignoring

[jira] [Commented] (NUTCH-2456) Allow to index pages/URLs not contained in CrawlDb

2017-11-08 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16244638#comment-16244638 ] Markus Jelsma commented on NUTCH-2456: -- It will have side-effects, the same as you already mentioned

[jira] [Commented] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay

2017-11-08 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16244230#comment-16244230 ] Markus Jelsma commented on NUTCH-2368: -- Hello - see NUTCH-2420 . > Variable generate.max.co

[jira] [Commented] (NUTCH-2420) Bug in variable generate.max.count and fetcher.server.delay

2017-11-06 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240491#comment-16240491 ] Markus Jelsma commented on NUTCH-2420: -- Committed and created NUTCH-2455 Thanks! > Bug in varia

[jira] [Updated] (NUTCH-2455) Speed up the merging of HostDb entries for variable fetch delay

2017-11-06 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2455: - Description: Citing Sebastian at NUTCH-2420: ??The correct solution would be to use <host,sc

[jira] [Created] (NUTCH-2455) Speed up the merging of HostDb entries for variable fetch delay

2017-11-06 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2455: Summary: Speed up the merging of HostDb entries for variable fetch delay Key: NUTCH-2455 URL: https://issues.apache.org/jira/browse/NUTCH-2455 Project: Nutch

[jira] [Resolved] (NUTCH-2420) Bug in variable generate.max.count and fetcher.server.delay

2017-11-06 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2420. -- Resolution: Fixed remote:517dbdf..6199492 6199492f5e1e8811022257c88dbf63f1e1c739d0

[jira] [Commented] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay

2017-11-03 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237641#comment-16237641 ] Markus Jelsma commented on NUTCH-2368: -- Could you open a new issue, this one has already been

[jira] [Updated] (NUTCH-1932) Automatically remove orphaned pages

2017-10-25 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1932: - Description: Orphan scoring filter that determines whether a page has become orphaned, e.g

[jira] [Commented] (NUTCH-1932) Automatically remove orphaned pages

2017-10-25 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16218704#comment-16218704 ] Markus Jelsma commented on NUTCH-1932: -- Yes! Indeed! Go ahead and thanks for taking this one over

[jira] [Commented] (NUTCH-2394) Possible bugs in the source code

2017-10-25 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16218583#comment-16218583 ] Markus Jelsma commented on NUTCH-2394: -- A crap, the trim() habit, a time where i was very

[jira] [Closed] (NUTCH-2386) BasicURLNormalizer does not encode curly braces

2017-10-25 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-2386. Thanks Sebastian! > BasicURLNormalizer does not encode curly bra

[jira] [Resolved] (NUTCH-2386) BasicURLNormalizer does not encode curly braces

2017-10-25 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2386. -- Resolution: Fixed 4cfec6e3..bd8c8476 master -> master > BasicURLNormalize

[jira] [Commented] (NUTCH-1932) Automatically remove orphaned pages

2017-10-25 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16218558#comment-16218558 ] Markus Jelsma commented on NUTCH-1932: -- This looks fine to me! I am happy to see that you removed my

[jira] [Updated] (NUTCH-1932) Automatically remove orphaned pages

2017-10-25 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1932: - Fix Version/s: 1.14 > Automatically remove orphaned pa

[jira] [Updated] (NUTCH-1932) Automatically remove orphaned pages

2017-10-25 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1932: - Affects Version/s: 1.13 > Automatically remove orphaned pa

[jira] [Commented] (NUTCH-2448) Allow Sending an empty http.agent.version

2017-10-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16215804#comment-16215804 ] Markus Jelsma commented on NUTCH-2448: -- Fine, go ahead! > Allow Sending an empty http.agent.vers

[jira] [Commented] (NUTCH-1932) Automatically remove orphaned pages

2017-10-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16215190#comment-16215190 ] Markus Jelsma commented on NUTCH-1932: -- I agree! I will try to make some time for it tomorrow

[jira] [Closed] (NUTCH-2445) Fetcher following outlinks to keep track of already fetched items

2017-10-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-2445. Thanks Sebastian! > Fetcher following outlinks to keep track of already fetched it

[jira] [Resolved] (NUTCH-2445) Fetcher following outlinks to keep track of already fetched items

2017-10-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2445. -- Resolution: Fixed remote:3c21a6b..0cdd095 0cdd095c881eed52dc461e559ce6ae278e99157f

[jira] [Resolved] (NUTCH-2444) HostDB CSV dumper to emit field header by default

2017-10-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2444. -- Resolution: Fixed remote:602c663..3c21a6b 3c21a6b2abaa17ecc66a1c76d1239c213c56ba4e

[jira] [Closed] (NUTCH-2444) HostDB CSV dumper to emit field header by default

2017-10-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-2444. Thanks! > HostDB CSV dumper to emit field header by defa

[jira] [Updated] (NUTCH-2445) Fetcher following outlinks to keep track of already fetched items

2017-10-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2445: - Attachment: NUTCH-2445.patch Updated patch! > Fetcher following outlinks to keep tr

[jira] [Updated] (NUTCH-2447) Work-around SSLProtocolException: handshake alert: unrecognized_name

2017-10-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2447: - Attachment: NUTCH-2447.patch Added comment indicating its filthy code with reference to here

[jira] [Commented] (NUTCH-2447) Work-around SSLProtocolException: handshake alert: unrecognized_name

2017-10-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16215057#comment-16215057 ] Markus Jelsma commented on NUTCH-2447: -- As a side note, also pay attention to this incredible ugly

[jira] [Updated] (NUTCH-2447) Work-around SSLProtocolException: handshake alert: unrecognized_name

2017-10-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2447: - Attachment: NUTCH-2447.patch Patch for master! Keep in mind, this only work for protocol-http

[jira] [Comment Edited] (NUTCH-2447) Work-around SSLProtocolException: handshake alert: unrecognized_name

2017-10-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16215055#comment-16215055 ] Markus Jelsma edited comment on NUTCH-2447 at 10/23/17 12:36 PM: - Patch

[jira] [Updated] (NUTCH-2447) Work-around SSLProtocolException: handshake alert: unrecognized_name

2017-10-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2447: - Description: Nutch is unable to crawl some websites, regardless of protocol plugin you are using

[jira] [Updated] (NUTCH-2447) Work-around SSLProtocolException: handshake alert: unrecognized_name

2017-10-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2447: - Description: {code} 2017-10-23 12:43:52,911 INFO api.HttpRobotRulesParser - Couldn't get

[jira] [Created] (NUTCH-2447) Work-around SSLProtocolException: handshake alert: unrecognized_name

2017-10-23 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2447: Summary: Work-around SSLProtocolException: handshake alert: unrecognized_name Key: NUTCH-2447 URL: https://issues.apache.org/jira/browse/NUTCH-2447 Project: Nutch

[jira] [Updated] (NUTCH-2445) Fetcher following outlinks to keep track of already fetched items

2017-10-20 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2445: - Attachment: NUTCH-2445.patch Correct patch! > Fetcher following outlinks to keep tr

[jira] [Updated] (NUTCH-2445) Fetcher following outlinks to keep track of already fetched items

2017-10-20 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2445: - Attachment: (was: NUTCH-2445.patch) > Fetcher following outlinks to keep track of alre

[jira] [Updated] (NUTCH-2445) Fetcher following outlinks to keep track of already fetched items

2017-10-20 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2445: - Attachment: NUTCH-2445.patch Patch! > Fetcher following outlinks to keep track of alre

[jira] [Created] (NUTCH-2445) Fetcher following outlinks to keep track of already fetched items

2017-10-20 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2445: Summary: Fetcher following outlinks to keep track of already fetched items Key: NUTCH-2445 URL: https://issues.apache.org/jira/browse/NUTCH-2445 Project: Nutch

[jira] [Commented] (NUTCH-2444) HostDB CSV dumper to emit field header by default

2017-10-20 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16212397#comment-16212397 ] Markus Jelsma commented on NUTCH-2444: -- Will commit is one shortly unless objections > HostDB

[jira] [Updated] (NUTCH-2444) HostDB CSV dumper to emit field header by default

2017-10-20 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2444: - Attachment: NUTCH-2444.patch Patch! > HostDB CSV dumper to emit field header by defa

[jira] [Commented] (NUTCH-1932) Automatically remove orphaned pages

2017-10-20 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16212381#comment-16212381 ] Markus Jelsma commented on NUTCH-1932: -- Hello - i don't disagree, i merely updated to master

[jira] [Updated] (NUTCH-1932) Automatically remove orphaned pages

2017-10-20 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1932: - Attachment: NUTCH-1932.patch Updated patch for master. > Automatically remove orphaned pa

[jira] [Updated] (NUTCH-2411) Index-metadata to support indexing multiple values for a field

2017-10-18 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2411: - Attachment: NUTCH-2411.patch Crap, previous version was wrong too! > Index-metadata to supp

[jira] [Updated] (NUTCH-2411) Index-metadata to support indexing multiple values for a field

2017-10-18 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2411: - Attachment: NUTCH-2411.patch The patch was missing support

[jira] [Commented] (NUTCH-2411) Index-metadata to support indexing multiple values for a field

2017-10-18 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209124#comment-16209124 ] Markus Jelsma commented on NUTCH-2411: -- Will commit shortly unless objections > Index-metad

[jira] [Commented] (NUTCH-2439) Upgrade to Apache Tika 1.16

2017-10-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208332#comment-16208332 ] Markus Jelsma commented on NUTCH-2439: -- No idea, but probably someone on Tika's user list will so i

[jira] [Updated] (NUTCH-2411) Index-metadata to support indexing multiple values for a field

2017-10-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2411: - Attachment: NUTCH-2411.patch Don't add empty fields. > Index-metadata to support index

[jira] [Updated] (NUTCH-2439) Upgrade to Apache Tika 1.16

2017-10-11 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2439: - Attachment: (was: NUTCH-2439.patch) > Upgrade to Apache Tika 1

[jira] [Updated] (NUTCH-2439) Upgrade to Apache Tika 1.16

2017-10-11 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2439: - Attachment: NUTCH-2439.patch updated patch > Upgrade to Apache Tika 1

[jira] [Updated] (NUTCH-2439) Upgrade to Apache Tika 1.16

2017-10-11 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2439: - Attachment: NUTCH-2439.patch > Upgrade to Apache Tika 1

[jira] [Commented] (NUTCH-2439) Upgrade to Apache Tika 1.16

2017-10-11 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16200495#comment-16200495 ] Markus Jelsma commented on NUTCH-2439: -- Ah, i removed slf4j-api from plugin.xml and it works

[jira] [Updated] (NUTCH-2439) Upgrade to Apache Tika 1.16

2017-10-11 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2439: - Attachment: NUTCH-2439.patch Patch for master! Upgrade went fine, getting the libraries was also

[jira] [Created] (NUTCH-2439) Upgrade to Apache Tika 1.16

2017-10-11 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2439: Summary: Upgrade to Apache Tika 1.16 Key: NUTCH-2439 URL: https://issues.apache.org/jira/browse/NUTCH-2439 Project: Nutch Issue Type: Improvement

[jira] [Commented] (NUTCH-2418) NPE in org.apache.hadoop.io.Text from FetcherThread

2017-09-28 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16184069#comment-16184069 ] Markus Jelsma commented on NUTCH-2418: -- This was due to a custom parser emitting null for the title

[jira] [Closed] (NUTCH-2418) NPE in org.apache.hadoop.io.Text from FetcherThread

2017-09-28 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-2418. > NPE in org.apache.hadoop.io.Text from FetcherThr

[jira] [Resolved] (NUTCH-2418) NPE in org.apache.hadoop.io.Text from FetcherThread

2017-09-28 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2418. -- Resolution: Not A Problem > NPE in org.apache.hadoop.io.Text from FetcherThr

[jira] [Updated] (NUTCH-2434) Option to reset parameters HTMLMetaTags

2017-09-27 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2434: - Attachment: NUTCH-2434.patch Patch for master > Option to reset parameters HTMLMetaT

[jira] [Created] (NUTCH-2434) Option to reset parameters HTMLMetaTags

2017-09-27 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2434: Summary: Option to reset parameters HTMLMetaTags Key: NUTCH-2434 URL: https://issues.apache.org/jira/browse/NUTCH-2434 Project: Nutch Issue Type: Task

[jira] [Updated] (NUTCH-2418) NPE in org.apache.hadoop.io.Text from FetcherThread

2017-09-25 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2418: - Description: {code} 2017-09-05 15:28:54,539 INFO [FetcherThread

[jira] [Updated] (NUTCH-2432) Protocol httpclient to disable cookies if http.enable.cookie.header is false

2017-09-25 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2432: - Attachment: NUTCH-2432.patch Patch for master! > Protocol httpclient to disable cook

[jira] [Created] (NUTCH-2432) Protocol httpclient to disable cookies if http.enable.cookie.header is false

2017-09-25 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2432: Summary: Protocol httpclient to disable cookies if http.enable.cookie.header is false Key: NUTCH-2432 URL: https://issues.apache.org/jira/browse/NUTCH-2432 Project

[jira] [Updated] (NUTCH-2420) Bug in variable generate.max.count and fetcher.server.delay

2017-09-11 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2420: - Attachment: NUTCH-2420.patch Patch for master! This calls open and reset each time a HostDatum

[jira] [Created] (NUTCH-2420) Bug in variable generate.max.count and fetcher.server.delay

2017-09-11 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2420: Summary: Bug in variable generate.max.count and fetcher.server.delay Key: NUTCH-2420 URL: https://issues.apache.org/jira/browse/NUTCH-2420 Project: Nutch

[jira] [Updated] (NUTCH-2419) Domain blacklist URL filter does not respect command-line override for file

2017-09-06 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2419: - Attachment: NUTCH-2419.patch Patch for trunk! > Domain blacklist URL filter does not resp

[jira] [Updated] (NUTCH-2417) Support for variable fetch delay via FreeGenerator

2017-09-06 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2417: - Attachment: (was: NUTCH-2417.patch) > Support for variable fetch delay via FreeGenera

[jira] [Commented] (NUTCH-2417) Support for variable fetch delay via FreeGenerator

2017-09-06 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155138#comment-16155138 ] Markus Jelsma commented on NUTCH-2417: -- No patch, wrong ticket! > Support for variable fetch de

[jira] [Updated] (NUTCH-2417) Support for variable fetch delay via FreeGenerator

2017-09-06 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2417: - Attachment: NUTCH-2417.patch Patch for trnk! > Support for variable fetch delay

[jira] [Created] (NUTCH-2419) Domain blacklist URL filter does not respect command-line override for file

2017-09-06 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2419: Summary: Domain blacklist URL filter does not respect command-line override for file Key: NUTCH-2419 URL: https://issues.apache.org/jira/browse/NUTCH-2419 Project

[jira] [Created] (NUTCH-2418) NPE in org.apache.hadoop.io.Text from FetcherThread

2017-09-05 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2418: Summary: NPE in org.apache.hadoop.io.Text from FetcherThread Key: NUTCH-2418 URL: https://issues.apache.org/jira/browse/NUTCH-2418 Project: Nutch Issue Type

[jira] [Commented] (NUTCH-2411) Index-metadata to support indexing multiple values for a field

2017-08-31 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148906#comment-16148906 ] Markus Jelsma commented on NUTCH-2411: -- Added param to explicitly list the fields that are supposed

[jira] [Updated] (NUTCH-2411) Index-metadata to support indexing multiple values for a field

2017-08-31 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2411: - Description: {code} index.metadata.separator Separator to use if you want to index

[jira] [Updated] (NUTCH-2411) Index-metadata to support indexing multiple values for a field

2017-08-31 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2411: - Attachment: NUTCH-2411-1.13.patch > Index-metadata to support indexing multiple val

[jira] [Updated] (NUTCH-2411) Index-metadata to support indexing multiple values for a field

2017-08-31 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2411: - Attachment: NUTCH-2411-1.13.patch Patch for 1.13 > Index-metadata to support indexing multi

[jira] [Commented] (NUTCH-2415) Create a JEXL based IndexingFilter

2017-08-30 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147994#comment-16147994 ] Markus Jelsma commented on NUTCH-2415: -- [~jorgelbg] good points, please feel free to assign this one

[jira] [Created] (NUTCH-2417) Support for variable fetch delay via FreeGenerator

2017-08-30 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2417: Summary: Support for variable fetch delay via FreeGenerator Key: NUTCH-2417 URL: https://issues.apache.org/jira/browse/NUTCH-2417 Project: Nutch Issue Type

[jira] [Updated] (NUTCH-2416) Fetcher to log thread ID

2017-08-30 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2416: - Attachment: NUTCH-2416.patch Didn't save changes at previous patch! > Fetcher to log thread

[jira] [Updated] (NUTCH-2416) Fetcher to log thread ID

2017-08-30 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2416: - Attachment: NUTCH-2416.patch > Fetcher to log thread

[jira] [Commented] (NUTCH-2416) Fetcher to log thread ID

2017-08-30 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16146915#comment-16146915 ] Markus Jelsma commented on NUTCH-2416: -- It seems Thread.currentThread().toString() prints the same

[jira] [Created] (NUTCH-2416) Fetcher to log thread ID

2017-08-30 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2416: Summary: Fetcher to log thread ID Key: NUTCH-2416 URL: https://issues.apache.org/jira/browse/NUTCH-2416 Project: Nutch Issue Type: Improvement

[jira] [Updated] (NUTCH-2416) Fetcher to log thread ID

2017-08-30 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2416: - Attachment: NUTCH-2416.patch > Fetcher to log thread

[jira] [Commented] (NUTCH-2414) Allow LanguageIndexingFilter to actually filter documents by language.

2017-08-28 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144237#comment-16144237 ] Markus Jelsma commented on NUTCH-2414: -- Although filtering on lang field is a good idea, i think we

[jira] [Updated] (NUTCH-2411) Index-metadata to support indexing multiple values for a field

2017-08-22 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2411: - Affects Version/s: 1.13 > Index-metadata to support indexing multiple values for a fi

[jira] [Updated] (NUTCH-2411) Index-metadata to support indexing multiple values for a field

2017-08-22 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2411: - Fix Version/s: 1.14 > Index-metadata to support indexing multiple values for a fi

[jira] [Updated] (NUTCH-2411) Index-metadata to support indexing multiple values for a field

2017-08-22 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2411: - Description: {code} index.metadata.separator Separator to use if you want to index

[jira] [Updated] (NUTCH-2411) Index-metadata to support indexing multiple values for a field

2017-08-22 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2411: - Attachment: NUTCH-2411.patch Patch! > Index-metadata to support indexing multiple val

[jira] [Created] (NUTCH-2411) Index-metadata to support indexing multiple values for a field

2017-08-22 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2411: Summary: Index-metadata to support indexing multiple values for a field Key: NUTCH-2411 URL: https://issues.apache.org/jira/browse/NUTCH-2411 Project: Nutch

RE: Styles

2017-08-18 Thread Markus Jelsma
Although it has not troubled me so far, if there is something you think there is to improve i would most certainly welcome it! Feel free to open an issue at our Jira, provide a patch or pull request so it can be dealt with. Regards, Markus -Original message- > From:kenneth mcfarland

[jira] [Commented] (NUTCH-2335) Injector not to filter and normalize existing URLs in CrawlDb

2017-08-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16130206#comment-16130206 ] Markus Jelsma commented on NUTCH-2335: -- Ok, with this modification, it doesnt print with -noFilter

[jira] [Commented] (NUTCH-2335) Injector not to filter and normalize existing URLs in CrawlDb

2017-08-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16130204#comment-16130204 ] Markus Jelsma commented on NUTCH-2335: -- I moved the sys.out to: {code} if (filters != null

[jira] [Commented] (NUTCH-2335) Injector not to filter and normalize existing URLs in CrawlDb

2017-08-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16130191#comment-16130191 ] Markus Jelsma commented on NUTCH-2335: -- I see, will modify the println to the correct branch

[jira] [Commented] (NUTCH-2335) Injector not to filter and normalize existing URLs in CrawlDb

2017-08-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16130110#comment-16130110 ] Markus Jelsma commented on NUTCH-2335: -- passing -noFilter does not change anything. I am staring

[jira] [Commented] (NUTCH-2335) Injector not to filter and normalize existing URLs in CrawlDb

2017-08-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16130105#comment-16130105 ] Markus Jelsma commented on NUTCH-2335: -- Sebastian, there is a problem with either this patch

<    1   2   3   4   5   6   7   8   9   10   >