[
https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1932:
-
Patch Info: Patch Available
> Automatically remove orphaned pa
[
https://issues.apache.org/jira/browse/NUTCH-2178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083096#comment-15083096
]
Markus Jelsma commented on NUTCH-2178:
--
Will commit in a few if no further objections
[
https://issues.apache.org/jira/browse/NUTCH-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1449:
-
Patch Info: Patch Available
> Optionally delete documents skipped by IndexingFilt
[
https://issues.apache.org/jira/browse/NUTCH-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1186:
-
Patch Info: Patch Available
> FreeGenerator always normali
[
https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-2191:
-
Attachment: NUTCH-2191.patch
Patch for trunk! Although all dependencies are correctly listed
Markus Jelsma created NUTCH-2191:
Summary: Add protocol-htmlunit
Key: NUTCH-2191
URL: https://issues.apache.org/jira/browse/NUTCH-2191
Project: Nutch
Issue Type: New Feature
Markus Jelsma created NUTCH-2192:
Summary: Get rid of oro
Key: NUTCH-2192
URL: https://issues.apache.org/jira/browse/NUTCH-2192
Project: Nutch
Issue Type: Task
Reporter: Markus
[
https://issues.apache.org/jira/browse/NUTCH-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-2192:
-
Attachment: NUTCH-2192.patch
Patch for trunk. OutlinkExtractor is done. JsParsefilter left
[
https://issues.apache.org/jira/browse/NUTCH-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma resolved NUTCH-2189.
--
Resolution: Fixed
Committed to trunk in revision 1721615. Also updated CHANGES.txt for
NUTCH
[
https://issues.apache.org/jira/browse/NUTCH-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma closed NUTCH-2189.
> Domain filter must deactivate if no rules are pres
[
https://issues.apache.org/jira/browse/NUTCH-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-2190:
-
Attachment: NUTCH-2190.patch
Patch for trunk. Tests pass.
{code}
# format: host\tprotocol\n
[
https://issues.apache.org/jira/browse/NUTCH-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15069421#comment-15069421
]
Markus Jelsma commented on NUTCH-2065:
--
Agreed!
> Domain URL filter to support protoc
[
https://issues.apache.org/jira/browse/NUTCH-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma closed NUTCH-2065.
Resolution: Won't Fix
> Domain URL filter to support protoc
Markus Jelsma created NUTCH-2190:
Summary: Protocol normalizer
Key: NUTCH-2190
URL: https://issues.apache.org/jira/browse/NUTCH-2190
Project: Nutch
Issue Type: New Feature
[
https://issues.apache.org/jira/browse/NUTCH-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15069422#comment-15069422
]
Markus Jelsma commented on NUTCH-2189:
--
Hello Sebastian - i do not think so actually
[
https://issues.apache.org/jira/browse/NUTCH-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-2189:
-
Patch Info: Patch Available
> Domain filter must deactivate if no rules are pres
Markus Jelsma created NUTCH-2189:
Summary: Domain filter must deactivate if no rules are present
Key: NUTCH-2189
URL: https://issues.apache.org/jira/browse/NUTCH-2189
Project: Nutch
Issue
[
https://issues.apache.org/jira/browse/NUTCH-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-2189:
-
Attachment: NUTCH-2189.patch
Patch for trunk. Test passes. If, for any reason, there are zero
[
https://issues.apache.org/jira/browse/NUTCH-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-2065:
-
Attachment: NUTCH-2065.patch
Updated patch to contain NUTCH-2189. Tests pass.
> Domain
[
https://issues.apache.org/jira/browse/NUTCH-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15063975#comment-15063975
]
Markus Jelsma commented on NUTCH-2188:
--
Ah yes. You would need to patch SolrUtils.java in the indexer
[
https://issues.apache.org/jira/browse/NUTCH-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15061883#comment-15061883
]
Markus Jelsma commented on NUTCH-2188:
--
Solr has built-in security since juts a few versions
[
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060036#comment-15060036
]
Markus Jelsma commented on NUTCH-2184:
--
Hello Lewis - you can use the indexer-dummy in unit tests
[
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15059750#comment-15059750
]
Markus Jelsma commented on NUTCH-2184:
--
Hello Lewis - keep in mind the possible configurations
[
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15059750#comment-15059750
]
Markus Jelsma edited comment on NUTCH-2184 at 12/16/15 9:43 AM:
Hello
[
https://issues.apache.org/jira/browse/NUTCH-1995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma closed NUTCH-1995.
Resolution: Fixed
Closing again. It seems there was a older nutch jar laying around. The plugin
[
https://issues.apache.org/jira/browse/NUTCH-1995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051126#comment-15051126
]
Markus Jelsma edited comment on NUTCH-1995 at 12/10/15 3:36 PM:
Guys, we
[
https://issues.apache.org/jira/browse/NUTCH-1995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma reopened NUTCH-1995:
--
Guys, we upgraded to 1.11 but got these curious exceptions when running the
crawler on Hadoop
[
https://issues.apache.org/jira/browse/NUTCH-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1449:
-
Attachment: NUTCH-1449.patch
Previous patch was wrong.
> Optionally delete documents skip
Nice!
-Original message-
From: lewis john mcgibbney
Sent: Tuesday 8th December 2015 2:34
To: annou...@apache.org; u...@nutch.apache.org; dev@nutch.apache.org
Subject: [RELEASE] Apache Nutch 1.11
Hello Folks,
07 December 2015 - Nutch 1.11 Release
The Apache Nutch
[
https://issues.apache.org/jira/browse/NUTCH-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1449:
-
Attachment: NUTCH-1449.patch
Patch for trunk again, 1.12
> Optionally delete documents skip
[
https://issues.apache.org/jira/browse/NUTCH-2176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma resolved NUTCH-2176.
--
Resolution: Fixed
Committed to trunk in rev. 1717622.
> Clean up of log4j.propert
[
https://issues.apache.org/jira/browse/NUTCH-2176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-2176:
-
Summary: Clean up of log4j.properties (was: log4j.properties is a mess)
> Clean
[
https://issues.apache.org/jira/browse/NUTCH-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033557#comment-15033557
]
Markus Jelsma commented on NUTCH-2177:
--
+1
> Generator produces only one partition e
Markus Jelsma created NUTCH-2178:
Summary: DeduplicationJob to optionall group on host or domain
Key: NUTCH-2178
URL: https://issues.apache.org/jira/browse/NUTCH-2178
Project: Nutch
Issue
[
https://issues.apache.org/jira/browse/NUTCH-2178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-2178:
-
Attachment: NUTCH-2178.patch
Patch for trunk
> DeduplicationJob to optionall group on h
Markus Jelsma created NUTCH-2176:
Summary: log4j.properties is a mess
Key: NUTCH-2176
URL: https://issues.apache.org/jira/browse/NUTCH-2176
Project: Nutch
Issue Type: Bug
[
https://issues.apache.org/jira/browse/NUTCH-2176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-2176:
-
Attachment: NUTCH-2176.patch
Patch for trunk resolving above mentioned points. Anything else
[
https://issues.apache.org/jira/browse/NUTCH-2176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-2176:
-
Affects Version/s: 1.10
Priority: Trivial (was: Major)
Fix Version/s: 1.11
[
https://issues.apache.org/jira/browse/NUTCH-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15029015#comment-15029015
]
Markus Jelsma commented on NUTCH-2177:
--
There seems to be no value for mapred.job.tracker on our own
[
https://issues.apache.org/jira/browse/NUTCH-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15029017#comment-15029017
]
Markus Jelsma commented on NUTCH-2177:
--
+1 for issue being blocker
> Generator produces only
[
https://issues.apache.org/jira/browse/NUTCH-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15029054#comment-15029054
]
Markus Jelsma commented on NUTCH-2177:
--
On standard Apache Hadoop YARN 2.7.1 running in high
[
https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15013513#comment-15013513
]
Markus Jelsma commented on NUTCH-2069:
--
Hi - looks good. One suggestion though. The patch mixes up
[
https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15013616#comment-15013616
]
Markus Jelsma edited comment on NUTCH-2069 at 11/19/15 2:35 PM:
Ah, i see
[
https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15013616#comment-15013616
]
Markus Jelsma commented on NUTCH-2069:
--
Ah, i see it now indeed. +1 for this patch
> Ignore exter
[
https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15012193#comment-15012193
]
Markus Jelsma commented on NUTCH-2069:
--
Hi J - i agree with the mode! Have it defaulted so it never
[
https://issues.apache.org/jira/browse/NUTCH-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15002072#comment-15002072
]
Markus Jelsma commented on NUTCH-2120:
--
Im fine with removing it, we're using Hadoop's MapWritable
[
https://issues.apache.org/jira/browse/NUTCH-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma reopened NUTCH-2058:
--
Reopening due to failing unit tests
[
https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14991421#comment-14991421
]
Markus Jelsma commented on NUTCH-2064:
--
It looks good to me, there are no immediate issues that come
[
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984806#comment-14984806
]
Markus Jelsma commented on NUTCH-2155:
--
By `remove current` and `not require current` you guys mean
[
https://issues.apache.org/jira/browse/NUTCH-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973930#comment-14973930
]
Markus Jelsma commented on NUTCH-2147:
--
Hello, we did something quite similar but used Jexl
[
https://issues.apache.org/jira/browse/NUTCH-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14974675#comment-14974675
]
Markus Jelsma commented on NUTCH-2147:
--
boolean CrawlDatum.evaluate(Expression expr) is what you need
[
https://issues.apache.org/jira/browse/NUTCH-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14974675#comment-14974675
]
Markus Jelsma edited comment on NUTCH-2147 at 10/26/15 6:00 PM:
Hello
University of Southern California, Los Angeles, CA 90089 USA
> ++++++
>
>
>
>
>
> -Original Message-
> From: Markus Jelsma <markus.jel...@openindex.io>
> Reply-To: "dev@nutch.apache.org" <dev@nutch.apache.org>
> Date: Monday, October 19,
Hi - i think NUTCH-2064 is too important to miss another release. Everyone
using Nutch needs it, especially if you are using HTTPS since httpclient cannot
deal with unescaped URL's.
M.
-Original message-
> From:Mattmann, Chris A (3980)
> Sent: Sunday
[
https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1932:
-
Attachment: NUTCH-1932.patch
Patch! Records with a orphan time greater than now > lastInlinkT
[
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963082#comment-14963082
]
Markus Jelsma commented on NUTCH-2144:
--
Hi - i like the purpose of this plugin. The patch, however
[
https://issues.apache.org/jira/browse/NUTCH-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963283#comment-14963283
]
Markus Jelsma commented on NUTCH-2145:
--
+1 for passing it through the normalizer.
> parse/in
[
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964110#comment-14964110
]
Markus Jelsma commented on NUTCH-2144:
--
Yes, this is much more readable indeed
[
https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-2064:
-
Summary: URLNormalizer basic to encode reserved chars and decode
non-reserved chars
Very cool! This is probably going to be useful.
-Original message-
From: Julien Nioche
Sent: Wednesday 23rd September 2015 16:35
To: u...@nutch.apache.org; dev@nutch.apache.org
Subject: Webcast : Apache Nutch on EMR
Hi again,
I have uploaded at webcast
[
https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1932:
-
Attachment: NUTCH-1932.patch
Updated patch. CrawlDatum now supports Jexl expressions on Long
[
https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1932:
-
Attachment: NUTCH-1932.patch
Fixed bad long to int casting.
> Automatically remove orpha
[
https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1932:
-
Attachment: NUTCH-1932.patch
Wrong default in code was used for markOrphanAfter. Config is ok
[
https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1932:
-
Attachment: NUTCH-1932.patch
Uh, using long over int for time keeping makes no sense. Relies
Welcome!!
-Original message-
From: Sujen Shah
Sent: Wednesday 16th September 2015 0:58
To: dev@nutch.apache.org
Cc: u...@nutch.apache.org
Subject: Re: [ANNOUNCE] New Nutch committer and PMC - Sujen Shah
Hi Everyone,
I would like to thank the members of the Apache
[
https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1932:
-
Attachment: NUTCH-1932.patch
Probably the final patch. It now includes:
* moving reducer code
[
https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14747316#comment-14747316
]
Markus Jelsma commented on NUTCH-2102:
--
Hello Julien! I believe this warc format is the updated arc
[
https://issues.apache.org/jira/browse/NUTCH-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma resolved NUTCH-2093.
--
Resolution: Fixed
Assignee: Markus Jelsma
Committed to trunk in revision 1703111
[
https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744959#comment-14744959
]
Markus Jelsma commented on NUTCH-2064:
--
I think having it in CC makes sense indeed. I shall commit
[
https://issues.apache.org/jira/browse/NUTCH-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744953#comment-14744953
]
Markus Jelsma commented on NUTCH-2097:
--
Interesting! What does 'Complete Ant + Ivy build system
[
https://issues.apache.org/jira/browse/NUTCH-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744953#comment-14744953
]
Markus Jelsma edited comment on NUTCH-2097 at 9/15/15 6:50 AM:
---
Interesting
[
https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1932:
-
Attachment: NUTCH-1932.patch
Eeh, patch with the scoring filter itself. Apparently it is possible
[
https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1932:
-
Attachment: NUTCH-1932.patch
New and much simpler patch. This relies on a scoring filter to mark
[
https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1932:
-
Attachment: NUTCH-1932.patch
> Automatically remove orphaned pa
[
https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1932:
-
Attachment: NUTCH-1932.patch
First proper working patch. Tests pass
> Automatically rem
[
https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1932:
-
Description: Orphan scoring filter that determines whether a page has
become orphaned, e.g
[
https://issues.apache.org/jira/browse/NUTCH-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745322#comment-14745322
]
Markus Jelsma commented on NUTCH-2097:
--
Yes, having them as separate mapper and reducer class files
[
https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746034#comment-14746034
]
Markus Jelsma commented on NUTCH-1932:
--
Hello Sebastian. I am not sure about that being on the list
Markus Jelsma created NUTCH-2093:
Summary: Indexing filters have no signature in CrawlDatum if
crawled via FreeGenerator
Key: NUTCH-2093
URL: https://issues.apache.org/jira/browse/NUTCH-2093
Project
[
https://issues.apache.org/jira/browse/NUTCH-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-2093:
-
Attachment: NUTCH-2093.patch
Patch for trunk.
> Indexing filters have no signature in CrawlDa
Welcome!
-Original message-
> From:Sebastian Nagel
> Sent: Thursday 10th September 2015 0:01
> To: dev@nutch.apache.org
> Cc: u...@nutch.apache.org
> Subject: [ANNOUNCE] New Nutch committer and PMC - Asitang Mishra
>
> Dear all,
>
> on behalf of the Nutch
[
https://issues.apache.org/jira/browse/NUTCH-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14716809#comment-14716809
]
Markus Jelsma edited comment on NUTCH-1084 at 8/27/15 3:03 PM
[
https://issues.apache.org/jira/browse/NUTCH-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14716809#comment-14716809
]
Markus Jelsma commented on NUTCH-1084:
--
I am getting sad, setting
[
https://issues.apache.org/jira/browse/NUTCH-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma resolved NUTCH-2085.
--
Resolution: Fixed
Assignee: Markus Jelsma
Committed to trunk in rev 1697860.
Upgrade
[
https://issues.apache.org/jira/browse/NUTCH-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma resolved NUTCH-2084.
--
Resolution: Fixed
Assignee: Markus Jelsma
Committed to trunk in rev 1697858.
Track
Yes Julien, please commit. I do think
https://issues.apache.org/jira/browse/NUTCH-2064 should also be included. But i
have my hands full atm.
-Original message-
From: Julien Niochelists.digitalpeb...@gmail.com
Sent: Wednesday 26th August 2015 13:51
To: dev@nutch.apache.org
Subject: Re:
[
https://issues.apache.org/jira/browse/NUTCH-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-2084:
-
Description: When merging 1000's of segments, and one is corrupt, broken,
whatever, the merge
[
https://issues.apache.org/jira/browse/NUTCH-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-2084:
-
Attachment: NUTCH-2084.patch
Patch for trunk.
Track changes in input dirs for SegmentMerger
[
https://issues.apache.org/jira/browse/NUTCH-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14710828#comment-14710828
]
Markus Jelsma commented on NUTCH-2084:
--
Well, this immediately helped me track down
Markus Jelsma created NUTCH-2084:
Summary: Track changes in input dirs for SegmentMerger
Key: NUTCH-2084
URL: https://issues.apache.org/jira/browse/NUTCH-2084
Project: Nutch
Issue Type: Bug
Markus Jelsma created NUTCH-2085:
Summary: Upgrade Guava
Key: NUTCH-2085
URL: https://issues.apache.org/jira/browse/NUTCH-2085
Project: Nutch
Issue Type: Task
Affects Versions: 1.10
[
https://issues.apache.org/jira/browse/NUTCH-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-2085:
-
Attachment: NUTCH-2085.patch
Patch for trunk. Tests pass except for ParserFactory, which fails
[
https://issues.apache.org/jira/browse/NUTCH-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-2085:
-
Patch Info: Patch Available
Upgrade Guava
-
Key: NUTCH-2085
[
https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646504#comment-14646504
]
Markus Jelsma commented on NUTCH-2069:
--
Fine with the feature but there's a lot
[
https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1932:
-
Attachment: NUTCH-1932.patch
Updated patch for trunk. This still relies on a LinkDB and a CrawlDB
[
https://issues.apache.org/jira/browse/NUTCH-2068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-2068:
-
Attachment: NUTCH-2068.patch
patch for trunk
Allow subcollection overrides via metadata
[
https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-2064:
-
Attachment: NUTCH-2064.patch
Quick and dirty patch where [ and ] are also encoded. I just
[
https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14642500#comment-14642500
]
Markus Jelsma commented on NUTCH-2064:
--
Also, spaces are now also escaped
[
https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-2064:
-
Attachment: NUTCH-1098.patch
Excellent! I have added both characters as a new test and it passes
Markus Jelsma created NUTCH-2065:
Summary: Domain URL filter to support protocols
Key: NUTCH-2065
URL: https://issues.apache.org/jira/browse/NUTCH-2065
Project: Nutch
Issue Type: Improvement
901 - 1000 of 3217 matches
Mail list logo