[jira] [Commented] (NUTCH-1201) Allow for different FetcherThread impls

2012-01-17 Thread Andrzej Bialecki (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187927#comment-13187927 ] Andrzej Bialecki commented on NUTCH-1201: -- I agree that there are situations

[jira] [Commented] (NUTCH-1247) CrawlDatum.retries should be int

2012-01-14 Thread Andrzej Bialecki (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186212#comment-13186212 ] Andrzej Bialecki commented on NUTCH-1247: -- Indeed, line 264 increases the retry

[jira] [Commented] (NUTCH-1247) CrawlDatum.retries should be int

2012-01-13 Thread Andrzej Bialecki (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13185908#comment-13185908 ] Andrzej Bialecki commented on NUTCH-1247: -- Originally the reason for a byte was

[jira] [Commented] (NUTCH-1139) Indexer to delete documents

2011-11-10 Thread Andrzej Bialecki (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13147722#comment-13147722 ] Andrzej Bialecki commented on NUTCH-1139: -- I suggest renaming the option to

[jira] [Commented] (NUTCH-1061) Migrate MoreIndexingFilter from Apache ORO to java.util.regex

2011-11-10 Thread Andrzej Bialecki (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13147723#comment-13147723 ] Andrzej Bialecki commented on NUTCH-1061: -- +1. Migrate

[jira] [Commented] (NUTCH-1196) Update job should impose an upper limit on the number of inlinks (nutchgora)

2011-11-04 Thread Andrzej Bialecki (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13144226#comment-13144226 ] Andrzej Bialecki commented on NUTCH-1196: -- Very nicely done and useful patch! A

[jira] [Commented] (NUTCH-1135) Fix TestGoraStorage for Nutchgora

2011-10-14 Thread Andrzej Bialecki (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13127427#comment-13127427 ] Andrzej Bialecki commented on NUTCH-1135: -- A few comments from the author of

[jira] [Commented] (NUTCH-1135) Fix TestGoraStorage for Nutchgora

2011-10-14 Thread Andrzej Bialecki (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13127470#comment-13127470 ] Andrzej Bialecki commented on NUTCH-1135: -- bq. if you prefer to keep the old

[jira] [Commented] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2011-10-12 Thread Andrzej Bialecki (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13125712#comment-13125712 ] Andrzej Bialecki commented on NUTCH-797: - That's unexpected :) I checked the patch

[jira] [Commented] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

2011-10-12 Thread Andrzej Bialecki (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13125916#comment-13125916 ] Andrzej Bialecki commented on NUTCH-1097: -- +1, the latest patch looks good.

[jira] [Commented] (NUTCH-1142) Normalization and filtering in WebGraph

2011-10-12 Thread Andrzej Bialecki (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13125931#comment-13125931 ] Andrzej Bialecki commented on NUTCH-1142: -- +1, the patch looks good. (There is

[jira] [Commented] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2011-10-11 Thread Andrzej Bialecki (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13124737#comment-13124737 ] Andrzej Bialecki commented on NUTCH-797: - The fixup code in Tika is still a

[jira] [Commented] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2011-10-11 Thread Andrzej Bialecki (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13125016#comment-13125016 ] Andrzej Bialecki commented on NUTCH-797: - Uhh, sorry - I'll fix this in a moment.

[jira] [Commented] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2011-10-11 Thread Andrzej Bialecki (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13125077#comment-13125077 ] Andrzej Bialecki commented on NUTCH-797: - I'm puzzled by the algorithm in

[jira] [Commented] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

2011-10-11 Thread Andrzej Bialecki (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13125414#comment-13125414 ] Andrzej Bialecki commented on NUTCH-1097: -- +1 the idea makes sense. Patch looks

[jira] [Commented] (NUTCH-1154) Upgrade to Tika 0.10

2011-10-10 Thread Andrzej Bialecki (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13124428#comment-13124428 ] Andrzej Bialecki commented on NUTCH-1154: -- TIKA-748 has been fixed and is

[jira] [Commented] (NUTCH-1124) JUnit test for scoring-opic

2011-10-05 Thread Andrzej Bialecki (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13120982#comment-13120982 ] Andrzej Bialecki commented on NUTCH-1124: -- Our implementation is most definitely