[
https://issues.apache.org/jira/browse/NUTCH-558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12530755
]
Chris Schneider commented on NUTCH-558:
---
The reason that DomainStats does not use URLUtils is that (as
[
https://issues.apache.org/jira/browse/NUTCH-558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12529749
]
Chris Schneider commented on NUTCH-558:
---
I made a comment in the source about this, but thinking about it
Need tool to retrieve domain statistics
---
Key: NUTCH-558
URL: https://issues.apache.org/jira/browse/NUTCH-558
Project: Nutch
Issue Type: New Feature
Affects Versions: 0.9.0
Reporter:
[
http://issues.apache.org/jira/browse/NUTCH-351?page=comments#action_12446424 ]
Chris Schneider commented on NUTCH-351:
---
I just noticed a bug in the patch above. I believe it's missing a return
sequence between the Host: host and
Server delay feature conflicts with maxThreadsPerHost
-
Key: NUTCH-385
URL: http://issues.apache.org/jira/browse/NUTCH-385
Project: Nutch
Issue Type: Bug
Components: fetcher
[
http://issues.apache.org/jira/browse/NUTCH-385?page=comments#action_12441528 ]
Chris Schneider commented on NUTCH-385:
---
This comment was actually made by Andrzej in response to an email containing
the analysis above that I sent him
[
http://issues.apache.org/jira/browse/NUTCH-385?page=comments#action_12441529 ]
Chris Schneider commented on NUTCH-385:
---
This comment was actually made by Ken Krugler, who was responding to Andrzej's
comment above:
[with respect to
[
http://issues.apache.org/jira/browse/NUTCH-351?page=comments#action_12438002 ]
Chris Schneider commented on NUTCH-351:
---
I would really appreciate it if Sami could explain in a little more detail what
this patch adds to the proxy support
[ http://issues.apache.org/jira/browse/NUTCH-371?page=all ]
Chris Schneider updated NUTCH-371:
--
Description:
DeleteDuplicates is supposed to delete documents with duplicate URLs (after
deleting documents with identical MD5 hashes), but this part is
[
http://issues.apache.org/jira/browse/NUTCH-273?page=comments#action_12430117 ]
Chris Schneider commented on NUTCH-273:
---
Another reason why it would be better to wait until the next segment to process
the target of the redirect is that
Generator is building fetch list using *lowest* scoring URLs
Key: NUTCH-348
URL: http://issues.apache.org/jira/browse/NUTCH-348
Project: Nutch
Issue Type: Bug
[
http://issues.apache.org/jira/browse/NUTCH-342?page=comments#action_12426039 ]
Chris Schneider commented on NUTCH-342:
---
I apologize for my confusion. I had been thinking that hadoop-env.sh was
getting sourced when a Nutch command was
Nutch commands log to nutch/logs/hadoop.logs by default
---
Key: NUTCH-342
URL: http://issues.apache.org/jira/browse/NUTCH-342
Project: Nutch
Issue Type: Bug
Affects Versions: 0.8
[ http://issues.apache.org/jira/browse/NUTCH-342?page=all ]
Chris Schneider updated NUTCH-342:
--
Attachment: NUTCH-342.patch
Here's a patch that defaults NUTCH_LOG_DIR to $HADOOP_LOG_DIR and NUTCH_LOGFILE
to $HADOOP_LOG_FILE.
Nutch commands log to
[ http://issues.apache.org/jira/browse/NUTCH-336?page=all ]
Chris Schneider updated NUTCH-336:
--
Attachment: NUTCH-336.patch.txt
Here's a patch that fixes the problem. It separates a new injectionScore API
out from the initialScore API.
Harvested
Harvested links shouldn't get db.score.injected in addition to inbound
contributions
Key: NUTCH-336
URL: http://issues.apache.org/jira/browse/NUTCH-336
Project:
CommonGrams loads analysis.common.terms.file for each query
---
Key: NUTCH-301
URL: http://issues.apache.org/jira/browse/NUTCH-301
Project: Nutch
Type: Improvement
Components: searcher
Versions:
Indexer doesn't consider linkdb when calculating boost value
Key: NUTCH-267
URL: http://issues.apache.org/jira/browse/NUTCH-267
Project: Nutch
Type: Bug
Components: indexer
Versions: 0.8-dev
[
http://issues.apache.org/jira/browse/NUTCH-246?page=comments#action_12374253 ]
Chris Schneider commented on NUTCH-246:
---
As it turns out, this problem was due to a time synchronization between the
jobtracker and the tasktrackers. When the URLs were
[ http://issues.apache.org/jira/browse/NUTCH-246?page=all ]
Chris Schneider updated NUTCH-246:
--
Priority: Minor (was: Blocker)
segment size is never as big as topN or crawlDB size in a distributed
deployement
[
http://issues.apache.org/jira/browse/NUTCH-246?page=comments#action_12374049 ]
Chris Schneider commented on NUTCH-246:
---
A few more details:
Stefan and I were able to reproduce this problem using either an injection set
of 4500 URLs or a larger set
RPC call times out while indexing map task is computing splits
--
Key: NUTCH-195
URL: http://issues.apache.org/jira/browse/NUTCH-195
Project: Nutch
Type: Bug
Components: indexer
Versions: 0.8-dev
22 matches
Mail list logo