Hey guys,
Sorry for the non-responsiveness here. I recently left my old employment and
have been packing for a cross-country move.
I agree that for 1.0 the best bet is what Sami has done. The code that I was
working on is available here:
http://github.com/toddlipcon/nutch/tree/nutch-669
But it
[
https://issues.apache.org/jira/browse/NUTCH-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12665855#action_12665855
]
Todd Lipcon commented on NUTCH-676:
---
Have you run some full crawls yet? I wrote pretty
[
https://issues.apache.org/jira/browse/NUTCH-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12665889#action_12665889
]
Todd Lipcon commented on NUTCH-676:
---
Hmm, I can't seem to find the bug I thought I
Hi Matt,
The nutch segments are stored as Hadoop SequenceFiles and MapFiles. MapFile
is made up of multiple SequenceFiles. I'm not certain if the format is
documented anywhere, but the source is in org.apache.hadoop.io. I doubt
you'll find a PHP library for reading them, so you'll probably have
[
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660382#action_12660382
]
Todd Lipcon commented on NUTCH-669:
---
Here's a further report on my progress:
- It turns
[
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12659857#action_12659857
]
Todd Lipcon commented on NUTCH-669:
---
Hey guys,
I tried it on production, but ran
Reporter: Todd Lipcon
Priority: Minor
The MapWritable implemention in o.a.n.crawl is written confusingly - it
maintains its own internal linked list which I think may have a bug somewhere
(I'm getting an NPE in certain cases in the code, though it's hard to track
down
[
https://issues.apache.org/jira/browse/NUTCH-676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Todd Lipcon updated NUTCH-676:
--
Attachment: 0001-NUTCH-676-Replace-MapWritable-implementation-with-t.patch
NUTCH-676: Replace
[
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12659958#action_12659958
]
Todd Lipcon commented on NUTCH-669:
---
Found the exception in a screen log:
{noformat
[
https://issues.apache.org/jira/browse/NUTCH-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12659961#action_12659961
]
Todd Lipcon commented on NUTCH-676:
---
Oops - please disregard above patch - it breaks
: Todd Lipcon
Priority: Trivial
In development it's handy to be able to run a single test case easily. You can
do it with ant -Dtestcase=foo test, but that's slow since it still checks all
the plugins for changes, rebuilds jars, etc.
This patch adds a command to bin/nutch to run
[
https://issues.apache.org/jira/browse/NUTCH-672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Todd Lipcon updated NUTCH-672:
--
Attachment: 0001-NUTCH-672-allow-junit-tests-to-be-run-from-bin-nutc.patch
allow unit tests to be run
[
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12655491#action_12655491
]
Todd Lipcon commented on NUTCH-669:
---
For those watching this issue: I pushed a couple more
[
https://issues.apache.org/jira/browse/NUTCH-670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12655513#action_12655513
]
Todd Lipcon commented on NUTCH-670:
---
Turns out this is actually a bit trickier if I'm
[
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653844#action_12653844
]
Todd Lipcon commented on NUTCH-669:
---
Agreed on all fronts.
I spent several hours
[
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653940#action_12653940
]
Todd Lipcon commented on NUTCH-669:
---
I've pushed the initial commit of this rewrite
[
https://issues.apache.org/jira/browse/NUTCH-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653412#action_12653412
]
Todd Lipcon commented on NUTCH-207:
---
Are both fetcher and fetcher2 supposed
Versions: 0.9.0
Reporter: Todd Lipcon
Priority: Minor
I'd like to consolidate a lot of the common code between Fetcher and
Fetcher2.java.
It seems to me like there are the following differences:
- Fetcher relies on the Protocol to obey robots.txt and crawl delay settings
18 matches
Mail list logo