[
https://issues.apache.org/jira/browse/NUTCH-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1037:
-
Attachment: NUTCH-1037-1.4-3.patch
Of course. Patch doesn't lowercase anchors and added config
Add classes to bin/nutch
Key: NUTCH-1049
URL: https://issues.apache.org/jira/browse/NUTCH-1049
Project: Nutch
Issue Type: Improvement
Reporter: Markus Jelsma
The following classes should be added to
[
https://issues.apache.org/jira/browse/NUTCH-1049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1049:
-
Attachment: NUTCH-1049-1.4-1.patch
Patch for 1.4.
Add classes to bin/nutch
I see on the wiki that HTTPS is supported by protocol-httpclient but not
protocol-http.
However, protocol-httpclient is not recommended for use (
https://issues.apache.org/jira/browse/NUTCH-990).
Is there a plan for supporting HTTPS? Happy to help implement if possible :)
Thanks
Matt
Have been thinking about this again. We could make so that the indexer does
not necessarily require a linkDB : some people are not particularly
interested in getting the anchors. At the moment you have to have a linkDB.
This would make it a bit simpler (and quicker) to index within a crawl
This is what I was thinking also previously :)
It would seem sensible to have the option. I definitely have use cases where
the links are not important.
On Thu, Jul 14, 2011 at 2:03 PM, Julien Nioche
lists.digitalpeb...@gmail.com wrote:
Have been thinking about this again. We could make so
Add segmentDir option to WebGraph
-
Key: NUTCH-1050
URL: https://issues.apache.org/jira/browse/NUTCH-1050
Project: Nutch
Issue Type: Improvement
Affects Versions: 1.3
Reporter: Markus Jelsma
[
https://issues.apache.org/jira/browse/NUTCH-1050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1050:
-
Attachment: NUTCH-1050-1.4-1.patch
Patch for 1.4.
Add segmentDir option to WebGraph
[
https://issues.apache.org/jira/browse/NUTCH-1050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1050:
-
Description:
One must either merge segments first or script around the lack of a segmentDir
[
https://issues.apache.org/jira/browse/NUTCH-1050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma reassigned NUTCH-1050:
Assignee: Markus Jelsma
Add segmentDir option to WebGraph
Hi,
If we filter and normalize hyperlinks in the parse job, we wouldn't have to
filter and normalize during all other jobs (perhaps except injector). This
would spair a lot of CPU time for updating crawl and link db. It would also, i
think, help the WebGraph as it operates on segments'
On Thursday 14 July 2011 15:03:34 Julien Nioche wrote:
Have been thinking about this again. We could make so that the indexer does
not necessarily require a linkDB : some people are not particularly
interested in getting the anchors. At the moment you have to have a linkDB.
This would make
Are you sure we don't we already filter and normalize at the end of the
parse? (not in front of code - sorry can't check)
On 14 July 2011 16:37, Markus Jelsma markus.jel...@openindex.io wrote:
Hi,
If we filter and normalize hyperlinks in the parse job, we wouldn't have to
filter and
Do be honest, i am not. But when reasoning, why would we filter and normalize
everywhere when it's already done in parsing.
... tested..
I injected a .nl url, generated and fetched. Then i modified urlfilter to deny
everything, did a parse and modified filter again to allow .nl pages. I
how can i be deleted from this mailing list pls ?
tks
ciao
gioele
Gioele Zanzico
Senior Web Analyst
Vitec Group Imaging Staging Division
Direct Line: +39 042407
Vitec Group Imaging Staging Division, Via Sasso Rosso 19, I-36061 Bassano del
Grappa (VI), Italy
T +39 0424 555 855 F +39
http://www.google.co.uk/search?q=nutch+mailing+list - 1st result
On 14 July 2011 16:50, Zanzico Gioele gioele.zanz...@vitecgroup.com wrote:
how can i be deleted from this mailing list pls ?
tks
ciao
gioele
Gioele Zanzico
Senior Web Analyst
Vitec Group Imaging Staging Division
Direct
Dear Wiki user,
You have subscribed to a wiki page or wiki category on Nutch Wiki for change
notification.
The NutchTutorial page has been changed by JoeLencioni:
http://wiki.apache.org/nutch/NutchTutorial?action=diffrev1=35rev2=36
+ 'This tutorial deals with Nutch 1.3. For older versions,
Dear Wiki user,
You have subscribed to a wiki page or wiki category on Nutch Wiki for change
notification.
The NutchTutorialPre1.3 page has been changed by JoeLencioni:
http://wiki.apache.org/nutch/NutchTutorialPre1.3?action=diffrev1=36rev2=37
Comment:
Removing references to 1.3 or = 1.3
-
Dear Wiki user,
You have subscribed to a wiki page or wiki category on Nutch Wiki for change
notification.
The NutchTutorial page has been changed by JoeLencioni:
http://wiki.apache.org/nutch/NutchTutorial?action=diffrev1=37rev2=38
Comment:
changing bin/nutch index to bin/nutch solrindex
Dear Wiki user,
You have subscribed to a wiki page or wiki category on Nutch Wiki for change
notification.
The NutchTutorial page has been changed by JoeLencioni:
http://wiki.apache.org/nutch/NutchTutorial?action=diffrev1=38rev2=39
Comment:
adding crucial parse command
Now the database
[
https://issues.apache.org/jira/browse/NUTCH-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Pease updated NUTCH-1052:
-
Summary: Multiple deletes of the same URL using SolrClean (was: Multiple
delete of the same URL using
Multiple delete of the same URL using SolrClean
---
Key: NUTCH-1052
URL: https://issues.apache.org/jira/browse/NUTCH-1052
Project: Nutch
Issue Type: Improvement
Components: indexer
See https://builds.apache.org/job/Nutch-trunk/1546/
--
[...truncated 985 lines...]
A src/plugin/subcollection/src/java/org/apache/nutch/collection
A
src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java
A
23 matches
Mail list logo