[jira] [Updated] (NUTCH-1037) Deduplicate anchors before indexing

2011-07-14 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1037: - Attachment: NUTCH-1037-1.4-3.patch Of course. Patch doesn't lowercase anchors and added config

[jira] [Created] (NUTCH-1049) Add classes to bin/nutch

2011-07-14 Thread Markus Jelsma (JIRA)
Add classes to bin/nutch Key: NUTCH-1049 URL: https://issues.apache.org/jira/browse/NUTCH-1049 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma The following classes should be added to

[jira] [Updated] (NUTCH-1049) Add classes to bin/nutch

2011-07-14 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1049: - Attachment: NUTCH-1049-1.4-1.patch Patch for 1.4. Add classes to bin/nutch

HTTPS support

2011-07-14 Thread Matthew Painter
I see on the wiki that HTTPS is supported by protocol-httpclient but not protocol-http. However, protocol-httpclient is not recommended for use ( https://issues.apache.org/jira/browse/NUTCH-990). Is there a plan for supporting HTTPS? Happy to help implement if possible :) Thanks Matt

Re: Real-time Solr integration

2011-07-14 Thread Julien Nioche
Have been thinking about this again. We could make so that the indexer does not necessarily require a linkDB : some people are not particularly interested in getting the anchors. At the moment you have to have a linkDB. This would make it a bit simpler (and quicker) to index within a crawl

Re: Real-time Solr integration

2011-07-14 Thread Matthew Painter
This is what I was thinking also previously :) It would seem sensible to have the option. I definitely have use cases where the links are not important. On Thu, Jul 14, 2011 at 2:03 PM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Have been thinking about this again. We could make so

[jira] [Created] (NUTCH-1050) Add segmentDir option to WebGraph

2011-07-14 Thread Markus Jelsma (JIRA)
Add segmentDir option to WebGraph - Key: NUTCH-1050 URL: https://issues.apache.org/jira/browse/NUTCH-1050 Project: Nutch Issue Type: Improvement Affects Versions: 1.3 Reporter: Markus Jelsma

[jira] [Updated] (NUTCH-1050) Add segmentDir option to WebGraph

2011-07-14 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1050: - Attachment: NUTCH-1050-1.4-1.patch Patch for 1.4. Add segmentDir option to WebGraph

[jira] [Updated] (NUTCH-1050) Add segmentDir option to WebGraph

2011-07-14 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1050: - Description: One must either merge segments first or script around the lack of a segmentDir

[jira] [Assigned] (NUTCH-1050) Add segmentDir option to WebGraph

2011-07-14 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reassigned NUTCH-1050: Assignee: Markus Jelsma Add segmentDir option to WebGraph

Normalize and filter hyperlinks during parse

2011-07-14 Thread Markus Jelsma
Hi, If we filter and normalize hyperlinks in the parse job, we wouldn't have to filter and normalize during all other jobs (perhaps except injector). This would spair a lot of CPU time for updating crawl and link db. It would also, i think, help the WebGraph as it operates on segments'

Re: Real-time Solr integration

2011-07-14 Thread Markus Jelsma
On Thursday 14 July 2011 15:03:34 Julien Nioche wrote: Have been thinking about this again. We could make so that the indexer does not necessarily require a linkDB : some people are not particularly interested in getting the anchors. At the moment you have to have a linkDB. This would make

Re: Normalize and filter hyperlinks during parse

2011-07-14 Thread Julien Nioche
Are you sure we don't we already filter and normalize at the end of the parse? (not in front of code - sorry can't check) On 14 July 2011 16:37, Markus Jelsma markus.jel...@openindex.io wrote: Hi, If we filter and normalize hyperlinks in the parse job, we wouldn't have to filter and

Re: Normalize and filter hyperlinks during parse

2011-07-14 Thread Markus Jelsma
Do be honest, i am not. But when reasoning, why would we filter and normalize everywhere when it's already done in parsing. ... tested.. I injected a .nl url, generated and fetched. Then i modified urlfilter to deny everything, did a parse and modified filter again to allow .nl pages. I

R: HTTPS support

2011-07-14 Thread Zanzico Gioele
how can i be deleted from this mailing list pls ? tks ciao gioele Gioele Zanzico Senior Web Analyst Vitec Group Imaging Staging Division Direct Line: +39  042407   Vitec Group Imaging Staging Division, Via Sasso Rosso 19, I-36061 Bassano del Grappa (VI), Italy T +39 0424 555 855 F +39

Re: HTTPS support

2011-07-14 Thread Julien Nioche
http://www.google.co.uk/search?q=nutch+mailing+list - 1st result On 14 July 2011 16:50, Zanzico Gioele gioele.zanz...@vitecgroup.com wrote: how can i be deleted from this mailing list pls ? tks ciao gioele Gioele Zanzico Senior Web Analyst Vitec Group Imaging Staging Division Direct

[Nutch Wiki] Update of NutchTutorial by JoeLencioni

2011-07-14 Thread Apache Wiki
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The NutchTutorial page has been changed by JoeLencioni: http://wiki.apache.org/nutch/NutchTutorial?action=diffrev1=35rev2=36 + 'This tutorial deals with Nutch 1.3. For older versions,

[Nutch Wiki] Update of NutchTutorialPre1.3 by JoeLencioni

2011-07-14 Thread Apache Wiki
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The NutchTutorialPre1.3 page has been changed by JoeLencioni: http://wiki.apache.org/nutch/NutchTutorialPre1.3?action=diffrev1=36rev2=37 Comment: Removing references to 1.3 or = 1.3 -

[Nutch Wiki] Update of NutchTutorial by JoeLencioni

2011-07-14 Thread Apache Wiki
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The NutchTutorial page has been changed by JoeLencioni: http://wiki.apache.org/nutch/NutchTutorial?action=diffrev1=37rev2=38 Comment: changing bin/nutch index to bin/nutch solrindex

[Nutch Wiki] Update of NutchTutorial by JoeLencioni

2011-07-14 Thread Apache Wiki
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The NutchTutorial page has been changed by JoeLencioni: http://wiki.apache.org/nutch/NutchTutorial?action=diffrev1=38rev2=39 Comment: adding crucial parse command Now the database

[jira] [Updated] (NUTCH-1052) Multiple deletes of the same URL using SolrClean

2011-07-14 Thread Tim Pease (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Pease updated NUTCH-1052: - Summary: Multiple deletes of the same URL using SolrClean (was: Multiple delete of the same URL using

[jira] [Created] (NUTCH-1052) Multiple delete of the same URL using SolrClean

2011-07-14 Thread Tim Pease (JIRA)
Multiple delete of the same URL using SolrClean --- Key: NUTCH-1052 URL: https://issues.apache.org/jira/browse/NUTCH-1052 Project: Nutch Issue Type: Improvement Components: indexer

Build failed in Jenkins: Nutch-trunk #1546

2011-07-14 Thread Apache Jenkins Server
See https://builds.apache.org/job/Nutch-trunk/1546/ -- [...truncated 985 lines...] A src/plugin/subcollection/src/java/org/apache/nutch/collection A src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java A