Build failed in Jenkins: Nutch-trunk #1546

2011-07-14 Thread Apache Jenkins Server
See -- [...truncated 985 lines...] A src/plugin/subcollection/src/java/org/apache/nutch/collection A src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java A

[jira] [Updated] (NUTCH-1052) Multiple deletes of the same URL using SolrClean

2011-07-14 Thread Tim Pease (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Pease updated NUTCH-1052: - Summary: Multiple deletes of the same URL using SolrClean (was: Multiple delete of the same URL using So

[jira] [Created] (NUTCH-1052) Multiple delete of the same URL using SolrClean

2011-07-14 Thread Tim Pease (JIRA)
Multiple delete of the same URL using SolrClean --- Key: NUTCH-1052 URL: https://issues.apache.org/jira/browse/NUTCH-1052 Project: Nutch Issue Type: Improvement Components: indexer Af

[Nutch Wiki] Update of "NutchTutorial" by JoeLencioni

2011-07-14 Thread Apache Wiki
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification. The "NutchTutorial" page has been changed by JoeLencioni: http://wiki.apache.org/nutch/NutchTutorial?action=diff&rev1=38&rev2=39 Comment: adding crucial parse command Now the datab

[Nutch Wiki] Update of "NutchTutorial" by JoeLencioni

2011-07-14 Thread Apache Wiki
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification. The "NutchTutorial" page has been changed by JoeLencioni: http://wiki.apache.org/nutch/NutchTutorial?action=diff&rev1=37&rev2=38 Comment: changing bin/nutch index to bin/nutch solrindex

[Nutch Wiki] Update of "NutchTutorial" by JoeLencioni

2011-07-14 Thread Apache Wiki
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification. The "NutchTutorial" page has been changed by JoeLencioni: http://wiki.apache.org/nutch/NutchTutorial?action=diff&rev1=36&rev2=37 Comment: Removing pre 1.3 stuff Try the following c

[Nutch Wiki] Update of "NutchTutorialPre1.3" by JoeLencioni

2011-07-14 Thread Apache Wiki
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification. The "NutchTutorialPre1.3" page has been changed by JoeLencioni: http://wiki.apache.org/nutch/NutchTutorialPre1.3?action=diff&rev1=36&rev2=37 Comment: Removing references to < 1.3 or >= 1

[Nutch Wiki] Update of "NutchTutorial" by JoeLencioni

2011-07-14 Thread Apache Wiki
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification. The "NutchTutorial" page has been changed by JoeLencioni: http://wiki.apache.org/nutch/NutchTutorial?action=diff&rev1=35&rev2=36 + 'This tutorial deals with Nutch 1.3. For older vers

Re: HTTPS support

2011-07-14 Thread Julien Nioche
http://www.google.co.uk/search?q=nutch+mailing+list -> 1st result On 14 July 2011 16:50, Zanzico Gioele wrote: > how can i be deleted from this mailing list pls ? > > tks > ciao > gioele > > Gioele Zanzico > Senior Web Analyst > Vitec Group Imaging & Staging Division > Direct Line: +39 0424

R: HTTPS support

2011-07-14 Thread Zanzico Gioele
how can i be deleted from this mailing list pls ? tks ciao gioele Gioele Zanzico Senior Web Analyst Vitec Group Imaging & Staging Division Direct Line: +39  042407   Vitec Group Imaging & Staging Division, Via Sasso Rosso 19, I-36061 Bassano del Grappa (VI), Italy T +39 0424 555 855 F +39

Re: Normalize and filter hyperlinks during parse

2011-07-14 Thread Markus Jelsma
Do be honest, i am not. But when reasoning, why would we filter and normalize everywhere when it's already done in parsing. ... tested.. I injected a .nl url, generated and fetched. Then i modified urlfilter to deny everything, did a parse and modified filter again to allow .nl pages. I update

Re: Normalize and filter hyperlinks during parse

2011-07-14 Thread Julien Nioche
Are you sure we don't we already filter and normalize at the end of the parse? (not in front of code - sorry can't check) On 14 July 2011 16:37, Markus Jelsma wrote: > Hi, > > If we filter and normalize hyperlinks in the parse job, we wouldn't have to > filter and normalize during all other jobs

[jira] [Commented] (NUTCH-914) Implement Apache Project Branding Requirements

2011-07-14 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065370#comment-13065370 ] Lewis John McGibbney commented on NUTCH-914: OK I have addressed points 2, 3 an

Re: Normalize and filter hyperlinks during parse

2011-07-14 Thread Markus Jelsma
There is a significant downside to filter and normalize in the parse job as you're losing the original information. But to whom that's not important (i.e. you don't change filters or normalizers often) and makes a lot of CPU cycles this would, i guess, be a very nice optional feature. On Thursd

Re: Normalize and filter hyperlinks during parse

2011-07-14 Thread lewis john mcgibbney
This is quite true Markus. This had actually occurred to me whilst I was updating the command line options. Initially I was questioning why it would be necessary to pass -norrmalze arguments when trying to merge crawldb or segments. It would also provide more value when trying to create the linkdb

[jira] [Created] (NUTCH-1051) Export WebGraph node scores for solr.ExternalFileField

2011-07-14 Thread Markus Jelsma (JIRA)
Export WebGraph node scores for solr.ExternalFileField -- Key: NUTCH-1051 URL: https://issues.apache.org/jira/browse/NUTCH-1051 Project: Nutch Issue Type: Improvement Reporter:

Re: HTTPS support

2011-07-14 Thread Markus Jelsma
Well, the Protocol-httpclient code needs to be upgraded and rewritten. If someone can provide a patch we can test and hopefully include the code. On Thursday 14 July 2011 14:02:11 Matthew Painter wrote: > I see on the wiki that HTTPS is supported by protocol-httpclient but not > protocol-http. >

Re: Real-time Solr integration

2011-07-14 Thread Markus Jelsma
On Thursday 14 July 2011 15:03:34 Julien Nioche wrote: > Have been thinking about this again. We could make so that the indexer does > not necessarily require a linkDB : some people are not particularly > interested in getting the anchors. At the moment you have to have a linkDB. > > This would

Normalize and filter hyperlinks during parse

2011-07-14 Thread Markus Jelsma
Hi, If we filter and normalize hyperlinks in the parse job, we wouldn't have to filter and normalize during all other jobs (perhaps except injector). This would spair a lot of CPU time for updating crawl and link db. It would also, i think, help the WebGraph as it operates on segments' ParseDat

[jira] [Assigned] (NUTCH-1050) Add segmentDir option to WebGraph

2011-07-14 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reassigned NUTCH-1050: Assignee: Markus Jelsma > Add segmentDir option to WebGraph > -

[jira] [Updated] (NUTCH-1050) Add segmentDir option to WebGraph

2011-07-14 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1050: - Description: One must either merge segments first or script around the lack of a segmentDir opti

[jira] [Updated] (NUTCH-1050) Add segmentDir option to WebGraph

2011-07-14 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1050: - Attachment: NUTCH-1050-1.4-1.patch Patch for 1.4. > Add segmentDir option to WebGraph >

[jira] [Created] (NUTCH-1050) Add segmentDir option to WebGraph

2011-07-14 Thread Markus Jelsma (JIRA)
Add segmentDir option to WebGraph - Key: NUTCH-1050 URL: https://issues.apache.org/jira/browse/NUTCH-1050 Project: Nutch Issue Type: Improvement Affects Versions: 1.3 Reporter: Markus Jelsma

Re: Real-time Solr integration

2011-07-14 Thread Matthew Painter
This is what I was thinking also previously :) It would seem sensible to have the option. I definitely have use cases where the links are not important. On Thu, Jul 14, 2011 at 2:03 PM, Julien Nioche < lists.digitalpeb...@gmail.com> wrote: > Have been thinking about this again. We could make so

Re: Real-time Solr integration

2011-07-14 Thread Julien Nioche
Have been thinking about this again. We could make so that the indexer does not necessarily require a linkDB : some people are not particularly interested in getting the anchors. At the moment you have to have a linkDB. This would make it a bit simpler (and quicker) to index within a crawl iterati

HTTPS support

2011-07-14 Thread Matthew Painter
I see on the wiki that HTTPS is supported by protocol-httpclient but not protocol-http. However, protocol-httpclient is not recommended for use ( https://issues.apache.org/jira/browse/NUTCH-990). Is there a plan for supporting HTTPS? Happy to help implement if possible :) Thanks Matt

[jira] [Updated] (NUTCH-1049) Add classes to bin/nutch

2011-07-14 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1049: - Attachment: NUTCH-1049-1.4-1.patch Patch for 1.4. > Add classes to bin/nutch > -

[jira] [Created] (NUTCH-1049) Add classes to bin/nutch

2011-07-14 Thread Markus Jelsma (JIRA)
Add classes to bin/nutch Key: NUTCH-1049 URL: https://issues.apache.org/jira/browse/NUTCH-1049 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma The following classes should be added to the

[jira] [Updated] (NUTCH-1037) Deduplicate anchors before indexing

2011-07-14 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1037: - Attachment: NUTCH-1037-1.4-3.patch Of course. Patch doesn't lowercase anchors and added config op