Re: Nutch Site

2013-06-18 Thread Julien Nioche
Hi Lewis,

Brilliant! Thanks a lot

Julien


On 18 June 2013 05:32, Lewis John Mcgibbney lewis.mcgibb...@gmail.comwrote:

 Hi All,
 @Julien,
 A while ago you mentioned about changing the Nutch site to be more direct
 towards Downloads. I agreed with this but as I didn't deal with it then and
 there, it got put to the bottom of my TODO.
 Anyway, today I got around to it and our site it now more directly liked
 to Downloads page.
 In time I will inevitably migrate out site and documentation over to the
 Apache CMS, but for the mean time this will do I suppose.
 You'll now notice that in the feed strip the Download link it right up
 there.
 Thanks for now
 LEwis

 --
 *Lewis*




-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


[jira] [Updated] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch

2013-06-18 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1527:
-

Attachment: NUTCH-1527.patch

Ok, here's a new patch. If you set elastic.host (elastic.port is default set to 
9300) TransportClient will be used. elastic.cluster remains mandatory.

Properties are now also available in nutch-default.

Please comment and report issues.

Thanks

 Port nutch-elasticsearch-indexer to Nutch
 -

 Key: NUTCH-1527
 URL: https://issues.apache.org/jira/browse/NUTCH-1527
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.6, 2.1
Reporter: Lewis John McGibbney
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 2.4

 Attachments: NUTCH-1527.patch, NUTCH-1527.patch, NUTCH-1527.patch, 
 NUTCH-1527.patch, NUTCH-1527.patch


 The source repos for this can be found here [0].
 This issue should be inline with the work already done by Julien and others 
 over at NUTCH-1047.
 [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (NUTCH-1583) Headings does not support multiValued headings

2013-06-18 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-1583:


 Summary: Headings does not support multiValued headings
 Key: NUTCH-1583
 URL: https://issues.apache.org/jira/browse/NUTCH-1583
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.7
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.8


Headings can now support multiple values since NUTCH-1560 and NUTCH-1467.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1583) Headings does not support multiValued headings

2013-06-18 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1583:
-

Attachment: NUTCH-1583.patch

Patch for trunk. If headings.multivalued=true multiple values will be recorded 
and indexed. Default is false to preserve current behaviour.

Comments?

 Headings does not support multiValued headings
 --

 Key: NUTCH-1583
 URL: https://issues.apache.org/jira/browse/NUTCH-1583
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.7
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.8

 Attachments: NUTCH-1583.patch


 Headings can now support multiple values since NUTCH-1560 and NUTCH-1467.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1475) Index-More Plugin -- A better fall back value for date field

2013-06-18 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1475:
---

Attachment: NUTCH-1475-trunk-v1.patch

Why not rely first on CrawlDatum's modifiedTime? See patch.

 Index-More Plugin -- A better fall back value for date field
 

 Key: NUTCH-1475
 URL: https://issues.apache.org/jira/browse/NUTCH-1475
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.1, 1.5.1
 Environment: All
Reporter: James Sullivan
Priority: Minor
  Labels: index-more, plugins
 Fix For: 2.3, 1.8

 Attachments: index-more-1xand2x.patch, index-more-2x.patch, 
 index-more-2x.patch, NUTCH-1475-trunk-v1.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 Among other fields, the more plugin for Nutch 2.x provides a last modified 
 and date field for the Solr index. The last modified field is the last 
 modified date from the http headers if available, if not available it is left 
 empty. Currently, the date field is the same as the last modified field 
 unless that field is empty in which case getFetchTime is used as a fall back. 
 I think getFetchTime is not a good fall back as it is the next fetch time and 
 often a month or more in the future which doesn't make sense for the date 
 field. Users do not expect webpages/documents with future dates. A more 
 sensible fallback would be current date at the time it is indexed. 
 This is possible by simply changing line 97 of 
 https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
  from
 time = page.getFetchTime(); // use fetch time
 to
 time = new Date().getTime();
 Users interested in the getFetchTime value can still get it from the tstamp 
 field.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch

2013-06-18 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13686830#comment-13686830
 ] 

lufeng commented on NUTCH-1527:
---

Thanks Markus, I try the patch and can index the document success. +1 for 
commit.

 Port nutch-elasticsearch-indexer to Nutch
 -

 Key: NUTCH-1527
 URL: https://issues.apache.org/jira/browse/NUTCH-1527
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.6, 2.1
Reporter: Lewis John McGibbney
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 2.4

 Attachments: NUTCH-1527.patch, NUTCH-1527.patch, NUTCH-1527.patch, 
 NUTCH-1527.patch, NUTCH-1527.patch


 The source repos for this can be found here [0].
 This issue should be inline with the work already done by Julien and others 
 over at NUTCH-1047.
 [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Nutch Site

2013-06-18 Thread Mattmann, Chris A (398J)
Woot you da man Lewis

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Julien Nioche lists.digitalpeb...@gmail.com
Reply-To: dev@nutch.apache.org dev@nutch.apache.org
Date: Tuesday, June 18, 2013 1:01 AM
To: dev@nutch.apache.org dev@nutch.apache.org
Subject: Re: Nutch Site

Hi Lewis, 


Brilliant! Thanks a lot


Julien



On 18 June 2013 05:32, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:

Hi All,

@Julien,

A while ago you mentioned about changing the Nutch site to be more direct
towards Downloads. I agreed with this but as I didn't deal with it then
and there, it got put to the bottom of my TODO.

Anyway, today I got around to it and our site it now more directly liked
to Downloads page.

In time I will inevitably migrate out site and documentation over to the
Apache CMS, but for the mean time this will do I suppose.
You'll now notice that in the feed strip the Download link it right up
there.

Thanks for now
LEwis

-- 
Lewis














-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble




[jira] [Resolved] (NUTCH-1475) Index-More Plugin -- A better fall back value for date field

2013-06-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-1475.
-

Resolution: Fixed

Committed @revision 1494234 in trunk.
Thank you [~wastl-nagel]  for the final patch.

 Index-More Plugin -- A better fall back value for date field
 

 Key: NUTCH-1475
 URL: https://issues.apache.org/jira/browse/NUTCH-1475
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.1, 1.5.1
 Environment: All
Reporter: James Sullivan
Priority: Minor
  Labels: index-more, plugins
 Fix For: 2.3, 1.8

 Attachments: index-more-1xand2x.patch, index-more-2x.patch, 
 index-more-2x.patch, NUTCH-1475-trunk-v1.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 Among other fields, the more plugin for Nutch 2.x provides a last modified 
 and date field for the Solr index. The last modified field is the last 
 modified date from the http headers if available, if not available it is left 
 empty. Currently, the date field is the same as the last modified field 
 unless that field is empty in which case getFetchTime is used as a fall back. 
 I think getFetchTime is not a good fall back as it is the next fetch time and 
 often a month or more in the future which doesn't make sense for the date 
 field. Users do not expect webpages/documents with future dates. A more 
 sensible fallback would be current date at the time it is indexed. 
 This is possible by simply changing line 97 of 
 https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
  from
 time = page.getFetchTime(); // use fetch time
 to
 time = new Date().getTime();
 Users interested in the getFetchTime value can still get it from the tstamp 
 field.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1475) Index-More Plugin -- A better fall back value for date field

2013-06-18 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13687079#comment-13687079
 ] 

Hudson commented on NUTCH-1475:
---

Integrated in Nutch-trunk #2245 (See 
[https://builds.apache.org/job/Nutch-trunk/2245/])
NUTCH-1475 Index-More Plugin -- A better fall back value for date field 
(Revision 1494234)

 Result = SUCCESS
lewismc : http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1494234
Files : 
* /nutch/trunk/CHANGES.txt
* 
/nutch/trunk/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java


 Index-More Plugin -- A better fall back value for date field
 

 Key: NUTCH-1475
 URL: https://issues.apache.org/jira/browse/NUTCH-1475
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.1, 1.5.1
 Environment: All
Reporter: James Sullivan
Priority: Minor
  Labels: index-more, plugins
 Fix For: 2.3, 1.8

 Attachments: index-more-1xand2x.patch, index-more-2x.patch, 
 index-more-2x.patch, NUTCH-1475-trunk-v1.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 Among other fields, the more plugin for Nutch 2.x provides a last modified 
 and date field for the Solr index. The last modified field is the last 
 modified date from the http headers if available, if not available it is left 
 empty. Currently, the date field is the same as the last modified field 
 unless that field is empty in which case getFetchTime is used as a fall back. 
 I think getFetchTime is not a good fall back as it is the next fetch time and 
 often a month or more in the future which doesn't make sense for the date 
 field. Users do not expect webpages/documents with future dates. A more 
 sensible fallback would be current date at the time it is indexed. 
 This is possible by simply changing line 97 of 
 https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
  from
 time = page.getFetchTime(); // use fetch time
 to
 time = new Date().getTime();
 Users interested in the getFetchTime value can still get it from the tstamp 
 field.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Jenkins build is back to normal : Nutch-trunk #2245

2013-06-18 Thread Apache Jenkins Server
See https://builds.apache.org/job/Nutch-trunk/2245/changes



[jira] [Commented] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch

2013-06-18 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13687176#comment-13687176
 ] 

Lewis John McGibbney commented on NUTCH-1527:
-

Hi Markus, the attached patch also includes your boilerpipe stuff ;) I am 
reverting those parts on the patch and trying it out right now.

 Port nutch-elasticsearch-indexer to Nutch
 -

 Key: NUTCH-1527
 URL: https://issues.apache.org/jira/browse/NUTCH-1527
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.6, 2.1
Reporter: Lewis John McGibbney
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 2.4

 Attachments: NUTCH-1527.patch, NUTCH-1527.patch, NUTCH-1527.patch, 
 NUTCH-1527.patch, NUTCH-1527.patch


 The source repos for this can be found here [0].
 This issue should be inline with the work already done by Julien and others 
 over at NUTCH-1047.
 [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch

2013-06-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1527:


Attachment: NUTCH-1527v2.patch

New patch removing your Boilerpipe stuff Markus.
I am +1 for this to oo in there. We keep index-solr as default for the time 
being and push the RC with this included.
It is a real nice addition to the release candidate.

 Port nutch-elasticsearch-indexer to Nutch
 -

 Key: NUTCH-1527
 URL: https://issues.apache.org/jira/browse/NUTCH-1527
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.6, 2.1
Reporter: Lewis John McGibbney
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 2.4

 Attachments: NUTCH-1527.patch, NUTCH-1527.patch, NUTCH-1527.patch, 
 NUTCH-1527.patch, NUTCH-1527.patch, NUTCH-1527v2.patch


 The source repos for this can be found here [0].
 This issue should be inline with the work already done by Julien and others 
 over at NUTCH-1047.
 [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (NUTCH-1584) Port NUTCH-1405 Allow to overwrite CrawlDatum's with injected entries to 2.x

2013-06-18 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-1584:
---

 Summary: Port NUTCH-1405 Allow to overwrite CrawlDatum's with 
injected entries to 2.x
 Key: NUTCH-1584
 URL: https://issues.apache.org/jira/browse/NUTCH-1584
 Project: Nutch
  Issue Type: Improvement
  Components: crawldb, injector
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 2.3


I was recently curious about what happens in 2.x when we inject similar but not 
identical seed lists in order to bootstrap a system.
I started looking about and found NUTCH-1405.
I think it would be great to port this concept to 2.x.
This issue should do exactly that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (NUTCH-1585) Ensure duplicate tags do not exist in microformat-reltag tag set.

2013-06-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney reassigned NUTCH-1585:
---

Assignee: Lewis John McGibbney

 Ensure duplicate tags do not exist in microformat-reltag tag set.
 -

 Key: NUTCH-1585
 URL: https://issues.apache.org/jira/browse/NUTCH-1585
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.6, 2.2
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1585-2.x.patch, NUTCH-1585-trunk.patch


 A WebPage can have many many embedded tags and other such markup.
 Creating huge tag lists containing many many duplicates is counter productive 
 to the process of parsing and extracting out such structure.
 We should add a mechanism to only include single tag occurrences for the 
 microformats-reltag parser.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1585) Ensure duplicate tags do not exist in microformat-reltag tag set.

2013-06-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1585:


Attachment: NUTCH-1585-trunk.patch
NUTCH-1585-2.x.patch

patches for trunk and 2.x.
Simply check if the tag exists in the set. If it doesn't then add it.
I suppose this is difficult/expensive if the set is huge, however by doing this 
check, the set is logically much much smaller than it would be otherwise. 

 Ensure duplicate tags do not exist in microformat-reltag tag set.
 -

 Key: NUTCH-1585
 URL: https://issues.apache.org/jira/browse/NUTCH-1585
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.6, 2.2
Reporter: Lewis John McGibbney
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1585-2.x.patch, NUTCH-1585-trunk.patch


 A WebPage can have many many embedded tags and other such markup.
 Creating huge tag lists containing many many duplicates is counter productive 
 to the process of parsing and extracting out such structure.
 We should add a mechanism to only include single tag occurrences for the 
 microformats-reltag parser.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira