Re: "nutch-site.xml" not robust

2012-06-08 Thread Andy Xue
Hi Lewis: Sorry for the delay. Sure, I'll open a ticket in a bit. Regards Andy On 7 June 2012 21:28, Lewis John Mcgibbney wrote: > Hi Andy, > Even opening a ticket and getting it logged would b great. > Thanks > Lewis > > On Wed, Jun 6, 2012 at 3:53 AM, Andy Xue wrote: > > Hi Lewis: > > > > I

Re: publishedDate and feed plugin

2012-06-08 Thread Shameema Umer
Hi Lewis, the things are clear, I am upset that I cannot find a means to find the age of a web page by nutch. I thought publishedDate from the feed plugin would help. If I change the field name from publishedDate to *pubDate * . Will this help? Thanks Shameema On Fri, Jun 8, 2012 at 6:48 PM, Lew

Re: Building Lucene index with Nutch 1.4

2012-06-08 Thread Emre Çelikten
Hello, I have done as you asked. I hope I have done it correctly as this was my first patch. Here's the issue: https://issues.apache.org/jira/browse/NUTCH-1382 Here's a tutorial for people that might be interested: http://cmusphinx.sourceforge.net/2012/06/building-a-java-application-with-apache-n

VOTE Apache Nutch 2.0 RC1

2012-06-08 Thread lewis john mcgibbney
Good Evening Everyone, A candidate for the Apache Nutch 2.0 RC1 is available at: http://people.apache.org/~lewismc/nutch-2.0 The release candidate is a src.zip, bin.zip, src.tar.gz and bin.tar.gz archive of the sources in: http://svn.apache.org/repos/asf/nutch/tags/release-2.0rc1 Further, a st

Re: [ANNOUNCE] Apache Nutch 1.5 Released

2012-06-08 Thread Mattmann, Chris A (388J)
Agreed thanks Lewis! Cheers, Chris On Jun 8, 2012, at 1:22 AM, Julien Nioche wrote: > Thanks Lewis! > > On 7 June 2012 17:52, lewis john mcgibbney wrote: > >> (apologies for cross posting...) >> >> Good Afternoon Everyone, >> >> The 1.5 release of Nutch is now available. This release includ

Re: URL filtering and normalization

2012-06-08 Thread Bai Shen
I'm attempting to filter during the generating. I removed the noFilter and noNorm flags from my generate job. I have around 10M records in my crawldb. The generate job has been running for several days now. Is there a way to check it's progress and/or make sure it's not hung? Also, is there a

Re: publishedDate and feed plugin

2012-06-08 Thread Lewis John Mcgibbney
Hi, No This should not be necessary. The feed parser and accompanying indexingfilter should extract and send (to be indexed) the following metadata items Author, Tags, Pub;lished date, Updated date and feed, There is a problem though... With many feeds, including the bbci one you provided in ano

Re: Building Lucene index with Nutch 1.4

2012-06-08 Thread Lewis John Mcgibbney
Hi Emre, Even if you were to open a Jira issue for this and submit a patch of your hack it would be excellent to have the code available to the community. All the best, oh and glad you got your application working. Lewis On Fri, Jun 8, 2012 at 4:22 AM, Emre Çelikten wrote: > Hello again, > > I

Re: [ANNOUNCE] Apache Nutch 1.5 Released

2012-06-08 Thread Julien Nioche
Thanks Lewis! On 7 June 2012 17:52, lewis john mcgibbney wrote: > (apologies for cross posting...) > > Good Afternoon Everyone, > > The 1.5 release of Nutch is now available. This release includes > several improvements including upgrades of several major components > including Tika 1.1 and Hado