Re: [VOTE] Apache Nutch 1.7 Release Candidate

2013-06-21 Thread Mattmann, Chris A (398J)
+1 from me: SIGS pass: bash-3.2$ /Users/mattmann/bin/verify_gpg_sigs Verifying Signature for file apache-nutch-1.7-bin.tar.gz.asc gpg: Signature made Thu Jun 20 14:14:30 2013 PDT using RSA key ID BEF70CB4 gpg: Good signature from "Lewis John McGibbney (CODE SIGNING KEY) " gpg: WARNING: This key i

Re: Slow parse on hadoop

2013-06-21 Thread Lewis John Mcgibbney
Thanks Jason for posting this to user@ list. For those using Cassandra (1.1.2) and gora-cassandra please patch your copy of Nutch 2.x with Jason's patch. It would be real great if we could get some feedback on this as I am of the opinion that it certainly justifies a point oh release for 2.x. Thank

Slow parse on hadoop

2013-06-21 Thread Jason Howes
I've hit the same issue when testing Nutch with Cassandra. It appears that it's due to the way Nutch converts between ByteBuffers and Strings (or streams thereof). I filed the following JIRA issue and attached a patch to it. Hopefully this will resolve the issue for you as well. https://issues.

Re: confusion over fetch schedule

2013-06-21 Thread Joe Zhang
Thanks. On Fri, Jun 21, 2013 at 8:52 PM, Tejas Patil wrote: > I just checked the current code and it seems to me that lastModifed > (aka "Modified > time" in CrawlDatum class) is not used for any further logic. If you want > to customize the fetch interval for a subset of pages, do as Lewis > s

Re: confusion over fetch schedule

2013-06-21 Thread Tejas Patil
I just checked the current code and it seems to me that lastModifed (aka "Modified time" in CrawlDatum class) is not used for any further logic. If you want to customize the fetch interval for a subset of pages, do as Lewis suggested. i.e. specify a customized fetch interval for the main pages in

Re: confusion over fetch schedule

2013-06-21 Thread Joe Zhang
Thanks, guys. So, just to confirm, lastModifed is not use in the fetching logic at all. Ideally, it should take higher priority than the default interval. This is particularly important for sites such as cnn.com, whether the leaf page doesn't really change, but the portal page is updated all the t

Re: confusion over fetch schedule

2013-06-21 Thread Lewis John Mcgibbney
Hi Joe, In 1.x Markus and Julien IIRC committed a real nice patch a while back which allows you to achieve what I think you are after. Please look at this thread http://www.mail-archive.com/user@nutch.apache.org/msg08738.html You will find piles of stuff on the user archive about this kinda granula

Re: confusion over fetch schedule

2013-06-21 Thread Tejas Patil
On Fri, Jun 21, 2013 at 7:07 PM, Joe Zhang wrote: > Sorry, Nutch is certainly aware of page modification, and it does capture > lastModified. Nutch does captures the "last modified" field but I am not sure if its value is used ahead. I remember that it was not being used for any logic in older v

Re: A bug in the crawl secript in Nutch 1.6

2013-06-21 Thread Tejas Patil
Thanks Joe for pointing it out. There was a jira [0] for this bug and the change is already present in the trunk. [0] : https://issues.apache.org/jira/browse/NUTCH-1500 On Fri, Jun 21, 2013 at 7:11 PM, Joe Zhang wrote: > The new crawl script is quite useful. Thanks for the addition. > > It com

A bug in the crawl secript in Nutch 1.6

2013-06-21 Thread Joe Zhang
The new crawl script is quite useful. Thanks for the addition. It comes with a bug, though: Line 169: $bin/nutch solrindex $SOLRURL $CRAWL_PATH/crawldb -linkdb $CRAWL_PATH/linkdb $SEGMENT should be: $bin/nutch solrindex $SOLRURL $CRAWL_PATH/crawldb -linkdb $CRAWL_PATH/linkdb $CRAWL_PATH/segm

Re: confusion over fetch schedule

2013-06-21 Thread Joe Zhang
Sorry, Nutch is certainly aware of page modification, and it does capture lastModified. The real question is, can nutch get lastModified of a page before fetching, and use it to make fetching decisions (e.g,, whether or not to override the default interval)? On Fri, Jun 21, 2013 at 6:27 PM, Joe Z

confusion over fetch schedule

2013-06-21 Thread Joe Zhang
If I don't change the default value of db.fetch.interval.default, which is 30 days, does it mean that the URL in the db won't be refetched before the due time even if it has been modified? In other words, is Nutch aware of page modification?

Re: Inconsistencies in use of ParseStatus in 2.x

2013-06-21 Thread Lewis John Mcgibbney
Forget this. I am tripping and the low counters were directly in relation to NUTCH-1591 Sorry Lewis On Wed, Jun 19, 2013 at 5:04 PM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Hi, > We define the structure of ParseStatus [0] in our WebPage JSON schema [1]. > All good so far. > Wh

Re: Get HTML content generated by Javascript

2013-06-21 Thread Julien Nioche
One way around this is to have a custom protocol implementation and get it to fetch via Selenium J. On 21 June 2013 19:54, Lewis John Mcgibbney wrote: > Hi, > Nearly all of this page is generated by JS right? > Right now my answer is no. We fetch then parse page source... which in this > case is

Re: [VOTE] Apache Nutch 1.7 Release Candidate

2013-06-21 Thread Julien Nioche
Hi Lewis I don't think you got my comment. I was saying that won't fix is the right resolution for these issues but that the report should not include them. In the case of these issues ppl might get the wrong idea by looking at the report and think that we included the mongodb related stuff in the

Re: Get HTML content generated by Javascript

2013-06-21 Thread Lewis John Mcgibbney
Hi, Nearly all of this page is generated by JS right? Right now my answer is no. We fetch then parse page source... which in this case is mostly all JS. The magic happens in the browser. ... Lewis On Tue, Jun 18, 2013 at 10:59 PM, Deals Collect wrote: > Hi all, > > Can Nutch get the HTML content

Re: [VOTE] Apache Nutch 1.7 Release Candidate

2013-06-21 Thread Lewis John Mcgibbney
Thanks Markus. This is good news. On Fri, Jun 21, 2013 at 11:44 AM, Markus Jelsma wrote: > Sigs checked out for tgz! > > -Original message- > From: Lewis John Mcgibbney > Sent: Friday 21st June 2013 20:41 > To: user@nutch.apache.org > Cc: d...@nutch.apache.org > Subject: Re: [VOTE] Apach

RE: [VOTE] Apache Nutch 1.7 Release Candidate

2013-06-21 Thread Markus Jelsma
Sigs checked out for tgz! -Original message- From: Lewis John Mcgibbney Sent: Friday 21st June 2013 20:41 To: user@nutch.apache.org Cc: d...@nutch.apache.org Subject: Re: [VOTE] Apache Nutch 1.7 Release Candidate Hi Julien, Done, thanks for the attention to detail. I wonder if you got to

Re: Nutch 2.x with HBase backend errors

2013-06-21 Thread Lewis John Mcgibbney
On Fri, Jun 21, 2013 at 11:40 AM, Tony Mullins wrote: > Thanks guys for your help & support. > No hassle great to have you poking around and using the software. We know there is work to be done. Thank you > > I'll try it now with HBase 0.90.x. > Let us know how you get on. > > (In .net world

Re: [VOTE] Apache Nutch 1.7 Release Candidate

2013-06-21 Thread Lewis John Mcgibbney
Hi Julien, Done, thanks for the attention to detail. I wonder if you got to check sigs as well? I have been dancing between machines and it would be excellent to verify. Thank you v much. Lewis On Fri, Jun 21, 2013 at 1:47 AM, Julien Nioche < lists.digitalpeb...@gmail.com> wrote: > Hi Lewis > >

Re: Nutch 2.x with HBase backend errors

2013-06-21 Thread Tony Mullins
Thanks guys for your help & support. I'll try it now with HBase 0.90.x. (In .net world latest is greatest, seems its not the case here : ) ) thanks, Tony On Fri, Jun 21, 2013 at 11:11 PM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Hi Tony, > The second bullet point on the tut

Re: Nutch 2.x with HBase backend errors

2013-06-21 Thread Lewis John Mcgibbney
Hi Tony, The second bullet point on the tutorial states that Gora works with 0.90.X HBase branch (yes this is old) It is known not to work with the 0.94.X branch. Please try with the 90 branch. Thanks Lewis On Fri, Jun 21, 2013 at 8:12 AM, Tony Mullins wrote: > Hi , > > After getting some errors

Re: Nutch 2.x with HBase backend errors

2013-06-21 Thread AC Nutch
Hi Tony, See this thread. Also I might politely add that you should do at least some basic searching before asking Alex On Fri, Jun 21, 2013 at 2:01 PM, Tony Mullins wrote: > In site > http://wiki.apache

Re: Nutch 2.x with HBase backend errors

2013-06-21 Thread Tejas Patil
As mentioned in [0], use older (0.90.x) version of HBase. Unfortunaltely, HBase folks have removed the link from the downloads page. You can grab the source code from [1] and build it. [0] : http://wiki.apache.org/nutch/Nutch2Tutorial [1] : https://svn.apache.org/repos/asf/hbase/tags/0.90.4/ On

Re: Nutch 2.x with HBase backend errors

2013-06-21 Thread Tony Mullins
In site http://wiki.apache.org/nutch/Nutch2Tutorial?action=show&redirect=GORA_HBase its said that N.B. It's possible to encounter the following exception: java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration; this is caused by the fact that sometimes the hbase TEST jar is depl

Re: Why webPage.getContent().array() is returning html of all pages in seed.txt ?

2013-06-21 Thread Lewis John Mcgibbney
In short yes I think it is gora-cassandra that is the problem here. This is precisely the reason that I'm using the Cassandra backend, to try and root these bugs out. On Friday, June 21, 2013, Jamshaid Ashraf wrote: > Hi, > > I'm also facing the same issue with cassandra backend. > > Do you think

Nutch 2.x with HBase backend errors

2013-06-21 Thread Tony Mullins
Hi , After getting some errors with Cassandra backend with Nutch2.x , I am trying now HBase. I have installed HBase 94.8 and have also created sample table in it. After following these links http://wiki.apache.org/nutch/RunNutchInEclipse http://wiki.apache.org/nutch/Nutch2Tutorial?action=show&re

Re: Synchronization & Consistency of data in ParseFilter and IndexingFIlter

2013-06-21 Thread Tony Mullins
Ok , then I give a try with HBase... but this is strange as I think once Lewis said he is also using Cassandra in his Nutch setup. Thanks, Tony On Fri, Jun 21, 2013 at 1:19 PM, Julien Nioche < lists.digitalpeb...@gmail.com> wrote: > HBase seems to be the one most widely used. I haven't followed

Re: Why webPage.getContent().array() is returning html of all pages in seed.txt ?

2013-06-21 Thread Jamshaid Ashraf
Hi, I'm also facing the same issue with cassandra backend. Do you think that cassandra is the reason for returning repeated html in parse job for parsefilter plugin? Regards, Jamshaid On Fri, Jun 21, 2013 at 1:18 PM, Julien Nioche < lists.digitalpeb...@gmail.com> wrote: > Tony, > > The plugin

Re: [VOTE] Apache Nutch 1.7 Release Candidate

2013-06-21 Thread Julien Nioche
Hi Lewis The release notes [ https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=10680&version=12323281] list issues marked as won't fix which is probably not a great idea. For instance it lists *- Port nutch-mongodb-indexer to Nutch* which is a won't fix but people could get the impr

RE: [VOTE] Apache Nutch 1.7 Release Candidate

2013-06-21 Thread Markus Jelsma
Nice! Signatures are ok and everything builds and all tests pass. The pom does still point to 16.-SNAPSHOT. Other than that: definate +1! Cheers -Original message- From: lewis john mcgibbney Sent: Friday 21st June 2013 0:33 To: d...@nutch.apache.org; user@nutch.apache.org Subject: [VO

Re: Synchronization & Consistency of data in ParseFilter and IndexingFIlter

2013-06-21 Thread Julien Nioche
HBase seems to be the one most widely used. I haven't followed GORA lately but the MySql one was unusable On 21 June 2013 07:17, Tony Mullins wrote: > Then which backend is more stabable & consistent with gora/nutch2.x ? > How about MySql and HBase ? > > > Thanks, > Tony > > > On Fri, Jun 21, 2

Re: Why webPage.getContent().array() is returning html of all pages in seed.txt ?

2013-06-21 Thread Julien Nioche
Tony, The plugins directory contains quite a few examples of parsefilters e.g. http://svn.apache.org/viewvc/nutch/branches/2.1/src/plugin/language-identifier/src/java/org/apache/nutch/analysis/lang/HTMLLanguageParser.java?view=markup I don't use 2.x and don't know how many people use Cassandra as