Re: Persistent Crawldb Checksum error

2011-12-05 Thread Lewis John Mcgibbney
Hi Danicela, Have a look here [1]. Although your problem is not directly linked to fetching, the symptoms and subsequent solution to the problem is the same. Unfortunately this is quite a messy one but will hopefully get you going in the right direction again. [1]

Re: problem with the Nutch and Hadoop Tutorial when starting to deploy Nutch to Single Machine

2011-12-05 Thread Lewis John Mcgibbney
Hi José, If you look at what is generated when you have built Nutch using ant runtime you will see correctly the runtime/local and runtime/deploy folders. To run in deploy mode, it is necessary to specify all of your nutch-site.xml (and any other configuration e.g. filters, plugins etc etc)

Re: error java.net.SocketTimeoutException: Read timed out in crawl with nutch?

2011-12-06 Thread Lewis John Mcgibbney
Hi Mina, You can pick this type of stuff up easier from the mailing lists [1]. It might save you some time rather than waiting for some replies from folks. Thanks [1] http://lucene.472066.n3.nabble.com/SocketTimeoutException-td604882.html On Mon, Dec 5, 2011 at 11:36 PM, mina

Re: The book Building Search Applications with Lucene and Nutch

2011-12-08 Thread Lewis John Mcgibbney
Not got a clue. One thing I must say is that we wary of any out of date code with these books. When reading about I found the Lucene API to be somewhat different and outdated, I am not saying that it is the same with the book you quoted but it definately is with this one [1]. On the positive side,

Re: Trouble running solrindexer from Nutch 1.4

2011-12-08 Thread Lewis John Mcgibbney
Thanks Tim. In addition Chip, the tutorial has now been updated to include Tim's comments and to cover latest Nutch 1.4. Thanks Lewis On Wed, Dec 7, 2011 at 10:45 PM, Tim Pease tim.pe...@gmail.com wrote: On Dec 7, 2011, at 3:17 PM, Chip Calhoun wrote: This is probably just down to my not

Re: URLFilterChecker documentation

2011-12-09 Thread Lewis John Mcgibbney
Hi Remi Markus, Yeah, I can replicate this, good catch Remi. lewis@lewis-desktop:~/ASF/trunk/runtime/local$ bin/nutch org.apache.nutch.net.URLFilterChecker http://www.heraldscotland.com-filterName regex-urlfilter.txt Checking combination of all URLFilters available ^Z [2]+ Stopped

Re: URLFilterChecker documentation

2011-12-13 Thread Lewis John Mcgibbney
Hi, Can anyone confirm if this is an issue? If so I think we should log it before it goes unnoticed. Thanks Lewis On Fri, Dec 9, 2011 at 3:21 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: If you look at the output I posted, even when I specified a particular filter, the checkAll

Re: URLFilterChecker documentation

2011-12-13 Thread Lewis John Mcgibbney
confirm if this is an issue? If so I think we should log it before it goes unnoticed. Thanks Lewis On Fri, Dec 9, 2011 at 3:21 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: If you look at the output I posted, even when I specified a particular filter

Re: URLFilterChecker documentation

2011-12-13 Thread Lewis John Mcgibbney
, December 13, 2011, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Heres my output from URLFilterChecker [1] lewis@lewis-01:~/ASF/trunk/runtime/local$ bin/nutch org.apache.nutch.net.URLFilterChecker -filterName urlfilter-regex Exception in thread main java.lang.RuntimeException: Filter

Re: SolrIndex java.io.IOException: Job failed!

2011-12-14 Thread Lewis John Mcgibbney
Hi Remi, This is a compatibility issue with conflicting versions of Solrj [1] [1] http://lucene.472066.n3.nabble.com/Invalid-version-or-the-data-in-not-in-javabin-format-td1460495.html On Wed, Dec 14, 2011 at 1:57 PM, remi tassing tassingr...@gmail.com wrote: Hello guys, After crawling with

Re: Problem running Nutch on Win 7 + Cygwin

2011-12-15 Thread Lewis John Mcgibbney
Can anyone confirm this? If this is the case it would be great to get it fired on to the wiki for future reference. Thank you On Wed, Dec 14, 2011 at 10:48 PM, Whitman, John jwhit...@ea.com wrote: Hi Lewis – I believe I know what this issue was – I would bet that the user has set an

Re: Nutch Hadoop Optimization

2011-12-16 Thread Lewis John Mcgibbney
It looks like its the parsing of these segments that is taking time... no? On Thu, Dec 15, 2011 at 9:57 PM, Bai Shen baishen.li...@gmail.com wrote: On Thu, Dec 15, 2011 at 12:47 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: This is overwhelmingly weighted towards Hadoop

Re: Crawl fails: Input path does not exist

2011-12-18 Thread Lewis John Mcgibbney
Hi Dean, What version are you on? On Sun, Dec 18, 2011 at 2:20 PM, Dean Pullen dean.pul...@gmail.com wrote: (can't access work email, so posting via this account!) I've tried absolutely everything to resolve this issue, and have scoured the web over the weekend in an attempt to rectify this

Re: Trouble building Nutch

2011-12-25 Thread Lewis John Mcgibbney
Hi Patrick, So you must remove thre NUTCH_HOME/src which by default will be added as the src folder. Instead add NUTCH_HOME/src/java every occourance of NUTCH_HOME/src/plugin/plugin_name/src/java NUTCH_HOME/src/plugin/plugin_name/src/test then NUTCH_HOME/src/test/ hopefully you can follow the

Re: error in solrindex command in nutch 1.4

2011-12-27 Thread Lewis John Mcgibbney
Hi Mina, Can you please check out the page now, I've edited this and would like you to confirm this has been clarified. Thank you On Mon, Dec 26, 2011 at 9:08 PM, mina tahereganji...@gmail.com wrote: i can solve this problem. i read nutch doc for solrindex in:

Working with Twitter

2011-12-30 Thread Lewis John Mcgibbney
Hi, I'm interested in crawling twitter feeds and haven't tried any implementation yet. Does anyone know if this is possible? I haven't seen anything on our archives to suggest that people are having problems with this. Thanks and happy NY to everyone when it comes around. -- Lewis

Re: Working with Twitter

2011-12-31 Thread Lewis John Mcgibbney
Thanks guys. All the best when the bells come around. Lewis On Fri, Dec 30, 2011 at 7:06 PM, Ken Krugler kkrugler_li...@transpac.com wrote: Note that for polite and efficient fetching, you want to resolve shortened links first, then treat some set (e.g. over a 5-10 minute interval) as

Re: Download older versions of Nutch?

2012-01-04 Thread Lewis John Mcgibbney
Hi Guys, Just to confirm, this has been addressed and committed. You can see the changes here [1] [1] http://nutch.apache.org/old_downloads.html On Tue, Nov 29, 2011 at 6:29 AM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Hi Tim, Thanks for your message. You can find

Re: parse data directory not found after merge

2012-01-06 Thread Lewis John Mcgibbney
John Mcgibbney wrote: Hi Dean, Depending on the size of the segments your fetching, in most cases I would advise you to separate out fetching and parsing into individual steps. This becomes self explanatory as your segments increase in size and the possibility of something going wrong

Re: parse data directory not found after merge

2012-01-06 Thread Lewis John Mcgibbney
Hi Dean, Without discussing any of your configuration properties can you please try 6) MERGE SEGMENTS: /opt/nutch_1_4/bin/nutch mergesegs /opt/nutch_1_4/data/crawl/MERGEDsegments/ -dir /opt/nutch_1_4/data/crawl/segments/* -filter -normalize paying attention to the wildcard /* in -dir

Re: parse data directory not found after merge

2012-01-06 Thread Lewis John Mcgibbney
: LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input Pattern file:/opt/nutch_1_4/data/crawl/segments/*/parse_data matches 0 files And yes, your assumption was correct - it's a different segment directory each loop. Many thanks, Dean. On 06/01/2012 15:43, Lewis John Mcgibbney

Re: parse data directory not found after merge

2012-01-06 Thread Lewis John Mcgibbney
that the directories exist after fetching and parsing? On Fri, Jan 6, 2012 at 4:24 PM, Dean Pullen dean.pul...@semantico.com wrote: Good spot because all of that was meant to be removed! No, I'm afraid that's just a copy/paste problem. Dean On 06/01/2012 16:17, Lewis John Mcgibbney wrote: Ok then, How

Re: parse data directory not found after merge

2012-01-06 Thread Lewis John Mcgibbney
on ONE crawl also removes the parse_data dir etc! Dean. On 06/01/2012 16:28, Lewis John Mcgibbney wrote: How about merging segs after every subsequent iteration of the crawl cycle... surely this is a problem with producing the specific parse_data directory. If it doesn't work after two

Re: parse data directory not found after merge

2012-01-08 Thread Lewis John Mcgibbney
/nutch_invertlinks (However it is in the solrindex docs) Adding it makes no difference to invertlinks. I think the problem is definitely with mergesegs, as opposed to invertlinks etc. Thanks again, Dean. On 06/01/2012 17:53, Lewis John Mcgibbney wrote: OK so now I think were at the bottom of it. If you

Re: parse data directory not found after merge

2012-01-09 Thread Lewis John Mcgibbney
/2012 14:26, Dean Pullen wrote: No Lewis, -linkdb was already been used for the solrindex command, so we still have the same problem. Many thanks, Dean On 08/01/2012 14:08, Lewis John Mcgibbney wrote: Hi dean is this sorted On Saturday, January 7, 2012, Dean Pullendean.pul

Re: parse data directory not found after merge

2012-01-09 Thread Lewis John Mcgibbney
How are you running Nutch local or deploy mode? Which hadoop versions are you using 0.20.2? This appears to be an open issue with this version [1]. Also please have a look here [2] for a similar frustrating situation. [1] https://issues.apache.org/jira/browse/HADOOP-6958 [2]

Re: Indexing specific metadata tags with urlmeta

2012-01-11 Thread Lewis John Mcgibbney
the update to solr when SolrIndexer runs. Matt Wilson -Original Message- From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] Sent: Monday, September 26, 2011 3:04 PM To: user@nutch.apache.org Subject: Re: Indexing specific metadata tags with urlmeta Hi

Re: Indexing specific metadata tags with urlmeta

2012-01-12 Thread Lewis John Mcgibbney
/browse/NUTCH-809), hope this helps. On 11.01.2012 22:44, Dean Del Ponte wrote: Thank-you for your response. My goal is to get Nutch to index meta tags. It's been quite an adventure so far! On Wed, Jan 11, 2012 at 3:30 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Dean

Re: relative url problem with Nutch

2012-01-12 Thread Lewis John Mcgibbney
Hi Remi, WRT fixing Nutch 1.2 I can't comment, we do not support this version any longer and it is no longer actively maintained, however please keep an eye on the issue (and related issues mentioned on NUTCH-566 thread) and you may be able to back port some of the changes to Nutch-1.2 (fingers

Re: Fetching large files

2012-01-12 Thread Lewis John Mcgibbney
Is it possible for you to fetch smaller segments, parse them, then merge incrementally rather than attempting to merge several larger segments at once? Are you getting any IO problems when parsing the segments? If so this may be an early warning light to attack the problem from another angle. On

Re: nutch, oozie and elasticsearch

2012-01-13 Thread Lewis John Mcgibbney
Hi Bowen, I completely agree with Chris' comments, there have been a few guys popping up from time to time asking about ES therefore any contrib in this area would be excellent. In the meantime I'll check your code out on Github. Thanks for letting us in the loop. Lewis On Fri, Jan 13, 2012

Re: Call for Submission Berlin Buzzwords 2012all for Submission Berlin Buzzwords - http://berlinbuzzwords.de

2012-01-13 Thread Lewis John Mcgibbney
Is anyone planning on heading to Berlin Buzzwords this year? I missed last year, but was fortunate enough to catch up with a lot of the stuff online. Even I don't get the opportunity to get something prepared, I would still really like to make the event. Anyone? On Fri, Jan 13, 2012 at 9:33 AM,

Re: Focused crawling with nutch

2012-01-13 Thread Lewis John Mcgibbney
Hi Vijith, We are happy to help however you need to be more specific in terms of what you want to achieve? What are you required to obtain from the Web or some domain... What nature of Nutch installation... WRT restricting/focussing your crawl, what for?... Thanks On Fri, Jan 13, 2012 at 10:45

Re: Indexing specific metadata tags with urlmeta

2012-01-13 Thread Lewis John Mcgibbney
I haven't been working on this, but how does your schema configure these fields? Have you configured it to store and index the new metadata field(s)? Also you may wish to set it to some kind of custom setting via conf/solr-mapping.xml Only thoughts so please ignore if out of context. Lewis On

Re: nutch, oozie and elasticsearch

2012-01-13 Thread Lewis John Mcgibbney
:19 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Bowen, I completely agree with Chris' comments, there have been a few guys popping up from time to time asking about ES therefore any contrib in this area would be excellent. In the meantime I'll check your code out on Github

Re: Start crawl from Java without bin/nutch script

2012-01-15 Thread Lewis John Mcgibbney
Mmmm, I am not using Nutch on Windows at all, generally don't know too much about configuring Cygwin and really hope thetreis some more help out there. The main problem here seems to be that the relative path to /cygdrive/c/server/nutch/urls is not being interpreted correctly. You mention {bq}

Re: Indexing specific metadata tags with urlmeta

2012-01-16 Thread Lewis John Mcgibbney
- meta name=keywords content=plugin / So I believe giving a query for 'plugin' should give me this page in results. (the page content is nothing related to plugins) Please correct me if I am wrong. On Fri, Jan 13, 2012 at 6:09 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: I

Re: incompatible neko and xerces versions?

2012-01-17 Thread Lewis John Mcgibbney
Hi Dennis, Would it be possible for you to open an issue on our Jira as this sounds like we need to document and catch it. Thanks very much for reporting. Kind Regards Lewis On Tue, Jan 17, 2012 at 3:16 PM, Dennis Spathis dspat...@gmail.com wrote: Hi, The Nutch 1.4 distribution includes

Re: invalid uri with three dots

2012-01-17 Thread Lewis John Mcgibbney
Hi Remi, This also looks like we need to document and address it. Can you log a Jira issue and we will try to get on to it. Can you also have a look through some of the existing issues in case there is something similar, possibly relate them. Thank you in advance Lewis On Tue, Jan 17, 2012 at

Re: Focused crawling with nutch

2012-01-18 Thread Lewis John Mcgibbney
, 2012 at 6:01 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Vijith, We are happy to help however you need to be more specific in terms of what you want to achieve? What are you required to obtain from the Web or some domain... What nature of Nutch

Re: How to exclude a specific URL from crawling

2012-01-18 Thread Lewis John Mcgibbney
Well this might be the file that you edit if you are using the urlfilter-automaton plugin for urlfiltering. Markus was indicating that you may wish to begin by looking at urlfilter-regex and subsequently regex-urlfilter.txt, I would go as far to say that this is the most commonly implemented

Re: Partly remove already crawled urls

2012-01-19 Thread Lewis John Mcgibbney
It depends where you are wanting to remove the urls from... your Nutch crawldb or your Solr index? We offer and maintain quite a number of tools to enable you to maintain a healthy crawldb e.g. purge, filtering, etc, we also maintain some tools to help you maintain your Solr index e.g. delete

Re: Partly remove already crawled urls

2012-01-19 Thread Lewis John Mcgibbney
:00 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: It depends where you are wanting to remove the urls from... your Nutch crawldb or your Solr index? We offer and maintain quite a number of tools to enable you to maintain a healthy crawldb e.g. purge, filtering, etc, we also

Re: problem fetching pages = nutch + hadoop

2012-01-19 Thread Lewis John Mcgibbney
Hi Waleed, Can you please read through the following section on our wiki and post a more verbose description of your problem, I'm struggling to help at the moment due to lack of information http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer#Becoming_a_Nutch_Developer On Wed, Jan 18, 2012

Re: Partly remove already crawled urls

2012-01-19 Thread Lewis John Mcgibbney
, Jan 19, 2012 at 8:26 PM, remi tassing tassingr...@gmail.com wrote: The main purpose is to remove urls matching a certain pattern from the Nutch segments(or database). Remi On Thursday, January 19, 2012, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Maintenance tool for what? You

Re: Focused crawling with nutch

2012-01-20 Thread Lewis John Mcgibbney
in accordance with release - 1.4 (which i am using). Don't know whether i did it right. but it works with 1.4. here is the new patch file attached. On Wed, Jan 18, 2012 at 3:34 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Apply the patch from $NUTCH_HOME Use 828-2 (newest by date

Re: nutch-779 or nutch-809

2012-01-20 Thread Lewis John Mcgibbney
I believe some people have recently had success using 809, and Elisabeth has done an excellent job of providing comprehensive documentation which you will find linked to from the 809 issue. I can't comment fully on which one to use over the other but 809 seems to be working for people. Thanks On

Re: Extracting documents from nutch segments

2012-01-20 Thread Lewis John Mcgibbney
I'm not sure if I'm understanding you here. You are not wanting to index the documents, but merely wanting to have stored documents in your hard disk? What is the reasoning behind this? Thanks On Fri, Jan 20, 2012 at 9:48 AM, Adriana Farina adriana.farin...@gmail.comwrote: I forgot to write

Re: Partly remove already crawled urls

2012-01-20 Thread Lewis John Mcgibbney
Hi Marek, What happens with the data in the segments, I guess the data of the crawled urls are still in the segments after filtering out from the crawldb, aren't they? Well additionally you can merge several segments and again use urlfilters to get rid of urls you don't wish to have. Again

Re: Fetch time in crawldb

2012-01-20 Thread Lewis John Mcgibbney
Hi Marek, If you have a look at the FetchSchedule class, you'll see that fetchTime relates to the time date when a page was successfully fetched, subsequently we configure fetchInterval to the time date when we wish to revisit the successfully fetched URL. hth On Fri, Jan 20, 2012 at 4:07 PM,

Re: Strange timestamps in generators log

2012-01-20 Thread Lewis John Mcgibbney
On Fri, Jan 20, 2012 at 5:10 PM, Marek Bachmann m.bachm...@uni-kassel.dewrote: Hello again, I was inspecting the generator because it doesn't deliver all urls for the fetcht list from the crawldb even if I set the addDays atribute to a value much higher than the max fetch intervall. How

Re: nutch-779 or nutch-809

2012-01-20 Thread Lewis John Mcgibbney
I think it's been updated to work with trunk so unless you can back port it (if there have been changes to the codebase) then it will not. That being said, I haven't used it so trying won't hurt. Thanks On Fri, Jan 20, 2012 at 2:16 PM, abhayd ajdabhol...@hotmail.com wrote: thaks lewis Does

Re: Extracting documents from nutch segments

2012-01-21 Thread Lewis John Mcgibbney
on a hard disk, but at the moment I can't figure out what I could do. Can you help me? Thank you very much! 2012/1/20 Lewis John Mcgibbney lewis.mcgibb...@gmail.com I'm not sure if I'm understanding you here. You are not wanting to index the documents, but merely wanting to have stored

Re: Getting html pages through a Nutch crawl (for a dataset)

2012-01-22 Thread Lewis John Mcgibbney
The best method is to read or dump the contents of your crawldb and work based on this. Please have a look on the wiki for using the readdb tool. On Sun, Jan 22, 2012 at 10:51 AM, Sameendra Samarawickrama smsa...@googlemail.com wrote: Hi, I am using Nutch to generate a small dataset of web;

Re: Following .axd urls

2012-01-23 Thread Lewis John Mcgibbney
Hi Ian, What fetching depth are you using? Lewis On Mon, Jan 23, 2012 at 7:46 AM, Ian Piper ianpi...@tellura.co.uk wrote: Hi all, I'd appreciate some guidance... can't seem to find much useful stuff on the web on this. I have set up a Nutch and Solr service that is crawling a client's

Re: Delete Duplicates Error

2012-01-26 Thread Lewis John Mcgibbney
Hi Kaveh, I'm not sure if your problem is the same at all. You're problem stems from the solr mapping configuration used by AnchorIndexingFilter in the index-anchor plugin. If this works properly then you should see a list of all of the source -- destination field mappings, this unfortunately is

Re: From Nutch 1.2 to 1.4

2012-01-31 Thread Lewis John Mcgibbney
Hi Remi, 1. Are the segments backward compatible? I tried updatedb but I get skipping invalid segment In all honesty I've not tried this! Is it possible to use readseg -dump to get a text file then use freegen to generate new segments to fetch??? 2. With the same configuration, it

Re: Focused crawling with nutch

2012-02-01 Thread Lewis John Mcgibbney
-02-01 16:05:52,927 WARN mapred.LocalJobRunner - job_local_0006 java.lang.NumberFormatException: null at java.lang.Integer.parseInt(Integer.java:417) at java.lang.Integer.parseInt(Integer.java:499) ... what could be the problem ?? On Fri, Jan 20, 2012 at 4:38 PM, Lewis John Mcgibbney

Nutch 2.0 Webapp

2012-02-02 Thread Lewis John Mcgibbney
Hi, Off-list me and Chris have briefly discussed the possibility of hosting a competition between teams of PG students from the universities here in Glasgow and @ USC. The idea is to have a competition aimed at creating a new Nutch 2.0 webapp taking in to consideration the Jira tickets [1] [2].

Re: Failed fetching

2012-02-02 Thread Lewis John Mcgibbney
Looks liek your using an old version of Nutc here. Please try upgrading to 1.4 Dean hth On Thu, Feb 2, 2012 at 5:22 PM, Dean Pullen dean.pul...@semantico.comwrote: What I see in logs/userlogs/myfetchjobxx/**syslog is: 2012-02-02 17:15:25,045 INFO org.apache.nutch.fetcher.**Fetcher: fetch of

Re: Failed fetching

2012-02-03 Thread Lewis John Mcgibbney
There's no log files attached On Fri, Feb 3, 2012 at 10:06 AM, tiagorcs dasilva-ti...@mitsue.co.jpwrote: Forgot to mention I am using Nutch 1.4, and that I have no problems with the exact same setup for Nutch 1.3. -- View this message in context:

Re: Focused crawling with nutch

2012-02-03 Thread Lewis John Mcgibbney
Hi, On Fri, Feb 3, 2012 at 10:33 AM, Vijith vijithkv...@gmail.com wrote: OK. It worked. There were some typo mistakes in the patch code. Also I forgot to change the 'fetcher.parse' property to true. I will attach the updated patch to the issue once I have a complete check. Great Also Is

Re: Is it still possible to create a pure lucene index?

2012-02-03 Thread Lewis John Mcgibbney
Hi Marek, I really don't think so. We stripped all of the Lucene stuff @ 1.3 as you know. There was however an interesting thread initiated by Adriana [1] which began down the same route here... [1] http://www.mail-archive.com/user@nutch.apache.org/msg05268.html On Fri, Feb 3, 2012 at 6:55 PM,

Solandra Nutch [WAS] Re: Dump into Cassandra using Nutch 1.x

2012-02-08 Thread Lewis John Mcgibbney
Hey Peyman, Do you care to discuss your experiences using Solandra, what was required to get a Nutch -- Solandra workflow working? This is also a module I think would be great in Gora and pluggable to Nutch trunk/nutchgora. Thanks for any comments. Lewis On Wed, Feb 8, 2012 at 3:43 PM, Peyman

Re: Seed urls not getting crawled.

2012-02-10 Thread Lewis John Mcgibbney
Hi, On Thu, Feb 9, 2012 at 7:26 AM, Sudip Datta pid...@gmail.com wrote: While, this indicates that a reattempt will be made in 1 day, the 'url' never really gets the state db_fetched. On the other hand, if I set generate.max.count = -1, the page is indeed crawled but the crawl is

Re: Failed fetching

2012-02-10 Thread Lewis John Mcgibbney
In all honesty this is strange. We can assure you that 1.4 DOES work for protocol-http! Any cygwin users out there that can lend a hand? On Mon, Feb 6, 2012 at 4:37 AM, tiagorcs dasilva-ti...@mitsue.co.jp wrote: Also, this is what I got inside my *plugins* folder creativecommons

Re: Understanding NutchConfigration properly

2012-02-12 Thread Lewis John Mcgibbney
I see a Jira ticket coming up here ... I'll open one up. Thanks Lewis On Sat, Feb 11, 2012 at 10:58 PM, Markus Jelsma mar...@apache.org wrote: The xsl, xsd and dtd files are not used by Nutch anymore. Hi, When specifying configurations for Hadoop, we are actually for using

Re: Understanding NutchConfigration properly

2012-02-12 Thread Lewis John Mcgibbney
: Is it really worth bothering? On 12 February 2012 17:04, Lewis John Mcgibbney lewis.mcgibb...@gmail.comwrote: I see a Jira ticket coming up here ... I'll open one up. Thanks Lewis On Sat, Feb 11, 2012 at 10:58 PM, Markus Jelsma mar...@apache.org wrote: The xsl, xsd and dtd

Re: Build a pipeline using nutch

2012-02-14 Thread Lewis John Mcgibbney
Hi Puneet, On Tue, Feb 14, 2012 at 5:12 AM, Puneet Pandey puneet...@gmail.com wrote: I have started using nutch recently. As I understand nutch crawling is a cyclic process inject-generate-fetch-parse-update Yes this is typically what you would execute. 1. When does parse start when I

Re: Dump into Cassandra using Nutch 1.x

2012-02-14 Thread Lewis John Mcgibbney
Hi, On Tue, Feb 14, 2012 at 3:13 PM, conta...@complexityintelligence.comwrote: IMHO the Nutch eco system requires a more integrated vision ;) In what respect? Can you please be more verbose? Thanks Lewis

Re: fetcher.max.crawl.delay = -1 doesn't work?

2012-02-14 Thread Lewis John Mcgibbney
Hi Danicela, Before I try this, have you configured any other overrides for generating or fetching in nutch-site.xml? Thanks On Tue, Feb 14, 2012 at 3:10 PM, Danicela nutch danicela-nu...@mail.comwrote: Hi, I have in my nutch-site.xml the value fetcher.max.crawl.delay = -1. When I try

Re: RSS parser

2012-02-14 Thread Lewis John Mcgibbney
John Mcgibbney wrote: Hi, On Wed, Feb 8, 2012 at 8:44 AM, Michael Kazekin Michael.Kazekin@mediainsight.**info michael.kaze...@mediainsight.info wrote: I tried your solution and got rid of doesn't claim to support contentType error indeed. Maybe we should submit a patch for this indeed

Re: tstamp vs. lastModified ...

2012-02-15 Thread Lewis John Mcgibbney
iirc time stamp represents when page was last fetched. Yes you should be able to specify this value in your schema and get it mapped to solr index. Last modified is when the actual page was last modified e.g. when there was a change to the page source or something. On Wed, Feb 15, 2012 at 1:26

Re: tstamp vs. lastModified ...

2012-02-15 Thread Lewis John Mcgibbney
Hi Remi, On Wed, Feb 15, 2012 at 1:51 PM, remi tassing tassingr...@gmail.com wrote: Thanks for the clarification! nb For tstamp, I can actually see it in Solr results (even thought the format is weird) what is the format? How can I get Last-Modified value in Solr as well? Does Nutch

Re: tstamp vs. lastModified ...

2012-02-15 Thread Lewis John Mcgibbney
Hi, On Wed, Feb 15, 2012 at 4:00 PM, remi tassing tassingr...@gmail.com wrote: tstamp shows a string of digits like 20020123123212 This is OK. -mm-dd-hh-mm-ssZ It is however hellishly old ! Never heard of the plugin index-more and it's poorly documented. Well it's been included in

Re: Re : Re: fetcher.max.crawl.delay = -1 doesn't work?

2012-02-15 Thread Lewis John Mcgibbney
On Wed, Feb 15, 2012 at 9:08 AM, Danicela nutch danicela-nu...@mail.comwrote: I don't think I configured such things, how can I be sure ? - Message d'origine - De : Lewis John Mcgibbney Envoyés : 14.02.12 19:18 À : user@nutch.apache.org Objet : Re: fetcher.max.crawl.delay = -1

Re: Question regarding NutchHadoopTutorial

2012-02-16 Thread Lewis John Mcgibbney
Yes. You are correct. Is it possible for me to add you to the wiki admin group and you could update this for us? It is a long outstanding task... If you are OK to do this, then please register a username with the wiki and I'll get you added. Thank you in advance Lewis On Thu, Feb 16, 2012 at

Re: Trouble with checking Gora trunk from SVN

2012-02-17 Thread Lewis John Mcgibbney
Hi, Gora has now graduated to TLP so change your SVN url accordingly. Do you really want to be working from inside eclipse? Why not just operate from the cmdline? If you have cassandra installed it will be more straighforward to work from cmdline. On Fri, Feb 17, 2012 at 4:52 AM, apachenutch

Re: Nutch setup on Cassandra error

2012-02-17 Thread Lewis John Mcgibbney
Hi, I'm afraid you need to be more verbose about where and when you are using this from. It looks like this is taken from the cassandra CLI? This tutorial may be somewhat dated, it also onyl covers Nutchgora Gora Cassandra use within Eclipse IDE, which was required at this time because Gora was

Re: Nutch setup on Cassandra error

2012-02-17 Thread Lewis John Mcgibbney
Hi, cassandra-mapping.xml should already be in your $nutchgora/conf directory, as per here [1]. When you build the project it will be copied over to runtime/local/conf and will then be on your classpath. On Fri, Feb 17, 2012 at 6:53 PM, apachenutch poojasw...@gmail.com wrote: /Please place

Re: Some PDF contains is not readable when crawling with nutch

2012-02-18 Thread Lewis John Mcgibbney
Hi Hadi, Please see here http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer#Becoming_a_Nutch_Developer If you wish to post back with your question we will try and help. Thanks Lewis On Sat, Feb 18, 2012 at 1:37 PM, hadi md.anb...@gmail.com wrote: I have problem with some pdf,when i

Re: IOExeption when crawling with nutch in Fetching process

2012-02-18 Thread Lewis John Mcgibbney
Hi Hadi, On Sat, Feb 18, 2012 at 1:05 PM, hadi md.anb...@gmail.com wrote: -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: java.io.IOException: Job failed! at

Re: IOExeption when crawling with nutch in Fetching process

2012-02-19 Thread Lewis John Mcgibbney
Can you please paste how you have specified your hadoop temp dir. This seems to be the cause of such stack trace error's Thanks On Sun, Feb 19, 2012 at 7:04 AM, hadi md.anb...@gmail.com wrote: yes,there is a hadoop log : i search this error but everyone says this error is about low space

Re: Nutch setup on Cassandra error

2012-02-19 Thread Lewis John Mcgibbney
Hi, On Sun, Feb 19, 2012 at 4:23 PM, apachenutch poojasw...@gmail.com wrote: I see the web page keyspace created, but I dont see any records after configuring the crawler :( How are you verifying this? Please be as verbose as possible when discussing Nutchgora branch. I performed a nutch

Re: Nutch setup on Cassandra error

2012-02-19 Thread Lewis John Mcgibbney
Hi, On Sun, Feb 19, 2012 at 8:44 PM, apachenutch poojasw...@gmail.com wrote: ERROR 12:37:59,770 Fatal configuration error org.apache.cassandra.config.ConfigurationException: localhost/ 127.0.0.1:7000 is in use by another process. Change listen_address:storage_port in cassandra.yaml to

Re: Error running Nutch 1.4 crawl on Amazon EMR using the S3 (s3n://) filesystem

2012-02-22 Thread Lewis John Mcgibbney
So maybe try hacking CrawlDb#createJob() So that when you create the new NutchJob object you pass in the uri parameter 124 JobConf job = new NutchJob(uri, config); as suggested in the thrown stack trace. Please get back to us with results. I've not been using anything like Amazon EMR and would

Re: http.redirect.max

2012-02-23 Thread Lewis John Mcgibbney
Hi, Can you post your nutch-site.xml and I will give it a spin. Thank you Lewis On Thu, Feb 23, 2012 at 5:07 AM, xuyuanme xuyua...@gmail.com wrote: Just checked the latest code in 1.4 but it's the same. See code line 138 in below link:

Re: index-basic and index-more cause multi-value on non-multi-value title field?

2012-02-23 Thread Lewis John Mcgibbney
*doc.removeField(title)*, just before *doc.add(title, result.group(1));*. Should a bug be opened, or am I misunderstanding the function of this plugin? ShlomiJ On Sun, Feb 19, 2012 at 2:06 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Shlomi, On Sun, Feb 19, 2012 at 10:15 AM

Re: http.redirect.max

2012-02-23 Thread Lewis John Mcgibbney
would usually not set it so) Lewis On Thu, Feb 23, 2012 at 3:18 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Additionally in your nutch-site.xml we don't maintain any query-(plugins), and there is no parse-text plugin either. On Thu, Feb 23, 2012 at 3:13 PM, Lewis John Mcgibbney

Re: Nutch data to Solr on HTTPS

2012-02-23 Thread Lewis John Mcgibbney
On Thu, Feb 23, 2012 at 1:59 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Christopher, I don't think Nutch 1.2 could be used with a SOlr server running on basic https authentication. Markus committed a nice section of work which address this in 1.3 iirc, or maybe 1.4

Re: Nutch data to Solr on HTTPS

2012-02-24 Thread Lewis John Mcgibbney
Hi, On Thu, Feb 23, 2012 at 7:27 PM, Christopher Gross cogr...@gmail.comwrote: Unless -- is 1.2 able to crawl https sites? If it can't do that then I may have to upgrade You should be able to get https sites yes, however I'm not overly familiar with the protocol-httpclient plugin. If

Re: Nutch AND Solr? Nutch performance and features

2012-02-24 Thread Lewis John Mcgibbney
Hi James, On Fri, Feb 24, 2012 at 2:47 PM, Spadez james_will...@hotmail.com wrote: However, having found nutch, it seems like this might be something worth looking at. Firstly, is nutch simply a web scrapper or does it integrate other aspects of lucene as well? Im wondering if I would need to

Re: Tika with nutch

2012-02-26 Thread Lewis John Mcgibbney
Mmmm... this is really a Tika question, this probably shadows why you have received very little response from the community unfortunately. So the problem is that you are always getting back isEmpty indicating that _nothing_ is being produced as an output from your parser. I would add in a try

Re: How to crowl AJAX populated pages

2012-02-28 Thread Lewis John Mcgibbney
Can you please provide one such URL so I can try. Thanks On Tue, Feb 28, 2012 at 9:02 AM, remi tassing tassingr...@gmail.com wrote: Same question here... I have similar issues where (redirection)links are given through JavaScript I hope I haven't hijacked your post as I see these issues

Re: How to crowl AJAX populated pages

2012-02-28 Thread Lewis John Mcgibbney
Tiny chunk of info on this topic https://developers.google.com/webmasters/ajax-crawling/ On Tue, Feb 28, 2012 at 9:39 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Can you please provide one such URL so I can try. Thanks On Tue, Feb 28, 2012 at 9:02 AM, remi tassing tassingr

Re: Query in nutch

2012-02-28 Thread Lewis John Mcgibbney
As far as I know, Elisabeth Adler contributed a patch exactly for this on NUTCH-585 [0]. If you wish to get cracking with it please check out the latest trunk code [1] patch it using the blacklist_whitelist_plugin.patch Elisabeth attached to the issue. Would be excellent if you could provide

Re: http.redirect.max

2012-03-02 Thread Lewis John Mcgibbney
issue is related to this specific site http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_BrowseDrugInitial=B lewis john mcgibbney wrote I've checked working with redirects and everything seems to work fine for me. The site I checked

Re: Only fetching initial seedlist

2012-03-02 Thread Lewis John Mcgibbney
Hi James, You're seed URLs are more than likely being filtered out for searching by your settings in conf/regex-urlfilter.xml. Have a good read through the urlfilter documentation [0] and basic examples that are provided in other urlfilters, also it might help to do a bit of reading regarding

Re: Only fetching initial seedlist

2012-03-02 Thread Lewis John Mcgibbney
What makes you think that? On Fri, Mar 2, 2012 at 12:07 PM, James Ford simon.fo...@gmail.com wrote: But it seems that the solution to my problem is to set db.max.outlinks.per.page to 0? Baring in mind that it makes it pretty difficult to provide help if this is not mentioned initially.

Re: Crawling with Certs

2012-03-08 Thread Lewis John Mcgibbney
Hi Christopher, It appears that the page is being fetched successfully. What is not successful is the parser obtaining the page content... these fields appears the be returning empty values when as you have stated this is not the case. How large is the page content? does you http.content.limit

<    1   2   3   4   5   6   7   8   9   10   >