Re: extract data from html, help

2011-07-20 Thread Julien Nioche
Simply implement a HTMLParseFilter which will receive a DOM representation from the tika|html parser. Look in existing plkugins for examples or search the mailing list On 20 July 2011 08:53, Cheng Li chen...@usc.edu wrote: Thank you . What do you mean by Xpath? Could you explain a little bit

Re: skipping invalid segments nutch 1.3

2011-07-20 Thread Julien Nioche
Haven't you forgotten to call parse? On 19 July 2011 23:40, Leo Subscriptions llsub...@zudiewiener.com wrote: Hi Lewis, You are correct about the last post not showing any errors. I just wanted to show that I don't get any errors if I use 'crawl' and to prove that I do not have any faults

Re: Fetched pages has no content

2011-07-20 Thread Julien Nioche
protocol-httpclient is broken and needs replacing On 19 July 2011 23:10, Anders Rask anr...@gmail.com wrote: Hi guys! I experimented some more, and it seems I'm only getting these problems when using protocol-httpclient. It works fine when I use protocol-http. Could you please try and see

Re: FATAL fetcher.Fetcher: Fetcher: java.lang.NullPointerException

2011-07-20 Thread Julien Nioche
This has been fixed recently. Checkout 1.4 from SVN, it lives in a separate branch and is NOT in the trunk On 20 July 2011 02:58, Chance Callahan chance1calla...@gmail.com wrote: Whenever I start Nutch, I get the following error: 2011-07-20 01:40:49,744 INFO server Copying

help, src modify to optimize the crawl

2011-07-20 Thread Cheng Li
Hi, I tried to use Nutch to crawl craiglist. The seed I use is http://losangeles.craigslist.org/wst/ctd/ http://losangeles.craigslist.org/sfv/ctd/ http://losangeles.craigslist.org/lac/ctd/ http://losangeles.craigslist.org/sgv/ctd/ http://losangeles.craigslist.org/lgb/ctd/

Score Format

2011-07-20 Thread Mohammad Hassan Pandi
Hi Everybody How is nutch.score formatted??? I use HBase + Nutch. for example I have injected a url with score 10 and what I see in HBase is value=A \x00\x00 how is A \x00\x00 representing 10???

Re: How to get the original html file that is crawled by Nutch?

2011-07-20 Thread Chris Alexander
One way I have seen this working is to edit the schema.xml file {SOLR_HOME}/conf/schema.xml. Modify the field with name content to have its stored parameter set to true. Something like this: field name=content type=text *stored=true* . You will need to re-index pages (either by emptying solr

Re: Score Format

2011-07-20 Thread Mohammad Hassan Pandi
I have no problem with Nuth-Gora-HBase. All I want to know is how value is formatted? how A \x00\x00 means 10? how A\x10\x00\x00 means 9? how @\xE0\x00\x00 means 7 and so on... On Wed, Jul 20, 2011 at 3:12 PM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Using 2.0 I gather? As you've

Re: skipping invalid segments nutch 1.3

2011-07-20 Thread Cam Bazz
Hello, I think there is a mislead in the documentation, it does not tell us that we have to parse. On Wed, Jul 20, 2011 at 11:42 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Haven't you forgotten to call parse? On 19 July 2011 23:40, Leo Subscriptions llsub...@zudiewiener.com wrote:

Solr frontend for nutch schema

2011-07-20 Thread Marek Bachmann
Hello, there isn't a nutch specific search frontend for solr yet, am I right? (Like the standard browse page in the solr example) Thxs

Nutch not indexing full collection

2011-07-20 Thread Chip Calhoun
Hi, I'm using Nutch 1.3 to crawl a section of our website, and it doesn't seem to crawl the entire thing. I'm probably missing something simple, so I hope somebody can help me. My urls/nutch file contains a single URL: http://www.aip.org/history/ohilist/transcripts.html , which is an

Re: Nutch not indexing full collection

2011-07-20 Thread Julien Nioche
I'd have suspected db.max.outlinks.per.page but you seem to have set it up correctly. Are you running Nutch in runtime/local? in which case you modified nutch-site.xml in runtime/local/conf, right? nutch readdb -stats will give you the total number of pages known etc Julien On 20 July 2011

Re: Updating Tika in Nutch

2011-07-20 Thread Mattmann, Chris A (388J)
Sorry guys I'm nutters! :) Cheers, Chris On Jul 20, 2011, at 1:39 AM, Julien Nioche wrote: Glad you managed to get it to work. I don't know what Chris meant by that, can;t see why we'd open a JIRA when we are already using the latest version Julien On 20 July 2011 08:19, Fernando Arreola

RE: Nutch not indexing full collection

2011-07-20 Thread Chip Calhoun
I've been working with $NUTCH_HOME/runtime/local/conf/nutch-site.xml, and I'm pretty sure that's the correct file. I run my commands while in $NUTCH_HOME/ , which means all of my commands begin with runtime/local/bin/nutch... . That means my urls directory is $NUTCH_HOME/urls/ and my crawl

Re: How to get the original html file that is crawled by Nutch?

2011-07-20 Thread Kelvin
I have found the solution for my problem, I'm posting it, in case others are also stuck in this problem. :) Nutch can store the whole text content of the html pages. for nutch 1.3 Step 1:In nutch/runtime/local/conf/nutch-site.xml              add property  namehttp.content.limit/name  

Re: FATAL fetcher.Fetcher: Fetcher: java.lang.NullPointerException

2011-07-20 Thread Chance Callahan
I am now having a new issue: 2011-07-20 18:45:54,480 INFO server Copying /user/hdfs/nutch-1.4.jar-/tmp/jobsub-0pXrwu/work/tmp.jar 2011-07-20 18:45:54,852 INFO server all_clusters: [hadoop.job_tracker.LiveJobTracker object at 0x94201ec, hadoop.fs.hadoopfs.HadoopFileSystem object at 0x92ab7ec]

Re: help, src modify to optimize the crawl

2011-07-20 Thread lewis john mcgibbney
I dont think this has anything to so with modifying the crawl src. It doesn't infact have anything to do with optimization either. Try using your URLFilters e.g. regex It is important to try and understand what type of pages we can filter out from a Nutch crawl using the filters provided. HTH

Re: embedded google map in nutch query result page

2011-07-20 Thread lewis john mcgibbney
I don't know if you are still pursuing this, and as you haven't had any response I will give some tips. It sounds like your using = Nutch 1.2, therefore unless you are comofrtable working with JSP's then I wouldn't bother with the hastle. It might be better to try and use Solr for indexing and

Re: embedded google map in nutch query result page

2011-07-20 Thread Cheng Li
Thank you . I'll try to use solr to do the indexing and add the google map object . Do you know some resource for solr AJAX ? where should I add the google map js code in solr ? Thanks again, On Wed, Jul 20, 2011 at 1:51 PM, lewis john mcgibbney lewis.mcgibb...@gmail.com wrote: I don't know

Re: skipping invalid segments nutch 1.3

2011-07-20 Thread lewis john mcgibbney
There is no documentation for individual commands used to run a Nutch 1.3 crawl so I'm not sure where there has been a mislead. In the instance that this was required I would direct newer users to the legacy documentation for the time being. My comment to Leo was to understand whether he managed

Re: crawling in any depth until no new pages were found

2011-07-20 Thread lewis john mcgibbney
Hi Marek, As were talking about automating the task were immediately looking at implementing a bash script. In the situation we have described, we wish Nutch to adopt a breadth first search BFS behaviour when crawling. Between us can we suggest any methods for best practice relating to BFS? As

Re: Nutch not indexing full collection

2011-07-20 Thread lewis john mcgibbney
Hi Chip, I would try running your scripts after setting the environment variable $NUTCH_HOME to nutch/runtime/local/NUTCH_HOME On Wed, Jul 20, 2011 at 4:01 PM, Chip Calhoun ccalh...@aip.org wrote: I've been working with $NUTCH_HOME/runtime/local/conf/nutch-site.xml, and I'm pretty sure that's

Re: embedded google map in nutch query result page

2011-07-20 Thread lewis john mcgibbney
You can find Ajax Solr here [1]. As I said this is only one option for doing this. The information you can return and display is really directly dependant on your requirements and your imagination. However it should not be too hard implementing the maps you are looking for when you get to grips