Re: Nutch efficiency and multiple single URL crawls

2012-11-25 Thread Joe Zhang
what do you mean by the "job file"? On Sun, Nov 25, 2012 at 10:43 PM, AC Nutch wrote: > Hello, > > I am using Nutch 1.5.1 and I am looking to do something specific with it. I > have a few million base domains in a Solr index, so for example: > http://www.nutch.org, http://www.apache.org, http://

Nutch efficiency and multiple single URL crawls

2012-11-25 Thread AC Nutch
Hello, I am using Nutch 1.5.1 and I am looking to do something specific with it. I have a few million base domains in a Solr index, so for example: http://www.nutch.org, http://www.apache.org, http://www.whatever.com etc. I am trying to crawl each of these base domains in deploy mode and retrieve

Re: Indexing-time URL filtering again

2012-11-25 Thread Joe Zhang
OK. I'm testing it. But like I said, even when I reduce the patterns to the simpliest form "-.", the problem still persists. On Sun, Nov 25, 2012 at 3:59 PM, Markus Jelsma wrote: > It's taking input from stdin, enter some URL's to test it. You can add an > issue with reproducable steps. > > -

RE: Indexing-time URL filtering again

2012-11-25 Thread Markus Jelsma
It's taking input from stdin, enter some URL's to test it. You can add an issue with reproducable steps. -Original message- > From:Joe Zhang > Sent: Sun 25-Nov-2012 23:49 > To: user@nutch.apache.org > Subject: Re: Indexing-time URL filtering again > > I ran the regex tester command yo

Re: Indexing-time URL filtering again

2012-11-25 Thread Joe Zhang
I ran the regex tester command you provided. It seems to be taking forever (15 min + by now). On Sun, Nov 25, 2012 at 3:28 PM, Joe Zhang wrote: > you mean the content my pattern file? > > well, even wehn I reduce it to simply "-.", the same problem still pops up. > > On Sun, Nov 25, 2012 at 3:30

Re: Indexing-time URL filtering again

2012-11-25 Thread Joe Zhang
you mean the content my pattern file? well, even wehn I reduce it to simply "-.", the same problem still pops up. On Sun, Nov 25, 2012 at 3:30 PM, Markus Jelsma wrote: > You seems to have an NPE caused by your regex rules, for some weird > reason. If you can provide a way to reproduce you can fi

RE: Indexing-time URL filtering again

2012-11-25 Thread Markus Jelsma
You seems to have an NPE caused by your regex rules, for some weird reason. If you can provide a way to reproduce you can file an issue in Jira. This NPE should also occur if your run the regex tester. nutch -Durlfilter.regex.file=path org.apache.nutch.net.URLFilterChecker -allCombined In the

Re: Indexing-time URL filtering again

2012-11-25 Thread Joe Zhang
the last few lines of hadoop.log: 2012-11-25 16:30:30,021 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2012-11-25 16:30:30,026 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.metadata.MetadataIndexer 2012-11-25 16:30:30,218 WARN mapre

RE: problem with text/html content type of documents appears application/xhtml+xml in solr index

2012-11-25 Thread Markus Jelsma
Hi - you need to enable mime-type mapping in Nutch config and define your mappings. Enable it with: moreIndexingFilter.mapMimeTypes true and add the following to your mapping config: cat conf/contenttype-mapping.txt # Target content type type1 [ type2 ...] text/html applic

RE: Indexing-time URL filtering again

2012-11-25 Thread Markus Jelsma
You should provide the log output. -Original message- > From:Joe Zhang > Sent: Sun 25-Nov-2012 17:27 > To: user@nutch.apache.org > Subject: Re: Indexing-time URL filtering again > > I actually checked out the most recent build from SVN, Release 1.6 - > 23/11/2012. > > The following co

Re: shouldFetch rejected

2012-11-25 Thread Sebastian Nagel
> But, i create a complete new crawl dir for every crawl. Then all should work as expected. > why the the cralwer set a "page to fetch" to rejected. Because obviously > the crawler never saw this page before (because i deleted all the old crawl > dirs). > In the crawl log i see many page to fetch

RE: problem with text/html content type of documents appears application/xhtml+xml in solr index

2012-11-25 Thread Eyeris Rodriguez Rueda
Thanks a lot Markus for your answer. My English is not so good. I was reading but i don’t know how to fix the problems yet. Could you explain me in details the solution please. I was looking in conf directory but I can't find how to map one mime types to another. I need to replace index-more plug

Re: Indexing-time URL filtering again

2012-11-25 Thread Joe Zhang
I actually checked out the most recent build from SVN, Release 1.6 - 23/11/2012. The following command bin/nutch solrindex -Durlfilter.regex.file=.UrlFiltering.txt http://localhost:8983/solr/ crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/* -filter produced the following output: Solr

RE: How to generate directed graph image file using Nutch linkdb, webgraphdb etc.

2012-11-25 Thread A Geek
Hi, Would appreciate if someone can give me any pointers with the following issue. Any pointers on how to use the Nutch Webgraphdb, outlink, inlinks etc for generating directed graph would be helpful. Thanks in advance. Thanks, DW > From: dw...@live.com > To: user@nutch.apache.org > Subject: Ho

Re: mime type text/plain

2012-11-25 Thread Sourajit Basak
DEBUG tika.TikaParser - Using Tika parser org.apache.tika.parser.txt.TXTParser for mime-type text/plain The above indicates Tika is fired. But somehow I need to tell Tika to use HtmlParser for mime-type text/plain. Have to dig into Tika docs. Is it possible to do anything in Nutch ? On Sun, Nov

Re: scoring (v1.5)

2012-11-25 Thread Sourajit Basak
You're saying that linkrank doesn't have any affect on the subsequent generate phase ? On Sun, Nov 25, 2012 at 6:14 PM, parnab kumar wrote: > Hi Sourajit, > I donno about nutch 1.5 but in nutch 1.4 the following happens i > guess (probably the same applies for nutch 1.5 as well) : > >

Re: Indexing-time URL filtering again

2012-11-25 Thread Joe Zhang
How exactly do I get to trunk? I did download download NUTCH-1300-1.5-1.patch, and run the patch command correctly, and re-build nutch. But the problem still persists... On Sun, Nov 25, 2012 at 3:29 AM, Markus Jelsma wrote: > No, this is no bug. As i said, you need either to patch your Nutch or

Re: scoring (v1.5)

2012-11-25 Thread parnab kumar
Hi Sourajit, I donno about nutch 1.5 but in nutch 1.4 the following happens i guess (probably the same applies for nutch 1.5 as well) : To create the webgraph you run the webgraph command . Scoring is not affected here . Next you need to run linkRank(this will compute the link rank

RE: scoring (v1.5)

2012-11-25 Thread Markus Jelsma
Hi - Scoring filters can run in several stages but the webgraph and linkrank programs must be run separately. After the graph has been iterated over you can update your crawldb with the score from the graph using the scoreupdater program. -Original message- > From:Sourajit Basak > Sen

RE: problem with text/html content type of documents appears application/xhtml+xml in solr index

2012-11-25 Thread Markus Jelsma
Hi - trunk's more indexing filter can map mime types to any target. With it you can map both (x)html mimes to text/html or to `web page`. https://issues.apache.org/jira/browse/NUTCH-1262 -Original message- > From:Eyeris Rodriguez Rueda > Sent: Sun 25-Nov-2012 00:48 > To: user@nutch.

RE: Indexing-time URL filtering again

2012-11-25 Thread Markus Jelsma
No, this is no bug. As i said, you need either to patch your Nutch or get the sources from trunk. The -filter parameter is not in your version. Check the patch manual if you don't know how it works. $ cd trunk ; patch -p0 < file.patch -Original message- > From:Joe Zhang > Sent: Sun 25

Re: How to extract fetched files(pdf)?

2012-11-25 Thread hudvin
I found better solution - Heritrix :). It just works except terrible spring config. -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-extract-fetched-files-pdf-tp4022202p4022244.html Sent from the Nutch - User mailing list archive at Nabble.com.

How to generate directed graph image file using Nutch linkdb, webgraphdb etc.

2012-11-25 Thread A Geek
Hi All, I've been learning up Nutch 1.5 from last couple of weeks and so far using these links: http://wiki.apache.org/nutch/NutchTutorial and http://wiki.apache.org/nutch/NewScoringIndexingExample I'm able to crawl a list of sites, with seed list of 1000 urls. I created the webgraphdb using on