what do you mean by the "job file"?
On Sun, Nov 25, 2012 at 10:43 PM, AC Nutch wrote:
> Hello,
>
> I am using Nutch 1.5.1 and I am looking to do something specific with it. I
> have a few million base domains in a Solr index, so for example:
> http://www.nutch.org, http://www.apache.org, http://
Hello,
I am using Nutch 1.5.1 and I am looking to do something specific with it. I
have a few million base domains in a Solr index, so for example:
http://www.nutch.org, http://www.apache.org, http://www.whatever.com etc. I
am trying to crawl each of these base domains in deploy mode and retrieve
OK. I'm testing it. But like I said, even when I reduce the patterns to the
simpliest form "-.", the problem still persists.
On Sun, Nov 25, 2012 at 3:59 PM, Markus Jelsma
wrote:
> It's taking input from stdin, enter some URL's to test it. You can add an
> issue with reproducable steps.
>
> -
It's taking input from stdin, enter some URL's to test it. You can add an issue
with reproducable steps.
-Original message-
> From:Joe Zhang
> Sent: Sun 25-Nov-2012 23:49
> To: user@nutch.apache.org
> Subject: Re: Indexing-time URL filtering again
>
> I ran the regex tester command yo
I ran the regex tester command you provided. It seems to be taking forever
(15 min + by now).
On Sun, Nov 25, 2012 at 3:28 PM, Joe Zhang wrote:
> you mean the content my pattern file?
>
> well, even wehn I reduce it to simply "-.", the same problem still pops up.
>
> On Sun, Nov 25, 2012 at 3:30
you mean the content my pattern file?
well, even wehn I reduce it to simply "-.", the same problem still pops up.
On Sun, Nov 25, 2012 at 3:30 PM, Markus Jelsma
wrote:
> You seems to have an NPE caused by your regex rules, for some weird
> reason. If you can provide a way to reproduce you can fi
You seems to have an NPE caused by your regex rules, for some weird reason. If
you can provide a way to reproduce you can file an issue in Jira. This NPE
should also occur if your run the regex tester.
nutch -Durlfilter.regex.file=path org.apache.nutch.net.URLFilterChecker
-allCombined
In the
the last few lines of hadoop.log:
2012-11-25 16:30:30,021 INFO indexer.IndexingFilters - Adding
org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2012-11-25 16:30:30,026 INFO indexer.IndexingFilters - Adding
org.apache.nutch.indexer.metadata.MetadataIndexer
2012-11-25 16:30:30,218 WARN mapre
Hi - you need to enable mime-type mapping in Nutch config and define your
mappings. Enable it with:
moreIndexingFilter.mapMimeTypes
true
and add the following to your mapping config:
cat conf/contenttype-mapping.txt
# Target content type type1 [ type2 ...]
text/html applic
You should provide the log output.
-Original message-
> From:Joe Zhang
> Sent: Sun 25-Nov-2012 17:27
> To: user@nutch.apache.org
> Subject: Re: Indexing-time URL filtering again
>
> I actually checked out the most recent build from SVN, Release 1.6 -
> 23/11/2012.
>
> The following co
> But, i create a complete new crawl dir for every crawl.
Then all should work as expected.
> why the the cralwer set a "page to fetch" to rejected. Because obviously
> the crawler never saw this page before (because i deleted all the old crawl
> dirs).
> In the crawl log i see many page to fetch
Thanks a lot Markus for your answer. My English is not so good.
I was reading but i don’t know how to fix the problems yet. Could you explain
me in details the solution please. I was looking in conf directory but I can't
find how to map one mime types to another. I need to replace index-more plug
I actually checked out the most recent build from SVN, Release 1.6 -
23/11/2012.
The following command
bin/nutch solrindex -Durlfilter.regex.file=.UrlFiltering.txt
http://localhost:8983/solr/ crawl/crawldb/ -linkdb crawl/linkdb/
crawl/segments/* -filter
produced the following output:
Solr
Hi, Would appreciate if someone can give me any pointers with the following
issue. Any pointers on how to use the Nutch Webgraphdb, outlink, inlinks etc
for generating directed graph would be helpful. Thanks in advance.
Thanks, DW
> From: dw...@live.com
> To: user@nutch.apache.org
> Subject: Ho
DEBUG tika.TikaParser - Using Tika parser
org.apache.tika.parser.txt.TXTParser for mime-type text/plain
The above indicates Tika is fired. But somehow I need to tell Tika to use
HtmlParser for mime-type text/plain. Have to dig into Tika docs.
Is it possible to do anything in Nutch ?
On Sun, Nov
You're saying that linkrank doesn't have any affect on the subsequent
generate phase ?
On Sun, Nov 25, 2012 at 6:14 PM, parnab kumar wrote:
> Hi Sourajit,
> I donno about nutch 1.5 but in nutch 1.4 the following happens i
> guess (probably the same applies for nutch 1.5 as well) :
>
>
How exactly do I get to trunk?
I did download download NUTCH-1300-1.5-1.patch, and run the patch command
correctly, and re-build nutch. But the problem still persists...
On Sun, Nov 25, 2012 at 3:29 AM, Markus Jelsma
wrote:
> No, this is no bug. As i said, you need either to patch your Nutch or
Hi Sourajit,
I donno about nutch 1.5 but in nutch 1.4 the following happens i
guess (probably the same applies for nutch 1.5 as well) :
To create the webgraph you run the webgraph command . Scoring is not
affected here . Next you need to run linkRank(this will compute the link
rank
Hi - Scoring filters can run in several stages but the webgraph and linkrank
programs must be run separately. After the graph has been iterated over you can
update your crawldb with the score from the graph using the scoreupdater
program.
-Original message-
> From:Sourajit Basak
> Sen
Hi - trunk's more indexing filter can map mime types to any target. With it you
can map both (x)html mimes to text/html or to `web page`.
https://issues.apache.org/jira/browse/NUTCH-1262
-Original message-
> From:Eyeris Rodriguez Rueda
> Sent: Sun 25-Nov-2012 00:48
> To: user@nutch.
No, this is no bug. As i said, you need either to patch your Nutch or get the
sources from trunk. The -filter parameter is not in your version. Check the
patch manual if you don't know how it works.
$ cd trunk ; patch -p0 < file.patch
-Original message-
> From:Joe Zhang
> Sent: Sun 25
I found better solution - Heritrix :). It just works except terrible spring
config.
--
View this message in context:
http://lucene.472066.n3.nabble.com/How-to-extract-fetched-files-pdf-tp4022202p4022244.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Hi All, I've been learning up Nutch 1.5 from last couple of weeks and so far
using these links: http://wiki.apache.org/nutch/NutchTutorial and
http://wiki.apache.org/nutch/NewScoringIndexingExample I'm able to crawl a list
of sites, with seed list of 1000 urls. I created the webgraphdb using on
23 matches
Mail list logo