both are in the list, but I guess since parse-html is listed first, it wins..
--
View this message in context:
http://lucene.472066.n3.nabble.com/Fetched-pages-has-no-content-tp3171881p3218585.html
Sent from the Nutch - User mailing list archive at Nabble.com.
!? I have been
wondering for a week or two what has changed between 1.2 and 1.3 that
would have caused such a problem.
Is there a JIRA open for the issue?
--
View this message in context:
http://lucene.472066.n3.nabble.com/Fetched-pages-has-no-content-tp3171881p
3216734.html Sent from
/test.php?id=123 link .../html..
but then it does nothing with the links here. I have tried changing my
filters multiple times and it just won't parse them. I also ran the
ParseChecker class and I get 0 outlinks.
--
View this message in context:
http://lucene.472066.n3.nabble.com/Fetched-pages-has
protocol-httpclient is broken and needs replacing
On 19 July 2011 23:10, Anders Rask anr...@gmail.com wrote:
Hi guys!
I experimented some more, and it seems I'm only getting these problems when
using protocol-httpclient. It works fine when I use protocol-http.
Could you please try and see
Hi,
If you have a look at your regex-ulrfilter.txt it will by default be
rejecting ? in the URL. Please test with line edited (or commented out) and
see if the problem fades.
On Mon, Jul 18, 2011 at 10:11 AM, Anders Rask anr...@gmail.com wrote:
Hi Markus!
We are using a custom parser, but I
Judging from the segment those url's are fetched and parsed. I think maybe
some HTML parse API's have changed between your 1.1 and 1.2 versions. If
parserchecker shows the same issue then it's most likey a parse plugin problem
for the new version. Can you check?
Hi,
If you have a look at
As pointed out by Markus the logs show that the content has been properly
fetched. Moreover
./nutch org.apache.nutch.parse.ParserChecker '
http://www.uu.se/news/news_item.php?typ=pmid=1381'
works fine. Double check your custom parser, it is likely to be the source
of the problem.
BTW : what
Hi!
We are using Nutch to crawl a bunch of websites and index them to Solr. At
the moment we are in the process of upgrading from Nutch 1.1 to Nutch 1.3
and in the same time going from one server to two servers.
Unfortunately we are stuck with a problem which we haven't seen in the old
8 matches
Mail list logo