I am trying to test the integrity of the crawl before using Tomcat. I am using
Nutch 1.0 on Ubuntu 9.04. Could someone tell me as to why I am getting Total
hits: 0.
Regards,
Sanjeev
Here are the config files that I have configured.
conf/crawl-urlfilter.txt
Can you provide the HTTP headers and HEAD of the HTML of a Web page
for which Nutch fails? Perhaps there is an inconsistency between HTTP
and META headers or a mispelled codepage? Just a wild guess, but
believe me -- Java does convert fine between Cp1250, Iso8859-2 and
internal UTF-16 so there mus
Here's a recently announced event search engine:
http://searchengineland.com/what-where-when-travel-local-search-combine-goby-com-26395
Just heard of it today.
Brian Ulicny
On Wed, 23 Sep 2009 09:27 +0200, "Michael Wechner"
wrote:
> Mitia Notaras schrieb:
> > Hi there,
> >
> > The two event se
hi again martina :)
now i dont have the error when crawling since i used the
addIndexBackendOptions(Configuration conf) to add my fields ..thx for
that. but still dont see an empty field (dc.subject) in my index when listing
it with Luke...i guess i have to write the parser to extract it and ad
I had the same little big problem - everything seemed OK:
- bin/nutch org.apache.nutch.searcher.NutchBean ... [in my
case search query = "apache"] in cygwin returns 62 Total hits on cawled
"+^http://([a-z0-9]*\.)*apache.org/"
- Nutch in Tomcat webapp after deploy seemed fine (no errors)
- I ha
yes i saw the differences and i wrote my index-cutom as the index-more plugin
(nutch-1.0).
but guess u right !! i didnt use the addFiledOptions method to add my custom
fileds information ...
so if i will add them in this method.. so for the parser i have to see first
how is made the htmlparser
Hi,
the howtos you're referring to are for Nutch 0.9. In Nutch 1.0 the indexing
system changed a little bit.
If you look at the index-basic or index-more plugin you see that the doc.add
method changed.
It's no longer doc.add(new Field("category", "puppies", false, true, false)) ->
here you cr
hi, thank you for your answer...
i was talking about this howto :
CreateNewFilter
Howto
add a category metadata to your index and be able to search for it. For
this, you need to write an indexing filter and a query filter.
Indexing your custom metadata
For the
indexing filter, copy the index-mo
Hi:
I´m following the steps to run Nucth 1.0 release with Eclipse and Windows
described in this link
http://wiki.apache.org/nutch/RunNutchInEclipse1.0
I´m trying to build it, but when I launch the war target I have this error
C:\ECLIPSE321\workspace\nutch-1.0\build.xml:62: Specify at least one
so
At last someone answers.
Correct CP1250.
My pages look fine in the browsers of course, but it does not mean Nutch
handles them properly.
What I'm wondering is if the the nutch HTML parser reads them properly,
because when I do a search on such characters it fails on pages iso8859-2 or
cp1250, but
Exactly! sorry for being so confusing in my original question.
Jesse
int GetRandomNumber()
{
return 4; // Chosen by fair roll of dice
// Guaranteed to be random
} // xkcd.com
On Wed, Sep 23, 2009 at 4:45 AM, Alexander Aristov <
alexander.aris...@gmail.com> wrote:
> Ok, I
Polish Web sites use Cp1250 (windows-1250) or iso8859-2 (or UTF-8 of
course). Check if diacritics like these:
ęółąśćżń
look all right in the above encodings and use appropriately.
Dawid
On Wed, Sep 16, 2009 at 4:47 PM, MilleBii wrote:
> same thing when there is
> charset=ISO-8859-2
>
> 2009/9/
Ok, I will paraphrase the question.
Consider I want to use distributed search using 3 servers: one primary and
two secondary nodes.
I create single BIG index using distributed crawler using other computers.
Now I want to split this single BIG index on two parts to put on the search
nodes.
How ca
Mitia Notaras schrieb:
Hi there,
The two event search engines I found are down :
betherebesquare.com
and
BusyTonight.com
I would like your advice :
Is it difficult to build one?
I guess it depends on the details of the requirements. Do you have a
requirements sheet?
I have knowledge of web
14 matches
Mail list logo