first i set conf/crawl-urlfilter that
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(png|PNG|ico|ICO|css|sit|eps|wmf|zip|mpg|gz|rpm|tgz|mov|MOV|exe|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
#+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

# skip everything else
+.

i can crawl "http://guide.kapook.com"; but i can't crawl
"http://www.kapook.com"; some webpage can't crawl all i want to know why?
after crawl index file not complete it's not have segments file it have only 

/user/nutch/crawld/indexes/part-00000/_0.fdt    <r 1>   365
/user/nutch/crawld/indexes/part-00000/_0.fdx    <r 1>   8
/user/nutch/crawld/indexes/part-00000/_0.fnm    <r 1>   66
/user/nutch/crawld/indexes/part-00000/_0.frq    <r 1>   370
/user/nutch/crawld/indexes/part-00000/_0.nrm    <r 1>   9
/user/nutch/crawld/indexes/part-00000/_0.prx    <r 1>   611
/user/nutch/crawld/indexes/part-00000/_0.tii    <r 1>   135
/user/nutch/crawld/indexes/part-00000/_0.tis    <r 1>   10553
/user/nutch/crawld/indexes/part-00000/index.done        <r 1>   0
/user/nutch/crawld/indexes/part-00000/segments.gen      <r 1>   20
/user/nutch/crawld/indexes/part-00000/segments_2        <r 1>   41

/user/nutch/crawld/indexes/part-00001/index.done        <r 1>   0
/user/nutch/crawld/indexes/part-00001/segments.gen      <r 1>   20
/user/nutch/crawld/indexes/part-00001/segments_1        <r 1>   20

how i solve it?
-- 
View this message in context: 
http://www.nabble.com/nutch-crawl-and-index-problem-tp14703815p14703815.html
Sent from the Hadoop Users mailing list archive at Nabble.com.

Reply via email to