Hi,
My name is Mohamed, and I'm working on a project to integrate nutch with
Heritrix
So I converted *ARC*-files (Heritrix) into segments using *ArcSegmentCreator
*.

$ ./bin/nutch org.apache.nutch.tools.arc.ArcSegmentCreator <ArcFiles>
<ArcCrawlDir/segments>

=> but the result of this command gives me this message
Ignoring position: 22878
Ignoring position: 36616
Ignoring position: 152183
Ignoring position: 167752
Ignoring position: 293285
Ignoring position: 334078
...
Ignoring position: 54757983
Ignoring position: 54891832


=>and in the ArcCrawlDir I found all the needed files :
/nutch-1.0/ArcCrawlDir/segments/20090527165114$ ls -R

.:
content  crawl_fetch  crawl_parse  parse_data  parse_text

./content:
part-00000

./content/part-00000:
data  index

./crawl_fetch:
part-00000

./crawl_fetch/part-00000:
data  index

./crawl_parse:
part-00000

./parse_data:
part-00000

./parse_data/part-00000:
data  index

./parse_text:
part-00000

./parse_text/part-00000:
data  index

=> but the size of this directory is 37,1 Ko while the size of the ARC file
is 60Mo, => this explains that the content segments is empty


please I need your Help
thanks
-- 


-=MBB=-

Reply via email to