Thanks for your reply ,Lewis.
 I think I didn't make my question easy to understand. In detail , I want to
get the body text in the japanese webpage ,but there are so many kinds of
coded format,such as Shift_JIS,EUC-JP...  Now, the nutch can parse the
webpage correctly, and it saves the text in:
\segments\20130514095644\parse_text,
 then I execute the command :
bin/nutch readseg -dump crawled/segments/20130514095644 segdb -nocontent
-noparsedata -nofetch -noparse -nogenerate 
and i get the text file which was named "dump" in folder segdb. I am just
wonder what the specific format of the file "dump"?

thank you so much
suzhaolong



--
View this message in context: 
http://lucene.472066.n3.nabble.com/NUTCH1-2-the-specific-format-of-the-dump-text-file-tp4062845p4063132.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to