Thanks for your reply ,Lewis. I think I didn't make my question easy to understand. In detail , I want to get the body text in the japanese webpage ,but there are so many kinds of coded format,such as Shift_JIS,EUC-JP... Now, the nutch can parse the webpage correctly, and it saves the text in: \segments\20130514095644\parse_text, then I execute the command : bin/nutch readseg -dump crawled/segments/20130514095644 segdb -nocontent -noparsedata -nofetch -noparse -nogenerate and i get the text file which was named "dump" in folder segdb. I am just wonder what the specific format of the file "dump"?
thank you so much suzhaolong -- View this message in context: http://lucene.472066.n3.nabble.com/NUTCH1-2-the-specific-format-of-the-dump-text-file-tp4062845p4063132.html Sent from the Nutch - User mailing list archive at Nabble.com.

