Re: NUTCH1.2 ,the specific format of the dump text file?

suzhaolong Mon, 13 May 2013 19:52:24 -0700

Thanks for your reply ,Lewis.
 I think I didn't make my question easy to understand. In detail , I want to
get the body text in the japanese webpage ,but there are so many kinds of
coded format,such as Shift_JIS,EUC-JP...  Now, the nutch can parse the
webpage correctly, and it saves the text in:
\segments\20130514095644\parse_text,
 then I execute the command :
bin/nutch readseg -dump crawled/segments/20130514095644 segdb -nocontent
-noparsedata -nofetch -noparse -nogenerate 
and i get the text file which was named "dump" in folder segdb. I am just
wonder what the specific format of the file "dump"?


thank you so much
suzhaolong



--
View this message in context: 
http://lucene.472066.n3.nabble.com/NUTCH1-2-the-specific-format-of-the-dump-text-file-tp4062845p4063132.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: NUTCH1.2 ,the specific format of the dump text file?

Reply via email to