Thanks for the welcome, The issue is due to the encoding in the file name. To fix it, I needed to make two changes in FileResponse.java in protocol-file plugin.
The fixes were for temp solution thus I hard coded the encoding to "utf-8". It would be better idea to read the encoding from the configuration. I noticed that there is a section for protocol-file plug in like <property> <name>file.crawl.parent</name> What is the guide-line for adding properties to the nutch-default.xml? I am thinking of using file.name.encoding. Cheers, Ye On Fri, Aug 31, 2012 at 10:42 PM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Hi Ye, > > Please feel free to comment fully on any issue you find onthe Nutch Jira. > If you find other/additional bugs or improvements when are not already > opened on the Jira instance then please feel free to open ones once > you are sure they are not duplicates and/or can be resolved via the > user@ list. > > As Markus has explained on NUTCH-968 if you could check out trunk and > provide a patch against it, this would be excellent. Test cases are > also very welcome as well. > > Thank you very much for your input. > > Lewis > > On Fri, Aug 31, 2012 at 3:15 PM, Ye T Thet <yethura.t...@gmail.com> wrote: > > Hi Folks, > > > > There is an issue with protocol-file plugin in while fetching files that > > contain CJK characters in the file name. JIRA Nutch 968 > > > > After I checked the code, I discovered that the problem due to the > encoding > > in the file name while fetching the directory. After changing couple of > > lines as discussed in the JIRA Nutch 968, the issue is resolved. > > > > I see the issue is still open in JIRA and the latest nutch release has no > > fix in it yet. I like to discuss further on the solution I have here in > the > > list and submit the patch once fine. > > > > Anyone in for it? I can elaborate further more on the fix. > > > > Cheers, > > > > Ye > > > > > > > > > > > > -- > Lewis >