Thanks for the welcome,

The issue is due to the encoding in the file name. To fix it, I needed to
make two changes in FileResponse.java in protocol-file plugin.

The fixes were for temp solution thus I hard coded the encoding to "utf-8".
It would be better idea to read the encoding from the configuration.
I noticed that there is a section for protocol-file plug in like

<property>
  <name>file.crawl.parent</name>

What is the guide-line for adding properties to the nutch-default.xml? I am
thinking of using file.name.encoding.

Cheers,

Ye

On Fri, Aug 31, 2012 at 10:42 PM, Lewis John Mcgibbney <
lewis.mcgibb...@gmail.com> wrote:

> Hi Ye,
>
> Please feel free to comment fully on any issue you find onthe Nutch Jira.
> If you find other/additional bugs or improvements when are not already
> opened on the Jira instance then please feel free to open ones once
> you are sure they are not duplicates and/or can be resolved via the
> user@ list.
>
> As Markus has explained on NUTCH-968 if you could check out trunk and
> provide a patch against it, this would be excellent. Test cases are
> also very welcome as well.
>
> Thank you very much for your input.
>
> Lewis
>
> On Fri, Aug 31, 2012 at 3:15 PM, Ye T Thet <yethura.t...@gmail.com> wrote:
> > Hi Folks,
> >
> > There is an issue with protocol-file plugin in while fetching files that
> > contain CJK characters in the file name. JIRA Nutch 968
> >
> > After I checked the code, I discovered that the problem due to the
> encoding
> > in the file name while fetching the directory. After changing couple of
> > lines as discussed in the JIRA Nutch 968, the issue is resolved.
> >
> > I see the issue is still open in JIRA and the latest nutch release has no
> > fix in it yet. I like to discuss further on the solution I have here in
> the
> > list and submit the patch once fine.
> >
> > Anyone in for it? I can elaborate further more on the fix.
> >
> > Cheers,
> >
> > Ye
> >
> >
> >
> >
>
>
>
> --
> Lewis
>

Reply via email to