RE: readseg dump and non-ASCII characters

Yossi Tamari Thu, 14 Dec 2017 10:49:06 -0800

Hi Michael,

Not directly answering this question, but keep in mind that as mentioned in the 
issue Sebastian referenced, there are many more places in Nutch that have the 
same problem, so setting LC_ALL is probably a good idea in general (until that 
issue is fixed...).
If you're worried about other applications, I believe passing 
`-DLC_ALL=en_US.utf8` as a parameter to all Nutch jobs should also work.


        Yossi.


> -----Original Message-----
> From: Michael Coffey [mailto:mcof...@yahoo.com.INVALID]
> Sent: 14 December 2017 20:30
> To: user@nutch.apache.org
> Subject: Re: readseg dump and non-ASCII characters
> 
> Not sure it's practical to go around to all the hadoop machines and change 
> their
> default encoding settings. Not sure it wouldn't break something else!
> 
> I'm wondering if there's a simple fix I could make to the source code to make
> nutch.segment.SegmentReader use utf-8 as a default when reading the segment
> data.
> 
> 
> 
> In SegmentReader.java, the only obvious file-reading code I see is in this 
> append
> function.
>   private int append(FileSystem fs, Configuration conf, Path src,
>       PrintWriter writer, int currentRecordNumber) throws IOException {
>     BufferedReader reader = new BufferedReader(new InputStreamReader(
>         fs.open(src)));
>     try {
>       String line = reader.readLine();
>       while (line != null) {
>         if (line.startsWith("Recno:: ")) {
>           line = "Recno:: " + currentRecordNumber++;
>         }
>         writer.println(line);
>         line = reader.readLine();
>       }
>       return currentRecordNumber;
>     } finally {
>       reader.close();
>     }
>   }
> 
> 
> SegmentReader has three different lines that create an OutputStreamWriter.
> Two of those explicitly use "UTF-8", but the one that creates a PrintWriter
> implicitly uses default encoding.
> 
> If I insert a "UTF-8" arg into the InputStreamReader and OutputStreamWriter
> constructors, should that work? Is it likely to break something else?
> 
> 
> 
> 
> 
> 
> 
> 
> ________________________________
> From: Sebastian Nagel <wastl.na...@googlemail.com>
> To: user@nutch.apache.org
> Sent: Wednesday, November 15, 2017 5:18 AM
> Subject: Re: readseg dump and non-ASCII characters
> 
> 
> 
> Hi Michael,
> 
> from the arguments I guess you're interested in the raw/binary HTML content,
> right?
> After a closer look I have no simple answer:
> 
> 1. HTML has no fix encoding - it could be everything, pageA may have a 
> different
>     encoding than pageB.
> 
> 2. That's different for parsed text: it's a Java String internally
> 
> 3. "readseg dump" converts all data to a Java String using the default 
> platform
>     encoding. On Linux having these locales installed you may get different 
> results
> for:
>        LC_ALL=en_US.utf8  ./bin/nutch reaseg -dump
>        LC_ALL=en_US       ./bin/nutch reaseg -dump
>        LC_ALL=ru_RU       ./bin/nutch reaseg -dump
>     In doubt, try to set UTF-8 to your platform encoding. Most pages nowadays
> are UTF-8.
>     Btw., this behavior isn't ideal, it should be fixed as part NUTCH-1807.
> 
> 4. a more reliable solution would require to detect the HTML encoding (the 
> code
> is available
>     in Nutch) and then convert the byte[] content using the right encoding.
> 
> Best,
> Sebastian
> 
> 
> 
> 
> On 11/15/2017 02:20 AM, Michael Coffey wrote:
> > Greetings Nutchlings,
> > I have been using readseg-dump successfully to retrieve content crawled by
> nutch, but I have one significant problem: many non-ASCII characters appear as
> '???' in the dumped text file. This happens fairly frequently in the 
> headlines of
> news sites that I crawl, for things like quotes, apostrophes, and dashes.
> > Am I doing something wrong, or is this a known bug? I use a python utf8
> decoder, so it would be nice if everything were UTF8.
> > Here is the command that I use to dump each segment (using nutch
> 1.12).bin/nutch readseg -dump  segPath destPath -noparse -noparsedata -
> noparsetext -nogenerate
> > It is so close to working perfectly!
> >

RE: readseg dump and non-ASCII characters

Reply via email to