We have ingested a number of HTML documents into DSpace 3.1.

These documents contained a large number of nbsp characters.

When these documents were presented in discovery search result snippets,
the nbsp characters appeared as "?".  The characters also appeared as "?"
in the TEXT bundle.

Curiously, inside the HTMLFilter code (running on the server), these
characters appeared to be (int)65533.  If I download the html file, the
characters appear to be an nbsp.

I intend to customize the HTMLFilter code to translate these characters
into spaces.

Could you suggest a better approach to this problem?

Thanks, Terry

  public InputStream getDestinationStream(InputStream source)
            throws Exception
    {
        // try and read the document - set to ignore character set
directive,
        // assuming that the input stream is already set properly (I hope)
        HTMLEditorKit kit = new HTMLEditorKit();
        Document doc = kit.createDefaultDocument();

        doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);

        kit.read(source, doc, 0);

        char[] chars = doc.getText(0, doc.getLength()).toCharArray();

        for(int i=0; i<chars.length; i++) {
        char c = chars[i];
        int ic = (int)c;
        if (ic == 160) chars[i] = ' ';    //handle nbsp in html file
        if (ic == 65533) chars[i] = ' ';  //handle nbsp (as delivered from
server) in html file
        }

        byte[] textBytes = new String(chars).getBytes();

        ByteArrayInputStream bais = new ByteArrayInputStream(textBytes);

        return bais;
    }


-- 
Terry Brady
Applications Programmer Analyst
Lauinger Information Technology
202-687-7053
------------------------------------------------------------------------------
Sponsored by Intel(R) XDK 
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette

Reply via email to