Quoting Phlip <phlip2...@gmail.com>:
> 
> Jeffrey L. Taylor wrote:
> > I am parsing XML streams with ruby-libxml using the XML::Reader class.
> > Several have invalid UTF-8 characters.  I need a tutorial or at least some
> > hints on how to recover and continue the parsing.
> 
> Why not scrub them with Ruby's built-in iconv first?
> 
> And what are they doing to you and ruby-libxml? I have found libxml2 
> suspiciously forgiving, so far...
> 
Throws an exception.  It took a bunch of digging to find line: 835, character:
418 is truely not an UTF-8 character (octal 240, maybe a Latin-1 character?).
I'd like to delete or replace it with a question mark and continue parsing.
It is a rather large file so I'd rather not read the whole thing into memory
to correct.  I suppose I could wrap the read function in a clean up function.
Messy trying to keep state for UTF-8 across partial reads.

I was hoping for something better.

Jeffrey

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Ruby 
on Rails: Talk" group.
To post to this group, send email to rubyonrails-talk@googlegroups.com
To unsubscribe from this group, send email to 
rubyonrails-talk+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/rubyonrails-talk?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to