On Mon, Apr 17, 2006 at 11:18:01AM +0200, Christian Boos wrote: > Ok, so that you don't hold your breath for too long, I've looked at that > code ... > > In > http://trac-hacks.org/browser/reposearchplugin/0.9/tracreposearch/search.py#L112, > you use node.get_content().read(), which is documented as: > > Warning: `SubversionNode.get_content` returns an object from which one > can read a stream of bytes. > NO guarantees can be given about what that stream of bytes > represents. > It might be some text, encoded in some way or another. > SVN properties __might__ give some hints about the content, > but they actually only reflect the beliefs of whomever set > those properties... > > I should probably move this warning inside api.py, as docu for > Node.get_content, > as it's exactly the same for other backends. > > So, in short, when you access the content of a file from the repository, > you access it as raw content (because it can be a binary object). > If you decide it's some text, you should be prepared handle /any/ kind > of encoding: to that end, use the `trac.util.to_unicode`: > > to_unicode(node.get_content().read()) > > (btw, I'm thinking to move all text related utilities in trac.util.text, > what do you think?) > > There's an additional twist, as Subversion has some conventions for > conveying charset information, in node properties. > That way, you can *maybe* (as the charset is not necessarily set > in the svn:mime-type property) get a *hint* (as the charset which > might be set is not necessarily the right one...) about the encoding > actually used for that file. > > So one way to get a hint about the charset is to get it from the > MIME type information. The second possibility to tell what is > the charset actually used is to try to detect it from the content. > Mimeview.get_charset does both of the above (*). > > You can also tell `trac.util.to_unicode` to try to use this information: > > raw_content = node.get_content().read() > charset = Mimeview(self.env).get_charset(raw_content, > node.get_content_type()) > to_unicode(raw_content, charset) > > and `to_unicode` will do the right thing: decode the raw_content > using the charset information if available and valid, but also gracefully > fallback otherwise. > > There are a few shortcuts to achieve the above: > > Mimeview(self.env).get_unicode(node.get_content().read(), > node.get_content_type()) > > or even:, using `trac.mimeview.api.content_to_unicode`: > > content_to_unicode(self.env, node.get_content(), node.get_content_type()) > > which copes with "readable" objects. > > -- Christian > > P.S: hey, that's more and more material for this UnicodeGuidelines page :)
Hehe :) This is all very useful stuff Christian, thanks. I'll update repo search soon. -- Evolution: Taking care of those too stupid to take care of themselves. _______________________________________________ Trac-dev mailing list [email protected] http://lists.edgewall.com/mailman/listinfo/trac-dev
