On 02/28/2012 09:50 PM, Rob Crittenden wrote:
Petr Viktorin wrote:
On 02/28/2012 04:45 PM, Rob Crittenden wrote:
Petr Viktorin wrote:
On 02/28/2012 04:02 AM, Rob Crittenden wrote:
Petr Viktorin wrote:
On 02/27/2012 05:10 PM, Rob Crittenden wrote:
Rob Crittenden wrote:
Simo Sorce wrote:
On Mon, 2012-02-27 at 09:44 -0500, Rob Crittenden wrote:
We are pretty trusting that the data coming out of LDAP matches
its
schema but it is possible to stuff non-printable characters into
most
attributes.

I've added a sanity checker to keep a value as a python str type
(treated as binary internally). This will result in a base64
encoded
blob be returned to the client.

Shouldn't you try to parse it as a unicode string and catch
TypeError to
know when to return it as binary ?

Simo.


What we do now is the equivalent of unicode(chr(0)) which returns
u'\x00' and is why we are failing now.

I believe there is a unicode category module, we might be able to
use
that if there is a category that defines non-printable characters.

rob

Like this:

import unicodedata

def contains_non_printable(val):
for c in val:
if unicodedata.category(unicode(c)) == 'Cc':
return True
return False

This wouldn't have the exclusion of tab, CR and LF like using ord()
but
is probably more correct.

rob

_______________________________________________
Freeipa-devel mailing list
Freeipa-devel@redhat.com
https://www.redhat.com/mailman/listinfo/freeipa-devel

If you're protecting the XML-RPC calls, it'd probably be better to
look
at the XML spec directly: http://www.w3.org/TR/xml/#charsets

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]

I'd say this is a good set for CLI as well.

And you can trap invalid UTF-8 sequences by catching the
UnicodeDecodeError from decode().


Replace my ord function with a regex that looks for invalid
characters.
Now catching that exception too and leaving as a str type.

rob

You're matching an Unicode regex against a bytestring. It'd be
better to
match after the decode.


The unicode regex was force of habit. I purposely do this comparison
before the decoding to save decoding something that is binary. Would it
be acceptable if I just remove the u'?

rob

No: re.search('[\uDFFF]', u'\uDFFF'.encode('utf-8')) --> None
Actually I'm not sure what encoding is used behind the scenes here; I
wouldn't recommend mixing Unicode and bytestrings even if it did work on
my machine.

The data coming out of LDAP is a plain string. We encode to utf-8 if is
not a binary as defined in _SYNTAX_MAPPING in ipaserver/plugins/ldap2.py

Now we're just confusing each other, it seems.
By plain string you mean a bytestring in either UTF-8 or IA5 (which is almost an UTF-8 subset), right?
We convert that to an Unicode string if possible.

Anyway just dropping the u won't work. The \u escape only works in Unicode strings:
>>> list('\u1234')
['\\', 'u', '1', '2', '3', '4']

Also, match() only looks at the beginning of the string; use search().
Sorry for not catching that earlier.

good point.



OCD notes:
- the "matches" variable name is misleading, there's either one match or
None.
- use `is not None` instead of `!= None` as per PEP8.
- The contains_non_printable method seems a bit excessive now that it
essentially just contains one call. Unless it's meant to be subclassed,
in which case the docstring should probably look different.


It doesn't need to be a public method. I broke it out mostly because
embedding the documentation on why it is there overwhelmed encode().

rob


--
PetrĀ³

_______________________________________________
Freeipa-devel mailing list
Freeipa-devel@redhat.com
https://www.redhat.com/mailman/listinfo/freeipa-devel

Reply via email to