Bugs item #956303, was opened at 2004-05-18 18:45 Message generated for change (Settings changed) made by akuchling You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=956303&group_id=5470
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Documentation Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Allan Crooks (amc1) Assigned to: Nobody/Anonymous (nobody) >Summary: Update pickle docs to describe format of persistent IDs Initial Comment: There is a bug in save_pers in both the pickle and cPickle modules in Python. It occurs when someone uses a Pickler instance which is using an ASCII protocol and also has persistent_id defined which can return a persistent reference that can contain newline characters in. The current implementation of save_pers in the pickle module is as follows: ---- def save_pers(self, pid): # Save a persistent id reference if self.bin: self.save(pid) self.write(BINPERSID) else: self.write(PERSID + str(pid) + '\n') ---- The else clause assumes that the 'pid' will not be a string which one or more newline characters. If the pickler pickles a persistent ID which has a newline in it, then an unpickler with a corresponding persistent_load method will incorrectly unpickle the data - usually interpreting the character after the newline as a marker indicating what type of data should be expected (usually resulting in an exception being raised when the remaining data is not in the format expected). I have attached an example file which illustrates in what circumstances the error occurs. Workarounds for this bug are: 1) Use binary mode for picklers. 2) Modify subclass implementations of save_pers to ensure that newlines are not returned for persistent ID's. Although you may assume in general that this bug would only occur on rare occasions (due to the unlikely situation where someone would implement persistent_id so that it would return a string with a newline character embedded), it may occur more frequently if the subclass implementation of persistent_id uses a string which has been constructed using the marshal module. This bug was discovered when our code implemented the persistent_id method, which was returning the marshalled format of a tuple which contained strings. It occurred when one or more of the strings had a length of ten characters - the marshalled format of that string contains the string's length, where the byte used to represent the number 10 is the same as the one which represents the newline character: >>> marshal.dumps('a' * 10) 's\n\x00\x00\x00aaaaaaaaaa' >>> chr(10) '\n' I have replicated this bug on Python 1.5.2 and Python 2.3b1, and I believe it is present on all 2.x versions of Python. Many thanks to SourceForge user (and fellow colleague) SMST who diagnosed the bug and provided the test cases attached. ---------------------------------------------------------------------- Comment By: Martin v. Löwis (loewis) Date: 2006-07-03 08:41 Message: Logged In: YES user_id=21627 Also lowering the priority. amc1, if you are still interested, are you willing to provide a documentation patch? ---------------------------------------------------------------------- Comment By: Tim Peters (tim_one) Date: 2004-11-07 17:40 Message: Logged In: YES user_id=31435 Unassigned myself (I don't have time for it), but changed the Category to Documentation. (Changing what a persistent ID can be would need to be a new feature request.) ---------------------------------------------------------------------- Comment By: Allan Crooks (amc1) Date: 2004-05-19 11:30 Message: Logged In: YES user_id=39733 I would at least like the documentation modified to make it clearer that certain characters are not permitted for persistent ID's. I think the text which indicates the requirement of printable ASCII characters is too subtle - there should be something which makes the requirement more obvious, the use of a "must" or "should" would help get the point across (as would some text after the statement indicating that characters such as '\b', '\n', '\r' are not permitted). Perhaps it would be an idea for save_pers to do some argument checking before storing the persistent ID, perhaps using an assertion statement to verify that the ID doesn't contain non-permitted characters (or at least, checking for the presence of a '\n' character embedded in the string). I think it is preferable to have safeguards implemented in Pickler to prevent potentially dodgy data being stored - I would rather have an error raised when I'm trying to pickle something than have the data stored and corrupted, only to notice it when it is unpickled (when it is too late). Confusingly, the code in save_pers in the pickle module seems to indicate that it would happily accept non-String based persistent ID's: ---- else: self.write(PERSID + str(pid) + '\n') ---- I don't understand why we are using the str function if we are expecting pid to be a string in the first place. I would rather that this method would raise an error if it tried to perform string concatenation on something which isn't a string. I agree with SMST, I would like the restriction removed over what persistent ID's we can use, it seems somewhat arbitary - there's no reason, for example, why we can't use any simple data type which can be marshalled as an ID. Apart from the reason that it wouldn't be backwardly compatible, which is probably a good enough reason. :) ---------------------------------------------------------------------- Comment By: Steve Tregidgo (smst) Date: 2004-05-19 06:31 Message: Logged In: YES user_id=42335 I'd overlooked that note in the documentation before, and in fact developed the opposite view on what was allowed by seeing that the binary pickle format happens to allow persistent IDs containing non-printable ASCII characters. Given that the non-binary format can represent strings (containing any character, printable or not) by escaping them, it seems unfortunate that the same escaping was not applied to the saving of persistent IDs. It might be helpful if the documentation indicated that the acceptance by the binary pickle format of strings without restriction is not to be relied upon, underlining the fact that printable ASCII is all that's allowed by the format. Personally I'd like to see the restriction on persistent IDs lifted in a future version of the pickle module, but I don't have a compelling reason for it (other than it seeming to be unnecessary). On the other hand, it seems to be a limitation which hasn't caused much grief (if any) over the years... perhaps such a change (albeit a minor one) in the specifications should be left until another protocol is introduced. ---------------------------------------------------------------------- Comment By: Tim Peters (tim_one) Date: 2004-05-18 23:02 Message: Logged In: YES user_id=31435 The only documentation is the "Pickling and unpickling external objects" section of the Library Reference Manual, which says: """ Such objects are referenced by a ``persistent id'', which is just an arbitrary string of printable ASCII characters. """ A newline is universally considered to be a control character, not a printable character (e.g., try isprint('\n') under your local C compiler). So this is functioning as designed and as documented. If you don't find the docs clear, we should call this a documentation bug. If you think the semantics should change to allow more than printable characters, then this should become a feature request, and more is needed to define exactly which characters should be allowed. The current implementation is correct for persistent ids that meet the documented requirement. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=956303&group_id=5470 _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com