Bugs item #1636950, was opened at 2007-01-16 17:56 Message generated for change (Comment added) made by runedevik You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1636950&group_id=5470
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Python Library Group: Python 2.5 Status: Closed Resolution: Invalid Priority: 5 Private: No Submitted By: Andy Monthei (amonthei) Assigned to: Nobody/Anonymous (nobody) Summary: Newline skipped in "for line in file" Initial Comment: When processing huge fixed block files of about 7000 bytes wide and several hundred thousand lines long some pairs of lines get read as one long line with no line break when using "for line in file:". The problem is even worse when using the fileinput module and reading in five or six huge files consisting of 4.8 million records causes several hundred pairs of lines to be read as single lines. When a newline is skipped it is usually followed by several more in the next few hundred lines. I have not noticed any other characters being skipped, only the line break. O.S. Windows (5, 1, 2600, 2, 'Service Pack 2') Python 2.5 ---------------------------------------------------------------------- Comment By: Rune Devik (runedevik) Date: 2007-06-27 12:00 Message: Logged In: YES user_id=1212666 Originator: NO Hi I have the same problem with a huge file (8GB) containing long lines. Sometimes two lines are merged into one and rerunning the test script that reads the file it's always the same lines that are merged. Also the merging happens more frequently towards the end of the file it seems. I tried to reproduce with a smaller data set (10 lines before the two lines that get merged, the two lines that gets merged and the 10 lines after that) but I was not able to reproduce on this smaller data set. However if you open this huge file in "rb" mode instead of "r" mode everything works as it should and no lines are merged at all! If I copy the file over to linux and rerun the test script no lines are merged (regardless if mode is "r" or "rb") so this is windows specific and might have something todo with the adding of \r\n if only \n is found when you open the file in "r" mode maybe? Also I have reproduced it on both python 2.3.5 and 2.5c1 on both windows XP and windows 2003. More stats on the input file in both "r" mode and "rb" mode below: Input file size: 8 695 828 KB fp = open(file, "r"): - total number of lines read: 668909 - length of the longest line: 13179792 - length of the shortest line: 89 - 56 lines contains the content of two lines - Always just two lines that are merged into one! - Always the same lines that are merged rerunning the test on the same file. open(file, "rb"): - total number of lines read: 668965 - length of the longest line: 13179793 - length of the shortest line: 90 - no lines merged Regards, Rune Devik ---------------------------------------------------------------------- Comment By: Brett Cannon (bcannon) Date: 2007-01-21 01:46 Message: Logged In: YES user_id=357491 Originator: NO Well, with Andy saying he can't reproduce the problem I am going to close as invalid. Andy, if you ever happen to be able to upload data that triggers it, then please re-open this bug. ---------------------------------------------------------------------- Comment By: Andy Monthei (amonthei) Date: 2007-01-20 23:53 Message: Logged In: YES user_id=1693612 Originator: YES I have had no luck creating random data to reproduce the problem which leaves me to come to the conclusion that it was the data itself. Using a hex editor I find no problem with the line breaks. The data that triggers this bug is transferred several time before it gets to me. It originates on a Unix box, then goes to an IBM mainframe, then to my Windows machine and through many updates along the way. It may be an EBCDIC/ASCII conversion or possibly something to do with the mainframe to PC transfer. Whatever it is, it's in the data itself. The only thing that bothers me is that Java somehow is not affected by this bad data. ---------------------------------------------------------------------- Comment By: Andy Monthei (amonthei) Date: 2007-01-18 16:34 Message: Logged In: YES user_id=1693612 Originator: YES I am using open() for reading the file, no other features. I have also had fileinput.input(fileList) compound the problem. Each file that this has happened to is a fixed block file of either 6990 or 7700 bytes wide but this I think is insignificant. When looking at the file in a hex editor everything looks fine and a small Java program using a buffered reader will give me the correct line count when Python does not. Using something like fp.read(8192) I'm sure might temporarily solve my problem but I will keep working on getting a file I can upload. ---------------------------------------------------------------------- Comment By: Walter Dörwald (doerwalter) Date: 2007-01-18 10:23 Message: Logged In: YES user_id=89016 Originator: NO Are you using any of the unicode reading features (i.e. codecs.EncodedFile etc.) or are you using plain open() for reading the file? ---------------------------------------------------------------------- Comment By: Mark Roberts (mark-roberts) Date: 2007-01-18 08:12 Message: Logged In: YES user_id=1591633 Originator: NO I don't know if this helps: I spent the last little while creating / reading random files that all (seemingly) matched the description you gave us. None of these files failed to read properly. (e.g., have the right amount of rows with a line length that seemingly was the right line. Definitely no doubling lines). Perusing the file source code found a detailed discussion of fgets vs fgetc for finding the next line in the file. Have you tried reading the file with fp.read(8192) or similar? Hopefully you're able to reproduce the bug with scrubbed data (because I couldn't construct random data to do so). Good luck. ---------------------------------------------------------------------- Comment By: Mark Roberts (mark-roberts) Date: 2007-01-18 06:24 Message: Logged In: YES user_id=1591633 Originator: NO How wide are the min and max widths of the lines? This problem is of particular interest to me. ---------------------------------------------------------------------- Comment By: Andy Monthei (amonthei) Date: 2007-01-17 22:58 Message: Logged In: YES user_id=1693612 Originator: YES I can not upload the files that trigger this because of the data that is in them but I am working on getting around that. In my data line 617391 in a fixed block file of 6990 bytes wide gets read in with the next line after it. The line break is 0d0a (same as the others) where the bug happens so I am wondering if it is a buffer issue where the linebreak falls at the edge, however no other characters are ever missed. The total file is 888420 lines and this happens in four spots. I will hopefully have a file to send soon. ---------------------------------------------------------------------- Comment By: Brett Cannon (bcannon) Date: 2007-01-16 23:33 Message: Logged In: YES user_id=357491 Originator: NO Do you happen to have a sample you could upload that triggers the bug? ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1636950&group_id=5470 _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com