Re: [Tutor] Reading large bz2 Files

2010-02-22 Thread Stefan Behnel
Norman Rieß, 19.02.2010 13:42:
> i am trying to read a large bz2 file with this code:
> 
> source_file = bz2.BZ2File(file, "r")
> for line in source_file:
> print line.strip()
> 
> But after 4311 lines, it stoppes without a errormessage. The bz2 file is
> much bigger though.

Could you send in a copy of the unpacked bytes around the position where it
stops? I.e. a couple of lines before and after that position? Note that
bzip2 is a block compressor, so, depending on your data, you may have to
send enough lines to fill the block size.

Does it also stop if you parse only those lines from a bzip2 file, or is it
required that the file has at least the current amount of data before those
lines?

Based on this, could you please do a bit of poking around yourself to
figure out if it is a) the byte position, b) the data content or c) the
length of the file that induces this behaviour? I assume it's rather
unpractical to share the entire file, so you will have to share hints and
information instead if you want this resolved.

Stefan

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Reading large bz2 Files

2010-02-19 Thread Norman Rieß
Am 19.02.2010 22:24, schrieb Lie Ryan:
> On 02/20/10 07:49, Norman Rieß wrote:
>   
>> Am 19.02.2010 21:42, schrieb Lie Ryan:
>> 
>>> On 02/19/10 23:42, Norman Rieß wrote:
>>>   
>>>   
 Hello,

 i am trying to read a large bz2 file with this code:

 source_file = bz2.BZ2File(file, "r")
 for line in source_file:
 print line.strip()

 But after 4311 lines, it stoppes without a errormessage. The bz2 file is
 much bigger though.
 How can i read the whole file line by line?
 
 
>>> Is the bz2 file an archive[1]?
>>>
>>> [1] archive: contains more than one file
>>>   
>>>   
>> No it is a single file. But how could i check for sure? Its extracts to
>> a single file...
>> 
> use "bzip2 -dc" or "bunzip2" instead of "bzcat" since bzcat concatenates
> its output file to a single file.
>
>
>   

Yes, it is a single file.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Reading large bz2 Files

2010-02-19 Thread Norman Rieß
Am 19.02.2010 22:03, schrieb Kent Johnson:
> On Fri, Feb 19, 2010 at 7:42 AM, Norman Rieß  wrote:
>   
>> Hello,
>>
>> i am trying to read a large bz2 file with this code:
>>
>> source_file = bz2.BZ2File(file, "r")
>> for line in source_file:
>>print line.strip()
>>
>> But after 4311 lines, it stoppes without a errormessage. The bz2 file is
>> much bigger though.
>> How can i read the whole file line by line?
>> 
> I wonder if it is dying after reading 2^31 or 2^32 bytes? It sounds a
> bit like this (fixed) bug:
> http://bugs.python.org/issue1215928
>
> Kent
>
>   
./osmcut.py ../planet-100210.osm.bz2 > test.txt
sm...@loki ~/osm/osmcut $ ls -lh test.txt
-rw-r--r-- 1 871K 19. Feb 22:41 test.txt

Seems like far from it.

Norman
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Reading large bz2 Files

2010-02-19 Thread Lie Ryan
On 02/20/10 07:49, Norman Rieß wrote:
> Am 19.02.2010 21:42, schrieb Lie Ryan:
>> On 02/19/10 23:42, Norman Rieß wrote:
>>   
>>> Hello,
>>>
>>> i am trying to read a large bz2 file with this code:
>>>
>>> source_file = bz2.BZ2File(file, "r")
>>> for line in source_file:
>>> print line.strip()
>>>
>>> But after 4311 lines, it stoppes without a errormessage. The bz2 file is
>>> much bigger though.
>>> How can i read the whole file line by line?
>>> 
>> Is the bz2 file an archive[1]?
>>
>> [1] archive: contains more than one file
>>   
> 
> No it is a single file. But how could i check for sure? Its extracts to
> a single file...

use "bzip2 -dc" or "bunzip2" instead of "bzcat" since bzcat concatenates
its output file to a single file.

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Reading large bz2 Files

2010-02-19 Thread Lie Ryan
On 02/20/10 07:42, Lie Ryan wrote:
> On 02/19/10 23:42, Norman Rieß wrote:
>> Hello,
>>
>> i am trying to read a large bz2 file with this code:
>>
>> source_file = bz2.BZ2File(file, "r")
>> for line in source_file:
>> print line.strip()
>>
>> But after 4311 lines, it stoppes without a errormessage. The bz2 file is
>> much bigger though.
>> How can i read the whole file line by line?
> 
> Is the bz2 file an archive[1]?
> 
> [1] archive: contains more than one file

Or more clearly, is the bz2 contains multiple file compressed using -c
flag? The -c flag will do a simple concatenation of multiple compressed
streams to stdout; it is only decompressible with bzip2 0.9.0 or later[1].

You cannot use bz2.BZ2File to open this, instead use the stream
decompressor bz2.BZ2Decompressor.

A better approach, is to use a real archiving format (e.g. tar).

[1] http://www.bzip.org/1.0.3/html/description.html

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Reading large bz2 Files

2010-02-19 Thread Kent Johnson
On Fri, Feb 19, 2010 at 7:42 AM, Norman Rieß  wrote:
> Hello,
>
> i am trying to read a large bz2 file with this code:
>
> source_file = bz2.BZ2File(file, "r")
> for line in source_file:
>    print line.strip()
>
> But after 4311 lines, it stoppes without a errormessage. The bz2 file is
> much bigger though.
> How can i read the whole file line by line?

I wonder if it is dying after reading 2^31 or 2^32 bytes? It sounds a
bit like this (fixed) bug:
http://bugs.python.org/issue1215928

Kent
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Reading large bz2 Files

2010-02-19 Thread Norman Rieß
Am 19.02.2010 21:42, schrieb Lie Ryan:
> On 02/19/10 23:42, Norman Rieß wrote:
>   
>> Hello,
>>
>> i am trying to read a large bz2 file with this code:
>>
>> source_file = bz2.BZ2File(file, "r")
>> for line in source_file:
>> print line.strip()
>>
>> But after 4311 lines, it stoppes without a errormessage. The bz2 file is
>> much bigger though.
>> How can i read the whole file line by line?
>> 
> Is the bz2 file an archive[1]?
>
> [1] archive: contains more than one file
>   

No it is a single file. But how could i check for sure? Its extracts to
a single file...

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Reading large bz2 Files

2010-02-19 Thread Lie Ryan
On 02/19/10 23:42, Norman Rieß wrote:
> Hello,
> 
> i am trying to read a large bz2 file with this code:
> 
> source_file = bz2.BZ2File(file, "r")
> for line in source_file:
> print line.strip()
> 
> But after 4311 lines, it stoppes without a errormessage. The bz2 file is
> much bigger though.
> How can i read the whole file line by line?

Is the bz2 file an archive[1]?

[1] archive: contains more than one file

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Reading large bz2 Files

2010-02-19 Thread Norman Rieß
Am 19.02.2010 17:04, schrieb Steven D'Aprano:
> My guess is one of two things:
> (1) You are mistaken that the file is bigger than 4311 lines.
>
> (2) You are using Windows, and somehow there is a Ctrl-Z (0x26) 
> character in the file, which Windows interprets as End Of File when 
> reading files in text mode. Try changing the mode to "rb" and see if 
> the behaviour goes away.
>   

Am 19.02.2010 17:15, schrieb Stefan Behnel:
> What does "stops" mean here? Does it crash? Does it exit from the loop? Is
> the above code exactly what you used for testing? Are you passing a
> filename? What platform is this on?
>
>
> How many lines does it have? How did you count them? Did you make sure that
> you are reading from the right file?
>
>   

Hello,

i took the liberty and copied your mails together, so i do not have to
repeat things.
How big is the file and how did i count that:

sm...@loki ~/osm $ bzcat planet-100210.osm.bz2 | wc -l
1717362770
(this took a looong time ;-))
sm...@loki ~/osm $ du -h planet-100210.osm.bz2
8,0Gplanet-100210.osm.bz2

So as you can see, the file really is bigger.
I am not using Windows and the next character would be a period.

sm...@loki ~/osm/osmcut $ ./osmcut.py ../planet-100210.osm.bz2
[...]



I did set the mode to "rb" with the same result.
I also edited the code to see if the loop was exited or the program crashed.
As you can see, there is no error, the loop just exits.
This is the _exact_ code i use:

source_file = bz2.BZ2File(osm_file, "r")
for line in source_file:
print line.strip()
 
print "Exiting"
print "I used file: " + osm_file

As you can see above, the loop exits, the prints are executed and the
right file is used. The content of the file is really distinctive, so
there is no doubt, that it is the right file.
Here is my platform information:
Python 2.6.4
Linux 2.6.32.8 #1 SMP Fri Feb 12 13:29:10 CET 2010 x86_64 Intel(R)
Core(TM)2 Duo CPU U9400 @ 1.40GHz GenuineIntel GNU/Linux
Note: This symptome shows on another platform (SuSE 11.1) with different
software versions as well.

Is there a possibility, that the bz2 module reads only into a limited
buffer and no further? If so, the same behaviour of the two independent
systems would be explained and that it works in Stevens smaller example.
How could i avoid that?

Oh and the content of the file is free, so i do not get into legal
issues exposing it.

Thanks.
Regards,

Norman

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Reading large bz2 Files

2010-02-19 Thread Stefan Behnel
Norman Rieß, 19.02.2010 13:42:
> i am trying to read a large bz2 file with this code:
> 
> source_file = bz2.BZ2File(file, "r")
> for line in source_file:
> print line.strip()
> 
> But after 4311 lines, it stoppes without a errormessage.

What does "stops" mean here? Does it crash? Does it exit from the loop? Is
the above code exactly what you used for testing? Are you passing a
filename? What platform is this on?


> The bz2 file is much bigger though.

How many lines does it have? How did you count them? Did you make sure that
you are reading from the right file?


> How can i read the whole file line by line?

Just as you do above, and it works for me. So the problem is likely elsewhere.

Stefan

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Reading large bz2 Files

2010-02-19 Thread Steven D'Aprano
On Fri, 19 Feb 2010 11:42:07 pm Norman Rieß wrote:
> Hello,
>
> i am trying to read a large bz2 file with this code:
>
> source_file = bz2.BZ2File(file, "r")
> for line in source_file:
>  print line.strip()
>
> But after 4311 lines, it stoppes without a errormessage. The bz2 file
> is much bigger though.
>
> How can i read the whole file line by line?

"for line in file" works for me:


>>> import bz2
>>>
>>> writer = bz2.BZ2File('file.bz2', 'w')
>>> for i in xrange(2):
... # write some variable text to a line
... writer.write('abc'*(i % 5) + '\n')
...
>>> writer.close()
>>> reader = bz2.BZ2File('file.bz2', 'r')
>>> i = 0
>>> for line in reader:
... i += 1
...
>>> reader.close()
>>> i
2


My guess is one of two things:

(1) You are mistaken that the file is bigger than 4311 lines.

(2) You are using Windows, and somehow there is a Ctrl-Z (0x26) 
character in the file, which Windows interprets as End Of File when 
reading files in text mode. Try changing the mode to "rb" and see if 
the behaviour goes away.




-- 
Steven D'Aprano
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


[Tutor] Reading large bz2 Files

2010-02-19 Thread Norman Rieß

Hello,

i am trying to read a large bz2 file with this code:

source_file = bz2.BZ2File(file, "r")
for line in source_file:
print line.strip()

But after 4311 lines, it stoppes without a errormessage. The bz2 file is 
much bigger though.

How can i read the whole file line by line?

Thank you.

Regards,
Norman
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor