date:20120719

Re: [Tutor] Problem When Iterating Over Large Test Files

2012-07-19 Thread Steven D'Aprano

On Wed, Jul 18, 2012 at 04:33:20PM -0700, Ryan Waples wrote:

 I've included 20 consecutive lines of input and output.  Each of these
 5 'records' should have been selected and printed to the output file.

I count only 19 lines. The first group has only three lines. See below.

There is a blank line, which I take as NOT part of the input but just a 
spacer. Then:

1) Line starting with @
2) Line of bases CGCGT ...
3) Plus sign
4) Line starting with @@@
5) Line starting with @
6) Line of bases TTCTA ...
7) Plus sign

and so on. There are TWO lines before the first +, and three before each 
of the others.



 __EXAMPLE RAW DATA FILE REGION__
 
 @HWI-ST0747:167:B02DEACXX:8:1101:3182:167088 1:N:0:
 CGCGTGTGCAGGTTTATAGAACCAGCTGCAGATTAGTAGCAGCGCACGGAGAGGTGTGTCTGTTTATTGTCCTCAGCAGGCAGACATGTTTGTGGTC
 +
 @@@DDADDHB9+2A??:?G9+C)???G@DB@@DGFB0*?FF?0F:@/54'-;;?B;;6(5@CDAC(5(5:5,(8?88?BC@#
 @HWI-ST0747:167:B02DEACXX:8:1101:3134:167090 1:N:0:
 TTCTAGTGCAGGGCGACAGCGTTGCGGAGCCGGTCCGAGTCTGCTGGGTCAGTCATGGCTAGTTGGTACTATAACGACACAGGGCGAGACCCAGATGCAAA
 +
 @CCFFFDFHJJIJHHIIIJHGHIJI@GFFDDDFDDCEEEDCCBDCCCCCB@C(4@ADCA?BBBDDABB055-?AB1:@ACC:
 @HWI-ST0747:167:B02DEACXX:8:1101:3002:167092 1:N:0:
 CTTTGCTGCAGGCTCATCCTGACATGACCCTCCAGCATGACAATGCCACCAGCCATACTGCTCGTTCTGTGTGTGATTTCCAGCAAGTAAATATGTA
 +
 CCCFHIJIEHIH@AHFAGHIGIIGGEIJGIJIIIGIIIGEHGEHIIJIEHH@FHGH@=ACEHHFBFFCE@AACCACDB;;B?C3AADBA
 @HWI-ST0747:167:B02DEACXX:8:1101:3022:167094 1:N:0:
 ATTCCGTGCAGGCCAACTCCCGACGGACATCCTTGCTCAGACTGCAGCGATAGTGGTCGATCAGGGCCCTGTTGTTCCATCCCACTCCGGCGACCAGGTTC
 +
 CCCFHIDHJIIHIIIJIJIIGGIIFHJIIIIEIFHFFCBAECBDDDC:??B=AAACD?8@:C@?8CBDDD@D99B@3884A
 @HWI-ST0747:167:B02DEACXX:8:1101:3095:167100 1:N:0:
 CGTGATTGCAGGGACGTTACAGAGACGTTACAGGGATGTTACAGGGACGTTACAGAGACGTTAAAGAGATGTTACAGGGATGTTACAGACAGAGACGTTAC
 +

Your code says that the first line in each group should start with an @ 
sign. That is clearly not the case for the last two groups.

I suggest that your data files have been corrupted.

 __PYTHON CODE __

I have re-written your code slightly, to be a little closer to best 
practice, or at least modern practice. If there is anything you don't 
understand, please feel free to ask.

I haven't tested this code, but it should run fine on Python 2.7.

It will be interesting to see if you get different results with this.



import glob

def four_lines(file_object):
Yield lines from file_object grouped into batches of four.

If the file has fewer than four lines remaining, pad the batch 
with 1-3 empty strings.

Lines are stripped of leading and trailing whitespace.

while True:
# Get the first line. If there is no first line, we are at EOF
# and we raise StopIteration to indicate we are done.
line1 = next(file_object).strip()
# Get the next three lines, padding if needed.
line2 = next(file_object, '').strip()
line3 = next(file_object, '').strip()
line4 = next(file_object, '').strip()
yield (line1, line2, line3, line4)


my_in_files = glob.glob ('E:/PINK/Paired_End/raw/gzip/*.fastq')
for each in my_in_files:
out = each.replace('/gzip', '/rem_clusters2' )
print (Reading File:  + each)
print (Writing File:  + out)
INFILE = open (each, 'r')
OUTFILE = open (out , 'w')
writes = 0

for reads, lines in four_lines( INFILE ):
ID_Line_1, Seq_Line, ID_Line_2, Quality_Line = lines
# Check that ID_Line_1 starts with @
if not ID_Line_1.startswith('@'):
print (**ERROR**)
print (expected ID_Line to start with @)
print (lines)
print (Read Number  + str(Reads))
break
elif Quality_Line != '+':
print (**ERROR**)
print (expected Quality_Line = +)
print (lines)
print (Read Number  + str(Reads))
break
# Select Reads that I want to keep  
ID = ID_Line_1.partition(' ')
if (ID[2] == 1:N:0: or ID[2] == 2:N:0:):
# Write to file, maintaining group of 4
OUTFILE.write(ID_Line_1 + \n)
OUTFILE.write(Seq_Line + \n)
OUTFILE.write(ID_Line_2 + \n)
OUTFILE.write(Quality_Line + \n)
writes += 1
# End of file reached, print update
print (Saw, reads, groups of four lines)
print (Wrote, writes, groups of four lines)
INFILE.close()
OUTFILE.close()





-- 
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:

Re: [Tutor] string to binary and back... Python 3

2012-07-19 Thread Mark Lawrence


On 19/07/2012 06:41, wolfrage8...@gmail.com wrote:

On Thu, Jul 19, 2012 at 12:16 AM, Dave Angel d...@davea.name wrote:


  On 07/18/2012 05:07 PM, Jordan wrote:

OK so I have been trying for a couple days now and I am throwing in the
towel, Python 3 wins this one.
I want to convert a string to binary and back again like in this
question: Stack Overflow: Convert Binary to ASCII and vice versa
(Python)


http://stackoverflow.com/questions/7396849/convert-binary-to-ascii-and-vice-versa-python


But in Python 3 I consistently get  some sort of error relating to the
fact that nothing but bytes and bytearrays support the buffer interface
or I get an overflow error because something is too large to be
converted to bytes.
Please help me and then explian what I am not getting that is new in
Python 3. I would like to point out I realize that binary, hex, and
encodings are all a very complex subject and so I do not expect to
master it but I do hope that I can gain a deeper insight. Thank you all.

test_script.py:
import binascii

test_int = 109

test_int = int(str(test_int) + '45670')
data = 'Testing XOR Again!'

while sys.getsizeof(data)  test_int.bit_length():

test_int = int(str(test_int) + str(int.from_bytes(os.urandom(1), 'big')))

print('Bit Length: ' + str(test_int.bit_length()))

key = test_int # Yes I know this is an unnecessary step...

data = bin(int(binascii.hexlify(bytes(data, 'UTF-8')), 16))

print(data)

data = int(data, 2)

print(data)

data = binascii.unhexlify('%x' % data)



I don't get the same error you did.  I get:

  File jordan.py, line 13
 test_int = int(str(test_int) + str(int.from_bytes(os.urandom(1),
'big')))
^


test_int = int(str(test_int) + str(int.from_bytes(os.urandom(1), \
 'big')))
# That was probably just do to the copy and paste.


IndentationError: expected an indented block


Please post it again, with correct indentation.  if you used tabs, then
expand them to spaces before pasting it into your test-mode mail editor.

I only use spaces and this program did not require any indentation until

it was pasted and the one line above became split across two line. Really
though that was a trivial error to correct.


Really?  Are you using a forked version of Python that doesn't need 
indentation after a while loop, or are you speaking with a forked 
tongue? :)  Strangely I believe the latter, so please take note of what 
Dave Angel has told you and post with the correct indentation.






I'd also recommend you remove a lot of the irrelevant details there.  if
you have a problem with hexlfy and/or unhexlify, then give a simple byte
string that doesn't work for you, and somebody can probably identify why
not.  And if you want people to run your code, include the imports as well.

My problem is not specific to hexlify and unhexlify, my problem is trying

to convert from string to binary and back. That is why all of the details,
to show I have tried on my own.
Sorry that I forgot to include sys and os for imports.



As it is, you're apparently looping, comparing the byte memory size of a
string (which is typically 4 bytes per character) with the number of
significant bits in an unrelated number.

I suspect what you want is something resembling (untested):

 mybytes = bytes( %x % data, ascii)
 newdata = binascii.unexlify(mybytes)

I was comparing them but I think I understand how to compare them well,

now I want to convert them both to binary so that I can XOR them together.
Thank you for your time and help Dave, now I need to reply to Ramit.



--
DaveA





___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor




--
Cheers.

Mark Lawrence.



___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Problem When Iterating Over Large Test Files

2012-07-19 Thread Ryan Waples


 If you copy those files to a different device (one that has just been 
 scrubbed and reformatted), then copy them back and get different results 
 with your application, you've found your problem.

 -Bill

 Thanks for the insistence,  I'll check this out.  If you have any
 guidance on how to do so let me know.  I knew my system wasn't
 particularly well suited to the task at hand, but I haven't seen how
 it would actually cause problems.

 -Ryan
 ___
 The last two lines in my MSG pretty much would be the test. Get another 
 flash drive, format it as FAT-32 (I assume that's what you are using), then 
 copy a couple of files to it.  Then copy them back to your current device 
 and run your program again. If you get DIFFERENT, but still wrong results, 
 you've found the problem. The largest positive integer a 32-bit binary 
 number can represent is 2^32, which is 4Gig.  I'm no expert on Window's 
 files, but I'd be very surprised if when the FAT-32 file system was being 
 designed, anyone considered the case where a single file could be that large.

 -Bill


The hard-drive is formatted as NTFS, because as you say I'm up against
the file size limit of FAT32 , do think this could still be the issue?
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Problem When Iterating Over Large Test Files

2012-07-19 Thread Ryan Waples

I count only 19 lines.

yep, you are right. My bad, I think I missing copy/pasting line 20.

The first group has only three lines. See below.

Not so, the first group is actually the first four lines listed below.
Lines 1-4 serve as one group. For what it is worth, line four should
have 1 character for each char in line 1, and the first line is much
shorter, contains a space, and for this file always ends in either
1:N:0: (keep) 1Y0: (remove). The EXAMPLE data is correctly
formatted as it should be, but I'm missing line 20.

There is a blank line, which I take as NOT part of the input but just a
spacer. Then:

1) Line starting with @
2) Line of bases CGCGT ...
3) Plus sign
4) Line starting with @@@
5) Line starting with @
6) Line of bases TTCTA ...
7) Plus sign

and so on. There are TWO lines before the first +, and three before each
of the others.

I think you are just reading one frame shifted, its not a well
designed format because the required start character @, can appear
other places as well

__EXAMPLE RAW DATA FILE REGION__

@HWI-ST0747:167:B02DEACXX:8:1101:3182:167088 1:N:0:
CGCGTGTGCAGGTTTATAGAACCAGCTGCAGATTAGTAGCAGCGCACGGAGAGGTGTGTCTGTTTATTGTCCTCAGCAGGCAGACATGTTTGTGGTC
+
@@@DDADDHB9+2A??:?G9+C)???G@DB@@DGFB0*?FF?0F:@/54'-;;?B;;6(5@CDAC(5(5:5,(8?88?BC@#
@HWI-ST0747:167:B02DEACXX:8:1101:3134:167090 1:N:0:
TTCTAGTGCAGGGCGACAGCGTTGCGGAGCCGGTCCGAGTCTGCTGGGTCAGTCATGGCTAGTTGGTACTATAACGACACAGGGCGAGACCCAGATGCAAA
+
@CCFFFDFHJJIJHHIIIJHGHIJI@GFFDDDFDDCEEEDCCBDCCCCCB@C(4@ADCA?BBBDDABB055-?AB1:@ACC:
@HWI-ST0747:167:B02DEACXX:8:1101:3002:167092 1:N:0:
CTTTGCTGCAGGCTCATCCTGACATGACCCTCCAGCATGACAATGCCACCAGCCATACTGCTCGTTCTGTGTGTGATTTCCAGCAAGTAAATATGTA
+
CCCFHIJIEHIH@AHFAGHIGIIGGEIJGIJIIIGIIIGEHGEHIIJIEHH@FHGH@=ACEHHFBFFCE@AACCACDB;;B?C3AADBA
@HWI-ST0747:167:B02DEACXX:8:1101:3022:167094 1:N:0:
ATTCCGTGCAGGCCAACTCCCGACGGACATCCTTGCTCAGACTGCAGCGATAGTGGTCGATCAGGGCCCTGTTGTTCCATCCCACTCCGGCGACCAGGTTC
+
CCCFHIDHJIIHIIIJIJIIGGIIFHJIIIIEIFHFFCBAECBDDDC:??B=AAACD?8@:C@?8CBDDD@D99B@3884A
@HWI-ST0747:167:B02DEACXX:8:1101:3095:167100 1:N:0:
CGTGATTGCAGGGACGTTACAGAGACGTTACAGGGATGTTACAGGGACGTTACAGAGACGTTAAAGAGATGTTACAGGGATGTTACAGACAGAGACGTTAC
+

Your code says that the first line in each group should start with an @
sign. That is clearly not the case for the last two groups.

I suggest that your data files have been corrupted.

I'm pretty sure that my raw IN files are all good, its hard to be sure
with such a large file, but the very picky downstream analysis program
takes every single raw file just fine (30 of them), and gaks on my
filtered files, at regions that don't conform to the correct
formatting.

__PYTHON CODE __

I have re-written your code slightly, to be a little closer to best
practice, or at least modern practice. If there is anything you don't
understand, please feel free to ask.

I haven't tested this code, but it should run fine on Python 2.7.

It will be interesting to see if you get different results with this.

--CODE REMOVED--

Thanks, for the suggestions. I've never really felt super comfortable
using objects at all, but its what I want to learn next. This will be
helpful, and useful.

for reads, lines in four_lines( INFILE ):
ID_Line_1, Seq_Line, ID_Line_2, Quality_Line = lines

Can you explain what is going on here, or point me In the right
direction? I see that the parts of 'lines' get assigned, but I'm
missing how the file gets iterated over and how reads gets
incremented.

59 matches

Mail list logo