That opening-the-entire-file inefficiency in our sam_to_bam.py was
fixed quite some time ago (4/30/2010, in 3724:007f175c8b88). It has
used getsize since then. It's really just intended to catch weird
errors that don't throw an actual error (also, I think that if the sam
file has no header, no info would be output into the bam file). It
seems somewhat more informative to tell the user the file is empty
rather than just "successfully" output an empty file.
Kelly
On Apr 7, 2011, at 1:05 AM, Assaf Gordon wrote:
Just another example why python's misleadingly simple idioms are
quite dangerous in production code (couldn't help myself from
teasing about python... sorry about that).
Seems like line 150 in "sam_to_bam.py" tries to read the entire BAM
file into memory just to find out if it's empty or not...
As a stop gap solution with minimal changes, change line 150 from:
if len( open( tmp_aligns_file_name ).read() ) == 0:
to
if len( open( tmp_aligns_file_name ).read(10) ) == 0:
Which will read up to the first 10 bytes (instead of the entire file).
A slightly better (but still wrong) solution is to simply check the
file size, with:
if os.path.getsize(tmp_aligns_file_name) == 0:
But it's still wrong because even an invalid sam file will create a
non-empty BAM file (when using "samtools view -bt") - the BAM file
will still contain the chromosome names and sizes.
Example:
========
$ cat mm9.fa.fai
chr1 197195432 6 50 51
chr10 129993255 201139354 50 51
chr11 121843856 333732482 50 51
chr12 121257530 458013223 50 51
chr13 120284312 581695911 50 51
chr13_random 400311 704385924 50 51
chr14 125194864 704794249 50 51
chr15 103494974 832493018 50 51
...
...
$ cat 1.sam
Hello World
This is not a SAM file
$ samtools view -bt mm9.fa.fai -o 1.bam 1.sam
[sam_header_read2] 35 sequences loaded.
[sam_read1] reference 'This is not a SAM file' is recognized as '*'.
[main_samview] truncated file.
$ ls -l 1.*
-rw-r--r-- 1 gordon hannon 348 Apr 7 00:57 1.bam
-rw-r--r-- 1 gordon hannon 35 Apr 7 00:57 1.sam
========
So in short, this whole sam-to-bam wrapper tool is not suitable for
large SAM files (if they don't fit entirely in memory), and not for
error checking of invalid SAM files.
-gordon
On 04/07/2011 12:30 AM, Ryan Golhar wrote:
Here's what I get:
(galaxy_env)[galaxy@vail pbs]$ sh ./big.sh
Samtools Version: 0.1.14 (r933:170)
Traceback (most recent call last):
File "/home/galaxy/galaxy-dist/tools/samtools/sam_to_bam.py", line
150,
in __main__
if len( open( tmp_aligns_file_name ).read() ) == 0:
MemoryError
Error extracting alignments from
(/home/galaxy/galaxy-dist/database/files/000/dataset_785.dat),
(galaxy_env)[galaxy@vail pbs]$
On 4/6/11 7:29 PM, Assaf Gordon wrote:
Ryan,
Since we're shooting in the dark here, best to try and understand
what's the exception.
Add the following line to the beginning of "sam_to_bam.py":
import traceback
and add the following line to "sam_to_bam.py" line 156 (before the
call to "stop_err"):
traceback.print_exc()
Hopefully this will print out which exception you're getting, and
where is it thrown from.
-gordon
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client. To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
http://lists.bx.psu.edu/
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client. To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
http://lists.bx.psu.edu/