Thank you all for your help. I'll keep an eye on 
https://github.com/samtools/htslib/issues/45
Stathis

-----Original Message-----
From: John Marshall [mailto:[email protected]] 
Sent: 14 May 2015 11:46
To: Kanterakis, Efstathios
Cc: [email protected]
Subject: Re: [Samtools-help] tabix bug on cat'ed vcf.gz

On 13 May 2015, at 12:17, Kanterakis, Efstathios <[email protected]> 
wrote:
> bgzip chr1_h.vcf
> bgzip chr2.vcf
> cat chr1_h.vcf.gz chr2.vcf.gz > test.vcf.gz

...i.e., constructs test.vcf.gz with many BGZF blocks, including an EOF trailer 
block from each of chr1_h.vcf.gz (in the middle of test.vcf.gz) and chr2.vcf.gz 
(at the end of test.vcf.gz).

> tabix test.vcf.gz    # <--
> tabix test.vcf.gz chr2 # blank
> tabix test.vcf.gz chr1 # works
> [...]
> I was under the impression that bgzipped files are directly cat'able. Is this 
> a bug?

As Len suspected, the tabix index command (marked <--) is stopping at the EOF 
trailer block at the end of chr1_h.vcf.gz.  This is an htslib bug: 
https://github.com/samtools/htslib/issues/45 .

See http://sourceforge.net/p/samtools/mailman/message/33493929/ for further 
background.  Nobody considered these EOF blocks and concatenation of bgzipped 
files until rather late in the piece, and both htslib/samtools and 
htsjdk/Picard still have bugs that mean they stop reading at these EOF blocks 
in various circumstances.  The fact that this doesn't cause chaos shows how 
rare this is in practice, and is a large part of the reason why these bugs have 
not been fixed.

Thanks for the IMHO rather plausible use case!  I mostly fixed this in htslib a 
while back, but stopped as the expected utility did not seem to outweigh the 
risk of screwing up error handling in the code in question.  Plausible use 
cases change that calculus.

On 13 May 2015, at 13:52, Peter Cock <[email protected]> wrote:
> Second, some tools fail
> to cope with concatenated gzip block (e.g. some Java libraries break).

This is a separate concern and is not in play here.  Any sizeable bgzipped file 
is already itself a bunch of concatenated gzip/BGZF blocks, so catting two of 
them together makes no difference to the Java library problem.

    John

--
 The Wellcome Trust Sanger Institute is operated by Genome Research  Limited, a 
charity registered in England with number 1021457 and a  company registered in 
England with number 2742969, whose registered  office is 215 Euston Road, 
London, NW1 2BE. 

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Samtools-help mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to