On 16 Apr 2018, at 15:55, Heredia Genestar, Txema <[email protected]> wrote:
> HTSLIB 1.5 and 1.8 both have a bug in tabix. It ignores the option for
> 0-based input files.
You are right that there is awkwardness here, but the full story is rather more
involved.
The -0 option is functioning as designed. Like the other options specifying the
columns of interest within the data file to be indexed (-b, -e, -p, etc), -0 is
only effective when tabix is being used to create an index for a data file, not
when querying via an existing index.
It's fairly clear in the documentation that -0 doesn't affect the
interpretation of *query* coordinates, whether in a -R file or as command-line
arguments.
> Using tabix -0 -R returns 3 entries. One snp (17491115) for the first bed
> line (17491114-17491115) and two (17491115 & 17491116) for the second bed
> line:
>
> $tabix -0 vcf -R <(echo -e
> "chr22\t17491114\t17491115\nchr22\t17491115\t17491116") | awk -v OFS="\t"
> '{print $1,$2,$3,$4,$5,$6,$7}'
> chr22 17491115 . C T 1189.33 PASS
> chr22 17491115 . C T 1189.33 PASS
> chr22 17491116 . G A 16942.9 PASS
Here -0 is ignored as it is an indexing option (so as tabix currently stands,
it would be good if trying to use -0 or other indexing-specific options during
querying produced an error message).
However if you store your -R coordinates in a file named test.bed, then "tabix
-R test.bed vcf" will work as you expect and return two lines. This is as
documented, as -R interprets the file's contents as BED or as 1-based
TAB-delimited depending on the filename extension. But there's no way to
effectively control the filename (if there even is any!) when you're using <()
or pipes or etc.
This is a pretty good demonstration of why having tools decide how to interpret
ambiguous [1] file contents based solely on filename extensions is unwise. :-)
So for your use case, it would be good if tabix querying had an option to force
interpretation of the -R file as BED. (Or if BED were the default here, but
that ship has sadly sailed.) It would probably be least confusing if the name
for that option was also -0. So the awkwardness you've encountered could be
fixed by having tabix also use -0 as a querying option:
-0, --zero-based, --bed treat -R (and -T?) file as a BED file
Thus, as an addition feature, tabix would also understand -0 when querying with
a -R file.
> The second position returns two variants with/without -0, even when giving
> the positions as string:
>
> $tabix -0 vcf chr22:17491115-17491116 | awk -v OFS="\t" '{print
> $1,$2,$3,$4,$5,$6,$7}'
> chr22 17491115 . C T 1189.33 PASS
> chr22 17491116 . G A 16942.9 PASS
>
> $ tabix vcf chr22:17491115-17491116 | awk -v OFS="\t" '{print
> $1,$2,$3,$4,$5,$6,$7}'
> chr22 17491115 . C T 1189.33 PASS
> chr22 17491116 . G A 16942.9 PASS
OTOH throughout samtools and bcftools this SEQ:START-END way of specifying a
range of coordinates is always used as a human-readable textual range, so it
would be confusing to interpret it other than as a 1-based inclusive range. So
I would not recommend having -0 affect the interpretation of this
human-readable syntax.
John
[1] BED and "1-based TAB-delimited" look identical.
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Samtools-help mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/samtools-help