I'm currently looking at htslib issues #375 and #380, which both involve 
how htslib downloads remote index files.  For example, running:

samtools faidx 
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa
 chr1:1000000-1000100

results in GRCh38_full_analysis_set_plus_decoy_hla.fa.fai being downloaded 
to the current directory.

I'd like to remove this caching behaviour.  Would anyone object to this? 
For those interested, the reasoning for doing this follows...

There are problems with the current implementation:

* There is a race condition if two processes try to get the index at the 
same time.  The second may read an empty or incomplete index.

* There is no way of checking that the cached index actually corresponds 
to the remote fasta file.

* There is no easy way of controlling where the file goes (apart from 
changing directory).

* The file name chosen is just the part of the URL following the last 
slash, which can sometimes be a bad choice.

* Leaving randomly-named files lying around is inconvenient for some 
users.

Fixing all of the above could be tricky, especially if processes run on 
different hosts and share the same cache via networked file systems. 
It's also now easier for htslib to read the index remotely via its hFILE 
interface.  Given this, the simplest solution would be to make faidx (and 
possibly also tabix) always access the index remotely and not try to save 
a copy for use later.

Local caching of indexes obtained via http or ftp could be done better 
using an ordinary web proxy.  As part of the work for htslib#375, we could 
also add an option to programs that read .fai files to say where the index 
is.  That would allow users who want to keep local copies to do so, but 
would make them responsible for downloading and naming the file.

Please reply with any comments you have.  The results of the discussion 
will go into the next release of htslib.

The two issues on github are:
https://github.com/samtools/htslib/issues/375
https://github.com/samtools/htslib/issues/380

Rob Davies              [email protected]
The Sanger Institute    http://www.sanger.ac.uk/
Hinxton, Cambs.,        Tel. +44 (1223) 834244
CB10 1SA, U.K.          Fax. +44 (1223) 494919


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 

------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are 
consuming the most bandwidth. Provides multi-vendor support for NetFlow, 
J-Flow, sFlow and other flows. Make informed decisions using capacity 
planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
_______________________________________________
Samtools-help mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to