I'm currently looking at htslib issues #375 and #380, which both involve how htslib downloads remote index files. For example, running:
samtools faidx ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa chr1:1000000-1000100 results in GRCh38_full_analysis_set_plus_decoy_hla.fa.fai being downloaded to the current directory. I'd like to remove this caching behaviour. Would anyone object to this? For those interested, the reasoning for doing this follows... There are problems with the current implementation: * There is a race condition if two processes try to get the index at the same time. The second may read an empty or incomplete index. * There is no way of checking that the cached index actually corresponds to the remote fasta file. * There is no easy way of controlling where the file goes (apart from changing directory). * The file name chosen is just the part of the URL following the last slash, which can sometimes be a bad choice. * Leaving randomly-named files lying around is inconvenient for some users. Fixing all of the above could be tricky, especially if processes run on different hosts and share the same cache via networked file systems. It's also now easier for htslib to read the index remotely via its hFILE interface. Given this, the simplest solution would be to make faidx (and possibly also tabix) always access the index remotely and not try to save a copy for use later. Local caching of indexes obtained via http or ftp could be done better using an ordinary web proxy. As part of the work for htslib#375, we could also add an option to programs that read .fai files to say where the index is. That would allow users who want to keep local copies to do so, but would make them responsible for downloading and naming the file. Please reply with any comments you have. The results of the discussion will go into the next release of htslib. The two issues on github are: https://github.com/samtools/htslib/issues/375 https://github.com/samtools/htslib/issues/380 Rob Davies [email protected] The Sanger Institute http://www.sanger.ac.uk/ Hinxton, Cambs., Tel. +44 (1223) 834244 CB10 1SA, U.K. Fax. +44 (1223) 494919 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. ------------------------------------------------------------------------------ What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic patterns at an interface-level. Reveals which users, apps, and protocols are consuming the most bandwidth. Provides multi-vendor support for NetFlow, J-Flow, sFlow and other flows. Make informed decisions using capacity planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e _______________________________________________ Samtools-help mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/samtools-help
