On Mon, Dec 07, 2015 at 10:47:37PM +0100, Claudio Alberti wrote: > I am implementing a parser that is able to read the BAM file in > pairs so whenever I read a record where pos < mpos I search for the > mate and I create a pair structure. Once I find the mate I have to > roll back to the second read and continue building the pairs. I > could create a circular array of BAM records to be filled > sequentially and used then to build pairs, but I cannot be 100% sure > that the array is large enough to contain the mate record that might > be very far in the sorted BAM.
That sounds rather like bamcollate from biobambam in intention, but not implementation. I'be be afraid of the huge amount of seeking and repeated decoding involved with your method. On deep data your insert size distribution may mean, for example, that you average something like 100 records between the two ends of a pair. If that's the case, then you end up decoding 100x more data surely? The extreme cases are nasty too, eg where a single template has one end in chr1 and one in chr22. Unless I misunderstand the strategy, it doesn't sound any more efficient than a full sort. bamcollate uses a hash table to collate records together, emitting both once the second has been found. As you say you can't be sure on the memory size needed. You can either just keep growing/shrinking as required or spill to disk if it gets too big. A bigger problem with both methods though is the assumption that there are only ever 2 reads involved. With supplementary reads, this is no longer guaranteed (it never really was, but now it's common place). Possibly we could decode the SA tags etc, but it's all a bit cumbersome and it's not mandatory either. IMO it should be mandatory for all aligners to fill out the TC aux field indicating the number of fragments they emitted if it's not 2 (or 1 for a non-paired library). James PS All that said, the API *ought* to permit people to write algorithms involving seeking back and forth. -- James Bonfield ([email protected]) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. ------------------------------------------------------------------------------ Go from Idea to Many App Stores Faster with Intel(R) XDK Give your users amazing mobile app experiences with Intel(R) XDK. Use one codebase in this all-in-one HTML5 development environment. Design, debug & build mobile apps & 2D/3D high-impact games for multiple OSs. http://pubads.g.doubleclick.net/gampad/clk?id=254741911&iu=/4140 _______________________________________________ Samtools-help mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/samtools-help
