Thanks James, efficiency is not a huge issue as I'm implementing a transcoder to a more efficient format. So this shall be done only once per file and then I will work on the new format. The final goal is to show that with a more efficient format you can heavily speedup manipulation and add plenty of features that are now hindered by a non-optimal file format. In can create a "very" large array of decoded reads and read/write it circularily. If a read has an very far mate it can be treated separately. Anyway I agree that
All that said, the API *ought* to permit people to write algorithms involving seeking back and forth. Claudio On 08/12/2015 10:42, James Bonfield wrote: > On Mon, Dec 07, 2015 at 10:47:37PM +0100, Claudio Alberti wrote: >> I am implementing a parser that is able to read the BAM file in >> pairs so whenever I read a record where pos < mpos I search for the >> mate and I create a pair structure. Once I find the mate I have to >> roll back to the second read and continue building the pairs. I >> could create a circular array of BAM records to be filled >> sequentially and used then to build pairs, but I cannot be 100% sure >> that the array is large enough to contain the mate record that might >> be very far in the sorted BAM. > That sounds rather like bamcollate from biobambam in intention, but > not implementation. I'be be afraid of the huge amount of seeking and > repeated decoding involved with your method. On deep data your insert > size distribution may mean, for example, that you average something > like 100 records between the two ends of a pair. If that's the case, > then you end up decoding 100x more data surely? The extreme cases are > nasty too, eg where a single template has one end in chr1 and one in > chr22. Unless I misunderstand the strategy, it doesn't sound any more > efficient than a full sort. > > bamcollate uses a hash table to collate records together, emitting > both once the second has been found. As you say you can't be sure on > the memory size needed. You can either just keep growing/shrinking as > required or spill to disk if it gets too big. > > A bigger problem with both methods though is the assumption that there > are only ever 2 reads involved. With supplementary reads, this is no > longer guaranteed (it never really was, but now it's common place). > Possibly we could decode the SA tags etc, but it's all a bit > cumbersome and it's not mandatory either. IMO it should be mandatory > for all aligners to fill out the TC aux field indicating the number of > fragments they emitted if it's not 2 (or 1 for a non-paired library). > > James > > PS All that said, the API *ought* to permit people to write algorithms > involving seeking back and forth. > -- Claudio Alberti ---------------------------------------------- http://gramm.epfl.ch EPFL SCI STI MM ELG 140 (ELG Building) Station 11 CH-1015 Lausanne - Switzerland Tel. +41 21 6936869 ---------------------------------------------- ------------------------------------------------------------------------------ Go from Idea to Many App Stores Faster with Intel(R) XDK Give your users amazing mobile app experiences with Intel(R) XDK. Use one codebase in this all-in-one HTML5 development environment. Design, debug & build mobile apps & 2D/3D high-impact games for multiple OSs. http://pubads.g.doubleclick.net/gampad/clk?id=254741911&iu=/4140 _______________________________________________ Samtools-help mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/samtools-help
