Thanks James,
efficiency is not a huge issue as I'm implementing a transcoder to a 
more efficient format. So this shall be done only once per file and then 
I will work on the new format.
The final goal is to show that with a more efficient format you can 
heavily speedup manipulation and add plenty of features that are now 
hindered by a non-optimal file format.
In can create a "very" large array of decoded reads and read/write it 
circularily. If a read has an very far mate it can be treated separately.
Anyway I agree that

All that said, the API *ought* to permit people to write algorithms
involving seeking back and forth.

Claudio

On 08/12/2015 10:42, James Bonfield wrote:
> On Mon, Dec 07, 2015 at 10:47:37PM +0100, Claudio Alberti wrote:
>> I am implementing a parser that is able to read the BAM file in
>> pairs so whenever I read a record where pos < mpos I search for the
>> mate and I create a pair structure.  Once I find the mate I have to
>> roll back to the second read and continue building the pairs.  I
>> could create a circular array of BAM records to be filled
>> sequentially and used then to build pairs, but I cannot be 100% sure
>> that the array is large enough to contain the mate record that might
>> be very far in the sorted BAM.
> That sounds rather like bamcollate from biobambam in intention, but
> not implementation.  I'be be afraid of the huge amount of seeking and
> repeated decoding involved with your method. On deep data your insert
> size distribution may mean, for example, that you average something
> like 100 records between the two ends of a pair.  If that's the case,
> then you end up decoding 100x more data surely?  The extreme cases are
> nasty too, eg where a single template has one end in chr1 and one in
> chr22.  Unless I misunderstand the strategy, it doesn't sound any more
> efficient than a full sort.
>
> bamcollate uses a hash table to collate records together, emitting
> both once the second has been found.  As you say you can't be sure on
> the memory size needed.  You can either just keep growing/shrinking as
> required or spill to disk if it gets too big.
>
> A bigger problem with both methods though is the assumption that there
> are only ever 2 reads involved.  With supplementary reads, this is no
> longer guaranteed (it never really was, but now it's common place).
> Possibly we could decode the SA tags etc, but it's all a bit
> cumbersome and it's not mandatory either. IMO it should be mandatory
> for all aligners to fill out the TC aux field indicating the number of
> fragments they emitted if it's not 2 (or 1 for a non-paired library).
>
> James
>
> PS All that said, the API *ought* to permit people to write algorithms
> involving seeking back and forth.
>

-- 
Claudio Alberti
----------------------------------------------
http://gramm.epfl.ch
EPFL SCI STI MM
ELG 140 (ELG Building)
Station 11
CH-1015 Lausanne - Switzerland
Tel. +41 21 6936869
----------------------------------------------


------------------------------------------------------------------------------
Go from Idea to Many App Stores Faster with Intel(R) XDK
Give your users amazing mobile app experiences with Intel(R) XDK.
Use one codebase in this all-in-one HTML5 development environment.
Design, debug & build mobile apps & 2D/3D high-impact games for multiple OSs.
http://pubads.g.doubleclick.net/gampad/clk?id=254741911&iu=/4140
_______________________________________________
Samtools-help mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to