Package: wnpp Severity: wishlist Subject: ITP: pufferfish -- An efficient index for the colored, compacted, de Bruijn graph Package: wnpp Owner: Michael R. Crusoe <michael.cru...@gmail.com> Severity: wishlist
* Package name : pufferfish Version : 1.0.0 Upstream Author : , 2016 Rob Patro, Avi Srivastava, Hirak Sarkar * URL : https://github.com/COMBINE-lab/pufferfish * License : GPL-3+ Programming Lang: C Description : An efficient index for the colored, compacted, de Bruijn graph Pufferfish is a new time and memory-efficient data structure for indexing a compacted, colored de Bruijn graph (ccdBG). . Though the de Bruijn Graph (dBG) has enjoyed tremendous popularity as an assembly and sequence comparison data structure, it has only relatively recently begun to see use as an index of the reference sequences (e.g. deBGA, kallisto). Particularly, these tools index the compacted dBG (cdBG), in which all non-branching paths are collapsed into individual nodes and labeled with the string they spell out. This data structure is particularly well-suited for representing repetitive reference sequences, since a single contig in the cdBG represents all occurrences of the repeated sequence. The original positions in the reference can be recovered with the help of an auxiliary "contig table" that maps each contig to the reference sequence, position, and orientation where it appears as a substring. The deBGA paper has a nice description how this kind of index looks (they call it a unipath index, because the contigs we index are unitigs in the cdBG), and how all the pieces fit together to be able to resolve the queries we care about. Moreover, the cdBG can be built on multiple reference sequences (transcripts, chromosomes, genomes), where each reference is given a distinct color (or colour, if you're of the British persuasion). The resulting structure, which also encodes the relationships between the cdBGs of the underlying reference sequences, is called the compacted, colored de Bruijn graph (ccdBG). This is not, of course, the only variant of the dBG that has proven useful from an indexing perspective. The (pruned) dBG has also proven useful as a graph upon which to build a path index of arbitrary variation / sequence graphs, which has enabled very interesting and clever indexing schemes like that adopted in GCSA2. Also, thinking about sequence search in terms of the dBG has led to interesting representations for variation-aware sequence search backed by indexes like the vBWT (implemented in the excellent gramtools package). Remark: This package is maintained by Debian Med Packaging Team at https://salsa.debian.org/med-team/pufferfish This package will be team maintained by Debian-Med