Package: wnpp Severity: wishlist Subject: ITP: clark -- accurate and versatile classification of biological sequences Package: wnpp Owner: Steffen Moeller <moel...@debian.org> Severity: wishlist
* Package name : clark Version : 1.2.6.1 Upstream Author : Rachid Ounit <rouni...@cs.ucr.edu> * URL : http://clark.cs.ucr.edu/ * License : GPL-3.0+ Programming Lang: (C, C++, C#, Perl, Python, etc.) Description : accurate and versatile classification of biological sequences The problem of DNA sequence classification is central to several application domains in molecular biology, genomics, metagenomics and genetics. Although several software tools have been developed for this problem, it is still computationally challenging due to the size of datasets generated by modern sequencing instruments and the growing size of reference sequence databases. . CLARK is based on a supervised sequence classification using discriminative k-mers. Considering two distinct specific classification problems (see the article for details), namely (1) the taxonomic classification of metagenomic reads to known bacterial genomes, and (2) the assignment of BAC clones and transcript to chromosome arms/centromeres (in the absence of a finished assembly for the reference genome), CLARK aspires to outperforms in classification speed and precision the best state-of-the-art methods. . Three classifiers from the CLARK framework are provided: . * CLARK (default): created for powerful workstation, it can require a significant amount of RAM to run with large database (e.g., all bacterial genomes from NCBI/RefSeq). This classifier is the standard in the CLARK tool series. It builds discriminative k-mers from all k-mers in the targets, queries k-mers with exact matching, and, in its fastest mode, classifies 1 million short reads in few seconds...; * CLARK-l : created for workstations with limited memory (i.e., "l" for light), this software tool provides precise classification on small metagenomes. Indeed, for metagenomics analysis, CLARK-l works with a sparse or ''light'' database (up to 4 GB of RAM) while still performing ultra accurate and fast results. This classifier builds discriminative k-mers from non-overlapping and distant k-mers in the targets and queries k-mers with exact matching; * CLARK-S: created for powerful workstations and exploiting spaced k-mers (i.e., "S" for spaced), this classifier requires a higher RAM usage than CLARK or CLARK-l, but it does offer a higher sensitivity than CLARK at the species level (see the peer-reviewed publication in Bioinformatics). CLARK-S completes the series of classifiers from the CLARK framework. . Other applications of CLARK are, for example, the detection of contaminants, the identification of chimerism and vector contamination in sequenced BACs (cf. "Overview" tab). Remark: This package is maintained by Steffen Moeller at https://salsa.debian.org/med-team/clark