Hi,

could someone who knows better about sequences please have a look and
fix the autopkgtest for kraken and kraken2 (including providing proper
test sequences).

Thanks a lot

      Andreas.

----- Forwarded message from Derrick Wood <notificati...@github.com> -----

Date: Thu, 03 Oct 2019 09:42:35 -0700
From: Derrick Wood <notificati...@github.com>
To: DerrickWood/kraken2 <krak...@noreply.github.com>
Cc: Andreas Tille <ti...@debian.org>, Mention <ment...@noreply.github.com>
Subject: Re: [DerrickWood/kraken2] Problems opening usual fasta files (#141)

Hi Andreas,

Looking at the code and the example files, it does not appear that this is a 
FASTA parsing issue, but rather an issue with parsing individual items of the 
sequence ID header and their suitability within Kraken. It looks as if the 
FASTA headers only have GI numbers in them. These are no longer acceptable for 
use in Kraken (or Kraken 2) for aiding taxonomy lookups, due to NCBI's move 
away from GI numbers. The patch you used caused the sequence ID to become only 
a number (e.g., "441431932"), which was interpreted as a taxid by the 
kraken2lib::check_seqid() subroutine. Had the test actually tried to classify 
the reads, it would have found the taxids to be incorrect.

The error I'm seeing when I try to use the scan_fasta_file.pl script to examine 
your test FASTAs is:

    scan_fasta_file.pl: unable to determine taxonomy ID for sequence 
gi|441431932|

The sequence ID is being parsed correctly by the script, but with the move away 
from GI numbers, the sequence ID now lacks any viable token for aiding taxonomy 
ID lookup. Acceptable replacements are either an explicit taxid in the sequence 
ID (e.g., `>9606` or `>humanseq|kraken:taxid|9606`) or an accession number 
(e.g., `>NC_230938.1`).

In short, the test is failing because it is no longer appropriate. The Kraken 2 
test should also have the `--minimizer-len 5` removed because minimizer 
behavior is different in Kraken 2 vs. Kraken 1. In K1, the length of minimizers 
governed the size of the `database.idx` file, which would be rather large (8 
GB) by default, so changing the minimizer length for a test like this made 
sense. In Kraken 2, no such index exists. Removing the `--minimizer-len 5` 
should allow the K2 test to work without needing to comment out the 
build/classification commands.

-- 
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/DerrickWood/kraken2/issues/141#issuecomment-538026447

----- End forwarded message -----

-- 
http://fam-tille.de

Reply via email to