Re: [R-sig-phylo] Extracting only FASTA sequences from a FASTQ file

Emmanuel Paradis Mon, 11 Mar 2024 21:32:37 -0700

Hi Jarrett,

First, there are a lot of tools in the bioinformatics community that can 
convert FATSQ files to FASTA far more efficiently than ape does (I think 
cutadapt can do that, but surely many others too). But with a small file like 
this one, ape can do the job :)


See below for specific answers.

----- Le 12 Mar 24, à 10:34, Jarrett Phillips phillipsjarre...@gmail.com a 
écrit :
> Hello,
> 
> I have a FASTQ file from which I would like to extract only the FASTA
> sequences (and not the associated PHRED quality scores).
> 
> For instance, using the file mentioned in the R documentation for
> read.fastq():
> 
> library(ape)
> 
> a <- "
> https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00096/sequence_read/
> "
> 
> file <- "SRR062641.filt.fastq.gz"
> URL <- paste0(a, file)
> download.file(URL, file)
> 
> fastq <- read.fastq(file) # import FASTQ file
> 
> One can easily extract the quality scores via
> 
> qual <- attr(fastq, "QUAL")
> 
> but there doesn't appear to be an attribute associated with the DNA
> sequences themselves (without the scores appended); only a "class"
> attribute, which evaluates to "DNAbin", is available.

The sequences are in the object; try the command:

base.freq(fastq, TRUE, TRUE)

> I've examined the underlying code for read.fastq() and there is a variable
> called DNA that is created, but it's not what I need:
> 
> function (file, offset = -33)
> {
>    Z <- scan(file, "", sep = "\n", quiet = TRUE)
>    tmp <- Z[c(TRUE, TRUE, FALSE, FALSE)]
>    sel <- c(TRUE, FALSE)
>    tmp[sel] <- gsub("^@", ">", tmp[sel])
>    fl <- tempfile()
>    cat(tmp, file = fl, sep = "\n")
>    DNA <- read.FASTA(fl)
>    tmp <- Z[c(FALSE, FALSE, FALSE, TRUE)]
>    QUAL <- lapply(tmp, function(x) as.integer(charToRaw(x)))
>    if (offset)
>        QUAL <- lapply(QUAL, "+", offset)
>    names(QUAL) <- names(DNA)
>    attr(DNA, "QUAL") <- QUAL
>    DNA
> }
> 
> I then tried the following:
> 
> x <- as.character.DNAbin(fastq) # extract DNA sequences
> y <- as.alignment(x)
> 
> seqs <- y$seq

If you want to do these extra steps, you can write 'y' into a FASTA file with 
(of course you can adapt the file name):

fl <- "titi.fas"
## avoid many copies of the data if you repeat the next command:
if (file.exists(fl)) unlink(fl)
for (i in 1:y$nb)
    cat(">", y$nam[i], "\n", y$seq[i], "\n", sep = "",
        file = fl, append = TRUE)

There is also the function seqinr::write.fasta() to de the same (a bit less 
efficiently I think).

> but 'seqs' can't be written to a file using write.FASTA().

No, instead write directly the 'fastq' object:

write.FASTA(fastq, "toto.fas")

And you can check that both files have the same sequences:

identical(read.FASTA("toto.fas"), read.FASTA("titi.fas"))

For the record, the first file has the sequences in uppercase, the 2nd one in 
lowercase.

Cheers,

Emmanuel

> Any assistance is greatly appreciated.
> 
> Thanks!
> 
> Cheers,
> 
> Jarrett
> 
>       [[alternative HTML version deleted]]
> 
> _______________________________________________
> R-sig-phylo mailing list - R-sig-phylo@r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
> Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

Re: [R-sig-phylo] Extracting only FASTA sequences from a FASTQ file

Reply via email to