Hi Jarrett, First, there are a lot of tools in the bioinformatics community that can convert FATSQ files to FASTA far more efficiently than ape does (I think cutadapt can do that, but surely many others too). But with a small file like this one, ape can do the job :)
See below for specific answers. ----- Le 12 Mar 24, à 10:34, Jarrett Phillips phillipsjarre...@gmail.com a écrit : > Hello, > > I have a FASTQ file from which I would like to extract only the FASTA > sequences (and not the associated PHRED quality scores). > > For instance, using the file mentioned in the R documentation for > read.fastq(): > > library(ape) > > a <- " > https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00096/sequence_read/ > " > > file <- "SRR062641.filt.fastq.gz" > URL <- paste0(a, file) > download.file(URL, file) > > fastq <- read.fastq(file) # import FASTQ file > > One can easily extract the quality scores via > > qual <- attr(fastq, "QUAL") > > but there doesn't appear to be an attribute associated with the DNA > sequences themselves (without the scores appended); only a "class" > attribute, which evaluates to "DNAbin", is available. The sequences are in the object; try the command: base.freq(fastq, TRUE, TRUE) > I've examined the underlying code for read.fastq() and there is a variable > called DNA that is created, but it's not what I need: > > function (file, offset = -33) > { > Z <- scan(file, "", sep = "\n", quiet = TRUE) > tmp <- Z[c(TRUE, TRUE, FALSE, FALSE)] > sel <- c(TRUE, FALSE) > tmp[sel] <- gsub("^@", ">", tmp[sel]) > fl <- tempfile() > cat(tmp, file = fl, sep = "\n") > DNA <- read.FASTA(fl) > tmp <- Z[c(FALSE, FALSE, FALSE, TRUE)] > QUAL <- lapply(tmp, function(x) as.integer(charToRaw(x))) > if (offset) > QUAL <- lapply(QUAL, "+", offset) > names(QUAL) <- names(DNA) > attr(DNA, "QUAL") <- QUAL > DNA > } > > I then tried the following: > > x <- as.character.DNAbin(fastq) # extract DNA sequences > y <- as.alignment(x) > > seqs <- y$seq If you want to do these extra steps, you can write 'y' into a FASTA file with (of course you can adapt the file name): fl <- "titi.fas" ## avoid many copies of the data if you repeat the next command: if (file.exists(fl)) unlink(fl) for (i in 1:y$nb) cat(">", y$nam[i], "\n", y$seq[i], "\n", sep = "", file = fl, append = TRUE) There is also the function seqinr::write.fasta() to de the same (a bit less efficiently I think). > but 'seqs' can't be written to a file using write.FASTA(). No, instead write directly the 'fastq' object: write.FASTA(fastq, "toto.fas") And you can check that both files have the same sequences: identical(read.FASTA("toto.fas"), read.FASTA("titi.fas")) For the record, the first file has the sequences in uppercase, the 2nd one in lowercase. Cheers, Emmanuel > Any assistance is greatly appreciated. > > Thanks! > > Cheers, > > Jarrett > > [[alternative HTML version deleted]] > > _______________________________________________ > R-sig-phylo mailing list - R-sig-phylo@r-project.org > https://stat.ethz.ch/mailman/listinfo/r-sig-phylo > Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/ _______________________________________________ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/