Re: [Bioc-devel] MRD measurements in Leukemic patients using NGS data in r

Mulder, R Sun, 08 Mar 2020 05:53:12 -0700

Dear Vincent,

Maybe we can move the discussion retrospectively? At the moment it seems we are 
tweaking a few things of which the for last obstacle good be tackled thanks to 
you. The only remaining issue is that I cant get results for my genes of 
interest. In stead I keep getting the same results over and over again. Would 
it be possible to help me a bit further until it works?In addition, I think 
incoporating the solution and later on posting it would be very interesting too.

Kind regards,

Rene

________________________________
Van: Vincent Carey <st...@channing.harvard.edu>
Verzonden: zondag 8 maart 2020 12:00:41
Aan: Mulder, R
CC: Tim Triche, Jr.; bioc-devel@r-project.org
Onderwerp: Re: [Bioc-devel] MRD measurements in Leukemic patients using NGS 
data in r

This has been a very interesting and illuminating dialogue but it really should 
be moved to
support.bioconductor.org<https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fsupport.bioconductor.org%2F&data=02%7C01%7Cr.mulder01%40umcg.nl%7Ca31511f1d00b42bc40bd08d7c34ff8a3%7C335122f9d4f44d67a2fccd6dc20dde70%7C0%7C0%7C637192620564238716&sdata=uRk3R%2Bs38f52BZQoZWYQokcv%2BOfyksDOwJf9vXwRCDM%3D&reserved=0>
 so that a record is available to the general user community.

The solution here
> library(Homo.sapiens)
Error in library(Homo.sapiens) :
  there is no package called ‘Homo.sapiens’

is BiocManager::install("Homo.sapiens")

On Sun, Mar 8, 2020 at 6:22 AM Mulder, R 
<r.mulde...@umcg.nl<mailto:r.mulde...@umcg.nl>> wrote:
Dear Tim,

Thanks again for your (quick) reply.

I just ran it on my computer and it gave me some results but also several 
errors.

The first part went perfect

> BamFiles <- BamFileList(lapply(list.files(pattern=".bam$"), BamFile))
> names(BamFiles) <- sapply(strsplit(list.files(pattern=".bam$"), "\\."), `[`, 
> 1)
> show(BamFiles)
BamFileList of length 2
names(2): here were my 2 files named without bam extension

However the second part did give results but also several errors

> library(Homo.sapiens)
Error in library(Homo.sapiens) :
  there is no package called ‘Homo.sapiens’
> gxs <- genes(Homo.sapiens)
Error in genes(Homo.sapiens) : could not find function "genes"
> names(gxs)  <- mapIds(Homo.sapiens, names(gxs), "SYMBOL", "ENTREZID")
Error in mapIds(Homo.sapiens, names(gxs), "SYMBOL", "ENTREZID") :
  could not find function "mapIds"
> SomeGenes <- gxs[ c("HRAS", "WT1", "IGF2") ]
Error: object 'gxs' not found
>
> sbparam <- ScanBamParam(which=SomeGenes)
Error in ScanBamParam(which = SomeGenes) : object 'SomeGenes' not found
> puparam <- PileupParam(max_depth=1e3, min_nucleotide_depth=5,
+                        include_insertions=TRUE, min_minor_allele_depth=1)
> piles <- lapply(BamFiles, pileup, ScanBamParam=sbparam, PileupParam=puparam)
>
> filterInvariant <- function(x) {
+     positions <- subset(x, duplicated(pos))$pos
+     subset(x, pos %in% positions)
+ }
> show(do.call(rbind, lapply(lapply(piles, filterInvariant), head, 2)))
              seqnames      pos strand nucleotide count
myfile1        1 36931671      +          A   251
myfile1         1 36931671      +          G     2
myfile2       1 36931673      +          -     1
myfile2               1 36931673      +          A   253

Error in library(Homo.sapiens) :
  there is no package called ‘Homo.sapiens’

For this error I installed the package BSgenome.Hsapiens.UCSC.hg19 and ran it 
again, but the errors at the next operations remained.

> gxs <- genes(Homo.sapiens)
Error in genes(Homo.sapiens) : could not find function "genes"

Renaming Homo.sapiens as BSgenome.Hsapiens.UCSC.hg19 did not work. Should I 
look up how this genome is called?

> names(gxs)  <- mapIds(Homo.sapiens, names(gxs), "SYMBOL", "ENTREZID")
Error in mapIds(Homo.sapiens, names(gxs), "SYMBOL", "ENTREZID") :
  could not find function "mapIds"

Where can I find this function?

> SomeGenes <- gxs[ c("HRAS", "WT1", "IGF2") ]
Error: object 'gxs' not found

Does this have to do with the fact that IGF2 is not included in my panel?

> sbparam <- ScanBamParam(which=SomeGenes)
Error in ScanBamParam(which = SomeGenes) : object 'SomeGenes' not found

As a result of the previous error, right?

Interestingly enough, I did get results. Why?

________________________________
Van: Tim Triche, Jr. <tim.tri...@gmail.com<mailto:tim.tri...@gmail.com>>
Verzonden: zaterdag 7 maart 2020 22:28
Aan: Mulder, R
CC: bioc-devel@r-project.org<mailto:bioc-devel@r-project.org>
Onderwerp: Re: [Bioc-devel] MRD measurements in Leukemic patients using NGS 
data in r

something like this perhaps?

library(Rsamtools)
BamFiles <- BamFileList(lapply(list.files(pattern=".bam$"), BamFile))
names(BamFiles) <- sapply(strsplit(list.files(pattern=".bam$"), "\\."), `[`, 1)
show(BamFiles)
#BamFileList of length 14
#names(14): Normal_0813_S19_R1_001 Normal_2013_S18_R1_001 ... REMC_CD19 REMC_CD3

library(Homo.sapiens)
gxs <- genes(Homo.sapiens)
names(gxs)  <- mapIds(Homo.sapiens, names(gxs), "SYMBOL", "ENTREZID")
SomeGenes <- gxs[ c("HRAS", "WT1", "IGF2") ]

sbparam <- ScanBamParam(which=SomeGenes)
puparam <- PileupParam(max_depth=1e3, min_nucleotide_depth=5,
                       include_insertions=TRUE, min_minor_allele_depth=1)
piles <- lapply(BamFiles, pileup, ScanBamParam=sbparam, PileupParam=puparam)

filterInvariant <- function(x) {
  positions <- subset(x, duplicated(pos))$pos
  subset(x, pos %in% positions)
}
show(do.call(rbind, lapply(lapply(piles, filterInvariant), head, 2)))
#                           seqnames     pos strand nucleotide count
# Normal_0813_S19_R1_001.43    chr11 1979973      -          C     1
# Normal_0813_S19_R1_001.44    chr11 1979973      +          T     1
# Normal_2013_S18_R1_001.16    chr11 1979889      +          T     3
# Normal_2013_S18_R1_001.17    chr11 1979889      -          T     1
# Normal_2714_S20_R1_001.4     chr11 1979933      +          A     1
# Normal_2714_S20_R1_001.5     chr11 1979933      -          A     1
# P01-00020-T06-P-cfDNA.21     chr11 1980045      +          G     1
# P01-00020-T06-P-cfDNA.22     chr11 1980045      -          G     1
# P01-00020-T06-P-cfDNA2.11    chr11 1979884      +          T     1
# P01-00020-T06-P-cfDNA2.12    chr11 1979884      -          T     1
# P01-00021-T06-P-cfDNA.5      chr11 1979879      +          C     1
# P01-00021-T06-P-cfDNA.6      chr11 1979879      -          C     1
# P01-00024-T06-P-cfDNA.9      chr11 1979990      +          C     1
# P01-00024-T06-P-cfDNA.10     chr11 1979990      +          T     2
# P01-00028-T06-N-P01.3        chr11 1979851      +          A     1
# P01-00028-T06-N-P01.4        chr11 1979851      +          G     1
# P01-00028-T06-P-cfDNA.10     chr11 1979948      +          A     1
# P01-00028-T06-P-cfDNA.11     chr11 1979948      +          G     1
# P01-00028-T06-T-P01.2        chr11 1979850      +          C     3
# P01-00028-T06-T-P01.3        chr11 1979850      -          C     2
# P01-00028-T06-T-P02.3        chr11 1979851      +          A     1
# P01-00028-T06-T-P02.4        chr11 1979851      +          C     1

Or whatever you like (GAlignments would get you the reads, GAlignmentPairs the 
read pairs, and so on)

--t

On Sat, Mar 7, 2020 at 3:51 PM Mulder, R 
<r.mulde...@umcg.nl<mailto:r.mulde...@umcg.nl><mailto:r.mulde...@umcg.nl<mailto:r.mulde...@umcg.nl>>>
 wrote:

Dear Tim,

I installed Rsamtools and ran it on a bamfile using the following command to 
get nucleotide count for 2 hotspot regions.

library(Rsamtools)
bamfile <- "test.bam"
bf <- BamFile(bamfile)
param <- ScanBamParam(which=GRanges(c("4", "9"), IRanges(start=c(55599321, 
5073770), end=c(55599321, 5073770))))

Now, I want to perform this on more than 1 bam file at once and on more region.

Do you perhaps have ideas on how I could do this?

Best,

Rene

________________________________
Van: Tim Triche, Jr. 
<tim.tri...@gmail.com<mailto:tim.tri...@gmail.com><mailto:tim.tri...@gmail.com<mailto:tim.tri...@gmail.com>>>
Verzonden: vrijdag 6 maart 2020 13:57:10
Aan: Mulder, R
CC: 
bioc-devel@r-project.org<mailto:bioc-devel@r-project.org><mailto:bioc-devel@r-project.org<mailto:bioc-devel@r-project.org>>
Onderwerp: Re: [Bioc-devel] MRD measurements in Leukemic patients using NGS 
data in r

Oh hey, one last thing — if all you want is to get nucleotide counts per region 
of interest, just use pileup() in Rsamtools, with bamWhich(GRanges) holding a 
GRanges (Genomic Ranges) of your regions added to scanBamParams for each BAM. 
It sounds awkward but in practice it is super fast and will give you all the 
nucleotide and read level information you could want. One of my interns 
implemented this for mitochondrial variant calling in MTseeker when we got sick 
of using gmapR and being flagged for errors on not-Linux. (We gutted the entire 
package recently and have new, insanely deep examples from Oxford Nanopore 
direct RNA sequencing and from large single cell datasets; I need to add those 
and get the package back out of purgatory).

That said, in the end you will want a LOT of validation material so this is 
very much just a starting point. But still, it’s your starting point, in R at 
least. And truthfully I much prefer R/Bioconductor idioms to (say) pysam or the 
like. htsnim is nice but then you’ll be implementing the ML bits from scratch, 
so I think your instincts to try R first are sensible.

Good luck! Even if you use this for something else besides MRD, I think it will 
become a useful exercise.

--t

On Mar 5, 2020, at 4:36 PM, Tim Triche, Jr. 
<tim.tri...@gmail.com<mailto:tim.tri...@gmail.com><mailto:tim.tri...@gmail.com<mailto:tim.tri...@gmail.com>>>
 wrote:

a few thoughts:

1) look into Shearwater 
(https://bioconductor.org/packages/release/bioc/html/deepSNV.html<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbioconductor.org%2Fpackages%2Frelease%2Fbioc%2Fhtml%2FdeepSNV.html&data=02%7C01%7Cr.mulder01%40umcg.nl%7Ca31511f1d00b42bc40bd08d7c34ff8a3%7C335122f9d4f44d67a2fccd6dc20dde70%7C0%7C0%7C637192620564238716&sdata=c8b%2FREMotc72%2BZt9XENI7nxiP1Ww0xQymAqX524XJoY%3D&reserved=0><https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbioconductor.org%2Fpackages%2Frelease%2Fbioc%2Fhtml%2FdeepSNV.html&data=02%7C01%7Cr.mulder01%40umcg.nl%7C5dfaa267d461487c892408d7c2de972a%7C335122f9d4f44d67a2fccd6dc20dde70%7C0%7C0%7C637192133587503301&sdata=3yQeLWUNk%2FLjDAOquRQf6wgbGvn06KZGUUzhIw%2BBUMk%3D&reserved=0<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbioconductor.org%2Fpackages%2Frelease%2Fbioc%2Fhtml%2FdeepSNV.html&data=02%7C01%7Cr.mulder01%40umcg.nl%7Ca31511f1d00b42bc40bd08d7c34ff8a3%7C335122f9d4f44d67a2fccd6dc20dde70%7C0%7C0%7C637192620564248707&sdata=y%2BxXbPrxep4RNyrHK9IJ6bnxMNKS9HqPufIAu3bkl7I%3D&reserved=0>>),
 then

2) talk to Todd Druley @ WashU, Elli Pappaemanuil @ MSKCC, Ruud & Bob @ 
Erasmus, the usual suspects

3) plan to validate w/ddPCR (at the absolute very least) and be aware that most 
MRD in leukemia is done by a combination of BCR/TCR + breakpoint PCR 
(lymphoid/fusion-driven) or DFN flow (myeloid + normal cyto)

not saying that ML-based methods might not help, but if you've got a 30x-100x 
genome (or even 1000x FM1) and are trying to compete with existing standard 
approaches that can detect molecules at 1e-6, it'll be rough.  An alternative 
approach (that has been used repeatedly) is to throw caution to the wind, 
generate primers for numerous subject-specific somatic variants, and use the 
ensemble to try and model MRD (speaking of ML). On the one hand, that could 
give the clinic a "customer for life"; on the other hand, it's not conducive to 
large-scale automation & deployment. As far as I know, it never got much 
traction, in leukemia or anywhere else.  (Consider that flow cytometry is 
capable of detecting 1-in-10K to 1-in-a-million cells in most clinical flow 
labs.)

Best of luck! (and if you're not already working with UMI-tagged reads, please 
talk to the people in #2 above; the reason most people won't go below 5% VAF is 
that you get thwacked by error rates at that level, and the reason most 
NGS-based MRD is based on UMIs is that existing PCR-based methods have 6 logs 
sensitivity.)

--t

On Thu, Mar 5, 2020 at 4:08 PM Mulder, R 
<r.mulde...@umcg.nl<mailto:r.mulde...@umcg.nl><mailto:r.mulde...@umcg.nl<mailto:r.mulde...@umcg.nl>>>
 wrote:
Hi,

I was wondering if anyone could help me with a script and support for the above 
mentioned goal.

For this I have several BAM files for which I want to determine de nucleotide 
count per region of interest. The latter could be several hotspot mutation 
sites. I would like to get an overall overview of all the BAM files. Next I 
want to use these counts to determine for any new BAM file if the count for a 
particular genomic position is higher than the allowable range, hence could 
indicate if a mutation is present. For this I would like to use the modified 
Thompson Tau test. I think machine learning could be used for this. So, why do 
I want to do all this? Well, normal NGS pipelines only deal with variants at a 
frequency of 5%. Mutatios below this frequency are often missed. To know if a 
mutation is present below this level, you showed dive into the alignment and 
most often manually investigate the base calls. I know that this races some 
questions regarding call qualities, but then again our conventional assays have 
actually confirmed some of these low mutations. In addition, NGS can
 be used to determine the presence of low frequent mutation which is of great 
importance for determining the measurable residual disease after treatment.

I am new to r and bioconductor so I would be very thankful if someone could 
help me in my mission to setting up a script for this purpose.

Thanks,

Rene Mulder

Laboratory Medicine

University Medical Center Groningen

The Netherlands

________________________________
De inhoud van dit bericht is vertrouwelijk en alleen bes...{{dropped:15}}

_______________________________________________
Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org><mailto:Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org>>
 mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel&data=02%7C01%7Cr.mulder01%40umcg.nl%7Ca31511f1d00b42bc40bd08d7c34ff8a3%7C335122f9d4f44d67a2fccd6dc20dde70%7C0%7C0%7C637192620564248707&sdata=QdAVedMFqBqJP%2FZPcDR6Vuicw%2BDvqHC87BKsswUeERA%3D&reserved=0><https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel&data=02%7C01%7Cr.mulder01%40umcg.nl%7C5dfaa267d461487c892408d7c2de972a%7C335122f9d4f44d67a2fccd6dc20dde70%7C0%7C0%7C637192133587503301&sdata=jDiihphF1FySPCdvLtXHMsmlkQ7ta6Fo0AjBvMa5VIA%3D&reserved=0<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel&data=02%7C01%7Cr.mulder01%40umcg.nl%7Ca31511f1d00b42bc40bd08d7c34ff8a3%7C335122f9d4f44d67a2fccd6dc20dde70%7C0%7C0%7C637192620564258701&sdata=hHNQn2AMqEkr2LtrD2G3lEmEQKc%2BfHSlb5ydAa%2FdWuo%3D&reserved=0>>
________________________________
De inhoud van dit bericht is vertrouwelijk en alleen bestemd voor de 
geadresseerde(n). Anderen dan de geadresseerde(n) mogen geen gebruik maken van 
dit bericht, het niet openbaar maken of op enige wijze verspreiden of 
vermenigvuldigen. Het UMCG kan niet aansprakelijk gesteld worden voor een 
incomplete aankomst of vertraging van dit verzonden bericht.

The contents of this message are confidential and only intended for the eyes of 
the addressee(s). Others than the addressee(s) are not allowed to use this 
message, to make it public or to distribute or multiply this message in any 
way. The UMCG cannot be held responsible for incomplete reception or delay of 
this transferred message.
________________________________
De inhoud van dit bericht is vertrouwelijk en alleen bestemd voor de 
geadresseerde(n). Anderen dan de geadresseerde(n) mogen geen gebruik maken van 
dit bericht, het niet openbaar maken of op enige wijze verspreiden of 
vermenigvuldigen. Het UMCG kan niet aansprakelijk gesteld worden voor een 
incomplete aankomst of vertraging van dit verzonden bericht.

The contents of this message are confidential and only intended for the eyes of 
the addressee(s). Others than the addressee(s) are not allowed to use this 
message, to make it public or to distribute or multiply this message in any 
way. The UMCG cannot be held responsible for incomplete reception or delay of 
this transferred message.

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel&data=02%7C01%7Cr.mulder01%40umcg.nl%7Ca31511f1d00b42bc40bd08d7c34ff8a3%7C335122f9d4f44d67a2fccd6dc20dde70%7C0%7C0%7C637192620564258701&sdata=hHNQn2AMqEkr2LtrD2G3lEmEQKc%2BfHSlb5ydAa%2FdWuo%3D&reserved=0>

The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Partners Compliance HelpLine at
http://www.partners.org/complianceline<https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.partners.org%2Fcomplianceline&data=02%7C01%7Cr.mulder01%40umcg.nl%7Ca31511f1d00b42bc40bd08d7c34ff8a3%7C335122f9d4f44d67a2fccd6dc20dde70%7C0%7C0%7C637192620564268692&sdata=IBbiS%2Bc2PiRWUHVRCbFgGgC41T729Pi2HOw5Sn3guPQ%3D&reserved=0>
 . If the e-mail was sent to you in error
but does not contain patient information, please contact the sender and properly
dispose of the e-mail.
________________________________
De inhoud van dit bericht is vertrouwelijk en alleen bestemd voor de 
geadresseerde(n). Anderen dan de geadresseerde(n) mogen geen gebruik maken van 
dit bericht, het niet openbaar maken of op enige wijze verspreiden of 
vermenigvuldigen. Het UMCG kan niet aansprakelijk gesteld worden voor een 
incomplete aankomst of vertraging van dit verzonden bericht.

The contents of this message are confidential and only intended for the eyes of 
the addressee(s). Others than the addressee(s) are not allowed to use this 
message, to make it public or to distribute or multiply this message in any 
way. The UMCG cannot be held responsible for incomplete reception or delay of 
this transferred message.

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] MRD measurements in Leukemic patients using NGS data in r

Reply via email to