Hi Sabrina. The Unit_ID can be any "transcript cluster" identifier of your choice. The easiest may be to use the Affymetrix transcript cluster identifier itself ... available from:
http://www.affymetrix.com/analysis/downloads/current_exon/MoEx-1_0-st-v1.mm9.probeset.csv.zip See the 'transcript_cluster_id' column. Perhaps only take the "core" probes, as defined in the the 'level' column? Note: we used Ensembl in that flat2Cdf() example since we were using a custom organization (i.e. non-Affy) of the probesets. Cheers, Mark On 11/06/2009, at 10:58 PM, sabrina wrote: > > Hi, Mark: > for the Unit_id, does it have to be Ensembl gene ID like ENSMUSGxxxx? > Lots of genes do not have ensembl assignment from Affy annotation > file. There are lots of missing annotaions, and I still have not found > any good way to deal with it. Do you have any suggestions? > > Thanks > > Sabrina > > On Jun 10, 12:32 am, Mark Robinson <mrobin...@wehi.edu.au> wrote: >> Hi Sabrina. >> >> How about you try and create a 'flat' file like the one described >> at:http://groups.google.com/group/aroma-affymetrix/web/creating-cdf-file >> ... >> >> Presumably, you will be comfortable with the Exon Array's 'probetab' >> file by now and possibly the Affymetrix annotation CSV file and so >> you >> should have access to all this information. >> >> For example, from the following table: >> >> mac1618:HuEx-1_0-st-v2.probe.tab mrobinson$ head HuEx-1_0-st- >> v2.probe.tab >> Probe ID Probe Set ID probe x probe y assembly >> seqname start stop >> strand probe sequence target strandedness category >> 494998 2315101 917 193 build-34/hg16 chr1 1788 >> 1812 + >> CACGGGAAGTCTGGGCTAAGAGACA Sense main >> 1734213 2315101 1092 677 build-34/hg16 chr1 1973 >> 1997 + >> ACAGGGGCCAGAAGATGAACAATGG Sense main >> 4767517 2315101 796 1862 build-34/hg16 chr1 1992 >> 2016 + >> ATTAAGTTACATGCAGACAACAGGG Sense main >> 4286427 2315101 986 1674 build-34/hg16 chr1 2006 >> 2030 + >> TGCCTGGTTGTGGTATTAAGTTACA Sense main >> 5760145 2315102 144 2250 build-34/hg16 chr1 2520 >> 2544 + >> TCGGCCGTCGTCTTCTGCAGCTCTG Sense main >> 671410 2315102 689 262 build-34/hg16 chr1 2523 >> 2547 + >> AAGTCGGCCGTCGTCTTCTGCAGCT Sense main >> 4275780 2315102 579 1670 build-34/hg16 chr1 2526 >> 2550 + >> TCCAAGTCGGCCGTCGTCTTCTGCA Sense main >> 4293462 2315102 341 1677 build-34/hg16 chr1 2531 >> 2555 + >> TGTGATCCAAGTCGGCCGTCGTCTT Sense main >> 5388 2315103 267 2 build-34/hg16 chr1 2927 >> 2951 + >> CTGTCTGTCGACCCAGCTGGAGGCA Sense main >> [snip] >> >> ... you see the second column is the probeset_id, which would be used >> as the "Group_ID" column for your flat file. Depending on whether >> you >> are using the Ensembl CDF or the Affymetrix annotation, you would >> need >> to create a mapping to get the transcript cluster id column (here, >> the >> "Unit_ID"). Everything else you need (Probe_Sequence, X, Y, >> Probe_ID) >> is within the table above. >> >> Then, it would be just a matter of filtering OUT those probes that >> overlap a SNP, which based on your mapping exercise, you must have a >> list of. Then, make a call to the flat2Cdf() script and hopefully >> you'll be off and running. >> >> Let me know how you go. >> >> Cheers, >> Mark >> >> On 10/06/2009, at 1:00 PM, sabrina wrote: >> >> >> >> >> >>> Thanks , Mark! >>> Can you show me /walk me through how to get a new snp-free CDF ? I >>> finally got the right version of snp and probe mapping so I am ready >>> to try it out! >> >>> Sabrina >> >>> On Jun 6, 3:14 am, Mark Robinson <mrobin...@wehi.edu.au> wrote: >>>> Hi Sabrina. >> >>>> Comments below. >> >>>> On 06/06/2009, at 1:57 AM, sabrina wrote: >> >>>>> Hi, Mark: >>>>> I finally found the SNP data set that is suitable for my case. >>>>> As I >>>>> understand, aroma used RMA to estimate gene level and exon level >>>>> intensities. After I estimate gene level (transcript level), I can >>>>> use >>>>> FIRMA to estimate residual for each exon and compose a score as >>>>> described in the paper . My question is: if there is a SNP >>>>> difference >>>>> between two strains within one exon, should I exclude that exon >>>>> from >>>>> estimating transcript level value? My guess is probably no. >> >>>> If the SNP affects only 1 probe in an entire transcript, I would >>>> expect it to have very little impact on the gene-level summary. >>>> And, >>>> especially so if there are a large number of total probes for that >>>> gene. It may have a noticeable effect on the probe effect. >> >>>>> So will it >>>>> be a good idea if I exclude that exon after I calculate all FIRMA >>>>> scores or should I exclude these exons after I estimate >>>>> residuals , >>>>> but only used these residuals not affected by SNPs for firma score >>>>> estimation? Thanks >> >>>> Keep in mind the residuals are calculated at the probe-level, not >>>> the >>>> probeset-level. The FIRMA score is then a summary of the all the >>>> residuals for a probeset. >> >>>> I think you have (at least) 3 choices: >> >>>> 1. (preferred, i would think) you could remove all affected >>>> *probes* >>>> (via the creation of a SNP-affected-probe-free CDF) in advance, >>>> then >>>> run FIRMA as normal. I can help with this if you tell me which >>>> probes >>>> are affected. >> >>>> 2. remove the affected *probesets* afterwards, since you may not >>>> believe the FIRMA scores for which these are based on. >> >>>> 3. as you suggested, only calculate FIRMA scores from unaffected >>>> residuals. But, the information you require to do this is the same >>>> information required to do #1 and it would seems like #1 is >>>> preferred. >> >>>> The good thing about option #1 is you would still have some ability >>>> to >>>> detect differential splicing for the probeset (instead of tossing >>>> it >>>> away), albeit with the smaller number of remaining unaffected >>>> probes. >> >>>> Cheers, >>>> Mark >> >>>>> Sabrina >> >>>>> On Apr 30, 3:46 am, Mark Robinson <mrobin...@wehi.edu.au> wrote: >>>>>> Hi Sabrina. >> >>>>>> I have not had to deal with this myself, but I do know that it >>>>>> exists >>>>>> and I can at least suggest a possible route to exclude affected >>>>>> exons. >> >>>>>> Presumably, there is a database (dbSNP?) that tells you the >>>>>> genome >>>>>> locations of each SNP for your strains. There is also a >>>>>> probe.tab >>>>>> file from Affymetrix that gives you the mapped genome locations >>>>>> of >>>>>> each probe (or you could take the sequences from the same file >>>>>> and >>>>>> map >>>>>> them yourself with a tool like BLAT). It is then just a matter >>>>>> of >>>>>> looking whether each probe maps to a location on the genome that >>>>>> overlaps a SNP. There is probably a Bioconductor tool for this >>>>>> or >>>>>> you >>>>>> could create a hash, etc. >> >>>>>> There are a couple levels at which you might introduce this to >>>>>> your >>>>>> analysis. You could remove individual probes that are affected. >>>>>> On >>>>>> the aroma.affymetrix side, this would require creating a new CDF >>>>>> with >>>>>> those affected probes not included (a bit tricky but doable). >>>>>> Or, >>>>>> you >>>>>> could simply post-process your existing results and remove >>>>>> probesets >>>>>> that have an affected probe (easier but not as elegant). >> >>>>>> You might've also seen: >> >>>>>> Duan S, Zhang W, Bleibel WK, Cox NJ, Dolan ME: SNPinProbe 1.0: A >>>>>> database for filtering out >>>>>> probes in the Affymetrix GeneChip(R) HumanExon1.0 ST array >>>>>> potentially affected bySNPs. >>>>>> Bioinformation 2008, 2(10):469{470. >> >>>>>> Hope that gets you started. >> >>>>>> Cheers, >>>>>> Mark >> >>>>>> On 30/04/2009, at 6:07 AM, sabrina wrote: >> >>>>>>> Hi, all: >>>>>>> I am using Aroma for detectingexonskipping events around two >>>>>>> groups >>>>>>> (two different strains). I found out that several of my top hits >>>>>>> indeed includes at least one SNP between two strains. I wonder >>>>>>> if >>>>>>> anyone has some suggestion about how to deal with this >>>>>>> situation. >>>>>>> If I >>>>>>> need to remove all affected exons from analysis, how can I do >>>>>>> it? I >>>>>>> never worked with SNP data before, can anyone give me a hint? >>>>>>> Thanks a >>>>>>> lot! >> >>>>>>> Sabrina >> >>>>>> ------------------------------ >>>>>> Mark Robinson >>>>>> Epigenetics Laboratory, Garvan >>>>>> Bioinformatics Division, WEHI >>>>>> e: m.robin...@garvan.org.au >>>>>> e: mrobin...@wehi.edu.au >>>>>> p: +61 (0)3 9345 2628 >>>>>> f: +61 (0)3 9347 0852 >>>>>> ------------------------------ >> >>>> ------------------------------ >>>> Mark Robinson, PhD (Melb) >>>> Epigenetics Laboratory, Garvan >>>> Bioinformatics Division, WEHI >>>> e: m.robin...@garvan.org.au >>>> e: mrobin...@wehi.edu.au >>>> p: +61 (0)3 9345 2628 >>>> f: +61 (0)3 9347 0852 >>>> ------------------------------ >> >> ------------------------------ >> Mark Robinson, PhD (Melb) >> Epigenetics Laboratory, Garvan >> Bioinformatics Division, WEHI >> e: m.robin...@garvan.org.au >> e: mrobin...@wehi.edu.au >> p: +61 (0)3 9345 2628 >> f: +61 (0)3 9347 0852 >> ------------------------------ > > ------------------------------ Mark Robinson, PhD (Melb) Epigenetics Laboratory, Garvan Bioinformatics Division, WEHI e: m.robin...@garvan.org.au e: mrobin...@wehi.edu.au p: +61 (0)3 9345 2628 f: +61 (0)3 9347 0852 ------------------------------ --~--~---------~--~----~------------~-------~--~----~ When reporting problems on aroma.affymetrix, make sure 1) to run the latest version of the package, 2) to report the output of sessionInfo() and traceback(), and 3) to post a complete code example. You received this message because you are subscribed to the Google Groups "aroma.affymetrix" group. To post to this group, send email to aroma-affymetrix@googlegroups.com To unsubscribe from this group, send email to aroma-affymetrix-unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/aroma-affymetrix?hl=en -~----------~----~----~----~------~----~------~--~---