Hi Tengfei,

Ugh.  It seems that they have cooked up yet another new way to represent 
that same kind of data inside of a gff file.  :(    I am sad to say that 
this is exactly the sort of thing that I was worried about.

If you can't specify a field from your gff attribute field that contains 
the exon rank information (and in this case it looks like you can't).  
Then the software will try to infer it for you (and it will warn you 
that it is being forced to do this).  But the inference is not magic of 
course and it is just going to do the simplest possible thing..  It is 
just going to assume that the order of the exons along the chromosome is 
the correct rank.  But so for something like soybeans, I definitely 
think should extract those exon ranks and use them instead...

But how best to proceed with this very weird file?

If I was in your shoes I would probably look at doing a substitution.  
You could use a substitution to convert attributes (things in the final 
column) that look like ".exon.1" into things that look like 
".exon.1;exonRank=1" while using a regular expression so that the "1" 
was preserved into the output.  A couple of global substitutions like 
this would effectively add an attribute to the file for all the rows 
that contain a CDS or exon.  You could do this substitution in R for 
example and then save out a modified file.  Then you could just feed 
that modified file right into the makeTranscriptDbFromGFF() function and 
pass "exonRank" as the argument to exonRankAttributeName...

Also, I am just now checking in a solution to the other inconvenience 
that you reported earlier (to the devel branch).  So look for an update 
to appear very soon (or DL it from svn if you are impatient).  Please 
let me know if there are any more snags with this.


On 02/11/2013 12:01 PM, Tengfei Yin wrote:
> Hi Marc,
> Thanks a lot for your advice.
> I think as far as I know the gff3 file is the only way I can use to 
> get Gmax's latest build for annotation from 
> phytozome(http://www.phytozome.net/). Now it's publicly available
> ftp://ftp.jgi-psf.org/pub/compgen/phytozome/v9.0/Gmax/annotation/Gmax_189_gene_exons.gff3.gz
> And the reason I didn't provide the 'exonRankAttributeName' is that 
> because there is no explicit numbers which indicate the exon rank 
> directly in that gff3 file, examples are like
> Gm01phytozome8_0gene2764327977.-.ID=Glyma01g00210;Name=Glyma01g00210
> Gm01phytozome8_0mRNA2764327977.-.ID=PAC:26325839;Name=Glyma01g00210.1;pacid=26325839;longest=1;Parent=Glyma01g00210
> Gm01phytozome8_0exon2791327977.-.ID=PAC:26325839.exon.1;Parent=PAC:26325839;pacid=26325839
> Gm01phytozome8_0CDS2791327977.-0ID=PAC:26325839.CDS.1;Parent=PAC:26325839;pacid=26325839
> Gm01phytozome8_0exon2764327811.-.ID=PAC:26325839.exon.2;Parent=PAC:26325839;pacid=26325839
> Gm01phytozome8_0CDS2764327811.-1ID=PAC:26325839.CDS.2;Parent=PAC:26325839;pacid=26325839
> The ID attributes looks like it has information about the rank, I see 
> *.exon.1 *.exon.2, so I guess I can extract those information as extra 
> column manually and specify them in the function of 
> 'makeTranscriptDbFromGFF'.
> btw, Is this required? It looks like the GenomicFeatures trying to 
> infer exon rank if I didn't provide that information, so I thought 
> 'exonRankAttributeName' is optional at first.
> Thanks again
> Tengfei
> On Fri, Feb 8, 2013 at 6:08 PM, Marc Carlson <mcarl...@fhcrc.org 
> <mailto:mcarl...@fhcrc.org>> wrote:
>     Hi Tengfei,
>     Yes that looks like an oversight.  Thanks for reporting that!  I
>     will extend makeTxDbPackage so that it's more accommodating of
>     these newer transcriptDbs.  If you want to help me out, you could
>     call saveDb() on your gmax189 object and send me the .sqlite file
>     that you save it to.
>     Also, if you have any alternate options for importing your data
>     (other than using GFF or GTF): I think you probably should
>     consider it.  The file specifications for these filetypes are
>     missing key details and so you can very easily get a "legal" GFF
>     or GTF file that is actually missing important details from it's
>     contents.  For example, they can commonly lack information about
>     the order of the exons for a given transcript, which can render
>     them difficult (or impossible) to use for transcript work.   But
>     for these specifications, that information is "optional".
>       Marc
>     On 02/06/2013 09:46 PM, Tengfei Yin wrote:
>         Dear all,
>         I am trying to build a txdb object from gff3 for soybean data
>         and try to
>         make it a package. Code used like this
>         gmax189<- makeTranscriptDbFromGFF("~/ Gmax_189_gene_exons.gff3",
>                                             format = "gff3", species =
>         "Glycine max",
>                                             dataSource =
>         "http://www.phytozome.org/";)
>         makeTxDbPackage(txdb = gmax189,
>                          version = "0.9.1",
>                          maintainer = "Tengfei Yin",
>                          author = "Tengfei Yin",
>                          destDir=".",
>                          license="Artistic-2.0")
>         Error message:
>         Error in gsub("_", "", pkgName) :
>            error in evaluating the argument 'x' in selecting a method
>         for function
>         'gsub': Error: object 'pkgName' not found
>         Looks like my dataSource should be either BioMart or UCSC,
>         otherwise no
>         pkgname will be produced in function .makePackageName?
>         Or should I build annotation package in some other ways?
>         Thanks a lot
>         Tengfei
>         my sessionInfo
>             sessionInfo()
>         R Under development (unstable) (2013-01-21 r61728)
>         Platform: x86_64-unknown-linux-gnu (64-bit)
>         locale:
>           [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>           [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>           [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>           [7] LC_PAPER=C                 LC_NAME=C
>           [9] LC_ADDRESS=C               LC_TELEPHONE=C
>         attached base packages:
>         [1] parallel  stats     graphics  grDevices utils     datasets
>          methods
>         [8] base
>         other attached packages:
>         [1] GenomicFeatures_1.11.8 AnnotationDbi_1.21.10  Biobase_2.19.2
>         [4] GenomicRanges_1.11.28  IRanges_1.17.31      
>          BiocGenerics_0.5.6
>         loaded via a namespace (and not attached):
>           [1] biomaRt_2.15.0     Biostrings_2.27.10 bitops_1.0-5
>         BSgenome_1.27.1
>           [5] DBI_0.2-5          RCurl_1.95-3       Rsamtools_1.11.15
>           RSQLite_0.11.2
>           [9] rtracklayer_1.19.9 stats4_3.0.0       tools_3.0.0      
>          XML_3.95-0.1
>         [13] zlibbioc_1.5.0
>     ______________________________ _________________
>     Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> mailing
>     list
>     https://stat.ethz.ch/mailman/ listinfo/bioc-devel
>     <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
> -- 
> Tengfei Yin
> MCDB PhD student
> 1620 Howe Hall, 2274,
> Iowa State University
> Ames, IA,50011-2274

