Hi Paolo and Simon, Would it be possible to update the documentation in the tutorial, so other users can see how to retrieve features as well? Also, perhaps file a ticket for the missing keywords, etc. fields , so this does not get lost?
https://github.com/biojava/biojava-tutorial/blob/master/genomics/genebank.md Thanks! Andreas On Thu, Jun 4, 2015 at 2:57 AM, simon rayner <[email protected]> wrote: > We resolved things, sort of, But at some point we fell off the mailing > list. Here is the full message chain > > thanks again to all for the help > > Andreas, repeating my question here, would it be any use if I added a > more complete code sample to the tutorial show how to pull the Feature > information out of a GenBank file? > > cheers > > Simon > > > ---------- Forwarded message ---------- > From: Paolo Pavan <[email protected]> > Date: Wed, Jun 3, 2015 at 5:34 PM > Subject: Re: [Biojava-l] GenBank parsing > To: simon rayner <[email protected]> > > > Oh, I'm realizing now that we went outside of the mailing list. > You can forward all the conversation to the list and ask for Andreas there. > > Paolo > > 2015-06-03 17:29 GMT+02:00 Paolo Pavan <[email protected]>: > >> Simon, >> As far as I have read on the mailing list, I know that Andreas Prlic is >> interested in this kind of collaborations. I think he will answer you >> shortly. >> >> Bye bye! >> >> 2015-06-03 17:14 GMT+02:00 simon rayner <[email protected]>: >> >>> Hi Paolo >>> >>> I think its okay. For now, perhaps it would be good to clarify this >>> somewhere (perhaps in the tutorial sample?). And would it be any use if I >>> added a more complete code sample to the tutorial show how to pull the >>> Feature information out of a GenBank file? >>> >>> Simon >>> >>> On Wed, Jun 3, 2015 at 5:11 PM, Paolo Pavan <[email protected]> >>> wrote: >>> >>>> Hi Simon, >>>> Now I see what you mean and unfortunately I must say that those >>>> retrieval are not supported yet. They aren't in the section I put my hands >>>> on and I must say that I wasn't actually aware of that. >>>> >>>> The file responsible for this behaviour is GenbankSequenceParser.java, >>>> I don't know if there are someone of the original authors out of there that >>>> can add something. >>>> >>>> You are unlucky, let me know if I can be of any help more. >>>> Paolo >>>> >>>> 2015-06-03 15:55 GMT+02:00 simon rayner <[email protected]>: >>>> >>>>> Hi Paolo >>>>> >>>>> sequence.getFeaturesByType("source"); >>>>> >>>>> will return the 'source' entry at the top of the FEATURE tree, but it >>>>> won't help me retrieve anything outside the FEATURE tree (from the top of >>>>> the file and at the bottom before the sequence) >>>>> >>>>> For example, in the following GenBank file >>>>> >>>>> LOCUS AY102993 400 bp mRNA linear VRL >>>>> 22-FEB-2006 >>>>> DEFINITION Rabies virus isolate RV61 nucleoprotein mRNA, partial cds. >>>>> ACCESSION AY102993 AY247649 >>>>> VERSION AY102993.2 GI:34099643 >>>>> KEYWORDS . >>>>> SOURCE Rabies virus >>>>> ORGANISM Rabies virus >>>>> <http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=11292> >>>>> Viruses; ssRNA viruses; ssRNA negative-strand viruses; >>>>> Mononegavirales; Rhabdoviridae; Lyssavirus. >>>>> REFERENCE 1 (bases 1 to 400) >>>>> AUTHORS Smith,J., McElhinney,L., Parsons,G., Brink,N., Doherty,T., >>>>> Agranoff,D., Miranda,M.E. and Fooks,A.R. >>>>> TITLE Case report: rapid ante-mortem diagnosis of a human case of >>>>> rabies >>>>> imported into the UK from the Philippines >>>>> JOURNAL J. Med. Virol. 69 (1), 150-155 (2003) >>>>> PUBMED 12436491 <http://www.ncbi.nlm.nih.gov/pubmed/12436491> >>>>> REFERENCE 2 (bases 1 to 400) >>>>> >>>>> . >>>>> . >>>>> . >>>>> >>>>> COMMENT On Aug 22, 2003 this sequence version replaced gi:25986720 >>>>> <http://www.ncbi.nlm.nih.gov/nuccore/25986720>.FEATURES >>>>> Location/Qualifiers source 1..400 >>>>> /organism="Rabies virus" >>>>> /mol_type="mRNA" >>>>> /isolate="RV61" >>>>> /host="Homo sapiens" >>>>> /db_xref="taxon:11292 >>>>> <http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=11292>" >>>>> /country="United Kingdom" >>>>> /note="isolated in 1987" CDS >>>>> <http://www.ncbi.nlm.nih.gov/nuccore/34099643?from=1&to=400&sat=4&sat_key=38832925> >>>>> 1..>400 >>>>> >>>>> >>>>> sequence.getFeaturesByType("source"); >>>>> >>>>> will return the portion >>>>> >>>>> source 1..400 >>>>> /organism="Rabies virus" >>>>> /mol_type="mRNA" >>>>> /isolate="RV61" >>>>> /host="Homo sapiens" >>>>> /db_xref="taxon:11292 >>>>> <http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=11292>" >>>>> /country="United Kingdom" >>>>> /note="isolated in 1987" >>>>> >>>>> >>>>> >>>>> which is important data, but what about the KEYWORDS, SOURCE and >>>>> REFERENCE information at the top and COMMENT at the bottom? >>>>> >>>>> I can use the following calls to get some information >>>>> >>>>> getOriginalHeader() -> LOCUS >>>>> getDescription() -> DEFINITION >>>>> getAccession() -> ACCESSION >>>>> >>>>> What am I missing here? >>>>> >>>>> thanks >>>>> >>>>> Simon >>>>> >>>>> On Wed, Jun 3, 2015 at 3:22 PM, Paolo Pavan <[email protected]> >>>>> wrote: >>>>> >>>>>> Can't you find those information in the "source" feature? Check this >>>>>> list: >>>>>> List l = sequence.getFeaturesByType("source"); >>>>>> >>>>>> This come from the fact that in new version of genbank file, source >>>>>> is a compulsory feature and they move many info from top level "Features >>>>>> tag" into "Source" tag qualifiers. >>>>>> >>>>>> Let us know, >>>>>> Paolo >>>>>> >>>>>> >>>>>> 2015-06-03 14:29 GMT+02:00 simon rayner <[email protected]>: >>>>>> >>>>>>> Thanks to all for taking the time to answer. >>>>>>> >>>>>>> I had already got as far as parsing out the feature information >>>>>>> using something like >>>>>>> >>>>>>> LinkedHashMap<String, DNASequence> dnaSequences = >>>>>>> GenbankReaderHelper.readGenbankDNASequence( dnaFile ); >>>>>>> for (DNASequence sequence : dnaSequences.values()) { >>>>>>> >>>>>>> >>>>>>> List<FeatureInterface<AbstractSequence<NucleotideCompound>, >>>>>>> NucleotideCompound>> fl = sequence.getFeatures(); >>>>>>> for (FeatureInterface fi : fl) { >>>>>>> >>>>>>> HashMap <String, Qualifier> quals = >>>>>>> fi.getQualifiers(); >>>>>>> for(Map.Entry<String, Qualifier> entry : >>>>>>> quals.entrySet()){ >>>>>>> logger.info("--\t" + entry.getKey() + >>>>>>> "\t|\t" + entry.getValue().getName() >>>>>>> + " / " + >>>>>>> entry.getValue().getValue() + "\\" + entry.getValue().toString()); >>>>>>> >>>>>>> } >>>>>>> logger.info("SHORT\t" + >>>>>>> fi.getShortDescription()); >>>>>>> logger.info("SOURCE\t" + fi.getSource()); >>>>>>> logger.info("TYPE\t" + fi.getType()); >>>>>>> logger.info("HASHCODE\t" + fi.hashCode()); >>>>>>> logger.info("-"); >>>>>>> } >>>>>>> >>>>>>> } >>>>>>> >>>>>>> But I am still stumped as to how to access the annotation >>>>>>> information at the top of a GenBank file. >>>>>>> >>>>>>> For example, getAccession gets me the accession number of the >>>>>>> sequence, but what about all the other data that is there (e.g. the >>>>>>> pubmed >>>>>>> records)? >>>>>>> >>>>>>> In BJ3, there was a RichAnnotation class, but I don't see anything >>>>>>> equivalent in BJ4. >>>>>>> >>>>>>> cheers >>>>>>> >>>>>>> Simon >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Jun 3, 2015 at 12:39 PM, Paolo Pavan <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Simon, >>>>>>>> I took care about last updates to the Genbank parser (reader). At >>>>>>>> the state of the art, there are two ways to read annotated Genbank >>>>>>>> files: via >>>>>>>> GenbankReader and via GenbankProxySequenceReader . >>>>>>>> >>>>>>>> The first one: >>>>>>>> GenbankReader<ProteinSequence, AminoAcidCompound> GenbankProtein >>>>>>>> = new GenbankReader<ProteinSequence, >>>>>>>> AminoAcidCompound>( >>>>>>>> inStream, >>>>>>>> new >>>>>>>> GenericGenbankHeaderParser<ProteinSequence, AminoAcidCompound>(), >>>>>>>> new >>>>>>>> ProteinSequenceCreator(AminoAcidCompoundSet.getAminoAcidCompoundSet()) >>>>>>>> ); >>>>>>>> LinkedHashMap<String, ProteinSequence> proteinSequences = >>>>>>>> GenbankProtein.process(); >>>>>>>> inStream.close(); >>>>>>>> >>>>>>>> >>>>>>>> The second one is: >>>>>>>> >>>>>>>> GenbankProxySequenceReader<AminoAcidCompound> genbankProteinReader >>>>>>>> = new >>>>>>>> GenbankProxySequenceReader<AminoAcidCompound>("/my_directory", >>>>>>>> "NP_000257", >>>>>>>> AminoAcidCompoundSet.getAminoAcidCompoundSet()); >>>>>>>> ProteinSequence proteinSequence = new >>>>>>>> ProteinSequence(genbankProteinReader); >>>>>>>> >>>>>>>> >>>>>>>> Just keep in mind to use NucleotideCompound and a >>>>>>>> DNASequenceCreator(DNACompoundSet.getDNACompoundSet()) if you need to >>>>>>>> parse >>>>>>>> genbank nucleotide files. >>>>>>>> >>>>>>>> You can access annotation stored via getFeatures() methods family >>>>>>>> of the readed sequence object. Also note that features have qualifiers >>>>>>>> (those starting with / in the genbank file) and they must be accessed >>>>>>>> from >>>>>>>> the feature object with getQualifiers(). >>>>>>>> Also note that feature can have complex locations (rare, but >>>>>>>> present) in this case you will find nested locations in the feature >>>>>>>> retrieved. >>>>>>>> >>>>>>>> Does this answer your question? >>>>>>>> Bye bye, >>>>>>>> Paolo >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> 2015-06-03 10:27 GMT+02:00 Jose Manuel Duarte <[email protected]>: >>>>>>>> >>>>>>>>> I can't offer much help regarding GenBank parsing itself, but I >>>>>>>>> would at least like to clarify the situation with the different >>>>>>>>> (indeed >>>>>>>>> confusing) versions: >>>>>>>>> >>>>>>>>> BJ4 is the current release, well maintained and under development. >>>>>>>>> BJ3 has been completely superseded by BJ4. That means that BJ4 does >>>>>>>>> everything that BJ3 did. In the cookbook and tutorials everything that >>>>>>>>> refers to BJ3 should work in BJ4, with the only difference that the >>>>>>>>> namespace of packages has changed from org.biojava.bio/org.biojava3 to >>>>>>>>> org.biojava.nbio. >>>>>>>>> >>>>>>>>> BJ1 and BJX are both legacy projects, with some maintenance but >>>>>>>>> not much active development. I believe that some of the features in >>>>>>>>> them >>>>>>>>> were not ported to BJ3+. >>>>>>>>> >>>>>>>>> Cheers >>>>>>>>> >>>>>>>>> Jose >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On 02.06.2015 11:40, Simon Rayner wrote: >>>>>>>>> >>>>>>>>>> Hi >>>>>>>>>> >>>>>>>>>> I'm coming back to BioJava (BJ) after a couple of years away and >>>>>>>>>> am somewhat confused by the current collection of cookbooks, >>>>>>>>>> tutorials and >>>>>>>>>> APIs. There appear to be a few examples for handling protein >>>>>>>>>> structure >>>>>>>>>> data, but relatively little for more mainstream stuff such as parsing >>>>>>>>>> Genbank files, which I first need to get the information I want to >>>>>>>>>> investigate protein structure. But when I look at the relevant code >>>>>>>>>> samples >>>>>>>>>> to do this, they refer back to BJ3, BJ1, or even BJX. Even the Wiki >>>>>>>>>> page >>>>>>>>>> still refers to BJ3 despite the release of BJ4 back in Feb 2015. >>>>>>>>>> >>>>>>>>>> I have everything working for parsing GenBank data, but I'm still >>>>>>>>>> trying to get the Annotation information out of the top of a GenBank >>>>>>>>>> file, >>>>>>>>>> and can't find any way of doing this using BJ4 - the BJ4 API appears >>>>>>>>>> to >>>>>>>>>> refer to the RichAnnotation type in BJX release. Can anyone clarify >>>>>>>>>> what >>>>>>>>>> you are supposed to do here? Start mixing in some BJX? (and is BJX >>>>>>>>>> still >>>>>>>>>> active?) or should I still be using BJ3 until BJ4 stabilizes. I >>>>>>>>>> realise >>>>>>>>>> this is an open source project, but some clarification on the current >>>>>>>>>> status of things would be handy if the project is going to appeal to >>>>>>>>>> a >>>>>>>>>> larger community :) >>>>>>>>>> >>>>>>>>>> Thanks! >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Biojava-l mailing list - [email protected] >>>>>>>>>> http://mailman.open-bio.org/mailman/listinfo/biojava-l >>>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Biojava-l mailing list - [email protected] >>>>>>>>> http://mailman.open-bio.org/mailman/listinfo/biojava-l >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Biojava-l mailing list - [email protected] >>>>>>>> http://mailman.open-bio.org/mailman/listinfo/biojava-l >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > > > _______________________________________________ > Biojava-l mailing list - [email protected] > http://mailman.open-bio.org/mailman/listinfo/biojava-l >
_______________________________________________ Biojava-l mailing list - [email protected] http://mailman.open-bio.org/mailman/listinfo/biojava-l
