Re: [Biojava-l] Fwd: GenBank parsing

Andreas Prlic Thu, 04 Jun 2015 05:52:16 -0700

Hi Paolo and Simon,

Would it be possible to update the documentation in the tutorial, so other
users can see how to retrieve features as well? Also, perhaps file a ticket
for the missing keywords, etc. fields , so this does not get lost?


https://github.com/biojava/biojava-tutorial/blob/master/genomics/genebank.md

Thanks!

Andreas

On Thu, Jun 4, 2015 at 2:57 AM, simon rayner <[email protected]>
wrote:

> We resolved things, sort of, But at some point we fell off the mailing
> list. Here is the full message chain
>
> thanks again to all for the help
>
> Andreas, repeating my question here,  would it be any use if I added a
> more complete code sample to the tutorial show how to pull the Feature
> information out of a GenBank file?
>
> cheers
>
> Simon
>
>
> ---------- Forwarded message ----------
> From: Paolo Pavan <[email protected]>
> Date: Wed, Jun 3, 2015 at 5:34 PM
> Subject: Re: [Biojava-l] GenBank parsing
> To: simon rayner <[email protected]>
>
>
> Oh, I'm realizing now that we went outside of the mailing list.
> You can forward all the conversation to the list and ask for Andreas there.
>
> Paolo
>
> 2015-06-03 17:29 GMT+02:00 Paolo Pavan <[email protected]>:
>
>> Simon,
>> As far as I  have read on the mailing list, I know that Andreas Prlic is
>> interested in this kind of collaborations. I think he will answer you
>> shortly.
>>
>> Bye bye!
>>
>> 2015-06-03 17:14 GMT+02:00 simon rayner <[email protected]>:
>>
>>> Hi Paolo
>>>
>>> I think its okay. For now, perhaps it would be good to clarify this
>>> somewhere (perhaps in the tutorial sample?). And would it be any use if I
>>> added a more complete code sample to the tutorial show how to pull the
>>> Feature information out of a GenBank file?
>>>
>>> Simon
>>>
>>> On Wed, Jun 3, 2015 at 5:11 PM, Paolo Pavan <[email protected]>
>>> wrote:
>>>
>>>> Hi Simon,
>>>> Now I see what you mean and unfortunately I must say that those
>>>> retrieval are not supported yet. They aren't in the section I put my hands
>>>> on and I must say that I wasn't actually aware of that.
>>>>
>>>> The file responsible for this behaviour is GenbankSequenceParser.java,
>>>> I don't know if there are someone of the original authors out of there that
>>>> can add something.
>>>>
>>>> You are unlucky, let me know if I can be of any help more.
>>>> Paolo
>>>>
>>>> 2015-06-03 15:55 GMT+02:00 simon rayner <[email protected]>:
>>>>
>>>>> Hi Paolo
>>>>>
>>>>>  sequence.getFeaturesByType("source");
>>>>>
>>>>> will return the 'source' entry at the top of the FEATURE tree, but it
>>>>> won't help me retrieve anything outside the FEATURE tree (from the top of
>>>>> the file and at the bottom before the sequence)
>>>>>
>>>>> For example, in the following GenBank file
>>>>>
>>>>> LOCUS       AY102993                 400 bp    mRNA    linear   VRL 
>>>>> 22-FEB-2006
>>>>> DEFINITION  Rabies virus isolate RV61 nucleoprotein mRNA, partial cds.
>>>>> ACCESSION   AY102993 AY247649
>>>>> VERSION     AY102993.2  GI:34099643
>>>>> KEYWORDS    .
>>>>> SOURCE      Rabies virus
>>>>>   ORGANISM  Rabies virus 
>>>>> <http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=11292>
>>>>>             Viruses; ssRNA viruses; ssRNA negative-strand viruses;
>>>>>             Mononegavirales; Rhabdoviridae; Lyssavirus.
>>>>> REFERENCE   1  (bases 1 to 400)
>>>>>   AUTHORS   Smith,J., McElhinney,L., Parsons,G., Brink,N., Doherty,T.,
>>>>>             Agranoff,D., Miranda,M.E. and Fooks,A.R.
>>>>>   TITLE     Case report: rapid ante-mortem diagnosis of a human case of 
>>>>> rabies
>>>>>             imported into the UK from the Philippines
>>>>>   JOURNAL   J. Med. Virol. 69 (1), 150-155 (2003)
>>>>>    PUBMED   12436491 <http://www.ncbi.nlm.nih.gov/pubmed/12436491>
>>>>> REFERENCE   2  (bases 1 to 400)
>>>>>
>>>>>      .
>>>>>      .
>>>>>      .
>>>>>
>>>>> COMMENT     On Aug 22, 2003 this sequence version replaced gi:25986720 
>>>>> <http://www.ncbi.nlm.nih.gov/nuccore/25986720>.FEATURES             
>>>>> Location/Qualifiers     source          1..400
>>>>>                      /organism="Rabies virus"
>>>>>                      /mol_type="mRNA"
>>>>>                      /isolate="RV61"
>>>>>                      /host="Homo sapiens"
>>>>>                      /db_xref="taxon:11292 
>>>>> <http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=11292>"
>>>>>                      /country="United Kingdom"
>>>>>                      /note="isolated in 1987"     CDS 
>>>>> <http://www.ncbi.nlm.nih.gov/nuccore/34099643?from=1&to=400&sat=4&sat_key=38832925>
>>>>>              1..>400
>>>>>
>>>>>
>>>>> sequence.getFeaturesByType("source");
>>>>>
>>>>> will return the portion
>>>>>
>>>>>      source          1..400
>>>>>                      /organism="Rabies virus"
>>>>>                      /mol_type="mRNA"
>>>>>                      /isolate="RV61"
>>>>>                      /host="Homo sapiens"
>>>>>                      /db_xref="taxon:11292 
>>>>> <http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=11292>"
>>>>>                      /country="United Kingdom"
>>>>>                      /note="isolated in 1987"
>>>>>
>>>>>
>>>>>
>>>>> which is important data, but what about the KEYWORDS, SOURCE and
>>>>> REFERENCE information at the  top and COMMENT at the bottom?
>>>>>
>>>>> I can use the following calls to get some information
>>>>>
>>>>> getOriginalHeader() -> LOCUS
>>>>> getDescription() -> DEFINITION
>>>>> getAccession() -> ACCESSION
>>>>>
>>>>> What am I missing here?
>>>>>
>>>>> thanks
>>>>>
>>>>> Simon
>>>>>
>>>>> On Wed, Jun 3, 2015 at 3:22 PM, Paolo Pavan <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Can't you find those information in the "source" feature? Check this
>>>>>> list:
>>>>>> List l = sequence.getFeaturesByType("source");
>>>>>>
>>>>>> This come from the fact that in new version of genbank file, source
>>>>>> is a compulsory feature and they move many info from top level "Features
>>>>>> tag" into "Source" tag qualifiers.
>>>>>>
>>>>>> Let us know,
>>>>>> Paolo
>>>>>>
>>>>>>
>>>>>> 2015-06-03 14:29 GMT+02:00 simon rayner <[email protected]>:
>>>>>>
>>>>>>> Thanks to all for taking the time to answer.
>>>>>>>
>>>>>>> I had already got as far as parsing out the feature information
>>>>>>> using something like
>>>>>>>
>>>>>>> LinkedHashMap<String, DNASequence> dnaSequences =
>>>>>>> GenbankReaderHelper.readGenbankDNASequence( dnaFile );
>>>>>>> for (DNASequence sequence : dnaSequences.values()) {
>>>>>>>
>>>>>>>
>>>>>>> List<FeatureInterface<AbstractSequence<NucleotideCompound>,
>>>>>>> NucleotideCompound>> fl =   sequence.getFeatures();
>>>>>>>                 for (FeatureInterface fi : fl) {
>>>>>>>
>>>>>>>                     HashMap <String, Qualifier> quals =
>>>>>>> fi.getQualifiers();
>>>>>>>                     for(Map.Entry<String, Qualifier> entry :
>>>>>>> quals.entrySet()){
>>>>>>>                         logger.info("--\t" + entry.getKey() +
>>>>>>> "\t|\t" + entry.getValue().getName()
>>>>>>>                                 + "  /  " +
>>>>>>> entry.getValue().getValue() + "\\" + entry.getValue().toString());
>>>>>>>
>>>>>>>                     }
>>>>>>>                     logger.info("SHORT\t" +
>>>>>>> fi.getShortDescription());
>>>>>>>                     logger.info("SOURCE\t" + fi.getSource());
>>>>>>>                     logger.info("TYPE\t" + fi.getType());
>>>>>>>                     logger.info("HASHCODE\t" + fi.hashCode());
>>>>>>>                     logger.info("-");
>>>>>>>                 }
>>>>>>>
>>>>>>> }
>>>>>>>
>>>>>>> But I am still stumped as to how to access the annotation
>>>>>>> information at the top of a GenBank file.
>>>>>>>
>>>>>>> For example, getAccession gets me the accession number of the
>>>>>>> sequence, but what about all the other data that is there (e.g. the 
>>>>>>> pubmed
>>>>>>> records)?
>>>>>>>
>>>>>>> In BJ3, there was a RichAnnotation class, but I don't see anything
>>>>>>> equivalent in BJ4.
>>>>>>>
>>>>>>> cheers
>>>>>>>
>>>>>>> Simon
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jun 3, 2015 at 12:39 PM, Paolo Pavan <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Simon,
>>>>>>>> I took care about last updates to the Genbank parser (reader). At
>>>>>>>> the state of the art, there are two ways to read annotated Genbank 
>>>>>>>> files: via
>>>>>>>> GenbankReader and via GenbankProxySequenceReader .
>>>>>>>>
>>>>>>>> The first one:
>>>>>>>> GenbankReader<ProteinSequence, AminoAcidCompound> GenbankProtein
>>>>>>>>                 = new GenbankReader<ProteinSequence,
>>>>>>>> AminoAcidCompound>(
>>>>>>>>                         inStream,
>>>>>>>>                         new
>>>>>>>> GenericGenbankHeaderParser<ProteinSequence, AminoAcidCompound>(),
>>>>>>>>                         new
>>>>>>>> ProteinSequenceCreator(AminoAcidCompoundSet.getAminoAcidCompoundSet())
>>>>>>>>                 );
>>>>>>>> LinkedHashMap<String, ProteinSequence> proteinSequences =
>>>>>>>> GenbankProtein.process();
>>>>>>>>         inStream.close();
>>>>>>>>
>>>>>>>>
>>>>>>>> The second one is:
>>>>>>>>
>>>>>>>> GenbankProxySequenceReader<AminoAcidCompound> genbankProteinReader
>>>>>>>>                 = new
>>>>>>>> GenbankProxySequenceReader<AminoAcidCompound>("/my_directory", 
>>>>>>>> "NP_000257",
>>>>>>>> AminoAcidCompoundSet.getAminoAcidCompoundSet());
>>>>>>>>         ProteinSequence proteinSequence = new
>>>>>>>> ProteinSequence(genbankProteinReader);
>>>>>>>>
>>>>>>>>
>>>>>>>> Just keep in mind to use NucleotideCompound and a
>>>>>>>> DNASequenceCreator(DNACompoundSet.getDNACompoundSet()) if you need to 
>>>>>>>> parse
>>>>>>>> genbank nucleotide files.
>>>>>>>>
>>>>>>>> You can access annotation stored via getFeatures() methods family
>>>>>>>> of the readed sequence object. Also note that features have qualifiers
>>>>>>>> (those starting with / in the genbank file) and they must be accessed 
>>>>>>>> from
>>>>>>>> the feature object with getQualifiers().
>>>>>>>> Also note that feature can have complex locations (rare, but
>>>>>>>> present) in this case you will find nested locations in the feature
>>>>>>>> retrieved.
>>>>>>>>
>>>>>>>> Does this answer your question?
>>>>>>>> Bye bye,
>>>>>>>> Paolo
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2015-06-03 10:27 GMT+02:00 Jose Manuel Duarte <[email protected]>:
>>>>>>>>
>>>>>>>>> I can't offer much help regarding GenBank parsing itself, but I
>>>>>>>>> would at least like to clarify the situation with the different 
>>>>>>>>> (indeed
>>>>>>>>> confusing) versions:
>>>>>>>>>
>>>>>>>>> BJ4 is the current release, well maintained and under development.
>>>>>>>>> BJ3 has been completely superseded by BJ4. That means that BJ4 does
>>>>>>>>> everything that BJ3 did. In the cookbook and tutorials everything that
>>>>>>>>> refers to BJ3 should work in BJ4, with the only difference that the
>>>>>>>>> namespace of packages has changed from org.biojava.bio/org.biojava3 to
>>>>>>>>> org.biojava.nbio.
>>>>>>>>>
>>>>>>>>> BJ1 and BJX are both legacy projects, with some maintenance but
>>>>>>>>> not much active development. I believe that some of the features in 
>>>>>>>>> them
>>>>>>>>> were not ported to BJ3+.
>>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>>
>>>>>>>>> Jose
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 02.06.2015 11:40, Simon Rayner wrote:
>>>>>>>>>
>>>>>>>>>> Hi
>>>>>>>>>>
>>>>>>>>>> I'm coming back to BioJava (BJ) after a couple of years away and
>>>>>>>>>> am somewhat confused by the current collection of cookbooks, 
>>>>>>>>>> tutorials and
>>>>>>>>>> APIs. There appear to be a few examples for handling protein 
>>>>>>>>>> structure
>>>>>>>>>> data, but relatively little for more mainstream stuff such as parsing
>>>>>>>>>> Genbank files, which I first need to get the information I want to
>>>>>>>>>> investigate protein structure. But when I look at the relevant code 
>>>>>>>>>> samples
>>>>>>>>>> to do this, they refer back to BJ3, BJ1, or even BJX. Even the Wiki 
>>>>>>>>>> page
>>>>>>>>>> still refers to BJ3 despite the release of BJ4 back in Feb 2015.
>>>>>>>>>>
>>>>>>>>>> I have everything working for parsing GenBank data, but I'm still
>>>>>>>>>> trying to get the Annotation information out of the top of a GenBank 
>>>>>>>>>> file,
>>>>>>>>>> and can't find any way of doing this using BJ4 - the BJ4 API appears 
>>>>>>>>>> to
>>>>>>>>>> refer to the RichAnnotation type in BJX release. Can anyone clarify 
>>>>>>>>>> what
>>>>>>>>>> you are supposed to do here? Start mixing in some BJX? (and is BJX 
>>>>>>>>>> still
>>>>>>>>>> active?) or should I still be using BJ3 until BJ4 stabilizes. I 
>>>>>>>>>> realise
>>>>>>>>>> this is an open source project, but some clarification on the current
>>>>>>>>>> status of things would be handy if the project is going to appeal to 
>>>>>>>>>> a
>>>>>>>>>> larger community :)
>>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Biojava-l mailing list  -  [email protected]
>>>>>>>>>> http://mailman.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Biojava-l mailing list  -  [email protected]
>>>>>>>>> http://mailman.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Biojava-l mailing list  -  [email protected]
>>>>>>>> http://mailman.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
> _______________________________________________
> Biojava-l mailing list  -  [email protected]
> http://mailman.open-bio.org/mailman/listinfo/biojava-l
>

_______________________________________________
Biojava-l mailing list  -  [email protected]
http://mailman.open-bio.org/mailman/listinfo/biojava-l

Re: [Biojava-l] Fwd: GenBank parsing

Reply via email to