I've just updated my sources few minutes ago and everything works fine now (both annotations and split-on-equals problem).
I've tested both the EBI file and Ensembl file. Thanks for fixing the problems !! Cheers, Morgane Jolyon Holdstock wrote: > No, I'll update my source. > > Thanks, > > Jolyon > > > -----Original Message----- > From: Richard Holland [mailto:[EMAIL PROTECTED] > Sent: 20 April 2006 13:16 > To: Jolyon Holdstock > Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED] > Subject: RE: [Biojava-l] [biojavax] EMBL parser : features > parsing[Scanned] > > Did you use the latest CVS version? (I committed a change that I think > should have fixed that about 1 minute before my previous email). > > > On Thu, 2006-04-20 at 13:08 +0100, Jolyon Holdstock wrote: > >> I've run the sequence through the parser and it seems to work OK. I >> iterate through the features and then iterate through the annotations >> > of > >> that feature >> >> Based on the input.... >> >> FT source 1..118 >> FT /organism="Triturus helveticus" >> FT /mol_type="genomic DNA" >> FT /clone="Thel.b9" >> FT /db_xref="taxon:256425" >> FT gene <1..>118 >> FT /gene="Hoxb9" >> FT /note="Hoxb-9" >> FT mRNA <1..>118 >> FT /gene="Hoxb9" >> FT /product="HOXB9" >> FT CDS <1..>118 >> FT /codon_start=2 >> FT /gene="Hoxb9" >> FT /product="HOXB9" >> FT /db_xref="UniProtKB/TrEMBL:Q2LK47" >> FT /protein_id="ABA39736.1" >> FT >> /translation="KYQTLELEKEFLFNMYLTRDRRHEVARLLNLSERQVKIW" >> >> The output is.... >> >> ======================================== >> Feature: (#0) lcl:DQ158013/DQ158013.1:source,EMBL(1..118) >> Note: (#0) biojavax:mol_type: genomic DNA >> Note: (#1) biojavax:clone: Thel.b9 >> ======================================== >> Feature: (#1) lcl:DQ158013/DQ158013.1:gene,EMBL(<1..118>) >> Note: (#2) biojavax:gene: Hoxb9 >> Note: (#3) biojavax:note: Hoxb-9 >> ======================================== >> Feature: (#2) lcl:DQ158013/DQ158013.1:mRNA,EMBL(<1..118>) >> Note: (#4) biojavax:gene: Hoxb9 >> Note: (#5) biojavax:product: HOXB9 >> ======================================== >> Feature: (#3) lcl:DQ158013/DQ158013.1:CDS,EMBL(<1..118>) >> Note: (#6) biojavax:codon_start: 2 >> Note: (#7) biojavax:gene: Hoxb9 >> Note: (#8) biojavax:product: HOXB9 >> Note: (#9) biojavax:protein_id: ABA39736.1 >> Note: (#10) biojavax:translation: >> KYQTLELEKEFLFNMYLTRDRRHEVARLLNLSERQVKIW >> Note: (#11) biojavax:translation: >> KYQTLELEKEFLFNMYLTRDRRHEVARLLNLSERQVKIW >> ============================================= >> >> This looks OK, the one thing I've just noticed is that the last piece >> > of > >> annotation of the last feature is assigned twice. >> >> Jolyon >> >> >> -----Original Message----- >> From: Richard Holland [mailto:[EMAIL PROTECTED] >> Sent: 20 April 2006 13:05 >> To: [EMAIL PROTECTED] >> Cc: Jolyon Holdstock; [EMAIL PROTECTED] >> Subject: Re: [Biojava-l] [biojavax] EMBL parser : features >> parsing[Scanned] >> >> Hi. >> >> I made some small changes to the code, although nothing that would fix >> this kind of problem, committed it back to CVS, checked it out again, >> compiled, and ran a test program that read in an EMBL file with the >> feature table you describe below, and output it in EMBL format to >> another file. I then compared the two files... and found no >> > differences! > >> The split-on-equals problem didn't occur, and all notes appeared >> alongside their correct features. >> >> Could there be a problem maybe with the script you are using? >> >> I've really no idea what the problem is as I can't reproduce it based >> > on > >> the current CVS contents! >> >> cheers, >> Richard >> >> On Thu, 2006-04-20 at 11:35 +0200, Morgane THOMAS-CHOLLIER wrote: >> >>> Hi, >>> >>> I have tested today's version from CVS. >>> >>> Both EBI and Ensembl files now react the same way. >>> The last annotation of a feature is nevertheless related to its >>> immediate following feature. >>> e.g. : >>> >>> FT gene <1..>118 >>> FT /gene="Hoxb9" >>> FT /note="Hoxb-9" >>> FT mRNA <1..>118 >>> FT /gene="Hoxb9" >>> FT /product="HOXB9" >>> FT CDS <1..>118 >>> >>> /note="Hoxb-9" is related to mRNA >>> /product="HOXB9" is related to CDS >>> >>> Concerning the split-on-equals problem, I still observe the problem >>> > : > >>> [(#2) biojavax:note: transcript_i] >>> >>> for this annotation : /note="transcript_id=ENSMUST00000048680" >>> >>> Thanks for helping, >>> >>> Cheers, >>> >>> Morgane. >>> >>> Richard Holland wrote: >>> >>>> I have committed an UNTESTED patch based on Jolyon's suggestion, >>>> > and > >>>> also attempted to fix the split-on-equals problem Morgane >>>> > observed. > >>>> Please let me know if there are any problems with it. >>>> >>>> As this problem affected the UniProt parser in a similar manner >>>> >> (much of >> >>>> the code is identical), the same fixes were applied there too. >>>> >>>> cheers, >>>> Richard >>>> >>>> On Thu, 2006-04-13 at 17:42 +0100, Jolyon Holdstock wrote: >>>> >>>> >>>>> Hi Morgane, >>>>> >>>>> I have amended the EmblFormat readSection method as below and the >>>>> parsing seems to work; please test it. >>>>> >>>>> I think that the last bit of annotation is carried over into the >>>>> >> next >> >>>>> feature so before adding the new feature I dump the annotation >>>>> > and > >> reset >> >>>>> currentTag and currentVal. >>>>> >>>>> if (!line.startsWith(" ")) { >>>>> //--------- new code starts --------------------------- >>>>> if (currentTag!=null) { >>>>> section.add(new String[]{currentTag,currentVal.toString()}); >>>>> currentTag = null; >>>>> currentVal = null; >>>>> } >>>>> //--------- new code ends ----------------------------- >>>>> // case 1 : word value - splits into key-value on its own >>>>> section.add(line.split("\\s+")); >>>>> } >>>>> >>>>> Cheers, >>>>> >>>>> Jolyon >>>>> >>>>> >>>>> >>>>> -----Original Message----- >>>>> From: [EMAIL PROTECTED] >>>>> [mailto:[EMAIL PROTECTED] On Behalf Of >>>>> > Morgane > >>>>> THOMAS-CHOLLIER >>>>> Sent: 12 April 2006 09:35 >>>>> To: [EMAIL PROTECTED] >>>>> Subject: [Biojava-l] [biojavax] EMBL parser : features >>>>> >> parsing[Scanned] >> >>>>> Hello again, >>>>> >>>>> I am currently using biojavax to parse EMBL files exported from >>>>> >> Ensembl >> >>>>> website. >>>>> >>>>> Compared to the EBI files I have, they show a difference in the >>>>> >> Features >> >>>>> lines : >>>>> >>>>> sometimes, only one "/word" is present. ie: >>>>> >>>>> EBI file : >>>>> >>>>> FT gene <1..>118 >>>>> FT /gene="Hoxb9" >>>>> FT /note="Hoxb-9" >>>>> >>>>> Ensembl file; >>>>> >>>>> FT gene complement(1..3218) >>>>> FT /gene="ENSMUSG00000038227" >>>>> >>>>> The problem I encounter is that the parser correctly convert the >>>>> >> "/word" >> >>>>> into a Note, but the Note is then in relation with the immediate >>>>> following feature (ie: mRNA). >>>>> The current gene feature thus has no annotation. >>>>> >>>>> This behavior is reproducible when removing one "/word" of an EBI >>>>> >> file. >> >>>>> Apart from this issue, I noted that Ensembl EMBL files uses "=" >>>>> >> inside a >> >>>>> feature (ie: /note="transcript_id=ENSMUST00000048680") which ends >>>>> >> up >> >>>>> with an incomplete Note, as the parser seems to split on "=" to >>>>> >> separate >> >>>>> the Key and the Value. >>>>> >>>>> Thanks for your help, >>>>> >>>>> Morgane. >>>>> >>>>> >>>>> -- ********************************************************** Morgane THOMAS-CHOLLIER, PHD Student ([EMAIL PROTECTED]) Vrije Universiteit Brussels (VUB) Laboratory of Cell Genetics Pleinlaan 2 1050 Brussels Belgium _______________________________________________ Biojava-l mailing list - [email protected] http://lists.open-bio.org/mailman/listinfo/biojava-l
