Dear Drs. Johnson and Yoo , 
 for the last 1 week I have been working on parsing
the elements from a bunch of XML files following your
suggestions. 

until now I have been unsuccessul.  I have no clue why
i am failing. 

I have ~16K XML files. this data obtained from johns
hopkins university (of course these are public data
and is allowed to use for teaching and non-commercial
purposes). 


from elementtree.ElementTree import ElementTree
>>> mydata = ElementTree(file='00004.xml')
>>> for process in
mydata.findall('//biological_process'):
        print process.text

        
>>> for proc in mydata.findall('functions'):
        print proc

        
>>> 



I do not understand why I am unable to parse this
file. I questioned if this file is not well structures
(well formedness). I feel it is properly structured
and yet it us unparsable.  


Would you please help me /guide me what the problem
is.  Apologies if i am completely ignoring somethings.
 

PS: Attached is the XML file that I am using. 

--- Kent Johnson <[EMAIL PROTECTED]> wrote:

> ps python wrote:
> >  Kent and Dany, 
> > Thanks for your replies.  
> > 
> > Here fromstring() assuming that the input is in a
> kind
> > of text format. 
> 
> Right, that is for the sake of a simple example.
> > 
> > what should be the case when I am reading files
> > directly. 
> > 
> > I am using the following :
> > 
> > from elementtree.ElementTree import ElementTree
> > mydata = ElementTree(file='00001.xml')
> > iter = root.getiterator()
> > 
> > Here the whole XML document is loaded as element
> tree
> > and how should this iter into a format where I can
> > apply findall() method. 
> 
> Call findall() directly on mydata, e.g.
> for process in
> mydata.findall('//biological_process'):
>    print process.text
> 
> The path //biological_process means find any
> biological_process element 
> at any depth from the root element.
> 
> Kent
> 
> _______________________________________________
> Tutor maillist  -  Tutor@python.org
> http://mail.python.org/mailman/listinfo/tutor
> 

Send instant messages to your online friends http://in.messenger.yahoo.com 
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE HPRDr2 SYSTEM "hprdr2.dtd">
<HPRDr2 xmlns="org:hprd:dtd:hprdr2">
  <protein isoform="1" version="1" id="HPRD_00004">
   <title>Aldehyde dehydrogenase 3</title>
    
        <alt_title>Aldehyde dehydrogenase family 3 subfamily A, member 1</alt_title>
    
    
        <alt_title>ALDH3</alt_title>
    
    
        <alt_title>Acetaldehyde dehydrogenase 3</alt_title>
    
    
        <alt_title>ALDH, Stomach type</alt_title>
    
    
        <alt_title>ALDHIII</alt_title>
    
    <omim>100660</omim> 
    <gene_symbol>ALDH3A1</gene_symbol>
    <gene_map_locus>
      <title>17p11.2</title>
      
          <pubmed>7774944</pubmed>
      
    </gene_map_locus>
    <seq_entry source="Ref-Seq">
      <entry_cdna>NM_000691.3</entry_cdna>
      <entry_protein>NP_000682.3</entry_protein>
    </seq_entry>
    <molecular_weight>50398</molecular_weight>
    <!-- Large sections begin here. -->
    <entry_sequence>
      <cdna length="1722">ccaggagccc cagttaccgg gagaggctgt
gtcaaaggcg ccatgagcaa gatcagcgag
gccgtgaagc gcgcccgcgc cgccttcagc
tcgggcagga cccgtccgct gcagttccgg
atccagcagc tggaggcgct gcagcgcctg
atccaggagc aggagcagga gctggtgggc
gcgctggccg cagacctgca caagaatgaa
tggaacgcct actatgagga ggtggtgtac
gtcctagagg agatcgagta catgatccag
aagctccctg agtgggccgc ggatgagccc
gtggagaaga cgccccagac tcagcaggac
gagctctaca tccactcgga gccactgggc
gtggtcctcg tcattggcac ctggaactac
cccttcaacc tcaccatcca gcccatggtg
ggcgccatcg ctgcagggaa ctcagtggtc
ctcaagccct cggagctgag tgagaacatg
gcgagcctgc tggctaccat catcccccag
tacctggaca aggatctgta cccagtaatc
aatgggggtg tccctgagac cacggagctg
ctcaaggaga ggttcgacca tatcctgtac
acgggcagca cgggggtggg gaagatcatc
atgacggctg ctgccaagca cctgacccct
gtcacgctgg agctgggagg gaagagtccc
tgctacgtgg acaagaactg tgacctggac
gtggcctgcc gacgcatcgc ctgggggaaa
ttcatgaaca gtggccagac ctgcgtggcc
cctgactaca tcctctgtga cccctcgatc
cagaaccaaa ttgtggagaa gctcaagaag
tcactgaaag agttctacgg ggaagatgct
aagaaatccc gggactatgg aagaatcatt
agtgcccggc acttccagag ggtgatgggc
ctgattgagg gccagaaggt ggcttatggg
ggcaccgggg atgccgccac tcgctacata
gcccccacca tcctcacgga cgtggacccc
cagtccccgg tgatgcaaga ggagatcttc
gggcctgtgc tgcccatcgt gtgcgtgcgc
agcctggagg aggccatcca gttcatcaac
cagcgtgaga agcccctggc cctctacatg
ttctccagca acgacaaggt gattaagaag
atgattgcag agacatccag tggtggggtg
gcggccaacg atgtcatcgt ccacatcacc
ttgcactctc tgcccttcgg gggcgtgggg
aacagcggca tgggatccta ccatggcaag
aagagcttcg agactttctc tcaccgccgc
tcttgcctgg tgaggcctct gatgaatgat
gaaggcctga aggtcagata ccccccgagc
ccggccaaga tgacccagca ctgaggaggg
gttgctccgc ctggcctggc catactgtgt
cccatcggag tgcggaccac cctcactggc
tctcctggcc ctgggagaat cgctcctgca
gccccagccc agccccactc ctctgctgac
ctgctgacct gtgcacaccc cactcccaca
tgggcccagg cctcaccatt ccaagtctcc
acccctttct agaccaataa agagacgaat
acaattttct aactcagcaa aaaaaaaaaa
aaaaaaaaaa aaaaaaaaaa aaaaaaaaaa
aaaaaaaaaa aaaaaaaaaa aaaaaaaaaa
aaaaaaaaaa aa
</cdna>
      <cdna_utr5 start="1" end="42"/>
      <cdna_coding start="43" end="1404"/>
      <cdna_utr3 start="1405" end="1722"/>
      <protein_sequence length="453">mskiseavkr araafssgrt rplqfriqql
ealqrliqeq eqelvgalaa dlhknewnay
yeevvyvlee ieymiqklpe waadepvekt
pqtqqdelyi hseplgvvlv igtwnypfnl
tiqpmvgaia agnsvvlkps elsenmasll
atiipqyldk dlypvinggv pettellker
fdhilytgst gvgkiimtaa akhltpvtle
lggkspcyvd kncdldvacr riawgkfmns
gqtcvapdyi lcdpsiqnqi veklkkslke
fygedakksr dygriisarh fqrvmglieg
qkvayggtgd aatryiapti ltdvdpqspv
mqeeifgpvl pivcvrslee aiqfinqrek
plalymfssn dkvikkmiae tssggvaand
vivhitlhsl pfggvgnsgm gsyhgkksfe
tfshrrsclv rplmndeglk vryppspakm
tqh
</protein_sequence>
    </entry_sequence>
    <protein_domain_architecture>
      <domain domain_source="smart" end="50" type="motif"
              start="23">
        <title>CC</title>
        
             
        
      </domain>
    </protein_domain_architecture>
    <expressions>
      <expression>
        <title>Stomach</title>
        
            <pubmed>1737758</pubmed>
        
      </expression>
      <expression>
        <title>Lung</title>
        
            <pubmed>4073832</pubmed>
        
      </expression>
      <expression>
        <title>Hair</title>
        
            <pubmed>7625577</pubmed>
        
      </expression>
      <expression>
        <title>Saliva </title>
        
            <pubmed>7625577</pubmed>
        
      </expression>
      <expression>
        <title>Liver</title>
        
            <pubmed>1737758</pubmed>
        
      </expression>
      <expression>
        <title>Oesophagus</title>
        
            <pubmed>1737758</pubmed>
        
      </expression>
      <expression>
        <title>Kidney</title>
        
            <pubmed>1737758</pubmed>
        
      </expression>
    </expressions>
    <functions>
      <molecular_class>Enzyme: Dehydrogenase</molecular_class>
      
         <molecular_function>
            <title>Catalytic activity</title>
            <goid>0003824</goid>
         </molecular_function>
      
      <biological_processes>
         <biological_process>
            <title>Metabolism</title>
            <goid>0008152</goid>
         </biological_process>
         <biological_process>
            <title>Energy pathways</title>
            <goid>0006091</goid>
         </biological_process>
      </biological_processes>
    </functions>
    <cellular_component>
      
          <primary>
              <title>cytoplasm</title>
              <go_id>GO:0005737</go_id>
              <go_abbreviation>TAS</go_abbreviation>
              
                 <pubmed>9514081</pubmed>
              
              
                 <pubmed>1306115</pubmed>
              
          </primary>
      
           
          
      
    </cellular_component>
    
    
    <interactions>
<entrySet xmlns="net:sf:psidev:mi"
          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
          xsi:schemaLocation="net:sf:psidev:mi http://psidev.sourceforge.net/mi/xml/src/MIF.xsd";
          level="1" version="1">
    <entry>
        <source>
            <names>
                <shortLabel>HPRD</shortLabel>
                <fullName>Human Protein Reference Database</fullName>
            </names>
            <bibref>
                <xref>
                     <primaryRef db="PubMed" id="14525934"/>
                     <secondaryRef db="PubMed" id="14681466"/>
                </xref>
            </bibref>
        </source>
        <availabilityList>
            <availability id="copyright">
                This data is copyrighted by Johns Hopkins University.
                Commercial entities may not use this without prior licensing
                authorization. Other databases must agree to enforce the same
                licensing guidelines before making this data public on their website.
            </availability>
        </availabilityList>

        <interactorList>
          
            <proteinInteractor id="ID_HPRD_00004">
                <names>
                    <shortLabel>Aldehyde dehydrogenase 3</shortLabel>
                </names>
                <xref>
                    <primaryRef db="HPRD" id="HPRD_00004"/>
                     
                       <secondaryRef db="PubMed"
    id="7774944"/>
                    
                    <secondaryRef version="3" db="Ref-Seq"
    id="NP_000682"/>
                    
                         
                    
                    <secondaryRef db="Locus-Link" id="218"/> 
                    
                        <secondaryRef db="Unigene" id="575"/> 
                    
                    
                          
                    
                </xref>
                <organism ncbiTaxId="9606">
                   <names>
                      <shortLabel>Human</shortLabel>
                      <fullName>Homo sapiens</fullName>
                   </names>
                </organism>
                <sequence>mskiseavkr araafssgrt rplqfriqql
ealqrliqeq eqelvgalaa dlhknewnay
yeevvyvlee ieymiqklpe waadepvekt
pqtqqdelyi hseplgvvlv igtwnypfnl
tiqpmvgaia agnsvvlkps elsenmasll
atiipqyldk dlypvinggv pettellker
fdhilytgst gvgkiimtaa akhltpvtle
lggkspcyvd kncdldvacr riawgkfmns
gqtcvapdyi lcdpsiqnqi veklkkslke
fygedakksr dygriisarh fqrvmglieg
qkvayggtgd aatryiapti ltdvdpqspv
mqeeifgpvl pivcvrslee aiqfinqrek
plalymfssn dkvikkmiae tssggvaand
vivhitlhsl pfggvgnsgm gsyhgkksfe
tfshrrsclv rplmndeglk vryppspakm
tqh
</sequence>
            </proteinInteractor>
          
        </interactorList>
        <interactionList>
           
            <interaction>
                <availabilityRef ref="copyright"/>
                <experimentList>
                  
                    <experimentDescription
    id="I10654197548104_vt">
                        <bibref>
                            <xref>
                               
                                  
                                       <primaryRef
    db="PubMed" id="12081471"/>
                                       
                                  
                                
                            </xref>
                        </bibref>
                        <interactionDetection>
                            <names>
                                <shortLabel>vt</shortLabel> 
                                <fullName>in vitro</fullName>
                            </names>
                            <xref>
                                <primaryRef db="IOB"
    id="IOB:0002"/> 
                            </xref>
                        </interactionDetection>
                    </experimentDescription>
                  
                </experimentList>
                <participantList>
                    <proteinParticipant>
                        <proteinInteractorRef
    ref="ID_HPRD_00004"/>
                    </proteinParticipant>
                    <proteinParticipant>
                        <proteinInteractorRef
    ref="ID_HPRD_00004"/>
                    </proteinParticipant>
                </participantList>
                <xref>
                    <primaryRef db="HPRD" id="HPRD_00004"/>
                </xref>
                <attributeList>
                   
                      <attribute name="HPRD Author">gopa</attribute>
                      
                   <attribute name="last_updated">2_20_2005</attribute>             
                </attributeList>
            </interaction>
             
        </interactionList>
    </entry>
</entrySet>
</interactions>

    <EXTERNAL_LINKS>
      <SwissProt>None</SwissProt>
      <locusLink>218</locusLink>
      <unigene>Hs.575</unigene>
      <otherResources>None</otherResources>
      <PDB></PDB>
    </EXTERNAL_LINKS>

    <author>
       <annotator>gopa</annotator>
       
    </author>
    <last_updated>2_20_2005</last_updated>

   </protein>
</HPRDr2>


_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Reply via email to