Dear Drs. Johnson and Yoo , for the last 1 week I have been working on parsing the elements from a bunch of XML files following your suggestions.
until now I have been unsuccessul. I have no clue why i am failing. I have ~16K XML files. this data obtained from johns hopkins university (of course these are public data and is allowed to use for teaching and non-commercial purposes). from elementtree.ElementTree import ElementTree >>> mydata = ElementTree(file='00004.xml') >>> for process in mydata.findall('//biological_process'): print process.text >>> for proc in mydata.findall('functions'): print proc >>> I do not understand why I am unable to parse this file. I questioned if this file is not well structures (well formedness). I feel it is properly structured and yet it us unparsable. Would you please help me /guide me what the problem is. Apologies if i am completely ignoring somethings. PS: Attached is the XML file that I am using. --- Kent Johnson <[EMAIL PROTECTED]> wrote: > ps python wrote: > > Kent and Dany, > > Thanks for your replies. > > > > Here fromstring() assuming that the input is in a > kind > > of text format. > > Right, that is for the sake of a simple example. > > > > what should be the case when I am reading files > > directly. > > > > I am using the following : > > > > from elementtree.ElementTree import ElementTree > > mydata = ElementTree(file='00001.xml') > > iter = root.getiterator() > > > > Here the whole XML document is loaded as element > tree > > and how should this iter into a format where I can > > apply findall() method. > > Call findall() directly on mydata, e.g. > for process in > mydata.findall('//biological_process'): > print process.text > > The path //biological_process means find any > biological_process element > at any depth from the root element. > > Kent > > _______________________________________________ > Tutor maillist - Tutor@python.org > http://mail.python.org/mailman/listinfo/tutor > Send instant messages to your online friends http://in.messenger.yahoo.com
<?xml version="1.0" encoding="utf-8"?> <!DOCTYPE HPRDr2 SYSTEM "hprdr2.dtd"> <HPRDr2 xmlns="org:hprd:dtd:hprdr2"> <protein isoform="1" version="1" id="HPRD_00004"> <title>Aldehyde dehydrogenase 3</title> <alt_title>Aldehyde dehydrogenase family 3 subfamily A, member 1</alt_title> <alt_title>ALDH3</alt_title> <alt_title>Acetaldehyde dehydrogenase 3</alt_title> <alt_title>ALDH, Stomach type</alt_title> <alt_title>ALDHIII</alt_title> <omim>100660</omim> <gene_symbol>ALDH3A1</gene_symbol> <gene_map_locus> <title>17p11.2</title> <pubmed>7774944</pubmed> </gene_map_locus> <seq_entry source="Ref-Seq"> <entry_cdna>NM_000691.3</entry_cdna> <entry_protein>NP_000682.3</entry_protein> </seq_entry> <molecular_weight>50398</molecular_weight> <!-- Large sections begin here. --> <entry_sequence> <cdna length="1722">ccaggagccc cagttaccgg gagaggctgt gtcaaaggcg ccatgagcaa gatcagcgag gccgtgaagc gcgcccgcgc cgccttcagc tcgggcagga cccgtccgct gcagttccgg atccagcagc tggaggcgct gcagcgcctg atccaggagc aggagcagga gctggtgggc gcgctggccg cagacctgca caagaatgaa tggaacgcct actatgagga ggtggtgtac gtcctagagg agatcgagta catgatccag aagctccctg agtgggccgc ggatgagccc gtggagaaga cgccccagac tcagcaggac gagctctaca tccactcgga gccactgggc gtggtcctcg tcattggcac ctggaactac cccttcaacc tcaccatcca gcccatggtg ggcgccatcg ctgcagggaa ctcagtggtc ctcaagccct cggagctgag tgagaacatg gcgagcctgc tggctaccat catcccccag tacctggaca aggatctgta cccagtaatc aatgggggtg tccctgagac cacggagctg ctcaaggaga ggttcgacca tatcctgtac acgggcagca cgggggtggg gaagatcatc atgacggctg ctgccaagca cctgacccct gtcacgctgg agctgggagg gaagagtccc tgctacgtgg acaagaactg tgacctggac gtggcctgcc gacgcatcgc ctgggggaaa ttcatgaaca gtggccagac ctgcgtggcc cctgactaca tcctctgtga cccctcgatc cagaaccaaa ttgtggagaa gctcaagaag tcactgaaag agttctacgg ggaagatgct aagaaatccc gggactatgg aagaatcatt agtgcccggc acttccagag ggtgatgggc ctgattgagg gccagaaggt ggcttatggg ggcaccgggg atgccgccac tcgctacata gcccccacca tcctcacgga cgtggacccc cagtccccgg tgatgcaaga ggagatcttc gggcctgtgc tgcccatcgt gtgcgtgcgc agcctggagg aggccatcca gttcatcaac cagcgtgaga agcccctggc cctctacatg ttctccagca acgacaaggt gattaagaag atgattgcag agacatccag tggtggggtg gcggccaacg atgtcatcgt ccacatcacc ttgcactctc tgcccttcgg gggcgtgggg aacagcggca tgggatccta ccatggcaag aagagcttcg agactttctc tcaccgccgc tcttgcctgg tgaggcctct gatgaatgat gaaggcctga aggtcagata ccccccgagc ccggccaaga tgacccagca ctgaggaggg gttgctccgc ctggcctggc catactgtgt cccatcggag tgcggaccac cctcactggc tctcctggcc ctgggagaat cgctcctgca gccccagccc agccccactc ctctgctgac ctgctgacct gtgcacaccc cactcccaca tgggcccagg cctcaccatt ccaagtctcc acccctttct agaccaataa agagacgaat acaattttct aactcagcaa aaaaaaaaaa aaaaaaaaaa aaaaaaaaaa aaaaaaaaaa aaaaaaaaaa aaaaaaaaaa aaaaaaaaaa aaaaaaaaaa aa </cdna> <cdna_utr5 start="1" end="42"/> <cdna_coding start="43" end="1404"/> <cdna_utr3 start="1405" end="1722"/> <protein_sequence length="453">mskiseavkr araafssgrt rplqfriqql ealqrliqeq eqelvgalaa dlhknewnay yeevvyvlee ieymiqklpe waadepvekt pqtqqdelyi hseplgvvlv igtwnypfnl tiqpmvgaia agnsvvlkps elsenmasll atiipqyldk dlypvinggv pettellker fdhilytgst gvgkiimtaa akhltpvtle lggkspcyvd kncdldvacr riawgkfmns gqtcvapdyi lcdpsiqnqi veklkkslke fygedakksr dygriisarh fqrvmglieg qkvayggtgd aatryiapti ltdvdpqspv mqeeifgpvl pivcvrslee aiqfinqrek plalymfssn dkvikkmiae tssggvaand vivhitlhsl pfggvgnsgm gsyhgkksfe tfshrrsclv rplmndeglk vryppspakm tqh </protein_sequence> </entry_sequence> <protein_domain_architecture> <domain domain_source="smart" end="50" type="motif" start="23"> <title>CC</title> </domain> </protein_domain_architecture> <expressions> <expression> <title>Stomach</title> <pubmed>1737758</pubmed> </expression> <expression> <title>Lung</title> <pubmed>4073832</pubmed> </expression> <expression> <title>Hair</title> <pubmed>7625577</pubmed> </expression> <expression> <title>Saliva </title> <pubmed>7625577</pubmed> </expression> <expression> <title>Liver</title> <pubmed>1737758</pubmed> </expression> <expression> <title>Oesophagus</title> <pubmed>1737758</pubmed> </expression> <expression> <title>Kidney</title> <pubmed>1737758</pubmed> </expression> </expressions> <functions> <molecular_class>Enzyme: Dehydrogenase</molecular_class> <molecular_function> <title>Catalytic activity</title> <goid>0003824</goid> </molecular_function> <biological_processes> <biological_process> <title>Metabolism</title> <goid>0008152</goid> </biological_process> <biological_process> <title>Energy pathways</title> <goid>0006091</goid> </biological_process> </biological_processes> </functions> <cellular_component> <primary> <title>cytoplasm</title> <go_id>GO:0005737</go_id> <go_abbreviation>TAS</go_abbreviation> <pubmed>9514081</pubmed> <pubmed>1306115</pubmed> </primary> </cellular_component> <interactions> <entrySet xmlns="net:sf:psidev:mi" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="net:sf:psidev:mi http://psidev.sourceforge.net/mi/xml/src/MIF.xsd" level="1" version="1"> <entry> <source> <names> <shortLabel>HPRD</shortLabel> <fullName>Human Protein Reference Database</fullName> </names> <bibref> <xref> <primaryRef db="PubMed" id="14525934"/> <secondaryRef db="PubMed" id="14681466"/> </xref> </bibref> </source> <availabilityList> <availability id="copyright"> This data is copyrighted by Johns Hopkins University. Commercial entities may not use this without prior licensing authorization. Other databases must agree to enforce the same licensing guidelines before making this data public on their website. </availability> </availabilityList> <interactorList> <proteinInteractor id="ID_HPRD_00004"> <names> <shortLabel>Aldehyde dehydrogenase 3</shortLabel> </names> <xref> <primaryRef db="HPRD" id="HPRD_00004"/> <secondaryRef db="PubMed" id="7774944"/> <secondaryRef version="3" db="Ref-Seq" id="NP_000682"/> <secondaryRef db="Locus-Link" id="218"/> <secondaryRef db="Unigene" id="575"/> </xref> <organism ncbiTaxId="9606"> <names> <shortLabel>Human</shortLabel> <fullName>Homo sapiens</fullName> </names> </organism> <sequence>mskiseavkr araafssgrt rplqfriqql ealqrliqeq eqelvgalaa dlhknewnay yeevvyvlee ieymiqklpe waadepvekt pqtqqdelyi hseplgvvlv igtwnypfnl tiqpmvgaia agnsvvlkps elsenmasll atiipqyldk dlypvinggv pettellker fdhilytgst gvgkiimtaa akhltpvtle lggkspcyvd kncdldvacr riawgkfmns gqtcvapdyi lcdpsiqnqi veklkkslke fygedakksr dygriisarh fqrvmglieg qkvayggtgd aatryiapti ltdvdpqspv mqeeifgpvl pivcvrslee aiqfinqrek plalymfssn dkvikkmiae tssggvaand vivhitlhsl pfggvgnsgm gsyhgkksfe tfshrrsclv rplmndeglk vryppspakm tqh </sequence> </proteinInteractor> </interactorList> <interactionList> <interaction> <availabilityRef ref="copyright"/> <experimentList> <experimentDescription id="I10654197548104_vt"> <bibref> <xref> <primaryRef db="PubMed" id="12081471"/> </xref> </bibref> <interactionDetection> <names> <shortLabel>vt</shortLabel> <fullName>in vitro</fullName> </names> <xref> <primaryRef db="IOB" id="IOB:0002"/> </xref> </interactionDetection> </experimentDescription> </experimentList> <participantList> <proteinParticipant> <proteinInteractorRef ref="ID_HPRD_00004"/> </proteinParticipant> <proteinParticipant> <proteinInteractorRef ref="ID_HPRD_00004"/> </proteinParticipant> </participantList> <xref> <primaryRef db="HPRD" id="HPRD_00004"/> </xref> <attributeList> <attribute name="HPRD Author">gopa</attribute> <attribute name="last_updated">2_20_2005</attribute> </attributeList> </interaction> </interactionList> </entry> </entrySet> </interactions> <EXTERNAL_LINKS> <SwissProt>None</SwissProt> <locusLink>218</locusLink> <unigene>Hs.575</unigene> <otherResources>None</otherResources> <PDB></PDB> </EXTERNAL_LINKS> <author> <annotator>gopa</annotator> </author> <last_updated>2_20_2005</last_updated> </protein> </HPRDr2>
_______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor