Re: [Tutor] how to extract text by specifying an element using ElementTree
Danny Yoo wrote: > > On Wed, 21 Dec 2005, ps python wrote: > > >>Dear drs. Yoo and johnson, Thank you very much for your help. I >>successully parsed my GO annotation from all 16,000 files. thanks again >>for your kind help > > > I'm glad to hear that it's working for you now. Just as a clarification: > I'm not a doctor. *grin* But I do work with bioinformaticians, so I > recognize the Gene Ontology annotations you are working with. No doctor here either. But I'll take it as a compliment! Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] how to extract text by specifying an element using ElementTree
On Wed, 21 Dec 2005, ps python wrote: > Dear drs. Yoo and johnson, Thank you very much for your help. I > successully parsed my GO annotation from all 16,000 files. thanks again > for your kind help I'm glad to hear that it's working for you now. Just as a clarification: I'm not a doctor. *grin* But I do work with bioinformaticians, so I recognize the Gene Ontology annotations you are working with. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] how to extract text by specifying an element using ElementTree
Dear drs. Yoo and johnson, Thank you very much for your help. I successully parsed my GO annotation from all 16,000 files. thanks again for your kind help --- Danny Yoo <[EMAIL PROTECTED]> wrote: > > > >>> for m in mydata.findall('//functions'): > > print m.get('molecular_class').text > > > > >>> for m in mydata.findall('//functions'): > > print m.find('molecular_class').text.strip() > > > > >>> for process in > > mydata.findall('//biological_process'): > > print process.get('title').text > > > Hello, > > I believe we're running into XML namespace issues. > If we look at all the > tag names in the XML, we can see this: > > ## > >>> from elementtree import ElementTree > >>> tree = ElementTree.parse(open('4.xml')) > >>> for element in tree.getroot()[0]: print > element.tag > ... > {org:hprd:dtd:hprdr2}title > {org:hprd:dtd:hprdr2}alt_title > {org:hprd:dtd:hprdr2}alt_title > {org:hprd:dtd:hprdr2}alt_title > {org:hprd:dtd:hprdr2}alt_title > {org:hprd:dtd:hprdr2}alt_title > {org:hprd:dtd:hprdr2}omim > {org:hprd:dtd:hprdr2}gene_symbol > {org:hprd:dtd:hprdr2}gene_map_locus > {org:hprd:dtd:hprdr2}seq_entry > {org:hprd:dtd:hprdr2}molecular_weight > {org:hprd:dtd:hprdr2}entry_sequence > {org:hprd:dtd:hprdr2}protein_domain_architecture > {org:hprd:dtd:hprdr2}expressions > {org:hprd:dtd:hprdr2}functions > {org:hprd:dtd:hprdr2}cellular_component > {org:hprd:dtd:hprdr2}interactions > {org:hprd:dtd:hprdr2}EXTERNAL_LINKS > {org:hprd:dtd:hprdr2}author > {org:hprd:dtd:hprdr2}last_updated > ## > > (I'm just doing a quick view of the toplevel > elements in the tree.) > > As we can see, each element's tag is being prefixed > with the namespace URL > provided in the XML document. If we look in our XML > document and search > for the attribute 'xmlns', we'll see where this > 'org:hprd:dtd:hprdr2' > thing comes from. > > > So we may need to prepend the namespace to get the > proper terms: > > ## > >>> for process in > tree.find("//{org:hprd:dtd:hprdr2}biological_processes"): > ... print > process.findtext("{org:hprd:dtd:hprdr2}title") > ... > Metabolism > Energy pathways > ## > > > To tell the truth, I don't quite understand how to > work fluently with XML > namespaces, so perhaps there's an easier way to do > what you want. But the > examples above should help you get started parsing > all your Gene Ontology > annotations. > > > > Good luck! > > Send instant messages to your online friends http://in.messenger.yahoo.com ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] how to extract text by specifying an element using ElementTree
> >>> for m in mydata.findall('//functions'): > print m.get('molecular_class').text > > >>> for m in mydata.findall('//functions'): > print m.find('molecular_class').text.strip() > > >>> for process in > mydata.findall('//biological_process'): > print process.get('title').text Hello, I believe we're running into XML namespace issues. If we look at all the tag names in the XML, we can see this: ## >>> from elementtree import ElementTree >>> tree = ElementTree.parse(open('4.xml')) >>> for element in tree.getroot()[0]: print element.tag ... {org:hprd:dtd:hprdr2}title {org:hprd:dtd:hprdr2}alt_title {org:hprd:dtd:hprdr2}alt_title {org:hprd:dtd:hprdr2}alt_title {org:hprd:dtd:hprdr2}alt_title {org:hprd:dtd:hprdr2}alt_title {org:hprd:dtd:hprdr2}omim {org:hprd:dtd:hprdr2}gene_symbol {org:hprd:dtd:hprdr2}gene_map_locus {org:hprd:dtd:hprdr2}seq_entry {org:hprd:dtd:hprdr2}molecular_weight {org:hprd:dtd:hprdr2}entry_sequence {org:hprd:dtd:hprdr2}protein_domain_architecture {org:hprd:dtd:hprdr2}expressions {org:hprd:dtd:hprdr2}functions {org:hprd:dtd:hprdr2}cellular_component {org:hprd:dtd:hprdr2}interactions {org:hprd:dtd:hprdr2}EXTERNAL_LINKS {org:hprd:dtd:hprdr2}author {org:hprd:dtd:hprdr2}last_updated ## (I'm just doing a quick view of the toplevel elements in the tree.) As we can see, each element's tag is being prefixed with the namespace URL provided in the XML document. If we look in our XML document and search for the attribute 'xmlns', we'll see where this 'org:hprd:dtd:hprdr2' thing comes from. So we may need to prepend the namespace to get the proper terms: ## >>> for process in tree.find("//{org:hprd:dtd:hprdr2}biological_processes"): ... print process.findtext("{org:hprd:dtd:hprdr2}title") ... Metabolism Energy pathways ## To tell the truth, I don't quite understand how to work fluently with XML namespaces, so perhaps there's an easier way to do what you want. But the examples above should help you get started parsing all your Gene Ontology annotations. Good luck! ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] how to extract text by specifying an element using ElementTree
Thank you for your email Dr. Johnson. I need to print : gene_symbol (from line ALDH3A1) entry_cdna (from line NM_000691.3) molecular_class (from line Enzyme:Dehydrogenase) title (from tags Catalytic activity) title (from tags section Metabolism) title (from tags section cytoplasm) This is how I tried: from elementtree.ElementTree import ElementTree mydata = ElementTree(file='4.xml') >>> for process in mydata.findall('//biological_process'): print process.get('title').text >>> for m in mydata.findall('//functions'): print m.get('molecular_class').text >>> for m in mydata.findall('//functions'): print m.find('molecular_class').text.strip() >>> for process in mydata.findall('//biological_process'): print process.get('title').text >>> for m in mydata.findall('//functions'): print m.get('molecular_class').text >>> for m in mydata.findall('//functions'): print m.get('title').text.strip() >>> for m in mydata.findall('//biological_processes'): print m.get('title').text.strip() >>> Result: I get nothing. No error. I have no clue why it is not giving me the result. I also tried this alternate way: >>> strdata = """ Enzyme: Dehydrogenase Catalytic activity 0003824 Metabolism 0008152 Energy pathways 0006091 """ >>> from elementtree import ElementTree >>> tree = ElementTree.fromstring(strdata) >>> for m in tree.findall('//functions'): print m.find('molecular_class').text Traceback (most recent call last): File "", line 1, in -toplevel- for m in tree.findall('//functions'): File "C:\Python23\Lib\site-packages\elementtree\ElementTree.py", line 352, in findall return ElementPath.findall(self, path) File "C:\Python23\Lib\site-packages\elementtree\ElementPath.py", line 195, in findall return _compile(path).findall(element) File "C:\Python23\Lib\site-packages\elementtree\ElementPath.py", line 173, in _compile p = Path(path) File "C:\Python23\Lib\site-packages\elementtree\ElementPath.py", line 74, in __init__ raise SyntaxError("cannot use absolute path on element") SyntaxError: cannot use absolute path on element >>> for m in tree.findall('functions'): print m.find('molecular_class').text >>> for m in tree.findall('functions'): print m.find('molecular_class').text.strip() >>> for m in tree.findall('functions'): print m.get('molecular_class').text Do you thing it is a problem with the XML files instead. Thank you for valuable suggestions. kind regards, M --- Kent Johnson <[EMAIL PROTECTED]> wrote: > ps python wrote: > > Dear Drs. Johnson and Yoo , > > for the last 1 week I have been working on > parsing > > the elements from a bunch of XML files following > your > > suggestions. > > > > from elementtree.ElementTree import ElementTree > > > mydata = ElementTree(file='4.xml') > for process in > > > > mydata.findall('//biological_process'): > > print process.text > > Looking at the data, neither > nor elements directly > contain text, they have children that contain text. > Try >print process.get('title').text > to print the title. > > for proc in mydata.findall('functions'): > > print proc > > I think you want findall('//functions') to find > at any depth in the tree. > > If this doesn't work please show the results you get > and tell us what you expect. > > Kent > > ___ > Tutor maillist - Tutor@python.org > http://mail.python.org/mailman/listinfo/tutor > Send instant messages to your online friends http://in.messenger.yahoo.com ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] how to extract text by specifying an element using ElementTree
ps python wrote: > Dear Drs. Johnson and Yoo , > for the last 1 week I have been working on parsing > the elements from a bunch of XML files following your > suggestions. > > from elementtree.ElementTree import ElementTree > mydata = ElementTree(file='4.xml') for process in > > mydata.findall('//biological_process'): > print process.text Looking at the data, neither nor elements directly contain text, they have children that contain text. Try print process.get('title').text to print the title. for proc in mydata.findall('functions'): > print proc I think you want findall('//functions') to find at any depth in the tree. If this doesn't work please show the results you get and tell us what you expect. Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] how to extract text by specifying an element using ElementTree
Dear Drs. Johnson and Yoo , for the last 1 week I have been working on parsing the elements from a bunch of XML files following your suggestions. until now I have been unsuccessul. I have no clue why i am failing. I have ~16K XML files. this data obtained from johns hopkins university (of course these are public data and is allowed to use for teaching and non-commercial purposes). from elementtree.ElementTree import ElementTree >>> mydata = ElementTree(file='4.xml') >>> for process in mydata.findall('//biological_process'): print process.text >>> for proc in mydata.findall('functions'): print proc >>> I do not understand why I am unable to parse this file. I questioned if this file is not well structures (well formedness). I feel it is properly structured and yet it us unparsable. Would you please help me /guide me what the problem is. Apologies if i am completely ignoring somethings. PS: Attached is the XML file that I am using. --- Kent Johnson <[EMAIL PROTECTED]> wrote: > ps python wrote: > > Kent and Dany, > > Thanks for your replies. > > > > Here fromstring() assuming that the input is in a > kind > > of text format. > > Right, that is for the sake of a simple example. > > > > what should be the case when I am reading files > > directly. > > > > I am using the following : > > > > from elementtree.ElementTree import ElementTree > > mydata = ElementTree(file='1.xml') > > iter = root.getiterator() > > > > Here the whole XML document is loaded as element > tree > > and how should this iter into a format where I can > > apply findall() method. > > Call findall() directly on mydata, e.g. > for process in > mydata.findall('//biological_process'): >print process.text > > The path //biological_process means find any > biological_process element > at any depth from the root element. > > Kent > > ___ > Tutor maillist - Tutor@python.org > http://mail.python.org/mailman/listinfo/tutor > Send instant messages to your online friends http://in.messenger.yahoo.com Aldehyde dehydrogenase 3 Aldehyde dehydrogenase family 3 subfamily A, member 1 ALDH3 Acetaldehyde dehydrogenase 3 ALDH, Stomach type ALDHIII 100660 ALDH3A1 17p11.2 7774944 NM_000691.3 NP_000682.3 50398 ccaggagccc cagttaccgg gagaggctgt gtcaaaggcg ccatgagcaa gatcagcgag gccgtgaagc gcgcccgcgc cgccttcagc tcgggcagga cccgtccgct gcagttccgg atccagcagc tggaggcgct gcagcgcctg atccaggagc aggagcagga gctggtgggc gcgctggccg cagacctgca caagaatgaa tggaacgcct actatgagga ggtggtgtac gtcctagagg agatcgagta catgatccag aagctccctg agtgggccgc ggatgagccc gtggagaaga cgagac tcagcaggac gagctctaca tccactcgga gccactgggc gtggtcctcg tcattggcac ctggaactac cccttcaacc tcaccatcca gcccatggtg ggcgccatcg ctgcagggaa ctcagtggtc ctcaagccct cggagctgag tgagaacatg gcgagcctgc tggctaccat catcag tacctggaca aggatctgta cccagtaatc aatgtg tccctgagac cacggagctg ctcaaggaga ggttcgacca tatcctgtac acgggcagca cgtggg gaagatcatc atgacggctg ctgccaagca cctgat gtcacgctgg agctgggagg gaagagtccc tgctacgtgg acaagaactg tgacctggac gtggcctgcc gacgcatcgc ctgaaa ttcatgaaca gtggccagac ctgcgtggcc cctgactaca tcctctgtga tcgatc cagaaccaaa ttgtggagaa gctcaagaag tcactgaaag agttctacgg ggaagatgct aagaaatccc gggactatgg aagaatcatt agtgcccggc acttccagag ggtgatgggc ctgattgagg gccagaaggt ggcttatggg ggcacc atgccgccac tcgctacata gcacca tcctcacgga cgtgga cagtgg tgatgcaaga ggagatcttc gggcctgtgc tgcccatcgt gtgcgtgcgc agcctggagg aggccatcca gttcatcaac cagcgtgaga agtggc cctctacatg ttctccagca acgacaaggt gattaagaag atgattgcag agacatccag tggttg gcggccaacg atgtcatcgt ccacatcacc ttgcactctc tgcccttcgg gggcgt aacagcggca tgggatccta ccatggcaag aagagcttcg agactttctc tcaccgccgc tcttgcctgg tgaggcctct gatgaatgat gaaggcctga aggtcagata ccgagc ccggccaaga tgacccagca ctgaggaggg gttgctccgc ctggcctggc catactgtgt cccatcggag tgcggaccac cctcactggc tctcctggcc ctgggagaat cgctcctgca gagccc agactc ctctgctgac ctgctgacct gtgcacaccc cactcccaca tgggcccagg cctcaccatt ccaagtctcc atttct agaccaataa agagacgaat acaact aactcagcaa aa aa aa aa aa aa aa aa aa mskiseavkr araafssgrt rplqfriqql ealqrliqeq eqelvgalaa dlhknewnay yeevvyvlee ieymiqklpe waadepvekt pqtqqdelyi hseplgvvlv igtwnypfnl tiqpmvgaia agnsvvlkps elsenmasll atiipqyldk dlypvinggv pettellker fdhilytgst gvgkiimtaa akhltpvtle lggkspcyvd kncdldvacr riawgkfmns gqtcvapdyi lcdpsiqnqi veklkkslke fygedakksr dygriisarh fqrvmglieg qkvayggtgd aatryiapti ltdvdpqspv mqeeifgpvl pivcvrslee aiqfinqrek plalymfssn dkvikkmiae tssggvaand vivhitlhsl pfggvgnsgm gsyhgkksfe tfshrrsclv rplmndeglk vryppspakm tqh CC
Re: [Tutor] how to extract text by specifying an element using ElementTree
Srinivas Iyyer wrote: > Hi group, > I just have another question in parsin XML files. I > found it very easy to parse XML files with kent and > danny's help. > > I realized that all my XML files have '\t' and '\n' > and whitespace. these extra features are making to > extract the text data from the xml files very > difficult. I can make these XML parser work when I > rekove '\n' and '\t' from xml files. > > is there a way to get rid of '\n' and '\t' characters > from xml files easily. Did you see how I did this in my original example? I called strip() on the text part of the element. This removes leading and trailing whitespace. Is that what you need? Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] how to extract text by specifying an element using ElementTree
Hi group, I just have another question in parsin XML files. I found it very easy to parse XML files with kent and danny's help. I realized that all my XML files have '\t' and '\n' and whitespace. these extra features are making to extract the text data from the xml files very difficult. I can make these XML parser work when I rekove '\n' and '\t' from xml files. is there a way to get rid of '\n' and '\t' characters from xml files easily. thank you very much. MDan --- Kent Johnson <[EMAIL PROTECTED]> wrote: > ps python wrote: > > Kent and Dany, > > Thanks for your replies. > > > > Here fromstring() assuming that the input is in a > kind > > of text format. > > Right, that is for the sake of a simple example. > > > > what should be the case when I am reading files > > directly. > > > > I am using the following : > > > > from elementtree.ElementTree import ElementTree > > mydata = ElementTree(file='1.xml') > > iter = root.getiterator() > > > > Here the whole XML document is loaded as element > tree > > and how should this iter into a format where I can > > apply findall() method. > > Call findall() directly on mydata, e.g. > for process in > mydata.findall('//biological_process'): >print process.text > > The path //biological_process means find any > biological_process element > at any depth from the root element. > > Kent > > ___ > Tutor maillist - Tutor@python.org > http://mail.python.org/mailman/listinfo/tutor > __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] how to extract text by specifying an element using ElementTree
ps python wrote: > Kent and Dany, > Thanks for your replies. > > Here fromstring() assuming that the input is in a kind > of text format. Right, that is for the sake of a simple example. > > what should be the case when I am reading files > directly. > > I am using the following : > > from elementtree.ElementTree import ElementTree > mydata = ElementTree(file='1.xml') > iter = root.getiterator() > > Here the whole XML document is loaded as element tree > and how should this iter into a format where I can > apply findall() method. Call findall() directly on mydata, e.g. for process in mydata.findall('//biological_process'): print process.text The path //biological_process means find any biological_process element at any depth from the root element. Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] how to extract text by specifying an element using ElementTree
Kent and Dany, Thanks for your replies. Here fromstring() assuming that the input is in a kind of text format. what should be the case when I am reading files directly. I am using the following : from elementtree.ElementTree import ElementTree mydata = ElementTree(file='1.xml') iter = root.getiterator() Here the whole XML document is loaded as element tree and how should this iter into a format where I can apply findall() method. thanks mdan --- Kent Johnson <[EMAIL PROTECTED]> wrote: > ps python wrote: > > Hi, > > > > using ElementTree, how can I extract text of a > > particular element, or a child node. > > > > For example: > > > > > > > >Signal transduction > > > > > >Energy process > > > > > > > > In the case where I already know which element > tags > > have the information that I need, in such case how > do > > i get that specific text. > > Use find() to get the nodes of interest. The text > attribute of the node > contains the text. For example: > > data = ''' > > Signal transduction > > > Energy process > > > ''' > > from elementtree import ElementTree > > tree = ElementTree.fromstring(data) > > for process in tree.findall('biological_process'): >print process.text.strip() > > > prints > Signal transduction > Energy process > > You will have to modify the path in the findall to > match your actual > data, assuming what you have shown is just a > snippet. > > I stripped whitespace from the text because > otherwise it includes the > newlines and indents exactly as in the original. > > Kent > > ___ > Tutor maillist - Tutor@python.org > http://mail.python.org/mailman/listinfo/tutor > __ Yahoo! India Matrimony: Find your partner now. Go to http://yahoo.shaadi.com ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] how to extract text by specifying an element using ElementTree
ps python wrote: > Hi, > > using ElementTree, how can I extract text of a > particular element, or a child node. > > For example: > > > >Signal transduction > > >Energy process > > > > In the case where I already know which element tags > have the information that I need, in such case how do > i get that specific text. Use find() to get the nodes of interest. The text attribute of the node contains the text. For example: data = ''' Signal transduction Energy process ''' from elementtree import ElementTree tree = ElementTree.fromstring(data) for process in tree.findall('biological_process'): print process.text.strip() prints Signal transduction Energy process You will have to modify the path in the findall to match your actual data, assuming what you have shown is just a snippet. I stripped whitespace from the text because otherwise it includes the newlines and indents exactly as in the original. Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] how to extract text by specifying an element using ElementTree
> For example: > > > >Signal transduction > > >Energy process > > > > I looked at some tutorials (eg. Ogbuji). Those > examples described to extract all text of nodes and > child nodes. Hi Mdan, The following might help: http://article.gmane.org/gmane.comp.python.tutor/24986 http://mail.python.org/pipermail/tutor/2005-December/043817.html The second post shows how we can use the findtext() method from an ElementTree. Here's another example that demonstrates how we can treat elements as sequences of their subelements: ## from elementtree import ElementTree from StringIO import StringIO text = """ skywalker luke valentine faye reynolds mal """ people = ElementTree.fromstring(text) for person in people: print "here's a person:", print person.findtext("firstName"), person.findtext('lastName') ## Does this make sense? The API allows us to treat an element as a sequence that we can march across, and the loop above marches across every person subelement in people. Another way we could have written the loop above would be: ### >>> for person in people.findall('person'): ... print person.find('firstName').text, ... print person.find('lastName').text ... luke skywalker faye valentine mal reynolds ### Or we might go a little funkier, and just get the first names anywhere in people: ### >>> for firstName in people.findall('.//firstName'): ... print firstName.text ... luke faye mal ### where the subelement "tag" that we're giving findall is really an XPath-query. ".//firstName" is an query in XPath format that says "Give me all the firstName elements anywhere within the current element." The documentation in: http://effbot.org/zone/element.htm#searching-for-subelements should also be helpful. If you have more questions, please feel free to ask. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor