I tried to apply the scheme you suggested to open the web page on "http://mirecords.umn.edu/miRecords/index.php" and got the followiing:
> result <- postForm("http://mirecords.umn.edu/miRecords/index.php", + searchType="miRNA", species="Homo sapiens", + searchBox="hsa-let-7a", submitButton="Search") > html <- htmlTreeParse(result, asText=TRUE, useInternalNodes=TRUE) Unexpected end tag : a error parsing attribute name Opening and ending tag mismatch: strong and font htmlParseStartTag: invalid element name Unexpected end tag : a > html <- htmlTreeParse(result, asText=FALSE, useInternalNodes=TRUE) Error in htmlTreeParse(result, asText = FALSE, useInternalNodes = TRUE) : File <html><!-- InstanceBegin template="/Templates/admin.dwt" codeOutsideHTMLIsLocked="false" --> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <link href="style/link.css" rel="stylesheet" type="text/css"> <!-- InstanceParam name="nav_1" type="boolean" value="true" --> <title>miRecords</title> </head> <body bgcolor="#FFFFFF" leftmargin="0" topmargin="0" marginwidth="0" marginheight="0"> <table width="80" border="0" cellspacing="0" cellpadding="0"> <tr> <td colspan="3"><img src="images/title.jpg" alt="" width=900 height=79 border="0"></a></td> </tr> <tr> <td width="131" valign="bottom" bgcolor="#CCCCCC"menu""></td> <td width="769" align="right" valign="middle" bgcolor="#CCCCCC"><a href="redirect.php?s=l" class="menu">Validated Targets </a> | <a href="redirect.php?s=p" class="menu">Predicted Targets </a> | <a href="download.php" class="menu">Download Validated Targets </a> | <a href="submit.php" class="m > I am lost about how to proceed from the above. My goal is always to get the VALIDATED miRNA identified and string followed by its target gene's 3'utr sequence- Thank you in advance, Maura P:S. BioMart started to work fine since yesterday -----Messaggio originale----- Da: Martin Morgan [mailto:mtmor...@fhcrc.org] Inviato: mer 01/07/2009 17.51 A: mau...@alice.it Cc: r-h...@stat.math.ethz.ch Oggetto: Re: [R] Is there a way to extract some fields data from HTML pages through any R function ? Hi Maura -- mau...@alice.it wrote: > I deal with a huge amount of Biology data stored in different databases. > The databases belongig to Bioconductor organization can be accessed through > Bioconductor packages. > Unluckily some useful data is stored in databases like, for instance, miRDB, > miRecords, etc ... which offer just an > interactive HTML interface. See for instance > http://mirdb.org/cgi-bin/search.cgi, > > http://mirecords.umn.edu/miRecords/interactions.php?species=Homo+sapiens&mirna_acc=Any&targetgene_type=refseq_acc&targetgene_info=&v=yes&search_int=Search > > Downloading data manually from the web pages is a painstaking time-consumung > and error-prone activity. > I came across a Python script that downloads (dumps) whole web pages into a > text file that is then parsed. > This is possible because Python has a library to access web pages. > But I have no experience with Python programming nor I like such a > programming language whose syntax is indentation-sensitive. > > I am *hoping* that there exists some sort of web pages, HTML connection from > R ... is there ?? Tools in R for this are the RCurl package and the XML package. library(RCurl) library(XML) Typically this involves manual exploration of the web form, Then you might query the web form result <- postForm("http://mirdb.org/cgi-bin/search.cgi", searchType="miRNA", species="Human", searchBox="hsa-let-7a", submitButton="Go") and parse the results into a convenient structure html <- htmlTreeParse(result, asText=TRUE, useInternalNodes=TRUE) you can then use XPath (http://www.w3.org/TR/xpath, especially section 2.5) to explore and extract information, e.g., ## second table, first row getNodeSet(html, "//table[2]/tr[1]") ## second table, makes subsequent paths shorter tbl <- getNodeSet(html, "//table[2]")[[1]] xget <- function(xml, path) # a helper function unlist(xpathApply(xml, path, xmlValue))[-1] df <- data.frame(TargetRank=as.numeric(xget(tbl, "./tr/td[2]")), TargetScore=as.numeric(xget(tbl, "./tr/td[3]")), miRNAName=xget(tbl, "./tr/td[4]"), GeneSymbol=xget(tbl, "./tr/td[5]"), GeneDescription=xget(tbl, "./tr/td[6]")) There are many ways through this latter part, probably some much cleaner than presented above. There are fairly extensive examples on each of the relevant help pages, e.g., ?postForm. Martin > Thank you very much for any suggestion. > Maura > > > tutti i telefonini TIM! > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. tutti i telefonini TIM! [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.