[R] R: Is there a way to extract some fields data from HTML pages through any R function ?

mauede Sun, 05 Jul 2009 12:15:35 -0700

I tried to apply the scheme you suggested to open the web page on 
"http://mirecords.umn.edu/miRecords/index.php"; and got the followiing:


> result <- postForm("http://mirecords.umn.edu/miRecords/index.php";,
+ searchType="miRNA", species="Homo sapiens",
+  searchBox="hsa-let-7a", submitButton="Search")
>  html <- htmlTreeParse(result, asText=TRUE, useInternalNodes=TRUE)
Unexpected end tag : a
error parsing attribute name
Opening and ending tag mismatch: strong and font
htmlParseStartTag: invalid element name
Unexpected end tag : a
>  html <- htmlTreeParse(result, asText=FALSE, useInternalNodes=TRUE)
Error in htmlTreeParse(result, asText = FALSE, useInternalNodes = TRUE) : 
  File <html><!-- InstanceBegin template="/Templates/admin.dwt" 
codeOutsideHTMLIsLocked="false" -->

<head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

<link href="style/link.css" rel="stylesheet" type="text/css">

<!-- InstanceParam name="nav_1" type="boolean" value="true" -->

<title>miRecords</title>

</head>

<body bgcolor="#FFFFFF" leftmargin="0" topmargin="0"  marginwidth="0" 
marginheight="0">







 

<table width="80" border="0" cellspacing="0" cellpadding="0">

  <tr> 

    <td colspan="3"><img src="images/title.jpg" alt="" width=900 height=79 
border="0"></a></td>

  </tr>

  <tr> 

    <td width="131" valign="bottom" bgcolor="#CCCCCC"menu""></td>

    <td width="769" align="right" valign="middle" bgcolor="#CCCCCC"><a 
href="redirect.php?s=l" class="menu">Validated Targets </a>  | <a 
href="redirect.php?s=p" class="menu">Predicted Targets </a> | <a 
href="download.php"  class="menu">Download Validated Targets </a> | <a 
href="submit.php" class="m
> 



I am lost about how to proceed from the above.
My goal is always to get the VALIDATED miRNA identified and string followed by 
its target  gene's  3'utr sequence-

Thank you in advance,
Maura 

P:S. BioMart started to work fine since yesterday

-----Messaggio originale-----
Da: Martin Morgan [mailto:mtmor...@fhcrc.org]
Inviato: mer 01/07/2009 17.51
A: mau...@alice.it
Cc: r-h...@stat.math.ethz.ch
Oggetto: Re: [R] Is there a way to extract some fields data from HTML pages 
through any R function ?
 
Hi Maura --

mau...@alice.it wrote:
> I deal with a huge amount of Biology data stored in different databases.
> The databases belongig to Bioconductor organization can be accessed through 
> Bioconductor packages.
> Unluckily some useful data is stored in databases like, for instance, miRDB, 
> miRecords, etc ... which offer just an
> interactive HTML interface. See for instance
>  http://mirdb.org/cgi-bin/search.cgi, 
>  
> http://mirecords.umn.edu/miRecords/interactions.php?species=Homo+sapiens&mirna_acc=Any&targetgene_type=refseq_acc&targetgene_info=&v=yes&search_int=Search
> 
> Downloading data manually from the web pages is a painstaking time-consumung 
> and error-prone activity.
> I came across a Python script that downloads (dumps) whole web pages  into a 
> text file that is then parsed.
> This is possible because Python has a library to access web pages.
> But I have no experience with Python programming nor I like such a 
> programming language whose syntax is indentation-sensitive.
> 
> I am *hoping* that there exists some sort of web pages, HTML connection  from 
> R ... is there ??

Tools in R for this are the RCurl package and the XML package.

  library(RCurl)
  library(XML)

Typically this involves manual exploration of the web form, Then you
might query the web form

  result <- postForm("http://mirdb.org/cgi-bin/search.cgi";,
                     searchType="miRNA", species="Human",
                     searchBox="hsa-let-7a", submitButton="Go")

and parse the results into a convenient structure

  html <- htmlTreeParse(result, asText=TRUE, useInternalNodes=TRUE)

you can then use XPath (http://www.w3.org/TR/xpath, especially section
2.5) to explore and extract information, e.g.,

  ## second table, first row
  getNodeSet(html, "//table[2]/tr[1]")
  ## second table, makes subsequent paths shorter
  tbl <- getNodeSet(html, "//table[2]")[[1]]
  xget <- function(xml, path) # a helper function
      unlist(xpathApply(xml, path, xmlValue))[-1]
  df <- data.frame(TargetRank=as.numeric(xget(tbl, "./tr/td[2]")),
                   TargetScore=as.numeric(xget(tbl, "./tr/td[3]")),
                   miRNAName=xget(tbl, "./tr/td[4]"),
                   GeneSymbol=xget(tbl, "./tr/td[5]"),
                   GeneDescription=xget(tbl, "./tr/td[6]"))

There are many ways through this latter part, probably some much cleaner
than presented above. There are fairly extensive examples on each of the
relevant help pages, e.g., ?postForm.

Martin


> Thank you very much for any suggestion.
> Maura
> 
> 
> tutti i telefonini TIM!
> 
> 
>       [[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.





tutti i telefonini TIM!


        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] R: Is there a way to extract some fields data from HTML pages through any R function ?

Reply via email to