How to parse html wild?

Garry_Galler Wed, 05 Oct 2016 17:00:03 +0200

I can not parse HTML wild.**findall** doesn't return all elements from the div 
element with the class _documents_
    
    
    import httpclient
    import htmlparser
    import tables
    import strutils
    import streams
    import xmltree
    import strtabs
    
    var main_page = "http://old.minjust.gov.ua/19612";
    var headers_dict = {
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, 
like Gecko) Chrome/53.0.2785.101 Safari/537.36 OPR/40.0.2308.62",
        "Accept": 
"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "DNT": "1",
        "Referer": main_page,
        "Accept-Encoding": "gzip, deflate, lzma, sdch",
        "Accept-Language": "ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4"
    }.toTable
    
    var headers = ""
    for k,v in headers_dict:
        headers&= k & ":" & v & "\c\L"
    
    var resp   = httpclient.get(main_page,headers)
    var stream = newStringStream(resp.body)
    var html   = htmlparser.parseHtml(stream)
    
    
    var cnt=0
    for elem in html.findall("div"):
      if elem.attr("class") == "document":
        for a in elem.findall("a"):
          cnt+=1
          echo cnt,"|", a.attrs["href"], "|", a.innerText
        break


It should be 809 instead of 327. Here on this element ends up:

<li><a href="/file/1495">Крівова проти України - <b>19.12.2012</b></li></a>

He was wrong. But what about me? PS: Nim version 14.

How to parse html wild?

Reply via email to