RE: Indexing HTML

Digy Thu, 03 Apr 2008 09:33:39 -0700

For Ex.,

          void GetMetaData()
        {
            WebBrowser wb = new WebBrowser();
            wb.DocumentCompleted += new
WebBrowserDocumentCompletedEventHandler(wb_DocumentCompleted);
            wb.ScriptErrorsSuppressed = true;


            StreamReader rdr = new
StreamReader(@"D:\Downloads\explorer_tips.htm");
            wb.DocumentText = rdr.ReadToEnd();

            while (read == false) Application.DoEvents();

            HtmlElementCollection coll =
wb.Document.GetElementsByTagName("meta"); //Or other tags
                //HtmlElementCollection coll = wb.Document.All; //or all
elements
            foreach (HtmlElement elem in coll)
            {
                richTextBox1.AppendText(elem.TagName + ":");
                richTextBox1.AppendText(elem.GetAttribute("content") +
"\n"); //or elem.InnerText;
            }
        }

        void wb_DocumentCompleted(object sender,
WebBrowserDocumentCompletedEventArgs e)
        {
            read = true;
        }

PS: Loading a web content to the web browser is very sloooow.

DIGY


-----Original Message-----
From: Jamie Vickers [mailto:[EMAIL PROTECTED] 
Sent: Thursday, April 03, 2008 5:19 PM
To: lucene-net-user@incubator.apache.org
Subject: Indexing HTML

Dear All,

 

Sorry about the re-post of this question, but so far no-one has been
able to help me, and I am completely stuck!

 

I am using Lucene, in conjunction with a slightly adapted of the
Seekafile server from Codeproject. All of my indexing and searching is
fine, except for HTML documents, where only the <BODY> tags are being
indexed.  Debugging back, the results returned from the server are
always of a DefaultParser, and when this runs, it calls the MS
query.dll. The subsequent output of text is always stripped of
everything except the content from between body tags - strange.

 

>From reading around, it seems that html files are better to be indexed
using nlhtml.dll, but I have no idea on how to get Lucene to use this -
I have tried it as a plugin, and fails (not unexpectedly) and I have
tried using it as an alternative parser (via DLLImport) but it does not
even attempt to load it.

 

Has anyone successfully retrieved html content other than the content of
the body? Or can anyone think why I am unable to do so.

 

Thanks in advance,

 

James.

RE: Indexing HTML

Reply via email to