For Ex., void GetMetaData() { WebBrowser wb = new WebBrowser(); wb.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(wb_DocumentCompleted); wb.ScriptErrorsSuppressed = true;
StreamReader rdr = new StreamReader(@"D:\Downloads\explorer_tips.htm"); wb.DocumentText = rdr.ReadToEnd(); while (read == false) Application.DoEvents(); HtmlElementCollection coll = wb.Document.GetElementsByTagName("meta"); //Or other tags //HtmlElementCollection coll = wb.Document.All; //or all elements foreach (HtmlElement elem in coll) { richTextBox1.AppendText(elem.TagName + ":"); richTextBox1.AppendText(elem.GetAttribute("content") + "\n"); //or elem.InnerText; } } void wb_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e) { read = true; } PS: Loading a web content to the web browser is very sloooow. DIGY -----Original Message----- From: Jamie Vickers [mailto:[EMAIL PROTECTED] Sent: Thursday, April 03, 2008 5:19 PM To: lucene-net-user@incubator.apache.org Subject: Indexing HTML Dear All, Sorry about the re-post of this question, but so far no-one has been able to help me, and I am completely stuck! I am using Lucene, in conjunction with a slightly adapted of the Seekafile server from Codeproject. All of my indexing and searching is fine, except for HTML documents, where only the <BODY> tags are being indexed. Debugging back, the results returned from the server are always of a DefaultParser, and when this runs, it calls the MS query.dll. The subsequent output of text is always stripped of everything except the content from between body tags - strange. >From reading around, it seems that html files are better to be indexed using nlhtml.dll, but I have no idea on how to get Lucene to use this - I have tried it as a plugin, and fails (not unexpectedly) and I have tried using it as an alternative parser (via DLLImport) but it does not even attempt to load it. Has anyone successfully retrieved html content other than the content of the body? Or can anyone think why I am unable to do so. Thanks in advance, James.