Hello Shad,

It was just a stupid idea, based on what I saw in EnwikiContentSource.Parser 
and medication-induced drowsiness. In addition, I didn't have my second cup of 
coffee.

I noticed all these HTML documents are really XML documents. In my simple mind, 
I can't see why we should drag thousands of lines of code just to read an XML 
document sequentially. There's already a type for that in .NET: XmlReader. 
(lives in System.Xml)

So I would stop EnwikiContentSource.Parser to inherit from DefaultHandler, yank 
out the Sax and Tagsoup stuff and replace the Run method by this:

        public void Run()
        {
                try
                {
                        while (!stopped)
                        {
                                Stream localFileIS = outerInstance.@is;
                                if (localFileIS != null)
                                { // null means fileIS was closed on us 
                                        try
                                        {
                                                using (var reader = 
XmlReader.Create(localFileIS))
                                                        while (!stopped && 
reader.Read())
                                                        {
                                                                
switch(reader.NodeType)
                                                                {
                                                                case 
XmlNodeType.Element:
                                                                        {
                                                                                
var namespaceURI = reader.NamespaceURI;
                                                                                
var localName = reader.LocalName;
                                                                                
var name = reader.Name;
                                                                                
var attributes = new Dictionary<string, string>();
                                                                                
if(reader.MoveToFirstAttribute())
                                                                                
        do
                                                                                
                attributes.Add(reader.Name, reader.Value);
                                                                                
        while (reader.MoveToNextAttribute());
                                                                                
StartElement(namespaceURI, localName, name, attributes);
                                                                        }
                                                                        break;
                                                                case 
XmlNodeType.EndElement:
                                                                        
EndElement(reader.NamespaceURI, reader.LocalName, reader.Name);
                                                                        break;
                                                                case 
XmlNodeType.SignificantWhitespace:
                                                                case 
XmlNodeType.Text:
                                                                        
Characters(reader.Value);
                                                                        break;
                                                                // everything 
else is irrelevant in this context
                                                                }
                                                        }
                                        }
                                        catch (IOException ioe)
                                        {
                                                lock (outerInstance)
                                                {
                                                        if (localFileIS != 
outerInstance.@is)
                                                        {
                                                                // fileIS was 
closed on us, so, just fall through
                                                        }
                                                        else
                                                                // Exception is 
real
                                                                throw ioe;
                                                }
                                        }
                                }
                                lock (this)
                                {
                                        if (stopped || !outerInstance.m_forever)
                                        {
                                                nmde = new 
NoMoreDataException();
                                                Monitor.Pulse(this); //notify();
                                                return;
                                        }
                                        else if (localFileIS == 
outerInstance.@is)
                                        {
                                                // If file is not already 
re-opened then re-open it now
                                                outerInstance.@is = 
outerInstance.OpenInputStream();
                                        }
                                }
                        }
                }
                catch (IOException ioe)
                {
                        throw new Exception(ioe.ToString(), ioe);
                }
                finally
                {
                        lock (this)
                        {
                                threadDone = true;
                                Monitor.Pulse(this); //Notify();
                        }
                }
        }

... and of course the signatures of the (previously overridden) methods change 
in a trivial way:

            public void Characters(string value)
            {
                contents.Append(value);
            }

            public void EndElement(string @namespace, string simple, string 
qualified);

            public void StartElement(string @namespace, string simple, string 
qualified,
                                     IDictionary<string,string> attributes)

As an added bonus, the last test (TestForever()) runs without error.

Of course, that only works for html documents that are well-formed XML 
documents. It will fail miserably in tests involved with DemoHTMLParser, since 
some of these tests are incorrect XML (unclosed/unbalanced tags, mixed 
upper/lowercase names, and the like) even though they are valid HTML (albeit 
quirky).
Normally, the only problem you have with XHTML and XmlReader is with entities, 
but this is easily solved by creating the instance in a slightly different way:

        XmlNameTable nt = new NameTable();
        XmlNamespaceManager nsmgr = new XmlNamespaceManager(nt);
        XmlParserContext context = new XmlParserContext(null, nsmgr, null, 
XmlSpace.None)
        {
                DocTypeName = "html",
                PublicId = "-//W3C//DTD XHTML 1.0 Strict//EN",
                SystemId = "xhtml1-strict.dtd",
        };
        XmlParserContext xhtmlContext = context;

        using (var reader = XmlReader.Create(source, new XmlReaderSettings { 
DtdProcessing = DtdProcessing.Parse, ValidationType=ValidationType.DTD, 
XmlResolver=new XmlPreloadedResolver(XmlKnownDtds.All) }, xhtmlContext))

Sadly, this won't work for HTML documents which already contain a DTD 
reference. And it doesn't parse HTML document that are not well-formed XML. 

I've used HTML agility pack in the past with great success. I've noticed 
https://github.com/MindTouch/SGMLReader as something that closely resembles 
XmlReader, but for HTML. The project hasn't been maintained for a while, and I 
haven't used it.

I'll have my second cup of coffee now.

Vincent





-----Original Message-----
From: Shad Storhaug [mailto:[email protected]] 
Sent: Tuesday, August 01, 2017 7:46 AM
To: Van Den Berghe, Vincent <[email protected]>
Cc: [email protected]
Subject: RE: Benchmark Concurrency Bug

Vincent,

I am curious to know what you meant by "something more lightweight". 

After taking a deeper dive into HTML Agility Pack, it is DOM based (it can be 
read from a stream or TextWriter into the DOM, though). Although one might 
argue that HTML documents are never going to be very large anyway, it feels 
inherently wrong to me to put together a solution that you know in advance 
isn't going to scale. In this instance one could even argue that we don't have 
true apples-to-apples performance comparison with Lucene because the loading of 
the document takes place before the parsing begins (which is the only place 
that is concurrent).

Do you know of an alternative stream-based HTML parsing solution than TagSoup?

Thanks,
Shad Storhaug (NightOwl888)


-----Original Message-----
From: Shad Storhaug [mailto:[email protected]] 
Sent: Monday, July 31, 2017 9:36 PM
To: Van Den Berghe, Vincent
Cc: [email protected]
Subject: RE: Benchmark Concurrency Bug

Thanks!

I knew it had to be something simple. It looks like that error is happening 
because of a race condition. Oh well, it probably isn't worth the effort 
considering the intended audience of the tool.

I hope you are feeling better soon.

From: Van Den Berghe, Vincent [mailto:[email protected]]
Sent: Monday, July 31, 2017 8:07 PM
To: Shad Storhaug
Cc: [email protected]
Subject: RE: Benchmark Concurrency Bug

Hello Shad,

There are 2 causes for the tests TestOneDocument/TestTwoDocuments never to 
terminate:

Cause 1: the Parser.Run() method is never called. In the Java code, this type 
implements IRunnable, but here it doesn't. The thread is supposed to be started 
at the first call to Parser.Next() but does absolutely nothing:

                if (t == null)
                {
                    threadDone = false;
                    t = new ThreadClass(/*this*/);
                    t.SetDaemon(true);
                    t.Start();
                }

The minimal solution is to define a new class:

              private class MyThreadClass: ThreadClass
              {
                     private readonly Action m_Run;

                     public MyThreadClass(Action run)
                     {
                           m_Run = run;
                     }

                     public override void Run()
                     {
                           m_Run();
                     }
              }


And change  the above code to:

                if (t == null)
                {
                    threadDone = false;
                     t = new MyThreadClass(Run);
                    t.SetDaemon(true);
                    t.Start();
                }

This will cause progress, but the tests will still fail. The reason is that the 
code to create the XmlReader:

                    Sax.Net.IXmlReader reader = 
XmlReaderFactory.Current.CreateXmlReader(); 
//XMLReaderFactory.createXMLReader();


... fails becasuse XmlReaderFactory.Current expects the reader type to be 
loaded from configuration files. Alas, something happens on its way to the 
forum and you get a "null reference exception" preceded by a "thread abort 
exception, causing the tests to fail because the reader is never created.

I had half a mind to replace the Sax parser (which is an idiom that is not 
implemented in .NET) by something more lightweight, but since I'm feeling a bit 
under the weather, I just changed the line to:

Sax.Net.IXmlReader reader = new 
TagSoup.Net.XmlReaderFactory().CreateXmlReader();

...and be done with it. And on my machine, the tests pass now. I hope they do 
too on your special machine <g>


The test TestForever() works as well, but ends with an exception (which is 
swallowed):

System.ObjectDisposedException: Cannot access a closed Stream.
   at System.IO.__Error.StreamIsClosed()
   at System.IO.MemoryStream.Read(Byte[] buffer, Int32 offset, Int32 count)
   at System.IO.StreamReader.ReadBuffer()
   at System.IO.StreamReader.Read()
   at TagSoup.Net.HTMLScanner.Scan(TextReader r, IScanHandler h)

The reason is that the parse call:

      reader.Parse(new InputSource(IOUtils.GetDecodingReader(localFileIS, 
Encoding.UTF8)));


... seems to want the StreamReader (and by default, the memory stream), after 
the source.Dispose() is called, Since the test passes, I'll pretend the problem 
doesn't exist.

Vincent


From: Shad Storhaug [mailto:[email protected]]
Sent: Monday, July 31, 2017 10:34 AM
To: Van Den Berghe, Vincent 
<[email protected]<mailto:[email protected]>>
Cc: [email protected]<mailto:[email protected]>
Subject: Benchmark Concurrency Bug

Vincent,

I have pushed Benchmark to my branch here: 
https://github.com/NightOwl888/lucenenet/tree/benchmark. There are 106/109 
tests passing, but there are 3 tests here that never finish: 
https://github.com/NightOwl888/lucenenet/blob/benchmark/src/Lucene.Net.Tests.Benchmark/ByTask/Feeds/EnwikiContentSourceTest.cs#L29

There is also still one unfinished matter in that TagSoup/Sax.Net doesn't 
support .NET Standard. It is a close match for Java's SAX parser, but so far 
the owner of the project has not replied to my query whether he would be open 
to a PR. So, I have my eye on using the HTML Agility Pack instead: 
https://www.nuget.org/packages/HtmlAgilityPack. If the concurrency bug happens 
to have something to do with Sax.Net, feel free to replace it with the HTML 
Agility Pack.

I would appreciate if you could have a look at this when you have a chance.

Thanks,
Shad Storhaug (NightOwl888)

Reply via email to