Re: XFastParser - next steps ...

2016-08-01 Thread Michael Meeks
Hi Mohammed,

On Mon, 2016-08-01 at 12:09 +0530, Mohammed Abdul Azeem wrote:

> A. optimize clearing the pending events - unlikely to
> give
>a big win, but nice.
>
> This is done.

Great.

> B. merge the legacyfastparser pieces into SvXMLImport
>
> If we do this, it will be
> XParser -> XFastParser -> unknown elements -> callbackDocumentHandler
> -> SvXMLImport -> tokenize (SvXMLNamespaceMap) -> FastContexts

As a first cut, then yes we will tokenize and de-tokenize and
re-tokenize ;-) but the de-tokenize is just looking up in array:

OUString aTokens[128];

aTokens[nTokenIndex]

which is quick - the rest is done in another thread. Also (obviously)
we want to only tokenize once and in the thread and share that moving
ahead.

> C. consider how to allow XFastParser tokenization
> selectively
>just for the elements eg. ScXMLTableRowCellContext
> that
>can get the maximum benefit in the short-run.
...
> So, then we will somehow selectively tokenize elements and attributes
> belonging to  ScXMLTableRowCellContext, so as to avoid
> SvXMLNamespaceMap pieces.

Hmm; - I think we need to tokenize them all - but lets get there first.
Lets get all of the tokens mapped to and fro; and then lets see if we
can't look at the profile, and work out how to tunnel through a few of
these contexts to use the fast-parser directly =)

So - eg. currently we have:

SvXMLImport::startElement

which calls CreateContext(...) - to create the handler for the next
element down the tree.

We could have a virtual FastContext *CreateFastContext(...) method that
returned a distinct FastParser context and if it is not there we fall
back to the old / dummy methods there =) Using that we could convert the
XML tree context handlers from the leaves upwards. Which would save a
lot of bother. There is already some partial attempt to integrate the
FastParser into xmloff/ that looks unlikely to do anything at all (to
me) =) worth not getting confused / tangled up in that though possibly
worth reading that through 'git grep -5 startFastElement' inside xmloff.

> For this I still have some questions in mind. Are we going to tokenize
> everything from FastParser and then de-tokenize in the callback
> handler (token based startElement)? That's the idea here?

Initially - yes; it sounds mad, but then allocation is also expensive -
and there is no 'free' for integer tokens ;-)

> If I've got anything wrong or you have some insights, please share it
> here. :)

HTH,

Michael.


-- 
michael.me...@collabora.com <><, GM Collabora Productivity
 Skype: mmeeks, Google Hangout: mejme...@gmail.com
 (M) +44 7795 666 147 - timezone usually UK / Europe

___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: XFastParser - next steps ...

2016-07-28 Thread Michael Meeks
Hi Mohammed,

Summarizing where we're at:

 No. Of Records in ods sax_expat  #1  #2
 1 million 8s 20.5s   11s
 50 million23.4s  26.3s   22s

#1 - first cut of XParser implemented on FastParser
#2 - second cut avoiding bogus tokenizer calls

...
> After this change, load time for ods file with 1million entries has
> come down to around 11s and with 50million entries to around 22s.

Great; so we're winning for 50m cells =)

> I've included complete traces of when a ods file with 1million entries
> is imported for both xparser and legacyfastparser. 

Lovely.

> I'm looking into the suggestions you gave in the irc channel.

Let me log that here:

 Azorpid: so the maUserEvents are re-used; but when they are
re-used it is necessary to clear the attributes on any elements in there
- which takes time.
 Azorpid: so we could clear them instead en-masse here:
 if (!consume(pEventList))
 done = true;
 aGuard.reset(); // lock
 rEntity.maUsedEvents.push(pEventList);
 .
 Azorpid: IIRC we used to have an ambition to have code to work
out if the producer or consumer had more time,
 Azorpid: and do that work in whichever one was getting ahead of
the other ;-) still not a bad idea IMHO.
 Azorpid: something like if (rEntity.maPendingEvents.size() <=
rEntity.mnEventLowWater) ... do the work in the consumer thread of
cleaning those attributes up.

Lets get that done - but reading the code - I don't think this is going
to give us a huge win - although it is useful to have that done anyway,
I'm sure in some cases it will really help.

Particularly after this next optimization.

I think we need to move something based on your legacyfastparser.cxx -
which is now working nicely into xmloff/ itself. It is a bit inefficient
re-constructing UNO APIs with Attributes with re-constructed namespace
pieces etc. here I think.

Ultimately, we want to fuse xmloff/ into using XFastParser directly -
so, I think the first step is to move a copy of the legacyfastparser.cxx
code into xmloff/ itself.

Checkout the CallbackDocumentHandler::startUnknown method in the 2nd
profile to see what that can look like: we're swamped with allocation
and de-allocation per element / attribute =)

Can you implement the XFastParser interface and the proxying onwards to
the legacy interface inside the SvXMLImport class in
xmloff/source/core/xmlimp.cxx ?

After that works - I think we need to consider how we can convert the
ScXMLTableRowCellContext to have a fast-path that tokenizes its input in
the parsing thread: thus avoiding all of the SvXMLNamespaceMap pieces,
and also avoiding all of the string allocation, free and copying
associated with that.

How does that sound ? I think the next steps are:

A. optimize clearing the pending events - unlikely to give
   a big win, but nice.

B. merge the legacyfastparser pieces into SvXMLImport

C. consider how to allow XFastParser tokenization selectively
   just for the elements eg. ScXMLTableRowCellContext that
   can get the maximum benefit in the short-run.
1. this will involve slowing us down again by
   adding a tokenizer.
2. in fact this may speed things up by avoiding
   lots of allocations.

C.2. is quite exciting; we'll need to implement a nice getTokenDirect
for ODF - and we will need to be able (on demand) to switch those tokens
back into OUStrings. In fact - this seems like a great idea anyway IMHO
- surely much faster than allocating all of those strings. Perhaps worth
trying C.2. after A ;-)

For C.2. you'll want to checkout the:

$ git log -u feature/fastparser

branch that Daniel Siekler worked on. I'll send you an archive of his
xmloff/ fastparser pieces - but it would be great to cherry-pick some of
that work across to master. My -hope- is that as/when we have an
incremental approach working here - we can pick and test his patches
across one by one and enable fast-parsing for each of those contexts. 

How does that sound ?

ATB,

Michael.

-- 
michael.me...@collabora.com <><, GM Collabora Productivity
 Skype: mmeeks, Google Hangout: mejme...@gmail.com
 (M) +44 7795 666 147 - timezone usually UK / Europe

___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: XFastParser - next steps ...

2016-06-28 Thread Michael Meeks
Hi Michael,

Thanks for your mail.

On Tue, 2016-06-28 at 20:53 +0200, Michael Stahl wrote:
> > Then (I guess) we could pass namespace prefixed names through for the
> > unknown attributes so eg. "office:foo" - and still have the information
> > we need to properly resolve them.
> 
> uhm... how many bad comparisons like "if (attribute == "office:foo") exist?

Apparently enough of them to stop libreoffice even starting if we don't
pass through namespace names un-modified ;-)

> a quick git grep finds only a handful, mostly in base code - wouldn't it
> be easier to just fix those?

Would be great; but I'm wary of pulling too much un-related string into
the GSOC project; Mohammed had a number of other cases I think where
this caused grief. A quick poke around shows:

$ git grep '"toolbar:'
$ git grep '"menu:'
 
Are rather suggestive of problems; though no idea what parser they use.
Ideally we'd move everyone to the XFastParser - it's faster ;-) in
due-course, but again - I'd rather get some of the threaded parsing
XFastParser wins for ODF - than drain this swap as part of the project.

Volunteers welcome to hunt & kill broken NS handling though !

ATB,

Michael.


-- 
michael.me...@collabora.com <><, GM Collabora Productivity
 Skype: mmeeks, Google Hangout: mejme...@gmail.com
 (M) +44 7795 666 147 - timezone usually UK / Europe

___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: XFastParser - next steps ...

2016-06-28 Thread Michael Stahl
On 28.06.2016 18:28, Michael Meeks wrote:
> Hi Mohammed,
> 
>   Let me CC the dev list on the fag-end of this conversation; hopefully
> it will get more interesting over time =)
> 
> On Mon, 2016-06-27 at 22:01 +0530, Mohammed Abdul Azeem wrote:
>> I'm looking into the code paths which misuses defined namespaces
>> without resolving them. I will make test cases to cover them. 
> 
>   Ah - right =)
>
>> Sure we can do this, but it would only account for element's namespace
>> and not attributes namespaces. Also some of the implementations of
>> XDocumentHandler expects namespace declaration( looking for "xmlns" )
>> and tries to resolve them, and I think this approach wouldn't cover
>> all the namespace declaration.
> 
>   Ah ! fair enough - then (I guess) we need to implement a new
> XFastNamespaceHandler which we can register with a setNamespaceHandler()
> call on XFastParser - and which can be NULL for all the interesting
> cases where we need to be truly fast =)
> 
>   Then (I guess) we could pass namespace prefixed names through for the
> unknown attributes so eg. "office:foo" - and still have the information
> we need to properly resolve them.

uhm... how many bad comparisons like "if (attribute == "office:foo") exist?

a quick git grep finds only a handful, mostly in base code - wouldn't it
be easier to just fix those?


___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: XFastParser - next steps ...

2016-06-28 Thread Michael Meeks
Hi Mohammed,

Let me CC the dev list on the fag-end of this conversation; hopefully
it will get more interesting over time =)

On Mon, 2016-06-27 at 22:01 +0530, Mohammed Abdul Azeem wrote:
> I'm looking into the code paths which misuses defined namespaces
> without resolving them. I will make test cases to cover them. 

Ah - right =)

> Sure we can do this, but it would only account for element's namespace
> and not attributes namespaces. Also some of the implementations of
> XDocumentHandler expects namespace declaration( looking for "xmlns" )
> and tries to resolve them, and I think this approach wouldn't cover
> all the namespace declaration.

Ah ! fair enough - then (I guess) we need to implement a new
XFastNamespaceHandler which we can register with a setNamespaceHandler()
call on XFastParser - and which can be NULL for all the interesting
cases where we need to be truly fast =)

Then (I guess) we could pass namespace prefixed names through for the
unknown attributes so eg. "office:foo" - and still have the information
we need to properly resolve them.

Failing that (I guess) - we could as you've done push xmlns: statements
through in attributes - it would best match the css::xml::Attribute
approach that expat_wrap used in the past - and would simplify things
for us.

In this case - I'd want to see:

a) comprehensive test cases, with assertions for all
   code-paths.
b) never creating these or slowing code-paths where
   the XFastParser is being used and tokenizing correctly
   and really needs to be fast.

I fear that b) is not met by the previous patch.

Thoughts ? =)

ATB,

Michael.

-- 
michael.me...@collabora.com <><, GM Collabora Productivity
 Skype: mmeeks, Google Hangout: mejme...@gmail.com
 (M) +44 7795 666 147 - timezone usually UK / Europe

___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/libreoffice