RE: [doc]proposal to improve and expand website

Morozova, Nadezhda Thu, 03 Aug 2006 04:42:17 -0700

Tim, Geir, all,

I've been experimenting with porting the externals/ stuff to the site
itself. For an API package doc (used Regexp.html), it took me: 
~2 hours to create and brush up a valid XML doc that ant parses with no
errors (used code sweepers and manual replacements)
~2 more hours to beautify the code (trim white space, remove redundant
anchors, repair links to reference section names instead of anchors,
etc.)
The resulting XML doc is attached. 
In a while, I might as well port the other docs using the same algorithm
unless we find a better one :)

Questions on the process: 

1. Each section and subsection ends with too much white space. It looks
ok on some other pages, but on this one it's ugly. I guess this is part
of stylesheet applied during processing; could we change the style
sheet? 
2. Where do we place images that we reference from these docs?
Currently, the only images/ folder is top-level docs/ folder with logos.
Do you think we can have images/ subfolder in subcomponents? Or in
subcomponents/classlib? 
3. The resulting XML doc is simplified in terms of style decoration.
Sometimes, it is good, but at other times it is not. For example, having
NOTE and WARNING appear in red might be a good idea. Can we extend the
set of styles currently applied to the files to make these and similar
adjustments? Alternatively, we can use harmony.css as the basis and
adjust it as we need. 

Best regards, 
Nadya Morozova, 
i-net 8-291-3310

-----Original Message-----
From: Tim Ellison [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, August 02, 2006 8:46 PM
To: [email protected]
Subject: Re: [doc]proposal to improve and expand website

Geir Magnusson Jr wrote:
> 
> Morozova, Nadezhda wrote:
> 
>> I scanned the svn repository for classlib packages, and found a few
more
>> that had docs with them, namely the JNDI DNS Service Provider, the
AWT
>> framework and the Java2D implementation. I could move their docs to
the
>> externals/ directory as well. 
>> Now, this makes multiple docs in one folder, so I thought I'd group
all
>> classlib package docs into a classlib/ subfolder and all DRLVM docs
into
>> the drlvm/ subfolder. Shared files, like the style sheet, could be
>> placed in those folders to avoid duplication. 
> 
> I wonder if it's time we pull that stuff out of classlib/ and move to
> the site itself...

Yes it is.

Tim

---------------------------------------------------------------------
Terms of use : http://incubator.apache.org/harmony/mailing.html
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

<?xml version="1.0" encoding="ISO-8859-1"?>	
    <!--

    Copyright 2006 The Apache Software Foundation

    Licensed under the Apache License, Version 2.0 (the "License");
    you may not use this file except in compliance with the License.
    You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing, software
    distributed under the License is distributed on an "AS IS" BASIS,
    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    See the License for the specific language governing permissions and
    limitations under the License.
-->
<document>
	<properties>
<title>
	Apache Harmony
</title>
<author email="[email protected]">
	Harmony Documentation Team
</author>
	</properties>
	<body>
<section name="Regex Processing Framework">
	<p>
    <A href="#Revision History">
    	Revision History
    </A>
	</p>
	<p>
    <A href="#About This Document">
    	About This Document
    </A>
	</p>
	<blockquote>
    <A href="#Purpose">
    	Purpose
    </A>
	<br />
    <A href="#Intended Audience">
    	Intended Audience
    </A>
	<br />
    <A href="#Documentation Conventions">
    	Documentation Conventions
    </A>
	</blockquote>
	<p>
    <a href="#Introduction to Pattern Matching">
    	Introduction to Pattern Matching
    </a>
	</p>
	<p>
    <a href="#AbstractSet Interface">
    	AbstractSet Interface
    </a>
	</p>
	<blockquote>
    <a href="#Characteristics">
    	Characteristics
    </a>
	<br />
    <a href="#Methods Used">
    	Methods Used
    </a>
	<br />
    <a href="#Class Hierarchy">
    	Class Hierarchy
    </a>
	<br />
    <a href="#Optimizations">
    	Optimizations
    </a>
	</blockquote>
	<p>
    <a href="#Usage Examples">
    	Usage Examples
    </a>
	</p>
    <blockquote>
    <a href="#Example 1">
        Example 1
    </a>
    <br />
    <a href="#Example 2">
        Example 2
    </a>
    <br />
    <a href="#Example 3">
        Example 3
    </a>
    </blockquote>
	<p>
    <A href="#References">
    	References
    </A>
	</p>
</section>

<section name="Revision History">

	<TABLE>
        <TR>
        	<th width="20%">
            Version
        	</th>
        	<th width="50%">
            Version Information
        	</th>
        	<th width="30%">
            Date
        	</th>
        </TR>
        <TR>
        	<TD class="TableCell">
            Initial version
        	</TD>
        	<TD class="TableCell">
            Nadya Morozova, Nikolay Kuznetsov: document created.
        	</TD>
        	<TD class="TableCell">
            December 16, 2005
        	</TD>
        </TR>
	</TABLE>
</section>
<section name="About This Document">

	<subsection name="Purpose">
           
             <p>
        	This document describes the
        	<code>java.util.regex</code> package delivered as part of 
        	    the Harmony classlibrary project. This document provides an overview
        	    of the overall architecture with focus on the aspects improving performance.
        </p>
	</subsection>
	<subsection name="Intended Audience">
        <a name="Intended_Audience">
        </a>
    <p>
	The target audience for the document includes a wide community of engineers 
	  interested in using regular expression package and in further work with the 
	  product to contribute to its development. The document assumes that readers 
	  are familiar with Regular expressions [
	<a href="#Fri02">1</a>, <a href="#MY60">2</a>,<a href="#THO68">3</a>], 
    finite automata theory [<a href="#ASU86">4</a>], basic compiler techniques 
    [<a href="#ASU86">4</a>] and the Java <a href="#*">*</a>
    programming language [<a href="#JavaSite">5</a>].
    </p>
	</subsection>
    <subsection name="Documentation Conventions">

    <p>
	This document uses the
	<a href="../conventions.html" target="_blank">
    unified conventions</a> for the Harmony documentation kit.
    </p>
    <p><A href="#Regex Processing Framework">Back to Top</A>
    </p></subsection></section>
<section name="Introduction to Pattern Matching">
	<p>
    To analyze text in search of sequences matching preset patterns, you can choose 
      from a variety of techniques, such as the wide-spread
    <i>regular expressions</i>
    (RE) [<a href="#Fri02">1</a>],<em>exact string matching</em>, for example, the Boyer-Moore algorithm 
      (BM) [<a href="#BM77">6</a>], and others. However, the RE engine is rather complex, and significantly 
      impacts performance, whereas the exact string matching techniques have a limited 
      application.
    	</p>
	<p>
    For example, the regular expression
    <code>
    	.*world
    </code>
    is used to find the 
      last instance of
    <i>
    	world
    </i>
    in a sentence. The BM string searching algorithm is more 
      appropriate for solving the task than the RE technique, and is more effective 
      from the performance perspective. To optimize using pattern matching techniques 
      with different tasks, Harmony provides a unified interface that applies to any part
      of a regular expression, including the whole expression.
	</p>
	<p>
    In terms of the Finite Automata theory, which is the basis of regular expression 
      processing, a part of regular expression is a
    <i>node</i>. The Harmony regex framework
      treats every distinctive part of a regular expression and the whole expression 
      as a node. Each node implements the unified interface adjusted for a specific 
      technique. For instance, for the regular expression in the example, the Harmony
      framework includes a special class SequenceSet, which has a unified interface 
      called
    <code>
    	AbstractSet
    </code>
    , and implements the Boyer-Moore algorithm in 
      searching for a word.
	</p>
	<p>
<A href="#Regex Processing Framework">
	Back to Top
</A>
	</p>
</section>
<section name="AbstractSet Interface">	
	<p>
    The key feature of the Harmony regex framework the single super interface, 
    <code>AbstractSet</code>, shared by all types of nodes. Individual nodes are independent, so that every 
      node of the automaton incorporates its own find-match algorithm optimized for 
      a specific type of nodes. You can use methods implemented in the
    <code>AbstractSet</code> interface subclasses or override these methods to use your own more efficient 
      algorithm.
	</p>
	<subsection name="Characteristics">

    <p>
    	The
    	<code>
        AbstractSet
    	</code>
    	interface has the following key characteristics:
    </p>
    <ul>
    	<li>
    <b>
    	Unified structure:
    </b>
    General contract for all nodes including node 
        chains. This enables the framework to support new features and optimizations 
        for a specific construction of RE without impacting other parts of the framework.
    	</li>
    	<li>
    <b>
    	Simplicity:
    </b>
    Every node of an automaton is simple and handles a specific 
        optimization or functionality. The resulting automaton is straight-forward 
        and includes no multi-purpose functionality. To enable support of new functionality, 
        create a new class.
    <p class="note">
    	NOTE:
    </p>
    <p class="notetext">
    	Because new features and optimizations are implemented 
    	      independently, expansion of the framework has no negative impact on the 
    	      overall performance.
    </p>
	</li>
	<li>
    <b>
    	Independence
    </b>
    :The matching and finding procedures are incorporated 
        into the interface and require no external logic. The interface allows using 
        the same find-match methods for all types of nodes, and at the same time enables 
        optimizing the find-match algorithm for individual node types to improve performance.
	</li>
</ul>
<p>
	<A href="#Regex Processing Framework">
    Back to Top
	</A>
</p>
	</subsection>
	<subsection name="Methods Used">
<p>
	The
	<code>AbstractSet</code> interface defines the matching and finding methods, 
	  and the service methods supporting them, as follows.
</p>
<p><strong>
	Service Methods
    </strong>
</p>
<ul>
	<li><strong>
accept()</strong>
	</li>
	<blockquote>
Checks whether the current node matches the current string symbol(s) to 
    ensure that the automaton can proceed with the current state. The method returns 
    the number of accepted symbols. This method is designed for nodes representing 
    tokens, which mostly use the default match procedure. To optimize the match 
    procedure for a specific type of token, you only need to override the
<code>
	accept()
</code>
method, see <a href="#Example1">Example 1</a>.
	</blockquote>
	<li><strong>
first()</strong>
	</li>
	<blockquote>
Checks whether the first symbol of the current node intersects with previous 
    one. If not, then the previous node is quantified possessively without
<i>
	backtrack
</i>
(the option of restarting the search from the previous position).
	</blockquote>
	<li><strong>
next()</strong>
	</li>
	<blockquote>
Returns the next node of the automaton.
	</blockquote>
</ul>
<p><strong>
	Matching and Finding Methods
    </strong>
</p>
<ul>
	<li><strong>
matches()</strong>
	</li>
	<blockquote>
Runs the match procedure starting from the current state. This is the only 
    method you need to override in order to introduce new functionality. By default, 
    the <code>matches()</code> method does the following when working with terms (see
<a href="#Class Hierarchy">Class Hierarchy</a>):
<ol>
	<li>
Calls the
<code>
	accept()
</code>
method and proceeds if the method returns 
        a positive value.
	</li>
	<li>
Calls the match of the next node with the shift obtained from the
<code>
	accept()
</code>
method.
	</li>
	<li>
Returns
<code>
	TRUE
</code>
in case of a successful match.
	</li>
</ol>
	</blockquote>
	<li><strong>
find()</strong>
	</li>
	<blockquote>
Checks whether the pattern can match any part of the string in the following 
    way:
<ol>
	<li>
Finds the next position in the input string, for which the
<code>
	accept()
</code>
method returns a positive value. If no matches are found, the method terminates 
        and returns a negative value.
	</li>
	<li>
From this position, runs the
<code>
	matches()
</code>
method.
	</li>
</ol>
	</blockquote>
	<li><strong>
findBack()</strong>
	</li>
	<blockquote>
Does the same as the
<code>
	find()
</code>
method but starts 
    from the end of the string or from the nearest new-line symbol. This method 
    optimizes the find procedure when the current node of the pattern fits the 
    rest of the string. In such cases, it is necessary to find the last occurrence 
    of the pattern represented by the next node, as in the case of the
<code>
	.*world
</code>
pattern.
	</blockquote>
</ul>
<p>
	<A href="#Regex Processing Framework">
Back to Top
	</A>
</p>
	</subsection>
	<subsection name="Class Hierarchy">
<p>
	<a href="#figure1">
Figure 1
	</a>
	shows the class hierarchy based on the
	<code>
AbstractSet
	</code>
	interface. As mentioned in its
	<a href="#AbstractSetIntro">
description
	</a>
	, 
	  the whole hierarchy is based on this interface. The other classes of the framework 
	  are divided into the following categories:
</p>
<ul>
	<li>
<b>
	Tokens or Leafs
</b>
(
<code>
	LeafSet
</code>
in the figure): nodes representing 
    ordinal symbols, substrings, ranges, character classes, and other basic units 
    of regular expressions.
	</li>
	<li>
<b>
	Quantifiers
</b>
(
<code>
	QuantifierSet
</code>
in the figure): set of nodes responsible 
    for quantification of terms. Terms are simple and usually represent only one 
    symbol. Therefore, terms are quantified without recursion, and backtracks are 
    processed on the basis of the underlying pattern length.
	</li>
	<li>
<b>
	Groups
</b>
(
<code>
	JointSet
</code>
in the figure): sets derived from parenthetic constructions. 
    These nodes represent alternations of other sets.
	</li>
	<li>
<b>
	Group Quantifiers
</b>
(
<code>
	GroupQuantifier
</code>
in the figure): special 
    quantifiers over Groups. Because the length of a groups can vary, backtracks 
    require recursion.
	</li>
</ul>
<p align="center">
	<img border="0" alt="Hierarchy of classes in regexp framework" src="images/picture1.gif"/>
</p>
<p class="special">
	<a name="figure1">
	</a>
	Figure 1. Class Hierarchy
</p>
<p>
	Figure 1 displays only the basics of the regex framework. This framework also 
	  includes other nodes responsible for different optimizations. For more details, 
	  see the comments inlined in code.
</p>
<p>
	<A href="#Regex Processing Framework">
Back to Top
	</A>
</p>
	</subsection>
	<subsection name="Optimizations">
<p>
	In the current implementation, most optimizations are based on the node representation 
	  to improve efficiency of the framework. It is noteworthy that an optimal finite 
	  automaton is built during compile time, so that the constructed automaton does 
	  not spend additional time on decision-making overhead during match time. The 
	  regex framework optimizations improve different aspects, such as:
</p>
<ul>
	<li>
<b>
	Character or String find methods:
</b>
The methods
<code>
	find()
</code>
and
<code>
	findBack()
</code>
of certain nodes use exact pattern matching algorithms, 
    specifically, the Boyer-Moore fast string search algorithm.
	</li>
	<li>
<b>
	Quantified dot term:
</b>
This term (
<code>
	.*
</code>
) covers the rest 
    of the string and can therefore run
<code>
	findBack()
</code>
from the end of 
    the string. If the
<code>
	findBack()
</code>
method of the following node implements 
    a special algorithm, the pattern matching speed for this chain increases, 
    otherwise the operation runs at the same speed.
	</li>
	<li>
<b>
	Quantified non-intersecting states:
</b>
If the quantified node does 
    not intersect with the first character of the next one, the framework treats 
    the quantifier as the possessive quantifier because no backtracks can occur. 
    Possessive quantification is based on loops instead of recursion and works 
    faster.
	</li>
</ul>
<p>
	<A href="#Regex Processing Framework">
Back to Top
	</A>
</p>
	</subsection>
</section>
<section name="Usage Examples">
	<a name="UsageExamples">
	</a>
	<p>
    This part on the document illustrates usage of the Harmony regex framework.
	</p>

<subsection name="Example 1">

	<p>
    This example illustrates using the
    <code>
    	CharSet
    </code>
    class, which includes all nodes accepting characters to create a new type of tokens. For that, the class uses the
    <code>
    	LeafSet
    </code>
    class, which is the basic class for tokens. This example shows that the only method you 
                            need to override in order to present a new type of tokens is the
    <code>
    	accept()
    </code>
    method.
	</p>
	<pre>
class CharSet extends LeafSet {
    ...
    public int accept(int strIndex, CharSequence testString) {
        // checks that the current symbol of the input string 
        // is the one this instance represents and returns 1
        // (the number of accepted symbols) or -1 if accept fails:
        return (this.ch == testString.charAt(strIndex)) ? 1 : -1;
    }
    ...
}
	</pre>
  </subsection>
<subsection name="Example 2">


	<p>
        The following example demonstrates independent implementation of the
        <code>
        	find()
        </code>
        method for
        <code>
        	SequenceSet
        </code>
        .
	</p>
	<p class="note">
    Note:
	</p>
	<p class="notetext">
    This changes the find procedure for nodes representing character 
      sequences and at the same time does not affect the find-match algorithms of 
      other types of nodes.
	</p>
	<pre>
    class SequenceSet extends LeafSet {
        ...
        protected int indexOf(CharSequence str, int from) {
            // Boyer-Moore algorithm 
            ...
        }
        
        public int find(int strIndex, CharSequence testString,
                MatchResultImpl matchResult) {
            ...
            while (strIndex &lt;= strLength) {
                // call the fast search method instead of default implementation
                strIndex = indexOf(testStr, strIndex);
                if (strIndex &lt; 0) {
                    return -1;
                }
                if (next.matches(strIndex + charCount, testString, matchResult)&gt;= 0) {
                    return strIndex;
                }
                strIndex++;
            }
            return -1;
        }
        ...
    }
	</pre>
</subsection>
<subsection name="Example 3">
	<p>
    This example illustrates how to turn the match procedure of the
    <code>
    	.*
    </code>
    patterns into the find procedure, which is potentially faster.
    </p>
    	<pre>
    class DotQuantifierSet extends LeafQuantifierSet {
        ...     
        public int matches(int stringIndex, CharSequence testString,
                MatchResultImpl matchResult) {
            ...
            // find line terminator, since .* represented by this node 
            // accepts all characters except line terminators 
            int startSearch = findLineTerminator(stringIndex, strLength, testString);
            ...
            // run findBack method of the next node, because the previous 
            // part of the string is accepted by .* and no matter where
            // the next part of pattern is found, the procedure works OK.  
            return next.findBack(stringIndex, startSearch, testString, matchResult);
        }
    }
    	</pre>
	<p>
<A href="#Regex Processing Framework">
	Back to Top
</A>
	</p>
</subsection>
</section>
<section name="References">
	<a name="References">
	</a>
	<p>
[
<a name="Fri02">
</a>
1] Jeffrey E. F. Friedl.,
<em>
	Mastering regular expressions
</em>
, 
  2nd Edition., July 2002, O'Reilly, ISBN 0-596-00289-0
	</p>
	<p>
[
<a name="MY60">
</a>
2] McNaughton, R. and Yamada, H.
<em>
	Regular Expressions 
	  and State Graphs for Automata
</em>
, IRA Trans. on Electronic Computers, Vol. 
  EC-9, No. 1, Mar. 1960, pp 39-47.
	</p>
	<p>
[
<a name="THO68">
</a>
3] Thompson, K.,
<em>
	Regular Expression search Algorithm
</em>
, 
  Communication ACM 11:6 (1968), pp. 419-422.
	</p>
	<p>
[
<a name="ASU86">
</a>
4] Alfred V. Aho, Ravi Sethi, Jeffrey 
  D. Ullman,
<em>
	Compilers, Principles Techniques and Tools
</em>
, Addison-Wesley 
  Publishing Company, Inc., 1985, ISBN 0-201-10088-6
	</p>
	<p>
[
<a name="JavaSite">
</a>
5] Java Technology site,
<a href="http://java.sun.com"; target="_blank">
	http://java.sun.com
</a>
	</p>
	<p>
[
<a name="BM77">
</a>
6] R. Boyer and S. Moore. A fast string searching algorithm. C. ACM, 20:762-772, 1977.
	</p>
	<p>
[
<a name="KMP77">
</a>
7] D.E. Knuth, .I. Morris, and V. Pratt. Fast pattern 
  matching in strings. SIAM J on Computing, 6:323-350, 1977.
	</p>
	<p>
<A href="#Regex Processing Framework">
	Back to Top
</A>
	</p>
	<p>
	</p>
	<p>
<A name="*">
	*
</A>
Other brands and names are the property of their respective 
  owners.
	</p>
</section>
	</body>
</document>

---------------------------------------------------------------------
Terms of use : http://incubator.apache.org/harmony/mailing.html
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: [doc]proposal to improve and expand website

Reply via email to