[ 
https://issues.apache.org/jira/browse/LUCENE-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989965#comment-12989965
 ] 

Hsiu Wang edited comment on LUCENE-2208 at 2/3/11 5:56 AM:
-----------------------------------------------------------

added patch(LUCENE-2208.patch) to fix 
org.apache.lucene.search.highlight.InvalidTokenOffsetsException.

The exception is caused by HTML escape characters (e.g., &, &  
) which are counted as 1 character in text.length() in 
Highlighter.getBestTextFragments, but in HTMLStripCharfilter, they are counted 
as N characters(& counted as 5). 

In the patch, I commented out an incorrect test case in 
HTMLStripCharFilterTest.testOffset()("X & X ( X < > X"). The 
commented out test case is covered by Robert's test patch. 


      was (Author: hwang):
    added patch(LUCENE-2208.patch) to fix 
org.apache.lucene.search.highlight.InvalidTokenOffsetsException.

The exception is caused by HTML escape characters (e.g., &, &) 
which are counted as 1 character in text.length() in 
Highlighter.getBestTextFragments, but in HTMLStripCharfilter, they are counted 
as N characters(& counted as 5). 

In the patch, I commented out an incorrect test case in 
HTMLStripCharFilterTest.testOffset()("X & X ( X < > X"). The 
commented out test case is covered by Robert's test patch. 

  
> Token div exceeds length of provided text sized 4114
> ----------------------------------------------------
>
>                 Key: LUCENE-2208
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2208
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>    Affects Versions: 3.0
>         Environment:  diagnostics = {os.version=5.1, os=Windows XP, 
> lucene.version=3.0.0 883080 - 2009-11-22 15:43:58, source=flush, os.arch=x86, 
> java.version=1.6.0_12, java.vendor=Sun Microsystems Inc.}
>    
>            Reporter: Ramazan VARLIKLI
>         Attachments: LUCENE-2208.patch, LUCENE-2208_test.patch
>
>
> I have a doc which contains html codes. I want to strip html tags and make 
> the test clear after then apply highlighter on the clear text . But 
> highlighter throws an exceptions if I strip out the html characters  , if i 
> don't strip out , it works fine. It just confuses me at the moment 
> I copy paste 3 thing here from the console as it may contain special 
> characters which might cause the problem.
> 1 -) Here is the html text 
>           <h2>Starter</h2>
>           <div id="tab1-content" class="tabContent selected">
>             <div class="head"></div>
>             <div class="body">
>              <div class="subject-header">Learning path: History</div>
>               <h3>Key question</h3>
>               <p>Did transport fuel the industrial revolution?</p>
>               <h3>Learning Objective</h3>
>             <ul>
>               <li>To categorise points as for or against an argument</li>
>               </ul>
>             <p>
>               <h3>What to do?</h3>
>               <ul>
>                 <li>Watch the clip: <em>Transport fuelled the industrial 
> revolution.</em></li>
>               </ul>
>               <p>The clips claims that transport fuelled the industrial 
> revolution. Some historians argue that the industrial revolution only 
> happened because of developments in transport.</p>
>                         <ul>
>                               <li>Read the statements below and decide which 
> points are <em>for</em> and which points are <em>against</em> the argument 
> that industry expanded in the 18th and 19th centuries because of developments 
> in transport.</li>
>                       </ul>
>                       
>                       <ol type="a">
>                               <li>Industry expanded because of inventions and 
> the discovery of steam power.</li>
>                               <li>Improvements in transport allowed goods to 
> be sold all over the country and all over the world so there were more 
> customers to develop industry for.</li>
>                               <li>Developments in transport allowed 
> resources, such as coal from mines and cotton from America to come together 
> to manufacture products.</li>
>                               <li>Transport only developed because industry 
> needed it. It was slow to develop as money was spent on improving roads, then 
> building canals and the replacing them with railways in order to keep up with 
> industry.</li>
>                       </ol>
>                       
>                       <p>Now try to think of 2 more statements of your 
> own.</p>
>                       
>             </div>
>             <div class="foot"></div>
>           </div>
>           <h2>Main activity</h2>
>           <div id="tab2-content" class="tabContent">
>             <div class="head"></div>
>             <div class="body"><div class="subject-header">Learning path: 
> History</div>
>               <h3>Learning Objective</h3>
>               <ul>
>                 <li>To select evidence to support points</li>
>               </ul>
>               <h3>What to do?</h3>
>               <!--<ul>
>                 <li>Watch the clip: <em>Windmill and water mill</em></li>
>               </ul>-->
>               <ul><li>Choose the 4 points that you think are most important - 
> try to be balanced by having two <strong>for</strong> and two 
> <strong>against</strong>.</li>
>                         <li>Write one in each of the point boxes of the 
> paragraphs on the sheet <a href="lp_history_industry_transport_ws1.html" 
> class="link-internal">Constructing a balanced argument</a>.</li></ul> <p>You 
> might like to re write the points in your own words and use connectives to 
> link the paragraphs.</p>
>               
>                         <p>In history and in any argument, you need evidence 
> to support your points.</p>
>                         <ul><li>Find evidence from these sources and from 
> your own knowledge to support each of your points:</li></ul>
>                         <ol>
>                 <li><a 
> href="../servlet/link?template=vid&macro=setResource&resourceID=2044" 
> class="link-internal">At a toll gate</a></li>
>                 <li><a 
> href="../servlet/link?macro=setResource&template=vid&resourceID=2046" 
> class="link-internal">Canals</a></li>
>                 <li><a 
> href="../servlet/link?macro=setResource&template=vid&resourceID=2043" 
> class="link-internal">Growing cities: traffic</a></li>
>                               <li><a 
> href="../servlet/link?macro=setResource&template=vid&resourceID=2047" 
> class="link-internal">Impact of the railway</a> </li>
>                               <li><a 
> href="../servlet/link?macro=setResource&template=vid&resourceID=2048" 
> class="link-internal">Sailing ships</a> </li>
>                               <li><a 
> href="../servlet/link?macro=setResource&template=vid&resourceID=2050" 
> class="link-internal">Liverpool: Capital of Culture</a> </li>
>               </ol>
>                         <p>Try to be specific in your evidence - use named 
> examples of places or people. Use dates if you can.</p>
>             </div>
>             <div class="foot"></div>
>           </div>
>           <h2>Plenary</h2>
>           <div id="tab3-content" class="tabContent">
>             <div class="head"></div>
>             <div class="body"><div class="subject-header">Learning path: 
> History</div>
>               <h3>Learning Objective</h3>
>               <ul>
>                 <li>To judge which of the arguments is most valid</li>
>               </ul>
>               <h3>What to do?</h3>
> <!--              <ul>
>                 <li>Watch the clip: <em>Food of the rich</em></li>
>               </ul>-->
>               <p>In order to be a good historian, and get good marks in 
> exams, you need to show your evaluation skills and make a judgement. Having 
> been through the evidence which point do you think is most important? Why? Is 
> there more evidence? Is the evidence more convincing?</p>
>                         <ul><li>In the final box on your worksheet write a 
> conclusion explaining whether on balance the evidence is enough to convince 
> you that transport fuelled the industrial revolution.</li></ul>
>             </div>
>             <div class="foot"></div>
>           </div>
>           <h2>Extension</h2>
>           <div id="tab4-content" class="tabContent">
>             <div class="head"></div>
>             <div class="body"><div class="subject-header">Learning path: 
> History</div>
>               <h3>What to do?</h3>
>               <p>Watch the clip <em>Stress in a ski resort</em></p>
>                         <p>New industries, such as tourism, can now be said 
> to be fuelled by transport improvements.</p>
>               <ul><li>Search Clipbank, using the Related clip lists as well 
> as the search function, to find examples from around the world of how 
> transport has helped industry.</li></ul>              
>             </div>
>             <div class="foot"></div>
>           </div>
>           
>           
> 2-) here is the text after stripped html tags  out 
>            Starter 
>            
>               
>              
>               Learning path: History 
>                Key question 
>                Did transport fuel the industrial revolution? 
>                Learning Objective 
>              
>                To categorise points as for or against an argument 
>                
>              
>                What to do? 
>                
>                  Watch the clip:  Transport fuelled the industrial 
> revolution.  
>                
>                The clips claims that transport fuelled the industrial 
> revolution. Some historians argue that the industrial revolution only 
> happened because of developments in transport. 
>                          
>                                Read the statements below and decide which 
> points are  for  and which points are  against  the argument that industry 
> expanded in the 18th and 19th centuries because of developments in transport. 
>                        
>                       
>                        
>                                Industry expanded because of inventions and 
> the discovery of steam power. 
>                                Improvements in transport allowed goods to be 
> sold all over the country and all over the world so there were more customers 
> to develop industry for. 
>                                Developments in transport allowed resources, 
> such as coal from mines and cotton from America to come together to 
> manufacture products. 
>                                Transport only developed because industry 
> needed it. It was slow to develop as money was spent on improving roads, then 
> building canals and the replacing them with railways in order to keep up with 
> industry. 
>                        
>                       
>                        Now try to think of 2 more statements of your own. 
>                       
>              
>               
>            
>            Main activity 
>            
>               
>               Learning path: History 
>                Learning Objective 
>                
>                  To select evidence to support points 
>                
>                What to do? 
>                
>                 Choose the 4 points that you think are most important - try 
> to be balanced by having two  for  and two  against . 
>                          Write one in each of the point boxes of the 
> paragraphs on the sheet  Constructing a balanced argument .    You might like 
> to re write the points in your own words and use connectives to link the 
> paragraphs. 
>               
>                          In history and in any argument, you need evidence to 
> support your points. 
>                           Find evidence from these sources and from your own 
> knowledge to support each of your points:  
>                          
>                   At a toll gate  
>                   Canals  
>                   Growing cities: traffic  
>                                 Impact of the railway   
>                                 Sailing ships   
>                                 Liverpool: Capital of Culture   
>                
>                          Try to be specific in your evidence - use named 
> examples of places or people. Use dates if you can. 
>              
>               
>            
>            Plenary 
>            
>               
>               Learning path: History 
>                Learning Objective 
>                
>                  To judge which of the arguments is most valid 
>                
>                What to do? 
>  
>                In order to be a good historian, and get good marks in exams, 
> you need to show your evaluation skills and make a judgement. Having been 
> through the evidence which point do you think is most important? Why? Is 
> there more evidence? Is the evidence more convincing? 
>                           In the final box on your worksheet write a 
> conclusion explaining whether on balance the evidence is enough to convince 
> you that transport fuelled the industrial revolution.  
>              
>               
>            
>            Extension 
>            
>               
>               Learning path: History 
>                What to do? 
>                Watch the clip  Stress in a ski resort  
>                          New industries, such as tourism, can now be said to 
> be fuelled by transport improvements. 
>                 Search Clipbank, using the Related clip lists as well as the 
> search function, to find examples from around the world of how transport has 
> helped industry.                
>              
>               
>            
>           
>          3-) here is the exception I get
> org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token div 
> exceeds length of provided text sized 4114
>       at 
> org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:228)
>       at 
> org.apache.lucene.search.highlight.Highlighter.getBestFragments(Highlighter.java:158)
>       at 
> org.apache.lucene.search.highlight.Highlighter.getBestFragments(Highlighter.java:462)

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to