Re: which HTML parser is better?

Karl Koch Thu, 03 Feb 2005 02:12:24 -0800

I am using Java 1.1 with a Sharp Zaurus PDA. I have very limited memory
constraints. I do not think CPU performance is a big issues though. But I
have other parts in my application which use quite a lot of memory and
soemthing run short. I therefore do not look into solutions which build up
tag trees etc. More like a solution who reads a stream of HTML and
transforms it into a stream of text.


I see your point of using an external program. I am however not entirely
sure if this is available. Also it would be much simpler to have a 3-5 kB
solution in Java, perhaps encapsulated in a class which does the job without
the need for advanced libraries which need 100-200 KB on my internal
storage. 

I hope I could clarify my situation now.

Cheers,
Karl 

> Karl Koch wrote:
> 
> >Hello Sergiu,
> >
> >thank you for your help so far. I appreciate it.
> >
> >I am working with Java 1.1 which does not include regular expressions.
> >  
> >
> Why are you using Java 1.1? Are you so limited in resources?
> What operating system do you use?
> I asume that you just need to index the html files, and you need a 
> html2txt conversion.
> If  an external converter si a solution for you, you can use
> Runtime.executeCommnand(...) to run the converter that will extract the 
> information from your HTMLs
> and generate a .txt file. Then you can use a reader to index the txt.
> 
> As I told you before, the best solution depends on your constraints 
> (time, effort, hardware, performance) and requirements :)
> 
>   Best,
> 
>   Sergiu
> 
> >Your turn ;-)
> >Karl 
> >
> >  
> >
> >>Karl Koch wrote:
> >>
> >>    
> >>
> >>>I am in control of the html, which means it is well formated HTML. I
> use
> >>>only HTML files which I have transformed from XML. No external HTML
> (e.g.
> >>>the web).
> >>>
> >>>Are there any very-short solutions for that?
> >>> 
> >>>
> >>>      
> >>>
> >>if you are using only correct formated HTML pages and you are in control
> >>of these pages.
> >>you can use a regular exprestion to remove the tags.
> >>
> >>something like
> >>replaceAll("<*>","");
> >>
> >>This is the ideea behind the operation. If you will search on google you
> >>will find a more robust
> >>regular expression.
> >>
> >>Using a simple regular expression will be a very cheap solution, that 
> >>can cause you a lot of problems in the future.
> >> 
> >> It's up to you to use it ....
> >>
> >> Best,
> >> 
> >> Sergiu
> >>
> >>    
> >>
> >>>Karl
> >>>
> >>> 
> >>>
> >>>      
> >>>
> >>>>Karl Koch wrote:
> >>>>
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>Hi,
> >>>>>
> >>>>>yes, but the library your are using is quite big. I was thinking that
> a
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>5kB
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>code could actually do that. That sourceforge project is doing much
> >>>>>          
> >>>>>
> >>more
> >>    
> >>
> >>>>>than that but I do not need it.
> >>>>>
> >>>>>
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>you need just the htmlparser.jar 200k.
> >>>>... you know ... the functionality is strongly correclated with the
> >>>>        
> >>>>
> >>size.
> >>    
> >>
> >>>> You can use 3 lines of code with a good regular expresion to
> eliminate
> >>>>the html tags,
> >>>>but this won't give you any guarantie that the text from the bad 
> >>>>fromated html files will be
> >>>>correctly extracted...
> >>>>
> >>>> Best,
> >>>>
> >>>> Sergiu
> >>>>
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>Karl
> >>>>>
> >>>>>
> >>>>>
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>Hi Karl,
> >>>>>>
> >>>>>>I already submitted a peace of code that removes the html tags.
> >>>>>>Search for my previous answer in this thread.
> >>>>>>
> >>>>>>Best,
> >>>>>>
> >>>>>> Sergiu
> >>>>>>
> >>>>>>Karl Koch wrote:
> >>>>>>
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>Hello,
> >>>>>>>
> >>>>>>>I have  been following this thread and have another question. 
> >>>>>>>
> >>>>>>>Is there a piece of sourcecode (which is preferably very short and
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>simple
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>>(KISS)) which allows to remove all HTML tags from HTML content?
> HTML
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>3.2
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>>would be enough...also no frames, CSS, etc. 
> >>>>>>>
> >>>>>>>I do not need to have the HTML strucutre tree or any other
> structure
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>but
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>>need a facility to clean up HTML into its normal underlying content
> >>>>>>>    
> >>>>>>>
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>before
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>indexing that content as a whole.
> >>>>>>>
> >>>>>>>Karl
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>    
> >>>>>>>
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>I think that depends on what you want to do.  The Lucene demo
> parser
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>does
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>>simple mapping of HTML files into Lucene Documents; it does not
> give
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>you
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>a
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>>parse tree for the HTML doc.  CyberNeko is an extension of Xerces
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>(uses
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>>> 
> >>>>>>>>
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>>the
> >>>>>>>
> >>>>>>>
> >>>>>>>    
> >>>>>>>
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>same API; will likely become part of Xerces), and so maps an HTML
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>document
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>>into a full DOM that you can manipulate easily for a wide range of
> >>>>>>>>purposes.  I haven't used JTidy at an API level and so don't know
> it
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>as
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>>> 
> >>>>>>>>
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>>well --
> >>>>>>>
> >>>>>>>
> >>>>>>>    
> >>>>>>>
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>based on its UI, it appears to be focused primarily on HTML
> >>>>>>>>                
> >>>>>>>>
> >>validation
> >>    
> >>
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>and
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>>error detection/correction.
> >>>>>>>>
> >>>>>>>>I use CyberNeko for a range of operations on HTML documents that
> go
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>beyond
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>>indexing them in Lucene, and really like it.  It has been robust
> for
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>me
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>so
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>>far.
> >>>>>>>>
> >>>>>>>>Chuck
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>>>>-----Original Message-----
> >>>>>>>>>From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
> >>>>>>>>>Sent: Tuesday, February 01, 2005 1:15 AM
> >>>>>>>>>To: lucene-user@jakarta.apache.org
> >>>>>>>>>Subject: which HTML parser is better?
> >>>>>>>>>
> >>>>>>>>>Three HTML parsers(Lucene web application
> >>>>>>>>>demo,CyberNeko HTML Parser,JTidy) are mentioned in
> >>>>>>>>>Lucene FAQ
> >>>>>>>>>1.3.27.Which is the best?Can it filter tags that are
> >>>>>>>>>auto-created by MS-word 'Save As HTML files' function?
> >>>>>>>>>
> >>>>>>>>>_________________________________________________________
> >>>>>>>>>Do You Yahoo!?
> >>>>>>>>>150����MP3����ѣ������������ֵ���
> >>>>>>>>>http://music.yisou.com/
> >>>>>>>>>��Ů����Ӧ�о��У��ѱ���ͼ����ͼ�Ϳ�ͼ
> >>>>>>>>>http://image.yisou.com
> >>>>>>>>>1G����1000�ף��Ż������������ݣ�
> >>>>>>>>>
> >>>>>>>>>             
> >>>>>>>>>
> >>>>>>>>>                  
> >>>>>>>>>
>
>>>>>>http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
> >>>>>>            
> >>>>>>
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>>il_1g/
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>             
> >>>>>>>>>
> >>>>>>>>>                  
> >>>>>>>>>
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
>
>>>>>>---------------------------------------------------------------------
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>>>To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> >>>>>>>>>For additional commands, e-mail:
> >>>>>>>>>             
> >>>>>>>>>
> >>>>>>>>>                  
> >>>>>>>>>
> >>>>[EMAIL PROTECTED]
> >>>>   
> >>>>
> >>>>        
> >>>>
>
>>>>>>>---------------------------------------------------------------------
> >>>>>>>              
> >>>>>>>
> >>>>>>>>To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>>>>>>>For additional commands, e-mail:
> [EMAIL PROTECTED]
> >>>>>>>>
> >>>>>>>> 
> >>>>>>>>
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>>    
> >>>>>>>
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
>
>>>>>>---------------------------------------------------------------------
> >>>>>>To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>>>>>For additional commands, e-mail: [EMAIL PROTECTED]
> >>>>>>
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>---------------------------------------------------------------------
> >>>>To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>>>For additional commands, e-mail: [EMAIL PROTECTED]
> >>>>
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>> 
> >>>
> >>>      
> >>>
> >>---------------------------------------------------------------------
> >>To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >>    
> >>
> >
> >  
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
GMX im TV ... Die Gedanken sind frei ... Schon gesehen?
Jetzt Spot online ansehen: http://www.gmx.net/de/go/tv-spot

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: which HTML parser is better?

Reply via email to