[sqlite] HTML Tokenizer

2014-02-13 Thread Wang, Baoping
New to Sqlite, anybody knows is there a HTML tokenizer for full text search, Or do I need to implement my own? Thanks Pursuant to Treasury Regulations, any U.S. federal tax advice contained in this communication, unless otherwise stated, is not intended and cannot be used for the purpose of avo

Re: [sqlite] HTML Tokenizer

2014-02-13 Thread David King
> New to Sqlite, anybody knows is there a HTML tokenizer for full text search, > Or do I need to implement my own? There isn't an HTML tokeniser. But the default tokeniser considers punctuation like <> to be word breaks so it may already work for you with the down side that things like Hello! wi

Re: [sqlite] HTML Tokenizer

2014-02-13 Thread Petite Abeille
On Feb 13, 2014, at 8:48 PM, Wang, Baoping wrote: > New to Sqlite, anybody knows is there a HTML tokenizer for full text search, No. > Or do I need to implement my own? If you feel the urge. Otherwise, try lynx -dump. For example: curl -s http://www.sqlite.org | lynx -nolist -stdin -dump _

Re: [sqlite] HTML Tokenizer

2014-02-13 Thread Scott Robison
My current project needed to tokenize the text in HTML without the tags. The easy solution for us was to license a library from Chilkat that supported text extraction then tokenize that. I'm on my phone at the moment but could supply more details later if desired. SDR On Feb 13, 2014 1:02 PM, "Dav

Re: [sqlite] HTML Tokenizer

2014-02-13 Thread Petite Abeille
On Feb 13, 2014, at 9:08 PM, Petite Abeille wrote: > curl -s http://www.sqlite.org | lynx -nolist -stdin -dump While we are at it, www.sqlite.org exhibits many validation errors: http://validator.w3.org/check?uri=http%3A%2F%2Fwww.sqlite.org%2F&charset=%28detect+automatically%29&doctype=Inline&

Re: [sqlite] HTML Tokenizer

2014-02-13 Thread Jan Nijtmans
2014-02-13 21:35 GMT+01:00 Petite Abeille : > > On Feb 13, 2014, at 9:08 PM, Petite Abeille wrote: > >> curl -s http://www.sqlite.org | lynx -nolist -stdin -dump > > While we are at it, www.sqlite.org exhibits many validation errors: > > http://validator.w3.org/check?uri=http%3A%2F%2Fwww.sqlite.or

Re: [sqlite] HTML Tokenizer

2014-02-13 Thread Petite Abeille
On Feb 13, 2014, at 9:52 PM, Jan Nijtmans wrote: > But if you put the validator in HTML5 mode, there are many less errors: Possibly. But it says 'HTML 4.01 Strict' on the tin: http://www.w3.org/TR/html4/strict.dtd”> Either way, a bunch of errors.

Re: [sqlite] HTML Tokenizer

2014-02-13 Thread RSmith
On 2014/02/13 22:35, Petite Abeille wrote: While we are at it, www.sqlite.org exhibits many validation errors: http://validator.w3.org/check?uri=http%3A%2F%2Fwww.sqlite.org%2F&charset=%28detect+automatically%29&doctype=Inline&group=0&user-agent=W3C_Validator%2F1.3+http%3A%2F%2Fvalidator.w3.org%2