Re: PDF text extracted without spaces

2010-12-03 Thread Hans Merkl
commands, e-mail: java-user-h...@lucene.apache.org -- Hans Merkl Right On Point, LLC 215 Victor Parkway, Suite E Annapolis, MD 21403 Phone: (443) 951-4324 E-mail: hme...@rightonpoint.us

Index strategy for tagged documents where tags can change often

2010-07-28 Thread Hans Merkl
Hi, In addition to text content my documents have tags which can be searched too. The problem now is that the tags change quite often and every time a tag gets added or removed I have to call UpdateDocument which is quite slow when done for hundreds of documents. Are there any well performing

Re: arguments in favour of lucene over commercial competition

2010-06-23 Thread Hans Merkl
Just curious. What commercial alternatives are out there? On Wed, Jun 23, 2010 at 04:01, jm jmugur...@gmail.com wrote: Hi, I am trying to compile some arguments in favour of lucene as management is deciding weather to standardize on lucene or a competing commercial product (we have a couple

Re: Are there any tokenizers that ignore HTML tags but keep the offsets so they can be used for highlighting in the original document?

2010-06-08 Thread Hans Merkl
Hi Ahmet, I am using Lucene.NET with C# so I can't test this quickly. Will HTMLStripCharFilter maintain the character offsets or does it just extract the plain text? Hans You can use org.apache.solr.analysis.HTMLStripCharFilter. It is possible to add one or more

Are there any tokenizers that ignore HTML tags but keep the offsets so they can be used for highlighting in the original document?

2010-06-07 Thread Hans Merkl
Hi, I need to index HTML documents and one of the requirements is to highlight documents while maintaining all of the original formatting. The documents are relatively simple HTML, meaning no JavaScript code that changes elements at runtime or too fancy CSS styling. I think it should be possible