Re: PDF Text extraction

2003-01-02 Thread Karl Øie
to get the string value of a inputstream you can use it to fill a ByteArrayInputStream and get the content from that; ByteArrayInputStream bais = new ByteArrayInputStream(inputstream); System.out.println( new String(bais.getBytes()) ); mvh karl øie On Friday, Dec 27, 2002, at 07:34 Europe/Oslo

Re: Trying to index .doc

2002-12-17 Thread Karl Øie
you also want to check out the POI project http://jakarta.apache.org/poi/index.html it has office readers that can extract the content as text. mvh karl øie On Tuesday, Dec 17, 2002, at 16:00 Europe/Oslo, Diego Gutierrez Alonso wrote: Hi, i´d like to index .doc files, but i don´t know how

Re: problems with search on Russian content

2002-11-22 Thread Karl Øie
Sorry, my bad! Didn't read this informative post :-) mvh karl øie On Thursday, Nov 21, 2002, at 16:35 Europe/Oslo, Otis Gospodnetic wrote: Look at CHANGES.txt document in CVS - there is some new stuff in org.apache.lucene.analysis.ru package that you will want to use. Get the Lucene from the

Re: problems with search on Russian content

2002-11-22 Thread Karl Øie
Hi i took a look at Andrey Grishin russian character problem and found something strange happening while we tried to debug it. It seems that he has avoided the usual "querying with different encoding than indexed" problem as he can dump out correctly encoded russian at all points in his applica

Re: Stress/scalability testing Lucene

2002-11-21 Thread Karl Øie
I have a index that is compiled each night that indexes 1,3gb with XML data that results into a 1,4gb index. The index takes about 11 hours to build on a dual 700mhz xeon processor with 768mb of ram. The index contains 4.388.730 documents and 953.632 terms. Mvh karl øie On Thursday, Nov 21

Re: Help on creating and maintaining an index that changes

2002-11-21 Thread Karl Øie
irectory? the org.apache.lucene.index.IndexReader class contains a delete() function to delete documents from lucene. But as said before, if your index is big it's best not to delete the documents just because a client goes offline, its better to filter out the hits. mvh karl øie -- To

Re: Indexing of documents in memory

2002-11-17 Thread Karl Øie
s you should store them into a byte or char array in a file or database. mvh karl øie On Monday, Nov 18, 2002, at 03:24 Europe/Oslo, Vinay Kakade wrote: Hi I am trying to use RAMDirectory to store the input HTML documents which are used to create index by the IndexHTML demo program, but I am f

Re: Indexing distant web sites

2002-11-04 Thread Karl Øie
cocoon/components/search/ Crawler implementation: > http://cvs.apache.org/viewcvs.cgi/xml-cocoon2/src/java/org/apache/ cocoon/components/crawler/ This impl is indexing XML, but the principe is the same... mvh karl øie On Monday, Nov 4, 2002, at 14:29 Europe/Oslo, Friaa Nafaa wrote:

Re: How to include strange characters??

2002-10-14 Thread karl øie
ring = new String(querystring.getBytes("ISO-8859-1")); ... mvh karl øie On søndag, okt 13, 2002, at 14:15 Europe/Oslo, Chris Davis wrote: > To Dominator, > > Where you able to solve the display problem as well? I am having a > similiar problem with documents that co

Re: Multithread searching problem on Linux

2002-10-14 Thread karl øie
if you still have problems, take a look at this note found in the newest tomcat release... it might help. mvh karl øie > --- > Linux and Sun JDK 1.2.x - 1.3.x: > --- > > Virtual machine crashes can be experienced whe

Re: How to include strange characters??

2002-10-07 Thread karl øie
to re-encode the query in UTF-8/16: String querystring = argv[0]; ' String querystring = httprequest.getParameter("query"); querystring = new String(querystring.getBytes("UTF-8")); ... this fixed my norwegian/samii problems... mvh karl øie On mandag, okt 7, 2002, at 13:

Re: Multithread searching problem on Linux

2002-10-02 Thread karl øie
hen it comes to thread performance? there is also a 1.3 jvm from a group called "blackdown" that is free and optimized for linux. there was some talking in the news about it being very good at threading... you could try it.. ( http://www.blackdown.org/ ) mvh karl øie On onsdag, okt

Re: Multithread searching problem on Linux

2002-10-01 Thread karl øie
Try to run your vm in classic mode "java -classic" to disable the hotspot features... mvh karl øie On tirsdag, okt 1, 2002, at 18:16 Europe/Oslo, Stas Chetvertkov wrote: > Hi All, > > I am building a search engine based on Lucene. Recently I created a > test >

Re: Problems with exact matces on non-tokenized fields...

2002-10-01 Thread karl øie
it works :-) when i see this i understand that the term being parsed by the queryparser is sent trough the analyzer as well... thanks! mvh karl øie On torsdag, sep 26, 2002, at 18:44 Europe/Oslo, Doug Cutting wrote: > karl øie wrote: >> I have a Lucene Document with a field named

Re: Problems with exact matces on non-tokenized fields...

2002-09-26 Thread karl øie
Hm.. a misunderstanding: i don't create the field with the value "POST?" i create it with "POST". "element:POST?" or "element:POST*" are the strings i send to the QueryParser for searching. mvh Karl Øie On torsdag, sep 26, 2002, at 14:13 Europe/

Problems with exact matces on non-tokenized fields...

2002-09-26 Thread karl øie
ering "element:POST?" or "element:POST*" in the QueryParser class. Have anyone here run into this problem? I am using the 1.2 release version of Lucene. Mvh Karl Øie -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

RE: Problems understanding RangeQuery...

2002-08-12 Thread Karl Øie
thank you, that works! :-) and saves my day! mvh karl øie -Original Message- From: Terry Steichen [mailto:[EMAIL PROTECTED]] Sent: 10. august 2002 18:29 To: Lucene Users List; [EMAIL PROTECTED] Subject: Re: Problems understanding RangeQuery... Hi Karl, I have discovered that with

Problems understanding RangeQuery...

2002-08-10 Thread Karl Øie
Hi, i have a problem with understanding RangeQueries in Lucene-1.2: I have created an index with posts that has the field W_PUBLISHING_YEAR which contains the year of publishing. After indexing i loop through the terms and finds that i have the following terms present in the index: 1923,192

Re: Crash / Recovery Scenario

2002-07-09 Thread Karl Øie
your implementation of this as i find this area to be the only weak point in lucene. mvh karl øie -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Re: Crash / Recovery Scenario

2002-07-09 Thread Karl Øie
ld think crash/recovery/rollback functionality to benefit lucene greatly. I have indexes that uses 5 days to build, and it's really bad to receive exceptions during a long index run, and no recovery/rollback functionality. Mvh Karl Øie -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Re: SearchBean Persistence

2002-07-03 Thread Karl Øie
oh, i see. i was misleaded by the Bean part of the SearchBean... im sorry! :-) Anyhow, if it is not a Statefull SessionBean you are not restricted by EJB rules and can thus serialize everything you want to disk or db... mvh karl øie On Wednesday 03 July 2002 17:20, Otis Gospodnetic wrote

Re: SearchBean Persistence

2002-07-03 Thread Karl Øie
; persistence, there should be no problem storing it in memory to serve > subsequent requests. I just can't figure out how to modify the SearchBean > code to do this. I seemed like it would be simple, but try as I might, > nothing has so far worked. > > Regards, > >

Re: SearchBean Persistence

2002-07-03 Thread Karl Øie
if the array is of a serializable sort, just store it in a sql table !?! mvh karl øie On Wednesday 03 July 2002 16:22, Terry Steichen wrote: > I'm using Peter's SearchBean code to sort search results. It works fine, > but it creates the sorting field array from scratch with

Wildcard searching

2002-07-02 Thread Karl Øie
in the end: is there a reason why lucene doesn't use java interfaces for eh. interfaces like the Query class? mvh karl øie -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Re: MS Word Search ??

2002-05-29 Thread Karl Øie
under development mvh karl øie On Wednesday 29 May 2002 11:48, Rama Krishna wrote: > Hi, > > I am trying to build a search engine which search in MS Word, excel, ppt > and adobe pdf. I am not sure whether i can use Lucene for this or not. pl. > help me out in this regard. > >

Re: UNICODE

2002-05-28 Thread Karl Øie
you better test it, it does not handle slavic and urgic characters well, but i don't know where the problems lies mvh karl øie On Tuesday 28 May 2002 10:52, jamin rubio wrote: > Hello, > > I have a newbie question ? Is lucene fully unicode compliant ? > > Thanks

Re: ->Lucene Index file names

2002-05-22 Thread Karl Øie
even better, remove this standard scare-monger from the bottom of your emails, (sic) corporate busllshit... mvh karl øie -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

->Lucene Index file names

2002-05-22 Thread Karl Øie
please note that the "license" in the email from the symbian employee actually tries to inciminate you just by replying to him!!! mvh karl øie -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Re: Search on XML files

2002-05-13 Thread Karl Øie
es so it should performe good anyhow happy hacking! mvh karl øie -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Re: Searching UNICODE

2002-05-02 Thread Karl Øie
what language are you trying to use lucene with? mh karl øie On Tuesday 30 April 2002 18:57, Hyong Ko wrote: > Hello, > > I think there's something wrong with the QueryParser.jj file. I downloaded > lucene-1.2-rc4-src and compiled successfully with JAVA_UNICODE_

Re: Lucene index integrity... or lack of :-(

2002-04-26 Thread Karl Øie
memory while indexing and merging, so checking the systems free memory is easier that trying to calculate memoryusage mvh karl øie -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Re: Lucene index integrity... or lack of :-(

2002-04-26 Thread Karl Øie
never experienced a failure while merging a RAMDir into a FSDir regardless of size, so it's my systems memory that is the problem mvh karl øie On Friday 26 April 2002 15:33, petite_abeille wrote: > >> Thanks. What's is your heuristic to flush the RAMDirectory? > >

Re: Lucene index integrity... or lack of :-(

2002-04-26 Thread Karl Øie
lush the RAMDirectory? please explain this because i don't understand english that good :-( mvh karl øie On Friday 26 April 2002 14:23, petite_abeille wrote: > > using a RAMDir as a middle man solved my problems... > > Thanks. What's is your heuristic to flush the RAMDirector

Re: Lucene index integrity... or lack of :-(

2002-04-26 Thread Karl Øie
xing large documents with many fields using a RAMDir as a middle man solved my problems... mvh karl øie On Friday 26 April 2002 13:54, petite_abeille wrote: > Hello, > > I'm starting to wander how "bullet proof" are Lucene indexes? Do they > get corrupted easely?

Re: delete document

2002-04-24 Thread Karl Øie
it's actually the IndexReader, not the IndexWriter... happy hacking! On Wednesday 24 April 2002 15:27, Tim Tschampel wrote: > How do you delete a document from the index? > I see in the FAQ to user IndexWriter.delete(Term), however I don't see > this in the current API JavaDocs, and don't hav

Re: Italian web sites

2002-04-24 Thread Karl Øie
hm... this looks very interesting! if it is a perl exe you can just copy the text into a temp file and run the per exe on that file and redirect the output to another tmp file. then read the file and use the result in a lucene keyword. mvh karl øie On Wednesday 24 April 2002 13:46, [EMAIL

Re: Italian web sites

2002-04-24 Thread Karl Øie
combined with that you could use an italian stop-word list to run statistics on a page :-) ?!? On Wednesday 24 April 2002 11:02, [EMAIL PROTECTED] wrote: > Hi all, > > I'm using Jobo for spidering web sites and lucene for indexing. The > problem is that I'd like spidering only Italian web site

Re: Some questions

2002-04-19 Thread Karl Øie
tract the links and process each of these links in the same manner. for this you will need a html parser.. happy hacking! mvh karl øie -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Re: too many open files in system

2002-04-09 Thread Karl Øie
n for this is that i have never encountered "Too many open files" when indexing clean text into one large field, but when creating many-many fields as required by indexing xml i got a "Too many open files" until i had to use a ram-dir to index document batches into.. mvh karl

Re: Read only filesystem

2002-04-05 Thread Karl Øie
thank you! i actually ran into this today when i buildt a index with crond as root and found that even my own user could read the index, lucene couldn't. :-D mvh karl øie On Friday 05 April 2002 15:15, you wrote: > Hi, > after some trial with Lucene, I discovered it doesn&

Re: (ZipDirectory) RE: storing index in third party database.

2002-04-04 Thread Karl Øie
could you put up the source? i would really appreciate it. mvh karl øie On Wednesday 03 April 2002 21:27, you wrote: > I am doing some testing on managing the underlying data in a zip archive > and found that there is about a 15ms hit to use a zip vs. grabbing directly > from fi

Re: storing index in third party database.

2002-04-03 Thread Karl Øie
generate this stream, equally on insert we must accept the stream and break it up into keys... is it possible to "intercept" lucene's work at the key-handling point? or would this require a larger rewrite? mvh karl øie On Wednesday 03 April 2002 16:55, you wrote: > >

Re: storing index in third party database.

2002-04-03 Thread Karl Øie
indexes i have experienced that BerkeleyDB runs circles around any SQL database (including db2 and oracle!!!). Berkeley has a java-api and a b-tree record type that could be a very good match for a key-based searchtree, and it's free. take a look at it! mvh karl øie (ps: i am not payed by

Re: Speed of indexing

2002-03-27 Thread Karl Øie
reads that indexes into its own sepparate ramdir, then flushes these ramdirs into each separate fsdir (hench i have a fsdir for each workerthread), this because you can only write to a dir by one thread. in the end this imporved my indexing time a lot... hope some of this can help you! mvh kar

RE: optimizing index - too many open files

2002-03-01 Thread Karl Øie
100 files. This made me work around both "out of memory" and "too many files" exceptions... mvh karl øie -Original Message- From: Paul Friedman [mailto:[EMAIL PROTECTED]] Sent: 28. februar 2002 21:38 To: Lucene Users List Subject: Re: optimizing index - too many

RE: How to do web searching

2002-02-19 Thread Karl Øie
.org/viewcvs/jakarta-lucene/src/demo/org/apache/lucene/demo /SearchFiles.java?rev=1.1&content-type=text/vnd.viewcvs-markup mvh karl øie -Original Message- From: Parag Dharmadhikari [mailto:[EMAIL PROTECTED]] Sent: 19. februar 2002 10:12 To: lucene-user Subject: How to do web sea

RE: does lucene support OR and AND queries ?

2002-02-19 Thread Karl Øie
Yes it does support boolean queries, you can read about its features here: http://jakarta.apache.org/lucene/docs/index.html mvh karl -Original Message- From: Biswas, Goutam_Kumar [mailto:[EMAIL PROTECTED]] Sent: 19. februar 2002 14:18 To: Lucene-User (E-mail) Subject: does lucene suppo

RE: strange search problems(SOLVED!)

2002-01-28 Thread Karl Øie
ow!... sorry to bother you! mvh karl øie -Original Message----- From: Karl Øie [mailto:[EMAIL PROTECTED]] Sent: 28. januar 2002 18:16 To: [EMAIL PROTECTED] Subject: strange search problems(cannot query for more than the first 1 words!?!) I have created a testclass for working with An

RE: Strange Results with German Analyzer

2001-12-20 Thread Karl Øie
urns 22 i can not say... mvh karl øie -Original Message- From: Jan Stövesand [mailto:[EMAIL PROTECTED]] Sent: 20. desember 2001 12:36 To: Lucene Users List Subject: Strange Results with German Analyzer Hi, I used a German Analyzer for Indexing and Searching. afaik, the search is

RE: HTML Parser

2001-12-18 Thread Karl Øie
*.jj files are compiled with javacc, there is a javacc.zip file in your lib directory, but you should download the compilerset. mvh karl øie -Original Message- From: Christophe GOGUYER DESSAGNES [mailto:[EMAIL PROTECTED]] Sent: 17. desember 2001 17:32 To: [EMAIL PROTECTED] Subject: HTML

RE: searching words starting with accent characters using UTF-8

2001-12-10 Thread Karl Øie
a/org/apache/lucene/quer yParser/ mvh karl øie -Original Message- From: Kiran Kumar K.G [mailto:[EMAIL PROTECTED]] Sent: 8. desember 2001 12:43 To: [EMAIL PROTECTED] Subject: searching words starting with accent characters using UTF-8 Iam trying to search for words starting with a

RE: Installation notes

2001-12-06 Thread Karl Øie
you will need javacc.zip in your classpath to compile lucene. it can be found in the jakarta-lucene-1.2-rc2/lib/ directory. mvh karl øie -Original Message- From: Patrick Codere [mailto:[EMAIL PROTECTED]] Sent: 5. desember 2001 16:00 To: '[EMAIL PROTECTED]' Subject: FW: In

RE: Filter and stop-words

2001-12-03 Thread Karl Øie
/portuguese/stemmer.html mvh karl øie -Original Message- From: Bizu de Anúncio [mailto:[EMAIL PROTECTED]] Sent: 3. desember 2001 13:22 To: [EMAIL PROTECTED] Subject: Filter and stop-words I'm new to Lucene. First of all I would like to know if there is a search arquive like "su

RE: scandinavian characters.

2001-11-27 Thread Karl Øie
e from the browser to utf-8 and it worked (guess the browser sent the string as ascii!!! i'm so happy and thanks to you both jonas and david!! String query = this.request.getParameter( "query" ); if( query!=null ) { query = new String( query.getBytes(), "UTF-8" );

RE: scandinavian characters.

2001-11-27 Thread Karl Øie
after i had replaced "QueryParser.jj" with the newest version from cvs the queryparser accepts my query, and i can now perform ø/æ/å searches from commandline, then i guess there is something wrong with my search servlets unicode handling :-) thank you very much! karl øie

RE: scandinavian characters.

2001-11-27 Thread Karl Øie
it's still translated into ä ?!? the strange thing is that the cvs version actually already has this into it's code.. perhaps I should try a full rebuild from the cvs version... could you send me your "QueryParser.jj" so i could have a look at it? btw: thanks for the tips! mvh

RE: scandinavian characters.

2001-11-27 Thread Karl Øie
changed on the way in. if i search for >"fjøs" (fjøs) i get the swedish "fjä" (fjÄ). Where ø is >changed to Ä and 's' is removed. > >is the querystring translated some where? > >mvh karl øie > -Original Message- > From: David Bonilla

RE: scandinavian characters.

2001-11-27 Thread Karl Øie
i tried the SimpleAnalyzer and got the same result. but i forgot to provide the stacktrace; org.apache.lucene.queryParser.TokenMgrError: Lexical error at line 1, column 1. Encountered: "\u00c3" (195), after : "" at org.apache.lucene.queryParser.QueryParserTokenManager.getNextToken(Unknow

RE: scandinavian characters.

2001-11-27 Thread Karl Øie
no it's even stranger than that, i have decoded the querystring, the problem is that it seems like something is changed on the way in. if i search for "fjøs" (fjøs) i get the swedish "fjä" (fjÄ). Where ø is changed to Ä and 's' is removed. is the querystring t

scandinavian characters.

2001-11-27 Thread Karl Øie
Hi, i got a problem with scandinavian characters (æåø), when i insert text with scand-chars it passes the analyzer correctly, but the QueryParser chokes when i try to search for the same characters. anyone know anything about how i can fix this? karl øie/gan meida -- To unsubscribe, e-mail