file formats: MacRoman and UTF-8...

2011-03-28 Thread Patrick Diviacco
When I run my Lucene app and a parse a xml file I get the following error due to some fonts such as é written in the text file. If I save the text file as UTF-8 with my text editor I don't have this issue, but when I create it with a java app, it is saved as MacRoman. How can I specify a

Re: file formats: MacRoman and UTF-8...

2011-03-28 Thread Paul Libbrecht
java -Dfile.encoding=utf-8 should do the trick. Or... which java app are you using? paul Le 28 mars 2011 à 09:03, Patrick Diviacco a écrit : When I run my Lucene app and a parse a xml file I get the following error due to some fonts such as é written in the text file. If I save the text

RE: file formats: MacRoman and UTF-8...

2011-03-28 Thread Uwe Schindler
Hi, You have to give the Charset when creating the Writer. If you give no charset, Java uses the platform default. This question has nothing to do with Lucene, it is better suited at an XML or JAVA general forum. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen

Re: file formats: MacRoman and UTF-8...

2011-03-28 Thread Patrick Diviacco
hi, I'm using my own code: Writer writer = null; try { //File fileOutput = new File(output.trectext); File fileOutput = new File(args[1]); writer = new BufferedWriter(new FileWriter(fileOutput)); writer.write(contents.toString()); } catch (FileNotFoundException e) { e.printStackTrace(); }

RE: file formats: MacRoman and UTF-8...

2011-03-28 Thread Uwe Schindler
Hi, Replace the stupid: writer = new BufferedWriter(new FileWriter(fileOutput)); by: writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(fileOutput), UTF-8)); Unfortunately, you cannot give a charset to FileWriter itself. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213

Re: file formats: MacRoman and UTF-8...

2011-03-28 Thread Patrick Diviacco
thanks, solved On 28 March 2011 09:30, Uwe Schindler u...@thetaphi.de wrote: Hi, Replace the stupid: writer = new BufferedWriter(new FileWriter(fileOutput)); by: writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(fileOutput), UTF-8)); Unfortunately, you cannot give

comparing lucene scores across queries

2011-03-28 Thread Patrick Diviacco
Hi, sorry I've already asked few days ago, but I got no reply and I really need some help on this.. I'm running several queries against a doc collection. The queries are documents of the collection itself, I need to measure how similar is each document to the rest of the collection. Now, Lucene

RE: comparing lucene scores across queries

2011-03-28 Thread Uwe Schindler
No, scores are in general not comparable between different queries. The problem lies in many things: - Each query has a norm factor that makes it more compareable if they are sub clauses of a BooleanQuery. But you are right, this norm factor should be the same. - Some queries like FuzzyQuery rely

Re: comparing lucene scores across queries

2011-03-28 Thread Patrick Diviacco
Hi, thanks for reply. Yeah, I've read the Similarity class documentation several times, but I need some tip. My queries are BooleanQueries but they always have the same structure (the same structure of the docs, they are actually docs from collection): 3 fields. What if I simplify the

RE: comparing lucene scores across queries

2011-03-28 Thread Uwe Schindler
Hi Patrick, You can disable the coord factor in the constructor of BooleanQuery. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Patrick Diviacco [mailto:patrick.divia...@gmail.com] Sent: Monday,

Re: comparing lucene scores across queries

2011-03-28 Thread Patrick Diviacco
Cool, so just to be sure, if I disable the coord factor I can finally compare my BooleanQuery results ? On 28 March 2011 10:11, Uwe Schindler u...@thetaphi.de wrote: Hi Patrick, You can disable the coord factor in the constructor of BooleanQuery. Uwe - Uwe Schindler

Re: comparing lucene scores across queries

2011-03-28 Thread Patrick Diviacco
One more thing, instead of extending the BooleanQuery class to remove the coord factor, can I also extend the Similarity class to do it ? Still the other question is open: just to be sure, if I disable the coord factor I can finally compare my BooleanQuery results ? thanks On 28 March 2011

RE: comparing lucene scores across queries

2011-03-28 Thread Uwe Schindler
Hi, You don't need to extend BooleanQuery, you can just pass true in its ctor, see: http://s.apache.org/QvK Of course you can also subclass DefaultSimilarity and return 1 as coord, but that is more work than passing true to a ctor. For your type of queries, disabling coord should be enough, but

Re: comparing lucene scores across queries

2011-03-28 Thread Patrick Diviacco
ok thanks, I will pass well I dunno how to verify it. Even if I try then I get some scores, but I dunno if comparing them is reliable. On 28 March 2011 11:36, Uwe Schindler u...@thetaphi.de wrote: Hi, You don't need to extend BooleanQuery, you can just pass true in its ctor, see:

RE: comparing lucene scores across queries

2011-03-28 Thread Uwe Schindler
Hi, As you seem to want to do very specific things, it might still be interesting to provide a modified Similarity (by subclassing DefaultSimilaity). You could then e.g. return also 1.0 to disable the queryNorm() which may also be a problem (but it isn't for your queries). Theoretically, you can

Re: comparing lucene scores across queries

2011-03-28 Thread Patrick Diviacco
I see, well if you say the norm isn't a problem for my case, I will just disable the coord factor by initializing BooleanQuery(true); and I should be done. If this is not correct, please anybody let me know. On 28 March 2011 11:44, Uwe Schindler u...@thetaphi.de wrote: Hi, As you seem to

Re: comparing lucene scores across queries

2011-03-28 Thread Chris Hostetter
: I see, well if you say the norm isn't a problem for my case, I will just : disable the coord factor by initializing BooleanQuery(true); and I should be : done. querynorm hsouldn't be a problem (since your booleanqueries all have hte same structure, and odn't use query boosts ... i assume) but

a faster way to addDocument and get the ID just added?

2011-03-28 Thread Trejkaz
Hi all. I'm trying to parallelise writing documents into an index. Let's set aside the fact that 3.1 is much better at this than 3.0.x... but I'm using 3.0.3. One of the things I need to know is the doc ID of each document added so that we can add them into auxiliary database tables which are