Thank you all.
@Muir
Thanks for sharing your views. I'ld like to have some more details on the
process you mentioned as I've absolutely no idea on this highlighting
stuffs, could not make much out of our mail. Can you point me to some
tutorials/good write ups on the same, if you have some write ups
Hi,
I may be missing something obvious, but how do I get the payloads for
the specific token positions that were matched by a query?
For example, if I have a phrase query like "A keyword B" that matches
the field "A keyword B A", I can get the payloads for A and B with
IndexReader.termPositions()
http://vtd-xml.sf.net
- Original Message -
From: "Sithu D. Sudarsan"
To: java-user@lucene.apache.org
Sent: Thursday, May 21, 2009 7:42:59 AM GMT -08:00 US/Canada Pacific
Subject: Parsing large xml files
Hi,
While trying to parse xml documents of about 50MB size, we run into
Thanks Mike. In the meantime I'll just not close them. :)
On Thu, May 21, 2009 at 12:19 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:
> You're right, SegmentTermDocs/TermEnum.close calls close on its
> IndexInputs, but those IndexInputs were obtained by calling clone() on
> the "real
Thanks for the response ! Will post my findings.
Thx,
~preetham
Michael McCandless wrote:
Alas, Lucene in general does not do such structural optimization (and
I agree, we should). EG we could do it during Query.rewrite().
There are certain corner cases that are handled, eg a BooleanQuery
wit
;>>> Can you post your indexReader/Searcher initialization code from your
>>>>>> standalone app, as well as your webapp.
>>>>>>
>>>>>> Could you further post your Analyzer Setup/Query Building code from
>>>>>> both
On Thu, May 21, 2009 at 3:09 PM, Max Lynch wrote:
> Sorry, the following code is in python, but I can hack a Java thing together
> if necessary.
I'm a big Python fan :)
> HighlighterSpanScorer is the SpanScorer from the highlight
> package just renamed to avoid conflict with the other SpanScorer
On Thu, Apr 30, 2009 at 5:16 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:
> On Thu, Apr 30, 2009 at 12:15 AM, Max Lynch wrote:
> > You should switch to the SpanScorer (in o.a.l.search.highlighter).
> >> That fragment scorer should only match true phrase matches.
> >>
> >> Mike
> >>
Alas, Lucene in general does not do such structural optimization (and
I agree, we should). EG we could do it during Query.rewrite().
There are certain corner cases that are handled, eg a BooleanQuery
with a single BooleanClause, or BooleanQuery where
minimumNumberShouldMatch exceeds the number of
Hello,
Perhaps the following will help:
asf-lucene/contrib$ ff HighFreq*java
./miscellaneous/src/java/org/apache/lucene/misc/HighFreqTerms.java
Oits
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: Ridzwan Aminuddin
> To: java-user@lucene.apa
Darned that Google; they need to do better ;)
Here's the entry from CHANGES.txt on Lucene's trunk:
2. LUCENE-1382: Add an optional arbitrary String "commitUserData" to
IndexWriter.commit(), which is stored in the segments file and is
then retrievable via IndexReader.getCommitUserData ins
On Thu, May 21, 2009 at 1:12 PM, Michael McCandless
wrote:
> Sorry for the slow response.
>
> It's really not clear when 2.9 will be released. We have accumulated
> a number of good improvements -- higher performance field sorting, new
> higher performance Collector (replaces HitCollector) API,
>
Sorry for the slow response.
It's really not clear when 2.9 will be released. We have accumulated
a number of good improvements -- higher performance field sorting, new
higher performance Collector (replaces HitCollector) API,
segment-based searching, attaching a String label to each commit from
Hi,
I am wondering if Lucene internally rewrites/optimizes Query. I am
programatically generating Query based on various user options, and
quite often I have BooleanQueri'es wrapped inside BooleanQueries etc.
Like,
((Src:Testing Dst:Test) (Src:Test2 Port:http)).
In this case, would Lucene optim
This is often requested, but Lucene doesn't make it easy. I'd love
for someone to come up and build this feature :)
Do you need term freqs for just the top N that were collected? Or for
all docs that matched the query?
Mike
On Thu, May 21, 2009 at 6:34 AM, Robert Young wrote:
> Hi,
> I would
You're right, SegmentTermDocs/TermEnum.close calls close on its
IndexInputs, but those IndexInputs were obtained by calling clone() on
the "real" IndexInputs and so for NIOFSDirectory, FSDirectory and
RAMDirectory at least, when a clone's close() is called, that's a
no-op.
I think there are many p
Does anyone set that property in order to customize the SegmentReader
class that Lucene uses?
A while back, this was added for GCJ specific code (appears under
src/gcj/* in a source checkout), but that code hasn't kept up w/
recent changes to Lucene (eg readOnly IndexReader) and won't work
out-of-
TrieRangeQuery - thanks for the tip.
-Original Message-
From: Michael McCandless
Reply-To: java-user@lucene.apache.org
To: java-user@lucene.apache.org
Subject: Re: Does Lucene fail fast on boolean queries?
Date: Thu, 21 May 2009 11:39:23 -0400
On Thu, May 21, 2009 at 10:58 AM, Joel Halb
On Thu, May 21, 2009 at 10:58 AM, Joel Halbert wrote:
> Thx. We're not relying on the internal implementation, but I was
> wondering with respect to how efficient it is with respect to doing a
> boolean AND query.
>
> i.e. does clause precedence effect the efficiency of the query - so is X
> && Y
Thanks, I'll try that and get back to you
Sincerely,
Sithu D Sudarsan
-Original Message-
From: Michael Barbarelli [mailto:mbarbare...@gmail.com]
Sent: Thursday, May 21, 2009 10:52 AM
To: java-user@lucene.apache.org
Subject: Re: Parsing large xml files
Why not use an XML pull parser?
What fails and what is the stack trace? Have you tried just
parsing the XML in a stand-alone program independent of
indexing?
You should easily be able to parse a 50MB file with that much
memory. I suspect something else is going on here. Perhaps you're
not *really* allocating that much memory to
try http://piccolo.sourceforge.net/
is small and fast.
-Original Message-
From: Michael Barbarelli
Reply-To: java-user@lucene.apache.org
To: java-user@lucene.apache.org
Subject: Re: Parsing large xml files
Date: Thu, 21 May 2009 15:52:00 +0100
Why not use an XML pull parser? I recommen
Thx. We're not relying on the internal implementation, but I was
wondering with respect to how efficient it is with respect to doing a
boolean AND query.
i.e. does clause precedence effect the efficiency of the query - so is X
&& Y faster than Y && X if there are fewer hits for X. From how you
de
Why not use an XML pull parser? I recommend against using an in-memory
parser.
On Thu, May 21, 2009 at 3:42 PM, Sudarsan, Sithu D. <
sithu.sudar...@fda.hhs.gov> wrote:
>
> Hi,
>
> While trying to parse xml documents of about 50MB size, we run into
> OutOfMemoryError due to java heap space. Incre
Hi,
While trying to parse xml documents of about 50MB size, we run into
OutOfMemoryError due to java heap space. Increasing JVM to use close 2GB
(that is the max), does not help. Is there any API that could be used to
handle such large single xml files?
If Lucene is not the right place, please l
Greetings all,
I currently have a FieldExistsFilter which returns all documents that
contain a particular field. I'm in the process of converting my custom
filters to be DocIdSet based rather than BitSet based. This filter, however,
requires the use of a TermDocs object to iterate over terms and
D
Well... scoring of AND queries currently is done doc-at-once.
So Lucene will first step to doc 1 for Name, then ask age to skip to
doc >= 1, will see that both have doc=1 and collect it. The same
thing happens for doc=2. Then, Lucene will ask for the next doc of
Name, which returns "false" (end
Thx. so, just to clarify, in the example I gave below...
Lucene will search for documents matching on Name and find doc 1 and doc
2.
Then it will search age and find docs 1, 2 and then break. It will not
go on to seek 5 and 10...?
-Original Message-
From: Michael McCandless
Reply-To: jav
;>> Could you further post your Analyzer Setup/Query Building code from
>>>>>> both apps.
>>>>>>
>>>>>> Could you further post the document creation code used at indexing
>>>>>> time? (Which analyzer, and which fields are index
its definitely an area in lucene that could use some improvement.
my recommendation for multilingual text is to apply the unicode "default"
algorithms:
Tokenize text according to UAX #29: unicode text segmentation
Apply full case-folding (unicode ch. 3.13) with FC_NFKC closure
Apply UAX #15: unic
Yes.
As soon as Lucene sees that the Name docID iteration has ended, the
search will break.
Mike
On Thu, May 21, 2009 at 8:44 AM, Joel Halbert wrote:
> Hi,
>
> When Lucene performs a Boolean query, say:
>
> Field Name = Male
> AND
> Field Age = 30
>
> assuming the resultant docs for each portio
I suspect that your boost values are too small to really influencethe scores
very much. Have you tried using boost values of, say,
d:5^100 OR uid:10^10 OR lang:lisp ?
But if you have specific documents that you *know* you want in
specific places, why play around with boosting at all? You can use
s
hello, your example (hindi), is probably suffering from a number of search
issues:
i dont recommend standardanalyzer as for this example, it will break words
around dependent the vowels and nukta dot, etc.
whitespaceanalyzer might be a good start.
also, is it possible to apply unicode normalizati
Hi All,
I've indexed some docs[non-english] in unicoded utf=8 format. For both
indexing as well as searching/querying I'm using simpleanalyzer. For english
texts when I tried with single words its working then I thought of trying
for non-english texts. So I wrote those words[multiple words] in babe
> If I index english pages
> with the same indexer, it will not take care of stemming and stop word
> removal?
correct
> Cant we have a single indexer that handles non-eng and eng in
> equally good ways?
You can have a single indexer, but, if you wanted to use one Analyzer for
English docume
Its been a few days, and we haven't heard back about this issue, can we
assume that you fixed it via using fully qualified paths then?
Matt
Ian Lea wrote:
Marco
You haven't answered Matt's question about where you are running it
from. Tomcat's default directory may well not be the same as y
Initially I was using standardAnalyzer but I switched to simpleAnalyzer
which I guess doesnot do more that tokenizing[and may be tokenizing] and I
think this does not do stemming which I dont/cant do because I've no stemmer
for the languages I'm indexing.
For indexing and querring I'm using the sam
The highlighter should be language independent. So long as you are
consistent with your use of Analyzer between
indexing/query/highlighting.
As for the most appropriate Analyzer to use for your local language,
this is a seperate question - especially if you are using stop word and
stemming filters
Hi,
When Lucene performs a Boolean query, say:
Field Name = Male
AND
Field Age = 30
assuming the resultant docs for each portion of the query were:
Matching docs for: Name = 1,2
Matching docs for: Age = 1,2,5,10
Will Lucene stop searching for documents matching the Age term once it
has found
Thank you very much. As you told me I just added a single line in the jsp
page mentioning the charset as utf-8 and it worked like a charm. Thank you.
KK
On Thu, May 21, 2009 at 5:47 PM, Uwe Schindler wrote:
> If you print the result e.g. to a webpage through the servlet API, the
> output is don
Hi All,
I was looking for various ways of implementing hit highlighting in Lucene
and found some standard classes that does support highlighting like this
*lucene*.apache.org/java/2_2_0/api/org/apache/*lucene*/search/*highlight*
/package-summary.html
ik but what i believe is that this is only for
Hi KK,
> right? and remove this conversion that I'm doing later ,
>
> byte [] utfEncodeByteArray = textOnly.getBytes();
> String utfString = new String(utfEncodeByteArray, Charset.forName("UTF-
> 8"));
>
> This will make sure I'm not depending on the platform encoding, right?
In principle, ye
If you print the result e.g. to a webpage through the servlet API, the
output is done with ISO-8859-1 (which is the default for HTTP). If you want
to change this, you must tell the servlet layer the encoding before getting
a PrintWriter (response.setEncoding(), response.setContentTpe("text/html;
ch
See http://www.lucidimagination.com/search/document/7fe40486bc935ce4/get_term_neighbours
(although I think you can do better than the code in the third reply
by using a TermVectorMapper such that you can process the TermVector
as it comes from disk.)
Essentially, you need to use a combinati
Hi,
I would like to perform a query and then get a summary of the term
frequencies of the result. Is this possible?
Thanks
Rob
I did all the changes but no improvement. the data is getting indexed
properly, I think because I'm able to see the results through luke and luke
has option for seeing the results in both utf-8 encoding and string default
encoding. I tried to use both but no difference. In both the cases I'm able
t
Thanks @Uwe.
#To answer your last mails query, textOnly is the output of the method
downloadPage(), complete text thing includeing all html tags etc...
#Instead of doing the encode/decode later, what i should do is when
downloading the page through buffered reader put the charset as utf-8 as you
me
I forgot:
> byte [] utfEncodeByteArray = textOnly.getBytes();
> String utfString = new String(utfEncodeByteArray, Charset.forName("UTF-
> 8"));
>
> here textonly is the text extracted from the downloaded page
What is textonly here? A String, if yes, why decode and then again encode
it? The impor
Hallo KK.,
> Thanks for your quick response. Let me explain the whole thing.
> I'm downloading the pages for give urls and then extracting text and
> converting that to unicode utf-8 this way,
>
> byte [] utfEncodeByteArray = textOnly.getBytes();
> String utfString = new String(utfEncodeByteArray
Thanks for your quick response. Let me explain the whole thing.
I'm downloading the pages for give urls and then extracting text and
converting that to unicode utf-8 this way,
byte [] utfEncodeByteArray = textOnly.getBytes();
String utfString = new String(utfEncodeByteArray, Charset.forName("UTF-8
50 matches
Mail list logo