> I guess that the obvious question is "Which characters are
> considered 'punctuation characters'?".
Punctuation = ("_"|"-"|"/"|"."|",")
> In particular, does the analyzer consider "=" (equal) and
> ":" (colon) to be punctuation characters?
":" is special character at QueryParser (if you are
Given a term say "apache", I want to look up the lucene index
programmatically to find out its frequency in the corpus.
On Fri, Jul 31, 2009 at 12:23 AM, wrote:
>
> prashant ullegaddi wrote:
> > How to get the number of times a term occurs in the Lucene index?
> >
> > Regards,
> > Prashant
> Given a term say "apache", I want to look up the lucene index
> programmatically to find out its frequency in the corpus.
I think you are asking collection frequency of a term. Term Frequency is
defined between a document and a term which is printed in the loop in the
following code. And at
Hmm... this doesn't sound right.
That example (ThreadedIndexWriter) is meant to be a drop-in
replacement, wherever you use an IndexWriter, that keeps an
under-the-hood thread pool (using java.util.concurrent.*) to
add/update documents with multiple threads.
It should not result in a smaller index
Hi, new here.
I recently started using lucene and had encounter a problem.I crawl and
index a number of documents.
When i perform a search, lets say "tall fat", by right the results that
matches all the keyword should be on top and display first.
But in my search results, some of the document
Thanks Ahmet. This answers my question.
On Fri, Jul 31, 2009 at 1:30 PM, AHMET ARSLAN wrote:
>
>
> > Given a term say "apache", I want to look up the lucene index
> > programmatically to find out its frequency in the corpus.
>
> I think you are asking collection frequency of a term. Term Frequen
> When i perform a search, lets say "tall fat", by right the
> results that matches all the keyword should be on top and display first.
Answer of your question lies at the end of this thread:
http://www.nabble.com/Generating-Query-for-Multiple-Clauses-in-a-Single-Field-td24694748.html
Hi
It's not quite that simple. Other things being equal, results that
match all keywords are likely to come first but there are other
factors such as term frequency and the length of the document.
Searcher.explain() will give you the gory details. Luke will let you
see what is in your index.
Hi All,
I am new to Lucene and I am working on a search application.
My application needs dynamic data retrieval from the database. That means,
based on my previous step output, I need to retrieve entries from the DB for
the next step.
For example, if my search query contains "Name" field entry,
It might be because there are hardly any documents containing both the
words.
Try exact search: "\"tall fat\""
On Fri, Jul 31, 2009 at 3:31 PM, bourne71 wrote:
>
> Hi, new here.
>
> I recently started using lucene and had encounter a problem.I crawl and
> index a number of documents.
> When i pe
Is there any difference between using QueryParser and
MultiFieldQueryParser when have single default search field ?
Depending on how many default search fields on an searching an index I
select which of the two QueryParsers to use, but does it mater if I just
use MultiFIeldQueryParser all the
And to address the stop word issue, you can override the stop word list
that it uses.
Most analyzers that use stop words, (Standard included) has an option to
pass it an arbitrary list of StopWords which will override the defaults.
You could also just roll your own (which is what you are goin
I'd guess there wouldn't be any difference, but haven't tried it. Try
it out and see what query.toString() says in each case.
--
Ian.
On Fri, Jul 31, 2009 at 1:37 PM, Paul Taylor wrote:
> Is there any difference between using QueryParser and MultiFieldQueryParser
> when have single default sea
Is Lucene capable of handling UCS4 data natively?
Thanks,
Mike
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
This would not make much of a difference. I would guess that you have
one additional "wrapping" boolean query if you use
MultiFieldQueryParser. For query "foo AND bar" the MFQueryParser
creates +(fname:foo) +(fname:bar) and QueryParser would create
+fname:foo +fname:bar so in this case one level of
In MultiFieldQueryParser, you can mention different fields of the document
which can
be searched for
E.g. in contents of the document, if you index different fields such as URL,
BOLD, ITALIC, you can search over all of them.
Additionally, there is provision to boost a field over the other as well.
If I understand you correctly you are asking if lucene can deal with
encodings that use more than 16 bit. Well yes and no but mainly no.
The support for unicode 4.0 was introduced in Java 1.5 and lucene core
has still back-compat requirements for java 1.4. Lucene's analyzers
make use of char[] all
Thanks Matt. Thanks Paul. I'm up early (PST) and ready for a major
rewrite of my indexer. I think these changes are going to make a huge
difference.
Cheers,
Phil
On Fri, Jul 31, 2009 at 5:52 AM, Matthew Hall wrote:
> And to address the stop word issue, you can override the stop word list that
> i
Thanks for your quick response!
Mike
On Fri, Jul 31, 2009 at 10:25 AM, Simon
Willnauer wrote:
> If I understand you correctly you are asking if lucene can deal with
> encodings that use more than 16 bit. Well yes and no but mainly no.
> The support for unicode 4.0 was introduced in Java 1.5 and l
Hi Ahmet,
Thanks for the clarification and information! That was exactly what I was
looking for.
Jim
AHMET ARSLAN wrote:
>
> > I guess that the obvious question is "Which characters are
> > considered 'punctuation characters'?".
>
> Punctuation = ("_"|"-"|"/"|"."|",")
>
> > In part
Simon Willnauer wrote:
This would not make much of a difference. I would guess that you have
one additional "wrapping" boolean query if you use
MultiFieldQueryParser. For query "foo AND bar" the MFQueryParser
creates +(fname:foo) +(fname:bar) and QueryParser would create
+fname:foo +fname:bar so
On Fri, Jul 31, 2009 at 5:00 PM, wrote:
> Hi Ahmet,
>
> Thanks for the clarification and information! That was exactly what I was
> looking for.
>
> Jim
>
>
> AHMET ARSLAN wrote:
>>
>> > I guess that the obvious question is "Which characters are
>> > considered 'punctuation characters'?".
Michael, as Simon mentioned I created an issue describing where you
might run into trouble, at least in lucene core.
The low-level lucene stuff, it treats these just fine (as surrogate pairs).
But most analyzers run into some trouble. (things like
WhitespaceAnalyzer are ok)
Also wildcard queries
Hey Robert, good to see that you found the link :)
On Fri, Jul 31, 2009 at 6:06 PM, Robert Muir wrote:
> Michael, as Simon mentioned I created an issue describing where you
> might run into trouble, at least in lucene core.
>
> The low-level lucene stuff, it treats these just fine (as surrogate pa
Michael just out of curiousity, did you have a particular Analyzer in
mind you were planning on using, or rather certain features in Lucene
you were concerned would work with these codepoints?
On Fri, Jul 31, 2009 at 12:19 PM, Simon
Willnauer wrote:
> Hey Robert, good to see that you found the lin
Hi,
I still am new to Lucene, but I think I have an initial indexer app (based on
the demo IndexFiles app) working, and also have a web app, based on the demo
luceneweb web app working.
I'm still busy tweaking both, but am starting to think ahead, about operational
type issues, esp. updating
You're pretty much spot on. Read the FAQ entry "Does Lucene allow
searching and indexing simultaneously?" for one of your questions (the
answer is yes btw). With only a single update app running there won't
be any locking issues. When the updater code opens the index you'll
need to ensure that i
Hi Jim,
There should not be much difference from the lucene end between a new
index and index you want to update (add more documents to). As stated
in the Lucene docs IndexWriter will create the index "if it does not
already exist".
http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/in
Not really. At this point, I just needed to know where the UCS4
support stands. I'm reasonably familiar with the various analyzers and
what they can do. It's just the state of UCS4 support that might be an
issue for us.
Thanks,
Mike
On Fri, Jul 31, 2009 at 12:25 PM, Robert Muir wrote:
> Michael
Michael, makes sense. most of the issues probably have some
workaround, so reply back if you need.
Thanks for your feedback though, it is helpful to know that its important!
On Fri, Jul 31, 2009 at 1:36 PM, Michael Thomsen wrote:
> Not really. At this point, I just needed to know where the UCS4
>
Number of docs are the same in the index for both the cases (200,000).
I haven't altered the benchmark/ code, but, used a profiler to verify
that Benchmark main thread is closed only after all other threads
are closed.
Thanks,
-Jibo
On Jul 31, 2009, at 2:34 AM, Michael McCandless wrote:
@Michael: add yourself as a Watcher for the issue.
@Robert: I can start working on this within the next weeks - can you help too?
simon
On Fri, Jul 31, 2009 at 7:49 PM, Robert Muir wrote:
> Michael, makes sense. most of the issues probably have some
> workaround, so reply back if you need.
>
> Th
Hi Jibo,
Have you tried optimizing indexes? I do not know anything about the
implementation of ThreadedIndexWriter, but if they both optimize down
to the same size, it could just mean that ThreadedIndexWriter is not
as optimized.
Thanks,
Phil
On Fri, Jul 31, 2009 at 11:38 AM, Jibo John wrote:
>
Simon, no problem. I am looking at it now. I will just post my
approach and let people tear it apart / get things moving :)
On Fri, Jul 31, 2009 at 2:45 PM, Simon
Willnauer wrote:
> @Michael: add yourself as a Watcher for the issue.
> @Robert: I can start working on this within the next weeks - ca
Hi,
Phil and Ian,
Thanks for the responses and confirmations about this.
Assuming that our requirements (as I described earlier) don't change, it looks
like this updating/inserting thing should be pretty easy :)!
Later, and have a great weekend!
Jim
Phil Whelan wrote:
> Hi Jim,
>
Hi,
Sorry to jump in, but I've been following this thread with interest
:)...
Am I misunderstanding your original observation, that
ThreadedIndexWriter produced smaller index? Did the ThreadedIndexWriter
also finish faster (I'm assuming that it should)?
If the index is smaller, and everyt
Hmmm... can you run CheckIndex on both indexes and post the results?
java org.apache.lucene.index.CheckIndex /path/to/index
Mike
On Fri, Jul 31, 2009 at 2:38 PM, Jibo John wrote:
> Number of docs are the same in the index for both the cases (200,000).
> I haven't altered the benchmark/ code, b
Tried with a larger set of documents (2,000,000 ) this time.
ThreadedIndexWriter
---
Size - 1.4 G
optimized - yes (as suggested by Phil)
Number of documents - 1,999,924 (Not idea where the 76 documents
vanished...)
Number of terms - 3,638,801
IndexWriter
Mike,
Here you go:
IndexWriter:
$ java -classpath /Users/jibo/Desktop/iwork/lucene/java/trunk/build/
lucene-core-2.9-dev.jar org.apache.lucene.index.CheckIndex /Users/jibo/
Desktop/iwork/lucene/java/trunk/contrib/benchmark/work/index
NOTE: testing will be more thorough if y
Hi Jibo,
Your mergeFactor is different, and the resulting numFiles (segment
files) is different. Maybe each thread is responsible for a segment
file. Just curious - do you have 3 threads?
Phil
-
To unsubscribe, e-mail: java-user
Hi,
I don't know the answer to your questions, but I'm guessing that the answer to
#3 is probably because the answers to #1 and #2.
Did you try to look at the indexes using Luke? That shows the top 50 terms
when it starts, so it might be obvious what the differences are, and that might
give
Hi,
I know you can use Field.Store.YES, but I want to inspect the terms /
tokens and their order related to the field name at search time. Is
this possible? Obviously this information is stored in the index, but
I can not find any API to access it. I'm guessing the answer might be
that Terms point
Hi Phil,
It's 5 threads for IndexWriter.
For ThreadedIndexWriter, I used:
writer.num.threads=16
writer.max.thread.queue.size=80
Thanks,
-Jibo
On Jul 31, 2009, at 5:01 PM, Phil Whelan wrote:
Hi Jibo,
Your mergeFactor is different, and the resulting numFiles (segment
files) is different. May
Hi,
Is there any tutorial on how to store Lucene Index in S3. How do we access the
index from S3. Are there any wrapper of amazon S3.
The other question is how do I store and access existing lucene index on Google
App Engine.
Thanks in advance.
Warm Regards,
Allahbaksh
See the Term Vector capability. http://www.lucidimagination.com/search/?q=term+vectors#/
p:lucene
By default the information is _not_ stored in the index. You will
need to add Field.TermVector.YES to your indexing in order for this
information to be available.
-Grant
On Jul 31, 2009, at
45 matches
Mail list logo