Re: Question from a new user : IndexSearcher.doc

Erick Erickson Mon, 21 Jun 2010 07:44:41 -0700

They're quite different beasts to use. SOLR will have you up and running
with some configuration very quickly, and if you're comfortable with servlet
containers, it'll be even faster. It has a DIH handler which will index data
from a database (again, with some configuration, but not necessarily
programming). SOLR has, out of the box, support for sharding, replication,
etc.


Lucene is a pure Java library that you have to write infrastructure for.
An understanding of Lucene, which SOLR uses under the covers can be
quite valuable.

But from what you've described, I suspect you'll be better off starting off
with
SOLR. You can add custom bits to SOLR if you need to, but it'll almost
certainly be some time before you do if you do. And it won't be as likely to
be
throw-away work as it would be if you started with Lucene then migrated
to Java.

Nutch is a web-crawler/indexer, so from what you've described Nutch isn't
a good match for what you're trying to do.

HTH
Erick


On Mon, Jun 21, 2010 at 3:29 AM, Victor Kabdebon
<victor.kabde...@gmail.com>wrote:

> Hi Erick,
>
> Thank you very much for you explanations. 588 is a rather long way to go,
> so
> you're right maybe I won't need at the moment to care about that problem.
> To answer your final question : no indeed I won't need to store a lot of
> data. Just some keys  in order to find the data in Cassandra later on.
>
> If you don't mind, please let me ask you another question :
>
> Is it really interesting to begin with Lucene rather than directly with
> solR
> (or Nutch) ? What I mean by that is : is it the same difficulty to
> implement
> a search with solR and stay with it instead of first implement a search
> with
> Lucene, then when the project becomes very big change it to a new system ?
> My goal is to have that can evolve with time even if I have 1 million
> documents added daily ?
>
> Thank you,
> Victor
>
> 2010/6/21 Erick Erickson <erickerick...@gmail.com>
>
> > By and large, you won't ever actually be interested in very many
> documents,
> > what's returned in the TopDocs structure internal document ID and score,
> in
> > score order. But retrieval by document ID is quite efficient, it's not a
> > search. I'm quite sure this won't be a problem.
> >
> > Adding 10,000 documents a day means that in 588 years you'll exceed a
> > 31-bit
> > number. I don't think you really need to worry about that either. And
> > that's
> > the worst-case, assuming the ints are signed. And I believe that they're
> > unsigned anyway.
> >
> > What you will have to worry about is the time to get the top N
> > highest-scoring documents. That is, IndexSearcher.seach() will be your
> > limiting factor long before you reach these numbers. By that time,
> though,
> > you'll have moved to SOLR or some other distributed search mechanism.
> >
> > Performance is influenced by the complexity of the queries and the
> > structure
> > and size of your index. The time spent retrieving the top few matches is
> > completely dwarfed by the search time for an index of any size.
> >
> > All this may be irrelevant if you really want to retrieve a very large
> > number of documents rather than, say, the top 100. But the use case would
> > have to be very interesting for it to be a requirement to return, say,
> > 100,000 documents to a user.
> >
> > But do be aware that you're not retrieving the *original* text with
> > IndexSearcher. Typically, the relevant data is indexed but not stored
> These
> > two concepts are confusing when you start using Lucene, especially since
> > they're specified in the same call. Indexing a field splits it up into
> > tokens, normalizes it (e.g. lowercases, stems, puts in synonyms, etc).
> The
> > indexed data is the part that's searched. You can also store the input
> > verbatim, the but stored part is just a copy that's never searched but is
> > available for retrieval.
> >
> > Which brings up one of the central decisions you need to make. Are you,
> > indeed, going to store all the data for retrieval in your index or just
> > index the relevant text to be searched along with some locator
> information
> > to the original document? You mention Cassandra, which leads me to
> > speculate
> > that it's the latter.
> >
> > HTH
> > Erick
> >
> >
> > On Sun, Jun 20, 2010 at 4:04 PM, Victor Kabdebon
> > <victor.kabde...@gmail.com>wrote:
> >
> > > Hello Simon,
> > >
> > > As I told you, I am quite new with Lucene, so there are many things
> that
> > > might be wrong.
> > > I'm using Lucene to make a search service for a website that has a
> large
> > > amount of information daily. This amount of information is directly
> > avaible
> > > as text in a Cassandra Database.
> > > There might be as much as 10.000 new documents added daily, and yes my
> > > concern is it possible to retrieve more documents than the integer max
> > > value
> > > ?
> > > I don't really see also how the IndexSearcher.doc( ) really works,
> > because
> > > it seems like we give this method an ID and it is going to search in
> the
> > > indexed documents. So what exactly is going to do this
> > > IndexSearcher.doc(int) ?
> > >
> > > *Or are you concerned about retrieving all documents
> > > containing term "XY" if the number of documents matching is large?*
> > > *
> > > *
> > >
> > > I'm also concerned by this problem, yes
> > >
> > > Could you explain me a little bit how it works, and how Lucene enables
> > one
> > > to retrieve a very large number of documents even if it uses int ?
> > >
> > > Thank you for your answers,
> > > Victor
> > >
> > > 2010/6/20 Simon Willnauer <simon.willna...@googlemail.com>
> > >
> > > > Hi, maybe I don't understand your question correctly. Are you asking
> > > > if you could run into problems if you retrieve more documents than
> > > > integer max value? Or are you concerned about retrieving all
> documents
> > > > containing term "XY" if the number of documents matching is large? If
> > > > you are afraid of loading all documents matched from a stored field I
> > > > guess you are doing something wrong.
> > > > What are you using lucene for?
> > > >
> > > > simon
> > > >
> > > > On Sun, Jun 20, 2010 at 8:00 PM, Victor Kabdebon
> > > > <victor.kabde...@gmail.com> wrote:
> > > > > Hello everybody,
> > > > >
> > > > > I am new to Apache Lucene and it seems to fit perfectly my needs
> for
> > my
> > > > > application.
> > > > > However I'm a little concerned about something (pardon me if it's a
> > > > > recurrent question, I've searched the archives but I didn't find
> > > > something
> > > > > about that)
> > > > >
> > > > > So here is my case :
> > > > >
> > > > > I have index a few files (like 10) and I'm trying to search
> something
> > > > stupid
> > > > > in it. The word "test". So after opening everything etc...
> (assuming
> > it
> > > > > works also) I do that :
> > > > >
> > > > > *Term test = new Term("text_comment","test");*
> > > > > *        Query query = new TermQuery(test);*
> > > > > *        TopDocs top = searcher.search(query, 10);*
> > > > >
> > > > > I want to recover the first document (I have 2 documents in
> TopDocs),
> > I
> > > > do :
> > > > >
> > > > > *IndexSearcher.doc( top[0].doc)*
> > > > >
> > > > > I searched a little bit in javadoc and I saw that this method uses
> > > "int"
> > > > as
> > > > > a parameter
> > > > > I'm a little bit concerned about this... At the moment, I have 10
> > > > documents
> > > > > so that's ok, but if I want to index let's say 20 files documents,
> > how
> > > > will
> > > > > the IndexSearcher.doc(int) be able to retrieve documents ?
> > > > > Same problem if 100.000 files have the word "test" in
> "text_comment"
> > > will
> > > > I
> > > > > still be able to get these 100.000 documents or is it going to be a
> > > > problem
> > > > > ?
> > > > >
> > > > > Thank you very much.
> > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > > >
> > > >
> > >
> >
>

Re: Question from a new user : IndexSearcher.doc

Reply via email to