Re: A Presentation on Building a Hadoop + Lucene System Architecture

2009-08-04 Thread m.harig
Hello Do you've any idea about the integration of Lucene with Hadoop BrickMcLargeHuge wrote: > > Hey all, > > I just wanted to send a link to a presentation I made on how my > company is building its entire core BI infrastructure around Hadoop, > HBase, Lucene, and more. It fea

Re: Searching doubt

2009-08-04 Thread m.harig
Thanks all, but how nutch handle this problem? am aware of nutch but not in depth. If i search the keyword "about us" , nutch gives me exactly what i want. Is there any scoring techinques? please let me know. -- View this message in context: http://www.nabble.com/Searching-doubt-tp2

A Presentation on Building a Hadoop + Lucene System Architecture

2009-08-04 Thread Bradford Stephens
Hey all, I just wanted to send a link to a presentation I made on how my company is building its entire core BI infrastructure around Hadoop, HBase, Lucene, and more. It features a decent amount of practical advice: from rules for approaching scalability problems, to why we chose certain aspects o

Re: Searching doubt

2009-08-04 Thread Phil Whelan
(sorry, tangent. I'll be quick) On Tue, Aug 4, 2009 at 8:42 AM, Shai Erera wrote: > Interesting ... I don't have access to a Japanese dictionary, so I just > extract bi-grams. Shai - if you're interested in parsing Japanese, check out Kakasi. It can split into words and convert Kanji->Katakana/Hi

Re: Nightly build link is broken

2009-08-04 Thread Michael McCandless
Hmmm... that link is old. The right one is: http://hudson.zones.apache.org/hudson/view/Lucene/job/Lucene-trunk/ Which page did you find that link on? Mike On Tue, Aug 4, 2009 at 5:40 PM, Adriano Crestani wrote: > Hi, > > I was trying to download a nightly build jar, so I went to Lucene websi

Nightly build link is broken

2009-08-04 Thread Adriano Crestani
Hi, I was trying to download a nightly build jar, so I went to Lucene website and clicked on the link that redirected to: http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/ and I got a "Firefox can't establish a connection to the server at lucene.zones.apache.org:8080". Is the link

Re: Searching doubt

2009-08-04 Thread Shai Erera
I had suggested that in my first response, but I think Harig's problem is that those words are not known in advance. Therefore, facing the query "about us" and converting it to "aboutus" is simple, but what about queries like "united states", or "united states of america"? Should they be 'grouped'

Re: How do you Parse a query to convert numbers to strings

2009-08-04 Thread Luis Alves
Hi Paul, In 2.9, you can use the "new query parser" in contrib. You should look at: original.config.FieldBoostMapAttribute original.config.FieldBoostMapFCListener original.processors.BoostQueryNodeProcessor original.builders.BoostQueryNodeBuilder this code implements boost

Re: score from spans

2009-08-04 Thread Grant Ingersoll
A SpanQuery is a Query, so if you do a search for it, you will get scores. However, the mechanism is a bit complicated, b/c actually getting the Spans is separate from doing the query. I agree there could be tighter integration. However, what you could do is use Spans.skipTo to move to t

Re: Slightly Off-topic: How to decide whether or not to add a document?

2009-08-04 Thread Amin Mohammed-Coleman
I've been working on a indexing solution using Spring integration and lucene. the example project uses jms to create work items (index add or update) and then a service that polls for work to do. I should have this complete soon and will be putting it on google code. Not much of help right now but

Re: Slightly Off-topic: How to decide whether or not to add a document?

2009-08-04 Thread ohaya
Hi Ian, Ok, thanks for the additional info. I've implemented check for both file.lastModified and file.length(), and it seems to work in my dev environment (Windows), so I'll have to test on a "real" system. Thanks again, Jim Ian Lea wrote: > Jim > > > The sleep is simply > >

Re: Slightly Off-topic: How to decide whether or not to add a document?

2009-08-04 Thread Ian Lea
Jim The sleep is simply try { Thread.sleep(millis); } catch (InterruptedException ie) { } No threading issues that I'm aware of, despite the method living in the Thread class. But you're right about it possibly impacting performance, if you've got to sleep for a reasona

Re: Searching doubt

2009-08-04 Thread N Hira
Good summary, Shai. I've missed some of this thread as well, but does anyone know what happened to the suggestion about query manipulation? e.g., query (about us) => query("about us", "aboutus") query(credit card) => query("credit card", "creditcard") Regards, -h - Original Message

Re: Searching doubt

2009-08-04 Thread Matthew Hall
Well.. search on both anyhow. "about us" OR "aboutus" should hit the spot I think. Matt Ian Lea wrote: The question was, how given a string "aboutus" in a document, you can return that document as a result to the query "about us" (note the space). So we're mostly discussing how to detect and t

Re: Slightly Off-topic: How to decide whether or not to add a document?

2009-08-04 Thread ohaya
Ian, One question about the 4th alternative: I was wondering how you implemented the sleep() in Java, esp. in such a way as not to mess up any of the Lucene stuff (in case there's threading)? Right now, my indexer/inserter app doesn't explicitly do any threading stuff. Thanks, Jim oh..

Re: Slightly Off-topic: How to decide whether or not to add a document?

2009-08-04 Thread ohaya
Hi Ian, Thanks for the quick response. I forgot to mention, but in our case, the "producers" is part of a commercial package, so we don't have a way to get them to change anything, so I think the 1st 3 suggestions are not feasible for us. I have considered something like the 4th suggestion (ch

Re: Searching doubt

2009-08-04 Thread Ian Lea
> The question was, how given a string "aboutus" in a document, you can return > that document as a result to the query "about us" (note the space). So we're > mostly discussing how to detect and then break the word "aboutus" to two > words. I haven't really been following this thread so apologies

Re: Slightly Off-topic: How to decide whether or not to add a document?

2009-08-04 Thread Ian Lea
A few suggestions: . Queue the docs once they are complete using something like JMS. . Get the document producers to write to e.g. xxx.tmp and rename to e.g. xxx.txt at the end . Get the document producers to write to a tmp folder and move to e.g. input/ when done . Find a file, store size, sle

Re: Searching doubt

2009-08-04 Thread Shai Erera
Interesting ... I don't have access to a Japanese dictionary, so I just extract bi-grams. But I guess that in this case, if one can access an English dictionary (are you aware of an "open-source" one, or free one BTW?), one can use the method you mention. But still, doing this for every Token you

Slightly Off-topic: How to decide whether or not to add a document?

2009-08-04 Thread ohaya
Hi, I have an app to initially create a Lucene index, and to populate it with documents. I'm now working on that app to insert new documents into that Lucene index. In general, this new app, which is based loosely on the demo apps (e.g., IndexFiles.java), is working, i.e., I can run it with a

Re: Searching doubt

2009-08-04 Thread Phil Whelan
On Tue, Aug 4, 2009 at 8:31 AM, Shai Erera wrote: > Hi Darren, > > The question was, how given a string "aboutus" in a document, you can return > that document as a result to the query "about us" (note the space). So we're > mostly discussing how to detect and then break the word "aboutus" to two >

Re: Searching doubt

2009-08-04 Thread darren
A, ok. Interesting problem there as well. I'll think on that one some too! cheers. > Hi Darren, > > The question was, how given a string "aboutus" in a document, you can > return > that document as a result to the query "about us" (note the space). So > we're > mostly discussing how to detec

Re: Searching doubt

2009-08-04 Thread Shai Erera
Hi Darren, The question was, how given a string "aboutus" in a document, you can return that document as a result to the query "about us" (note the space). So we're mostly discussing how to detect and then break the word "aboutus" to two words. What you wrote though seems interesting as well, onl

Re: Searching doubt

2009-08-04 Thread darren
Just catching this thread, but if I understand what is being asked I can share how I do multi-word phrase matching. If that's not what's wanted, pardons! Ok, I load an entire dictionary into a lucene index, phrases and all. When I'm scanning some text, I do lookups in this dictionary index using

Re: Searching doubt

2009-08-04 Thread Phil Whelan
On Tue, Aug 4, 2009 at 3:56 AM, Shai Erera wrote: > 2) Use a dictionary (real dictionary), and search it for every substring, > e.g. "a", "ab", "abo" ... "about" etc. If you find a match, split it there. > This needs some fine tuning, like checking if the rest is also a word and if > the full strin

Is it possible to receive a score when using span queries?

2009-08-04 Thread Eran Sevi
Hi, Does anyone knows of how to retrieve such score for any kind of span queries (especially SpanNearQueries) ? Thanks, Eran.

Re: ParallelMultiSearcher and idf

2009-08-04 Thread Christian Reuschling
Hi Otis, thanks for the answer - I'm aware of Solr, but it seems this is - according to its abstraction level - too generalized for us. Solr seems to be nice in the case you want to use the black box, and won't be aware of 'what is under the hood'. But maybe I'm totaly wrong. At least, it would be

Re: question about indexing/searching using standardanalyzer for KEYWORD field that contains alphanumeric data

2009-08-04 Thread Otis Gospodnetic
Leonard, Make sure the "key" or "id" fields are not analyzed and that should solve your problems. You are using some older version of Lucene? Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Mess

Re: ParallelMultiSearcher and idf

2009-08-04 Thread Otis Gospodnetic
Hi Christian, You didn't mention Solr, so I'm not sure if you are aware of it. Maybe Solr meets your needs? Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message > From: Christian Reusc

Re: How to improve search time?

2009-08-04 Thread Shashi Kant
To add to all these excellent suggestions: I would suggest creating a "baby index" out of the master index - pull out say 1000 docs into a test index and query. Helps in narrowing down the problem. On Tue, Aug 4, 2009 at 8:55 AM, Matthew Hall wrote: > Also, how long does it take Luke to do a sea

Re: How to improve search time?

2009-08-04 Thread Matthew Hall
Also, how long does it take Luke to do a search against the same index. That way you can remove any of the timing that your application is adding into the mix. If Luke doesn't take the minimum of 8 seconds... then you know its an issue with your app. (or at least a large part of it) Matt

Re: How to improve search time?

2009-08-04 Thread Ian Lea
Still surprising that your searches are taking so long. Have you worked through everything on http://wiki.apache.org/lucene-java/ImproveSearchingSpeed, suggested by someone earlier in this thread? Are you sure that the problem is really with lucene? Is it the search itself that takes a long time,

Re: How to improve search time?

2009-08-04 Thread prashant ullegaddi
Shahi, Our queries are free text queries. But they will be expanded into: Multifield, Boolean. We are also expanding the original query using SynExpand of lucene. A simple query gets expanded to say a query of page size. And we are not storing any other fields except key (document IDs), target UR

Re: Searching doubt

2009-08-04 Thread Shai Erera
If you don't know which tokens you'll face, then it's really a much harder problem. If you know where the token is, e.g. it's always in http://some.example.site/a/b//index.html, then it eases the task a bit. Otherwise you'll need to search every single token produced. I can think of several ways to

Re: Searching doubt

2009-08-04 Thread m.harig
Thanks , i've noticed that , but the code is for known tokens, how do i do it for dynamic tokens , meaning , i don't know the urls , someone picked up the urls and i'll index it. Is there any technique to use while indexing ? am using lucene 2.4.0 version. Please suggest me. -- Vie

ParallelMultiSearcher and idf

2009-08-04 Thread Christian Reuschling
Hello, when searching over multiple indices, we create one IndexReader for each index, and wrap them into a MultiReader, that we use for IndexSearcher creation. This is fine for searching multiple indices on one machine, but in the case the indices are distributed over the (intra)net, this scenar

Indexed Field impact on Memory

2009-08-04 Thread Ganesh
Hello all, I am having a indexed field, If i am not using this field for any search query. Whether this field consume memory? If this field is part of filter query, then there would be any impact in memory consumption? I am going to break / shorten the Date Time field and one field might be

Re: How to improve search time?

2009-08-04 Thread Ganesh
Hello Shashi, Could you please provide me your DB related information. How big the db size, memory etc. I am currently having 100 million records splitted in 10 indexes in the same system. I am using ParallelSearcher and search speed is also good. Regards Ganesh - Original Message

Re: Searching doubt

2009-08-04 Thread Shai Erera
Well, if you have more cases like "aboutus", then I think the TokenFilter approach will help you. You should create your own Analyzer which receives another Analyzer as argument, and impl it's tokenStream() like this (it's the general idea): public TokenStream tokenStream(String fld, Reader reader

Re: How to improve search time?

2009-08-04 Thread Shashi Kant
Prashant, I have had better luck with even larger sized indices on similar platforms. Could you elaborate what types of queries you are running, Multifield? Boolean? combinations? etc. Also you might want to remove unnecessary stored fields from the index and move them to a relational db to squeeze

Re: Searching doubt

2009-08-04 Thread m.harig
Thanks for your reply, my original code snippet is IndexSearcher searcher = new IndexSearcher(indexDir); Analyzer analyzer = new StopAnalyzer(); BooleanClause.Occur[] flags = { BooleanClause.Occur.SHOULD, Boolea

Re: How to improve search time?

2009-08-04 Thread prashant ullegaddi
I did that as well. Actually, we had 32 indexes initially. We searched them. It was even horrible. After that I merged them into 4 indexes. And did the same. No gain! Then, I had to merge 32 indexes into one. On Tue, Aug 4, 2009 at 10:48 AM, Anshum wrote: > Hi Prashant, > 8 seconds as the minim