Hello
Do you've any idea about the integration of Lucene with Hadoop
BrickMcLargeHuge wrote:
>
> Hey all,
>
> I just wanted to send a link to a presentation I made on how my
> company is building its entire core BI infrastructure around Hadoop,
> HBase, Lucene, and more. It fea
Thanks all,
but how nutch handle this problem? am aware of nutch but not in
depth. If i search the keyword "about us" , nutch gives me exactly what i
want. Is there any scoring techinques? please let me know.
--
View this message in context:
http://www.nabble.com/Searching-doubt-tp2
Hey all,
I just wanted to send a link to a presentation I made on how my
company is building its entire core BI infrastructure around Hadoop,
HBase, Lucene, and more. It features a decent amount of practical
advice: from rules for approaching scalability problems, to why we
chose certain aspects o
(sorry, tangent. I'll be quick)
On Tue, Aug 4, 2009 at 8:42 AM, Shai Erera wrote:
> Interesting ... I don't have access to a Japanese dictionary, so I just
> extract bi-grams.
Shai - if you're interested in parsing Japanese, check out Kakasi. It
can split into words and convert Kanji->Katakana/Hi
Hmmm... that link is old. The right one is:
http://hudson.zones.apache.org/hudson/view/Lucene/job/Lucene-trunk/
Which page did you find that link on?
Mike
On Tue, Aug 4, 2009 at 5:40 PM, Adriano
Crestani wrote:
> Hi,
>
> I was trying to download a nightly build jar, so I went to Lucene websi
Hi,
I was trying to download a nightly build jar, so I went to Lucene website
and clicked on the link that redirected to:
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/
and I got a "Firefox can't establish a connection to the server at
lucene.zones.apache.org:8080".
Is the link
I had suggested that in my first response, but I think Harig's problem is
that those words are not known in advance. Therefore, facing the query
"about us" and converting it to "aboutus" is simple, but what about queries
like "united states", or "united states of america"? Should they be
'grouped'
Hi Paul,
In 2.9, you can use the "new query parser" in contrib.
You should look at:
original.config.FieldBoostMapAttribute
original.config.FieldBoostMapFCListener
original.processors.BoostQueryNodeProcessor
original.builders.BoostQueryNodeBuilder
this code implements boost
A SpanQuery is a Query, so if you do a search for it, you will get
scores. However, the mechanism is a bit complicated, b/c actually
getting the Spans is separate from doing the query. I agree there
could be tighter integration. However, what you could do is use
Spans.skipTo to move to t
I've been working on a indexing solution using Spring integration and
lucene. the example project uses jms to create work items (index add or
update) and then a service that polls for work to do. I should have this
complete soon and will be putting it on google code. Not much of help right
now but
Hi Ian,
Ok, thanks for the additional info.
I've implemented check for both file.lastModified and file.length(), and it
seems to work in my dev environment (Windows), so I'll have to test on a "real"
system.
Thanks again,
Jim
Ian Lea wrote:
> Jim
>
>
> The sleep is simply
>
>
Jim
The sleep is simply
try { Thread.sleep(millis); }
catch (InterruptedException ie) { }
No threading issues that I'm aware of, despite the method living in
the Thread class.
But you're right about it possibly impacting performance, if you've
got to sleep for a reasona
Good summary, Shai.
I've missed some of this thread as well, but does anyone know what happened to
the suggestion about query manipulation?
e.g., query (about us) => query("about us", "aboutus")
query(credit card) => query("credit card", "creditcard")
Regards,
-h
- Original Message
Well.. search on both anyhow.
"about us" OR "aboutus" should hit the spot I think.
Matt
Ian Lea wrote:
The question was, how given a string "aboutus" in a document, you can return
that document as a result to the query "about us" (note the space). So we're
mostly discussing how to detect and t
Ian,
One question about the 4th alternative: I was wondering how you implemented
the sleep() in Java, esp. in such a way as not to mess up any of the Lucene
stuff (in case there's threading)?
Right now, my indexer/inserter app doesn't explicitly do any threading stuff.
Thanks,
Jim
oh..
Hi Ian,
Thanks for the quick response.
I forgot to mention, but in our case, the "producers" is part of a commercial
package, so we don't have a way to get them to change anything, so I think the
1st 3 suggestions are not feasible for us.
I have considered something like the 4th suggestion (ch
> The question was, how given a string "aboutus" in a document, you can return
> that document as a result to the query "about us" (note the space). So we're
> mostly discussing how to detect and then break the word "aboutus" to two
> words.
I haven't really been following this thread so apologies
A few suggestions:
. Queue the docs once they are complete using something like JMS.
. Get the document producers to write to e.g. xxx.tmp and rename to
e.g. xxx.txt at the end
. Get the document producers to write to a tmp folder and move to e.g.
input/ when done
. Find a file, store size, sle
Interesting ... I don't have access to a Japanese dictionary, so I just
extract bi-grams. But I guess that in this case, if one can access an
English dictionary (are you aware of an "open-source" one, or free one
BTW?), one can use the method you mention.
But still, doing this for every Token you
Hi,
I have an app to initially create a Lucene index, and to populate it with
documents. I'm now working on that app to insert new documents into that
Lucene index.
In general, this new app, which is based loosely on the demo apps (e.g.,
IndexFiles.java), is working, i.e., I can run it with a
On Tue, Aug 4, 2009 at 8:31 AM, Shai Erera wrote:
> Hi Darren,
>
> The question was, how given a string "aboutus" in a document, you can return
> that document as a result to the query "about us" (note the space). So we're
> mostly discussing how to detect and then break the word "aboutus" to two
>
A, ok. Interesting problem there as well.
I'll think on that one some too!
cheers.
> Hi Darren,
>
> The question was, how given a string "aboutus" in a document, you can
> return
> that document as a result to the query "about us" (note the space). So
> we're
> mostly discussing how to detec
Hi Darren,
The question was, how given a string "aboutus" in a document, you can return
that document as a result to the query "about us" (note the space). So we're
mostly discussing how to detect and then break the word "aboutus" to two
words.
What you wrote though seems interesting as well, onl
Just catching this thread, but if I understand what is being asked I can
share how I do multi-word phrase matching. If that's not what's wanted,
pardons!
Ok, I load an entire dictionary into a lucene index, phrases and all.
When I'm scanning some text, I do lookups in this dictionary index using
On Tue, Aug 4, 2009 at 3:56 AM, Shai Erera wrote:
> 2) Use a dictionary (real dictionary), and search it for every substring,
> e.g. "a", "ab", "abo" ... "about" etc. If you find a match, split it there.
> This needs some fine tuning, like checking if the rest is also a word and if
> the full strin
Hi,
Does anyone knows of how to retrieve such score for any kind of span queries
(especially SpanNearQueries) ?
Thanks,
Eran.
Hi Otis,
thanks for the answer - I'm aware of Solr, but it seems this is - according to
its abstraction level - too generalized for us. Solr seems to be nice in the
case you want to use the black box, and won't be aware of 'what is under the
hood'.
But maybe I'm totaly wrong. At least, it would be
Leonard,
Make sure the "key" or "id" fields are not analyzed and that should solve your
problems.
You are using some older version of Lucene?
Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
- Original Mess
Hi Christian,
You didn't mention Solr, so I'm not sure if you are aware of it. Maybe Solr
meets your needs?
Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
- Original Message
> From: Christian Reusc
To add to all these excellent suggestions: I would suggest creating a
"baby index" out of the master index - pull out say 1000 docs into a
test index and query. Helps in narrowing down the problem.
On Tue, Aug 4, 2009 at 8:55 AM, Matthew Hall wrote:
> Also, how long does it take Luke to do a sea
Also, how long does it take Luke to do a search against the same index.
That way you can remove any of the timing that your application is
adding into the mix.
If Luke doesn't take the minimum of 8 seconds... then you know its an
issue with your app. (or at least a large part of it)
Matt
Still surprising that your searches are taking so long.
Have you worked through everything on
http://wiki.apache.org/lucene-java/ImproveSearchingSpeed, suggested by
someone earlier in this thread? Are you sure that the problem is
really with lucene? Is it the search itself that takes a long time,
Shahi,
Our queries are free text queries. But they will be expanded into:
Multifield, Boolean.
We are also expanding the original query using SynExpand of lucene. A simple
query
gets expanded to say a query of page size.
And we are not storing any other fields except key (document IDs), target
UR
If you don't know which tokens you'll face, then it's really a much harder
problem. If you know where the token is, e.g. it's always in
http://some.example.site/a/b//index.html,
then it eases the task a bit. Otherwise you'll need to search every single
token produced. I can think of several ways to
Thanks ,
i've noticed that , but the code is for known tokens, how do i
do it for dynamic tokens , meaning , i don't know the urls , someone picked
up the urls and i'll index it. Is there any technique to use while indexing
? am using lucene 2.4.0 version. Please suggest me.
--
Vie
Hello,
when searching over multiple indices, we create one IndexReader for each index,
and wrap them into a MultiReader, that we use for IndexSearcher creation.
This is fine for searching multiple indices on one machine, but in the case the
indices are distributed over the (intra)net, this scenar
Hello all,
I am having a indexed field, If i am not using this field for any search query.
Whether this field consume memory?
If this field is part of filter query, then there would be any impact in memory
consumption?
I am going to break / shorten the Date Time field and one field might be
Hello Shashi,
Could you please provide me your DB related information. How big the db size,
memory etc.
I am currently having 100 million records splitted in 10 indexes in the same
system. I am using ParallelSearcher and search speed is also good.
Regards
Ganesh
- Original Message
Well, if you have more cases like "aboutus", then I think the TokenFilter
approach will help you. You should create your own Analyzer which receives
another Analyzer as argument, and impl it's tokenStream() like this (it's
the general idea):
public TokenStream tokenStream(String fld, Reader reader
Prashant, I have had better luck with even larger sized indices on
similar platforms. Could you elaborate what types of queries you are
running, Multifield? Boolean? combinations? etc. Also you might want
to remove unnecessary stored fields from the index and move them to a
relational db to squeeze
Thanks for your reply,
my original code snippet is
IndexSearcher searcher = new IndexSearcher(indexDir);
Analyzer analyzer = new StopAnalyzer();
BooleanClause.Occur[] flags = { BooleanClause.Occur.SHOULD,
Boolea
I did that as well. Actually, we had 32 indexes initially. We searched them.
It was even horrible.
After that I merged them into 4 indexes. And did the same. No gain!
Then, I had to merge 32 indexes into one.
On Tue, Aug 4, 2009 at 10:48 AM, Anshum wrote:
> Hi Prashant,
> 8 seconds as the minim
42 matches
Mail list logo