Re: Getting Payload data from BooleanQuery results

2009-09-24 Thread Mark Miller
I should beef up that spans extractor - it can actually work on the constantscore multi term queries (the base ones that now have a constant score mode in 2.9), just like the Highlighter does. That class really belongs in contrib probably. You can use the filter and the spanquery to get the result

Re: Seattle / PNW Hadoop/Lucene/HBase Meetup, Wed Sep 30th

2009-09-24 Thread Bradford Stephens
Friendly Reminder! One week to go. On Mon, Sep 14, 2009 at 11:35 AM, Bradford Stephens < bradfordsteph...@gmail.com> wrote: > Greetings, > > It's time for another Hadoop/Lucene/Apache"Cloud" Stack meetup! > This month it'll be on Wednesday, the 30th, at 6:45 pm. > > We should have a few interest

Re: Free Webinar - Apache Lucene 2.9: Technical Overview of New Features

2009-09-24 Thread Michael Masters
Has anyone received a link with the slides from the presentation yet? -Mike On Fri, Sep 18, 2009 at 3:56 PM, Erik Hatcher wrote: > Free Webinar: Apache Lucene 2.9: Discover the Powerful New Features > --- > > Join us for a free an

RE: metrics for index ~100M docs ... Correction

2009-09-24 Thread Sudarsan, Sithu D.
Hi Joel, Couple of quick points. 1. The metric for indexing only. 2. It is 1000 docs/minute (sorry for the earlier 1000/sec goof up) 3. Regarding search/query, it depends on many parameters... (similarity, proximity, synonym look up etc.) Sincerely, Sithu D Sudarsan -Original Message-

Re: Getting Payload data from BooleanQuery results

2009-09-24 Thread Christopher Tignor
thanks for the tip. I don't see a way to integrate the QueryWrapperFilter (or any Filter) into SpanTermQuery.getSpans(indexReader) however. I can use a SpanQuery with an IndexSearcher as per susual but that leaves me back where I started. Any thoughts? Also, I will need to sort these results by

help with DictionaryFilter replacing an acronym token with multiple tokens

2009-09-24 Thread Allen Atamer
Hello List members, Please help me to fix a problem in my DictionaryFilter class. It is used to map acronyms, abbreviations, synonyms, etc. to one common root word/phrase for easy searching. For example, "temp" is an abbreviation for "temperature". One-to-one substitutions work without probl

Re: Getting Payload data from BooleanQuery results

2009-09-24 Thread Chris Hostetter
: But alas, I cannot seem to get access to any TermPositions from my above : BooleanQuery. I would suggest refactoring your "date" restriction into a Filter (there's fairly easy to use Filter that wraps a Query) and then execute a SPanTermQuery just as you describe. -Hoss --

Re: Filtering question/advice

2009-09-24 Thread Amin Mohammed-Coleman
Hi Sorry for not getting back to you. Been swamped with stuff and work and home. Just managed to check my lucene emails! You are right i made some silly mistakes with the testcase and have updated accordingly. The test is still failing but the properties are set correctly: public class Underwr

Re: Search with wild-cards by words with forward-slash ("/")

2009-09-24 Thread Erick Erickson
First, really think about getting a copy of Luke to help you investigate what'sactually in your index, it's invaluable. It'll also let you try running queries through different analyzers and seeing the results. But I think you're a bit fuzzy on what analyzers do. Their primary purpose is to break

Search with wild-cards by words with forward-slash ("/")

2009-09-24 Thread coldserenity
Hello, I've been searching the forum and found several more or less relevant topic listed below. http://www.nabble.com/Parsing-text-containing-forward-slash-and-wildcard-td13541503.html#a13541503 http://www.nabble.com/Parsing-text-containing-forward-slash-and-wildcard-td13541503.ht

RE: metrics for index ~100M docs

2009-09-24 Thread Sudarsan, Sithu D.
Hi Joel, With approx. 100K doc size, on dual-quad core machine, (3.0Ghz) - Windows platform, we have an average 1000 docs/sec. This includes text extraction from PDF docs. Hope this helps. Sincerely, Sithu D Sudarsan -Original Message- From: Joel Halbert [mailto:j...@su3analytics.co

Getting Payload data from BooleanQuery results

2009-09-24 Thread Christopher Tignor
Hello, I have indexed documents with two fields, "ARTICLE" for an article of text and "PUB_DATE" for the article's publication date. Given a specific single word, I want to search my index for all documents that contain this word within the last two weeks, and have them sorted by date: TermQuery

Re: metrics for index ~100M docs

2009-09-24 Thread Joel Halbert
I found this thread pretty useful: http://markmail.org/search/?q=Re%3A+Scaling+out%2Fup+or+a+mix#query:Re% 3A%20Scaling%20out%2Fup%20or%20a%20mix+page:1+mid:x4ymuplegomuth7n +state:results -Original Message- From: Erick Erickson Reply-To: java-user@lucene.apache.org To: java-user@lucene.

Re: metrics for index ~100M docs

2009-09-24 Thread Erick Erickson
It's really hard to say anything meaningful here. How many fields? Whatkind of sorting to you intend to do? How complex are the queries you expect? And even if you have meaningful answers to the above, then "it depends" (tm). Then you could go to SOLR (which is built on Lucene) to handle distribu

metrics for index ~100M docs

2009-09-24 Thread Joel Halbert
Hi, Does anyone know of any recent metrics & stats on building out an index of ~100mm documents (each doc approx 5k). I'm looking for approx stats on time to build, time to query and infrastructure requirements (number of machines & spec) to reasonably support an index of such a size. Thanks, J