Re: Help using ShingleFilter/NGramTokenizer: Could not find implementing class for org.apache.lucene.analysis.tokenattributes.OffsetAttribute

2014-01-24 Thread Koji Sekiguchi
Hi Russell, Seems that the error messages says that the implementing class for OffsetAttribute cannot be found in your classpath on the (Pig?) environment. There seems to be implementing classes OffsetAttributeImpl and Token, according to Javadoc: http://lucene.apache.org/core/4_6_0/core/org/a

Re: Performance testing Lucene

2014-01-24 Thread Michael McCandless
Oh that's good to hear. Lucene's unit tests are quite stressful on a new Directory impl... Mike McCandless http://blog.mikemccandless.com On Thu, Jan 23, 2014 at 8:40 PM, Scott Schneider wrote: > Thanks! I ran this Directory subclass through the Lucene unit tests (and > found 3 race conditi

RE: Performance testing Lucene

2014-01-24 Thread Uwe Schindler
Hi Scott, the unit tests are also a good performance test. But to compare your directory with another one, be sure to: - use a defined directory instance to compare. The most performant Lucene one is: -Dtests.directory=MMapDirectory - so compare you results with that one. If you don't define a

Building term frequency matrix over 6 million documents...

2014-01-24 Thread Witdouck, Xavier
Hi all, We have over 6 million documents in our index, and would like to construct a term frequency matrix over all 6 million documents as quickly as possible. Each document has a numeric date field, so we would like to build a time series which contains values which are the sum of all frequen

Re: Building term frequency matrix over 6 million documents...

2014-01-24 Thread Marcio Napoli
Hi! I believe the approach below can help you. http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/misc/src/java/org/apache/lucene/misc/HighFreqTerms.java Marcio http://numere.stela.org.br Go beyond Luceneā„¢ features with NumereĀ® 2014/1/24 Witdouck, Xavier > Hi all, > > We have over 6 m

(DocIds satisfying a query) -> (branch of the boolean query as a tree)

2014-01-24 Thread Olivier Binda
Hello While searching a query, I guess that Lucene traverses a Field->Term->DocId structure, filters the docIds that satisfy the query, score them and then sort them Given a resulting docId, I would like a way to find at least a valid path (or the first valid path or all valid paths) that ma

exporting a query to String with default operator = AND ?

2014-01-24 Thread Olivier Binda
Hello. I would like to serialize a query into a string (A) and then to unserialize it back into a query (B) I guess that a solution is A) query.toString() B) StandardQueryParser().parse(query,"") It is suboptimal for me though, because my app already has a custom query parser (with leadingWi

Re: Problems with Lucene and Solr

2014-01-24 Thread Doug Turnbull
Hey Vishnu, I'm trying to understand what you're trying to accomplish (cc'ing Lucene user group to solicit additional advice) Are you trying to extract all the terms for a given document? If so, you might just want to enable term vectors to analyze the index terms for the document. -Doug On Fri

SnapshotDeletionPolicy API changes

2014-01-24 Thread Vitaly Funstein
I see that SnapshotDeletionPolicy no longer supports snapshotting by an app-supplied string id, as of Lucene 4.4. However, my use case relies on the policy's ability to maintain multiple snapshots simultaneously to provide index versioning semantics, of sorts. What is the new recommended way of doi

Re: exporting a query to String with default operator = AND ?

2014-01-24 Thread Erick Erickson
First of all, query.toString is not idempotent. You cannot count on feeding the results of query.toString back into query and getting the same thing, so that's out. Not quite sure what the right solution is though Best, Erick On Fri, Jan 24, 2014 at 11:29 AM, Olivier Binda wrote: > Hello. >

Re: SnapshotDeletionPolicy API changes

2014-01-24 Thread Michael McCandless
It added complexity, for Lucene to track the app-provided ID. And, it's something you can easily add back on top of the new API, if necessary. But, maintaining multiple snapshots is certainly still allowed: multiple snapshots referencing the same IndexCommit is fine. There is a ref count increme

Lucene performance

2014-01-24 Thread Hamed Ghavamnia
Hello, I searched a lot about lucene limits and its performance, but I still don't know how much I can count on it. I'm storing logs and indexing them with lucene. The event per second is 2000. The format of each log is generally 'fieldname' : 'fieldvalue'. What search performance should I expect