Re: i'm Lucene beginner. help me

2012-06-26 Thread Adrien Grand
Hi kjysmu,

On Tue, Jun 26, 2012 at 11:22 AM, kjysmu  wrote:
> What i want with lucene is that i wanna get it's image ids for certain query
> (tag)
>
> how can i implement it using Lucene with Java?

I moved the discussion to java-user@lucene instead of dev@lucene since
your question is not related to Lucene development.
  http://people.apache.org/~hossman/#java-user

To understand how to use Lucene, you should start with the getting
started guide[1]  which will help you get familiar with analysis,
indexing and searching with Lucene.

 1. http://lucene.apache.org/core/3_6_0/gettingstarted.html,

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene 4.0.0 - find term position.

2012-12-07 Thread Adrien Grand
Hi Vitaly,

On Fri, Dec 7, 2012 at 3:24 PM,  wrote:

> I try to use or  Terms tfvector = reader.getTermVector(docId, "contents");
> or  Fields fields = reader.getTermVectors(docId);
> but I get null from these calls.
> What is wrong?


These methods will always return null unless you turn term vectors on at
indexing time (see FieldType.setStoreTermVectors[1]
and FieldType.setStoreTermVectorPositions[2]).

 [1]
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/document/FieldType.html#setStoreTermVectors(boolean)
 [2]
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/document/FieldType.html#setStoreTermVectorPositions(boolean)

-- 
Adrien


Re: StoredFieldsFormat / documentation

2013-01-24 Thread Adrien Grand
Hi Bernd,

On Thu, Jan 24, 2013 at 11:55 AM, Bernd Müller  wrote:
> Hi Simon,
>
>> you mean where it is used? Look at the org.apache.lucene.codecs.Codec
>> class, it has a method:
>>
>>   public abstract StoredFieldsFormat storedFieldsFormat();
>>
>> which returns a stored fields format used to encode your stored fields
>> written by the index writer.
>
> Thanks for your quick reply. So I have to change the return value of
> the method storedFieldsFormat to a custom
> CompressingStoredFieldsFormat. Then, I set the codec in the
> IndexWriterConfig for the IndexWriter. If this is correct, my problem
> is solved.

This is correct. There are some explanations on how to register custom
codecs in the o.a.l.codecs package documentation:
http://lucene.apache.org/core/4_1_0/core/org/apache/lucene/codecs/package-summary.html#package_description.

Something you may need to consider if you are willing to read this
index with future versions of Lucene is that
CompressingStoredFieldsFormat is still experimental, so we might break
it in a non backward-compatible way in a future release. (But it is
possible to work around this problem: for example if we break it in
Lucene 4.2, all you need to do is to convert your index to the
Lucene41 codec using Lucene 4.1 and then re-converting your index back
to your custom codec using Lucene 4.2.)

> Next question that comes up: If I have different IndexWriters writing
> in the same index with different codecs, is the codec for the fields
> somehow resolved for an IndexReader? Or does every instance of an
> IndexWriter change the stored fields to its codec when committing and
> closing the index?

You should never have two IndexWriters writing to the same directory
at the same time. However there is no problem writing some segments
with a codec and then other segments with another codec. Your
DirectoryReader will just be composed of atomic readers which use
different codecs, which is fine, even for merging.

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Need help regarding understanding internals of Lucene Index.

2013-01-25 Thread Adrien Grand
Hi Vignesh,

This is a very broad question! The following links might help you:
 - Lucene documentation: http://lucene.apache.org/core/4_1_0/index.html
 - File formats:
http://lucene.apache.org/core/4_1_0/core/org/apache/lucene/codecs/lucene41/package-summary.html#package_description
 - The block tree terms dictionary:
https://issues.apache.org/jira/browse/LUCENE-3030 (data structure used
to lookup terms)

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content

2013-01-27 Thread Adrien Grand
Have you tried using the PDFParser [1] and the OfficeParser [2]
classes from Tika?

This question seems to be more appropriate for the Tika user mailing list [3]?

[1] 
http://tika.apache.org/1.3/api/org/apache/tika/parser/pdf/PDFParser.html#parse(java.io.InputStream,
org.xml.sax.ContentHandler, org.apache.tika.metadata.Metadata,
org.apache.tika.parser.ParseContext)
[2] 
http://tika.apache.org/1.3/api/org/apache/tika/parser/microsoft/OfficeParser.html#parse(java.io.InputStream,
org.xml.sax.ContentHandler, org.apache.tika.metadata.Metadata,
org.apache.tika.parser.ParseContext)
[3] http://tika.apache.org/mail-lists.html

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: CompressingStoredFieldsFormat doesn't show improvement

2013-01-29 Thread Adrien Grand
Arun,

Lucene uses a very light compression algorithm so I'm a little
surprised it can make indexing 2x slower. Could you run indexing under
a profiler to make sure it really is what makes indexing slower?

Thanks!

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: CompressingStoredFieldsFormat doesn't show improvement

2013-01-30 Thread Adrien Grand
On Wed, Jan 30, 2013 at 8:08 AM, arun k  wrote:
> Adrein,
>
> I have created an index of size 370M of 1 million docs of 40 fields of 40
> chars and did the profiling.
> I see that the indexing and in particular the addDocument &
> ConcurrentMergeScheduler in 4.1 takes double the time compared to 3.0.2.

Can you provide me with the detailed profiles?

> Looks like CompressionStoredFieldsFormat is of little use in my scenario.

You can can disable stored fields compression and use another
StoredFieldsFormat by defining a custom codec
(http://lucene.apache.org/core/4_1_0/core/org/apache/lucene/codecs/package-summary.html#package_description).

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Example settings for TieredMergePolicy : Lucene 4.0

2013-02-01 Thread Adrien Grand
Hi,

On Fri, Feb 1, 2013 at 6:51 PM, saisantoshi  wrote:
> Prior to 4.0, there was an optimize() in the IndexWriter which was merging
> the index files. Is there any settings that can be done on the
> TieredMergePolicy so that I want to limit the number of files produced
> during the indexing.

Segments can be merged by running IndexWriter.forceMerge(1). You can
read more about this command and why it is not recommended to use it
anymore at 
http://www.searchworkings.org/blog/-/blogs/simon-says%3A-optimize-is-bad-for-you.

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: updateDocument question

2013-02-06 Thread Adrien Grand
Hi Thomas,

On Wed, Feb 6, 2013 at 2:50 PM, Becker, Thomas  wrote:
> I've built a search prototype feature for my application using Lucene, and it 
> works great.  The application monitors a remote system and currently indexes 
> just a few core attributes of the objects on that system.  I get 
> notifications when objects change, and I then update the Lucene index to keep 
> things in sync.   The thing is that even when objects on the remote system 
> are updated, it's relatively unlikely that the specific attributes I'm 
> indexing (like name) were changed.  From what I can see, 
> IndexWriter.updateDocument() makes no effort to determine if the existing 
> document is actually dirty compared to the provided one.  My questions are:
>
> Is this true that documents are assumed to be changed and not actually 
> checked before replacement?

Yes, it's true.

> Has such a feature been considered?

I'm not sure but I see several issues: For example if you reindex the
exact same document with a different analyzer, the index
terms/positions/offsets/payloads might be different. Moreover, one can
only perform such a comparison if the document is stored, which is
something that Lucene doesn't enforce.

> Is it worth it to query for the document, manually dirty check it and then 
> delete/re-add only if it's different if changes to the indexed fields are 
> relatively uncommon?  My concern is that I'm inadvertently causing a lot of 
> segment churn for things that aren't actually changing.

You could try to do it, but maybe it is just fine the way it is: as
segments get merged deleted docs eventually get expunged.

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: updateDocument question

2013-02-07 Thread Adrien Grand
On Thu, Feb 7, 2013 at 1:54 PM, Becker, Thomas  wrote:
> Thanks for the response Adrien.  I guess I'll just leave things as they are 
> for now.  To be clear though, do merged segments get cleaned up completely 
> even if the IndexWriter is never closed?

The way it works is that indexing data creates segments (a single
segment usually contains a large number of documents), but the number
of segments is bounded so whenever this limit is reached, a merge is
triggered. When merging segments, deleted documents are skipped, this
is how deletions get expunged.

> Currently I'm using NRT search with a single writer that stays open for the 
> lifetime of the application.   This product will be shipped to customers, so 
> I need the index to be entirely self-managing.

Sounds good: having a long-living IndexWriter is the best way to index
documents. It will take care of merging segments whenever necessary.

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Indexing directly from stdin in lucene 3.5

2013-02-19 Thread Adrien Grand
Hi,

On Tue, Feb 19, 2013 at 11:04 AM, A. L. Benhenni  wrote:
> I am currently writing an indexer class to index texts from stdin. I also
> need the text to be tokenized and stored to access the termvector of the
> document.

Actually, you don't need to store documents to access their term
vectors, these are two different options. The stored fields allow you
to retrieve data as you provided it to the IndexWriter while term
vectors return a single-document inverted index of your document
(mapping every unique term to its frequency, the positions where it
appeared in the original document, etc.).

> 1/ Is there a more appropriate way of handling the indexing of an incoming
> stream ?

Actually, your example is very strange since (if I'm not mistaken)
each iteration of the loop overrides the previous line with the
current one (because path_field_name doesn't change).

If you want your document to be stored (Store.YES), you need to buffer
everything into a String before feeding Lucene with it.

> 2/ Is there an easy way to clean the index ?

IndexWriter.deleteAll?
http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/core/org/apache/lucene/index/IndexWriter.html#deleteAll()

> And a subsidiary 3/ Why a field can't store a reader ?

Lucene stores string fields by first writing their length, followed by
their bytes, so it couldn't even start serializing a Reader before
having consumed and buffered it fully (to know its length). Lucene
doesn't allow the creation of stored fields from a Reader because it
would give the impression of being lighweight (no need to load
everything into memory at once) although it wouldn't be. On the
contrary, you can provide a Reader to a field which is indexed and has
term vectors turned on and Lucene will manage to consume it in a truly
streaming fashion.

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: [ANNOUNCE] Wiki editing change

2013-03-25 Thread Adrien Grand
Hi Steve,

On Mon, Mar 25, 2013 at 4:16 AM, Steve Rowe  wrote:
> Please request either on the java-user@lucene.apache.org or on 
> d...@lucene.apache.org to have your wiki username added to the 
> ContributorsGroup page - this is a one-time step.

Can you add 'jpountz' to the ContributorsGroup? Thank you!

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Beginner's questions

2013-03-27 Thread Adrien Grand
Hi Paul,

On Wed, Mar 27, 2013 at 1:58 PM, Paul Bell  wrote:
> As to the ideas raised in the links you pointed me to: the first link shows
> the instantiation of a Term object via
>
>writer.UpdateDocument(new Term("IDField", *id*), doc);
>
> yet in the 4.2.0 docs I see no Term constructor that allows this "id"
> field.

I think this is this one:
http://lucene.apache.org/core/4_2_0/core/org/apache/lucene/index/Term.html#Term(java.lang.String,
java.lang.String)

> But this raises an interesting question: is it possible to tell Lucene that
> the Document I've given it to index has a specific identifier? Here's an
> example of what I mean. Suppose that the DB in question is a NoSQL type of
> the graph flavor. I add a vertex to that graph. The vertex contains some
> properties, e.g., name and type, whose values are text strings. I want
> Lucene to index these data AND I want to know some kind of identifier for
> that vertex Document. I would prefer to give Lucene that ID, though I might
> be able to tolerate it giving it to me.

Lucene has no schema that would allow you to specify a primary key,
but there is the IndexWriter.updateDocument method that allows for
atomic updates of documents:
http://lucene.apache.org/core/4_2_0/core/org/apache/lucene/index/IndexWriter.html#updateDocument(org.apache.lucene.index.Term,
java.lang.Iterable)

You just need to pass a term where the field name is the name of your
primary key field and the value is the actual ID.

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Beginner's questions

2013-03-27 Thread Adrien Grand
On Wed, Mar 27, 2013 at 9:04 PM, Paul Bell  wrote:
> Thanks Adrien.
>
> I've scraped together a simple program in the Lucene 4.2 idiom (see below).
> Does this illustrate what you meant by your last sentence?
>
> The code adds/indexes 5 documents all of whose content is identical, but
> whose 'id' field is unique ("v1" through "v5"). It then queries the 'id'
> field for the pattern "v*".

Even if your program works, there is something "dangerous" in it: you
index your id field with a String field, meaning that the field should
not be analyzed and then query it using a query parser, which analyzes
the data it is given. So you gave any of your document the id "ABC",
you will never be able to find it since StandardAnalyzer filters
tokens with a LowerCaseFilter. You could simply create the query
manually:

Query query = new PrefixQuery(new Term("id", "v" + id));

without help from a query parser.

To ensure that your id field is unique across documents, you could replace

writer.addDocument(createDocument("This is a test; for the
next 60 seconds..."));

with

Document doc = createDocument("This is a test; for the next 60 seconds...")
writer.updateDocument(new Term("id", doc.get("id")), doc);

> While we're at it, what method should I be using to obtain merely the
> original document itself after a query?

You can println document.get("id").

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Indexing Term Frequency Vectors

2013-03-28 Thread Adrien Grand
Hi,

On Thu, Mar 28, 2013 at 8:25 PM, Sharon Tam  wrote:
> I believe that when Lucene indexes documents, it generates counts for a
> term by counting how many times the term appears in a particular document.
> Instead of having Lucene do the counting, I want to do my own counting and
> feed a term-frequency vector representation of a document directly into the
> indexer which will take my counts and proceed to do the other processing
> such as generating inverse document frequency.  My term-frequencies may not
> all be integers.  Is there a way to do this?

You could provide the indexer with arbitrary frequencies by creating a
handcrafted TokenStream that repeats terms ${termFreq} times, but
unfortunately, frequencies need to be strictly positive (> 0)
integers.

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Storing Documents in Lucene

2013-03-28 Thread Adrien Grand
On Thu, Mar 28, 2013 at 11:06 PM, Paul  wrote:
> Hi,

Hi Paul,

> Some of the stuff I've read suggests that Lucene is not especially 
> well-suited to storing the documents. It's supposed to be great at indexing 
> those documents, but not so great at storing the docs themselves.
>
> Can someone shed some light on this?

I'd say that it is the same problem as with other databases: The
problem with large stored fields is that they might make the I/O cache
of your operating system go crazy and make search slower. However, if
your fields are small (ie. not high-resolution photos), I think it is
reasonable to store them in the Lucene index, especially now that
Lucene compresses stored fields.

> If this is true, then am I right to think that the typical Lucene use case is 
> to
>  a. Index a document
>  b. Store in the index some kind of unique document identifier that 
> is meaningful to the
>   "native" application
>  c. Search the index, obtain this ID, and present it to the native 
> app to fetch the original
>document?

If you need to store your documents somewhere else anyway, this
approach is good. But you could use Lucene as your primary store as
well.

> This came up in the context of trying to compare MongoDB and Lucene. But as I 
> dug into it I began to think that this might be an apples to oranges 
> comparison. MongoDB builds indices as you insert documents, but it seems like 
> Lucene is more about the indexing and less about storing documents.

Lucene being only a library, you might be interested to check out Solr
or ElasticSearch which are more comparable with MongoDB than Lucene.

I hope this helps!

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Beginner's questions

2013-03-29 Thread Adrien Grand
Hi Paul,

On Fri, Mar 29, 2013 at 1:38 PM, Paul Bell  wrote:
> Last night reading in "Lucene in Action, 2nd edition," I came upon this
> about addDocument(Document, Analyzer): "Adds the document using the
> provided analyzer for tokenization. But be careful! In order for searches
> to work correctly, you need the analyzer used at search time to "match" the
> tokens produced by the analyzers at indexing time."
>
> Is this warning from the author of a piece with what you're warning me
> about?

Exactly. This is a very common source of errors.

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Discrepancies between search results and reader.document(i).get("path")

2013-03-29 Thread Adrien Grand
Hi,

On Fri, Mar 29, 2013 at 10:23 AM, Bushman, Lamont  wrote:
> This snippet of one of my classes looks at all of my documents and displays 
> their file path.
> 
> Directory dir = FSDirectory.open(mIndexFolder);
> IndexReader reader = DirectoryReader.open(dir);
> int numDocs = reader.numDocs();
> filesToDelete = new HashMap();
>
> for (int i = 0; i < numDocs; i++)
> {
> File file = new File(reader.document(i).get("path"));
> System.out.println("Files: " + file);

This is not correct if there are deleted documents. You must iterate
from 0 to maxDoc() and skip deleted documents (using
reader.liveDocs()).

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Discrepancies between search results and reader.document(i).get("path")

2013-03-29 Thread Adrien Grand
On Sat, Mar 30, 2013 at 12:39 AM, Bushman, Lamont  wrote:
> However, with your response, especially if I come across problems later.   
> reader.liveDocs() is not found in IndexWriter.  I am guessing you are 
> referring to the TermsEnum class.  I assume numDocs() returns the amount of 
> documents that are left to search and maxDoc() is the greatest id that still 
> exists.  I believe my program is working fine for me now because I am using 
> the forceMergeDeletes() method, so that numDocs() will always be the same as 
> maxDoc().  Am I right on my assumptions?

When you delete documents, Lucene doesn't delete them in-place. They
are first just marked as deleted but still present in the index, this
is why AtomicReader[1] has 3 methods:
 - numDocs() which returns the number of non-deleted documents
 - maxDoc() which return the greatest ID that exists plus one (so if
no document is deleted, numDocs() == maxDoc())
 - getLiveDocs() which returns a bitmap of the documents that exist in
your index. A document docID is not deleted if getLiveDocs() is null
or if getLiveDocs().get(docID) return true.

This third method is not present on the IndexReader class that you are
manipulating. This is because this reader is not atomic, but you could
still get its live docs by calling MultiFields.getLiveDocs(reader). It
is however rather uncommon to use this method in high-level code
because when you run queries against an IndexSearcher, which is the
most common way to retrieve doc IDs from Lucene, Lucene already took
care of evicting the deleted documents, even if they matched.

I hope this helps.

[1] 
http://lucene.apache.org/core/4_2_0/core/org/apache/lucene/index/AtomicReader.html

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: "4.1 consuming more memory than 3.0.2 while Indexing"

2013-04-01 Thread Adrien Grand
On Mon, Apr 1, 2013 at 1:56 PM, Arun Kumar K  wrote:
> Hi Guys,

Hi,

> I have been finding out the heap space requirement for indexing and
> searching with 3.0.2 vs 4.1 (with BlockPostings Format).
>
> I have a 2GB index with 1 million docs with around 42 fields with 40 fields
> being random strings.
>
> I have seen that memory for search has reduced by 5X with 4.1 (with
> BlockPostings Format) but the memory usage during indexing with 4.1 is
> around 800MB~1.7 GB whereas for 3.0.2 it is 300~600MB.
> But indexing time is almost same with both versions.

How did you measure memory usage ? Operating systems may report high
memory usage because MMapDirectory became the new default directory
implementation and memory-mapped files are taken into account in the
virtual memory of your processes. See
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
for more information.

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Term vector Lucene 4.2

2013-04-02 Thread Adrien Grand
Hi Andi,

Here is how you could retrieve positions from your document:

Terms termVector = indexReader.getTermVector(docId, fieldName);
TermsEnum reuse = null;
TermsEnum iterator = termVector.iterator(reuse);
BytesRef ref = null;
DocsAndPositionsEnum docsAndPositions = null;
while ((ref = iterator.next()) != null) {
docsAndPositions = iterator.docsAndPositions(null, docsAndPositions);
// beware that docsAndPositions will be null if you didn't
index positions
if (docsAndPositions.nextDoc() != 0) { // you need to call
nextDoc() to have the enum positioned
  throw new AssertionError();
}
final int freq = docsAndPositions.freq(); // number of
occurrences of the term
for (int i = 0; i < freq; ++i) {
  final int position = docsAndPositions.nextPosition();
  // 'position' is the i-th position of the current term in the document
}
}

I hope this helps.

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Term vector Lucene 4.2

2013-04-02 Thread Adrien Grand
On Tue, Apr 2, 2013 at 12:45 PM, andi rexha  wrote:
> Hi Adrien,
> Thank you very much for the reply.
>
> I have two other small question about this:
> 1) Is  "final int freq = docsAndPositions.freq();" the same with 
> "iterator.totalTermFreq()" ? In my tests it returns the same result and from 
> the documentation it seems that the result should be the same.

In case of term vectors, the docs enums contain only one document so
iterator.totalTermFreq() and docsAndPositions.freq() are equal. This
would not be true if you consumed AtomicReader.fields() (since the
docs enums would have several documents).

> 2) How do I get the offsets for the term vector? I have tried to iterate over 
> the docsAndPositions but I get the following exception:
>
> Exception in thread "main" java.lang.IllegalStateException: Position enum not 
> started

You need to call startOffset and endOffset just after nextPosition:

for (int i = 0; i < freq; ++i) {
  final int position = docsAndPositions.nextPosition();
  // 'position' is the i-th position of the current term in the document
  final int startOffset = docsAndPositions.startOffset();
  final int endOffset = docsAndPositions.endOffset();
  // offsets of the i-th term
}

Beware that these methods will return -1 if you did not index offsets
(see FieldType.setIndexOptions and
IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS).

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to use concurrency efficiently

2013-04-02 Thread Adrien Grand
On Tue, Apr 2, 2013 at 2:29 PM, Igor Shalyminov
 wrote:
> Hello!

Hi Igor,

> I have a ~20GB index and try to make a concurrent search over it.
> The index has 16 segments, I run SpanQuery.getSpans() on each segment 
> concurrently.
> I see really small performance improvement of searching concurrently. I 
> suppose, the reason is that the sizes of the segments are very non-uniform (3 
> segments have ~20 000 docs each, and the others have less than 1 000 each).
> How to make more uniformly sized segments (I now use just 
> writer.forceMerge(16)), and are multiple index segments the most important 
> thing in Lucene concurrency?

Segments have non uniform sizes by design. A segment is generated
every time a flush happens (when the ram buffer is full or if you
explicitely call commit). When there are two many segments, Lucene
merges some of them while new segments keep being generated as you add
data. So the "flush" segments will always be small while segments
resulting from a merge will be much larger since they contain data
from several other segments.

Even if segments are collected concurrently, IndexSearcher needs to
merge the results of the collection of each segments in the end. Since
your segments are very small (2 docs), maybe the cost of
initialization/merge is not negligible compared to single-segment
collection.

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to use concurrency efficiently

2013-04-02 Thread Adrien Grand
On Tue, Apr 2, 2013 at 4:39 PM, Igor Shalyminov
 wrote:
> Yes, the number of documents is not too large (about 90 000), but the queries 
> are very hard. Although they're just boolean, a typical query can produce a 
> result with tens of millions of hits.

How can there be tens of millions of hits with only 9 docs?

> Single-threadedly such a query runs ~20 seconds, which is too slow. 
> therefore, multithreading is vital for this task.

Indeed, that's super slow. Multithreading could help a little, but
maybe there is something to do to better index your data so that
queries get faster?

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Indexing Term Frequency Vectors

2013-04-02 Thread Adrien Grand
On Tue, Apr 2, 2013 at 4:10 PM, Sharon  W Tam  wrote:
>  Are there any other ideas?

Since scoring seems to be what you are interested in, you could have a
look to payloads: there can store arbitrary data and can be used to
score matches.

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: DocValues questions

2013-04-04 Thread Adrien Grand
Hi,

On Thu, Apr 4, 2013 at 10:30 AM, Wei Wang  wrote:
> A few quick questions about DocValues:
>
> 1. If only small number of documents have a ShortDocValueField defined,
> should each document in the index has this field filled with some value?
> The add() function of Document seems not enforce a DocValues field is
> always added to each document.

Given the name of the fied you are referring to, I assume that you are
using Lucene 4.0 or 4.1. I would highly recommend to upgrade to Lucene
4.2 since the API has been completely refactored (but the disk format
is compatible) and should hopefully be a little clearer.

You are right that there is nothing that enforces that every document
has a value : Lucene will give a default value to documents: 0 for
numeric doc values and an empty byte array for binary doc values.

> 2. Is there any examples to show how DocValues are stored and retrieved? It
> seems JavaDoc only shows how to add it, and no complete examples are out
> there.

This should be transparent if you use doc values for eg. sorting.
Otherwise, just call getNumericDocValues(field), getBinaryDocValues or
getSortedDocValues on an AtomicReader.

I hope this helps.

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: DocValues questions

2013-04-04 Thread Adrien Grand
On Thu, Apr 4, 2013 at 11:03 PM, Wei Wang  wrote:
> Given the new Lucene 4.2 DocValues API, it seems no matter it is byte,
> short, int, or long, they are all stored as NumericDocValuesField. Does
> this mean "long" values are always stored regardless of the initial type?
> If so, do we still save space if the value range is small? Do we need to
> give some hint to NumericDocValuesField to save space?

Space savings are codec-dependent, but the default codecs use bit
packing to save space. For example:
 - if all your values are between 0 and 255, Lucene will only use 8
bits per value on average,
 - if your documents only have three distinct values 1, 100 and 1,
Lucene will detect that this is a low-cardinality field and only use 2
bits per value on average.

This makes doc values storage-efficient, and much more
memory-efficient than FieldCache, that people had to use unti Lucene
4.0.

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: DocValues questions

2013-04-05 Thread Adrien Grand
On Fri, Apr 5, 2013 at 4:05 AM, Wei Wang  wrote:
> Do we need to use setLongValue() all the time?

Yes.

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: DocValues space usage

2013-04-09 Thread Adrien Grand
Hi,

On Tue, Apr 9, 2013 at 5:22 PM, Wei Wang  wrote:
> DocValues makes fast per doc value lookup possible, which is nice. But it
> brings other interesting issues.
>
> Assume there are 100M docs and 200 NumericDocValuesFields, this ends up
> with huge number of disk and memory usage, even if there are just thousands
> of values for each field. I guess this is because Lucene stores a value for
> each DocValues field of each document, with variable-length codec.

The default codec stores numeric doc values by blocks of 4096 values
that have independent numbers of bits per values. If you end up having
most of these blocks empty, doc values will require little space but
in a worst-case scenario where each block contains 1 single value, it
is true that memory and disk usage will be very inefficient.

> So in such scenario, is it possible only store values for the DocValues
> field of the docment that actually has a value for that field? Or does
> Lucene has a column storage mechanism sort of like hash map for DocValues:
>
> key: the docId that has a value for the DocValues field
> value: the value of the DocValues field

Lucene doesn't have a HashMap-like storage for doc values, although it
would be doable to build a DocValuesFormat that would work this way.

However, for your problem, I would recommend that you encode your
numeric data on top on BinaryDocValues. On the contrary to
NumericDocValues, BinaryDocValues require very little space for
missing values. All you need is to have conversion methods between
your numeric data and byte arrays.

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Indexing Term Frequency Vectors

2013-04-09 Thread Adrien Grand
Hi,

On Tue, Apr 9, 2013 at 5:24 PM, Sharon Tam  wrote:
> I tried following following this payloads tutorial to attach the term
> frequencies as payloads:
> http://searchhub.org/2009/08/05/getting-started-with-payloads/
>
> But I'm confused as to where I need to override the term frequency counter
> so that it uses my term frequencies.  I think term frequency counts are
> calculated during indexing, so I don't think I can just write my own
> Similarity class?

This is correct, frequencies are computed at indexing time. I just
wanted to mention that you can influence scores based on payloads:
http://lucene.apache.org/core/4_2_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html#scorePayload(int,
int, int, org.apache.lucene.util.BytesRef)

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: IntField question

2013-04-10 Thread Adrien Grand
Hi,

On Wed, Apr 10, 2013 at 9:34 AM, Wei Wang  wrote:
> IntField inherits from Field class a function called setByteValue().
> However, if we call it, it gives an error message:
>
> java.lang.IllegalArgumentException: cannot change value type from Integer
> to Byte
>
> 1. If this not allowed for IntField, and there is no ByteField, how will
> function setByteValue() be used?

The rule is that if your Field instances wrap an object whose type is
XXX, you should only use the setXXXValue setter. Other setters will
throw an exception instead of performing automatic type conversions in
order to detect programming errors. This is why setByteValue threw an
exception on your IntField.

> 2. Will IntField automatically detect value range is small and use less
> space? I understand DocValuesField can save space by using variable length
> codec, but not sure about IntField.

They are very different:
 - A DocValues field stores one value per document ID.
 - An indexed field only stores distinct values, and associate with
every dictinct value the list of document IDs that contain this value
(this is called a postings list).

Indexed values are not compressed but the postings lists are, and the
compression ratio is better when postings lists are dense (with the
current default postings format at least). This makes indexed fields
(such as IntField) use less space when the number of dictinct values
is small.

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: IntField question

2013-04-10 Thread Adrien Grand
Hi,

On Wed, Apr 10, 2013 at 4:59 PM, Wei Wang  wrote:
> Okay. Since there is no ByteField, setByteValue will never by used. It
> seems like a dead function.

Right, Lucene doesn't have byte or short fields.

> That makes sense. If we don't need positional info (virtually all terms are
> at the same position), can we control this for IntField or any other Field?

You can configure a FieldType so that its postings lists only contain
matching documents without any positional information[1]. This is the
case by default on numeric fields (in particular IntField).

[1] 
https://lucene.apache.org/core/4_2_0/core/org/apache/lucene/index/FieldInfo.IndexOptions.html#DOCS_ONLY

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Update a bunch of documents

2013-04-12 Thread Adrien Grand
Hi,

On Thu, Apr 11, 2013 at 5:46 PM, Carsten Schnober
 wrote:
> This is limited to one
> field only (not the one on which the query is typically performed!),
> shouldn't that help?

Unfortunately not. Lucene doesn't support in-place updates so updating
a document is equivalent to deleting the old version and adding back
the new one (although Lucene can do it atomically).

For your information, there are efforts to try to improve document
updates[1][2][3].

[1] https://issues.apache.org/jira/browse/LUCENE-4258
[2] https://issues.apache.org/jira/browse/LUCENE-3837
[3] https://issues.apache.org/jira/browse/LUCENE-4272

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: DiskDocValuesFormat

2013-04-13 Thread Adrien Grand
Hi Wei,

On Sat, Apr 13, 2013 at 7:44 AM, Wei Wang  wrote:
> I am trying to use DiskDocValuesFormat for a particular
> BinaryDocValuesField. It seems there is no good examples showing how to do
> this. The only hint I got from various docs and forums is set some codec in
> IndexWriter. Could someone give a few lines of code snippet and show how to
> set DiskDocValuesFormat?

Lucene42Codec can be extended to specify the doc values format to use
on a per-field basis. For example:

final Codec codec = new Lucene42Codec() {
  final Lucene42DocValuesFormat memoryDVFormat = new Lucene42DocValuesFormat();
  final DiskDocValuesFormat diskDVFormat = new DiskDocValuesFormat();
  @Override
  public DocValuesFormat getDocValuesFormatForField(String field) {
if ("dv_mem".equals(field)) {
  // use Lucene42 for "dv_mem"
  return memoryDVFormat;
} else {
  // use Disk otherwise
  return diskDVFormat;
}
  }
};

Then just pass this Codec instance to your IndexWriterConfig.

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Please explain the example

2013-04-21 Thread Adrien Grand
Hi,

On Thu, Apr 18, 2013 at 3:46 PM, Gaurav Ranjan
 wrote:
> I am a student and studying the functionality of Lucene for my project work.
> The DocDelta example on this link is not clear
> http://lucene.apache.org/core/4_2_0/core/org/apache/lucene/codecs/lucene40/Lucene40PostingsFormat.html?is-external=true
> ,
>
> Please explain the first part how we are getting 15,8,3 as the TermFreqs
> for the example.

The term appears once in doc 7 and 3 times in doc 11. In real-world
cases, freqs are very often equal to 1, so Lucene40PostingsFormat
tries to use as little data as possible (one bit here) when the freq
is 1. Here are the steps performed:

1. Raw doc IDs and freqs -> 7, 1, 11, 3
2. Delta-encoded doc IDs -> 7, 1, 4, 3 (11 - 7 = 4)
3. Multiply deltas by 2 -> 14, 1, 8, 3
4. When the frequency is 1, omit it and add one the the doc delta -> 15, 8, 3

To decode, just perform the steps in reverse order:
1. Encoded data -> 15, 8, 3
2. When the doc delta is even, it means that the frequency is omitted
and equal to 1 -> 15, 1, 8, 3
3. Divide deltas by 2 -> 7, 1, 4, 3
4. Restore absolute doc IDs -> 7, 1, 11, 3

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Too many unique terms

2013-04-24 Thread Adrien Grand
Hi Manuel,

On Thu, Apr 25, 2013 at 12:29 AM, Manuel LeNormand
 wrote:
> Hi there,
> Looking at my index (about 1M docs) i see lot of unique terms, more
> than 8M which is a significant part of my total term count. These are very
> likely useless terms, binaries or other meaningless numbers that come with
> few of my docs.

If you are only interested in letters, one option is to change your
analysis chain to use LetterTokenizer. This tokenizer will split on
everything that is not a letter, filtering out numbers and binary
data.

> I am totally fine with deleting them so these terms would be unsearchable.
> Thinking about it i get that
> 1. It is impossible apriori knowing if it is unique term or not, so i
> cannot add them to my stop words.
> 2. I have a performance decrease cause my cached "hot spot" chuncks (4kb)
> do contain useless data. It's a problem for me as im short on memory.
>t
> Q:
> Assuming a constant index, is there a way of deleting all terms that are
> unique from at least the dictionary tim and tip files? Do i need to enter
> the source code for this, and if yes what par of it?

If frequencies are indexed, you can pull a TermsEnum, iterate through
the terms dictionary and delete terms that are less frequent than a
given threshold. As you said, this will however prevent your users
from searching for these terms anymore.

>  Will i get significant query time performance increase beside the better
> RAM use benefit?

This is hard to answer. Having fewer terms in the terms dictionary
should make search a little faster but I can't tell you by how much.
You should also try to disable features that you don't use. For
example, if you don't need positional information or frequencies,
IndexOptions.DOCS_ONLY will make your postings lists smaller.

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Distinction between AtomicReader and CompositeReader

2013-04-24 Thread Adrien Grand
Hi Paul

On Wed, Apr 24, 2013 at 1:35 PM, Paul Taylor  wrote:
> Trying to convert some Lucene 3 code to Lucene 4,
>
> I want to use termEnums.docs(ir.getLiveDocs()) to only return docs that have
> not been deleted for a particular term. However getLiveDocs() is only
> available for AtomicReaders, and although I just have a single index it is
> file based and uses DirectoryReader (which is a subclass of
> CompositeReader).
>
> So I guess I could use SlowCompositeReaderWrapper but the name deters me
> from this, but what I dont understand is that isnt almost everyone using
> indexes based on the filesystem, isnt almost everyone using CompositeReaders
> ?
>
> Yet the documentation seems to be implying we should be using AtomicReaders
> but I dont understand how I could possibly do this with a file based index,
> maybe if the file based index only had a single segment, but aren't segments
> created by Lucene as it requires them and not usually closely controlled by
> the end user application.

Lucene storage is based on segments and Lucene 3.x used to expose
information from these segments in a consolidated manner. Although
elegant, this tends to be slow, this is why Lucene 4.x was modified to
execute everything on a per-segment basis instead. The
"SlowCompositeReaderWrapper" name tries to discourage people from
using this slow composite view.

On every IndexReader, there is a leaves() method that allows you to
access to the atomic segment readers that your IndexReader is made of.
If you can manage to solve your problem by using the segment readers,
then you should definitely do it. Otherwise, (for example if you only
need to see every term only once), then you should use one of the
static methods on the MultiFields class (that
SlowCompositeReaderWrapper uses under the hood).

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: org.apache.lucene.classification - bug in SimpleNaiveBayesClassifier

2013-04-24 Thread Adrien Grand
Hi Alexey,

On Tue, Apr 23, 2013 at 3:28 PM, Alexey Anatolevitch
 wrote:
> I was trying it with 4.2.1 and SimpleNaiveBayesClassifier seems to have a
> bug - the local copy of BytesRef referenced by foundClass is affected by
> subsequent TermsEnum.iterator.next() calls as the shared BytesRef.bytes
> changes... I can provide a test case if that was not clear.
>
> I believe it's either BytesRef.clone() that needs to create a full copy of
> the underlying array, or a local fix SimpleNaiveBayesClassifier to actually
> copy bytes instead of clone()

Good catch Alexey. If you can open an issue in JIRA and provide a
patch, I'll be happy to review it!

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Too many unique terms

2013-04-29 Thread Adrien Grand
On Sat, Apr 27, 2013 at 8:41 PM, Manuel Le Normand
 wrote:
> Hi, real thanks for the previous reply.
> For now i'm not able to make a separation between these useless words,
> whether they contain words or digits.
> I liked the idea of iterating with TermsEnum. Will it also delete the
> occurances of these terms in the other file formats (termVectors etc.)?

Yes it will. But since Lucene ony marks documents as deleted, you will
need to force a merge in order to expunge deletes.

> As i understand, the strField implementation is a kind of TrieField ordered
> by the leading char (as searches support wildcards), every term in the
> Dictionnary points to the inverted file (frq) to find the list (not bitmap)
> of the docs containing the term.

These details are codec-specific, but they are correct for the current
postings format. You can have a look at
https://builds.apache.org/job/Lucene-Artifacts-trunk/javadoc/core/org/apache/lucene/codecs/lucene41/Lucene41PostingsFormat.html#Termdictionary
for more information.

> Let's say i query for the term "hello" many times within different queries,
> the O.S will load into memory the matching 4k chunk from the Dictionary and
> frq. If most of my terms are garbage, much of the Dictionnary chunk will be
> useless, whereas the frq chunk will be more efficiently used as it contains
> all the  list. Still i'm not sure a typical 
> chunk per term gets to 4k.

Postings lists are compressed and most terms are usually present in
only a few documents so most postings lists are likely much smaller
than 4kb.

> If my assumption's right, i should lower down the memory chunks (through
> the OS) to about the 0.9th percentile of the  chunk for
> a single term in the frq (neglecting for instance the use of prx and
> termVectors). Any cons to the idea? Do you have any estimation of the
> magnitude of a frq chunk for a N-times occuring term, or how can i check it
> on my own.

I've never been tuning this myself. I guess the main issue is that it
could increase bookkeeping (to keep track of the pages) and thus CPU
usage.

Unfortunately the size of the postings lists is hard to predict
because it depends on the data. They compress better when they are
large and evenly distributed across all doc IDs. You could try to
compare the sum of your doc freqs with the total byte size of the
postings list to get a rough estimate.

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Too many unique terms

2013-04-29 Thread Adrien Grand
Hi,

On Mon, Apr 29, 2013 at 10:38 PM, Manuel Le Normand
 wrote:
> I want to make sure: iterating with the TermsEnum will not delete all the
> terms occuring in the same doc that includes the single term, but only the
> single term right?
> Going through the Class TermEnum i cannot find any "delete" method, how can
> i do this?

Sorry, I've been confusing. Deleting a term will delete all documents
that match this term.

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: lucene and mongodb

2013-05-14 Thread Adrien Grand
Hi,

On Tue, May 14, 2013 at 10:35 AM, Rider Carrion Cleger
 wrote:
> - Can I store the lucene index in a mongodb database ?

I don't know whether it's possible, but even if it was, I would not
recommend it. Lucene works best on local filesystems, and even better
if the disk is an SSD. If your intention was to rely on MongoDB to
scale the index, it is better to handle distribution on top of Lucene,
similarly to Solr and Elasticsearch, rather than writing distributed
implementations of org.apache.lucene.store.Directory.

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: lucene and mongodb

2013-05-14 Thread Adrien Grand
Hi,

On Tue, May 14, 2013 at 1:34 PM, Rider Carrion Cleger
 wrote:
> So, can I have for sure scalability and safety with a  distribution on top
> of Lucene like Solr ?

Yes, Solr can help you shard your index and add replicas, see
http://wiki.apache.org/solr/SolrCloud.

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: how to get max value of a long field?

2013-05-17 Thread Adrien Grand
Hi,

On Fri, May 17, 2013 at 11:10 AM, Hu Jing  wrote:
> I want to know the max value of a long field.
> I read lucene api , but don't find any api about this?
> does someone can supply any hits about how to implement this.

To do this efficiently, your field needs to have doc values[1].

First, iterate over your DirectoryReader leaves[2]. Then for every
AtomicReaderContext.reader(), get a NumericDocValues instance for your
field[3]. Then iterate over the values to compute the maximum value:

IndexReader rd;
long max = Long.MIN_VALUE;
for (AtomicReaderContext ctx : rd.leaves()) {
  final NumericDocValues longs =
ctx.reader().getNumericDocValues("my_long_field");
  final Bits liveDocs = ctx.reader().getLiveDocs();
  for (int i = 0; i < ctx.reader().maxDoc(); ++i) {
if (liveDocs != null || liveDocs.get(i)) {
  max = Math.max(max, longs.get(i));
}
  }
}

[1] 
http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/document/NumericDocValuesField.html
[2] 
http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/index/IndexReader.html#leaves()
[3] 
http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/index/AtomicReader.html#getNumericDocValues(java.lang.String)

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: how to get max value of a long field?

2013-05-17 Thread Adrien Grand
On Fri, May 17, 2013 at 11:36 AM, Adrien Grand  wrote:
> if (liveDocs != null || liveDocs.get(i)) {

Sorry, I meant "if (liveDocs == null || liveDocs.get(i)) {".

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene 4.2 DocValues

2013-05-28 Thread Adrien Grand
On Tue, May 28, 2013 at 4:48 PM, Arun Kumar K  wrote:
> Hi Guys,

Hi,

> I have been trying to understand DocValues and get some hands on and have
> observed few things.
>
> I have added LongDocValuesField to the documents like:
> doc.add(new LongDocValuesField("id",1));
>
> 1> In 4.0 i saw that there are two versions for docvalues,
>  RAM Resident(using Sources.getSOurces())  & On
> Disk(Sources.getDirectSources()).
>
>  But in 4.2 i get LongDocValues using
> "context.reader().getNumericDocValues(field) ". Which type is this ?
>  If this RAM based then is there any Disk-Based equivalent ?

Indeed, doc values have changed a lot between 4.1 and 4.2. The way doc
values are stored now depends on the DocValuesFormat. For example, the
default format (Lucene42DocValuesFormat) today stores data in memory
while we also have DiskDocValuesFormat (in lucene/codecs) which stores
data on disk.

> 2> Can DocValuesField be used for search ? I coudn't. Did i miss something?
>  "searcher.search(parser.parse("docvaluedfield:value"),100)"

Yes and no. The query parser can't deal with it, but for example, you
could use FieldCacheRangeFilter to build a range query (potentially
matching a single value) on top of doc values. (When a field has doc
values, Fieldcache will automatically use them instead of uninverting
the field). While this will likely be slower for thin ranges, this
should be very fast (probably even faster than a range query based on
the terms dictionary) for large ranges that match many documents.

>  I am able to use for sorting.
>  If possible i want to avoid having a stored field in index with same
> "name" & "Value" of DocValueField of same
>  document and perform search.

While you can do that, I don't recommend it. For example, if you have
5 fields, loading all fields from stored fields requires at most 1
disk seek while loading all fields from doc values requires at least 5
disk seeks for disk-based doc values.

> 3> I have a reader opened on DirectoryReader with the docBaseInParent value
> as 0 (first documents internal ID).
>  Even when i delete the first added document (with internal docID = 0)
> using some query the docBaseInParent is not
>  updated to 1(next documents internal ID). I have committed writer,
> forceMergeDeletes but it's the same.
>  I have also seen getLiveDocs().
>
> Just curious to know the reasons for not updating the docBase ?

Everything in Lucene is based on the fact that segments are immutable
up to deletes. Starting to mutate internal data such as
docBaseInParent would make the design much more complicated (hence
harder to reason about, to optimize, etc.).

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene 4.2 DocValues

2013-05-28 Thread Adrien Grand
On Tue, May 28, 2013 at 8:55 PM, Arun Kumar K  wrote:
> Thanks for clarifying the things.
> I have some doubts regarding sorting :
>>
>> While you can do that, I don't recommend it. For example, if you have
>> 5 fields, loading all fields from stored fields requires at most 1
>> disk seek while loading all fields from doc values requires at least 5
>> disk seeks for disk-based doc values.
>
>
> 1> I am assuming those mentioned 5 fields are sortable fields upon which 
> sorting is done.
> In my understanding, loading stored fields takes 1 disk seek for finding file 
> pointer & 1 disk seek for getting all those fields.

This was correct until Lucene 4.0, but since 4.1, Lucene stores the
doc ID -> file pointer mapping in memory, ensuring at most 1 disk
seek.

> Since different file is maintained for a particular doc value field. We get 5 
> disk seeks + 1 disk seek for file pointer.

There is no general rule since this depends on the doc values type and
the codec implementation, but you got the idea.

> If we have only one sortable field , which could be better ? I guess no diff.

Just to make things clear, before Lucene had doc values, sorting was
performed based on the inverted index (which was uninverted and stored
in memory using FieldCache), not stored fields. Stored fields are bad
for sorting because they are usually large and don't play nice with
the file system cache.

Doc values are very similar to FieldCache except that the hard work is
done at indexing time instead of searching time. This is good
trade-off because it allows for faster loading of indexes and for
off-loading data to disk. This is never a bad idea to use doc values
for sorting.

> Also, I vaguely remember that there is some performance loss for sorting 
> based on string in lucene 4.0
> Then, will the decision change for String field or based on type of field ?

I don't see why String sorting would be slower. However, it is true
that String sorting requires a lot of memory. If your field is a
number, you should definitely use a numeric field cache.

> 2> Also, In my understanding, if we need to use parser based queries for 
> docvalues, we need to have a storedfield for a doc with same name & value of 
> the doc's docvalue.
> Even term queries won't work. Am i right here?

QueryParser is completely unaware of your schema. If you want
QueryParser to use doc-values-based queries, you can override
QueryParser.newRangeQuery and/or QueryParser.newFieldQuery to return a
new ConstantScoreQuery that wraps a FieldCacheRangeFilter.

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: confirm subscribe to java-user@lucene.apache.org

2013-06-03 Thread Adrien Grand
Hi Manoj,

This is maybe related to the compression support which was added in Lucene
4.1. Although it improves performance on large indexes, it might prove to
be slightly faster on indexes that completely fit in the file-system cache,
especially if you fetch a large number of records at each request.
Compression can be disabled by using a custom codec but I don't recommend
it unless absolutely required.

-- 
Adrien


Re: Please add me as a wiki editor

2013-06-10 Thread Adrien Grand
Hi Lance,

On Mon, Jun 10, 2013 at 4:55 AM, Lance Norskog  wrote:
> I'm responsible for the OpenNLP wiki page:
> https://wiki.apache.org/solr/OpenNLP
>
> Please add me to the list of editors.

I just added you to the ContributorsGroup, please let me know if you
have trouble editing wiki pages.

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: posting list traversal code

2013-06-13 Thread Adrien Grand
Hi,

On Thu, Jun 13, 2013 at 8:24 AM, Denis Bazhenov  wrote:
> Document id on the index level is offset of the document in the index. It can 
> change over time for the same document, for example when merging several 
> segments. They are also stored in order in posting lists. This allows fast 
> posting list intersection. Some Lucene API's explicitly state that they 
> operate on the document ids in order (like TermDocs), some allows out of 
> order processing (like Collector). So it really depends.
>
> In case of SortingAtomicReader, as far as I know, it calculate document 
> permutation, which allows to have sorted docIDs on the output. So, it 
> basically relabel documents.

This is correct. The org.apache.lucene.index.sorter.Sorter.sort method
computes a permutation of the doc IDs which makes doc IDs sorted
according to the sort order. SortingAtomicReader is just a view over
an AtomicReader which uses this permutation to relabel doc IDs and
give the impression that the index is sorted. But this class is not
very interesting by itself can can be very slow to decode postings:
for each term it needs to load all postings into memory and sort them
before returning an enumeration of the doc IDs (see the
SortingDocsEnum class), it is only useful to sort indices offline with
IndexWriter.addIndexes or online with SortingMergePolicy.

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: posting list traversal code

2013-06-13 Thread Adrien Grand
On Thu, Jun 13, 2013 at 7:56 PM, Sriram Sankar  wrote:
> Thank you very much.  I think I need to play a bit with the code before
> asking more questions.  Here is the context for my questions:
>
> I was at Facebook until recently and worked extensively on the Unicorn
> search backend.  Unicorn allows documents to be ordered by a static rank in
> the posting lists and retrieval (based on a query) will therefore return
> documents in decreasing order of importance (by static rank).  We only
> retrieved a limited number of these documents - so static rank is the first
> filter.  These documents are scored which becomes the second filter.  This
> two level filtering ended up working very well.
>
> I am trying to see how we can do the same with Lucene.  It looks like we
> can build an index offline (say in Hadoop) and use SortingAtomicReader,
> etc. to get the posting lists in static rank order.  Then everything works
> great. However, as we update the index with new data, we end up with
> additional segments.

The use-case you are describing is why we built SortingMergePolicy:
this merge policy sorts segments at merging time. The consequence is
that segments resulting from a flush are unsorted while segments
resulting from a merge are sorted. This is an interesting property
because one can then early-terminate collection on the large segments
(which necessarily result from a merge) which are the most costly to
collect.

> To get the same functionality as Unicorn, it looks
> like we will have to sort these additional segments by the same static
> rank, and then traverse the posting lists in these multiple segments
> simultaneously (rather than segment by segment).  I am trying to see how
> easy this will be to achieve.

Traversing the postings lists simultaneously to collect documents by
strict increasing rank is an option, but segment-by-segment collection
might be faster (depending on the number of segments to collect, the
likeliness of new documents to have better ranks, etc.): although you
will over-collect some segments, it is no more needed to make a
decision after each collected document to know which segment to
collect. I think both options are worth trying.

> To get more specific, the all documents in Unicorn have two attributes - an
> id (called fbid) which is unique and never changes - so for example the
> fbid for the Lucene page is 104032849634513 (so you can always get it as
> www.facebook.com/104032849634513), and a static rank.  The static rank may
> not be unique (multiple docs may share the same static rank, in which case
> we could order them arbitrarily so long as they are ordered the same way in
> all posting lists).

SortingAtomicReader and SortingMergePolicy ensure this as well by
using a stable sorting algorithm to sort documents. Documents which
have the same rank will remain in index order through the sorting
process.

> It looks like we will need an fbid to docid mapping in Lucene for each
> segment so that the simultaneous calls to advance() in each segment can be
> in sync.

Although I understand why you would need a docid -> fbid mapping
(likely through a stored field or a doc values field), I'm not sure to
understand why you need a fbid -> docid mapping.

> This is how far I've thought things out - let me play with the code a bit
> and I may have more questions in a couple of days.

You are welcome. Online sorting is rather new in Lucene so there are
likely many things that could be improved!

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: segments and sorting

2013-06-15 Thread Adrien Grand
Hi,

On Fri, Jun 14, 2013 at 11:24 PM, Sriram Sankar  wrote:
> For my use case of having all docs sorted by a static rank and being able
> to cut off retrieval after a certain number of docs, I have to sort all my
> docs using the static rank (and Lucene 4 has a way to do this).
>
> When an index has multiple segments, how does this sorting work?  Is each
> segment sorted independently?  Or is it possible for me to control this -
> and have a single segment?

You can sort each segment independently or have a single segment, both
options are available. To have a single segment, you just need to wrap
your top-level index reader with SlowCompositeReaderWrapper before
wrapping it again in a SortingAtomicReader and calling
IndexWriter.addIndexes.

> Assuming I have a single segment, are there any other constraints?  I read
> somewhere that FieldValue's have a limit of 2Gb per segment - is this true?

What do you mean with "FieldValue"? If you are referring to stored
fields, a single field value cannot be larger than 2B because the API
uses ints. But some codecs enforce lower limits, for example the
current default stored fields format enforces that the sum of the
sizes of all fields of a _single_ document is less than 2GB (which is
already much more than what typical users need). I think the major
limitation is that a single Lucene index cannot have more than 2
billion documents, but you can store your data into several physical
shards to work around this limitation and merge results at searching
time.

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene pointing to existing DB Index

2013-06-15 Thread Adrien Grand
Hi,

On Sat, Jun 15, 2013 at 6:55 AM, Pradeep B  wrote:
> Hi
> I have just started out on lucene and experimenting with some possibilities.
> My goal is to try to exploit an existing database index (which in my case
> is an inverted index)  to serve as a Lucene Index.
> this helps me avoid need of additional indexing time and storage space.
>
> Is this possible ?

You might be able to write a custom codec to do it. But this would be
a very tedious task and the result would likely be much slower than
Lucene's default codec (because you would need to keep on doing
translations between Lucene's doc IDs and your database index internal
IDs) so wouldn't recommend doing it.

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: merging policy is not triggered behind the scene

2013-06-15 Thread Adrien Grand
Hi Lei,

On Fri, Jun 14, 2013 at 1:06 AM, Reg  wrote:
> I noticed if I do the merging in the following way,
> IndexWriter.mabyeMerge() is never triggered automatically by the merge
> scheduler.
>
>
> IndexWriter writer = ...;
>
> IndexReader[] readers = ...;
>
> writer.addIndexes(readers)
>
> writer.close();
>
>
> Is it a bug or by design?

It is by design. For example, if you are adding several batches of
IndexReaders into your index, your MergePolicy will be able to make a
much better decision of which segments to merge after all batches have
been added (by calling maybeMerge explicitely).

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: segments and sorting

2013-06-18 Thread Adrien Grand
On Tue, Jun 18, 2013 at 1:05 AM, Sriram Sankar  wrote:
> I'm sorry - I meant "DocValue" not "FieldValue".  Slide 20 in the following
> deck talks about the 2Gb limit.

Doc values don't have this limit anymore. However, there is a limit of
~32kb per term, but this shouldn't be a problem with reasonable
use-cases for doc values.

These slides are talking about the pre-4.0 API, and the doc values API
has been completely refactored in 4.2. Although the concepts are the
same, it may be non-trivial to translate the code examples to the new
API.

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Upgrading from 3.6.1 to 4.3.0 and Custom collector

2013-06-18 Thread Adrien Grand
Hi,

You didn't say specifically what your problem is so I assume it is
with the following method:

On Tue, Jun 18, 2013 at 4:37 AM, Peyman Faratin  wrote:
>   public void setNextReader(IndexReader reader, int docBase) 
> throws IOException{
> this.docBase = docBase;
> store = FieldCache.DEFAULT.getStrings(reader,"title");
>   }

setNextReader now takes an AtomicReaderContext as an argument and
FieldCache.getStrings is now FieldCache.getTerms, so this would give
something like

private BinaryDocValues store;

public void setNextReader(AtomicReaderContext ctx) throws IOException{
this.docBase = ctx.docBase;
this.store = FieldCache.DEFAULT.getTerms(ctx.reader(), "title");
}

public void collect(int doc) throws IOException {
BytesRef page = new BytesRef();
store.get(doc, page);
if (page.bytes != BinaryDocValues.MISSING) {
outLinks.add(page.utf8ToString());
}
}

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: segments and sorting

2013-06-19 Thread Adrien Grand
Hi,

On Wed, Jun 19, 2013 at 12:16 AM, Sriram Sankar  wrote:
> Is it possible to do this more efficiently using a merge sort?  Assuming
> the individual segments are already sorted, is there a wrapper that I can
> use where I can pass the same sorting function?  I'm guessing the
> SlowCompositeReaderWrapper does not assume that the individual segments are
> already sorted and therefore would repeat the work?

Given that online sorting is rather new to Lucene, we tried to keep it
simple. Merging segments in parallel by maintaining a priority queue
is totally doable and is probably one of the next steps for online
sorting but it would require some non-trivial work to reimplement
merging for all formats (postings lists especially) and to be able to
plug a custom SegmentMerger into the IndexWriter.

For now, we just make sure that sorting a SlowCompositeReaderWrapper
which wraps several sorted segments is faster than sorting a random
AtomicReader by using TimSort to compute the mapping between the old
and the new doc IDs and to sort all individual postings lists.

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Doing concurrent searches efficiently

2013-06-19 Thread Adrien Grand
Hi Roberto,

On Wed, Jun 19, 2013 at 12:57 PM, Roberto Ragusa  wrote:
> Hi,
>
> I would like an expert opinion about how to optimally do concurrent
> searches on the same index (let's suppose there are several threads
> doing searches). Consider these options:
>
> a) one IndexReader, all threads use it
> b) cloned IndexReader's, each thread uses a clone
> c) opened IndexReader's, each thread open and keeps its own IndexReader

Best option is a), Lucene will internally clone per-thread
data-structure that don't support concurrent access.

> What I would like to achieve is:
> z) let each thread have an atomic view of the index, i.e. do a reopen when it 
> wants
> (and with no coordination with other threads)
>
> I think z) is impossible to achieve with a).

The best way to achieve this is to use SearcherManager[1][2].

[1] 
http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/search/SearcherManager.html
[2] 
http://blog.mikemccandless.com/2011/09/lucenes-searchermanager-simplifies.html

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: build of trunk hangs

2013-06-20 Thread Adrien Grand
Hi,

On Thu, Jun 20, 2013 at 5:59 PM, Tom Burton-West  wrote:
> I'm trying to build trunk and when I run "ant compile"
> the build hangs right after "Building replicator"  at the line
> "common.resolve:". (see below for more context)
>
> I'm not familiar with Ivy so I'm not too sure where to look for the problem.
> Can someone point me to the FAQ or the appropriate resource to figure out
> what is going on?

This is likely due to a zombie lock file in your ivy cache. You should
look for files ending with .lck or .lock in your ~/.ivu2/cache and
remove them.

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Payload Matching Query

2013-06-20 Thread Adrien Grand
Hi Michal,

Although payloads can be used at query time to customize scoring, they
can't be used for searching. Lucene only allows to search on terms.

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Stored fields: decompression slows down in my scenario ... any idea for a workaround?

2013-06-24 Thread Adrien Grand
Hi,

On Sun, Jun 23, 2013 at 9:08 PM, Savia Beson  wrote:
> I think Mathias was talking about the case with many smallish fields that all 
> get read per document.  DV approach would mean seeking N times, while stored 
> fields, only once? Or you meant he should encode all his fields  into single 
> byte[]?
>
> Or did I get it all wrong about stored vs DV :)

No, this is correct. But in that particular case, I think the best
option depends on how data is queried: if all features are always used
together then it makes sense to encode them all in a single
BinaryDocValuesField. On the other hand, if queries are more likely to
require only a subset of the features, encoding each feature in a
different field makes more sense.

> What helped a lot in a similar case was to make own codec and reduce chunk 
> size to something smallish, depending on your average document size… there is 
> a sweet spot somewhere compression/speed.

This would indeed make decompression faster on an index that fits in
the file-system cache, but as Uwe said, stored fields should only be
used to display search results. So requiring 100µs to decompress data
per document is not a big deal since you are only going to load 20 or
50 documents (size of a page of results). It is more important to help
the file-system cache to prevent actual random accesses to happen as
they can easily take 10ms on magnetic storage.

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Stored fields: decompression slows down in my scenario ... any idea for a workaround?

2013-06-24 Thread Adrien Grand
Hi,

On Mon, Jun 24, 2013 at 2:47 PM, Mathias Lux  wrote:
> Still, I've read that all the BinaryDocValues go directly to memory.
> Am I right with this?

It is true that the current default implementation stores them in
memory. However, disk doc values formats can be configured on a
per-field basis, so you could just write:

Codec codec = new Lucene42Codec() {

  final DiskDocValuesFormat diskDVF = new DiskDocValuesFormat();

  @Override
  public DocValuesFormat getDocValuesFormatForField(String fieldName) {
return diskDVF;
  }

}

to store them on disk instead (add conditions on fieldName if you want
to have different behaviors based on the field name).

> I've also tried to change the codec, but I'm stuck with the
> IndexReader. It throws

This is because you defined a new custom codec (with a unique name to
identity it) without registering it in
META-INF/org.apache.lucene.codecs.Codec in your classpath. Note that
the example above doesn't require you to register a different codec
since it is fully compatible with Lucene42Codec and uses the same
name.

> Also I understand that the APIs are still experimental and in no way
> stable. As I'm quite a lazy programmer I'd like to hear you opinion on
> how stable the APIs for BinaryDocValues and Codec might be? :)

I can't predict the future :), but given the time and energy that has
been put into the doc values APIs for the 4.2 release (thanks
Robert!), I'd say that they shouldn't change much in the next months.

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Stored fields: decompression slows down in my scenario ... any idea for a workaround?

2013-06-25 Thread Adrien Grand
Hi,

On Mon, Jun 24, 2013 at 6:13 PM, Mathias Lux  wrote:
> When searching for an image within memory I came down to 44ms.
> Therefore, 77ms is totally acceptable in these terms. My benchmarking
> of the BinaryDocValuesField showed that it'd come close to the 44ms,
> but I didn't go for a full evaluation as a lot of re-coding was
> needed.

Even though stored fields performance looks good with small blocks, I
encourage you to use doc values instead, which have been designed
exactly for the kind of use-case you are describing!

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Securing stored data using Lucene

2013-06-25 Thread Adrien Grand
On Tue, Jun 25, 2013 at 1:03 PM, Rafaela Voiculescu
 wrote:
> Hello,

Hi,

> I am sorry I was not a bit more explicit. I am trying to find an acceptable
> way to encrypt the data to prevent any access of it in any way unless the
> person who is trying to access it knows how to decrypt it. As I mentioned,
> I looked a bit through the patch, but I am not sure of its status.

You can encrypt stored fields, but there is no way to do it correctly
with fields that have positions indexed: attackers could infer the
actual terms based on the order of terms (the encrypted version must
sort the same way as the original terms), frequencies and positions.

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: In memory index (current status in Lucene)

2013-07-04 Thread Adrien Grand
On Tue, Jul 2, 2013 at 10:09 AM, Toke Eskildsen  
wrote:
> I wonder if Java's ByteBuffer could be used to make a more GC-friendly
> RAMDirectory?

For the record, there is an open issue about it:
https://issues.apache.org/jira/browse/LUCENE-2292.

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Please Help solve problem of bad read performance in lucene 4.2.1

2013-07-07 Thread Adrien Grand
Indeed, Lucene 4.1+ may be a bit slower for indices that comptelely
fit in your file-system cache. On the other hand, you should see
better performance with indices which are larger than the amount of
physical memory of your machine. Your reading benchmark only measures
IndexReader.get(int) which should only be used to display summary
results (that is, only called 10 or 20 times per displayed page). Most
of time, the bottleneck is rather searching which can be made more
efficient on small indices by switching to an in-memory postings
format.

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: NRT + static rank based sorting

2013-07-09 Thread Adrien Grand
Hi Sriram,

On Tue, Jul 9, 2013 at 5:06 AM, Sriram Sankar  wrote:
> I've finally got something running and will send you some performance
> numbers as promised shortly.  In the meanwhile, I've a question regarding
> the use of real time indexing along with ordering by static rank.  Before
> each search, I do the reopen as follows:
>
> public void refresh() throws IOException {
> DirectoryReader r = DirectoryReader.openIfChanged(reader);
> if (r != null) {
> reader.close();
> reader = r;
> this.live = SortingAtomicReader.wrap(
> new SlowCompositeReaderWrapper(reader),
> new StaticRankSorter());
> }
> }
>
> This works fine.  However, I believe the index is resorted everytime I
> reopen the index.  Ideally, it would be nice to do the sort more
> incrementally each time a new document gets added.  I assume that this is
> not easy - but just in case you have ideas, I'd like to hear them.

I think a good trade-off could be to fully collect the small segments
that come from incremental updates. Since they are small, collecting
them will be fast anyway. One the opposite, the bottleneck is likely
the collection of large segments. This is why we chose to tackle the
problem of online sorting using a merge policy (SortingMergePolicy).
Segments are only sorted when merging, meaning that small NRT
(flushed) segments won't be sorted but large (merged) segments will
be.

Then computing the top hits is just a matter of computing the best
hits on every segment and merging them into a single hit list:
 - for flushed segments, you need to fully collect them like Lucene
does by default,
 - for sorted segments, you can early-terminate collection on a
per-segment basis when enough matchs have been collected.

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: posting list strings

2013-07-09 Thread Adrien Grand
Hi,

Lucene stores the string because it may need it to run prefix or range
queries. We don't have a hash-based terms dictionary right now but I
know some people wrote one since they don't need support for these
queries, see for instance the Earlybird paper[1]. Then if you can find
a perfect hashing function, you can just replace your terms by their
hash.

[1] http://www.umiacs.umd.edu/~jimmylin/publications/Busch_etal_ICDE2012.pdf

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Another question on sorting documents

2013-07-18 Thread Adrien Grand
Hi,

On Thu, Jul 18, 2013 at 7:15 AM, Sriram Sankar  wrote:
> The approach we have discussed in an earlier thread uses:
>
> writer.addIndexes(new SortingAtomicReader(...));
>
> I want to confirm (this is not absolutely clear to me yet) that the above
> call will not create multiple segments - i.e., the output will be optimized.

All the provided readers will be merged into a single segment but if
your index already has segments, it will have an additional one.

> We are also trying another approach - sorting the documents in Hadoop - so
> that we can repeatedly call writer.addDocument(...) providing documents in
> the correct order.
>
> How can we make sure that the final output contains documents in a single
> segment  and in the order in which they were added?

You can ensure that documents stay in the order in which they have
been added by using LogByteMergePolicy or LogDocMergePolicy. However,
don't use TieredMergePolicy which will happily merge non-adjacent
segments.

If this is an offline operation, you can just use LogByteMergePolicy,
add documents in order and run forceMerge(1) when finished.

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Performance measurements

2013-07-24 Thread Adrien Grand
Hi,

On Wed, Jul 24, 2013 at 6:11 PM, Sriram Sankar  wrote:
> termA AND (termB1 OR termB2 OR ... OR termBn)

Maybe this comment is not appropriate for your use-case, but if you
don't actually need scoring from the disjunction on the right of the
query, a TermsFilter will be faster when n gets large.

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Query serialization/deserialization

2013-07-28 Thread Adrien Grand
Hi Denis,

Indeed, Query.toString() only tries to give a human-understandable
representation of what the query searches for and doesn't guarantee
that it can be parsed again and would give the same query. We don't
provide tools to serialize queries but since query parsing is usually
lightweight compared to query execution, so nodes can send queries to
each other using an unparsed representation and parse the query upon
reception.

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: getNumericDocValues

2013-07-29 Thread Adrien Grand
Hi,

On Mon, Jul 29, 2013 at 4:56 PM, Yonghui Zhao  wrote:
> I want to know what will be returned if  the input docID is not a valid id,
> for examples:
>
> 1. the docID beyonds the reader scope

In that case, the behavior is not defined, it might throw an exception
or return a random value. You should make sure you always call this
method with a valid document ID.

> 2. the doc with this docID is already deleted

It will return the same value it returned prior to the document
deletion. However, you shouldn't rely on this.

> 2. the doc with this docID doesn't has the field

In that case, the behavior is defined: it will return 0.

Numeric doc values expect are dense: all documents have a value. When
a document doesn't have a value, 0 will be assigned by default at
indexing time.

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Cache Field Lucene 3.6.0

2013-07-30 Thread Adrien Grand
Hi,

On Tue, Jul 30, 2013 at 4:09 PM, andi rexha  wrote:
> Hi, I have a stored and tokenized field, and I want to cache all the field 
> values.
>
> I have one document in the index, with the  "field.value"  =>   "hello world" 
>   and with tokens   => "hello", "world".
> I try to extract the fields content :
> String [] cachedFields  =  FieldCache.DEFAULT.getStrings(reader, 
> field.getName());
>
> The content of the cachedFields array is ["hello"].
>
> When I try to index other documents, I get also "null" as the value of the 
> field.
>
> Can somebody help me with that?

Lucene 3.6's field cache was created for sorting on single-valued
fields, this is why it only stores one token per field per document
and isn't suitable for multi-valued fields.

If you need such an API for multi-valued fields, I recommend upgrading
to Lucene 4 and using SortedSetDocValues[1].

[1] 
http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/index/AtomicReader.html#getSortedSetDocValues%28java.lang.String%29

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

2013-07-30 Thread Adrien Grand
Hi,

On Tue, Jul 30, 2013 at 5:34 PM, Robert Muir  wrote:
> I'm not sure if there is a similar one for vectors.

There is, it has been done for stored fields and term vectors at the
same time[1].

[1] https://issues.apache.org/jira/browse/LUCENE-4928

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: sorting with lucene 4.3

2013-07-30 Thread Adrien Grand
Hi,

On Tue, Jul 30, 2013 at 8:19 PM, Nicolas Guyot  wrote:
> When sorting numerically, the search seems to take a bit of a while
> compared to the lexically sorted search.
> Also when sorting numerically the result is sorted within each page but no
> globally as opposed to the lexical sorted search.
>
> From my understanding, a SortedDocValuesField is sorted while indexing but
> not the NumericDocValuesField which is why we are facing those issues in
> our implementation. Is that correct ?

Sorted doc values are not exactly sorted, but Lucene computes all
unique values, sorts them and assigns an ordinal to every unique
value. These ordinals are then used at searching time to sort
documents. When comparing documents on the same segment, Lucene
directly uses the ordinals while when there are documents from
different documents to compare, Lucene uses the values themselves
(slower).

I would expect sorting on a NumericDocValuesField to be faster since
longs can be used to directly compare documents across all segments.
Moreover, it is not normal that data is only sorted per page. Can you
write a small piece of code that reproduces the problem?

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Files greater than 20 MB not getting Indexed. No files generated except write.lock even after 8-9 minutes.

2013-08-30 Thread Adrien Grand
Ankit,

The stack traces you are showing only say there was an out of memory
error. In those case, the stack trace is unfortunately not always
helpful since the allocation may fail on a small object because
another object is taking all the memory of the JVM. Can you come up
with a small piece of code that reproduces the error you are
encountering? This would help us see if there is something wrong in
the indexing code and try to debug it otherwise.

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Optimize Lucene 4.4 for CPU usage

2013-08-31 Thread Adrien Grand
Hi,

On Sat, Aug 31, 2013 at 6:55 AM, Rose, Stuart J  wrote:
> I've noticed that processes that were previously IO bound (in 3.5) are now 
> CPU bound (in 4.4) and I expect it is due to the compression/decompression of 
> term vector fields  in 4.4.
>
> It would be nice if users of 4.4 could turn the compression OFF entirely.

Even though the default Lucene codec just tries to make good
trade-offs regarding I/O vs. CPU usage for most use-cases, it is
possible that it is not optimal for your use-case. If this is a
problem, it is possible to change the trade-offs by writing a custom
codec.

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Making lucene indexing multi threaded

2013-09-02 Thread Adrien Grand
Hi,

Lucene's IndexWriter can safely accept updates coming from several
threads, just make sure to share the same IndexWriter instance across
all threads, no extrenal locking is necessary.

30 minutes sound slike a lot for 3 files unless they are large.
You can have a look at
http://wiki.apache.org/lucene-java/ImproveIndexingSpeed which gives
good advices on how to improve Lucene indexing speed.

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene handling of duplicate terms

2013-09-05 Thread Adrien Grand
Hi,

On Thu, Sep 5, 2013 at 9:28 AM, Kristofer Karlsson  wrote:
> I have a use case where some of my documents have duplicate terms in
> various fields or within the same field.
>
> For an example, I may have a million documents with just the term "foo" in
> field A, and one particular document with the term "foo" in both field A
> and B, or have two terms "foo" in the same field.
>
> If I search for "foo foo" I would like to filter out all the documents with
> only one matching term - is this possible?

I don't think we have existing queries that allow for doing it
efficiently (if someone reads this and knows it is wrong, please
correct!). However, it should be doable to implement such a query
rather easily by iterating over the postings lists of the 'foo' term
in all the fields you are interested in, suming up frequencies (the
index must have been created with IndexOptions.DOCS_AND_FREQS or
higher) and only keeping documents whose sum of frequencies is at
least 2.

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Strange performance of Lucene 4.4.0

2013-09-10 Thread Adrien Grand
Sort.INDEXORDER just lets you know about matching documents while by
default a score is computed and Lucene selects the top N matching
documents from your index.

On Mon, Sep 9, 2013 at 7:33 PM, Mirko Sertic  wrote:
> Ok, using Sort.INDEXORDER for default sorting is blazing fast. Just for my
> understanding, what is the difference between both methods? Is just
> unneccesary score computation the problem of the CPU peak?
>
> Thanks in advance
> Mirko
>
> Am 09.09.2013 13:43, schrieb Michael McCandless:
>
>> Sort.INDEXORDER, or, just make your own Collector, which should be
>> faster than INDEXORDER.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Mon, Sep 9, 2013 at 7:19 AM, Mirko Sertic  wrote:
>>>
>>> Dear Mike
>>>
>>> I need an API to disable Scoring without any sorting.
>>>
>>> Unfortunately every method in IndexSearcher where i can say doDocScores
>>> also require a not-null Sort instance. So what would be the best way to
>>> disable scoring and have no sorting, and archiving the same performance as
>>> the "Empty Sort()" bug? I should probably use Sort.INDEXORDER for sorting,
>>> right? It gives me the same result...
>>>
>>> Thanks in advance
>>> Mirko
>>>
>>>
>>>
>>> Gesendet: Montag, 09. September 2013 um 13:03 Uhr
>>> Von: "Michael McCandless" 
>>> An: "Lucene Users" 
>>> Betreff: Re: Re: Re: Strange performance of Lucene 4.4.0
>>> If new Sort() fails to sort by score, that's a bug! Can you please
>>> open a Jira issue?
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>>
>>> On Mon, Sep 9, 2013 at 5:21 AM, Mirko Sertic  wrote:

 Hi

 Basically i am running a load test. For every run i executed about 1
 million queries on the same index with the same query string, so it should
 be warmed up very well ;-) It performs about 8x faster with an empty Sort()
 instance than the first option. Still do not get it. An empty Sort instance
 should sort by score, according to the default constructor.

 Providing no sort instance finally invokes

 protected TopDocs search(Weight weight, ScoreDoc after, int nDocs)
 throws IOException

 while providing a Sort instance finally invokes

 protected TopFieldDocs search(Weight weight, FieldDoc after, int nDocs,
 Sort sort, boolean fillFields,
 boolean doDocScores, boolean doMaxScore) throws IOException

 with doDocScores and doMaxScore set to false.

 Seems like providing an empty Sort() instances should sort by score
 according to its default constructor. But no scoring is done by the
 IndexSearcher, so there is nothing so sort at all. So from this point of
 view the scoring computation does cause the slower queries.

 Regards
 Mirko

 Gesendet: Montag, 09. September 2013 um 09:55 Uhr
 Von: "Toke Eskildsen" 
 An: "java-user@lucene.apache.org" 
 Betreff: Re: Aw: Re: Strange performance of Lucene 4.4.0
 On Sun, 2013-09-08 at 15:15 +0200, Mirko Sertic wrote:
>
> I have to check, but my usecase does not require sorting or even
> scoring at all. I still do not get what the difference is...

 Please describe how you perform your measurements. How do you ensure
 that the index is warmed equally for the two cases?

 - Toke


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org

>>> -
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>



-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: possible latency increase from Lucene versions 4.1 to 4.4?

2013-09-16 Thread Adrien Grand
Hi John,

I just had a look at Mike's benchs[1][2] which don't show any
performance difference from approximately 1 year. But this only tests
a conjunction of two terms so it might still be that latency worsened
for more complex queries.

[1] http://people.apache.org/~mikemccand/lucenebench/AndHighMed.html
[2] http://people.apache.org/~mikemccand/lucenebench/AndHighHigh.html

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: A question about "seek past EOF: MMapIndexInput"

2013-09-18 Thread Adrien Grand
Hi,

This means that there is either a bug in Lucene or that your index is
corrupted. Can you reproduce this failure if you reindex data? The
output of CheckIndex would be interesting as well, see
http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/index/CheckIndex.html#main%28java.lang.String[]%29.

On Wed, Sep 18, 2013 at 6:39 AM, hao yan  wrote:
> Hi, folks
>
> I build lucene index using lucene-4.3.
>
> However, I found for a field, some terms are searchable while searching the
> others will throw the following exception:
>
> java.io.EOFException: seek past EOF:
> MMapIndexInput(path="/tmp/galeneTestData/input/base/index/_2ca_Lucene41_0.doc")
> at
> org.apache.lucene.store.ByteBufferIndexInput.seek(ByteBufferIndexInput.java:172)
> at
> org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.reset(Lucene41PostingsReader.java:407)
> at
> org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.docs(Lucene41PostingsReader.java:293)
> at
> org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.docs(BlockTreeTermsReader.java:2188)
> at org.apache.lucene.index.TermsEnum.docs(TermsEnum.java:157)
> at
> org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:85)
> at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:609)
> at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:482)
> at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:438)
> at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:281)
> at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:269)
> at
> com.senseidb.clue.commands.SearchCommand.execute(SearchCommand.java:59)
> at
> com.senseidb.clue.ClueApplication.handleCommand(ClueApplication.java:41)
> at com.senseidb.clue.ClueApplication.main(ClueApplication.java:76)
>
> Do you guys have any idea whey this happens!
>
> thanks a lot!
>
> hao



-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Position problems in 4.3.0

2013-09-18 Thread Adrien Grand
Hi,

This looks bad! Can you write a small test case that reproduces the
issue so that we can try to understand what happens here?

Thanks!

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: IndexFileNameFilter

2013-09-18 Thread Adrien Grand
Hi,

Since Lucene 4.0 which introduced codecs, it is not possible anymore
to know based on filename extensions whether files have been created
by Lucene or not: every codec is free to use any file extension.

On Wed, Sep 18, 2013 at 1:03 PM, Yonghui Zhao  wrote:
> In lucene 4.3.0 there is no IndexFileNameFilter.
>
> And I find in org.apache.lucene.index.IndexFileNames the index file
> extensions have  only 3 types.
>
>
> public static final String INDEX_EXTENSIONS[] = new String[] {
> COMPOUND_FILE_EXTENSION,
> COMPOUND_FILE_ENTRIES_EXTENSION,
> GEN_EXTENSION,
>   };
>
>
> But there should be  many extensions  such as: fdt fdx fnm. ...
>
> I want to know if there is any elegant way to filter these extensions
> rather than list all extensions by myself.



-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to modify the Lucene 4 index?

2013-09-18 Thread Adrien Grand
Hi,

Are you talking about updating the content of the index or customizing
the file formats of the index?

On Tue, Sep 17, 2013 at 11:31 PM, Ralf Bierig  wrote:
> Hi all,
>
> is there any good documentation of how to change and modify the index of
> Lucene version 4 other than what is already on the website? Blogs, papers,
> reports etc. or just a report on experience in some form --- anything would
> be good.
>
> Based on an early-stage project, I would like to get first hand experience
> in order to get overview about what is possible and how difficult is it to
> do.
>
> Best,
> Ralf
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>



-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: IndexFileNameFilter

2013-09-18 Thread Adrien Grand
Hi,

On Wed, Sep 18, 2013 at 1:39 PM, Yonghui Zhao  wrote:
> Got it. Currently I don't use any custom codecs.

Part of the problem is that even the current codec keeps evolving, and
file extensions that exist today might not be used anymore in 6 months
and vice-versa. I would strongly recommend not relying on a method
that would tell whether a file belongs to Lucene or not.

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to make good use of the multithreaded IndexSearcher?

2013-10-01 Thread Adrien Grand
Hi Benson,

On Mon, Sep 30, 2013 at 5:21 PM, Benson Margulies  wrote:
> The multithreaded index searcher fans out across segments. How aggressively
> does 'optimize' reduce the number of segments? If the segment count goes
> way down, is there some other way to exploit multiple cores?

forceMerge[1], formerly known as optimize, takes a parameter to
configure how many segments should remain in the index.

Regarding multi-core usage, if your query load is high enough to use
all you CPUs (there are alwas #cores queries running in parrallel),
there is generally no need to use the multi-threaded IndexSearcher.
The multi-threaded index searcher can however help in case all CPU
power is not in use or if you care more about latency than throughput.
It indeed leverages the fact that the index is splitted into segments
to parallelize query execution, so a fully merged index will actually
run the query in a single thread in any case.

There is no way to make query execution efficiently use several cores
on a single-segment index so if you really want to parallelize query
execution, you will have to shard the index to do at the index level
what the multi-threaded IndexSearcher does at the segment level.

Side notes:
 - A single segment index only runs more efficiently queries which are
terms-dictionary-intensive, it is generally discouraged to run
forceMerge on an index unless this index is read-only.
 - The multi-threaded index searcher only parallelizes query execution
in certain cases. In particular, it never parallelizes execution when
the method takes a collector. This means that if you want to use
TotalHitCountCollector to count matches, you will have to do the
parallelization by yourself.

[1] 
http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/index/IndexWriter.html#forceMerge%28int%29

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



[ANNOUNCE] Apache Lucene 4.5 released

2013-10-05 Thread Adrien Grand
October 2013, Apache Lucene™ 4.5 available

The Lucene PMC is pleased to announce the release of Apache Lucene 4.5

Apache Lucene is a high-performance, full-featured text search engine
library written entirely in Java. It is a technology suitable for nearly
any application that requires full-text search, especially cross-platform.

This release contains numerous bug fixes, optimizations, and
improvements, some of which are highlighted below. The release
is available for immediate download at:
  http://lucene.apache.org/core/mirrors-core-latest-redir.html

See the CHANGES.txt file included with the release for a full list of
details.

Lucene 4.5 Release Highlights:

* Added support for missing values to DocValues fields through
AtomicReader.getDocsWithField.

* Lucene 4.5 has a new Lucene45Codec with Lucene45DocValues,
supporting missing values and with most datastructures residing
off-heap.

* New in-memory DocIdSet implementations which are especially better
than FixedBitSet on small sets: WAH8DocIdSet, PFORDeltaDocIdSet and
EliasFanoDocIdSet.

* CachingWrapperFilter now caches filters with WAH8DocIdSet by
default, which has the same memory usage as FixedBitSet in the worst
case but is smaller and faster on small sets.

* TokenStreams now set the position increment in end(), so we can
handle trailing holes.

* IndexWriter no longer clones the given IndexWriterConfig.

* Various bugfixes and optimizations since the 4.4 release.

Please read CHANGES.txt for a full list of new features.

Please report any feedback to the mailing lists
(http://lucene.apache.org/core/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring network
for distributing releases.  It is possible that the mirror you are using
may not have replicated the release yet.  If that is the case, please
try another mirror.  This also goes for Maven access.

-- 
Adrien Grand

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: optimal way to access many TermVectors

2013-10-08 Thread Adrien Grand
Hi,

On Mon, Oct 7, 2013 at 9:31 PM, Rose, Stuart J  wrote:
> Is there an optimal way to access many document TermVectors (in the same 
> chunk) consecutively when using the LZ4 termvector compression?
>
> I'm curious to know whether all TermVectors in a single compressed chunk are 
> decompressed and cached when one TermVector in the same chunk is accessed?

The main use-case for term vectors today being more-like-this and
highlighting, term vectors are generally accessed in no particular
order. This is why we don't cache the uncompressed chunk (it would
never get reused) so you need to decompress everytime you are
retrieving a document or its term vectors.

> Also wondering if there is a mapping of TermVector order to docID order? Or 
> is it always one to one? If docIds are dynamic, then presumably they are not 
> necessarily in the same order as their documents' corresponding term 
> vectors...

Term vectors are stored in doc ID order, meaning that for a given
segment, term vectors for document N are followed by term vectors for
document N+1.

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: external file stored field codec

2013-10-11 Thread Adrien Grand
On Fri, Oct 11, 2013 at 7:03 PM, Michael Sokolov
 wrote:
> I've been running some tests comparing storing large fields (documents, say
> 100K .. 10M) as files vs. storing them in Lucene as stored fields.  Initial
> results seem to indicate storing them externally is a win (at least for
> binary docs which don't compress, and presumably we can compress the
> external files if we want, too), which seems to make sense.  There will be
> some issues with huge directories, but that might be worth solving.
>
> So I'm wondering if there is a codec that does that?  I haven't seen one
> talked about anywhere.

I don't know about any codec that works this way but such a codec
would quickly exceed the amount of available file descriptors.

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: external file stored field codec

2013-10-13 Thread Adrien Grand
Hi Michael,

I'm not aware enough of operating system internals to know what
exactly happens when a file is open but it sounds to be like having
separate files per document or field adds levels of indirection when
loading stored fields, so I would be surprised it it actually proved
to be more efficient than storing everything in a single file.

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Retrieving values for a NumericDocValuesField [SEC=UNOFFICIAL]

2013-10-23 Thread Adrien Grand
Hi Stephen,

On Wed, Oct 23, 2013 at 9:29 AM, Stephen GRAY  wrote:
> UNOFFICIAL
> Hi everyone,
>
> I have a question about how to retrieve the values in a 
> NumericDocValuesField. I understand how to do this in situations where you 
> have an AtomicReaderContext available 
> (context.reader().getNumericDocValues(field)). However in a situation where I 
> have just done a search and only have a searcher available, and I want to get 
> the NumericDocValue in each document returned by the search, is there a way 
> to do this?

The purpose of doc values is usually scoring, sorting or faceting but
here you are willing to actually use them as stored fields. I would
recommend storing this numeric field value twice, once as a
NumericDocValuesField and once as a StoredField.

Otherwise, what you want to do is still possible (but a bad
trade-off), for every document, you can find its AtomicReader by
calling ReaderUtil.subIndex on the IndexReader.leaves() of your index
reader (that you can get from the IndexSearcher by calling
IndexSearcher.getIndexReader).

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Merging ordered segments without re-sorting.

2013-10-23 Thread Adrien Grand
Hi,

On Wed, Oct 23, 2013 at 10:19 PM, Arvind Kalyan  wrote:
> Sorting is not an option for our case so we will most likely implement a
> variant that merges the segments in one pass. Using TimSort is great but in
> our case the 2 segments will be highly interspersed and would not benefit
> from the galloping in TimSort.

I think Shai didn't mention TimSort because of galloping, but because
it tries to discover the monotonic sequences in your data. In your
particular case, this means that the sorting would run in linear time
instead of O(n log(n)) given that the 2 segments to merge are sorted.

> In additional, if anyone else on the list has any inputs on implementing
> the merge (without sort) I'd appreciate it as well! More than likely I'll
> have followup questions if we decide to go this route.

The tricky part is postings merging. You need to be able to remap your
doc ids to their new value in the merged segment and this requires
sorting the doc ids.

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Retrieving values for a NumericDocValuesField [SEC=UNOFFICIAL]

2013-10-24 Thread Adrien Grand
Hi Stephen,

On Thu, Oct 24, 2013 at 1:18 AM, Stephen GRAY  wrote:
> I actually need to loop through a large number of documents (50,000 - 
> 100,000) calculating a number of statistics (min, max, sum) so I really need 
> the most efficient/fastest solution available. It sounds like it would be 
> best to just store the data in a stored field.

I see. For that many documents, doc values are actually the right
thing to use, sorry if I put you on the wrong track I was assuming you
were only going to collect values from a few documents.

In your case the best option would be to split your doc ids according
to the segment they belong to, and then for each segment, get a
per-segment NumericDocValues instance and aggregate your statistics.
It is better than using MultiDocValues because MultiDocValues needs to
binary-search for the appropriate segment for every document.

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Merging ordered segments without re-sorting.

2013-10-24 Thread Adrien Grand
Hi,

On Thu, Oct 24, 2013 at 12:20 AM, Arvind Kalyan  wrote:
> I will benchmark the available approach itself then, in that case. Will
> revert back if the performance in unacceptable.

For the record, last time I checked, indexing was 2x slower on average
on a 10M document collection (see
https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13605896&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13605896)
and most of the time was not actually spent in sorting the doc IDs but
merging stored fields (because we by-pass the specialized sequential
merging impl which is usually used when merging segment without
sorting).

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene doubt

2014-02-17 Thread Adrien Grand
Hi Pedro,

Lucene indeed supports indexing data from several threads into a
single IndexWriter instance, and it will make use of all your I/O and
CPU. You can learn more about how it works at
http://blog.trifork.com/2011/05/03/lucene-indexing-gains-concurrency/

On Mon, Feb 17, 2014 at 3:54 PM, Pedro Cardoso  wrote:
> Good afternoon,
>
> I am using Lucene in developing a protect, however I was faced with a
> doubt.
>
> I wonder if a multi-thread system it is possible to write concurrently?
>
> Cumprimentos/ Best Regards
>
> *Pedro Cardoso*
>
>
> http://www.linkedin.com/pub/pedro-cardoso/54/243/60
>
>
>
> Sent with 
> MailTrack



-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Stored fields and OS file caching

2014-04-04 Thread Adrien Grand
Hi Vitaly,

Doc values are indeed well-suited for grouping and sorting. However
stored fields remain better at returning field values to users since
they guarantee a worst-case of one disk seek per document.

The filesystem cache typically caches data by blocks of 4KB. This
plays more nicely with doc values: given that they are stored in a
column-stride fashion, you are load only those field values into the
filesystem cache. On the other hand with stored fields, data is stored
sequentially in a very large file, so whenever you read a single field
value, the filesystem cache would load a 4KB block of data into the
filesystem cache that likely contains other fields' values that you
are not interested in.



On Sat, Apr 5, 2014 at 12:23 AM, Vitaly Funstein  wrote:
> I use stored fields to load values for the following use cases:
> - to return per-document values as is, requested by the user - similar to
> listing DB columns you are interested in, in a "select ..." clause.
> - to perform aggregate function calculations while forming the result set
> (if requested).
> - for group-by type queries (would like to switch to the native grouping
> API, but don't think it supports grouping on multiple fields, or aggregate
> functions).
> - and finally, as I mentioned - to sort search results, also when requested.
>
> Evidently, even for simple queries that don't require any of the
> post-processing above but ask for a set of values from each document,
> there's still non-trivial amount of disk activity... hence, I started
> second-guessing the implementation.
>
>
> On Fri, Apr 4, 2014 at 3:00 PM, Uwe Schindler  wrote:
>
>> Hi,
>>
>> What are you doing with the stored fields? They are not deprecated and
>> also not really slow, unless you scan over millions of documents in random
>> access order. To display serach results, DocValues are of no use.
>>
>> Uwe
>>
>> -
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: u...@thetaphi.de
>>
>>
>> > -Original Message-
>> > From: Vitaly Funstein [mailto:vfunst...@gmail.com]
>> > Sent: Friday, April 04, 2014 9:44 PM
>> > To: java-user@lucene.apache.org
>> > Subject: Stored fields and OS file caching
>> >
>> > I have heard here that stored fields don't work well with OS file
>> caching.
>> > Could someone elaborate on why that is? I am using Lucene 4.6 and we do
>> > use stored fields but not doc values; it appears most of the benefit
>> from the
>> > latter comes as improvement in sorting performance, and I don't actually
>> use
>> > Lucene for sorting at all; rather, it's done on a post-processing basis,
>> based on
>> > stored field values (in a nutshell, the reason for this is Lucene's
>> inability to tell
>> > apart terms that are empty strings vs. a missing value, resulting in
>> unstable
>> > sort order on such fields).
>> >
>> > I am not sure if switching to using doc values fields from stored fields
>> entirely
>> > would help leverage OS file cache better... what worries me is that when
>> > processing queries requesting multiple values from the document, doc
>> value
>> > fields could cause multiple disk seeks to fetch values for each field, as
>> > opposed to just one with stored fields.
>> >
>> > Am I way off in my understanding of how this works? Any guidelines, as
>> > general as they may be, are appreciated.
>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>



-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Performance issues with the default field compression

2014-04-09 Thread Adrien Grand
Hi Alex,

Indeed, one or several (the number depends on the size of your
documents) documents need to be fully decompressed in order to read a
single field of a single document.

Regarding the stored fields visitor, the default one doesn't return
STOP when the field has been found because other fields with the same
name might be stored further in the stream of stored fields (in case
of a multivalued field). If you know that you have a single field
value, you can write your own field visitor that will return STOP
after the first value has been read. As you noted, this probably has
less impact on performance than the first point that you raised.

The default stored fields visitor is rather targeted at large indices
where compression helps save disk space and can also make stored
fields retrieval faster since a larger portion of the stored fields
can fit in the filesystem cache. However, if your index is small and
fully fits in the filesystem cache, this stored fields format might
indeed have non-negligible overhead.


On Wed, Apr 9, 2014 at 9:17 PM, Alex Parvulescu
 wrote:
> Hi,
>
> I was investigating some performance issues and during profiling I noticed
> that there is a significant amount of time being spent decompressing fields
> which are unrelated to the actual field I'm trying to load from the lucene
> documents. In our benchmark doing mostly a simple full-test search, 40% of
> the time was lost in these parts.
>
> My code does the following: reader.document(id, Set(":path")).get(":path"),
> and this is where the fun begins :)
> I noticed 2 things, please excuse the ignorance if some of the things I
> write here are not 100% correct:
>
>  - all the fields in the document are being decompressed prior to applying
> the field filter. We've noticed this because we have a lot of content
> stored in the index, so there is an important time lost around
> decompressing junk. At one point I tried adding the field first, thinking
> this will save some work, but it doesn't look like it's doing much.
> Reference code, the visitor is only used at the very end. [0]
>
>  - second, and probably of a smaller impact would be to have the
> DocumentStoredFieldVisitor return STOP when there are no more fields in the
> visitor to visit. I only have one, and it looks like it will #skip through
> a bunch of other stuff before finishing a document. [1]
>
> thanks in advance,
> alex
>
>
> [0]
> https://svn.apache.org/viewvc/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/codecs/compressing/CompressingStoredFieldsReader.java?view=markup#l364
>
> [1]
> https://svn.apache.org/viewvc/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/document/DocumentStoredFieldVisitor.java?view=markup#l100



-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Reading a v2 index in v4

2014-06-09 Thread Adrien Grand
Hi,

It is not possible to read 2.x indices from Lucene 4, even with a
custom codec. For instance, Lucene 4 needs to hook into
SegmentInfos.read to detect old 3.x indices and force the use of the
Lucene3x codec since these indices don't expose what codec has been
used to write them.


On Mon, Jun 9, 2014 at 12:51 PM, Trejkaz  wrote:
> Hi all.
>
> The inability to read people's existing indexes is essentially the
> only thing stopping us upgrading to v4, so we're stuck indefinitely on
> v3.6 until we find a way around this issue.
>
> As I understand it, Lucene 4 added the notion of codecs which can
> precisely choose how to read and write the index content.
>
> So I wonder:
> 1. Is it technically possible to write a codec to read the v2 format?
> 2. If this is possible, has anyone already done it?
> (Surely we're not the first ones to want to do it.)
>
> TX
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>



-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: EarlyTerminatingSortingCollector help needed..

2014-06-21 Thread Adrien Grand
Hi Ravikumar,

On Fri, Jun 20, 2014 at 12:14 PM, Ravikumar Govindarajan
 wrote:
> If my "numDocsToCollect" = 50 and no.of. segments = 15, then
> collector.collect() will be called 750 times.

That is the worst-case indeed. However if some of your segments have
less than 50 matches, `collect` will only be called on those matches.

> When I use a SortField instead, then TopFieldDocs does the sorting for all
> segments and collector.collect() will be called only 50 times...

What do you mean by "When I use a SortField instead"? Unless you are
using early termination, Collector.collect is supposed to be called
for every matching document.

> Assuming a stored-field seek for every collector.collect(), will it be
> advisable to still persist with ETSC? Was it introduced as a trade-off b/n
> memory & disk?

I would not advise to use the stored fields API, even in the context
of early termination. Doc values should be more efficient here?

The trade-off is not really about memory and disk. What it tries to
achieve is to make queries much faster provided that:
 - you can afford the merging overhead (ie. for heavy indexing
workloads, this might not be the best solution)
 - there is a single sort order that is used for most queries
 - you don't need any feature that requires to collect all documents
(like computing the total hit count or facets).

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: EarlyTerminatingSortingCollector help needed..

2014-06-23 Thread Adrien Grand
On Sun, Jun 22, 2014 at 6:44 PM, Ravikumar Govindarajan
 wrote:
> For a normal sorting-query, on a top-level searcher, I execute
>
> TopDocs docs = searcher.search(query, 50, sortField)
>
> Then I can issue reader.document() for final list of exactly 50 docs, which
> gives me a global order across segments but at the obvious cost of memory...
>
> SortingMergePolicy + ETSC will make me do 50*N [N=no.of.segments] collects,
> which could increase cost of seeks when each segment collects considerable
> hits...

This is not correct. :) ETSC will collect segments one after another
but in the end, what you will get are the top hits for all segments.
This means that even though you have eg. 15 segments, if you requested
50 documents, you will get the top 50 documents out of your
TopHitsCollector.

>  - you can afford the merging overhead (ie. for heavy indexing
>> workloads, this might not be the best solution)
>>  - there is a single sort order that is used for most queries
>>  - you don't need any feature that requires to collect all documents
>> (like computing the total hit count or facets).
>
>
> Our use-case fits perfectly on all these 3 points and thats why we wanted
> to explore this. But our final set of results must also be globally
> ordered. May be it's mistake to assume that Sorting can be entirely
> replaced with SMP + ETSC...

I don't think it is a mistake, this can help make the execution of
search requests significantly faster.

> I would not advise to use the stored fields API, even in the context
>> of early termination. Doc values should be more efficient here?
>
>
> I read your excellent blog on stored-fields compression, where you've
> mentioned that stored-fields now take only one random seek. [
> http://blog.jpountz.net/post/35667727458/stored-fields-compression-in-lucene-4-1
> ]
>
> If so, then what could make DocValues still a winner?

Yes. If you use eg. 2 doc values fields to run your query, it is true
that the number of seeks in the worst case would be 2 for doc values
and only 1 for stored fields, so stored fields might look more
appropriate. However, doc values play much better with the operating
system thanks to column-stride storage since:
 - it allows for lightweight and efficient compression,
 - the filesystem cache doesn't get loaded on field values that you
are not interested in.

When wondering about stored fields vs doc values, the right trade-off
is usually to use:
 - stored fields when looking up several field values for a few documents,
 - doc values when loading a few field values for many documents.


-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



  1   2   3   4   5   6   >