Indexing size increase 20% after switching from lucene 4.4 to 4.5 or 4.8 with BinaryDocValuesField

2014-06-13 Thread Zhao, Gang

I used lucene 4.4 to create index for some documents. One of the indexing 
fields is BinaryDocValuesField. After I change the dependency to lucene 4.5. 
The index size for 1 million documents increases from 293MB to 357MB. If I did 
not use BinaryDocValuesField, the index size increases only about 2%. I also 
tried lucene 4.8. The index size is similar to index size with lucene 4.5.

I am wondering what the change for handling BinaryDocValuesField from 4.4 to 
4.5 or 4.8 is.

Gang Zhao
Software Engineer - EA Digital Platform
207 Redwood Shores Parkway
Redwood City, CA 94065
Direct Line: 650-628-3719
[cid:image001.png@01CD68F0.6239B040]



Re: Facets in Lucene 4.7.2

2014-06-13 Thread Sandeep Khanzode
Hi Shai,
 
Thanks so much for the clear explanation.

I agree on the first question. Taxonomy Writer with a separate index would 
probably be my approach too.

For the second question:
I am a little new to the Facets API so I will try to figure out the approach 
that you outlined below.

However, the scenario is such: Assume a document corpus that is indexed. For a 
user query, a document is returned and selected by the user for editing as part 
of some use case/workflow. That document is now marked as either historically 
interesting or not, financially relevant, specific to media or entertainment 
domain, etc. by the user. So, essentially the user is flagging the document 
with certain markers.
Another set of users could possibly want to query on these markers. So, lets 
say, a second user comes along, and wants to see the top documents belonging to 
one category, say, agriculture or farming. Since these markers are run time 
activities, how can I use the facets on them? So, I was envisioning facets as 
the various markers. But, if I constantly re-index or update the documents 
whenever a marker changes, I believe it would not be very efficient. 

Is there anything, facets or otherwise, in Lucene that can help me solve this 
use case? 

Please let me know. And, thanks!

---
Thanks n Regards,
Sandeep Ramesh Khanzode


On Friday, June 13, 2014 9:51 PM, Shai Erera  wrote:
 


Hi

You can check the demo code here:
https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_8/lucene/demo/src/java/org/apache/lucene/demo/facet/.
This code is updated with each release, so you always get a working code
examples, even when the API changes.

If you don't mind managing the sidecar index, which I agree isn't such a
big deal, then yes - the taxonomy index currently performs the fastest. I
plan to explore porting the taxonomy-based approach from BinaryDocValues to
the new SortedNumericDocValues (coming out in 4.9) since it might perform
even faster.

I didn't quite get the marker/flag facet. Can you give an example? For
instance, if you can model that as a NumericDocValuesField added to
documents (w/ the different markers/flags translated to numbers), then you
can use Lucene's updatable numeric DocValues and write a custom Facets to
aggregate on that NumericDocValues field.

Shai



On Fri, Jun 13, 2014 at 11:48 AM, Sandeep Khanzode <
sandeep_khanz...@yahoo.com.invalid> wrote:

> Hi,
>
> I am evaluating Lucene Facets for a project. Since there is a lot of
> change in 4.7.2 for Facets, I am relying on UTs for reference. Please let
> me know if there are other sources of information.
>
> I have a couple of questions:
>
> 1.] All categories in my application are flat, not hierarchical. But, it
> seems from a few sources, that even that notwithstanding, you would want to
> use a Taxonomy based index for performance reasons. It is faster but uses
> more RAM. Or is the deterrent to use it is the fact that it is a separate
> data structure. If one could do with the life-cycle management of the extra
> index, should we go ahead with the taxonomy index for better performance
> across tens of millions of documents?
>
> Another note to add is that I do not see a scenario wherein I would want
> to re-index my collection over and over again or, in other words, the
> changes would be spread over time.
>
> 2.] I need a type of dynamic facet that allows me to add a flag or marker
> to the document at runtime since it will change/update every time a user
> modifies or adds to the list of markers. Is this possible to do with the
> current implementation? Since I believe, that currently all faceting is
> done at indexing time.
>
>
> ---
> Thanks n Regards,
> Sandeep Ramesh Khanzode

JTRES 2014: Deadline extended to June 23

2014-06-13 Thread w...@dtu.dk
(Apologies if you reveive multiple copies of this message.)

  DEADLINE EXTENDED TO JUNE 23, 2014
   
   The 12th International Workshop on Java Technologies for
 Real-time and Embedded Systems - JTRES 2014

 October 13th - 14th
Niagara Falls, NY, USA


   Call for Papers



MOTIVATION

Over 90% of all microprocessors are now used for real-time and
embedded applications. Embedded devices are deployed on a broad
diversity of distinct processor architectures and operating
systems. The application software for many embedded devices is custom
tailored if not written entirely from scratch. The size of typical
embedded system software applications is growing exponentially from
year to year, with many of today's embedded systems comprised of
multiple millions of lines of code. For all of these reasons, the
software portability, reuse, and modular composability benefits
offered by Java are especially valuable to developers of embedded
systems.

Both embedded and general purpose software frequently need to comply
with real-time constraints. Higher-level programming languages and
middleware are needed to robustly and productively design, implement,
compose, integrate, validate, and enforce memory and real-time
constraints along with conventional functional requirements for
reusable software components. The Java programming language has become
an attractive choice because of its safety, productivity, its
relatively low maintenance costs, and the availability of well trained
developers.

Although Java features good software engineering characteristics,
traditional Java virtual machine (JVM) implementations are unsuitable
for deploying real-time software due to under-specification of thread
scheduling and synchronization semantics, unclear demand and
utilization of memory and CPU resources, and unpredictable
interference associated with automatic garbage collection and adaptive
compilation.

GOAL

Interest in real-time Java by both the academic research community and
commercial industry has been motivated by the need to manage the
complexity and costs associated with continually expanding embedded
real-time software systems. The goal of the workshop is to gather
researchers working on real-time and embedded Java to identify the
challenging problems that still need to be solved in order to assure
the success of real-time Java as a technology and to report results
and experience gained by researchers.

The Java ecosystem has outgrown the combination of Java as programming
language and the JVM. For example, Android uses Java as source
language and the Dalvik virtual machine for execution. Languages such
as Scala are compiled to Java bytecode and executed on the JVM. JTRES
welcomes submissions that apply such approaches to embedded and/or
real-time systems.

TOPICS OF INTEREST

Topics of interest to this workshop include, but are not limited to:

- New real-time programming paradigms and language features
- Industrial experience and practitioner reports
- Open source solutions for real-time Java
- Real-time design patterns and programming idioms
- High-integrity and safety critical system support
- Java-based real-time operating systems and processors
- Extensions to the RTSJ and SCJ
- Real-time and embedded virtual machines and execution environments
- Memory management and real-time garbage collection
- Scheduling frameworks, feasibility analysis, and timing analysis
- Multiprocessor and distributed real-time Java
- Real-time solutions for Android
- Languages other than Java on real-time or embedded JVMs

SUBMISSION REQUIREMENTS

Participants are expected to submit a paper of at most 10 pages (ACM
Conference Format, i.e., two-columns, 10 point font). Industrial
experience and practitioner reports may be submitted as 4-page short
papers. Accepted papers will be published in the ACM International
Conference Proceedings Series via the ACM Digital Library and have to
be presented by one author at the JTRES.

LaTeX and Word templates can be found at:
http://www.acm.org/sigs/pubs/proceed/template.html

The ISBN number for JTRES 2014 is TBD.

Papers describing open source projects shall include a description how
to obtain the source and how to run the experiments in the appendix.

Papers should be submitted through Easychair. Please use the
submission link:
https://www.easychair.org/conferences/?conf=jtres2014

The best papers will be invited for submission to a special issue of
the Journal on Concurrency and Computation: Practice and Experience,
as determined by the program committee.

IMPORTANT DATES

- Paper Submission: extended to 23 June, 2014
- Notification of Acceptance: 27 July, 2014
- Camera Ready Paper Due: 24 August, 2014
- Workshop: 13-14 October, 2014

PROGRAM CHAIR

Wolfgang Puffitsch, Technical University of Denmark

WORKSHOP CHAIR

Lukasz Ziarek, SUNY Buffalo

PROGRAM CO

Re: Facets in Lucene 4.7.2

2014-06-13 Thread Shai Erera
Hi

You can check the demo code here:
https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_8/lucene/demo/src/java/org/apache/lucene/demo/facet/.
This code is updated with each release, so you always get a working code
examples, even when the API changes.

If you don't mind managing the sidecar index, which I agree isn't such a
big deal, then yes - the taxonomy index currently performs the fastest. I
plan to explore porting the taxonomy-based approach from BinaryDocValues to
the new SortedNumericDocValues (coming out in 4.9) since it might perform
even faster.

I didn't quite get the marker/flag facet. Can you give an example? For
instance, if you can model that as a NumericDocValuesField added to
documents (w/ the different markers/flags translated to numbers), then you
can use Lucene's updatable numeric DocValues and write a custom Facets to
aggregate on that NumericDocValues field.

Shai


On Fri, Jun 13, 2014 at 11:48 AM, Sandeep Khanzode <
sandeep_khanz...@yahoo.com.invalid> wrote:

> Hi,
>
> I am evaluating Lucene Facets for a project. Since there is a lot of
> change in 4.7.2 for Facets, I am relying on UTs for reference. Please let
> me know if there are other sources of information.
>
> I have a couple of questions:
>
> 1.] All categories in my application are flat, not hierarchical. But, it
> seems from a few sources, that even that notwithstanding, you would want to
> use a Taxonomy based index for performance reasons. It is faster but uses
> more RAM. Or is the deterrent to use it is the fact that it is a separate
> data structure. If one could do with the life-cycle management of the extra
> index, should we go ahead with the taxonomy index for better performance
> across tens of millions of documents?
>
> Another note to add is that I do not see a scenario wherein I would want
> to re-index my collection over and over again or, in other words, the
> changes would be spread over time.
>
> 2.] I need a type of dynamic facet that allows me to add a flag or marker
> to the document at runtime since it will change/update every time a user
> modifies or adds to the list of markers. Is this possible to do with the
> current implementation? Since I believe, that currently all faceting is
> done at indexing time.
>
>
> ---
> Thanks n Regards,
> Sandeep Ramesh Khanzode


Re: [lucene 4.6] NPE when calling IndexReader#openIfChanged

2014-06-13 Thread Michael McCandless
On Fri, Jun 13, 2014 at 8:53 AM, Clemens Wyss DEV  wrote:
> Thanks a lot!
>>"large text fields"
> What is a good limit (in characters) to switch from StringField to TextField? 
> Do Analyzers (e.g. GermanAnalyzer)  help a lot in reducing the size 
> of an Index?

It's more based on your app's requirements.  StringField indexes
everything as a single token.

>> Add XXXDocValuesField instead of e.g. StringField.
> Does this apply only for StringFields? Or for TextFields too?
>
>> Upgrade to the upcoming Lucene 4.9
> we have not yet transitionen to Java 7/8 ... hopefully soon ;)
>
>> and take a heap dump and see what's using RAM
> Find attached a snippet from MemoryAnalyzer

Does this say  59255872 bytes (ie, ~56.5 MB) being used by the
StandardDirectoryReader?

I'm a little confused because I don't see which structures sum up to
that total.  And I would expect the FST (terms index) to take more
RAM.

Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



fuzzy/case insensitive AnalyzingSuggester )

2014-06-13 Thread Clemens Wyss DEV
Looking for an AnalyzingSuggester which supports
- fuzzyness
- case insensitivity
- small (in memors) footprint (*)

(*)Just tried to "hand" my big IndexReader (see oher post " [lucene 4.6] NPE 
when calling IndexReader#openIfChanged") into JaspellLookup. Got an OOM.
Is there any (Jaspell)Lookup implementation that can handle really big indexes 
(by swapping  out part of the "lookup-table")?


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



AW: [lucene 4.6] NPE when calling IndexReader#openIfChanged

2014-06-13 Thread Clemens Wyss DEV
Thanks a lot!
>"large text fields"
What is a good limit (in characters) to switch from StringField to TextField? 
Do Analyzers (e.g. GermanAnalyzer)  help a lot in reducing the size 
of an Index?

> Add XXXDocValuesField instead of e.g. StringField.
Does this apply only for StringFields? Or for TextFields too?

> Upgrade to the upcoming Lucene 4.9
we have not yet transitionen to Java 7/8 ... hopefully soon ;)

> and take a heap dump and see what's using RAM
Find attached a snippet from MemoryAnalyzer
Class Name  
   | Shallow 
Heap | Retained Heap | Percentage
---
org.apache.lucene.index.StandardDirectoryReader @ 0x783932460   
   |   
72 |59'255'872 |  3.04%
|- org.apache.lucene.index.SegmentReader[24] @ 0x794089ee0  
   |  
112 |59'190'960 |  3.03%
|  |- org.apache.lucene.index.SegmentReader @ 0x788820f40   
   |   
72 |16'905'072 |  0.87%
|  |  |- org.apache.lucene.index.SegmentCoreReaders @ 0x7910cacc8   
   |   
56 |16'895'576 |  0.87%
|  |  |  |- 
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader @ 
0x780661c50|   24 | 
   16'864'864 |  0.86%
|  |  |  |  |- org.apache.lucene.codecs.BlockTreeTermsReader @ 0x7910cae50  
   |   
56 |16'864'240 |  0.86%
|  |  |  |  |  |- java.util.TreeMap @ 0x783902738   
   |   
48 |16'858'472 |  0.86%
|  |  |  |  |  |  '- java.util.TreeMap$Entry @ 0x77ec5f9f8  
   |   
40 |16'858'424 |  0.86%
|  |  |  |  |  | |- java.util.TreeMap$Entry @ 0x77ec5fa20   
   |   
40 |10'895'656 |  0.56%
|  |  |  |  |  | |- java.util.TreeMap$Entry @ 0x77ec5fa48   
   |   
40 | 5'960'072 |  0.31%
|  |  |  |  |  | |  |- java.util.TreeMap$Entry @ 0x77ec5fa98
   |   
40 | 5'958'072 |  0.31%
|  |  |  |  |  | |  |  |- java.util.TreeMap$Entry @ 0x77fc09bf0 
   |   
40 | 5'949'864 |  0.30%
|  |  |  |  |  | |  |  |- 
org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader @ 0x788820e20 
 |   72 | 8'168 |  0.00%
|  |  |  |  |  | |  |  '- Total: 2 entries  
   |
  |   |   
|  |  |  |  |  | |  |- java.util.TreeMap$Entry @ 0x77ec5fa70
   |   
40 | 1'000 |  0.00%
|  |  |  |  |  | |  |  '- 
org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader @ 0x78347fbc0 
 |   72 |   960 |  0.00%
|  |  |  |  |  | |  | |- org.apache.lucene.util.fst.FST @ 0x788fe34c8   
   |  
104 |   840 |  0.00%
|  |  |  |  |  | |  | |  |- org.apache.lucene.util.fst.FST$Arc[128] @ 
0x7870932a0  |  
528 |   528 |  0.00%
|  |  |  |  |  | |  | |  |- org.apache.lucene.util.fst.BytesStore @ 
0x77ec5fb60|
   40 |   144 |  0.00%
|  |  |  |  |  | |  | |  |  '- java.util.ArrayList @ 0x780663b28
   |   
24 |   104 |  0.00%
|  |  |  |  |  | |  | |  |- org.apache.lucene.util.BytesRef @ 
0x780663b10  |  
 24 |48 |  0.00%
|  |  |  |  |  | |  | |  |  '- byte[5] @ 0x780

Re: [lucene 4.6] NPE when calling IndexReader#openIfChanged

2014-06-13 Thread Michael McCandless
On Fri, Jun 13, 2014 at 3:02 AM, Clemens Wyss DEV  wrote:
>> limit how many fields have norms enabled
> We have one index for approx 7000 pdfs (24GB). Of course no content is STOREd 
> (but ANALYZEd). This very index occupies 4GB on disk and the corresponding 
> IndexReader is 60MB.
> Are norms per default enabled org.apache.lucene.document .TextField?

Yes.  Norms are a good idea for "large text fields", e.g. body text or
a catch all field, but usually not a good idea for tiny fields (e.g.
title).

>> use disk-based doc values not field cache
> How is this done?

Add XXXDocValuesField instead of e.g. StringField.

>> etc.
> such as? ;)

Upgrade to the upcoming Lucene 4.9; there have been some improvements
e.g. to norms compression.  You can tune your terms index settings,
but terms index usually doesn't use much RAM.

You can fire up your up, get all searchers warmed, and take a heap
dump and see what's using RAM.  We can iterate from there.

Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Facets in Lucene 4.7.2

2014-06-13 Thread Sandeep Khanzode
Hi,
 
I am evaluating Lucene Facets for a project. Since there is a lot of change in 
4.7.2 for Facets, I am relying on UTs for reference. Please let me know if 
there are other sources of information. 

I have a couple of questions:

1.] All categories in my application are flat, not hierarchical. But, it seems 
from a few sources, that even that notwithstanding, you would want to use a 
Taxonomy based index for performance reasons. It is faster but uses more RAM. 
Or is the deterrent to use it is the fact that it is a separate data structure. 
If one could do with the life-cycle management of the extra index, should we go 
ahead with the taxonomy index for better performance across tens of millions of 
documents? 

Another note to add is that I do not see a scenario wherein I would want to 
re-index my collection over and over again or, in other words, the changes 
would be spread over time. 

2.] I need a type of dynamic facet that allows me to add a flag or marker to 
the document at runtime since it will change/update every time a user modifies 
or adds to the list of markers. Is this possible to do with the current 
implementation? Since I believe, that currently all faceting is done at 
indexing time.

 
---
Thanks n Regards,
Sandeep Ramesh Khanzode

searching in hierarchical structures

2014-06-13 Thread Sascha Janz
we use lucene to search in hierarchical structures.  like a folder structure in 
filesystem.
 
the documents have an extra field, which specifies the location of the document.
 
so if you want to search documents under a specific folder you have to query a 
prefix in this field.
 
but if the documents are moved to an other location, every document must be 
updated. in our case this is not a good option.
 
are there any concepts for implementing hierarchical structures in lucene? does 
someone have a suggestion?
 
i know lucene is fulltextsearch and therefore primarily for flat structures.
 
greetings
sascha

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



AW: [lucene 4.6] NPE when calling IndexReader#openIfChanged

2014-06-13 Thread Clemens Wyss DEV
> limit how many fields have norms enabled
We have one index for approx 7000 pdfs (24GB). Of course no content is STOREd 
(but ANALYZEd). This very index occupies 4GB on disk and the corresponding 
IndexReader is 60MB.
Are norms per default enabled org.apache.lucene.document .TextField? 

> use disk-based doc values not field cache
How is this done?

> etc.
such as? ;)

-Ursprüngliche Nachricht-
Von: Michael McCandless [mailto:luc...@mikemccandless.com] 
Gesendet: Mittwoch, 21. Mai 2014 11:21
An: Lucene Users
Betreff: Re: [lucene 4.6] NPE when calling IndexReader#openIfChanged

On Wed, May 21, 2014 at 3:17 AM, Clemens Wyss DEV  wrote:
>> Can you just decrease IW's ramBufferSizeMB to relieve the memory pressure?
> +1
> Is there something alike for IndexReaders?

No, although you can take steps during indexing to reduce the RAM required 
during searching, e.g. limit how many fields have norms enabled, use disk-based 
doc values not field cache, etc.

Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org