Sort, Search Facets

2014-07-08 Thread Sandeep Khanzode
Hi,
 
I am using Lucene 4.7.2 and my primary use case for Lucene is to do three 
things: (a) search, (b) sort by a number of fields for the search results, and 
(c) facet on probably an equal number of fields (probably the most standard use 
cases anyway).

Let us say, I have a corpus of more than a 100m docs with each document having 
approx. 10-15 fields excluding the content (body) which will also be one of the 
fields. Out of 10-15, I have a requirement to have sorting enabled on all 10-15 
and the facets as well. That makes a total of approx. ~45 fields to be indexed 
for various reasons, once for String/Long/TextField, once for 
SortedDocValuesField, and once for FacetField each. 

What will be the impact of this on the indexing operation w.r.t. the time taken 
as well as the extra disk space required? Will it grow linearly with the 
increase in the number of fields?

What is the impact on the memory usage during search time?


I will attempt to benchmark some of these, but if you have any experience with 
this, request you to share the details. Thanks,

---
Thanks n Regards,
Sandeep Ramesh Khanzode

DrillSideways accepting FacetCollector parameter

2014-07-08 Thread Jigar Shah
Currently Drillsideways provides following method:

public DrillSidewaysResult search(DrillDownQuery query, Collector
hitCollector);

Can same class provide following method ?

public DrillSidewaysResult search(DrillDownQuery query, Collector
hitCollector, FacetsCollector facetCollector);

Currently,

 FacetsCollector drillDownCollector = new FacetsCollector();

is created from API method

public DrillSidewaysResult search(DrillDownQuery query, Collector
hitCollector) throws IOException

 which can be parametrised ?

It will help application to use same FacetsCollector to fetch other facets,
i.e. non sideways facets.

Thanks,
Jigar Shah.


re-use IndexWriter

2014-07-08 Thread Jason.H
nowadays , i've been trying every way to improve the performance of indexing , 
IndexWriter's close operation is really costly , and the Lucene's doc sugguest 
to re-use IndexWriter instance , i  did it , i  kept the indexWriter instance , 
and give it back to every request thread , But there comes a big problem ,  i 
never search the index changes because the index changes is till in the RAM , 
maybe there's a way to flush all the changes to the stable Storage and this 
operation don't close the IndexWriter so i could re-use it  . am i right at 
this point ? 

there're several point i don't quite understand ..

1, what's the difference between commit and flush  ?   i thought with these two 
method , i could see the changes in my Directory without closing IndexWriter .

2, when should i close the writer ? if i use it Singleton(i don't have to worry 
about the LockObtainException) , and i don't have to worry about the changes 
because commit and flush would do this , then i don't have to close it any more 
...

Re: re-use IndexWriter

2014-07-08 Thread Ian Lea
Read the javadocs to understand the difference between commit() and
flush().  You need commit(), or close().

There are no hard and fast rules and it depends on how much data you
are indexing, how fast, how many searches you're getting and how up to
date they need to be.  And how much you worry about losing indexed
data.

One option is to pick a value that makes sense to you and commit() the
writer every n seconds|minutes|hours|docs.  close() it when your
indexing job exits.  You'll need to reopen index searchers to pick up
changes.  See the javadocs for IndexSearcher.

Another option is to use lucene's near-real-time (NRT) features.  Also
see the IndexSearcher javadocs for a way in to that.


--
Ian.


On Tue, Jul 8, 2014 at 10:08 AM, Jason.H 469673...@qq.com wrote:
 nowadays , i've been trying every way to improve the performance of indexing 
 , IndexWriter's close operation is really costly , and the Lucene's doc 
 sugguest to re-use IndexWriter instance , i  did it , i  kept the indexWriter 
 instance , and give it back to every request thread , But there comes a big 
 problem ,  i never search the index changes because the index changes is till 
 in the RAM , maybe there's a way to flush all the changes to the stable 
 Storage and this operation don't close the IndexWriter so i could re-use it  
 . am i right at this point ?

 there're several point i don't quite understand ..

 1, what's the difference between commit and flush  ?   i thought with these 
 two method , i could see the changes in my Directory without closing 
 IndexWriter .

 2, when should i close the writer ? if i use it Singleton(i don't have to 
 worry about the LockObtainException) , and i don't have to worry about the 
 changes because commit and flush would do this , then i don't have to close 
 it any more ...

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: DrillSideways accepting FacetCollector parameter

2014-07-08 Thread Michael McCandless
We could do this, but what's the use case?

E.g. DrillSideways also hardwires the drill-sideways collectors it
creates ... do you control over those as well?  Maybe we could make
methods in the DrillSideways class that you could override?

Mike McCandless

http://blog.mikemccandless.com


On Tue, Jul 8, 2014 at 7:14 AM, Jigar Shah jigaronl...@gmail.com wrote:
 Currently Drillsideways provides following method:

 public DrillSidewaysResult search(DrillDownQuery query, Collector
 hitCollector);

 Can same class provide following method ?

 public DrillSidewaysResult search(DrillDownQuery query, Collector
 hitCollector, FacetsCollector facetCollector);

 Currently,

  FacetsCollector drillDownCollector = new FacetsCollector();

 is created from API method

 public DrillSidewaysResult search(DrillDownQuery query, Collector
 hitCollector) throws IOException

  which can be parametrised ?

 It will help application to use same FacetsCollector to fetch other facets,
 i.e. non sideways facets.

 Thanks,
 Jigar Shah.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Incremental Field Updates

2014-07-08 Thread Ravikumar Govindarajan
That's a cool patch. Thanks


On Thursday, July 3, 2014, Gopal Patwa gopalpa...@gmail.com wrote:

 Thanks Ravi, it is good to know general problem with updatable field. In
 our use-case where we have few fields which update more frequently then
 main index. We are using this SOLR join contrib patch with DocTransformer
 for returning data from join core. But this approach has some performance
 impact if that performance hit acceptable for your use use-case then you
 can give a try if you are using SOLR.

 https://issues.apache.org/jira/browse/SOLR-4787





 On Thu, Jul 3, 2014 at 3:22 AM, Ravikumar Govindarajan 
 ravikumar.govindara...@gmail.com javascript:; wrote:

  In case of sorting, updatable DocValues may be what you are looking for.
 
  But updatable fields for searching is a different beast.
 
  A sample approach is documented at
 
 
 http://www.flax.co.uk/blog/2012/06/22/updating-individual-fields-in-lucene-with-a-redis-backed-codec/
 
  The general problems with updatable postings-list AFAIK are
 
  1. Impossible to correctly score updated documents
  2. Segment Merges could miss out updates
  3. Might behave in-correctly with NRT
  4. Freq updates could end-up creating lots of files because of
 append-only
  nature of lucene...
 
  May be if you are not too worried about scoring, correct NRT behavior etc
  you can attempt a solution like the RedisCodec stuff...
 
  Segregating static  dynamic fields into 2 separate indexes as described
  here
 
 
 http://www.lucenerevolution.org/2013/Sidecar-Index-Solr-Components-for-Parallel-Index-Management
  may be of some use to you
 
  --
  Ravi
 
 
 
  On Wed, Jul 2, 2014 at 7:29 PM, Shai Erera ser...@gmail.com
 javascript:; wrote:
 
   Using BinaryDocValues is not recommended for all scenarios. It is a
   catchall alternative to the other DocValues types. I would not use it
   unless it makes sense for your application, even if it means that you
  need
   to re-index a document in order to update a single field.
  
   DocValues are not good for search - by search I assume you mean take
 a
   query such as apache AND lucene and find all documents which contain
  both
   terms under the same field. They are good for sorting and faceting
  though.
  
   So I guess the answer to your question is it depends (it always is!)
 -
  I
   would use DocValues for sorting and faceting, but not for regular
 search
   queries. And I would use BinaryDocValues only when the other DocValues
   types don't match.
  
   Also, note that the current field-level update of DocValues is not
 always
   better than re-indexing the document, you can read here for more
 details:
  
 
 http://shaierera.blogspot.com/2014/04/benchmarking-updatable-docvalues.html
  
   Shai
  
  
   On Tue, Jul 1, 2014 at 9:17 PM, Sandeep Khanzode 
   sandeep_khanz...@yahoo.com.invalid wrote:
  
Hi Shai,
   
So one follow-up question.
   
Assume that my use case is to have approx. ~50M documents indexed
 with
each document having about ~10-15 indexed but not stored fields.
 These
fields will never change, but there are another ~5-6 fields that will
change and will continue to change after the index is written. These
  ~5-6
fields may also be multivalued. The size of this index turns out to
 be
~120GB.
   
In this case, I would like to sort or facet or search on these ~5-6
fields. Which approach do you suggest? Should I use BinaryDocValues
 and
update using IW or use either a ParallelReader/Join query.
   
---
Thanks n Regards,
Sandeep Ramesh Khanzode
   
   
On Tuesday, July 1, 2014 9:53 PM, Shai Erera ser...@gmail.com
 javascript:; wrote:
   
   
   
Except that Lucene now offers efficient numeric and binary DocValues
updates. See IndexWriter.updateNumeric/Binary...
   
On Jul 1, 2014 5:51 PM, Erick Erickson erickerick...@gmail.com
 javascript:;
   wrote:
   
 This JIRA is complicated, don't really expect it in 4.9 as it's
 been hanging around for quite a while. Everyone would like this,
 but it's not easy.

 Atomic updates will work, but you have to stored=true for all
 source fields. Under the covers this actually reads the document
 out of the stored fields, deletes the old one and adds it
 over again.

 FWIW,
 Erick

 On Tue, Jul 1, 2014 at 5:32 AM, Sandeep Khanzode
 sandeep_khanz...@yahoo.com.invalid wrote:
  Hi,
 
  I wanted to know of the best approach to follow if a few fields
 in
  my
 indexed documents are changing at run time (after index and before
 or
 during search), but a majority of them are created at index time.
 
  I could see the JIRA given below but it is scheduled for Lucene
  4.9,
   I
 believe.
 
  There are a few other approaches, like maintaining a separate
 index
   for
 changing fields and use either a parallelreader or use a Join.
 
  Can everyone share their experience for 

Adding/removing a term from a document

2014-07-08 Thread Allen Kneser
Hi all,

I am trying to figure out how to easily remove or add a keyword from a
document's index (or equivalently, decrease/increase that keyword's
frequency in the document).

I know Lucene allows you to reindex a document using the
IndexWriter.updateDocument(docPath, doc) call but that's too expensive for
my purposes. I already know the removed  added keywords from a previous
pass through the document and I would like to avoid Lucene doing another
pass.

I am looking for a IndexWriter.adjustTermFreqInDoc(keyword, doc,
deltafreq) which will either change the frequency of keyword in 'doc' by
'deltafreq'. This could result in either adding or removing a keyword from
the document in the index.

Is there a way to do this? At first I thought adding term vectors to the
index could help with this but it seems like that will dramatically
increase the index size.

Cheers,
Alin


IndexSearcher.doc thread safe problem

2014-07-08 Thread 김선무
Hi all,

I know IndexSearcher is thread safe.
But IndexSearcher.doc is not thread safe maybe...

I try to below

First, I extract docID at index directory. And that docID add on
queue(ConcurrentLinkedQueue)

Second, extract field value using docID poll at this queue after extract
process end. This process is  work to multi-threads.

For this I used the following summation code below:
searcher.search( query, filter, new Collector() { public void collect( int
doc ) { queue.add( docBase + doc ) } );
Thread thread1 = new Thread( () - { while( !queue.isEmpty() ) {
System.out.println( searcher.doc(queue.poll()).get(content) ); } } );
Thread thread2 = new Thread( thread1 );
thread1.start();
thread2.start();
---

Result was different in every execution.

My method is wrong? or IndexSearcher bug?

Please help me


Re: DrillSideways accepting FacetCollector parameter

2014-07-08 Thread Jigar Shah
Usecase:

With below code i perform search.

DrillSideways drillSideWays = new DrillSideways(searcher, config,
engine.getTaxoReader());
DrillSidewaysResult result = drillSideWays.search(filterQuery, null, null,
first + limit, sort, true, true);

In above code i don't have reference to FacetCollector fc, which is used.
Consider i want to get LongRangeFacetCounts, which is based on
NumericDocValuesField.

facets = new LongRangeFacetCounts(facetField.getQueryName(), fc,
longRanges.toArray(new LongRange[longRanges
.size()]));

if i use below, i get access to current fc.

FacetsCollector fc = new FacetsCollector();
TopDocs topDocs = FacetsCollector.search(searcher, query, null, first +
limit, sort, true, true, fc);

Difference is if i use ' FacetsCollector.search(searcher, query, null,
first + limit, sort, true, true, fc);' i can get FacetCollector. This is
not true in case of DrillSideways.

Let me know if, there is already some other way provided.

Thanks,
Jigar Shah.






On Tue, Jul 8, 2014 at 8:15 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 We could do this, but what's the use case?

 E.g. DrillSideways also hardwires the drill-sideways collectors it
 creates ... do you control over those as well?  Maybe we could make
 methods in the DrillSideways class that you could override?

 Mike McCandless

 http://blog.mikemccandless.com


 On Tue, Jul 8, 2014 at 7:14 AM, Jigar Shah jigaronl...@gmail.com wrote:
  Currently Drillsideways provides following method:
 
  public DrillSidewaysResult search(DrillDownQuery query, Collector
  hitCollector);
 
  Can same class provide following method ?
 
  public DrillSidewaysResult search(DrillDownQuery query, Collector
  hitCollector, FacetsCollector facetCollector);
 
  Currently,
 
   FacetsCollector drillDownCollector = new FacetsCollector();
 
  is created from API method
 
  public DrillSidewaysResult search(DrillDownQuery query, Collector
  hitCollector) throws IOException
 
   which can be parametrised ?
 
  It will help application to use same FacetsCollector to fetch other
 facets,
  i.e. non sideways facets.
 
  Thanks,
  Jigar Shah.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org