Re: Data Import Handelr Question

2014-04-27 Thread Yuval Dotan
Thanks Shawn

In your opinion, what do you think is easier, writing the importer from
scratch or extending the DIH (for example: adding the state etc...)?


Yuval


On Thu, Apr 24, 2014 at 6:47 PM, Shawn Heisey s...@elyograg.org wrote:

 On 4/24/2014 9:24 AM, Yuval Dotan wrote:

 I want to use the DIH component in order to import data from old
 postgresql
 DB.
 I want to be able to recover from errors and crashes.
 If an error occurs I should be able to restart and continue indexing from
 where it stopped.
 Is the DIH good enough for my requirements ?
 If not is it possible to extend one of its classes in order to support the
 recovery?


 The entity in the Dataimport Handler (DIH) config has an onError
 attribute.

 http://wiki.apache.org/solr/DataImportHandler#Schema_for_the_data_config
 https://cwiki.apache.org/confluence/display/solr/
 Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler#
 UploadingStructuredDataStoreDatawiththeDataImportHandler-EntityProcessors

 But honestly, if you want a really robust Java program that indexes to
 Solr and does precisely what you want, you may be better off writing it
 yourself using SolrJ and JDBC.  DIH is powerful and efficient, but when you
 write the program yourself, you can do anything you want with your data.

 You also have the possibility of resuming an import after a Solr crash.
  Because DIH is embedded in Solr and doesn't save any kind of state data
 about an import in progress, that's pretty much impossible with DIH.  With
 a SolrJ program, you'd have to handle that yourself, but it would be
 *possible*.

 https://cwiki.apache.org/confluence/display/solr/Using+SolrJ

 Thanks,
 Shawn




Data Import Handelr Question

2014-04-24 Thread Yuval Dotan
Hi
I want to use the DIH component in order to import data from old postgresql
DB.
I want to be able to recover from errors and crashes.
If an error occurs I should be able to restart and continue indexing from
where it stopped.
Is the DIH good enough for my requirements ?
If not is it possible to extend one of its classes in order to support the
recovery?
Thanks
Yuval


Re: distributed search is significantly slower than direct search

2013-11-18 Thread Yuval Dotan
Hi
Thanks very much for your answers :)
Manuel, if you have a patch I will be glad to test it's performance
Yuval



On Mon, Nov 18, 2013 at 10:49 AM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 Manuel, that sounds very interesting. Would you be willing to
 contribute this back to the community?

 On Mon, Nov 18, 2013 at 9:53 AM, Manuel Le Normand
 manuel.lenorm...@gmail.com wrote:
  In order to accelerate the BinaryResponseWriter.write we extended this
  writer class to implement the docid to id tranformation by docValues (on
  memory) with no need to access stored field for id reading nor lazy
 loading
  of fields that also has a cost. That should improve read rate as
 docValues
  are sequential and should avoid disk IO. This docValues implementation is
  accessed during both query stages (as mentioned above) in case you ask
 for
  id's only, or only once, during the distributed search stage, in case you
  intend asking for stored fields different than id.
 
  We just started testing it for performance. I would love hearing any
  oppinions or test performances for this implementation
 
  Manu



 --
 Regards,
 Shalin Shekhar Mangar.



Re: distributed search is significantly slower than direct search

2013-11-17 Thread Yuval Dotan
Hi,

I isolated the case

Installed on a new machine (2 x Xeon E5410 2.33GHz)

I have an environment with 12Gb of memory.

I assigned 6gb of memory to Solr and I’m not running any other memory
consuming process so no memory issues should arise.

Removed all indexes apart from two:

emptyCore – empty – used for routing

core1 – holds the stored data – has ~750,000 docs and size of 400Mb

Again this is a single machine that holds both indexes.

The query
http://localhost:8210/solr/emptyCore/select?rows=5000q=*:*shards=127.0.0.1:8210/solr/core1wt=jsonQTime
takes ~3 seconds

and direct query
http://localhost:8210/solr/core1/select?rows=5000q=*:*wt=json Qtime takes
~15 ms - a magnitude difference.

I ran the long query several times and got an improvement of about a sec
(33%) but that’s it.

I need to better understand why this is happening.

I tried looking at Solr code and debugging the issue but with no success.

The one thing I did notice is that the getFirstMatch method which receives
the doc id, searches the term dict and returns the internal id takes most
of the time for some reason.

I am pretty stuck and would appreciate any ideas

My only solution for the moment is to bypass the distributed query,
implement code in my own app that directly queries the relevant cores and
handles the sorting etc..

Thanks




On Sat, Nov 16, 2013 at 2:39 PM, Michael Sokolov 
msoko...@safaribooksonline.com wrote:

 Did you say what the memory profile of your machine is?  How much memory,
 and how large are the shards? This is just a random guess, but it might be
 that if you are memory-constrained, there is a lot of thrashing caused by
 paging (swapping?) in and out the sharded indexes while a single index can
 be scanned linearly, even if it does need to be paged in.

 -Mike


 On 11/14/2013 8:10 AM, Elran Dvir wrote:

 Hi,

 We tried returning just the id field and got exactly the same performance.
 Our system is distributed but all shards are in a single machine so
 network issues are not a factor.
 The code we found where Solr is spending its time is on the shard and not
 on the routing core, again all shards are local.
 We investigated the getFirstMatch() method and noticed that the
 MultiTermEnum.reset (inside MultiTerm.iterator) and MultiTerm.seekExact
 take 99% of the time.
 Inside these methods, the call to BlockTreeTermsReader$
 FieldReader$SegmentTermsEnum$Frame.loadBlock  takes most of the time.
 Out of the 7 seconds  run these methods take ~5 and
 BinaryResponseWriter.write takes the rest(~ 2 seconds).

 We tried increasing cache sizes and got hits, but it only improved the
 query time by a second (~6), so no major effect.
 We are not indexing during our tests. The performance is similar.
 (How do we measure doc size? Is it important due to the fact that the
 performance is the same when returning only id field?)

 We still don't completely understand why the query takes this much longer
 although the cores are on the same machine.

 Is there a way to improve the performance (code, configuration, query)?

 -Original Message-
 From: idokis...@gmail.com [mailto:idokis...@gmail.com] On Behalf Of
 Manuel Le Normand
 Sent: Thursday, November 14, 2013 1:30 AM
 To: solr-user@lucene.apache.org
 Subject: Re: distributed search is significantly slower than direct search

 It's surprising such a query takes a long time, I would assume that after
 trying consistently q=*:* you should be getting cache hits and times should
 be faster. Try see in the adminUI how do your query/doc cache perform.
 Moreover, the query in itself is just asking the first 5000 docs that
 were indexed (returing the first [docid]), so seems all this time is wasted
 on transfer. Out of these 7 secs how much is spent on the above method?
 What do you return by default? How big is every doc you display in your
 results?
 Might be the matter that both collections work on the same ressources.
 Try elaborating your use-case.

 Anyway, it seems like you just made a test to see what will be the
 performance hit in a distributed environment so I'll try to explain some
 things we encountered in our benchmarks, with a case that has at least the
 similarity of the num of docs fetched.

 We reclaim 2000 docs every query, running over 40 shards. This means
 every shard is actually transfering to our frontend 2000 docs every
 document-match request (the first you were referring to). Even if lazily
 loaded, reading 2000 id's (on 40 servers) and lazy loading the fields is a
 tough job. Waiting for the slowest shard to respond, then sorting the docs
 and reloading (lazy or not) the top 2000 docs might take a long time.

 Our times are 4-8 secs, but do it's not possible comparing cases. We've
 done few steps that improved it along the way, steps that led to others.
 These were our starters:

 1. Profile these queries from different servers and solr instances,
 try
 putting your finger what collection is working hard and why. Check if
 you're 

Re: distributed search is significantly slower than direct search

2013-11-17 Thread Yuval Dotan
Hi Tomás
This is just a test environment meant only to reproduce the issue I am
currently investigating.
The number of documents should grow substantially (billions of docs).



On Sun, Nov 17, 2013 at 7:12 PM, Tomás Fernández Löbbe 
tomasflo...@gmail.com wrote:

 Hi Yuval, quick question. You say that your code has 750k docs and around
 400mb? Is this some kind of test dataset and you expect it to grow
 significantly? For an index of this size, I wouldn't use distributed
 search, single shard should be fine.


 Tomás


 On Sun, Nov 17, 2013 at 6:50 AM, Yuval Dotan yuvaldo...@gmail.com wrote:

  Hi,
 
  I isolated the case
 
  Installed on a new machine (2 x Xeon E5410 2.33GHz)
 
  I have an environment with 12Gb of memory.
 
  I assigned 6gb of memory to Solr and I’m not running any other memory
  consuming process so no memory issues should arise.
 
  Removed all indexes apart from two:
 
  emptyCore – empty – used for routing
 
  core1 – holds the stored data – has ~750,000 docs and size of 400Mb
 
  Again this is a single machine that holds both indexes.
 
  The query
 
 
 http://localhost:8210/solr/emptyCore/select?rows=5000q=*:*shards=127.0.0.1:8210/solr/core1wt=jsonQTime
  takes ~3 seconds
 
  and direct query
  http://localhost:8210/solr/core1/select?rows=5000q=*:*wt=json Qtime
  takes
  ~15 ms - a magnitude difference.
 
  I ran the long query several times and got an improvement of about a sec
  (33%) but that’s it.
 
  I need to better understand why this is happening.
 
  I tried looking at Solr code and debugging the issue but with no success.
 
  The one thing I did notice is that the getFirstMatch method which
 receives
  the doc id, searches the term dict and returns the internal id takes most
  of the time for some reason.
 
  I am pretty stuck and would appreciate any ideas
 
  My only solution for the moment is to bypass the distributed query,
  implement code in my own app that directly queries the relevant cores and
  handles the sorting etc..
 
  Thanks
 
 
 
 
  On Sat, Nov 16, 2013 at 2:39 PM, Michael Sokolov 
  msoko...@safaribooksonline.com wrote:
 
   Did you say what the memory profile of your machine is?  How much
 memory,
   and how large are the shards? This is just a random guess, but it might
  be
   that if you are memory-constrained, there is a lot of thrashing caused
 by
   paging (swapping?) in and out the sharded indexes while a single index
  can
   be scanned linearly, even if it does need to be paged in.
  
   -Mike
  
  
   On 11/14/2013 8:10 AM, Elran Dvir wrote:
  
   Hi,
  
   We tried returning just the id field and got exactly the same
  performance.
   Our system is distributed but all shards are in a single machine so
   network issues are not a factor.
   The code we found where Solr is spending its time is on the shard and
  not
   on the routing core, again all shards are local.
   We investigated the getFirstMatch() method and noticed that the
   MultiTermEnum.reset (inside MultiTerm.iterator) and
 MultiTerm.seekExact
   take 99% of the time.
   Inside these methods, the call to BlockTreeTermsReader$
   FieldReader$SegmentTermsEnum$Frame.loadBlock  takes most of the time.
   Out of the 7 seconds  run these methods take ~5 and
   BinaryResponseWriter.write takes the rest(~ 2 seconds).
  
   We tried increasing cache sizes and got hits, but it only improved the
   query time by a second (~6), so no major effect.
   We are not indexing during our tests. The performance is similar.
   (How do we measure doc size? Is it important due to the fact that the
   performance is the same when returning only id field?)
  
   We still don't completely understand why the query takes this much
  longer
   although the cores are on the same machine.
  
   Is there a way to improve the performance (code, configuration,
 query)?
  
   -Original Message-
   From: idokis...@gmail.com [mailto:idokis...@gmail.com] On Behalf Of
   Manuel Le Normand
   Sent: Thursday, November 14, 2013 1:30 AM
   To: solr-user@lucene.apache.org
   Subject: Re: distributed search is significantly slower than direct
  search
  
   It's surprising such a query takes a long time, I would assume that
  after
   trying consistently q=*:* you should be getting cache hits and times
  should
   be faster. Try see in the adminUI how do your query/doc cache perform.
   Moreover, the query in itself is just asking the first 5000 docs that
   were indexed (returing the first [docid]), so seems all this time is
  wasted
   on transfer. Out of these 7 secs how much is spent on the above
 method?
   What do you return by default? How big is every doc you display in
 your
   results?
   Might be the matter that both collections work on the same ressources.
   Try elaborating your use-case.
  
   Anyway, it seems like you just made a test to see what will be the
   performance hit in a distributed environment so I'll try to explain
 some
   things we encountered in our benchmarks, with a case that has

Re: Performance improvement for solr faceting on large index

2012-11-22 Thread Yuval Dotan
you could always try the fc facet method and maybe increase the filtercache
size

On Thu, Nov 22, 2012 at 2:53 PM, Pravin Agrawal 
pravin_agra...@persistent.co.in wrote:

 Hi All,

 We are using solr 3.4 with following schema fields.


 schema.xml---

 fieldType name=autosuggest_text class=solr.TextField
 positionIncrementGap=100
 analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.ShingleFilterFactory
 maxShingleSize=5 outputUnigrams=true/
 filter class=solr.PatternReplaceFilterFactory
 pattern=^([0-9. ])*$ replacement=
 replace=all/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
 /analyzer
 analyzer type=query
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 /analyzer
 /fieldType

 field name=id type=string stored=true indexed=true/
 field name=autoSuggestContent type=autosuggest_text stored=true
 indexed=true multiValued=true/
 copyField source=content dest=autoSuggestContent/
 copyField source=original_title dest=autoSuggestContent/

 field name=content type=text stored=true indexed=true/
 field name=original_title type=text stored=true indexed=true/
 field name=site type=site stored=false indexed=true/


 /schema.xml---

 The index on above schema is distributed on two solr shards with each
 index size of about 1.2 million, and size on disk of about 195GB per shard.

 We want to retrieve (site, autoSuggestContent term, frequency of the term)
 information from our above main solr index. The site is a field in document
 and contains name of site to which that document belongs. The terms are
 retrieved from multivalued field autoSuggestContent which is created using
 shingles from content and title of the web page.

 As of now, we are using facet query to retrieve (term, frequency of term)
  for each site. Below is a sample query (you may ignore initial part of
 query)


 http://localhost:8080/solr/select?indent=onq=*:*fq=site:www.abc.comstart=0rows=0fl=idqt=dismaxfacet=truefacet.field=autoSuggestContentfacet.mincount=25facet.limit=-1facet.method=enumfacet.sort=index

 The problem is that with increase in index size, this method has started
 taking huge time. It used to take 7 minutes per site with index size of
 0.4 million docs but takes around 60-90 minutes for index size of 2.5
 million(). With this speed, it will take around 5-6 days to index complete
 1500 sites. Also we are expecting the index size to grow with more
 documents and more sites and as such time to get the above information will
 increase further.

 Please let us know if there is any better way to extract (site, term,
 frequency) information compare to current method.

 Thanks,
 Pravin Agrawal




 DISCLAIMER
 ==
 This e-mail may contain privileged and confidential information which is
 the property of Persistent Systems Ltd. It is intended only for the use of
 the individual or entity to which it is addressed. If you are not the
 intended recipient, you are not authorized to read, retain, copy, print,
 distribute or use this message. If you have received this communication in
 error, please notify the sender and delete all copies of this message.
 Persistent Systems Ltd. does not accept any liability for virus infected
 mails.



Re: Questions about query times

2012-10-10 Thread Yuval Dotan
OK so I solved the question about the query that returns no results and
still takes time - I needed to add the facet.mincount=1 parameter and this
reduced the time to 200-300 ms instead of seconds.

I still could't figure out why a query that returns very few results (like
query number 2) still takes seconds to return even with
the facet.mincount=1 parameter.
I couldn't understand why the facet pivot takes so much time on 299 docs.

Does anyone have any idea?

Example Query:

(2)
q=*:*fq=(trimTime:[2012-09-04T15:23:48Z TO *])fq=(Severity:(High
Critical))fq=(trimTime:[2012-09-04T15:23:48Z TO
*])fq=(Confidence_Level:(N/A)) OR (Confidence_Level:(Medium-High)) OR
(Confidence_Level:(High))f.product.facet.sort=indexf.product.facet.limit=-1f.Severity.facet.sort=indexf.Severity.facet.limit=-1f.trimTime.facet.sort=indexf.trimTime.facet.limit=-1facet=truef.product.facet.method=enumfacet.pivot=product,Severity,trimTime

NumFound: 299

Times(ms):
Qtime: 2,756 Query: 307 Facet: 2,449

On Thu, Sep 20, 2012 at 5:24 PM, Yuval Dotan yuvaldo...@gmail.com wrote:

 Hi,

 We have a system that inserts logs continuously (real-time).
 We have been using the Solr facet pivot feature for querying and have been
 experiencing slow query times and we were hoping to gain some insights with
 your help.
 schema and solrconfig are attached

 Here are our questions (data below):

1. Why is facet time so long in (3) and (5) - in cases where there are
0 or very few results?
2. We ran two queries that are only differ in the time limit (for the
second query - time range is very small) - we got the same time for both
queries although the second one returned very few results - again why is
that?
3. Is there a way to improve pivot facet time?

 System Data:

 Index size: 63 GB
 RAM:4Gb
 CPU: 2 x Xeon E5410 2.33GHz
 Num of Documents: 109,278,476


 query examples:

 -
 (1)
 Query:
 q=*:*fq=(trimTime:[2012-09-04T14:29:24Z TO
 *])fq=(trimTime:[2012-09-04T14:29:24Z TO
 *])f.product.facet.sort=indexf.product.facet.limit=-1f.Severity.facet.sort=indexf.Severity.facet.limit=-1f.trimTime.facet.sort=indexf.trimTime.facet.limit=-1facet=truef.product.facet.method=enumfacet.pivot=product,Severity,trimTime

 NumFound:
 11,407,889

 Times (ms):
 Qtime: 3,239 Query: 353 Facet: 2,885
 -

 (2)
 Query:
 q=*:*fq=(trimTime:[2012-09-04T15:23:48Z TO *])fq=(Severity:(High
 Critical))fq=(trimTime:[2012-09-04T15:23:48Z TO
 *])fq=(Confidence_Level:(N/A)) OR (Confidence_Level:(Medium-High)) OR
 (Confidence_Level:(High))f.product.facet.sort=indexf.product.facet.limit=-1f.Severity.facet.sort=indexf.Severity.facet.limit=-1f.trimTime.facet.sort=indexf.trimTime.facet.limit=-1facet=truef.product.facet.method=enumfacet.pivot=product,Severity,trimTime

 NumFound: 299

 Times(ms):
 Qtime: 2,756 Query: 307 Facet: 2,449

 -
 (3)
 Query:
 q=*:*fq=(trimTime:[2012-09-11T12:55:00Z TO *])fq=(Severity:(High
 Critical))fq=(trimTime:[2012-09-04T15:23:48Z TO
 *])fq=(Confidence_Level:(N/A)) OR (Confidence_Level:(Medium-High)) OR
 (Confidence_Level:(High))f.product.facet.sort=indexf.product.facet.limit=-1f.Severity.facet.sort=indexf.Severity.facet.limit=-1f.trimTime.facet.sort=indexf.trimTime.facet.limit=-1facet=truef.product.facet.method=enumfacet.pivot=product,Severity,trimTime

 NumFound: 7

 Times(ms):
 Qtime: 2,798 Query: 312 Facet: 2,485

 -
 (4)
 Query:
 q=*:*fq=(trimTime:[2012-09-04T15:43:16Z TO
 *])fq=(trimTime:[2012-09-04T15:43:16Z TO *])fq=(product:(Application
 Control)) OR (product:(URL
 Filtering))f.appi_name.facet.sort=indexf.appi_name.facet.limit=-1f.app_risk.facet.sort=indexf.app_risk.facet.limit=-1f.matched_category.facet.sort=indexf.matched_category.facet.limit=-1f.trimTime.facet.sort=indexf.trimTime.facet.limit=-1facet=truef.appi_name.facet.method=enumfacet.pivot=appi_name,app_risk,matched_category,trimTimeexf.trimTime.facet.limit=-1facet=truef.product.facet.method=enumfacet.pivot=product,Severity,trimTime

 NumFound: more than 30M

 Times(ms): Qtime: 23,288
 -

 (5)
 Query:
 q=*:*fq=(trimTime:[2012-09-05T06:03:55Z TO *])fq=(Severity:(High
 Critical))fq=(trimTime:[2012-09-05T06:03:55Z TO *])fq=(product:(IPS))
 OR (product:(SmartDefense))fq=(action:(Detect)) OR
 (action:(mixed))fq=(Confidence_Level:(Medium

Re: Partition Question

2012-05-09 Thread Yuval Dotan
Thanks Lance

There is already a clear partition - as you assumed, by date.

My requirement is for the best setup for:
1. A *single machine*
2. Quickly changing index - so i need to have the option to load and unload
partitions dynamically

Do you think that the sharding model that solr offers is the most suitable
for this setup?
What about the solr multi core model?

On Wed, May 9, 2012 at 12:23 AM, Lance Norskog goks...@gmail.com wrote:

 Lucene does not support more 2^32 unique documents, so you need to
 partition. In Solr this is done with Distributed Search:

 http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/DistributedSearch

 First, you have to decide a policy for which documents go to which
 'shard'. It is common to make a hash code as the unique id, then
 distribute the documents modulo this value. This gives a roughly equal
 distribution of documents. If there is already a clear partition, like
 the date of the document (like newspaper articles) you could use that
 also.

 You have new documents and existing documents. For new documents you
 need code for this policy to get all new documents to the right index.
 This could be one master program that passes them out, or each indexer
 could know which documents it gets.

 If you want to split up your current index, that's different. I have
 done this: for each shard, make a copy of the full index,
 delete-by-query all of the documents that are NOT in that shard, and
 optimize. We had to do this in sequence so it took a few days :) You
 don't need a full optimize. Use 'maxSegments=50' or '100' to suppress
 that last final giant merge.

 On Tue, May 8, 2012 at 12:02 AM, Yuval Dotan yuvaldo...@gmail.com wrote:
  Hi
  Can someone please guide me to the right way to partition the solr index?
 
  On Mon, May 7, 2012 at 11:41 AM, Yuval Dotan yuvaldo...@gmail.com
 wrote:
 
  Hi All
  Jan, thanks for the reply - answers for your questions are located below
  Please update me if you have ideas that can solve my problems.
 
  First, some corrections to my previous mail:
 
   Hi All
   We have an index of ~2,000,000,000 Documents and the query and facet
  times
   are too slow for us - our index in fact will be much larger
 
   Most of our queries will be limited by time, hence we want to
 partition
  the
   data by date/time - even when unlimited – which is mostly what will
  happen, we have results in the recent records and querying the whole
  dataset is redundant
 
   We want to partition the data because the index size is too big and
  doesn't
   fit into memory (80 Gb's) - our data actually continuously grows over
  time, it will never fit into memory, but has to be available for
 queries in
  case results are found in older records or a full facet is required
 
  
   1. Is multi core the best way to implement my requirement?
   2. I noticed there are some LOAD / UNLOAD actions on a core - should i
  use
   these action when managing my cores? if so how can i LOAD a core that
 i
   have unloaded
   for example:
   I have 7 partitions / cores - one for each day of the week - we might
  have 2000 per day
 
   In most cases I will search for documents only on the last day core.
   Once every 1 queries I need documents from all cores.
   Question: Do I need to unload all of the old cores and then load them
 on
   demand (when i see i need data from these cores)?
   3. If the question to the last answer is no, how do i ensure that only
   cores that are loaded into memory are the ones I want?
  
   Thanks
   Yuval
  *
  *
  *Answers to Jan:*
 
  Hi,
 
  First you need to investigate WHY faceting and querying is too slow.
  What exactly do you mean by slow? Can you please tell us more about your
  setup?
 
  * How large documents and how many fields?
  small records ~200bytes, 20 fields avg most of them are not stored -
  attached schema and config file
 
  * What kind of queries? How many hits? How many facets? Have you studies
  debugQuery=true output?
  problem is not with queries being slow per se, it is with getting 50
  matches out of billions of matching docs
 
  * Do you use filter queries (fq) extensively?
  user generated queries, fq would not reduce the dataset for some of our
  usecases
 
  * What data do you facet on? Many unique values per field? Text or
 ranges?
  What facet.method?
   problem is not just faceting, it’s with queries – let’s start there
 
  * What kind of hardware? RAM/CPU
  HP DL180G6 , 2 E5645 (12 core)
  48 GB RAM
   * How have you configured your JVM? How much memory? GC?
  java -Xms512M -Xmx40960M -jar start.jar
 
  As you see, you will have to provide a lot more information on your use
  case and setup in order for us to judge correct action to take. You
 might
  need to adjust your config, or to optimize your queries or caches, slim
  your schema, buy some more RAM, or an SSD :)
 
  Normally, going multi core on one box will not necessarily help in
 itself,
  as there is overhead in sharding

Re: Partition Question

2012-05-08 Thread Yuval Dotan
Hi
Can someone please guide me to the right way to partition the solr index?

On Mon, May 7, 2012 at 11:41 AM, Yuval Dotan yuvaldo...@gmail.com wrote:

 Hi All
 Jan, thanks for the reply - answers for your questions are located below
 Please update me if you have ideas that can solve my problems.

 First, some corrections to my previous mail:

  Hi All
  We have an index of ~2,000,000,000 Documents and the query and facet
 times
  are too slow for us - our index in fact will be much larger

  Most of our queries will be limited by time, hence we want to partition
 the
  data by date/time - even when unlimited – which is mostly what will
 happen, we have results in the recent records and querying the whole
 dataset is redundant

  We want to partition the data because the index size is too big and
 doesn't
  fit into memory (80 Gb's) - our data actually continuously grows over
 time, it will never fit into memory, but has to be available for queries in
 case results are found in older records or a full facet is required

 
  1. Is multi core the best way to implement my requirement?
  2. I noticed there are some LOAD / UNLOAD actions on a core - should i
 use
  these action when managing my cores? if so how can i LOAD a core that i
  have unloaded
  for example:
  I have 7 partitions / cores - one for each day of the week - we might
 have 2000 per day

  In most cases I will search for documents only on the last day core.
  Once every 1 queries I need documents from all cores.
  Question: Do I need to unload all of the old cores and then load them on
  demand (when i see i need data from these cores)?
  3. If the question to the last answer is no, how do i ensure that only
  cores that are loaded into memory are the ones I want?
 
  Thanks
  Yuval
 *
 *
 *Answers to Jan:*

 Hi,

 First you need to investigate WHY faceting and querying is too slow.
 What exactly do you mean by slow? Can you please tell us more about your
 setup?

 * How large documents and how many fields?
 small records ~200bytes, 20 fields avg most of them are not stored -
 attached schema and config file

 * What kind of queries? How many hits? How many facets? Have you studies
 debugQuery=true output?
 problem is not with queries being slow per se, it is with getting 50
 matches out of billions of matching docs

 * Do you use filter queries (fq) extensively?
 user generated queries, fq would not reduce the dataset for some of our
 usecases

 * What data do you facet on? Many unique values per field? Text or ranges?
 What facet.method?
  problem is not just faceting, it’s with queries – let’s start there

 * What kind of hardware? RAM/CPU
 HP DL180G6 , 2 E5645 (12 core)
 48 GB RAM
  * How have you configured your JVM? How much memory? GC?
 java -Xms512M -Xmx40960M -jar start.jar

 As you see, you will have to provide a lot more information on your use
 case and setup in order for us to judge correct action to take. You might
 need to adjust your config, or to optimize your queries or caches, slim
 your schema, buy some more RAM, or an SSD :)

 Normally, going multi core on one box will not necessarily help in itself,
 as there is overhead in sharding multi cores as well. However, it COULD be
 a solution since you say that most of the time you only need to consider
 1/7 of your data. I would perhaps consider one hot core for last 24h, and
 one archive core for older data. You could then tune these differently
 regarding caches etc.

 Can you get back with some more details?

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 Solr Training - www.solrtraining.com




Partition Question

2012-05-06 Thread Yuval Dotan
Hi All
We have an index of ~2,000,000,000 Documents and the query and facet times
are too slow for us.
Before using the shards solution for improving performance, we thought
about using the multicore feature (our goal is to maximize performance for
a single machine).
Most of our queries will be limited by time, hence we want to partition the
data by date/time.
We want to partition the data because the index size is too big and doesn't
fit into memory (80 Gb's).

1. Is multi core the best way to implement my requirement?
2. I noticed there are some LOAD / UNLOAD actions on a core - should i use
these action when managing my cores? if so how can i LOAD a core that i
have unloaded
for example:
I have 7 partitions / cores - one for each day of the week
In most cases I will search for documents only on the last day core.
Once every 1 queries I need documents from all cores.
Question: Do I need to unload all of the old cores and then load them on
demand (when i see i need data from these cores)?
3. If the question to the last answer is no, how do i ensure that only
cores that are loaded into memory are the ones I want?

Thanks
Yuval


Re: Java out of memory - with fieldcache faceting

2012-04-30 Thread Yuval Dotan
Thanks for the fast answer
One more question:
Is there a way to know (some formula) what is the size of memory i need for
these actions?

Thanks
Yuval

On Mon, Apr 30, 2012 at 11:50, Dan Tuffery dan.tuff...@gmail.com wrote:

 You need to add more memory to the JVM that is running Solr:

 http://wiki.apache.org/solr/SolrPerformanceFactors#OutOfMemoryErrors

 Dan

 On Mon, Apr 30, 2012 at 9:43 AM, Yuval Dotan yuvaldo...@gmail.com wrote:

  Hi Guys
  I have a problem and i need your assistance
  I get an exception when doing field cache faceting (the enum method works
  perfectly):
 
  */solr/select?q=*:*facet=truefacet.field=src_ip_strfacet.limit=10*
 
  lst name=error
  str name=msgjava.lang.OutOfMemoryError: Java heap space/str
  str name=trace
  java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
 at
 
 org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:449)
  at
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:277)
  at
 
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
  at
 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
  at
 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
  at
 
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
  at
 
 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)
  at
 
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
  at
  org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
  at
 
 org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)
  at
 
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)
  at
 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
  at
 
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
  at
 
 org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)
  at
 
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)
  at org.eclipse.jetty.server.Server.handle(Server.java:351) at
 
 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)
  at
 
 org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47)
  at
 
 org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890)
  at
 
 org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:944)
  at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:634) at
  org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230) at
 
 org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66)
  at
 
 org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254)
  at
 
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599)
  at
 
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534)
  at java.lang.Thread.run(Thread.java:679) Caused by:
  java.lang.OutOfMemoryError: Java heap space at
  org.apache.lucene.util.packed.Direct16.init(Direct16.java:38) at
  org.apache.lucene.util.packed.PackedInts.getMutable(PackedInts.java:267)
 at
  org.apache.lucene.util.packed.GrowableWriter.set(GrowableWriter.java:81)
 at
 
 org.apache.lucene.search.FieldCacheImpl$DocTermsIndexCache.createValue(FieldCacheImpl.java:1178)
  at
 
 org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:248)
  at
 
 org.apache.lucene.search.FieldCacheImpl.getTermsIndex(FieldCacheImpl.java:1081)
  at
 
 org.apache.lucene.search.FieldCacheImpl.getTermsIndex(FieldCacheImpl.java:1077)
  at
 
 org.apache.solr.request.SimpleFacets.getFieldCacheCounts(SimpleFacets.java:459)
  at
  org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:310)
  at
 
 org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:396)
  at
 
 org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:205)
  at
 
 org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:81)
  at
 
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:204)
  at
 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1541) at
 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:435)
  at
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:256)
  at
 
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
  at
 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
  at
 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119