Re: Use cases for ReplicationHandler's backup facility?

2009-09-25 Thread Chris Harris
2009/9/24 Noble Paul നോബിള്‍  नोब्ळ् noble.p...@corp.aol.com:
 On Fri, Sep 25, 2009 at 4:57 AM, Chris Harris rygu...@gmail.com wrote:
 The ReplicationHandler (http://wiki.apache.org/solr/SolrReplication)
 has support for backups, which can be triggered in one of two ways:

 1. in response to startup/commit/optimize events (specified through
 the backupAfter tag specified in the handler's requestHandler tag in
 solrconfig.xml)
 2. by manually hitting 
 http://master_host:port/solr/replication?command=backup

 These backups get placed in directories named, e.g.
 snapshot.20090924033521, inside the solr data directory.

 According to the docs, these backups are not necessary for replication
 to work. My question is: What use case *are* they meant to address?

 The first potential use case that came to mind was that maybe I would
 be able to restore my index from these snapshot directories should it
 ever become corrupted. (I could just do something like rm -r data; mv
 snapshot.20090924033521 data.) That appears not to be one of the
 intended use cases, though; if it were, then I imagine the snapshot
 directories would contain the entire index, whereas they seem to
 contain only deltas of one form or another.
 Yes, the only reason to take a backup should be for restoration/archival
 They should contain all the files required for the latest commit point.

To be clear, you'd have to write your own code to make any kind of
restore from these snapshot back directories possible, right? (That
is, the handler itself doesn't implement any kind of restore, nor
can you restore by using simple filesystem commands like cp -r or mv.)

For example, the most straightforward case would be if you limited
yourself to only doing backups after each optimize; that's
straightforward in that each snapshot directory should contain all the
segment files required for a particular point-in-time view of the
index. However, it still wouldn't contain the Lucene segments_N file,
and it seems like to implement an index restore you'd need to try to
reconstitute that somehow.


Unsubscribe from this mailing-list

2009-09-25 Thread Rafeek Raja
Unsubscribe from this mailing-list


Re: Solr http post performance seems slow - help?

2009-09-25 Thread Constantijn Visinescu
This may or may not help but here goes :)

When i was running performance tests i look a look at the simple post tool
that comes with the solr examples.

First i changed my schema.xml to fit my needs and then i deleted the old
index so solr created a blank one when i started up.
Then i had a had a process chew on my data and spit out xml files that are
formatted similarly to the xml files that the SimplePostTool example uses.
Next i used the simple Post tool to post the xml files to solr (60k-80k
records per xml file). Each file only took a couple minutes to index this
way.
Comit and optimize after that (took less then 10 minutes) and after about
2.5 hrs i had indexed just under 8 milion records.

This was on a 4 year old single core laptop using resin 3 as my servlet
container.

Hope this helps.


On Fri, Sep 25, 2009 at 3:51 AM, Lance Norskog goks...@gmail.com wrote:

 In top, press the '1' key. This will give a list of the CPUs and how
 much load is on each. The display is otherwise a little weird for
 multi-cpu machines. But don't be surprised when Solr is I/O bound. The
 biggest fanciest RAID is often a better investment than CPUs. On one
 project we bought low-end rack servers come with 6-8 disk bays,
 filling them with 10k/15k RPM disks.

 On Wed, Sep 23, 2009 at 2:47 PM, Dan A. Dickey dan.dic...@savvis.net
 wrote:
  On Friday 11 September 2009 11:06:20 am Dan A. Dickey wrote:
  ...
  Our JBoss expert and I will be looking into why this might be occurring.
  Does anyone know of any JBoss related slowness with Solr?
  And does anyone have any other sort of suggestions to speed indexing
  performance?   Thanks for your help all!  I'll keep you up to date with
  further progress.
 
  Ok, further progress... just to keep any interested parties up to date
  and for the record...
 
  I'm finding that using the example jetty setup (will be switching very
  very soon to a real jetty installation) is about the fastest.  Using
  several processes to send posts to Solr helps a lot, and we're seeing
  about 80 posts a second this way.
 
  We also stripped down JBoss to the bare bones and the Solr in it
  is running nearly as fast - about 50 posts a second.  It was our previous
  JBoss configuration that was making it appear slow for some reason.
 
  We will be running more tests and spreading out the pre-index workload
  across more machines and more processes. In our case we were seeing
  the bottleneck being one machine running 18 processes.
  The 2 quad core xeon system is experiencing about a 25% cpu load.
  And I'm not certain, but I think this may be actually 25% of one of the 8
 cores.
  So, there's *lots* of room for Solr to be doing more work there.
 -Dan
 
  --
  Dan A. Dickey | Senior Software Engineer
 
  Savvis
  10900 Hampshire Ave. S., Bloomington, MN  55438
  Office: 952.852.4803 | Fax: 952.852.4951
  E-mail: dan.dic...@savvis.net
 



 --
 Lance Norskog
 goks...@gmail.com



Re: Unsubscribe from this mailing-list

2009-09-25 Thread Avlesh Singh
You seem to be desperate to get out of the Solr mailing list :)
Send an email to solr-user-unsubscr...@lucene.apache.org

Cheers
Avlesh

On Fri, Sep 25, 2009 at 11:54 AM, Rafeek Raja rafeek.r...@gmail.com wrote:

 Unsubscribe from this mailing-list



Highlighting on text fields

2009-09-25 Thread Avlesh Singh
I am new to the whole highlighting API and have a few basic questions:
I have a text type field defined as underneath:
fieldType name=text class=solr.TextField positionIncrementGap=100
analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=false/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
/fieldType

And the schema field is associated as follows:
field name=text_entity_name type=text indexed=true stored=false/

My query, q=text_entity_name:(foo bar)hl=truehl.fl=text_entity_name
work fine for the search part but not for highlighting. The highlight named
list is empty for each document returned back.

I have a unique key defined. What am I missing? Do I need to store term
vectors for highlighting to work properly?

Cheers
Avlesh


Re: Showcase: Facetted Search for Wine using Solr

2009-09-25 Thread Marian Steinbach
Hi Grant!

Thanks for the advidce, I added the link to the list.

Regards,

Marian


On Fri, Sep 25, 2009 at 5:14 AM, Grant Ingersoll gsing...@apache.org wrote:
 Hi Marian,

 Looks great!  Wish I could order some wine.  When you get a chance, please
 add the site to http://wiki.apache.org/solr/PublicServers!

 Cheers,
 Grant

 On Sep 24, 2009, at 11:51 AM, marian.steinbach wrote:

 Hello everybody!

 The purpose of this mail is to say thank you to the creators of Solr
 and to the community that supports it.

 We released our first project using Solr several weeks ago, after
 having tested Solr for several months.

 The project I'm talking about is a product search for an online wine
 shop (sorry, german user interface only):

  http://www.koelner-weinkeller.de/index.php?id=sortiment

 Our client offers about 3000 different wines and other related products.

 Before we introduced Solr, the products have been searched via
 complicated and slow SQL statements, with all kinds problems related
 to that. No full text indexing, no stemming etc.

 We are happy to make use of several built-in features which solve
 problems that bugged us: Facetted search, german accents and stemming
 and synonyms beeing the most important ones.

 The surrounding website is TYPO3 driven. We integrated Solr by
 creating our own frontend plugin which talks to the Solr webservice
 (and we're very happy about the PHP output type!).

 I'd be glad about your comments.

 Cheers,

 Marian

 --
 Grant Ingersoll
 http://www.lucidimagination.com/

 Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
 Solr/Lucene:
 http://www.lucidimagination.com/search




Using two Solr documents to represent one logical document/file

2009-09-25 Thread Peter Ledbrook

Hi,

I want to index both the contents of a document/file and metadata associated
with that document. Since I also want to update the content and metadata
indexes independently, I believe that I need to use two separate Solr
documents per real/logical document. The question I have is how do I merge
query results so that only one result is returned per real/logical document,
not per Solr document? In particular, I don't want to filter the results to
satisfy any max results constraint.

I have read that this can be achieved with a facet search. Is this the best
approach, or is there some alternative?

Thanks,

Peter
-- 
View this message in context: 
http://www.nabble.com/Using-two-Solr-documents-to-represent-one-logical-document-file-tp25609646p25609646.html
Sent from the Solr - User mailing list archive at Nabble.com.



What options would you recommend for the Sun JVM?

2009-09-25 Thread Jérôme Etévé
Hi solr addicts,

I know there's no one size fits all set of options for the sun JVM,
but I think It'd be useful to everyone to share your tips on using the
sun JVM with solr.

For instance, I recently figured out that setting the tenured
generation garbage collection to Concurrent mark and sweep (
-XX:+UseConcMarkSweepGC )  have dramatically decreased the amount of
time java hangs on tenured gen. garbage collecting. On my settings,
the old gen. garbage collection went from big time chunks of 1~2
second to multiple small slices of ~0.2 s.

As a result, the commits (hence the searcher drop/rebuild) are much
less painful from the application performance point of view.

What are the other options you would recommend?

Cheers!

Jerome.

-- 
Jerome Eteve.
http://www.eteve.net
jer...@eteve.net


DIH RSS 1.4 nightly 2009-09-25 full-importclean=false always clean and import command do nothing

2009-09-25 Thread Brahim Abdesslam

Hello everybody,

we are using Solr to index some RSS feeds for a news agregator application.

We've got some difficulties with the publication date of each item 
because each site use an homemade date format.
The fact is that we want to have the exact amount of time between the 
date of publication and the time it is now.


So we decided to uses a timestamp that stores the index time for each item.

The problem is :

   * when i do a full-importclean=false the index is always cleaned.
   * when i do a simple import, nothing seems to be done.

Here is the configuration :

   * Apache Solr 1.4 Nightly 2009-09-25
   * java version : build 1.6.0_15-b03
   * Java HotSpot Client VM : build 14.1-b02, mixed mode, sharing

= data-config.xml

?xml version=1.0 encoding=utf-8?
dataConfig
   dataSource type=HttpDataSource /
   document
   entity name=flux_367
   pk=link
   url=http://www.capital.fr/rss2/feed/fil-bourse.xml;
   processor=XPathEntityProcessor
   forEach=/rss/channel | /rss/channel/item
   transformer=DateFormatTransformer, TemplateTransformer
   onError=continue
   field column=source template=368 commonField=true /
   field column=type template=0 commonField=true /
  
   field column=title xpath=/rss/channel/item/title /

   field column=link xpath=/rss/channel/item/link /
   field column=description 
xpath=/rss/channel/item/description /
   field column=date xpath=/rss/channel/item/pubDate 
dateTimeFormat=EEE, dd MMM  HH:mm:ss z /

   /entity
   /document
/dataConfig

= schema.xml

[...]
fields
  field name=source type=text indexed=true stored=true /
  field name=title type=text indexed=true stored=true /
  field name=link type=string indexed=true stored=true /
  field name=description type=html indexed=true stored=true /
  field name=date type=date indexed=true stored=true 
default=NOW /

  field name=type type=sint indexed=true stored=true /
  field name=all_text type=text indexed=true stored=false 
multiValued=true /

  copyField source=source dest=all_text /
  copyField source=title dest=all_text /
  copyField source=description dest=all_text /
  copyField source=date dest=all_text /
  copyField source=type dest=all_text /
 
  !-- Here, default is used to create a timestamp field indicating

   When each document was indexed.
  --
  field name=timestamp type=date indexed=true stored=true 
default=NOW multiValued=false/


/fields

uniqueKeylink/uniqueKey

defaultSearchFieldall_text/defaultSearchField

solrQueryParser defaultOperator=OR/
[...]

- Tests :

= command=full-importclean=false

25-Sep-2009 14:58:21 org.apache.solr.handler.dataimport.SolrWriter 
readIndexerProperties

INFO: Read dataimport.properties
25-Sep-2009 14:58:21 org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/dataimport params={command=full-import} 
status=0 QTime=6
25-Sep-2009 14:58:21 org.apache.solr.handler.dataimport.DataImporter 
doFullImport

INFO: Starting Full Import
25-Sep-2009 14:58:21 org.apache.solr.handler.dataimport.SolrWriter 
readIndexerProperties

INFO: Read dataimport.properties
25-Sep-2009 14:58:21 org.apache.solr.update.DirectUpdateHandler2 deleteAll
INFO: [] REMOVING ALL DOCUMENTS FROM INDEX
25-Sep-2009 14:58:21 org.apache.solr.core.SolrDeletionPolicy onInit
INFO: SolrDeletionPolicy.onInit: commits:num=1
   
commit{dir=D:\srv\solr\index,segFN=segments_2s,version=1251453476028,generation=100,filenames=[segments_2s, 
_3u.

cfs, _3u.cfx]
25-Sep-2009 14:58:21 org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: last commit = 1251453476028
25-Sep-2009 14:58:22 org.apache.solr.handler.dataimport.DocBuilder finish
INFO: Import completed successfully

= command=import

25-Sep-2009 14:59:20 org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/dataimport params={command=import} status=0 
QTime=0
25-Sep-2009 14:59:20 org.apache.solr.handler.dataimport.SolrWriter 
readIndexerProperties

INFO: Read dataimport.properties

Any idea or suggestion ?
Thank you in advance!
--

Brahim Abdesslam
Directeur des opérations

* Maecia - /Développement web/ *
Mob : +33 (0)6 82 87 31 27
Tel  : +33 (0)9 54 99 29 59
Fax : +33 (0)9 59 99 29 59

http://www.maecia.com http://www.maecia.com



RE: Alphanumeric Wild Card Search Question

2009-09-25 Thread Carr, Adrian
Hi Ken,
I am using the WordDelimiterFilterFactory. I thought I needed it because I 
thought that's what gave me the control over the options of how the words are 
split and indexed? I did try taking it out completely, but that didn't seem to 
help.

I'll try the analysis tool today. There has got to be a simple solution for 
this, but it is sure eluding me.
Thanks,
Adrian

-Original Message-
From: Ensdorf Ken [mailto:ensd...@zoominfo.com] 
Sent: Thursday, September 24, 2009 5:03 PM
To: solr-user@lucene.apache.org
Subject: RE: Alphanumeric Wild Card Search Question

 Here's my question:
 I have some products that I want to allow people to search for with 
 wild cards. For example, if my product is YBM354, I'd like for users 
 to be able to search on YBM*, YBM3*, YBM35* and for any of these 
 searches to return that product. I've found that I can search for 
 YBM* and get the product, just not the other combinations.

Are you using WordDelimiterFilterFactory?  That would explain this behavior.

If so, do you need it - for the queries you describe you don't need that kind 
of tokenization.

Also, have you played with the analysis tool on the admin page, it is a great 
help in debugging things like this.

-Ken


RE: Alphanumeric Wild Card Search Question

2009-09-25 Thread Carr, Adrian
In case it helps, here's what I have currently, but I've been messing with 
different options:

filter class=solr.WordDelimiterFilterFactory 
generateWordParts=0
generateNumberParts=0 
catenateWords=1 
catenateNumbers=1 
catenateAll=1 
splitOnNumerics=0  
preserveOriginal=1/
 

-Original Message-
From: Carr, Adrian [mailto:adrian.c...@jtv.com] 
Sent: Friday, September 25, 2009 9:28 AM
To: solr-user@lucene.apache.org
Subject: RE: Alphanumeric Wild Card Search Question

Hi Ken,
I am using the WordDelimiterFilterFactory. I thought I needed it because I 
thought that's what gave me the control over the options of how the words are 
split and indexed? I did try taking it out completely, but that didn't seem to 
help.

I'll try the analysis tool today. There has got to be a simple solution for 
this, but it is sure eluding me.
Thanks,
Adrian

-Original Message-
From: Ensdorf Ken [mailto:ensd...@zoominfo.com]
Sent: Thursday, September 24, 2009 5:03 PM
To: solr-user@lucene.apache.org
Subject: RE: Alphanumeric Wild Card Search Question

 Here's my question:
 I have some products that I want to allow people to search for with 
 wild cards. For example, if my product is YBM354, I'd like for users 
 to be able to search on YBM*, YBM3*, YBM35* and for any of these 
 searches to return that product. I've found that I can search for 
 YBM* and get the product, just not the other combinations.

Are you using WordDelimiterFilterFactory?  That would explain this behavior.

If so, do you need it - for the queries you describe you don't need that kind 
of tokenization.

Also, have you played with the analysis tool on the admin page, it is a great 
help in debugging things like this.

-Ken


Re: OOM error during merge - index still ok?

2009-09-25 Thread Yonik Seeley
On Fri, Sep 25, 2009 at 8:20 AM, Phillip Farber pfar...@umich.edu wrote:
  Can I expect the index to be left in a usable state ofter an out of memory
 error during a merge or it it most likely to be corrupt?

It should be in the state it was after the last successful commit.

-Yonik
http://www.lucidimagination.com

  I'd really hate to
 have to start this index build again from square one.  Thanks.

 Thanks,

 Phil

 ---
 Exception in thread http-8080-Processor2505 java.lang.OutOfMemoryError:
 Java heap space
 Exception in thread RMI TCP Connection(131)-141.213.128.155
 java.lang.OutOfMemoryError: Java heap space
 Exception in thread ContainerBackgroundProcessor[StandardEngine[Catalina]]
 java.lang.OutOfMemoryError: Java heap space
 Exception in thread http-8080-Processor2537 java.lang.OutOfMemoryError:
 Java heap space
 Exception in thread http-8080-Processor2483 Exception in thread RMI
 Scheduler(0) java.lang.OutOfMemoryError: Java heap space
 java.lang.OutOfMemoryError: Java heap space
 Exception in thread Lucene Merge Thread #202
 org.apache.lucene.index.MergePolicy$MergeException:
 java.lang.OutOfMemoryError: Java heap space
   at
 org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:351)
   at
 org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:315)
 Caused by: java.lang.OutOfMemoryError: Java heap space
 Exception in thread Lucene Merge Thread #266
 org.apache.lucene.index.MergePolicy$MergeException:
 java.lang.IllegalStateException: this writer hit an OutOfMemoryError; cannot
 merge
   at
 org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:351)
   at
 org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:315)
 Caused by: java.lang.IllegalStateException: this writer hit an
 OutOfMemoryError; cannot merge
   at org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:4529)
   at org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:4512)
   at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4424)
   at
 org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:235)
   at
 org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:291)
 WARN: The method class
 org.apache.commons.logging.impl.SLF4JLogFactory#release() was invoked.
 WARN: Please see http://www.slf4j.org/codes.html#release for an explanation.
 WARN: The method class
 org.apache.commons.logging.impl.SLF4JLogFactory#release() was invoked.
 WARN: Please see http://www.slf4j.org/codes.html#release for an explanation.




Re: Can we point a Solr server to index directory dynamically at runtime..

2009-09-25 Thread Michael
Are you storing (in addition to indexing) your data?  Perhaps you could turn
off storage on data older than 7 days (requires reindexing), thus losing the
ability to return snippets but cutting down on your storage space and server
count.  I've experienced 10x decrease in space requirements and a large
boost in speed after cutting extraneous storage from Solr -- the stored data
is mixed in with the index data and so it slows down searches.
You could also put all 200G onto one Solr instance rather than 10 for 7days
data, and accept that those searches will be slower.

Michael

On Fri, Sep 25, 2009 at 1:34 AM, Silent Surfer silentsurfe...@yahoo.comwrote:

 Hi,

 Thank you Michael and Chris for the response.

 Today after the mail from Michael, we tested with the dynamic loading of
 cores and it worked well. So we need to go with the hybrid approach of
 Multicore and Distributed searching.

 As per our testing, we found that a Solr instance with 20 GB of
 index(single index or spread across multiple cores) can provide better
 performance when compared to having a Solr instance say 40 (or) 50 GB of
 index (single index or index spread across cores).

 So the 200 GB of index on day 1 will be spread across 200/20=10 Solr salve
 instances.

 On day 2 data, 10 more Solr slave servers are required; Cumulative Solr
 Slave instances = 200*2/20=20
 ...
 ..
 ..
 On day 30 data, 10 more Solr slave servers are required; Cumulative Solr
 Slave instances = 200*30/20=300

 So with the above approach, we may need ~300 Solr slave instances, which
 becomes very unmanageable.

 But we know that most of the queries is for the past 1 week, i.e we
 definitely need 70 Solr Slaves containing the last 7 days worth of data up
 and running.

 Now for the rest of the 230 Solr instances, do we need to keep it running
 for the odd query,that can span across the 30 days of data (30*200 GB=6 TB
 data) which can come up only a couple of times a day.
 This linear increase of Solr servers with the retention period doesn't
 seems to be a very scalable solution.

 So we are looking for something more simpler approach to handle this
 scenario.

 Appreciate any further inputs/suggestions.

 Regards,
 sS

 --- On Fri, 9/25/09, Chris Hostetter hossman_luc...@fucit.org wrote:

  From: Chris Hostetter hossman_luc...@fucit.org
  Subject: Re: Can we point a Solr server to index directory dynamically
 at  runtime..
  To: solr-user@lucene.apache.org
  Date: Friday, September 25, 2009, 4:04 AM
  : Using a multicore approach, you
  could send a create a core named
  : 'core3weeksold' pointing to '/datadirs/3weeksold' 
  command to a live Solr,
  : which would spin it up on the fly.  Then you query
  it, and maybe keep it
  : spun up until it's not queried for 60 seconds or
  something, then send a
  : remove core 'core3weeksold'  command.
  : See http://wiki.apache.org/solr/CoreAdmin#CoreAdminHandler
  .
 
  something that seems implicit in the question is what to do
  when the
  request spans all of the data ... this is where (in theory)
  distributed
  searching could help you out.
 
  index each days worth of data into it's own core, that
  makes it really
  easy to expire the old data (just UNLOAD and delete an
  entire core once
  it's more then 30 days old) if your user is only searching
  current dta
  then your app can directly query the core containing the
  most current data
  -- but if they want to query the last week, or last two
  weeks worth of
  data, you do a distributed request for all of the shards
  needed to search
  the appropriate amount of data.
 
  Between the ALIAS and SWAP commands it on the CoreAdmin
  screen it should
  be pretty easy have cores with names like
  today,1dayold,2dayold so
  that your app can configure simple shard params for all the
  perumations
  you'll need to query.
 
 
  -Hoss
 
 








Re: Parallel requests to Tomcat

2009-09-25 Thread Michael
Thank you Grant and Lance for your comments -- I've run into a separate snag
which puts this on hold for a bit, but I'll return to finish digging into
this and post my results. - Michael
On Thu, Sep 24, 2009 at 9:23 PM, Lance Norskog goks...@gmail.com wrote:

 Are you on Java 5, 6 or 7? Each release sees some tweaking of the Java
 multithreading model as well as performance improvements (and bug
 fixes) in the Sun HotSpot runtime.

 You may be tripping over the TCP/IP multithreaded connection manager.
 You might wish to create each client thread with a separate socket.

 Also, here is a standard bit of benchmarking advice: include think
 time. This means that instead of sending requests constantly, each
 thread should time out for a few seconds before sending the next
 request. This simulates a user stopping and thinking before clicking
 the mouse again. This helps simulate the quantity of threads, etc.
 which are stopped and waiting at each stage of the request pipeline.
 As it is, you are trying to simulate the throughput behaviour without
 simulating the horizontal volume. (Benchmarking is much harder than it
 looks.)

 On Wed, Sep 23, 2009 at 9:43 AM, Grant Ingersoll gsing...@apache.org
 wrote:
 
  On Sep 23, 2009, at 12:09 PM, Michael wrote:
 
  On Wed, Sep 23, 2009 at 12:05 PM, Yonik Seeley
  yo...@lucidimagination.comwrote:
 
  On Wed, Sep 23, 2009 at 11:47 AM, Michael solrco...@gmail.com wrote:
 
  If this were IO bound, wouldn't I see the same results when sending my
 8
  requests to 8 Tomcats?  There's only one disk (well, RAM) whether
 I'm
  querying 8 processes or 8 threads in 1 process, right?
 
  Right - I was thinking IO bound at the Lucene Directory level - which
  synchronized in the past and led to poor concurrency.  Buy your Solr
  version is recent enough to use the newer unsynchronized method by
  default (on non-windows)
 
 
  Ah, OK.  So it looks like comparing to Jetty is my only next step.
   Although
  I'm not sure what I'm going to do based on the result of that test -- if
  Jetty behaves differently, then I still don't know why the heck Tomcat
 is
  behaving badly! :)
 
 
  Have you done any profiling to see where hotspots are?  Have you looked
 at
  garbage collection?  Do you have any full collections occurring?  What
  garbage collector are you using?  How often are you updating/committing,
  etc?
 
 
  --
  Grant Ingersoll
  http://www.lucidimagination.com/
 
  Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
  Solr/Lucene:
  http://www.lucidimagination.com/search
 
 



 --
 Lance Norskog
 goks...@gmail.com



RE: Mixed field types and boolean searching

2009-09-25 Thread Ensdorf Ken
 No- there are various analyzers. StandardAnalyzer is geared toward
 searching bodies of text for interesting words -  punctuation is
 ripped out. Other analyzers are more useful for concrete text. You
 may have to work at finding one that leaves punctuation in.
 

My problem is not with the StandardAnalyzer per se, but more as to how dismax 
style queries are handled by the query parser when the different fields have 
different sets of ignored tokens or stop words.

Say you want to use the contents of a text box in your app and query a field in 
Solr.  The user enters A and B, so you map this to f1:A and f1:B.  Now, if 
B is an ignored token in the f1 field for whatever reason, the query boils 
down to f1:A.  

Now imagine you want to allow the user's text to match multiple fields - as in 
any term can match any field, but all terms must match at least 1 field.  So 
now you map the user's query to (f1:A OR f2:A) AND (f1:B OR f2:B).  But if f2 
does not ignore B, the query boils down to (f1:A OR f2:A) AND (f2:B).  Now 
documents that could come back when you were only matching against the f1 field 
don't come back.  

This seems counter-intuitive - to be consistent, I would think the query should 
essentially be treated as (f1:A OR f2:A) AND (TRUE OR f2:B)  - and thus a 
term that is a stop word or ignored token for any of the fields would be 
ignored across the board.

So I guess what I'm asking is if there is a reason for the existing behavior, 
or is it just a fact-of-life of the query parser?  Thanks!

-Ken


Re: Faceted Search on Dynamic Fields?

2009-09-25 Thread danben

Also, here is the field definition in the schema

dynamicField name=*amp;STRING_NOT_ANALYZED_YES type=string
indexed=true stored=true multiValued=true/


-- 
View this message in context: 
http://www.nabble.com/Faceted-Search-on-Dynamic-Fields--tp25612887p25612936.html
Sent from the Solr - User mailing list archive at Nabble.com.



RE: Solr and Garbage Collection

2009-09-25 Thread cbennett
Hi,

Have you looked at tuning the garbage collection ?

Take a look at the following articles

http://www.lucidimagination.com/blog/2009/09/19/java-garbage-collection-boot
-camp-draft/
http://java.sun.com/docs/hotspot/gc5.0/gc_tuning_5.html

Changing to the concurrent or throughput collector should help with the long
pauses.


Colin.

-Original Message-
From: Jonathan Ariel [mailto:ionat...@gmail.com] 
Sent: Friday, September 25, 2009 11:37 AM
To: solr-user@lucene.apache.org; yo...@lucidimagination.com
Subject: Re: Solr and Garbage Collection

Right, now I'm giving it 12GB of heap memory.
If I give it less (10GB) it throws the following exception:

Sep 5, 2009 7:18:32 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.OutOfMemoryError: Java heap space
at
org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:3
61)
at
org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72)
at
org.apache.lucene.search.FieldCacheImpl.getStringIndex(FieldCacheImpl.java:3
52)
at
org.apache.solr.request.SimpleFacets.getFieldCacheCounts(SimpleFacets.java:2
67)
at
org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:185)
at
org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:2
07)
at
org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:104)
at
org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java
:70)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHand
ler.java:169)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.
java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:3
03)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:
232)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler
.java:1089)
at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerColl
ection.java:211)
at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:11
4)
at
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
at
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:
835)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:22
6)
at
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:4
42)

On Fri, Sep 25, 2009 at 10:55 AM, Yonik Seeley
yo...@lucidimagination.comwrote:

 On Fri, Sep 25, 2009 at 9:30 AM, Jonathan Ariel ionat...@gmail.com
 wrote:
  Hi to all!
  Lately my solr servers seem to stop responding once in a while. I'm
using
  solr 1.3.
  Of course I'm having more traffic on the servers.
  So I logged the Garbage Collection activity to check if it's because of
  that. It seems like 11% of the time the application runs, it is stopped
  because of GC. And some times the GC takes up to 10 seconds!
  Is is normal? My instances run on a 16GB RAM, Dual Quad Core Intel Xeon
  servers. My index is around 10GB and I'm giving to the instances 10GB of
  RAM.

 Bigger heaps lead to bigger GC pauses in general.
 Do you mean that you are giving the JVM a 10GB heap?  Were you getting
 OOM exceptions with a smaller heap?

 -Yonik
 http://www.lucidimagination.com






RE: Solr and Garbage Collection

2009-09-25 Thread Fuad Efendi
Give it even more memory.

Lucene FieldCache is used to store non-tokenized single-value non-boolean
(DocumentId - FieldValue) pairs, and it is used (in-full!) for instance for
sorting query results.

So that if you have 100,000,000 documents with specific heavily distributed
field values (cardinality is high! Size is 100bytes!) you need
10,000,000,000 bytes for just this instance of FieldCache.

GC does not play any role. FieldCache won't be GC-collected.


-Fuad
http://www.linkedin.com/in/liferay



 -Original Message-
 From: Jonathan Ariel [mailto:ionat...@gmail.com]
 Sent: September-25-09 11:37 AM
 To: solr-user@lucene.apache.org; yo...@lucidimagination.com
 Subject: Re: Solr and Garbage Collection
 
 Right, now I'm giving it 12GB of heap memory.
 If I give it less (10GB) it throws the following exception:
 
 Sep 5, 2009 7:18:32 PM org.apache.solr.common.SolrException log
 SEVERE: java.lang.OutOfMemoryError: Java heap space
 at

org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:3
61
 )
 at
 org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72)
 at

org.apache.lucene.search.FieldCacheImpl.getStringIndex(FieldCacheImpl.java:3
52
 )
 at

org.apache.solr.request.SimpleFacets.getFieldCacheCounts(SimpleFacets.java:2
67
 )
 at
 org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:185)
 at

org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:2
07
 )
 at
 org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:104)
 at

org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java
:7
 0)
 at

org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHand
le
 r.java:169)
 at

org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.
ja
 va:131)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204)
 at

org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:3
03
 )
 at

org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:
23
 2)
 at

org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler
.j
 ava:1089)
 at
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
 at

org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
 at
 org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
 at
 org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
 at
 org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
 at

org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerColl
ec
 tion.java:211)
 at

org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:11
4)
 at
 org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
 at org.mortbay.jetty.Server.handle(Server.java:285)
 at
 org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
 at

org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:
83
 5)
 at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
 at
org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
 at
org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
 at

org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:22
6)
 at

org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:4
42
 )
 
 On Fri, Sep 25, 2009 at 10:55 AM, Yonik Seeley
 yo...@lucidimagination.comwrote:
 
  On Fri, Sep 25, 2009 at 9:30 AM, Jonathan Ariel ionat...@gmail.com
  wrote:
   Hi to all!
   Lately my solr servers seem to stop responding once in a while. I'm
using
   solr 1.3.
   Of course I'm having more traffic on the servers.
   So I logged the Garbage Collection activity to check if it's because
of
   that. It seems like 11% of the time the application runs, it is
stopped
   because of GC. And some times the GC takes up to 10 seconds!
   Is is normal? My instances run on a 16GB RAM, Dual Quad Core Intel
Xeon
   servers. My index is around 10GB and I'm giving to the instances 10GB
of
   RAM.
 
  Bigger heaps lead to bigger GC pauses in general.
  Do you mean that you are giving the JVM a 10GB heap?  Were you getting
  OOM exceptions with a smaller heap?
 
  -Yonik
  http://www.lucidimagination.com
 




RE: Solr and Garbage Collection

2009-09-25 Thread Fuad Efendi
 You are saying that I should give more memory than 12GB?


Yes. Look at this:

  SEVERE: java.lang.OutOfMemoryError: Java heap space

org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:3
 61
  )



It can't find few (!!!) contiguous bytes for .createValue(...)

It can't add (Field Value, Document ID) pair to an array.

GC tuning won't help in this specific case...

May be SOLR/Lucene core developers may WARM FieldCache at IndexReader
opening time, in the future... to have early OOM...


Avoiding faceting (and sorting) on such field will only postpone OOM to
unpredictable date/time...


-Fuad
http://www.linkedin.com/in/liferay





Re: Solr and Garbage Collection

2009-09-25 Thread Mark Miller
It won't really - it will just keep the JVM from wasting time resizing
the heap on you. Since you know you need so much RAM anyway, no reason
not to just pin it at what you need.
Not going to help you much with GC though.

Jonathan Ariel wrote:
 BTW why making them equal will lower the frequency of GC?

 On 9/25/09, Fuad Efendi f...@efendi.ca wrote:
   
 Bigger heaps lead to bigger GC pauses in general.
   
 Opposite viewpoint:
 1sec GC happening once an hour is MUCH BETTER than 30ms GC once-per-second.

 To lower frequency of GC: -Xms4096m -Xmx4096m (make it equal!)

 Use -server option.

 -server option of JVM is 'native CPU code', I remember WebLogic 7 console
 with SUN JVM 1.3 not showing any GC (just horizontal line).

 -Fuad
 http://www.linkedin.com/in/liferay




 


-- 
- Mark

http://www.lucidimagination.com





Re: Faceted Search on Dynamic Fields?

2009-09-25 Thread Yonik Seeley
On Fri, Sep 25, 2009 at 12:19 PM, Avlesh Singh avl...@gmail.com wrote:
 Faceting, as of now, can only be done of definitive field names.

To further clarify, the fields you can facet on can include those
defined by dynamic fields.  You just must specify the exact field name
when you facet.

   dynamicField name=*amp;STRING_NOT_ANALYZED_YES type=string
indexed=true stored=true multiValued=true/

Did you really mean for the ampersand to be in the dynamic field name?
 I'd advise against this, and it could be the source of your problems
(escaping the ampersand in your request, etc).

What is the exact facet request you are sending?


-Yonik
http://www.lucidimagination.com


Re: Solr and Garbage Collection

2009-09-25 Thread Mark Miller
-server option of JVM is 'native CPU code', I remember WebLogic 7 console
with SUN JVM 1.3 not showing any GC (just horizontal line).

Not sure what that is all about either. -server and -client are just two
different versions of hotspot.
The -server version is optimized for long running applications - it
starts slower, and over time, it learns
about your app and makes good throughput optimizations.

The -client hotspot version works faster quicker, and does concentrate
more on response than throughput.
Better for desktop apps. -server is better for long lived server apps.
Generally.

Mark Miller wrote:
 It won't really - it will just keep the JVM from wasting time resizing
 the heap on you. Since you know you need so much RAM anyway, no reason
 not to just pin it at what you need.
 Not going to help you much with GC though.

 Jonathan Ariel wrote:
   
 BTW why making them equal will lower the frequency of GC?

 On 9/25/09, Fuad Efendi f...@efendi.ca wrote:
   
 
 Bigger heaps lead to bigger GC pauses in general.
   
 
 Opposite viewpoint:
 1sec GC happening once an hour is MUCH BETTER than 30ms GC once-per-second.

 To lower frequency of GC: -Xms4096m -Xmx4096m (make it equal!)

 Use -server option.

 -server option of JVM is 'native CPU code', I remember WebLogic 7 console
 with SUN JVM 1.3 not showing any GC (just horizontal line).

 -Fuad
 http://www.linkedin.com/in/liferay




 
   


   


-- 
- Mark

http://www.lucidimagination.com





RE: Solr and Garbage Collection

2009-09-25 Thread Walter Underwood
30ms is not better or worse than 1s until you look at the service
requirements. For many applications, it is worth dedicating 10% of your
processing time to GC if that makes the worst-case pause short.

On the other hand, my experience with the IBM JVM was that the maximum query
rate was 2-3X better with the concurrent generational GC compared to any of
their other GC algorithms, so we got the best throughput along with the
shortest pauses.

Solr garbage generation (for queries) seems to have two major components:
per-request garbage and cache evictions. With a generational collector,
these two are handled by separate parts of the collector. Per-request
garbage should completely fit in the short-term heap (nursery), so that it
can be collected rapidly and returned to use for further requests. If the
nursery is too small, the per-request allocations will be made in tenured
space and sit there until the next major GC. Cache evictions are almost
always in long-term storage (tenured space) because an LRU algorithm
guarantees that the garbage will be old.

Check the growth rate of tenured space (under constant load, of course)
while increasing the size of the nursery. That rate should drop when the
nursery gets big enough, then not drop much further as it is increased more.

After that, reduce the size of tenured space until major GCs start happening
too often (a judgment call). A bigger tenured space means longer major GCs
and thus longer pauses, so you don't want it oversized by too much.

Also check the hit rates of your caches. If the hit rate is low, say 20% or
less, make that cache much bigger or set it to zero. Either one will reduce
the number of cache evictions. If you have an HTTP cache in front of Solr,
zero may be the right choice, since the HTTP cache is cherry-picking the
easily cacheable requests.

Note that a commit nearly doubles the memory required, because you have two
live Searcher objects with all their caches. Make sure you have headroom for
a commit.

If you want to test the tenured space usage, you must test with real world
queries. Those are the only way to get accurate cache eviction rates.

wunder

-Original Message-
From: Jonathan Ariel [mailto:ionat...@gmail.com] 
Sent: Friday, September 25, 2009 9:34 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr and Garbage Collection

BTW why making them equal will lower the frequency of GC?

On 9/25/09, Fuad Efendi f...@efendi.ca wrote:
 Bigger heaps lead to bigger GC pauses in general.

 Opposite viewpoint:
 1sec GC happening once an hour is MUCH BETTER than 30ms GC
once-per-second.

 To lower frequency of GC: -Xms4096m -Xmx4096m (make it equal!)

 Use -server option.

 -server option of JVM is 'native CPU code', I remember WebLogic 7 console
 with SUN JVM 1.3 not showing any GC (just horizontal line).

 -Fuad
 http://www.linkedin.com/in/liferay








Re: download pre-release nightly solr 1.4

2009-09-25 Thread michael8



markrmiller wrote:
 
 michael8 wrote:
 Hi,

 I know Solr 1.4 is going to be released any day now pending Lucene 2.9
 release.  Is there anywhere where one can download a pre-released nighly
 build of Solr 1.4 just for getting familiar with new features (e.g. field
 collapsing)?

 Thanks,
 Michael
   
 You can download nightlies
 here:http://people.apache.org/builds/lucene/solr/nightly/
 
 field collapsing won't be in 1.4 though. You have to build from svn
 after applying the patch for that.
 
 -- 
 - Mark
 
 http://www.lucidimagination.com
 
 
 
 
 

Thanks for the info Mark.  If field collapsing is a patch, can I apply the
patch against 1.3 then?  Thanks again.

Michael

-- 
View this message in context: 
http://www.nabble.com/download-pre-release-nightly-solr-1.4-tp25590281p25615553.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr and Garbage Collection

2009-09-25 Thread Mark Miller
Walter Underwood wrote:
 30ms is not better or worse than 1s until you look at the service
 requirements. For many applications, it is worth dedicating 10% of your
 processing time to GC if that makes the worst-case pause short.

 On the other hand, my experience with the IBM JVM was that the maximum query
 rate was 2-3X better with the concurrent generational GC compared to any of
 their other GC algorithms, so we got the best throughput along with the
 shortest pauses.
   
With which collector? Since the very early JVM's, all GC is generational.
Most of the collectors (other than the Serial Collector) also work
concurrently.
By default, they are concurrent on different generations, but you can
add concurrency
to the other generation with each now too.
 Solr garbage generation (for queries) seems to have two major components:
 per-request garbage and cache evictions. With a generational collector,
 these two are handled by separate parts of the collector.
Different parts of the collector? Its a different collector depending on
the generation.
The young generation is collected with a copy collector. This is because
almost all the objects
in the young generation are likely dead, and a copy collector only needs
to visit live objects. So
its very efficient. The tenured generation uses something more along the
lines of mark and sweep or mark
and compact.
  Per-request
 garbage should completely fit in the short-term heap (nursery), so that it
 can be collected rapidly and returned to use for further requests. If the
 nursery is too small, the per-request allocations will be made in tenured
 space and sit there until the next major GC. Cache evictions are almost
 always in long-term storage (tenured space) because an LRU algorithm
 guarantees that the garbage will be old.

 Check the growth rate of tenured space (under constant load, of course)
 while increasing the size of the nursery. That rate should drop when the
 nursery gets big enough, then not drop much further as it is increased more.

 After that, reduce the size of tenured space until major GCs start happening
 too often (a judgment call). A bigger tenured space means longer major GCs
 and thus longer pauses, so you don't want it oversized by too much.
   
With the concurrent low pause collector, the goal is to avoid major
collections,
by collecting *before* the tenured space is filled. If you you are
getting major collections,
you need to tune your settings - the whole point of that collector is to
avoid major
collections, and do almost all of the work while your application is not
paused. There are
still 2 brief pauses during the collection, but they should not be
significant at all.
 Also check the hit rates of your caches. If the hit rate is low, say 20% or
 less, make that cache much bigger or set it to zero. Either one will reduce
 the number of cache evictions. If you have an HTTP cache in front of Solr,
 zero may be the right choice, since the HTTP cache is cherry-picking the
 easily cacheable requests.

 Note that a commit nearly doubles the memory required, because you have two
 live Searcher objects with all their caches. Make sure you have headroom for
 a commit.

 If you want to test the tenured space usage, you must test with real world
 queries. Those are the only way to get accurate cache eviction rates.

 wunder

 -Original Message-
 From: Jonathan Ariel [mailto:ionat...@gmail.com] 
 Sent: Friday, September 25, 2009 9:34 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr and Garbage Collection

 BTW why making them equal will lower the frequency of GC?

 On 9/25/09, Fuad Efendi f...@efendi.ca wrote:
   
 Bigger heaps lead to bigger GC pauses in general.
   
 Opposite viewpoint:
 1sec GC happening once an hour is MUCH BETTER than 30ms GC
 
 once-per-second.
   
 To lower frequency of GC: -Xms4096m -Xmx4096m (make it equal!)

 Use -server option.

 -server option of JVM is 'native CPU code', I remember WebLogic 7 console
 with SUN JVM 1.3 not showing any GC (just horizontal line).

 -Fuad
 http://www.linkedin.com/in/liferay




 


   


-- 
- Mark

http://www.lucidimagination.com





RE: Solr and Garbage Collection

2009-09-25 Thread Walter Underwood
As I said, I was using the IBM JVM, not the Sun JVM. The concurrent low
pause collector is only in the Sun JVM.

I just found this excellent article about the various IBM GC options for a
Lucene application with a 100GB heap:

http://www.nearinfinity.com/blogs/aaron_mccurry/tuning_the_ibm_jvm_for_large
_h.html

wunder

-Original Message-
From: Mark Miller [mailto:markrmil...@gmail.com] 
Sent: Friday, September 25, 2009 10:03 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr and Garbage Collection

Walter Underwood wrote:
 30ms is not better or worse than 1s until you look at the service
 requirements. For many applications, it is worth dedicating 10% of your
 processing time to GC if that makes the worst-case pause short.

 On the other hand, my experience with the IBM JVM was that the maximum
query
 rate was 2-3X better with the concurrent generational GC compared to any
of
 their other GC algorithms, so we got the best throughput along with the
 shortest pauses.
   
With which collector? Since the very early JVM's, all GC is generational.
Most of the collectors (other than the Serial Collector) also work
concurrently.
By default, they are concurrent on different generations, but you can
add concurrency
to the other generation with each now too.
 Solr garbage generation (for queries) seems to have two major components:
 per-request garbage and cache evictions. With a generational collector,
 these two are handled by separate parts of the collector.
Different parts of the collector? Its a different collector depending on
the generation.
The young generation is collected with a copy collector. This is because
almost all the objects
in the young generation are likely dead, and a copy collector only needs
to visit live objects. So
its very efficient. The tenured generation uses something more along the
lines of mark and sweep or mark
and compact.
  Per-request
 garbage should completely fit in the short-term heap (nursery), so that it
 can be collected rapidly and returned to use for further requests. If the
 nursery is too small, the per-request allocations will be made in tenured
 space and sit there until the next major GC. Cache evictions are almost
 always in long-term storage (tenured space) because an LRU algorithm
 guarantees that the garbage will be old.

 Check the growth rate of tenured space (under constant load, of course)
 while increasing the size of the nursery. That rate should drop when the
 nursery gets big enough, then not drop much further as it is increased
more.

 After that, reduce the size of tenured space until major GCs start
happening
 too often (a judgment call). A bigger tenured space means longer major
GCs
 and thus longer pauses, so you don't want it oversized by too much.
   
With the concurrent low pause collector, the goal is to avoid major
collections,
by collecting *before* the tenured space is filled. If you you are
getting major collections,
you need to tune your settings - the whole point of that collector is to
avoid major
collections, and do almost all of the work while your application is not
paused. There are
still 2 brief pauses during the collection, but they should not be
significant at all.
 Also check the hit rates of your caches. If the hit rate is low, say 20%
or
 less, make that cache much bigger or set it to zero. Either one will
reduce
 the number of cache evictions. If you have an HTTP cache in front of Solr,
 zero may be the right choice, since the HTTP cache is cherry-picking the
 easily cacheable requests.

 Note that a commit nearly doubles the memory required, because you have
two
 live Searcher objects with all their caches. Make sure you have headroom
for
 a commit.

 If you want to test the tenured space usage, you must test with real world
 queries. Those are the only way to get accurate cache eviction rates.

 wunder

 -Original Message-
 From: Jonathan Ariel [mailto:ionat...@gmail.com] 
 Sent: Friday, September 25, 2009 9:34 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr and Garbage Collection

 BTW why making them equal will lower the frequency of GC?

 On 9/25/09, Fuad Efendi f...@efendi.ca wrote:
   
 Bigger heaps lead to bigger GC pauses in general.
   
 Opposite viewpoint:
 1sec GC happening once an hour is MUCH BETTER than 30ms GC
 
 once-per-second.
   
 To lower frequency of GC: -Xms4096m -Xmx4096m (make it equal!)

 Use -server option.

 -server option of JVM is 'native CPU code', I remember WebLogic 7 console
 with SUN JVM 1.3 not showing any GC (just horizontal line).

 -Fuad
 http://www.linkedin.com/in/liferay




 


   


-- 
- Mark

http://www.lucidimagination.com






Re: Solr and Garbage Collection

2009-09-25 Thread Jonathan Ariel
Ok. I will try with the concurrent low pause collector and let you know
the results.
On Fri, Sep 25, 2009 at 2:23 PM, Walter Underwood wun...@wunderwood.orgwrote:

 As I said, I was using the IBM JVM, not the Sun JVM. The concurrent low
 pause collector is only in the Sun JVM.

 I just found this excellent article about the various IBM GC options for a
 Lucene application with a 100GB heap:


 http://www.nearinfinity.com/blogs/aaron_mccurry/tuning_the_ibm_jvm_for_large
 _h.html

 wunder

 -Original Message-
 From: Mark Miller [mailto:markrmil...@gmail.com]
 Sent: Friday, September 25, 2009 10:03 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr and Garbage Collection

 Walter Underwood wrote:
  30ms is not better or worse than 1s until you look at the service
  requirements. For many applications, it is worth dedicating 10% of your
  processing time to GC if that makes the worst-case pause short.
 
  On the other hand, my experience with the IBM JVM was that the maximum
 query
  rate was 2-3X better with the concurrent generational GC compared to any
 of
  their other GC algorithms, so we got the best throughput along with the
  shortest pauses.
 
 With which collector? Since the very early JVM's, all GC is generational.
 Most of the collectors (other than the Serial Collector) also work
 concurrently.
 By default, they are concurrent on different generations, but you can
 add concurrency
 to the other generation with each now too.
  Solr garbage generation (for queries) seems to have two major components:
  per-request garbage and cache evictions. With a generational collector,
  these two are handled by separate parts of the collector.
 Different parts of the collector? Its a different collector depending on
 the generation.
 The young generation is collected with a copy collector. This is because
 almost all the objects
 in the young generation are likely dead, and a copy collector only needs
 to visit live objects. So
 its very efficient. The tenured generation uses something more along the
 lines of mark and sweep or mark
 and compact.
   Per-request
  garbage should completely fit in the short-term heap (nursery), so that
 it
  can be collected rapidly and returned to use for further requests. If the
  nursery is too small, the per-request allocations will be made in tenured
  space and sit there until the next major GC. Cache evictions are almost
  always in long-term storage (tenured space) because an LRU algorithm
  guarantees that the garbage will be old.
 
  Check the growth rate of tenured space (under constant load, of course)
  while increasing the size of the nursery. That rate should drop when the
  nursery gets big enough, then not drop much further as it is increased
 more.
 
  After that, reduce the size of tenured space until major GCs start
 happening
  too often (a judgment call). A bigger tenured space means longer major
 GCs
  and thus longer pauses, so you don't want it oversized by too much.
 
 With the concurrent low pause collector, the goal is to avoid major
 collections,
 by collecting *before* the tenured space is filled. If you you are
 getting major collections,
 you need to tune your settings - the whole point of that collector is to
 avoid major
 collections, and do almost all of the work while your application is not
 paused. There are
 still 2 brief pauses during the collection, but they should not be
 significant at all.
  Also check the hit rates of your caches. If the hit rate is low, say 20%
 or
  less, make that cache much bigger or set it to zero. Either one will
 reduce
  the number of cache evictions. If you have an HTTP cache in front of
 Solr,
  zero may be the right choice, since the HTTP cache is cherry-picking the
  easily cacheable requests.
 
  Note that a commit nearly doubles the memory required, because you have
 two
  live Searcher objects with all their caches. Make sure you have headroom
 for
  a commit.
 
  If you want to test the tenured space usage, you must test with real
 world
  queries. Those are the only way to get accurate cache eviction rates.
 
  wunder
 
  -Original Message-
  From: Jonathan Ariel [mailto:ionat...@gmail.com]
  Sent: Friday, September 25, 2009 9:34 AM
  To: solr-user@lucene.apache.org
  Subject: Re: Solr and Garbage Collection
 
  BTW why making them equal will lower the frequency of GC?
 
  On 9/25/09, Fuad Efendi f...@efendi.ca wrote:
 
  Bigger heaps lead to bigger GC pauses in general.
 
  Opposite viewpoint:
  1sec GC happening once an hour is MUCH BETTER than 30ms GC
 
  once-per-second.
 
  To lower frequency of GC: -Xms4096m -Xmx4096m (make it equal!)
 
  Use -server option.
 
  -server option of JVM is 'native CPU code', I remember WebLogic 7
 console
  with SUN JVM 1.3 not showing any GC (just horizontal line).
 
  -Fuad
  http://www.linkedin.com/in/liferay
 
 
 
 
 
 
 
 


 --
 - Mark

 http://www.lucidimagination.com







Re: Solr and Garbage Collection

2009-09-25 Thread Mark Miller
My bad - later, it looks as if your giving general advice, and thats
what I took issue with.

Any Collector that is not doing generational collection is essentially
from the dark ages and shouldn't be used.

Any Collector that doesn't have concurrent options, unless possibly your
running a tiny app (under 100MB of RAM), or only have a single CPU, is
also dark ages, and not fit for a server environement.

I havn't kept up with IBM's JVM, but it sounds like they are well behind
Sun in GC then.

- Mark

Walter Underwood wrote:
 As I said, I was using the IBM JVM, not the Sun JVM. The concurrent low
 pause collector is only in the Sun JVM.

 I just found this excellent article about the various IBM GC options for a
 Lucene application with a 100GB heap:

 http://www.nearinfinity.com/blogs/aaron_mccurry/tuning_the_ibm_jvm_for_large
 _h.html

 wunder

 -Original Message-
 From: Mark Miller [mailto:markrmil...@gmail.com] 
 Sent: Friday, September 25, 2009 10:03 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr and Garbage Collection

 Walter Underwood wrote:
   
 30ms is not better or worse than 1s until you look at the service
 requirements. For many applications, it is worth dedicating 10% of your
 processing time to GC if that makes the worst-case pause short.

 On the other hand, my experience with the IBM JVM was that the maximum
 
 query
   
 rate was 2-3X better with the concurrent generational GC compared to any
 
 of
   
 their other GC algorithms, so we got the best throughput along with the
 shortest pauses.
   
 
 With which collector? Since the very early JVM's, all GC is generational.
 Most of the collectors (other than the Serial Collector) also work
 concurrently.
 By default, they are concurrent on different generations, but you can
 add concurrency
 to the other generation with each now too.
   
 Solr garbage generation (for queries) seems to have two major components:
 per-request garbage and cache evictions. With a generational collector,
 these two are handled by separate parts of the collector.
 
 Different parts of the collector? Its a different collector depending on
 the generation.
 The young generation is collected with a copy collector. This is because
 almost all the objects
 in the young generation are likely dead, and a copy collector only needs
 to visit live objects. So
 its very efficient. The tenured generation uses something more along the
 lines of mark and sweep or mark
 and compact.
   
  Per-request
 garbage should completely fit in the short-term heap (nursery), so that it
 can be collected rapidly and returned to use for further requests. If the
 nursery is too small, the per-request allocations will be made in tenured
 space and sit there until the next major GC. Cache evictions are almost
 always in long-term storage (tenured space) because an LRU algorithm
 guarantees that the garbage will be old.

 Check the growth rate of tenured space (under constant load, of course)
 while increasing the size of the nursery. That rate should drop when the
 nursery gets big enough, then not drop much further as it is increased
 
 more.
   
 After that, reduce the size of tenured space until major GCs start
 
 happening
   
 too often (a judgment call). A bigger tenured space means longer major
 
 GCs
   
 and thus longer pauses, so you don't want it oversized by too much.
   
 
 With the concurrent low pause collector, the goal is to avoid major
 collections,
 by collecting *before* the tenured space is filled. If you you are
 getting major collections,
 you need to tune your settings - the whole point of that collector is to
 avoid major
 collections, and do almost all of the work while your application is not
 paused. There are
 still 2 brief pauses during the collection, but they should not be
 significant at all.
   
 Also check the hit rates of your caches. If the hit rate is low, say 20%
 
 or
   
 less, make that cache much bigger or set it to zero. Either one will
 
 reduce
   
 the number of cache evictions. If you have an HTTP cache in front of Solr,
 zero may be the right choice, since the HTTP cache is cherry-picking the
 easily cacheable requests.

 Note that a commit nearly doubles the memory required, because you have
 
 two
   
 live Searcher objects with all their caches. Make sure you have headroom
 
 for
   
 a commit.

 If you want to test the tenured space usage, you must test with real world
 queries. Those are the only way to get accurate cache eviction rates.

 wunder

 -Original Message-
 From: Jonathan Ariel [mailto:ionat...@gmail.com] 
 Sent: Friday, September 25, 2009 9:34 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr and Garbage Collection

 BTW why making them equal will lower the frequency of GC?

 On 9/25/09, Fuad Efendi f...@efendi.ca wrote:
   
 
 Bigger heaps lead to bigger GC pauses in general.
   
 
 Opposite viewpoint:
 1sec GC happening once an hour 

8 for 1.4

2009-09-25 Thread Grant Ingersoll

Y'all,

We're down to 8 open issues:  
https://issues.apache.org/jira/secure/BrowseVersion.jspa?id=12310230versionId=12313351showOpenIssuesOnly=true

2 are packaging related, one is dependent on the official 2.9 release  
(so should be taken care of today or tomorrow I suspect) and then we  
have a few others.


The only two somewhat major ones are S-1458, S-1294 (more on this in a  
mo') and S-1449.


On S-1294, the SolrJS patch, I yet again have concerns about even  
including this, given the lack of activity (from Matthias, the  
original author and others) and the fact that some in the Drupal  
community have already forked this to fix the various bugs in it  
instead of just submitting patches.  While I really like the idea of  
this library (jQuery is awesome), I have yet to see interest in the  
community to maintain it (unless you count someone forking it and  
fixing the bugs in the fork as maintenance) and I'll be upfront in  
admitting I have neither the time nor the patience to debug Javascript  
across the gazillions of browsers out there (I don't even have IE on  
my machine unless you count firing up a VM w/ XP on it) in the wild.   
Given what I know of most of the other committers here, I suspect that  
is true for others too.  At a minimum, I think S-1294 should be pushed  
to 1.5.  Next up, I think we consider pulling SolrJS from the release,  
but keeping it in trunk and officially releasing it with either 1.5 or  
1.4.1, assuming its gotten some love in the meantime.  If by then it  
has no love, I vote we remove it and let the fork maintain it and  
point people there.


-Grant


RE: Solr and Garbage Collection

2009-09-25 Thread Walter Underwood
For batch-oriented computing, like Hadoop, the most efficient GC is probably
a non-concurrent, non-generational GC. I doubt that there are many
batch-oriented applications of Solr, though.

The rest of the advice is intended to be general and it sounds like we agree
about sizing. If the nursery is not big enough, the tenured space will be
used for allocations that have a short lifetime and that will increase the
length and/or frequency of major collections.

Cache evictions are the interesting part, because they cause a constant rate
of tenured space garbage. In most many servers, you can get a big enough
nursery that major collections are very rare. That won't happen in Solr
because of cache evictions.

The IBM JVM is excellent. Their concurrent generational GC policy is
gencon.

wunder

-Original Message-
From: Mark Miller [mailto:markrmil...@gmail.com] 
Sent: Friday, September 25, 2009 10:31 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr and Garbage Collection

My bad - later, it looks as if your giving general advice, and thats
what I took issue with.

Any Collector that is not doing generational collection is essentially
from the dark ages and shouldn't be used.

Any Collector that doesn't have concurrent options, unless possibly your
running a tiny app (under 100MB of RAM), or only have a single CPU, is
also dark ages, and not fit for a server environement.

I havn't kept up with IBM's JVM, but it sounds like they are well behind
Sun in GC then.

- Mark

Walter Underwood wrote:
 As I said, I was using the IBM JVM, not the Sun JVM. The concurrent low
 pause collector is only in the Sun JVM.

 I just found this excellent article about the various IBM GC options for a
 Lucene application with a 100GB heap:


http://www.nearinfinity.com/blogs/aaron_mccurry/tuning_the_ibm_jvm_for_large
 _h.html

 wunder

 -Original Message-
 From: Mark Miller [mailto:markrmil...@gmail.com] 
 Sent: Friday, September 25, 2009 10:03 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr and Garbage Collection

 Walter Underwood wrote:
   
 30ms is not better or worse than 1s until you look at the service
 requirements. For many applications, it is worth dedicating 10% of your
 processing time to GC if that makes the worst-case pause short.

 On the other hand, my experience with the IBM JVM was that the maximum
 
 query
   
 rate was 2-3X better with the concurrent generational GC compared to any
 
 of
   
 their other GC algorithms, so we got the best throughput along with the
 shortest pauses.
   
 
 With which collector? Since the very early JVM's, all GC is generational.
 Most of the collectors (other than the Serial Collector) also work
 concurrently.
 By default, they are concurrent on different generations, but you can
 add concurrency
 to the other generation with each now too.
   
 Solr garbage generation (for queries) seems to have two major components:
 per-request garbage and cache evictions. With a generational collector,
 these two are handled by separate parts of the collector.
 
 Different parts of the collector? Its a different collector depending on
 the generation.
 The young generation is collected with a copy collector. This is because
 almost all the objects
 in the young generation are likely dead, and a copy collector only needs
 to visit live objects. So
 its very efficient. The tenured generation uses something more along the
 lines of mark and sweep or mark
 and compact.
   
  Per-request
 garbage should completely fit in the short-term heap (nursery), so that
it
 can be collected rapidly and returned to use for further requests. If the
 nursery is too small, the per-request allocations will be made in tenured
 space and sit there until the next major GC. Cache evictions are almost
 always in long-term storage (tenured space) because an LRU algorithm
 guarantees that the garbage will be old.

 Check the growth rate of tenured space (under constant load, of course)
 while increasing the size of the nursery. That rate should drop when the
 nursery gets big enough, then not drop much further as it is increased
 
 more.
   
 After that, reduce the size of tenured space until major GCs start
 
 happening
   
 too often (a judgment call). A bigger tenured space means longer major
 
 GCs
   
 and thus longer pauses, so you don't want it oversized by too much.
   
 
 With the concurrent low pause collector, the goal is to avoid major
 collections,
 by collecting *before* the tenured space is filled. If you you are
 getting major collections,
 you need to tune your settings - the whole point of that collector is to
 avoid major
 collections, and do almost all of the work while your application is not
 paused. There are
 still 2 brief pauses during the collection, but they should not be
 significant at all.
   
 Also check the hit rates of your caches. If the hit rate is low, say 20%
 
 or
   
 less, make that cache much bigger or set it to zero. 

Solr + Jboss + Custom Transformers

2009-09-25 Thread Papiya Misra

Hi

I am trying to use a custom transformer that extends
org.apache.solr.handler.dataimport.Transformer.

I have the CustomTransformer.jar and DataImportHandler.jar in
JBOSS/server/default/lib. I have the solr.war (as is from the distro) in
the JBOSS/server/default/deploy.

org.apache.solr.handler.dataimport.EntityProcessorWrapper (line 110)
returns false for the following code
 clazz.newInstance() instanceof Transformer


This happens because the CustomTransformer uses the Transformer from a
different ClassLoader than the Solr web application.

I could use the source code to create solr.war that includes the
CustomTransformer class. Is there any other option - one that preferably
does not include re-packaging solr.war ?

Thanks
Papiya

Pink OTC Markets Inc. provides the leading inter-dealer quotation and trading 
system in the over-the-counter (OTC) securities market.   We create innovative 
technology and data solutions to efficiently connect market participants, 
improve price discovery, increase issuer disclosure, and better inform 
investors.   Our marketplace, comprised of the issuer-listed OTCQX and 
broker-quoted   Pink Sheets, is the third largest U.S. equity trading venue for 
company shares.

This document contains confidential information of Pink OTC Markets and is only 
intended for the recipient.   Do not copy, reproduce (electronically or 
otherwise), or disclose without the prior written consent of Pink OTC Markets.  
If you receive this message in error, please destroy all copies in your 
possession (electronically or otherwise) and contact the sender above.


Re: Solr and Garbage Collection

2009-09-25 Thread Mark Miller
Walter Underwood wrote:
 For batch-oriented computing, like Hadoop, the most efficient GC is probably
 a non-concurrent, non-generational GC. 
Okay - for batch we somewhat agree I guess - if you can stand any length
of pausing, non concurrent can be nice, because you don't pay for thread
sync communication. Only with a small heap size though (less than 100MB
is what I've seen). You would pause the batch job while GC takes place.
If you have 8 processors, and you are pausing all of them to collect a
large heap using only 1 processor, that doesn't make much sense to me.
The thread communication pain will be far outweighed by using more
processors to do the collection faster, and not stop the world for
your batch job so long. Stopping your application dead in its tracks,
and then only using one of the available processors to collect a large
heap, while the rest sit idle, doesn't make much sense.

I also don't agree it ever really makes sense not to do generational
collection. What is your argument here? Generational collection is
**way** more efficient for short lived objects, which tend to be up to
98% of the objects in most applications. The only way I see that making
sense is if you have almost no short lived objects (which occurs in
what, .0001% of apps if at all?). The Sun JVM doesn't even offer a non
generational approach anymore. It's just standard GC practice.
 I doubt that there are many
 batch-oriented applications of Solr, though.

 The rest of the advice is intended to be general and it sounds like we agree
 about sizing. If the nursery is not big enough, the tenured space will be
 used for allocations that have a short lifetime and that will increase the
 length and/or frequency of major collections.
   
Yes - I wasn't arguing with every point - I was picking and choosing :)
After the heap size, the size of the young generation is the most
important factor.
 Cache evictions are the interesting part, because they cause a constant rate
 of tenured space garbage. In most many servers, you can get a big enough
 nursery that major collections are very rare. That won't happen in Solr
 because of cache evictions.

 The IBM JVM is excellent. Their concurrent generational GC policy is
 gencon.
   
Yeah, I actually know very little about the IBM JVM, so I wasn't really
commenting. But from the info I gleaned here and on a couple quick web
searches, I'm not too impressed by it's GC.
 wunder

 -Original Message-
 From: Mark Miller [mailto:markrmil...@gmail.com] 
 Sent: Friday, September 25, 2009 10:31 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr and Garbage Collection

 My bad - later, it looks as if your giving general advice, and thats
 what I took issue with.

 Any Collector that is not doing generational collection is essentially
 from the dark ages and shouldn't be used.

 Any Collector that doesn't have concurrent options, unless possibly your
 running a tiny app (under 100MB of RAM), or only have a single CPU, is
 also dark ages, and not fit for a server environement.

 I havn't kept up with IBM's JVM, but it sounds like they are well behind
 Sun in GC then.

 - Mark

 Walter Underwood wrote:
   
 As I said, I was using the IBM JVM, not the Sun JVM. The concurrent low
 pause collector is only in the Sun JVM.

 I just found this excellent article about the various IBM GC options for a
 Lucene application with a 100GB heap:


 
 http://www.nearinfinity.com/blogs/aaron_mccurry/tuning_the_ibm_jvm_for_large
   
 _h.html

 wunder

 -Original Message-
 From: Mark Miller [mailto:markrmil...@gmail.com] 
 Sent: Friday, September 25, 2009 10:03 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr and Garbage Collection

 Walter Underwood wrote:
   
 
 30ms is not better or worse than 1s until you look at the service
 requirements. For many applications, it is worth dedicating 10% of your
 processing time to GC if that makes the worst-case pause short.

 On the other hand, my experience with the IBM JVM was that the maximum
 
   
 query
   
 
 rate was 2-3X better with the concurrent generational GC compared to any
 
   
 of
   
 
 their other GC algorithms, so we got the best throughput along with the
 shortest pauses.
   
 
   
 With which collector? Since the very early JVM's, all GC is generational.
 Most of the collectors (other than the Serial Collector) also work
 concurrently.
 By default, they are concurrent on different generations, but you can
 add concurrency
 to the other generation with each now too.
   
 
 Solr garbage generation (for queries) seems to have two major components:
 per-request garbage and cache evictions. With a generational collector,
 these two are handled by separate parts of the collector.
 
   
 Different parts of the collector? Its a different collector depending on
 the generation.
 The young generation is collected with a copy collector. This is because
 almost all the objects
 in the young 

Re: FW: Solr and Garbage Collection

2009-09-25 Thread Mark Miller
Faud, you didn't read the thread right.

He is not having a problem with OOM. He got the OOM because he lowered
the heap to try and help GC.

He normally runs with a heap that can handle his FC.

Please re-read the thread. You are confusing the tread.

- Mark

Fuad Efendi wrote:
 Guys, thanks for GC discussion; but the root of a problem is FieldCache
 internals.

 Not enough RAM for FieldCache will cause unpredictable OOM, and it does not
 depend on GC. How much RAM FieldCache needs in case of 2 different
 values for a Field, 200 bytes each (Unicode), and 100M documents? What if we
 have 100 such non-tokenized fields in a schema?


 SOLR has an option to warm up caches on startup which might help
 troubleshooting.


 JRockit JVM has 'realtime' version if you are interested in predictable GC
 (without delaying 'transaction')...

 GC will frequently happen even if RAM is more than enough: in case if it is
 heavily sparse... so that have even more RAM!



 -Original Message-
 From: Fuad Efendi [mailto:f...@efendi.ca] 
 Sent: September-25-09 12:17 PM
 To: solr-user@lucene.apache.org
 Subject: RE: Solr and Garbage Collection

   
 You are saying that I should give more memory than 12GB?
 


 Yes. Look at this:

   
 SEVERE: java.lang.OutOfMemoryError: Java heap space
   
 org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:3
   
 61
 
 )
   



 It can't find few (!!!) contiguous bytes for .createValue(...)

 It can't add (Field Value, Document ID) pair to an array.

 GC tuning won't help in this specific case...

 May be SOLR/Lucene core developers may WARM FieldCache at IndexReader
 opening time, in the future... to have early OOM...


 Avoiding faceting (and sorting) on such field will only postpone OOM to
 unpredictable date/time...


 -Fuad
 http://www.linkedin.com/in/liferay





   


-- 
- Mark

http://www.lucidimagination.com





FW: Solr and Garbage Collection

2009-09-25 Thread Fuad Efendi
Guys, thanks for GC discussion; but the root of a problem is FieldCache
internals.

Not enough RAM for FieldCache will cause unpredictable OOM, and it does not
depend on GC. How much RAM FieldCache needs in case of 2 different
values for a Field, 200 bytes each (Unicode), and 100M documents? What if we
have 100 such non-tokenized fields in a schema?


SOLR has an option to warm up caches on startup which might help
troubleshooting.


JRockit JVM has 'realtime' version if you are interested in predictable GC
(without delaying 'transaction')...

GC will frequently happen even if RAM is more than enough: in case if it is
heavily sparse... so that have even more RAM!



-Original Message-
From: Fuad Efendi [mailto:f...@efendi.ca] 
Sent: September-25-09 12:17 PM
To: solr-user@lucene.apache.org
Subject: RE: Solr and Garbage Collection

 You are saying that I should give more memory than 12GB?


Yes. Look at this:

  SEVERE: java.lang.OutOfMemoryError: Java heap space

org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:3
 61
  )



It can't find few (!!!) contiguous bytes for .createValue(...)

It can't add (Field Value, Document ID) pair to an array.

GC tuning won't help in this specific case...

May be SOLR/Lucene core developers may WARM FieldCache at IndexReader
opening time, in the future... to have early OOM...


Avoiding faceting (and sorting) on such field will only postpone OOM to
unpredictable date/time...


-Fuad
http://www.linkedin.com/in/liferay







RE: FW: Solr and Garbage Collection

2009-09-25 Thread Fuad Efendi
Mark,

what if piece of code needs 10 contiguous Kb to load a document field? How
locked memory pieces are optimized/moved (putting on hold almost whole
application)?
Lowering heap is _bad_ idea; we will have extremely frequent GC (optimize of
live objects!!!) even if RAM is (theoretically) enough. 

-Fuad


Faud, you didn't read the thread right.
 
 He is not having a problem with OOM. He got the OOM because he lowered
 the heap to try and help GC.
 
 He normally runs with a heap that can handle his FC.
 
 Please re-read the thread. You are confusing the tread.
 
 - Mark
 


 GC will frequently happen even if RAM is more than enough: in case if it
is
 heavily sparse... so that have even more RAM!
 -Fuad




Hierarchical Facet Field Prefix Not Working

2009-09-25 Thread Nasseam Elkarra

Hello all,

We are using the patch from SOLR-64 (http://issues.apache.org/jira/browse/SOLR-64 
) to implement hierarchical facets for categories. We are trying to  
use the facet.prefix to prevent all categories from coming back.  
However, f.category.facet.prefix doesn't work. Using facet.prefix  
works but prevents the other facets from coming back since it is a  
global option. Are per facet options supported on hierarchical facet  
fields? If not, how can I get a specific category and it's children  
without getting the surrounding categories?


Any help is much appreciated.

Thank you,

Nasseam Elkarra
http://bodukai.com/boutique/
The fastest possible shopping experience.



Re: FW: Solr and Garbage Collection

2009-09-25 Thread Jonathan Ariel
I'm not planning on lowering the heap. I just want to lower the time
wasted on GC, which is 11% right now.So what I'll try is changing the GC
to -XX:+UseConcMarkSweepGC

On Fri, Sep 25, 2009 at 4:17 PM, Fuad Efendi f...@efendi.ca wrote:

 Mark,

 what if piece of code needs 10 contiguous Kb to load a document field? How
 locked memory pieces are optimized/moved (putting on hold almost whole
 application)?
 Lowering heap is _bad_ idea; we will have extremely frequent GC (optimize
 of
 live objects!!!) even if RAM is (theoretically) enough.

 -Fuad


 Faud, you didn't read the thread right.
 
  He is not having a problem with OOM. He got the OOM because he lowered
  the heap to try and help GC.
 
  He normally runs with a heap that can handle his FC.
 
  Please re-read the thread. You are confusing the tread.
 
  - Mark
 


  GC will frequently happen even if RAM is more than enough: in case if it
 is
  heavily sparse... so that have even more RAM!
  -Fuad





RE: FW: Solr and Garbage Collection

2009-09-25 Thread Fuad Efendi
But again, GC is not just Garbage Collection as many in this thread
think... it is also memory defragmentation which is much costly than
collection just because it needs move somewhere _live_objects_ (and
wait/lock till such objects get unlocked to be moved...) - obviously more
memory helps...

11% is extremely high.

 
-Fuad
http://www.linkedin.com/in/liferay


 -Original Message-
 From: Jonathan Ariel [mailto:ionat...@gmail.com]
 Sent: September-25-09 3:36 PM
 To: solr-user@lucene.apache.org
 Subject: Re: FW: Solr and Garbage Collection
 
 I'm not planning on lowering the heap. I just want to lower the time
 wasted on GC, which is 11% right now.So what I'll try is changing the GC
 to -XX:+UseConcMarkSweepGC
 
 On Fri, Sep 25, 2009 at 4:17 PM, Fuad Efendi f...@efendi.ca wrote:
 
  Mark,
 
  what if piece of code needs 10 contiguous Kb to load a document field?
How
  locked memory pieces are optimized/moved (putting on hold almost whole
  application)?
  Lowering heap is _bad_ idea; we will have extremely frequent GC
(optimize
  of
  live objects!!!) even if RAM is (theoretically) enough.
 
  -Fuad
 
 
  Faud, you didn't read the thread right.
  
   He is not having a problem with OOM. He got the OOM because he lowered
   the heap to try and help GC.
  
   He normally runs with a heap that can handle his FC.
  
   Please re-read the thread. You are confusing the tread.
  
   - Mark
  
 
 
   GC will frequently happen even if RAM is more than enough: in case if
it
  is
   heavily sparse... so that have even more RAM!
   -Fuad
 
 
 




Re: FW: Solr and Garbage Collection

2009-09-25 Thread Yonik Seeley
On Fri, Sep 25, 2009 at 2:52 PM, Fuad Efendi f...@efendi.ca wrote:
 Lowering heap helps GC?

Yes.  In general, lowering the heap can help or hurt.

Hurt: if one is running very low on memory, GC will be working harder
all of the time trying to find more memory and the % of time that GC
takes can go up.

Help: if one has massive heaps, full GCs may not happen as frequently,
but when they do they can be larger and cause more of a problem.  For
many apps, a .2 second pause every minute is preferable to a 10 second
pause every hour.

And of course the other reason to lower the heap size *if* you don't
need it that big is to leave more memory for other stuff, and for the
OS itself to cache the index files.

-Yonik
http://www.lucidimagination.com


Re: FW: Solr and Garbage Collection

2009-09-25 Thread Jonathan Ariel
Maybe what's missing here is how did I get the 11%.I just ran solr with the
following JVM params: -XX:+PrintGCApplicationConcurrentTime
-XX:+PrintGCApplicationStoppedTime with that I can measure the amount of
time the application run between collection pauses and the length of the
collection pauses, respectively.
I think that in this case the 11% is just for memory collection and not
defragmentation... but I'm not 100% sure.

On Fri, Sep 25, 2009 at 5:05 PM, Fuad Efendi f...@efendi.ca wrote:

 But again, GC is not just Garbage Collection as many in this thread
 think... it is also memory defragmentation which is much costly than
 collection just because it needs move somewhere _live_objects_ (and
 wait/lock till such objects get unlocked to be moved...) - obviously more
 memory helps...

 11% is extremely high.


 -Fuad
 http://www.linkedin.com/in/liferay


  -Original Message-
  From: Jonathan Ariel [mailto:ionat...@gmail.com]
  Sent: September-25-09 3:36 PM
  To: solr-user@lucene.apache.org
  Subject: Re: FW: Solr and Garbage Collection
 
  I'm not planning on lowering the heap. I just want to lower the time
  wasted on GC, which is 11% right now.So what I'll try is changing the
 GC
  to -XX:+UseConcMarkSweepGC
 
  On Fri, Sep 25, 2009 at 4:17 PM, Fuad Efendi f...@efendi.ca wrote:
 
   Mark,
  
   what if piece of code needs 10 contiguous Kb to load a document field?
 How
   locked memory pieces are optimized/moved (putting on hold almost whole
   application)?
   Lowering heap is _bad_ idea; we will have extremely frequent GC
 (optimize
   of
   live objects!!!) even if RAM is (theoretically) enough.
  
   -Fuad
  
  
   Faud, you didn't read the thread right.
   
He is not having a problem with OOM. He got the OOM because he
 lowered
the heap to try and help GC.
   
He normally runs with a heap that can handle his FC.
   
Please re-read the thread. You are confusing the tread.
   
- Mark
   
  
  
GC will frequently happen even if RAM is more than enough: in case
 if
 it
   is
heavily sparse... so that have even more RAM!
-Fuad
  
  
  





Re: Can we point a Solr server to index directory dynamically at runtime..

2009-09-25 Thread Silent Surfer
Hi Michael,

We are storing all our data in addition to index, as we need to display those 
values to the user. So unfortunately we cannot go with the option stored=false, 
which could have potentially solved our issue.

Appreciate any other pointers/suggestions

Thanks,
sS

--- On Fri, 9/25/09, Michael solrco...@gmail.com wrote:

 From: Michael solrco...@gmail.com
 Subject: Re: Can we point a Solr server to index directory dynamically at  
 runtime..
 To: solr-user@lucene.apache.org
 Date: Friday, September 25, 2009, 2:00 PM
 Are you storing (in addition to
 indexing) your data?  Perhaps you could turn
 off storage on data older than 7 days (requires
 reindexing), thus losing the
 ability to return snippets but cutting down on your storage
 space and server
 count.  I've experienced 10x decrease in space
 requirements and a large
 boost in speed after cutting extraneous storage from Solr
 -- the stored data
 is mixed in with the index data and so it slows down
 searches.
 You could also put all 200G onto one Solr instance rather
 than 10 for 7days
 data, and accept that those searches will be slower.
 
 Michael
 
 On Fri, Sep 25, 2009 at 1:34 AM, Silent Surfer 
 silentsurfe...@yahoo.comwrote:
 
  Hi,
 
  Thank you Michael and Chris for the response.
 
  Today after the mail from Michael, we tested with the
 dynamic loading of
  cores and it worked well. So we need to go with the
 hybrid approach of
  Multicore and Distributed searching.
 
  As per our testing, we found that a Solr instance with
 20 GB of
  index(single index or spread across multiple cores)
 can provide better
  performance when compared to having a Solr instance
 say 40 (or) 50 GB of
  index (single index or index spread across cores).
 
  So the 200 GB of index on day 1 will be spread across
 200/20=10 Solr salve
  instances.
 
  On day 2 data, 10 more Solr slave servers are
 required; Cumulative Solr
  Slave instances = 200*2/20=20
  ...
  ..
  ..
  On day 30 data, 10 more Solr slave servers are
 required; Cumulative Solr
  Slave instances = 200*30/20=300
 
  So with the above approach, we may need ~300 Solr
 slave instances, which
  becomes very unmanageable.
 
  But we know that most of the queries is for the past 1
 week, i.e we
  definitely need 70 Solr Slaves containing the last 7
 days worth of data up
  and running.
 
  Now for the rest of the 230 Solr instances, do we need
 to keep it running
  for the odd query,that can span across the 30 days of
 data (30*200 GB=6 TB
  data) which can come up only a couple of times a day.
  This linear increase of Solr servers with the
 retention period doesn't
  seems to be a very scalable solution.
 
  So we are looking for something more simpler approach
 to handle this
  scenario.
 
  Appreciate any further inputs/suggestions.
 
  Regards,
  sS
 
  --- On Fri, 9/25/09, Chris Hostetter hossman_luc...@fucit.org
 wrote:
 
   From: Chris Hostetter hossman_luc...@fucit.org
   Subject: Re: Can we point a Solr server to index
 directory dynamically
  at  runtime..
   To: solr-user@lucene.apache.org
   Date: Friday, September 25, 2009, 4:04 AM
   : Using a multicore approach, you
   could send a create a core named
   : 'core3weeksold' pointing to
 '/datadirs/3weeksold' 
   command to a live Solr,
   : which would spin it up on the fly.  Then
 you query
   it, and maybe keep it
   : spun up until it's not queried for 60 seconds
 or
   something, then send a
   : remove core 'core3weeksold'  command.
   : See http://wiki.apache.org/solr/CoreAdmin#CoreAdminHandler
   .
  
   something that seems implicit in the question is
 what to do
   when the
   request spans all of the data ... this is where
 (in theory)
   distributed
   searching could help you out.
  
   index each days worth of data into it's own core,
 that
   makes it really
   easy to expire the old data (just UNLOAD and
 delete an
   entire core once
   it's more then 30 days old) if your user is only
 searching
   current dta
   then your app can directly query the core
 containing the
   most current data
   -- but if they want to query the last week, or
 last two
   weeks worth of
   data, you do a distributed request for all of the
 shards
   needed to search
   the appropriate amount of data.
  
   Between the ALIAS and SWAP commands it on the
 CoreAdmin
   screen it should
   be pretty easy have cores with names like
   today,1dayold,2dayold so
   that your app can configure simple shard params
 for all the
   perumations
   you'll need to query.
  
  
   -Hoss
  
  
 
 
 
 
 
 
 






Re: FW: Solr and Garbage Collection

2009-09-25 Thread Mark Miller
When we talk about Collectors, we are not just talking about
collecting - whatever that means. There isn't really a collecting
phase - the whole algorithm is garbage collecting - hence calling the
different implementations collectors.

Usually, fragmentation is dealt with using a mark-compact collector (or
IBM has used a mark-sweep-compact collector).
Copying collectors are not only super efficient at collecting young
spaces, but they are also great for fragmentation - when you copy
everything to the new space, you can remove any fragmentation. At the
cost of double the space requirements though.

So mark-compact is a compromise. First you mark whats reachable, then
everything thats marked is copied/compacted to the bottom of the heap.
Its all part of a collection though.

Jonathan Ariel wrote:
 Maybe what's missing here is how did I get the 11%.I just ran solr with the
 following JVM params: -XX:+PrintGCApplicationConcurrentTime
 -XX:+PrintGCApplicationStoppedTime with that I can measure the amount of
 time the application run between collection pauses and the length of the
 collection pauses, respectively.
 I think that in this case the 11% is just for memory collection and not
 defragmentation... but I'm not 100% sure.

 On Fri, Sep 25, 2009 at 5:05 PM, Fuad Efendi f...@efendi.ca wrote:

   
 But again, GC is not just Garbage Collection as many in this thread
 think... it is also memory defragmentation which is much costly than
 collection just because it needs move somewhere _live_objects_ (and
 wait/lock till such objects get unlocked to be moved...) - obviously more
 memory helps...

 11% is extremely high.


 -Fuad
 http://www.linkedin.com/in/liferay


 
 -Original Message-
 From: Jonathan Ariel [mailto:ionat...@gmail.com]
 Sent: September-25-09 3:36 PM
 To: solr-user@lucene.apache.org
 Subject: Re: FW: Solr and Garbage Collection

 I'm not planning on lowering the heap. I just want to lower the time
 wasted on GC, which is 11% right now.So what I'll try is changing the
   
 GC
 
 to -XX:+UseConcMarkSweepGC

 On Fri, Sep 25, 2009 at 4:17 PM, Fuad Efendi f...@efendi.ca wrote:

   
 Mark,

 what if piece of code needs 10 contiguous Kb to load a document field?
 
 How
 
 locked memory pieces are optimized/moved (putting on hold almost whole
 application)?
 Lowering heap is _bad_ idea; we will have extremely frequent GC
 
 (optimize
 
 of
 live objects!!!) even if RAM is (theoretically) enough.

 -Fuad


 
 Faud, you didn't read the thread right.

 He is not having a problem with OOM. He got the OOM because he
   
 lowered
 
 the heap to try and help GC.

 He normally runs with a heap that can handle his FC.

 Please re-read the thread. You are confusing the tread.

 - Mark

   
 
 GC will frequently happen even if RAM is more than enough: in case
 
 if
 it
 
 is
 
 heavily sparse... so that have even more RAM!
 -Fuad
 

 

 

   


-- 
- Mark

http://www.lucidimagination.com





solr home

2009-09-25 Thread Park, Michael
I already have a handful of solr instances running .  However, I'm
trying to install solr (1.4) on a new linux server with tomcat using a
context file (same way I usually do):

 

Context docBase=/opt/local/solr/apache-solr-1.4.war debug=0
crossContext=true

   Environment name=solr/home type=java.lang.String
value=/opt/local/solr/fedora_solr/ override=true/

/Context

 

However it throws an exception due to the following:

SEVERE: Could not start SOLR. Check solr/home property

java.lang.RuntimeException: Can't find resource 'solrconfig.xml' in
classpath or 'solr/conf/', cwd=/opt/local/solr/fedora_solr

at
org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoader.
java:198)

at
org.apache.solr.core.SolrResourceLoader.openConfig(SolrResourceLoader.ja
va:166)

 

Any ideas why this is happening?

 

Thanks, Mike



Re: Solr and Garbage Collection

2009-09-25 Thread Jonathan Ariel
Ok. I'll first change the GC and see if the time spent decreased. Than
I'll try increasing the heap as Fuad recommends.

On 9/25/09, Mark Miller markrmil...@gmail.com wrote:
 When we talk about Collectors, we are not just talking about
 collecting - whatever that means. There isn't really a collecting
 phase - the whole algorithm is garbage collecting - hence calling the
 different implementations collectors.

 Usually, fragmentation is dealt with using a mark-compact collector (or
 IBM has used a mark-sweep-compact collector).
 Copying collectors are not only super efficient at collecting young
 spaces, but they are also great for fragmentation - when you copy
 everything to the new space, you can remove any fragmentation. At the
 cost of double the space requirements though.

 So mark-compact is a compromise. First you mark whats reachable, then
 everything thats marked is copied/compacted to the bottom of the heap.
 Its all part of a collection though.

 Jonathan Ariel wrote:
 Maybe what's missing here is how did I get the 11%.I just ran solr with
 the
 following JVM params: -XX:+PrintGCApplicationConcurrentTime
 -XX:+PrintGCApplicationStoppedTime with that I can measure the amount of
 time the application run between collection pauses and the length of the
 collection pauses, respectively.
 I think that in this case the 11% is just for memory collection and not
 defragmentation... but I'm not 100% sure.

 On Fri, Sep 25, 2009 at 5:05 PM, Fuad Efendi f...@efendi.ca wrote:


 But again, GC is not just Garbage Collection as many in this thread
 think... it is also memory defragmentation which is much costly than
 collection just because it needs move somewhere _live_objects_ (and
 wait/lock till such objects get unlocked to be moved...) - obviously more
 memory helps...

 11% is extremely high.


 -Fuad
 http://www.linkedin.com/in/liferay



 -Original Message-
 From: Jonathan Ariel [mailto:ionat...@gmail.com]
 Sent: September-25-09 3:36 PM
 To: solr-user@lucene.apache.org
 Subject: Re: FW: Solr and Garbage Collection

 I'm not planning on lowering the heap. I just want to lower the time
 wasted on GC, which is 11% right now.So what I'll try is changing the

 GC

 to -XX:+UseConcMarkSweepGC

 On Fri, Sep 25, 2009 at 4:17 PM, Fuad Efendi f...@efendi.ca wrote:


 Mark,

 what if piece of code needs 10 contiguous Kb to load a document field?

 How

 locked memory pieces are optimized/moved (putting on hold almost whole
 application)?
 Lowering heap is _bad_ idea; we will have extremely frequent GC

 (optimize

 of
 live objects!!!) even if RAM is (theoretically) enough.

 -Fuad



 Faud, you didn't read the thread right.

 He is not having a problem with OOM. He got the OOM because he

 lowered

 the heap to try and help GC.

 He normally runs with a heap that can handle his FC.

 Please re-read the thread. You are confusing the tread.

 - Mark



 GC will frequently happen even if RAM is more than enough: in case

 if
 it

 is

 heavily sparse... so that have even more RAM!
 -Fuad









 --
 - Mark

 http://www.lucidimagination.com






Re: FW: Solr and Garbage Collection

2009-09-25 Thread Mark Miller
 or IBM has used a mark-sweep-compact collector

Never mind - Sun's is also sometimes referred to as mark-sweep-compact.
I've just seen it referred to as mark-compact before as well. In either
case though, without some sort of sweep phase, there is no reclamation
of memory :)

It's interesting though - in the days of the early JVM's Sun talked more
about compaction - but if you look at their recent info, they don't even
mention it, or give you params to messs with it. They just talk about
the mark and the sweep phase.

IBM is much more open about a compaction phase, and not only do they
give controls to tune it, they let you turn it off completely.

Not sure what Sun is doing with compaction these days - or if they just
work with fragmentation avoidance techniques instead - haven't seen any
info on it.


Mark Miller wrote:
 When we talk about Collectors, we are not just talking about
 collecting - whatever that means. There isn't really a collecting
 phase - the whole algorithm is garbage collecting - hence calling the
 different implementations collectors.

 Usually, fragmentation is dealt with using a mark-compact collector (or
 IBM has used a mark-sweep-compact collector).
 Copying collectors are not only super efficient at collecting young
 spaces, but they are also great for fragmentation - when you copy
 everything to the new space, you can remove any fragmentation. At the
 cost of double the space requirements though.

 So mark-compact is a compromise. First you mark whats reachable, then
 everything thats marked is copied/compacted to the bottom of the heap.
 Its all part of a collection though.

 Jonathan Ariel wrote:
   
 Maybe what's missing here is how did I get the 11%.I just ran solr with the
 following JVM params: -XX:+PrintGCApplicationConcurrentTime
 -XX:+PrintGCApplicationStoppedTime with that I can measure the amount of
 time the application run between collection pauses and the length of the
 collection pauses, respectively.
 I think that in this case the 11% is just for memory collection and not
 defragmentation... but I'm not 100% sure.

 On Fri, Sep 25, 2009 at 5:05 PM, Fuad Efendi f...@efendi.ca wrote:

   
 
 But again, GC is not just Garbage Collection as many in this thread
 think... it is also memory defragmentation which is much costly than
 collection just because it needs move somewhere _live_objects_ (and
 wait/lock till such objects get unlocked to be moved...) - obviously more
 memory helps...

 11% is extremely high.


 -Fuad
 http://www.linkedin.com/in/liferay


 
   
 -Original Message-
 From: Jonathan Ariel [mailto:ionat...@gmail.com]
 Sent: September-25-09 3:36 PM
 To: solr-user@lucene.apache.org
 Subject: Re: FW: Solr and Garbage Collection

 I'm not planning on lowering the heap. I just want to lower the time
 wasted on GC, which is 11% right now.So what I'll try is changing the
   
 
 GC
 
   
 to -XX:+UseConcMarkSweepGC

 On Fri, Sep 25, 2009 at 4:17 PM, Fuad Efendi f...@efendi.ca wrote:

   
 
 Mark,

 what if piece of code needs 10 contiguous Kb to load a document field?
 
   
 How
 
   
 locked memory pieces are optimized/moved (putting on hold almost whole
 application)?
 Lowering heap is _bad_ idea; we will have extremely frequent GC
 
   
 (optimize
 
   
 of
 live objects!!!) even if RAM is (theoretically) enough.

 -Fuad


 
   
 Faud, you didn't read the thread right.

 He is not having a problem with OOM. He got the OOM because he
   
 
 lowered
 
   
 the heap to try and help GC.

 He normally runs with a heap that can handle his FC.

 Please re-read the thread. You are confusing the tread.

 - Mark

   
 
 
   
 GC will frequently happen even if RAM is more than enough: in case
 
   
 if
 it
 
   
 is
 
   
 heavily sparse... so that have even more RAM!
 -Fuad
 
   
 
   
 
   
   
 


   


-- 
- Mark

http://www.lucidimagination.com





Re: Solr and Garbage Collection

2009-09-25 Thread Grant Ingersoll


On Sep 25, 2009, at 9:30 AM, Jonathan Ariel wrote:


Hi to all!
Lately my solr servers seem to stop responding once in a while. I'm  
using

solr 1.3.
Of course I'm having more traffic on the servers.
So I logged the Garbage Collection activity to check if it's because  
of
that. It seems like 11% of the time the application runs, it is  
stopped

because of GC. And some times the GC takes up to 10 seconds!
Is is normal? My instances run on a 16GB RAM, Dual Quad Core Intel  
Xeon
servers. My index is around 10GB and I'm giving to the instances  
10GB of

RAM.

How can I check which is the GC that it is being used? If I'm right  
JVM
Ergonomics should use the Throughput GC, but I'm not 100% sure. Do  
you have

any recommendation on this?



As I said in Eteve's thread on JVM settings, some extra time spent on  
application design/debugging will save a whole lot of headache in  
Garbage Collection and trying to tune the gazillion different options  
available.  Ask yourself:  What is on the heap and does it need to be  
there?  For instance, do you, if you have them, really need sortable  
ints?   If your servers seem to come to a stop, I'm going to bet you  
have major collections going on.  Major collections in a production  
system are very bad.  They tend to happen right after commits in  
poorly tuned systems, but can also happen in other places if you let  
things build up due to really large heaps and/or things like really  
large cache settings.  I would pull up jConsole and have a look at  
what is happening when the pauses occur.  Is it a major collection?   
If so, then hook up a heap analyzer or a profiler and see what is on  
the heap around those times.  Then have a look at your schema/config,  
etc. and see if there are things that are memory intensive (sorting,  
faceting, excessively large filter caches).


--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search



RE: FW: Solr and Garbage Collection

2009-09-25 Thread Fuad Efendi
 Usually, fragmentation is dealt with using a mark-compact collector (or
 IBM has used a mark-sweep-compact collector).
 Copying collectors are not only super efficient at collecting young
 spaces, but they are also great for fragmentation - when you copy
 everything to the new space, you can remove any fragmentation. At the
 cost of double the space requirements though.


So that if memory size is optimized (application specific!) no any object
copy will ever happen, although it is server-loading specific too
(application-usage-specific; what do they do most frequently?)
- just statistics, need to monitor JVM and make decision.

Few years ago I had hard time explaining to client that byte array should be
Base64 encoded instead of just byte123/byte... instead of GC tuning...

SOLR uses XML; try to upload big XML - each Element instance needs at least
100 bytes... try to create array of 20M of Elements (parser will do!)... so
that any GC tuning is application-usage specific too... RAM allocation and
GC tuning is usage-specific, not SOLR-specific...




Re: Solr and Garbage Collection

2009-09-25 Thread Mark Miller
Jonathan Ariel wrote:
 How can I check which is the GC that it is being used? If I'm right JVM
 Ergonomics should use the Throughput GC, but I'm not 100% sure. Do you have
 any recommendation on this?

   
Just to straighten out this one too - Ergonomics doesn't use throughput
- throughput is the collector that allows Ergonomics ;)

And throughput is the default as long as your machine is detected as
server class.

But throughput is not great with large tenured spaces out of the box. It
only parallelizes the new space collection. You have to turn on an
option to get parallel tenured collection as well - which is essential
to scale to large heap sizes.

-- 
- Mark

http://www.lucidimagination.com





Re: Solr and Garbage Collection

2009-09-25 Thread Mark Miller
Mark Miller wrote:
 Jonathan Ariel wrote:
   
 How can I check which is the GC that it is being used? If I'm right JVM
 Ergonomics should use the Throughput GC, but I'm not 100% sure. Do you have
 any recommendation on this?

   
 
 Just to straighten out this one too - Ergonomics doesn't use throughput
 - throughput is the collector that allows Ergonomics ;)

 And throughput is the default as long as your machine is detected as
 server class.

 But throughput is not great with large tenured spaces out of the box. It
 only parallelizes the new space collection. You have to turn on an
 option to get parallel tenured collection as well - which is essential
 to scale to large heap sizes.

   
hmm - I'm not being totally accurate there - ergonomics is what detects
server and so makes throughput the default collector for a server
machine. But much of the GC ergonomics support only works with the
throughput collector. Kind of chicken and egg :)

-- 
- Mark

http://www.lucidimagination.com





Re: Solr and Garbage Collection

2009-09-25 Thread Mark Miller
Thats a good point too - if you can reduce your need for such a large
heap, by all means, do so.

However, considering you already need at least 10GB or you get OOM, you
have a long way to go with that approach. Good luck :)

How many docs do you have ? I'm guessing its mostly FieldCache type
stuff, and thats the type of thing you can't really side step, unless
you give up the functionality thats using it.

Grant Ingersoll wrote:

 On Sep 25, 2009, at 9:30 AM, Jonathan Ariel wrote:

 Hi to all!
 Lately my solr servers seem to stop responding once in a while. I'm
 using
 solr 1.3.
 Of course I'm having more traffic on the servers.
 So I logged the Garbage Collection activity to check if it's because of
 that. It seems like 11% of the time the application runs, it is stopped
 because of GC. And some times the GC takes up to 10 seconds!
 Is is normal? My instances run on a 16GB RAM, Dual Quad Core Intel Xeon
 servers. My index is around 10GB and I'm giving to the instances 10GB of
 RAM.

 How can I check which is the GC that it is being used? If I'm right JVM
 Ergonomics should use the Throughput GC, but I'm not 100% sure. Do
 you have
 any recommendation on this?


 As I said in Eteve's thread on JVM settings, some extra time spent on
 application design/debugging will save a whole lot of headache in
 Garbage Collection and trying to tune the gazillion different options
 available.  Ask yourself:  What is on the heap and does it need to be
 there?  For instance, do you, if you have them, really need sortable
 ints?   If your servers seem to come to a stop, I'm going to bet you
 have major collections going on.  Major collections in a production
 system are very bad.  They tend to happen right after commits in
 poorly tuned systems, but can also happen in other places if you let
 things build up due to really large heaps and/or things like really
 large cache settings.  I would pull up jConsole and have a look at
 what is happening when the pauses occur.  Is it a major collection? 
 If so, then hook up a heap analyzer or a profiler and see what is on
 the heap around those times.  Then have a look at your schema/config,
 etc. and see if there are things that are memory intensive (sorting,
 faceting, excessively large filter caches).

 --
 Grant Ingersoll
 http://www.lucidimagination.com/

 Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
 using Solr/Lucene:
 http://www.lucidimagination.com/search



-- 
- Mark

http://www.lucidimagination.com





Re: Solr and Garbage Collection

2009-09-25 Thread Mark Miller
One more point and I'll stop - I've hit my email quota for the day ;)

While its a pain to have to juggle GC params and tune - when you require
a heap thats more than a gig or two, I personally believe its essential
to do so for good performance. The (default settings / ergonomics with
throughput) just don't cut it. Sad fact of life :) Luckily, you don't
generally have to do that much to get things nice - the number of
options is not that staggering, and you don't usually need to get into
most of them. Choosing the right collector, and tweaking a setting or
two can often be enough.

The most important to do with a large heap and the throughput collector
is to turn on parallel tenured collection. I've said it before, but it
really is key. At least if you have more than a processor or two -
which, for your sake, I hope you do :)

- Mark

Mark Miller wrote:
 Thats a good point too - if you can reduce your need for such a large
 heap, by all means, do so.

 However, considering you already need at least 10GB or you get OOM, you
 have a long way to go with that approach. Good luck :)

 How many docs do you have ? I'm guessing its mostly FieldCache type
 stuff, and thats the type of thing you can't really side step, unless
 you give up the functionality thats using it.

 Grant Ingersoll wrote:
   
 On Sep 25, 2009, at 9:30 AM, Jonathan Ariel wrote:

 
 Hi to all!
 Lately my solr servers seem to stop responding once in a while. I'm
 using
 solr 1.3.
 Of course I'm having more traffic on the servers.
 So I logged the Garbage Collection activity to check if it's because of
 that. It seems like 11% of the time the application runs, it is stopped
 because of GC. And some times the GC takes up to 10 seconds!
 Is is normal? My instances run on a 16GB RAM, Dual Quad Core Intel Xeon
 servers. My index is around 10GB and I'm giving to the instances 10GB of
 RAM.

 How can I check which is the GC that it is being used? If I'm right JVM
 Ergonomics should use the Throughput GC, but I'm not 100% sure. Do
 you have
 any recommendation on this?
   
 As I said in Eteve's thread on JVM settings, some extra time spent on
 application design/debugging will save a whole lot of headache in
 Garbage Collection and trying to tune the gazillion different options
 available.  Ask yourself:  What is on the heap and does it need to be
 there?  For instance, do you, if you have them, really need sortable
 ints?   If your servers seem to come to a stop, I'm going to bet you
 have major collections going on.  Major collections in a production
 system are very bad.  They tend to happen right after commits in
 poorly tuned systems, but can also happen in other places if you let
 things build up due to really large heaps and/or things like really
 large cache settings.  I would pull up jConsole and have a look at
 what is happening when the pauses occur.  Is it a major collection? 
 If so, then hook up a heap analyzer or a profiler and see what is on
 the heap around those times.  Then have a look at your schema/config,
 etc. and see if there are things that are memory intensive (sorting,
 faceting, excessively large filter caches).

 --
 Grant Ingersoll
 http://www.lucidimagination.com/

 Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
 using Solr/Lucene:
 http://www.lucidimagination.com/search

 


   


-- 
- Mark

http://www.lucidimagination.com





Re: Solr and Garbage Collection

2009-09-25 Thread Jonathan Ariel
I have around 8M documents.
I set up my server to use a different collector and it seems like it
decreased from 11% to 4%, of course I need to wait a bit more because it is
just a 1 hour old log. But it seems like it is much better now.
I will tell you on Monday the results :)

On Fri, Sep 25, 2009 at 6:07 PM, Mark Miller markrmil...@gmail.com wrote:

 Thats a good point too - if you can reduce your need for such a large
 heap, by all means, do so.

 However, considering you already need at least 10GB or you get OOM, you
 have a long way to go with that approach. Good luck :)

 How many docs do you have ? I'm guessing its mostly FieldCache type
 stuff, and thats the type of thing you can't really side step, unless
 you give up the functionality thats using it.

 Grant Ingersoll wrote:
 
  On Sep 25, 2009, at 9:30 AM, Jonathan Ariel wrote:
 
  Hi to all!
  Lately my solr servers seem to stop responding once in a while. I'm
  using
  solr 1.3.
  Of course I'm having more traffic on the servers.
  So I logged the Garbage Collection activity to check if it's because of
  that. It seems like 11% of the time the application runs, it is stopped
  because of GC. And some times the GC takes up to 10 seconds!
  Is is normal? My instances run on a 16GB RAM, Dual Quad Core Intel Xeon
  servers. My index is around 10GB and I'm giving to the instances 10GB of
  RAM.
 
  How can I check which is the GC that it is being used? If I'm right JVM
  Ergonomics should use the Throughput GC, but I'm not 100% sure. Do
  you have
  any recommendation on this?
 
 
  As I said in Eteve's thread on JVM settings, some extra time spent on
  application design/debugging will save a whole lot of headache in
  Garbage Collection and trying to tune the gazillion different options
  available.  Ask yourself:  What is on the heap and does it need to be
  there?  For instance, do you, if you have them, really need sortable
  ints?   If your servers seem to come to a stop, I'm going to bet you
  have major collections going on.  Major collections in a production
  system are very bad.  They tend to happen right after commits in
  poorly tuned systems, but can also happen in other places if you let
  things build up due to really large heaps and/or things like really
  large cache settings.  I would pull up jConsole and have a look at
  what is happening when the pauses occur.  Is it a major collection?
  If so, then hook up a heap analyzer or a profiler and see what is on
  the heap around those times.  Then have a look at your schema/config,
  etc. and see if there are things that are memory intensive (sorting,
  faceting, excessively large filter caches).
 
  --
  Grant Ingersoll
  http://www.lucidimagination.com/
 
  Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
  using Solr/Lucene:
  http://www.lucidimagination.com/search
 


 --
 - Mark

 http://www.lucidimagination.com






RE: Solr and Garbage Collection

2009-09-25 Thread Fuad Efendi
Sorry for OFF-topic:
Create dummy Hello, World! JSP, use Tomcat, execute load-stress
simulator(s) from separate machine(s), and measure... don't forget to
allocate necessary thread pools in Tomcat (if you have to)...
Although such JSP doesn't use any memory, you will see how easy one can go
with 5000 TPS (or 'virtually' 5 concurrent users) on modern quad-cores
by simply allocating more memory (...GB) and more Tomcat threads. There is
threshold too... repeat it with HTTPD Workers (and threads), same result,
although it doesn't use any GC. More memory - more threads - more keep
alives per TCP...

However, 'theoretically' you need only 64Mb for Hello World :)))





Re: problem with HTMLStripStandardTokenizerFactory

2009-09-25 Thread Yonik Seeley
Can you give a small test file that demonstrates the problem?

-Yonik
http://www.lucidimagination.com



On Fri, Sep 25, 2009 at 5:34 AM, Kundig, Andreas
andreas.kun...@wipo.int wrote:
 Hello

 I can't bring HTMLStripStandardTokenizerFactory to remove the content of the 
 style tag, as the documentation says it should.

 A search for 'mso' returns a document where the search term only appears in 
 the style tag (it's a word document saved as html). Here is the highlight 
 returned by solr (by the way: the wrong word is highlighted).

 vetica;#13;\n\tpanose-1:2 11 5 4 2 2 2 2 2 
 4;em#13/em;\n\tmso-font-charset:0;em#13/em;\n\tmso-generic-font-family:swiss;em#13/em

 I am using solr 1.3. Here is how I configured the tokenizer in schema.xml

    fieldType name=text_en class=solr.TextField 
 positionIncrementGap=100
      analyzer type=index
        tokenizer class=solr.HTMLStripStandardTokenizerFactory/
        filter class=solr.StopFilterFactory ignoreCase=true 
 words=stopwords.txt enablePositionIncrements=true/
        filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 
 splitOnCaseChange=1/
        filter class=solr.LowerCaseFilterFactory/
        filter class=solr.EnglishPorterFilterFactory 
 protected=protwords.txt/
        filter class=solr.RemoveDuplicatesTokenFilterFactory/
      /analyzer
      analyzer type=query
        tokenizer class=solr.HTMLStripStandardTokenizerFactory/
        filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
 ignoreCase=true expand=true/
        filter class=solr.StopFilterFactory ignoreCase=true 
 words=stopwords.txt/
        filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 
 splitOnCaseChange=1/
        filter class=solr.LowerCaseFilterFactory/
        filter class=solr.EnglishPorterFilterFactory 
 protected=protwords.txt/
        filter class=solr.RemoveDuplicatesTokenFilterFactory/
      /analyzer
    /fieldType

 Am I doing something wrong?

 thank you
 Andréas Kündig

 World Intellectual Property Organization Disclaimer:

 This electronic message may contain privileged, confidential and
 copyright protected information. If you have received this e-mail
 by mistake, please immediately notify the sender and delete this
 e-mail and all its attachments. Please ensure all e-mail attachments
 are scanned for viruses prior to opening or using.



Problem changing the default MergePolicy/Scheduler

2009-09-25 Thread Jibo John

Hello,

It looks like solr is not allowing me to change the default  
MergePolicy/Scheduler classes.


Even if I change the default MergePolicy/ 
Scheduler(LogByteSizeMErgePolicy and ConcurrentMergeScheduler) defined  
in solrconfig.xml to a different one (LogDocMergePolicy and  
SerialMergeScheduler), my profiler shows the default classes are still  
being loaded.


Also, if I use the default LogByteSizeMergePolicy, I can't seem to  
override the 'calibrateSizeByDeletes' to 'true' value using solrconfig  
using the new syntax that was introduced this week (SOLR-1447).


I'm using the version checked out from trunk yesterday.

Any pointers will be helpful.

Thanks,
-Jibo