One of three cores is missing userData and lastModified fields from /admin/cores

2015-03-24 Thread Aaron Daubman
Hey All,

On a Solr server running 4.10.2 with three cores, two return the expected
info from /solr/admin/cores?wt=json but the third is missing userData and
lastModified.

The first (artists) and third (tracks) cores from the linked screenshot are
the ones I care about. Unfortunately, the third (tracks) is the one missing
lastModified.

As far as I can see, that comes from:
https://github.com/apache/lucene-solr/blob/lucene_solr_4_10_2/solr/core/src/java/org/apache/solr/handler/admin/LukeRequestHandler.java#L568

I can't trace back to see what would possible cause getUserData() to return
an empty Object, but that appears to be what is happening?

For these severs, indexes that are pre-optimized are shipped over to the
server and the server is re-started... nothing is actually ever committed
on these live servers. This should behave exactly the same for artists and
tracks, even though tracks is the one always missing lastUpdated.

Here's the output in img format, I'll paste the full JSON[1] below:
http://monosnap.com/image/XMyAfk5z3AvHgY39m0qAKAGlc3RACI.png

I'd like to be able to provide access to clients to grab lastUpdated time
for both indices so that they can see how old/stale the data they are
getting results back from is...

...alternately, is there any other way to expose easily how old (last
modified time?) the index for a core is?

Thanks,
  Aaron

1: Full JSON
---snip---
{
  responseHeader: {
status: 0,
QTime: 10
  },
  defaultCoreName: collection1,
  initFailures: {
  },
  status: {
artists: {
  name: artists,
  isDefaultCore: false,
  instanceDir: /opt/solr/search/solr/artists/,
  dataDir: /opt/solr/search/solr/artists/,
  config: solrconfig.xml,
  schema: schema.xml,
  startTime: 2015-03-24T14:12:23.667Z,
  uptime: 7335696,
  index: {
numDocs: 3360380,
maxDoc: 3360380,
deletedDocs: 0,
indexHeapUsageBytes: 63366952,
version: 421,
segmentCount: 1,
current: true,
hasDeletions: false,
directory:
org.apache.lucene.store.MMapDirectory:MMapDirectory@/opt/solr/search/solr/artists/index
lockFactory=NativeFSLockFactory@/opt/solr/search/solr/artists/index,
userData: {
  commitTimeMSec: 1427133705908
},
lastModified: 2015-03-23T18:01:45.908Z,
sizeInBytes: 25341305528,
size: 23.6 GB
  }
},
banana-int: {
  name: banana-int,
  isDefaultCore: false,
  instanceDir: /opt/solr/search/solr/banana-int/,
  dataDir: /opt/solr/search/solr/banana-int/data/,
  config: solrconfig.xml,
  schema: schema.xml,
  startTime: 2015-03-24T14:12:22.895Z,
  uptime: 7336472,
  index: {
numDocs: 3,
maxDoc: 3,
deletedDocs: 0,
indexHeapUsageBytes: 17448,
version: 135,
segmentCount: 3,
current: true,
hasDeletions: false,
directory:
org.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/opt/solr/search/solr/banana-int/data/index
lockFactory=NativeFSLockFactory@/opt/solr/search/solr/banana-int/data/index;
maxCacheMB=48.0 maxMergeSizeMB=4.0),
userData: {
  commitTimeMSec: 1412796723183
},
lastModified: 2014-10-08T19:32:03.183Z,
sizeInBytes: 16196,
size: 15.82 KB
  }
},
tracks: {
  name: tracks,
  isDefaultCore: false,
  instanceDir: /opt/solr/search/solr/tracks/,
  dataDir: /opt/solr/search/solr/tracks/,
  config: solrconfig.xml,
  schema: schema.xml,
  startTime: 2015-03-24T14:12:23.656Z,
  uptime: 7335713,
  index: {
numDocs: 53268126,
maxDoc: 53268126,
deletedDocs: 0,
indexHeapUsageBytes: 517650552,
version: 100,
segmentCount: 1,
current: true,
hasDeletions: false,
directory:
org.apache.lucene.store.MMapDirectory:MMapDirectory@/opt/solr/search/solr/tracks/index
lockFactory=NativeFSLockFactory@/opt/solr/search/solr/tracks/index,
userData: {
},
sizeInBytes: 122892905007,
size: 114.45 GB
  }
}
  }
}
---snip---


Re: Understanding fieldNorm differences between 3.6.1 and 4.9 solrs

2014-07-02 Thread Aaron Daubman
Wow - so apparently I have terrible recall and should re-read this thread I
started on the same topic when upgrading from 1.4 to 3.6 and hit a very
similar fieldNorm issue almost two years ago! =)
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201207.mbox/%3CCALyTvnpwZMj4zxPbK0abVpnyRJny=qauijdqmj7e3zgnv7u...@mail.gmail.com%3E

In the mean time, I'm still happy to hear any new thoughts / suggestions on
making similarity contiguous across upgrades.

Thanks again,
   Aaron


On Tue, Jul 1, 2014 at 11:14 PM, Aaron Daubman daub...@gmail.com wrote:

 In trying to determine some subtle scoring differences (causing
 occasionally significant ordering differences) among search results, I
 wrote a parser to normalize debug.explain.structured JSON output.

 It appears that every score that is different comes down to a difference
 in fieldNorm, where the 3.6.1 solr is using  0.109375 as the fieldNorm, and
 the 4.9 solr is using 0.125 as the fieldNorm. [1]

 What would be causing the different versions to use different field norms
 (and rather infrequently, as the majority of scores are identical as
 desired)?

 Thanks,
   Aaron

 [1] Here's a snippet of the diff (of the output from my
 debug.explain.structured normalizer) for one such difference (apologies for
 the width):

 06808040cd523a296abaf26025148c85: {
 06808040cd523a296abaf26025148c85: {
 *  _value: 0.839616605,   |
  _value: 0.854748135, *
   description: product of:,
 description: product of:,
   details: [
  details: [
 {   {
 *  _value: 2.623802,  |
  _value: 2.67108801, *
   description: sum of:,
 description: sum of:,
   details: [
  details: [
 {
   {
 *  _value: 0.0644619693,  |
  _value: 0.0736708307, *
   description: weight(t_style:alternative
description: weight(t_style:alternative
   details: [
details: [
 {
   {
   _value: 0.0629802298,
 _value: 0.0629802298,
   description: queryWeight,
 description: queryWeight,
   details: [
details: [
 {
   {
   _value: 4.18500798,
 _value: 4.18500798,
   description: idf(137871)
description: idf(137871)
 }
   }
   ]
 ]
 },
  },
 {
   {
 *  _value: 1.02352709,|
  _value: 1.1697453, *
   description: fieldWeight,
 description: fieldWeight,
   details: [
details: [
 {
   {
   _value: 2.23606799,
 _value: 2.23606799,
   description: tf(freq=5)
 description: tf(freq=5)
 },
  },
 {
   {
   _value: 4.18500798,
 _value: 4.18500798,
   description: idf(137871)
description: idf(137871)
 },
  },
 {
   {
 *  _value: 0.109375,  |
  _value: 0.125, *
 *  description: fieldNorm
  description: fieldNorm*
 }
   }
   ]
 ]
 }
   }
   ]
 ]
 },
  },



Understanding fieldNorm differences between 3.6.1 and 4.9 solrs

2014-07-01 Thread Aaron Daubman
In trying to determine some subtle scoring differences (causing
occasionally significant ordering differences) among search results, I
wrote a parser to normalize debug.explain.structured JSON output.

It appears that every score that is different comes down to a difference in
fieldNorm, where the 3.6.1 solr is using  0.109375 as the fieldNorm, and
the 4.9 solr is using 0.125 as the fieldNorm. [1]

What would be causing the different versions to use different field norms
(and rather infrequently, as the majority of scores are identical as
desired)?

Thanks,
  Aaron

[1] Here's a snippet of the diff (of the output from my
debug.explain.structured normalizer) for one such difference (apologies for
the width):

06808040cd523a296abaf26025148c85: {
06808040cd523a296abaf26025148c85: {
*  _value: 0.839616605,   |
 _value: 0.854748135, *
  description: product of:,
description: product of:,
  details: [
 details: [
{   {
*  _value: 2.623802,  |
 _value: 2.67108801, *
  description: sum of:,
description: sum of:,
  details: [
 details: [
{
{
*  _value: 0.0644619693,  |
   _value: 0.0736708307, *
  description: weight(t_style:alternative
   description: weight(t_style:alternative
  details: [
   details: [
{
{
  _value: 0.0629802298,
  _value: 0.0629802298,
  description: queryWeight,
  description: queryWeight,
  details: [
   details: [
{
{
  _value: 4.18500798,
  _value: 4.18500798,
  description: idf(137871)
   description: idf(137871)
}
}
  ]
  ]
},
 },
{
{
*  _value: 1.02352709,|
   _value: 1.1697453, *
  description: fieldWeight,
  description: fieldWeight,
  details: [
   details: [
{
{
  _value: 2.23606799,
  _value: 2.23606799,
  description: tf(freq=5)
  description: tf(freq=5)
},
 },
{
{
  _value: 4.18500798,
  _value: 4.18500798,
  description: idf(137871)
   description: idf(137871)
},
 },
{
{
*  _value: 0.109375,  |
   _value: 0.125, *
*  description: fieldNorm
   description: fieldNorm*
}
}
  ]
  ]
}
}
  ]
  ]
},
 },


Re: Range Queries performing differently on SortableIntField vs TrieField of type integer

2012-12-04 Thread Aaron Daubman
Hi Upayavira,

One small question - did you re-index in-between? The index structure
 will be different for each.


Yes, the Solr 1.4.1 (working) instance was built using the original schema
and that solr version.
The Solr 3.6.1 (not working) instance was re-built using the new schema and
Solr 3.6.1...

Thanks,
  Aaron


Re: Range Queries performing differently on SortableIntField vs TrieField of type integer

2012-12-04 Thread Aaron Daubman
I forgot a possibly important piece... Given the different Solr versions,
the schema version (and it's related different defaults) is also a change:

Solr 1.4.1 Has:
schema name=ourSchema version=1.1

Solr 3.6.1 Has:
schema name=ourSchema version=1.5


 Solr 1.4.1 Relevant Schema Parts - Working as desired:

 
 -
  fieldType name=sint class=solr.SortableIntField
  sortMissingLast=true
  omitNorms=true/
  ...
  field name=i_yearStartSort type=sint indexed=true stored=false
  required=false multiValued=true/
  field name=i_yearStopSort type=sint indexed=true  stored=false
  required=false multiValued=true/
 
 
  Solr 3.6.1 Relevant Schema Parts - Not working as expected:
 
 -
  fieldType name=tint class=solr.TrieField type=integer
  precisionStep=4 sortMissingLast=true positionIncrementGap=0
  omitNorms=true/
  ...
  field name=i_yearStartSort type=tint indexed=true stored=false
  required=false multiValued=false/
  field name=i_yearStopSort type=tint indexed=true stored=false
  required=false multiValued=false/



Re: Cannot run Solr4 from Intellij Idea

2012-12-04 Thread Aaron Daubman
Interestingly, I have run in to this same (or very similar) issue when
attempting to run embedded solr. All of the solr.* classes that were
recently moved to lucene would not work with the solr.* shorthand - I had
to replace them with the full classpath. As you found, these shorthands in
the same schema worked fine from within solr proper (webapp).

Is there a workaround for this? (It would be great to have a unified schema
between embedded and webapp solr instances)

Thanks,
 Aaron


On Tue, Dec 4, 2012 at 7:37 AM, Artyom ice...@mail.ru wrote:

 After 2 days I have figured out how to open Solr 4 in IntelliJ IDEA 11.1.4
 on
 Tomcat 7. IntelliJ IDEA finds webapp/web/WEB-INF/web.xml and offers to make
 a facet from it and adds this facet to the parent module, from which an
 artifact can be created.

 The problem is that Solr cannot run properly. I get this message:

 SEVERE: Unable to create core: mycore
 org.apache.solr.common.SolrException: Plugin init failure for [schema.xml]
 fieldType text: Plugin init failure for [schema.xml] analyzer/tokenizer:
 Error loading class 'solr.StandardTokenizerFactory'
 at

 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)
 at
 org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:369)
 at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:113)
 at
 org.apache.solr.core.CoreContainer.create(CoreContainer.java:846)
 at org.apache.solr.core.CoreContainer.load(CoreContainer.java:534)
 at org.apache.solr.core.CoreContainer.load(CoreContainer.java:356)
 at

 org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:308)
 at

 org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:107)
 at

 org.apache.catalina.core.ApplicationFilterConfig.initFilter(ApplicationFilterConfig.java:277)
 at

 org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:258)
 at

 org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:382)
 at

 org.apache.catalina.core.ApplicationFilterConfig.init(ApplicationFilterConfig.java:103)
 at

 org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4650)
 at

 org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5306)
 at
 org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
 at

 org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:901)
 at
 org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:877)
 at
 org.apache.catalina.core.StandardHost.addChild(StandardHost.java:618)
 at

 org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:650)
 at

 org.apache.catalina.startup.HostConfig$DeployDescriptor.run(HostConfig.java:1582)
 at
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
 at
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: org.apache.solr.common.SolrException: Plugin init failure for
 [schema.xml] analyzer/tokenizer: Error loading class
 'solr.StandardTokenizerFactory'
 at

 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)
 at

 org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:344)
 at

 org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:95)
 at

 org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:43)
 at

 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)
 ... 25 more
 Caused by: org.apache.solr.common.SolrException: Error loading class
 'solr.StandardTokenizerFactory'
 at

 org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:436)
 at

 org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:457)
 at

 org.apache.solr.util.plugin.AbstractPluginLoader.create(AbstractPluginLoader.java:89)
 at

 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)
 ... 29 more
 Caused by: java.lang.ClassNotFoundException: solr.StandardTokenizerFactory
 at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
 at
 

Preventing accepting queries while custom QueryComponent starts up?

2012-11-08 Thread Aaron Daubman
Greetings,

I have several custom QueryComponents that have high one-time startup costs
(hashing things in the index, caching things from a RDBMS, etc...)

Is there a way to prevent solr from accepting connections before all
QueryComponents are ready?

Especially, since many of our instance are load-balanced (and
added-in/removed automatically based on admin/ping responses) preventing
ping from answering prior to all custom QueryComponents being ready would
be ideal...

Thanks,
 Aaron


Re: Preventing accepting queries while custom QueryComponent starts up?

2012-11-08 Thread Aaron Daubman
Amit,

I am using warming /firstSearcher queries to ensure this happens before any
external queries are received, however, unless I am misinterpreting the
logs, solr starts responding to admin/ping requests before firstSearcher
completes, and, the LB then puts the solr instance back in the pool, and it
starts accepting connections...


On Thu, Nov 8, 2012 at 4:24 PM, Amit Nithian anith...@gmail.com wrote:

 I think Solr does this by default and are you executing warming queries in
 the firstSearcher so that these actions are done before Solr is ready to
 accept real queries?


 On Thu, Nov 8, 2012 at 11:54 AM, Aaron Daubman daub...@gmail.com wrote:

  Greetings,
 
  I have several custom QueryComponents that have high one-time startup
 costs
  (hashing things in the index, caching things from a RDBMS, etc...)
 
  Is there a way to prevent solr from accepting connections before all
  QueryComponents are ready?
 
  Especially, since many of our instance are load-balanced (and
  added-in/removed automatically based on admin/ping responses) preventing
  ping from answering prior to all custom QueryComponents being ready would
  be ideal...
 
  Thanks,
   Aaron
 



Re: Preventing accepting queries while custom QueryComponent starts up?

2012-11-08 Thread Aaron Daubman
  (plus when I deploy, my deploy script
 runs some actual simple test queries to ensure they return before enabling
 the ping handler to return 200s) to avoid this problem.


What are you doing to programmatically disable/enable the ping handler?
This sounds like exactly what I should be doing as well...


Improving performance for use-case where large (200) number of phrase queries are used?

2012-10-24 Thread Aaron Daubman
Greetings,

We have a solr instance in use that gets some perhaps atypical queries
and suffers from poor (2 second) QTimes.

Documents (~2,350,000) in this instance are mainly comprised of
various descriptive fields, such as multi-word (phrase) tags - an
average document contains 200-400 phrases like this across several
different multi-valued field types.

A custom QueryComponent has been built that functions somewhat like a
very specific MoreLikeThis. A seed document is specified via the
incoming query, its terms are retrieved, boosted both by query
parameters as well as fields within the document that specify term
weighting, sorted by this custom boosting, and then a second query is
crafted by taking the top 200 (sorted by the custom boosting)
resulting field values paired with their fields and searching for
documents matching these 200 values.

For many searches, 25-50% of the documents match the query of 200
terms (so 600,000 to 1,200,000).

After doing some profiling, it seems that a majority of the QTime
comes from dealing with phrases and resulting term positions, since a
majority of the search terms are actually multi-word tokenized
phrases. (processing is dominated by ExactPhraseScorer on down,
particularly: SegmentTermPositions, readVInt)

I have thought of a few ways to improve performance for this use case,
and am looking for feedback as to which seems best, as well as any
insight into other ways to approach this problem that I haven't
considered (or things to look into to help better understand the slow
QTimes more fully):

1) Shard the index - since there is no key to really specify which
shard queries would go to, this would only be of benefit if scoring is
done in parallel. Is there documentation I have so far missed that
describes distributed searching for this case? (I haven't found
anything that really describes the differences in scoring for
distributed vs. non-distributed indices, aside from the warnings that
IDF doesn't work - which I don't think we really care about).

2) Implement Common Grams as described here:
http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2
It's not clear how many individual words in the phrases being used
are, in fact, common, but given that 25-50% of the documents in the
index match many queries, it seems this may be of value

3) Try and make mm (minimum terms should match) work for the custom
query. I haven't been able to figure out how exactly this parameter
works, but, my thinking is along the lines of if only 2 of those 200
terms match a document, it doesn't need to get scored. What I don't
currently understand is at what point failing the mm requirement
short-circuits - e.g. does the doc still get scored? If it does
short-circuit prior to scoring, this may help somewhat, although it's
not clear this would still prevent the many many gets against term
positions that is still killing QTime

4) Set a dynamic number (rather than the currently fixed 200) of terms
based on the custom boosting/weighting value - e.g. only use terms
whose calculated value is above some threshold. I'm not keen on this
since some documents may be dominated by many weak terms and not have
any great ones, it it might break for those (finding the sweet spot
cutoff would not be straightforward).

5) *This is my current favorite*: stop tokenizing/analyzing these
terms and just use KeywordTokenizer. Most of these phrases are
pre-vetted, and it may be possible to clean/process any others before
creating the docs. My main worry here is that, currently, if I
understand correctly, a document with the phrase brazilian pop would
still be returned as a match to a seed document containing only the
phrase brazilian (not the other way around, but that is not
necessary), however, with KeywordTokenizer, this would no longer be
the case. If I switched from the current dubious tokenize/stem/etc...
and just used Keyword, would this allow queries like this used to be
a long phrase query to match documents that have this used to be a
long phrase query as one of the multivalued values in the field
without having to pull term positions? (and thus significantly speed
up performance).

Thanks,
 Aaron


Re: Improving performance for use-case where large (200) number of phrase queries are used?

2012-10-24 Thread Aaron Daubman
Thanks for the ideas - some followup questions in-line below:


 * use shingles e.g. to turn two-word phrases into single terms (how
 long is your average phrase?).

Would this be different than what I was calling common grams? (other
than shingling every two words, rather than just common ones?)


 * in addition to the above, maybe for phrases with  2 terms, consider
 just a boolean conjunction of the shingled phrases instead of a real
 phrase query: e.g. more like this - (more_like AND like_this). This
 would have some false positives.

This would definitely help, but, IIRC, we moved to phrase queries due
to too many false positives, it would be an interesting experiment to
see how many false positives were left when shingling and then just
doing conjunctive queries.


 * use a more aggressive stopwords list for your MorePhrasesLikeThis.
 * reduce this number 200, and instead work harder to prune out which
 phrases are the most descriptive from the seed document, e.g. based
 on some heuristics like their frequency or location within that seed
 document, so your query isnt so massive.

This is something I've been asking for (perform some sort of PCA /
feature selection on the actual terms used) but is of questionable
value and hard to do right so hasn't happened yet (it's not clear
that there will be terms that are very common that are not also very
descriptive, so the extent to which this would help is unknown).

Thanks again for the ideas!
 Aaron


Re: Improving performance for use-case where large (200) number of phrase queries are used?

2012-10-24 Thread Aaron Daubman
Hi Peter,

Thanks for the recommendation - I believe we are thinking along the
same lines, but wanted to check to make sure. Are you suggesting
something different than my #5 (below) or are we essentially
suggesting the same thing?

On Wed, Oct 24, 2012 at 1:20 PM, Peter Keegan peterlkee...@gmail.com wrote:
 Could you index your 'phrase tags' as single tokens? Then your phrase
 queries become simple TermQuerys.


 5) *This is my current favorite*: stop tokenizing/analyzing these
 terms and just use KeywordTokenizer. Most of these phrases are
 pre-vetted, and it may be possible to clean/process any others before
 creating the docs. My main worry here is that, currently, if I
 understand correctly, a document with the phrase brazilian pop would
 still be returned as a match to a seed document containing only the
 phrase brazilian (not the other way around, but that is not
 necessary), however, with KeywordTokenizer, this would no longer be
 the case. If I switched from the current dubious tokenize/stem/etc...
 and just used Keyword, would this allow queries like this used to be
 a long phrase query to match documents that have this used to be a
 long phrase query as one of the multivalued values in the field
 without having to pull term positions? (and thus significantly speed
 up performance).


Thanks again,
 Aaron


Why does SolrIndexSearcher.java enforce mutual exclusion of filter and filterList?

2012-10-21 Thread Aaron Daubman
Greetings,

I'm wondering if somebody would please explain why
SolrIndexSearcher.java enforces mutual exclusion of filter and
filterList
(e.g. see: 
https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java#L2039
)

For a custom application we have been using this functionality
successfully and I have been maintaining patches against base releases
up from 1.4 through 3.6.1 and am now finally looking at 4.0. Since I
am yet-again revisiting this custom patch, I am wondering why this
functionality is prevented out of the box - for two reasons really:

1) It would be great if I didn't have to maintain a custom internal
branch of solr for this tiny little change
2) I am worried that the purposeful prevention of this functionality
implies there is a downside to doing this.

Is there a downside to utilizing both a DocSet based filter and query
based filterList?
If not, once I migrate this patch to 4.0 what would be the best way to
get this functionality incorporated into the base?

For additional info, you may find the now-2-year-old issue with
patches addressing this up through 3.6.1 here:
https://issues.apache.org/jira/browse/SOLR-2052

Any insight appreciated as always,
 Aaron


ScorerDocQueue.java's downHeap showing up as frequent hotspot in profiling - ideas why?

2012-10-16 Thread Aaron Daubman
Greetings,

In a recent batch of solr 3.6.1 slow response time queries the
profiler highlighted downHeap (line 212) in SoorerDocQueue.java as
averaging more than 60ms across the 16 calls I was looking at and
showing it spiking up over 100ms - which, after looking at the code
(two int comparisons?!?) I am at a loss to explain:

Here's the source:
https://github.com/apache/lucene-solr/blob/6b8783bfa59351878c59e47deaa7739d95150a22/lucene/core/src/java/org/apache/lucene/util/ScorerDocQueue.java#L212

Here's the invocation trace of one of the many similar:
---snip---
Thread.run:722 (0ms self time, 416 ms total time)
 QueuedThreadPool$3.run:526 (0ms self time, 416 ms total time)
  QueuedThreadPool.runJob:595 (0ms self time, 416 ms total time)
   ExecutorCallback$ExecutorCallbackInvoker.run:130 (0ms self time,
416 ms total time)
ExecutorCallback$ExecutorCallbackInvoker.call:124 (0ms self time,
416 ms total time)
 AbstractConnection$1.onCompleted:63 (0ms self time, 416 ms total time)
  AbstractConnection$1.onCompleted:71 (0ms self time, 416 ms total time)
   HttpConnection.onFillable:253 (0ms self time, 416 ms total time)
HttpChannel.run:246 (0ms self time, 416 ms total time)
 Server.handle:403 (0ms self time, 416 ms total time)
  HandlerWrapper.handle:97 (0ms self time, 416 ms total time)
   IPAccessHandler.handle:204 (0ms self time, 416 ms total time)
HandlerCollection.handle:110 (0ms self time, 416 ms total time)
 ContextHandlerCollection.handle:258 (0ms self time, 416
ms total time)
  ScopedHandler.handle:136 (0ms self time, 416 ms total time)
   ContextHandler.doScope:973 (0ms self time, 416 ms total time)
SessionHandler.doScope:174 (0ms self time, 416 ms total time)
 ServletHandler.doScope:358 (0ms self time, 416 ms total time)
  ContextHandler.doHandle:1044 (0ms self time, 416 ms
total time)
   SessionHandler.doHandle:213 (0ms self time, 416 ms
total time)
SecurityHandler.handle:540 (0ms self time, 416 ms
total time)
 ScopedHandler.handle:138 (0ms self time, 416 ms total time)
  ServletHandler.doHandle:429 (0ms self time, 416
ms total time)
   ServletHandler$CachedChain.doFilter:1274 (0ms
self time, 416 ms total time)
SolrDispatchFilter.doFilter:260 (0ms self
time, 416 ms total time)
 SolrDispatchFilter.execute:365 (0ms self
time, 416 ms total time)
  SolrCore.execute:1376 (0ms self time, 416 ms
total time)
   RequestHandlerBase.handleRequest:129 (0ms
self time, 416 ms total time)
SearchHandler.handleRequestBody:186 (0ms
self time, 416 ms total time)
 QueryComponent.process:394 (0ms self
time, 416 ms total time)
  SolrIndexSearcher.search:375 (0ms self
time, 416 ms total time)
   SolrIndexSearcher.getDocListC:1176 (0ms
self time, 416 ms total time)
SolrIndexSearcher.getDocListNC:1296
(0ms self time, 416 ms total time)
 IndexSearcher.search:364 (0ms self
time, 416 ms total time)
  IndexSearcher.search:581 (0ms self
time, 416 ms total time)
   FilteredQuery$2.score:169 (0ms self
time, 416 ms total time)
BooleanScorer2.advance:320 (0ms
self time, 416 ms total time)
 ReqExclScorer.advance:112 (0ms
self time, 416 ms total time)
  DisjunctionSumScorer.advance:229
(52ms self time, 416 ms total time)

DisjunctionSumScorer.advanceAfterCurrent:171 (0ms self time, 308 ms
total time)

ScorerDocQueue.topNextAndAdjustElsePop:120 (0ms self time, 308 ms
total time)

ScorerDocQueue.checkAdjustElsePop:135 (0ms self time, 111 ms total
time)
  ScorerDocQueue.downHeap:212
(111ms self time, 111 ms total time)
---snip---

Any ideas on what is causing this seemingly inordinate amount of time
in downHeap? Is this symptomatic of anything in particular?

Thanks, as always!
 Aaron


Re: PriorityQueue:initialize consistently showing up as hot spot while profiling

2012-10-10 Thread Aaron Daubman
Hi Mikhail,

On Fri, Oct 5, 2012 at 7:15 AM, Mikhail Khludnev
mkhlud...@griddynamics.com wrote:
 okay. huge rows value is no.1 way to kill Lucene. It's not possible,
 absolutely. You need to rethink logic of your component. Check Solr's
 FieldCollapsing code, IIRC it makes second search to achieve similar goal.
 Also check PostFilter and DelegatingCollector classes, their approach can
 also be handy for your task.

This sounds like it could be a much saner way to handle the task,
however, I'm not sure what I should be looking at for the
'FieldCollapsing code' you mention - can you point me to a class?

Also, is there anything more you can say about PostFilter and
DelegatingCollector classes - I reviewed them but it was not obvious
to me what they were doing that would allow me to reduce the large
rows param we use to ensure all relevant docs are included in the
grouping and limiting occurs at the group level, rather than
pre-grouping...

Thanks again,
  Aaron


Re: PriorityQueue:initialize consistently showing up as hot spot while profiling

2012-10-05 Thread Aaron Daubman
On Fri, Oct 5, 2012 at 4:33 AM, Mikhail Khludnev
mkhlud...@griddynamics.com wrote:
 what's the value of rows param
 http://wiki.apache.org/solr/CommonQueryParameters#rows ?

Very interesting question - so, for historic reasons lost to me, we
pass in a huge (1000?) number for rows and this hits our custom
component, which has its own internal maximum for real rows returned.
(This is a custom grouping component, so I am guessing the large
number of rows had to do with trying not to limit what got grouped?).

Is the value of rows what is used for that heap allocation?

Thanks,
 Aaron


 On Fri, Oct 5, 2012 at 6:56 AM, Aaron Daubman daub...@gmail.com wrote:

 Greetings,

 I've been seeing this call chain come up fairly frequently when
 debugging longer-QTime queries under Solr 3.6.1 but have not been able
 to understand from the code what is really going on - the call graph
 and code follow below.

 Would somebody please explain to me:
 1) Why this would show up frequently as a hotspot
 2) If it is expected to do so
 3) If there is anything I should look in to that may help performance
 where this frequently shows up as the long pole in the QTime tent
 4) What the code is doing and why heap is being allocated as an
 apparently giant object (which also is apparently not unheard of due
 to MAX_VALUE wrapping check)

 ---call-graph---
 Filter - SolrDispatchFilter:doFilter (method time = 12 ms, total time =
 487 ms)
  Filter - SolrDispatchFilter:execute:365 (method time = 0 ms, total
 time = 109 ms)
   org.apache.solr.core.SolrCore:execute:1376 (method time = 0 ms,
 total time = 109 ms)
org.apache.solr.handler.RequestHandlerBase:handleRequest:129
 (method time = 0 ms, total time = 109 ms)
 org.apache.solr.handler.component.SearchHandler:handleRequestBody:186
 (method time = 0 ms, total time = 109 ms)
  com.echonest.solr.component.EchoArtistGroupingComponent:process:188
 (method time = 0 ms, total time = 109 ms)
   org.apache.solr.search.SolrIndexSearcher:search:375 (method time
 = 0 ms, total time = 96 ms)
org.apache.solr.search.SolrIndexSearcher:getDocListC:1176
 (method time = 0 ms, total time = 96 ms)
 org.apache.solr.search.SolrIndexSearcher:getDocListNC:1209
 (method time = 0 ms, total time = 96 ms)
  org.apache.solr.search.SolrIndexSearcher:getProcessedFilter:796
 (method time = 0 ms, total time = 26 ms)
   org.apache.solr.search.BitDocSet:andNot:185 (method time = 0
 ms, total time = 13 ms)
org.apache.lucene.util.OpenBitSet:clone:732 (method time =
 13 ms, total time = 13 ms)
   org.apache.solr.search.BitDocSet:intersection:31 (method
 time = 0 ms, total time = 13 ms)
org.apache.solr.search.DocSetBase:intersection:90 (method
 time = 0 ms, total time = 13 ms)
 org.apache.lucene.util.OpenBitSet:and:808 (method time =
 13 ms, total time = 13 ms)
  org.apache.lucene.search.TopFieldCollector:create:916 (method
 time = 0 ms, total time = 46 ms)
   org.apache.lucene.search.FieldValueHitQueue:create:175
 (method time = 0 ms, total time = 46 ms)

  
 org.apache.lucene.search.FieldValueHitQueue$MultiComparatorsFieldValueHitQueue:init:111
 (method time = 0 ms, total time = 46 ms)
 org.apache.lucene.search.SortField:getComparator:409
 (method time = 0 ms, total time = 13 ms)

  org.apache.lucene.search.FieldComparator$FloatComparator:init:400
 (method time = 13 ms, total time = 13 ms)
 org.apache.lucene.util.PriorityQueue:initialize:108
 (method time = 33 ms, total time = 33 ms)
 ---snip---


 org.apache.lucene.util.PriorityQueue:initialize - hotspot is line 108:
 heap = (T[]) new Object[heapSize]; // T is unbounded type, so this
 unchecked cast works always

 ---PriorityQueue.java---
   /** Subclass constructors must call this. */
   @SuppressWarnings(unchecked)
   protected final void initialize(int maxSize) {
 size = 0;
 int heapSize;
 if (0 == maxSize)
   // We allocate 1 extra to avoid if statement in top()
   heapSize = 2;
 else {
   if (maxSize == Integer.MAX_VALUE) {
 // Don't wrap heapSize to -1, in this case, which
 // causes a confusing NegativeArraySizeException.
 // Note that very likely this will simply then hit
 // an OOME, but at least that's more indicative to
 // caller that this values is too big.  We don't +1
 // in this case, but it's very unlikely in practice
 // one will actually insert this many objects into
 // the PQ:
 heapSize = Integer.MAX_VALUE;
   } else {
 // NOTE: we add +1 because all access to heap is
 // 1-based not 0-based.  heap[0] is unused.
 heapSize = maxSize + 1;
   }
 }
 heap = (T[]) new Object[heapSize]; // T is unbounded type, so this
 unchecked cast works always
 this.maxSize = maxSize;

 // If sentinel objects are supported, populate the queue with them
 T sentinel

PriorityQueue:initialize consistently showing up as hot spot while profiling

2012-10-04 Thread Aaron Daubman
Greetings,

I've been seeing this call chain come up fairly frequently when
debugging longer-QTime queries under Solr 3.6.1 but have not been able
to understand from the code what is really going on - the call graph
and code follow below.

Would somebody please explain to me:
1) Why this would show up frequently as a hotspot
2) If it is expected to do so
3) If there is anything I should look in to that may help performance
where this frequently shows up as the long pole in the QTime tent
4) What the code is doing and why heap is being allocated as an
apparently giant object (which also is apparently not unheard of due
to MAX_VALUE wrapping check)

---call-graph---
Filter - SolrDispatchFilter:doFilter (method time = 12 ms, total time = 487 ms)
 Filter - SolrDispatchFilter:execute:365 (method time = 0 ms, total
time = 109 ms)
  org.apache.solr.core.SolrCore:execute:1376 (method time = 0 ms,
total time = 109 ms)
   org.apache.solr.handler.RequestHandlerBase:handleRequest:129
(method time = 0 ms, total time = 109 ms)
org.apache.solr.handler.component.SearchHandler:handleRequestBody:186
(method time = 0 ms, total time = 109 ms)
 com.echonest.solr.component.EchoArtistGroupingComponent:process:188
(method time = 0 ms, total time = 109 ms)
  org.apache.solr.search.SolrIndexSearcher:search:375 (method time
= 0 ms, total time = 96 ms)
   org.apache.solr.search.SolrIndexSearcher:getDocListC:1176
(method time = 0 ms, total time = 96 ms)
org.apache.solr.search.SolrIndexSearcher:getDocListNC:1209
(method time = 0 ms, total time = 96 ms)
 org.apache.solr.search.SolrIndexSearcher:getProcessedFilter:796
(method time = 0 ms, total time = 26 ms)
  org.apache.solr.search.BitDocSet:andNot:185 (method time = 0
ms, total time = 13 ms)
   org.apache.lucene.util.OpenBitSet:clone:732 (method time =
13 ms, total time = 13 ms)
  org.apache.solr.search.BitDocSet:intersection:31 (method
time = 0 ms, total time = 13 ms)
   org.apache.solr.search.DocSetBase:intersection:90 (method
time = 0 ms, total time = 13 ms)
org.apache.lucene.util.OpenBitSet:and:808 (method time =
13 ms, total time = 13 ms)
 org.apache.lucene.search.TopFieldCollector:create:916 (method
time = 0 ms, total time = 46 ms)
  org.apache.lucene.search.FieldValueHitQueue:create:175
(method time = 0 ms, total time = 46 ms)
   
org.apache.lucene.search.FieldValueHitQueue$MultiComparatorsFieldValueHitQueue:init:111
(method time = 0 ms, total time = 46 ms)
org.apache.lucene.search.SortField:getComparator:409
(method time = 0 ms, total time = 13 ms)
 org.apache.lucene.search.FieldComparator$FloatComparator:init:400
(method time = 13 ms, total time = 13 ms)
org.apache.lucene.util.PriorityQueue:initialize:108
(method time = 33 ms, total time = 33 ms)
---snip---


org.apache.lucene.util.PriorityQueue:initialize - hotspot is line 108:
heap = (T[]) new Object[heapSize]; // T is unbounded type, so this
unchecked cast works always

---PriorityQueue.java---
  /** Subclass constructors must call this. */
  @SuppressWarnings(unchecked)
  protected final void initialize(int maxSize) {
size = 0;
int heapSize;
if (0 == maxSize)
  // We allocate 1 extra to avoid if statement in top()
  heapSize = 2;
else {
  if (maxSize == Integer.MAX_VALUE) {
// Don't wrap heapSize to -1, in this case, which
// causes a confusing NegativeArraySizeException.
// Note that very likely this will simply then hit
// an OOME, but at least that's more indicative to
// caller that this values is too big.  We don't +1
// in this case, but it's very unlikely in practice
// one will actually insert this many objects into
// the PQ:
heapSize = Integer.MAX_VALUE;
  } else {
// NOTE: we add +1 because all access to heap is
// 1-based not 0-based.  heap[0] is unused.
heapSize = maxSize + 1;
  }
}
heap = (T[]) new Object[heapSize]; // T is unbounded type, so this
unchecked cast works always
this.maxSize = maxSize;

// If sentinel objects are supported, populate the queue with them
T sentinel = getSentinelObject();
if (sentinel != null) {
  heap[1] = sentinel;
  for (int i = 2; i  heap.length; i++) {
heap[i] = getSentinelObject();
  }
  size = maxSize;
}
  }
---snip---


Thanks, as always!
 Aaron


Re: Understanding fieldCache SUBREADER insanity

2012-10-02 Thread Aaron Daubman
Hi Yonik,

I've been attempting to fix the SUBREADER insanity in our custom
component, and have made perhaps some progress (or is this worse?) -
I've gone from SUBREADER to VALUEMISMATCH insanity:
---snip---
entries_count : 12
entry#0 : 
'MMapIndexInput(path=/io01/p/solr/playlist/c/playlist/index/_c2.frq)'='f_normalizedTotalHotttnesss',class
org.apache.lucene.search.FieldCacheImpl$DocsWithFieldCache,null=org.apache.lucene.util.FixedBitSet#1387502754
entry#1 : 
'MMapIndexInput(path=/io01/p/solr/playlist/c/playlist/index/_c2.frq)'='i_track_count',class
org.apache.lucene.search.FieldCacheImpl$DocsWithFieldCache,null=org.apache.lucene.util.Bits$MatchAllBits#233863705
entry#2 : 
'MMapIndexInput(path=/io01/p/solr/playlist/c/playlist/index/_c2.frq)'='s_artistID',class
org.apache.lucene.search.FieldCache$StringIndex,null=org.apache.lucene.search.FieldCache$StringIndex#652215925
entry#3 : 
'MMapIndexInput(path=/io01/p/solr/playlist/c/playlist/index/_c2.frq)'='s_artistID',class
java.lang.String,null=[Ljava.lang.String;#1036517187
entry#4 : 
'MMapIndexInput(path=/io01/p/solr/playlist/c/playlist/index/_c2.frq)'='thingID',class
java.lang.String,null=[Ljava.lang.String;#357017445
entry#5 : 
'MMapIndexInput(path=/io01/p/solr/playlist/c/playlist/index/_c2.frq)'='f_normalizedTotalHotttnesss',float,org.apache.lucene.search.FieldCache.NUMERIC_UTILS_FLOAT_PARSER=[F#322888397
entry#6 : 
'MMapIndexInput(path=/io01/p/solr/playlist/c/playlist/index/_c2.frq)'='f_normalizedTotalHotttnesss',float,org.apache.lucene.search.FieldCache.DEFAULT_FLOAT_PARSER=org.apache.lucene.search.FieldCache$CreationPlaceholder#1229311421
entry#7 : 
'MMapIndexInput(path=/io01/p/solr/playlist/c/playlist/index/_c2.frq)'='f_normalizedTotalHotttnesss',float,null=[F#322888397
entry#8 : 
'MMapIndexInput(path=/io01/p/solr/playlist/c/playlist/index/_c2.frq)'='i_collapse',int,org.apache.lucene.search.FieldCache.DEFAULT_INT_PARSER=org.apache.lucene.search.FieldCache$CreationPlaceholder#92920526
entry#9 : 
'MMapIndexInput(path=/io01/p/solr/playlist/c/playlist/index/_c2.frq)'='i_collapse',int,null=[I#494669113
entry#10 : 
'MMapIndexInput(path=/io01/p/solr/playlist/c/playlist/index/_c2.frq)'='i_collapse',int,org.apache.lucene.search.FieldCache.NUMERIC_UTILS_INT_PARSER=[I#494669113
entry#11 : 
'MMapIndexInput(path=/io01/p/solr/playlist/c/playlist/index/_c2.frq)'='i_track_count',int,org.apache.lucene.search.FieldCache.NUMERIC_UTILS_INT_PARSER=[I#994584654
insanity_count : 1
insanity#0 : VALUEMISMATCH: Multiple distinct value objects for
MMapIndexInput(path=/io01/p/solr/playlist/c/playlist/index/_c2.frq)+s_artistID
'MMapIndexInput(path=/io01/p/solr/playlist/c/playlist/index/_c2.frq)'='s_artistID',class
org.apache.lucene.search.FieldCache$StringIndex,null=org.apache.lucene.search.FieldCache$StringIndex#652215925
'MMapIndexInput(path=/io01/p/solr/playlist/c/playlist/index/_c2.frq)'='s_artistID',class
java.lang.String,null=[Ljava.lang.String;#1036517187
---snip---

Any suggestions on what the cause of this VALUEMISMATCH is, if it is
the normal case, or suggestions on how to fix it.

For anybody else with SUBREADER insanity issues, this is the change I
made to get this far (get the first leafReader, since we are using a
merged/optimized index):
---snip---
SolrIndexReader reader = searcher.getReader().getLeafReaders()[0];
collapseIDs = FieldCache.DEFAULT.getInts(reader, COLLAPSE_KEY_NAME);
hotnessValues = FieldCache.DEFAULT.getFloats(reader,
HOTNESS_KEY_NAME);
artistIDs = FieldCache.DEFAULT.getStrings(reader, ARTIST_KEY_NAME);
---snip---

Thanks,
 Aaron

On Wed, Sep 19, 2012 at 4:54 PM, Yonik Seeley yo...@lucidworks.com wrote:
 already-optimized, single-segment index

 That part is interesting... if true, then the type of insanity you
 saw should be impossible, and either the insanity detection or
 something else is broken.

 -Yonik
 http://lucidworks.com


Solr Caching - how to tune, how much to increase, and any tips on using Solr with JDK7 and G1 GC?

2012-09-29 Thread Aaron Daubman
Greetings,

I've recently moved to running some of our Solr (3.6.1) instances
using JDK 7u7 with the G1 GC (playing with max pauses in the 20 to
100ms range). By and large, it has been working well (or, perhaps I
should say that without requiring much tuning it works much better in
general than my haphazard attempts to tune CMS).

I have two instances in particular, one with a heap size of 14G and
one with a heap size of 60G. I'm attempting to squeeze out additional
performance by increasing Solr's cache sizes (I am still seeing the
hit ratio go up as I increase max size size and decrease the number of
evictions), and am guessing this is the cause of some recent
situations where the 14G instance especially eventually (12-24 hrs
later under 100s of queries per minute) makes it to 80%-90% of the
heap and then spirals into major GC with long-pause territory.

I am wondering:
1) if anybody has experience tuning the G1 GC, especially for use with
Solr (what are decent max-pause times to use?)
2) how to better tune Solr's cache sizes - e.g. how to even tell the
actual amount of memory used by each cache (not # entries as the stats
sow, but # bits)
3) if there are any guidelines on when increasing a cache's size (even
if it does continue to increase the hit ratio) runs into the law of
diminishing returns or even starts to hurt - e.g. if the document
cache has a current maxSize of 65536 and has seen 4409275 evictions,
and currently has a hit ratio of 0.74, should the max be increased
further? If so, how much ram needs to be added to the heap, and how
much larger should its max size be made?

I should mention that these solr instances are read-only (so cache is
probably more valuable than in other scenarios - we only invalidate
the searcher every 20-24hrs or so) and are also backed with indexes
(6G and 70G for the 14G and 60G heap sizes) on IODrives, so I'm not as
concerned about leaving RAM for linux to cache the index files (I'd
much rather actually cache the post-transformed values).

Thanks as always,
 Aaron


How to more gracefully handle field format exceptions?

2012-09-24 Thread Aaron Daubman
Greetings,

Is there a way to configure more graceful handling of field formatting
exceptions when indexing documents?

Currently, there is a field being generated in some documents that I
am indexing that is supposed to be a float but some times slips
through as an empty string. (I know, fix the docs, but sometimes bad
values slip through, and it would be nice to handle them in a more
forgiving manner).

Here's an example of the exception - when this happens, the entire doc
is thrown out due to the one malformed field:
---snip---
ERROR org.apache.solr.core.SolrCore -
org.apache.solr.common.SolrException: ERROR: [doc=docidstr] Error
adding field 'f_floatfield'=''
...
Caused by: java.lang.NumberFormatException: empty String

00:56:46,288 [SI] WARN  com.company.IndexerThread - BAD DOC:
a82a2f6a6a42ad3c98a05ddb3f2c382c
01:02:12,713 [SI] ERROR org.apache.solr.core.SolrCore -
org.apache.solr.common.SolrException: ERROR:
[doc=6ff90020f9ec0f6dd623e9879c3e024d] Error adding field
'f_afloatfield'=''
at 
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:333)
at 
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60)
at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:157)
at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)
at 
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:142)
at 
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:121)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:106)
at com.company.IndexerThread.run(IndexerThread.java:55)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.lang.NumberFormatException: empty String
at 
sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1011)
at java.lang.Float.parseFloat(Float.java:452)
at org.apache.solr.schema.TrieField.createField(TrieField.java:410)
at org.apache.solr.schema.SchemaField.createField(SchemaField.java:103)
at 
org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:203)
at 
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:286)
... 12 more

01:02:12,713 [SI] WARN  com.company.IndexerThread - BAD DOC:
6ff90020f9ec0f6dd623e9879c3e024d
---snip---

In my thinking (and for this situation), it would be much better to
just ignore the malformed field and keep the doc - is there any way to
configure this or enable this behavior instead?

Thanks,
 Aaron


Re: How to more gracefully handle field format exceptions?

2012-09-24 Thread Aaron Daubman
Hi Otis,

I was just looking at how to implement that, but was hoping for a
cleaner method - it seems like I will have to actually parse the error
as text to find the field that caused it, then remove/mangle that
field and attempt re-adding the document - which seems less than
ideal.

I would think there would be a flag or an easy way to override the add
method that would just drop (or set to default value) any field that
didn't meet expectations.

Thanks for the suggestion,
 Aaron

On Mon, Sep 24, 2012 at 9:24 PM, Otis Gospodnetic
otis.gospodne...@gmail.com wrote:
 Hi Aaron,

 You could catch the error on the client, fix/clean/remove, and retry, no?

 Otis
 --
 Search Analytics - http://sematext.com/search-analytics/index.html
 Performance Monitoring - http://sematext.com/spm/index.html


 On Mon, Sep 24, 2012 at 9:21 PM, Aaron Daubman daub...@gmail.com wrote:
 Greetings,

 Is there a way to configure more graceful handling of field formatting
 exceptions when indexing documents?

 Currently, there is a field being generated in some documents that I
 am indexing that is supposed to be a float but some times slips
 through as an empty string. (I know, fix the docs, but sometimes bad
 values slip through, and it would be nice to handle them in a more
 forgiving manner).

 Here's an example of the exception - when this happens, the entire doc
 is thrown out due to the one malformed field:
 ---snip---
 ERROR org.apache.solr.core.SolrCore -
 org.apache.solr.common.SolrException: ERROR: [doc=docidstr] Error
 adding field 'f_floatfield'=''
 ...
 Caused by: java.lang.NumberFormatException: empty String

 00:56:46,288 [SI] WARN  com.company.IndexerThread - BAD DOC:
 a82a2f6a6a42ad3c98a05ddb3f2c382c
 01:02:12,713 [SI] ERROR org.apache.solr.core.SolrCore -
 org.apache.solr.common.SolrException: ERROR:
 [doc=6ff90020f9ec0f6dd623e9879c3e024d] Error adding field
 'f_afloatfield'=''
 at 
 org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:333)
 at 
 org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60)
 at 
 org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:157)
 at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
 at 
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
 at 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)
 at 
 org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:142)
 at 
 org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
 at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:121)
 at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:106)
 at com.company.IndexerThread.run(IndexerThread.java:55)
 at java.lang.Thread.run(Thread.java:722)
 Caused by: java.lang.NumberFormatException: empty String
 at 
 sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1011)
 at java.lang.Float.parseFloat(Float.java:452)
 at org.apache.solr.schema.TrieField.createField(TrieField.java:410)
 at 
 org.apache.solr.schema.SchemaField.createField(SchemaField.java:103)
 at 
 org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:203)
 at 
 org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:286)
 ... 12 more

 01:02:12,713 [SI] WARN  com.company.IndexerThread - BAD DOC:
 6ff90020f9ec0f6dd623e9879c3e024d
 ---snip---

 In my thinking (and for this situation), it would be much better to
 just ignore the malformed field and keep the doc - is there any way to
 configure this or enable this behavior instead?

 Thanks,
  Aaron


Re: Understanding fieldCache SUBREADER insanity

2012-09-21 Thread Aaron Daubman
Yonik, et al.

I believe I found the section of code pushing me into 'insanity' status:
---snip---
int[] collapseIDs = null;
float[] hotnessValues = null;
String[] artistIDs = null;
try {
collapseIDs =
FieldCache.DEFAULT.getInts(searcher.getIndexReader(),
COLLAPSE_KEY_NAME);
hotnessValues =
FieldCache.DEFAULT.getFloats(searcher.getIndexReader(),
HOTNESS_KEY_NAME);
artistIDs =
FieldCache.DEFAULT.getStrings(searcher.getIndexReader(),
ARTIST_KEY_NAME);
} ...
---snip---

Since it seems like this code is using the 'old-style' pre-Lucene 2.9
top-level indexReaders, is there any example code you can point me to
that could show how to convert to using the leaf level segmentReaders?
If the limited information I've been able to find is correct, this
could explain some of the significant memory usage I am seeing...

Thanks again,
 Aaron

On Wed, Sep 19, 2012 at 4:54 PM, Yonik Seeley yo...@lucidworks.com wrote:
 already-optimized, single-segment index

 That part is interesting... if true, then the type of insanity you
 saw should be impossible, and either the insanity detection or
 something else is broken.

 -Yonik
 http://lucidworks.com


Understanding fieldCache SUBREADER insanity

2012-09-19 Thread Aaron Daubman
Hi all,

In reviewing a solr instance with somewhat variable performance, I
noticed that its fieldCache stats show an insanity_count of 1 with the
insanity type SUBREADER:

---snip---
insanity_count : 1
insanity#0 : SUBREADER: Found caches for descendants of
ReadOnlyDirectoryReader(segments_k
_6h9(3.3):C17198463)+tf_normalizedTotalHotttnesss
'ReadOnlyDirectoryReader(segments_k
_6h9(3.3):C17198463)'='tf_normalizedTotalHotttnesss',float,org.apache.lucene.search.FieldCache.NUMERIC_UTILS_FLOAT_PARSER=[F#1965982057
'ReadOnlyDirectoryReader(segments_k
_6h9(3.3):C17198463)'='tf_normalizedTotalHotttnesss',float,null=[F#1965982057
'MMapIndexInput(path=/io01/p/solr/playlist/a/playlist/index/_6h9.frq)'='tf_normalizedTotalHotttnesss',float,org.apache.lucene.search.FieldCache.NUMERIC_UTILS_FLOAT_PARSER=[F#1308116426
---snip---

How can I decipher what this means and what, if anything, I should do
to fix/improve the insanity?

Thanks,
 Aaron


Re: Understanding fieldCache SUBREADER insanity

2012-09-19 Thread Aaron Daubman
Hi Tomás,

 This probably means that you are using the same field for faceting and for
 sorting (tf_normalizedTotalHotttnesss), sorting uses the segment level
 cache and faceting uses by default the global field cache. This can be a
 problem because the field is duplicated in cache, and then it uses twice
 the memory.

 One way to solve this would be to change the faceting method on that field
 to 'fcs', which uses segment level cache (but may be a little bit slower).

Thanks for explaining what the sparse wiki and javadoc mean - I had
read them but had no idea what the implications were ;-)

We are not doing any explicit faceting, and this index is also
supposed to be a read-only, already-optimized, single-segment index -
both of these seem to indicate to (very unknowledgeable about this) me
that this could be more of a problem - e.g. what am I doing to cause
this since I don't think I need to be using segment-level anything
(should be a single segment if I understand optimization and RO
indicies) and I am not leveraging faceting?

Any pointers on where else to look for what might be causing this (one
issue I am currently troubleshooting is too-many-pauses caused by
too-frequent GC, so preventing this double-allocation could help)?

Thanks again,
 Aaron


Solr request/response lifecycle and logging full response time

2012-09-06 Thread Aaron Daubman
Greetings,

I'm looking to add some additional logging to a solr 3.6.0 setup to
allow us to determine actual time spent by Solr responding to a
request.

We have a custom QueryComponent that sometimes returns 1+ MB of data
and while QTime is always on the order of ~100ms, the response time at
the client can be longer than a second (as measured with JMeter
running on the same server using localhost).

The end goal is to be able to:
1) determine if this large variance in response time is due to Solr,
and if so where (to help determine if/how it can be optimized)
2) determine if the large variance is due to how jetty handles
connections, buffering, etc... (and if so, if/how we can optimize
there)
...or some combination of the two.

As it stands now, where the second or so between when the actual query
finishes as indicated by QTime, when solr gathers all the data to be
returned as requested by fl, and when the client actually receives the
data (even when the client is on the localhost) is completely opaque.

My main question:
- Is there any documentation (a diagram / flowchart would be oh so
wonderful) on the lifecycle of a Solr request? So far I've attempted
to modify and rebuild solr, adding logging to SolrCore's execute()
method (this pretty much mirrors QTime), as well as add timing
calculations and logging to various different overriden methods in the
QueryComponent custom extension, all to no avail so far.

What I'm getting at is how to:
- start a stopwatch when solr receives the request from the client
- stop the stopwatch and log the elapsed time right before solr hands
the response body off to Jetty to be delivered back to the client.

Thanks, as always!
 Aaron


Re: Solr request/response lifecycle and logging full response time

2012-09-06 Thread Aaron Daubman
I'd still love to see a query lifecycle flowchart, but, in case it
helps any future users or in case this is still incorrect, here's how
I'm tackling this:

1) Override default json responseWriter with my own in solrconfig.xml:
queryResponseWriter name=json
class=com.mydomain.solr.component.JSONResponseWriterWithTiming/
2) Define JSONResponseWriterWithTiming as just extending
JSONResponseWriter and adding in a log statement:

public class JSONResponseWriterWithTiming extends JSONResponseWriter {
private static final Logger logger =
LoggerFactory.getLogger(JSONResponseWriterWithTiming.class);
@Override
public void write(Writer writer, SolrQueryRequest req,
SolrQueryResponse rsp) throws IOException {
super.write(writer, req, rsp);
if (logger.isInfoEnabled()) {
final long st = req.getStartTime();
logger.info(String.format(Total solr time for query with
QTime: %d is: %d, (int) (rsp.getEndTime() - st), (int)
(System.currentTimeMillis() - st)));
}
}
}

Please advise if:
- Flowcharts for any solr/lucene-related lifecycles exist
- There is a better way of doing this

Thanks,
  Aaron

On Thu, Sep 6, 2012 at 9:16 PM, Aaron Daubman daub...@gmail.com wrote:
 Greetings,

 I'm looking to add some additional logging to a solr 3.6.0 setup to
 allow us to determine actual time spent by Solr responding to a
 request.

 We have a custom QueryComponent that sometimes returns 1+ MB of data
 and while QTime is always on the order of ~100ms, the response time at
 the client can be longer than a second (as measured with JMeter
 running on the same server using localhost).

 The end goal is to be able to:
 1) determine if this large variance in response time is due to Solr,
 and if so where (to help determine if/how it can be optimized)
 2) determine if the large variance is due to how jetty handles
 connections, buffering, etc... (and if so, if/how we can optimize
 there)
 ...or some combination of the two.

 As it stands now, where the second or so between when the actual query
 finishes as indicated by QTime, when solr gathers all the data to be
 returned as requested by fl, and when the client actually receives the
 data (even when the client is on the localhost) is completely opaque.

 My main question:
 - Is there any documentation (a diagram / flowchart would be oh so
 wonderful) on the lifecycle of a Solr request? So far I've attempted
 to modify and rebuild solr, adding logging to SolrCore's execute()
 method (this pretty much mirrors QTime), as well as add timing
 calculations and logging to various different overriden methods in the
 QueryComponent custom extension, all to no avail so far.

 What I'm getting at is how to:
 - start a stopwatch when solr receives the request from the client
 - stop the stopwatch and log the elapsed time right before solr hands
 the response body off to Jetty to be delivered back to the client.

 Thanks, as always!
  Aaron


Re: Frustrating differences in fieldNorm between two different versions of solr indexing the same document

2012-07-19 Thread Aaron Daubman
Robert,

 I have a solr 1.4.1 instance and a solr 3.6.0 instance, both configured as
  identically as possible (given deprecations) and indexing the same
 document.

 Why did you do this? If you want the exact same scoring, use the exact
 same analysis.
 This means specifying luceneMatchVersion = 2.9, and the exact same
 analysis components (even if deprecated).

  I have taken the field values for the example below and run them
  through /admin/analysis.jsp on each solr instance. Even for the
 problematic
  docs/fields, the results are almost identical. For the example below, the
  t_tag values for the problematic doc:
  1.4.1: 162 values
  3.6.0: 164 values
 

 This is why: you changed your analysis.


Apologies if I didn't clearly state my goal/concern: I am not looking for
the exact same scoring - I am looking to explain scoring differences.
 Deprecated components will eventually go away, time moves on, etc...
etc... I would like to be able to run current code, and should be able to -
the part that is sticking is being able to *explain* the difference in
results.

As you can see from my email, after running the different analysis on the
input, the output does not demonstrate (in any way that I can see) why the
fieldNorm values would be so different. Even with the different analysis,
the results are almost identical - which *should* result in an almost
identical fieldNorm???

Again, the desire is not to be the same, it is to understand the difference.

Thanks,
 Aaron


Re: Frustrating differences in fieldNorm between two different versions of solr indexing the same document

2012-07-19 Thread Aaron Daubman
Robert,

So this is lossy: basically you can think of there being only 256
 possible values. So when you increased the number of terms only
 slightly by changing your analysis, this happened to bump you over the
 edge rounding you up to the next value.

 more information:
 http://lucene.apache.org/core/3_6_0/scoring.html

 http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Similarity.html



Thanks - this was extremely helpful! I had read both sources before but
didn't grasp the magnitude of lossy-ness until your pointer and mention of
edge-case.
Just to help out anybody else who might run in to this, I hacked together a
little harness to demonstrate:
---
fieldLength: 160, computeNorm: 0.07905694, floatToByte315: 109,
byte315ToFloat: 0.078125
fieldLength: 161, computeNorm: 0.07881104, floatToByte315: 109,
byte315ToFloat: 0.078125
fieldLength: 162, computeNorm: 0.07856742, floatToByte315: 109,
byte315ToFloat: 0.078125
fieldLength: 163, computeNorm: 0.07832605, floatToByte315: 109,
byte315ToFloat: 0.078125
fieldLength: 164, computeNorm: 0.07808688, floatToByte315: 108,
byte315ToFloat: 0.0625
fieldLength: 165, computeNorm: 0.077849895, floatToByte315: 108,
byte315ToFloat: 0.0625
fieldLength: 166, computeNorm: 0.07761505, floatToByte315: 108,
byte315ToFloat: 0.0625
---

So my takeaway is that these scores that vary significantly are caused by:
1) a field with lengths right on this boundary between the two analyzer
chains
2) the fact that we might be searching for matches from 50+ values to a
field with 150+ values, and so the overall score is repeatedly impacted by
the otherwise typically insignificant change in fieldNorm value

Thanks again,
 Aaron


Frustrating differences in fieldNorm between two different versions of solr indexing the same document

2012-07-18 Thread Aaron Daubman
Greetings,

I've been digging in to this for two days now and have come up short -
hopefully there is some simple answer I am just not seeing:

I have a solr 1.4.1 instance and a solr 3.6.0 instance, both configured as
identically as possible (given deprecations) and indexing the same document.

For most queries the results are very close (scoring within three
significant differences, almost identical positions in results).

However, for certain documents, the scores are very different (causing
these docs to be ranked +/- 25 positions different or more in the results)

In looking at debugQuery output, it seems like this is due to fieldNorm
values being lower for the 3.6.0 instance than the 1.4.1.

(note that for most docs, the fieldNorms are identical)

I have taken the field values for the example below and run them
through /admin/analysis.jsp on each solr instance. Even for the problematic
docs/fields, the results are almost identical. For the example below, the
t_tag values for the problematic doc:
1.4.1: 162 values
3.6.0: 164 values

note that 1/sqrt(162) = 0.07857 ~= fieldNorm for 1.4.1,
however, (1/0.0625)^2 = 256, which is no where near 164

Here is a particular example from 1.4.1:
1.6263733 = (MATCH) fieldWeight(t_tag:soul in 2066419), product of:
   3.8729835 = tf(termFreq(t_tag:soul)=15)
   5.3750753 = idf(docFreq=27619, maxDocs=2194294)
   0.078125 = fieldNorm(field=t_tag, doc=2066419)

And the same from 3.6.0:
1.3042576 = (MATCH) fieldWeight(t_tag:soul in 1977957), product of:
   3.8729835 = tf(termFreq(t_tag:soul)=15)
   5.388126 = idf(docFreq=27740, maxDocs=2232857)
   0.0625 = fieldNorm(field=t_tag, doc=1977957)


Here is the 1.4.1 config for the t_tag field and text type:
fieldtype name=text class=solr.TextField
positionIncrementGap=100
  analyzer
  tokenizer class=solr.StandardTokenizerFactory/
  filter class=solr.StandardFilterFactory/
  filter class=solr.ISOLatin1AccentFilterFactory/
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.StopFilterFactory words=stopwords.txt
ignoreCase=true/
  filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt/
  /analyzer
  /fieldtype
dynamicField name=t_* type=text indexed=true stored=true
required=false multiValued=true termVectors=true/


And 3.6.0 schema config for the t_tag field and text type:
fieldtype name=text class=solr.TextField
positionIncrementGap=100 autoGeneratePhraseQueries=true
analyzer
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StandardFilterFactory/
filter class=solr.ASCIIFoldingFilterFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory
words=stopwords.txt ignoreCase=true/
filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt/
filter class=solr.PorterStemFilterFactory/
/analyzer
/fieldtype
field name=t_tag type=text indexed=true stored=true
required=false multiValued=true/

I at first got distracted by this change between versions:
LUCENE-2286: Enabled DefaultSimilarity.setDiscountOverlaps by default. This
means that terms with a position increment gap of zero do not affect the
norms calculation by default.
However, this doesn't appear to be causing the issue as, according to
analysis.jsp there is no overlap for t_tag...

Can you point me to where these fieldNorm differences are coming from and
why they'd only be happing for a select few documents for which the content
doesn't stand out?

Thank you,
 Aaron


Debugging jetty IllegalStateException errors?

2012-07-04 Thread Aaron Daubman
Greetings,

I'm wondering if anybody has experienced (and found root cause) for errors
like this. We're running Solr 3.6.0 with latest stable Jetty 7
(7.6.4.v20120524).
I know this is likely due to a client (or the server) terminating the
connection unexpectedly, but we see these fairly frequently and can't
determine what the impact is or why they are happening (who is closing
early, why?)

Any tips/tricks on troubleshooting or what to do to possibly minimize or
help prevent these from happening (we are using a fairly old python client
to programmatically access this solr instance).

---snip---
17:25:13,250 [qtp581536050-12] WARN  jetty.server.Response null - Committed
before 500 null

org.eclipse.jetty.io.EofException
at
org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:952)
at
org.eclipse.jetty.http.AbstractGenerator.flush(AbstractGenerator.java:438)
at org.eclipse.jetty.server.HttpOutput.flush(HttpOutput.java:94)
at
org.eclipse.jetty.server.AbstractHttpConnection$Output.flush(AbstractHttpConnection.java:1016)
at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:278)
at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:122)
at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:212)
at org.apache.solr.common.util.FastWriter.flush(FastWriter.java:115)
at
org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:353)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:273)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1332)
at
org.eclipse.jetty.servlets.UserAgentFilter.doFilter(UserAgentFilter.java:77)
at
org.eclipse.jetty.servlets.GzipFilter.doFilter(GzipFilter.java:247)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1332)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:477)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:225)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1031)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:406)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:186)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:965)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)
at org.eclipse.jetty.server.Server.handle(Server.java:348)
at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:452)
at
org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:894)
at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:948)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:851)
at
org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at
org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:77)
at
org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:620)
at
org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:46)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:603)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:538)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.nio.channels.ClosedChannelException
at
sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:137)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:359)
at java.nio.channels.SocketChannel.write(SocketChannel.java:360)
at
org.eclipse.jetty.io.nio.ChannelEndPoint.gatheringFlush(ChannelEndPoint.java:371)
at
org.eclipse.jetty.io.nio.ChannelEndPoint.flush(ChannelEndPoint.java:330)
at
org.eclipse.jetty.io.nio.SelectChannelEndPoint.flush(SelectChannelEndPoint.java:330)
at
org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:876)
... 37 more

17:25:13,250 [qtp581536050-12] WARN  jetty.servlet.ServletHandler null -
/solr/artists/select java.lang.IllegalStateException: Committed
at org.eclipse.jetty.server.Response.resetBuffer(Response.java:1087)
at 

Re: Correct way to deal with source data that may include a multivalued field that needs to be used for sorting?

2012-06-11 Thread Aaron Daubman
While I look into doing some refactoring, as well as creating some new
UpdateRequestProcessors (and/or backporting), would you please point me to
some reading material on why you say the following:

In this day and age, a custom update handler is almost never the right
 answer to a problem -- nor is a custom request handler that does updates
 (theose two things are actaully different) ... my advice is always to
 start by trying to impliment what you need as an UpdateRequestProcessor,
 and if that doesn't work out then refactor your code to be a Request
 Handler instead.


e.g. benefits of UpdateRequestProcessor over custom update handler?

Thanks again for the great pointers,
  Aaron


Re: What would cause: SEVERE: java.lang.ClassCastException: com.company.MyCustomTokenizerFactory cannot be cast to org.apache.solr.analysis.TokenizerFactory

2012-06-10 Thread Aaron Daubman
Jack,

Thanks - this was indeed the issue. I still don't understand exactly why
(the same local-nexus-hosted Solr jars were the ones being duplicated on
the classpath: included in my custom -with-dependencies jars as well as in
the solr war, which was build/distributed/and hosted from the same nexus
repo used to host my jars) but shading solr from my -with-dependencies jars
fixed the issue.
(if anybody could point me to reading on why this happened - e.g. the
classes on the classpath would be duplicated but identical, in
my naive understanding of the classloader this should have still just
worked - it would be appreciated)

Thanks again,
 Aaron

On Sat, Jun 9, 2012 at 2:40 PM, Jack Krupansky j...@basetechnology.comwrote:

 Make sure there are no stray jars/classes in your jar, especially any that
 might contain BaseTokenizerFactory or TokenizerFactory. I notice that your
 jar name says -with-dependencies, raising a little suspicion. The
 exception is as if your class was referring to a BaseTokenizerFactory,
 which implements TokenizerFactory, coming from your jar (or a contained
 jar) rather than getting resolved to Solr 3.6's own BaseTokenizerFactory
 and TokenizerFactory.

 -- Jack Krupansky

 -Original Message- From: Aaron Daubman
 Sent: Saturday, June 09, 2012 12:03 AM
 To: solr-user@lucene.apache.org
 Subject: What would cause: SEVERE: java.lang.ClassCastException:
 com.company.**MyCustomTokenizerFactory cannot be cast to
 org.apache.solr.analysis.**TokenizerFactory


 Greetings,

 I am in the process of updating custom code and schema from Solr 1.4 to
 3.6.0 and have run into the following issue with our two custom Tokenizer
 and Token Filter components.

 I've been banging my head against this one for far too long, especially
 since it must be something obvious I'm missing.

 I have  custom Tokenizer and Token Filter components along with
 corresponding factories. The code for all looks very similar to the
 Tokenizer and TokenFilter (and Factory) code that is standard with 3.6.0
 (and I have also read through
 http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**shttp://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

 I have ensured my custom code is on the classpath, it is
 in ENSolrComponents-1.0-SNAPSHOT-**jar-with-dependencies.jar:
 ---output snip---
 Jun 8, 2012 10:41:00 PM org.apache.solr.core.**CoreContainer load
 INFO: loading shared library: /opt/test_artists_solr/jetty-**solr/lib/en
 Jun 8, 2012 10:41:00 PM org.apache.solr.core.**SolrResourceLoader
 replaceClassLoader
 INFO: Adding
 'file:/opt/test_artists_solr/**jetty-solr/lib/en/**
 ENSolrComponents-1.0-SNAPSHOT-**jar-with-dependencies.jar'
 to classloader
 Jun 8, 2012 10:41:00 PM org.apache.solr.core.**SolrResourceLoader
 replaceClassLoader
 INFO: Adding
 'file:/opt/test_artists_solr/**jetty-solr/lib/en/ENUtil-1.0-**
 SNAPSHOT-jar-with-**dependencies.jar'
 to classloader
 Jun 8, 2012 10:41:00 PM org.apache.solr.core.**CoreContainer create
 --snip---

 After successfully parsing the schema and creating many fields, etc.. the
 following is logged:
 ---snip---
 Jun 8, 2012 10:41:00 PM org.apache.solr.util.plugin.**AbstractPluginLoader
 load
 INFO: created : com.company.**MyCustomTokenizerFactory
 Jun 8, 2012 10:41:00 PM org.apache.solr.common.**SolrException log
 SEVERE: java.lang.ClassCastException: com.company.**
 MyCustomTokenizerFactory
 cannot be cast to org.apache.solr.analysis.**TokenizerFactory
 at org.apache.solr.schema.**IndexSchema$5.init(**IndexSchema.java:966)
 at
 org.apache.solr.util.plugin.**AbstractPluginLoader.load(**
 AbstractPluginLoader.java:148)
 at org.apache.solr.schema.**IndexSchema.readAnalyzer(**
 IndexSchema.java:986)
 at org.apache.solr.schema.**IndexSchema.access$100(**IndexSchema.java:60)
 at org.apache.solr.schema.**IndexSchema$1.create(**IndexSchema.java:453)
 at org.apache.solr.schema.**IndexSchema$1.create(**IndexSchema.java:433)
 at
 org.apache.solr.util.plugin.**AbstractPluginLoader.load(**
 AbstractPluginLoader.java:140)
 at org.apache.solr.schema.**IndexSchema.readSchema(**IndexSchema.java:490)
 at org.apache.solr.schema.**IndexSchema.init(**IndexSchema.java:123)
 at org.apache.solr.core.**CoreContainer.create(**CoreContainer.java:481)
 at org.apache.solr.core.**CoreContainer.load(**CoreContainer.java:335)
 at org.apache.solr.core.**CoreContainer.load(**CoreContainer.java:219)
 at
 org.apache.solr.core.**CoreContainer$Initializer.**
 initialize(CoreContainer.java:**161)
 at
 org.apache.solr.servlet.**SolrDispatchFilter.init(**
 SolrDispatchFilter.java:96)
 at org.eclipse.jetty.servlet.**FilterHolder.doStart(**
 FilterHolder.java:102)
 at
 org.eclipse.jetty.util.**component.AbstractLifeCycle.**
 start(AbstractLifeCycle.java:**59)
 at
 org.eclipse.jetty.servlet.**ServletHandler.initialize(**
 ServletHandler.java:748)
 at
 org.eclipse.jetty.servlet.**ServletContextHandler.**startContext(**
 ServletContextHandler.java:**249)
 at
 org.eclipse.jetty.webapp

Re: Correct way to deal with source data that may include a multivalued field that needs to be used for sorting?

2012-06-10 Thread Aaron Daubman
Hoss,

The new FieldValueSubsetUpdateProcessorFactory classes look phenomenal. I
haven't looked yet, but what are the chances these will be back-ported to
3.6 (or how hard would it be to backport them?)... I'll have to check out
the source in more detail.

If stuck on 3.6, what would be the best way to deal with this situation?
It's currently looking like it will have to be a custom update handler, but
I'd hate to have to go down this route if there are more future-proof
options.

Thanks again,
 Aaron

On Tue, Jun 5, 2012 at 6:53 PM, Chris Hostetter hossman_luc...@fucit.orgwrote:


 : The real issue here is that the docs are created externally, and the
 : producer won't (yet) guarantee that fields that should appear once will
 : actually appear once. Because of this, I don't want to declare the field
 as
 : multiValued=false as I don't want to cause indexing errors. It would be
 : great for me (and apparently many others after searching) if there were
 an
 : option as simple as forceSingleValued=true - where some deterministic
 : behavior such as use first field encountered, ignore all others, would
 : occur.

 This will be trivial in Solr 4.0, using one of the new
 FieldValueSubsetUpdateProcessorFactory classes that are now available --
 just pick your rule...


 https://builds.apache.org/view/G-L/view/Lucene/job/Solr-trunk/javadoc/org/apache/solr/update/processor/FieldValueSubsetUpdateProcessorFactory.html
 Direct Known Subclasses:
FirstFieldValueUpdateProcessorFactory,
LastFieldValueUpdateProcessorFactory,
MaxFieldValueUpdateProcessorFactory,
MinFieldValueUpdateProcessorFactory

 -Hoss



What would cause: SEVERE: java.lang.ClassCastException: com.company.MyCustomTokenizerFactory cannot be cast to org.apache.solr.analysis.TokenizerFactory

2012-06-08 Thread Aaron Daubman
Greetings,

I am in the process of updating custom code and schema from Solr 1.4 to
3.6.0 and have run into the following issue with our two custom Tokenizer
and Token Filter components.

I've been banging my head against this one for far too long, especially
since it must be something obvious I'm missing.

I have  custom Tokenizer and Token Filter components along with
corresponding factories. The code for all looks very similar to the
Tokenizer and TokenFilter (and Factory) code that is standard with 3.6.0
(and I have also read through
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

I have ensured my custom code is on the classpath, it is
in ENSolrComponents-1.0-SNAPSHOT-jar-with-dependencies.jar:
---output snip---
Jun 8, 2012 10:41:00 PM org.apache.solr.core.CoreContainer load
INFO: loading shared library: /opt/test_artists_solr/jetty-solr/lib/en
Jun 8, 2012 10:41:00 PM org.apache.solr.core.SolrResourceLoader
replaceClassLoader
INFO: Adding
'file:/opt/test_artists_solr/jetty-solr/lib/en/ENSolrComponents-1.0-SNAPSHOT-jar-with-dependencies.jar'
to classloader
Jun 8, 2012 10:41:00 PM org.apache.solr.core.SolrResourceLoader
replaceClassLoader
INFO: Adding
'file:/opt/test_artists_solr/jetty-solr/lib/en/ENUtil-1.0-SNAPSHOT-jar-with-dependencies.jar'
to classloader
Jun 8, 2012 10:41:00 PM org.apache.solr.core.CoreContainer create
--snip---

After successfully parsing the schema and creating many fields, etc.. the
following is logged:
---snip---
Jun 8, 2012 10:41:00 PM org.apache.solr.util.plugin.AbstractPluginLoader
load
INFO: created : com.company.MyCustomTokenizerFactory
Jun 8, 2012 10:41:00 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.ClassCastException: com.company.MyCustomTokenizerFactory
cannot be cast to org.apache.solr.analysis.TokenizerFactory
at org.apache.solr.schema.IndexSchema$5.init(IndexSchema.java:966)
at
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:148)
at org.apache.solr.schema.IndexSchema.readAnalyzer(IndexSchema.java:986)
at org.apache.solr.schema.IndexSchema.access$100(IndexSchema.java:60)
at org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:453)
at org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:433)
at
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:140)
at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:490)
at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:123)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:481)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:335)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:219)
at
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:161)
at
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:96)
at org.eclipse.jetty.servlet.FilterHolder.doStart(FilterHolder.java:102)
at
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
at
org.eclipse.jetty.servlet.ServletHandler.initialize(ServletHandler.java:748)
at
org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:249)
at
org.eclipse.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1222)
at
org.eclipse.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:676)
at org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:455)
at
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
at
org.eclipse.jetty.deploy.bindings.StandardStarter.processBinding(StandardStarter.java:36)
at org.eclipse.jetty.deploy.AppLifeCycle.runBindings(AppLifeCycle.java:183)
at
org.eclipse.jetty.deploy.DeploymentManager.requestAppGoal(DeploymentManager.java:491)
at
org.eclipse.jetty.deploy.DeploymentManager.addApp(DeploymentManager.java:138)
at
org.eclipse.jetty.deploy.providers.ScanningAppProvider.fileAdded(ScanningAppProvider.java:142)
at
org.eclipse.jetty.deploy.providers.ScanningAppProvider$1.fileAdded(ScanningAppProvider.java:53)
at org.eclipse.jetty.util.Scanner.reportAddition(Scanner.java:604)
at org.eclipse.jetty.util.Scanner.reportDifferences(Scanner.java:535)
at org.eclipse.jetty.util.Scanner.scan(Scanner.java:398)
at org.eclipse.jetty.util.Scanner.doStart(Scanner.java:332)
at
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
at
org.eclipse.jetty.deploy.providers.ScanningAppProvider.doStart(ScanningAppProvider.java:118)
at
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
at
org.eclipse.jetty.deploy.DeploymentManager.startAppProvider(DeploymentManager.java:552)
at
org.eclipse.jetty.deploy.DeploymentManager.doStart(DeploymentManager.java:227)
at
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
at
org.eclipse.jetty.util.component.AggregateLifeCycle.doStart(AggregateLifeCycle.java:63)
at
org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:53)
at

Re: What would cause: SEVERE: java.lang.ClassCastException: com.company.MyCustomTokenizerFactory cannot be cast to org.apache.solr.analysis.TokenizerFactory

2012-06-08 Thread Aaron Daubman
Just in case it is helpful, here are the relevant pieces of my schema.xml:

---snip--
fieldtype name=customfield class=solr.TextField
positionIncrementGap=100
analyzer type=index
tokenizer class=com.company.MyCustomTokenizerFactory/
filter class=solr.StandardFilterFactory/
filter class=solr.ASCIIFoldingFilterFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory
words=stopwords.txt ignoreCase=true/
!--filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt/--
/analyzer
analyzer type=query
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.StandardFilterFactory/
filter class=solr.ASCIIFoldingFilterFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory
words=stopwords.txt ignoreCase=true/
!--filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt/--
/analyzer
/fieldtype
---snip---

and

---snip---
fieldtype name=customterms class=solr.TextField
positionIncrementGap=100
analyzer
tokenizer class=solr.KeywordTokenizerFactory/
filter class=com.company.MyCustomFilterFactory/
filter class=solr.StandardFilterFactory/
filter class=solr.SynonymFilterFactory
synonyms=synonyms.txt expand=false/
filter class=solr.ASCIIFoldingFilterFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.PatternReplaceFilterFactory
pattern=\- replacement=  replace=all/
filter class=solr.PatternReplaceFilterFactory
pattern=amp;amp; replacement=amp; replace=all/
filter class=solr.PatternReplaceFilterFactory
pattern=\s+ replacement=  replace=all/
filter class=solr.TrimFilterFactory/
filter class=solr.SnowballPorterFilterFactory
language=English protected=protwords.txt/
/analyzer
/fieldtype
---snip---

On Sat, Jun 9, 2012 at 12:03 AM, Aaron Daubman daub...@gmail.com wrote:

 Greetings,

 I am in the process of updating custom code and schema from Solr 1.4 to
 3.6.0 and have run into the following issue with our two custom Tokenizer
 and Token Filter components.

 I've been banging my head against this one for far too long, especially
 since it must be something obvious I'm missing.

 I have  custom Tokenizer and Token Filter components along with
 corresponding factories. The code for all looks very similar to the
 Tokenizer and TokenFilter (and Factory) code that is standard with 3.6.0
 (and I have also read through
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

 I have ensured my custom code is on the classpath, it is
 in ENSolrComponents-1.0-SNAPSHOT-jar-with-dependencies.jar:
 ---output snip---
 Jun 8, 2012 10:41:00 PM org.apache.solr.core.CoreContainer load
 INFO: loading shared library: /opt/test_artists_solr/jetty-solr/lib/en
 Jun 8, 2012 10:41:00 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 INFO: Adding
 'file:/opt/test_artists_solr/jetty-solr/lib/en/ENSolrComponents-1.0-SNAPSHOT-jar-with-dependencies.jar'
 to classloader
 Jun 8, 2012 10:41:00 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 INFO: Adding
 'file:/opt/test_artists_solr/jetty-solr/lib/en/ENUtil-1.0-SNAPSHOT-jar-with-dependencies.jar'
 to classloader
 Jun 8, 2012 10:41:00 PM org.apache.solr.core.CoreContainer create
 --snip---

 After successfully parsing the schema and creating many fields, etc.. the
 following is logged:
 ---snip---
 Jun 8, 2012 10:41:00 PM org.apache.solr.util.plugin.AbstractPluginLoader
 load
 INFO: created : com.company.MyCustomTokenizerFactory
 Jun 8, 2012 10:41:00 PM org.apache.solr.common.SolrException log
 SEVERE: java.lang.ClassCastException: com.company.MyCustomTokenizerFactory
 cannot be cast to org.apache.solr.analysis.TokenizerFactory
 at org.apache.solr.schema.IndexSchema$5.init(IndexSchema.java:966)
  at
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:148)
 at org.apache.solr.schema.IndexSchema.readAnalyzer(IndexSchema.java:986)
  at org.apache.solr.schema.IndexSchema.access$100(IndexSchema.java:60)
 at org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:453)
  at org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:433)
 at
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:140)
  at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:490)
 at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:123)
  at org.apache.solr.core.CoreContainer.create(CoreContainer.java:481)
 at org.apache.solr.core.CoreContainer.load(CoreContainer.java:335)
  at org.apache.solr.core.CoreContainer.load(CoreContainer.java:219

Re: Correct way to deal with source data that may include a multivalued field that needs to be used for sorting?

2012-06-05 Thread Aaron Daubman
Thanks for the responses,

By saying dirty data you imply that only one of the values is good or
 clean and that the others can be safely discarded/ignored, as opposed to
 true multi-valued data where each value is there for good reason and needs
 to be preserved. In any case, how do you know/decide which value should be
 used for sorting - and did you just get lucky that Solr happened to use the
 right one?


I haven't gone back and checked the old version's docs where this was
working, however, I suspect that either the field never ended up
appearing in docs more than once, or if it did, it had the same value
repeated...

The real issue here is that the docs are created externally, and the
producer won't (yet) guarantee that fields that should appear once will
actually appear once. Because of this, I don't want to declare the field as
multiValued=false as I don't want to cause indexing errors. It would be
great for me (and apparently many others after searching) if there were an
option as simple as forceSingleValued=true - where some deterministic
behavior such as use first field encountered, ignore all others, would
occur.


The preferred technique would be the preprocess and clean the data before
 it is handed to Solr or SolrJ, even if the source must remain dirty.
 Baring that a preprocessor or a custom update processor certainly.


I could write preprocessors (this is really what will eventually happen
when the producer cleans their data),  custom processors, etc... however,
for something this simple it would be great not to be producing more code
that would have to be maintained.



 Please clarify exactly how the data is being fed into Solr.


 I am using generic code to read from a key/value store and compose
documents. This is another reason fixing the data at this point would not
be desirable, the currently generic code would need to be made specific to
look for these particular fields and then coerce them to single values...

Thanks again,
  Aaron


Correct way to deal with source data that may include a multivalued field that needs to be used for sorting?

2012-06-04 Thread Aaron Daubman
Greetings,

I have dirty source data where some documents being indexed, although
unlikely, may contain multivalued fields that are also required for
sorting. In previous versions of Solr, sorting on this field worked fine
(possibly because few or no multivalued fields were ever encountered?),
however, as of 3.6.0, thanks to
https://issues.apache.org/jira/browse/SOLR-2339 attempting to sort on this
field now throws an error:

[2012-06-04 17:20:01,691] ERROR org.apache.solr.common.SolrException
org.apache.solr.common.SolrException: can not sort on multivalued field:
f_normalizedValue

The relevant bits of the schema.xml are:
fieldType name=sfloat class=solr.TrieFloatField precisionStep=0
positionIncrementGap=0 sortMissingLast=true/
dynamicField name=f_* type=sfloat indexed=true stored=true
required=false multiValued=true/

Assuming that the source documents being indexed cannot be changed (which,
at least for now, they cannot), what would be the next best way to allow
for both the possibility of multiple f_normalizedValue fields appearing in
indexed documents, as wel as being able to sort by f_normalizedValue?

Thank you,
 Aaron


Re: Tips on creating a custom QueryCache?

2012-05-30 Thread Aaron Daubman
Hoss,


: 1) Any recommendations on which best to sub-class? I'm guessing, for this
 : scenario with rare batch puts and no evictions, I'd be looking for get
 : performance. This will also be on a box with many CPUs - so I wonder if
 the
 : older LRUCache would be preferable?

 i suspect you are correct ... the entire point of the other caches is
 dealingwith faster replacement, so you really don't care.

 You might even find it worth while to write your own
 NoReplacementCache from scratch backed by a HashMap (instead of the
 LinkedHashMap used in LRUCache)


I really like this idea (roll-your-own cache using a simple HashMap).
However, as much searching as I've done, I've come up short on anything
that describes concurrency in Solr. The short question is, for such a
cache, do I need to worry about concurrent access (I'm guessing that the
firstSearcher QuerySenderListener process would be
single-threaded/non-concurrent, and thus writes would never be an issue -
is this correct?) - e.g. for my case, would I back the NoReplacementCache
with a HashMap or ? The bigger question is: what are the parallel task
execution paths in Solr and under what conditions are they possible?

Thanks again,
 Aaron


Example setup of using Solr 3.6.0 with Jetty 7 (7.6.3)?

2012-05-29 Thread Aaron Daubman
Greetings,

Has anybody gotten Solr 3.6.0 to work well with Jetty 7.6.3, and if so,
would you mind sharing your config files / directory structure / other
useful details?

Thanks,
 Aaron


Generating maven artifacts for 3.6.0 build - correct -Dversion to use?

2012-05-25 Thread Aaron Daubman
Greetings,

Following the directions here:
http://svn.apache.org/repos/asf/lucene/dev/trunk/dev-tools/maven/README.maven

for building Lucene/Solr with Maven, what is the correct -Dversion to pass
in to get-maven-poms.

This seems set up for building -SNAPSHOT, however, I would like to use
maven to build the 3.6.0 tag.

If I set version to 3.6.0, however, this causes issue with lucene, which
seems to really only want version 3.6 (no 0) and even causes the version
check test to fail.

What is the correct version to pass in to get-maven-poms for a 3.6.0
release build via maven?

Thanks,
  Aaron


Re: Tips on creating a custom QueryCache?

2012-05-24 Thread Aaron Daubman
Thanks for the reply,

Do you have any pointers to relevant Docs or Examples that show how this
should be chained together?

Thanks again,
 Aaron

On Thu, May 24, 2012 at 3:03 AM, Otis Gospodnetic 
otis_gospodne...@yahoo.com wrote:

 Perhaps this could be a custom SearchComponent that's run before the usual
 QueryComponent?
 This component would be responsible for loading queries, executing them,
 caching results, and for returning those results when these queries are
 encountered later on.

 Otis

 
  From: Aaron Daubman daub...@gmail.com
 Subject: Tips on creating a custom QueryCache?
 
 Greetings,
 
 I'm looking for pointers on where to start when creating a
 custom QueryCache.
 Our usage patterns are possibly a bit unique, so let me explain the
 desired
 use case:
 
 Our Solr index is read-only except for dedicated periods where it is
 updated and re-optimized.
 
 On startup, I would like to create a specific QueryCache that would cache
 the top ~20,000 (arbitrary but large) queries. This cache should never
 evict entries, and, after the warming process to populate, should never
 be added to either.
 
 The warming process would be to run through the (externally determined)
 list of anticipated top X (say 20,000) queries and cache these results.
 
 This cache would then be used for the duration of the solr run-time (until
 the period, perhaps daily, where the index is updated and re-optimized, at
 which point the cache would be re-created)
 
 Where should I begin looking to implement such a cache?
 
 The reason for this somewhat different approach to caching is that we may
 get any number of odd queries throughout the day for which performance
 isn't important, and we don't want any of these being added to the cache
 or
 evicting other entries from the cache. We need to ensure high performance
 for this pre-determined list of queries only (while still handling other
 arbitrary queries, if not as quickly)
 
 Thanks,
   Aaron



Re: Tips on creating a custom QueryCache?

2012-05-24 Thread Aaron Daubman
Hoss, brilliant as always - many thanks! =)

Subclassing the SolrCache class sounds like a good way to accomplish this.

Some questions:
1) Any recommendations on which best to sub-class? I'm guessing, for this
scenario with rare batch puts and no evictions, I'd be looking for get
performance. This will also be on a box with many CPUs - so I wonder if the
older LRUCache would be preferable?

2) Would I need to worry about auto warming at all? I'm still a little
foggy on lifecycle of firstSearcher versus newSearcher (is firstSearcher
really only ever called the first time the solr instanced is started?). In
any case, since the only time a commit would occur is when batch updates,
re-indexing and re-optimizing occurs (once a day off-peak perhaps) I
*think* I would always want to perform the same static warming rather
than attempting to auto-warm from the old cache - does this make sense?

Thanks again!
 Aaron

On Thu, May 24, 2012 at 7:38 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 Interesting problem,

 w/o making any changes to Solr, you could probably get this behavior be:
  a) sizing your cache large neough.
  b) using a firstSearcher that generates your N queries on startup
  c) configure autowarming of 100%
  d) ensure every query you send uses cache=false


 The tricky part being d.

 But if you don't mind writing a little java, i think this should actually
 be fairly trivial to do w/o needing d at all...

 1) subclass the existing SolrCache class of your choice.
 2) in your subclass, make put be a No-Op if getState()==LIVE, else
 super.put(...)

 ...so during any warming phase (either static from
 firstSearcher/newSearcher, or because of autowarming) the cache will
 accept new objects, but once warming is done it will ignore requests to
 add new items (so it will never evict anything)

 Then all you need is a firstSearcher event listener that feeds you your N
 queries (model it after QuerySenderListener but have it read from
 whatever source you want instead of the solrconfig.xml)

 : The reason for this somewhat different approach to caching is that we may
 : get any number of odd queries throughout the day for which performance
 : isn't important, and we don't want any of these being added to the cache
 or
 : evicting other entries from the cache. We need to ensure high performance
 : for this pre-determined list of queries only (while still handling other
 : arbitrary queries, if not as quickly)

 FWIW: my defacto way of dealing with this in the past was to siloize my
 slave machines by usecase.  For example, in one index: i had 1 master,
 which replicated to 2*N slaves, as well as a repeater.  The 2*N slaves
 were behind 2 diff load balancers (N even numbered machines and N odd
 numbered machines), and the two sets of slaves had diff static cache
 warming configs - even numbered machines warmed queries common to
 browsing categories, odd numbered machines warmed top-searches.  If the
 front end was doing an arbitrary search, it was routed to the load blancer
 for the odd-numbered slaves.  if the front end was doing a category
 browse, the query was routed to the even-numbered slaves.  Meanwhile: the
 repeater was replicating out to a bunch of smaller one-off boxes with
 cache configs by use case, ie: the data-wharehouse and analytics team had
 their own slave they could run their own complex queries against.  the
 tools team had a dedicated slave that various internal tools would query
 via ajax to get metadata, etc...

 -Hoss



Tips on creating a custom QueryCache?

2012-05-23 Thread Aaron Daubman
Greetings,

I'm looking for pointers on where to start when creating a
custom QueryCache.
Our usage patterns are possibly a bit unique, so let me explain the desired
use case:

Our Solr index is read-only except for dedicated periods where it is
updated and re-optimized.

On startup, I would like to create a specific QueryCache that would cache
the top ~20,000 (arbitrary but large) queries. This cache should never
evict entries, and, after the warming process to populate, should never
be added to either.

The warming process would be to run through the (externally determined)
list of anticipated top X (say 20,000) queries and cache these results.

This cache would then be used for the duration of the solr run-time (until
the period, perhaps daily, where the index is updated and re-optimized, at
which point the cache would be re-created)

Where should I begin looking to implement such a cache?

The reason for this somewhat different approach to caching is that we may
get any number of odd queries throughout the day for which performance
isn't important, and we don't want any of these being added to the cache or
evicting other entries from the cache. We need to ensure high performance
for this pre-determined list of queries only (while still handling other
arbitrary queries, if not as quickly)

Thanks,
  Aaron