PostingsSolrHighlighter not working on Multivalue field

2013-06-18 Thread Floyd Wu
In my test case, it seems this new highlighter not working.

When field set multivalue=true, the stored text in this field can not be
highlighted.

Am I miss something? Or this is current limitation? I have no luck to find
any documentations mentioned this.

Floyd


Re: yet another optimize question

2013-06-18 Thread Walter Underwood
Your query cache is far too small. Most of the default caches are too small.

We run with 10K entries and get a hit rate around 0.30 across four servers. 
This rate goes up with more queries, down with less, but try a bigger cache, 
especially if you are updating the index infrequently, like once per day.

At Netflix, we had a 0.12 hit rate on the query cache, even with an HTTP cache 
in front of it. The HTTP cache had an 80% hit rate.

I'd increase your document cache, too. I usually see about 0.75 or better on 
that.

wunder

On Jun 18, 2013, at 10:22 AM, Petersen, Robert wrote:

> Hi Otis, 
> 
> Yes the query results cache is just about worthless.   I guess we have too 
> diverse of a set of user queries.  The business unit has decided to let bots 
> crawl our search pages too so that doesn't help either.  I turned it way down 
> but decided to keep it because my understanding was that it would still help 
> for users going from page 1 to page 2 in a search.  Is that true?
> 
> Thanks
> Robi
> 
> -Original Message-
> From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] 
> Sent: Monday, June 17, 2013 6:39 PM
> To: solr-user@lucene.apache.org
> Subject: Re: yet another optimize question
> 
> Hi Robi,
> 
> This goes against the original problem of getting OOMEs, but it looks like 
> each of your Solr caches could be a little bigger if you want to eliminate 
> evictions, with the query results one possibly not being worth keeping if you 
> can't get the hit % up enough.
> 
> Otis
> --
> Solr & ElasticSearch Support -- http://sematext.com/
> 
> 
> On Mon, Jun 17, 2013 at 2:21 PM, Petersen, Robert 
>  wrote:
>> Hi Otis,
>> 
>> Right I didn't restart the JVMs except on the one slave where I was 
>> experimenting with using G1GC on the 1.7.0_21 JRE.   Also some time ago I 
>> made all our caches small enough to keep us from getting OOMs while still 
>> having a good hit rate.Our index has about 50 fields which are mostly 
>> int IDs and there are some dynamic fields also.  These dynamic fields can be 
>> used for custom faceting.  We have some standard facets we always facet on 
>> and other dynamic facets which are only used if the query is filtering on a 
>> particular category.  There are hundreds of these fields but since they are 
>> only for a small subset of the overall index they are very sparsely 
>> populated with regard to the overall index.  With CMS GC we get a sawtooth 
>> on the old generation (I guess every replication and commit causes it's 
>> usage to drop down to 10GB or so) and it seems to be the old generation 
>> which is the main space consumer.  With the G1GC, the memory map looked 
>> totally different!  I was a little lost looking at memory consumption with 
>> that GC.  Maybe I'll try it again now that the index is a bit smaller than 
>> it was last time I tried it.  After four days without running an optimize 
>> now it is 21GB.  BTW our indexing speed is mostly bound by the DB so 
>> reducing the segments might be ok...
>> 
>> Here is a quick snapshot of one slaves memory map as reported by PSI-Probe, 
>> but unfortunately I guess I can't send the history graphics to the solr-user 
>> list to show their changes over time:
>>NameUsedCommitted   Max   
>>   Initial Group
>> Par Survivor Space 20.02 MB108.13 MB   108.13 MB 
>>   108.13 MB   HEAP
>> CMS Perm Gen   42.29 MB70.66 MB82.00 MB20.75 
>> MBNON_HEAP
>> Code Cache 9.73 MB 9.88 MB 48.00 MB2.44 MB 
>> NON_HEAP
>> CMS Old Gen20.22 GB30.94 GB30.94 GB  
>>   30.94 GBHEAP
>> Par Eden Space 42.20 MB865.31 MB   865.31 MB   
>> 865.31 MB   HEAP
>> Total  20.33 GB31.97 GB32.02 GB  
>>   31.92 GBTOTAL
>> 
>> And here's our current cache stats from a random slave:
>> 
>> name:queryResultCache
>> class:   org.apache.solr.search.LRUCache
>> version: 1.0
>> description: LRU Cache(maxSize=488, initialSize=6, autowarmCount=6, 
>> regenerator=org.apache.solr.search.SolrIndexSearcher$3@461ff4c3)
>> stats:  lookups : 619
>> hits : 36
>> hitratio : 0.05
>> inserts : 592
>> evictions : 101
>> size : 488
>> warmupTime : 2949
>> cumulative_lookups : 681225
>> cumulative_hits : 73126
>> cumulative_hitratio : 0.10
>> cumulative_inserts : 602396
>> cumulative_evictions : 428868
>> 
>> 
>> name:   fieldCache
>> class:   org.apache.solr.search.SolrFieldCacheMBean
>> version: 1.0
>> description: Provides introspection of the Lucene FieldCache, this is 
>> **NOT** a cache that is managed by Solr.
>> stats:  entries_count : 359
>> 
>> 
>> name:documentCache
>> class:   org.apache.solr.search.LRUCache
>> version: 1.0
>> description: LRU Cache(maxSize=2048, initialSize=512, autowarmCount=10, 
>> regenerator=null)
>> s

Re: Merge tool based on mergefactor

2013-06-18 Thread Otis Gospodnetic
Hi,

You could call the optimize command directly on slaves, but specify
the target number of segments, e.g.
/solr/update?optimize=true&maxSegments=10

Not sure I recommend doing this on slaves, but you could - maybe you
have spare capacity.  You may also want to consider not doing it on
all your slaves at the same time...

Otis
--
Solr & ElasticSearch Support
http://sematext.com/





On Tue, Jun 18, 2013 at 8:19 PM, Learner  wrote:
> We have SOLR master, primarily for indexing and SOLR slave primarily for
> searching. I see that the merge factor plays a key factor in Indexing as
> well as searching. I would like to have a high merge factor for my master
> instance and low merge factor for slave.
>
> As of now since I just replicate the data from master, the number of
> segments remain the same as that of master. May be its already there ... but
> I feel that it would be great if theres a tool with which we can merge the
> segments in slave based on merge factor specified in slave (config) after
> replication..Is there any work around to change the number of segments
> without doing complete indexing?
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Merge-tool-based-on-mergefactor-tp4071506.html
> Sent from the Solr - User mailing list archive at Nabble.com.


RE: Slow Highlighter Performance Even Using FastVectorHighlighter

2013-06-18 Thread Bryan Loofbourrow
Also, in your position, I would be very curious what would happen to
highlighting performance, if I just took the EdgeNGramFilter out of the
analysis chain and reindexed. That would immediately tell you that the
problem lives there (or not).

-- Bryan

> -Original Message-
> From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
> Sent: Tuesday, June 18, 2013 5:16 PM
> To: 'solr-user@lucene.apache.org'
> Subject: RE: Slow Highlighter Performance Even Using
FastVectorHighlighter
>
> Andy,
>
> OK, I get what you're doing. As far as alternate paths, you could index
> normally and use WildcardQuery, but that wouldn't get you the boost on
> exact word matches. That makes me wonder whether there's a way to use
> edismax to combine the results of a wildcard search and a non-wildcard
> search against the same field, boosting the latter. I haven't looked
into
> it, but it seems possible that it might be done.
>
> I am perplexed at this point by the poor highlight performance you're
> seeing, but we do have your profiling data that suggests that you have a
> very large number of matches to contend with, so that's interesting.
>
> At this point, faced with your issue, I would step my way through the
> FastVectorHighlighter code. About the first thing it does for each field
> is walk the terms in the document, and retain only those that matched
some
> terms in the query. It may be interesting to see this set of terms it
ends
> up with -- is it excessively large for some reason?
>
> -- Bryan
>
> > -Original Message-
> > From: Andy Brown [mailto:andy_br...@rhoworld.com]
> > Sent: Friday, June 14, 2013 1:52 PM
> > To: solr-user@lucene.apache.org
> > Subject: RE: Slow Highlighter Performance Even Using
> FastVectorHighlighter
> >
> > Bryan,
> >
> > For specifics, I'll refer you back to my original email where I
> > specified all the fields/field types/handlers I use. Here's a general
> > overview.
> >
> > I really only have 3 fields that I index and search against: "name",
> > "description", and "content". All of which are just general text
> > (string) fields. I have a catch-all field called "text" that is only
> > used for querying. It's indexed but not stored. The "name",
> > "description", and "content" fields are copied into the "text" field.
> >
> > For partial word matching, I have 4 more fields: "name_par",
> > "description_par", "content_par", and "text_par". The "text_par" field
> > has the same relationship to the "*_par" fields as "text" does to the
> > others (only used for querying). Those partial word matching fields
are
> > of type "text_general_partial" which I created. That field type is
> > analyzed different than the regular text field in that it goes through
> > an EdgeNGramFilterFactory with the minGramSize="2" and maxGramSize="7"
> > at index time.
> >
> > I query against both "text" and "text_par" fields using edismax
deftype
> > with my qf set to "text^2 text_par^1" to give full word matches a
higher
> > score. This part returns back very fast as previously stated. It's
when
> > I turn on highlighting that I take the huge performance hit.
> >
> > Again, I'm using the FastVectorHighlighting. The hl.fl is set to "name
> > name_par description description_par content content_par" so that it
> > returns highlights for full and partial word matches. All of those
> > fields have indexed, stored, termPositions, termVectors, and
termOffsets
> > set to "true".
> >
> > It all seems redundant just to allow for partial word
> > matching/highlighting but I didn't know of a better way. Does anything
> > stand out to you that could be the culprit? Let me know if you need
any
> > more clarification.
> >
> > Thanks!
> >
> > - Andy
> >
> > -Original Message-
> > From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
> > Sent: Wednesday, May 29, 2013 5:44 PM
> > To: solr-user@lucene.apache.org
> > Subject: RE: Slow Highlighter Performance Even Using
> > FastVectorHighlighter
> >
> > Andy,
> >
> > > I don't understand why it's taking 7 secs to return highlights. The
> > size
> > > of the index is only 20.93 MB. The JVM heap Xms and Xmx are both set
> > to
> > > 1024 for this verification purpose and that should be more than
> > enough.
> > > The processor is plenty powerful enough as well.
> > >
> > > Running VisualVM shows all my CPU time being taken by mainly these 3
> > > methods:
> > >
> > >
> >
org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI
> > > nfo.getStartOffset()
> > >
> >
org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI
> > > nfo.getStartOffset()
> > >
> >
org.apache.lucene.search.vectorhighlight.FieldPhraseList.addIfNoOverlap(
> > > )
> >
> > That is a strange and interesting set of things to be spending most of
> > your CPU time on. The implication, I think, is that the number of term
> > matches in the document for terms in your query (or, at least, terms
> > matching exact words or the beginning of phrases in

Merge tool based on mergefactor

2013-06-18 Thread Learner
We have SOLR master, primarily for indexing and SOLR slave primarily for
searching. I see that the merge factor plays a key factor in Indexing as
well as searching. I would like to have a high merge factor for my master
instance and low merge factor for slave.

As of now since I just replicate the data from master, the number of
segments remain the same as that of master. May be its already there ... but
I feel that it would be great if theres a tool with which we can merge the
segments in slave based on merge factor specified in slave (config) after
replication..Is there any work around to change the number of segments
without doing complete indexing?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Merge-tool-based-on-mergefactor-tp4071506.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Slow Highlighter Performance Even Using FastVectorHighlighter

2013-06-18 Thread Bryan Loofbourrow
Andy,

OK, I get what you're doing. As far as alternate paths, you could index
normally and use WildcardQuery, but that wouldn't get you the boost on
exact word matches. That makes me wonder whether there's a way to use
edismax to combine the results of a wildcard search and a non-wildcard
search against the same field, boosting the latter. I haven't looked into
it, but it seems possible that it might be done.

I am perplexed at this point by the poor highlight performance you're
seeing, but we do have your profiling data that suggests that you have a
very large number of matches to contend with, so that's interesting.

At this point, faced with your issue, I would step my way through the
FastVectorHighlighter code. About the first thing it does for each field
is walk the terms in the document, and retain only those that matched some
terms in the query. It may be interesting to see this set of terms it ends
up with -- is it excessively large for some reason?

-- Bryan

> -Original Message-
> From: Andy Brown [mailto:andy_br...@rhoworld.com]
> Sent: Friday, June 14, 2013 1:52 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Slow Highlighter Performance Even Using
FastVectorHighlighter
>
> Bryan,
>
> For specifics, I'll refer you back to my original email where I
> specified all the fields/field types/handlers I use. Here's a general
> overview.
>
> I really only have 3 fields that I index and search against: "name",
> "description", and "content". All of which are just general text
> (string) fields. I have a catch-all field called "text" that is only
> used for querying. It's indexed but not stored. The "name",
> "description", and "content" fields are copied into the "text" field.
>
> For partial word matching, I have 4 more fields: "name_par",
> "description_par", "content_par", and "text_par". The "text_par" field
> has the same relationship to the "*_par" fields as "text" does to the
> others (only used for querying). Those partial word matching fields are
> of type "text_general_partial" which I created. That field type is
> analyzed different than the regular text field in that it goes through
> an EdgeNGramFilterFactory with the minGramSize="2" and maxGramSize="7"
> at index time.
>
> I query against both "text" and "text_par" fields using edismax deftype
> with my qf set to "text^2 text_par^1" to give full word matches a higher
> score. This part returns back very fast as previously stated. It's when
> I turn on highlighting that I take the huge performance hit.
>
> Again, I'm using the FastVectorHighlighting. The hl.fl is set to "name
> name_par description description_par content content_par" so that it
> returns highlights for full and partial word matches. All of those
> fields have indexed, stored, termPositions, termVectors, and termOffsets
> set to "true".
>
> It all seems redundant just to allow for partial word
> matching/highlighting but I didn't know of a better way. Does anything
> stand out to you that could be the culprit? Let me know if you need any
> more clarification.
>
> Thanks!
>
> - Andy
>
> -Original Message-
> From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
> Sent: Wednesday, May 29, 2013 5:44 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Slow Highlighter Performance Even Using
> FastVectorHighlighter
>
> Andy,
>
> > I don't understand why it's taking 7 secs to return highlights. The
> size
> > of the index is only 20.93 MB. The JVM heap Xms and Xmx are both set
> to
> > 1024 for this verification purpose and that should be more than
> enough.
> > The processor is plenty powerful enough as well.
> >
> > Running VisualVM shows all my CPU time being taken by mainly these 3
> > methods:
> >
> >
> org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI
> > nfo.getStartOffset()
> >
> org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI
> > nfo.getStartOffset()
> >
> org.apache.lucene.search.vectorhighlight.FieldPhraseList.addIfNoOverlap(
> > )
>
> That is a strange and interesting set of things to be spending most of
> your CPU time on. The implication, I think, is that the number of term
> matches in the document for terms in your query (or, at least, terms
> matching exact words or the beginning of phrases in your query) is
> extremely high . Perhaps that's coming from this "partial word match"
> you
> mention -- how does that work?
>
> -- Bryan
>
> > My guess is that this has something to do with how I'm handling
> partial
> > word matches/highlighting. I have setup another request handler that
> > only searches the whole word fields and it returns in 850 ms with
> > highlighting.
> >
> > Any ideas?
> >
> > - Andy
> >
> >
> > -Original Message-
> > From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
> > Sent: Monday, May 20, 2013 1:39 PM
> > To: solr-user@lucene.apache.org
> > Subject: RE: Slow Highlighter Performance Even Using
> > FastVectorHighlighter
> >
> > My guess is tha

Re: preserve special characters

2013-06-18 Thread Mingfeng Yang
Hi Jack,

That seems like the solution I am looking for. Thanks so much!

//Can't find this "types" for WDF anywhere.

Ming-


On Tue, Jun 18, 2013 at 4:52 PM, Jack Krupansky wrote:

> The WDF has a "types" attribute which can specify one or more character
> type mapping files. You could create a file like:
>
> @ => ALPHA
> _ => ALPHA
>
> For example (from the book!):
>
> Example - Treat at-sign and underscores as text
>
>   positionIncrementGap="100" autoGeneratePhraseQueries="**true">
>
>  
>types="at-under-alpha.txt"/>
>
>  
>
> The file +at-under-alpha.txt+ would contain:
>
>  @ => ALPHA
>  _ => ALPHA
>
> The analysis results:
>
>Source: Hello @World_bar, r@end.
>Tokens: 1: Hello 2: @World_bar 3: r@end
>
>
> -- Jack Krupansky
>
> -Original Message- From: Mingfeng Yang
> Sent: Tuesday, June 18, 2013 6:58 PM
> To: solr-user@lucene.apache.org
> Subject: preserve special characters
>
>
> We need to index and search lots of tweets which can like "@solr:  solr is
> great". or "@solr_lucene, good combination".
>
> And we want to search with "@solr" or "@solr_lucene".  How can we preserve
> "@" and "_" in the index?
>
> If using whitespacetokennizer followed by worddelimiterfilter, @solr_lucene
> will be broken down into "solr" and "lucene", which make the search results
> contain lots of non-relevant docs.
>
> If using standardtokenizer, the "@" symbol is stripped.
>
> Thanks,
> Ming-
>


Re: preserve special characters

2013-06-18 Thread Jack Krupansky
The WDF has a "types" attribute which can specify one or more character type 
mapping files. You could create a file like:


@ => ALPHA
_ => ALPHA

For example (from the book!):

Example - Treat at-sign and underscores as text

 
   
 
 
   
 

The file +at-under-alpha.txt+ would contain:

 @ => ALPHA
 _ => ALPHA

The analysis results:

   Source: Hello @World_bar, r@end.
   Tokens: 1: Hello 2: @World_bar 3: r@end


-- Jack Krupansky

-Original Message- 
From: Mingfeng Yang

Sent: Tuesday, June 18, 2013 6:58 PM
To: solr-user@lucene.apache.org
Subject: preserve special characters

We need to index and search lots of tweets which can like "@solr:  solr is
great". or "@solr_lucene, good combination".

And we want to search with "@solr" or "@solr_lucene".  How can we preserve
"@" and "_" in the index?

If using whitespacetokennizer followed by worddelimiterfilter, @solr_lucene
will be broken down into "solr" and "lucene", which make the search results
contain lots of non-relevant docs.

If using standardtokenizer, the "@" symbol is stripped.

Thanks,
Ming- 



Re: preserve special characters

2013-06-18 Thread Learner
You can use keyword tokenizer..

Creates org.apache.lucene.analysis.core.KeywordTokenizer.

Treats the entire field as a single token, regardless of its content.

Example: "http://example.com/I-am+example?Text=-Hello"; ==>
"http://example.com/I-am+example?Text=-Hello";



--
View this message in context: 
http://lucene.472066.n3.nabble.com/preserve-special-characters-tp4071488p4071496.html
Sent from the Solr - User mailing list archive at Nabble.com.


preserve special characters

2013-06-18 Thread Mingfeng Yang
We need to index and search lots of tweets which can like "@solr:  solr is
great". or "@solr_lucene, good combination".

And we want to search with "@solr" or "@solr_lucene".  How can we preserve
"@" and "_" in the index?

If using whitespacetokennizer followed by worddelimiterfilter, @solr_lucene
will be broken down into "solr" and "lucene", which make the search results
contain lots of non-relevant docs.

If using standardtokenizer, the "@" symbol is stripped.

Thanks,
Ming-


TieredMergePolicy reclaimDeletesWeight

2013-06-18 Thread Petersen, Robert
Hi

In continuing a previous conversation, I am attempting to not have to do 
optimizes on our continuously updated index in solr3.6.1 and I came across the 
mention of the reclaimDeletesWeight setting in this blog: 
http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html

We do a *lot* of deletes in our index so I want to make the merges be more 
aggressive on reclaiming deletes, but I am having trouble finding much out 
about this setting.  Does anyone have experience with this setting?  Would the 
below accomplish what I want ie for it to go after deletes more aggressively 
than normal?  I got the impression 10.0 was the default from looking at this 
code but I could be wrong:
https://builds.apache.org/job/Lucene-Solr-Clover-trunk/lastSuccessfulBuild/clover-report/org/apache/lucene/index/TieredMergePolicy.html?id=3085


  20
  8
  20.0


Thanks

Robert (Robi) Petersen
Senior Software Engineer
Search Department



Re: Solr cloud: zkHost in solr.xml gets wiped out

2013-06-18 Thread Al Wold
I just finished a test with the patch, and it looks like all is working well.

On Jun 18, 2013, at 12:19 PM, Al Wold wrote:

> For the CREATE call, I'm doing it manually per the instructions here:
> 
> http://wiki.apache.org/solr/SolrCloud
> 
> Here's the exact URL I'm using:
> 
> http://asu-solr-cloud.elasticbeanstalk.com/admin/collections?action=CREATE&name=directory&numShards=2&replicationFactor=2&maxShardsPerNode=2
> 
> I'm testing out your patch now, and I'll let you know how it goes.
> 
> Thanks for all the help!
> 
> -Al
> 
> On Jun 18, 2013, at 6:47 AM, Erick Erickson wrote:
> 
>> OK, I think I see what's happening. If you do
>> NOT specify an instanceDir on the create
>> (and I'm doing this via the core admin
>> interface, not SolrJ) then the default is
>> used, but not persisted. If you _do_
>> specify the instance dir, it will be persisted.
>> 
>> I've put up another quick patch (tested
>> only in my test case, running full suite
>> now). Can you give it a whirl? You'll have
>> to apply the patch over top of the current
>> 4x, een though the patch is for trunk it
>> applied to 4x cleanly for me and the tests ran.
>> 
>> Thanks,
>> Erick
>> 
>> On Tue, Jun 18, 2013 at 9:02 AM, Erick Erickson  
>> wrote:
>>> OK, I put up a very preliminary patch attached to the bug
>>> if you want to try it out that addresses the extra junk being
>>> put in the  tag. Doesn't address the instanceDir issue
>>> since I haven't reproduced it yet.
>>> 
>>> Erick
>>> 
>>> On Tue, Jun 18, 2013 at 8:46 AM, Erick Erickson  
>>> wrote:
 Whoa! What's this junk?
 qt="/admin/cores" wt="javabin" version="2
 
 That shouldn't be being preserved, and the instancedir should be!
 
 So I'm guessing you're using SolrJ to create the core, but I just
 reproduced the problem (at least the 'wt="json" ') bit from the
 browser and even from one of my internal tests when I added
 extra parameters.
 
 That said, instanceDir is being preserved in my test, so I'm not
 seeing everything you're seeing, could you cut/paste your
 create code? I'll see if I can set up a test case for SolrJ to catch
 this too.
 
 See SOLR-4935
 
 Thanks for reporting!
 
 On Mon, Jun 17, 2013 at 5:39 PM, Al Wold  wrote:
> Hi Erick,
> I tried out your changes from the branch_4x branch. It looks good in 
> terms of preserving the zkHost, but I'm running into an exception because 
> it isn't persisting the instanceDir attribute on the  element.
> 
> I've got a few other things I need to take care of, but as soon as I have 
> time I'll dig in and see if I can figure out what's going on, and see 
> what changed to make this not work.
> 
> Here are details on what the files looked like before/after CREATE call:
> 
> original solr.xml:
> 
> 
> 
> 
>  hostContext="/"/>
> 
> 
> here's what was produced with 4.3 branch + a quick mod to preserve zkHost:
> 
> 
> 
>  hostContext="/">
>    instanceDir="directory_shard1_replica1/" transient="false" 
> name="directory_shard1_replica1" collection="directory"/>
>    instanceDir="directory_shard2_replica1/" transient="false" 
> name="directory_shard2_replica1" collection="directory"/>
> 
> 
> 
> here's what was produced with branch_4x 4.4-SNAPSHOT:
> 
> 
> 
>  distribUpdateSoTimeout="0" distribUpdateConnTimeout="0" hostPort="8080" 
> hostContext="/">
>    collection="directory" qt="/admin/cores" wt="javabin" version="2"/>
>    collection="directory" qt="/admin/cores" wt="javabin" version="2"/>
> 
> 
> 
> and here's the error from solr.log after restarting after the CREATE:
> 
> 2013-06-17 21:37:07,083 1874 [pool-2-thread-1] ERROR 
> org.apache.solr.core.CoreContainer  - 
> null:java.lang.NullPointerException: Missing required 'instanceDir'
>   at 
> org.apache.solr.core.CoreDescriptor.doInit(CoreDescriptor.java:133)
>   at 
> org.apache.solr.core.CoreDescriptor.(CoreDescriptor.java:87)
>   at org.apache.solr.core.CoreContainer.load(CoreContainer.java:365)
>   at org.apache.solr.core.CoreContainer.load(CoreContainer.java:221)
>   at 
> org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:190)
>   at 
> org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:124)
>   at 
> org.apache.catalina.core.ApplicationFilterConfig.initFilter(ApplicationFilterConfig.java:277)
>   at 
> org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:258)
>   at 
> org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:382)
>   at 
> org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:103)
>   at 
> org.apache.cata

Need help in constructing a view with selective columns of two table

2013-06-18 Thread Jenny Huang
 Hi,

I really need your input on my problem in constructing a view with
selective columns of table GENE and table TAXON.

I am importing data from database tables GENE and table TAXON into solr.
 The two tables are
connected through 'taxon' column in table GENE and 'taxon_oid' column in
table TAXON.  In another word, 'gene.taxon = taxon.taxon_oid'.

I want the view in solr to include the following columns:
 from table GENE: gene_oid, gene_display_name, locus_tag, scaffold, taxon
from table TAXON: domain

In another word, I want the 'domain' column from table TAXON to be
displayed in parallel with other columns in table GENE.  Unfortunately, the
'domain' column did not show up at all after the import.  See below for
related markup in data-config.xml and schema.xml.

data-config.xml:


..
















schema.xml:

   
   
   
   
   
   








The import goes smoothly.  However, when I did the 'query' in admin
browser, the 'domain' column never show up.

http://localhost:8983/solr/imgdb/select?q=*%3A*&wt=xml&indent=true


  0
  7
  
true
*:*
1371500131137
xml
  
  
637000454
hypothetical protein
637000261
SBO_2569
637808244  



I want the  to be something like:

  
637000454
hypothetical protein
637000261
SBO_2569
637808244
bacteria 


Could anyone let me know what went wrong?


Thanks ahead,


Re: mm (Minimum 'Should' Match)

2013-06-18 Thread Chris Hostetter

: Thanks Chris.  That worked.. just one correction instead of *df -> qf *

if you're using multiple fields (with optional boosts) then yes, you need 
qf ... but in your examle you knew exactly which (one) field you wanted, 
and df should work fine for that -- because qf defaults to df.


-Hoss


Re: mm (Minimum 'Should' Match)

2013-06-18 Thread anand_solr
Thanks Chris.  That worked.. just one correction instead of *df -> qf *


On Tue, Jun 18, 2013 at 2:05 PM, Chris Hostetter-3 [via Lucene] <
ml-node+s472066n4071423...@n3.nabble.com> wrote:

>
> : query something like
> :
> :
> http://localhost:8983/solr/select?q=(category:lcd+OR+category:led+OR+category:plasma)+AND+(manufacture:sony+OR+manufacture:samsung+OR+manufacture:apple)&facet.field=category&facet.field=manufacture&fl=id&mm=2
>
> Here's an example of something similar using the Solr 4.3 example data...
>
> http://localhost:8983/solr/collection1/select?debugQuery=true&q=%2B{!dismax+df%3Dfeatures+mm%3D2+v%3D%24features}+%2B{!dismax+df%3Dcat+mm%3D2+v%3D%24cat}+&wt=xml&indent=true&cat=electronics+connector+&features=power+ipod+
>
>
> the important (un url encoded) params here are...
>
>   q = +{!dismax df=features mm=2 v=$features} +{!dismax df=cat mm=2
> v=$cat}
>   features = power ipod 
>   cat = electronics connector 
>
> This takes advantage of the fact that the (default) "lucene" parser in
> solr can understand the "{!parser}" syntax to create subclauses using
> other parsers -- but one thing to watch out for is that if your entire
> query starts with "{!dismax}..." then it's going to try and use the dismax
> parser for the whole thing, so this won't do what you expect...
>
>   q = {!dismax df=features mm=2 v=$features} AND {!dismax df=cat mm=2
> v=$cat}
>
> ...but this will...
>
>   q = ({!dismax df=features mm=2 v=$features} AND {!dismax df=cat mm=2
> v=$cat})
>
> This is a realtively new feature of the solr, in older versions you would
> need to use the _query_ parsing trick...
>
>   q = _query_:"{!dismax df=features mm=2 v=$features}" AND
> _query_:"{!dismax df=cat mm=2 v=$cat}"
>
>
> The important thing to remeber about all of this though is that it really
> doesn't matter unless you truly care about getting scoring from this
> resulting BooleanQuery based on the two sub-queries ... if all you really
> care about is *filtering* a set of documents bsaed on these two criteria,
> it's much simpler (and typically more efficient) to use filter queries...
>
>   q = *:*
>  fq = {!dismax df=features mm=2}power ipod 
>  fq = {!dismax df=cat mm=2}electronics connector 
>
>
>
> -Hoss
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/mm-Minimum-Should-Match-tp4071197p4071423.html
>  To unsubscribe from mm (Minimum 'Should' Match), click 
> here
> .
> NAML
>




--
View this message in context: 
http://lucene.472066.n3.nabble.com/mm-Minimum-Should-Match-tp4071197p4071465.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SOLR Cloud - Disable Transaction Logs

2013-06-18 Thread Rishi Easwaran

Erick,

We at AOL mail have been using SOLR for quiet a while and our system is pretty 
write heavy and disk I/O is one of our bottlenecks. At present we use regular 
SOLR in the lotsOfCore configuration and I am in  the process of benchmarking 
SOLR cloud for our use case. I don't have concrete data that tLogs are placing 
lot of load on the system, but for a large scale system like ours even minimal 
load gets magnified. 


>From the Cloud design, for a properly set up cluster, usually you have 
>replicas at different availability zones . Probablity of losing more than 1 
>availability zone at any given time should be pretty low. Why have tLogs if 
>all replicas on an update get the request anyway, In theory 1 replica must be 
>able to commit eventually.

NRT is an optional feature and probably not tied to Cloud, correct?


Thanks,

Rishi.



 

 

-Original Message-
From: Erick Erickson 
To: solr-user 
Sent: Tue, Jun 18, 2013 4:07 pm
Subject: Re: SOLR Cloud - Disable Transaction Logs


bq: the replica can take over and maintain a durable
state of my index

This is not true. On an update, all the nodes in a slice
have already written the data to the tlog, not just the
leader. So if a leader goes down, the replicas have
enough local info to insure that data is not lost. Without
tlogs this would not be true since documents are not
durably saved until a hard commit.

tlogs save data between hard commits. As Yonik
explained to me once, "soft commits are about
visibility, hard commits are about durability" and
tlogs fill up the gap between hard commits.

So to reinforce Shalin's comment yes, you can disable tlogs
if
1> you don't want any of SolrCloud's HA/DR capabilities
2> NRT is unimportant

IOW if you're using 4.x just like you would 3.x in terms
of replication, HA/DR, etc. This is perfectly reasonable,
but don't get hung up on disabling tlogs.

And you haven't told us _why_ you want to do this. They
don't consume much memory or disk space unless you
have configured your hard commits (with openSearcher
true or false) to be quite long. Do you have any proof at
all that the tlogs are placing enough load on the system
to go down this road?

Best
Erick

On Tue, Jun 18, 2013 at 10:49 AM, Rishi Easwaran  wrote:
> SolrJ already has access to zookeeper cluster state. Network I/O bottleneck 
can be avoided by parallel requests.
> You are only as slow as your slowest responding server, which could be your 
single leader with the current set up.
>
> Wouldn't this lessen the burden of the leader, as he does not have to 
> maintain 
transaction logs or distribute to replicas?
>
>
>
>
>
>
>
> -Original Message-
> From: Shalin Shekhar Mangar 
> To: solr-user 
> Sent: Tue, Jun 18, 2013 2:05 am
> Subject: Re: SOLR Cloud - Disable Transaction Logs
>
>
> Yes, but at what cost? You are thinking of replacing disk IO with even more
> slower network IO. The transaction log is a append-only log -- it is not
> pretty cheap especially so if you compare it with the indexing process.
> Plus your write request/sec will drop a lot once you start doing
> synchronous replication.
>
>
> On Tue, Jun 18, 2013 at 2:18 AM, Rishi Easwaran wrote:
>
>> Shalin,
>>
>> Just some thoughts.
>>
>> Near Real time replication- don't we use solrCmdDistributor, which send
>> requests immediately to replicas with a clonedRequest, as an option can't
>> we achieve something similar form CloudSolrserver in Solrj instead of
>> leader doing it. As long as 2 nodes receive writes and acknowledge.
>> durability should be high.
>> Peer-Sync and Recovery - Can we achieve that merging indexes from leader
>> as needed, instead of replaying the transaction logs?
>>
>> Rishi.
>>
>>
>>
>>
>>
>>
>>
>> -Original Message-
>> From: Shalin Shekhar Mangar 
>> To: solr-user 
>> Sent: Mon, Jun 17, 2013 3:43 pm
>> Subject: Re: SOLR Cloud - Disable Transaction Logs
>>
>>
>> It is also necessary for near real-time replication, peer sync and
>> recovery.
>>
>>
>> On Tue, Jun 18, 2013 at 1:04 AM, Rishi Easwaran > >wrote:
>>
>> > Hi,
>> >
>> > Is there a way to disable transaction logs in SOLR cloud. As far as I can
>> > tell no.
>> > Just curious why do we need transaction logs, seems like an I/O intensive
>> > operation.
>> > As long as I have replicatonFactor >1, if a node (leader) goes down, the
>> > replica can take over and maintain a durable state of my index.
>> >
>> > I understand from the previous discussions, that it was intended for
>> > update durability and realtime get.
>> > But, unless I am missing something an ability to disable it in SOLR cloud
>> > if not needed would be good.
>> >
>> > Thanks,
>> >
>> > Rishi.
>> >
>> >
>>
>>
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>>
>>
>>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>
>

 



Re: SOLR Cloud - Disable Transaction Logs

2013-06-18 Thread Erick Erickson
bq: the replica can take over and maintain a durable
state of my index

This is not true. On an update, all the nodes in a slice
have already written the data to the tlog, not just the
leader. So if a leader goes down, the replicas have
enough local info to insure that data is not lost. Without
tlogs this would not be true since documents are not
durably saved until a hard commit.

tlogs save data between hard commits. As Yonik
explained to me once, "soft commits are about
visibility, hard commits are about durability" and
tlogs fill up the gap between hard commits.

So to reinforce Shalin's comment yes, you can disable tlogs
if
1> you don't want any of SolrCloud's HA/DR capabilities
2> NRT is unimportant

IOW if you're using 4.x just like you would 3.x in terms
of replication, HA/DR, etc. This is perfectly reasonable,
but don't get hung up on disabling tlogs.

And you haven't told us _why_ you want to do this. They
don't consume much memory or disk space unless you
have configured your hard commits (with openSearcher
true or false) to be quite long. Do you have any proof at
all that the tlogs are placing enough load on the system
to go down this road?

Best
Erick

On Tue, Jun 18, 2013 at 10:49 AM, Rishi Easwaran  wrote:
> SolrJ already has access to zookeeper cluster state. Network I/O bottleneck 
> can be avoided by parallel requests.
> You are only as slow as your slowest responding server, which could be your 
> single leader with the current set up.
>
> Wouldn't this lessen the burden of the leader, as he does not have to 
> maintain transaction logs or distribute to replicas?
>
>
>
>
>
>
>
> -Original Message-
> From: Shalin Shekhar Mangar 
> To: solr-user 
> Sent: Tue, Jun 18, 2013 2:05 am
> Subject: Re: SOLR Cloud - Disable Transaction Logs
>
>
> Yes, but at what cost? You are thinking of replacing disk IO with even more
> slower network IO. The transaction log is a append-only log -- it is not
> pretty cheap especially so if you compare it with the indexing process.
> Plus your write request/sec will drop a lot once you start doing
> synchronous replication.
>
>
> On Tue, Jun 18, 2013 at 2:18 AM, Rishi Easwaran wrote:
>
>> Shalin,
>>
>> Just some thoughts.
>>
>> Near Real time replication- don't we use solrCmdDistributor, which send
>> requests immediately to replicas with a clonedRequest, as an option can't
>> we achieve something similar form CloudSolrserver in Solrj instead of
>> leader doing it. As long as 2 nodes receive writes and acknowledge.
>> durability should be high.
>> Peer-Sync and Recovery - Can we achieve that merging indexes from leader
>> as needed, instead of replaying the transaction logs?
>>
>> Rishi.
>>
>>
>>
>>
>>
>>
>>
>> -Original Message-
>> From: Shalin Shekhar Mangar 
>> To: solr-user 
>> Sent: Mon, Jun 17, 2013 3:43 pm
>> Subject: Re: SOLR Cloud - Disable Transaction Logs
>>
>>
>> It is also necessary for near real-time replication, peer sync and
>> recovery.
>>
>>
>> On Tue, Jun 18, 2013 at 1:04 AM, Rishi Easwaran > >wrote:
>>
>> > Hi,
>> >
>> > Is there a way to disable transaction logs in SOLR cloud. As far as I can
>> > tell no.
>> > Just curious why do we need transaction logs, seems like an I/O intensive
>> > operation.
>> > As long as I have replicatonFactor >1, if a node (leader) goes down, the
>> > replica can take over and maintain a durable state of my index.
>> >
>> > I understand from the previous discussions, that it was intended for
>> > update durability and realtime get.
>> > But, unless I am missing something an ability to disable it in SOLR cloud
>> > if not needed would be good.
>> >
>> > Thanks,
>> >
>> > Rishi.
>> >
>> >
>>
>>
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>>
>>
>>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>
>


Re: Solr large boolean filter

2013-06-18 Thread Erick Erickson
Not necessarily. If the auth tokens are available on some
other system (DB, LDAP, whatever), one could get them
in the PostFilter and cache them somewhere since,
presumably, they wouldn't be changing all that often. Or
use a UserCache and get notified whenever a new searcher
was opened and regenerate or purge the cache.

Of course you're right if the post filter does NOT have
access to the source of truth for the user's privileges.

FWIW,
Erick

On Tue, Jun 18, 2013 at 8:54 AM, Otis Gospodnetic
 wrote:
> Hi,
>
> The unfortunate thing about this is what you still have to *pass* that
> filter from the client to the server every time you want to use that
> filter.  If that filter is big/long, passing that in all the time has
> some price that could be eliminated by using "server-side named
> filters".
>
> Otis
> --
> Solr & ElasticSearch Support
> http://sematext.com/
>
>
>
>
>
> On Tue, Jun 18, 2013 at 8:16 AM, Erick Erickson  
> wrote:
>> You might consider "post filters". The idea
>> is to write a custom filter that gets applied
>> after all other filters etc. One use-case
>> here is exactly ACL lists, and can be quite
>> helpful if you're not doing *:* type queries.
>>
>> Best
>> Erick
>>
>> On Mon, Jun 17, 2013 at 5:12 PM, Otis Gospodnetic
>>  wrote:
>>> Btw. ElasticSearch has a nice feature here.  Not sure what it's
>>> called, but I call it "named filter".
>>>
>>> http://www.elasticsearch.org/blog/terms-filter-lookup/
>>>
>>> Maybe that's what OP was after?
>>>
>>> Otis
>>> --
>>> Solr & ElasticSearch Support
>>> http://sematext.com/
>>>
>>>
>>>
>>>
>>>
>>> On Mon, Jun 17, 2013 at 4:59 PM, Alexandre Rafalovitch
>>>  wrote:
 On Mon, Jun 17, 2013 at 12:35 PM, Igor Kustov  wrote:
> So I'm using query like
> http://127.0.0.1:8080/solr/select?q=*:*&fq={!mqparser}id:%281%202%203%29

 If the IDs are purely numeric, I wonder if the better way is to send a
 bitset. So, bit 1 is on if ID:1 is included, bit 2000 is on if ID:2000
 is included. Even using URL-encoding rules, you can fit at least 65
 sequential ID flags per character and I am sure there are more
 efficient encoding schemes for long empty sequences.

 Regards,
Alex.



 Personal website: http://www.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all
 at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
 book)


Re: Solr cloud: zkHost in solr.xml gets wiped out

2013-06-18 Thread Al Wold
For the CREATE call, I'm doing it manually per the instructions here:

http://wiki.apache.org/solr/SolrCloud

Here's the exact URL I'm using:

http://asu-solr-cloud.elasticbeanstalk.com/admin/collections?action=CREATE&name=directory&numShards=2&replicationFactor=2&maxShardsPerNode=2

I'm testing out your patch now, and I'll let you know how it goes.

Thanks for all the help!

-Al

On Jun 18, 2013, at 6:47 AM, Erick Erickson wrote:

> OK, I think I see what's happening. If you do
> NOT specify an instanceDir on the create
> (and I'm doing this via the core admin
> interface, not SolrJ) then the default is
> used, but not persisted. If you _do_
> specify the instance dir, it will be persisted.
> 
> I've put up another quick patch (tested
> only in my test case, running full suite
> now). Can you give it a whirl? You'll have
> to apply the patch over top of the current
> 4x, een though the patch is for trunk it
> applied to 4x cleanly for me and the tests ran.
> 
> Thanks,
> Erick
> 
> On Tue, Jun 18, 2013 at 9:02 AM, Erick Erickson  
> wrote:
>> OK, I put up a very preliminary patch attached to the bug
>> if you want to try it out that addresses the extra junk being
>> put in the  tag. Doesn't address the instanceDir issue
>> since I haven't reproduced it yet.
>> 
>> Erick
>> 
>> On Tue, Jun 18, 2013 at 8:46 AM, Erick Erickson  
>> wrote:
>>> Whoa! What's this junk?
>>> qt="/admin/cores" wt="javabin" version="2
>>> 
>>> That shouldn't be being preserved, and the instancedir should be!
>>> 
>>> So I'm guessing you're using SolrJ to create the core, but I just
>>> reproduced the problem (at least the 'wt="json" ') bit from the
>>> browser and even from one of my internal tests when I added
>>> extra parameters.
>>> 
>>> That said, instanceDir is being preserved in my test, so I'm not
>>> seeing everything you're seeing, could you cut/paste your
>>> create code? I'll see if I can set up a test case for SolrJ to catch
>>> this too.
>>> 
>>> See SOLR-4935
>>> 
>>> Thanks for reporting!
>>> 
>>> On Mon, Jun 17, 2013 at 5:39 PM, Al Wold  wrote:
 Hi Erick,
 I tried out your changes from the branch_4x branch. It looks good in terms 
 of preserving the zkHost, but I'm running into an exception because it 
 isn't persisting the instanceDir attribute on the  element.
 
 I've got a few other things I need to take care of, but as soon as I have 
 time I'll dig in and see if I can figure out what's going on, and see what 
 changed to make this not work.
 
 Here are details on what the files looked like before/after CREATE call:
 
 original solr.xml:
 
 
 
  
  >>> hostContext="/"/>
 
 
 here's what was produced with 4.3 branch + a quick mod to preserve zkHost:
 
 
 
  >>> hostContext="/">
>>> instanceDir="directory_shard1_replica1/" transient="false" 
 name="directory_shard1_replica1" collection="directory"/>
>>> instanceDir="directory_shard2_replica1/" transient="false" 
 name="directory_shard2_replica1" collection="directory"/>
  
 
 
 here's what was produced with branch_4x 4.4-SNAPSHOT:
 
 
 
  >>> distribUpdateSoTimeout="0" distribUpdateConnTimeout="0" hostPort="8080" 
 hostContext="/">
>>> collection="directory" qt="/admin/cores" wt="javabin" version="2"/>
>>> collection="directory" qt="/admin/cores" wt="javabin" version="2"/>
  
 
 
 and here's the error from solr.log after restarting after the CREATE:
 
 2013-06-17 21:37:07,083 1874 [pool-2-thread-1] ERROR 
 org.apache.solr.core.CoreContainer  - null:java.lang.NullPointerException: 
 Missing required 'instanceDir'
at 
 org.apache.solr.core.CoreDescriptor.doInit(CoreDescriptor.java:133)
at 
 org.apache.solr.core.CoreDescriptor.(CoreDescriptor.java:87)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:365)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:221)
at 
 org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:190)
at 
 org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:124)
at 
 org.apache.catalina.core.ApplicationFilterConfig.initFilter(ApplicationFilterConfig.java:277)
at 
 org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:258)
at 
 org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:382)
at 
 org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:103)
at 
 org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4638)
at 
 org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5294)
at 
 org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)

RE: yet another optimize question

2013-06-18 Thread Petersen, Robert
Hi Andre,

Wow that is astonishing!  I will definitely also try that out!  Just set the 
facet method on a per field basis for the less used sparse facet fields eh?  
Thanks for the tip.

Thanks
Robi

-Original Message-
From: Andre Bois-Crettez [mailto:andre.b...@kelkoo.com] 
Sent: Tuesday, June 18, 2013 3:03 AM
To: solr-user@lucene.apache.org
Subject: Re: yet another optimize question

Recently we had steadily increasing memory usage and OOM due to facets on 
dynamic fields.
The default facet.method=fc need to build a large array of maxdocs ints for 
each field (a fieldCache or fieldValueCahe entry), whether it is sparsely 
populated or not.

Once you have reduced your number of maxDocs with the merge policy, it can be 
interesting to try facet.method=enum for all the sparsely populated dynamic 
fields.
Despite what is said in the wiki, in our case the performance was similar to 
facet.method=fc, however the JVM heap usage went down from about 20GB to 4GB.

André

On 06/17/2013 08:21 PM, Petersen, Robert wrote:
> Also some time ago I made all our caches small enough to keep us from getting 
> OOMs while still having a good hit rate.Our index has about 50 fields 
> which are mostly int IDs and there are some dynamic fields also.  These 
> dynamic fields can be used for custom faceting.  We have some standard facets 
> we always facet on and other dynamic facets which are only used if the query 
> is filtering on a particular category.  There are hundreds of these fields 
> but since they are only for a small subset of the overall index they are very 
> sparsely populated with regard to the overall index.
--
André Bois-Crettez

Search technology, Kelkoo
http://www.kelkoo.com/


Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.



Re: ConcurrentUpdateSolrserver - Queue size not working

2013-06-18 Thread Shawn Heisey

On 6/18/2013 11:06 AM, Learner wrote:

My issue is that, I see that the documents are getting adding to server even
before it reaches the queue size.  Am I doing anything wrong? Or is
queuesize not implemented yet?

Also I dont see a very big performance improvements when I increase /
decrease the number of threads. Can someone let me know the best way to
improve indexing performance when using ConcurrentUpdateSolrserver

FYI... I am running this program on 4 core machine..

Sample snippet:

ConcurrentUpdateSolrServer server = new 
ConcurrentUpdateSolrServer(
solrServer, 3, 4);
try {
while ((line = bReader.readLine()) != null) {
inputDocument = line.split("\t");
   Do some processing
server.add(doc);

}}


It looks like the Javadocs for this class are misleading.  The queuesize 
value doesn't cause documents to buffer until that many are present, it 
is used to limit the size of the internal queue.  The object will always 
begin indexing immediately.


I notice that you have your add() call within a try block.  You should 
know that CUSS will *never* throw an exception for things like your 
server being down or the update being badly formed.  It always returns 
immediately with a success.  I did put information about this in the 
javadocs for its predecessor StreamingUpdateSolrServer, but it looks 
like we need an update for CUSS.


If you aren't seeing performance increases from increasing threads, then 
it is very likely that Solr is keeping up with your application with no 
problems and that your application, or the place your application gets 
its data, is the bottleneck.


Thanks,
Shawn



RE: ConcurrentUpdateSolrserver - Queue size not working

2013-06-18 Thread James Thomas
Looks like the javadoc  on this parameter could use a little tweaking.
>From looking at the 4.3 source code (hoping I get this right :-), it appears 
>the ConcurrentUpdateSolrServer will begin sending documents (on a single 
>thread) as soon as the first document is added.
New threads (up to threadCount) are created only when a document is added and 
the queue is more than half full.
Kind of makes sense... why wait until the queue is full to send documents.  And 
if one thread can keep up with your ETL (adds), there's really no need to 
create new threads.

You might want to create your own buffer (e.g. ArrayList) of the 
SolrInputDocument objects and then use the "add" API that accepts the 
collection.
Calling "add" after creating 30,000 SolrInputDocument objects seems a bit much. 
 Something smaller (like 1,000) might work better.  You'll have to experiment 
to see what works best for your environment.

-- James

-Original Message-
From: Learner [mailto:bbar...@gmail.com] 
Sent: Tuesday, June 18, 2013 1:07 PM
To: solr-user@lucene.apache.org
Subject: ConcurrentUpdateSolrserver - Queue size not working

I am using ConcurrentUpdateSolrserver to create 4 threads (threadCount=4) with 
queueSize of 3.

Indexing works fine as expected.

My issue is that, I see that the documents are getting adding to server even 
before it reaches the queue size.  Am I doing anything wrong? Or is queuesize 
not implemented yet?

Also I dont see a very big performance improvements when I increase / decrease 
the number of threads. Can someone let me know the best way to improve indexing 
performance when using ConcurrentUpdateSolrserver

FYI... I am running this program on 4 core machine.. 

Sample snippet:

ConcurrentUpdateSolrServer server = new 
ConcurrentUpdateSolrServer(
solrServer, 3, 4);
try {
while ((line = bReader.readLine()) != null) {
inputDocument = line.split("\t");
  Do some processing
server.add(doc);

}}




--
View this message in context: 
http://lucene.472066.n3.nabble.com/ConcurrentUpdateSolrserver-Queue-size-not-working-tp4071408.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: mm (Minimum 'Should' Match)

2013-06-18 Thread Chris Hostetter

: query something like
: 
: 
http://localhost:8983/solr/select?q=(category:lcd+OR+category:led+OR+category:plasma)+AND+(manufacture:sony+OR+manufacture:samsung+OR+manufacture:apple)&facet.field=category&facet.field=manufacture&fl=id&mm=2

Here's an example of something similar using the Solr 4.3 example data...

http://localhost:8983/solr/collection1/select?debugQuery=true&q=%2B{!dismax+df%3Dfeatures+mm%3D2+v%3D%24features}+%2B{!dismax+df%3Dcat+mm%3D2+v%3D%24cat}+&wt=xml&indent=true&cat=electronics+connector+&features=power+ipod+

the important (un url encoded) params here are...

  q = +{!dismax df=features mm=2 v=$features} +{!dismax df=cat mm=2 v=$cat} 
  features = power ipod 
  cat = electronics connector 

This takes advantage of the fact that the (default) "lucene" parser in 
solr can understand the "{!parser}" syntax to create subclauses using 
other parsers -- but one thing to watch out for is that if your entire 
query starts with "{!dismax}..." then it's going to try and use the dismax 
parser for the whole thing, so this won't do what you expect...

  q = {!dismax df=features mm=2 v=$features} AND {!dismax df=cat mm=2 v=$cat}

...but this will...

  q = ({!dismax df=features mm=2 v=$features} AND {!dismax df=cat mm=2 v=$cat})
  
This is a realtively new feature of the solr, in older versions you would 
need to use the _query_ parsing trick...

  q = _query_:"{!dismax df=features mm=2 v=$features}" AND _query_:"{!dismax 
df=cat mm=2 v=$cat}"


The important thing to remeber about all of this though is that it really 
doesn't matter unless you truly care about getting scoring from this 
resulting BooleanQuery based on the two sub-queries ... if all you really 
care about is *filtering* a set of documents bsaed on these two criteria, 
it's much simpler (and typically more efficient) to use filter queries...

  q = *:*
 fq = {!dismax df=features mm=2}power ipod 
 fq = {!dismax df=cat mm=2}electronics connector 



-Hoss


RE: yet another optimize question

2013-06-18 Thread Petersen, Robert
In reading the newer solrconfig in the example conf folder it seems like it is 
saying this setting ' 10' is shorthand to putting 
the below and that these both are the defaults?  It says 'The default since 
Solr/Lucene 3.3 is TieredMergePolicy.' So isn't this setting already in effect 
for me?


  10
  10
  

Thanks
Robi

-Original Message-
From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] 
Sent: Monday, June 17, 2013 6:36 PM
To: solr-user@lucene.apache.org
Subject: Re: yet another optimize question

Yes, in one of the example solrconfig.xml files this is right above the merge 
factor definition.

Otis
--
Solr & ElasticSearch Support -- http://sematext.com/





On Mon, Jun 17, 2013 at 8:00 PM, Petersen, Robert 
 wrote:
> Hi Upayavira,
>
> You might have gotten it.  Yes we noticed maxdocs was way bigger than 
> numdocs.  There were a lot of files ending in '.del' in the index folder 
> also.  We started on 1.3 also.   I don't currently have any solr config 
> settings for MergePolicy at all.  Am I going to want to put something like 
> this into my index defaults section?
>
> 
>10
>10 
>
> Thanks
> Robi
>
> -Original Message-
> From: Upayavira [mailto:u...@odoko.co.uk]
> Sent: Monday, June 17, 2013 12:29 PM
> To: solr-user@lucene.apache.org
> Subject: Re: yet another optimize question
>
> The key figures are numdocs vs maxdocs. Maxdocs-numdocs is the number of 
> deleted docs in your index.
>
> This is a 3.6 system you say. But has it been upgraded? I've seen folks 
> who've upgraded from 1.4 or 3.0/3.1 over time, keeping the old config.
> The consequence of this is that they don't get the right config for the 
> TieredMergePolicy, and therefore don't get to use it, seeing the old 
> behaviour which does require periodic optimise.
>
> Upayavira
>
> On Mon, Jun 17, 2013, at 07:21 PM, Petersen, Robert wrote:
>> Hi Otis,
>>
>> Right I didn't restart the JVMs except on the one slave where I was
>> experimenting with using G1GC on the 1.7.0_21 JRE.   Also some time ago I
>> made all our caches small enough to keep us from getting OOMs while still
>> having a good hit rate.Our index has about 50 fields which are mostly
>> int IDs and there are some dynamic fields also.  These dynamic fields 
>> can be used for custom faceting.  We have some standard facets we 
>> always facet on and other dynamic facets which are only used if the 
>> query is filtering on a particular category.  There are hundreds of 
>> these fields but since they are only for a small subset of the 
>> overall index they are very sparsely populated with regard to the 
>> overall index.  With CMS GC we get a sawtooth on the old generation 
>> (I guess every replication and commit causes it's usage to drop down 
>> to 10GB or
>> so) and it seems to be the old generation which is the main space 
>> consumer.  With the G1GC, the memory map looked totally different!  I 
>> was a little lost looking at memory consumption with that GC.  Maybe 
>> I'll try it again now that the index is a bit smaller than it was 
>> last time I tried it.  After four days without running an optimize 
>> now it is 21GB.  BTW our indexing speed is mostly bound by the DB so 
>> reducing the segments might be ok...
>>
>> Here is a quick snapshot of one slaves memory map as reported by 
>> PSI-Probe, but unfortunately I guess I can't send the history 
>> graphics to the solr-user list to show their changes over time:
>>   NameUsedCommitted   Max
>>  Initial Group
>>Par Survivor Space 20.02 MB108.13 MB   108.13 MB  
>>  108.13 MB   HEAP
>>CMS Perm Gen   42.29 MB70.66 MB82.00 MB20.75 
>> MBNON_HEAP
>>Code Cache 9.73 MB 9.88 MB 48.00 MB2.44 MB 
>> NON_HEAP
>>CMS Old Gen20.22 GB30.94 GB30.94 GB   
>>  30.94 GBHEAP
>>Par Eden Space 42.20 MB865.31 MB   865.31 MB   865.31 
>> MB   HEAP
>>Total  20.33 GB31.97 GB32.02 GB   
>>  31.92 GBTOTAL
>>
>> And here's our current cache stats from a random slave:
>>
>> name:queryResultCache
>> class:   org.apache.solr.search.LRUCache
>> version: 1.0
>> description: LRU Cache(maxSize=488, initialSize=6, autowarmCount=6,
>> regenerator=org.apache.solr.search.SolrIndexSearcher$3@461ff4c3)
>> stats:  lookups : 619
>> hits : 36
>> hitratio : 0.05
>> inserts : 592
>> evictions : 101
>> size : 488
>> warmupTime : 2949
>> cumulative_lookups : 681225
>> cumulative_hits : 73126
>> cumulative_hitratio : 0.10
>> cumulative_inserts : 602396
>> cumulative_evictions : 428868
>>
>>
>>  name:fieldCache
>> class:   org.apache.solr.search.SolrFieldCacheMBean
>> version: 1.0
>> description: Provides introspection of the Lucene FieldCache, this is
>> **NOT** a cache that is managed by Solr.
>> stat

RE: yet another optimize question

2013-06-18 Thread Petersen, Robert
Hi Otis, 

Yes the query results cache is just about worthless.   I guess we have too 
diverse of a set of user queries.  The business unit has decided to let bots 
crawl our search pages too so that doesn't help either.  I turned it way down 
but decided to keep it because my understanding was that it would still help 
for users going from page 1 to page 2 in a search.  Is that true?

Thanks
Robi

-Original Message-
From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] 
Sent: Monday, June 17, 2013 6:39 PM
To: solr-user@lucene.apache.org
Subject: Re: yet another optimize question

Hi Robi,

This goes against the original problem of getting OOMEs, but it looks like each 
of your Solr caches could be a little bigger if you want to eliminate 
evictions, with the query results one possibly not being worth keeping if you 
can't get the hit % up enough.

Otis
--
Solr & ElasticSearch Support -- http://sematext.com/





On Mon, Jun 17, 2013 at 2:21 PM, Petersen, Robert 
 wrote:
> Hi Otis,
>
> Right I didn't restart the JVMs except on the one slave where I was 
> experimenting with using G1GC on the 1.7.0_21 JRE.   Also some time ago I 
> made all our caches small enough to keep us from getting OOMs while still 
> having a good hit rate.Our index has about 50 fields which are mostly int 
> IDs and there are some dynamic fields also.  These dynamic fields can be used 
> for custom faceting.  We have some standard facets we always facet on and 
> other dynamic facets which are only used if the query is filtering on a 
> particular category.  There are hundreds of these fields but since they are 
> only for a small subset of the overall index they are very sparsely populated 
> with regard to the overall index.  With CMS GC we get a sawtooth on the old 
> generation (I guess every replication and commit causes it's usage to drop 
> down to 10GB or so) and it seems to be the old generation which is the main 
> space consumer.  With the G1GC, the memory map looked totally different!  I 
> was a little lost looking at memory consumption with that GC.  Maybe I'll try 
> it again now that the index is a bit smaller than it was last time I tried 
> it.  After four days without running an optimize now it is 21GB.  BTW our 
> indexing speed is mostly bound by the DB so reducing the segments might be 
> ok...
>
> Here is a quick snapshot of one slaves memory map as reported by PSI-Probe, 
> but unfortunately I guess I can't send the history graphics to the solr-user 
> list to show their changes over time:
> NameUsedCommitted   Max   
>   Initial Group
>  Par Survivor Space 20.02 MB108.13 MB   108.13 MB 
>   108.13 MB   HEAP
>  CMS Perm Gen   42.29 MB70.66 MB82.00 MB20.75 
> MBNON_HEAP
>  Code Cache 9.73 MB 9.88 MB 48.00 MB2.44 MB 
> NON_HEAP
>  CMS Old Gen20.22 GB30.94 GB30.94 GB  
>   30.94 GBHEAP
>  Par Eden Space 42.20 MB865.31 MB   865.31 MB   
> 865.31 MB   HEAP
>  Total  20.33 GB31.97 GB32.02 GB  
>   31.92 GBTOTAL
>
> And here's our current cache stats from a random slave:
>
> name:queryResultCache
> class:   org.apache.solr.search.LRUCache
> version: 1.0
> description: LRU Cache(maxSize=488, initialSize=6, autowarmCount=6, 
> regenerator=org.apache.solr.search.SolrIndexSearcher$3@461ff4c3)
> stats:  lookups : 619
> hits : 36
> hitratio : 0.05
> inserts : 592
> evictions : 101
> size : 488
> warmupTime : 2949
> cumulative_lookups : 681225
> cumulative_hits : 73126
> cumulative_hitratio : 0.10
> cumulative_inserts : 602396
> cumulative_evictions : 428868
>
>
>  name:   fieldCache
> class:   org.apache.solr.search.SolrFieldCacheMBean
> version: 1.0
> description: Provides introspection of the Lucene FieldCache, this is 
> **NOT** a cache that is managed by Solr.
> stats:  entries_count : 359
>
>
> name:documentCache
> class:   org.apache.solr.search.LRUCache
> version: 1.0
> description: LRU Cache(maxSize=2048, initialSize=512, autowarmCount=10, 
> regenerator=null)
> stats:  lookups : 12710
> hits : 7160
> hitratio : 0.56
> inserts : 5636
> evictions : 3588
> size : 2048
> warmupTime : 0
> cumulative_lookups : 10590054
> cumulative_hits : 6166913
> cumulative_hitratio : 0.58
> cumulative_inserts : 4423141
> cumulative_evictions : 3714653
>
>
> name:fieldValueCache
> class:   org.apache.solr.search.FastLRUCache
> version: 1.0
> description: Concurrent LRU Cache(maxSize=280, initialSize=280, 
> minSize=252, acceptableSize=266, cleanupThread=false, autowarmCount=6, 
> regenerator=org.apache.solr.search.SolrIndexSearcher$1@143eb77a)
> stats:  lookups : 1725
> hits : 1481
> hitratio : 0.85
> inserts : 122
> evictions : 0
> size : 128
> warmupTime : 4426
> cumulative

Re: New operator.

2013-06-18 Thread Yanis Kakamaikis
Thanks, Roman.  I'm going to do some digging...


On Mon, Jun 17, 2013 at 9:53 PM, Roman Chyla  wrote:

> Hello Yanis,
>
> We are probably using something similar - eg. 'functional operators' - eg.
> edismax() to treat everything inside the bracket as an argument for
> edismax, or pos() to search for authors based on their position. And
> invenio() which is exactly what you describe, to get results from external
> engine. Depending on the level of complexity, you may need any/all of the
> following
>
> 1. query parser that understands the operator syntax and can build some
> 'external search' query object
> 2. the 'query object' that knows to contact the external service and return
> lucene docids - so you will need some translation
> externalIds<->luceneDocIds - you can for example index the same primary key
> in both solr and the ext engine, and then use a cache for the mapping
>
> To solve the 1, you could use the
> https://issues.apache.org/jira/browse/LUCENE-5014 - sorry for the
> shameless
> plug :) - but this is what we use and what i am familiar with, you can see
> a grammar that gives you the 'functional operator' here - if you dig
> deeper, you will see how it is building different query objects for
> different operators:
>
> https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/grammars/ADS.g
>
> and here an example how to ask the external engine for results and return
> lucene docids:
>
> https://github.com/romanchyla/montysolr/blob/master/contrib/invenio/src/java/org/apache/lucene/search/InvenioWeight.java
>
> it is a bit messy and you should probably ignore how we are getting the
> results, just look at nextDoc()
>
> HTH,
>
>   roman
>
>
> On Mon, Jun 17, 2013 at 2:34 PM, Yanis Kakamaikis <
> yanis.kakamai...@gmail.com> wrote:
>
> > Hi all,   thanks for your reply.
> > I want to be able to ask a combined query,  a normal solr querym but one
> of
> > the query fields should get it's answer not from within the solr engine,
> > but from an external engine.
> > the rest should work normaly with the ability to do more tasks on the
> > answer like faceting for example.
> > The external engine will use the same objects ids like solr, so the
> boolean
> > query that uses this engine answer be executed correctly.
> > For example, let say I want to find a person by his name, age, address,
> and
> > also by his picture. I have a picture indexing engine, I want to create a
> > combined query that will call this engine like other query field.   I
> hope
> > it's more clear now...
> >
> >
> > On Sun, Jun 16, 2013 at 4:02 PM, Jack Krupansky  > >wrote:
> >
> > > It all depends on what you mean by an "operator".
> > >
> > > Start by describing in more detail what problem you are trying to
> solve.
> > >
> > > And how do you expect your users or applications to use this
> "operator".
> > > Give some examples.
> > >
> > > Solr and Lucene do not have "operators" per say, except in query parser
> > > syntax, but that is hard-wired into the individual query parsers.
> > >
> > > -- Jack Krupansky
> > >
> > > -Original Message- From: Yanis Kakamaikis
> > > Sent: Sunday, June 16, 2013 2:01 AM
> > > To: solr-user@lucene.apache.org
> > > Subject: New operator.
> > >
> > >
> > > Hi all,I want to add a new operator to my solr.   I need that
> > operator
> > > to call my proprietary engine and build an answer vector to solr, in a
> > way
> > > that this vector will be part of the boolean query at the next step.
> > How
> > > do I do that?
> > > Thanks
> > >
> >
>


ConcurrentUpdateSolrserver - Queue size not working

2013-06-18 Thread Learner
I am using ConcurrentUpdateSolrserver to create 4 threads (threadCount=4)
with queueSize of 3.

Indexing works fine as expected.

My issue is that, I see that the documents are getting adding to server even
before it reaches the queue size.  Am I doing anything wrong? Or is
queuesize not implemented yet?

Also I dont see a very big performance improvements when I increase /
decrease the number of threads. Can someone let me know the best way to
improve indexing performance when using ConcurrentUpdateSolrserver

FYI... I am running this program on 4 core machine.. 

Sample snippet:

ConcurrentUpdateSolrServer server = new 
ConcurrentUpdateSolrServer(
solrServer, 3, 4);
try {
while ((line = bReader.readLine()) != null) {
inputDocument = line.split("\t");
  Do some processing
server.add(doc);

}}




--
View this message in context: 
http://lucene.472066.n3.nabble.com/ConcurrentUpdateSolrserver-Queue-size-not-working-tp4071408.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Running solr cloud

2013-06-18 Thread Utkarsh Sengar
Looks like zk does not contain the configuration called: collection1.
You can use zkCli.sh to see what's inside "configs" zk node. You can
manually push config via zkCli's upconfig (not very sure how it works).

Try adding this arg: " -Dbootstrap_conf=true" in place of
"-Dbootstrap_confdir=./solr/collection1/conf" and start solr. This might
push the config to zk.

bootstrap_conf uploads the index configuration files for all the cores to
zk.

Thanks,
-Utkarsh


On Tue, Jun 18, 2013 at 4:49 AM, Daniel Mosesson
wrote:

> I cannot seem to be able to get the default cloud setup to work properly.
>
> What I did:
> Downloaded the binaries, extracted.
> Made the pwd example
> Ran: java -Dbootstrap_confdir=./solr/collection1/conf
> -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar
> And got the error message:  Caused by:
> org.apache.solr.common.cloud.ZooKeeperException: Specified config does not
> exist in ZooKeeper:collection1
> Which caused follow up messages, etc.
>
> What am I doing wrong here?
> Windows 7 pro
>
> 
>
> **
> This e-mail message and any attachments are confidential. Dissemination,
> distribution or copying of this e-mail or any attachments by anyone other
> than the intended recipient is prohibited. If you are not the intended
> recipient, please notify Ipreo immediately by replying to this e-mail, and
> destroy all copies of this e-mail and any attachments. Thank you!
> **
>



-- 
Thanks,
-Utkarsh


Re: Shard splitting and document routing

2013-06-18 Thread Otis Gospodnetic
Beautiful.  Thanks!

Otis
--
Solr & ElasticSearch Support --  http://sematext.com/
Performance Monitoring - http://sematext.com/spm/index.html





On Tue, Jun 18, 2013 at 12:34 PM, Mark Miller  wrote:
> No, the hash ranges are split and new docs go to both new shards.
>
> - Mark
>
> On Jun 18, 2013, at 12:25 PM, Otis Gospodnetic  
> wrote:
>
>> Hi,
>>
>> Imagine a (common) situation where you use document routing and you
>> end up with 1 large shards (e.g. 1 large user with lots of docs).
>> Shard splitting will help here, because we can break up that 1 shard
>> in 2 smaller shards (and maybe do that "recursively" to make shards
>> sufficiently small).
>>
>> But what happens with document routing after a big shard is split?
>> I assume new docs keep going to just one of the 2 new shards, right?
>>
>> If so, does that mean that after a while one of the new shards will
>> balloon again and the above procedure will need to be repeated?
>>
>> Thanks,
>> Otis
>> --
>> Solr & ElasticSearch Support --  http://sematext.com/
>> Performance Monitoring - http://sematext.com/spm/index.html
>


Re: Shard splitting and document routing

2013-06-18 Thread Mark Miller
No, the hash ranges are split and new docs go to both new shards.

- Mark

On Jun 18, 2013, at 12:25 PM, Otis Gospodnetic  
wrote:

> Hi,
> 
> Imagine a (common) situation where you use document routing and you
> end up with 1 large shards (e.g. 1 large user with lots of docs).
> Shard splitting will help here, because we can break up that 1 shard
> in 2 smaller shards (and maybe do that "recursively" to make shards
> sufficiently small).
> 
> But what happens with document routing after a big shard is split?
> I assume new docs keep going to just one of the 2 new shards, right?
> 
> If so, does that mean that after a while one of the new shards will
> balloon again and the above procedure will need to be repeated?
> 
> Thanks,
> Otis
> --
> Solr & ElasticSearch Support --  http://sematext.com/
> Performance Monitoring - http://sematext.com/spm/index.html



Shard splitting and document routing

2013-06-18 Thread Otis Gospodnetic
Hi,

Imagine a (common) situation where you use document routing and you
end up with 1 large shards (e.g. 1 large user with lots of docs).
Shard splitting will help here, because we can break up that 1 shard
in 2 smaller shards (and maybe do that "recursively" to make shards
sufficiently small).

But what happens with document routing after a big shard is split?
I assume new docs keep going to just one of the 2 new shards, right?

If so, does that mean that after a while one of the new shards will
balloon again and the above procedure will need to be repeated?

Thanks,
Otis
--
Solr & ElasticSearch Support --  http://sematext.com/
Performance Monitoring - http://sematext.com/spm/index.html


Looking for Search Engineers

2013-06-18 Thread Jagdish Nomula
Hello,

SimplyHired.com, a job search engine with the biggest job index in the
world is looking for engineers to help us with our core search and auction
systems.

Some of the problems you will be working on are,
a) Scaling to millions of requests
b) Working with millions of jobs
c) Maximizing the revenue and relevance for search queries(aka
multi-objective maximization problem)
d) Helping people find jobs

If you are interested in helping us out, please send your resume across.

Thanks,

*Jagdish Nomula*
Sr. Manager Search
Simply Hired, Inc.
370 San Aleso Ave., Ste 200
Sunnyvale, CA 94085

office - 408.400.4700
cell - 408.431.2916
email - jagd...@simplyhired.com 

www.simplyhired.com


[ANNOUNCE] Apache Solr 4.3.1 released

2013-06-18 Thread Shalin Shekhar Mangar
June 2013, Apache Solr™ 4.3.1 available

The Lucene PMC is pleased to announce the release of Apache Solr 4.3.1

Solr is the popular, blazing fast, open source NoSQL search platform
from the Apache Lucene project. Its major features include powerful
full-text search, hit highlighting, faceted search, dynamic
clustering, database integration, rich document (e.g., Word, PDF)
handling, and geospatial search. Solr is highly scalable, providing
fault tolerant distributed search and indexing, and powers the search
and navigation features of many of the world's largest internet sites.

Solr 4.3.1 is available for immediate download at:
http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Solr 4.3.1 includes 24 bug fixes. The list includes a lot of SolrCloud
bug fixes around Shard Splitting as well as some fixes in other areas.

See the CHANGES.txt file included with the release for a full list of
changes and further details. Please note that the fix for SOLR-4791 is
*NOT* part of this release even though the CHANGES.txt mentions it.

Please report any feedback to the mailing lists
(http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring
network for distributing releases. It is possible that the mirror you
are using may not have replicated the release yet. If that is the
case, please try another mirror. This also goes for Maven access.

Happy searching,
Lucene/Solr developers


Re: Different scores for exact and non-exact matching

2013-06-18 Thread Otis Gospodnetic
Hi,

I think you are after indexing tokens with begin/end markers.
e.g.
"This is a sample string" becomes:
_This$
_is$
_a$
_sample$
_string$
+ (edge) ngrams of the above tokens

Then a query for /This string/ could become:

_This$^100 _string$^100 this string
(or something along those lines)

So this will match documents with "praying mantis" or "thisle" if
you've ngrammed the tokens during indexing, but docs with the proper
words "this" and "string" will be stronger matches because of the ^100
part.

Otis
--
Solr & ElasticSearch Support -- http://sematext.com/





On Mon, Jun 17, 2013 at 3:46 AM, Daniel Mosesson
 wrote:
> What I am looking to do is take field that contains a string like (called 
> name for example):
>
> "This is a sample string"
>
> and then query by that field so that a search for "This" gets x points (exact 
> match), "sam" gets y points (partial match).
>
> I attempted to do this via the sort and query parameters, like so:
>
> sort=if(name==This,100,50) but this gives me an error:
> sort param could not be parsed as a query, and is not a field that exists in 
> the index: if(name==This,100,50)
>
> Full URL:
> http://localhost:8983/solr/db/select?q=name%3A*&sort=if(name%3D%3D%22This%22%2C100%2C50)+asc&fl=price%2Cname&wt=xml&indent=true
>
> Is there a way to do this?
>
> Note: I believe that I can at least get the documents that need to be sorted 
> via (name:This AND name:This*) but then I do not know where to go from there 
> (as I can't seem to get sort working for any functions).
>
> Can anyone provide some examples for how to do this kind of thing?
>
> Thank you.
>
> 
>
> **
> This e-mail message and any attachments are confidential. Dissemination, 
> distribution or copying of this e-mail or any attachments by anyone other 
> than the intended recipient is prohibited. If you are not the intended 
> recipient, please notify Ipreo immediately by replying to this e-mail, and 
> destroy all copies of this e-mail and any attachments. Thank you!
> **


Re: How to define my data in schema.xml

2013-06-18 Thread Jack Krupansky
You can in fact have multiple collections in Solr and do a limited amount of 
joining, and Solr has multivalued fields as well, but none of those 
techniques should be used to avoid the process of flattening and 
denormalizing a relational data model. It is hard work, but yes, it is 
required to use Solr effectively.


Again, start with the queries - what problem are you trying to solve. Nobody 
stores data just for the sake of storing it - how will the data be used?


-- Jack Krupansky

-Original Message- 
From: Mysurf Mail

Sent: Tuesday, June 18, 2013 9:58 AM
To: solr-user@lucene.apache.org
Subject: Re: How to define my data in schema.xml

Hi Jack,
Thanks, for you kind comment.

I am truly in the beginning of data modeling my schema over an existing
working DB.
I have used the school-teachers-student db as an example scenario.
(a, I have written it as a disclaimer in my first post. b. I really do not
know anyone that has 300 hobbies too.)

In real life my db is obviously much different,
I just used this as an example of potential pitfalls that will occur if I
use my old db data modeling notions.
obviously, the old relational modeling idioms do not apply here.

Now, my question was referring to the fact that I would really like to
avoid a flat table/join/view because of the reason listed above.
So, my scenario is answering a plain user generated text search over a
MSSQLDB that contains a few 1:n relation (and a few 1:n:n relationship).

So, I come here for tips. Should I use one combined index (treat it as a
nosql source) or separate indices or another. any other ways to define
relation data ?
Thanks.



On Tue, Jun 18, 2013 at 4:30 PM, Jack Krupansky 
wrote:



It sounds like you still have a lot of work to do on your data model. No
matter how you slice it, 8 billion rows/fields/whatever is still way too
much for any engine to search on a single server. If you have 8 billion of
anything, a heavily sharded SolrCloud cluster is probably warranted. Don't
plan ahead to put more than 100 million rows on a single node; plan on a
proof of concept implementation to determine that number.

When we in Solr land say "flattened" or "denormalized", we mean in an
intelligent, "smart", thoughtful sense, not a mindless, mechanical
flattening. It is an opportunity for you to reconsider your data models,
both old and new.

Maybe data modeling is beyond your skill set. If so, have a chat with your
boss and ask for some assistance, training, whatever.

Actually, I am suspicious of your 8 billion number - change each of those
300's to realistic, average numbers. Each teacher teaches 300 courses?
Right. Each Student has 300 hobbies? If you say so, but...

Don't worry about schema.xml until you get your data model under control.

For an initial focus, try envisioning the use cases for user queries. That
will guide you in thinking about how the data would need to be organized 
to

satisfy those user queries.

-- Jack Krupansky

-Original Message- From: Mysurf Mail
Sent: Tuesday, June 18, 2013 2:20 AM
To: solr-user@lucene.apache.org
Subject: Re: How to define my data in schema.xml


Thanks for your reply.
I have tried the simplest approach and it works absolutely fantastic.
Huge table - 0s to result.

two problems as I described earlier, and that is what I try to solve:
1. I create a flat table just for solar. This requires maintenance and
develop. Can I run solr over my regular tables?
   This is my simplest approach. Working over my relational tables,
2. When you query a flat table by school name, as I described, if the
school has 300 student, 300 teachers, 300  with 300 teacherCourses, 300
studentHobbies,
   you get 8.1 Billion rows (300*300*300*300). As I am sure this will work
great on solar - searching for the school name will retrieve 8.1 B rows.
3. Lets say all my searches are user generated free text search that is
searching name and comments columns.
Thanks.


On Tue, Jun 18, 2013 at 7:32 AM, Gora Mohanty  wrote:

 On 18 June 2013 01:10, Mysurf Mail  wrote:

> Thanks for your quick reply. Here are some notes:
>
> 1. Consider that all tables in my example have two columns: Name &
> Description which I would like to index and search.
> 2. I have no other reason to create flat table other than for solar. So
> I
> would like to see if I can avoid it.
> 3. If in my example I will have a flat table then obviously it will 
> hold

a
> lot of rows for a single school.
> By searching the exact school name I will likely receive a lot of
rows.
> (my flat table has its own pk)

Yes, all of this is definitely the case, but in practice
it does not matter. Solr can efficiently search through
millions of rows. To start with, just try the simplest
approach, and only complicate things as and when
needed.

> That is something I would like to avoid and I thought I can avoid
this
> by defining teachers and students as multiple value or something like
this
> and than teacherCourses and studentHobbies  as 1:n respect

Re: SOLR Cloud - Disable Transaction Logs

2013-06-18 Thread Rishi Easwaran
SolrJ already has access to zookeeper cluster state. Network I/O bottleneck can 
be avoided by parallel requests. 
You are only as slow as your slowest responding server, which could be your 
single leader with the current set up.

Wouldn't this lessen the burden of the leader, as he does not have to maintain 
transaction logs or distribute to replicas? 

 

 

 

-Original Message-
From: Shalin Shekhar Mangar 
To: solr-user 
Sent: Tue, Jun 18, 2013 2:05 am
Subject: Re: SOLR Cloud - Disable Transaction Logs


Yes, but at what cost? You are thinking of replacing disk IO with even more
slower network IO. The transaction log is a append-only log -- it is not
pretty cheap especially so if you compare it with the indexing process.
Plus your write request/sec will drop a lot once you start doing
synchronous replication.


On Tue, Jun 18, 2013 at 2:18 AM, Rishi Easwaran wrote:

> Shalin,
>
> Just some thoughts.
>
> Near Real time replication- don't we use solrCmdDistributor, which send
> requests immediately to replicas with a clonedRequest, as an option can't
> we achieve something similar form CloudSolrserver in Solrj instead of
> leader doing it. As long as 2 nodes receive writes and acknowledge.
> durability should be high.
> Peer-Sync and Recovery - Can we achieve that merging indexes from leader
> as needed, instead of replaying the transaction logs?
>
> Rishi.
>
>
>
>
>
>
>
> -Original Message-
> From: Shalin Shekhar Mangar 
> To: solr-user 
> Sent: Mon, Jun 17, 2013 3:43 pm
> Subject: Re: SOLR Cloud - Disable Transaction Logs
>
>
> It is also necessary for near real-time replication, peer sync and
> recovery.
>
>
> On Tue, Jun 18, 2013 at 1:04 AM, Rishi Easwaran  >wrote:
>
> > Hi,
> >
> > Is there a way to disable transaction logs in SOLR cloud. As far as I can
> > tell no.
> > Just curious why do we need transaction logs, seems like an I/O intensive
> > operation.
> > As long as I have replicatonFactor >1, if a node (leader) goes down, the
> > replica can take over and maintain a durable state of my index.
> >
> > I understand from the previous discussions, that it was intended for
> > update durability and realtime get.
> > But, unless I am missing something an ability to disable it in SOLR cloud
> > if not needed would be good.
> >
> > Thanks,
> >
> > Rishi.
> >
> >
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>
>
>


-- 
Regards,
Shalin Shekhar Mangar.

 


Re: Solr Cloud Hangs consistently .

2013-06-18 Thread Rishi Easwaran
Mark,

All I am doing are inserts, afaik search side deadlocks should not be an issue.

I am using Jmeter, standard test driver we use for most of our benchmarks and 
stats collection.
My jmeter.jmx file- http://apaste.info/79IS , maybe i overlooked something

 
Is there a benchmark script that solr community uses (preferably with jmeter), 
we are write heavy so at the moment focusing on inserts only.

Thanks,

Rishi.

 

 

-Original Message-
From: Yago Riveiro 
To: solr-user 
Sent: Mon, Jun 17, 2013 6:19 pm
Subject: Re: Solr Cloud Hangs consistently .


I do all the indexing through a HTTP POST, with replicationFactor=1 no problem, 
if is higher deadlock problems can appear

A stack trace like this 
http://lucene.472066.n3.nabble.com/updating-docs-in-solr-cloud-hangs-td4067388.html#a4067862
 
is that I get

-- 
Yago Riveiro
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Monday, June 17, 2013 at 11:03 PM, Mark Miller wrote:

> If it actually happens with replicationFactor=1, it doesn't likely have 
anything to do with the update handler issue I'm referring to. In some cases 
like these, people have better luck with Jetty than Tomcat - we test it much 
more. For instance, it's setup to help avoid search side distributed deadlocks.
> 
> In any case, there is something special about it - I do and have seen a lot 
> of 
heavy indexing to SolrCloud by me and others without running into this. Both 
with replicationFacotor=1 and greater. So there is something specific in how 
the 
load is being done or what features/methods are being used that likely causes 
it 
or makes it easier to cause.
> 
> But again, the issue I know about involves threads that are not even created 
in the replicationFactor = 1 case, so that could be a first report afaik.
> 
> - Mark
> 
> On Jun 17, 2013, at 5:52 PM, Rishi Easwaran mailto:rishi.easwa...@aol.com)> wrote:
> 
> > Update!!
> > 
> > This happens with replicationFactor=1
> > Just for kicks I created a collection with a 24 shards, replicationfactor=1 
cluster on my exisiting benchmark env.
> > Same behaviour, SOLR cloud just hangs. Nothing in the logs, top/heap/cpu 
most metrics looks fine.
> > Only indication seems to be netstat showing incoming request not being read 
in.
> > 
> > Yago,
> > 
> > I saw your previous post 
> > (http://lucene.472066.n3.nabble.com/updating-docs-in-solr-cloud-hangs-td4067388.html#a4067631)
> > Following it, Last week, I upgraded to SOLR 4.3, to see if the issue gets 
fixed, but no luck.
> > Looks like this is a dominant and easily reproducible issue on SOLR cloud.
> > 
> > 
> > Thanks,
> > 
> > Rishi. 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > -Original Message-
> > From: Yago Riveiro mailto:yago.rive...@gmail.com)>
> > To: solr-user  > (mailto:solr-user@lucene.apache.org)>
> > Sent: Mon, Jun 17, 2013 5:15 pm
> > Subject: Re: Solr Cloud Hangs consistently .
> > 
> > 
> > I can confirm that the deadlock happen with only 2 replicas by shard. I 
> > need 

> > shutdown one node that host a replica of the shard to recover the 
> > indexation 

> > capability.
> > 
> > -- 
> > Yago Riveiro
> > Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
> > 
> > 
> > On Monday, June 17, 2013 at 6:44 PM, Rishi Easwaran wrote:
> > 
> > > 
> > > 
> > > Hi All,
> > > 
> > > I am trying to benchmark SOLR Cloud and it consistently hangs. 
> > > Nothing in the logs, no stack trace, no errors, no warnings, just seems 
stuck.
> > > 
> > > A little bit about my set up. 
> > > I have 3 benchmark hosts, each with 96GB RAM, 24 CPU's and 1TB SSD. Each 
host 
> > > 
> > 
> > is configured to have 8 SOLR cloud nodes running at 4GB each.
> > > JVM configs: http://apaste.info/57Ai
> > > 
> > > My cluster has 12 shards with replication factor 2- 
> > > http://apaste.info/09sA
> > > 
> > > I originally stated with SOLR 4.2., tomcat 5 and jdk 6, as we are already 
> > running this configuration in production in Non-Cloud form. 
> > > It got stuck repeatedly.
> > > 
> > > I decided to upgrade to the latest and greatest of everything, SOLR 4.3, 
JDK7 
> > and tomcat7. 
> > > It still shows same behaviour and hangs through the test.
> > > 
> > > My test schema and config.
> > > Schema.xml - http://apaste.info/imah
> > > SolrConfig.xml - http://apaste.info/ku4F
> > > 
> > > The test is pretty simple. its a jmeter test with update command via SOAP 
rpc 
> > (round robin request across every node), adding in 5 fields from a csv file 
- 
> > id, guid, subject, body, compositeID (guid!id).
> > > number of jmeter threads = 150. loop count = 20, num of messages to 
add/per 
> > 
> > guid = 3; total 150*3*20 = 9000 documents. 
> > > 
> > > When cloud gets stuck, i don't get anything in the logs, but when i run 
> > netstat i see the following.
> > > Sample netstat on a stuck run. http://apaste.info/hr0O 
> > > hycl-d20 is my jmeter host. ssd-d01/2/3 are my cloud hosts.
> > > 
> > > At the moment my benchmarking efforts are at a stand still.
> > >

RE: How spell checker used if indexed document is containing misspelled words

2013-06-18 Thread Dyer, James
There are two newer parameters that work better than "onlyMorePopular":

spellcheck.alternativeTermCount
- This is the # of suggestions you want for terms that exist in the index.  You 
can set it the same as "spellcheck.count", or less if you don't want as many 
suggestions for these.
http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.alternativeTermCount

spellcheck.maxResultsForSuggest
- This lets you give a "did-you-mean" suggestion if the query only gets a few 
hits.  Useful if the user enters a misspelled terms that is in the index but 
could have gotten a lot more results if they had spelled it correctly.
http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.maxResultsForSuggest

James Dyer
Ingram Content Group
(615) 213-4311

-Original Message-
From: Shreejay [mailto:shreej...@gmail.com] 
Sent: Friday, June 14, 2013 8:38 AM
To: solr-user@lucene.apache.org
Subject: Re: How spell checker used if indexed document is containing 
misspelled words

Hi,  

Have you tried this? 
http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.onlyMorePopular

Of course this is assuming that your corpus has correct words occurring more 
frequently than incorrect ones!  

-- 
Shreejay


On Friday, June 14, 2013 at 2:49, venkatesham.gu...@igate.com wrote:

> My data is picked from social media sites and misspelled words are very
> frequent in social text because of the informal mode of
> communication.Spellchecker does not work here because misspelled words are
> present in the text corpus and not in the search query. Finding documents
> with all the different misspelled forms for a given word is not possible
> using spellchecker, how to go ahead with this.
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/How-spell-checker-used-if-indexed-document-is-containing-misspelled-words-tp4070463.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 
> 



Re: what does a zero score mean?

2013-06-18 Thread Upayavira
debugQuery=true adds an extra block of XML to the bottom that will give
you extra info.

Alternatively, add fl=*,[explain] to your URL. That'll give you an extra
field in your output. Then, view the source to see it structured
properly.

Upayavira

On Tue, Jun 18, 2013, at 02:52 PM, Joe Zhang wrote:
> I did include "debugQuery=on" in the query, but nothing extra showed up
> in
> the response.
> 
> 
> On Mon, Jun 17, 2013 at 10:29 PM, Gora Mohanty 
> wrote:
> 
> > On 18 June 2013 10:49, Joe Zhang  wrote:
> > > I issued a simple query ("apple") to my collection and got 201 documents
> > > back, all of which are scored 0. What does this mean? --- The documents
> > do
> > > contain the query words.
> >
> > My guess is that the float-valued score is getting
> > converted to an integer. You could also try your
> > query with the parameter &debugQuery=on
> > to get an explanation of the scoring:
> > http://wiki.apache.org/solr/CommonQueryParameters#debugQuery
> >
> > Regards,
> > Gora
> >


Re: [ANN] Lux XML search engine

2013-06-18 Thread Michael Sokolov

On 06/18/2013 09:20 AM, Alexandre Rafalovitch wrote:

On Tue, Jun 18, 2013 at 7:44 AM, Michael Sokolov
  wrote:
   

I'm pleased to announce the first public release of Lux (version 0.9.1), an
XML search engine embedding Saxon 9 and Lucene/Solr 4.
 

Congratulations, this looks very interesting. I am guessing, this
is/will be replacing MarkLogic that Safari Books Online used before.

Regards,
Alex.
   
Alex, it can do some of what MarkLogic can do, yes, and I definitely 
drew some inspiration from their work, but we are continuing to use 
MarkLogic for many of our applications, and there's no plan to replace it.


--
Michael Sokolov
Senior Architect
Safari Books Online



Re: How to define my data in schema.xml

2013-06-18 Thread Mysurf Mail
Hi Jack,
Thanks, for you kind comment.

I am truly in the beginning of data modeling my schema over an existing
working DB.
I have used the school-teachers-student db as an example scenario.
(a, I have written it as a disclaimer in my first post. b. I really do not
know anyone that has 300 hobbies too.)

In real life my db is obviously much different,
I just used this as an example of potential pitfalls that will occur if I
use my old db data modeling notions.
obviously, the old relational modeling idioms do not apply here.

Now, my question was referring to the fact that I would really like to
avoid a flat table/join/view because of the reason listed above.
So, my scenario is answering a plain user generated text search over a
MSSQLDB that contains a few 1:n relation (and a few 1:n:n relationship).

So, I come here for tips. Should I use one combined index (treat it as a
nosql source) or separate indices or another. any other ways to define
relation data ?
Thanks.



On Tue, Jun 18, 2013 at 4:30 PM, Jack Krupansky wrote:

> It sounds like you still have a lot of work to do on your data model. No
> matter how you slice it, 8 billion rows/fields/whatever is still way too
> much for any engine to search on a single server. If you have 8 billion of
> anything, a heavily sharded SolrCloud cluster is probably warranted. Don't
> plan ahead to put more than 100 million rows on a single node; plan on a
> proof of concept implementation to determine that number.
>
> When we in Solr land say "flattened" or "denormalized", we mean in an
> intelligent, "smart", thoughtful sense, not a mindless, mechanical
> flattening. It is an opportunity for you to reconsider your data models,
> both old and new.
>
> Maybe data modeling is beyond your skill set. If so, have a chat with your
> boss and ask for some assistance, training, whatever.
>
> Actually, I am suspicious of your 8 billion number - change each of those
> 300's to realistic, average numbers. Each teacher teaches 300 courses?
> Right. Each Student has 300 hobbies? If you say so, but...
>
> Don't worry about schema.xml until you get your data model under control.
>
> For an initial focus, try envisioning the use cases for user queries. That
> will guide you in thinking about how the data would need to be organized to
> satisfy those user queries.
>
> -- Jack Krupansky
>
> -Original Message- From: Mysurf Mail
> Sent: Tuesday, June 18, 2013 2:20 AM
> To: solr-user@lucene.apache.org
> Subject: Re: How to define my data in schema.xml
>
>
> Thanks for your reply.
> I have tried the simplest approach and it works absolutely fantastic.
> Huge table - 0s to result.
>
> two problems as I described earlier, and that is what I try to solve:
> 1. I create a flat table just for solar. This requires maintenance and
> develop. Can I run solr over my regular tables?
>This is my simplest approach. Working over my relational tables,
> 2. When you query a flat table by school name, as I described, if the
> school has 300 student, 300 teachers, 300  with 300 teacherCourses, 300
> studentHobbies,
>you get 8.1 Billion rows (300*300*300*300). As I am sure this will work
> great on solar - searching for the school name will retrieve 8.1 B rows.
> 3. Lets say all my searches are user generated free text search that is
> searching name and comments columns.
> Thanks.
>
>
> On Tue, Jun 18, 2013 at 7:32 AM, Gora Mohanty  wrote:
>
>  On 18 June 2013 01:10, Mysurf Mail  wrote:
>> > Thanks for your quick reply. Here are some notes:
>> >
>> > 1. Consider that all tables in my example have two columns: Name &
>> > Description which I would like to index and search.
>> > 2. I have no other reason to create flat table other than for solar. So
>> > I
>> > would like to see if I can avoid it.
>> > 3. If in my example I will have a flat table then obviously it will hold
>> a
>> > lot of rows for a single school.
>> > By searching the exact school name I will likely receive a lot of
>> rows.
>> > (my flat table has its own pk)
>>
>> Yes, all of this is definitely the case, but in practice
>> it does not matter. Solr can efficiently search through
>> millions of rows. To start with, just try the simplest
>> approach, and only complicate things as and when
>> needed.
>>
>> > That is something I would like to avoid and I thought I can avoid
>> this
>> > by defining teachers and students as multiple value or something like
>> this
>> > and than teacherCourses and studentHobbies  as 1:n respectively.
>> > This is quite similiar to my real life demand, so I came here to get
>> > some tips as a solr noob.
>>
>> You have still not described what are the searches that
>> you would want to do. Again, I would suggest starting
>> with the most straightforward approach.
>>
>> Regards,
>> Gora
>>
>>
>


Re: what does a zero score mean?

2013-06-18 Thread Joe Zhang
I did include "debugQuery=on" in the query, but nothing extra showed up in
the response.


On Mon, Jun 17, 2013 at 10:29 PM, Gora Mohanty  wrote:

> On 18 June 2013 10:49, Joe Zhang  wrote:
> > I issued a simple query ("apple") to my collection and got 201 documents
> > back, all of which are scored 0. What does this mean? --- The documents
> do
> > contain the query words.
>
> My guess is that the float-valued score is getting
> converted to an integer. You could also try your
> query with the parameter &debugQuery=on
> to get an explanation of the scoring:
> http://wiki.apache.org/solr/CommonQueryParameters#debugQuery
>
> Regards,
> Gora
>


Re: Solr cloud: zkHost in solr.xml gets wiped out

2013-06-18 Thread Erick Erickson
OK, I think I see what's happening. If you do
NOT specify an instanceDir on the create
(and I'm doing this via the core admin
interface, not SolrJ) then the default is
used, but not persisted. If you _do_
specify the instance dir, it will be persisted.

I've put up another quick patch (tested
only in my test case, running full suite
now). Can you give it a whirl? You'll have
to apply the patch over top of the current
4x, een though the patch is for trunk it
applied to 4x cleanly for me and the tests ran.

Thanks,
Erick

On Tue, Jun 18, 2013 at 9:02 AM, Erick Erickson  wrote:
> OK, I put up a very preliminary patch attached to the bug
> if you want to try it out that addresses the extra junk being
> put in the  tag. Doesn't address the instanceDir issue
> since I haven't reproduced it yet.
>
> Erick
>
> On Tue, Jun 18, 2013 at 8:46 AM, Erick Erickson  
> wrote:
>> Whoa! What's this junk?
>> qt="/admin/cores" wt="javabin" version="2
>>
>> That shouldn't be being preserved, and the instancedir should be!
>>
>> So I'm guessing you're using SolrJ to create the core, but I just
>> reproduced the problem (at least the 'wt="json" ') bit from the
>> browser and even from one of my internal tests when I added
>> extra parameters.
>>
>> That said, instanceDir is being preserved in my test, so I'm not
>> seeing everything you're seeing, could you cut/paste your
>> create code? I'll see if I can set up a test case for SolrJ to catch
>> this too.
>>
>> See SOLR-4935
>>
>> Thanks for reporting!
>>
>> On Mon, Jun 17, 2013 at 5:39 PM, Al Wold  wrote:
>>> Hi Erick,
>>> I tried out your changes from the branch_4x branch. It looks good in terms 
>>> of preserving the zkHost, but I'm running into an exception because it 
>>> isn't persisting the instanceDir attribute on the  element.
>>>
>>> I've got a few other things I need to take care of, but as soon as I have 
>>> time I'll dig in and see if I can figure out what's going on, and see what 
>>> changed to make this not work.
>>>
>>> Here are details on what the files looked like before/after CREATE call:
>>>
>>> original solr.xml:
>>>
>>> 
>>> 
>>>   
>>>   >> hostContext="/"/>
>>> 
>>>
>>> here's what was produced with 4.3 branch + a quick mod to preserve zkHost:
>>>
>>> 
>>> 
>>>   >> hostContext="/">
>>> >> instanceDir="directory_shard1_replica1/" transient="false" 
>>> name="directory_shard1_replica1" collection="directory"/>
>>> >> instanceDir="directory_shard2_replica1/" transient="false" 
>>> name="directory_shard2_replica1" collection="directory"/>
>>>   
>>> 
>>>
>>> here's what was produced with branch_4x 4.4-SNAPSHOT:
>>>
>>> 
>>> 
>>>   >> distribUpdateSoTimeout="0" distribUpdateConnTimeout="0" hostPort="8080" 
>>> hostContext="/">
>>> >> collection="directory" qt="/admin/cores" wt="javabin" version="2"/>
>>> >> collection="directory" qt="/admin/cores" wt="javabin" version="2"/>
>>>   
>>> 
>>>
>>> and here's the error from solr.log after restarting after the CREATE:
>>>
>>> 2013-06-17 21:37:07,083 1874 [pool-2-thread-1] ERROR 
>>> org.apache.solr.core.CoreContainer  - null:java.lang.NullPointerException: 
>>> Missing required 'instanceDir'
>>> at 
>>> org.apache.solr.core.CoreDescriptor.doInit(CoreDescriptor.java:133)
>>> at 
>>> org.apache.solr.core.CoreDescriptor.(CoreDescriptor.java:87)
>>> at org.apache.solr.core.CoreContainer.load(CoreContainer.java:365)
>>> at org.apache.solr.core.CoreContainer.load(CoreContainer.java:221)
>>> at 
>>> org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:190)
>>> at 
>>> org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:124)
>>> at 
>>> org.apache.catalina.core.ApplicationFilterConfig.initFilter(ApplicationFilterConfig.java:277)
>>> at 
>>> org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:258)
>>> at 
>>> org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:382)
>>> at 
>>> org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:103)
>>> at 
>>> org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4638)
>>> at 
>>> org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5294)
>>> at 
>>> org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
>>> at 
>>> org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:895)
>>> at 
>>> org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:871)
>>> at 
>>> org.apache.catalina.core.StandardHost.addChild(StandardHost.java:615)
>>> at 
>>> org.apache.catalina.startup.HostConfig.deployDirectory(HostConfig.java:1099)
>>> at 
>>> org.apache.catalina.startup.HostConfig$DeployDirectory.run(HostConfig.java:1621)
>>> at 
>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>>  

Re: How to define my data in schema.xml

2013-06-18 Thread Jack Krupansky
It sounds like you still have a lot of work to do on your data model. No 
matter how you slice it, 8 billion rows/fields/whatever is still way too 
much for any engine to search on a single server. If you have 8 billion of 
anything, a heavily sharded SolrCloud cluster is probably warranted. Don't 
plan ahead to put more than 100 million rows on a single node; plan on a 
proof of concept implementation to determine that number.


When we in Solr land say "flattened" or "denormalized", we mean in an 
intelligent, "smart", thoughtful sense, not a mindless, mechanical 
flattening. It is an opportunity for you to reconsider your data models, 
both old and new.


Maybe data modeling is beyond your skill set. If so, have a chat with your 
boss and ask for some assistance, training, whatever.


Actually, I am suspicious of your 8 billion number - change each of those 
300's to realistic, average numbers. Each teacher teaches 300 courses? 
Right. Each Student has 300 hobbies? If you say so, but...


Don't worry about schema.xml until you get your data model under control.

For an initial focus, try envisioning the use cases for user queries. That 
will guide you in thinking about how the data would need to be organized to 
satisfy those user queries.


-- Jack Krupansky

-Original Message- 
From: Mysurf Mail

Sent: Tuesday, June 18, 2013 2:20 AM
To: solr-user@lucene.apache.org
Subject: Re: How to define my data in schema.xml

Thanks for your reply.
I have tried the simplest approach and it works absolutely fantastic.
Huge table - 0s to result.

two problems as I described earlier, and that is what I try to solve:
1. I create a flat table just for solar. This requires maintenance and
develop. Can I run solr over my regular tables?
   This is my simplest approach. Working over my relational tables,
2. When you query a flat table by school name, as I described, if the
school has 300 student, 300 teachers, 300  with 300 teacherCourses, 300
studentHobbies,
   you get 8.1 Billion rows (300*300*300*300). As I am sure this will work
great on solar - searching for the school name will retrieve 8.1 B rows.
3. Lets say all my searches are user generated free text search that is
searching name and comments columns.
Thanks.


On Tue, Jun 18, 2013 at 7:32 AM, Gora Mohanty  wrote:


On 18 June 2013 01:10, Mysurf Mail  wrote:
> Thanks for your quick reply. Here are some notes:
>
> 1. Consider that all tables in my example have two columns: Name &
> Description which I would like to index and search.
> 2. I have no other reason to create flat table other than for solar. So 
> I

> would like to see if I can avoid it.
> 3. If in my example I will have a flat table then obviously it will hold
a
> lot of rows for a single school.
> By searching the exact school name I will likely receive a lot of
rows.
> (my flat table has its own pk)

Yes, all of this is definitely the case, but in practice
it does not matter. Solr can efficiently search through
millions of rows. To start with, just try the simplest
approach, and only complicate things as and when
needed.

> That is something I would like to avoid and I thought I can avoid
this
> by defining teachers and students as multiple value or something like
this
> and than teacherCourses and studentHobbies  as 1:n respectively.
> This is quite similiar to my real life demand, so I came here to get
> some tips as a solr noob.

You have still not described what are the searches that
you would want to do. Again, I would suggest starting
with the most straightforward approach.

Regards,
Gora





Re: [ANN] Lux XML search engine

2013-06-18 Thread Alexandre Rafalovitch
On Tue, Jun 18, 2013 at 7:44 AM, Michael Sokolov
 wrote:
> I'm pleased to announce the first public release of Lux (version 0.9.1), an
> XML search engine embedding Saxon 9 and Lucene/Solr 4.

Congratulations, this looks very interesting. I am guessing, this
is/will be replacing MarkLogic that Safari Books Online used before.

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


Re: Is there a way to encrypt username and pass in the solr config file

2013-06-18 Thread Mysurf Mail
@Gora: yes.
User name and pass.


On Tue, Jun 18, 2013 at 2:57 PM, Gora Mohanty  wrote:

> On 18 June 2013 17:16, Erick Erickson  wrote:
> > What do you mean "encrypt"? The stored value?
> > the indexed value? Over the wire?
> [...]
>
> My understanding was that he wanted to encrypt the
> username/password in the DIH configuration file.
> "Mysurf Mail", could you please clarify?
>
> Regards,
> Gora
>


Re: Need assistance in defining solr to process user generated query text

2013-06-18 Thread Mysurf Mail
great tip :-)


On Tue, Jun 18, 2013 at 2:36 PM, Erick Erickson wrote:

> if the _solr_ type is "string", then you aren't getting any
> tokenization, so "my dog has fleas" is indexed as
> "my dog has fleas", a single token. To search
> for individual words you need to use, say, the
> "text_general" type, which would index
> "my" "dog" "has" "fleas"
>
> Best
> Erick
>
> On Mon, Jun 17, 2013 at 11:26 AM, Mysurf Mail 
> wrote:
> > I have one fact table with a lot of string columns and a few GUIDs just
> for
> > retreival (Not for search)
> >
> >
> >
> > On Mon, Jun 17, 2013 at 6:01 PM, Jack Krupansky  >wrote:
> >
> >> It sounds like you have your text indexed in a "string" field (why the
> >> wildcards are needed), or that maybe you are using the "keyword"
> tokenizer
> >> rather than the standard tokenizer.
> >>
> >> What is your default or query fields for dismax/edismax? And what are
> the
> >> field types for those fields?
> >>
> >> -- Jack Krupansky
> >>
> >> -Original Message- From: Mysurf Mail
> >> Sent: Monday, June 17, 2013 10:51 AM
> >> To: solr-user@lucene.apache.org
> >> Subject: Need assistance in defining solr to process user generated
> query
> >> text
> >>
> >>
> >> Hi,
> >> I have been reading solr wiki pages and configured solr successfully
> over
> >> my flat table.
> >> I have a few question though regarding the querying and parsing of user
> >> generated text.
> >>
> >> 1. I have understood through this  >page
> >> that
> >>
> >> I want to use dismax.
> >>Through this  http://wiki.apache.org/solr/LocalParams>>page
> >> I can do it
> >>
> >> using localparams
> >>
> >>But I think the best way is to define this in my xml files.
> >>Can I do this?
> >>
> >> 2.in this  http://lucene.apache.org/solr/4_3_0/tutorial.html>
> >> >**tutorial
> >>
> >> (solr) the following query appears
> >>
> >>http://localhost:8983/solr/#/**collection1/query?q=video<
> http://localhost:8983/solr/#/collection1/query?q=video>
> >>
> >>When I want to query my fact table  I have to query using *video*.
> >>just video retrieves nothing.
> >>How can I query it using video only?
> >> 3. In this  http://wiki.apache.org/solr/ExtendedDisMax#Configuration>
> >> >**page
> >>
> >> it says that
> >> "Extended DisMax is already configured in the example configuration,
> with
> >> the name edismax"
> >> But I see it only in the /browse requestHandler
> >> as follows:
> >>
> >>
> >> 
> >> 
> >>   explicit
> >>...
> >>
> >>   edismax
> >>
> >> Do I use it also when I use select in my url ?
> >>
> >> 4. In general, I want to transfer a user generated text to my url
> request
> >> using the most standard rules (translate "",+,- signs to the q parameter
> >> value).
> >> What is the best way to
> >>
> >>
> >>
> >> Thanks.
> >>
>


Re: implementing identity authentication in SOLR

2013-06-18 Thread Mysurf Mail
Just to make sure.
In my previous question I was referring to the user/pass that queries the
db.

Now I was referring to the user/pass that i want for the solr http request.
Think of it as if my user sends a request where he filter documents created
by another user.
I want to restrict that.

I currently work in a .NET environment where we have identity provider that
provides trusted claims to the http request.
In similar situations I take the user name property from a trusted claim
and not from a parameter in the url .

I want to know how solr can restrict his http request/responses.
Thank you.


On Tue, Jun 18, 2013 at 10:56 AM, Gora Mohanty  wrote:

> On 18 June 2013 13:10, Mysurf Mail  wrote:
> > Hi,
> > In order to add solr to my prod environmnet I have to implement some
> > security restriction.
> > Is there a way to add user/pass to the requests and to keep them
> > *encrypted*in a file.
>
> As mentioned earlier, no there is no built-in way of doing that
> if you are using the Solr DataImportHandler.
>
> Probably the easiest way would be to implement your own
> indexing using a library like SolrJ. Then, you can handle encryption
> as you wish.
>
> Regards,
> Gora
>


Re: Solr cloud: zkHost in solr.xml gets wiped out

2013-06-18 Thread Erick Erickson
OK, I put up a very preliminary patch attached to the bug
if you want to try it out that addresses the extra junk being
put in the  tag. Doesn't address the instanceDir issue
since I haven't reproduced it yet.

Erick

On Tue, Jun 18, 2013 at 8:46 AM, Erick Erickson  wrote:
> Whoa! What's this junk?
> qt="/admin/cores" wt="javabin" version="2
>
> That shouldn't be being preserved, and the instancedir should be!
>
> So I'm guessing you're using SolrJ to create the core, but I just
> reproduced the problem (at least the 'wt="json" ') bit from the
> browser and even from one of my internal tests when I added
> extra parameters.
>
> That said, instanceDir is being preserved in my test, so I'm not
> seeing everything you're seeing, could you cut/paste your
> create code? I'll see if I can set up a test case for SolrJ to catch
> this too.
>
> See SOLR-4935
>
> Thanks for reporting!
>
> On Mon, Jun 17, 2013 at 5:39 PM, Al Wold  wrote:
>> Hi Erick,
>> I tried out your changes from the branch_4x branch. It looks good in terms 
>> of preserving the zkHost, but I'm running into an exception because it isn't 
>> persisting the instanceDir attribute on the  element.
>>
>> I've got a few other things I need to take care of, but as soon as I have 
>> time I'll dig in and see if I can figure out what's going on, and see what 
>> changed to make this not work.
>>
>> Here are details on what the files looked like before/after CREATE call:
>>
>> original solr.xml:
>>
>> 
>> 
>>   
>>   > hostContext="/"/>
>> 
>>
>> here's what was produced with 4.3 branch + a quick mod to preserve zkHost:
>>
>> 
>> 
>>   > hostContext="/">
>> > instanceDir="directory_shard1_replica1/" transient="false" 
>> name="directory_shard1_replica1" collection="directory"/>
>> > instanceDir="directory_shard2_replica1/" transient="false" 
>> name="directory_shard2_replica1" collection="directory"/>
>>   
>> 
>>
>> here's what was produced with branch_4x 4.4-SNAPSHOT:
>>
>> 
>> 
>>   > distribUpdateSoTimeout="0" distribUpdateConnTimeout="0" hostPort="8080" 
>> hostContext="/">
>> > collection="directory" qt="/admin/cores" wt="javabin" version="2"/>
>> > collection="directory" qt="/admin/cores" wt="javabin" version="2"/>
>>   
>> 
>>
>> and here's the error from solr.log after restarting after the CREATE:
>>
>> 2013-06-17 21:37:07,083 1874 [pool-2-thread-1] ERROR 
>> org.apache.solr.core.CoreContainer  - null:java.lang.NullPointerException: 
>> Missing required 'instanceDir'
>> at 
>> org.apache.solr.core.CoreDescriptor.doInit(CoreDescriptor.java:133)
>> at org.apache.solr.core.CoreDescriptor.(CoreDescriptor.java:87)
>> at org.apache.solr.core.CoreContainer.load(CoreContainer.java:365)
>> at org.apache.solr.core.CoreContainer.load(CoreContainer.java:221)
>> at 
>> org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:190)
>> at 
>> org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:124)
>> at 
>> org.apache.catalina.core.ApplicationFilterConfig.initFilter(ApplicationFilterConfig.java:277)
>> at 
>> org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:258)
>> at 
>> org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:382)
>> at 
>> org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:103)
>> at 
>> org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4638)
>> at 
>> org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5294)
>> at 
>> org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
>> at 
>> org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:895)
>> at 
>> org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:871)
>> at 
>> org.apache.catalina.core.StandardHost.addChild(StandardHost.java:615)
>> at 
>> org.apache.catalina.startup.HostConfig.deployDirectory(HostConfig.java:1099)
>> at 
>> org.apache.catalina.startup.HostConfig$DeployDirectory.run(HostConfig.java:1621)
>> at 
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>> at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>> at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>> at java.lang.Thread.run(Thread.java:679)
>>
>>
>> On Jun 16, 2013, at 5:38 AM, Erick Erickson wrote:
>>
>>> Al:
>>>
>>> As it happens, I hope sometime today to put up a patch for SOLR-4910
>>> that should harden up many things in persisting solr.xml, I'll be sure
>>> to include this. It's kind of a pain to create an automated test for
>>> this, so I'll give

Re: Solr large boolean filter

2013-06-18 Thread Otis Gospodnetic
Hi,

The unfortunate thing about this is what you still have to *pass* that
filter from the client to the server every time you want to use that
filter.  If that filter is big/long, passing that in all the time has
some price that could be eliminated by using "server-side named
filters".

Otis
--
Solr & ElasticSearch Support
http://sematext.com/





On Tue, Jun 18, 2013 at 8:16 AM, Erick Erickson  wrote:
> You might consider "post filters". The idea
> is to write a custom filter that gets applied
> after all other filters etc. One use-case
> here is exactly ACL lists, and can be quite
> helpful if you're not doing *:* type queries.
>
> Best
> Erick
>
> On Mon, Jun 17, 2013 at 5:12 PM, Otis Gospodnetic
>  wrote:
>> Btw. ElasticSearch has a nice feature here.  Not sure what it's
>> called, but I call it "named filter".
>>
>> http://www.elasticsearch.org/blog/terms-filter-lookup/
>>
>> Maybe that's what OP was after?
>>
>> Otis
>> --
>> Solr & ElasticSearch Support
>> http://sematext.com/
>>
>>
>>
>>
>>
>> On Mon, Jun 17, 2013 at 4:59 PM, Alexandre Rafalovitch
>>  wrote:
>>> On Mon, Jun 17, 2013 at 12:35 PM, Igor Kustov  wrote:
 So I'm using query like
 http://127.0.0.1:8080/solr/select?q=*:*&fq={!mqparser}id:%281%202%203%29
>>>
>>> If the IDs are purely numeric, I wonder if the better way is to send a
>>> bitset. So, bit 1 is on if ID:1 is included, bit 2000 is on if ID:2000
>>> is included. Even using URL-encoding rules, you can fit at least 65
>>> sequential ID flags per character and I am sure there are more
>>> efficient encoding schemes for long empty sequences.
>>>
>>> Regards,
>>>Alex.
>>>
>>>
>>>
>>> Personal website: http://www.outerthoughts.com/
>>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
>>> - Time is the quality of nature that keeps events from happening all
>>> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
>>> book)


Re: Solr cloud: zkHost in solr.xml gets wiped out

2013-06-18 Thread Erick Erickson
Whoa! What's this junk?
qt="/admin/cores" wt="javabin" version="2

That shouldn't be being preserved, and the instancedir should be!

So I'm guessing you're using SolrJ to create the core, but I just
reproduced the problem (at least the 'wt="json" ') bit from the
browser and even from one of my internal tests when I added
extra parameters.

That said, instanceDir is being preserved in my test, so I'm not
seeing everything you're seeing, could you cut/paste your
create code? I'll see if I can set up a test case for SolrJ to catch
this too.

See SOLR-4935

Thanks for reporting!

On Mon, Jun 17, 2013 at 5:39 PM, Al Wold  wrote:
> Hi Erick,
> I tried out your changes from the branch_4x branch. It looks good in terms of 
> preserving the zkHost, but I'm running into an exception because it isn't 
> persisting the instanceDir attribute on the  element.
>
> I've got a few other things I need to take care of, but as soon as I have 
> time I'll dig in and see if I can figure out what's going on, and see what 
> changed to make this not work.
>
> Here are details on what the files looked like before/after CREATE call:
>
> original solr.xml:
>
> 
> 
>   
>hostContext="/"/>
> 
>
> here's what was produced with 4.3 branch + a quick mod to preserve zkHost:
>
> 
> 
>hostContext="/">
>  instanceDir="directory_shard1_replica1/" transient="false" 
> name="directory_shard1_replica1" collection="directory"/>
>  instanceDir="directory_shard2_replica1/" transient="false" 
> name="directory_shard2_replica1" collection="directory"/>
>   
> 
>
> here's what was produced with branch_4x 4.4-SNAPSHOT:
>
> 
> 
>distribUpdateSoTimeout="0" distribUpdateConnTimeout="0" hostPort="8080" 
> hostContext="/">
>  collection="directory" qt="/admin/cores" wt="javabin" version="2"/>
>  collection="directory" qt="/admin/cores" wt="javabin" version="2"/>
>   
> 
>
> and here's the error from solr.log after restarting after the CREATE:
>
> 2013-06-17 21:37:07,083 1874 [pool-2-thread-1] ERROR 
> org.apache.solr.core.CoreContainer  - null:java.lang.NullPointerException: 
> Missing required 'instanceDir'
> at org.apache.solr.core.CoreDescriptor.doInit(CoreDescriptor.java:133)
> at org.apache.solr.core.CoreDescriptor.(CoreDescriptor.java:87)
> at org.apache.solr.core.CoreContainer.load(CoreContainer.java:365)
> at org.apache.solr.core.CoreContainer.load(CoreContainer.java:221)
> at 
> org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:190)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:124)
> at 
> org.apache.catalina.core.ApplicationFilterConfig.initFilter(ApplicationFilterConfig.java:277)
> at 
> org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:258)
> at 
> org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:382)
> at 
> org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:103)
> at 
> org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4638)
> at 
> org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5294)
> at 
> org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
> at 
> org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:895)
> at 
> org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:871)
> at 
> org.apache.catalina.core.StandardHost.addChild(StandardHost.java:615)
> at 
> org.apache.catalina.startup.HostConfig.deployDirectory(HostConfig.java:1099)
> at 
> org.apache.catalina.startup.HostConfig$DeployDirectory.run(HostConfig.java:1621)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:679)
>
>
> On Jun 16, 2013, at 5:38 AM, Erick Erickson wrote:
>
>> Al:
>>
>> As it happens, I hope sometime today to put up a patch for SOLR-4910
>> that should harden up many things in persisting solr.xml, I'll be sure
>> to include this. It's kind of a pain to create an automated test for
>> this, so I'll give it a whirl manually.
>>
>> As you say, most of this is going away in 5.0, but it needs to work for 4.x.
>>
>> And when I get the patch up, if you could give it a "real world" try
>> it'd be great!
>>
>> Thanks,
>> Erick
>>
>> On Fri, Jun 14, 2013 at 6:15 PM, Al Wold  wrote:
>>> Hi,
>>> I'm working on setting up a solr cloud test environment, and the target 
>>> environment I need to put it in has multiple webapps

Re: Solr large boolean filter

2013-06-18 Thread Erick Erickson
You might consider "post filters". The idea
is to write a custom filter that gets applied
after all other filters etc. One use-case
here is exactly ACL lists, and can be quite
helpful if you're not doing *:* type queries.

Best
Erick

On Mon, Jun 17, 2013 at 5:12 PM, Otis Gospodnetic
 wrote:
> Btw. ElasticSearch has a nice feature here.  Not sure what it's
> called, but I call it "named filter".
>
> http://www.elasticsearch.org/blog/terms-filter-lookup/
>
> Maybe that's what OP was after?
>
> Otis
> --
> Solr & ElasticSearch Support
> http://sematext.com/
>
>
>
>
>
> On Mon, Jun 17, 2013 at 4:59 PM, Alexandre Rafalovitch
>  wrote:
>> On Mon, Jun 17, 2013 at 12:35 PM, Igor Kustov  wrote:
>>> So I'm using query like
>>> http://127.0.0.1:8080/solr/select?q=*:*&fq={!mqparser}id:%281%202%203%29
>>
>> If the IDs are purely numeric, I wonder if the better way is to send a
>> bitset. So, bit 1 is on if ID:1 is included, bit 2000 is on if ID:2000
>> is included. Even using URL-encoding rules, you can fit at least 65
>> sequential ID flags per character and I am sure there are more
>> efficient encoding schemes for long empty sequences.
>>
>> Regards,
>>Alex.
>>
>>
>>
>> Personal website: http://www.outerthoughts.com/
>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
>> - Time is the quality of nature that keeps events from happening all
>> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
>> book)


Re: Is there a way to encrypt username and pass in the solr config file

2013-06-18 Thread Gora Mohanty
On 18 June 2013 17:16, Erick Erickson  wrote:
> What do you mean "encrypt"? The stored value?
> the indexed value? Over the wire?
[...]

My understanding was that he wanted to encrypt the
username/password in the DIH configuration file.
"Mysurf Mail", could you please clarify?

Regards,
Gora


Running solr cloud

2013-06-18 Thread Daniel Mosesson
I cannot seem to be able to get the default cloud setup to work properly.

What I did:
Downloaded the binaries, extracted.
Made the pwd example
Ran: java -Dbootstrap_confdir=./solr/collection1/conf 
-Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar
And got the error message:  Caused by: 
org.apache.solr.common.cloud.ZooKeeperException: Specified config does not 
exist in ZooKeeper:collection1
Which caused follow up messages, etc.

What am I doing wrong here?
Windows 7 pro



**
This e-mail message and any attachments are confidential. Dissemination, 
distribution or copying of this e-mail or any attachments by anyone other than 
the intended recipient is prohibited. If you are not the intended recipient, 
please notify Ipreo immediately by replying to this e-mail, and destroy all 
copies of this e-mail and any attachments. Thank you!
**


Re: Is there a way to encrypt username and pass in the solr config file

2013-06-18 Thread Erick Erickson
What do you mean "encrypt"? The stored value?
the indexed value? Over the wire?

Here's the problem with indexing indexed terms...
you can't search it reliably. Any decent
encryption algorithm isn't going to let you, for
instance, search wildcards since the encrypted
value for "awesome" better produce different
leading bytes than "awe" so trying to search
on "aw*" will fail. If it doesn't fail, you might
as well not encrypt anything.

I have seem custom update processors that do
an encryption of the _stored_ data only and index
the raw terms. That at least removes some of
the context, but the raw terms are still in the index.

So in addition to Gora's comment, because you
absolutely _have_ to keep your server behind
firewalls etc, you can use a secure communication
channel and use an encrypting filesystem to keep
the media secure.

But if you're just talking about storing the values
in the index and you _don't_ need to search on them,
just encrypt them before you send them to the index
and store (but do not index) the data.

Best
Erick

On Mon, Jun 17, 2013 at 2:10 PM, Gora Mohanty  wrote:
> On 17 June 2013 21:41, Mysurf Mail  wrote:
>> Hi,
>> I want to encrypt (rsa maybe?) my user name/pass in solr .
>> Cant leave a simple plain text on the server.
>> What is the recomended way?
>
> I don't think that there is a way to encrypt this information
> at the moment.
>
> The recommended way would be to never expose your
> Solr server to the external world. The way to do that
> depends on your OS, and possibly the container in
> which you are running Solr.
>
> Regards,
> Gora


[ANN] Lux XML search engine

2013-06-18 Thread Michael Sokolov
I'm pleased to announce the first public release of Lux (version 0.9.1), 
an XML search engine embedding Saxon 9 and Lucene/Solr 4. Lux offers 
many features found in XML databases: persistent XML storage, 
index-optimized querying, an interactive query window, and some 
application support features - it is possible to build applications 
written exclusively in XQuery and XSLT using Lux. I call it a search 
engine, though, to indicate this is not a replacement for a 
full-featured database. I hope existing Solr/Lucene users dealing with 
XML documents will find Lux adds a unique and compelling query capability.


Lux grew out of the need to provide an XML-aware query capability for 
documents stored in and indexed by Solr/Lucene; it is currently in use 
at my workplace, both as an ad hoc query tool for developers and as part 
of our content ingestion process. We believe that others may find it 
useful as well, and are providing it under an open source license in the 
hope that wider exposure will lead to a longer, healthier life.


Lux is built on a production quality code base.  It has been rigorously 
tested, and all known critical defects have been resolved. However, it 
is in a fairly early stage of development (it's about a year old).  This 
means there are features users will miss.  Some of them we already know 
about (see the Plans page on the web site or the issue tracker), but the 
relatively narrow distribution so far means there are certainly more to 
be found.  I encourage you to report these on the mailing list (see 
below): this is a unique opportunity to influence the direction Lux 
takes in the future.


Lots more information, including downloads and setup instructions, is 
available at http://luxdb.org, source code is at 
http://github.com/msokolov/lux, issues are tracked at 
http://issues.luxdb.org/, and please let us know what you think using 
the (brand new) mailing list at lu...@luxdb.org, archived at 
https://groups.google.com/forum/?fromgroups#!topic/luxdb. (Please note 
the mailing list is hosted as a google group, which makes it a bit 
tricky to sign up with a non-gmail address, but you can.  To do that, 
just tack your email address on to this link: 
http://groups.google.com/group/luxdb/boxsubscribe?email=)


-Mike Sokolov

PS: apologies for cross-posting; there is some overlap among the 
readership of these groups, I know -- but they are not identical, and 
we'll keep any followup to a single list -- or on the luxdb list.


Re: Need assistance in defining solr to process user generated query text

2013-06-18 Thread Erick Erickson
if the _solr_ type is "string", then you aren't getting any
tokenization, so "my dog has fleas" is indexed as
"my dog has fleas", a single token. To search
for individual words you need to use, say, the
"text_general" type, which would index
"my" "dog" "has" "fleas"

Best
Erick

On Mon, Jun 17, 2013 at 11:26 AM, Mysurf Mail  wrote:
> I have one fact table with a lot of string columns and a few GUIDs just for
> retreival (Not for search)
>
>
>
> On Mon, Jun 17, 2013 at 6:01 PM, Jack Krupansky 
> wrote:
>
>> It sounds like you have your text indexed in a "string" field (why the
>> wildcards are needed), or that maybe you are using the "keyword" tokenizer
>> rather than the standard tokenizer.
>>
>> What is your default or query fields for dismax/edismax? And what are the
>> field types for those fields?
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Mysurf Mail
>> Sent: Monday, June 17, 2013 10:51 AM
>> To: solr-user@lucene.apache.org
>> Subject: Need assistance in defining solr to process user generated query
>> text
>>
>>
>> Hi,
>> I have been reading solr wiki pages and configured solr successfully over
>> my flat table.
>> I have a few question though regarding the querying and parsing of user
>> generated text.
>>
>> 1. I have understood through this 
>> >page
>> that
>>
>> I want to use dismax.
>>Through this 
>> >page
>> I can do it
>>
>> using localparams
>>
>>But I think the best way is to define this in my xml files.
>>Can I do this?
>>
>> 2.in this 
>> 
>> >**tutorial
>>
>> (solr) the following query appears
>>
>>
>> http://localhost:8983/solr/#/**collection1/query?q=video
>>
>>When I want to query my fact table  I have to query using *video*.
>>just video retrieves nothing.
>>How can I query it using video only?
>> 3. In this 
>> 
>> >**page
>>
>> it says that
>> "Extended DisMax is already configured in the example configuration, with
>> the name edismax"
>> But I see it only in the /browse requestHandler
>> as follows:
>>
>>
>> 
>> 
>>   explicit
>>...
>>
>>   edismax
>>
>> Do I use it also when I use select in my url ?
>>
>> 4. In general, I want to transfer a user generated text to my url request
>> using the most standard rules (translate "",+,- signs to the q parameter
>> value).
>> What is the best way to
>>
>>
>>
>> Thanks.
>>


Re: Returning both partial and complete match results in solr

2013-06-18 Thread Toke Eskildsen
On Tue, 2013-06-18 at 12:17 +0200, Prathik Puthran wrote:
> The 2nd query returns the complete matches as well. So I will have to
> filter out the complete matches from the partial match results.

Without testing:
(Brad OR Pitt) NOT (Brad AND Pitt)

Although that does require you to parse the query from the user and
re-write it.



Re: Returning both partial and complete match results in solr

2013-06-18 Thread Prathik Puthran
The 2nd query returns the complete matches as well. So I will have to
filter out the complete matches from the partial match results.


On Tue, Jun 18, 2013 at 3:31 PM, Upayavira  wrote:

> With two queries.
>
> I'm not sure there's another way to do it. Unless you were prepared to
> get coding, and implement another SearchComponent, but given that you
> can achieve it with two queries, that seems overkill to me.
>
> Upayavira
>
> On Tue, Jun 18, 2013, at 10:59 AM, Prathik Puthran wrote:
> > Hi,
> >
> > I wanted to know if it is possible to tweak solr to return the results of
> > both complete and partial query matches.
> >
> > For eg:
> > If the search query is "Brad Pitt" and if the query parser is "AND" Solr
> > returns all documents indexed against the term "Brad Pitt".
> > If the query parser is "OR" Solr returns all the documents indexed
> > against
> > the term "Brad Pitt", "Brad", "Pitt".
> >
> > I want to the Solr to return the data in a way such that all the results
> > matched by the "AND" parser (i.e. Complete match) should be in a seperate
> > key- value pair in JSON response.
> > i.e. "CompleteMatch :[doc1, doc2, doc3...]"
> > and all the partial matches which are not part of complete match should
> > be
> > a seperate key-value pair in JSON response i.e.
> > "PartialMatch : [doc4, doc5, doc6].
> >
> > How can I achieve this?
> >
> > Thanks,
> > Prathik
>


Re: yet another optimize question

2013-06-18 Thread Andre Bois-Crettez

Recently we had steadily increasing memory usage and OOM due to facets
on dynamic fields.
The default facet.method=fc need to build a large array of maxdocs ints
for each field (a fieldCache or fieldValueCahe entry), whether it is
sparsely populated or not.

Once you have reduced your number of maxDocs with the merge policy, it
can be interesting to try facet.method=enum for all the sparsely
populated dynamic fields.
Despite what is said in the wiki, in our case the performance was
similar to facet.method=fc, however the JVM heap usage went down from
about 20GB to 4GB.

André

On 06/17/2013 08:21 PM, Petersen, Robert wrote:

Also some time ago I made all our caches small enough to keep us from getting 
OOMs while still having a good hit rate.Our index has about 50 fields which 
are mostly int IDs and there are some dynamic fields also.  These dynamic 
fields can be used for custom faceting.  We have some standard facets we always 
facet on and other dynamic facets which are only used if the query is filtering 
on a particular category.  There are hundreds of these fields but since they 
are only for a small subset of the overall index they are very sparsely 
populated with regard to the overall index.

--
André Bois-Crettez

Search technology, Kelkoo
http://www.kelkoo.com/


Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


Re: Returning both partial and complete match results in solr

2013-06-18 Thread Upayavira
With two queries. 

I'm not sure there's another way to do it. Unless you were prepared to
get coding, and implement another SearchComponent, but given that you
can achieve it with two queries, that seems overkill to me.

Upayavira

On Tue, Jun 18, 2013, at 10:59 AM, Prathik Puthran wrote:
> Hi,
> 
> I wanted to know if it is possible to tweak solr to return the results of
> both complete and partial query matches.
> 
> For eg:
> If the search query is "Brad Pitt" and if the query parser is "AND" Solr
> returns all documents indexed against the term "Brad Pitt".
> If the query parser is "OR" Solr returns all the documents indexed
> against
> the term "Brad Pitt", "Brad", "Pitt".
> 
> I want to the Solr to return the data in a way such that all the results
> matched by the "AND" parser (i.e. Complete match) should be in a seperate
> key- value pair in JSON response.
> i.e. "CompleteMatch :[doc1, doc2, doc3...]"
> and all the partial matches which are not part of complete match should
> be
> a seperate key-value pair in JSON response i.e.
> "PartialMatch : [doc4, doc5, doc6].
> 
> How can I achieve this?
> 
> Thanks,
> Prathik


Returning both partial and complete match results in solr

2013-06-18 Thread Prathik Puthran
Hi,

I wanted to know if it is possible to tweak solr to return the results of
both complete and partial query matches.

For eg:
If the search query is "Brad Pitt" and if the query parser is "AND" Solr
returns all documents indexed against the term "Brad Pitt".
If the query parser is "OR" Solr returns all the documents indexed against
the term "Brad Pitt", "Brad", "Pitt".

I want to the Solr to return the data in a way such that all the results
matched by the "AND" parser (i.e. Complete match) should be in a seperate
key- value pair in JSON response.
i.e. "CompleteMatch :[doc1, doc2, doc3...]"
and all the partial matches which are not part of complete match should be
a seperate key-value pair in JSON response i.e.
"PartialMatch : [doc4, doc5, doc6].

How can I achieve this?

Thanks,
Prathik


Re: Shard identification

2013-06-18 Thread Upayavira
What version of Solr? I had something like this on 4.2.1. Upgraging to
4.3 sorted it.

Upayavira

On Tue, Jun 18, 2013, at 09:37 AM, Ophir Michaeli wrote:
> Hi, 
> 
> I built a 2 shards and 2 replicas system that works ok on a local
> machine, 1
> zookeeper on shard 1. 
> It appears ok on the solar monitor page, cloud tab
> (http://localhost:8983/solr/#/~cloud).
> When I move to using different machines, each shard/replica on a
> different
> machine I get a wrong cloud-graph on the Solr monitoring page.
> The machine that has Shard 2 appears on the graph on shard 1, and the
> replicas are also mixed, shard 2 appears as 1 and shard 1 appears as 2.
> 
> Any ideas why this happens?
> 
> Thanks,
> Ophir


Re: How to get SolrJ-serialization / binary-size statistics ?

2013-06-18 Thread Ralf Heyde

Hello,

just for information: the Solution might look like (1st approach):

I take the sourcecode of the BinaryResponsewriter and surround the 
serialization with some tracking methods.
Then I create a custom QueryResponseWriter, which implements the binary 
Response writer and voila, i get my statistics.


I'll keep you updated.


Regards,
Ralf


Shard identification

2013-06-18 Thread Ophir Michaeli
Hi, 

I built a 2 shards and 2 replicas system that works ok on a local machine, 1
zookeeper on shard 1. 
It appears ok on the solar monitor page, cloud tab
(http://localhost:8983/solr/#/~cloud).
When I move to using different machines, each shard/replica on a different
machine I get a wrong cloud-graph on the Solr monitoring page.
The machine that has Shard 2 appears on the graph on shard 1, and the
replicas are also mixed, shard 2 appears as 1 and shard 1 appears as 2.

Any ideas why this happens?

Thanks,
Ophir


Re: implementing identity authentication in SOLR

2013-06-18 Thread Gora Mohanty
On 18 June 2013 13:10, Mysurf Mail  wrote:
> Hi,
> In order to add solr to my prod environmnet I have to implement some
> security restriction.
> Is there a way to add user/pass to the requests and to keep them
> *encrypted*in a file.

As mentioned earlier, no there is no built-in way of doing that
if you are using the Solr DataImportHandler.

Probably the easiest way would be to implement your own
indexing using a library like SolrJ. Then, you can handle encryption
as you wish.

Regards,
Gora


implementing identity authentication in SOLR

2013-06-18 Thread Mysurf Mail
Hi,
In order to add solr to my prod environmnet I have to implement some
security restriction.
Is there a way to add user/pass to the requests and to keep them
*encrypted*in a file.
thanks.