RE: Returned number of result rows as a function of maxScore or numFound.

2016-06-09 Thread Prasanna Josium
Thanks Erick & Binoy,
I will try out the 2 query technique. Guess this will work for numFound related 
issue.

Guess I was not very clear in stating  my problem. The problem I'm dealing with 
is mostly with maxScore.
I have collection (~500K docs) where I look for matches to the query.
Because of the nature of the data in the collection, I get for some of them a 
very high score which soon fades to very low score for others(5 to 0.5); 
For some queries even within the first 10 docs; 8  have score between 5 to 3.8 
and the 9th onwards falls to 0.4 & 0.3 and so on into a long tail.

The business guys thinks that docs with very low score compared to the highs 
scores ones should not be part of the result set.
and must be cut off below a threshold defined as a percent of maxScore. Any 
thought about how to work with max score.

Thanks 
Prasanna Josium




-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: 09 June 2016 22:43
To: solr-user
Subject: Re: Returned number of result rows as a function of maxScore or 
numFound.

Why do this at all? I have a hard time understanding what benefit this is to 
the _user_.

And even returning 5% is risky. I mean what happens for a query of *:*? For a 
corpus of 100M docs that's still 5M documents which is would hurt.

Sure, you say, well I'll cap it at XXX docs. The principle still holds though.
Users usually don't want to deal with very many docs at a time.

If you must do this for some kind of reporting or something, just fire two 
queries. The first has a rows of 0 and the second has a rows=5% of what was 
returned the first time.

Under the covers, you really can't do this without writing some sort of custom 
collector. Solr (Well, Lucene) uses the rows parameter as the dimension of the 
list where the most relevant docs are stored, and replaced as "better" docs 
some along. You can't know how many doc are going to be found before you score 
them all.
So how would you know what 5% was when you start? You'd have to write something 
that would keep 20X whatever your max was set to and then grow it as 
necessary but by that time you _might_ have already thrown away docs that 
should be in the expanded list... Or you'd have to keep _all_ the results 
which would be very expensive usually.

All in all, I think a 2-query solution is much simpler than hacking into your 
own collector, not to mention far more efficient in the general case.

Best,
Erick

On Wed, Jun 8, 2016 at 10:26 PM, Binoy Dalal  wrote:
> I don't think you can do such a thing ootb with solr but this is 
> pretty easy to achieve using a custom search component.
>
> Just write some custom code which will limit your resultset and plug 
> it into your request handler as the last component.
>
> On Thu, 9 Jun 2016, 08:53 Prasanna Josium, 
> 
> wrote:
>
>> Hi,
>> I use a dse stack with has solr4.10.
>> I want to control the number of rows from result set as a percent of 
>> the max hit 'numFound' or  'maxScore' for a query.
>> e.g.,
>> 1)  for a query 'foo', if I get 100 hits and if I want to get the top 
>> 5% percent (say rows=5%). Then I get only 5 rows.
>> for a query 'bar', if I get 1000 hits, I want to get the top 5% 
>> (rows=5%).Then I get top 50 rows.
>>
>> 2) for a query 'foo' if the maxScore is 4.5, I want to get say all 
>> records within 10% of maxScore ..I want to get all records whose 
>> score is between
>> 4.5 to 4.0(this could be the any number of records)
>>
>> in  other words, the returned set is a percent of hits, instead of a 
>> static row count.
>> Is there a way to do this readily or via some custom implementation?
>>
>> Thanks
>> Cheers
>> Prasanna Josium
>>
> --
> Regards,
> Binoy Dalal


Bypassing ExtractingRequestHandler

2016-06-09 Thread Justin Lee
Has anybody had any experience bypassing ExtractingRequestHandler and
simply managing Tika manually?  I want to make a small modification to Tika
to get and save additional data from my PDFs, but I have been
procrastinating in no small part due to the unpleasant prospect of setting
up a development environment where I could compile and debug modifications
that might run through PDFBox, Tika, and ExtractingRequestHandler.  It
occurs to me that it would be much easier if the two were separate, so I
could have direct control over Tika and just submit the text to Solr after
extraction.  Am I going to regret this approach?  I'm not sure what
ExtractingRequestHandler really does for me that Tika doesn't already do.

Also, I was reading this

stackoverflow entry and someone offhandedly mentioned that
ExtractingRequestHandler might be separated in the future anyway. Is there
a public roadmap for the project, or does one have to keep up with the
developer's mailing list and hunt through JIRA entries to keep up with the
pulse of the project?

Thanks,
Justin


Re: Question about multiple fq parameters

2016-06-09 Thread Ahmet Arslan
Hi Mikhail,

Can you please explain what this mysterious op parameter is?
How is it related to range queries issued on date fields?

Thanks,
Ahmet


On Thursday, June 9, 2016 11:43 AM, Mikhail Khludnev 
 wrote:
Shawn,
I found "op" at
org.apache.solr.schema.DateRangeField.parseSpatialArgs(QParser, String).


On Thu, Jun 9, 2016 at 1:46 AM, Shawn Heisey  wrote:

> On 6/8/2016 2:28 PM, Steven White wrote:
> > ?q=*=OR={!field+f=DateA+op=Intersects}[2020-01-01+TO+2030-01-01]
>
> Looking at this and checking the code for the Field query parser, I
> cannot see how what you have used above is any different than:
>
> fq=DateA:[2020-01-01 TO 2030-01-01]
>
> The "op=Intersects" parameter that you have included appears to be
> ignored by the parser code that I examined.
>
> If my understanding of the documentation and the code is correct, then
> you should be able to use this:
>
> fq=DateB:[2000-01-01 TO 2020-01-01] OR DateA:[2020-01-01 TO 2030-01-01]
>
> In my examples I have changed the URL encoded "+" character back to a
> regular space.
>
> Thanks,
> Shawn
>
>


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics





Re: Scoring changes between 4.10 and 5.5

2016-06-09 Thread Ahmet Arslan
Hi,

I wondered the same before and failed to decipher TFIDFSimilarity.
Scoring looks like tf*idf*idf to me.

I appreciate someone who will shed some light on this.

Thanks,
Ahmet



On Friday, June 10, 2016 12:37 AM, Upayavira  wrote:
I've just done a very simple, single term query against a 4.10 system
and a 5.5 system, each with much the same data.

The score for the 4.10 system was essentially made up of the field
weight, which is:
   score = tf * idf 

Whereas, in the 5.5 system, there is an additional "query weight", which
is idf * query norm. If query norm is 1, then the final score is now:
  score = query_weight * field_weight
  = ( idf * 1 ) * (tf * idf)
  = tf * idf^2

Can anyone explain why this new "query weight" element has appeared in
our scores somewhere between 4.10 and 5.5?

Thanks!

Upayavira

4.10 score 
  "2937439": {
"match": true,
"value": 5.5993805,
"description": "weight(description:obama in 394012)
[DefaultSimilarity], result of:",
"details": [
  {
"match": true,
"value": 5.5993805,
"description": "fieldWeight in 394012, product of:",
"details": [
  {
"match": true,
"value": 1,
"description": "tf(freq=1.0), with freq of:",
"details": [
  {
"match": true,
"value": 1,
"description": "termFreq=1.0"
  }
]
  },
  {
"match": true,
"value": 5.5993805,
"description": "idf(docFreq=56010, maxDocs=5568765)"
  },
  {
"match": true,
"value": 1,
"description": "fieldNorm(doc=394012)"
  }
]
  }
]
5.5 score 
  "2502281":{
"match":true,
"value":28.51136,
"description":"weight(description:obama in 43472) [], result
of:",
"details":[{
"match":true,
"value":28.51136,
"description":"score(doc=43472,freq=1.0), product of:",
"details":[{
"match":true,
"value":5.339603,
"description":"queryWeight, product of:",
"details":[{
"match":true,
"value":5.339603,
"description":"idf(docFreq=31905,
maxDocs=2446459)"},
  {
"match":true,
"value":1.0,
"description":"queryNorm"}]},
  {
"match":true,
"value":5.339603,
"description":"fieldWeight in 43472, product of:",
"details":[{
"match":true,
"value":1.0,
"description":"tf(freq=1.0), with freq of:",
"details":[{
"match":true,
"value":1.0,
"description":"termFreq=1.0"}]},
  {
"match":true,
"value":5.339603,
"description":"idf(docFreq=31905,
maxDocs=2446459)"},
  {
"match":true,
"value":1.0,
"description":"fieldNorm(doc=43472)"}]}]}]},


Re: Solutions for Multi-word Synonyms

2016-06-09 Thread MaryJo Sminkey
Thanks, added my vote (which threw an error but looks like it did get
added).

MJ



On Thu, Jun 9, 2016 at 5:41 PM, Upayavira  wrote:

> Here's a recently created ticket that covers this issue:
>
> https://issues.apache.org/jira/browse/SOLR-9185
>
> Let's hope we see some traction on it soon, as many people suffer from
> this issue.
>
> Upayavira
>
> On Thu, 9 Jun 2016, at 09:10 PM, MaryJo Sminkey wrote:
> > On Thu, Jun 9, 2016 at 1:50 PM, Joe Lawson <
> > jlaw...@opensourceconnections.com> wrote:
> >
> > > The auth-phrasing-token (APT) filter is a two pronged solution that
> > > requires index and query time processes versus hon-lucene-synonyms
> (HLS)
> > > which is strictly a query time implementation. The primary take away
> from
> > > that is, APT requires reindexing your data when you update the
> autophrases
> > > and synonyms while HLS does not.
> > >
> >
> >
> > Yup, understood about the indexing, that is not a big issue for us as we
> > rarely change the synonym list and re-index frequently.
> >
> > MJ
> >
> >
> > Sent with MailTrack
> > <
> https://mailtrack.io/install?source=signature=en=mjsmin...@gmail.com=22
> >
>


Re: Solutions for Multi-word Synonyms

2016-06-09 Thread Upayavira
Here's a recently created ticket that covers this issue:

https://issues.apache.org/jira/browse/SOLR-9185

Let's hope we see some traction on it soon, as many people suffer from
this issue.

Upayavira

On Thu, 9 Jun 2016, at 09:10 PM, MaryJo Sminkey wrote:
> On Thu, Jun 9, 2016 at 1:50 PM, Joe Lawson <
> jlaw...@opensourceconnections.com> wrote:
> 
> > The auth-phrasing-token (APT) filter is a two pronged solution that
> > requires index and query time processes versus hon-lucene-synonyms (HLS)
> > which is strictly a query time implementation. The primary take away from
> > that is, APT requires reindexing your data when you update the autophrases
> > and synonyms while HLS does not.
> >
> 
> 
> Yup, understood about the indexing, that is not a big issue for us as we
> rarely change the synonym list and re-index frequently.
> 
> MJ
> 
> 
> Sent with MailTrack
> 


Scoring changes between 4.10 and 5.5

2016-06-09 Thread Upayavira
I've just done a very simple, single term query against a 4.10 system
and a 5.5 system, each with much the same data.

The score for the 4.10 system was essentially made up of the field
weight, which is:
   score = tf * idf 

Whereas, in the 5.5 system, there is an additional "query weight", which
is idf * query norm. If query norm is 1, then the final score is now:
  score = query_weight * field_weight
  = ( idf * 1 ) * (tf * idf)
  = tf * idf^2

Can anyone explain why this new "query weight" element has appeared in
our scores somewhere between 4.10 and 5.5?

Thanks!

Upayavira

4.10 score 
  "2937439": {
"match": true,
"value": 5.5993805,
"description": "weight(description:obama in 394012)
[DefaultSimilarity], result of:",
"details": [
  {
"match": true,
"value": 5.5993805,
"description": "fieldWeight in 394012, product of:",
"details": [
  {
"match": true,
"value": 1,
"description": "tf(freq=1.0), with freq of:",
"details": [
  {
"match": true,
"value": 1,
"description": "termFreq=1.0"
  }
]
  },
  {
"match": true,
"value": 5.5993805,
"description": "idf(docFreq=56010, maxDocs=5568765)"
  },
  {
"match": true,
"value": 1,
"description": "fieldNorm(doc=394012)"
  }
]
  }
]
5.5 score 
  "2502281":{
"match":true,
"value":28.51136,
"description":"weight(description:obama in 43472) [], result
of:",
"details":[{
"match":true,
"value":28.51136,
"description":"score(doc=43472,freq=1.0), product of:",
"details":[{
"match":true,
"value":5.339603,
"description":"queryWeight, product of:",
"details":[{
"match":true,
"value":5.339603,
"description":"idf(docFreq=31905,
maxDocs=2446459)"},
  {
"match":true,
"value":1.0,
"description":"queryNorm"}]},
  {
"match":true,
"value":5.339603,
"description":"fieldWeight in 43472, product of:",
"details":[{
"match":true,
"value":1.0,
"description":"tf(freq=1.0), with freq of:",
"details":[{
"match":true,
"value":1.0,
"description":"termFreq=1.0"}]},
  {
"match":true,
"value":5.339603,
"description":"idf(docFreq=31905,
maxDocs=2446459)"},
  {
"match":true,
"value":1.0,
"description":"fieldNorm(doc=43472)"}]}]}]},


suggester stack overflow

2016-06-09 Thread Rick Leir
I know how to debug this, but am hoping someone can give me a tip before
I dive in! 

Solr 6.0.0, I just started the server, hoping to build the suggester.

from the log:
3625 INFO  (searcherExecutor-7-thread-1-processing-x:blinkmon) [
x:blinkmon] o.a.s.s.s.SolrSuggester SolrSuggester.build(mySuggester)
3714 ERROR (searcherExecutor-7-thread-1-processing-x:blinkmon) [
x:blinkmon] o.a.s.c.SolrCore null:java.lang.StackOverflowError
at
org.apache.lucene.util.automaton.Automaton.getNumTransitions(Automaton.java:350)
at
org.apache.lucene.util.automaton.Automaton.initTransition(Automaton.java:487)
at
org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1306)
at
org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1311)
at
org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1311)

..repeats.. the console shows "SolrCore
null:java.lang.StackOverflowError"

from solrconfig.xml:
  

  mySuggester

  FuzzyLookupFactory
  DocumentDictionaryFactory

  autocomplete
  string
  true
  true

  

  

  true
  mySuggester

  10


  suggest

  



Re: Question about multiple fq parameters

2016-06-09 Thread Mikhail Khludnev
Steve,
It's hard to debug queries in this way. Try to experiment with
debugQuery=true, pulling fq to q, just for explanation, etc.

On Thu, Jun 9, 2016 at 5:08 PM, Steven White  wrote:

> Erick, Mikhail, and Shawn, thank you all for your help.
>
>
>
> Just a quick re-cap of what I’m trying to achieve: my need is to combine 2
> or more “fq” queries to be treated as OR.
>
>
>
> Erick, Mikhail, I have the syntax you provided but I cannot get them to
> work properly, in fact I’m seeing odd behavior that I cannot explain so I
> hope you can shed some light on them.
>
>
>
> The following give me hits as expected:
>
>
>
> 1269 hits:
>
> http://localhost:8983/solr/openpages/select_openpages_3?start=0=*=AND?=*=
>
> {!field+f=DateA+v=$a}+{!field+f=DateB+v=$b}+
>
> =[2000-01-01+TO+2030-01-01]=[2000-01-01+TO+2030-01-01]
>
>
>
> 1269 hits:
>
> http://localhost:8983/solr/openpages/select_openpages_3?start=0=*=AND?=*=
>
> {!field+f=DateA+v=$a}+
>
> =[2000-01-01+TO+2030-01-01]
>
>
>
> 905 hits:
>
> http://localhost:8983/solr/openpages/select_openpages_3?start=0=*=AND?=*=
>
> {!field+f=DateB+v=$b}+
>
> =[2000-01-01+TO+2030-01-01]
>
>
>
> The following don’t give me a hit as expected:
>
>
>
> 0 hits:
>
> http://localhost:8983/solr/openpages/select_openpages_3?start=0=*=AND?=*=
>
> {!field+f=DateA+v=$a}+
>
> =[2020-01-01+TO+2030-01-01]
>
>
>
> 0 hits:
>
> http://localhost:8983/solr/openpages/select_openpages_3?start=0=*=AND?=*=
>
> {!field+f=DateB+v=$b}+
>
> =[2020-01-01+TO+2030-01-01]
>
>
>
> The next 3 syntax are odd behavior that I cannot explain:
>
>
>
> A) 1269 hits (expected):
>
> http://localhost:8983/solr/openpages/select_openpages_3?start=0=*=AND?=*=
>
> {!field+f=DateA+v=$a}+{!field+f=DateB+v=$b}+
>
> =[2000-01-01+TO+2030-01-01]=[2020-01-01+TO+2030-01-01]
>
>
>
> B) 905 hits (odd result):
>
> http://localhost:8983/solr/openpages/select_openpages_3?start=0=*=AND?=*=
>
> {!field+f=DateB+v=$a}+{!field+f=DateA+v=$b}+
>
> =[2000-01-01+TO+2030-01-01]=[2020-01-01+TO+2030-01-01]
>
>
>
> C) 0 hits (but why?!):
>
> http://localhost:8983/solr/openpages/select_openpages_3?start=0=*=AND?=*=
>
> {!field+f=DateA+v=$a}+{!field+f=DateB+v=$b}+
>
> =[2020-01-01+TO+2030-01-01]=[2000-01-01+TO+2030-01-01]
>
>
>
> D) 0 hits (but why?!):
>
> http://localhost:8983/solr/openpages/select_openpages_3?start=0=*=AND?=*=
>
> {!field+f=DateB+v=$a}+{!field+f=DateA+v=$b}+
>
> =[2020-01-01+TO+2030-01-01]=[2000-01-01+TO+2030-01-01]
>
>
>
> Since my goal here is to have fq apply OR on the two date searches, test B
> clearly shows that’s not the case and test C & D shows that fq is ignoring
> the second part in the query.
>
>
>
> I also tried this syntax:
>
>
>
>
> http://localhost:8983/solr/openpages/select_openpages_3?start=0=*=AND?=*=
>
> filter({!field+f=DateA+op=Intersects}[2000-01-01+TO+2020-01-01])+
>
> filter({!field+f=DateB+op=Intersects}[2000-01-01+TO+2030-01-01])
>
>
>
> But Solr is reporting an error: “no field name specified in query and no
> default specified via 'df' param”.
>
>
>
> Shawn, using the syntax that you suggested everything works (including my
> mix date range tests of the above):
>
>
>
>
> http://localhost:8983/solr/openpages/select_openpages_3?start=0=*=AND?=*=
>
> DateA:[2000-01-01+TO+2030-01-01]+OR+DateB:[2000-01-01 TO 2030-01-01]
>
>
>
> My motivation to use “{!field}[]” in fq was what I read somewhere (I cannot
> find it now, even after many Google’s on it) is far faster and efficient
> than the tradition :[value]
>
>
>
> Steve
>
> On Wed, Jun 8, 2016 at 6:46 PM, Shawn Heisey  wrote:
>
> > On 6/8/2016 2:28 PM, Steven White wrote:
> > >
> ?q=*=OR={!field+f=DateA+op=Intersects}[2020-01-01+TO+2030-01-01]
> >
> > Looking at this and checking the code for the Field query parser, I
> > cannot see how what you have used above is any different than:
> >
> > fq=DateA:[2020-01-01 TO 2030-01-01]
> >
> > The "op=Intersects" parameter that you have included appears to be
> > ignored by the parser code that I examined.
> >
> > If my understanding of the documentation and the code is correct, then
> > you should be able to use this:
> >
> > fq=DateB:[2000-01-01 TO 2020-01-01] OR DateA:[2020-01-01 TO 2030-01-01]
> >
> > In my examples I have changed the URL encoded "+" character back to a
> > regular space.
> >
> > Thanks,
> > Shawn
> >
> >
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics





Re: Solutions for Multi-word Synonyms

2016-06-09 Thread MaryJo Sminkey
On Thu, Jun 9, 2016 at 1:50 PM, Joe Lawson <
jlaw...@opensourceconnections.com> wrote:

> The auth-phrasing-token (APT) filter is a two pronged solution that
> requires index and query time processes versus hon-lucene-synonyms (HLS)
> which is strictly a query time implementation. The primary take away from
> that is, APT requires reindexing your data when you update the autophrases
> and synonyms while HLS does not.
>


Yup, understood about the indexing, that is not a big issue for us as we
rarely change the synonym list and re-index frequently.

MJ


Sent with MailTrack



Re: Nested vs Flattened Indexes

2016-06-09 Thread Rick Leir
Can you use Tika?
https://tika.apache.org/0.9/formats.html

On Wed, 2016-06-08 at 10:06 -0400, Aniruddh Sharma wrote:

> Hi
> 
> I am new to use Solr.
> 
> I am running Solr 4.10.3 on CDH 5.5.
> 
> My use case is , I have real time data ingestion in Hadoop on which I want
> to implement search.
> 
> My input data format is XML and it has nested child nodes. So my question
> is about schema creation for solr.
> 
> Technically I notice in JSON format , it is possible to handle nested data.
> 
> a) Although technically JSON can handle nested child data. Is it also
> doable in XML format. If no, then are there any guidelines to change XML
> data to JSON or what is best way around to deal with this.
> 
> b) Even though if could be technically done, from a functional point of
> view when does it make sense to store data in Solr as nested vs flattened .
> What is functional use case which drives this.
> 
> 
> Thanks and Regards




Re: Checking performance of plugins, queryParser, edismax, etc

2016-06-09 Thread Rick Leir
On Wed, 2016-06-08 at 11:56 +0800, Zheng Lin Edwin Yeo wrote:

> Hi,
> 
> Would like to find out, is there a way to check the performance of
> the queryParser and things like edismax in Solr?
> 
> I have tried on the debug=true, but it only show general information like
> the time taken for query, highlight, etc.
> 
> "process":{
> "time":6397.0,
> "query":{
>   "time":5938.0},
> "facet":{
>   "time":0.0},
> "facet_module":{
>   "time":39.0},
> "mlt":{
>   "time":0.0},
> "highlight":{
>   "time":386.0},
> "stats":{
>   "time":0.0},
> "expand":{
>   "time":0.0},
> "debug":{
>   "time":32.0}
> 
> I'm trying to find out what is causing the query to slowdown. I have
> included things like SynonymExpandingExtendedDismaxQParserPlugin, and would
> like to find out the time it takes to process the plugin and other things
> like edismax?
> 
> I'm using Solr 6.0.1.

This is a big and complex topic, as you will see as you explore the
results of 
https://www.google.ca/search?q=solr+performance



Re: Sorl 4.3.1 - Does not load the new data using the Java application

2016-06-09 Thread Shawn Heisey
On 6/9/2016 4:13 AM, SRINI SOLR wrote:
> *Now the issue is  - *
> *If I index the new data in Solr - the same data is not getting loaded
> through Java application until and un-less if I again load the Core
> Container using **embeddedSolrServer.getCoreContainer().load().*

This sounds like you are not doing a commit.  Newly indexed data will
not be visible until it is committed. Reloading/restarting will also
effectively do a commit.

You can do this explicitly with the commit() method on the server
object.  You can include a commitWithin parameter on your indexing
requests.  You can also configure autoSoftCommit in solrconfig.xml.

Thanks,
Shawn



Re: Solutions for Multi-word Synonyms

2016-06-09 Thread Joe Lawson
>
> I'm wondering if anyone has experience using the autophrasing solution on
> the Lucidworks blog:
>
>
> https://lucidworks.com/blog/2014/07/12/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/
>
>
The auth-phrasing-token (APT) filter is a two pronged solution that
requires index and query time processes versus hon-lucene-synonyms (HLS)
which is strictly a query time implementation. The primary take away from
that is, APT requires reindexing your data when you update the autophrases
and synonyms while HLS does not.

APT is more precise while HLS is more flexible.

-Joe


Re: Solutions for Multi-word Synonyms

2016-06-09 Thread MaryJo Sminkey
On Thu, Jun 9, 2016 at 11:06 AM, Doug Turnbull <
dturnb...@opensourceconnections.com> wrote:

> Honestly half the time I run into this problem, I end up creating a
> QParserPlugin because I need to do something specific. With a QParserPlugin
> I can run whatever analysis, slicing and dicing of the query string to
> manually construct whatever I need to
>
>
> http://www.supermind.org/blog/1134/custom-solr-queryparsers-for-fun-and-profit
>
> One thing I often do is repeat the functionality of Elasticsearch's match
> query. Elasticsearch's match query does the following:
>


Thanks Doug... I was surprised at the lack of response on this as it seems
like it would be a lot more common issue. Looking over that page though, I
am not sure I would be able to figure out how to do that kind of custom
query parser on my own, without something fairly similar in respect to
adding synonym support to work from. I'm just a lowly self-taught web
developer after all, not a java programmer or someone with a lot of
experience writing source code, etc.

We did consider switching to ElasticSearch due to its support out of the
box for multi-term synonyms, but that would be a lot of work, and I'm not
sure it can support everything else we are doing, like all the nested
facets and grouping, etc. and it would take a fair amount of work to
convert everything we have to the point of finding that out.

I'm wondering if anyone has experience using the autophrasing solution on
the Lucidworks blog:

https://lucidworks.com/blog/2014/07/12/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/

I know I tried this one as well some months ago and couldn't seem to get it
to work but it's probably the one I'll be trying next and hopefully can
figure it out this time. Since it works as a filter, it should work better
for us in terms of being able to apply it selectively only to certain
fields.


Sent with MailTrack



Re: Questions regarding re-index when using Solr as a data source

2016-06-09 Thread Walter Underwood
In the HowToReindex page, under “Using Solr as a Data Store”, it says this: 
"Don't do this unless you have no other option. Solr is not really designed for 
this role.” So don’t start by planning to do this.

Using a second copy of Solr is still using Solr as a repository. That doesn’t 
satisfy any sort of requirements for disaster recovery. How do you know that 
data is good? How do you make a third copy? How do you roll back to a previous 
version? How do you deal with a security breach that affects all your systems? 
Are the systems in the same data center? How do you deal with ransomware (U. of 
Calgary paid $20K yesterday)?

If a consultant suggested this to me, I’d probably just give up and get a 
different consultant.

Here is what we do for batch loading.

1. For each Solr collection, we define a JSONL feed format, with a JSON Schema.
2. The owners of the data write an extractor to pull the data out of wherever 
it is, then generate the JSON feed.
3. We validate the JSON feed against the JSON schema.
4. If the feed is valid, we save it to Amazon S3 along with a manifest which 
lists the version of the JSON Schema.
5. Then a multi-threaded loader reads the feed and sends it to Solr.

Reloading is safe and easy, because all the feeds in S3 are valid.

Storing backups in S3 instead of running a second Solr is massively cheaper, 
easier, and safer.

We also have a clear contract between the content owners and the search team. 
That contract is enforced by the JSON Schema on every single batch.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jun 9, 2016, at 9:51 AM, Hui Liu  wrote:
> 
> Hi Walter,
> 
> Thank you for the reply, sorry I need to clarify what I mean by 'migrate 
> tables' from Oracle to Solr, we are not literally move existing records from 
> Oracle to Solr, instead, we are building a new application directly feed data 
> into Solr as document and fields, in parallel of another existing application 
> which feeds the same data into Oracle tables/columns, of course, the Solr 
> schema will be somewhat different than Oracle; also we only keep those data 
> for 90 days for user to search on, we hope once we run both system in 
> parallel for some time (> 90 days), we will build up enough new data in Solr 
> and we no longer need any old data in Oracle, by then we will be able to use 
> Solr as our only data store.
> 
> It sounds to me that we may need to consider save the data into either file 
> system, or another database, in case we need to rebuild the indexes; and the 
> reason I mentioned to save data into another Solr system is by reading this 
> info from https://wiki.apache.org/solr/HowToReindex : so just trying to get a 
> feedback on if there is any update on this approach? And any better way to do 
> this to minimize the downtime caused by the schema change and re-index? For 
> example, in Oracle, we are able to add a new column or new index online 
> without any impact of existing queries as existing indexes are intact.
> 
> Alternatives when a traditional reindex isn't possible
> 
> Sometimes the option of "do your indexing again" is difficult. Perhaps the 
> original data is very slow to access, or it may be difficult to get in the 
> first place.
> 
> Here's where we go against our own advice that we just gave you. Above we 
> said "don't use Solr itself as a datasource" ... but one way to deal with 
> data availability problems is to set up a completely separate Solr instance 
> (not distributed, which for SolrCloud means numShards=1) whose only job is to 
> store the data, then use the SolrEntityProcessor in the DataImportHandler to 
> index from that instance to your real Solr install. If you need to reindex, 
> just run the import again on your real installation. Your schema for the 
> intermediate Solr install would have stored="true" and indexed="false" for 
> all fields, and would only use basic types like int, long, and string. It 
> would not have any copyFields.
> 
> This is the approach used by the Smithsonian for their Solr installation, 
> because getting access to the source databases for the individual entities 
> within the organization is very difficult. This way they can reindex the 
> online Solr at any time without having to get special permission from all 
> those entities. When they index new content, it goes into a copy of Solr 
> configured for storage only, not in-depth searching. Their main Solr instance 
> uses SolrEntityProcessor to import from the intermediate Solr servers, so 
> they can always reindex.
> 
> Regards,
> Hui
> 
> -Original Message-
> From: Walter Underwood [mailto:wun...@wunderwood.org]
> Sent: Thursday, June 09, 2016 12:19 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Questions regarding re-index when using Solr as a data source
> 
> First, using Solr as a repository is pretty risky. I would keep the official 
> copy of the data in a database, not in 

Re: Returned number of result rows as a function of maxScore or numFound.

2016-06-09 Thread Erick Erickson
Why do this at all? I have a hard time understanding what benefit this
is to the _user_.

And even returning 5% is risky. I mean what happens for a query of
*:*? For a corpus of 100M docs that's still 5M documents which is
would hurt.

Sure, you say, well I'll cap it at XXX docs. The principle still holds though.
Users usually don't want to deal with very many docs at a time.

If you must do this for some kind of reporting or something, just fire
two queries. The first has a rows of 0 and the second has a rows=5%
of what was returned the first time.

Under the covers, you really can't do this without writing some sort
of custom collector. Solr (Well, Lucene) uses the
rows parameter as the dimension of the list where the most relevant
docs are stored, and replaced as "better" docs some along. You can't
know how many doc are going to be found before you score them all.
So how would you know what 5% was when you start? You'd have to
write something that would keep 20X whatever your max was set
to and then grow it as necessary but by that time you _might_ have
already thrown away docs that should be in the expanded list... Or
you'd have to keep _all_ the results which would be very expensive usually.

All in all, I think a 2-query solution is much simpler than hacking into
your own collector, not to mention far more efficient in the general case.

Best,
Erick

On Wed, Jun 8, 2016 at 10:26 PM, Binoy Dalal  wrote:
> I don't think you can do such a thing ootb with solr but this is pretty
> easy to achieve using a custom search component.
>
> Just write some custom code which will limit your resultset and plug it
> into your request handler as the last component.
>
> On Thu, 9 Jun 2016, 08:53 Prasanna Josium, 
> wrote:
>
>> Hi,
>> I use a dse stack with has solr4.10.
>> I want to control the number of rows from result set as a percent of the
>> max hit 'numFound' or  'maxScore' for a query.
>> e.g.,
>> 1)  for a query 'foo', if I get 100 hits and if I want to get the top 5%
>> percent (say rows=5%). Then I get only 5 rows.
>> for a query 'bar', if I get 1000 hits, I want to get the top 5%
>> (rows=5%).Then I get top 50 rows.
>>
>> 2) for a query 'foo' if the maxScore is 4.5, I want to get say all records
>> within 10% of maxScore ..I want to get all records whose score is between
>> 4.5 to 4.0(this could be the any number of records)
>>
>> in  other words, the returned set is a percent of hits, instead of a
>> static row count.
>> Is there a way to do this readily or via some custom implementation?
>>
>> Thanks
>> Cheers
>> Prasanna Josium
>>
> --
> Regards,
> Binoy Dalal


RE: Questions regarding re-index when using Solr as a data source

2016-06-09 Thread Hui Liu
Hi Walter,

Thank you for the reply, sorry I need to clarify what I mean by 'migrate 
tables' from Oracle to Solr, we are not literally move existing records from 
Oracle to Solr, instead, we are building a new application directly feed data 
into Solr as document and fields, in parallel of another existing application 
which feeds the same data into Oracle tables/columns, of course, the Solr 
schema will be somewhat different than Oracle; also we only keep those data for 
90 days for user to search on, we hope once we run both system in parallel for 
some time (> 90 days), we will build up enough new data in Solr and we no 
longer need any old data in Oracle, by then we will be able to use Solr as our 
only data store.

It sounds to me that we may need to consider save the data into either file 
system, or another database, in case we need to rebuild the indexes; and the 
reason I mentioned to save data into another Solr system is by reading this 
info from https://wiki.apache.org/solr/HowToReindex : so just trying to get a 
feedback on if there is any update on this approach? And any better way to do 
this to minimize the downtime caused by the schema change and re-index? For 
example, in Oracle, we are able to add a new column or new index online without 
any impact of existing queries as existing indexes are intact.

Alternatives when a traditional reindex isn't possible

Sometimes the option of "do your indexing again" is difficult. Perhaps the 
original data is very slow to access, or it may be difficult to get in the 
first place.

Here's where we go against our own advice that we just gave you. Above we said 
"don't use Solr itself as a datasource" ... but one way to deal with data 
availability problems is to set up a completely separate Solr instance (not 
distributed, which for SolrCloud means numShards=1) whose only job is to store 
the data, then use the SolrEntityProcessor in the DataImportHandler to index 
from that instance to your real Solr install. If you need to reindex, just run 
the import again on your real installation. Your schema for the intermediate 
Solr install would have stored="true" and indexed="false" for all fields, and 
would only use basic types like int, long, and string. It would not have any 
copyFields.

This is the approach used by the Smithsonian for their Solr installation, 
because getting access to the source databases for the individual entities 
within the organization is very difficult. This way they can reindex the online 
Solr at any time without having to get special permission from all those 
entities. When they index new content, it goes into a copy of Solr configured 
for storage only, not in-depth searching. Their main Solr instance uses 
SolrEntityProcessor to import from the intermediate Solr servers, so they can 
always reindex.

Regards,
Hui

-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org]
Sent: Thursday, June 09, 2016 12:19 PM
To: solr-user@lucene.apache.org
Subject: Re: Questions regarding re-index when using Solr as a data source

First, using Solr as a repository is pretty risky. I would keep the official 
copy of the data in a database, not in Solr.

Second, you can’t “migrate tables” because Solr doesn’t have tables. You need 
to turn the tables into documents, then index the documents. It can take a lot 
of joins to flatten a relational schema into Solr documents.

Solr does not support schema migration, so yes, you will need to save off all 
the documents, then reload them. I would save them to files. It makes no sense 
to put them in another copy of Solr.

Changing the schema will be difficult and time-consuming, but you’ll probably 
run into much worse problems trying to use Solr as a repository.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jun 9, 2016, at 8:50 AM, Hui Liu 
> > wrote:
>
> Hi,
>
>  We are porting an application currently hosted in Oracle 11g to 
> Solr Cloud 6.x, i.e we plan to migrate all tables in Oracle as collections in 
> Solr, index them, and build search tools on top of this; the goal is we won't 
> be using Oracle at all after this has been implemented; every fields in Solr 
> will have 'stored=true' and selectively a subset of searchable fields will 
> have 'indexed=true'; the question is what steps we should follow if we need 
> to re-index a collection after making some schema changes - mostly we only 
> add new fields to store, or make a non-indexed field as indexed, we normally 
> do not delete or rename any existing fields; according to this url: 
> https://wiki.apache.org/solr/HowToReindex it seems we need to setup a 
> 'intermediate' Solr1 to only store the data themselves without any indexing, 
> then have another Solr2 setup to store the indexed data, and in case of 
> re-index, just delete all the documents in Solr2 for the collection 

RE: [E] Re: Question about Data Import Handler

2016-06-09 Thread Jamal, Sarfaraz
I am on SOLR6 =)

Thanks,

Sas

-Original Message-
From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com] 
Sent: Thursday, June 9, 2016 12:42 PM
To: solr-user 
Subject: [E] Re: Question about Data Import Handler

which version of Solr do you run?

On Thu, Jun 9, 2016 at 6:23 PM, Jamal, Sarfaraz < 
sarfaraz.ja...@verizonwireless.com.invalid> wrote:

> Hi Guys,
>
> I have a question about the data import handler and its configuration 
> file
>
> This is what a part of my data-config looks like:
>
>
> 
> 
>
> 
> 
>   
> ===
>
> I would like it so that when its indexed, it returns in xml the 
> following when on that doc.
>
> -
> This Is my name
> This is my description 
>
> The best I have gotten it to do so far is to add to the values in name 
> and description, which are fields on the doc.
>
> Thanks for any help -
>
> P.S. I shall be replying to the other threads as well, I Just took a 
> break from it to come work on another part of SOLR.
>
> Sas
>



--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics





Re: Question about Data Import Handler

2016-06-09 Thread Mikhail Khludnev
which version of Solr do you run?

On Thu, Jun 9, 2016 at 6:23 PM, Jamal, Sarfaraz <
sarfaraz.ja...@verizonwireless.com.invalid> wrote:

> Hi Guys,
>
> I have a question about the data import handler and its configuration file
>
> This is what a part of my data-config looks like:
>
>
> 
> 
>
> 
> 
> 
> 
> ===
>
> I would like it so that when its indexed, it returns in xml the following
> when on that doc.
>
> -
> This Is my name
> This is my description
> 
>
> The best I have gotten it to do so far is to add to the values in name and
> description, which are fields on the doc.
>
> Thanks for any help -
>
> P.S. I shall be replying to the other threads as well, I Just took a break
> from it to come work on another part of SOLR.
>
> Sas
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics





Re: Questions regarding re-index when using Solr as a data source

2016-06-09 Thread Walter Underwood
First, using Solr as a repository is pretty risky. I would keep the official 
copy of the data in a database, not in Solr.

Second, you can’t “migrate tables” because Solr doesn’t have tables. You need 
to turn the tables into documents, then index the documents. It can take a lot 
of joins to flatten a relational schema into Solr documents.

Solr does not support schema migration, so yes, you will need to save off all 
the documents, then reload them. I would save them to files. It makes no sense 
to put them in another copy of Solr.

Changing the schema will be difficult and time-consuming, but you’ll probably 
run into much worse problems trying to use Solr as a repository.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jun 9, 2016, at 8:50 AM, Hui Liu  wrote:
> 
> Hi,
> 
>  We are porting an application currently hosted in Oracle 11g to 
> Solr Cloud 6.x, i.e we plan to migrate all tables in Oracle as collections in 
> Solr, index them, and build search tools on top of this; the goal is we won't 
> be using Oracle at all after this has been implemented; every fields in Solr 
> will have 'stored=true' and selectively a subset of searchable fields will 
> have 'indexed=true'; the question is what steps we should follow if we need 
> to re-index a collection after making some schema changes - mostly we only 
> add new fields to store, or make a non-indexed field as indexed, we normally 
> do not delete or rename any existing fields; according to this url: 
> https://wiki.apache.org/solr/HowToReindex it seems we need to setup a 
> 'intermediate' Solr1 to only store the data themselves without any indexing, 
> then have another Solr2 setup to store the indexed data, and in case of 
> re-index, just delete all the documents in Solr2 for the collection and 
> re-import data from Solr1 into Solr2 using SolrEntityProcessor (from 
> dataimport handler)? Is this still the recommended approach? I can see the 
> downside of this approach is if we have tremendous amount of data for a 
> collection (some of our collection could have several billions of documents), 
> re-import it from Solr1 to Solr2 may take a few hours or even days, and 
> during this time, users cannot query the data, is there any better way to do 
> this and avoid this type of down time? Any feedback is appreciated!
> 
> Regards,
> Hui Liu
> Opentext, Inc.



Questions regarding re-index when using Solr as a data source

2016-06-09 Thread Hui Liu
Hi,

  We are porting an application currently hosted in Oracle 11g to 
Solr Cloud 6.x, i.e we plan to migrate all tables in Oracle as collections in 
Solr, index them, and build search tools on top of this; the goal is we won't 
be using Oracle at all after this has been implemented; every fields in Solr 
will have 'stored=true' and selectively a subset of searchable fields will have 
'indexed=true'; the question is what steps we should follow if we need to 
re-index a collection after making some schema changes - mostly we only add new 
fields to store, or make a non-indexed field as indexed, we normally do not 
delete or rename any existing fields; according to this url: 
https://wiki.apache.org/solr/HowToReindex it seems we need to setup a 
'intermediate' Solr1 to only store the data themselves without any indexing, 
then have another Solr2 setup to store the indexed data, and in case of 
re-index, just delete all the documents in Solr2 for the collection and 
re-import data from Solr1 into Solr2 using SolrEntityProcessor (from dataimport 
handler)? Is this still the recommended approach? I can see the downside of 
this approach is if we have tremendous amount of data for a collection (some of 
our collection could have several billions of documents), re-import it from 
Solr1 to Solr2 may take a few hours or even days, and during this time, users 
cannot query the data, is there any better way to do this and avoid this type 
of down time? Any feedback is appreciated!

Regards,
Hui Liu
Opentext, Inc.


Question about Data Import Handler

2016-06-09 Thread Jamal, Sarfaraz
Hi Guys,

I have a question about the data import handler and its configuration file

This is what a part of my data-config looks like:









===

I would like it so that when its indexed, it returns in xml the following when 
on that doc.

-
This Is my name
This is my description


The best I have gotten it to do so far is to add to the values in name and 
description, which are fields on the doc.

Thanks for any help -

P.S. I shall be replying to the other threads as well, I Just took a break from 
it to come work on another part of SOLR.

Sas


Re: Solutions for Multi-word Synonyms

2016-06-09 Thread Doug Turnbull
Mary Jo,

Honestly half the time I run into this problem, I end up creating a
QParserPlugin because I need to do something specific. With a QParserPlugin
I can run whatever analysis, slicing and dicing of the query string to
manually construct whatever I need to

http://www.supermind.org/blog/1134/custom-solr-queryparsers-for-fun-and-profit

One thing I often do is repeat the functionality of Elasticsearch's match
query. Elasticsearch's match query does the following:

- Analyze the query string using the field's query-time analyzer
- Create an OR query with the tokens that come out of the analysis

You can look at the field query parser as something of a starting point for
this.

I usually do this in the context of a boost query, not as the main edismax
query.

If I have time, this is something I've been meaning to open source.

Best
-Doug

On Tue, Jun 7, 2016 at 2:51 PM Joe Lawson 
wrote:

> I'm sorry I wasn't more specific, I meant we were hijacking the thread with
> the question, "Anyone used a different method of
> handling multi-term synonyms that isn't as global?" as the original thread
> was about getting synonym_edismax running.
>
> On Tue, Jun 7, 2016 at 2:24 PM, MaryJo Sminkey 
> wrote:
>
> > > MaryJo you might want to start a new thread, I think we kinda hijacked
> > this
> > > one. Also if you are interested in tuning queries check out
> > > http://splainer.io/ and https://www.quepid.com which are interactive
> > tools
> > > (both of which my company makes) to tune for search relevancy.
> > >
> >
> >
> > Okay I changed the subject. But I don't need a tuning tool, I already
> know
> > WHY I'm not getting the results I need, the problem is how to fix it or
> get
> > around what the plugin is doing. Which is why I was inquiring if people
> > have had success with something other than this particularly plugin for
> > more advanced queries that it messes around with. It seems to do a good
> job
> > if you aren't doing anything particularly complicated with your search
> > logic, but I don't see a good way to solve the issue I'm having, and a
> > tuning tool isn't really going to help with that. We were pretty happy
> with
> > our search relevancy for the most part *other* than the problem with the
> > multi-term synonyms not working reliably but I definitely can't lose
> > relevancy that we had just to get those working.
> >
> > In reviewing your tools previously, the problem as I recall is that they
> > rely on querying Solr directly, while our searches go through multiple
> > levels of an application which includes a lot of additional logic in
> terms
> > of what the data that gets sent to Solr are, so they just aren't going to
> > be much use for us. It was easier for me to just write my own tool that
> > essentially does the same kind of thing, but with my application logic
> > built in.
> >
> > Mary Jo
> >
>


Re: Question about multiple fq parameters

2016-06-09 Thread Steven White
Erick, Mikhail, and Shawn, thank you all for your help.



Just a quick re-cap of what I’m trying to achieve: my need is to combine 2
or more “fq” queries to be treated as OR.



Erick, Mikhail, I have the syntax you provided but I cannot get them to
work properly, in fact I’m seeing odd behavior that I cannot explain so I
hope you can shed some light on them.



The following give me hits as expected:



1269 hits:
http://localhost:8983/solr/openpages/select_openpages_3?start=0=*=AND?=*=

{!field+f=DateA+v=$a}+{!field+f=DateB+v=$b}+

=[2000-01-01+TO+2030-01-01]=[2000-01-01+TO+2030-01-01]



1269 hits:
http://localhost:8983/solr/openpages/select_openpages_3?start=0=*=AND?=*=

{!field+f=DateA+v=$a}+

=[2000-01-01+TO+2030-01-01]



905 hits:
http://localhost:8983/solr/openpages/select_openpages_3?start=0=*=AND?=*=

{!field+f=DateB+v=$b}+

=[2000-01-01+TO+2030-01-01]



The following don’t give me a hit as expected:



0 hits:
http://localhost:8983/solr/openpages/select_openpages_3?start=0=*=AND?=*=

{!field+f=DateA+v=$a}+

=[2020-01-01+TO+2030-01-01]



0 hits:
http://localhost:8983/solr/openpages/select_openpages_3?start=0=*=AND?=*=

{!field+f=DateB+v=$b}+

=[2020-01-01+TO+2030-01-01]



The next 3 syntax are odd behavior that I cannot explain:



A) 1269 hits (expected):
http://localhost:8983/solr/openpages/select_openpages_3?start=0=*=AND?=*=

{!field+f=DateA+v=$a}+{!field+f=DateB+v=$b}+

=[2000-01-01+TO+2030-01-01]=[2020-01-01+TO+2030-01-01]



B) 905 hits (odd result):
http://localhost:8983/solr/openpages/select_openpages_3?start=0=*=AND?=*=

{!field+f=DateB+v=$a}+{!field+f=DateA+v=$b}+

=[2000-01-01+TO+2030-01-01]=[2020-01-01+TO+2030-01-01]



C) 0 hits (but why?!):
http://localhost:8983/solr/openpages/select_openpages_3?start=0=*=AND?=*=

{!field+f=DateA+v=$a}+{!field+f=DateB+v=$b}+

=[2020-01-01+TO+2030-01-01]=[2000-01-01+TO+2030-01-01]



D) 0 hits (but why?!):
http://localhost:8983/solr/openpages/select_openpages_3?start=0=*=AND?=*=

{!field+f=DateB+v=$a}+{!field+f=DateA+v=$b}+

=[2020-01-01+TO+2030-01-01]=[2000-01-01+TO+2030-01-01]



Since my goal here is to have fq apply OR on the two date searches, test B
clearly shows that’s not the case and test C & D shows that fq is ignoring
the second part in the query.



I also tried this syntax:



http://localhost:8983/solr/openpages/select_openpages_3?start=0=*=AND?=*=

filter({!field+f=DateA+op=Intersects}[2000-01-01+TO+2020-01-01])+

filter({!field+f=DateB+op=Intersects}[2000-01-01+TO+2030-01-01])



But Solr is reporting an error: “no field name specified in query and no
default specified via 'df' param”.



Shawn, using the syntax that you suggested everything works (including my
mix date range tests of the above):



http://localhost:8983/solr/openpages/select_openpages_3?start=0=*=AND?=*=

DateA:[2000-01-01+TO+2030-01-01]+OR+DateB:[2000-01-01 TO 2030-01-01]



My motivation to use “{!field}[]” in fq was what I read somewhere (I cannot
find it now, even after many Google’s on it) is far faster and efficient
than the tradition :[value]



Steve

On Wed, Jun 8, 2016 at 6:46 PM, Shawn Heisey  wrote:

> On 6/8/2016 2:28 PM, Steven White wrote:
> > ?q=*=OR={!field+f=DateA+op=Intersects}[2020-01-01+TO+2030-01-01]
>
> Looking at this and checking the code for the Field query parser, I
> cannot see how what you have used above is any different than:
>
> fq=DateA:[2020-01-01 TO 2030-01-01]
>
> The "op=Intersects" parameter that you have included appears to be
> ignored by the parser code that I examined.
>
> If my understanding of the documentation and the code is correct, then
> you should be able to use this:
>
> fq=DateB:[2000-01-01 TO 2020-01-01] OR DateA:[2020-01-01 TO 2030-01-01]
>
> In my examples I have changed the URL encoded "+" character back to a
> regular space.
>
> Thanks,
> Shawn
>
>


Re: Sorl 4.3.1 - Does not load the new data using the Java application

2016-06-09 Thread Upayavira
Firstly, I'm not sure why you are using embeddedSolrServer. You would be
much better off running a standalone Solr server, and connecting to it
with a SolrClient, in Java. Then you can do client.commit(); to execute
a commit.

EmbeddedSolrServer behaves slightly differently from normal Solr, and
will get you into trouble (e.g. like this), so I'd suggest you just
start up a Solr as described in all of the tutorials, and use it the
normal way.

Upayavira

On Thu, 9 Jun 2016, at 01:36 PM, SRINI SOLR wrote:
> Hi Upayavira / Team -
> Can you please explain in-detail - how to do the commit...?
> 
> if we do the commit - Will the new data will be available to Java
> Application with-out calling *embeddedSolrServer.*
> *getCoreContainer().load()*. again. ...?
> 
> Please help me here ...
> 
> Thanks in Advance.
> 
> 
> 
> 
> 
> 
> 
> 
> On Thu, Jun 9, 2016 at 4:08 PM, Upayavira  wrote:
> 
> > Are you executing a commit?
> >
> > You must commit before your content becomes visible.
> >
> > Upayavira
> >
> > On Thu, 9 Jun 2016, at 11:13 AM, SRINI SOLR wrote:
> > > Hi Team -
> > > Can you please help me out on the below issue ...
> > >
> > > We are using the Solr 4.3.1 version.
> > >
> > > Integrated Solr 4.3.1 with Java application using EmbeddedSolrServer.
> > >
> > > Using this EmbeddedSolrServer in java -  loading the core container as
> > > below ...
> > > *embeddedSolrServer.getCoreContainer().load();*
> > >
> > > We are loading the container at the time of initiating the
> > > ApplicationContext. And now Java application is able to access the
> > > indexed
> > > data.
> > >
> > > *Now the issue is  - *
> > > *If I index the new data in Solr - the same data is not getting loaded
> > > through Java application until and un-less if I again load the Core
> > > Container using **embeddedSolrServer.getCoreContainer().load().*
> > >
> > > Can you please help me out to on how to access the new data (which is
> > > indexed on Solr) using java application with out calling every-time
> > > *embeddedSolrServer.getCoreContainer().load().*
> > >
> > > *??? *
> > >
> > > *Please help me out ... I am stuck and not able to proceed further ... It
> > > is leading to critical issue ...*
> > >
> > > *Thanks In Advance.*
> >


Re: Question about content indexing with Alfresco

2016-06-09 Thread Rick Leir
Is there some reason you are using version 1.4?

In the Solr admin dashboard you can load your core and do queries against it. 

On June 9, 2016 5:06:33 AM EDT, OTEC Jordi Florit  wrote:
>Hi,
>
>I'm using Alfresco 4.2.6 and SOLR 1.4, and I want to verify if my
>content is indexing on SOLR or not. I add
>alfresco.index.transformContent=false on my solcore.properties, but I
>want to verify if all is doing correctly.
>
>There are some place (on SOLR url https://localhost:8443/solr or
>something) to verify that the contents aren't being indexed really?

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Per-query boosts in MLT

2016-06-09 Thread Marc Burt

Hi,

Is it possible to assign boosts to the MLT similarity fields instead of 
the defaults set in the config when making a MLT query?
I'm currently using a query parser and attempting /select?q={!mlt 
qf=foo^10,bar^20,upc^50}/id /etc but it's taking the boost to be part of 
the field name.


--

Kind Regards,

Marc



Re: Sorl 4.3.1 - Does not load the new data using the Java application

2016-06-09 Thread SRINI SOLR
Hi Upayavira / Team -
Can you please explain in-detail - how to do the commit...?

if we do the commit - Will the new data will be available to Java
Application with-out calling *embeddedSolrServer.*
*getCoreContainer().load()*. again. ...?

Please help me here ...

Thanks in Advance.








On Thu, Jun 9, 2016 at 4:08 PM, Upayavira  wrote:

> Are you executing a commit?
>
> You must commit before your content becomes visible.
>
> Upayavira
>
> On Thu, 9 Jun 2016, at 11:13 AM, SRINI SOLR wrote:
> > Hi Team -
> > Can you please help me out on the below issue ...
> >
> > We are using the Solr 4.3.1 version.
> >
> > Integrated Solr 4.3.1 with Java application using EmbeddedSolrServer.
> >
> > Using this EmbeddedSolrServer in java -  loading the core container as
> > below ...
> > *embeddedSolrServer.getCoreContainer().load();*
> >
> > We are loading the container at the time of initiating the
> > ApplicationContext. And now Java application is able to access the
> > indexed
> > data.
> >
> > *Now the issue is  - *
> > *If I index the new data in Solr - the same data is not getting loaded
> > through Java application until and un-less if I again load the Core
> > Container using **embeddedSolrServer.getCoreContainer().load().*
> >
> > Can you please help me out to on how to access the new data (which is
> > indexed on Solr) using java application with out calling every-time
> > *embeddedSolrServer.getCoreContainer().load().*
> >
> > *??? *
> >
> > *Please help me out ... I am stuck and not able to proceed further ... It
> > is leading to critical issue ...*
> >
> > *Thanks In Advance.*
>


Re: Sorl 4.3.1 - Does not load the new data using the Java application

2016-06-09 Thread Upayavira
Are you executing a commit?

You must commit before your content becomes visible.

Upayavira

On Thu, 9 Jun 2016, at 11:13 AM, SRINI SOLR wrote:
> Hi Team -
> Can you please help me out on the below issue ...
> 
> We are using the Solr 4.3.1 version.
> 
> Integrated Solr 4.3.1 with Java application using EmbeddedSolrServer.
> 
> Using this EmbeddedSolrServer in java -  loading the core container as
> below ...
> *embeddedSolrServer.getCoreContainer().load();*
> 
> We are loading the container at the time of initiating the
> ApplicationContext. And now Java application is able to access the
> indexed
> data.
> 
> *Now the issue is  - *
> *If I index the new data in Solr - the same data is not getting loaded
> through Java application until and un-less if I again load the Core
> Container using **embeddedSolrServer.getCoreContainer().load().*
> 
> Can you please help me out to on how to access the new data (which is
> indexed on Solr) using java application with out calling every-time
> *embeddedSolrServer.getCoreContainer().load().*
> 
> *??? *
> 
> *Please help me out ... I am stuck and not able to proceed further ... It
> is leading to critical issue ...*
> 
> *Thanks In Advance.*


Sorl 4.3.1 - Does not load the new data using the Java application

2016-06-09 Thread SRINI SOLR
Hi Team -
Can you please help me out on the below issue ...

We are using the Solr 4.3.1 version.

Integrated Solr 4.3.1 with Java application using EmbeddedSolrServer.

Using this EmbeddedSolrServer in java -  loading the core container as
below ...
*embeddedSolrServer.getCoreContainer().load();*

We are loading the container at the time of initiating the
ApplicationContext. And now Java application is able to access the indexed
data.

*Now the issue is  - *
*If I index the new data in Solr - the same data is not getting loaded
through Java application until and un-less if I again load the Core
Container using **embeddedSolrServer.getCoreContainer().load().*

Can you please help me out to on how to access the new data (which is
indexed on Solr) using java application with out calling every-time
*embeddedSolrServer.getCoreContainer().load().*

*??? *

*Please help me out ... I am stuck and not able to proceed further ... It
is leading to critical issue ...*

*Thanks In Advance.*


Question about content indexing with Alfresco

2016-06-09 Thread OTEC Jordi Florit
Hi,

I'm using Alfresco 4.2.6 and SOLR 1.4, and I want to verify if my content is 
indexing on SOLR or not. I add alfresco.index.transformContent=false on my 
solcore.properties, but I want to verify if all is doing correctly.

There are some place (on SOLR url https://localhost:8443/solr or something) to 
verify that the contents aren't being indexed really?

Thanks!

Best regards,

Jordi



SolrInputDocument required id in solr5.4 but the same program run on solr5.0 without any id

2016-06-09 Thread pratika.sarda
Hi,

SolrInputDocument requires id in solr5.4 but the same program run on solr5.0
without any id on adding doc.

org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at http://localhost:8983/solr/CampaignCore: [doc=null] missing
required field: campaignId
at
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:576)
at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:240)
at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:229)
at 
org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:149)
at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:106)
at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:71)
at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:85)
at com.yash.campaign.app.CampaignJob.main(CampaignJob.java:15)

If I add id into child doc then document will get added but solr query which
I had written for solr5.0 won`t run in solr5.4.
Please suggest me any alternative, so my child doc will not require id in
solr5.4 or later.


Regards,
Pratika






--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrInputDocument-required-id-in-solr5-4-but-the-same-program-run-on-solr5-0-without-any-id-tp4281387.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Question about multiple fq parameters

2016-06-09 Thread Mikhail Khludnev
Shawn,
I found "op" at
org.apache.solr.schema.DateRangeField.parseSpatialArgs(QParser, String).

On Thu, Jun 9, 2016 at 1:46 AM, Shawn Heisey  wrote:

> On 6/8/2016 2:28 PM, Steven White wrote:
> > ?q=*=OR={!field+f=DateA+op=Intersects}[2020-01-01+TO+2030-01-01]
>
> Looking at this and checking the code for the Field query parser, I
> cannot see how what you have used above is any different than:
>
> fq=DateA:[2020-01-01 TO 2030-01-01]
>
> The "op=Intersects" parameter that you have included appears to be
> ignored by the parser code that I examined.
>
> If my understanding of the documentation and the code is correct, then
> you should be able to use this:
>
> fq=DateB:[2000-01-01 TO 2020-01-01] OR DateA:[2020-01-01 TO 2030-01-01]
>
> In my examples I have changed the URL encoded "+" character back to a
> regular space.
>
> Thanks,
> Shawn
>
>


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics





Re: Question about CloudSolrServer

2016-06-09 Thread Naveen Pajjuri
Thanks *Shawn.*
i was using older version of solrj. upgrading it to newer version worked.

Thank you.

On Thu, Jun 9, 2016 at 11:41 AM, Shawn Heisey  wrote:

> On 6/8/2016 11:44 PM, Naveen Pajjuri wrote:
> > Trying to migrate from HttpSolrServer to CloudSolrServer. getting the
> > following exception while adding docs using CloudSolrServer.
> >
> >
> > org.apache.solr.common.SolrException: Unknown document router
> > '{name=compositeId}'
> >
> > at org.apache.solr.common.cloud.DocRouter.getDocRouter(DocRouter.java:46)
> >
> > whereas my cluterstate json says --
> >
> >   "maxShardsPerNode":"1",
> > "router":{"name":"compositeId"},
> > "replicationFactor":"1".
>
> I am guessing that you are using a much older version of SolrJ than the
> Solr version it is talking to.  The '{"name":"compositeId"}' structure
> appears to be the way that newer versions of Solr record the router in
> zookeeper, which is something that the older versions of SolrJ will not
> know how to handle.
>
> Mixing different versions of Solr and SolrJ will work very well, as long
> as you're not using the cloud client.  That client is so tightly coupled
> to SolrCloud internals that it does not work well with a large version
> difference, especially if the client is older than the server.
>
> Most likely you'll need to upgrade your SolrJ version.  At the same
> time, switching to CloudSolrClient is probably a good idea -- the class
> names that end in Server are deprecated in 5.x and gone in 6.x.
>
> Thanks,
> Shawn
>
>


Solr6 CDCR issue with a 3 cloud design

2016-06-09 Thread dmitry.medvedev
I've set up a 3 cloud CDCR: Source => Target1-Source2 => Target2 CDCR 
environment, and the replication process works perfectly, but:

when I shutdown Target1-Source2 cloud (the mediator, for testing for 
resilience), index/push some docs to Source1 cloud, get back Target1-Source2 
cloud online after several min, then I only part of the docs are replicated to 
the 2 Target clouds (7 of 10 docs tested).



Anyone has an idea what is the reason for such a behavior?

Configurations attached.

Thanks in advance,
Dmitry Medvedev.

___

This message is for information purposes only, it is not a recommendation, 
advice, offer or solicitation to buy or sell a product or service nor an 
official confirmation of any transaction. It is directed at persons who are 
professionals and is not intended for retail customer use. Intended for 
recipient only. This message is subject to the terms at: 
www.barclays.com/emaildisclaimer.

For important disclosures, please see: 
www.barclays.com/salesandtradingdisclaimer regarding market commentary from 
Barclays Sales and/or Trading, who are active market participants; and in 
respect of Barclays Research, including disclosures relating to specific 
issuers, please see http://publicresearch.barclays.com.

___





  

  
  6.0.0

  
  ${solr.data.dir:}


  
  
   

  
  

  
  


${solr.lock.type:native}


 true
  


  
  
  
  
  
  


  
disabled
  



  
  



  
cdcr-proc-chain
  



  
${solr.ulog.dir:}
  

  
  
  

1024









   



 



true

   
   20

   
   200


false


2

  


  
  
 





  

  
  
  

 
   explicit
   10
 



  
  
 
   explicit
   json
   true
   text
 
  

  

  text

  

  
  


  
  

  
  

 explicit 
 true

  
  


  

  
  

  
  
 
  true
  false
 

  terms

  







  

  
  6.0.0

  
  ${solr.data.dir:}


  
  
   

  
  

  
  


${solr.lock.type:native}


 true
  


  
  
  
  
  
  

	
	  
		10.36.75.4:9983
		demo
		demo
	  
	  
	  
		2
		10
		128
	  
	  
	  
		1000
	  
	
	
	
	  
		${solr.ulog.dir:}
	  
	
	  
  
  

1024









   



 



true

   
   20

   
   200


false


2

  


  
  
 





  

  
  
  

 
   explicit
   10
 



  
  
 
   explicit
   json
   true
   text
 
  

  

  text

  

  
  


  
  

  
  

 explicit 
 true

  
  


  

  
  

  
  
 
  true
  false
 

  terms

  







  

  
  6.0.0

  
  ${solr.data.dir:}


  
  
   

  
  

  
  


${solr.lock.type:native}


 true
  


  
  
  
  
  
  

	
  
disabled
  
	  
		10.88.52.219:9983
		demo
		demo
	  
	  
	  
		2
		10
		128
	  
	  
	  
		1000
	  



  
  



  
cdcr-proc-chain
  



  
${solr.ulog.dir:}
  

  
  
  

1024









   



 



true

   
   20

   
   200


false


2

  


  
  
 





  

  
  
  

 
   explicit
   10
 



  
  
 
   explicit
   json
   true
   text
 
  

  

  text

  

  
  


  
  

  
  

 explicit 
 true

  
  


  

  
  

  
  
 
  true
  false
 

  terms

  




Re: Solr 6.1.x Release Date ??

2016-06-09 Thread Ramesh Shankar
Hi,

I found it working in [subquery] transformer solr-6.1.0-79 nightly builds.

Regards
Ramesh

On Tue, Jun 7, 2016 at 11:08 AM, Ramesh Shankar  wrote:

> Hi,
>
> Any idea of Solr 6.1.X Release Date ??
>
> I am interested in the [subquery] transformer and like to know the release
> date since its available only in 6.1.x
>
> Thanks & Regards
> Ramesh
>


RE: Using Solr to index zip files

2016-06-09 Thread anupama . gangadhar
Hi,

The nesting level is fixed. Outerzip has many inner zip files(i.e. 1.zip has 
many zip files).
Currently the outer zip path and inner zip name is stored in a Hive table for 
reference.
I use a Hive query to find the zip for me.

I intend to index the outer zip file and store all the inner zips as 
fields(search criteria) for this index.

Thank you,
Regards,
Anupama

-Original Message-
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com]
Sent: Tuesday, June 07, 2016 7:44 PM
To: solr-user
Subject: Re: Using Solr to index zip files

I _think_ DataImportHandler could handle zip files with fixed level of nesting, 
but not read from HDFS.

I don't think anything else in Solr will. So, doing it outside of Solr is 
probably best. Especially, since you would need to decide how you actually want 
to map these files (e.g. do you keep the path for zip within zip, etc).

Regards,
Alex.

Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 7 June 2016 at 12:57,   wrote:
> Hi,
>
> I have an use case where I need to search zip files quickly in HDFS. I intend 
> to use Solr but not finding any relevant information about whether it can be 
> done for zip files.
> These are nested zip files i.e. zips within a zip file. Any help/information 
> is much appreciated.
>
> Thank you,
> Regards,
> Anupama
>
>
> If you are not the addressee, please inform us immediately that you have 
> received this e-mail by mistake, and delete it. We thank you for your support.
>

If you are not the addressee, please inform us immediately that you have 
received this e-mail by mistake, and delete it. We thank you for your support.



Re: Question about CloudSolrServer

2016-06-09 Thread Shawn Heisey
On 6/8/2016 11:44 PM, Naveen Pajjuri wrote:
> Trying to migrate from HttpSolrServer to CloudSolrServer. getting the
> following exception while adding docs using CloudSolrServer.
>
>
> org.apache.solr.common.SolrException: Unknown document router
> '{name=compositeId}'
>
> at org.apache.solr.common.cloud.DocRouter.getDocRouter(DocRouter.java:46)
>
> whereas my cluterstate json says --
>
>   "maxShardsPerNode":"1",
> "router":{"name":"compositeId"},
> "replicationFactor":"1".

I am guessing that you are using a much older version of SolrJ than the
Solr version it is talking to.  The '{"name":"compositeId"}' structure
appears to be the way that newer versions of Solr record the router in
zookeeper, which is something that the older versions of SolrJ will not
know how to handle.

Mixing different versions of Solr and SolrJ will work very well, as long
as you're not using the cloud client.  That client is so tightly coupled
to SolrCloud internals that it does not work well with a large version
difference, especially if the client is older than the server.

Most likely you'll need to upgrade your SolrJ version.  At the same
time, switching to CloudSolrClient is probably a good idea -- the class
names that end in Server are deprecated in 5.x and gone in 6.x.

Thanks,
Shawn