Commit required after delete ?

2017-01-05 Thread Dorian Hoxha
Hello friends,

Based on what I've read, I think "commit" isn't needed to make deletes
active (like we do with index/update), right ?

Since it just marks an in-memory deleted-id bitmap, right ?

Thank You


error during running my code java.lang.VerifyError: Bad type on operand stack

2017-01-05 Thread gayathri...@tcs.com
Hi

Im using solr 5.4.0 while running my code i get below eroor please suggest
what has to be done

public static void main(String[] args) throws SolrServerException,
IOException {


String urlString = "http://localhost:8983/solr/";;
SolrClient client = new HttpSolrClient(urlString);
}

Error :

java.lang.VerifyError: Bad type on operand stack
Exception Details:
  Location:
   
org/apache/http/impl/client/DefaultHttpClient.setDefaultHttpParams(Lorg/apache/http/params/HttpParams;)V
@4: invokestatic
  Reason:
Type 'org/apache/http/HttpVersion' (current frame, stack[1]) is not
assignable to 'org/apache/http/ProtocolVersion'

please suggest what has to be done



--
View this message in context: 
http://lucene.472066.n3.nabble.com/error-during-running-my-code-java-lang-VerifyError-Bad-type-on-operand-stack-tp4312690.html
Sent from the Solr - User mailing list archive at Nabble.com.


Need help for this scenario

2017-01-05 Thread capgemini_india . shashi
Hello Team,

I am looking for your valuable suggestions/solutions for the below scenario:

>  Scenario :
When any user gives a request by giving the name of the filename.zip, then he 
wants the "filename.zip" zip file.

>  Description:
*The data is a collection.zip where it consists of many inner 
zipfiles(that is the filename.zip).
*Here we have hive table and in the hive table first column consists of 
harlocation which in turn points to the collection.zip file path and in the 
next column it will have the filename.zip.
hive> select * from hvtb_xdlogfileanalysis_logfilewithworkshopinfo_ext limit 1;
OK
har://hdfs-DBDPInnovationLab/org/itpgm/XDLogFileAnalysis/archived/data/AfterSalesAndService_AS/APP-16151/XentryInDia-XentryDiagnostics/Central_Y/XDlogfiles/2016_05_30/2016_05_30.har/10_52_33/INDIA_archive_LTA1_1_20160330-153020236.zip
  H_8658E990C2F3_20160330_110145.zip  WDC1660041A050241   M/GLE (166)   
  8658E990C2F32   2016-03-30 09:01:45 2016-03-30 12:50:34  MPC212   
   12/15   104172.0km  5131265 CENTRETOILE SA  4500
avenue de l'Industrie 24Huy 513 126531  2016-04-19
Time taken: 0.609 seconds, Fetched: 1 row(s)

>  Tasks carried out:
*We need to use restapi to send a request for getting a filename.zip
*Then we need to query the hive table and get the har location and from 
there we get the collection zip.
*Unzip the collection.zip and compare the list of inner zip files with 
the filename.zip
*After comparison, we need to put the filename.zip to particular 
location .
*Provide the webhdfs path of the filename.zip via rest api.

Now problem I am facing is :
This is the first time, i am using the solr and i am not pretty sure, how to 
point the inner zip file (which is in the collection.zip).
Does solr supports this, if yes can you explain me in brief.
How to point the inner zip files.
Or any workaround is there, please let me know.

All suggestion/solutions are welcomed.
Thank you in advance.

Emailed : kshashi...@gmail.com

Thanks,
ShashiKumar


If you are not the addressee, please inform us immediately that you have 
received this e-mail by mistake, and delete it. We thank you for your support.



Re: Subqueries

2017-01-05 Thread Mikhail Khludnev
Peter,
Subquery should also log its' request. Can't you find it in log?

On Fri, Jan 6, 2017 at 1:19 AM, Peter Matthew Eichman 
wrote:

> Hello Mikhail,
>
> I put pcdm_members into the fl, and it is definitely stored. I tried adding
> the logParamsList, but all I see in the log is
> 183866104 [qtp1778535015-14] INFO  org.apache.solr.core.SolrCore  –
> [fedora4] webapp=/solr path=/select params={q=id:"https://
> fcrepolocal/fcrepo/rest/pcdm/19/31/3c/1a/19313c1a-6ab4-
> 4305-93ec-12dfdf01ba74"&members.logParamsList=q,fl,
> rows,row.pcdm_members&indent=true&fl=members:[subquery]&
> members.fl=id,title&members.q={!terms+f%3Did+v%3D$row.pcdm_
> members}&wt=json&_=1483654385162} hits=1 status=0 QTime=0
>
> Still getting no members key in the output:
>
> {
>   "responseHeader": {
> "status": 0,
> "QTime": 1,
> "params": {
>   "q": "id:\"https://fcrepolocal/fcrepo/rest/pcdm/19/31/3c/1a/
> 19313c1a-6ab4-4305-93ec-12dfdf01ba74\"",
>   "members.logParamsList": "q,fl,rows,row.pcdm_members",
>   "indent": "true",
>   "fl": "pcdm_members,members:[subquery]",
>   "members.fl": "id,title",
>   "members.q": "{!terms f=id v=$row.pcdm_members}",
>   "wt": "json",
>   "_": "1483654538166"
> }
>   },
>   "response": {
> "numFound": 1,
> "start": 0,
> "docs": [
>   {
> "pcdm_members": [
>   "https://fcrepolocal/fcrepo/rest/pcdm/28/2e/5b/f5/
> 282e5bf5-74c8-4148-9c1a-4ebead6435cb",
>   "https://fcrepolocal/fcrepo/rest/pcdm/6e/7c/36/2f/
> 6e7c362f-d239-4534-abd7-28caa24a134c",
>   "https://fcrepolocal/fcrepo/rest/pcdm/6e/e3/a6/33/
> 6ee3a633-998e-4f36-b80f-d76bcbe0d352",
>   "https://fcrepolocal/fcrepo/rest/pcdm/8a/d9/c7/62/
> 8ad9c762-4391-428d-b1ad-be5ac3e06c42"
> ]
>   }
> ]
>   }
> }
>
> Is $row.pcdm_members the right way to refer to the pcdm_members field
> of the current document in the subquery? Is the multivalued nature of
> the field a problem? I have tried adding separator=' ' to both the
> [subquery] and {!terms}, but to no avail.
>
> Thanks,
> -Peter
>
> On Thu, Jan 5, 2017 at 4:38 PM, Mikhail Khludnev  wrote:
>
> > Hello,
> >
> > Can you add pcdm_members into fl to make sure it's stored?
> > Also please add the following param
> > members.logParamsList=q,fl,rows,row.pcdm_members,
> > and check logs then.
> >
> > On Thu, Jan 5, 2017 at 9:46 PM, Peter Matthew Eichman 
> > wrote:
> >
> > > Hello all,
> > >
> > > I am attempting to use a subquery to enrich a query with the titles of
> > > related objects. Each document in my index may have 1 or more
> > pcdm_members
> > > and pcdm_related_objects fields, whose values are ids of other
> documents
> > in
> > > the index. Those documents in turn have reciprocal pcdm_member_of and
> > > pcdm_related_object_of fields.
> > >
> > > In the Blacklight app I am working on, we want to enrich the display
> of a
> > > document with the titles of its members and related objects using a
> > > subquery. However, this is out first foray into subqueries and things
> > > aren't working as expected.
> > >
> > > I expected the following query to return a "members" key with a
> document
> > > list of documents with "id" and "title" keys, but I am getting nothing:
> > >
> > > {
> > >   "responseHeader": {
> > > "status": 0,
> > > "QTime": 1,
> > > "params": {
> > >   "q": "id:\"https://fcrepolocal/fcrepo/rest/pcdm/19/31/3c/1a/
> > > 19313c1a-6ab4-4305-93ec-12dfdf01ba74\"",
> > >   "indent": "true",
> > >   "fl": "members:[subquery]",
> > >   "members.fl": "id,title",
> > >   "members.q": "{!terms f=id v=$row.pcdm_members}",
> > >   "wt": "json",
> > >   "_": "1483641932207"
> > > }
> > >   },
> > >   "response": {
> > > "numFound": 1,
> > > "start": 0,
> > > "docs": [
> > >   {}
> > > ]
> > >   }
> > > }
> > >
> > > Any pointers on what I am missing? Are there any configuration settings
> > in
> > > solrconfig.xml that I need to be aware of for subqueries to work?
> > >
> > > Thanks,
> > > -Peter
> > >
> > > --
> > > Peter Eichman
> > > Senior Software Developer
> > > University of Maryland Libraries
> > > peich...@umd.edu
> > >
> >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> >
>
>
>
> --
> Peter Eichman
> Senior Software Developer
> University of Maryland Libraries
> peich...@umd.edu
>



-- 
Sincerely yours
Mikhail Khludnev


Re: How to train the model using user clicks when use ltr(learning to rank) module?

2017-01-05 Thread Will Martin
In the Assemble training data part: the third column indicates the relative
importance or relevance of that doc
Could you please give more info about how to give a score based on what user
clicks?

Hi Jeffery,

Give your questions more detail and there may be more feedback; just a 
suggestion.
About above,

some examples of assigning "relative" weighting to training data
user click info gathered (all assumed but similar to omniture monitoring)
- position in the result list
- above/below the fold
- result page number
As a information engineer, you might see 2 attributes here: a) user 
perseverance b) effort to find the result

From there, the attributes have a correlation relationship that is not 
linear and directly proportional I think:
easy to find outweighs user perseverance every time because it 
reduces the need for such
 extensive perseverance, page #3 for example, doesn't mitigate 
effort, it drives effort  towards lower user perseverance need value pairs.
Ok. That is damn confusing. But its what I would want to do, use the pair 
in a manner that reranks a document as if the perseverance and effort were 
balanced and positioned ... "relative" to the other training data. What that 
equation is, will take some more effort

i'm not sure this response is helpful at all, but i'm going to go with it 
because I recognize all of it from AOL, Microsoft and Comcast work. Before the 
days of ML in Search.

On 1/5/2017 3:33 PM, Jeffery Yuan wrote:

Thanks , Will Martin.

I checked the pdf it's great. but seems not very useful for my question: How
to train the model using user clicks when use ltr(learning to rank) module.

I know the concept after reading these papers. But still not sure how to
code them.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-train-the-model-using-user-clicks-when-use-ltr-learning-to-rank-module-tp4312462p4312592.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: SolrCloud and LVM

2017-01-05 Thread Shawn Heisey
On 1/5/2017 3:12 PM, Chris Ulicny wrote:
> Is there any known significant performance impact of running solrcloud with
> lvm on linux?
>
> While migrating to solrcloud we don't have the storage capacity for our
> expected final size, so we are planning on setting up the solrcloud
> instances on a logical volume that we can grow when hardware becomes
> available.

Nothing specific.  Whatever the general performance impacts for LVM are
is what Solr would encounter when it reads and writes data to/from the disk.

If your system has enough memory for good performance, then disk reads
will be rare, so the performance of the storage volume wouldn't matter
much.  If you don't have enough memory, then the disk performance would
matter ...although Solr's performance at that point would probably be
bad enough that you'd be looking for ways to improve it.

Here's some information:

https://wiki.apache.org/solr/SolrPerformanceProblems

Exactly how much memory is enough depends on enough factors that there's
no good general advice.  The only thing we can say in general is to
recommend the ideal setup -- where you have enough spare memory that
your OS can cache the ENTIRE index.  The ideal setup is usually not
required for good performance.

Thanks,
Shawn



Re: Subqueries

2017-01-05 Thread Peter Matthew Eichman
Hello Mikhail,

I put pcdm_members into the fl, and it is definitely stored. I tried adding
the logParamsList, but all I see in the log is
183866104 [qtp1778535015-14] INFO  org.apache.solr.core.SolrCore  –
[fedora4] webapp=/solr path=/select params={q=id:"https://
fcrepolocal/fcrepo/rest/pcdm/19/31/3c/1a/19313c1a-6ab4-
4305-93ec-12dfdf01ba74"&members.logParamsList=q,fl,
rows,row.pcdm_members&indent=true&fl=members:[subquery]&
members.fl=id,title&members.q={!terms+f%3Did+v%3D$row.pcdm_
members}&wt=json&_=1483654385162} hits=1 status=0 QTime=0

Still getting no members key in the output:

{
  "responseHeader": {
"status": 0,
"QTime": 1,
"params": {
  "q": 
"id:\"https://fcrepolocal/fcrepo/rest/pcdm/19/31/3c/1a/19313c1a-6ab4-4305-93ec-12dfdf01ba74\"";,
  "members.logParamsList": "q,fl,rows,row.pcdm_members",
  "indent": "true",
  "fl": "pcdm_members,members:[subquery]",
  "members.fl": "id,title",
  "members.q": "{!terms f=id v=$row.pcdm_members}",
  "wt": "json",
  "_": "1483654538166"
}
  },
  "response": {
"numFound": 1,
"start": 0,
"docs": [
  {
"pcdm_members": [
  
"https://fcrepolocal/fcrepo/rest/pcdm/28/2e/5b/f5/282e5bf5-74c8-4148-9c1a-4ebead6435cb";,
  
"https://fcrepolocal/fcrepo/rest/pcdm/6e/7c/36/2f/6e7c362f-d239-4534-abd7-28caa24a134c";,
  
"https://fcrepolocal/fcrepo/rest/pcdm/6e/e3/a6/33/6ee3a633-998e-4f36-b80f-d76bcbe0d352";,
  
"https://fcrepolocal/fcrepo/rest/pcdm/8a/d9/c7/62/8ad9c762-4391-428d-b1ad-be5ac3e06c42";
]
  }
]
  }
}

Is $row.pcdm_members the right way to refer to the pcdm_members field
of the current document in the subquery? Is the multivalued nature of
the field a problem? I have tried adding separator=' ' to both the
[subquery] and {!terms}, but to no avail.

Thanks,
-Peter

On Thu, Jan 5, 2017 at 4:38 PM, Mikhail Khludnev  wrote:

> Hello,
>
> Can you add pcdm_members into fl to make sure it's stored?
> Also please add the following param
> members.logParamsList=q,fl,rows,row.pcdm_members,
> and check logs then.
>
> On Thu, Jan 5, 2017 at 9:46 PM, Peter Matthew Eichman 
> wrote:
>
> > Hello all,
> >
> > I am attempting to use a subquery to enrich a query with the titles of
> > related objects. Each document in my index may have 1 or more
> pcdm_members
> > and pcdm_related_objects fields, whose values are ids of other documents
> in
> > the index. Those documents in turn have reciprocal pcdm_member_of and
> > pcdm_related_object_of fields.
> >
> > In the Blacklight app I am working on, we want to enrich the display of a
> > document with the titles of its members and related objects using a
> > subquery. However, this is out first foray into subqueries and things
> > aren't working as expected.
> >
> > I expected the following query to return a "members" key with a document
> > list of documents with "id" and "title" keys, but I am getting nothing:
> >
> > {
> >   "responseHeader": {
> > "status": 0,
> > "QTime": 1,
> > "params": {
> >   "q": "id:\"https://fcrepolocal/fcrepo/rest/pcdm/19/31/3c/1a/
> > 19313c1a-6ab4-4305-93ec-12dfdf01ba74\"",
> >   "indent": "true",
> >   "fl": "members:[subquery]",
> >   "members.fl": "id,title",
> >   "members.q": "{!terms f=id v=$row.pcdm_members}",
> >   "wt": "json",
> >   "_": "1483641932207"
> > }
> >   },
> >   "response": {
> > "numFound": 1,
> > "start": 0,
> > "docs": [
> >   {}
> > ]
> >   }
> > }
> >
> > Any pointers on what I am missing? Are there any configuration settings
> in
> > solrconfig.xml that I need to be aware of for subqueries to work?
> >
> > Thanks,
> > -Peter
> >
> > --
> > Peter Eichman
> > Senior Software Developer
> > University of Maryland Libraries
> > peich...@umd.edu
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>



-- 
Peter Eichman
Senior Software Developer
University of Maryland Libraries
peich...@umd.edu


SolrCloud and LVM

2017-01-05 Thread Chris Ulicny
Is there any known significant performance impact of running solrcloud with
lvm on linux?

While migrating to solrcloud we don't have the storage capacity for our
expected final size, so we are planning on setting up the solrcloud
instances on a logical volume that we can grow when hardware becomes
available.

Thanks,
Chris


Re: Subqueries

2017-01-05 Thread Mikhail Khludnev
Hello,

Can you add pcdm_members into fl to make sure it's stored?
Also please add the following param
members.logParamsList=q,fl,rows,row.pcdm_members,
and check logs then.

On Thu, Jan 5, 2017 at 9:46 PM, Peter Matthew Eichman 
wrote:

> Hello all,
>
> I am attempting to use a subquery to enrich a query with the titles of
> related objects. Each document in my index may have 1 or more pcdm_members
> and pcdm_related_objects fields, whose values are ids of other documents in
> the index. Those documents in turn have reciprocal pcdm_member_of and
> pcdm_related_object_of fields.
>
> In the Blacklight app I am working on, we want to enrich the display of a
> document with the titles of its members and related objects using a
> subquery. However, this is out first foray into subqueries and things
> aren't working as expected.
>
> I expected the following query to return a "members" key with a document
> list of documents with "id" and "title" keys, but I am getting nothing:
>
> {
>   "responseHeader": {
> "status": 0,
> "QTime": 1,
> "params": {
>   "q": "id:\"https://fcrepolocal/fcrepo/rest/pcdm/19/31/3c/1a/
> 19313c1a-6ab4-4305-93ec-12dfdf01ba74\"",
>   "indent": "true",
>   "fl": "members:[subquery]",
>   "members.fl": "id,title",
>   "members.q": "{!terms f=id v=$row.pcdm_members}",
>   "wt": "json",
>   "_": "1483641932207"
> }
>   },
>   "response": {
> "numFound": 1,
> "start": 0,
> "docs": [
>   {}
> ]
>   }
> }
>
> Any pointers on what I am missing? Are there any configuration settings in
> solrconfig.xml that I need to be aware of for subqueries to work?
>
> Thanks,
> -Peter
>
> --
> Peter Eichman
> Senior Software Developer
> University of Maryland Libraries
> peich...@umd.edu
>



-- 
Sincerely yours
Mikhail Khludnev


Re: reuse a org.apache.lucene.search.Query in Solrj?

2017-01-05 Thread Mikhail Khludnev
If I've got you right, it's not possible. It's an obvious problem to pass
Lucene Query through SolrJ API.

On Thu, Jan 5, 2017 at 8:32 PM, xavier jmlucjav  wrote:

> Hi,
>
> I have a lucene Query (Boolean query with a bunch of possibly complex
> spatial queries, even polygon etc) that I am building for some MemoryIndex
> stuff.
>
> Now I need to add that same query to a Solr query (adding it to a bunch of
> other fq I am using). Is there a some way to piggyback the lucene query
> this way?? It would be extremelly handy in my situation.
>
> thanks
> xavier
>



-- 
Sincerely yours
Mikhail Khludnev


Re: How to train the model using user clicks when use ltr(learning to rank) module?

2017-01-05 Thread Jeffery Yuan
Thanks , Will Martin.

I checked the pdf it's great. but seems not very useful for my question: How
to train the model using user clicks when use ltr(learning to rank) module.

I know the concept after reading these papers. But still not sure how to
code them.
 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-train-the-model-using-user-clicks-when-use-ltr-learning-to-rank-module-tp4312462p4312592.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Is there Solr limitation on size for document retrieval?

2017-01-05 Thread Erick Erickson
The problem is probably somewhere in the max allowed packet size you
have configured between your client and server. Solr has no a-priori
limit here (well, I think > 2B won't return).

What is you symptom? Often the browser will sit there blank because
it's taking forever to render. Try submitting the url with curl and
piping the output to a file. If that succeeds then it's a browser
problem.

hl.fragsize, see:
https://cwiki.apache.org/confluence/display/solr/Standard+Highlighter
won't return the entire field. maxAnalyzedChars will restrict
highlighting to the beginning of the doc.

Best,
Erick

On Thu, Jan 5, 2017 at 12:15 PM, Kaushik  wrote:
> Hello,
>
> Is there a limit on the size of a document that can be indexed and rendered
> by Solr? We use Solr 5.3.1 and while we are able to index a document of 40
> mb size withouot any issue, we are unable to retrieve the indexed
> SolrDocument. Is there any configuration that we can use to spit out the
> entire document?
>
> Also, the only reason why we need the whole document is because of the
> highlighting feature. It would be great if we can just get a snippet of the
> text, instead of the entire content field for highlighting.
>
> Thanks,
> Kaushik


Is there Solr limitation on size for document retrieval?

2017-01-05 Thread Kaushik
Hello,

Is there a limit on the size of a document that can be indexed and rendered
by Solr? We use Solr 5.3.1 and while we are able to index a document of 40
mb size withouot any issue, we are unable to retrieve the indexed
SolrDocument. Is there any configuration that we can use to spit out the
entire document?

Also, the only reason why we need the whole document is because of the
highlighting feature. It would be great if we can just get a snippet of the
text, instead of the entire content field for highlighting.

Thanks,
Kaushik


Facet date - autogap

2017-01-05 Thread sn00py

Is it possible to make an "autogap" for a daterange?

I would like to send a query and depending on the daterange, the gap should be
1 Year
1 Month
1 Day
depending on the date range of the results

My only possibility i see at the moment ist to make a query to get  
first and last date and send the query a second time ... but i would  
like to get it all in one query.


Some ideas on it?


This message was sent using IMP, the Internet Messaging Program.



Re: Regarding /sql -- WHERE <> IS NULL and IS NOT NULL

2017-01-05 Thread Joel Bernstein
IS NULL and IS NOT NULL predicate are not currently supported.

Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Jan 5, 2017 at 2:05 PM, radha krishnan 
wrote:

> Hi,
>
> solr version : 6.3
>
> will WHERE <> IS NULL / IS NOT NULL work with the /sql handler
> ?
>
> " select   name from gettingstarted where name is not null "
>
> the above query is not returning any documents in the response even if
> there are documents with "name"defined
>
>
> Thanks,
> Radhakrishnan D
>


Regarding /sql -- WHERE <> IS NULL and IS NOT NULL

2017-01-05 Thread radha krishnan
Hi,

solr version : 6.3

will WHERE <> IS NULL / IS NOT NULL work with the /sql handler ?

" select   name from gettingstarted where name is not null "

the above query is not returning any documents in the response even if
there are documents with "name"defined


Thanks,
Radhakrishnan D


Re: Search for ISBN-like identifiers

2017-01-05 Thread Josh Lincoln
Sebastian,
You may want to try adding autoGeneratePhraseQueries="true" to the
fieldtype.
With that setting, a query for 978-3-8052-5094-8 will behave just like "978
3 8052 5094 8" (with the quotes)

A few notes about autoGeneratePhraseQueries
a) it used to be set to true by default, but that was changed several years
ago
b) does NOT require a reindex, so very easy to test
c) apparently not recommended for non-whitespace delimited languages (CJK,
etc), but maybe that's not an issue in your use case.
d) i'm unsure how it'll impact wildcard queries on that field. E.g. will
978-3-8052* match 978-3-8052-5094-8? At the very least, partial ISBNs (e.g.
978-3-8052) would match full ISBN without needing to use the wildcard. I'm
just not sure what happens if the user includes the wildcard.

Josh

On Thu, Jan 5, 2017 at 1:41 PM Sebastian Riemer  wrote:

> Thank you very much for taking the time to help me!
>
> I'll definitely have a look at the link you've posted.
>
> @ShawnHeisey Thanks too for shedding light on the wildcard behaviour!
>
> Allow me one further question:
> - Assuming that I define a separate field for storing the ISBNs, using the
> awesome analyzer provider by Mr. Bill Dueber. How do I get that field
> copied into my general text field, which is used by my QuickSearch-Input?
> Won't that field be processed again by the analyser defined on the text
> field?
> - Should I alternatively add more fields to the q-Parameter? As for now, I
> always have set q=text: but I guess one
> could try something like
> q=text:+isbnspeciallookupfield:
>
> I don't really know about that last idea though, since the searches are
> propably OR-combined which is not what I like to have.
>
> Third option would be, to pre-process the distinction to where to look at
> in the solr in my application of course. I.e. everything being a regex
> containing only numbers and hyphens with length 13 -> don't query on field
> text, instead use field isbnspeciallookupfield
>
>
> Many thanks again, and have a nice day!
> Sebastian
>
>
> -Ursprüngliche Nachricht-
> Von: Erik Hatcher [mailto:erik.hatc...@gmail.com]
> Gesendet: Donnerstag, 5. Januar 2017 19:10
> An: solr-user@lucene.apache.org
> Betreff: Re: Search for ISBN-like identifiers
>
> Sebastian -
>
> There’s some precedent out there for ISBN’s.  Bill Dueber and the
> UMICH/code4lib folks have done amazing work, check it out here -
>
> https://github.com/mlibrary/umich_solr_library_filters <
> https://github.com/mlibrary/umich_solr_library_filters>
>
>   - Erik
>
>
> > On Jan 5, 2017, at 5:08 AM, Sebastian Riemer 
> wrote:
> >
> > Hi folks,
> >
> >
> > TL;DR: Is there an easy way, to copy ISBNs with hyphens to the general
> text field, respectively configure the analyser on that field, so that a
> search for the hyphenated ISBN returns exactly the matching document?
> >
> > Long version:
> > I've defined a field "text" of type "text_general", where I copy all
> > my other fields to, to be able to do a "quick search" where I set
> > q=text
> >
> > The definition of the type text_general is like this:
> >
> >
> >
> >  > positionIncrementGap="100">
> >
> >  
> >
> >
> >
> > > words="stopwords.txt" />
> >
> >
> >
> >  
> >
> >  
> >
> >
> >
> > > words="stopwords.txt" />
> >
> > > synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> >
> >
> >
> >  
> >
> >
> >
> >
> > I now face the problem, that searching for a book with
> > text:978-3-8052-5094-8* does not return the single result I expect.
> > However searching for text:9783805250948* instead returns a result.
> > Note, that I am adding a wildcard at the end automatically, to further
> > broaden the resultset. Note also, that it does not seem to matter
> > whether I put backslashes in front of the hyphen or not (to be exact,
> > when sending via SolrJ from my application, I put in the backslashes,
> > but I don't see a difference when using SolrAdmin as I guess SolrAdmin
> > automatically inserts backslashes if needed?)
> >
> > When storing ISBNs, I do store them twice, once with hyphens
> (978-3-8052-5094-8) and once without (9783805250948). A pure phrase search
> on both those values return also the single document.
> >
> > I learned that the StandardTokenizer splits up values from fields at
> index time, and I've also learned that I can use the solrAdmin analysis and
> the debugQuery to help understand what is going on. From the analysis
> screen I see, that given the value 9783805250948 at index-time and
> 9783805250948* query-time both leads to an unchanged value 9783805250948 at
> the end.
> > When given the value 978-3-8052-5094-8 for "Field Value (Index)" and
> 978-3-8052-5094-8* for "Field Value (Query)"  I can see how the ISBN is
> tokenized into 5 parts. Again, the values match on both sides (Index and
> Query).
> >
> > How does the left side correlate with the right side? My guess: The left
> side means, "Values stored in fiel

Subqueries

2017-01-05 Thread Peter Matthew Eichman
Hello all,

I am attempting to use a subquery to enrich a query with the titles of
related objects. Each document in my index may have 1 or more pcdm_members
and pcdm_related_objects fields, whose values are ids of other documents in
the index. Those documents in turn have reciprocal pcdm_member_of and
pcdm_related_object_of fields.

In the Blacklight app I am working on, we want to enrich the display of a
document with the titles of its members and related objects using a
subquery. However, this is out first foray into subqueries and things
aren't working as expected.

I expected the following query to return a "members" key with a document
list of documents with "id" and "title" keys, but I am getting nothing:

{
  "responseHeader": {
"status": 0,
"QTime": 1,
"params": {
  "q": 
"id:\"https://fcrepolocal/fcrepo/rest/pcdm/19/31/3c/1a/19313c1a-6ab4-4305-93ec-12dfdf01ba74\"";,
  "indent": "true",
  "fl": "members:[subquery]",
  "members.fl": "id,title",
  "members.q": "{!terms f=id v=$row.pcdm_members}",
  "wt": "json",
  "_": "1483641932207"
}
  },
  "response": {
"numFound": 1,
"start": 0,
"docs": [
  {}
]
  }
}

Any pointers on what I am missing? Are there any configuration settings in
solrconfig.xml that I need to be aware of for subqueries to work?

Thanks,
-Peter

-- 
Peter Eichman
Senior Software Developer
University of Maryland Libraries
peich...@umd.edu


AW: Search for ISBN-like identifiers

2017-01-05 Thread Sebastian Riemer
Thank you very much for taking the time to help me!

I'll definitely have a look at the link you've posted.

@ShawnHeisey Thanks too for shedding light on the wildcard behaviour!

Allow me one further question:
- Assuming that I define a separate field for storing the ISBNs, using the 
awesome analyzer provider by Mr. Bill Dueber. How do I get that field copied 
into my general text field, which is used by my QuickSearch-Input? Won't that 
field be processed again by the analyser defined on the text field?
- Should I alternatively add more fields to the q-Parameter? As for now, I 
always have set q=text: but I guess one could 
try something like 
q=text:+isbnspeciallookupfield:

I don't really know about that last idea though, since the searches are 
propably OR-combined which is not what I like to have.

Third option would be, to pre-process the distinction to where to look at in 
the solr in my application of course. I.e. everything being a regex containing 
only numbers and hyphens with length 13 -> don't query on field text, instead 
use field isbnspeciallookupfield


Many thanks again, and have a nice day!
Sebastian


-Ursprüngliche Nachricht-
Von: Erik Hatcher [mailto:erik.hatc...@gmail.com] 
Gesendet: Donnerstag, 5. Januar 2017 19:10
An: solr-user@lucene.apache.org
Betreff: Re: Search for ISBN-like identifiers

Sebastian -

There’s some precedent out there for ISBN’s.  Bill Dueber and the 
UMICH/code4lib folks have done amazing work, check it out here -

https://github.com/mlibrary/umich_solr_library_filters 


  - Erik


> On Jan 5, 2017, at 5:08 AM, Sebastian Riemer  wrote:
> 
> Hi folks,
> 
> 
> TL;DR: Is there an easy way, to copy ISBNs with hyphens to the general text 
> field, respectively configure the analyser on that field, so that a search 
> for the hyphenated ISBN returns exactly the matching document?
> 
> Long version:
> I've defined a field "text" of type "text_general", where I copy all 
> my other fields to, to be able to do a "quick search" where I set 
> q=text
> 
> The definition of the type text_general is like this:
> 
> 
> 
>  positionIncrementGap="100">
> 
>  
> 
>
> 
> words="stopwords.txt" />
> 
>
> 
>  
> 
>  
> 
>
> 
> words="stopwords.txt" />
> 
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> 
>
> 
>  
> 
>
> 
> 
> I now face the problem, that searching for a book with 
> text:978-3-8052-5094-8* does not return the single result I expect. 
> However searching for text:9783805250948* instead returns a result. 
> Note, that I am adding a wildcard at the end automatically, to further 
> broaden the resultset. Note also, that it does not seem to matter 
> whether I put backslashes in front of the hyphen or not (to be exact, 
> when sending via SolrJ from my application, I put in the backslashes, 
> but I don't see a difference when using SolrAdmin as I guess SolrAdmin 
> automatically inserts backslashes if needed?)
> 
> When storing ISBNs, I do store them twice, once with hyphens 
> (978-3-8052-5094-8) and once without (9783805250948). A pure phrase search on 
> both those values return also the single document.
> 
> I learned that the StandardTokenizer splits up values from fields at index 
> time, and I've also learned that I can use the solrAdmin analysis and the 
> debugQuery to help understand what is going on. From the analysis screen I 
> see, that given the value 9783805250948 at index-time and 9783805250948* 
> query-time both leads to an unchanged value 9783805250948 at the end.
> When given the value 978-3-8052-5094-8 for "Field Value (Index)" and 
> 978-3-8052-5094-8* for "Field Value (Query)"  I can see how the ISBN is 
> tokenized into 5 parts. Again, the values match on both sides (Index and 
> Query).
> 
> How does the left side correlate with the right side? My guess: The left side 
> means, "Values stored in field text will be tokenized while indexing as show 
> here on the left". The right side means, "When querying on the field text, 
> I'll tokenize the entered value like this, and see if I find something on the 
> index" Is this correct?
> 
> Another question: when querying and investigating the single document in 
> solrAdmin, the contents I see In the column text represents the _stored_ 
> value of the field text, right?
> And am I correct that this actually has nothing to do, with what is actually 
> stored in  the index for searching?
> 
> When storing the value 978-3-8052-5094-8, are only the tokenized values 
> stored for search, or is the "whole word" also stored? Is there a way to 
> actually see all the values which are stored for search?
> When searching text:" 978-3-8052-5094-8" I get the single result, so I guess 
> the value as a whole must also be stored in the index for searching?
> 
> One more thing which confuses me:
> Searching for text: 978-3-8052-5094-8 gives me 72 results,

Re: Search for ISBN-like identifiers

2017-01-05 Thread Erik Hatcher
Sebastian -

There’s some precedent out there for ISBN’s.  Bill Dueber and the 
UMICH/code4lib folks have done amazing work, check it out here -

https://github.com/mlibrary/umich_solr_library_filters 


  - Erik


> On Jan 5, 2017, at 5:08 AM, Sebastian Riemer  wrote:
> 
> Hi folks,
> 
> 
> TL;DR: Is there an easy way, to copy ISBNs with hyphens to the general text 
> field, respectively configure the analyser on that field, so that a search 
> for the hyphenated ISBN returns exactly the matching document?
> 
> Long version:
> I've defined a field "text" of type "text_general", where I copy all my other 
> fields to, to be able to do a "quick search" where I set q=text
> 
> The definition of the type text_general is like this:
> 
> 
> 
>  positionIncrementGap="100">
> 
>  
> 
>
> 
> words="stopwords.txt" />
> 
>
> 
>  
> 
>  
> 
>
> 
> words="stopwords.txt" />
> 
> ignoreCase="true" expand="true"/>
> 
>
> 
>  
> 
>
> 
> 
> I now face the problem, that searching for a book with 
> text:978-3-8052-5094-8* does not return the single result I expect. However 
> searching for text:9783805250948* instead returns a result. Note, that I am 
> adding a wildcard at the end automatically, to further broaden the resultset. 
> Note also, that it does not seem to matter whether I put backslashes in front 
> of the hyphen or not (to be exact, when sending via SolrJ from my 
> application, I put in the backslashes, but I don't see a difference when 
> using SolrAdmin as I guess SolrAdmin automatically inserts backslashes if 
> needed?)
> 
> When storing ISBNs, I do store them twice, once with hyphens 
> (978-3-8052-5094-8) and once without (9783805250948). A pure phrase search on 
> both those values return also the single document.
> 
> I learned that the StandardTokenizer splits up values from fields at index 
> time, and I've also learned that I can use the solrAdmin analysis and the 
> debugQuery to help understand what is going on. From the analysis screen I 
> see, that given the value 9783805250948 at index-time and 9783805250948* 
> query-time both leads to an unchanged value 9783805250948 at the end.
> When given the value 978-3-8052-5094-8 for "Field Value (Index)" and 
> 978-3-8052-5094-8* for "Field Value (Query)"  I can see how the ISBN is 
> tokenized into 5 parts. Again, the values match on both sides (Index and 
> Query).
> 
> How does the left side correlate with the right side? My guess: The left side 
> means, "Values stored in field text will be tokenized while indexing as show 
> here on the left". The right side means, "When querying on the field text, 
> I'll tokenize the entered value like this, and see if I find something on the 
> index" Is this correct?
> 
> Another question: when querying and investigating the single document in 
> solrAdmin, the contents I see In the column text represents the _stored_ 
> value of the field text, right?
> And am I correct that this actually has nothing to do, with what is actually 
> stored in  the index for searching?
> 
> When storing the value 978-3-8052-5094-8, are only the tokenized values 
> stored for search, or is the "whole word" also stored? Is there a way to 
> actually see all the values which are stored for search?
> When searching text:" 978-3-8052-5094-8" I get the single result, so I guess 
> the value as a whole must also be stored in the index for searching?
> 
> One more thing which confuses me:
> Searching for text: 978-3-8052-5094-8 gives me 72 results, because it leads 
> to searching for "parsedquery_toString":"text:978 text:3 text:8052 text:5094 
> text:8",
> but searching for text: 978-3-8052-5094-8* gives me 0 results, this leads to 
> "parsedquery_toString":"text:978-3-8052-5094-8*",
> 
> Why is the appended wildcard changing the behaviour so radically? I'd rather 
> expect to get something like "parsedquery_toString":"text:978 text:3 
> text:8052 text:5094 text:8*",  and thus even more results.
> 
> Btw. I've found and read an interesting blog about storing ISBNs and alikes 
> here: 
> http://robotlibrarian.billdueber.com/2012/03/solr-field-type-for-numericish-ids/
>  However, I already store my ISBN also in a separate field, of type string, 
> which works fine when I use this field for searching.
> 
> Best regards, sorry for the enormously long question and thank you for 
> listening.
> 
> Sebastian



Re: Search for ISBN-like identifiers

2017-01-05 Thread Shawn Heisey
On 1/5/2017 3:08 AM, Sebastian Riemer wrote:
> I now face the problem, that searching for a book with
> text:978-3-8052-5094-8* does not return the single result I expect.
> However searching for text:9783805250948* instead returns a result.
> Note, that I am adding a wildcard at the end automatically, to further
> broaden the resultset. Note also, that it does not seem to matter
> whether I put backslashes in front of the hyphen or not (to be exact,
> when sending via SolrJ from my application, I put in the backslashes,
> but I don't see a difference when using SolrAdmin as I guess SolrAdmin
> automatically inserts backslashes if needed?) 

As soon as you use a wildcard, the query is no longer run through the
analysis chain, which means that it keeps all those hyphens.  That will
never match anything in the index, because the StandardTokenizer has
removed all the hyphens in the tokens that it puts into the index.  The
fact that wildcards skip analysis is a source of major confusion.  I
assume that the analysis skip is required for correct operation,
although I have never delved that deeply into the internals.

A hyphen is only a special character if it's the first character in a
word.  It's generally a good idea to escape the special characters
anyway, but in this case it doesn't matter, which is why you can send it
unescaped.

If you want to use wildcards, you're going to have to use them on an
untokenized (normally "string") field, or the results will probably not
be what you expect.

Thanks,
Shawn



Re: Search for ISBN-like identifiers

2017-01-05 Thread Erick Erickson
bq: How does the left side correlate with the right side?...

You've got it right, the left is the indexed and the right is the query

bq: the contents I see In the column text represents the _stored_
value of the field text, right...

Correct

bq: ...are only the tokenized values stored for search

I'll be a bit pedantic here since "stored" is overloaded ;)...

The _indexed_ tokens, i.e. the tokens you search against are all
that's searchable. For instance let's say you have "running" in your
text and are stemming. "run" is all that gets into the searchable
portion of your index.

there's no really convenient way to find the tokens associated with a
doc, the inverted index structure doesn't lent itself well to
reconstructing a doc that way. Luke _can_ do this. It's a lossy
process as you'll see. It can also be quite lengthy.

bq: One more thing which confuses me:

Oh boy. All I can offer here is it's less confusing that it was in
"the bad old days". Wildcards are tricky to handle. Here's a writeup:
https://lucidworks.com/blog/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/

The short form is that wildcards are handled "specially" and much of
the analysis chain will be skipped, it depends on the particular
class. Your trailing wildcard example makes sense to a human, but it
turns out to be hard to generalize.

Two possibilities for you to consider, especially since ISBNs are regular:
1> WordDelimiterFilterFactory is designed for this kind of thing. You
can dothings like "catenateNumbers" so what'd be searchable would be
both "978-3-8052-5094-8" and 9783805250948

2> do the above yourself in the ETL process. Then just use a
multiValued String field.

Best,
Erick

On Thu, Jan 5, 2017 at 2:08 AM, Sebastian Riemer  wrote:
> Hi folks,
>
>
> TL;DR: Is there an easy way, to copy ISBNs with hyphens to the general text 
> field, respectively configure the analyser on that field, so that a search 
> for the hyphenated ISBN returns exactly the matching document?
>
> Long version:
> I've defined a field "text" of type "text_general", where I copy all my other 
> fields to, to be able to do a "quick search" where I set q=text
>
> The definition of the type text_general is like this:
>
>
>
>  positionIncrementGap="100">
>
>   
>
> 
>
>  words="stopwords.txt" />
>
> 
>
>   
>
>   
>
> 
>
>  words="stopwords.txt" />
>
>  ignoreCase="true" expand="true"/>
>
> 
>
>   
>
> 
>
>
> I now face the problem, that searching for a book with 
> text:978-3-8052-5094-8* does not return the single result I expect. However 
> searching for text:9783805250948* instead returns a result. Note, that I am 
> adding a wildcard at the end automatically, to further broaden the resultset. 
> Note also, that it does not seem to matter whether I put backslashes in front 
> of the hyphen or not (to be exact, when sending via SolrJ from my 
> application, I put in the backslashes, but I don't see a difference when 
> using SolrAdmin as I guess SolrAdmin automatically inserts backslashes if 
> needed?)
>
> When storing ISBNs, I do store them twice, once with hyphens 
> (978-3-8052-5094-8) and once without (9783805250948). A pure phrase search on 
> both those values return also the single document.
>
> I learned that the StandardTokenizer splits up values from fields at index 
> time, and I've also learned that I can use the solrAdmin analysis and the 
> debugQuery to help understand what is going on. From the analysis screen I 
> see, that given the value 9783805250948 at index-time and 9783805250948* 
> query-time both leads to an unchanged value 9783805250948 at the end.
> When given the value 978-3-8052-5094-8 for "Field Value (Index)" and 
> 978-3-8052-5094-8* for "Field Value (Query)"  I can see how the ISBN is 
> tokenized into 5 parts. Again, the values match on both sides (Index and 
> Query).
>
> How does the left side correlate with the right side? My guess: The left side 
> means, "Values stored in field text will be tokenized while indexing as show 
> here on the left". The right side means, "When querying on the field text, 
> I'll tokenize the entered value like this, and see if I find something on the 
> index" Is this correct?
>
> Another question: when querying and investigating the single document in 
> solrAdmin, the contents I see In the column text represents the _stored_ 
> value of the field text, right?
> And am I correct that this actually has nothing to do, with what is actually 
> stored in  the index for searching?
>
> When storing the value 978-3-8052-5094-8, are only the tokenized values 
> stored for search, or is the "whole word" also stored? Is there a way to 
> actually see all the values which are stored for search?
> When searching text:" 978-3-8052-5094-8" I get the single result, so I guess 
> the value as a whole must also be stored in the index for searching?
>
> One more thing which confuses me:
> Searching f

reuse a org.apache.lucene.search.Query in Solrj?

2017-01-05 Thread xavier jmlucjav
Hi,

I have a lucene Query (Boolean query with a bunch of possibly complex
spatial queries, even polygon etc) that I am building for some MemoryIndex
stuff.

Now I need to add that same query to a Solr query (adding it to a bunch of
other fq I am using). Is there a some way to piggyback the lucene query
this way?? It would be extremelly handy in my situation.

thanks
xavier


Re: Howto reload "all" cores?

2017-01-05 Thread Shawn Heisey
On 1/5/2017 6:16 AM, Clemens Wyss DEV wrote:
> does http://localhost:8983/solr/admin/cores?action=RELOAD reload all
> cores?

No.  It would complain that you didn't give it a core name.

If you want to reload all cores, restart Solr ... or ask Solr for a list
of cores, and reload each of them.

Solr is not capable of restarting itself, you would have to do that
externally.

Thanks,
Shawn



AW: Re: update/extract override ExtractTyp

2017-01-05 Thread sn00py


I am useing the Extract URL And Renamed the File to test.txtBut it is still 
Parsed with the XML ParserCan I force the txt Parser for all .txt Files? 


Von meinem Samsung Gerät gesendet.

 Ursprüngliche Nachricht 
Von: Shawn Heisey  
Datum: 04.01.17  17:10  (GMT+01:00) 
An: solr-user@lucene.apache.org 
Betreff: Re: update/extract override ExtractTyp 

On 1/4/2017 8:12 AM, sn0...@ulysses-erp.com wrote:
> Is it possible to override the ExtractClass for a specific document?
> I would like to upload a XML Document, but this XML is not XML conform
>
> I need this XML because it is part of a project where a corrupt XML is
> need, for testing purpose.
>
>
> The update/extract process failes every time with an 500 error.
>
> I tried to override the Content-Type with "text/plain" but  get still
> the XML parse error.

If you send something to the /update handler, and don't tell Solr that
it is another format that it knows like CSV, JSON, or Javabin, then Solr
assumes that it is XML -- and that it is the *specific* XML format that
Solr uses.  "text/plain" is not one of the formats that the update
handler knows how to handle, so it will assume XML.

If you send some other arbitrary XML content, even if that XML is
otherwise correctly formed (which apparently yours isn't), Solr will
throw an error, because it is not the type of XML that Solr is looking
for.  On this page are some examples of what Solr is expecting when you
send XML:

https://wiki.apache.org/solr/UpdateXmlMessages

If you want to parse arbitrary XML into fields, you probably need to
send it using DIH and the XPathEntityProcessor.  If you want the XML to
go into a field completely as-is, then you need to encode the XML into
one of the update formats that Solr knows (XML, JSON, etc) and set it as
the value of one of the fields.

Thanks,
Shawn



Re: SolrCloud different score for same document on different replicas.

2017-01-05 Thread Charlie Hull

On 05/01/2017 13:30, Morten Bøgeskov wrote:



Hi.

We've got a SolrCloud which is sharded and has a replication factor of
2.

The 2 replicas of a shard may look like this:

Num Docs:5401023
Max Doc:6388614
Deleted Docs:987591


Num Docs:5401023
Max Doc:5948122
Deleted Docs:547099

We've seen >10% difference in Max Doc at times with same Num Docs.
Our use case is few documents that are search and many small that
are filtered against (often updated multiple times a day), so the
difference in deleted docs aren't surprising.

This results in a different score for a document depending on which
replica it comes from. As I see it: it has to do with the different
maxDoc value when calculating idf.

This in turn alters a specific document's position in the search
result over reloads. This is quite confusing (duplicates in pagination).

What is the trick to get homogeneous score from different replicas.
We've tried using ExactStatsCache & ExactSharedStatsCache, but that
didn't seem to make any difference.

Any hints to this will be greatly appreciated.



This was one of things we looked at during our recent Lucene London 
Hackday (see item 3) https://github.com/flaxsearch/london-hackday-2016


I'm not sure there is a way to get a homogenous score - this patch tries 
to keep you connected to the same replica during a session so you don't 
see results jumping over pagination.


Cheers

Charlie


--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


LineEntityProcessor | Separator --- /update/csv | OnError

2017-01-05 Thread Moenieb Davids
Hi,

Just wanted to know if anybody can assist with the following scenario:
I have a pipe delimited mainframe file\s that sometimes misses certain fields 
in a row, which obviously causes issues when I try the /update/csv handler.

Scenario 1:
The csv handler is quite fast, however, when it picks up a line that does not 
have all the fields due to a missing delimiter, then the entire import fails.
So, is there a way to do a OnError skip type of scenario. 
I have check the 6.3 ref guide and web but no luck

Scenario 2:
I try to use a my own DIH and then configure my schema accordingly, however, I 
am trying to use the separator parameter, but it seems to not be working.
It looks like the data always just goes to rawline which then means that the 
separator effectively means nothing?

I am trying to not go custom too much, so does anybody know of a "standard" way 
of getting the data in

Regards
Moenieb










===
GPAA e-mail Disclaimers and confidential note 

This e-mail is intended for the exclusive use of the addressee only.
If you are not the intended recipient, you should not use the contents 
or disclose them to any other person. Please notify the sender immediately 
and delete the e-mail. This e-mail is not intended nor 
shall it be taken to create any legal relations, contractual or otherwise. 
Legally binding obligations can only arise for the GPAA by means of 
a written instrument signed by an authorised signatory.
===


StringIndexOutOfBoundsException "in" SpellCheckCollator.getCollation

2017-01-05 Thread Clemens Wyss DEV
I am seeing many exceptions like this in my Solr [5.4.1] log:
null:java.lang.StringIndexOutOfBoundsException: String index out of range: -2
at 
java.lang.AbstractStringBuilder.replace(AbstractStringBuilder.java:824)
at java.lang.StringBuilder.replace(StringBuilder.java:262)
at 
org.apache.solr.spelling.SpellCheckCollator.getCollation(SpellCheckCollator.java:236)
at 
org.apache.solr.spelling.SpellCheckCollator.collate(SpellCheckCollator.java:93)
at 
org.apache.solr.handler.component.SpellCheckComponent.addCollationsToResponse(SpellCheckComponent.java:238)
at 
org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:203)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:273)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:156)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2073)
at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:658)
...
at java.lang.Thread.run(Thread.java:745)

What am I potentially facing here?

Thx
Clemens


RE: SolrCloud different score for same document on different replicas.

2017-01-05 Thread Markus Jelsma
Hello - you need a custom similarity and use docCount as divisor instead of 
maxDoc when calculating IDF. I believe this was fixed in some version but i'm 
not sure.

Markus
 
-Original message-
> From:Morten Bøgeskov 
> Sent: Thursday 5th January 2017 14:33
> To: solr-user@lucene.apache.org
> Subject: SolrCloud different score for same document on different replicas.
> 
> 
> 
> Hi.
> 
> We've got a SolrCloud which is sharded and has a replication factor of
> 2.
> 
> The 2 replicas of a shard may look like this:
> 
> Num Docs:5401023
> Max Doc:6388614
> Deleted Docs:987591
> 
> 
> Num Docs:5401023
> Max Doc:5948122
> Deleted Docs:547099
> 
> We've seen >10% difference in Max Doc at times with same Num Docs.
> Our use case is few documents that are search and many small that
> are filtered against (often updated multiple times a day), so the
> difference in deleted docs aren't surprising.
> 
> This results in a different score for a document depending on which
> replica it comes from. As I see it: it has to do with the different
> maxDoc value when calculating idf.
> 
> This in turn alters a specific document's position in the search
> result over reloads. This is quite confusing (duplicates in pagination).
> 
> What is the trick to get homogeneous score from different replicas.
> We've tried using ExactStatsCache & ExactSharedStatsCache, but that
> didn't seem to make any difference.
> 
> Any hints to this will be greatly appreciated.
> 
> -- 
>  Morten Bøgeskov 
> 
> 


SolrCloud different score for same document on different replicas.

2017-01-05 Thread Morten Bøgeskov


Hi.

We've got a SolrCloud which is sharded and has a replication factor of
2.

The 2 replicas of a shard may look like this:

Num Docs:5401023
Max Doc:6388614
Deleted Docs:987591


Num Docs:5401023
Max Doc:5948122
Deleted Docs:547099

We've seen >10% difference in Max Doc at times with same Num Docs.
Our use case is few documents that are search and many small that
are filtered against (often updated multiple times a day), so the
difference in deleted docs aren't surprising.

This results in a different score for a document depending on which
replica it comes from. As I see it: it has to do with the different
maxDoc value when calculating idf.

This in turn alters a specific document's position in the search
result over reloads. This is quite confusing (duplicates in pagination).

What is the trick to get homogeneous score from different replicas.
We've tried using ExactStatsCache & ExactSharedStatsCache, but that
didn't seem to make any difference.

Any hints to this will be greatly appreciated.

-- 
 Morten Bøgeskov 



Howto reload "all" cores?

2017-01-05 Thread Clemens Wyss DEV
does
http://localhost:8983/solr/admin/cores?action=RELOAD
reload all cores?

Thx
Clemens


Re: ClusterStateMutator

2017-01-05 Thread Hendrik Haddorp
The UI warning was quite easy to resolve. I'm currently testing Solr 
with HDFS but for some reason the core ended up on the local storage of 
the node. After a delete and restart the problem was gone.


On 05.01.2017 12:42, Hendrik Haddorp wrote:
Right, I had to do that multiple times already when I restarted nodes 
during collection creation. In such cases I was left with data in the 
clusterstate.json, which at least on 6.2.1, blocked further collection 
creations. Once manually deleted or set to {} collection creation 
worked again.


Setting legacyCloud=false looks good. I don't get anything in 
clusterstate.json anymore and no old collections show up after a node 
restarts. I could also confirm what Shalin said, that state format 2 
is used by default. Only if I explicitly set state format to 1 I see 
data in clusterstate.json during the collection creation. Just the 
Solr admin UI is now showing "SolrCore Initialization Failures" 
pointing to none existing replicas. I assume that happens when Solr 
starts up and finds data for a core that does not exist in ZK anymore. 
How would one clean up this issue? Beside that the some replicas can 
still end up broken if the node restarts in the wrong time. I 
currently have one replica marked as down and one as gone. So far I 
was however always able to manually replace these replicas to resolve 
this state. So in general this looks quite good now. Guess I will 
still need to find a way to make sure that I don't restart a node 
during collection creation :-(


On 05.01.2017 02:33, Erick Erickson wrote:

Let us know how it goes. You'll probably want to remove the _contents_
of clusterstate.json and just leave it as a pair of brackets , i.e. {}
if for no other reason than it's confusing.

Times past the node needed to be there even if empty. Although I just
tried removing it completely on 6x and I was able to start Solr, part
of the startup process recreates it as an empty node, just a pair of
braces.

Best,
Erick

On Wed, Jan 4, 2017 at 1:22 PM, Hendrik Haddorp 
 wrote:

Hi Erik,

I have actually also seen that behavior already. So will check what
happens when I set that property.
I still believe I'm getting the clusterstate.json set already before 
the

node comes up again. But I will try to verify that further tomorrow.

thanks,
Hendrik

On 04/01/17 22:10, Erick Erickson wrote:

Hendrik:

Historically in 4.x, there was code that would reconstruct the
clusterstate.json code. So you would see "deleted" collections come
back. One scenario was:

- Have a Solr node offline that had a replica for a collection.
- Delete that collection
- Bring the node back
- It would register itself in clusterstate.json.

So my guess is that something like this is going on and you're getting
a clusterstate.json that's reconstructed (and possibly not complete).

You can avoid this by specifying legacyCloud=false clusterprop

Kind of a shot in the dark...

Erick

On Wed, Jan 4, 2017 at 11:12 AM, Hendrik Haddorp
 wrote:
You are right, the code looks like it. But why did I then see 
collection

data in the clusterstate.json file? If version 1 is not used I would
assume that no data ends up in there. When explicitly setting the 
state
format 2 the system seemed to behave differently. And if the code 
always
uses version 2 shouldn't the default in that line be changed 
accordingly?


On 04/01/17 16:41, Shalin Shekhar Mangar wrote:

Actually the state format defaults to 2 since many releases (all of
6.x at least). This default is enforced in CollectionsHandler much
before the code in ClusterStateMutator is executed.

On Wed, Jan 4, 2017 at 6:16 PM, Hendrik Haddorp 
 wrote:

Hi,

in
solr-6.3.0/solr/core/src/java/org/apache/solr/cloud/overseer/ClusterStateMutator.java 


there is the following code starting line 107:

//TODO default to 2; but need to debug why 
BasicDistributedZk2Test fails

early on
 String znode = message.getInt(DocCollection.STATE_FORMAT, 
1) == 1 ? null

 : ZkStateReader.getCollectionPath(cName);

Any if that will be changed to default to version 2 anytime soon?

thanks,
Hendrik






Re: ClusterStateMutator

2017-01-05 Thread Hendrik Haddorp
Right, I had to do that multiple times already when I restarted nodes 
during collection creation. In such cases I was left with data in the 
clusterstate.json, which at least on 6.2.1, blocked further collection 
creations. Once manually deleted or set to {} collection creation worked 
again.


Setting legacyCloud=false looks good. I don't get anything in 
clusterstate.json anymore and no old collections show up after a node 
restarts. I could also confirm what Shalin said, that state format 2 is 
used by default. Only if I explicitly set state format to 1 I see data 
in clusterstate.json during the collection creation. Just the Solr admin 
UI is now showing "SolrCore Initialization Failures" pointing to none 
existing replicas. I assume that happens when Solr starts up and finds 
data for a core that does not exist in ZK anymore. How would one clean 
up this issue? Beside that the some replicas can still end up broken if 
the node restarts in the wrong time. I currently have one replica marked 
as down and one as gone. So far I was however always able to manually 
replace these replicas to resolve this state. So in general this looks 
quite good now. Guess I will still need to find a way to make sure that 
I don't restart a node during collection creation :-(


On 05.01.2017 02:33, Erick Erickson wrote:

Let us know how it goes. You'll probably want to remove the _contents_
of clusterstate.json and just leave it as a pair of brackets , i.e. {}
if for no other reason than it's confusing.

Times past the node needed to be there even if empty. Although I just
tried removing it completely on 6x and I was able to start Solr, part
of the startup process recreates it as an empty node, just a pair of
braces.

Best,
Erick

On Wed, Jan 4, 2017 at 1:22 PM, Hendrik Haddorp  wrote:

Hi Erik,

I have actually also seen that behavior already. So will check what
happens when I set that property.
I still believe I'm getting the clusterstate.json set already before the
node comes up again. But I will try to verify that further tomorrow.

thanks,
Hendrik

On 04/01/17 22:10, Erick Erickson wrote:

Hendrik:

Historically in 4.x, there was code that would reconstruct the
clusterstate.json code. So you would see "deleted" collections come
back. One scenario was:

- Have a Solr node offline that had a replica for a collection.
- Delete that collection
- Bring the node back
- It would register itself in clusterstate.json.

So my guess is that something like this is going on and you're getting
a clusterstate.json that's reconstructed (and possibly not complete).

You can avoid this by specifying legacyCloud=false clusterprop

Kind of a shot in the dark...

Erick

On Wed, Jan 4, 2017 at 11:12 AM, Hendrik Haddorp
 wrote:

You are right, the code looks like it. But why did I then see collection
data in the clusterstate.json file? If version 1 is not used I would
assume that no data ends up in there. When explicitly setting the state
format 2 the system seemed to behave differently. And if the code always
uses version 2 shouldn't the default in that line be changed accordingly?

On 04/01/17 16:41, Shalin Shekhar Mangar wrote:

Actually the state format defaults to 2 since many releases (all of
6.x at least). This default is enforced in CollectionsHandler much
before the code in ClusterStateMutator is executed.

On Wed, Jan 4, 2017 at 6:16 PM, Hendrik Haddorp  wrote:

Hi,

in
solr-6.3.0/solr/core/src/java/org/apache/solr/cloud/overseer/ClusterStateMutator.java
there is the following code starting line 107:

//TODO default to 2; but need to debug why BasicDistributedZk2Test fails
early on
 String znode = message.getInt(DocCollection.STATE_FORMAT, 1) == 1 ? null
 : ZkStateReader.getCollectionPath(cName);

Any if that will be changed to default to version 2 anytime soon?

thanks,
Hendrik




Solr json facet api

2017-01-05 Thread kshitij tyagi
Hi,

We were earlier using solr 4.0 and now moved to solr 5.2:

I am debugging queries and seeing that most of the time in queries are
taken by solr facet queries.

I have read about solr json facet api in solr 5 on wards, can anyone help
me out to understand the difference between these both?

Will there be significant gain in query performance and response time if i
manage to use  SOlr json facet api?

Kindly help me out here as I am trying to reduce my query response time.

Regards,
Kshitij


Search for ISBN-like identifiers

2017-01-05 Thread Sebastian Riemer
Hi folks,


TL;DR: Is there an easy way, to copy ISBNs with hyphens to the general text 
field, respectively configure the analyser on that field, so that a search for 
the hyphenated ISBN returns exactly the matching document?

Long version:
I've defined a field "text" of type "text_general", where I copy all my other 
fields to, to be able to do a "quick search" where I set q=text

The definition of the type text_general is like this:





  







  

  









  




I now face the problem, that searching for a book with text:978-3-8052-5094-8* 
does not return the single result I expect. However searching for 
text:9783805250948* instead returns a result. Note, that I am adding a wildcard 
at the end automatically, to further broaden the resultset. Note also, that it 
does not seem to matter whether I put backslashes in front of the hyphen or not 
(to be exact, when sending via SolrJ from my application, I put in the 
backslashes, but I don't see a difference when using SolrAdmin as I guess 
SolrAdmin automatically inserts backslashes if needed?)

When storing ISBNs, I do store them twice, once with hyphens 
(978-3-8052-5094-8) and once without (9783805250948). A pure phrase search on 
both those values return also the single document.

I learned that the StandardTokenizer splits up values from fields at index 
time, and I've also learned that I can use the solrAdmin analysis and the 
debugQuery to help understand what is going on. From the analysis screen I see, 
that given the value 9783805250948 at index-time and 9783805250948* query-time 
both leads to an unchanged value 9783805250948 at the end.
When given the value 978-3-8052-5094-8 for "Field Value (Index)" and 
978-3-8052-5094-8* for "Field Value (Query)"  I can see how the ISBN is 
tokenized into 5 parts. Again, the values match on both sides (Index and Query).

How does the left side correlate with the right side? My guess: The left side 
means, "Values stored in field text will be tokenized while indexing as show 
here on the left". The right side means, "When querying on the field text, I'll 
tokenize the entered value like this, and see if I find something on the index" 
Is this correct?

Another question: when querying and investigating the single document in 
solrAdmin, the contents I see In the column text represents the _stored_ value 
of the field text, right?
And am I correct that this actually has nothing to do, with what is actually 
stored in  the index for searching?

When storing the value 978-3-8052-5094-8, are only the tokenized values stored 
for search, or is the "whole word" also stored? Is there a way to actually see 
all the values which are stored for search?
When searching text:" 978-3-8052-5094-8" I get the single result, so I guess 
the value as a whole must also be stored in the index for searching?

One more thing which confuses me:
Searching for text: 978-3-8052-5094-8 gives me 72 results, because it leads to 
searching for "parsedquery_toString":"text:978 text:3 text:8052 text:5094 
text:8",
but searching for text: 978-3-8052-5094-8* gives me 0 results, this leads to 
"parsedquery_toString":"text:978-3-8052-5094-8*",

Why is the appended wildcard changing the behaviour so radically? I'd rather 
expect to get something like "parsedquery_toString":"text:978 text:3 text:8052 
text:5094 text:8*",  and thus even more results.

Btw. I've found and read an interesting blog about storing ISBNs and alikes 
here: 
http://robotlibrarian.billdueber.com/2012/03/solr-field-type-for-numericish-ids/
 However, I already store my ISBN also in a separate field, of type string, 
which works fine when I use this field for searching.

Best regards, sorry for the enormously long question and thank you for 
listening.

Sebastian


Re: SolrJ doesn't work with Json facet api

2017-01-05 Thread Sandeep Khanzode
For me, these variants have worked ...

solrQuery.add("json.facet", "...");

solrQuery.setParam("json.facet", "...");
 
You get ...
QueryResponse.getResponse().get("facets");

SRK 

On Thursday, January 5, 2017 1:19 PM, Jeffery Yuan  
wrote:
 

 Thanks for your response.
We definitely use solrQuery.set("json.facet", "the json query here");

Btw we are using Solr 5.2.1.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrJ-doesn-t-work-with-Json-facet-api-tp4299867p4312459.html
Sent from the Solr - User mailing list archive at Nabble.com.


   

Re: Solr query *:* timeout

2017-01-05 Thread sn00py

Hmmm i have to check something
it seems, that it's no error

There are some zip files which are indexed, and on the admin page  
there are fetched all fields, including the contents ... and the zip  
document has a realy big content :O




Zitat von sn0...@ulysses-erp.com:

Hello - an hour ago, solr worked fine i had about 2 documents in  
the index.
I had made an upadte/extract process from the batch, and saw that on  
document has blocked the batch


I waited fo about 2 minutes than i killed the update batch process.

After a restart of the server, i started solr.

But now (and too before the restart)

if i query *.* in the admin interface - it hangs, get now result

if i query for uid:1 or make a query for q=test, i get the document.
a query for q=*a* hangs too.

how can i repair it?


This message was sent using IMP, the Internet Messaging Program.







This message was sent using IMP, the Internet Messaging Program.



Solr query *:* timeout

2017-01-05 Thread sn00py
Hello - an hour ago, solr worked fine i had about 2 documents in  
the index.
I had made an upadte/extract process from the batch, and saw that on  
document has blocked the batch


I waited fo about 2 minutes than i killed the update batch process.

After a restart of the server, i started solr.

But now (and too before the restart)

if i query *.* in the admin interface - it hangs, get now result

if i query for uid:1 or make a query for q=test, i get the document.
a query for q=*a* hangs too.

how can i repair it?


This message was sent using IMP, the Internet Messaging Program.



Re: How to train the model using user clicks when use ltr(learning to rank) module?

2017-01-05 Thread Will Martin
http://www.dcc.fc.up.pt/~pribeiro/aulas/na1516/slides/na1516-slides-ir.pdf

  see the relevant sections for good info


On 1/5/2017 3:02 AM, Jeffery Yuan wrote:
> Thanks very much for integrating machine learning to Solr.
> https://github.com/apache/lucene-solr/blob/f62874e47a0c790b9e396f58ef6f14ea04e2280b/solr/contrib/ltr/README.md
>
> In the Assemble training data part: the third column indicates the relative
> importance or relevance of that doc
> Could you please give more info about how to give a score based on what user
> clicks?
>
> I have read
> https://static.aminer.org/pdf/PDF/000/472/865/optimizing_search_engines_using_clickthrough_data.pdf
> http://www.cs.cornell.edu/people/tj/publications/joachims_etal_05a.pdf
> http://alexbenedetti.blogspot.com/2016/07/solr-is-learning-to-rank-better-part-1.html
>
> But still have no clue how to translate the partial pairwise feedback to the
> importance or relevance of that doc.
>
>  From a user's perspective, the steps such as setup the feature and model in
> Solr is simple, but collecting the feedback data and train/update the model
> is much more complex.
>
> It would be great Solr can provide some detailed instruction or sample code
> about how to translate the partial pairwise feedback and use it to train and
> update model.
>
> Thanks again for your help.
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/How-to-train-the-model-using-user-clicks-when-use-ltr-learning-to-rank-module-tp4312462.html
> Sent from the Solr - User mailing list archive at Nabble.com.



How to train the model using user clicks when use ltr(learning to rank) module?

2017-01-05 Thread Jeffery Yuan
Thanks very much for integrating machine learning to Solr.
https://github.com/apache/lucene-solr/blob/f62874e47a0c790b9e396f58ef6f14ea04e2280b/solr/contrib/ltr/README.md

In the Assemble training data part: the third column indicates the relative
importance or relevance of that doc
Could you please give more info about how to give a score based on what user
clicks?

I have read
https://static.aminer.org/pdf/PDF/000/472/865/optimizing_search_engines_using_clickthrough_data.pdf
http://www.cs.cornell.edu/people/tj/publications/joachims_etal_05a.pdf
http://alexbenedetti.blogspot.com/2016/07/solr-is-learning-to-rank-better-part-1.html

But still have no clue how to translate the partial pairwise feedback to the
importance or relevance of that doc.

>From a user's perspective, the steps such as setup the feature and model in
Solr is simple, but collecting the feedback data and train/update the model
is much more complex.

It would be great Solr can provide some detailed instruction or sample code
about how to translate the partial pairwise feedback and use it to train and
update model.

Thanks again for your help.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-train-the-model-using-user-clicks-when-use-ltr-learning-to-rank-module-tp4312462.html
Sent from the Solr - User mailing list archive at Nabble.com.