Re: Issues with Authentication / Role based authorization

2016-05-10 Thread shamik
Ok, I'm really struggling to figure out the right approach here. I wanted to
make it simple and started fresh. Removed the existing node (node1 and
node2), started the server in Cloud mode and uploaded the following
security.json.

{
  "authentication": {
"blockUnknown": true,
"class": "solr.BasicAuthPlugin",
"credentials": {
  "solr": "IV0EHq1OnNrj6gvRCwvFwTrZ1+z1oBbnQdiVC3otuq0=
Ndd7LKvVBAaZIF0QAVi1ekCfAJXr1GGfLtRUXhgrF8c="
}
  },
  "authorization": {
"class": "solr.RuleBasedAuthorizationPlugin",
"permissions": [
  {
"name": "security-edit",
"role": "admin"
  },
   {
"name": "all",
"role": "all"
  },
  {
"name": "browse",
"collection": "gettingstarted",
"path": "/browse",
"role": "browseRole"
  },
  {
"name": "select",
"collection": "gettingstarted",
"path": "/select/*",
"role": "selectRole"
  }
],
"user-role": {
   "solr": ["admin"]
}
  }
}

When I try to login using solr/SolrRocks, I' got the following exception:

INFO  - 2016-05-11 05:55:48.830; [   ]
org.apache.solr.security.RuleBasedAuthorizationPlugin; This resource is
configured to have a permission
org.apache.solr.security.RuleBasedAuthorizationPlugin$Permission@167ffde1,
The principal [principal: solr] does not have the right role 
INFO  - 2016-05-11 05:55:48.834; [   ] org.apache.solr.servlet.HttpSolrCall;
USER_REQUIRED auth header Basic c29scjpTb2xyUm9ja3M= context : [FAILED
toString()] 

Now, I removed the node, started all over again and uploaded a bare-bone
security.json.

{
"authentication": {
"blockUnknown": true,
"class": "solr.BasicAuthPlugin",
"credentials": {
"solr": "IV0EHq1OnNrj6gvRCwvFwTrZ1+z1oBbnQdiVC3otuq0=
Ndd7LKvVBAaZIF0QAVi1ekCfAJXr1GGfLtRUXhgrF8c="
}
},
"authorization": {
"class": "solr.RuleBasedAuthorizationPlugin",
"user-role": {
"solr": "admin"
},
"permissions": [{
"name": "security-edit",
"role": "admin"
}]
}
}

I was able to access solr admin and request handlers without any issue.
Entire admin functionality, including create/modify collections were
accessible.

Is it safe to assume that the default security.json can accept only one role
?

Now, I added couple of users through curl {"set-user": {"superuser":
"Password1","beehive": "Password1"}}.

Then, assigned "superuser" to admin role.
{"set-user-role":{"superuser":"admin"}}

I'm able to access both admin and request handlers. So far so good.

I added couple of new roles,

{"set-permission" : {"name":"select", "collection": "gettingstarted", 
"path": "/select/*", "role": "selectRole"}}
{"set-permission" : {"name":"browse", "collection": "gettingstarted", 
"path": "/browse", "role": "browseRole"}}

Then assigned user "beehive" to these roles.

{"set-user-role":{"beehive":["browseRole","selectRole"]}}

Logged in as "beehive" and accessed /browse. The page came up, but threw the
following exception:

[c:gettingstarted s:shard2 r:core_node2 x:gettingstarted_shard2_replica1]
org.apache.solr.common.SolrException;
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at
http://192.168.1.100:7574/solr/gettingstarted_shard1_replica2: Expected mime
type application/octet-stream but got text/html. 


Error 401 Unauthorized request, Response code: 401


HTTP ERROR 401

Problem accessing /solr/gettingstarted_shard1_replica2/browse. Reason:
Unauthorized request, Response code:
401/Powered by Jetty:///




at
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:544)
at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:240)
at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:229)
at
org.apache.solr.client.solrj.impl.LBHttpSolrClient.doRequest(LBHttpSolrClient.java:372)
at
org.apache.solr.client.solrj.impl.LBHttpSolrClient.request(LBHttpSolrClient.java:325)
at
org.apache.solr.handler.component.HttpShardHandlerFactory.makeLoadBalancedRequest(HttpShardHandlerFactory.java:246)
at
org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:201)
at
org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:163)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 

How to search string

2016-05-10 Thread kishor
I want to search a product and product name is "Garmin Class A" so  I expect
result is product name matching string "Garmin Class A" but it searches
separately i dont know why and how it happen.Please guide me how to search a
string in only one field only not in other fields. "debug": {   
"rawquerystring": "Garmin Class A","querystring": "Garmin Class A",   
"parsedquery": "(+(DisjunctionMaxQuery((product_name:Garmin))
DisjunctionMaxQuery((product_name:Class))
DisjunctionMaxQuery((product_name:A))) ())/no_coord",   
"parsedquery_toString": "+((product_name:Garmin) (product_name:Class)
(product_name:A)) ()","explain": {},



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-search-string-tp4276052.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: [scottchu] What kind of configuration to use for this size of news data?

2016-05-10 Thread scott.chu

A further question: Can master-slave and SolrCloud exist simultaneously in one 
Solr server? If yes, how can I do it?

scott.chu,scott@udngroup.com
2016/5/11 (週三)
- Original Message - 
From: scott(自己) 
To: solr-user 
CC: 
Date: 2016/5/11 (週三) 11:11
Subject: [scottchu] What kind of configuration to use for this size of news 
data?



I want to build a Solr engine for over 60-year news articles. My requests are 
(I use Solr 5.4.1):

1> Currently over 10M no. of docs.
2> Currently over 60GB total data size.
3> The no. of docs and data size will keep growing at the rate of 1000 no. of 
docs(or 8MB size) per day.
4> There are totally 5-6 different newspaper types.

My questions are:
1>  Is it wokable enough just to use master-slave model? Or should I turn of 
SolrCloud? (I ask this due to our system management group never manage a 
distributed system before and they also have no knowedge of Zookeeper, shards, 
etc. Also they don't know how to backup/restore distributed data.)
2> Say if I choose Solrcloud anyway. I wish to keep one shard owning one 
specific year of data. Can it be done? What configuration should I do? (AFAIK, 
SolrCloud distributes data based on some intrinsic routing algorithm.)
3> If I wish to create another Solr engine with one or two particular paper 
types. Is it possible to copy their data directly from the big central Solr 
engine? Or I have to rebuild index from raw articles data? (Our business has 
this possibility of needs.)

I'd like to hear and use some well suggestion and experiences.

Thanks in advance and best regards.
scott.chu,scott@udngroup.com
2016/5/11 (週三)


Search score showing in exponential format

2016-05-10 Thread Zheng Lin Edwin Yeo
Hi,

I found that in my result search, there are some results which has got a
score which looks like the following:

"score":6.705859E-6}]


This is probably a figure with a very small value, and it may occur in
queries which has a large number of records found. Is there a way to
standardise the figures format, to display in all standard number format in
decimal place, like 0.0670, instead of showing it in the exponential
format?

I'm using Solr 5.4.0

Regards,
Edwin


Sorting for MLT results

2016-05-10 Thread Zheng Lin Edwin Yeo
Hi,

Would like to check, is there a function to do the sorting for MLT results
in Solr? I understand that there is a sort function, but that only works
for the main query results. It does not do any sorting for the MLT results

I'm using Solr 5.4.0.

Regards,
Edwin


[scottchu] What kind of configuration to use for this size of news data?

2016-05-10 Thread scott.chu
Fix some typos, add some words and resend same question => 

I want to build a Solr engine for over 60-year news articles. My requests are 
(I use Solr 5.4.1):
 
1> Currently over 10M no. of docs.
2> Currently over 60GB total data size.
3> The no. of docs and data size will keep growing at the rate of 1000 no. of 
docs(or 8MB size) per day.
4> There are totally 5-6 different newspaper types.
 
My questions are:
1> Is it wokable enough just to use master-slave model? Or should I turn to 
SolrCloud? (I ask this due to our system management group never manage a 
distributed system before and they also have no knowedge of Zookeeper, shards, 
etc. Also they don't know how to backup/restore distributed data.)
2> Say if I choose Solrcloud anyway. I wish to keep one shard owning one 
specific year of data. Can it be done? What configuration should I do? (AFAIK, 
SolrCloud distributes data based on some intrinsic routing algorithm.)
3> If I wish to create another Solr engine with one or two particular paper 
types. Is it possible to copy their index data directly from the big central 
Solr engine? Or I have to rebuild index from raw articles data? (Our business 
has this possibility of needs.)
 
I'd like to hear and use some well suggestion and experiences.
 
Thanks in advance and best regards.

Scott Chu @ 2016/5/11  11:26 GMT+8


[scottchu] What kind of configuration to use for this size of news data?

2016-05-10 Thread scott.chu

I want to build a Solr engine for over 60-year news articles. My requests are 
(I use Solr 5.4.1):

1> Currently over 10M no. of docs.
2> Currently over 60GB total data size.
3> The no. of docs and data size will keep growing at the rate of 1000 no. of 
docs(or 8MB size) per day.
4> There are totally 5-6 different newspaper types.

My questions are:
1>  Is it wokable enough just to use master-slave model? Or should I turn of 
SolrCloud? (I ask this due to our system management group never manage a 
distributed system before and they also have no knowedge of Zookeeper, shards, 
etc. Also they don't know how to backup/restore distributed data.)
2> Say if I choose Solrcloud anyway. I wish to keep one shard owning one 
specific year of data. Can it be done? What configuration should I do? (AFAIK, 
SolrCloud distributes data based on some intrinsic routing algorithm.)
3> If I wish to create another Solr engine with one or two particular paper 
types. Is it possible to copy their data directly from the big central Solr 
engine? Or I have to rebuild index from raw articles data? (Our business has 
this possibility of needs.)

I'd like to hear and use some well suggestion and experiences.

Thanks in advance and best regards.

scott.chu,scott@udngroup.com
2016/5/11 (週三)


Re: what scene using carrot2 cluster

2016-05-10 Thread Zheng Lin Edwin Yeo
I'm using carrot2 clustering with Solr. I'm using
the Lingo3GClusteringAlgorithm, but that requires a licence. Othwrwise, you
can use the default LingoClusteringAlgorithm.

Regards,
Edwin

On 10 May 2016 at 22:43, xiangliumi <852262...@qq.com> wrote:

> hi,all
>
> does someone have used carrot2 with solr,please give me a scene
> description when using carrot2,and best give me some links about deploying
> solr5.x and carrot2. thanks for your help!
>
>
> thanks
> Max Mi
>
> Sent using CloudMagic Email [
> https://cloudmagic.com/k/d/mailapp?ct=pa&cv=8.4.52&pv=4.4.2&source=email_footer_2
> ]


Re:Re: solrcloud performance problem

2016-05-10 Thread lltvw
Hi Shawn,


Thanks for your help.


the args used to start solr are as following, and upload my screen shot to 
http://www.yupoo.com/photos/qzone3927066199/96064170/, please help to take a 
look, thanks.

-DSTOP.PORT=7989

-DSTOP.KEY=

-DzkHost=node1:2181,node2:2181,node3:2181/solr

-Dsolr.solr.home=solr

-Dbootstrap_conf=true

-Xmx10240M

-Xms4196M

-XX:MaxPermSize=512M

-XX:PermSize=256M

-Dcom.sun.management.jmxremote.authenticate=false

-Dcom.sun.management.jmxremote.ssl=false

-Dcom.sun.management.jmxremote.port=3000

-Dcom.sun.management.jmxremote









At 2016-05-10 23:25:53, "Shawn Heisey"  wrote:
>On 5/9/2016 11:42 PM, lltvw wrote:
>> By using jps command double check the parms used to start solr, i found that 
>> the max  heap size already set to 10G. So I made a big mistake yesterday.
>>
>> But by using solr admin UI, I select the collection with performance 
>> problem, in the overview page I find that the heap memory is about 8M. What 
>> is wrong.
>>
>> Every time I search difference characters, QTime from response header always 
>> greater than 300ms. If I search again, cause i can hit cache, the response 
>> time could become to about 30ms.
>
>When my queries hit the cache, they only take a few milliseconds.  30
>milliseconds for a cached query seems VERY slow.
>
>Can you open the dashboard in the admin UI, make it large enough to see
>everything, take a screenshot of the whole page, and included a URL
>where that screenshot can be viewed?  I do not need to see the whole
>browser window, just the whole dashboard.  Here's an example of what I
>am looking for:
>
>https://www.dropbox.com/s/ixu8dr954mst0c4/dashboard-just-page.png?dl=0
>
>In my example, you can't see all of the JVM Args in the screenshot --
>there are a lot more of them, and they wouldn't fit in the window even
>when maximized.  So if your screenshot doesn't include all of them, you
>probably should copy those as text and include them in your reply --
>like this:
>
>-DSTOP.KEY=solrrocks
>-DSTOP.PORT=7982
>-Dcom.sun.management.jmxremote
>-Dcom.sun.management.jmxremote.authenticate=false
>-Dcom.sun.management.jmxremote.local.only=false
>-Dcom.sun.management.jmxremote.port=18982
>-Dcom.sun.management.jmxremote.rmi.port=18982
>-Dcom.sun.management.jmxremote.ssl=false
>-Djetty.home=/opt/solr5/server
>-Djetty.port=8982
>-Dlog4j.configuration=file:/index/solr5/log4j.properties
>-Dsolr.install.dir=/opt/solr5
>-Dsolr.solr.home=/index/solr5/data
>-Duser.timezone=UTC
>-XX:+CMSParallelRemarkEnabled
>-XX:+CMSScavengeBeforeRemark
>-XX:+ParallelRefProcEnabled
>-XX:+PrintGCApplicationStoppedTime
>-XX:+PrintGCDateStamps
>-XX:+PrintGCDetails
>-XX:+PrintGCTimeStamps
>-XX:+PrintHeapAtGC
>-XX:+PrintTenuringDistribution
>-XX:+UseCMSInitiatingOccupancyOnly
>-XX:+UseConcMarkSweepGC
>-XX:+UseParNewGC
>-XX:CMSInitiatingOccupancyFraction=70
>-XX:CMSMaxAbortablePrecleanTime=2000
>-XX:MaxTenuringThreshold=8
>-XX:NewRatio=3
>-XX:OnOutOfMemoryError=/opt/solr5/bin/oom_solr.sh 8982 /index/solr5/logs
>-XX:PretenureSizeThreshold=64m
>-XX:SurvivorRatio=4
>-XX:TargetSurvivorRatio=90
>-Xloggc:/index/solr5/logs/solr_gc.log
>-Xms22g
>-Xmx22g
>-verbose:gc
>
>How are you starting Solr?  With Solr 4.x, there are limitless numbers
>of ways to install and start Solr, because it is released as a webapp
>.war file.  When 5.0 was released, that was reduced to only a few
>supported options.
>
>Thanks,
>Shawn
>


Re: How to search in solr for words like %rek Dr%

2016-05-10 Thread Thrinadh Kuppili
Thank you, Yes i am aware that surround with quotes will result in match for
space but i am trying to match word based on input which cant be controlled. 
I need to search solr for %rek Dr%  and return all result which has "rek Dr"
without qoutes.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-search-in-solr-for-words-like-rek-Dr-tp4275854p4276027.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Transforming SolrDocument to SolrInputDocument in Solr 6.0

2016-05-10 Thread Alexandre Rafalovitch
Not sure if that's useful, but the samples that ship with Solr show how to
transform Solr XML output into Solr Update XML format using XSLT
post-processing.

Regards,
   Alex.


Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/

On 11 May 2016 at 01:36, Stephan Schubert  wrote:

> In Solr 6.0 the method ClientUtils.toSolrInputDocument() was removed
> (deprecated since 5.5.1, see
> https://issues.apache.org/jira/browse/SOLR-8339). What is the best way
> now to transform a SolrDocument into a SolrInputDocument?
>
> Mit freundlichen Grüßen / Best regards
>
> Stephan Schubert
> Senior Web Application Engineer  |   IT Engineering Information Oriented
> Applications
>
>
>
> SICK AG  |  Erwin-Sick-Str. 1  |  79183 Waldkirch  |  Germany
> Phone +49 7681 202-3751  |  stephan.schub...@sick.de  |
> http://www.sick.de
> 
> __
>
> SICK AG  |   Sitz: Waldkirch i. Br.  |   Handelsregister: Freiburg i. Br.
> HRB 280355
> Vorstand: Dr. Robert Bauer (Vorsitzender)  |  Reinhard Bösl  |  Dr. Mats
> Gökstorp  |  Dr. Martin Krämer  |  Markus Vatter
> Aufsichtsrat: Gisela Sick (Ehrenvorsitzende)  |  Klaus M. Bukenberger
> (Vorsitzender)


How do we generate SHA256 password for Authentication

2016-05-10 Thread Shamik Bandopadhyay
Hi,

  I'm trying to setup Authentication and Role-based authorization in Solr
5.5. Beside "Solr" user from example, I've created another user "dev". I've
used the following website to generate sha256 encoded password.

http://www.lorem-ipsum.co.uk/hasher.php

I've used password as "password" .

Here's my security.json

{
  "authentication": {
"blockUnknown": false,
"class": "solr.BasicAuthPlugin",
"credentials": {
  "solr": "IV0EHq1OnNrj6gvRCwvFwTrZ1+z1oBbnQdiVC3otuq0=
Ndd7LKvVBAaZIF0QAVi1ekCfAJXr1GGfLtRUXhgrF8c=",
  "dev":"
5e884898da28047151d0e56f8dc6292773603d0d6aabbdd62a11ef721d1542d8",
}
  },
  "authorization": {
"class": "solr.RuleBasedAuthorizationPlugin",
"permissions": [
  {
"name": "security-edit",
"role": "admin"
  },
  {
"name": "schema-edit",
"role": "admin"
  },
  {
"name": "config-edit",
"role": "admin"
  },
  {
"name": "collection-admin-edit",
"role": "admin"
  },
  {
"name": "all-admin",
"collection": null,
"path": "/*",
"role": "adminAllRole"
  },
  {
"name": "all-core-handlers",
"path": "/*",
"role": "adminAllHandler"
  },
  {
"name": "update",
"role": "updateRole"
  },
  {
"name": "read",
"role": "readRole"
  },
  {
"name": "browse",
"collection": "gettingstarted",
"path": "/browse",
"role": "browseRole"
  },
  {
"name": "select",
"collection": "gettingstarted",
"path": "/select/*",
"role": "selectRole"
  }
],
"user-role": {
  "solr": [
"admin",
"adminAllRole",
"adminAllHandler",
"updateRole"
  ],
  "dev": [
"readRole"
  ]
}
  }
}

Here's what I'm doing.
1. I started Solr in Cloud mode "solr start -e cloud -noprompt"
2. zkcli.bat -zkhost localhost:9983 -cmd putfile /security.json
security.json
3. tried http://localhost:8983/solr/gettingstarted/browse , provided
dev/password but I'm getting the following exception:

[c:gettingstarted s:shard2 r:core_node3 x:gettingstarted_shard2_replica2]
org.apache.solr.servlet.HttpSolrCall; USER_REQUIRED auth header Basic
c29scjpTb2xyUm9ja3M= context : userPrincipal: [[principal: solr]] type:
[UNKNOWN], collections: [gettingstarted,], Path: [/browse] path : /browse
params :

Looks like I'm using the wrong way of generating the password.
solr/SolrRocks works as expected.

Also, sure what's wrong with the "readRole" . It doesn't seem to work when
I try with user "solr".

Any pointers will be appreciated.

-Thanks,
Shamik


Re: How to search in solr for words like %rek Dr%

2016-05-10 Thread Walter Underwood
That is going to be a very slow search in Solr.

But if you want to match space separated words, that is very easy and fast in 
Solr. Surround the phrase in quotes: “N Derek”.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On May 10, 2016, at 3:53 PM, Thrinadh Kuppili  wrote:
> 
> Thanks Nick, will look into it.
> 
> My main moto is to able to search like %xxx xxx% similar to database search
> of contians with.
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/How-to-search-in-solr-for-words-like-rek-Dr-tp4275854p4275970.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: How to search in solr for words like %rek Dr%

2016-05-10 Thread Thrinadh Kuppili
Thanks Nick, will look into it.

My main moto is to able to search like %xxx xxx% similar to database search
of contians with.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-search-in-solr-for-words-like-rek-Dr-tp4275854p4275970.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Simulate doc linking via post filter cache check

2016-05-10 Thread tedsolr
Mikhail, that's an interesting idea. If a terms list could stand in for a
cache that may be helpful. What I don't fully see is how the search would
work. Building an explicit negative terms query with returned IDs doesn't
seem possible as that list would be in the millions. To drastically speed my
process up I need to stop updating the data docs and only update the marker
(linked) docs.

Starting with 0 terms indexed for field "doclist" the very first search is
easy:
- put all result IDs in the doclist
Second search must exclude results that are already represented in the
doclist field. How is that possible?

I should mention I do an explicit hard commit after running each saved
search, to prevent consecutive searches from overlapping. That is probably
costing me. I didn't know it was possible to do an explicit soft commit. How
do you do that with SolrJ (not by setting maxDocs=1 in the config I hope)?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Simulate-doc-linking-via-post-filter-cache-check-tp4275842p4275929.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Simulate doc linking via post filter cache check

2016-05-10 Thread Mikhail Khludnev
The problem description is really long, you know.
I'd attack the statement:

> Since it's not possible to do a RDBMS like search joining the 2
> doc types, I need to run the saved search: find docs where name=Johnson,
> then drop the docs that are not in a doclist.
>

And also, if you remove all markers or starting from empty collection, and
do softCommit after every add, you can use /terms (TermsComponent) as a
"cache of" inserted doclist_ids.

For me it seems more like transient cache for ETL process, this state makes
sense only for single load operation; and not a search engine concern,
really.

Also, you can think from the opposite side:
after you search for the first request: q=name:Johnson and add it result to
markers, the second request might be q=name:Jacobson -name:Johnson etc,
until you exceed maxBooleanClauses limit, that can be leveraged by another
meanings.

and also every request can append list of responded ids into the growing
list of negative terms query:
q=name:Jacobson -{terms f=ids v=$alreadyseen}&alreadyseen=2,4,6,8,...

or they might be joined from markers, if you can afford often softCommit,
etc there are plenty of approaches to keep your hair.


On Tue, May 10, 2016 at 6:44 PM, tedsolr  wrote:

> I'm pulling my hair out on this one - and there's not much of that to begin
> with. The problem I have is that updating 10M denormalized docs in a
> collection takes about 5 hours. Soon there will be collections with 100M
> docs and a 50 hour update cycle will not be acceptable. The process
> involves
> cleaning (deleting) the marker fields, querying the collection with user
> defined saved searches, then updating the marker fields in every matched
> doc. If I can normalize based on the searches the processing time should go
> way down: delete marker docs, query the collection with user defined saved
> searches, then insert marker docs. The time savings comes from 1) deleting
> and inserting docs is faster than updating docs, 2) the number of saved
> searches is at least 1000X less than the number of docs.
>
> A doc may have a couple hundred fields, but looks sorta like this:
> {"id":123_5677899","searchid":"34","name":"Johnson", ...}
>
> To normalize I would remove the searchid into a new doc:
> {"id":"S234","searchid":"34","doclist":["123_5677899","123_5677898",...]}
>
> The "link" is established by the doclist field which is multivalued and
> contains the ids from the real docs. All this is doable, the problem is
> that
> when users create saved searches they must only match docs that have not
> already been matched by another search. That's why there's only one doc
> "type" now - every matched doc has a marker (searchid) which makes the Solr
> search work. Since it's not possible to do a RDBMS like search joining the
> 2
> doc types, I need to run the saved search: find docs where name=Johnson,
> then drop the docs that are not in a doclist.
>
> So, maybe if I manage a custom cache of matched doc ids, I can check each
> returned id against the cache and drop the docs that are not in it. I think
> this could be done in a post filter. There will be a big memory hit to
> maintain this cache, but does this seem like a performant solution to my
> problem?
>
> Thanks!
> v5.2.1
> All collections are one shard with replication factor 2
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Simulate-doc-linking-via-post-filter-cache-check-tp4275842.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics





Re: Facet ignoring repeated word

2016-05-10 Thread Toke Eskildsen
G, Rajesh  wrote:
> Thanks Toke. The issue I have is I cannot look for a specific word e.g. ddr
> in termfreq(%27name%27,%20%27ddr%27). I have to find count of all words
> and their sum

Is that really the case? As your field is a comment field, your word cloud 
could easily contain tens or hundreds of thousands of words. That is pretty 
hard to display. Normally a word cloud consists of a small amount of words, 
just as seen in the example you link to. The point of using facet + stats is 
that facets gives you a rough list and stats gives you the real count.

If a usable word cloud consists of 50 words, you could use something like 
facet.limit=200 and feed those to your stats-request, then only use the top-50 
from there. I know that it does not guarantee that the words are the correct 
ones, but you can experiment with the facet.limit until you get a proper 
speed/accurracy trade-off.

- Toke Eskildsen


Re: Re: Transforming SolrDocument to SolrInputDocument in Solr 6.0

2016-05-10 Thread Erick Erickson
NP, Having to dive into the patch is kind of arcane...

On Tue, May 10, 2016 at 8:54 AM, Stephan Schubert 
wrote:

> Ouch... thanks a lot ;)
>
>
> Mit freundlichen Grüßen / Best regards
>
> Stephan Schubert
> Senior Web Application Engineer  |   IT Engineering Information Oriented
> Applications
>
>
>
> SICK AG  |  Erwin-Sick-Str. 1  |  79183 Waldkirch  |  Germany
> Phone +49 7681 202-3751  |  stephan.schub...@sick.de  |
> http://www.sick.de
> 
> __
>
> SICK AG  |   Sitz: Waldkirch i. Br.  |   Handelsregister: Freiburg i. Br.
> HRB 280355
> Vorstand: Dr. Robert Bauer (Vorsitzender)  |  Reinhard Bösl  |  Dr. Mats
> Gökstorp  |  Dr. Martin Krämer  |  Markus Vatter
> Aufsichtsrat: Gisela Sick (Ehrenvorsitzende)  |  Klaus M. Bukenberger
> (Vorsitzender)


Re: How to search in solr for words like %rek Dr%

2016-05-10 Thread Nick D
Don't really get what "Q= {!dismax qf=address} "rek Dr*" - It is not
allowed since perfix in Quotes is not allowed" means, why cant you use
exact phrase matching? Do you have some limitation of quoting as you are
specifically looking for an exact phrase I dont see why you wouldn't want
exact matching.


Anyways

You can look into using another type of tokenizer, my guess is you are
probably using the standard tokenizer or possibly the whitespace tokenizer.
You may want to try a different one a see what result you get. Also you
probably wont need to use the wildcards if you setup you gram sizes the way
you want.

The shingle factory can do stuff like (now my memory is a bit fuzzy on this
but I play with it in the admin page).

This is a sentence
shingle = 4
this_is_a_sentence

Combine that with your ngram factory and you can do something like. Mingram
= 4 max =50
this
this_i
this_is

this_is_a_sentence

his_i
his_is

his_is_a_sentence

etc.


Then apply the shingle factory on query to take something like

his is-> his_is and you will get that phrase back.

My personal favorite is just using edgengram and fixing something like but
the concept is the same with regular old ngram:

2001 N Drive Derek Fullerton

2
[32]
0
1
1
word
1
20
[32 30]
0
2
1
word
1
200
[32 30 30]
0
3
1
word
1
2001
[32 30 30 31]
0
4
1
word
1
n
[6e]
5
6
1
word
2
d
[64]
7
8
1
word
3
dr
[64 72]
7
9
1
word
3
dri
[64 72 69]
7
10
1
word
3
driv
[64 72 69 76]
7
11
1
word
3
drive
[64 72 69 76 65]
7
12
1
word
3
d
[64]
13
14
1
word
4
de
[64 65]
13
15
1
word
4
der
[64 65 72]
13
16
1
word
4
dere
[64 65 72 65]
13
17
1
word
4
derek
[64 65 72 65 6b]
13
18
1
word
4
f
[66]
19
20
1
word
5
fu
[66 75]
19
21
1
word
5
ful
[66 75 6c]
19
22
1
word
5
full
[66 75 6c 6c]
19
23
1
word
5
fulle
[66 75 6c 6c 65]
19
24
1
word
5
fuller
[66 75 6c 6c 65 72]
19
25
1
word
5
fullert
[66 75 6c 6c 65 72 74]
19
26
1
word
5
fullerto
[66 75 6c 6c 65 72 74 6f]
19
27
1
word
5
fullerton
[66 75 6c 6c 65 72 74 6f 6e]
19
28
1
word
5

Works great for a quick type-ahead field type.

Oh and by the way your ngram size is two small for _rek_ to be split up
from _derek_


Setting up a few different field types and playing with the analyzer in
admin page can give you a good idea about what both index and query time
results can be and with your tiny data set is the best way I can think of
to see instant results with your new field types.

Nick

On Tue, May 10, 2016 at 10:01 AM, Thrinadh Kuppili 
wrote:

> I have tried with  maxGramSize="12"/> and search using the Extended Dismax
>
> Q= {!dismax qf=address} rek Dr* - It did not work as expected since i am
> getting all the records which has rek, Dr .
>
> Q= {!dismax qf=address} "rek Dr*" - It is not allowed since perfix in
> Quotes
> is not allowed.
>
> Q= {!complexphrase inOrder=true}address:"rek dr*" - It did not work since
> it
> is searching for words starts with rek
>
> I am not aware of shingle factory as of now will try to use and findout how
> i can use.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-search-in-solr-for-words-like-rek-Dr-tp4275854p4275859.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


RE: Solr 5.4.1 Mergeindexes duplicate rows

2016-05-10 Thread Kalpana
As per Shawn's advice I deleted the index data using 
http://localhost:8983/solr/Sitecore_SharePoint/update?stream.body=%3Cdelete%3E%3Cquery%3E*:*%3C/query%3E%3C/delete%3E&commit=true

and then stopped and started Solr and the duplicates were gone.

Will keep a watch!

Thanks much!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-5-4-1-Mergeindexes-duplicate-rows-tp4275153p4275869.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Re-indexing in SolRCloud while keeping the collection online -- Best practice?

2016-05-10 Thread Horváth Péter Gergely
Hi Erick,

Most of the time we have to do a full re-index: I do love your second idea,
I will take a look at the details of that. Thank you! :)

Cheers,
Peter

2016-05-10 17:10 GMT+02:00 Erick Erickson :

> Peter:
>
> Yeah, that would work, but there are a couple of alternatives:
> 1> If there's any way to know what the subset of docs that's
>  changed, just re-index _them_. The problem here is
>  picking up deletes. In the RDBMS case this is often done
>  by creating a trigger for deletes and then the last step
>  in your update is to remove the docs since the last time
>  you indexed using the deleted_docs table (or whatever).
>  This falls down if a> you require an instantaneous switch
>  from _all_ the old data to the new or b> you can't get a
>  list of deleted docs.
>
> 2> Use collection aliasing. The pattern is this: you have your
>  "Hot" collection (col1) serving queries that is pointed to
>  by alias "hot". You create a new collection (col2) and index
>  to it in the background. When done, use CREATEALIAS
>  to point "hot" to "col2". Now you can delete col1. There are
>  no restrictions on where these collections live, so this
>  allows you to move your collections around as you want. Plus
>  this keeps a better separation of old and new data...
>
> Best,
> Erick
>
> On Tue, May 10, 2016 at 4:32 AM, Horváth Péter Gergely
>  wrote:
> > Hi Everyone,
> >
> > I am wondering if there is any best practice regarding re-indexing
> > documents in SolrCloud 6.0.0 without making the data (or the underlying
> > collection) temporarily unavailable. Wiping all documents in a collection
> > and performing a full re-indexing is not a viable alternative for us.
> >
> > Say we had a massive Solr Cloud cluster with a number of separate nodes
> > that are used to host *multiple hundreds* of collections, with document
> > counts ranging from a couple of thousands to multiple (say up to 20)
> > millions of documents, each with 200-300 fields and a background batch
> > loader job that fetches data from a variety of source systems.
> >
> > We have to retain the cluster and ALL collections online all the time
> (365
> > x 24): We cannot allow queries to be blocked while data in a collection
> is
> > being updated and we cannot load everything in a single-shot jumbo commit
> > (the replication could overload the cluster).
> >
> > One solution I could imagine is storing an additional field "load
> > time-stamp" in all documents and the client (interactive query)
> application
> > extending all queries with an additional restriction, which requires
> > documents "load time-stamp" to be the latest known completed "load
> > time-stamp".
> >
> > This concept would work according to the following:
> > 1.) The batch job would simply start loading new documents, with the new
> > "load time-stamp". Existing documents would not be touched.
> > 2.) The client (interactive query) application would still use the old
> data
> > from the previous load (since all queries are restricted with the old
> "load
> > time-stamp")
> > 3.) The batch job would store the new "load time-stamp" as the one to be
> > used (e.g. in a separate collection etc.) -- after this, all queries
> would
> > return the most up-to-data documents
> > 4.) The batch job would purge all documents from the collection, where
> > the "load time-stamp" is not the same as the last one.
> >
> > This approach seems to be implementable, however, I definitely want to
> > avoid reinventing the wheel myself and wondering if there is any better
> > solution or built-in Solr Cloud feature to achieve the same or something
> > similar.
> >
> > Thanks,
> > Peter
>


Re: How to search in solr for words like %rek Dr%

2016-05-10 Thread Thrinadh Kuppili
I have tried with  and search using the Extended Dismax 

Q= {!dismax qf=address} rek Dr* - It did not work as expected since i am
getting all the records which has rek, Dr .

Q= {!dismax qf=address} "rek Dr*" - It is not allowed since perfix in Quotes
is not allowed.

Q= {!complexphrase inOrder=true}address:"rek dr*" - It did not work since it
is searching for words starts with rek 

I am not aware of shingle factory as of now will try to use and findout how
i can use.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-search-in-solr-for-words-like-rek-Dr-tp4275854p4275859.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: auto purge for embedded zookeeper

2016-05-10 Thread tedsolr
That makes perfect sense Shawn. I will clean up the old log data the old
fashioned way.

thanks, Ted



--
View this message in context: 
http://lucene.472066.n3.nabble.com/auto-purge-for-embedded-zookeeper-tp4275561p4275857.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Solr 5.4.1 Mergeindexes duplicate rows

2016-05-10 Thread Kalpana
Thanks for your reply!

Some questions:

Is Solr in cloud mode or running standalone?

Standalone

If you look at the core overview in the admin UI for these three cores,
can you tell me what Num Docs, Max Doc, and the index size is for all
three indexes?

SharePoint_All
Num Docs: 6211
Max Doc= 6211
Index: 29.82 MB

Sitecore_web_index
Num Docs: 5268
Max Doc= 5268
Index: 3.47 MB

Sitecore_SharePoint
Num Docs: 22958
Max Doc= 22958
Index: 78.84 MB



Are the schemas in these three indexes all using the same field name for
uniqueKey?

Yes
_uniqueid


Are you sure that you have only run the merge once?  Alternately, before
each merge attempt, you could entirely delete
$SOLR_HOME/Sitecore_Sharepoint/data and reload the core or restart Solr.

I am manually type the url and performing the merge. I stopped solr, deleted 
the files in the file system for index and then started Solr and ran the Url 
merge and still saw duplicates. I can try what you have recommended.

Thanks so much!.




From: Shawn Heisey-2 [via Lucene] 
[mailto:ml-node+s472066n4275813...@n3.nabble.com]
Sent: Tuesday, May 10, 2016 10:38 AM
To: Kalpana Sivanandan 
Subject: Re: Solr 5.4.1 Mergeindexes duplicate rows

On 5/9/2016 7:55 AM, Kalpana wrote:

> Can anyone help me with a merge. Currently I have the two cores already
> pulling data from SQL Table based on the query I set up.
>
> Solr is running
>
> I also have a third core set up with schema similar to the first two. and
> then I wrote this in the url and hit enter
> http://localhost:8983/solr/admin/cores?action=mergeindexes&core=Sitecore_SharePoint&srcCore=sitecore_web_index&srcCore=SharePoint_All
>
> I stop and start Solr and I see data with duplicates.
>
> Am I doing this right?

Some questions:

Is Solr in cloud mode or running standalone?

If you look at the core overview in the admin UI for these three cores,
can you tell me what Num Docs, Max Doc, and the index size is for all
three indexes?

Are the schemas in these three indexes all using the same field name for
uniqueKey?

Are you sure that you have only run the merge once?  Alternately, before
each merge attempt, you could entirely delete
$SOLR_HOME/Sitecore_Sharepoint/data and reload the core or restart Solr.

Thanks,
Shawn



If you reply to this email, your message will be added to the discussion below:
http://lucene.472066.n3.nabble.com/Solr-5-4-1-Mergeindexes-duplicate-rows-tp4275153p4275813.html
To unsubscribe from Solr 5.4.1 Mergeindexes duplicate rows, click 
here.
NAML




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-5-4-1-Mergeindexes-duplicate-rows-tp4275153p4275820.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to search in solr for words like %rek Dr%

2016-05-10 Thread Nick D
You can use a combination of ngram or edgengram fields and possibly the
shingle factory if you want to combine words. Also might want to have it as
exact text with no query sloop if the two words, even the partial text,
need to be right next to each other. Edge is great for left to right ngram
is great just to splitup by a size.  There are a number of tokenizers you
can try out.

Nick
On May 10, 2016 9:22 AM, "Thrinadh Kuppili"  wrote:

> I am trying to search a field named Address which has a space in it.
> Example :
> Address has the below values in it.
> 1. 2000 North Derek Dr Fullerton
> 2. 2011 N Derek Drive Fullerton
> 3. 2108 N Derek Drive Fullerton
> 4. 2100 N Derek Drive Fullerton
> 5. 2001 N Drive Derek Fullerton
>
> Search Query:- Derek Drive or rek Dr
> Expectation is it should return all  2,3,4 and it should not return 1 & 5 .
>
> Finally i am trying to find a word which can search similar to database
> search of %N Derek%
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-search-in-solr-for-words-like-rek-Dr-tp4275854.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


How to search in solr for words like %rek Dr%

2016-05-10 Thread Thrinadh Kuppili
I am trying to search a field named Address which has a space in it.
Example :
Address has the below values in it.
1. 2000 North Derek Dr Fullerton
2. 2011 N Derek Drive Fullerton 
3. 2108 N Derek Drive Fullerton
4. 2100 N Derek Drive Fullerton
5. 2001 N Drive Derek Fullerton

Search Query:- Derek Drive or rek Dr 
Expectation is it should return all  2,3,4 and it should not return 1 & 5 .

Finally i am trying to find a word which can search similar to database
search of %N Derek% 

 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-search-in-solr-for-words-like-rek-Dr-tp4275854.html
Sent from the Solr - User mailing list archive at Nabble.com.


Antwort: Re: Transforming SolrDocument to SolrInputDocument in Solr 6.0

2016-05-10 Thread Stephan Schubert
Ouch... thanks a lot ;)

Mit freundlichen Grüßen / Best regards

Stephan Schubert
Senior Web Application Engineer | IT Engineering
Information Oriented Applications

SICK AG | Erwin-Sick-Str. 1 | 79183 Waldkirch | Germany
Phone  +49 7681 202-3751 | Fax  | mailto:stephan.schub...@sick.de | 
http://www.sick.de
 

SICK AG  |  Sitz: Waldkirch i. Br.  |  Handelsregister: Freiburg i. Br. HRB 
280355 
Vorstand: Dr. Robert Bauer (Vorsitzender)  |  Reinhard Bösl  |  Dr. Mats 
Gökstorp  |  Dr. Martin Krämer  |  Markus Vatter
Aufsichtsrat: Gisela Sick (Ehrenvorsitzende)  |  Klaus M. Bukenberger 
(Vorsitzender)


Re: Transforming SolrDocument to SolrInputDocument in Solr 6.0

2016-05-10 Thread Erick Erickson
Hmm, looking at the patch I see:

DocumentObjectBinder binder = new DocumentObjectBinder();
.
.
.

SolrInputDocument solrInputDoc = binder.toSolrInputDocument(in);

But I confess I didn't actually try it.

On Tue, May 10, 2016 at 8:41 AM, Stephan Schubert
 wrote:
> In Solr 6.0 the method ClientUtils.toSolrInputDocument() was removed
> (deprecated since 5.5.1, see
> https://issues.apache.org/jira/browse/SOLR-8339). What is the best way now
> to transform a SolrDocument into a SolrInputDocument?
>
> Mit freundlichen Grüßen / Best regards
>
> Stephan Schubert
> Senior Web Application Engineer  |   IT Engineering Information Oriented
> Applications
>
>
>
> SICK AG  |  Erwin-Sick-Str. 1  |  79183 Waldkirch  |  Germany
> Phone +49 7681 202-3751  |  stephan.schub...@sick.de  |  http://www.sick.de
> __
>
> SICK AG  |   Sitz: Waldkirch i. Br.  |   Handelsregister: Freiburg i. Br.
> HRB 280355
> Vorstand: Dr. Robert Bauer (Vorsitzender)  |  Reinhard Bösl  |  Dr. Mats
> Gökstorp  |  Dr. Martin Krämer  |  Markus Vatter
> Aufsichtsrat: Gisela Sick (Ehrenvorsitzende)  |  Klaus M. Bukenberger
> (Vorsitzender)


RE: Solr edismax field boosting

2016-05-10 Thread Megha Bhandari
Hi Nick

We found the issue.

We had set the type of some of the fields as "string". After changing the 
fields to "text_general" boosting started working.
You were right , Solr was not finding the search term in those fields as String 
only supports exact match and doesn’t tokenise content.

Thanks

-Original Message-
From: Nick D [mailto:ndrake0...@gmail.com] 
Sent: Tuesday, May 10, 2016 9:05 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr edismax field boosting

Megha,

What are the field types for the fields you are trying to search through?
Grab a copy of the schema.xml and paste the relevant fields.

My guess is you have _text_ as some copy field for everything else and have
it stored=false correct? I am no seeing that field in the output above.
Also in you first post you show the /elevate requestHandler definition, is
that your default request handler or did you paste in the incorrect
handler.

The simple reason the boosting isn't working is Solr isnt finding a match
in that your query fields that you are applying a boost too it is only
finding the values in the _text_ field.

Also you probably should read up on BM25Similarity as this is the default
in the version of solr you are using.


Nick




On Tue, May 10, 2016 at 12:27 AM, Megha Bhandari 
wrote:

> Thanks Nick, got the response formatted. We are using Solr 5.5.
> Not able to understand why it is ignoring the boosts completely. What
> configuration is being missed? As you correctly pointed out it is only
> calculating based on the _text_ field.
>
> Query:
>
> http://10.203.101.42:8983/solr/uhc/select?defType=edismax&indent=on&mm=1&q=upendra&qf=h1
> ^9.0%20_text_^1.0&wt=ruby&debug=true
>
> Response with debug on:
> {
>   'responseHeader'=>{
> 'status'=>0,
> 'QTime'=>6,
> 'params'=>{
>   'mm'=>'1',
>   'q'=>'upendra',
>   'defType'=>'edismax',
>   'debug'=>'true',
>   'indent'=>'on',
>   'qf'=>'h1^9.0 _text_^1.0',
>   'wt'=>'ruby'}},
>   'response'=>{'numFound'=>6,'start'=>0,'maxScore'=>0.14641379,'docs'=>[
>   {
> 'h2'=>['Looks like your browser is a little out-of-date.'],
> 'h3'=>['Already a member?'],
> 'strtitle'=>['I m increasiing the the page title content Upendra
> Custon'],
> 'id'=>'http://localhost:4503/baseurl/upendra-custon.html',
> 'tstamp'=>'2016-05-10T05:50:22.316Z',
> 'metataghideininternalsearch'=>false,
> 'metatagtopresultthumbnailalt'=>',',
> 'segment'=>[20160510112017],
> 'digest'=>['fb988351afceb26a835fba68e2bcc33f'],
> 'boost'=>[1.4142135],
> 'lang'=>'en',
> 'metatagkeywords'=>[','],
> '_version_'=>1533919301006786560,
> 'host'=>'localhost',
> 'url'=>'http://localhost:4503/baseurl/upendra-custon.html',
> 'score'=>0.14641379},
>   {
> 'metatagdescription'=>['test'],
> 'h1'=>['Upendra'],
> 'h2'=>['Looks like your browser is a little out-of-date.'],
> 'h3'=>['Already a member?'],
> 'strtitle'=>['health care body content'],
> 'id'=>'
> http://localhost:4503/baseurl/upendra-custon/care-body-content.html',
> 'tstamp'=>'2016-05-10T05:50:22.269Z',
> 'metataghideininternalsearch'=>false,
> 'metatagtopresultthumbnailalt'=>',',
> 'segment'=>[20160510112017],
> 'digest'=>['dd4ef8879be2d4d3f28e24928e9b84c5'],
> 'boost'=>[1.4142135],
> 'lang'=>'en',
> 'metatagkeywords'=>[','],
> '_version_'=>1533919301071798272,
> 'host'=>'localhost',
> 'url'=>'
> http://localhost:4503/baseurl/upendra-custon/care-body-content.html',
> 'score'=>0.13738367},
>   {
> 'metatagdescription'=>['test'],
> 'h1'=>['health care keyword'],
> 'h2'=>['Looks like your browser is a little out-of-date.'],
> 'h3'=>['Already a member?'],
> 'strtitle'=>['health care keyword'],
> 'id'=>'
> http://localhost:4503/baseurl/upendra-custon/care-keyword.html',
> 'tstamp'=>'2016-05-10T05:50:22.300Z',
> 'metataghideininternalsearch'=>false,
> 'metatagtopresultthumbnailalt'=>',',
> 'segment'=>[20160510112017],
> 'digest'=>['4af11065d604bcec7aa4cbc1cf0fca59'],
> 'boost'=>[1.4142135],
> 'lang'=>'en',
> 'metatagkeywords'=>['upendra,upendra'],
> '_version_'=>1533919301088575488,
> 'host'=>'localhost',
> 'url'=>'
> http://localhost:4503/baseurl/upendra-custon/care-keyword.html',
> 'score'=>0.13738367},
>   {
> 'metatagdescription'=>['test'],
> 'h1'=>['Health care'],
> 'h2'=>['Looks like your browser is a little out-of-date.'],
> 'h3'=>['Already a member?'],
> 'strtitle'=>['This is the page Title Upendra, lets do the
> testing'],
> 'id'=>'http://localhost:4503/baseurl/upendra-custon/care.html',
> 'tstamp'=>'2016-05-10T05:50:22.518Z',
> 'metataghide

Simulate doc linking via post filter cache check

2016-05-10 Thread tedsolr
I'm pulling my hair out on this one - and there's not much of that to begin
with. The problem I have is that updating 10M denormalized docs in a
collection takes about 5 hours. Soon there will be collections with 100M
docs and a 50 hour update cycle will not be acceptable. The process involves
cleaning (deleting) the marker fields, querying the collection with user
defined saved searches, then updating the marker fields in every matched
doc. If I can normalize based on the searches the processing time should go
way down: delete marker docs, query the collection with user defined saved
searches, then insert marker docs. The time savings comes from 1) deleting
and inserting docs is faster than updating docs, 2) the number of saved
searches is at least 1000X less than the number of docs.

A doc may have a couple hundred fields, but looks sorta like this:
{"id":123_5677899","searchid":"34","name":"Johnson", ...}

To normalize I would remove the searchid into a new doc:
{"id":"S234","searchid":"34","doclist":["123_5677899","123_5677898",...]}

The "link" is established by the doclist field which is multivalued and
contains the ids from the real docs. All this is doable, the problem is that
when users create saved searches they must only match docs that have not
already been matched by another search. That's why there's only one doc
"type" now - every matched doc has a marker (searchid) which makes the Solr
search work. Since it's not possible to do a RDBMS like search joining the 2
doc types, I need to run the saved search: find docs where name=Johnson,
then drop the docs that are not in a doclist.

So, maybe if I manage a custom cache of matched doc ids, I can check each
returned id against the cache and drop the docs that are not in it. I think
this could be done in a post filter. There will be a big memory hit to
maintain this cache, but does this seem like a performant solution to my
problem?

Thanks!
v5.2.1
All collections are one shard with replication factor 2



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Simulate-doc-linking-via-post-filter-cache-check-tp4275842.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr 5.x bug with Service installation script?

2016-05-10 Thread A Laxmi
Hi Shawn -

You brought up a good point. This might be a possible reason. I'll test it
out. Thanks! My index (4.5g) usually takes about 15-20 secs to load.

One other observation - even though it says write.lock file in a specific
data directory path, when I look up the directory, I don't see any
write.lock file in there. It is really confusing.

AL

On Tue, May 10, 2016 at 10:37 AM, Shawn Heisey  wrote:

> On 5/9/2016 11:30 AM, A Laxmi wrote:
> > yes, I always shutdown both source and destination Solr before copying
> the
> > index over from one to another. Somehow the write.lock only happens when
> > Solr restarts from service script. If loads just fine when started
> manually.
>
> One possible problem:
>
> The bin/solr script (which is used by the init script) only waits for 5
> seconds for Solr to stop gracefully before killingit forcibly.  This can
> leave write.lock files behind.
>
> I thought it had increased to 30 seconds in a recent version and that it
> was possibly even configurable in solr.in.sh, but I just checked the
> 6.0.0 download.  It's still only 5 seconds, and the value is hard-coded
> in the script.  This is only enough time if you have a very small number
> of very small indexes.
>
> Thanks,
> Shawn
>
>


Transforming SolrDocument to SolrInputDocument in Solr 6.0

2016-05-10 Thread Stephan Schubert
In Solr 6.0 the method ClientUtils.toSolrInputDocument() was removed 
(deprecated since 5.5.1, see 
https://issues.apache.org/jira/browse/SOLR-8339). What is the best way now 
to transform a SolrDocument into a SolrInputDocument?
Mit freundlichen Grüßen / Best regards

Stephan Schubert
Senior Web Application Engineer | IT Engineering
Information Oriented Applications

SICK AG | Erwin-Sick-Str. 1 | 79183 Waldkirch | Germany
Phone  +49 7681 202-3751 | Fax  | mailto:stephan.schub...@sick.de | 
http://www.sick.de
 

SICK AG  |  Sitz: Waldkirch i. Br.  |  Handelsregister: Freiburg i. Br. HRB 
280355 
Vorstand: Dr. Robert Bauer (Vorsitzender)  |  Reinhard Bösl  |  Dr. Mats 
Gökstorp  |  Dr. Martin Krämer  |  Markus Vatter
Aufsichtsrat: Gisela Sick (Ehrenvorsitzende)  |  Klaus M. Bukenberger 
(Vorsitzender)


Re: Solr 5.x bug with Service installation script?

2016-05-10 Thread A Laxmi
Hi Erick - I used "sudo service solr stop" to shut it down.

On Tue, May 10, 2016 at 12:26 AM, Erick Erickson 
wrote:

> How do you shut down your Solrs? Any kind of un-graceful
> stopping (kill -9 is a favorite) may leave the lock file around.
>
> It can't be coming from nowhere, so my guess is that
> it's present in the source or destination before
> you do your copy...
>
> Best,
> Erick
>
> On Mon, May 9, 2016 at 10:30 AM, A Laxmi  wrote:
> > yes, I always shutdown both source and destination Solr before copying
> the
> > index over from one to another. Somehow the write.lock only happens when
> > Solr restarts from service script. If loads just fine when started
> manually.
> >
> > On Mon, May 9, 2016 at 1:20 PM, Abdel Belkasri 
> wrote:
> >
> >> Did you copy the core while solr is running? if yes, first shuown source
> >> and destination solr, copy intex to the other solr, then restat solr
> nodes.
> >> Lock files get written to the core while solr is running and doing
> indexing
> >> or searching, etc.
> >>
> >> On Mon, May 9, 2016 at 12:38 PM, A Laxmi 
> wrote:
> >>
> >> > Hi,
> >> >
> >> > I have installed Solr 5.3.1 using the Service Installation Script. I
> was
> >> > able to successfully start and stop Solr using service solr start/stop
> >> > commands and Solr loads up just fine.
> >> >
> >> > However, when I stop Solr service and copy an index of a core from one
> >> > server to another with same exact version of Solr and its
> corresponding
> >> > conf and restart the service, it complains about write.lock file when
> >> none
> >> > exists under the path that it specifies in the log.
> >> >
> >> > To validate whether the issue is with the data that is being copied or
> >> the
> >> > service script itself, I copied the collection directory with new
> index
> >> > into example-DIH directory and restarted Solr manually bin/solr start
> -e
> >> > dih -m 2g, it worked without any error. So, atleast this validates
> that
> >> > collection data is just fine and service script is creating a lock
> >> > everytime a new index is copied from another server though it has the
> >> same
> >> > exact Solr version.
> >> >
> >> > Did anyone experience the same? Any thoughts if this is a bug?
> >> >
> >> > Thanks!
> >> > AL
> >> >
> >>
> >>
> >>
> >> --
> >> Abdel K. Belkasri, PhD
> >>
>


Transforming SolrDocument to SolrInputDocument in Solr 6.0

2016-05-10 Thread Stephan Schubert
In Solr 6.0 the method ClientUtils.toSolrInputDocument() was removed 
(deprecated since 5.5.1, see 
https://issues.apache.org/jira/browse/SOLR-8339). What is the best way now 
to transform a SolrDocument into a SolrInputDocument?
Mit freundlichen Grüßen / Best regards

Stephan Schubert
Senior Web Application Engineer | IT Engineering
Information Oriented Applications

SICK AG | Erwin-Sick-Str. 1 | 79183 Waldkirch | Germany
Phone  +49 7681 202-3751 | Fax  | mailto:stephan.schub...@sick.de | 
http://www.sick.de
 

SICK AG  |  Sitz: Waldkirch i. Br.  |  Handelsregister: Freiburg i. Br. HRB 
280355 
Vorstand: Dr. Robert Bauer (Vorsitzender)  |  Reinhard Bösl  |  Dr. Mats 
Gökstorp  |  Dr. Martin Krämer  |  Markus Vatter
Aufsichtsrat: Gisela Sick (Ehrenvorsitzende)  |  Klaus M. Bukenberger 
(Vorsitzender)


Re: Unable to achieve boosting into solr 5.5

2016-05-10 Thread Erick Erickson
Please review: http://wiki.apache.org/solr/UsingMailingLists

You haven't shown us what you _do_ get, what you expect, why
you think there's an error. Adding &debug=query will show you
the parsed query and may give you a clue.

Best,
Erick

On Mon, May 9, 2016 at 11:02 PM, Upendra Kumar Baliyan
 wrote:
> Hi,
>
> We are using solr 5.5, but could not achieve field boosting. We are not 
> getting the result as per the below configuration.
>
>
>
> Below is configuration in solrconfig.xml for 
>
>
> 
>
> 
>
> edismax
>
>
>
>   metatag.keywords^10.0 metatag.description^9.0 h1^7.0 h2^6.0 h3^5.0 
> h4^4.0 _text_^1.0 id^0.5
>
>
>
>100%
>
>*:*
>
>10
>
>*,score
>
>   explicit
>
> 
>
>
>
> Any help ?
>
>
>
> Regards
>
> Upendra Kumar Baliyan
>


Re: Solr edismax field boosting

2016-05-10 Thread Nick D
Megha,

What are the field types for the fields you are trying to search through?
Grab a copy of the schema.xml and paste the relevant fields.

My guess is you have _text_ as some copy field for everything else and have
it stored=false correct? I am no seeing that field in the output above.
Also in you first post you show the /elevate requestHandler definition, is
that your default request handler or did you paste in the incorrect
handler.

The simple reason the boosting isn't working is Solr isnt finding a match
in that your query fields that you are applying a boost too it is only
finding the values in the _text_ field.

Also you probably should read up on BM25Similarity as this is the default
in the version of solr you are using.


Nick




On Tue, May 10, 2016 at 12:27 AM, Megha Bhandari 
wrote:

> Thanks Nick, got the response formatted. We are using Solr 5.5.
> Not able to understand why it is ignoring the boosts completely. What
> configuration is being missed? As you correctly pointed out it is only
> calculating based on the _text_ field.
>
> Query:
>
> http://10.203.101.42:8983/solr/uhc/select?defType=edismax&indent=on&mm=1&q=upendra&qf=h1
> ^9.0%20_text_^1.0&wt=ruby&debug=true
>
> Response with debug on:
> {
>   'responseHeader'=>{
> 'status'=>0,
> 'QTime'=>6,
> 'params'=>{
>   'mm'=>'1',
>   'q'=>'upendra',
>   'defType'=>'edismax',
>   'debug'=>'true',
>   'indent'=>'on',
>   'qf'=>'h1^9.0 _text_^1.0',
>   'wt'=>'ruby'}},
>   'response'=>{'numFound'=>6,'start'=>0,'maxScore'=>0.14641379,'docs'=>[
>   {
> 'h2'=>['Looks like your browser is a little out-of-date.'],
> 'h3'=>['Already a member?'],
> 'strtitle'=>['I m increasiing the the page title content Upendra
> Custon'],
> 'id'=>'http://localhost:4503/baseurl/upendra-custon.html',
> 'tstamp'=>'2016-05-10T05:50:22.316Z',
> 'metataghideininternalsearch'=>false,
> 'metatagtopresultthumbnailalt'=>',',
> 'segment'=>[20160510112017],
> 'digest'=>['fb988351afceb26a835fba68e2bcc33f'],
> 'boost'=>[1.4142135],
> 'lang'=>'en',
> 'metatagkeywords'=>[','],
> '_version_'=>1533919301006786560,
> 'host'=>'localhost',
> 'url'=>'http://localhost:4503/baseurl/upendra-custon.html',
> 'score'=>0.14641379},
>   {
> 'metatagdescription'=>['test'],
> 'h1'=>['Upendra'],
> 'h2'=>['Looks like your browser is a little out-of-date.'],
> 'h3'=>['Already a member?'],
> 'strtitle'=>['health care body content'],
> 'id'=>'
> http://localhost:4503/baseurl/upendra-custon/care-body-content.html',
> 'tstamp'=>'2016-05-10T05:50:22.269Z',
> 'metataghideininternalsearch'=>false,
> 'metatagtopresultthumbnailalt'=>',',
> 'segment'=>[20160510112017],
> 'digest'=>['dd4ef8879be2d4d3f28e24928e9b84c5'],
> 'boost'=>[1.4142135],
> 'lang'=>'en',
> 'metatagkeywords'=>[','],
> '_version_'=>1533919301071798272,
> 'host'=>'localhost',
> 'url'=>'
> http://localhost:4503/baseurl/upendra-custon/care-body-content.html',
> 'score'=>0.13738367},
>   {
> 'metatagdescription'=>['test'],
> 'h1'=>['health care keyword'],
> 'h2'=>['Looks like your browser is a little out-of-date.'],
> 'h3'=>['Already a member?'],
> 'strtitle'=>['health care keyword'],
> 'id'=>'
> http://localhost:4503/baseurl/upendra-custon/care-keyword.html',
> 'tstamp'=>'2016-05-10T05:50:22.300Z',
> 'metataghideininternalsearch'=>false,
> 'metatagtopresultthumbnailalt'=>',',
> 'segment'=>[20160510112017],
> 'digest'=>['4af11065d604bcec7aa4cbc1cf0fca59'],
> 'boost'=>[1.4142135],
> 'lang'=>'en',
> 'metatagkeywords'=>['upendra,upendra'],
> '_version_'=>1533919301088575488,
> 'host'=>'localhost',
> 'url'=>'
> http://localhost:4503/baseurl/upendra-custon/care-keyword.html',
> 'score'=>0.13738367},
>   {
> 'metatagdescription'=>['test'],
> 'h1'=>['Health care'],
> 'h2'=>['Looks like your browser is a little out-of-date.'],
> 'h3'=>['Already a member?'],
> 'strtitle'=>['This is the page Title Upendra, lets do the
> testing'],
> 'id'=>'http://localhost:4503/baseurl/upendra-custon/care.html',
> 'tstamp'=>'2016-05-10T05:50:22.518Z',
> 'metataghideininternalsearch'=>false,
> 'metatagtopresultthumbnailalt'=>',,,',
> 'segment'=>[20160510112017],
> 'digest'=>['711a059f2a05a6c03e59d490cd7008ff'],
> 'boost'=>[1.4142135],
> 'lang'=>'en',
> 'metatagkeywords'=>[',,,'],
> '_version_'=>1533919301088575489,
> 'host'=>'localhost',
> 'url'=>'http://localhost:4503/baseurl/upendra-custon/care.html',
> 'score'=>0.13286635},
>   {
> 'metatagdescription

Re: Filter queries & caching

2016-05-10 Thread Erick Erickson
No. Please re-read and use the admin plugins/stats page to examine for yourself.

1)  fq=filter(fromfield:[* TO NOW/DAY+1DAY]&& tofield:[NOW/DAY-7DAY TO *])
&& fq=type:abc

&& is totally unnecessary when using fq clauses, there is already an
implicit AND.
I'm not even sure what the above does, I don't quite know off the top of my head
how that would be parsed.

fq=filter() is unnecessary and in fact (apparently) uses extra
filterCache entries
to no purpose.

I'm guessing you're thinking of something like this

q=*:*&fq=(fromfield:[* TO NOW/DAY+1DAY] && tofield:[NOW/DAY-7DAY TO
*])&fq=type:abc

Would use 2 filterCache entries,

or maybe this: (notice this is "q=" not "fq=")

q=filter(fromfield:[* TO NOW/DAY+1DAY] && tofield:[NOW/DAY-7DAY TO *])
&& filter(type:abc)

would use two filter queries as well. Same thing essentially.

2) fq= fromfield:[* TO NOW/DAY+1DAY]&& fq=tofield:[NOW/DAY-7DAY TO *]) &&
fq=type:abc

This is syntactically incorrect, I assume you meant (added left paren
and again the && is unnecessary):

q=*:*&fq=(fromfield:[* TO NOW/DAY+1DAY] && fq=tofield:[NOW/DAY-7DAY TO *])&
fq=type:abc

As above the rewritten form would use two filterCache entries.

Best,
Erick

On Mon, May 9, 2016 at 11:03 PM, Jay Potharaju  wrote:
> Thanks for the explanation Eric.
>
> So that I understand this clearly
>
>
> 1)  fq=filter(fromfield:[* TO NOW/DAY+1DAY]&& tofield:[NOW/DAY-7DAY TO *])
> && fq=type:abc
> 2) fq= fromfield:[* TO NOW/DAY+1DAY]&& fq=tofield:[NOW/DAY-7DAY TO *]) &&
> fq=type:abc
>
> Using 1) would benefit from having 2 separate filter caches instead of 3
> slots in the cache. But in general both would be using the filter cache.
> And secondly it would  be more useful to use filter() in a scenario like
> above(mentioned in your email).
> Thanks
>
>
>
>
> On Mon, May 9, 2016 at 9:43 PM, Erick Erickson 
> wrote:
>
>> You're confusing a query clause with fq when thinking about filter() I
>> think.
>>
>> Essentially they don't need to be used together, i.e.
>>
>> q=myclause AND filter(field:value)
>>
>> is identical to
>>
>> q=myclause&fq=field:value
>>
>> both in docs returned and filterCache usage.
>>
>> q=myclause&filter(fq=field:value)
>>
>> actually uses two filterCache entries, so is probably not what you want to
>> use.
>>
>> the filter() syntax attached to a q clause (not an fq clause) is meant
>> to allow you to get speedups
>> you want to use compound clauses without having every combination be
>> separate filterCache entries.
>>
>> Consider the following:
>> fq=A OR B
>> fq=A AND B
>> fq=A
>> fq=B
>>
>> These would require 4 filterCache entries.
>>
>> q=filter(A) OR filter(B)
>> q=filter(A) AND filter(B)
>> q=filter(A)
>> q=filter(B)
>>
>> would only require two. Yet all of them would be satisfied only by
>> looking at the filterCache.
>>
>> Aside from the example immediately above, which one you use is largely
>> a matter of taste.
>>
>> Best,
>> Erick
>>
>> On Mon, May 9, 2016 at 12:47 PM, Jay Potharaju 
>> wrote:
>> > Thanks Ahmet...but I am not still clear how is adding filter() option
>> > better or is it the same as filtercache?
>> >
>> > My question is below.
>> >
>> > "As mentioned above adding filter() will add the filter query to the
>> cache.
>> > This would mean that results are fetched from cache instead of running n
>> > number of filter queries  in parallel.
>> > Is it necessary to use the filter() option? I was under the impression
>> that
>> > all filter queries will get added to the "filtercache". What is the
>> > advantage of using filter()?"
>> >
>> > Thanks
>> >
>> > On Sun, May 8, 2016 at 6:30 PM, Ahmet Arslan 
>> > wrote:
>> >
>> >> Hi,
>> >>
>> >> As I understand it useful incase you use an OR operator between two
>> >> restricting clauses.
>> >> Recall that multiple fq means implicit AND.
>> >>
>> >> ahmet
>> >>
>> >>
>> >>
>> >> On Monday, May 9, 2016 4:02 AM, Jay Potharaju 
>> >> wrote:
>> >> As mentioned above adding filter() will add the filter query to the
>> cache.
>> >> This would mean that results are fetched from cache instead of running n
>> >> number of filter queries  in parallel.
>> >> Is it necessary to use the filter() option? I was under the impression
>> that
>> >> all filter queries will get added to the "filtercache". What is the
>> >> advantage of using filter()?
>> >>
>> >> *From
>> >> doc:
>> >>
>> https://cwiki.apache.org/confluence/display/solr/Query+Settings+in+SolrConfig
>> >> <
>> >>
>> https://cwiki.apache.org/confluence/display/solr/Query+Settings+in+SolrConfig
>> >> >*
>> >> This cache is used by SolrIndexSearcher for filters (DocSets) for
>> unordered
>> >> sets of all documents that match a query. The numeric attributes control
>> >> the number of entries in the cache.
>> >> Solr uses the filterCache to cache results of queries that use the fq
>> >> search parameter. Subsequent queries using the same parameter setting
>> >> result in cache hits and rapid returns of results. See Searching for a
>> >> detailed discussion of the fq par

Re: solrcloud performance problem

2016-05-10 Thread Shawn Heisey
On 5/9/2016 11:42 PM, lltvw wrote:
> By using jps command double check the parms used to start solr, i found that 
> the max  heap size already set to 10G. So I made a big mistake yesterday.
>
> But by using solr admin UI, I select the collection with performance problem, 
> in the overview page I find that the heap memory is about 8M. What is wrong.
>
> Every time I search difference characters, QTime from response header always 
> greater than 300ms. If I search again, cause i can hit cache, the response 
> time could become to about 30ms.

When my queries hit the cache, they only take a few milliseconds.  30
milliseconds for a cached query seems VERY slow.

Can you open the dashboard in the admin UI, make it large enough to see
everything, take a screenshot of the whole page, and included a URL
where that screenshot can be viewed?  I do not need to see the whole
browser window, just the whole dashboard.  Here's an example of what I
am looking for:

https://www.dropbox.com/s/ixu8dr954mst0c4/dashboard-just-page.png?dl=0

In my example, you can't see all of the JVM Args in the screenshot --
there are a lot more of them, and they wouldn't fit in the window even
when maximized.  So if your screenshot doesn't include all of them, you
probably should copy those as text and include them in your reply --
like this:

-DSTOP.KEY=solrrocks
-DSTOP.PORT=7982
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.local.only=false
-Dcom.sun.management.jmxremote.port=18982
-Dcom.sun.management.jmxremote.rmi.port=18982
-Dcom.sun.management.jmxremote.ssl=false
-Djetty.home=/opt/solr5/server
-Djetty.port=8982
-Dlog4j.configuration=file:/index/solr5/log4j.properties
-Dsolr.install.dir=/opt/solr5
-Dsolr.solr.home=/index/solr5/data
-Duser.timezone=UTC
-XX:+CMSParallelRemarkEnabled
-XX:+CMSScavengeBeforeRemark
-XX:+ParallelRefProcEnabled
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCDateStamps
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintHeapAtGC
-XX:+PrintTenuringDistribution
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+UseConcMarkSweepGC
-XX:+UseParNewGC
-XX:CMSInitiatingOccupancyFraction=70
-XX:CMSMaxAbortablePrecleanTime=2000
-XX:MaxTenuringThreshold=8
-XX:NewRatio=3
-XX:OnOutOfMemoryError=/opt/solr5/bin/oom_solr.sh 8982 /index/solr5/logs
-XX:PretenureSizeThreshold=64m
-XX:SurvivorRatio=4
-XX:TargetSurvivorRatio=90
-Xloggc:/index/solr5/logs/solr_gc.log
-Xms22g
-Xmx22g
-verbose:gc

How are you starting Solr?  With Solr 4.x, there are limitless numbers
of ways to install and start Solr, because it is released as a webapp
.war file.  When 5.0 was released, that was reduced to only a few
supported options.

Thanks,
Shawn



Re: query action with wrong result size zero

2016-05-10 Thread Mikhail Khludnev
Usually such issues are troubleshooted with: Solr admin: schema browser and
analysis. Also, you might need to check debugQuery=true output and perhaps
use explainOther param.
05 мая 2016 г. 18:58 пользователь "mixiangliu" <852262...@qq.com> написал:


i found a strange thing  with solr query,when i set the value of query
field like "brand:amd",the  size of query result is zero,but the real data
is not zero,can  some body tell me why,thank you very much!!
my english is not very good,wish some body understand my words!


Re: Re-indexing in SolRCloud while keeping the collection online -- Best practice?

2016-05-10 Thread Erick Erickson
Peter:

Yeah, that would work, but there are a couple of alternatives:
1> If there's any way to know what the subset of docs that's
 changed, just re-index _them_. The problem here is
 picking up deletes. In the RDBMS case this is often done
 by creating a trigger for deletes and then the last step
 in your update is to remove the docs since the last time
 you indexed using the deleted_docs table (or whatever).
 This falls down if a> you require an instantaneous switch
 from _all_ the old data to the new or b> you can't get a
 list of deleted docs.

2> Use collection aliasing. The pattern is this: you have your
 "Hot" collection (col1) serving queries that is pointed to
 by alias "hot". You create a new collection (col2) and index
 to it in the background. When done, use CREATEALIAS
 to point "hot" to "col2". Now you can delete col1. There are
 no restrictions on where these collections live, so this
 allows you to move your collections around as you want. Plus
 this keeps a better separation of old and new data...

Best,
Erick

On Tue, May 10, 2016 at 4:32 AM, Horváth Péter Gergely
 wrote:
> Hi Everyone,
>
> I am wondering if there is any best practice regarding re-indexing
> documents in SolrCloud 6.0.0 without making the data (or the underlying
> collection) temporarily unavailable. Wiping all documents in a collection
> and performing a full re-indexing is not a viable alternative for us.
>
> Say we had a massive Solr Cloud cluster with a number of separate nodes
> that are used to host *multiple hundreds* of collections, with document
> counts ranging from a couple of thousands to multiple (say up to 20)
> millions of documents, each with 200-300 fields and a background batch
> loader job that fetches data from a variety of source systems.
>
> We have to retain the cluster and ALL collections online all the time (365
> x 24): We cannot allow queries to be blocked while data in a collection is
> being updated and we cannot load everything in a single-shot jumbo commit
> (the replication could overload the cluster).
>
> One solution I could imagine is storing an additional field "load
> time-stamp" in all documents and the client (interactive query) application
> extending all queries with an additional restriction, which requires
> documents "load time-stamp" to be the latest known completed "load
> time-stamp".
>
> This concept would work according to the following:
> 1.) The batch job would simply start loading new documents, with the new
> "load time-stamp". Existing documents would not be touched.
> 2.) The client (interactive query) application would still use the old data
> from the previous load (since all queries are restricted with the old "load
> time-stamp")
> 3.) The batch job would store the new "load time-stamp" as the one to be
> used (e.g. in a separate collection etc.) -- after this, all queries would
> return the most up-to-data documents
> 4.) The batch job would purge all documents from the collection, where
> the "load time-stamp" is not the same as the last one.
>
> This approach seems to be implementable, however, I definitely want to
> avoid reinventing the wheel myself and wondering if there is any better
> solution or built-in Solr Cloud feature to achieve the same or something
> similar.
>
> Thanks,
> Peter


How to restrict outside IP access in Solr with internal jetty server

2016-05-10 Thread Mugeesh Husain

I am using solr 5.3 version with inbuilt jetty server.

I am looking for a proxy kind of thing which i could prevent outside User
access for all of the link, I would give only access select and select core
url accessibility other than this should be not open.

Please give me some suggestion.

Thanks
 Mugeesh




--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-restrict-outside-IP-access-in-Solr-with-internal-jetty-server-tp4275822.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Replicate Between sites

2016-05-10 Thread Erick Erickson
bq: Why not use classic replication between one node in the cluster and another
node in the other cluster.

First of all in SolrCloud I'm pretty sure you can't do that, where
"that" is have
classic replication operate where the source was one SolrCloud and
the destination is another SolrCloud cluster. There's all the
replication logic between leaders and followers that you'd be interfering
with.

Step back for a minute though. Even if you set that up you'd be
replicating your index across your admittedly slow DC/DC
connection. The merging process creates new segments from various
subsets of current segments, and the new segment would be
copied to the backup DC. In some cases the entire index will be merged
into a single segment (admittedly rarely). It seems far more bandwidth-efficient
to just index the raw docs to each DC from the client.

Best,
Erick

On Tue, May 10, 2016 at 6:52 AM, Abdel Belkasri  wrote:
> Erick,
>
> That's not what I was going for. No code porting. I was thinking this:
> Why not use classic replication between one node in the cluster and another
> node in the other cluster?
> something along this line.
>
> Thanks,
> --Abdel.
>
> On Tue, May 10, 2016 at 12:21 AM, Erick Erickson 
> wrote:
>
>> bq: How similar thing could be done in 4.9.1?
>>
>> That's not going to happen. More precisely,
>> there is zero chance that anyone will take on that
>> work unless it's a custom one-off that you
>> hire done or develop internally. And even
>> if someone took this on, it'd never be officially
>> released.
>>
>> IOW, if you want to try backporting it on your own,
>> have at it but that'll be completely unsupported.
>>
>> One thing people have done is create two
>> independent clusters, complete to separate ZK
>> ensembles and have the indexing client send
>> updates to both DCs. At that point it also makes
>> sense to have them both serve queries.
>>
>> Another choice is to have your system-of-record
>> replicated to both DCs, and have the indexing
>> process run in both DCs from the local copy of
>> the system-of-record to the local Solr
>> clusters independently of each other.
>>
>> Best,
>> Erick
>>
>> On Mon, May 9, 2016 at 12:31 PM, Abdel Belkasri 
>> wrote:
>> > Hi Alex,
>> >
>> > just started reading about CDCR, looks very promissing. Is this only in
>> > 6.0? our PROD server are running 4.9.1 and we cannot upgrade just yet.
>> How
>> > similar thing could be done in 4.9.1?
>> >
>> > Thanks,
>> > --Abdel
>> >
>> > On Mon, May 9, 2016 at 2:59 PM, Alexandre Rafalovitch <
>> arafa...@gmail.com>
>> > wrote:
>> >
>> >> Have you looked at Cross Data Center replication that's the new big
>> >> feature in Solr 6.0?
>> >>
>> >> Regards,
>> >>Alex.
>> >> 
>> >> Newsletter and resources for Solr beginners and intermediates:
>> >> http://www.solr-start.com/
>> >>
>> >>
>> >> On 10 May 2016 at 02:13, Abdel Belkasri  wrote:
>> >> > Hi there,
>> >> >
>> >> > we have the main site setup as follows:
>> >> > solrCould:
>> >> > App --> smart Client (solrj) --> ensemble of zookeeper --> SolrCloud
>> Noes
>> >> > (with slice/shard/recplica)
>> >> > Works fine.
>> >> >
>> >> > On the DR site we have a mirror setup, how can we keep the two site in
>> >> > sync, so that if something happened we point the app to DR and get
>> back
>> >> up
>> >> > and running?
>> >> >
>> >> > Note: making zookeeper span the two sites is not an option because of
>> >> > network latency.
>> >> >
>> >> > We are looking for replication (kind of master-slave that exists in
>> Solr
>> >> > classic)...how that is achieved in SolrCloud?
>> >> >
>> >> > Thanks,
>> >> > --Abdel.
>> >>
>> >
>> >
>> >
>> > --
>> > Abdel K. Belkasri, PhD
>>
>
>
>
> --
> Abdel K. Belkasri, PhD


Using Ping Request Handler in SolrCloud within a load balancer

2016-05-10 Thread Sandy Foley
A couple of questions ...
We've upconfig'd the ping request handler to ZooKeeper within the 
solrconfig.xml.  SolrCloud and ZooKeeper are working fine.
I understand that the /solr/admin/ping command is for a ping on its local 
server only (not from a remote machine).  This is working.I also understand 
that /solr/[core]/admin/ping can be used from a load balancer to ping a 
particular core on a server. This is working also.
Question #1:Is there a SINGLE command that can be issued to each server from a 
load balancer to check the ping status of each server?
Question #2:When running /solr/admin/ping from the load balancer to each Solr 
node, one of the three nodes returns a status ok.  It's the same node every 
time; it's the first node that we set up of the 3 (which is not always the 
leader). The zkcli upconfig command has always been issued from this first 
node.Out of curiosity, if this command is for local ping only, why does this 
return status ok on one node (issued from the load balancer) and not the other 
nodes?

Configuration:
WindowsTomcat 8.0SolrCloud 4.10.3 (3 nodes)External ensemble ZooKeeper 3.4.6 - 
3 servers
Thank you.

Re-indexing in SolRCloud while keeping the collection online -- Best practice?

2016-05-10 Thread Horváth Péter Gergely
Hi Everyone,

I am wondering if there is any best practice regarding re-indexing
documents in SolrCloud 6.0.0 without making the data (or the underlying
collection) temporarily unavailable. Wiping all documents in a collection
and performing a full re-indexing is not a viable alternative for us.

Say we had a massive Solr Cloud cluster with a number of separate nodes
that are used to host *multiple hundreds* of collections, with document
counts ranging from a couple of thousands to multiple (say up to 20)
millions of documents, each with 200-300 fields and a background batch
loader job that fetches data from a variety of source systems.

We have to retain the cluster and ALL collections online all the time (365
x 24): We cannot allow queries to be blocked while data in a collection is
being updated and we cannot load everything in a single-shot jumbo commit
(the replication could overload the cluster).

One solution I could imagine is storing an additional field "load
time-stamp" in all documents and the client (interactive query) application
extending all queries with an additional restriction, which requires
documents "load time-stamp" to be the latest known completed "load
time-stamp".

This concept would work according to the following:
1.) The batch job would simply start loading new documents, with the new
"load time-stamp". Existing documents would not be touched.
2.) The client (interactive query) application would still use the old data
from the previous load (since all queries are restricted with the old "load
time-stamp")
3.) The batch job would store the new "load time-stamp" as the one to be
used (e.g. in a separate collection etc.) -- after this, all queries would
return the most up-to-data documents
4.) The batch job would purge all documents from the collection, where
the "load time-stamp" is not the same as the last one.

This approach seems to be implementable, however, I definitely want to
avoid reinventing the wheel myself and wondering if there is any better
solution or built-in Solr Cloud feature to achieve the same or something
similar.

Thanks,
Peter


Re: auto purge for embedded zookeeper

2016-05-10 Thread Shawn Heisey
On 5/9/2016 1:11 PM, tedsolr wrote:
> I have a development environment that is using an embedded zookeeper, and the
> zoo_data folder continues to grow. It's filled with snapshot files that are
> not getting purged. zoo.cfg has properties
> autopurge.snapRetainCount=10
> autopurge.purgeInterval=1
> Perhaps it's not in the correct location so its not getting read? Or maybe
> these props don't apply for embedded instances?
>
> Anyone know? Thanks!
> v5.2.1

Reading the source for the SolrZkServer class, it appears that only a
limited set of properties in that config file is parsed by the embedded
zookeeper.  Only these properties are used to configure the server, all
others are ignored:

server.*
group.*
weight.*
dataDir
dataLogDir
clientPort
tickTime
initLimit
syncLimit
electionAlg
maxClientCnxns

The reason that it ignores everything else is that this code is copied
from Zookeeper 3.2 -- which is over six years old.  Zookeeper did not
have snapshot purging functionality back then.

Although this is something we can fix by copying some of the code from
the latest Zookeeper into Solr and making some changes, the way Solr
implements the embedded zookeeper functionality will be susceptible to
similar problems in the future unless we upgrade zookeeper to 3.5.x and
change the embedded zookeeper implementation. ZK 3.5 is only available
as an alpha version, and may not be available in a stable version for a
few months.

You have mentioned that it's a dev environment.  A production
environment configured according to recommendations (no embedded
zookeeper) would not have this problem.

I would recommend scripting something yourself to clean up the zookeeper
data directory ... because even if we do fix this problem, the fix won't
likely be available in a regular Solr release for several weeks, and
will only be available in a new 6.x version, not anything in 4.x or 5.x.

Thanks,
Shawn



what scene using carrot2 cluster

2016-05-10 Thread xiangliumi
hi,all

does someone have used carrot2 with solr,please give me a scene description 
when using carrot2,and best give me some links about deploying solr5.x and 
carrot2. thanks for your help!


thanks
Max Mi

Sent using CloudMagic Email 
[https://cloudmagic.com/k/d/mailapp?ct=pa&cv=8.4.52&pv=4.4.2&source=email_footer_2]

Re: query action with wrong result size zero

2016-05-10 Thread xiangliumi
Hi,Erick

thank you very much, i will do more try following your words.


thanks
Max Mi

Sent using CloudMagic Email 
[https://cloudmagic.com/k/d/mailapp?ct=pa&cv=8.4.52&pv=4.4.2&source=email_footer_2]
 On Fri, May 06, 2016 at 11:39 PM, Erick < erickerick...@gmail.com 
[erickerick...@gmail.com] > wrote:
bq: does this means that different kinds of docs can not be put into
the same solr core

You can certainly put different kinds of docs in the same core,
you just have to search them appropriately, something like
q=field1:value OR field2:value

Say doc1 had "value" in field1 (but did not have field2)
and doc2 had "value" in field2 (but did not have field1)

Then the above query would return both docs.

However, this may have surprising results since presumably
the different "types" of docs represent very different things.
Let's say you have "people" and "places" docs. Ogden is a
surname, but there is also a city in Utah called "Ogden".
A search like above might return both and if the user expected
to be searching places they'd be surprised to see a person.

So, to sum up there's no restriction on having different types
of docs with different fields in Solr, you just have to search
them appropriately (and so the users get what they expect).

Very often, people will put a "type" field in the doc and restrict
what kinds of docs are returned with an fq clause (fq=type:people
in the above example for instance) when appropriate.

Best,
Erick

On Thu, May 5, 2016 at 10:58 PM, 梦在远方  wrote:
> thank you ,Jay Potharaju
>
>
> I got a discover, in the same one solr core , i put two kinds of docs, which 
> means that they does not have the same fields, does this means that different 
> kinds of docs can not be put into the same solr core?
>
>
> thanks!
> 
> max mi
>
>
>
>
> -- 原始邮件 --
> 发件人: "Erick Erickson";;
> 发送时间: 2016年5月6日(星期五) 中午12:14
> 收件人: "solr-user";
>
> 主题: Re: query action with wrong result size zero
>
>
>
> Please show us:
> 1> a sample doc that you expect to be returned
> 2> the results of adding '&debug=query' to the URL
> 3> the schema definition for the field you're querying against.
>
> It is likely that your query isn't quite what you think it is, is going
> against a different field than you think or your schema isn't
> quite doing what you think...
>
> On Thu, May 5, 2016 at 9:40 AM, Jay Potharaju  wrote:
>> Can you check if the field you are searching on is case sensitive? You can
>> quickly test it by copying the exact contents of the brand field into your
>> query and comparing it against the query you have posted above.
>>
>> On Thu, May 5, 2016 at 8:57 AM, mixiangliu <852262...@qq.com> wrote:
>>
>>>
>>> i found a strange thing with solr query,when i set the value of query
>>> field like "brand:amd",the size of query result is zero,but the real data
>>> is not zero,can some body tell me why,thank you very much!!
>>> my english is not very good,wish some body understand my words!
>>>
>>
>>
>>
>> --
>> Thanks
>> Jay Potharaju

Re: Solr 5.4.1 Mergeindexes duplicate rows

2016-05-10 Thread Shawn Heisey
On 5/9/2016 7:55 AM, Kalpana wrote:
> Can anyone help me with a merge. Currently I have the two cores already
> pulling data from SQL Table based on the query I set up.
>
> Solr is running
>
> I also have a third core set up with schema similar to the first two. and
> then I wrote this in the url and hit enter 
> http://localhost:8983/solr/admin/cores?action=mergeindexes&core=Sitecore_SharePoint&srcCore=sitecore_web_index&srcCore=SharePoint_All
>
> I stop and start Solr and I see data with duplicates.
>
> Am I doing this right? 

Some questions:

Is Solr in cloud mode or running standalone?

If you look at the core overview in the admin UI for these three cores,
can you tell me what Num Docs, Max Doc, and the index size is for all
three indexes?

Are the schemas in these three indexes all using the same field name for
uniqueKey?

Are you sure that you have only run the merge once?  Alternately, before
each merge attempt, you could entirely delete
$SOLR_HOME/Sitecore_Sharepoint/data and reload the core or restart Solr.

Thanks,
Shawn



Re: Solr 5.x bug with Service installation script?

2016-05-10 Thread Shawn Heisey
On 5/9/2016 11:30 AM, A Laxmi wrote:
> yes, I always shutdown both source and destination Solr before copying the
> index over from one to another. Somehow the write.lock only happens when
> Solr restarts from service script. If loads just fine when started manually.

One possible problem:

The bin/solr script (which is used by the init script) only waits for 5
seconds for Solr to stop gracefully before killingit forcibly.  This can
leave write.lock files behind.

I thought it had increased to 30 seconds in a recent version and that it
was possibly even configurable in solr.in.sh, but I just checked the
6.0.0 download.  It's still only 5 seconds, and the value is hard-coded
in the script.  This is only enough time if you have a very small number
of very small indexes.

Thanks,
Shawn



Re: Streaming expressions join operations

2016-05-10 Thread Joel Bernstein
The block of code the NPE is coming from is where the collection nodes are
being gathered for the query. So this points to some issue with the cloud
setup or the query.

Joel Bernstein
http://joelsolr.blogspot.com/

On Tue, May 10, 2016 at 9:52 AM, Joel Bernstein  wrote:

> Can you post the entire stack trace? I'd like to see what line the NPE is
> coming from. The line you pasted in is coming from the wrapper exception I
> believe.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Tue, May 10, 2016 at 12:30 AM, Ryan Cutter 
> wrote:
>
>> Yes, the people collection has the personId and pets has ownerId, as
>> described.
>> On May 9, 2016 8:55 PM, "Joel Bernstein"  wrote:
>>
>> > The example is using two collections: people and pets. So these
>> collections
>> > would need to be present for the join expression to work.
>> >
>> > Joel Bernstein
>> > http://joelsolr.blogspot.com/
>> >
>> > On Mon, May 9, 2016 at 10:43 PM, Ryan Cutter 
>> wrote:
>> >
>> > > Thanks Joel, I added the personId and ownerId fields before ingested a
>> > > little data.  I made them to be stored=true/multiValue=false/longs
>> (and
>> > > strings, later).  Is additional schema required?
>> > >
>> > > On Mon, May 9, 2016 at 6:45 PM, Joel Bernstein 
>> > wrote:
>> > >
>> > > > Hi,
>> > > >
>> > > > The example in the cwiki would require setting up the people and
>> pets
>> > > > collections. Unless I'm mistaken this won't work with the out of the
>> > box
>> > > > schemas. So you'll need to setup some test schemas to get started.
>> > > Although
>> > > > having out of the box streaming schemas is a great idea.
>> > > >
>> > > > Joel Bernstein
>> > > > http://joelsolr.blogspot.com/
>> > > >
>> > > > On Mon, May 9, 2016 at 9:22 PM, Ryan Cutter 
>> > > wrote:
>> > > >
>> > > > > Hello, I'm checking out the cool stream join operations in Solr
>> 6.0
>> > but
>> > > > > can't seem to the example listed on the wiki to work:
>> > > > >
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions#StreamingExpressions-innerJoin
>> > > > >
>> > > > > innerJoin(
>> > > > >   search(people, q=*:*, fl="personId,name", sort="personId asc"),
>> > > > >   search(pets, q=type:cat, fl="ownerId,petName", sort="ownerId
>> asc"),
>> > > > >   on="personId=ownerId"
>> > > > > )
>> > > > >
>> > > > > ERROR - 2016-05-09 21:42:43.497; [c:pets s:shard1 r:core_node1
>> > > > > x:pets_shard1_replica1] org.apache.solr.common.SolrException;
>> > > > > java.io.IOException: java.lang.NullPointerException
>> > > > >
>> > > > > at
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.solr.client.solrj.io.stream.CloudSolrStream.constructStreams(CloudSolrStream.java:339)
>> > > > >
>> > > > > 1. Joel Bernstein pointed me at SOLR-9058.  Is this the likely
>> bug?
>> > > > > 2. What kind of field should personId and ownerId be?  long,
>> string,
>> > > > > something else?
>> > > > > 3. Does someone have an example schema or dataset that show off
>> these
>> > > > > joins?  If not, it's something I could work on for future souls.
>> > > > >
>> > > > > Thanks! Ryan
>> > > > >
>> > > >
>> > >
>> >
>>
>
>


Re: Streaming expressions join operations

2016-05-10 Thread Joel Bernstein
Can you post the entire stack trace? I'd like to see what line the NPE is
coming from. The line you pasted in is coming from the wrapper exception I
believe.

Joel Bernstein
http://joelsolr.blogspot.com/

On Tue, May 10, 2016 at 12:30 AM, Ryan Cutter  wrote:

> Yes, the people collection has the personId and pets has ownerId, as
> described.
> On May 9, 2016 8:55 PM, "Joel Bernstein"  wrote:
>
> > The example is using two collections: people and pets. So these
> collections
> > would need to be present for the join expression to work.
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Mon, May 9, 2016 at 10:43 PM, Ryan Cutter 
> wrote:
> >
> > > Thanks Joel, I added the personId and ownerId fields before ingested a
> > > little data.  I made them to be stored=true/multiValue=false/longs (and
> > > strings, later).  Is additional schema required?
> > >
> > > On Mon, May 9, 2016 at 6:45 PM, Joel Bernstein 
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > The example in the cwiki would require setting up the people and pets
> > > > collections. Unless I'm mistaken this won't work with the out of the
> > box
> > > > schemas. So you'll need to setup some test schemas to get started.
> > > Although
> > > > having out of the box streaming schemas is a great idea.
> > > >
> > > > Joel Bernstein
> > > > http://joelsolr.blogspot.com/
> > > >
> > > > On Mon, May 9, 2016 at 9:22 PM, Ryan Cutter 
> > > wrote:
> > > >
> > > > > Hello, I'm checking out the cool stream join operations in Solr 6.0
> > but
> > > > > can't seem to the example listed on the wiki to work:
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions#StreamingExpressions-innerJoin
> > > > >
> > > > > innerJoin(
> > > > >   search(people, q=*:*, fl="personId,name", sort="personId asc"),
> > > > >   search(pets, q=type:cat, fl="ownerId,petName", sort="ownerId
> asc"),
> > > > >   on="personId=ownerId"
> > > > > )
> > > > >
> > > > > ERROR - 2016-05-09 21:42:43.497; [c:pets s:shard1 r:core_node1
> > > > > x:pets_shard1_replica1] org.apache.solr.common.SolrException;
> > > > > java.io.IOException: java.lang.NullPointerException
> > > > >
> > > > > at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.solr.client.solrj.io.stream.CloudSolrStream.constructStreams(CloudSolrStream.java:339)
> > > > >
> > > > > 1. Joel Bernstein pointed me at SOLR-9058.  Is this the likely bug?
> > > > > 2. What kind of field should personId and ownerId be?  long,
> string,
> > > > > something else?
> > > > > 3. Does someone have an example schema or dataset that show off
> these
> > > > > joins?  If not, it's something I could work on for future souls.
> > > > >
> > > > > Thanks! Ryan
> > > > >
> > > >
> > >
> >
>


Re: Replicate Between sites

2016-05-10 Thread Abdel Belkasri
Erick,

That's not what I was going for. No code porting. I was thinking this:
Why not use classic replication between one node in the cluster and another
node in the other cluster?
something along this line.

Thanks,
--Abdel.

On Tue, May 10, 2016 at 12:21 AM, Erick Erickson 
wrote:

> bq: How similar thing could be done in 4.9.1?
>
> That's not going to happen. More precisely,
> there is zero chance that anyone will take on that
> work unless it's a custom one-off that you
> hire done or develop internally. And even
> if someone took this on, it'd never be officially
> released.
>
> IOW, if you want to try backporting it on your own,
> have at it but that'll be completely unsupported.
>
> One thing people have done is create two
> independent clusters, complete to separate ZK
> ensembles and have the indexing client send
> updates to both DCs. At that point it also makes
> sense to have them both serve queries.
>
> Another choice is to have your system-of-record
> replicated to both DCs, and have the indexing
> process run in both DCs from the local copy of
> the system-of-record to the local Solr
> clusters independently of each other.
>
> Best,
> Erick
>
> On Mon, May 9, 2016 at 12:31 PM, Abdel Belkasri 
> wrote:
> > Hi Alex,
> >
> > just started reading about CDCR, looks very promissing. Is this only in
> > 6.0? our PROD server are running 4.9.1 and we cannot upgrade just yet.
> How
> > similar thing could be done in 4.9.1?
> >
> > Thanks,
> > --Abdel
> >
> > On Mon, May 9, 2016 at 2:59 PM, Alexandre Rafalovitch <
> arafa...@gmail.com>
> > wrote:
> >
> >> Have you looked at Cross Data Center replication that's the new big
> >> feature in Solr 6.0?
> >>
> >> Regards,
> >>Alex.
> >> 
> >> Newsletter and resources for Solr beginners and intermediates:
> >> http://www.solr-start.com/
> >>
> >>
> >> On 10 May 2016 at 02:13, Abdel Belkasri  wrote:
> >> > Hi there,
> >> >
> >> > we have the main site setup as follows:
> >> > solrCould:
> >> > App --> smart Client (solrj) --> ensemble of zookeeper --> SolrCloud
> Noes
> >> > (with slice/shard/recplica)
> >> > Works fine.
> >> >
> >> > On the DR site we have a mirror setup, how can we keep the two site in
> >> > sync, so that if something happened we point the app to DR and get
> back
> >> up
> >> > and running?
> >> >
> >> > Note: making zookeeper span the two sites is not an option because of
> >> > network latency.
> >> >
> >> > We are looking for replication (kind of master-slave that exists in
> Solr
> >> > classic)...how that is achieved in SolrCloud?
> >> >
> >> > Thanks,
> >> > --Abdel.
> >>
> >
> >
> >
> > --
> > Abdel K. Belkasri, PhD
>



-- 
Abdel K. Belkasri, PhD


Re:Re: Re:Re: solrcloud performance problem

2016-05-10 Thread lltvw
Hi Toke,

The version i am using is 4.10, i do not know why by setting log to all, and 
then recycle solr, i still could not  get detailed log info. What is wrong.


Does debug info from solr admin UI make sense?

--
发自我的网易邮箱手机智能版


在 2016-05-10 16:25:34,"Toke Eskildsen"  写道:
>On Tue, 2016-05-10 at 15:33 +0800, lltvw wrote:
>> What log you mentioned, console log or something else. the version I am 
>> using is 4.10.
>
>There should be a solr.log somewhere. If you have not changed the
>default log levels, it should log all queries.
>
>
>- Toke Eskildsen, State and University Library, Denmark
>
>


Re: Nodes appear twice in state.json

2016-05-10 Thread solr2020
i am able to delete the down/unused cores which does not actually a core but
it have an entry in state.json using the DELETEREPLICA API.

/admin/collections?action=DELETEREPLICA&collection=collection
name&shard=shardname&replica=(dead/unused core name listed in  state.json)

eg:

/admin/collections?action=DELETEREPLICA&collection=collection
name&shard=shardname&replica=core_node3



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nodes-appear-twice-in-state-json-tp4274504p4275797.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Nodes appear twice in state.json

2016-05-10 Thread solr2020
Hi Shalin,

How do we edit state.json? Do we have any utility to edit state.json as we
have for clusterstate.json? 

Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nodes-appear-twice-in-state-json-tp4274504p4275791.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Facet ignoring repeated word

2016-05-10 Thread G, Rajesh
Thanks Toke. The issue I have is I cannot look for a specific word e.g. ddr in 
termfreq(%27name%27,%20%27ddr%27). I have to find count of all words and their 
sum. I might have 1000+ comments and each might have different words



CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered 
office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, 
Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including SHL. If you have received 
this e-mail in error, please notify the sender and immediately, destroy all 
copies of this email and its attachments. The publication, copying, in whole or 
in part, or use or dissemination in any other way of this e-mail and 
attachments by anyone other than the intended person(s) is prohibited.

-Original Message-
From: G, Rajesh [mailto:r...@cebglobal.com]
Sent: Tuesday, May 10, 2016 6:22 PM
To: solr-user@lucene.apache.org; t...@statsbiblioteket.dk
Subject: RE: Facet ignoring repeated word

Thanks Toke. The issue I have is I cannot look for a specific word e.g. ddr in 
termfreq(%27name%27,%20%27ddr%27). I have to find count of all words and their 
sum



CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered 
office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, 
Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including SHL. If you have received 
this e-mail in error, please notify the sender and immediately, destroy all 
copies of this email and its attachments. The publication, copying, in whole or 
in part, or use or dissemination in any other way of this e-mail and 
attachments by anyone other than the intended person(s) is prohibited.

-Original Message-
From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk]
Sent: Tuesday, May 10, 2016 1:52 PM
To: solr-user@lucene.apache.org
Subject: Re: Facet ignoring repeated word

On Fri, 2016-04-29 at 08:55 +, G, Rajesh wrote:
> I am trying to implement word 
> cloud  >  using Solr.  The problem I have is Solr facet query ignores repeated 
> words in a document eg.

Use a combination of faceting and stats:

1) Resolve candidate words with faceting, just as you have already done.

2) Create a stats-request with the same q as you used for faceting, with a 
termfreq-function for each term in your facet result.


Working example from the techproducts-demo that comes with Solr:

https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8983_solr_techproducts_select&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=ZdiuXWIvnemQkwtzfuD8daMQYonM62VtPXW6Nojd__o&s=UWysIbdd4V1fnKkuLiek_J_zQ66MM2YNLLVI7f--ICI&e=
?q=name%3Addr%0A
&fl=name&wt=json&indent=true
&stats=true
&stats.field={!sum=true%20func}termfreq(%27name%27,%20%27ddr%27)
&stats.field={!sum=true%20func}termfreq(%27name%27,%20%271GB%27)

where 'name' is the field ('comments' in your setup) and 'ddr' and '1GB'
are two terms ('absorbed', 'am', 'believe' etc. in your setup).


The result will be something like

"response": {
"numFound": 3,
...
"stats": {
"stats_fields": {
  "termfreq('name', 'ddr')": {
"sum": 6
  },
  "termfreq('name', '1GB')": {
"sum": 3
  }
}
  }


- Toke Eskildsen, State and University Library, Denmark




RE: Facet ignoring repeated word

2016-05-10 Thread G, Rajesh
Thanks Toke. The issue I have is I cannot look for a specific word e.g. ddr in 
termfreq(%27name%27,%20%27ddr%27). I have to find count of all words and their 
sum



CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered 
office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, 
Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including SHL. If you have received 
this e-mail in error, please notify the sender and immediately, destroy all 
copies of this email and its attachments. The publication, copying, in whole or 
in part, or use or dissemination in any other way of this e-mail and 
attachments by anyone other than the intended person(s) is prohibited.

-Original Message-
From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk]
Sent: Tuesday, May 10, 2016 1:52 PM
To: solr-user@lucene.apache.org
Subject: Re: Facet ignoring repeated word

On Fri, 2016-04-29 at 08:55 +, G, Rajesh wrote:
> I am trying to implement word 
> cloud  >  using Solr.  The problem I have is Solr facet query ignores repeated 
> words in a document eg.

Use a combination of faceting and stats:

1) Resolve candidate words with faceting, just as you have already done.

2) Create a stats-request with the same q as you used for faceting, with a 
termfreq-function for each term in your facet result.


Working example from the techproducts-demo that comes with Solr:

https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8983_solr_techproducts_select&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=ZdiuXWIvnemQkwtzfuD8daMQYonM62VtPXW6Nojd__o&s=UWysIbdd4V1fnKkuLiek_J_zQ66MM2YNLLVI7f--ICI&e=
?q=name%3Addr%0A
&fl=name&wt=json&indent=true
&stats=true
&stats.field={!sum=true%20func}termfreq(%27name%27,%20%27ddr%27)
&stats.field={!sum=true%20func}termfreq(%27name%27,%20%271GB%27)

where 'name' is the field ('comments' in your setup) and 'ddr' and '1GB'
are two terms ('absorbed', 'am', 'believe' etc. in your setup).


The result will be something like

"response": {
"numFound": 3,
...
"stats": {
"stats_fields": {
  "termfreq('name', 'ddr')": {
"sum": 6
  },
  "termfreq('name', '1GB')": {
"sum": 3
  }
}
  }


- Toke Eskildsen, State and University Library, Denmark




Re: Facet ignoring repeated word

2016-05-10 Thread Ahmet Arslan
+1 to Toke's facet and stats combo!



On Tuesday, May 10, 2016 11:21 AM, Toke Eskildsen  
wrote:
On Fri, 2016-04-29 at 08:55 +, G, Rajesh wrote:

> I am trying to implement word 
> cloud
>   using Solr.  The problem I have is Solr facet query ignores repeated words 
> in a document eg.

Use a combination of faceting and stats:

1) Resolve candidate words with faceting, just as you have already done.

2) Create a stats-request with the same q as you used for faceting, with
a termfreq-function for each term in your facet result.


Working example from the techproducts-demo that comes with Solr:

http://localhost:8983/solr/techproducts/select
?q=name%3Addr%0A
&fl=name&wt=json&indent=true
&stats=true
&stats.field={!sum=true%20func}termfreq(%27name%27,%20%27ddr%27)
&stats.field={!sum=true%20func}termfreq(%27name%27,%20%271GB%27)

where 'name' is the field ('comments' in your setup) and 'ddr' and '1GB'
are two terms ('absorbed', 'am', 'believe' etc. in your setup).


The result will be something like

"response": {
"numFound": 3,
...
"stats": {
"stats_fields": {
  "termfreq('name', 'ddr')": {
"sum": 6
  },
  "termfreq('name', '1GB')": {
"sum": 3
  }
}
  }


- Toke Eskildsen, State and University Library, Denmark


Re: how to find out how many times a word appears in a collection of documents?

2016-05-10 Thread Ahmet Arslan
Hi,

fl parameters accepts multivalued parameters, please try fl=title,link

ahmet



On Tuesday, May 10, 2016 2:26 PM, "liviuchrist...@yahoo.com.INVALID" 
 wrote:
Hi Ahmet, Thank you very muchThere would be another question: I can't make it 
provide results from more than one 
field:http://localhost:8983/solr/cuvinte/admin/luke?fl=_text_&?fl=title&?fl=link&numTerms=100

is my querry sintax wrong?I need to get results from more than one field... for 
example the words from the following fields: _text_&link&title&category
it gives me this03459136591360-11041truefalseorg.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/home/solr/Downloads/Solr6/solr-6.0.0/server/solr/cuvinte/data/index
 lockFactory=org.apache.lucene.store.NativeFSLockFactory@4614229d; 
maxCacheMB=48.0 maxMergeSizeMB=4.0)segments_j16514628748877022016-05-10T10:08:07.702Ztext_generalITS-MVop-ITS--V---58629241761398983935335404351743292729390282032456524315207032019919732191571827117599159061508014983141871395113866137911262112452123001186011706114381126911204106371036810339102381016710160101029416935690288898883788048764865286098552835082238138807278017463721370476936692067426721669566056599654064726378629462126194614860955999596158655796569356615642553255125406538753715363529151915165506250244850483348284730471247104696465646294608458645271693232388615734106947304503335212356162210446603001666934105IndexedTokenizedStoredDocValuesMultivaluedTermVector StoredStore Offset With TermVectorStore Position With 
TermVectorStore Payloads With TermVectorOmit NormsOmit Term Frequencies & 
PositionsOmit PositionsStore Offsets 
with PositionsLazyBinarySort Missing FirstSort Missing 
LastDocument Frequency (df) is not updated when a 
document is marked for deletion. df values include deleted 
documents.
 Christian Fotache Tel: 0728.297.207 Fax: 0351.411.570

  From: Ahmet Arslan 

To: "solr-user@lucene.apache.org" ; 
"liviuchrist...@yahoo.com"  
Sent: Tuesday, May 10, 2016 1:42 PM
Subject: Re: how to find out how many times a word appears in a collection of 
documents?
  
Hi Christian,

Collection wide term statistics can be accessed via TermsComponent or 
LukeRequestHandler.

Ahmet



On Tuesday, May 10, 2016 1:26 PM, "liviuchrist...@yahoo.com.INVALID" 
 wrote:
Hi everyone,
I need to "read" the solr/lucene index and see how many times does words appear 
in all documents. For example: I have a collection of 1 mil documents and I 
want to see a list like this:the - 10 timesbread - 1000 timesspoon - 10 
timesfork - 5 times
etc.
How do I do that???
Kind regards,Christian


Re: how to find out how many times a word appears in a collection of documents?

2016-05-10 Thread liviuchristian
Hi Ahmet, Thank you very muchThere would be another question: I can't make it 
provide results from more than one 
field:http://localhost:8983/solr/cuvinte/admin/luke?fl=_text_&?fl=title&?fl=link&numTerms=100

is my querry sintax wrong?I need to get results from more than one field... for 
example the words from the following fields: _text_&link&title&category
it gives me this03459136591360-11041truefalseorg.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/home/solr/Downloads/Solr6/solr-6.0.0/server/solr/cuvinte/data/index
 lockFactory=org.apache.lucene.store.NativeFSLockFactory@4614229d; 
maxCacheMB=48.0 maxMergeSizeMB=4.0)segments_j16514628748877022016-05-10T10:08:07.702Ztext_generalITS-MVop-ITS--V---58629241761398983935335404351743292729390282032456524315207032019919732191571827117599159061508014983141871395113866137911262112452123001186011706114381126911204106371036810339102381016710160101029416935690288898883788048764865286098552835082238138807278017463721370476936692067426721669566056599654064726378629462126194614860955999596158655796569356615642553255125406538753715363529151915165506250244850483348284730471247104696465646294608458645271693232388615734106947304503335212356162210446603001666934105IndexedTokenizedStoredDocValuesMultivaluedTermVector StoredStore Offset With TermVectorStore Position With 
TermVectorStore Payloads With TermVectorOmit NormsOmit Term Frequencies & 
PositionsOmit PositionsStore Offsets 
with PositionsLazyBinarySort Missing FirstSort Missing 
LastDocument Frequency (df) is not updated when a 
document is marked for deletion. df values include deleted 
documents.
 Christian Fotache Tel: 0728.297.207 Fax: 0351.411.570

  From: Ahmet Arslan 
 To: "solr-user@lucene.apache.org" ; 
"liviuchrist...@yahoo.com"  
 Sent: Tuesday, May 10, 2016 1:42 PM
 Subject: Re: how to find out how many times a word appears in a collection of 
documents?
   
Hi Christian,

Collection wide term statistics can be accessed via TermsComponent or 
LukeRequestHandler.

Ahmet



On Tuesday, May 10, 2016 1:26 PM, "liviuchrist...@yahoo.com.INVALID" 
 wrote:
Hi everyone,
I need to "read" the solr/lucene index and see how many times does words appear 
in all documents. For example: I have a collection of 1 mil documents and I 
want to see a list like this:the - 10 timesbread - 1000 timesspoon - 10 
timesfork - 5 times
etc.
How do I do that???
Kind regards,Christian 


  

Re: how to find out how many times a word appears in a collection of documents?

2016-05-10 Thread Ahmet Arslan
Hi Christian,

Collection wide term statistics can be accessed via TermsComponent or 
LukeRequestHandler.

Ahmet



On Tuesday, May 10, 2016 1:26 PM, "liviuchrist...@yahoo.com.INVALID" 
 wrote:
Hi everyone,
I need to "read" the solr/lucene index and see how many times does words appear 
in all documents. For example: I have a collection of 1 mil documents and I 
want to see a list like this:the - 10 timesbread - 1000 timesspoon - 10 
timesfork - 5 times
etc.
How do I do that???
Kind regards,Christian 


how to find out how many times a word appears in a collection of documents?

2016-05-10 Thread liviuchristian
Hi everyone,
I need to "read" the solr/lucene index and see how many times does words appear 
in all documents. For example: I have a collection of 1 mil documents and I 
want to see a list like this:the - 10 timesbread - 1000 timesspoon - 10 
timesfork - 5 times
etc.
How do I do that???
Kind regards,Christian

AW: How to find out if index contains orphaned child documents

2016-05-10 Thread Sebastian Riemer
Sorry for the double post. Formatting got lost too :(

Whenever I mention the field "type" I actually mean "type_s".


-Ursprüngliche Nachricht-
Von: Sebastian Riemer [mailto:s.rie...@littera.eu] 
Gesendet: Dienstag, 10. Mai 2016 11:47
An: solr-user@lucene.apache.org
Betreff: How to find out if index contains orphaned child documents

Hi all,



I have the suspicion that my index might contain orphaned child documents 
because a query restricting to a field on a child document field returns two 
parent documents where I only expect one document to match the query. As I 
cannot figure out any obvious reason why the second document is returned, I 
suspect something is going wrong elsewhere. (See the query link and the result 
in very small font at the end of mail).



Therefore I would like to know whether there is a simple way to find out if my 
index contains orphaned child documents?



In my index I have parent documents which are marked through field 
"type_s:wemi" and I have child documents (amongst other) marked through field 
"type:cat_title". They share the same ID in a field called "wemiId”.



So I guess I would have to phrase a query like “are there any documents with a 
type_s other than wemi for which there are no documents with type wemi having 
the same wemiId?”



If you need further information I am happy to provide, thanks for your help!



Sebastian





Query in multiple formats:



http://localhost:8983/solr/wemi/select?q=*:*&fq=client_id:1&fq=cat_db_id:4294967297&fq=m_id_l:[*
 TO *]&fq=(type_s:wemi AND {!parent which='type_s:wemi'v='(((type_s:cat_title 
AND titles_name_t_ns:("Neuland unter den 
Sandalen"'})&start=0&rows=15&wt=json&indent=true



http://localhost:8983/solr/wemi/select?q=*%3A*&fq=client_id%3A1&fq=cat_db_id%3A4294967297&fq=m_id_l%3A%5B*+TO+*%5D&fq=(type_s%3Awemi+AND+%7B!parent+which%3D%27type_s%3Awemi%27v%3D%27(((type_s%3Acat_title+AND+titles_name_t_ns%3A(%22Neuland+unter+den+Sandalen%22%27%7D)&start=0&rows=15&wt=json&indent=true



start=0

&rows=15

&fq=client_id:1

&fq=cat_db_id:4294967297

&fq=m_id_l:[* TO *]

&fq=(type_s:wemi AND {!parent which='type_s:wemi'v='(((type_s:cat_title AND 
titles_name_t_ns:("Neuland unter den Sandalen"'})

&q=*:*

&facet=true

&facet.missing=true

&facet.mincount=1

&group=true

&group.facet=true

&group.ngroups=true

&group.field=m_id_l

&sort=m_id_l desc

&facet.field={!ex=m_mt_0 key=m_mt_0}m_mediaType_lang_2_s



Result of the query:

(to verify that the result is strange, look for the text “Neuland unter den 
Sandalen”, which seems to only occur in one of the two documents)



{

  "responseHeader":{

"status":0,

"QTime":15,

"params":{

  "q":"*:*",

  "indent":"true",

  "start":"0",

  "fq":["client_id:1",

"cat_db_id:4294967297",

"m_id_l:[* TO *]",

"(type_s:wemi AND {!parent which='type_s:wemi'v='(((type_s:cat_title 
AND titles_name_t_ns:(\"Neuland unter den Sandalen\"'})"],

  "rows":"15",

  "wt":"json"}},

  "response":{"numFound":2,"start":0,"docs":[

  {

"type_s":"wemi",

"text":["wemi",

  "4294985955",

  "Work",

  "Werk",

  "Opera",

  "",

  "",

  "Neuland unter den Sandalen ; Müller, Christoph",

  "Müller, Christoph",

  "Neuland unter den Sandalen",

  "4294984086",

  "Neuland unter den Sandalen",

  "Expression",

  "Expression",

  "Espressione",

  "",

  "",

  "Neuland unter den Sandalen",

  "German",

  "Deutsch",

  "Tedesco",

  "German",

  "German",

  "TEXT",

  "4294985990",

  "Neuland unter den Sandalen ; Müller, Christoph",

  "Neuland unter den Sandalen",

  "Book",

  "Buch",

  "Libro",

  "",

  "",

  "Müller, Christoph",

  "Verlagsangaben Angaben aus der Verlagsmeldung \n\n \n\n  Bete, 
arbeite und brich auf! : Ein Benediktiner auf dem Jakobsweg / von Christoph 
Müller \n\n \nWas ein Ordensmann auf dem Jakobsweg erlebt: \nZum \"Ora et 
Labora\" gesellt sich bei Benediktinerpater Christoph das Pilgern hinzu. 
Zunächst per Fahrrad, später auf Schusters Rappen, erlebt er Freud- und 
Leidvolles bis Santiago. Gute Beobachtungsgabe, Sinn für Situationskomik und 
die benediktinische Spiritualität, die immer wieder durchscheint, machen diesen 
Pilgerbericht zu einem niveauvollen Leseerlebnis.",

  "1",

  "UNSPECIFIED",

  "Christoph Müller",

  "UNMEDIATED",

  "Ill., Kt.",

  "German",

  "Deutsch",

  "Tedesco",

  "German",

  "German",

  "205 S.",

  "4294985812",

  "4294985990",

  "4294967297",

  "2016-05-10T00:00:00Z",

  "Mü",

  "18449",

  "false",

  "1",

 

How to find out if index contains orphaned child documents

2016-05-10 Thread Sebastian Riemer
Hi all,



I have the suspicion that my index might contain orphaned child documents 
because a query restricting to a field on a child document field returns two 
parent documents where I only expect one document to match the query. As I 
cannot figure out any obvious reason why the second document is returned, I 
suspect something is going wrong elsewhere. (See the query link and the result 
in very small font at the end of mail).



Therefore I would like to know whether there is a simple way to find out if my 
index contains orphaned child documents?



In my index I have parent documents which are marked through field 
"type_s:wemi" and I have child documents (amongst other) marked through field 
"type:cat_title". They share the same ID in a field called "wemiId”.



So I guess I would have to phrase a query like “are there any documents with a 
type_s other than wemi for which there are no documents with type wemi having 
the same wemiId?”



If you need further information I am happy to provide, thanks for your help!



Sebastian





Query in multiple formats:



http://localhost:8983/solr/wemi/select?q=*:*&fq=client_id:1&fq=cat_db_id:4294967297&fq=m_id_l:[*
 TO *]&fq=(type_s:wemi AND {!parent which='type_s:wemi'v='(((type_s:cat_title 
AND titles_name_t_ns:("Neuland unter den 
Sandalen"'})&start=0&rows=15&wt=json&indent=true



http://localhost:8983/solr/wemi/select?q=*%3A*&fq=client_id%3A1&fq=cat_db_id%3A4294967297&fq=m_id_l%3A%5B*+TO+*%5D&fq=(type_s%3Awemi+AND+%7B!parent+which%3D%27type_s%3Awemi%27v%3D%27(((type_s%3Acat_title+AND+titles_name_t_ns%3A(%22Neuland+unter+den+Sandalen%22%27%7D)&start=0&rows=15&wt=json&indent=true



start=0

&rows=15

&fq=client_id:1

&fq=cat_db_id:4294967297

&fq=m_id_l:[* TO *]

&fq=(type_s:wemi AND {!parent which='type_s:wemi'v='(((type_s:cat_title AND 
titles_name_t_ns:("Neuland unter den Sandalen"'})

&q=*:*

&facet=true

&facet.missing=true

&facet.mincount=1

&group=true

&group.facet=true

&group.ngroups=true

&group.field=m_id_l

&sort=m_id_l desc

&facet.field={!ex=m_mt_0 key=m_mt_0}m_mediaType_lang_2_s



Result of the query:

(to verify that the result is strange, look for the text “Neuland unter den 
Sandalen”, which seems to only occur in one of the two documents)



{

  "responseHeader":{

"status":0,

"QTime":15,

"params":{

  "q":"*:*",

  "indent":"true",

  "start":"0",

  "fq":["client_id:1",

"cat_db_id:4294967297",

"m_id_l:[* TO *]",

"(type_s:wemi AND {!parent which='type_s:wemi'v='(((type_s:cat_title 
AND titles_name_t_ns:(\"Neuland unter den Sandalen\"'})"],

  "rows":"15",

  "wt":"json"}},

  "response":{"numFound":2,"start":0,"docs":[

  {

"type_s":"wemi",

"text":["wemi",

  "4294985955",

  "Work",

  "Werk",

  "Opera",

  "",

  "",

  "Neuland unter den Sandalen ; Müller, Christoph",

  "Müller, Christoph",

  "Neuland unter den Sandalen",

  "4294984086",

  "Neuland unter den Sandalen",

  "Expression",

  "Expression",

  "Espressione",

  "",

  "",

  "Neuland unter den Sandalen",

  "German",

  "Deutsch",

  "Tedesco",

  "German",

  "German",

  "TEXT",

  "4294985990",

  "Neuland unter den Sandalen ; Müller, Christoph",

  "Neuland unter den Sandalen",

  "Book",

  "Buch",

  "Libro",

  "",

  "",

  "Müller, Christoph",

  "Verlagsangaben Angaben aus der Verlagsmeldung \n\n \n\n  Bete, 
arbeite und brich auf! : Ein Benediktiner auf dem Jakobsweg / von Christoph 
Müller \n\n \nWas ein Ordensmann auf dem Jakobsweg erlebt: \nZum \"Ora et 
Labora\" gesellt sich bei Benediktinerpater Christoph das Pilgern hinzu. 
Zunächst per Fahrrad, später auf Schusters Rappen, erlebt er Freud- und 
Leidvolles bis Santiago. Gute Beobachtungsgabe, Sinn für Situationskomik und 
die benediktinische Spiritualität, die immer wieder durchscheint, machen diesen 
Pilgerbericht zu einem niveauvollen Leseerlebnis.",

  "1",

  "UNSPECIFIED",

  "Christoph Müller",

  "UNMEDIATED",

  "Ill., Kt.",

  "German",

  "Deutsch",

  "Tedesco",

  "German",

  "German",

  "205 S.",

  "4294985812",

  "4294985990",

  "4294967297",

  "2016-05-10T00:00:00Z",

  "Mü",

  "18449",

  "false",

  "1",

  "Available",

  "Verfügbar",

  "Disponibile",

  "",

  "",

  "true",

  "http://";],

"wemiId":"4294985955429498408642949859904294985812",

"id":"4294985955429498408642949859904294985812",

"w_id_l":4294985955,

"w_mediaType_lang_1_s":"Work",

"w_mediaT

How to find out if index contains orphaned child documents

2016-05-10 Thread Sebastian Riemer
Hi all,



I have the suspicion that my index might contain orphaned child documents 
because a query restricting to a field on a child document field returns two 
parent documents where I only expect one document to match the query. As I 
cannot figure out any obvious reason why the second document is returned, I 
suspect something is going wrong elsewhere. (See the query link and the result 
in very small font at the end of mail).



Therefore I would like to know whether there is a simple way to find out if my 
index contains orphaned child documents?



In my index I have parent documents which are marked through field 
"type_s:wemi" and I have child documents (amongst other) marked through field 
"type:cat_title". They share the same ID in a field called "wemiId”.



So I guess I would have to phrase a query like “are there any documents with a 
type_s other than wemi for which there are no documents with type wemi having 
the same wemiId?”



If you need further information I am happy to provide, thanks for your help!



Sebastian





Query in multiple formats:



http://localhost:8983/solr/wemi/select?q=*:*&fq=client_id:1&fq=cat_db_id:4294967297&fq=m_id_l:[*
 TO *]&fq=(type_s:wemi AND {!parent which='type_s:wemi'v='(((type_s:cat_title 
AND titles_name_t_ns:("Neuland unter den 
Sandalen"'})&start=0&rows=15&wt=json&indent=true



http://localhost:8983/solr/wemi/select?q=*%3A*&fq=client_id%3A1&fq=cat_db_id%3A4294967297&fq=m_id_l%3A%5B*+TO+*%5D&fq=(type_s%3Awemi+AND+%7B!parent+which%3D%27type_s%3Awemi%27v%3D%27(((type_s%3Acat_title+AND+titles_name_t_ns%3A(%22Neuland+unter+den+Sandalen%22%27%7D)&start=0&rows=15&wt=json&indent=true



start=0

&rows=15

&fq=client_id:1

&fq=cat_db_id:4294967297

&fq=m_id_l:[* TO *]

&fq=(type_s:wemi AND {!parent which='type_s:wemi'v='(((type_s:cat_title AND 
titles_name_t_ns:("Neuland unter den Sandalen"'})

&q=*:*

&facet=true

&facet.missing=true

&facet.mincount=1

&group=true

&group.facet=true

&group.ngroups=true

&group.field=m_id_l

&sort=m_id_l desc

&facet.field={!ex=m_mt_0 key=m_mt_0}m_mediaType_lang_2_s



Result of the query:

(to verify that the result is strange, look for the text “Neuland unter den 
Sandalen”, which seems to only occur in one of the two documents)



{

  "responseHeader":{

"status":0,

"QTime":15,

"params":{

  "q":"*:*",

  "indent":"true",

  "start":"0",

  "fq":["client_id:1",

"cat_db_id:4294967297",

"m_id_l:[* TO *]",

"(type_s:wemi AND {!parent which='type_s:wemi'v='(((type_s:cat_title 
AND titles_name_t_ns:(\"Neuland unter den Sandalen\"'})"],

  "rows":"15",

  "wt":"json"}},

  "response":{"numFound":2,"start":0,"docs":[

  {

"type_s":"wemi",

"text":["wemi",

  "4294985955",

  "Work",

  "Werk",

  "Opera",

  "",

  "",

  "Neuland unter den Sandalen ; Müller, Christoph",

  "Müller, Christoph",

  "Neuland unter den Sandalen",

  "4294984086",

  "Neuland unter den Sandalen",

  "Expression",

  "Expression",

  "Espressione",

  "",

  "",

  "Neuland unter den Sandalen",

  "German",

  "Deutsch",

  "Tedesco",

  "German",

  "German",

  "TEXT",

  "4294985990",

  "Neuland unter den Sandalen ; Müller, Christoph",

  "Neuland unter den Sandalen",

  "Book",

  "Buch",

  "Libro",

  "",

  "",

  "Müller, Christoph",

  "Verlagsangaben Angaben aus der Verlagsmeldung \n\n \n\n  Bete, 
arbeite und brich auf! : Ein Benediktiner auf dem Jakobsweg / von Christoph 
Müller \n\n \nWas ein Ordensmann auf dem Jakobsweg erlebt: \nZum \"Ora et 
Labora\" gesellt sich bei Benediktinerpater Christoph das Pilgern hinzu. 
Zunächst per Fahrrad, später auf Schusters Rappen, erlebt er Freud- und 
Leidvolles bis Santiago. Gute Beobachtungsgabe, Sinn für Situationskomik und 
die benediktinische Spiritualität, die immer wieder durchscheint, machen diesen 
Pilgerbericht zu einem niveauvollen Leseerlebnis.",

  "1",

  "UNSPECIFIED",

  "Christoph Müller",

  "UNMEDIATED",

  "Ill., Kt.",

  "German",

  "Deutsch",

  "Tedesco",

  "German",

  "German",

  "205 S.",

  "4294985812",

  "4294985990",

  "4294967297",

  "2016-05-10T00:00:00Z",

  "Mü",

  "18449",

  "false",

  "1",

  "Available",

  "Verfügbar",

  "Disponibile",

  "",

  "",

  "true",

  "http://";],

"wemiId":"4294985955429498408642949859904294985812",

"id":"4294985955429498408642949859904

Re: Re:Re: solrcloud performance problem

2016-05-10 Thread Toke Eskildsen
On Tue, 2016-05-10 at 15:33 +0800, lltvw wrote:
> What log you mentioned, console log or something else. the version I am using 
> is 4.10.

There should be a solr.log somewhere. If you have not changed the
default log levels, it should log all queries.


- Toke Eskildsen, State and University Library, Denmark




Re: Facet ignoring repeated word

2016-05-10 Thread Toke Eskildsen
On Fri, 2016-04-29 at 08:55 +, G, Rajesh wrote:
> I am trying to implement word 
> cloud
>   using Solr.  The problem I have is Solr facet query ignores repeated words 
> in a document eg.

Use a combination of faceting and stats:

1) Resolve candidate words with faceting, just as you have already done.

2) Create a stats-request with the same q as you used for faceting, with
a termfreq-function for each term in your facet result.


Working example from the techproducts-demo that comes with Solr:

http://localhost:8983/solr/techproducts/select
?q=name%3Addr%0A
&fl=name&wt=json&indent=true
&stats=true
&stats.field={!sum=true%20func}termfreq(%27name%27,%20%27ddr%27)
&stats.field={!sum=true%20func}termfreq(%27name%27,%20%271GB%27)

where 'name' is the field ('comments' in your setup) and 'ddr' and '1GB'
are two terms ('absorbed', 'am', 'believe' etc. in your setup).


The result will be something like

"response": {
"numFound": 3,
...
"stats": {
"stats_fields": {
  "termfreq('name', 'ddr')": {
"sum": 6
  },
  "termfreq('name', '1GB')": {
"sum": 3
  }
}
  }


- Toke Eskildsen, State and University Library, Denmark




Re:Re: solrcloud performance problem

2016-05-10 Thread lltvw
Hi Toke,

What log you mentioned, console log or something else. the version I am using 
is 4.10.




--
发自我的网易邮箱手机智能版


在 2016-05-10 14:42:34,"Toke Eskildsen"  写道:
>On Tue, 2016-05-10 at 00:41 +0800, lltvw wrote:
>> Recently we setup a 4.10 solrcloud  env with about 9000 doc indexed
>> in it,this solrcloud with 12 shards, each shard on one separate
>> machine, but when we try to search some infor on solrcloud, the
>> response time is about 300ms.
>
>Could you provide us with a sample request? Preferably taken from the
>log of one of the shards, so that we also get timing. There will
>probably be 2 entries in the log for each request you issue. This will
>make it easier for us to check if you have some of the typical problems,
>such as very high rows or facet.limit.
>
>- Toke Eskildsen, State and University Library, Denmark
>
>


RE: Solr edismax field boosting

2016-05-10 Thread Megha Bhandari
Thanks Nick, got the response formatted. We are using Solr 5.5. 
Not able to understand why it is ignoring the boosts completely. What 
configuration is being missed? As you correctly pointed out it is only 
calculating based on the _text_ field.

Query:
http://10.203.101.42:8983/solr/uhc/select?defType=edismax&indent=on&mm=1&q=upendra&qf=h1^9.0%20_text_^1.0&wt=ruby&debug=true
 

Response with debug on:
{
  'responseHeader'=>{
'status'=>0,
'QTime'=>6,
'params'=>{
  'mm'=>'1',
  'q'=>'upendra',
  'defType'=>'edismax',
  'debug'=>'true',
  'indent'=>'on',
  'qf'=>'h1^9.0 _text_^1.0',
  'wt'=>'ruby'}},
  'response'=>{'numFound'=>6,'start'=>0,'maxScore'=>0.14641379,'docs'=>[
  {
'h2'=>['Looks like your browser is a little out-of-date.'],
'h3'=>['Already a member?'],
'strtitle'=>['I m increasiing the the page title content Upendra 
Custon'],
'id'=>'http://localhost:4503/baseurl/upendra-custon.html',
'tstamp'=>'2016-05-10T05:50:22.316Z',
'metataghideininternalsearch'=>false,
'metatagtopresultthumbnailalt'=>',',
'segment'=>[20160510112017],
'digest'=>['fb988351afceb26a835fba68e2bcc33f'],
'boost'=>[1.4142135],
'lang'=>'en',
'metatagkeywords'=>[','],
'_version_'=>1533919301006786560,
'host'=>'localhost',
'url'=>'http://localhost:4503/baseurl/upendra-custon.html',
'score'=>0.14641379},
  {
'metatagdescription'=>['test'],
'h1'=>['Upendra'],
'h2'=>['Looks like your browser is a little out-of-date.'],
'h3'=>['Already a member?'],
'strtitle'=>['health care body content'],

'id'=>'http://localhost:4503/baseurl/upendra-custon/care-body-content.html',
'tstamp'=>'2016-05-10T05:50:22.269Z',
'metataghideininternalsearch'=>false,
'metatagtopresultthumbnailalt'=>',',
'segment'=>[20160510112017],
'digest'=>['dd4ef8879be2d4d3f28e24928e9b84c5'],
'boost'=>[1.4142135],
'lang'=>'en',
'metatagkeywords'=>[','],
'_version_'=>1533919301071798272,
'host'=>'localhost',

'url'=>'http://localhost:4503/baseurl/upendra-custon/care-body-content.html',
'score'=>0.13738367},
  {
'metatagdescription'=>['test'],
'h1'=>['health care keyword'],
'h2'=>['Looks like your browser is a little out-of-date.'],
'h3'=>['Already a member?'],
'strtitle'=>['health care keyword'],
'id'=>'http://localhost:4503/baseurl/upendra-custon/care-keyword.html',
'tstamp'=>'2016-05-10T05:50:22.300Z',
'metataghideininternalsearch'=>false,
'metatagtopresultthumbnailalt'=>',',
'segment'=>[20160510112017],
'digest'=>['4af11065d604bcec7aa4cbc1cf0fca59'],
'boost'=>[1.4142135],
'lang'=>'en',
'metatagkeywords'=>['upendra,upendra'],
'_version_'=>1533919301088575488,
'host'=>'localhost',
'url'=>'http://localhost:4503/baseurl/upendra-custon/care-keyword.html',
'score'=>0.13738367},
  {
'metatagdescription'=>['test'],
'h1'=>['Health care'],
'h2'=>['Looks like your browser is a little out-of-date.'],
'h3'=>['Already a member?'],
'strtitle'=>['This is the page Title Upendra, lets do the testing'],
'id'=>'http://localhost:4503/baseurl/upendra-custon/care.html',
'tstamp'=>'2016-05-10T05:50:22.518Z',
'metataghideininternalsearch'=>false,
'metatagtopresultthumbnailalt'=>',,,',
'segment'=>[20160510112017],
'digest'=>['711a059f2a05a6c03e59d490cd7008ff'],
'boost'=>[1.4142135],
'lang'=>'en',
'metatagkeywords'=>[',,,'],
'_version_'=>1533919301088575489,
'host'=>'localhost',
'url'=>'http://localhost:4503/baseurl/upendra-custon/care.html',
'score'=>0.13286635},
  {
'metatagdescription'=>['Upendra decription testing'],
'h1'=>['care description'],
'h2'=>['Looks like your browser is a little out-of-date.'],
'h3'=>['Already a member?'],
'strtitle'=>['care description'],

'id'=>'http://localhost:4503/baseurl/upendra-custon/care-description.html',
'tstamp'=>'2016-05-10T05:50:22.362Z',
'metataghideininternalsearch'=>false,
'metatagtopresultthumbnailalt'=>',',
'segment'=>[20160510112017],
'digest'=>['6262795db6aed05a5de7cc3cbe496401'],
'boost'=>[1.4142135],
'lang'=>'en',
'metatagkeywords'=>[','],
'_version_'=>1533919301088575490,
'host'=>'localhost',

'url'=>'http://localhost:4503/baseurl/upendra-custon/care-description.html',
'score'=>0.13053702},
  {
'metatagdescription'=>['test'],
'h1'=>['care without'],
'h2'=>['Looks like your browser is a little out-of-date.'],
'h3'=>['Already a member?'],
'strtitle'=>['care w