date:20160511

Re: Solrj API with Basic Authentication

2016-05-11 Thread shamik

Ok, I found another way of doing it which will preserve the QueryResponse
object. I've used DefaultHttpClient, set the credentials and finally passed
it as a constructor to the CloudSolrClient.

*DefaultHttpClient httpclient = new DefaultHttpClient();
UsernamePasswordCredentials defaultcreds = new
UsernamePasswordCredentials(USER, PASSWORD);
httpclient.getCredentialsProvider().setCredentials(AuthScope.ANY,
defaultcreds);
SolrClient  client = new CloudSolrClient("127.0.0.1:9983", httpclient);*
SolrClient client = new CloudSolrClient("127.0.0.1:9983"); 
((CloudSolrClient)client).setDefaultCollection("gettingstarted"); 
ModifiableSolrParams param = getSearchSolrQuery(); 
try{ 
  QueryResponse res = client.query(param); 
  //facets 
  List fieldFacets = solrResp.getFacetFields(); 
  // results 
  SolrDocumentList docs = solrResp.getResults(); 
  // Spelling 
  SpellCheckResponse spellCheckResponse =
solrResp.getSpellCheckResponse(); 
  }catch(Exception ex){ 
ex.printStackTrace(); 
  }finally{ 
   try { 
client.close(); 
  } catch (IOException e) { 
 e.printStackTrace(); 
  }
  } 

Just wanted to know if this is recommended ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solrj-API-with-Basic-Authentication-tp4276312p4276319.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Error

2016-05-11 Thread Midas A

thanks for replying .

PERFORMANCE WARNING: Overlapping onDeckSearchers=2
one more warning is coming please suggest for this also.

On Wed, May 11, 2016 at 7:53 PM, Ahmet Arslan 
wrote:

> Hi Midas,
>
> It looks like you are committing too frequently, cache warming cannot
> catchup.
> Either lower your commit rate, or disable cache auto warm
> (autowarmCount=0).
> You can also remove queries registered at newSearcher event if you have
> defined some.
>
> Ahmet
>
>
>
> On Wednesday, May 11, 2016 2:51 PM, Midas A  wrote:
> Hi i am getting following error
>
> org.apache.solr.common.SolrException: Error opening new searcher.
> exceeded limit of maxWarmingSearchers=2, try again later.
>
>
>
> what should i do to remove it .
>

Re: How to self join a collection with SOLR and have another condition

2016-05-11 Thread Mikhail Khludnev

..&fq=aaa:1 bbb:2&..

On Wed, May 11, 2016 at 11:34 PM, baggadonuts  wrote:

> Refer to the following documentation: https://wiki.apache.org/solr/Join
>
> According to the documentation the SOLR equivalent of this SQL query:
>
> SELECT xxx, yyy
> FROM collection1
> WHERE outer_id IN (SELECT inner_id FROM collection1 where zzz = "vvv")
>
> is this:
>
> /solr/collection1/select ? fl=xxx,yyy & q={!join from=inner_id
> to=outer_id}zzz:vvv
>
> Basically the SQL equivalent of what I'd like to do is:
>
> SELECT xxx, yyy
> FROM collection1
> WHERE (aaa = "1" OR bbb = "2")
> AND outer_id IN (SELECT inner_id FROM collection1 where zzz =
> "vvv")
>
> Is it possible to do this query in SOLR?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-self-join-a-collection-with-SOLR-and-have-another-condition-tp4276235.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

Solrj API with Basic Authentication

2016-05-11 Thread Shamik Bandopadhyay

Hi,

  I'm looking into the option of adding basic authentication using Solrj
API. Currently, I'm using the following code for querying Solr.

SolrClient client = new CloudSolrClient("127.0.0.1:9983");
((CloudSolrClient)client).setDefaultCollection("gettingstarted");
ModifiableSolrParams param = getSearchSolrQuery();
try{
QueryResponse res = client.query(param);
 //facets
 List fieldFacets = solrResp.getFacetFields();
// results
SolrDocumentList docs = solrResp.getResults();
// Spelling
SpellCheckResponse spellCheckResponse =
solrResp.getSpellCheckResponse();
  }catch(Exception ex){
ex.printStackTrace();
  }finally{
try {
client.close();
} catch (IOException e) {
e.printStackTrace();
}
}

The QueryReponse object is well-constructed and provides clean APIs to
parse the result.

Now, to use the basic authentication, we need to use a SolrRequest object
instead.

SolrClient client = new CloudSolrClient("127.0.0.1:9983");
((CloudSolrClient)client).setDefaultCollection("gettingstarted");
ModifiableSolrParams param = getSearchSolrQuery();
SolrRequest solrRequest = new QueryRequest(param);
solrRequest.setBasicAuthCredentials(USER, PASSWORD);
try{
NamedList results = client.request(solrRequest);
for (int i = 0; i < results.size(); i++) {
System.out.println("RESULTS: " + i + " " + results.getName(i) + " : " +
results.getVal(i));
}
  }catch(Exception ex){
ex.printStackTrace();
  }finally{
try {
client.close();
} catch (IOException e) {
e.printStackTrace();
}
}

Since my existing APIs use QueryResponse res = client.query(param) , moving
to NamedList results = client.request(solrRequest) translates to a
bunch of code change. Moreover, by using SolrRequest, I lose the option
of delete, getById methods which doesn't accept solr request object.

Just wondering if there's another way to use Basic Authentication , perhaps
setting at the ModifiableSolrParams level. Ideally, i would like to retain
QueryResponse or UpdateResponse objects instead.

Any pointers will be appreciated.

-Thanks,
Shamik

Re: Nested grouping or equivalent.

2016-05-11 Thread Erick Erickson

A couple of ideas. If this is 5x consider Streaming Aggregation.
The idea here is that you stream the docs back to a SolrJ client and
slice and dice them there. SA is designed to export 400K docs/sec,
but the returned values must be DocValues (i.e. no text types, strings
are OK).

Have you seen the CollapsingQParserPlugin? That might help.

Or push back at the product manager and say "why are we wasting
time supporting something nobody uses?" ;)

Best,
Erick

On Wed, May 11, 2016 at 1:45 AM, Callum Lamb  wrote:
> We have a horrible Solr query that groups by a field and then sorts by
> another. My understanding is that for this to happen it has to sort by the
> grouping field, group it and then sort the resulting result set. It's not a
> fast query.
>
> Unfortunately our documents now need to be grouped as well (product
> variants into items) and that grouping query needs to work on that grouping
> instead. As far as I'm aware you can't do nested grouping in Solr.
>
> In summary we want to have product variants that get grouped into Items and
> then they get grouped by field and then sorted by another.
>
> The solution doesn't need to be fast, it's a rarely ever used legacy part
> of our application that's basically never used and we just need it to work.
> Our dataset isn't huge so it doesn't matter if Solr has to scan the entire
> index (I think the query does this atm anyway). But downloading the entire
> document set and doing the operations in ETL isn't something we really want
> to dedicate time to unless it's impossible to represent this in Solr
> queries.
>
> Any ideas?
>
> Cheers,
>
> Callum.
>
> --
>
> Mintel Group Ltd | 11 Pilgrim Street | London | EC4V 6RN
> Registered in England: Number 1475918. | VAT Number: GB 232 9342 72
>
> Contact details for our other offices can be found at
> http://www.mintel.com/office-locations.
>
> This email and any attachments may include content that is confidential,
> privileged
> or otherwise protected under applicable law. Unauthorised disclosure,
> copying, distribution
> or use of the contents is prohibited and may be unlawful. If you have
> received this email in error,
> including without appropriate authorisation, then please reply to the
> sender about the error
> and delete this email and any attachments.
>

Re: Shard data using lat long range

2016-05-11 Thread Erick Erickson

Personally I'd just let it do the default "hash  modulo #shards".

I don't see how you could shard based on location and I don't know
why you'd want to. Let's say you have some kind of restriction like
"we'll never return a doc from any state except the one our location is in".
So you'd have your
Michigan shard, your Ohio shard, your California shard etc. When
you run your query, you'd be concentrating _all_ of the computations on
exactly one shard rather than having N shards do the computation.

Best,
Erick

On Wed, May 11, 2016 at 12:32 AM, chandan khatri
 wrote:
> Hi All,
>
> I've an application that has location based data. The data is expected to
> grow rapidly and the search is also based on the location i.e the search is
> done using the geospatial distance range.
>
> I am wondering what is the best possible way to shard the index. Any
> pointer/input is highly appreciated.
>
> Thanks,
> Chandan

Re: backups of analyzingInfixSuggesterIndexDir

2016-05-11 Thread Erick Erickson

Well, it can always be rebuilt from the backed-up index. That suggester
reads the _stored_ fields from the docs to build up the suggester
index. With a lot of documents that could take a very long time though.

If you desperately need it, AFAIK you'll have to back it up whenever
you build it I'm afraid.

Best,
Erick

On Wed, May 11, 2016 at 8:30 AM, Oakley, Craig (NIH/NLM/NCBI) [C]
 wrote:
> I have a client whose Solr installation creates a 
> analyzingInfixSuggesterIndexDir directory besides index and tlog. I notice 
> that this analyzingInfixSuggesterIndexDir is not included in backups (created 
> by replication?command=backup). Is there a way to include this? Or does it 
> not need to be backed-up?
>
> I haven't needed this yet, but wanted to ask before I find that I might need 
> it.

Re: Edismax field boosting behavior for null values

2016-05-11 Thread Erick Erickson

Fields that don't match for a particular document just don't contribute to the
score. The boost is multiplied into the score calculated for that field and
term. So if for doc1 the calculated score is 5 and you boost by 2, the result is
10. If doc2 has a calculated score of 20 and you boost by 1, its score is
higher.

For all the messy details, try adding
&debug=all&debug.explain.structured = true

Best,
Erick

On Wed, May 11, 2016 at 10:47 AM, Megha Bhandari  wrote:
> Correcting typo in original post and making it a little clearer
>
> Hi
>
> Can someone help us understand how null values affect boosting.
>
> Say we have field_1 (with boost ^10.1)  and field_2 (with boost ^9.1).
> We search for foo.
> Document A :  field_1 : does not exist
>   Field_2 = matches search term
> Document B: field_1 = matches search term
>  Field_2 = empty string.
> As per our understanding the result should be Document B, Document A.
> However what we are getting is Document A,Document B.
>
> Below is a detailed description of the above problem with our business use 
> case and configurations.
>
> Use case : Promote documents as per following priority of fields ie. Keywords 
> > meta description > Title > H1 > H2 >H3 > body content
>
> For this we have indexed the above fields as
>  indexed="true" stored="true"/>
>  indexed="true" stored="true"/>
>  stored="true"/>
>  stored="true"/>
>  stored="true"/>
>  stored="true"/>
>
> and used the eDisMax query parser and set boosting as
> edismax
> 
> metatag.keywords^100.1 metatag.description^50.1 title^20.1 h1^4.7 
> h2^3.6 h3^2.5 h4^1.4 id^0.01 _text_^0.001
> 
>
> The above is working fine for documents that have an entry for all fields. 
> E.g. all pages have keywords, meta description and so on even though the 
> entry might just be an empty string. So if the search contains pages only the 
> results are coming fine as per expectation.
>
> However for documents that don't have keywords ,e.g. all PDFs only have meta 
> description ,title and _text_, results are skewed. PDFs are coming right at 
> the top even though we have a page with the search term in keyword field.
>
> To fix this anomaly we come up with the following boosting ( notice the very 
> large boost values)
> edismax
>   
>   metatag.keywords^10.1 metatag.description^7500.1 title^500.1 
> h1^40.7 h2^25.6 h3^15.1 h4^5.4 h5^1.3 h6^1.2 _text_^1.0
>   
>
> I can provide the query debug results for both configurations if required.
>
> Thanks for any help in understanding this.
>
>
> -Original Message-
> From: Megha Bhandari [mailto:mbhanda...@sapient.com]
> Sent: Wednesday, May 11, 2016 11:10 PM
> To: solr-user@lucene.apache.org
> Subject: Edismax field boosting behavior for null values
>
> Hi
>
> Can someone help us understand how null values affect boosting.
>
> Say we have field_1 (with boost ^10.1)  and field_2 (with boost ^9.1).
> We search for foo. Document A has field_1(foo match) and field_2(empty) and 
> Document B has field_2(foo match)  but no field_1.
> As per our understanding the result should be Document A,Document B.
> However what we are getting is Document B,Document A.
>
> Below is a detailed description of the above problem with our business use 
> case and configurations.
>
> Use case : Promote documents as per following priority of fields ie. Keywords 
> > meta description > Title > H1 > H2 >H3 > body content
>
> For this we have indexed the above fields as
>  indexed="true" stored="true"/>
>  indexed="true" stored="true"/>
>  stored="true"/>
>  stored="true"/>
>  stored="true"/>
>  stored="true"/>
>
> and used the eDisMax query parser and set boosting as
> edismax
> 
> metatag.keywords^100.1 metatag.description^50.1 title^20.1 h1^4.7 
> h2^3.6 h3^2.5 h4^1.4 id^0.01 _text_^0.001
> 
>
> The above is working fine for documents that have an entry for all fields. 
> E.g. all pages have keywords, meta description and so on even though the 
> entry might just be an empty string. So if the search contains pages only the 
> results are coming fine as per expectation.
>
> However for documents that don't have keywords ,e.g. all PDFs only have meta 
> description ,title and _text_, results are skewed. PDFs are coming right at 
> the top even though we have a page with the search term in keyword field.
>
> To fix this anomaly we come up with the following boosting ( notice the very 
> large boost values)
> edismax
>   
>   metatag.keywords^10.1 metatag.description^7500.1 title^500.1 
> h1^40.7 h2^25.6 h3^15.1 h4^5.4 h5^1.3 h6^1.2 _text_^1.0
>   
>
> I can provide the query debug results for both configurations if required.
>
> Thanks for any help in understanding this.
>

Re: How to self join a collection with SOLR and have another condition

2016-05-11 Thread Dennis Gove

If you're able to use Solr 6 then you can use Streaming Expressions to
solve this. The docs for Streaming Expressions in Solr 6 can be found at
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=61330338.

One option would be to use an intersect to find documents in both sets.

intersect(
  search(collection1, q="aaa:1 OR bbb:2", fl="xxx,yyy,outer_id",
sort="outer_id asc", qt="/export"),
  search(collection1, q="zzz:vvv", fl="inner_id", sort="inner_id asc",
qt="/export"),
  on="outer_id=inner_id"
)

This will give you all documents from the first query where there is a
matching document in the second query (where matching is defined as
outer_id=inner_id).

- Dennis

On Wed, May 11, 2016 at 4:34 PM, baggadonuts  wrote:

> Refer to the following documentation: https://wiki.apache.org/solr/Join
>
> According to the documentation the SOLR equivalent of this SQL query:
>
> SELECT xxx, yyy
> FROM collection1
> WHERE outer_id IN (SELECT inner_id FROM collection1 where zzz = "vvv")
>
> is this:
>
> /solr/collection1/select ? fl=xxx,yyy & q={!join from=inner_id
> to=outer_id}zzz:vvv
>
> Basically the SQL equivalent of what I'd like to do is:
>
> SELECT xxx, yyy
> FROM collection1
> WHERE (aaa = "1" OR bbb = "2")
> AND outer_id IN (SELECT inner_id FROM collection1 where zzz =
> "vvv")
>
> Is it possible to do this query in SOLR?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-self-join-a-collection-with-SOLR-and-have-another-condition-tp4276235.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Cannot comment on Jira anymore

2016-05-11 Thread Erick Erickson

I just added you to the contributors group, you should be able to post now.

On Wed, May 11, 2016 at 4:22 PM, Chris Hostetter
 wrote:
>
> If you re-load the jira you should see at the top this message...
>
> ---
> Jira is in Temporary Lockdown mode as a spam countermeasure. Only
> logged-in users with active roles (committer, contributor, PMC, etc.) will
> be able to create issues or comments during this time. Lockdown period
> from 11 May 2300 UTC to estimated 12 May 2300 UTC.
> ---
>
>
> : Date: Wed, 11 May 2016 23:37:01 +0100
> : From: Arcadius Ahouansou 
> : Reply-To: solr-user@lucene.apache.org
> : To: solr-user 
> : Subject: Cannot comment on Jira anymore
> :
> : Hello.
> :
> : Somehow, I am no longer able to comment on Solr Jira tickets.
> :
> : When I go to https://issues.apache.org/jira/browse/SOLR-7963
> : I am logged in... I can edit the ticket, but there is no comment box or
> : comment button visible.
> :
> : Any help would be very appreciated.
> :
> : Thank you very much.
> :
> : --
> : Arcadius Ahouansou
> : Menelic Ltd | Applied Knowledge Is Power
> : M: 07908761999
> : W: www.menelic.com
> : ---
> :
>
> -Hoss
> http://www.lucidworks.com/

Re(2): [scottchu] What kind of configuration to use for this size ofnewsdata?

2016-05-11 Thread scott.chu

Hi Shawn,
Thank for suggestion of Zookeepers. For the 'buyout', I think your 
misunderstanding is my fault since my description is kinda vague. Actually, the 
'buyout' is requested by special customers. Their budget can only buy some 
special service that "must be" (not just be able to) allowed to install whole 
system, including s/w and h/w, on their site. The license is not about Solr. 
It's about this whole service. Our biz manager want to get that budget and it's 
approved by boss. So he ask us to provide such service.
scott.chu，scott@udngroup.com
2016/5/12 (週四)
- Original Message - 
From: Shawn Heisey 
To: solr-user 
CC: 
Date: 2016/5/12 (週四) 07:03
Subject: Re: [scottchu] What kind of configuration to use for this size 
ofnewsdata?

On 5/11/2016 3:55 AM, scott.chu wrote: 
> ** 
> If I use SolrCloud, I know I have to setup Zookeeper. I know there're 
> something called 'quorum' or 'ensemble' in Zookeeper terminologies. I 
> also know there is a need for (2n+1) Zookeeper nodes per n SolrCloud 
> nodes. Is your case running one SolrCloud node per one machine 
> (Whether PM or VM). According to your experiences, how many nodes , 
> including SolrCloud's and Zookeeper's, do I need to setup? Is 
> Replication in SolrCloud easy to setup as that in old version? (I 
> setup replication solrconfig.xml and use solrcore.properties file to 
> setup/switch roles in Solr node, rather than defining role directly in 
> solrconfig.xml) 
> ** 

No, you do not need that many Zookeeper nodes. You need three zookeeper 
nodes. The only reason to add more zookeeper nodes is to handle more ZK 
failures. I cannot think of any reason for anybody to install more than 
five zookeeper nodes in a single ZK ensemble. Five nodes will sustain a 
failure of two nodes, which means that you can take a node down for 
maintenance, and *still* survive if another machine dies during maintenance. 

Master-Slave replication is incompatible with SolrCloud, because 
SolrCloud uses replication for disaster recovery operations. 

> ** 
> We have a special biz case called 'buyout newspaper search service'. 
> Customers buy intranet license to use search service for articles of some 
> newspaper types and some range of publish dates, e.g. paper type 'A' for 
> 2010-2012 and paper type 'B' for 2015. The buyout means we have to install 
> who search service at customer site and customer can only use search service 
> within their enterprise intranet environment. So you know, I have to build a 
> special Solr server for each of such customers. Your idea of filtering is 
> very much like ElasticSearch's multitenancy, which both are not fit in our 
> buyout biz model. Do you have any suggestion for building Solr server in such 
> condition? 
> ** 

Solr does not include any kind of licensing system. It will always 
work, and will usually allow somebody to search the entire index. To do 
otherwise would be contrary to the general goal of open source projects. 

If you need the service to expire or be limited in some way according to 
license terms, you will need to write software to handle that ... but be 
aware that if the users know anything about Solr, and they can get 
access to the machine's filesystem, they will be able to get around most 
such restrictions simply by copying the index config and data to another 
machine. You would need to embed the restrictions into custom Lucene 
code (perhaps encrypting the index and obfuscating search terms in some 
way) to make it more difficult to defeat your licensing system. 

Thanks, 
Shawn 

- 
未在此訊息中找到病毒。 
已透過 AVG 檢查 - www.avg.com 
版本: 2015.0.6189 / 病毒庫: 4568/12206 - 發佈日期: 05/10/16

Re(2): [scottchu] What kind of configuration to use for this size ofnews data?

2016-05-11 Thread scott.chu

Originally, I want to experiment both master-slave and SolrCloud on my PC but 
want to save time to install another Solr server. If I have to do that, I think 
I have to change the default port for 2nd Solr server, right?

However, after reading the mails of Toke and Charlie. I decide to delve into 
SolrCloud. I'm also have no practical experience on SolrCloud. Zookeeper also 
looks like an unknown territory for me but it's really like a good choice for 
my job.

scott.chu，scott@udngroup.com
2016/5/12 (週四)
- Original Message - 
From: Shawn Heisey 
To: solr-user 
CC: 
Date: 2016/5/11 (週三) 22:38
Subject: Re: [scottchu] What kind of configuration to use for this size ofnews 
data?

On 5/10/2016 10:34 PM, scott.chu wrote: 
> A further question: Can master-slave and SolrCloud exist simultaneously in 
> one Solr server? If yes, how can I do it? 

No. SolrCloud uses replication internally for automated recovery on an 
as-needed basis. SolrCloud completely manages multiple replicas of an 
index and user-configured replication is not necessary. 

I do not know what you intend with that combination, but you may want to 
look into Cross-Data-Center-Replication (CDCR) in Solr 6.0. 

Thanks, 
Shawn 

- 
未在此訊息中找到病毒。 
已透過 AVG 檢查 - www.avg.com 
版本: 2015.0.6189 / 病毒庫: 4568/12206 - 發佈日期: 05/10/16

Re(2): [scottchu] What kind of configuration to use for this size ofnews data?

2016-05-11 Thread scott.chu

Hi Toke and Charile,

Thanks for sharing your cases and your well suggestions. After reading 
through your mails, I'll delve into SolrCloud. One thing I'd like to share with 
all of people in maillist: Chinese corpus could  create a dramatically large 
size of index with respect to what tokenization method we use. CJK tokenizer 
falls into this situation and synonym becomes very gard to establish (For every 
Chinese words composed by n Chinese character, I have to set C(n,2) synonym 
items, which is impossible for me to do it). Unfortunately, my old Solr 3.5 
server uses it. 

This time I have two choices for Chinese tokenizing:

1> Algorithmic way: "Using standard tokenizer + query with quoting Chinese word 
as phrase (by quoting with double quotes). This maintains a fairly size of 
index and carry out the effect similar with Google's search. This also makes 
synonym be feasible practically.
2> Dictionary-oriented way: I install mmseg4j (The theory is from Taiwan but 
carried out by China). But the problem is on how to maintain an up-to-date 
dictionary, especially for "news". News may create never-before new nouns as 
the times goes by.

You both have experiences installing a newspaper search site. I'd like to know 
what way of tokenizer you use and how do you maintain an up-to-date dictionary 
assume if you use the 2nd way. I know most of the Solr tokenizers in China uses 
the 2nd way, so I'm curious how they maintain an up-to-date dictionary. If 
there's someone has experiences running a Chinese-corpus Solr server, I'll 
appreciate if  you're willing to share your case.

Thanks again and best regards, you guys save many time for my job ^_^

scott.chu，scott@udngroup.com
2016/5/12 (週四)
- Original Message - 
From: Toke Eskildsen 
To: solr-user ; scott(自己) 
CC: 
Date: 2016/5/11 (週三) 18:55
Subject: Re: [scottchu] What kind of configuration to use for this size ofnews 
data?

On Wed, 2016-05-11 at 11:27 +0800, scott.chu wrote: 
> I want to build a Solr engine for over 60-year news articles. My 
> requests are (I use Solr 5.4.1): 

Charlie Hull has given you an fine answer, which I agree with fully, so 
I'll just add a bit from our experience. 

We are running a similar service for Danish newspapers. We have 16M 
OCR'ed pages, split into 250M+ articles, for 1.4TB total index size. 
Everything in a single shard on a 64GB machine with SSDs. 

We do faceting, range faceting and grouping as part of basic search. 
That works okay (sub-second response times) for the bulk of our 
requests, but when the hitCount gets above 10M, performance gets poor. 
For the real heavy hitters, basically matching everything, we encounter 
20 second response times. 

This is not acceptable, so we will be switching to SolrCloud and 
multiple shards (on the same machine, as our bottleneck is single 
CPU-core performance). However, you have a smaller corpus and the growth 
rate does not look alarming. 

Putting all this together, I would advice you to try and put everything 
in a single shard to avoid the overhead of distributed search. If that 
performs well enough for single queries, then add replicas with 
SolrCloud to get redundancy and scale throughput. Should you need to 
shard at a later time, this will be easy with SolrCloud. 

- Toke Eskildsen, State and University Library, Denmark 

- 
未在此訊息中找到病毒。 
已透過 AVG 檢查 - www.avg.com 
版本: 2015.0.6189 / 病毒庫: 4568/12206 - 發佈日期: 05/10/16

Re: Dynamically change solr suggest field

2016-05-11 Thread Lasitha Wattaladeniya

Hi Nick,

Thanks for the reply. According to my requirement I can use only option
one. I thought about that solution but I was bit lazy to implement that
since I have many modules and solr cores. If I'm going to configure request
handlers for each drop down value in each component it seems like a lot of
work. Anyway this seems like the only way forward.

I can't use the option two because the combo box select the filed, not a
value specific to a single field

Best regards,
Lasitha

Lasitha Wattaladeniya
Software Engineer

Mobile : +6593896893
Blog : techreadme.blogspot.com

On Wed, May 11, 2016 at 11:41 PM, Nick D  wrote:

> There are only two ways I can think of to accomplish this and neither of
> them are dynamically setting the suggester field as is looks according to
> the Doc (which does sometimes have lacking info so I might be wrong) you
> cannot set something like *suggest.fl=combo_box_field* at query time. But
> maybe they can help you get started.
>
> 1. Multiple suggester request handlers for each option in combo box. This
> way you just change the request handler in the query you submit based on
> the context.
>
> 2. Use copy fields to put all possible suggestions into same field name, so
> no more dynamic field settings, with another field defining whatever the
> option would be for that document out of the combo box and use context
> filters which can be passed at query time to limit the suggestions to those
> filtered by whats in the combo box.
>
> https://cwiki.apache.org/confluence/display/solr/Suggester#Suggester-ContextFiltering
>
> Hope this helps a bit
>
> Nick
>
> On Wed, May 11, 2016 at 7:05 AM, Lasitha Wattaladeniya 
> wrote:
>
> > Hello devs,
> >
> > I'm trying to implement auto complete text suggestions using solr. I
> have a
> > text box and next to that there's a combo box. So the auto complete
> should
> > suggest based on the value selected in the combo box.
> >
> > Basically I should be able to change the suggest field based on the value
> > selected in the combo box. I was trying to solve this problem whole day
> but
> > not much luck. Can anybody tell me is there a way of doing this ?
> >
> > Regards,
> > Lasitha.
> >
> > Lasitha Wattaladeniya
> > Software Engineer
> >
> > Mobile : +6593896893
> > Blog : techreadme.blogspot.com
> >
>

How to self join a collection with SOLR and have another condition

2016-05-11 Thread baggadonuts

Refer to the following documentation: https://wiki.apache.org/solr/Join

According to the documentation the SOLR equivalent of this SQL query:

SELECT xxx, yyy
FROM collection1
WHERE outer_id IN (SELECT inner_id FROM collection1 where zzz = "vvv")

is this:

/solr/collection1/select ? fl=xxx,yyy & q={!join from=inner_id
to=outer_id}zzz:vvv

Basically the SQL equivalent of what I'd like to do is:

SELECT xxx, yyy
FROM collection1
WHERE (aaa = "1" OR bbb = "2")
AND outer_id IN (SELECT inner_id FROM collection1 where zzz = "vvv")

Is it possible to do this query in SOLR?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-self-join-a-collection-with-SOLR-and-have-another-condition-tp4276235.html
Sent from the Solr - User mailing list archive at Nabble.com.

Need Help with Solr 6.0 Cross Data Center Replication

2016-05-11 Thread Satvinder Singh

Hi,

I am trying to configure Cross Data Center Replication using solr 6.0.
I am having issue configuring solrconfig.xml on both the target and source 
side. I keep getting error 
"org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: 
Solr instance is not configured with the cdcr update log"


This is my config on the Source

  
  
disabled
  
  
  


  
  
  
cdcr-proc-chain
  
  
  

${solr.ulog.dir:}
500
20
65536

  


This is the config on the Target side:

  
  
disabled
  
  
  


  
  
  
cdcr-proc-chain
  
  
  

${solr.ulog.dir:}
500
20
65536

  

Any help would be great.

Thanks
[http://www.nc4worldwide.com/_signature/nc4.png]

Satvinder Singh







Security Systems Engineer
satvinder.si...@nc4.com
703.682.6000 x276 direct
703.989.8030 cell
www.NC4.com






[http://www.nc4worldwide.com/_catalogs/masterpage/images/linkedin-sml.png] 
 
[http://www.nc4worldwide.com/_catalogs/masterpage/images/googleplus-sml.png]  
 
[http://www.nc4worldwide.com/_catalogs/masterpage/images/twitter-sml.png] 





Disclaimer: This message is intended only for the use of the individual or 
entity to which it is addressed and may contain information which is 
privileged, confidential, proprietary, or exempt from disclosure under 
applicable law. If you are not the intended recipient or the person responsible 
for delivering the message to the intended recipient, you are strictly 
prohibited from disclosing, distributing, copying, or in any way using this 
message. If you have received this communication in error, please notify the 
sender and destroy and delete any copies you may have received.

Re: Cannot comment on Jira anymore

2016-05-11 Thread Chris Hostetter


If you re-load the jira you should see at the top this message...

---
Jira is in Temporary Lockdown mode as a spam countermeasure. Only 
logged-in users with active roles (committer, contributor, PMC, etc.) will 
be able to create issues or comments during this time. Lockdown period 
from 11 May 2300 UTC to estimated 12 May 2300 UTC. 
---


: Date: Wed, 11 May 2016 23:37:01 +0100
: From: Arcadius Ahouansou 
: Reply-To: solr-user@lucene.apache.org
: To: solr-user 
: Subject: Cannot comment on Jira anymore
: 
: Hello.
: 
: Somehow, I am no longer able to comment on Solr Jira tickets.
: 
: When I go to https://issues.apache.org/jira/browse/SOLR-7963
: I am logged in... I can edit the ticket, but there is no comment box or
: comment button visible.
: 
: Any help would be very appreciated.
: 
: Thank you very much.
: 
: -- 
: Arcadius Ahouansou
: Menelic Ltd | Applied Knowledge Is Power
: M: 07908761999
: W: www.menelic.com
: ---
: 

-Hoss
http://www.lucidworks.com/

Re: Complexity of a document?

2016-05-11 Thread Shawn Heisey

On 5/11/2016 1:32 PM, A Laxmi wrote:
> Is it possible to determine how complex a document is using Solr?
> Complexity in terms of whether document is readable by a 7th grade vs. PHD
> Grad?

Out of the box?  No.  You can of course embed any custom component
you're willing to find or write.

In general, I would say that this is not a job for Solr.  You can run
the analysis offline and update your data source, or you can include the
analysis engine in your indexing pipeline.

Thanks,
Shawn

Re: Planning and benchmarking Solr: resource consumption (RAM, disk, CPU, number of nodes)

2016-05-11 Thread Shawn Heisey

On 5/11/2016 6:06 AM, Horváth Péter Gergely wrote:
> If there is no such research document available, I would be much obliged if
> you could give some hints on what and how to measure in Solr / Solr cloud
> world. (E.g. what the optimal resource utilization of a Solr instance is,
> how to recognize if an instance is trashing etc.)

I don't know if you've seen this:

https://lucidworks.com/blog/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

There quite simply is no general answer.  Scalability rarely follows a
predictable curve based on the amount of hardware you use ... and what
I've frequently found is that a given Solr install will perform *great*
until some magic unknown threshold is reached, and then suddenly it's
like somebody installed an analog modem in place of your network card. 
If you Google "performance curve knee" you will find some information on
this phenomenon.

The only way to know exactly how Solr will behave under a given workload
is to set up the system and see what happens.  After somebody gets
enough experience with Solr, they can take a look at details for a
specific install and *maybe* predict whether it will handle the load or
not ... but I've frequently been wrong (in both directions) when trying
to make that assessment.

Thanks,
Shawn

Re: [scottchu] What kind of configuration to use for this size ofnews data?

2016-05-11 Thread Shawn Heisey

On 5/11/2016 3:55 AM, scott.chu wrote:
> **
> If I use SolrCloud, I know I have to setup Zookeeper. I know there're
> something called 'quorum' or 'ensemble' in Zookeeper terminologies. I
> also know there is a need for (2n+1) Zookeeper nodes per n SolrCloud
> nodes. Is your case running one SolrCloud node per one machine
> (Whether PM or VM). According to your experiences, how many nodes ,
> including SolrCloud's and Zookeeper's, do I need to setup? Is
> Replication in SolrCloud easy to setup as that in old version? (I
> setup replication solrconfig.xml and use solrcore.properties file to
> setup/switch roles in Solr node, rather than defining role directly in
> solrconfig.xml)
> **

No, you do not need that many Zookeeper nodes.  You need three zookeeper
nodes.  The only reason to add more zookeeper nodes is to handle more ZK
failures.  I cannot think of any reason for anybody to install more than
five zookeeper nodes in a single ZK ensemble.  Five nodes will sustain a
failure of two nodes, which means that you can take a node down for
maintenance, and *still* survive if another machine dies during maintenance.

Master-Slave replication is incompatible with SolrCloud, because
SolrCloud uses replication for disaster recovery operations.

> **
> We have a special biz case called 'buyout newspaper search service'. 
> Customers buy intranet license to use search service for articles of some 
> newspaper types and some range of  publish dates, e.g. paper type 'A' for 
> 2010-2012 and paper type 'B' for 2015. The buyout means we have to install 
> who search service at customer site and customer can only use search service 
> within their enterprise intranet environment. So you know, I have to build a 
> special Solr server for each of such customers. Your idea of filtering is 
> very much like ElasticSearch's multitenancy, which both are not fit in our 
> buyout biz model. Do you have any suggestion for building Solr server in such 
> condition?
> **

Solr does not include any kind of licensing system.  It will always
work, and will usually allow somebody to search the entire index.  To do
otherwise would be contrary to the general goal of open source projects.

If you need the service to expire or be limited in some way according to
license terms, you will need to write software to handle that ... but be
aware that if the users know anything about Solr, and they can get
access to the machine's filesystem, they will be able to get around most
such restrictions simply by copying the index config and data to another
machine.  You would need to embed the restrictions into custom Lucene
code (perhaps encrypting the index and obfuscating search terms in some
way) to make it more difficult to defeat your licensing system.

Thanks,
Shawn

Re: Using Ping Request Handler in SolrCloud within a load balancer

2016-05-11 Thread Shawn Heisey

On 5/9/2016 10:56 PM, Sandy Foley wrote:
> Question #1:Is there a SINGLE command that can be issued to each server from 
> a load balancer to check the ping status of each server?

I am not aware of a single request that will test every collection.

The way I have things set up, each load balancer front end (runs on a
port number separate from the others) is only responsible for a single
index.  This is typically a collection in the SolrCloud world.

> Question #2:When running /solr/admin/ping from the load balancer to each Solr 
> node, one of the three nodes returns a status ok.  It's the same node every 
> time; it's the first node that we set up of the 3 (which is not always the 
> leader). The zkcli upconfig command has always been issued from this first 
> node.Out of curiosity, if this command is for local ping only, why does this 
> return status ok on one node (issued from the load balancer) and not the 
> other nodes?

I have never seen a global /solr/admin/ping handler.

If a collection has the ping handler defined, you should be able to
request /solr/COLLECTION_NAME/admin/ping on any server in the cloud,
even a server that does not contain replicas for that collection.  I
believe that as long as the ping handler doesn't set distrib to false,
the configured ping query will be executed across the entire collection
-- effectively testing the whole thing.  You could target specific shard
replicas, but the collection is probably better.  If the collection is
whole (at least one replica of every shard is functional) within the
cloud, the ping handler should return success.  The fact that a 2xx
response is returned verifies that the server itself is up.

With SolrCloud, I am not sure what happens if you disable the ping
handler's health check file.  I have not used a load balancer with
SolrCloud -- our client code is Java, so the small SolrCloud deployment
we have doesn't need one.  My load balancer sits in front of Solr
servers that are *not* running SolrCloud.

Do you need to track the availability of each collection independently? 
If you do, you can set up a backend for each collection with appropriate
cloud servers in it.  Each backend would use the ping URL for that
collection.  This is probably extreme overkill, though ... see the next
paragraph for the solution that probably makes more sense.

If you don't need to track the availability of each collection
independently, then you really only need one back end.  You can simply
pick a collection that you expect to always exist in the cloud (let's
say that collection is named foo) and use /solr/foo/admin/ping as the
health check URL.  As long as you have the appropriate number of
replicas in every collection to survive any expected failure scenario,
SolrCloud itself will take care of making sure the queries work on all
your collections, no matter which server receives the request.

Thanks,
Shawn

Re: How to restrict outside IP access in Solr with internal jetty server

2016-05-11 Thread Shawn Heisey

On 5/10/2016 9:02 AM, Mugeesh Husain wrote:
> I am using solr 5.3 version with inbuilt jetty server.
>
> I am looking for a proxy kind of thing which i could prevent outside User
> access for all of the link, I would give only access select and select core
> url accessibility other than this should be not open.
>
> Please give me some suggestion.

There are appliances and software implementing proxy/load balancer
functionality (software examples: haproxy, nginx, apache httpd) that can
do this.  Making it reasonably secure would not be a trivial config, but
it is doable.

For maximum safety, you should not expose *any* part of your Solr
install to end users at all, especially the Internet.  Not even the
/select endpoint should be exposed.  You should have website application
code that sits between your users and Solr -- which could be homegrown
software, or an existing software package like Wordpress.

Thanks,
Shawn

Cannot comment on Jira anymore

2016-05-11 Thread Arcadius Ahouansou

Hello.

Somehow, I am no longer able to comment on Solr Jira tickets.

When I go to https://issues.apache.org/jira/browse/SOLR-7963
I am logged in... I can edit the ticket, but there is no comment box or
comment button visible.

Any help would be very appreciated.

Thank you very much.

-- 
Arcadius Ahouansou
Menelic Ltd | Applied Knowledge Is Power
M: 07908761999
W: www.menelic.com
---

Re: issues using BlendedInfixLookupFactory in solr5.5

2016-05-11 Thread Arcadius Ahouansou

Hi Xavi.

The blenderType=linear not working has been introduced in
https://issues.apache.org/jira/browse/LUCENE-6939

"linear" has been refactored to "position_linear"

I would be grateful if a committer could help update the wiki with the
comments at


https://issues.apache.org/jira/browse/LUCENE-6939?focusedCommentId=15068054#comment-15068054


About your question:
"does SolrCloud totally support suggesters?"
Yes, SolrCloud supports the BlendedInfixSuggester to some extend.
What worked for us was buildOnCommit=true

We used 2 collections one is live, the other one is in stand-by mode.
We update the stand-by one in batches and we commit at the end...
triggering the suggester rebuilt
Then we swap the stand-by to become the live collection using aliases.


Arcadius


On 31 March 2016 at 18:04, xavi jmlucjav  wrote:

> Hi,
>
> I have been working with
> AnalyzingInfixLookupFactory/BlendedInfixLookupFactory in 5.5.0, and I have
> a number of questions/comments, hopefully I get some insight into this:
>
> - Doc not complete/up-to-date:
> - blenderType param does not accept 'linear' value, it did in 5.3. I
> commented it out as it's the default.
> - it should be mentioned contextField must be a stored field
> - if the field used is whitespace tokenized, and you search for 'one t',
> the suggestions are sorted by weight, not score. So if you give a constant
> score to all docs, you might get this:
> 1. one four two
> 2. one two four
>   Would taking the score into account (something not done yet but could be
> done according to something I saw in code/jira) return 2,1 instead of 1,2?
> My guess is it would, correct?
> - what would we need to return the score too? Could it be done easily?
> along with the payload or something.
> - would it be possible to make BlendedInfixLookupFactory allow for some
> fuzziness a la FuzzyLookupFactory?
> - when building a big suggester, it can take a long time, you just send a
> request with suggest.build=true and wait. Is there any possible way to
> monitor the progress of this? I did not find one.
> - for weightExpression, one typical use case would be to provide the users'
> lat/lon to weight the suggestions by proximity, is this somehow feasible?
> What would be needed?
> - does SolrCloud totally support suggesters? If so does each shard build
> its own suggester and it works just like a normal distributed search ?
> - I filled SOLR-8928 suggest.cfq does not work with
> DocumentExpressionDictionaryFactory/weightExpression as I found that combo
> not working.
>
> regards
> xavi
>



-- 
Arcadius Ahouansou
Menelic Ltd | Applied Knowledge Is Power
M: 07908761999
W: www.menelic.com
---

Re: Complexity of a document?

2016-05-11 Thread Walter Underwood

There are many different “readability scores”. The most common is 
Flesch-Kincaid, which uses the number of words, number of sentences, and number 
of syllables. Solr has the word count, but not the other two.

https://en.wikipedia.org/wiki/Readability_test 

https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests 


I think Solr is the wrong tool for calculating readability scores.

The scores are fairly easy to calculate once you have the whole document. But 
the information stored in a Solr index is the wrong information for that 
calculation.

Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On May 11, 2016, at 1:11 PM, A Laxmi  wrote:
> 
>> 
> 
> 
> * What I mean is that a technical paper will have a different type of
> complexity from let's say a Shakespearean play, because the former will
> have technical jargon, while the latter will have really high level*
> 
> 
> * vocabulary.*Good point. But, I am thinking a 7th grade might find both of
> them complex to understand - one because of technical jargon and other
> because of high level vocabulary?
> 
> If it helps, I am looking at a set of user manuals of various products. I
> am trying to determine which of those user manuals are easier to read and
> which are more complex in comparison.
> 
> 
> 
> On Wed, May 11, 2016 at 3:58 PM, Binoy Dalal  wrote:
> 
>> Please correct me if I'm wrong, but I think what Joel means is the variety
>> of words in a document.
>> 
>> One more aspect that will come into play here, I think, is the different
>> types of complexity.
>> What I mean is that a technical paper will have a different type of
>> complexity from let's say a Shakespearean play, because the former will
>> have technical jargon, while the latter will have really high level
>> vocabulary.
>> 
>> On Thu, 12 May 2016, 01:17 A Laxmi,  wrote:
>> 
>>> Yes, length of the words would be one way but was wondering if there are
>>> any other ways to identify the complexity.
>>> 
>>> On Wed, May 11, 2016 at 3:46 PM, A Laxmi  wrote:
>>> 
 Yes, length of the words would be one way but was wondering if there
>> are
 any ways to identify the complexity.
 
 On Wed, May 11, 2016 at 3:36 PM, Joel Bernstein 
 wrote:
 
> I'm wondering if the size of the vocabulary used would be enough for
>>> this?
> 
> Joel Bernstein
> http://joelsolr.blogspot.com/
> 
> On Wed, May 11, 2016 at 3:32 PM, A Laxmi 
>>> wrote:
> 
>> Hi,
>> 
>> Is it possible to determine how complex a document is using Solr?
>> Complexity in terms of whether document is readable by a 7th grade
>> vs.
> PHD
>> Grad?
>> 
>> Thanks!
>> AL
>> 
> 
 
 
>>> 
>> --
>> Regards,
>> Binoy Dalal
>>

Re: changing web context and port for SolrCloud Zookeeper

2016-05-11 Thread John Bickerstaff

Excellent! That file gave me fits at first.  It lives in two locations, but
the one that counts for booting SOLR is the /etc/default one.
On May 11, 2016 12:53 PM, "Tom Gullo"  wrote:

That helps.  I ended up updating the sole.in.sh file in /etc/default and
that was in getting picked up.  Thanks

> On May 11, 2016, at 2:05 PM, Tom Gullo  wrote:
>
> My Solr installation is running on Tomcat on port 8080 with a  web
context name that is different than /solr.   We want to move to a basic
jetty setup with all the defaults.  I haven’t found a clean way to do
this.  A lot of the values like baseurl and /leader/elect/shard1 have
values that need to be updated.  If I try shutting down the servers, change
the zookeeper settings and then restart Solr in Jetty I get issues - like
Solr thinks they are replicas.   So I’m looking to see if anyone knows what
is the cleanest way to move from a Tomcat/8080 install to a Jetty/8983 one.
>
> Thanks
>
>> On May 11, 2016, at 1:59 PM, John Bickerstaff 
wrote:
>>
>> I may be answering the wrong question - but SolrCloud goes in by default
on
>> 8983, yes?  Is yours currently on 8080?
>>
>> I don't recall where, but I think I saw a config file setting for the
port
>> number (In Solr I mean)
>>
>> Am I on the right track or are you asking something other than how to get
>> Solr on host:8983/solr ?
>>
>> On Wed, May 11, 2016 at 11:56 AM, Tom Gullo  wrote:
>>
>>> I need to change the web context and the port for a SolrCloud
installation.
>>>
>>> Example, change:
>>>
>>> host:8080/some-api-here/
>>>
>>> to this:
>>>
>>> host:8983/solr/
>>>
>>> Does anyone know how to do this with SolrCloud?  There are values stored
>>> in clusterstate.json and /leader/elect and I could change
them
>>> but that seems a little messy.
>>>
>>> Thanks
>

Re: Complexity of a document?

2016-05-11 Thread A Laxmi

>


* What I mean is that a technical paper will have a different type of
complexity from let's say a Shakespearean play, because the former will
have technical jargon, while the latter will have really high level*


* vocabulary.*Good point. But, I am thinking a 7th grade might find both of
them complex to understand - one because of technical jargon and other
because of high level vocabulary?

If it helps, I am looking at a set of user manuals of various products. I
am trying to determine which of those user manuals are easier to read and
which are more complex in comparison.



On Wed, May 11, 2016 at 3:58 PM, Binoy Dalal  wrote:

> Please correct me if I'm wrong, but I think what Joel means is the variety
> of words in a document.
>
> One more aspect that will come into play here, I think, is the different
> types of complexity.
> What I mean is that a technical paper will have a different type of
> complexity from let's say a Shakespearean play, because the former will
> have technical jargon, while the latter will have really high level
> vocabulary.
>
> On Thu, 12 May 2016, 01:17 A Laxmi,  wrote:
>
> > Yes, length of the words would be one way but was wondering if there are
> > any other ways to identify the complexity.
> >
> > On Wed, May 11, 2016 at 3:46 PM, A Laxmi  wrote:
> >
> > > Yes, length of the words would be one way but was wondering if there
> are
> > > any ways to identify the complexity.
> > >
> > > On Wed, May 11, 2016 at 3:36 PM, Joel Bernstein 
> > > wrote:
> > >
> > >> I'm wondering if the size of the vocabulary used would be enough for
> > this?
> > >>
> > >> Joel Bernstein
> > >> http://joelsolr.blogspot.com/
> > >>
> > >> On Wed, May 11, 2016 at 3:32 PM, A Laxmi 
> > wrote:
> > >>
> > >> > Hi,
> > >> >
> > >> > Is it possible to determine how complex a document is using Solr?
> > >> > Complexity in terms of whether document is readable by a 7th grade
> vs.
> > >> PHD
> > >> > Grad?
> > >> >
> > >> > Thanks!
> > >> > AL
> > >> >
> > >>
> > >
> > >
> >
> --
> Regards,
> Binoy Dalal
>

Re: Complexity of a document?

2016-05-11 Thread Binoy Dalal

Please correct me if I'm wrong, but I think what Joel means is the variety
of words in a document.

One more aspect that will come into play here, I think, is the different
types of complexity.
What I mean is that a technical paper will have a different type of
complexity from let's say a Shakespearean play, because the former will
have technical jargon, while the latter will have really high level
vocabulary.

On Thu, 12 May 2016, 01:17 A Laxmi,  wrote:

> Yes, length of the words would be one way but was wondering if there are
> any other ways to identify the complexity.
>
> On Wed, May 11, 2016 at 3:46 PM, A Laxmi  wrote:
>
> > Yes, length of the words would be one way but was wondering if there are
> > any ways to identify the complexity.
> >
> > On Wed, May 11, 2016 at 3:36 PM, Joel Bernstein 
> > wrote:
> >
> >> I'm wondering if the size of the vocabulary used would be enough for
> this?
> >>
> >> Joel Bernstein
> >> http://joelsolr.blogspot.com/
> >>
> >> On Wed, May 11, 2016 at 3:32 PM, A Laxmi 
> wrote:
> >>
> >> > Hi,
> >> >
> >> > Is it possible to determine how complex a document is using Solr?
> >> > Complexity in terms of whether document is readable by a 7th grade vs.
> >> PHD
> >> > Grad?
> >> >
> >> > Thanks!
> >> > AL
> >> >
> >>
> >
> >
>
-- 
Regards,
Binoy Dalal

Re: Complexity of a document?

2016-05-11 Thread A Laxmi

Yes, length of the words would be one way but was wondering if there are
any other ways to identify the complexity.

On Wed, May 11, 2016 at 3:46 PM, A Laxmi  wrote:

> Yes, length of the words would be one way but was wondering if there are
> any ways to identify the complexity.
>
> On Wed, May 11, 2016 at 3:36 PM, Joel Bernstein 
> wrote:
>
>> I'm wondering if the size of the vocabulary used would be enough for this?
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>> On Wed, May 11, 2016 at 3:32 PM, A Laxmi  wrote:
>>
>> > Hi,
>> >
>> > Is it possible to determine how complex a document is using Solr?
>> > Complexity in terms of whether document is readable by a 7th grade vs.
>> PHD
>> > Grad?
>> >
>> > Thanks!
>> > AL
>> >
>>
>
>

Re: Complexity of a document?

2016-05-11 Thread A Laxmi

Yes, length of the words would be one way but was wondering if there are
any ways to identify the complexity.

On Wed, May 11, 2016 at 3:36 PM, Joel Bernstein  wrote:

> I'm wondering if the size of the vocabulary used would be enough for this?
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Wed, May 11, 2016 at 3:32 PM, A Laxmi  wrote:
>
> > Hi,
> >
> > Is it possible to determine how complex a document is using Solr?
> > Complexity in terms of whether document is readable by a 7th grade vs.
> PHD
> > Grad?
> >
> > Thanks!
> > AL
> >
>

Cross Data Center Replication - ERROR

2016-05-11 Thread Abdel Belkasri

Hi there,



I am trying to configure Cross Data Center Replication using solr 6.0.

I am having issue creating collections or reloading old collections with
the new solrconfig.xml on both the target and source side. I keep getting
error 
“org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
Solr instance is not configured with the cdcr update log”





This is my config on the Source



  

  

disabled

  

  

  





  

  

  

cdcr-proc-chain

  

  

  



${solr.ulog.dir:}

500

20

65536



  





This is the config on the Target side:



  

  

disabled

  

  

  





  

  

  

cdcr-proc-chain

  

  

  



${solr.ulog.dir:}

500

20

65536



  





HOW SOLR IS RUNNING:

ZKHOSTS parameter in solr.in.sh file under /etc/default

and when you start solr service it will start in cloud





Any help would be great.



Thanks


--Abdel.

Re: Complexity of a document?

2016-05-11 Thread Joel Bernstein

I'm wondering if the size of the vocabulary used would be enough for this?

Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, May 11, 2016 at 3:32 PM, A Laxmi  wrote:

> Hi,
>
> Is it possible to determine how complex a document is using Solr?
> Complexity in terms of whether document is readable by a 7th grade vs. PHD
> Grad?
>
> Thanks!
> AL
>

Complexity of a document?

2016-05-11 Thread A Laxmi

Hi,

Is it possible to determine how complex a document is using Solr?
Complexity in terms of whether document is readable by a 7th grade vs. PHD
Grad?

Thanks!
AL

Re: Re-indexing in SolRCloud while keeping the collection online -- Best practice?

2016-05-11 Thread Nick Vasilyev

Aliasing works great, I implemented it after upgrading to Solr 5 and it
allows us to do this exact thing. The only thing you have to watch out for
is indexing new items (if they overwrite old ones) while you are
re-indexing.

I took it a step further for another collection that stores a lot of time
based data from logs. I have two aliases for that collection logs and
logs_indexing, every month a new collection gets created called logs_201605
or something like that and both aliases get updated. logs_indexing now only
points to the newest collection, thats where all the indexing is going, the
logs alias gets updated to include the new collection as well (since
aliases can point to multiple collections).

Here is the link to the documentation.
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api4

On Tue, May 10, 2016 at 12:55 PM, Horváth Péter Gergely <
peter.gergely.horv...@gmail.com> wrote:

> Hi Erick,
>
> Most of the time we have to do a full re-index: I do love your second idea,
> I will take a look at the details of that. Thank you! :)
>
> Cheers,
> Peter
>
> 2016-05-10 17:10 GMT+02:00 Erick Erickson :
>
> > Peter:
> >
> > Yeah, that would work, but there are a couple of alternatives:
> > 1> If there's any way to know what the subset of docs that's
> >  changed, just re-index _them_. The problem here is
> >  picking up deletes. In the RDBMS case this is often done
> >  by creating a trigger for deletes and then the last step
> >  in your update is to remove the docs since the last time
> >  you indexed using the deleted_docs table (or whatever).
> >  This falls down if a> you require an instantaneous switch
> >  from _all_ the old data to the new or b> you can't get a
> >  list of deleted docs.
> >
> > 2> Use collection aliasing. The pattern is this: you have your
> >  "Hot" collection (col1) serving queries that is pointed to
> >  by alias "hot". You create a new collection (col2) and index
> >  to it in the background. When done, use CREATEALIAS
> >  to point "hot" to "col2". Now you can delete col1. There are
> >  no restrictions on where these collections live, so this
> >  allows you to move your collections around as you want. Plus
> >  this keeps a better separation of old and new data...
> >
> > Best,
> > Erick
> >
> > On Tue, May 10, 2016 at 4:32 AM, Horváth Péter Gergely
> >  wrote:
> > > Hi Everyone,
> > >
> > > I am wondering if there is any best practice regarding re-indexing
> > > documents in SolrCloud 6.0.0 without making the data (or the underlying
> > > collection) temporarily unavailable. Wiping all documents in a
> collection
> > > and performing a full re-indexing is not a viable alternative for us.
> > >
> > > Say we had a massive Solr Cloud cluster with a number of separate nodes
> > > that are used to host *multiple hundreds* of collections, with document
> > > counts ranging from a couple of thousands to multiple (say up to 20)
> > > millions of documents, each with 200-300 fields and a background batch
> > > loader job that fetches data from a variety of source systems.
> > >
> > > We have to retain the cluster and ALL collections online all the time
> > (365
> > > x 24): We cannot allow queries to be blocked while data in a collection
> > is
> > > being updated and we cannot load everything in a single-shot jumbo
> commit
> > > (the replication could overload the cluster).
> > >
> > > One solution I could imagine is storing an additional field "load
> > > time-stamp" in all documents and the client (interactive query)
> > application
> > > extending all queries with an additional restriction, which requires
> > > documents "load time-stamp" to be the latest known completed "load
> > > time-stamp".
> > >
> > > This concept would work according to the following:
> > > 1.) The batch job would simply start loading new documents, with the
> new
> > > "load time-stamp". Existing documents would not be touched.
> > > 2.) The client (interactive query) application would still use the old
> > data
> > > from the previous load (since all queries are restricted with the old
> > "load
> > > time-stamp")
> > > 3.) The batch job would store the new "load time-stamp" as the one to
> be
> > > used (e.g. in a separate collection etc.) -- after this, all queries
> > would
> > > return the most up-to-data documents
> > > 4.) The batch job would purge all documents from the collection, where
> > > the "load time-stamp" is not the same as the last one.
> > >
> > > This approach seems to be implementable, however, I definitely want to
> > > avoid reinventing the wheel myself and wondering if there is any better
> > > solution or built-in Solr Cloud feature to achieve the same or
> something
> > > similar.
> > >
> > > Thanks,
> > > Peter
> >
>

Re: changing web context and port for SolrCloud Zookeeper

2016-05-11 Thread Tom Gullo

That helps.  I ended up updating the sole.in.sh file in /etc/default and that 
was in getting picked up.  Thanks

> On May 11, 2016, at 2:05 PM, Tom Gullo  wrote:
> 
> My Solr installation is running on Tomcat on port 8080 with a  web context 
> name that is different than /solr.   We want to move to a basic jetty setup 
> with all the defaults.  I haven’t found a clean way to do this.  A lot of the 
> values like baseurl and /leader/elect/shard1 have values that need to be 
> updated.  If I try shutting down the servers, change the zookeeper settings 
> and then restart Solr in Jetty I get issues - like Solr thinks they are 
> replicas.   So I’m looking to see if anyone knows what is the cleanest way to 
> move from a Tomcat/8080 install to a Jetty/8983 one.
> 
> Thanks
> 
>> On May 11, 2016, at 1:59 PM, John Bickerstaff  
>> wrote:
>> 
>> I may be answering the wrong question - but SolrCloud goes in by default on
>> 8983, yes?  Is yours currently on 8080?
>> 
>> I don't recall where, but I think I saw a config file setting for the port
>> number (In Solr I mean)
>> 
>> Am I on the right track or are you asking something other than how to get
>> Solr on host:8983/solr ?
>> 
>> On Wed, May 11, 2016 at 11:56 AM, Tom Gullo  wrote:
>> 
>>> I need to change the web context and the port for a SolrCloud installation.
>>> 
>>> Example, change:
>>> 
>>> host:8080/some-api-here/
>>> 
>>> to this:
>>> 
>>> host:8983/solr/
>>> 
>>> Does anyone know how to do this with SolrCloud?  There are values stored
>>> in clusterstate.json and /leader/elect and I could change them
>>> but that seems a little messy.
>>> 
>>> Thanks
>

Re: Issues with Authentication / Role based authorization

2016-05-11 Thread shamik

Brian,

  Thanks for your reply. My first post was bit convoluted, tried to explain
the issue in the subsequent post. Here's a security JSON. I've solr and
beehive assigned the admin role which allows them to have access to "update"
and "read". This works as expected. I add a new role "browseRole" in order
to restrict certain user to only have access to browse on gettingstarted
collection. 

  "authorization.enabled": true,
  "authorization": {
"class": "solr.RuleBasedAuthorizationPlugin",
"user-role": {
  "solr": "admin",
  "beehive": [
"admin"
  ],
  "dev": [
"browseRole"
  ]
},
"permissions": [
  {
"name": "update",
"role": "admin"
  },
  {
"name": "read",
"role": "admin"
  },
  {
"name": "browse",
"collection": "gettingstarted",
"path": "/browse",
"role": "browseRole"
  }
],
"": {
  "v": 6
}
  }
}

But when I log in as "dev", I seemed to have similar access to "solr" and
"beehive". "dev" can add/delete data, create collection, etc. Will the order
of the permissions matter here even though "dev" is assigned to a specific
role ?





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Issues-with-Authentication-Role-based-authorization-tp4276024p4276203.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: changing web context and port for SolrCloud Zookeeper

2016-05-11 Thread John Bickerstaff

Oh, I see -

Hmmm... I just did a disaster recovery work up for my IT guys and basically
I recommended they build SOLR from scratch and reindex rather than try to
recover (same for changing versions)

However, we've got a small-ish data set and that may not work for everyone.

Any chance you can just rebuild (with the default Jetty) and re-index?

On Wed, May 11, 2016 at 12:05 PM, Tom Gullo  wrote:

> My Solr installation is running on Tomcat on port 8080 with a  web context
> name that is different than /solr.   We want to move to a basic jetty setup
> with all the defaults.  I haven’t found a clean way to do this.  A lot of
> the values like baseurl and /leader/elect/shard1 have values that need to
> be updated.  If I try shutting down the servers, change the zookeeper
> settings and then restart Solr in Jetty I get issues - like Solr thinks
> they are replicas.   So I’m looking to see if anyone knows what is the
> cleanest way to move from a Tomcat/8080 install to a Jetty/8983 one.
>
> Thanks
>
> > On May 11, 2016, at 1:59 PM, John Bickerstaff 
> wrote:
> >
> > I may be answering the wrong question - but SolrCloud goes in by default
> on
> > 8983, yes?  Is yours currently on 8080?
> >
> > I don't recall where, but I think I saw a config file setting for the
> port
> > number (In Solr I mean)
> >
> > Am I on the right track or are you asking something other than how to get
> > Solr on host:8983/solr ?
> >
> > On Wed, May 11, 2016 at 11:56 AM, Tom Gullo  wrote:
> >
> >> I need to change the web context and the port for a SolrCloud
> installation.
> >>
> >> Example, change:
> >>
> >> host:8080/some-api-here/
> >>
> >> to this:
> >>
> >> host:8983/solr/
> >>
> >> Does anyone know how to do this with SolrCloud?  There are values stored
> >> in clusterstate.json and /leader/elect and I could change
> them
> >> but that seems a little messy.
> >>
> >> Thanks
>
>

Re: changing web context and port for SolrCloud Zookeeper

2016-05-11 Thread John Bickerstaff

Yup - bottom of solr.in.sh - if you used the "install for production"
script.

/etc/default/solr.in.sh (on linux which is all I do these days)

Hope that helps...  Ping back if not.

SOLR_PID_DIR="/var/solr"
SOLR_HOME="/var/solr/data"
LOG4J_PROPS="/var/solr/log4j.properties"
SOLR_LOGS_DIR="/var/solr/logs"
SOLR_PORT="8983"

On Wed, May 11, 2016 at 11:59 AM, John Bickerstaff  wrote:

> I may be answering the wrong question - but SolrCloud goes in by default
> on 8983, yes?  Is yours currently on 8080?
>
> I don't recall where, but I think I saw a config file setting for the port
> number (In Solr I mean)
>
> Am I on the right track or are you asking something other than how to get
> Solr on host:8983/solr ?
>
> On Wed, May 11, 2016 at 11:56 AM, Tom Gullo  wrote:
>
>> I need to change the web context and the port for a SolrCloud
>> installation.
>>
>> Example, change:
>>
>> host:8080/some-api-here/
>>
>> to this:
>>
>> host:8983/solr/
>>
>> Does anyone know how to do this with SolrCloud?  There are values stored
>> in clusterstate.json and /leader/elect and I could change them
>> but that seems a little messy.
>>
>> Thanks
>
>
>

Re: changing web context and port for SolrCloud Zookeeper

2016-05-11 Thread Tom Gullo

My Solr installation is running on Tomcat on port 8080 with a  web context name 
that is different than /solr.   We want to move to a basic jetty setup with all 
the defaults.  I haven’t found a clean way to do this.  A lot of the values 
like baseurl and /leader/elect/shard1 have values that need to be updated.  If 
I try shutting down the servers, change the zookeeper settings and then restart 
Solr in Jetty I get issues - like Solr thinks they are replicas.   So I’m 
looking to see if anyone knows what is the cleanest way to move from a 
Tomcat/8080 install to a Jetty/8983 one.

Thanks

> On May 11, 2016, at 1:59 PM, John Bickerstaff  
> wrote:
> 
> I may be answering the wrong question - but SolrCloud goes in by default on
> 8983, yes?  Is yours currently on 8080?
> 
> I don't recall where, but I think I saw a config file setting for the port
> number (In Solr I mean)
> 
> Am I on the right track or are you asking something other than how to get
> Solr on host:8983/solr ?
> 
> On Wed, May 11, 2016 at 11:56 AM, Tom Gullo  wrote:
> 
>> I need to change the web context and the port for a SolrCloud installation.
>> 
>> Example, change:
>> 
>> host:8080/some-api-here/
>> 
>> to this:
>> 
>> host:8983/solr/
>> 
>> Does anyone know how to do this with SolrCloud?  There are values stored
>> in clusterstate.json and /leader/elect and I could change them
>> but that seems a little messy.
>> 
>> Thanks

Re: changing web context and port for SolrCloud Zookeeper

2016-05-11 Thread John Bickerstaff

I may be answering the wrong question - but SolrCloud goes in by default on
8983, yes?  Is yours currently on 8080?

I don't recall where, but I think I saw a config file setting for the port
number (In Solr I mean)

Am I on the right track or are you asking something other than how to get
Solr on host:8983/solr ?

On Wed, May 11, 2016 at 11:56 AM, Tom Gullo  wrote:

> I need to change the web context and the port for a SolrCloud installation.
>
> Example, change:
>
> host:8080/some-api-here/
>
> to this:
>
> host:8983/solr/
>
> Does anyone know how to do this with SolrCloud?  There are values stored
> in clusterstate.json and /leader/elect and I could change them
> but that seems a little messy.
>
> Thanks

changing web context and port for SolrCloud Zookeeper

2016-05-11 Thread Tom Gullo

I need to change the web context and the port for a SolrCloud installation.

Example, change:

host:8080/some-api-here/

to this:

host:8983/solr/

Does anyone know how to do this with SolrCloud?  There are values stored in 
clusterstate.json and /leader/elect and I could change them but 
that seems a little messy.

Thanks

RE: Edismax field boosting behavior for null values

2016-05-11 Thread Megha Bhandari

Correcting typo in original post and making it a little clearer

Hi

Can someone help us understand how null values affect boosting.

Say we have field_1 (with boost ^10.1)  and field_2 (with boost ^9.1).
We search for foo. 
Document A :  field_1 : does not exist
  Field_2 = matches search term
Document B: field_1 = matches search term
 Field_2 = empty string.
As per our understanding the result should be Document B, Document A.
However what we are getting is Document A,Document B.

Below is a detailed description of the above problem with our business use case 
and configurations.

Use case : Promote documents as per following priority of fields ie. Keywords > 
meta description > Title > H1 > H2 >H3 > body content

For this we have indexed the above fields as







and used the eDisMax query parser and set boosting as
edismax

metatag.keywords^100.1 metatag.description^50.1 title^20.1 h1^4.7 
h2^3.6 h3^2.5 h4^1.4 id^0.01 _text_^0.001


The above is working fine for documents that have an entry for all fields. E.g. 
all pages have keywords, meta description and so on even though the entry might 
just be an empty string. So if the search contains pages only the results are 
coming fine as per expectation.

However for documents that don't have keywords ,e.g. all PDFs only have meta 
description ,title and _text_, results are skewed. PDFs are coming right at the 
top even though we have a page with the search term in keyword field.

To fix this anomaly we come up with the following boosting ( notice the very 
large boost values)
edismax
  
  metatag.keywords^10.1 metatag.description^7500.1 title^500.1 h1^40.7 
h2^25.6 h3^15.1 h4^5.4 h5^1.3 h6^1.2 _text_^1.0
  

I can provide the query debug results for both configurations if required.

Thanks for any help in understanding this.


-Original Message-
From: Megha Bhandari [mailto:mbhanda...@sapient.com] 
Sent: Wednesday, May 11, 2016 11:10 PM
To: solr-user@lucene.apache.org
Subject: Edismax field boosting behavior for null values

Hi

Can someone help us understand how null values affect boosting.

Say we have field_1 (with boost ^10.1)  and field_2 (with boost ^9.1).
We search for foo. Document A has field_1(foo match) and field_2(empty) and 
Document B has field_2(foo match)  but no field_1.
As per our understanding the result should be Document A,Document B.
However what we are getting is Document B,Document A.

Below is a detailed description of the above problem with our business use case 
and configurations.

Use case : Promote documents as per following priority of fields ie. Keywords > 
meta description > Title > H1 > H2 >H3 > body content

For this we have indexed the above fields as







and used the eDisMax query parser and set boosting as
edismax

metatag.keywords^100.1 metatag.description^50.1 title^20.1 h1^4.7 
h2^3.6 h3^2.5 h4^1.4 id^0.01 _text_^0.001


The above is working fine for documents that have an entry for all fields. E.g. 
all pages have keywords, meta description and so on even though the entry might 
just be an empty string. So if the search contains pages only the results are 
coming fine as per expectation.

However for documents that don't have keywords ,e.g. all PDFs only have meta 
description ,title and _text_, results are skewed. PDFs are coming right at the 
top even though we have a page with the search term in keyword field.

To fix this anomaly we come up with the following boosting ( notice the very 
large boost values)
edismax
  
  metatag.keywords^10.1 metatag.description^7500.1 title^500.1 h1^40.7 
h2^25.6 h3^15.1 h4^5.4 h5^1.3 h6^1.2 _text_^1.0
  

I can provide the query debug results for both configurations if required.

Thanks for any help in understanding this.

Edismax field boosting behavior for null values

2016-05-11 Thread Megha Bhandari

Hi

Can someone help us understand how null values affect boosting.

Say we have field_1 (with boost ^10.1)  and field_2 (with boost ^9.1).
We search for foo. Document A has field_1(foo match) and field_2(empty) and 
Document B has field_2(foo match)  but no field_1.
As per our understanding the result should be Document A,Document B.
However what we are getting is Document B,Document A.

Below is a detailed description of the above problem with our business use case 
and configurations.

Use case : Promote documents as per following priority of fields ie. Keywords > 
meta description > Title > H1 > H2 >H3 > body content

For this we have indexed the above fields as







and used the eDisMax query parser and set boosting as
edismax

metatag.keywords^100.1 metatag.description^50.1 title^20.1 h1^4.7 
h2^3.6 h3^2.5 h4^1.4 id^0.01 _text_^0.001


The above is working fine for documents that have an entry for all fields. E.g. 
all pages have keywords, meta description and so on even though the entry might 
just be an empty string. So if the search contains pages only the results are 
coming fine as per expectation.

However for documents that don't have keywords ,e.g. all PDFs only have meta 
description ,title and _text_, results are skewed. PDFs are coming right at the 
top even though we have a page with the search term in keyword field.

To fix this anomaly we come up with the following boosting ( notice the very 
large boost values)
edismax
  
  metatag.keywords^10.1 metatag.description^7500.1 title^500.1 h1^40.7 
h2^25.6 h3^15.1 h4^5.4 h5^1.3 h6^1.2 _text_^1.0
  

I can provide the query debug results for both configurations if required.

Thanks for any help in understanding this.

Re: Issues with Authentication / Role based authorization

2016-05-11 Thread Brian J. Vanecek

I can't say I followed your entire example, but I think you're running 
into a couple of issues:

1) Users don't get any roles by default. So, when you initial setup 
includes this:

{
"name": "all",
"role": "all"
  }

but nobody has the "all" role, it doesn't surprise me that it rejected 
your request.

2) Roles are not hierarchical. Again looking at your initial configuration 
file, giving the "solr" user the "admin" role only gives it access to the 
security-edit functionality. It won't have access to anything else. Even 
though "admin" might imply access to everything or all roles, it doesn't 
actually mean anything. It is just a name. The applies to the "all" role 
as well.

3) Rules are checked in order, and the first matching rule is utilized. In 
that first example again, the "all" rule is going to match any request, so 
basically it is like the rules underneath it don't exist. Solr will never 
even consider them, as a request would match the "all" rule first. You 
need to order rules where you put the most specific rules first and the 
most general ones last.

- Brian Vanecek

**

This email and any attachments may contain information that is confidential 
and/or privileged for the sole use of the intended recipient.  Any use, review, 
disclosure, copying, distribution or reliance by others, and any forwarding of 
this email or its contents, without the express permission of the sender is 
strictly prohibited by law.  If you are not the intended recipient, please 
contact the sender immediately, delete the e-mail and destroy all copies.
**

How to write these kind of functions in solr data config.xml

2016-05-11 Thread kavurupavan

CREATE OR REPLACE FUNCTION page(IN i_app name character varying, IN
i_photo_id  big int, IN i_page integer, IN i_member_id  big int, OUT
o_similar_page_name character varying, OUT o_similar_page_id  big int, OUT
o_similar_photo_id big int[])

DECLARE

v_limitINTEGER := 4;

v_offset   INTEGER;

BEGIN

  SET SEARCH_PATH = '';

  v_start_time = DAYTIME();

  i_app name = UPPER(i_app name);

  IF i_app name <> 'DD' THEN

  RAISE EXCEPTION 'Enter Valid Application Name';

  END IF;

  IF i_page = 1 THEN

  v_offset := 0;

  ELSE  

  v_offset := i_page * v_limit - v_limit;

  END IF;

Please help me. Thanks in advance.

Regards,
Pavan.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-write-these-kind-of-functions-in-solr-data-config-xml-tp4276060.html
Sent from the Solr - User mailing list archive at Nabble.com.

Any suggestions -- o.a.s.c.SolrCore <> Too many close [count:-1] on org.apache.solr.core.SolrCore

2016-05-11 Thread Gupta, Vipul

ERROR 0-thread-7 o.a.s.c.SolrCore <> Too many close [count:-1] on 
org.apache.solr.core.SolrCore@3d6f8ad3. Please report this exception to 
solr-user@lucene.apache.org

Re: Dynamically change solr suggest field

2016-05-11 Thread Nick D

There are only two ways I can think of to accomplish this and neither of
them are dynamically setting the suggester field as is looks according to
the Doc (which does sometimes have lacking info so I might be wrong) you
cannot set something like *suggest.fl=combo_box_field* at query time. But
maybe they can help you get started.

1. Multiple suggester request handlers for each option in combo box. This
way you just change the request handler in the query you submit based on
the context.

2. Use copy fields to put all possible suggestions into same field name, so
no more dynamic field settings, with another field defining whatever the
option would be for that document out of the combo box and use context
filters which can be passed at query time to limit the suggestions to those
filtered by whats in the combo box.
https://cwiki.apache.org/confluence/display/solr/Suggester#Suggester-ContextFiltering

Hope this helps a bit

Nick

On Wed, May 11, 2016 at 7:05 AM, Lasitha Wattaladeniya 
wrote:

> Hello devs,
>
> I'm trying to implement auto complete text suggestions using solr. I have a
> text box and next to that there's a combo box. So the auto complete should
> suggest based on the value selected in the combo box.
>
> Basically I should be able to change the suggest field based on the value
> selected in the combo box. I was trying to solve this problem whole day but
> not much luck. Can anybody tell me is there a way of doing this ?
>
> Regards,
> Lasitha.
>
> Lasitha Wattaladeniya
> Software Engineer
>
> Mobile : +6593896893
> Blog : techreadme.blogspot.com
>

Re:Re: solrcloud performance problem

2016-05-11 Thread lltvw

Hi Shawn,

Thanks for your input and help.

what you just guessed is right, we run solr in jetty by using start.jar, the 
parms is what I sent to you in my last mail.

about GC, I will check it carefully, thanks.




--
发自我的网易邮箱手机智能版


在 2016-05-11 21:32:33，"Shawn Heisey"  写道：
>On 5/10/2016 7:46 PM, lltvw wrote:
>> the args used to start solr are as following, and upload my screen shot to 
>> http://www.yupoo.com/photos/qzone3927066199/96064170/, please help to take a 
>> look, thanks.
>>
>> -DSTOP.PORT=7989
>> -DSTOP.KEY=
>> -DzkHost=node1:2181,node2:2181,node3:2181/solr
>> -Dsolr.solr.home=solr
>> -Dbootstrap_conf=true
>> -Xmx10240M
>> -Xms4196M
>> -XX:MaxPermSize=512M
>> -XX:PermSize=256M
>> -Dcom.sun.management.jmxremote.authenticate=false
>> -Dcom.sun.management.jmxremote.ssl=false
>> -Dcom.sun.management.jmxremote.port=3000
>> -Dcom.sun.management.jmxremote
>
>You still didn't say how you're starting Solr.  I don't see anything in
>the arguments that I would expect to see if running in tomcat or another
>third-party container.  You have a stop port and a stop key, which
>suggest that you might be running with the start.jar for the jetty
>included with Solr.
>
>The "bootstrap_conf" system property is not something you should set
>every time you start SolrCloud.  It is designed to be used once, to
>convert a non-cloud install to a cloud install.  In my opinion, it
>shouldn't even be used once.
>
>I don't have any way to confirm this with the information you've sent so
>far, but one problem you *might* be having is garbage collection
>pauses.  GC tuning is extremely important for Solr.  Here's some
>information I have written on the topic:
>
>https://wiki.apache.org/solr/ShawnHeisey#GC_Tuning_for_Solr
>
>Thanks,
>Shawn
>

Re: Issues with Authentication / Role based authorization

2016-05-11 Thread shamik

Anyone ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Issues-with-Authentication-Role-based-authorization-tp4276024p4276153.html
Sent from the Solr - User mailing list archive at Nabble.com.

backups of analyzingInfixSuggesterIndexDir

2016-05-11 Thread Oakley, Craig (NIH/NLM/NCBI) [C]

I have a client whose Solr installation creates a 
analyzingInfixSuggesterIndexDir directory besides index and tlog. I notice that 
this analyzingInfixSuggesterIndexDir is not included in backups (created by 
replication?command=backup). Is there a way to include this? Or does it not 
need to be backed-up?

I haven't needed this yet, but wanted to ask before I find that I might need it.

Re: [scottchu] What kind of configuration to use for this size of news data?

2016-05-11 Thread Shawn Heisey

On 5/10/2016 10:34 PM, scott.chu wrote:
> A further question: Can master-slave and SolrCloud exist simultaneously in 
> one Solr server? If yes, how can I do it?

No.  SolrCloud uses replication internally for automated recovery on an
as-needed basis.  SolrCloud completely manages multiple replicas of an
index and user-configured replication is not necessary.

I do not know what you intend with that combination, but you may want to
look into Cross-Data-Center-Replication (CDCR) in Solr 6.0.

Thanks,
Shawn

Re: Error

2016-05-11 Thread Ahmet Arslan

Hi Midas,

It looks like you are committing too frequently, cache warming cannot catchup.
Either lower your commit rate, or disable cache auto warm (autowarmCount=0).
You can also remove queries registered at newSearcher event if you have defined 
some.

Ahmet



On Wednesday, May 11, 2016 2:51 PM, Midas A  wrote:
Hi i am getting following error

org.apache.solr.common.SolrException: Error opening new searcher.
exceeded limit of maxWarmingSearchers=2, try again later.



what should i do to remove it .

Re: How to search string

2016-05-11 Thread Lasitha Wattaladeniya

Hi Kishor,

You can try escaping the search phrase "Garmin Class A" > Garmin\ Class\ A

Lasitha Wattaladeniya
Software Engineer

Mobile : +6593896893
Blog : techreadme.blogspot.com

On Wed, May 11, 2016 at 6:12 PM, Ahmet Arslan 
wrote:

> Hi,
>
> You can be explicit about the field that you want to search on. e.g.
> q=product_name:(Garmin Class A)
>
> Or you can use lucene query parser with default field (df) parameter. e.g.
> q={!lucene df=product_name)Garmin Class A
>
> Its all about query parsers.
>
> Ahmet
>
>
> On Wednesday, May 11, 2016 9:12 AM, kishor  wrote:
> I want to search a product and product name is "Garmin Class A" so  I
> expect
> result is product name matching string "Garmin Class A" but it searches
> separately i dont know why and how it happen.Please guide me how to search
> a
> string in only one field only not in other fields. "debug": {
> "rawquerystring": "Garmin Class A","querystring": "Garmin Class A",
> "parsedquery": "(+(DisjunctionMaxQuery((product_name:Garmin))
> DisjunctionMaxQuery((product_name:Class))
> DisjunctionMaxQuery((product_name:A))) ())/no_coord",
> "parsedquery_toString": "+((product_name:Garmin) (product_name:Class)
> (product_name:A)) ()","explain": {},
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-search-string-tp4276052.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: URL parameters combined with text param

2016-05-11 Thread Ahmet Arslan

Hi Bastien,

Please use magic _query_ field, q=hospital AND _query_:"{!q.op=AND v=$a}"

ahmet


On Wednesday, May 11, 2016 2:35 PM, Latard - MDPI AG  
wrote:
Hi Everybody,

Is there a way to pass only some of the data by reference and some 
others in the q param?

e.g.:

q1.   http://localhost:8983/solr/my_core/select?{!q.op=OR 
v=$a}&fl=abstract,title&a=hospital Leapfrog&debug=true

q1a.  http://localhost:8983/solr/my_core/select?q=hospital AND 
Leapfrog&fl=abstract,title

q2.  http://localhost:8983/solr/my_core/select?q=hospital AND 
({!q.op=AND v=$a})&fl=abstract,title&a=hospital Leapfrog

q1 & q1a  are returning the same results, but q2 is somehow not 
analyzing the $a parameter properly...

Am I missing anything?

Kind regards,
Bastien Latard
Web engineer
-- 
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/

Dynamically change solr suggest field

2016-05-11 Thread Lasitha Wattaladeniya

Hello devs,

I'm trying to implement auto complete text suggestions using solr. I have a
text box and next to that there's a combo box. So the auto complete should
suggest based on the value selected in the combo box.

Basically I should be able to change the suggest field based on the value
selected in the combo box. I was trying to solve this problem whole day but
not much luck. Can anybody tell me is there a way of doing this ?

Regards,
Lasitha.

Lasitha Wattaladeniya
Software Engineer

Mobile : +6593896893
Blog : techreadme.blogspot.com

Re: solrcloud performance problem

2016-05-11 Thread Shawn Heisey

On 5/10/2016 7:46 PM, lltvw wrote:
> the args used to start solr are as following, and upload my screen shot to 
> http://www.yupoo.com/photos/qzone3927066199/96064170/, please help to take a 
> look, thanks.
>
> -DSTOP.PORT=7989
> -DSTOP.KEY=
> -DzkHost=node1:2181,node2:2181,node3:2181/solr
> -Dsolr.solr.home=solr
> -Dbootstrap_conf=true
> -Xmx10240M
> -Xms4196M
> -XX:MaxPermSize=512M
> -XX:PermSize=256M
> -Dcom.sun.management.jmxremote.authenticate=false
> -Dcom.sun.management.jmxremote.ssl=false
> -Dcom.sun.management.jmxremote.port=3000
> -Dcom.sun.management.jmxremote

You still didn't say how you're starting Solr.  I don't see anything in
the arguments that I would expect to see if running in tomcat or another
third-party container.  You have a stop port and a stop key, which
suggest that you might be running with the start.jar for the jetty
included with Solr.

The "bootstrap_conf" system property is not something you should set
every time you start SolrCloud.  It is designed to be used once, to
convert a non-cloud install to a cloud install.  In my opinion, it
shouldn't even be used once.

I don't have any way to confirm this with the information you've sent so
far, but one problem you *might* be having is garbage collection
pauses.  GC tuning is extremely important for Solr.  Here's some
information I have written on the topic:

https://wiki.apache.org/solr/ShawnHeisey#GC_Tuning_for_Solr

Thanks,
Shawn

Re: [scottchu] What is the host name for? can we change 'solr' in request url to some other name?

2016-05-11 Thread Shawn Heisey

On 5/11/2016 3:08 AM, scott.chu wrote:
> I see there's a -h option for bin\solr start command. What's that for?
When we crate a core, say 'abc', the request url is something like
http:///solr/abc. I'd like to change 'solr' to other name, how can I
do it?

The "host" is what SolrCloud will publish in zookeeper as the hostname
for this server, which is used by the distributed search functionality
inherent in SolrCloud.  If you are not using SolrCloud, this does
nothing.  Some poeple assume that this controls which network interface
will be used, but that is not what it's for.

The /solr part of the URL is set by Jetty.  It is called the context
path.  Note that if you change this path (which you CAN do), the admin
UI in 6.0 will stop working.  So far, this is not considered to be a bug
-- the person who wrote our new UI specifically designed it around a
static /solr path.  Since 5.0, the only supported way to run Solr is the
included scripts that start the included Jetty, where we maintain
control of the configuration.  See the SOLR-9054 issue in Jira for more
detailed information.

Thanks,
Shawn

Planning and benchmarking Solr: resource consumption (RAM, disk, CPU, number of nodes)

2016-05-11 Thread Horváth Péter Gergely

Hi All,

I am wondering if there is any recommendation or convention regarding
planning and benchmarking a Solr node / Solr Cloud cluster infrastructure.
I am looking for a somewhat more structured approach than trying with our
forecast data volumes and keep adding more resources (CPU, RAM, disk etc.)
or additional nodes until the performance is acceptable.

If there any way to analyze, benchmark and forecast Solr resource
requirements according to the planned future load? E.g. if we have 2 cloud
nodes today, and our data volume grows to the double, can we simply add two
more nodes and except roughly the same performance?

What if data volume grows 10x or 100x? How can we plan resources in
advance? I _know_ that Solr / Solr Cloud scales _very well_ , but is there
any supporting document, metrics etc available? I only heard about reports
that Solr Cloud is widely used with great success, however in a company
environment you normally have to show some numbers, proving that the
investment into the hardware will return and that you will not require a
5-fold emergency increase in infrastructure funding just to keep the system
up and running.

If there is no such research document available, I would be much obliged if
you could give some hints on what and how to measure in Solr / Solr cloud
world. (E.g. what the optimal resource utilization of a Solr instance is,
how to recognize if an instance is trashing etc.)

Thanks,
Peter

Re: [scottchu] What kind of configuration to use for this size ofnews data?

2016-05-11 Thread Charlie Hull


On 11/05/2016 10:55, scott.chu wrote:


I just find maillist seems not accept colorful fonts (cause I receive
my own letter from maillist and see blue colors are gone!). I use
asterisk row to highlight my questions  and send this again.


Answers inline below.

C




- Original Message - From: scott(自己) To: solr-user To: Date:
2016/5/11 (週三) 17:34 Subject: Re: [scottchu] What kind of
configuration to use for this size ofnews data?


Hi, Charlie,

Thanks first for your concrete answer. I have further questions as
written in blue color below.

scott.chu，scott@udngroup.com 2016/5/11 (週三) - Original
Message - From: Charlie Hull To: solr-user@lucene.apache.org CC:
Date: 2016/5/11 (週三) 16:21 Subject: Re: [scottchu] What kind of
configuration to use for this size ofnews data?


On 11/05/2016 04:27, scott.chu wrote:

Fix some typos, add some words and resend same question =>

I want to build a Solr engine for over 60-year news articles. My
requests are (I use Solr 5.4.1):


Hi Scott,

We've actually done something very similar for the our client NLA
Media Access in the UK, who handle licensing of most UK newspaper
content. They have over 45m docs going back to 2006.


1> Currently over 10M no. of docs. 2> Currently over 60GB total
data size. 3> The no. of docs and data size will keep growing at
the rate of 1000 no. of docs(or 8MB size) per day. 4> There are
totally 5-6 different newspaper types.

My questions are: 1> Is it wokable enough just to use master-slave
model? Or should I turn to SolrCloud? (I ask this due to our
system management group never manage a distributed system before
and they also have no knowedge of Zookeeper, shards, etc. Also they
don't know how to backup/restore distributed data.)


Workable yes, advisable no. You should get much better reliability &
performance with SolrCloud once it's set up. Also, if you have
replication set up correctly the need for backup/restore will be
significantly reduced and may be unnecessary.

We used master-slave for News UK's Solr setup (articles from The
Times and other papers) but this was before SolrCloud had properly
arrived. We'd only use master-slave rarely now.


*


If I use SolrCloud, I know I have to setup Zookeeper. I know there're 
something called 'quorum' or 'ensemble' in Zookeeper terminologies. I 
also know there is a need for (2n+1) Zookeeper nodes per n SolrCloud 
nodes.  Is your case running one SolrCloud node per one machine (Whether 
PM or VM).  According to your experiences, how many nodes , including 
SolrCloud's and Zookeeper's, do I need to setup? Is Replication in 
SolrCloud easy to setup as that in old version? (I setup replication 
solrconfig.xml and use solrcore.properties file to setup/switch roles in 
Solr node, rather than defining role directly in solrconfig.xml)

*



You need at least 3 ZK nodes to form a quorum. How many SolrClouds you 
need will depend on how you decide to shard and replicate your data. 
There is no single answer to this - it depends on various factors 
including query load, query complexity, source data size, indexing 
strategy...you should read this page. 
https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ 
You can run more than one Solr node per machine, but if that machine 
dies then your failover setup must be able to cope.


The *only* sensible way to figure out how many nodes you need is to try 
out a prototype system. I would guesstimate it will be less than 10 
nodes but don't hold me to that! Doing this will also teach you a lot 
about ZK and SolrCloud - you're not going to be able to avoid some 
learning here. Don't avoid looking at SolrCloud just because it involves 
ZK, the advantages outweigh the learning curve IMO.



3> If I wish to create another Solr engine with one or two
particular paper types. Is it possible to copy their index data
directly from the big central Solr engine? Or I have to rebuild
index from raw articles data? (Our business has this possibility
of needs.)


Yes, I guess so, but why copy it when you could just search it with
a filter for the paper types?

*We
have a special biz case called 'buyout newspaper search service'.
Customers buy intranet license to use search service for articles of
some newspaper types and some range of  publish dates, e.g. paper
type 'A' for 2010-2012 and paper type 'B' for 2015. The buyout means
we have to install who search service at customer site and customer
can only use search service within their enterp

Error

2016-05-11 Thread Midas A

Hi i am getting following error

org.apache.solr.common.SolrException: Error opening new searcher.
exceeded limit of maxWarmingSearchers=2, try again later.



what should i do to remove it .

URL parameters combined with text param

2016-05-11 Thread Bastien Latard - MDPI AG


Hi Everybody,

Is there a way to pass only some of the data by reference and some 
others in the q param?


e.g.:

q1.   http://localhost:8983/solr/my_core/select?{!q.op=OR 
v=$a}&fl=abstract,title&a=hospital Leapfrog&debug=true


q1a.  http://localhost:8983/solr/my_core/select?q=hospital AND 
Leapfrog&fl=abstract,title


q2.  http://localhost:8983/solr/my_core/select?q=hospital AND 
({!q.op=AND v=$a})&fl=abstract,title&a=hospital Leapfrog


q1 & q1a  are returning the same results, but q2 is somehow not 
analyzing the $a parameter properly...


Am I missing anything?

Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/

Re: [scottchu] What kind of configuration to use for this size of news data?

2016-05-11 Thread Toke Eskildsen

On Wed, 2016-05-11 at 11:27 +0800, scott.chu wrote:
> I want to build a Solr engine for over 60-year news articles. My
> requests are (I use Solr 5.4.1):

Charlie Hull has given you an fine answer, which I agree with fully, so
I'll just add a bit from our experience.

We are running a similar service for Danish newspapers. We have 16M
OCR'ed pages, split into 250M+ articles, for 1.4TB total index size.
Everything in a single shard on a 64GB machine with SSDs.

We do faceting, range faceting and grouping as part of basic search.
That works okay (sub-second response times) for the bulk of our
requests, but when the hitCount gets above 10M, performance gets poor.
For the real heavy hitters, basically matching everything, we encounter
20 second response times.

This is not acceptable, so we will be switching to SolrCloud and
multiple shards (on the same machine, as our bottleneck is single
CPU-core performance). However, you have a smaller corpus and the growth
rate does not look alarming.

Putting all this together, I would advice you to try and put everything
in a single shard to avoid the overhead of distributed search. If that
performs well enough for single queries, then add replicas with
SolrCloud to get redundancy and scale throughput. Should you need to
shard at a later time, this will be easy with SolrCloud.

- Toke Eskildsen, State and University Library, Denmark

Re: How to search string

2016-05-11 Thread Ahmet Arslan

Hi,

You can be explicit about the field that you want to search on. e.g. 
q=product_name:(Garmin Class A)

Or you can use lucene query parser with default field (df) parameter. e.g. 
q={!lucene df=product_name)Garmin Class A

Its all about query parsers.

Ahmet


On Wednesday, May 11, 2016 9:12 AM, kishor  wrote:
I want to search a product and product name is "Garmin Class A" so  I expect
result is product name matching string "Garmin Class A" but it searches
separately i dont know why and how it happen.Please guide me how to search a
string in only one field only not in other fields. "debug": {  
"rawquerystring": "Garmin Class A","querystring": "Garmin Class A",  
"parsedquery": "(+(DisjunctionMaxQuery((product_name:Garmin))
DisjunctionMaxQuery((product_name:Class))
DisjunctionMaxQuery((product_name:A))) ())/no_coord",  
"parsedquery_toString": "+((product_name:Garmin) (product_name:Class)
(product_name:A)) ()","explain": {},



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-search-string-tp4276052.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to search in solr for words like %rek Dr%

2016-05-11 Thread Ahmet Arslan

Hi Thrinadh,

Why don't you use plain wildcard search? There are two operator star and 
question mark for this purpose.

Ahmet


On Wednesday, May 11, 2016 4:31 AM, Thrinadh Kuppili  
wrote:
Thank you, Yes i am aware that surround with quotes will result in match for
space but i am trying to match word based on input which cant be controlled. 
I need to search solr for %rek Dr%  and return all result which has "rek Dr"
without qoutes.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-search-in-solr-for-words-like-rek-Dr-tp4275854p4276027.html

Sent from the Solr - User mailing list archive at Nabble.com.

Antwort: Re: Transforming SolrDocument to SolrInputDocument in Solr 6.0

2016-05-11 Thread Stephan Schubert

I just tried the method. The method is throwing a exception after just 
passing a solr document of a solr response to the method

My source code:

SolrDocument currentDoc = DocumentList.get(f);
DocumentObjectBinder binder = new DocumentObjectBinder();
SolrInputDocument inputDoc = binder.toSolrInputDocument(currentDoc)

Eception:

Exception in thread "main" 
org.apache.solr.client.solrj.beans.BindingException: class: class 
org.apache.solr.common.SolrDocument does not define any fields.
at 
org.apache.solr.client.solrj.beans.DocumentObjectBinder.toSolrInputDocument(DocumentObjectBinder.java:78)

I don't know why this error is coming up, as the fields are definitely 
filled (response directly from Solr and the checked the fields too).

 I'm using now just the old code:

public static SolrInputDocument toSolrInputDocument( SolrDocument d )
{
 SolrInputDocument doc = new SolrInputDocument();
 for( String name : d.getFieldNames() ) 
{
 doc.addField( name, d.getFieldValue(name), 1.0f );
}
return doc;
}

Seems that the binder is throwing an error here:

72   public SolrInputDocument toSolrInputDocument(Object obj) {
73 List fields = getDocFields(obj.getClass());
74 if (fields.isEmpty()) {
75   throw new BindingException("class: " + obj.getClass() + " does 
not define any fields.");
76 }
77 

Von:Erick Erickson 
An: solr-user , 
Datum:  10.05.2016 17:49
Betreff:Re: Transforming SolrDocument to SolrInputDocument in Solr 
6.0

Hmm, looking at the patch I see:

DocumentObjectBinder binder = new DocumentObjectBinder();
.
.
.

SolrInputDocument solrInputDoc = binder.toSolrInputDocument(in);

But I confess I didn't actually try it.

On Tue, May 10, 2016 at 8:41 AM, Stephan Schubert
 wrote:
> In Solr 6.0 the method ClientUtils.toSolrInputDocument() was removed
> (deprecated since 5.5.1, see
> https://issues.apache.org/jira/browse/SOLR-8339). What is the best way 
now
> to transform a SolrDocument into a SolrInputDocument?
>
> Mit freundlichen Grüßen / Best regards
>
> Stephan Schubert
> Senior Web Application Engineer  |   IT Engineering Information Oriented
> Applications
>
>
>
> SICK AG  |  Erwin-Sick-Str. 1  |  79183 Waldkirch  |  Germany
> Phone +49 7681 202-3751  |  stephan.schub...@sick.de  |  
http://www.sick.de
> 
__
>
> SICK AG  |   Sitz: Waldkirch i. Br.  |   Handelsregister: Freiburg i. 
Br.
> HRB 280355
> Vorstand: Dr. Robert Bauer (Vorsitzender)  |  Reinhard Bösl  |  Dr. Mats
> Gökstorp  |  Dr. Martin Krämer  |  Markus Vatter
> Aufsichtsrat: Gisela Sick (Ehrenvorsitzende)  |  Klaus M. Bukenberger
> (Vorsitzender)

Mit freundlichen Grüßen / Best regards

Stephan Schubert
Senior Web Application Engineer | IT Engineering
Information Oriented Applications

SICK AG | Erwin-Sick-Str. 1 | 79183 Waldkirch | Germany
Phone  +49 7681 202-3751 | Fax  | mailto:stephan.schub...@sick.de | 
http://www.sick.de

SICK AG  |  Sitz: Waldkirch i. Br.  |  Handelsregister: Freiburg i. Br. HRB 
280355 
Vorstand: Dr. Robert Bauer (Vorsitzender)  |  Reinhard Bösl  |  Dr. Mats 
Gökstorp  |  Dr. Martin Krämer  |  Markus Vatter
Aufsichtsrat: Gisela Sick (Ehrenvorsitzende)  |  Klaus M. Bukenberger 
(Vorsitzender)

Re: [scottchu] What kind of configuration to use for this size ofnews data?

2016-05-11 Thread scott.chu

I just find maillist seems not accept colorful fonts (cause I receive my own 
letter from maillist and see blue colors are gone!). I use asterisk row to 
highlight my questions  and send this again.

- Original Message - 
From: scott(自己) 
To: solr-user 
To: 
Date: 2016/5/11 (週三) 17:34
Subject: Re: [scottchu] What kind of configuration to use for this size ofnews 
data?

Hi, Charlie,

Thanks first for your concrete answer. I have further questions as written 
in blue color below.

scott.chu，scott@udngroup.com
2016/5/11 (週三)
- Original Message - 
From: Charlie Hull 
To: solr-user@lucene.apache.org 
CC: 
Date: 2016/5/11 (週三) 16:21
Subject: Re: [scottchu] What kind of configuration to use for this size ofnews 
data?

On 11/05/2016 04:27, scott.chu wrote: 
> Fix some typos, add some words and resend same question => 
> 
> I want to build a Solr engine for over 60-year news articles. My 
> requests are (I use Solr 5.4.1): 

Hi Scott, 

We've actually done something very similar for the our client NLA Media 
Access in the UK, who handle licensing of most UK newspaper content. 
They have over 45m docs going back to 2006. 
> 
> 1> Currently over 10M no. of docs. 2> Currently over 60GB total data 
> size. 3> The no. of docs and data size will keep growing at the rate 
> of 1000 no. of docs(or 8MB size) per day. 4> There are totally 5-6 
> different newspaper types. 
> 
> My questions are: 1> Is it wokable enough just to use master-slave 
> model? Or should I turn to SolrCloud? (I ask this due to our system 
> management group never manage a distributed system before and they 
> also have no knowedge of Zookeeper, shards, etc. Also they don't know 
> how to backup/restore distributed data.) 

Workable yes, advisable no. You should get much better reliability & 
performance with SolrCloud once it's set up. Also, if you have 
replication set up correctly the need for backup/restore will be 
significantly reduced and may be unnecessary. 

We used master-slave for News UK's Solr setup (articles from The Times 
and other papers) but this was before SolrCloud had properly arrived. 
We'd only use master-slave rarely now. 

*
If I use SolrCloud, I know I have to setup Zookeeper. I know there're something 
called 'quorum' or 'ensemble' in Zookeeper terminologies. I also know there is 
a need for (2n+1) Zookeeper nodes per n SolrCloud nodes.  Is your case running 
one SolrCloud node per one machine (Whether PM or VM).  According to your 
experiences, how many nodes , including SolrCloud's and Zookeeper's, do I need 
to setup? Is Replication in SolrCloud easy to setup as that in old version? (I 
setup replication solrconfig.xml and use solrcore.properties file to 
setup/switch roles in Solr node, rather than defining role directly in 
solrconfig.xml)
*

> 2> Say if I choose Solrcloud anyway. I wish to keep one shard owning 
> one specific year of data. Can it be done? 

Yes it can, but it may not be a good idea. If a large proportion of your 
queries hit recent news you may find one shard dealing with more queries 
than the others and becoming overloaded. Here's a blog post we wrote a 
long time ago about this - ignore the name Xapian, this applies to Solr 
as well: 
http://www.flax.co.uk/blog/2009/04/25/distributed-search-and-partition-functions/

What configuration should 
> I do? (AFAIK, SolrCloud distributes data based on some intrinsic 
> routing algorithm.) 

You can choose how to route data at indexing time: 
https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud

*
You are right. I do neglect this condition.  I'll think twice and could drop 
out my idea. Thanks for sharing the blog article. I'll take a good look at it.
*

>3> If I wish to create another Solr engine with 
> one or two particular paper types. Is it possible to copy their index 
> data directly from the big central Solr engine? Or I have to rebuild 
> index from raw articles data? (Our business has this possibility of 
> needs.) 

Yes, I guess so, but why copy it when you could just search it with a 
filter for the paper types? 

*We
 have a special biz case called 'buyout newspaper search service'. Customers 
buy intranet li

Re: [scottchu] What kind of configuration to use for this size ofnews data?

2016-05-11 Thread scott.chu

Hi, Charlie,

Thanks first for your concrete answer. I have further questions as written 
in blue color below.

scott.chu，scott@udngroup.com
2016/5/11 (週三)
- Original Message - 
From: Charlie Hull 
To: solr-user@lucene.apache.org 
CC: 
Date: 2016/5/11 (週三) 16:21
Subject: Re: [scottchu] What kind of configuration to use for this size ofnews 
data?

On 11/05/2016 04:27, scott.chu wrote: 
> Fix some typos, add some words and resend same question => 
> 
> I want to build a Solr engine for over 60-year news articles. My 
> requests are (I use Solr 5.4.1): 

Hi Scott, 

We've actually done something very similar for the our client NLA Media 
Access in the UK, who handle licensing of most UK newspaper content. 
They have over 45m docs going back to 2006. 
> 
> 1> Currently over 10M no. of docs. 2> Currently over 60GB total data 
> size. 3> The no. of docs and data size will keep growing at the rate 
> of 1000 no. of docs(or 8MB size) per day. 4> There are totally 5-6 
> different newspaper types. 
> 
> My questions are: 1> Is it wokable enough just to use master-slave 
> model? Or should I turn to SolrCloud? (I ask this due to our system 
> management group never manage a distributed system before and they 
> also have no knowedge of Zookeeper, shards, etc. Also they don't know 
> how to backup/restore distributed data.) 

Workable yes, advisable no. You should get much better reliability & 
performance with SolrCloud once it's set up. Also, if you have 
replication set up correctly the need for backup/restore will be 
significantly reduced and may be unnecessary. 

We used master-slave for News UK's Solr setup (articles from The Times 
and other papers) but this was before SolrCloud had properly arrived. 
We'd only use master-slave rarely now. 

If I use SolrCloud, I know I have to setup Zookeeper. I know there're something 
called 'quorum' or 'ensemble' in Zookeeper terminologies. I also know there is 
a need for (2n+1) Zookeeper nodes per n SolrCloud nodes.  Is your case running 
one SolrCloud node per one machine (Whether PM or VM).  According to your 
experiences, how many nodes , including SolrCloud's and Zookeeper's, do I need 
to setup? Is Replication in SolrCloud easy to setup as that in old version? (I 
setup replication solrconfig.xml and use solrcore.properties file to 
setup/switch roles in Solr node, rather than defining role directly in 
solrconfig.xml)

> 2> Say if I choose Solrcloud anyway. I wish to keep one shard owning 
> one specific year of data. Can it be done? 

Yes it can, but it may not be a good idea. If a large proportion of your 
queries hit recent news you may find one shard dealing with more queries 
than the others and becoming overloaded. Here's a blog post we wrote a 
long time ago about this - ignore the name Xapian, this applies to Solr 
as well: 
http://www.flax.co.uk/blog/2009/04/25/distributed-search-and-partition-functions/

What configuration should 
> I do? (AFAIK, SolrCloud distributes data based on some intrinsic 
> routing algorithm.) 

You can choose how to route data at indexing time: 
https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud

You are right. I do neglect this condition.  I'll think twice and could drop 
out my idea. Thanks for sharing the blog article. I'll take a good look at it.

>3> If I wish to create another Solr engine with 
> one or two particular paper types. Is it possible to copy their index 
> data directly from the big central Solr engine? Or I have to rebuild 
> index from raw articles data? (Our business has this possibility of 
> needs.) 

Yes, I guess so, but why copy it when you could just search it with a 
filter for the paper types? 

We have a special biz case called 'buyout newspaper search service'. Customers 
buy intranet license to use search service for articles of some newspaper types 
and some range of  publish dates, e.g. paper type 'A' for 2010-2012 and paper 
type 'B' for 2015. The buyout means we have to install who search service at 
customer site and customer can only use search service within their enterprise 
intranet environment. So you know, I have to build a special Solr server for 
each of such customers. Your idea of filtering is very much like 
ElasticSearch's multitenancy, which both are not fit in our buyout biz model. 
Do you have any suggestion for building Solr server in such condition?

> 
> I'd like to hear and use some well suggestion and experiences. 
> 
> Thanks in advance and best regards. 
> 
> Scott Chu @ 2016/5/11 11:26 GMT+8 
> 

Hope this helps! 

Cheers 

Charlie 

-- 
Charlie Hull 
Flax - Open Source Enterprise Search 

tel/fax: +44 (0)8700 118334 
mobile: +44 (0)7767 825828 
web: www.flax.co.uk 

- 
未在此訊息中找到病毒。 
已透過 AVG 檢查 - www.avg.com 
版本: 2015.0.6189 / 病毒庫: 4568/12206 - 發佈日期: 05/10/16

[scottchu] What is the host name for? can we change 'solr' in request url to some other name?

2016-05-11 Thread scott.chu


I see there's a -h option for bin\solr start command. What's that for? When we 
crate a core, say 'abc', the request url is something like 
http:///solr/abc. I'd like to change 'solr' to other name, how can I do it?

Nested grouping or equivalent.

2016-05-11 Thread Callum Lamb

We have a horrible Solr query that groups by a field and then sorts by
another. My understanding is that for this to happen it has to sort by the
grouping field, group it and then sort the resulting result set. It's not a
fast query.

Unfortunately our documents now need to be grouped as well (product
variants into items) and that grouping query needs to work on that grouping
instead. As far as I'm aware you can't do nested grouping in Solr.

In summary we want to have product variants that get grouped into Items and
then they get grouped by field and then sorted by another.

The solution doesn't need to be fast, it's a rarely ever used legacy part
of our application that's basically never used and we just need it to work.
Our dataset isn't huge so it doesn't matter if Solr has to scan the entire
index (I think the query does this atm anyway). But downloading the entire
document set and doing the operations in ETL isn't something we really want
to dedicate time to unless it's impossible to represent this in Solr
queries.

Any ideas?

Cheers,

Callum.

-- 

Mintel Group Ltd | 11 Pilgrim Street | London | EC4V 6RN
Registered in England: Number 1475918. | VAT Number: GB 232 9342 72

Contact details for our other offices can be found at 
http://www.mintel.com/office-locations.

This email and any attachments may include content that is confidential, 
privileged 
or otherwise protected under applicable law. Unauthorised disclosure, 
copying, distribution 
or use of the contents is prohibited and may be unlawful. If you have 
received this email in error,
including without appropriate authorisation, then please reply to the 
sender about the error 
and delete this email and any attachments.

Re: [scottchu] What kind of configuration to use for this size of news data?

2016-05-11 Thread Charlie Hull


On 11/05/2016 04:27, scott.chu wrote:

Fix some typos, add some words and resend same question =>

I want to build a Solr engine for over 60-year news articles. My
requests are (I use Solr 5.4.1):


Hi Scott,

We've actually done something very similar for the our client NLA Media
Access in the UK, who handle licensing of most UK newspaper content.
They have over 45m docs going back to 2006.


1> Currently over 10M no. of docs. 2> Currently over 60GB total data
size. 3> The no. of docs and data size will keep growing at the rate
of 1000 no. of docs(or 8MB size) per day. 4> There are totally 5-6
different newspaper types.

My questions are: 1> Is it wokable enough just to use master-slave
model? Or should I turn to SolrCloud? (I ask this due to our system
management group never manage a distributed system before and they
also have no knowedge of Zookeeper, shards, etc. Also they don't know
how to backup/restore distributed data.)


Workable yes, advisable no. You should get much better reliability & 
performance with SolrCloud once it's set up. Also, if you have 
replication set up correctly the need for backup/restore will be 
significantly reduced and may be unnecessary.


We used master-slave for News UK's Solr setup (articles from The Times 
and other papers) but this was before SolrCloud had properly arrived. 
We'd only use master-slave rarely now.



2> Say if I choose Solrcloud anyway. I wish to keep one shard owning
one specific year of data. Can it be done?


Yes it can, but it may not be a good idea. If a large proportion of your 
queries hit recent news you may find one shard dealing with more queries 
than the others and becoming overloaded. Here's a blog post we wrote a 
long time ago about this - ignore the name Xapian, this applies to Solr 
as well: 
http://www.flax.co.uk/blog/2009/04/25/distributed-search-and-partition-functions/


What configuration should

I do? (AFAIK, SolrCloud distributes data based on some intrinsic
routing algorithm.)


You can choose how to route data at indexing time:
https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud


3> If I wish to create another Solr engine with
one or two particular paper types. Is it possible to copy their index
data directly from the big central Solr engine? Or I have to rebuild
index from raw articles data? (Our business has this possibility of
needs.)


Yes, I guess so, but why copy it when you could just search it with a 
filter for the paper types?


I'd like to hear and use some well suggestion and experiences.

Thanks in advance and best regards.

Scott Chu @ 2016/5/11  11:26 GMT+8



Hope this helps!

Cheers

Charlie

--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Re: Advice to add additional non-related fields to a collection or create a subset of it?

2016-05-11 Thread Mikhail Khludnev

On Wed, May 11, 2016 at 10:16 AM, Derek Poh  wrote:

> Hi Erick
>
> Yes we have identified and fixed the page slow loading.
>

Derek,
Can you elaborate more? What did you fix?


>
> I was wondering if there are any best practices when it comes to deciding
> to create a single collection that stores all information in it or create
> multiple sub collections. I understand now itdepends on the use-case.
> My apologies for not giving it much thoughts before asking the questions.
> Thank you for your patience.
>
> - Derek
>
>
> On 5/10/2016 12:10 PM, Erick Erickson wrote:
>
>> Not quite sure where you are at with this. It sounds
>> like your slow loading is fixed and was a coding
>> issue on your part, that happens to us all.
>>
>> bq: Is it advisable to has as less number of
>> queries to solr in a page?
>>
>> Of course it is advisable to have as few Solr queries
>> executed to display a page as possible. Every one
>> costs you at least _some_ turnaround time. You can
>> mitigate this (assuming your Solr server isn't running
>> flat out) by issuing the subsequent queries in parallel
>> threads.
>>
>> But it's not really a question to me of advisability, it's a
>> question of what your application needs to deliver. The
>> use-case drives all. You can do some tricks like display
>> partial pages and fill in the rest behind the scenes to
>> display when your user clicks something and the like.
>>
>> bq: In my case, by denormalizing,that means putting the
>> product and supplier information into one collection?
>> The supplier information are stored but not indexed in the collection.
>>
>> It Depends(tm). If all you want to do is provide supplier
>> information when people do product searches then stored-only
>> is fine.
>>
>> If you want to perform queries like "show me all the products
>> supplied by supplier X", then you need to index at least
>> some values too.
>>
>> Best,
>> Erick
>>
>> On Sun, May 8, 2016 at 10:36 PM, Derek Poh 
>> wrote:
>>
>>> Hi Erick
>>>
>>> In my case, by denormalizing,that means putting the product and supplier
>>> information into one collection?
>>> The supplier information arestored but not indexed in thecollection.
>>>
>>> We haveidentified itwas a combination of a loop and bad source data that
>>> caused an endless loop under certain scenario.
>>>
>>> Is it advisable to has as less number of queries to solr in a page?
>>>
>>>
>>> On 5/6/2016 11:17 PM, Erick Erickson wrote:
>>>
 Denormalizing the data is usually the first thing to try. That's
 certainly the preferred option if it doesn't bloat the index
 unacceptably.

 But my real question is what have you done to try to figure out _why_
 it's slow? Do you have some loop
 like
 for (each found document)
  extract all the supplier IDs and query Solr for them)

 ? That's a fundamental design decision that will be expensive.

 Have you examined the time each query takes to see if Solr is really
 the bottleneck or whether it's "something else"? Mind you, I have no
 clue what "something else" is here

 Do you ever return lots of rows (i.e. thousands)?

 Solr serves queries very quickly, so I'd concentrate on identifying what
 is slow before jumping to a solution

 Best,
 Erick

 On Wed, May 4, 2016 at 10:28 PM, Derek Poh 
 wrote:

> Hi
>
> We have a "product" collection and a "supplier" collection.
> The "product" collection contains products information and "supplier"
> collection contains the product's suppliers information.
> We have a subsidiary page that query on "product" collection for the
> search.
> The display result include product and supplier information.
> This page will query the "product" collection to get the matching
> product
> records.
>   From this query a list of the matching product's supplier id is
> extracted
> and used in a filter query against the "supplier" collection to get the
> necessary supplier's information.
>
> The loading of this page is very slow, it leads to timeout at times as
> well.
> Beside looking at tweaking the codes of the page we are also looking at
> what
> tweaking can be done on solr side. Reducing the number of queries
> generated
> bythis page was one of the optionto try.
>
> The main "product" collection is also use by our site main search page
> and
> other subsidiary pages as well. So the query load on it is substantial.
> It has about 6.5 million documents and index size of 38-39 GB.
> It is setup as 1 shard with 5 replicas. Each replica is on it's own
> server.
> Total of 5 servers.
> There are other smaller collections with similar 1 shard 5 replicas
> setup
> residing on these servers as well.
>
> I am thinking of either
> 1. Index supplier information into the "product" collection.
> 2. Create another similar

Shard data using lat long range

2016-05-11 Thread chandan khatri

Hi All,

I've an application that has location based data. The data is expected to
grow rapidly and the search is also based on the location i.e the search is
done using the geospatial distance range.

I am wondering what is the best possible way to shard the index. Any
pointer/input is highly appreciated.

Thanks,
Chandan

Re: Advice to add additional non-related fields to a collection or create a subset of it?

2016-05-11 Thread Derek Poh


Hi Erick

Yes we have identified and fixed the page slow loading.

I was wondering if there are any best practices when it comes to 
deciding to create a single collection that stores all information in it 
or create multiple sub collections. I understand now itdepends on the 
use-case.

My apologies for not giving it much thoughts before asking the questions.
Thank you for your patience.

- Derek

On 5/10/2016 12:10 PM, Erick Erickson wrote:

Not quite sure where you are at with this. It sounds
like your slow loading is fixed and was a coding
issue on your part, that happens to us all.

bq: Is it advisable to has as less number of
queries to solr in a page?

Of course it is advisable to have as few Solr queries
executed to display a page as possible. Every one
costs you at least _some_ turnaround time. You can
mitigate this (assuming your Solr server isn't running
flat out) by issuing the subsequent queries in parallel
threads.

But it's not really a question to me of advisability, it's a
question of what your application needs to deliver. The
use-case drives all. You can do some tricks like display
partial pages and fill in the rest behind the scenes to
display when your user clicks something and the like.

bq: In my case, by denormalizing,that means putting the
product and supplier information into one collection?
The supplier information are stored but not indexed in the collection.

It Depends(tm). If all you want to do is provide supplier
information when people do product searches then stored-only
is fine.

If you want to perform queries like "show me all the products
supplied by supplier X", then you need to index at least
some values too.

Best,
Erick

On Sun, May 8, 2016 at 10:36 PM, Derek Poh  wrote:

Hi Erick

In my case, by denormalizing,that means putting the product and supplier
information into one collection?
The supplier information arestored but not indexed in thecollection.

We haveidentified itwas a combination of a loop and bad source data that
caused an endless loop under certain scenario.

Is it advisable to has as less number of queries to solr in a page?


On 5/6/2016 11:17 PM, Erick Erickson wrote:

Denormalizing the data is usually the first thing to try. That's
certainly the preferred option if it doesn't bloat the index
unacceptably.

But my real question is what have you done to try to figure out _why_
it's slow? Do you have some loop
like
for (each found document)
 extract all the supplier IDs and query Solr for them)

? That's a fundamental design decision that will be expensive.

Have you examined the time each query takes to see if Solr is really
the bottleneck or whether it's "something else"? Mind you, I have no
clue what "something else" is here

Do you ever return lots of rows (i.e. thousands)?

Solr serves queries very quickly, so I'd concentrate on identifying what
is slow before jumping to a solution

Best,
Erick

On Wed, May 4, 2016 at 10:28 PM, Derek Poh  wrote:

Hi

We have a "product" collection and a "supplier" collection.
The "product" collection contains products information and "supplier"
collection contains the product's suppliers information.
We have a subsidiary page that query on "product" collection for the
search.
The display result include product and supplier information.
This page will query the "product" collection to get the matching product
records.
  From this query a list of the matching product's supplier id is
extracted
and used in a filter query against the "supplier" collection to get the
necessary supplier's information.

The loading of this page is very slow, it leads to timeout at times as
well.
Beside looking at tweaking the codes of the page we are also looking at
what
tweaking can be done on solr side. Reducing the number of queries
generated
bythis page was one of the optionto try.

The main "product" collection is also use by our site main search page
and
other subsidiary pages as well. So the query load on it is substantial.
It has about 6.5 million documents and index size of 38-39 GB.
It is setup as 1 shard with 5 replicas. Each replica is on it's own
server.
Total of 5 servers.
There are other smaller collections with similar 1 shard 5 replicas setup
residing on these servers as well.

I am thinking of either
1. Index supplier information into the "product" collection.
2. Create another similar "product" collection for this page to use. This
collection will have lesser product fields and will include the required
supplier fields. But the number of documents in it will be the same as
the
main "product" collection. The index size will be smallerthough.

With either 2 options we do not need to query "supplier" collection. So
there is one less query and hopefully it will improve the performance of
this page.

What is the advise between the 2 options?
Any other advice or options?

Derek

--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential and/or
privileged infor

73 matches

Mail list logo