Indicating missing query terms in response

2020-11-08 Thread adfel70
As Solr query result set may contain documents that does not include all
search terms, we were wondering if it is possible to get indication what
terms were missing as part of the response.

For example, if our index has the following indexed doc:

{ 
"title": "hello"
}

(assuming '/title/' is textGeneral field)

Following query *q=hello world&qf=title&defType=edismax&mm=1* will retrieve
the doc even though the search term '/world/' is missing. Is there a built
in capability to indicate the user, so she could refine the query afterward,
say like /+world/?



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Potential authorization bug when making HTTP requests

2019-05-15 Thread adfel70
Opened SOLR-13472   



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Potential authorization bug when making HTTP requests

2019-05-04 Thread adfel70
Hi Jan,
Thanks for the reply.
I am not sure it is exactly the same issue, also we are testing with Solr
7.7.1 and issue still occurs.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Potential authorization bug when making HTTP requests

2019-05-02 Thread adfel70
Atuhorization bug (?) when making HTTP requests

We are experiencing a problem when making HTTP requests to a cluster with
authorization plugin enabled.
Permissions are configured in security.json the following:

{
 ... authentication_settings ...
  "authorization":{
  "class":"solr.RuleBasedAuthorizationPlugin",  
  "permissions":[
{
  "name": "read",
  "role": "*"
},
{
  "name": "update",
  "role": ["indexer", "admin"]
},
{
  "name": "all",
  "role": "admin"
}
  ],
  "user-role": {
"admin_user": "admin",
"indexer_app": "indexer"
  }
}

Our goal is to give all users read-only access to the data, read+write
permissions to indexer_app user and full permissions to admin_user.

While testing the ACLs with different users we encountered unclear results,
where in some cases a privileged user got HTTP 403 response (unauthorized
request):
- in some calls authenticated reader could query the data.
- in some calls 'indexer_app' user could query data nor update the data.
- 'admin_user' worked as expected.
In addition, the problems were only relevant to HTTP requests - SolrJ
requests were perfect...

After further investigation we realized what seems to be the problem: HTTP
requests works correctly only when the collection has a core on the server
that got the initial request. For example:
Say we have a cloud made of 2 servers - 'host1' and 'host2' and collection
'test' with one shard - core on host1:

curl -u *reader_user*: "http://host1:8983/solr/test/select?q=*:*";   

--> code 200 as expected
curl -u *reader_user*: "http://host2:8983/solr/test/select?q=*:*";   

--> code 403 (error - should return result)

curl -u *indexer_app*: "http://host1:8983/solr/test/select?q=*:*";   

--> code 200 as expected
curl -u *indexer_app*: "http://host1:8983/solr/test/update?commit=true"; 
--> code 200 as expected
curl -u *indexer_app*: "http://host2:8983/solr/test/select?q=*:*";   

--> code 403 (error - should return result)
curl -u *indexer_app*: "http://host2:8983/solr/test/update?commit=true"; 
--> code 403 (error - should return result)

We guess 'admin_user' does not encounter any error due to the special
/'all'/ permission.
Tested both using basic and Kerberos authentication plugins and got the same
behaviour.
Is this should be opened as a bug?
Thanks.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Solr failed to start after configuring Kerberos authentication

2018-05-24 Thread adfel70
Hi,
We are trying to configure Kerberos auth for Solr 6.5.1.
We went over the steps as described through Sorl’s ref guide, but after
restart we are getting the following error:

org.apache.zookeeper.client.ZookeeperSaslClient; An error:
(java.security.PrivilegedActionException: javax.security.sasl.SaslException:
GSS initiate failed [Caused by GSSException: No valid credentials provided
(Mechanism level: Server not found in Kerberos database (7))] occurred when
evaluating Zookeeper Quorum Member’s received SASL token. Zookeeper Client
will go to AUTH_FAILED state.

We tested both of our keytab files (Zookeeper’s and Solr’s) using kinit and
everything looks fine.

Our Zookeeper does not configured with Kerberos yet and ‘ruok’ command
response with ‘imok’ as expected.

When examing Zokeeper’s logs we see the following:
Successfully logged in.
TGT refresh thread started.
TGT valid starting at:  Thu May 21:39:10 ...
TGT expires:   Fri May 25 07:39:44 ...
TGT refresh sleeping until: Fri May 25 05:55:44 ...

Any idea what we can do?
Thanks.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Do streaming expressions support range facets?

2017-04-03 Thread adfel70
Specifically date ranges?

I would like to perform some kind of OLAP cube on the data in solr, and
looking at streaming expressions for this.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Do-streaming-expressions-support-range-facets-tp4328233.html
Sent from the Solr - User mailing list archive at Nabble.com.


Streaming expressions - Any plans to add one to many fetches to the fetch decorator?

2017-03-27 Thread adfel70
Any ideas how to workaround this with the current streaming capabilities?





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Streaming-expressions-Any-plans-to-add-one-to-many-fetches-to-the-fetch-decorator-tp4326989.html
Sent from the Solr - User mailing list archive at Nabble.com.


Streaming expressions and result transfomers

2017-03-26 Thread adfel70
Hi
does streaming expressions support doc transformers?
To be more specific, I have a nested docs data model.
I want to use streaming expressions and get the results with
ChildDocTransformerFactory. Is it possible?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Streaming-expressions-and-result-transfomers-tp4326864.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Simple sql query with where clause doesn't work

2017-03-12 Thread adfel70
Seems like this only happend when the value is not a number


curl --data-urlencode 'stmt=select fieldA from collection where field='123''
http://host:port/solr/collection/sql?aggregationMode=facet

works.
while this one doesnt work:

curl --data-urlencode 'stmt=select fieldA from collection where field='abc''
http://host:port/solr/collection/sql?aggregationMode=facet

Again, the same message with "no field name specified in query and no
default specified via df param".

tried this on multiple field types.
example of field settings: type=string, indexed=true, stored=true,
omitNorms=true, multiValued=false, docValues=true.

Note that this collection was indexed as nested documents, but while trying
the sql, I'm not using anything related to the nested format (except that
the data itself was indexed this way)



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Simple-sql-query-with-where-clause-doesn-t-work-tp4324498p4324499.html
Sent from the Solr - User mailing list archive at Nabble.com.


Simple sql query with where clause doesn't work

2017-03-12 Thread adfel70
Hi
I'm trying to play with /sql feature.
working with solr 6.4.2

running
curl --data-urlencode 'stmt=select fieldA from collection'
http://host:port/solr/collection/sql?aggregationMode=facet

work fine.

running
curl --data-urlencode 'stmt=select fieldA from collection where
fieldB='value'' http://host:port/solr/collection/sql?aggregationMode=facet

doesn't work.
throws:
undefined field _text_

I dont have _text_ field in the schema but I also don't query on it so I'm
wondering what the problem is..

Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Simple-sql-query-with-where-clause-doesn-t-work-tp4324498.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: reindexing a solr collection of nested documents

2016-11-29 Thread adfel70
Anyone has a clue?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/reindexing-a-solr-collection-of-nested-documents-tp4307586p4307976.html
Sent from the Solr - User mailing list archive at Nabble.com.


reindexing a solr collection of nested documents

2016-11-27 Thread adfel70
Hi
I have a solr collection of nested documents.
I would like to reindex this collection to a new collection ,without running
the original process that created this collection.

If this was not a a collection of nested documents, I would use the /export 
handler to export all the documents and then reindex it.

Can the /export handler be used to retrieve the original nested structure
that was indexed?

Is there another way to do this?

Note that I want to make some schema modifications, so just copying the
index won't do.


Thanks.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/reindexing-a-solr-collection-of-nested-documents-tp4307586.html
Sent from the Solr - User mailing list archive at Nabble.com.


Executing Collector's Collect method on more than one thread

2016-01-31 Thread adfel70
I am using RankQuery to implement my applicative scorer that returns a score
based on the value of specific field (lets call it 'score_field') that is
stored for every document. 
The RankQuery creates a collector, and for every collected docId I retrieve
the value of score_field, calculate the score and add the doc id into
priority queue: 

public class MyScorerrankQuery extends RankQuery { 
... 

@Override 
public TopDocsCollector getTopDocsCollector(int i,
SolrIndexerSearcher.QueryCommand cmd, IndexSearcher searcher) { 
... 
return new MyCollector(...) 
} 
} 

public class MyCollector  extends TopDocsCollector{ 
MyScorer scorer; 
SortedDocValues scoreFieldValues;


@Override 
public void collect(int id){ 
int docID = docBase + id; 
//1. get specific field from the doc using DocValues 
and calculate score
using my scorer 
String value = 
scoreFieldValues.get(docID).utf8ToString(); 
scorer.calcScore(value); 
//2. add docId and score (ScoreDoc object) into 
PriorityQueue. 
} 
} 

Problem is that the calcScore may take ~20 ms per call, so if query returns
100,000 docs, which is not unusual, query execution time will be become 16
minutes. Is there a way to parallelize collector's logic, so more than one
thread would call calcScore simultaneously?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Executing-Collector-s-Collect-method-on-more-than-one-thread-tp4254269.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Read time out exception - exactly 10 minutes after starting committing

2016-01-27 Thread adfel70
I don't have any custom ShardHandler

Regarding the cache, I reduced it to zero, and checking performance now

Thanks,



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Read-time-out-exception-exactly-10-minutes-after-starting-committing-tp4252287p4253568.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Read time out exception - exactly 10 minutes after starting committing

2016-01-23 Thread adfel70
Thanks Shawn,

1. I am getting the "read time out" from the Solr Server.
Not from my client, but from the server client when it tries to reach other
instances while committing.

2. I reduced the filter cache autowarmCount to 512, and seems to fix the
problem. It now takes only several seconds to commit!

Thanks a lot,



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Read-time-out-exception-exactly-10-minutes-after-starting-committing-tp4252287p4252855.html
Sent from the Solr - User mailing list archive at Nabble.com.


Read time out exception - exactly 10 minutes after starting committing

2016-01-21 Thread adfel70
I am running soft commit on 100 solr docs (the index itself has 3 Billion
docs).
After EXACTLY 10 minutes (for example, start committing on 15:52:55.932,
exception on 16:02:55.976) I am getting several exception of the sort:
org.apache.solr.client.solrj.SolrServerException: Timeout occured while
waiting response from server at: [any instance of solr] ...
Caused by: java.net.SocketTimeoutException: Read timed out

All the exceptions are from the same instance to other instances in the
cluster.

I have checked the jetty.xml, the java params and the solr config -  and
didn't find any place where a timeout of 10 minutes is configured.
I have tunned the filterCache, queryResultCache and documentCahce to:
size="2048", initialSize="1024", autowarmCount="1024"
I created queries for newSearcher for warm-up

I have 2 questions:
1. Can someone point me to the rigth direction for the timeout?
2. Why could the commit take so long?!

Thanks,



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Read-time-out-exception-exactly-10-minutes-after-starting-committing-tp4252287.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: CloudSolrCloud - Commit returns but not all data is visible (occasionally)

2015-11-24 Thread adfel70
After several days running some use cases with 2 configurations, I can tell
you that the "PERFORMANCE WARNING: Overlapping onDeckSearchers" continues
only on the maxWarmingSearchers=2 and none on the maxWarmingSearchers=5
config.
Unfortunately, the root problem still occurs!

I have reduced the filterCache autowarmCount to 512 and the problems seems
to be solved.

Will update if I'll find more relevant insights...

Thanks,



--
View this message in context: 
http://lucene.472066.n3.nabble.com/CloudSolrCloud-Commit-returns-but-not-all-data-is-visible-occasionally-tp4240368p4241896.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: CloudSolrCloud - Commit returns but not all data is visible (occasionally)

2015-11-17 Thread adfel70
Thanks Eric,
I'll try to play with the autowarm config.

But I have a more direct question - why does the commit return without
waiting till the searchers are fully refreshed?

Could it be that the parameter waitSearcher=true doesn't really work?
or maybe I don't understand something here...

Thanks,




--
View this message in context: 
http://lucene.472066.n3.nabble.com/CloudSolrCloud-Commit-returns-but-not-all-data-is-visible-occasionally-tp4240368p4240518.html
Sent from the Solr - User mailing list archive at Nabble.com.


CloudSolrCloud - Commit returns but not all data is visible (occasionally)

2015-11-16 Thread adfel70
Hi,
I am using Solr 5.2.1 with the solrj client 5.2.1. (I know CloudSolrCloud is
deprecated)

I am running the command:
*cloudSolrServer.commit(false, true, true)*
the parameters are: waitFlush (false), waitSearcher (true), softCommit
(true)

The problem is that the client returns as if it already ended committing and
the searchers are refreshed when actually not all the data is visible for
users (it takes several minutes more).

The problem is that I have to wait till the searchers are up and the new
data is visible for users before I finish my process (and I don't want to
put 'sleep' in my code :-))

I have tried increasing the maxWarmingSearchers to 5 - it helped but the
problem still occasionally happens.

What could I configure more?


Thanks a lot,



--
View this message in context: 
http://lucene.472066.n3.nabble.com/CloudSolrCloud-Commit-returns-but-not-all-data-is-visible-occasionally-tp4240368.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr facets implementation question

2015-09-17 Thread adfel70
Toke Eskildsen wrote
> adfel70 <

> adfel70@

> > wrote:
>> I am trying to understand why faceting on a field with lots of unique
>> values
>> has a great impact on query performance.
> 
> Faceting in Solr is performed in different ways. String faceting different
> from Numerics faceting, DocValued fields different from non-DocValued, fc
> different from enum. Let's look at String faceting with facet.method=fc
> and DocValues.
> 
> Strings (aka Terms) are represented in the faceting code with an ordinal,
> which is really just a number. The first term has number 0, the next
> number 1 and so forth. When doing a faceting call with the above premise,
> what happens is
> 
> 1) An counter of int[unique_values] is allocated.
> This is fairly fast, but still with a noticeable impact when the number of
> unique value creeps into the millions. On our machine it takes several
> hundred milliseconds for 100M values. Also relevant is the overall strain
> it puts on the garbage collector.
> 
> 2) For each hit in the result set, the corresponding ordinals are resolved
> and counter[ordinal]++ is triggered.
> This scales with the result set. Small sets are very fast, quite
> independent of the size of the counter-structure. Large result sets are
> (naturally) equally slow.
> 
> 3) The counter-structure is iterated and top-X are determined.
> This scales with the size of the counter-structure, (nearly) independent
> of the result set size.
> 
> 4) The Terms for the top-X ordinals are resolved from the index.
> This scales with X.
> 
> 
> Some of these parts has some non-intuitive penalties: Even very tiny
> result sets has aa constant overhead from allocation and iteration. Asking
> for top-1M hits means that the underlying priority queue will probably no
> longer fit in the CPU cache and will slow things down. Resolving Terms
> from ordinals relies of fast IO and a large number of unique Terms might
> mean that the disk cache is not large enough.
> 
> 
> Blatant plug: I have spend a fair amount of time trying to make some of
> this faster http://tokee.github.io/lucene-solr/
> 
> - Toke Eskildsen

Hi Toke, Thank you for the detailed explanation, thats exactly what I was
looking for, except this algorithm fit single index only. could you please
elaborate what adjustments are needed for distributed index?
The naive solution would count all terms on each shard, but the initial
shard (the one that executed the request) must have ALL results for correct
aggregation (its easy to find example which shows that top K results from
every shard is not good enough). 
Is that correct? I tried to verify this behaviour, but I didnt see that the
process who got the request from the user used more memory than the other
shards.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-facets-implementation-question-tp4227604p4229741.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr facets implementation question

2015-09-08 Thread adfel70
I am trying to understand why faceting on a field with lots of unique values
has a great impact on query performance. Since Googling for Solr facet
algorithm did not yield anything, I looked how facets are implemented in
Lucene. I found out that there are 2 methods - taxonomy-based and
SortedSetDocValues-based. Does Solr facet capabilities are based on one of
those methods? if so, I still cant understand why unique values impacts
query performance...



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-facets-implementation-question-tp4227604.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: serious data loss bug in correlation with "too much data after closed"

2015-08-13 Thread adfel70
Update:
modifying jetty.xml to what said here:
http://lucene.472066.n3.nabble.com/Too-much-data-after-closed-for-HttpChannelOverHttp-td4170459.html

solved the problem of these warnings and the dataloss.

future searches of this problem should take into account that this warning
may imply of dataloss.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/serious-data-loss-bug-in-correlation-with-too-much-data-after-closed-tp4220723p4222710.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: serious data loss bug in correlation with "too much data after closed"

2015-08-09 Thread adfel70
By now I'm pretty much sure that this is either a bug in solr or in
http-client.

I again reproduced the problem:
1. during massive indexing we see some WARNINGS from HttpParser:
"badMessage: java.lang.IllegalStateException: too much data after closed for
HttpChannelOverHttp"

checking in httpcore code it seems that this happens when the connection
closes abruptly. 

2. only at some instances of this warning, we see a related
NoHttpResponseException from solr node.

3. after indexing we perform a full data validation and see that around 200
docs that our client got http 200 status for them, are not present in solr.

4. checking when these docs were sent to solr we get to the same time that
we had log messages from 1 and 2 (HttpChannelOverHttp warning and
NoHttpResponseException )

5. these 200 docs are divided into around 8 bulks that were sent in various
times, and all had these warn/error messages around them.


Would be glad to have some inputs from the community on this.

Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/serious-data-loss-bug-in-correlation-with-too-much-data-after-closed-tp4220723p4221986.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: serious data loss bug in correlation with "too much data after closed"

2015-08-06 Thread adfel70
Are you sure that this parameter concerns /update requests?
On the one hand, it says that it "specides the max size of form data
(application/x-www-form-urlencoded) sent via POST. You can use POST to pass
request parameters not fitting into URL" 

and on the other hand, I see the my bulks are as big as 7mb in some cases
and I dont' get any error for these.

I now write to log the bulk size in kbs of each bulk and the time it took
for the solrclient to get a status reply.
I tried to correlate between spikes in these metrics and these warnings -
but seems that there is no relation.

I'm indexing again right now to try to reproduce the problem.
I'm seeing these warnings but need to wait for the end of the indexing to
see if there is any dataloss and if the docs that got lost correlate by time
to these warning.

would be glad to here more ideas about this warning and this problem.


Shawn Heisey-2 wrote
> On 8/4/2015 8:06 AM, adfel70 wrote:
>> I saw this post:
>> http://lucene.472066.n3.nabble.com/Too-much-data-after-closed-for-HttpChannelOverHttp-td4170459.html
>> 
>> I tried reducing the bulk size from 1000 to 200 as the post suggests
>> (didn't
>> go to runing each doc in a seperate .add call yet), with no success. In
>> this
>> try I'm getting the same WARNING, but now I also have regular errors such
>> as
>> NoHttpResponseExcpeption which is fine because the client also gets an
>> error
>> and I can handle this.
> 
> How big are those docs?  Is there any chance that the index request from
> those 200 docs would exceed 2 megabytes?  That's the default size limit
> on a POST request.
> 
> In your solrconfig.xml, there is a requestDispatcher section, which has
> a requestParsers tag inside it.  In the requestParsers tag you can
> change the formdataUploadLimitInKB parameter, which defaults to 2048.
> In 3.x, this would have been controlled by the container config, but
> since about 4.1, Solr can configure it directly.  I have changed mine to
> 32768, so I can be SURE it's big enough for anything I might throw at
> it, but not so big that incredibly massive requests can be sent.
> 
> Thanks,
> Shawn





--
View this message in context: 
http://lucene.472066.n3.nabble.com/serious-data-loss-bug-in-correlation-with-too-much-data-after-closed-tp4220723p4221300.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: serious data loss bug in correlation with "too much data after closed"

2015-08-06 Thread adfel70
I have some docs that I know i've overwritten, but this is fine because this
is caused by some duplicate docs with same data and same id.

i know of dataloss because I know that a certain doc with certain id should
be in the index but it isnt.



Upayavira wrote
> Are you adding all new documents? If you are not updating documents at
> all, take a look at your maxDocs vs numDocs, if they are not the same,
> then you have overwritten some documents.
> 
> You may also be right that the exception you've seen could be the cause
> of it, just thought the above is worth checking.
> 
> Upayavira
> 
> On Tue, Aug 4, 2015, at 03:06 PM, adfel70 wrote:
>> Hello,
>> I'm using solr 5.2.1
>> I'm running indexing of a collection with 20 shards.
>> around 1.7 billion docs should be indexed.
>> the indexer is a mapreduce job that runs on yarn, running 60  concurrent
>> containers.
>> I index with bulks of 1000 docs and write logs for each bulk that was
>> indexed.
>> each such log message has all the ids of the solr docs that were in the
>> bulk.
>> 
>> Such and indexing process finished without any errors, not in the indexer
>> nor in solr.
>> I have a data validation process that validates that solr has the correct
>> number of docs as it should.
>> I ran this process and got that some docs are missing.
>> I figure out which docs are missing and went back to my logs and saw that
>> these docs appeared in log messages of succeeded bulks.
>> So I have the following scenario:
>> 1. At a certain time during the indexing, a client used solrj to send a
>> bulk
>> of 1000 docs
>> 2. the client got success for this operation
>> 3. solr had no errors.
>> 4. not all the data was indexed.
>> 
>> Further investigation of solr logs broguht me to a conclution that at all
>> times that I had a bulk that had missing docs, solr had the following
>> WARNING log:
>> badMessage: java.lang.IllegalStateException: too much data after closed
>> for
>> HttpChannelOverHttp@5432494a
>> 
>> I saw this post:
>> http://lucene.472066.n3.nabble.com/Too-much-data-after-closed-for-HttpChannelOverHttp-td4170459.html
>> 
>> I tried reducing the bulk size from 1000 to 200 as the post suggests
>> (didn't
>> go to runing each doc in a seperate .add call yet), with no success. In
>> this
>> try I'm getting the same WARNING, but now I also have regular errors such
>> as
>> NoHttpResponseExcpeption which is fine because the client also gets an
>> error
>> and I can handle this.
>> 
>> 
>> Any inputs of this WARNING and the dataloss issue?
>> 
>> thanks.
>> 
>> 
>> 
>> 
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/serious-data-loss-bug-in-correlation-with-too-much-data-after-closed-tp4220723.html
>> Sent from the Solr - User mailing list archive at Nabble.com.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/serious-data-loss-bug-in-correlation-with-too-much-data-after-closed-tp4220723p4221289.html
Sent from the Solr - User mailing list archive at Nabble.com.


serious data loss bug in correlation with "too much data after closed"

2015-08-04 Thread adfel70
Hello,
I'm using solr 5.2.1
I'm running indexing of a collection with 20 shards.
around 1.7 billion docs should be indexed.
the indexer is a mapreduce job that runs on yarn, running 60  concurrent
containers.
I index with bulks of 1000 docs and write logs for each bulk that was
indexed.
each such log message has all the ids of the solr docs that were in the
bulk.

Such and indexing process finished without any errors, not in the indexer
nor in solr.
I have a data validation process that validates that solr has the correct
number of docs as it should.
I ran this process and got that some docs are missing.
I figure out which docs are missing and went back to my logs and saw that
these docs appeared in log messages of succeeded bulks.
So I have the following scenario:
1. At a certain time during the indexing, a client used solrj to send a bulk
of 1000 docs
2. the client got success for this operation
3. solr had no errors.
4. not all the data was indexed.

Further investigation of solr logs broguht me to a conclution that at all
times that I had a bulk that had missing docs, solr had the following
WARNING log:
badMessage: java.lang.IllegalStateException: too much data after closed for
HttpChannelOverHttp@5432494a

I saw this post:
http://lucene.472066.n3.nabble.com/Too-much-data-after-closed-for-HttpChannelOverHttp-td4170459.html

I tried reducing the bulk size from 1000 to 200 as the post suggests (didn't
go to runing each doc in a seperate .add call yet), with no success. In this
try I'm getting the same WARNING, but now I also have regular errors such as
NoHttpResponseExcpeption which is fine because the client also gets an error
and I can handle this.


Any inputs of this WARNING and the dataloss issue?

thanks.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/serious-data-loss-bug-in-correlation-with-too-much-data-after-closed-tp4220723.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: mapreduce job using soirj 5

2015-06-17 Thread adfel70
We cannot downgrade httpclient in solrj5 because its using new features and
we dont want to start altering solr code, anyway we thought about upgrading
httpclient in hadoop but as Erick said its sounds more work than just put
the jar in the data nodes.

About that flag we tried it, hadoop even has an environment variable
HADOOP_USER_CLASSPATH_FIRST but all our tests with that flag failed.

We thought this is an issue that is more likely that solr users will
encounter rather than cloudera users, so we will be glad for a more elegant
solution or workaround than to replace the httpclient jar in the data nodes

Thank you all for your responses



--
View this message in context: 
http://lucene.472066.n3.nabble.com/mapreduce-job-using-soirj-5-tp4212199p4212350.html
Sent from the Solr - User mailing list archive at Nabble.com.


mapreduce job using soirj 5

2015-06-16 Thread adfel70
Hi, 

We recently started testing solr 5, our indexer creates mapreduce job that
uses solrj5 to index documents to our SolrCloud. Until now, we used solr
4.10.3 with solrj 4.8.0. Our hadoop dist is cloudera 5.

The problem is, solrj5 is using httpclient-4.3.1 while hadoop is installed
with httpclient-4.2.5
and that causing us jar-hell because hadoop jars are being loaded first and
solrj is using closeablehttpclient class which is in 4.3.1 but not in 4.2.5

Does anyone encounter that? and have a solution? or a workaround?

Right now we are replacing the jar physically in each data node





--
View this message in context: 
http://lucene.472066.n3.nabble.com/mapreduce-job-using-soirj-5-tp4212199.html
Sent from the Solr - User mailing list archive at Nabble.com.


DocValues memory consumption thoughts

2015-06-11 Thread adfel70
I am using DocValues and I am wondering how to configure Solr's processes
java's heap size: does DocValues uses system cache (off heap memory) or heap
memory? should I take  DocValues into consideration when I calculate heap
parameters (xmx, xmn, xms...)?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/DocValues-memory-consumption-thoughts-tp4211187.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Adding applicative cache to SolrSearcher

2015-06-11 Thread adfel70
Works great, thanks guys!
Missed the leafReader because I looked at IndexSearcher instead of
SolrIndexSearcher...



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Adding-applicative-cache-to-SolrSearcher-tp4211012p4211183.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to tell when Collector finishes collect loop?

2015-06-10 Thread adfel70
I need to execute close() because the scorer is being opened in a context of
a query and caches some data in that scope - of the specific query. The way
to clear this cache, which is only relevant for that query, is to call
close(). I think this API is not so good, but I assume that the scorer's
code will not change soon...



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-tell-when-Collector-finishes-collect-loop-tp4209447p4211016.html
Sent from the Solr - User mailing list archive at Nabble.com.


Adding applicative cache to SolrSearcher

2015-06-10 Thread adfel70

I am using RankQuery to implement my applicative scorer that returns a score
based on the value of specific field (lets call it 'score_field') that is
stored for every document.
The RankQuery creates a collector, and for every collected docId I retrieve
the value of score_field, calculate the score and add the doc id into
priority queue:

public class MyScorerrankQuet extends RankQuery { 
... 

@Override 
public TopDocsCollector getTopDocsCollector(int i,
SolrIndexerSearcher.QueryCommand cmd, IndexSearcher searcher) { 
... 
return new MyCollector(...) 
} 
} 

public class MyCollector  extends TopDocsCollector{ 
MyScorer scorer; 
SortedDocValues scoreFieldValues;   //Initialized in constrctor

public MyCollector(){ 
scorer = new MyScorer(); 
scorer.start(); //the scorer's API needs to call start()
before every query and close() at the end of the query
AtomicReader r =
SlowCompositeReaderWrapper.wrap(searcher.getIndexReader());
scoreFieldValues = DocValues.getSorted(r, "score_field");   
/*
THIS CALL IS TIME CONSUMING! */
} 

@Override 
public void collect(int id){ 
int docID = docBase + id;
//1. get specific field from the doc using DocValues and
calculate score using my scorer 
String value = scoreFieldValues.get(docID).utf8ToString();
scorer.calcScore(value);
//2. add docId and score (ScoreDoc object) into
PriorityQueue. 
} 
} 


I used DocValues to get the value of score_field. Currently its being
instantiate in collector's constructor - which is performance killer,
because it is being called for EVERY query, even if the index is static (no
commits). I want to make the DocValue.getStored() call only when it is
really necessary, but I dont know where to put that code. Is there a place
to plug that code so when a new searcher is being opened I can add my this
applicative cache?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Adding-applicative-cache-to-SolrSearcher-tp4211012.html
Sent from the Solr - User mailing list archive at Nabble.com.


How to tell when Collector finishes collect loop?

2015-06-03 Thread adfel70
Hi guys, need your help (again):
I have a search handler which need to override solr's scoring. I chose to
implement it with RankQuery API, so when getTopDocsCollector() gets called
it instantiates my TopDocsCollector instance, and every dicId gets its own
score:

public class MyScorerrankQuet extends RankQuery {
...

@Override
public TopDocsCollector getTopDocsCollector(int i,
SolrIndexerSearcher.QueryCommand cmd, IndexSearcher searcher) {
...
return new MyCollector(...)
}
}

public class MyCollector  extends TopDocsCollector{
//Initialized in constrctor 
MyScorer scorer;

public MyCollector(){
scorer = new MyScorer();
scorer.start(); //the scorer's API needs to call 
start() before every
query and close() at the end of the query
}

@Override
public void collect(int id){
//1. get specific field from the doc using DocValues and 
calculate score
using my scorer
//2. add docId and score (ScoreDoc object) into PriorityQueue.
}
}

My problem is that I cant find a place to call scorer.close(), which need to
be executed when the query ends (after we calculated score for each docID).
I saw the DeligatingCollector has finish() method which is called after
collector is done, but I cannot extend both TopDocsCollector and
DeligatingCollector...





--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-tell-when-Collector-finishes-collect-loop-tp4209447.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Native library of plugin is loaded for every core

2015-05-28 Thread adfel70
Works as expected :)
Thanks guys!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Native-library-of-plugin-is-loaded-for-every-core-tp4207996p4208372.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Native library of plugin is loaded for every core

2015-05-27 Thread adfel70
Hi Alan, thanks for the reply.
I am not sure what did you mean. Currently it is loaded from solrconfig.xml


Is there any other way?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Native-library-of-plugin-is-loaded-for-every-core-tp4207996p4208004.html
Sent from the Solr - User mailing list archive at Nabble.com.


Native library of plugin is loaded for every core

2015-05-27 Thread adfel70
Hi guys, need your help:
I added a custom plugins to Solr, to support my applicative needs (one index
handler and 2 search components), all of them access a native library using
JNI. The native library wrapper class loads the library using the regular
pattern:

public class YWrapper{
static{
System.loadLibrary("YJNI");
}
...
}


Basically things are working great, but when I try to create another
collection, an exception is being thrown:
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:Error
CREATing SolrCore 'anotherColl_shard1_replica1': Unable to create core
[anotherColl_shard1_replica1] caused by: Native Library
/...path_to_library/LibY.so already loaded in another classloader

I guess that this happens because every core has its own class loader. Is
that right? Is there any way to define my plugin (my jar file) as a shared
library, so it would only be loaded once when the process starts, and not on
every core instantiation?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Native-library-of-plugin-is-loaded-for-every-core-tp4207996.html
Sent from the Solr - User mailing list archive at Nabble.com.


"I was asked to wait on state recovering for shard.... but I still do not see the request state"

2015-05-07 Thread adfel70
Hi
I have a cluster of 16 shards, 3 replicas.

I keep getting situations where a whole shard breaks.
the leader is at down state and says:
I was asked to wait on state recovering for shard but i still do not see
the requested state. I see state: recovering live:true leader from
ZK:http://...

the replicas are in recovering state keep failing on recovery, and putting
the same exception in the log.

any idea?

I use solr 4.10.3

Thanks.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/I-was-asked-to-wait-on-state-recovering-for-shard-but-I-still-do-not-see-the-request-state-tp4204348.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: getting frequent CorruptIndexException and inconsistent data though core is active

2015-05-07 Thread adfel70
Anyone has any inputs on this?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/getting-frequent-CorruptIndexException-and-inconsistent-data-though-core-is-active-tp4204129p4204347.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: severe problems with soft and hard commits in a large index

2015-05-06 Thread adfel70
Thank you for the detailed answer.
How can I decrease the impact of opening a searcher in such a large index?
especially the impact of heap usage that causes OOM.

regarding GC tuning - I am doint that.
here are the params I use:
AggresiveOpts
UseLargePages
ParallelRefProcEnabled
CMSParallelRemarkEnabled
CMSMaxAbortablePrecleanTime=6000
CMDTriggerPermRatio=80
CMSInitiatingOccupancyFraction=70
UseCMSInitiatinOccupancyOnly
CMSFullGCsBeforeCompaction=1
PretenureSizeThreshold=64m
CMSScavengeBeforeRemark
UseConcMarkSweepGC
MaxTenuringThreshold=8
TargetSurvivorRatio=90
SurviorRatio=4
NewRatio=2
Xms16gb
Xmn28gb

any input on this?

How many documents per shard are recommended?
Note that I use nested documents. total collection size is 3 billion docs,
number of parent docs is 600 million. the rest are children.



Shawn Heisey-2 wrote
> On 5/6/2015 1:58 AM, adfel70 wrote:
>> I have a cluster of 16 shards, 3 replicas. the cluster indexed nested
>> documents.
>> it currently has 3 billion documents overall (parent and children).
>> each shard has around 200 million docs. size of each shard is 250GB.
>> this runs on 12 machines. each machine has 4 SSD disks and 4 solr
>> processes.
>> each process has 28GB heap.  each machine has 196GB RAM.
>> 
>> I perform periodic indexing throughout the day. each indexing cycle adds
>> around 1.5 million docs. I keep the indexing load light - 2 processes
>> with
>> bulks of 20 docs.
>> 
>> My use case demands that each indexing cycle will be visible only when
>> the
>> whole cycle finishes.
>> 
>> I tried various methods of using soft and hard commits:
> 
> I personally would configure autoCommit on a five minute (maxTime of
> 30) interval with openSearcher=false.  The use case you have
> outlined (not seeing changed until the indexing is done) demands that
> you do NOT turn on autoSoftCommit, that you do one manual commit at the
> end of indexing, which could be either a soft commit or a hard commit.
> I would recommend a soft commit.
> 
> Because it is the openSearcher part of a commit that's very expensive,
> you can successfully do autoCommit with openSearcher=false on an
> interval like 10 or 15 seconds and not see much in the way of immediate
> performance loss.  That commit is still not free, not only in terms of
> resources, but in terms of java heap garbage generated.
> 
> The general advice with commits is to do them as infrequently as you
> can, which applies to ANY commit, not just those that make changes
> visible.
> 
>> with all methods I encounter pretty much the same problem:
>> 1. heavy GCs when soft commit is performed (methods 1,2) or when
>> hardcommit
>> opensearcher=true is performed. these GCs cause heavy latency (average
>> latency is 3 secs. latency during the problem is 80secs)
>> 2. if indexing cycles come too often, which causes softcommits or
>> hardcommits(opensearcher=true) occur with a small interval one after
>> another
>> (around 5-10minutes), I start getting many OOM exceptions.
> 
> If you're getting OOM, then either you need to change things so Solr
> requires less heap memory, or you need to increase the heap size.
> Changing things might be either the config or how you use Solr.
> 
> Are you tuning your garbage collection?  With a 28GB heap, tuning is not
> optional.  It's so important that the startup scripts in 5.0 and 5.1
> include it, even though the default max heap is 512MB.
> 
> Let's do some quick math on your memory.  You have four instances of
> Solr on each machine, each with a 28GB heap.  That's 112GB of memory
> allocated to Java.  With 196GB total, you have approximately 84GB of RAM
> left over for caching your index.
> 
> A 16-shard index with three replicas means 48 cores.  Divide that by 12
> machines and that's 4 replicas on each server, presumably one in each
> Solr instance.  You say that the size of each shard is 250GB, so you've
> got about a terabyte of index on each server, but only 84GB of RAM for
> caching.
> 
> Even with SSD, that's not going to be anywhere near enough cache memory
> for good Solr performance.
> 
> All these memory issues, including GC tuning, are discussed on this wiki
> page:
> 
> http://wiki.apache.org/solr/SolrPerformanceProblems
> 
> One additional note: By my calculations, each filterCache entry will be
> at least 23MB in size.  This means that if you are using the filterCache
> and the G1 collector, you will not be able to avoid humongous
> allocations, which is any allocation larger than half the G1 region
> size.  The max configurable G1 region size is 32MB.  You should use the
> CMS collector for your GC tuning, not G1.  If you can reduce the number
> of documents in each shard, G1 might work well.
> 
> Thanks,
> Shawn





--
View this message in context: 
http://lucene.472066.n3.nabble.com/severe-problems-with-soft-and-hard-commits-in-a-large-index-tp4204068p4204148.html
Sent from the Solr - User mailing list archive at Nabble.com.


getting frequent CorruptIndexException and inconsistent data though core is active

2015-05-06 Thread adfel70
Hi
I'm getting org.apache.lucene.index.CorruptIndexException 
liveDocs.count()=2000699 info.docCount()=2047904 info.getDelCount()=47207
(filename=_ney_1g.del).

This just happened for the 4th time in 2 weeks.
each time this happens in another core, usually when a replica tries to
recover, then it reports that it succeeded, and then the
CorruptIndexException  is thrown while trying to open searcher.

this core is marked as active and thus query can get redirected there and
this causes data inconsistency to users.
this occurs with solr 4.10.3, should be noted that I use nested docs.

ANOTHER problem is that replicas can get inconsistent number of docs with no
exception being reported.
This occurs usually when one of the replicas goes down during indexing. what
I end up getting is the leader being in an older version than the replicas
or having less docs than the replicas. switching leaders (stopping the
leader so that another replicas become the leader) does not fix the problem.

this occurs both in solr 4.10.3 and in solr 4.8





--
View this message in context: 
http://lucene.472066.n3.nabble.com/getting-frequent-CorruptIndexException-and-inconsistent-data-though-core-is-active-tp4204129.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: severe problems with soft and hard commits in a large index

2015-05-06 Thread adfel70
I dont see any of these.
I've seen them before in other clusters and uses of SOLR  but don't see any
of these messages here.



Dmitry Kan-2 wrote
> Do you seen any (a lot?) of the warming searchers on deck, i.e. value for
> N:
> 
> PERFORMANCE WARNING: Overlapping onDeckSearchers=N
> 
> On Wed, May 6, 2015 at 10:58 AM, adfel70 <

> adfel70@

> > wrote:
> 
>> Hello
>> I have a cluster of 16 shards, 3 replicas. the cluster indexed nested
>> documents.
>> it currently has 3 billion documents overall (parent and children).
>> each shard has around 200 million docs. size of each shard is 250GB.
>> this runs on 12 machines. each machine has 4 SSD disks and 4 solr
>> processes.
>> each process has 28GB heap.  each machine has 196GB RAM.
>>
>> I perform periodic indexing throughout the day. each indexing cycle adds
>> around 1.5 million docs. I keep the indexing load light - 2 processes
>> with
>> bulks of 20 docs.
>>
>> My use case demands that each indexing cycle will be visible only when
>> the
>> whole cycle finishes.
>>
>> I tried various methods of using soft and hard commits:
>>
>> 1. using auto hard commit with time=10secs (opensearcher=false) and an
>> explicit soft commit when the indexing finishes.
>> 2. using auto soft commit with time=10/30/60secs during the indexing.
>> 3. not using soft commit at all, just using auto hard commit with
>> time=10secs during the indexing (opensearcher=false) and an explicit hard
>> commit with opensearcher=true when the cycle finishes.
>>
>>
>> with all methods I encounter pretty much the same problem:
>> 1. heavy GCs when soft commit is performed (methods 1,2) or when
>> hardcommit
>> opensearcher=true is performed. these GCs cause heavy latency (average
>> latency is 3 secs. latency during the problem is 80secs)
>> 2. if indexing cycles come too often, which causes softcommits or
>> hardcommits(opensearcher=true) occur with a small interval one after
>> another
>> (around 5-10minutes), I start getting many OOM exceptions.
>>
>>
>> Thank you.
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/severe-problems-with-soft-and-hard-commits-in-a-large-index-tp4204068.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
> 
> 
> 
> -- 
> Dmitry Kan
> Luke Toolbox: http://github.com/DmitryKey/luke
> Blog: http://dmitrykan.blogspot.com
> Twitter: http://twitter.com/dmitrykan
> SemanticAnalyzer: www.semanticanalyzer.info





--
View this message in context: 
http://lucene.472066.n3.nabble.com/severe-problems-with-soft-and-hard-commits-in-a-large-index-tp4204068p4204123.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: severe problems with soft and hard commits in a large index

2015-05-06 Thread adfel70
1. yes, I'm sure that pauses are due to GCs. I monitor the cluster and
receive continuously metric from system and from java process.
I see clearly that when soft commit is triggered, major GCs start occurring
(sometimes reocuuring on the same process) and latency rises.
I use CMS GC and jdk 1.7.75

2. My previous post was about another use case, but nevertheless I have
configured docvalues in the faceted fields.


Toke Eskildsen wrote
> On Wed, 2015-05-06 at 00:58 -0700, adfel70 wrote:
>> each shard has around 200 million docs. size of each shard is 250GB.
>> this runs on 12 machines. each machine has 4 SSD disks and 4 solr
>> processes.
>> each process has 28GB heap.  each machine has 196GB RAM.
> 
> [...]
> 
>> 1. heavy GCs when soft commit is performed (methods 1,2) or when
>> hardcommit
>> opensearcher=true is performed. these GCs cause heavy latency (average
>> latency is 3 secs. latency during the problem is 80secs)
> 
> Sanity check: Are you sure the pauses are due to garbage collection?
> 
> You have a fairly large heap and judging from your previous post
> "problem with facets  - out of memory exception", you are doing
> non-trivial faceting. Are you using DocValues, as Marc suggested?
> 
> 
> - Toke Eskildsen, State and University Library, Denmark





--
View this message in context: 
http://lucene.472066.n3.nabble.com/severe-problems-with-soft-and-hard-commits-in-a-large-index-tp4204068p4204088.html
Sent from the Solr - User mailing list archive at Nabble.com.


severe problems with soft and hard commits in a large index

2015-05-06 Thread adfel70
Hello
I have a cluster of 16 shards, 3 replicas. the cluster indexed nested
documents.
it currently has 3 billion documents overall (parent and children).
each shard has around 200 million docs. size of each shard is 250GB.
this runs on 12 machines. each machine has 4 SSD disks and 4 solr processes.
each process has 28GB heap.  each machine has 196GB RAM.

I perform periodic indexing throughout the day. each indexing cycle adds
around 1.5 million docs. I keep the indexing load light - 2 processes with
bulks of 20 docs.

My use case demands that each indexing cycle will be visible only when the
whole cycle finishes.

I tried various methods of using soft and hard commits:

1. using auto hard commit with time=10secs (opensearcher=false) and an
explicit soft commit when the indexing finishes.
2. using auto soft commit with time=10/30/60secs during the indexing.
3. not using soft commit at all, just using auto hard commit with
time=10secs during the indexing (opensearcher=false) and an explicit hard
commit with opensearcher=true when the cycle finishes.


with all methods I encounter pretty much the same problem:
1. heavy GCs when soft commit is performed (methods 1,2) or when hardcommit
opensearcher=true is performed. these GCs cause heavy latency (average
latency is 3 secs. latency during the problem is 80secs)
2. if indexing cycles come too often, which causes softcommits or
hardcommits(opensearcher=true) occur with a small interval one after another
(around 5-10minutes), I start getting many OOM exceptions.


Thank you.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/severe-problems-with-soft-and-hard-commits-in-a-large-index-tp4204068.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: CLUSTERSTATE timeout

2015-04-13 Thread adfel70
I'm having the same issue with 4.10.3

I'm performing various task on clusterstate API and getting random timeouts
throguhout the day.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/CLUSTERSTATE-timeout-tp4199367p4199501.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Unexplained leader initiated recovery after updates

2015-02-06 Thread adfel70
any inputs on this?
i'm facing the same problem..



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Unexplained-leader-initiated-recovery-after-updates-tp4178496p4184336.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: CLUSTERSTATUS timeout

2014-12-17 Thread adfel70
Hi Jonathan,
We are having the exact same problem with Solr 4.8.0.
Did you manage to resolve this one?
Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/CLUSTERSTATUS-timeout-tp4173224p4174741.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Getting the position of a word via Solr API

2014-12-02 Thread adfel70
Small update,
I have managed making the Term Vector to work and I am getting all the words
of the text field.

The problem is that it doesn't work with several words combined, I can't
find the offset of the needed expression starts...

Any ideas anyone?

Thanks!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Getting-the-position-of-a-word-via-Solr-API-tp4171877p4172092.html
Sent from the Solr - User mailing list archive at Nabble.com.


Getting the position of a word via Solr API

2014-12-01 Thread adfel70
Hi,
I am trying to retrieve from Solr API the position of a word from a text
field that was indexes but not stored.

I am storing the text field in an external repository and trying to do the
Solr built-in snippet function by myself, outside Solr.

Basically, all I need is to get from Solr the actual offset of the word in
the text.

I have tried 2 things:
1. Using the TermVectorComponent - Got NPE errors. 
2. Using Highlighting mechanism - didn't find the needed option in the API.

If one of this actually works, please tell me how :-)

Is there another way for me to get the location/position/offset of the
searched word from a text that is not stored in Solr?

Thank you very much!!




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Getting-the-position-of-a-word-via-Solr-API-tp4171877.html
Sent from the Solr - User mailing list archive at Nabble.com.


Is it possible to facet on date fields and aggregate by day/month/year?

2014-11-16 Thread adfel70
Hi,

If my data includes:
doc1: date_f: 2014-05-01T00:00:00Z
doc2: date_f: 2014-05-02T00:00:00Z
doc2: date_f: 2014-06-01T00:00:00Z
doc2: date_f: 2014-07-01T00:00:00Z

then I can facet on month(date_f) and get
05(2)
06(1)
07(1)
or facet on year(date_f) and get 
2014(4)


Is it supported?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-it-possible-to-facet-on-date-fields-and-aggregate-by-day-month-year-tp4169366.html
Sent from the Solr - User mailing list archive at Nabble.com.


out of memory when trying to sort by id in a 1.5 billion index

2014-11-07 Thread adfel70
hi
I have 11 machines in my cluster.
each machine 128GB memory, 2 solr jvm's with 12gb heap each.
cluster has 7 shard, 3 replicas.
1.5 billion docs total.
most user queries are pretty simple for now, sorting by date fields and
another field the has around 1000 unique values.

I have a usecase for using cursorpage and when tried to check this, I got
outOfMemory just for sorting by id.
I read in old posts that I should add heap memory, and I can do that, but I
would rather not .
All other usecases I have are using stable 8gb heap .
Any other way to handle this in solr 4.8?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/out-of-memory-when-trying-to-sort-by-id-in-a-1-5-billion-index-tp4168156.html
Sent from the Solr - User mailing list archive at Nabble.com.


Facets on Nested documents

2014-07-07 Thread adfel70
Hi,

I indexed different types(different fields) of child docs for every parent.
I want to do facet on field in one type of child doc and after it to do
another of facet on different type of child doc. It doesn't work..

Any idea how i can do something like that?

thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Facets-on-Nested-documents-tp4145931.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: OOM during indexing nested docs

2014-06-25 Thread adfel70
I made two tests, one with MaxRamBuffer=128 and the second with 
MaxRamBuffer=256.
In both i got OOM.

I also made two tests on autocommit:
one with commit every 5 min, and the second with commit every 100,000 docs.
(disabled softcommit)
In both i got OOM.

merge policy - Tiered (max segment size of 5000, and merged at once = 2,
merge factor = 12).

any idea for more tests?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/OOM-during-indexing-nested-docs-tp4143722p4143966.html
Sent from the Solr - User mailing list archive at Nabble.com.


OOM during indexing nested docs

2014-06-24 Thread adfel70
Hi, 

I am getting OOM during indexing 400 million docs (nested 7-20 children).
The memory usage gets higher while indexing until it gets to 24g.
also after OOM and stop indexing, the memory stays on 24g, *seems like a
leak.*


*Solr & Collection Info: *
solr 4.8 , 6 shards, 1 replicas per shard, 24g for jvm

Thanks 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/OOM-during-indexing-nested-docs-tp4143722.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Replica as a "leader"

2014-05-18 Thread adfel70
*one of the most impotent requirements in my system is not to lose docs and
not to retrieve part of the data at query time.*

I expect the replica to wait until the real leader will start or 
at least to sync the real leader with the docs indexed in the replica after
starting and syncing the replica with the docs that were indexed to the
leader. 

Is there a way that solr can recover without losing docs in this scenario?

Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Replica-as-a-leader-tp4135614p4136729.html
Sent from the Solr - User mailing list archive at Nabble.com.


Replica as a "leader"

2014-05-15 Thread adfel70
/Solr &Collection Info:/
Solr 4.8 , 4 shards, 3 replicas per shard, 30-40 million docs per shard.

/Process:/
1. Indexing 100-200 docs per second.
2. Doing Pkill -9 java to 2 replicas (not the leader) in shard 3 (while
indexing).
3. Indexing for 10-20 minutes and doing hard commit. 
4. Doing Pkill -9 java to the leader and then starting one replica in shard
3 (while indexing).
5. After 20 minutes starting another replica in shard 3 ,while indexing (not
the leader in step 1). 
6. After 10 minutes starting the rep that was the leader in step 1. 

/Results:/
2. Only the leader is active in shard 3.
3. Thousands of docs were added to the leader in shard 3.
4. After staring the replica, it's state was down and after 10 minutes it
became the leader in cluster state (and still down). no servers hosting
shards for index and search requests.
*5. After starting another replica, it's state was recovering for 2-3
minutes and then it became active (not leader in cluster state).
   Index, commit and search requests are handled in the other replica
(active status, not leader!!!). 
   The search Results not includes docs that have been indexed to the leader
in step 3.  *
6. syncing with the active rep. 

/Expected:/
*5. To stay in down status.
   Not to handle index, commit and search requests - no servers hosting
shards!*
6. Become the leader.

Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Replica-as-a-leader-tp4135078.html
Sent from the Solr - User mailing list archive at Nabble.com.


solr 4.8 Leader Problem

2014-05-12 Thread adfel70
*Solr &Collection Info:*
Solr 4.8 , 4 shards, 3 replicas per shard, 30-40 million docs per shard. 

Process:
1. Indexing 100-200 docs per second. 
2. Doing Pkill -9 java to 2 replicas (not the leader) in shard 3 (while
indexing). 
3. Indexing for 10-20 minutes and doing hard commit. 
4. Doing Pkill -9 java to the leader and then starting one replica in shard
3 (while indexing). 
5. After 20 minutes starting another replica in shard 3 ,while indexing (not
the leader in step 1). 
6. After 10 minutes starting the rep that was the leader in step 1. 

*Results:*
2. Only the leader is active in shard 3. 
3. Thousands of docs were added to the leader in shard 3. 
4. After staring the replica, it's state was down and after 10 minutes it
became the leader in cluster state (and still down). no servers hosting
shards for index and search requests. 
5. After starting another replica, it's state was recovering for 2-3 minutes
and then it became active (not leader in cluster state). 
   Index, commit and search requests are handled in the other replica
(active status, not leader!!!). 
   The search Results not includes docs that have been indexed to the leader
in step 3.  
6. syncing with the active rep. 

*Expected:*
5. To stay in down status. 
   Not to handle index, commit and search requests - no servers hosting
shards!
6. Become the leader. 

Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-4-8-Leader-Problem-tp4135306.html
Sent from the Solr - User mailing list archive at Nabble.com.


Replica as a "leader"

2014-05-11 Thread adfel70
Solr & Collection Info:
solr 4.8 , 4 shards, 3 replicas per shard, 30-40 milion docs per shard.

Process:
1. Indexing 100-200 docs per second.
2. Doing Pkill -9 java to 2 replicas (not the leader) in shard 3 (while
indexing).
3. Indexing for 10-20 minutes and doing hard commit. 
4. Doing Pkill -9 java to the leader and then starting one replica in shard
3 (while indexing).
5. After 20 minutes starting another replica in shard 3 ,while indexing (not
the leader in step 1). 

Results:
2. Only the leader is active in shard 3.
3. Thousands of docs were added to the leader in shard 3.
4. After staring the replica, it's state was down and after 10 minutes it
became the leader in cluster state (and still down). no servers hosting
shards for index and search requests.
5. After starting another replica, it's state was recovering for 2-3 minutes
and then it became active (not leader in cluster state).
6. Index, commit and search requests are handeled in the other replicae
(*active status, not leader!!!*).


Expected:
5. To stay in down status.
*6. Not to handel index, commit and search requests - no servers hosting
shards!*

Thanks!




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Replica-as-a-leader-tp4135077.html
Sent from the Solr - User mailing list archive at Nabble.com.


Suspicious Object.wait in UnInvertedField.getUnInvertedField

2014-04-02 Thread adfel70
While debugging a problem where 400 threads were waiting for a single lock we
traced the issue to the getUnInvertedField method. 

public static UnInvertedField getUnInvertedField(String field,
SolrIndexSearcher searcher) throws IOException {
SolrCache cache = searcher.getFieldValueCache();
if (cache == null) {
  return new UnInvertedField(field, searcher);
}
UnInvertedField uif = null;
Boolean doWait = false;
synchronized (cache) {
  uif = cache.get(field);
  if (uif == null) {
cache.put(field, uifPlaceholder); // This thread will load this
field, don't let other threads try.
  } else {
if (uif.isPlaceholder == false) {
  return uif;
}
doWait = true; // Someone else has put the place holder in, wait for
that to complete.
  }
}
while (doWait) {
  try {
synchronized (cache) {
  uif = cache.get(field); // Should at least return the placeholder,
NPE if not is OK.
  if (uif.isPlaceholder == false) { // OK, another thread put this
in the cache we should be good.
return uif;
  }
*  cache.wait();*
}
  } catch (InterruptedException e) {
throw new SolrException(SolrException.ErrorCode.SERVER_ERROR,
"Thread interrupted in getUninvertedField.");
  }
}

uif = new UnInvertedField(field, searcher);
synchronized (cache) {
  cache.put(field, uif); // Note, this cleverly replaces the
placeholder.
  *cache.notifyAll();*
}

return uif;
  }

It seems that the code is waiting on the same object it is synchronized on,
thus the notifyAll call may never happen since is requires re-obtaining the
lock...

Am i missing something here? or is this a real bug?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Suspicious-Object-wait-in-UnInvertedField-getUnInvertedField-tp4128555.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: bulk indexing - EofExceptions and big latencies after soft-commit

2014-03-18 Thread adfel70
I disabled softCommit and tried to run another indexing proccess.
Now I see no jetty EofException and no latency peaks..

I also noticed that when I had softcommit every 10 minutes, I also saw
spikes in the major GC (i use CMS) to around 9-10k.

Any idea?



Shawn Heisey-4 wrote
> On 3/17/2014 7:07 AM, adfel70 wrote:
>> we currently have arround 200gb in a server.
>> I'm aware of the RAM issue, but it somehow doesnt seems related.
>> I would expect search latency problems. not strange eofexceptions.
>> 
>> regarding the http.timeout - I didn't change anything concerning this.
>> Do I need to explicitly set something different than the solr
>> out-of-the-box
>> comes with?
>> 
>> I'm also monitoring garbage collector metrics and I don't see anything
>> unsual..
> 
> Indexing puts extra strain on a Solr server, especially when that server
> does not have enough RAM to cache the entire index.  The basic symptom
> is that *everything* takes longer than it normally does, including
> queries.
> 
> A server that is indexing uses extra heap memory and does a LOT of I/O,
> both reading and writing.  When you don't have enough RAM to cache the
> index effectively, the commit operation will compete with queries for
> space in the OS disk cache, making both operations slow.  Your
> information says that GC is not a problem, but if it were a problem,
> indexing will make it many times worse.
> 
> The specific EofException problem is because of the *client* ... not
> Solr.  Whatever is talking to Solr is disconnecting before Solr is done,
> probably after 30 or 60 seconds.  Solr and SolrJ do not configure
> timeouts by default, and neither does the Jetty server that is included
> with Solr.  If you are using a load balancer, it's probably
> disconnecting there.
> 
> When you mention "http.timeout" ... I actually have no idea what that
> is.  I don't recall seeing a setting like that for Solr itself.  It
> sounds like a setting for some other piece of software, perhaps a
> client, load balancer, or servlet container.
> 
> Thanks,
> Shawn





--
View this message in context: 
http://lucene.472066.n3.nabble.com/bulk-indexing-EofExceptions-and-big-latencies-after-soft-commit-tp4124574p4125214.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: bulk indexing - EofExceptions and big latencies after soft-commit

2014-03-17 Thread adfel70
we currently have arround 200gb in a server.
I'm aware of the RAM issue, but it somehow doesnt seems related.
I would expect search latency problems. not strange eofexceptions.

regarding the http.timeout - I didn't change anything concerning this.
Do I need to explicitly set something different than the solr out-of-the-box
comes with?

I'm also monitoring garbage collector metrics and I don't see anything
unsual..





Shawn Heisey-4 wrote
> On 3/16/2014 10:34 AM, adfel70 wrote:
>> I have a 12-node solr 4.6.1 cluster. each node has 2 solr procceses,
>> running
>> on 8gb heap jvms. each node has total of 64gb memory.
>> My current collection (7 shards, 3 replicas) has around 500 million docs. 
>> I'm performing bulk indexing into the collection. I set softCommit to 10
>> minutes and hardCommit openSearcher=false to 15 minutes.
> 
> How much index data does each server have on it?  This would be the sum
> total of the index directories of all your cores.
> 
>> I recently started seeing the following problems while indexing - every
>> 10
>> minutes ( and I assume that this is the 10minutes soft-commit cycles) I
>> get
>> the following errors:
>> 1. EofExcpetion from jetty in HttpOutput.write send from
>> SolrDispatchFilter
>> 2. queries to all cores start getting high latencies (more the 10
>> seconds)
> 
> EofException errors happen when your client disconnects before the
> request is complete.  I would strongly recommend that you *NOT*
> configure hard timeouts for your client connections, or that you make
> them really long, five minutes or so.  For SolrJ, this is the SO_TIMEOUT.
> 
> These problems sound like one of two things.  It could be either or both:
> 
> 1) You don't have enough RAM to cache your index effectively.  With 64GB
> of RAM and 16GB heap, you have approximately 48GB of RAM left over for
> other software and the OS disk cache.  If the total index size on each
> machine is in the neighborhood of 60GB (or larger), this might be a
> problem.  If you have software other than Solr running on the machine,
> you must subtract it's direct and indirect memory requirements from the
> available OS disk cache.
> 
> 2) Indexing results in a LOT of object creation, most of which exist for
> a relatively short time.  This can result in severe problems with
> garbage collection pauses.
> 
> Both problems listed above (and a few others) are discussed at the wiki
> page linked below.  As you will read, there are two major causes of GC
> symptoms - a heap that's too small and incorrect (or nonexistent) GC
> tuning.  With a very large index like yours, either or both of these GC
> symptoms could be happening.
> 
> http://wiki.apache.org/solr/SolrPerformanceProblemshttp://wiki.apache.org/solr/SolrPerformanceProblems
> 
> Side note: You should only be running one Solr process per machine.
> Running multiple processes creates additional memory overhead.  Any hard
> limits that you might have run into with a single Solr process can be
> overcome with configuration options for Jetty, Solr, or the operating
> system.
> 
> Thanks,
> Shawn





--
View this message in context: 
http://lucene.472066.n3.nabble.com/bulk-indexing-EofExceptions-and-big-latencies-after-soft-commit-tp4124574p4124783.html
Sent from the Solr - User mailing list archive at Nabble.com.


bulk indexing - EofExceptions and big latencies after soft-commit

2014-03-16 Thread adfel70
Hi
I have a 12-node solr 4.6.1 cluster. each node has 2 solr procceses, running
on 8gb heap jvms. each node has total of 64gb memory.
My current collection (7 shards, 3 replicas) has around 500 million docs. 
I'm performing bulk indexing into the collection. I set softCommit to 10
minutes and hardCommit openSearcher=false to 15 minutes.

I recently started seeing the following problems while indexing - every 10
minutes ( and I assume that this is the 10minutes soft-commit cycles) I get
the following errors:
1. EofExcpetion from jetty in HttpOutput.write send from SolrDispatchFilter
2. queries to all cores start getting high latencies (more the 10 seconds)

Any idea?

thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/bulk-indexing-EofExceptions-and-big-latencies-after-soft-commit-tp4124574.html
Sent from the Solr - User mailing list archive at Nabble.com.


need help in understating solr cloud stats data

2014-02-02 Thread adfel70
I'm sending all solr stats data to graphite.
I have some questions:
1. query_handler/select requestTime - 
if i'm looking at some metric, lets say 75thPcRequestTime - I see that each
core in a single collection has different values.
Is each value of each core is the time that specific core spent on a
request?
so to get an idea of total request time, I should summarize all the values
of all the cores?


2.update_handler/commits - does this include auto_commits? becuaste I'm
pretty sure I'm not doing any manual commits and yet I see a number there.

3. update_handler/docs pending - what does this mean? pending for what? for
flush to disk?

thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/need-help-in-understating-solr-cloud-stats-data-tp4114992.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: monitoring solr logs

2013-12-30 Thread adfel70
And are you using any tool like kibana as a dashboard for the logs?



Tim Potter wrote
> We're (LucidWorks) are actively developing on logstash4solr so if you have
> issues, let us know. So far, so good for me but I upgraded to logstash
> 1.3.2 even though the logstash4solr version includes 1.2.2 you can use the
> newer one. I'm not quite in production with my logstash4solr <- rabbit-mq
> <- log4j <- Solr solution yet though ;-)
> 
> Yeah, 50GB is too much logging for only 150K docs. Maybe start by
> filtering by log level (WARN and more severe). If a server crashes, you're
> likely to see some errors in the logstash side but sometimes you may have
> to SSH to the specific box and look at the local log (so definitely append
> all messages to the local Solr log too), I'm using something like the
> following for local logging:
> 
> log4j.rootLogger=INFO, file
> log4j.appender.file=org.apache.log4j.RollingFileAppender
> log4j.appender.file.MaxFileSize=50MB
> log4j.appender.file.MaxBackupIndex=10
> log4j.appender.file.File=logs/solr.log
> log4j.appender.file.layout=org.apache.log4j.PatternLayout
> log4j.appender.file.layout.ConversionPattern=%d{ISO8601} [%t] %-5p %c{3}
> %x - %m%n
> 
> 
> Timothy Potter
> Sr. Software Engineer, LucidWorks
> www.lucidworks.com
> 
> 
> From: adfel70 <

> adfel70@

> >
> Sent: Monday, December 30, 2013 9:34 AM
> To: 

> solr-user@.apache

> Subject: RE: monitoring solr logs
> 
> Actually I was considering using logstash4solr, but it didn't seem mature
> enough.
> does it work fine? any known bugs?
> 
> are you collecting the logs in the same solr cluster you use for the
> production systems?
> if so, what will you do if for some reason solr is down and you would like
> to analyze the logs to see what happend?
> 
> btw, i started a new solr cluster with 7 shards, replicationfactor=3 and
> run
> indexing job of 400K docs,
> it got stuck on 150K because I used Socketappender directly to write to
> logstash and logstash disk got full.
> 
> that's why I moved to using AsyncAppender, and I plan on moving to using
> rabbit.
> but this is also why I wanted to filter some of the logs. indexing 150K
> docs
> prodcued 50GB of logs.
> this seemed too much.
> 
> 
> 
> 
> Tim Potter wrote
>> I'm using logstash4solr (http://logstash4solr.org) for something similar
>> ...
>>
>> I setup my Solr to use Log4J by passing the following on the command-line
>> when starting Solr:
>> -Dlog4j.configuration=file:///$SCRIPT_DIR/log4j.properties
>>
>> Then I use a custom Log4J appender that writes to RabbitMQ:
>>
>> https://github.com/plant42/rabbitmq-log4j-appender
>>
>> You can then configure a RabbitMQ input for logstash -
>> http://logstash.net/docs/1.3.2/inputs/rabbitmq
>>
>> This decouples the log writes from log indexing in logstash4solr, which
>> scales better for active Solr installations.
>>
>> Btw ... I just log everything from Solr using this approach but you can
>> use standard Log4J configuration settings to limit which classes / log
>> levels to send to the RabbitMQ appender.
>>
>> Cheers,
>>
>> Timothy Potter
>> Sr. Software Engineer, LucidWorks
>> www.lucidworks.com
>>
>> 
>> From: adfel70 <
> 
>> adfel70@
> 
>> >
>> Sent: Monday, December 30, 2013 8:15 AM
>> To:
> 
>> solr-user@.apache
> 
>> Subject: monitoring solr logs
>>
>> hi
>> i'm trying to figure out which solr and zookeeper logs i should monitor
>> and
>> collect.
>> All the logs will be written to a file but I want to collect some of them
>> with logstash in order to be able to analyze them efficiently.
>> any inputs on logs of which classes i should collect?
>>
>> thanks.
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/monitoring-solr-logs-tp4108721.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
> 
> 
> 
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/monitoring-solr-logs-tp4108721p4108737.html
> Sent from the Solr - User mailing list archive at Nabble.com.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/monitoring-solr-logs-tp4108721p4108744.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: monitoring solr logs

2013-12-30 Thread adfel70
Actually I was considering using logstash4solr, but it didn't seem mature
enough.
does it work fine? any known bugs?

are you collecting the logs in the same solr cluster you use for the
production systems?
if so, what will you do if for some reason solr is down and you would like
to analyze the logs to see what happend? 

btw, i started a new solr cluster with 7 shards, replicationfactor=3 and run
indexing job of 400K docs,
it got stuck on 150K because I used Socketappender directly to write to
logstash and logstash disk got full.

that's why I moved to using AsyncAppender, and I plan on moving to using
rabbit.
but this is also why I wanted to filter some of the logs. indexing 150K docs
prodcued 50GB of logs.
this seemed too much.




Tim Potter wrote
> I'm using logstash4solr (http://logstash4solr.org) for something similar
> ...
> 
> I setup my Solr to use Log4J by passing the following on the command-line
> when starting Solr: 
> -Dlog4j.configuration=file:///$SCRIPT_DIR/log4j.properties
> 
> Then I use a custom Log4J appender that writes to RabbitMQ: 
> 
> https://github.com/plant42/rabbitmq-log4j-appender
> 
> You can then configure a RabbitMQ input for logstash -
> http://logstash.net/docs/1.3.2/inputs/rabbitmq
> 
> This decouples the log writes from log indexing in logstash4solr, which
> scales better for active Solr installations.
> 
> Btw ... I just log everything from Solr using this approach but you can
> use standard Log4J configuration settings to limit which classes / log
> levels to send to the RabbitMQ appender.
> 
> Cheers,
> 
> Timothy Potter
> Sr. Software Engineer, LucidWorks
> www.lucidworks.com
> 
> 
> From: adfel70 <

> adfel70@

> >
> Sent: Monday, December 30, 2013 8:15 AM
> To: 

> solr-user@.apache

> Subject: monitoring solr logs
> 
> hi
> i'm trying to figure out which solr and zookeeper logs i should monitor
> and
> collect.
> All the logs will be written to a file but I want to collect some of them
> with logstash in order to be able to analyze them efficiently.
> any inputs on logs of which classes i should collect?
> 
> thanks.
> 
> 
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/monitoring-solr-logs-tp4108721.html
> Sent from the Solr - User mailing list archive at Nabble.com.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/monitoring-solr-logs-tp4108721p4108737.html
Sent from the Solr - User mailing list archive at Nabble.com.


monitoring solr logs

2013-12-30 Thread adfel70
hi
i'm trying to figure out which solr and zookeeper logs i should monitor and
collect.
All the logs will be written to a file but I want to collect some of them
with logstash in order to be able to analyze them efficiently.
any inputs on logs of which classes i should collect?

thanks.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/monitoring-solr-logs-tp4108721.html
Sent from the Solr - User mailing list archive at Nabble.com.


problem with facets - out of memory exception

2013-12-19 Thread adfel70
Hi
I have a cluster of 14 nodes (7 shards, 2 replicas). each node with 6gb jvm.
solr 4.3.0
i have 400 million docs in the cluster, each node around 60gb of index.
I  index new docs each night, around a million a night.

As the index started to grow, i  started having problems of OutOfMmemory
when querying with facets.
the exception occurs in one of the nodes when querying a specific facet
field, when I  restart this node, and query again it doesn't happen, until I
perform some more indexing and then it might happen again with another facet
field.

the fields that cause the failure have less than 20 unique values.

any idea why this happens?
why restarting the node (without adding more memory) solves the problem
temporarily?
what solr does behind the scenes when asking for facets?

thnks.






--
View this message in context: 
http://lucene.472066.n3.nabble.com/problem-with-facets-out-of-memory-exception-tp4107390.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr cloud - deleting and adding the same doc

2013-12-17 Thread adfel70
Can you elaborate on bulk or streaming API’s?



Mark Miller-3 wrote
> As long as you are not using the bulk or streaming API’s. Solrj does not
> currently respect delete/add ordering in those cases, though each of the
> two types are ordered. For the standard update per request, as long as
> it’s the same client, this is a guarantee.
> 
> - Mark
> 
> On Dec 17, 2013, at 9:54 AM, adfel70 <

> adfel70@

> > wrote:
> 
>> Hi
>> in SolrCloud, if I send 2 different requests to solr - one with delete
>> action of doc with id X and another with add action of doc with the same
>> id
>> - is it guaranteed that the delete action will occur before the add
>> action?
>> 
>> Is it guaranteed that after all actions are done, the index will have doc
>> X
>> with its most updated state?
>> 
>> thanks.
>> 
>> 
>> 
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/solr-cloud-deleting-and-adding-the-same-doc-tp4107111.html
>> Sent from the Solr - User mailing list archive at Nabble.com.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-cloud-deleting-and-adding-the-same-doc-tp4107111p4107214.html
Sent from the Solr - User mailing list archive at Nabble.com.


solr cloud - deleting and adding the same doc

2013-12-17 Thread adfel70
Hi
in SolrCloud, if I send 2 different requests to solr - one with delete
action of doc with id X and another with add action of doc with the same id
- is it guaranteed that the delete action will occur before the add action?

Is it guaranteed that after all actions are done, the index will have doc X
with its most updated state?

thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-cloud-deleting-and-adding-the-same-doc-tp4107111.html
Sent from the Solr - User mailing list archive at Nabble.com.


Upgrading Solr cluster without downtime

2013-12-01 Thread adfel70
I was wondering if there is a way to upgrade Solr version without downtime.
Theoretically it seems possible when every shard in the cluster has at least
2 replicas - but Jetty does not refresh the web container until we delete
solr-webapp folder's content.
Can someone please share from his experience?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Upgrading-Solr-cluster-without-downtime-tp4104223.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr as a service for multiple projects in the same environment

2013-11-30 Thread adfel70
The risk is if you buy mistake mess up a cluster while doing maintenance on
one of the systems, you can affect the other system.
Its a pretty amorfic risk.
Aside from having multiple systems share the same hardware resources, I
don't see any other real risk.

Are your collections share the same topology in terms of shards and
replicas?
Do you manually configure the nodes on which each collection is created so
that you'll still have some level of seperation between the systems?




michael.boom wrote
> Hi,
> 
> There's nothing unusual in what you are trying to do, this scenario is
> very common.
> 
> To answer your questions:
>> 1. as I understand I can separate the configs of each collection in
>> zookeeper. is it correct? 
> Yes, that's correct. You'll have to upload your configs to ZK and use the
> CollectionAPI to create your collections.
> 
>>2.are there any solr operations that can be performed on collection A and
somehow affect collection B? 
> No, I can't think of any cross-collection operation. Here you can find a
> list of collection related operations:
> https://cwiki.apache.org/confluence/display/solr/Collections+API
> 
>>3. is the solr cache separated for each collection? 
> Yes, separate and configurable in solrconfig.xml for each collection.
> 
>>4. I assume that I'll encounter a problem with the os cache, when the
different indices will compete on the same memory, right? how severe is this
issue? 
> Hardware can be a bottleneck. If all your collection will face the same
> load you should try to give solr a RAM amount equal to the index size (all
> indexes)
> 
>>5. any other advice on building such an architecture? does the maintenance
overhead of maintaining multiple clusters in production really overwhelm the
problems and risks of using the same cluster for multiple systems? 
> I was in the same situation as you, and putting everything in multiple
> collections in just one cluster made sense for me : it's easier to manage
> and has no obvious downside. As for "risks of using the same cluster for
> multiple systems" they are pretty much the same  in both scenarios. Only
> that with multiple clusters you'll have much more machines to manage.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-as-a-service-for-multiple-projects-in-the-same-environment-tp4103523p4104206.html
Sent from the Solr - User mailing list archive at Nabble.com.


solr as a service for multiple projects in the same environment

2013-11-27 Thread adfel70
Hi
I have various solr related projects in a single environment.
These project are not related one to another.

I'm thinking of building a solr architecture so that all the projects will
use different solr collections in the same cluster, as opposed to having a
solr cluster for each project.

1. as I understand I can separate the configs of each collection in
zookeeper. is it correct?
2.are there any solr operations that can be performed on collection A and
somehow affect collection B?
3. is the solr cache separated for each collection? 
4. I assume that I'll encounter a problem with the os cache, when the
different indices will compete on the same memory, right? how severe is this
issue?
5. any other advice on building such an architecture? does the maintenance
overhead of maintaining multiple clusters in production really overwhelm the
problems and risks of using the same cluster for multiple systems?

thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-as-a-service-for-multiple-projects-in-the-same-environment-tp4103523.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: syncronization between replicas

2013-11-27 Thread adfel70
I'm sorry, I forgot to write the problem.


adfel70 wrote
> 1. take one of the replicas of shard1 down(it doesn't matter which one)
> 2. continue indexing documents(that's important for this scenario)
> 3. take down the second replica of shard1(now the shard is down and we
> cannot index anymore)
> 4. take the replica from step 1 up(that's important that this replica will
> go up first)
> 5. take the replica from step 3 up

after the second replica is up, it has data that the first replica doesn't
have(step 2, we continued to index while the first replica was down), I need
to know if there is a way that the second replica tell the first one that it
has data to sync with him...




--
View this message in context: 
http://lucene.472066.n3.nabble.com/syncronization-between-replicas-tp4103046p4103477.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: syncronization between replicas

2013-11-26 Thread adfel70
anyone?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/syncronization-between-replicas-tp4103046p4103455.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Setting solr.data.dir for SolrCloud instance

2013-11-26 Thread adfel70
The problem we had was that we tried to run: 
java -Dsolr.data.dir=/opt/solr/data -Dsolr.solr.home=/opt/solr/home -jar
start.jar
and got different behavior for how solr handles these 2 params.

we created 2 collections, which created 2 cores. 
then we got 2 home dirs for the cores, as expected:
/opt/solr/home/collection1_shard1_replica1
/opt/solr/home/collection2_shard1_replica1

but instead of creating 2 data dirs like:
/opt/solr/data/collection1_shard1_replica1
/opt/solr/data/collection2_shard1_replica1
 
solr had both cores' data dirs  pointing to the same directory -
/opt/solr/data

when we tried putting a relative path in -Dsolr.data.dir, it worked as
expected.

I don't know if this is a bug, but we thought of 2 solutions in our case:
1. point -Dsolr.data.dir to a relative path on symlink that path to the
absolute path we wanted in the first place.
2. dont provide -Dsolr.data.dir at all, and then solr puts the data dir
inside the home.dir, which as said, works with relative paths.

we chose the first option for now.





Erick Erickson wrote
> The data _is_ separated from the code. It's all relative
> to solr_home which need not have any relation to where
> the code is executing from.
> 
> For instance, I can start Solr like
> java -Dsolr.solr.home=/Users/Erick/testdir/solr -jar start.jar
> 
> and have my war in a completely different place.
> 
> Best,
> Erick
> 
> 
> On Tue, Nov 26, 2013 at 1:08 AM, adfel70 <

> adfel70@

> > wrote:
> 
>> Thanks for the reply, Erick.
>> Actually, I didnt not think this through. I just thought it would be a
>> good
>> idea to separate the data from the application code.
>> I guess I'll leave it without setting the datadir parameter and add a
>> symlink.
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Setting-solr-data-dir-for-SolrCloud-instance-tp4103052p4103228.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Setting-solr-data-dir-for-SolrCloud-instance-tp4103052p4103334.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Setting solr.data.dir for SolrCloud instance

2013-11-25 Thread adfel70
Thanks for the reply, Erick.
Actually, I didnt not think this through. I just thought it would be a good
idea to separate the data from the application code.
I guess I'll leave it without setting the datadir parameter and add a
symlink.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Setting-solr-data-dir-for-SolrCloud-instance-tp4103052p4103228.html
Sent from the Solr - User mailing list archive at Nabble.com.


Setting solr.data.dir for SolrCloud instance

2013-11-25 Thread adfel70
I found something strange while trying to create more than one collection in
SolrCloud:
I am running every instance with -Dsolr.data.dir=/data
If I look at Core Admin section, I can see that I have one core and its
dataDir is set to this fixed location. Problem is, if I create a new
collection, another core is created - but with this fixed index location
again.
I was expecting that the path I sent would serve as the BASE path for all
cores the the node hosts. Current behaviour seems like a bug to me, because
obviously one collection will see data that was not indexed to him.
Is there a way to overcome this? I mean, change the default data dir
location, but still be able to create more than one collection correctly?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Setting-solr-data-dir-for-SolrCloud-instance-tp4103052.html
Sent from the Solr - User mailing list archive at Nabble.com.


syncronization between replicas

2013-11-25 Thread adfel70
Hi,

We currently running tests on solr to find as many problems in our solr
environment so we can be ready for these kind of problems in production,
anyway we found an edge case and have few questions about it. 

We have one collection with two shards, each shard with replica factor 2.
we are sending docs to the index and everything is okay, now the scenario:
1. take one of the replicas of shard1 down(it doesn't matter which one)
2. continue indexing documents(that's important for this scenario)
3. take down the second replica of shard1(now the shard is down and we
cannot index anymore)
4. take the replica from step 1 up(that's important that this replica will
go up first)
5. take the replica from step 3 up

The regular synchronization flow is that the leader synchronize the other
replica, but I'm pretty sure this is a known issue, is there a way to do a
two way synchronization or do you have any other solution for me?

thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/syncronization-between-replicas-tp4103046.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Commit behaviour in SolrCloud

2013-11-24 Thread adfel70
Just to clarify how these two phrases come together:
1. "you will know when an update is rejected - it just might not be easy to
know which in the batch / stream"

2. "Documents that come in batches are added as they come / are processed -
not in some atomic unit."


If I send a batch of documents in one update request, and some of the docs
fail - will the other docs still remain in the system?
what if soft commit occurred after some of the docs but before all of the
docs got processed, and then some of the remaining docs fail during
processing?
I assume that the client will get an error for the whole batch (because of
the current error reporting strategy), but which docs will remain in the
system? only those which got processed before the fail or non of the docs in
this batch?




Mark Miller-3 wrote
> If you want this promise and complete control, you pretty much need to do
> a doc per request and many parallel requests for speed.
> 
> The bulk and streaming methods of adding documents do not have a good fine
> grained error reporting strategy yet. It’s okay for certain use cases and
> and especially batch loading, and you will know when an update is rejected
> - it just might not be easy to know which in the batch / stream.
> 
> Documents that come in batches are added as they come / are processed -
> not in some atomic unit.
> 
> What controls how soon you will see documents or whether you will see them
> as they are still loading is simply when you soft commit and how many docs
> have been indexed when the soft commit happens.
> 
> - Mark
> 
> On Nov 25, 2013, at 1:03 AM, adfel70 <

> adfel70@

> > wrote:
> 
>> Hi Mark, Thanks for the answer.
>> 
>> One more question though: You say that if I get a success from the
>> update,
>> it’s in the system, commit or not. But when exactly do I get this
>> feedback -
>> Is it one feedback per the whole request, or per one add inside the
>> request?
>> I will give an example clarify my question: Say I have new empty index,
>> and
>> I repeatedly send indexing requests - every request adds 500 new
>> documents
>> to the index. Is it possible that in some point during this process, to
>> query the index and get a total of 1,030 docs total? (Lets assume there
>> were
>> no indexing errors got from Solr)
>> 
>> Thanks again.
>> 
>> 
>> 
>> 
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Commit-behaviour-in-SolrCloud-tp4102879p4102996.html
>> Sent from the Solr - User mailing list archive at Nabble.com.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Commit-behaviour-in-SolrCloud-tp4102879p4102999.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Commit behaviour in SolrCloud

2013-11-24 Thread adfel70
Hi Mark, Thanks for the answer.

One more question though: You say that if I get a success from the update,
it’s in the system, commit or not. But when exactly do I get this feedback -
Is it one feedback per the whole request, or per one add inside the request?
I will give an example clarify my question: Say I have new empty index, and
I repeatedly send indexing requests - every request adds 500 new documents
to the index. Is it possible that in some point during this process, to
query the index and get a total of 1,030 docs total? (Lets assume there were
no indexing errors got from Solr)

Thanks again.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Commit-behaviour-in-SolrCloud-tp4102879p4102996.html
Sent from the Solr - User mailing list archive at Nabble.com.


Commit behaviour in SolrCloud

2013-11-24 Thread adfel70
Hi everyone,

I am wondering how commit operation works in SolrCloud:
Say I have 2 parallel indexing processes. What if one process sends big
update request (an add command with a lot of docs), and the other one just
happens to send a commit command while the update request is being
processed. 
Is it possible that only part of the documents will be commited? 
What will happen with the other docs? Is Solr transactional and promise that
there will be no partial results?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Commit-behaviour-in-SolrCloud-tp4102879.html
Sent from the Solr - User mailing list archive at Nabble.com.


Question regarding possibility of data loss

2013-11-19 Thread adfel70
Hi, we plan to establish an ensemble of solr with zookeeper. 
We gonna have 6 solr servers with 2 instances on each server, also we'll
have 6 shards with replication factor 2, in addition we'll have 3
zookeepers. 

Our concern is that we will send documents to index and solr won't index
them but won't send any error message and we will suffer a data loss

1. Is there any situation that can cause this kind of problem? 
2. Can it happen if some of ZKs are down? or some of the solr instances? 
3. How can we monitor them? Can we do something to prevent these kind of
errors? 

Thanks in advance 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Question-regarding-possibility-of-data-loss-tp4101915.html
Sent from the Solr - User mailing list archive at Nabble.com.


Question regarding possibili

2013-11-19 Thread adfel70
Hi, we plan to establish an ensemble of solr with zookeeper.
We gonna have 6 solr servers with 2 instances on each server, also we'll
have 6 shards with replication factor 2, in addition we'll have 3
zookeepers.

Our concern is that we will send documents to index and solr won't index
them but *won't *send any error message and we will suffer a *data loss*

1. Is there any situation that can cause this kind of problem?
2. Can it happen if some of ZKs are down? or some of the solr instances?
3. How can we monitor them? Can we do something to prevent these kind of
errors?

Thanks in advance





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Question-regarding-possibili-tp4101914.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solrcloud shards backup/restoration

2013-11-07 Thread adfel70
did you solve this eventually?


Aditya Sakhuja wrote
> How does one recover from an index corruption ? That's what I am trying to
> eventually tackle here.
> 
> Thanks
> Aditya
> 
> On Thursday, September 19, 2013, Aditya Sakhuja wrote:
> 
>> Hi,
>>
>> Sorry for the late followup on this. Let me put in more details here.
>>
>> *The problem:*
>>
>> Cannot successfully restore back the index backed up with
>> '/replication?command=backup'. The backup was generated as *
>> snapshot.mmdd*
>>
>> *My setup and steps:*
>> *
>> *
>> 6 solrcloud instances
>> 7 zookeepers instances
>>
>> Steps:
>>
>> 1.> Take snapshot using
>> *http://host1:8893/solr/replication?command=backup
>> *, on one host only. move *snapshot.mmdd *to some reliable storage.
>>
>> 2.> Stop all 6 solr instances, all 7 zk instances.
>>
>> 3.> Delete ../collectionname/data/* on all solrcloud nodes. ie. deleting
>> the index data completely.
>>
>> 4.> Delete zookeeper/data/version*/* on all zookeeper nodes.
>>
>> 5.> Copy back index from backup to one of the nodes.
>>  \> cp *snapshot.mmdd/*  *../collectionname/data/index/*
>>
>> 6.> Restart all zk instances. Restart all solrcloud instances.
>>
>>
>> *Outcome:*
>> *
>> *
>> All solr instances are up. However, *num of docs = 0 *for all nodes.
>> Looking at the node where the index was restored, there is a new
>> index.yymmddhhmmss directory being created and index.properties pointing
>> to
>> it. That explains why no documents are reported.
>>
>>
>> How do I have solrcloud pickup data from the index directory on a restart
>> ?
>>
>> Thanks in advance,
>> Aditya
>>
>>
>>
>> On Fri, Sep 6, 2013 at 3:41 PM, Aditya Sakhuja <

> aditya.sakhuja@

> >wrote:
>>
>> Thanks Shalin and Mark for your responses. I am on the same page about
>> the
>> conventions for taking the backup. However, I am less sure about the
>> restoration of the index. Lets say we have 3 shards across 3 solrcloud
>> servers.
>>
>> 1.> I am assuming we should take a backup from each of the shard leaders
>> to get a complete collection. do you think that will get the complete
>> index
>> ( not worrying about what is not hard committed at the time of backup ).
>> ?
>>
>> 2.> How do we go about restoring the index in a fresh solrcloud cluster ?
>> From the structure of the snapshot I took, I did not see any
>> replication.properties or index.properties  which I see normally on a
>> healthy solrcloud cluster nodes.
>> if I have the snapshot named snapshot.20130905 does the
>> snapshot.20130905/* go into data/index ?
>>
>> Thanks
>> Aditya
>>
>>
>>
>> On Fri, Sep 6, 2013 at 7:28 AM, Mark Miller <

> markrmiller@

> > wrote:
>>
>> Phone typing. The end should not say "don't hard commit" - it should say
>> "do a hard commit and take a snapshot".
>>
>> Mark
>>
>> Sent from my iPhone
>>
>> On Sep 6, 2013, at 7:26 AM, Mark Miller <

> markrmiller@

> > wrote:
>>
>> > I don't know that it's too bad though - its always been the case that
>> if
>> you do a backup while indexing, it's just going to get up to the last
>> hard
>> commit. With SolrCloud that will still be the case. So just make sure you
>> do a hard commit right before taking the backup - yes, it might miss a
>> few
>> docs in the tran log, but if you are taking a back up while indexing, you
>> don't have great precision in any case - you will roughly get a snapshot
>> for around that time - even without SolrCloud, if you are worried about
>> precision and getting every update into that backup, you want to stop
>> indexing and commit first. But if you just want a rough snapshot for
>> around
>> that time, in both cases you can still just don't hard commit and take a
>> snapshot.
>> >
>> > Mark
>> >
>> > Sent from my iPhone
>> >
>> > On Sep 6, 2013, at 1:13 AM, Shalin Shekhar Mangar <
>> 

> shalinmangar@

>> wrote:
>> >
>> >> The replication handler's backup command was built for pre-SolrCloud.
>> >> It takes a snapshot of the index but it is unaware of the transaction
>> >> log which is a key component in SolrCloud. Hence unless you stop
>> >> updates, commit your changes and then take a backup, you will likely
>> >> miss some updates.
>> >>
>> >> That being said, I'm curious to see how peer sync behaves when you try
>> >> to restore from a snapshot. When you say that you haven't been
>> >> successful in restoring, what exactly is the behaviour you observed?
>> >>
>> >> On Fri, Sep 6, 2013 at 5:14 AM, Aditya Sakhuja <
>> 

> aditya.sakhuja@

>> wrote:
>> >>> Hello,
>> >>>
>> >>> I was looking for a good backup / recovery solution for the solrcloud
>> >>> indexes. I am more looking for restoring the indexes from the index
>> >>> snapshot, which can be taken using the replicationHandler's backup
>> command.
>> >>>
>> >>> I am looking for something that works with solrcloud 4.3 eventually,
>> but
>> >>> still relevant if you tested with a previous version.
>> >>>
>> >>> I haven't been successful in have the restored index replicate across
>> the
>> >>> new replicas, after 

Re: Soft commit and flush

2013-10-07 Thread adfel70
Sorry, by "OOE" I meant Out of memory exception...



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Soft-commit-and-flush-tp4091726p4093902.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Soft commit and flush

2013-10-06 Thread adfel70
I understand the bottom line that soft commits are about visibility, hard
commits are about durability. I am just trying to gain better understanding
what happens under the hood...
2 more related questions you made me think of:
1. Does the NRTCachingDirectoryFactory relevant for both types of commit, or
just for hard commit?
2. If soft commit does not flush - all data exists in RAM until we call hard
commit? If so, using soft commit without calling hard commit could cause OOE
... ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Soft-commit-and-flush-tp4091726p4093834.html
Sent from the Solr - User mailing list archive at Nabble.com.


solr cpu usage

2013-10-01 Thread adfel70
hi
We're building a spec for a machine to purchase.
We're going to buy 10 machines.
we aren't sure yet how many proccesses we will run per machine.
the question is  -should we buy faster cpu with less cores or slower cpu
with more cores?
in any case we will have 2 cpus in each machine.
should we buy 2.6Ghz cpu with 8 cores or 3.5Ghz cpu with 4 cores?

what will we gain by having many cores?

what kinds of usages would make cpu be the bottleneck?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-cpu-usage-tp4092938.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Maximum solr processes per machine

2013-09-29 Thread adfel70
Bram Van Dam wrote
> On 09/29/2013 04:03 PM, adfel70 wrote:
> If you're doing real time on a 5TB index then you'll probably want to 
> throw your money at the fastest storage you can afford (SSDs vs spinning 
> rust made a huge difference in our benchmarks) and the fastest CPUs you 
> can get your hands on. Memory is important too, but in our benchmarks 
> that didn't have as much impact as the other factors. Keeping a 5TB 
> index in memory is going to be tricky, so in my opinion you'd be better 
> off investing in faster disks instead.

Can you please elaborate on your benchmarks? what was the cluster size,
which hardware (CPUs, RAM size, disk type...) and so on?
This info might really help us. 
Also, if I understand you correctly, from certain index size, the impact of
RAM size is less important than disk performance?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Maximum-solr-processes-per-machine-tp4092568p4092651.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Maximum solr processes per machine

2013-09-29 Thread adfel70
How can I configure the disk storage so that disk access is optimized?
I'm considering having RAID-10
and I think I'll have arround 4-8 disks per machine.
Should I run each solr jvm to point on a datadir on differnet disks, or is
there some other way to optimize this?



Erick Erickson wrote
> bq: is there an upper limit of amount of solr processes per machine,
> 
> No, assuming they're all in separate JVMs. I've see reports, though,
> that increasing the number of JVMs past the number of CPU
> cores gets into "iffy" territory.
> 
> And, depending on your disk storage they may all be contending for
> disk access.
> 
> FWIW,
> Erick
> 
> On Sun, Sep 29, 2013 at 9:21 AM, adfel70 <

> adfel70@

> > wrote:
>> Hi,
>> I'm thinking of solr cluster architecture before purchasing machines.
>>
>>
>> My total index size is around 5TB. I want to have replication factor of
>> 3.
>> total 15TB.
>> I've understood that I should  have 50-100% of the index size as ram, for
>> OS
>> cache. Lets say we're talking about around 10TB of memory.
>> Now I need to split this memory to multiple servers and get the machine
>> spec
>> I want to buy.
>> I'm thinking of running multiple solr processes per machine.
>> is there an upper limit of amount of solr processes per machine, assuming
>> that I make sure that the total size of indexes of all nodes in the
>> machine
>> is within the RAM percentage i've defined?
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Maximum-solr-processes-per-machine-tp4092568.html
>> Sent from the Solr - User mailing list archive at Nabble.com.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Maximum-solr-processes-per-machine-tp4092568p4092574.html
Sent from the Solr - User mailing list archive at Nabble.com.


Maximum solr processes per machine

2013-09-29 Thread adfel70
Hi,
I'm thinking of solr cluster architecture before purchasing machines.


My total index size is around 5TB. I want to have replication factor of 3.
total 15TB.
I've understood that I should  have 50-100% of the index size as ram, for OS
cache. Lets say we're talking about around 10TB of memory.
Now I need to split this memory to multiple servers and get the machine spec
I want to buy.
I'm thinking of running multiple solr processes per machine.
is there an upper limit of amount of solr processes per machine, assuming
that I make sure that the total size of indexes of all nodes in the machine
is within the RAM percentage i've defined?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Maximum-solr-processes-per-machine-tp4092568.html
Sent from the Solr - User mailing list archive at Nabble.com.


Soft commit and flush

2013-09-24 Thread adfel70
I am struggling to get a deep understanding of soft commit.
I have read  Erick's post

  
which helped me a lot with when and why we should call each type of commit.
But still, I cant understand what exactly happens when we call soft commit:
I mean, does the new data is flushed, fsynched, or hold in the RAM... ?
I tried to test it myself and I got 2 different behaviours: 
a. If I just had 1 document that was added to the index, soft commit did not
cause index files to change.
b. If I had a big change (addition of about 100,000 docs, ~5MB tlog file),
calling the soft commit DID change the index files - so I guess that soft
commit caused fsynch.

My conclusion is that soft commit always flushes the data, but because of
the implementation of NRTCachingDirectoryFactory, the data will be written
to the disk when its getting too big. 

Can some one please correct me? 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Soft-commit-and-flush-tp4091726.html
Sent from the Solr - User mailing list archive at Nabble.com.


using tika inside SOLR vs using nutch

2013-09-10 Thread adfel70
Hi

What are the pros and cons of both use cases?
1. use nutch to crawl file system + parse files + perform other data
manipulation and eventually index to solr.
2. use solr dataimporthandlers and plugins in order to perform this task.


Note that I have  tens of millions of docs which I need to handle the first
time, and then delta imports of around 100k docs per day.
Each doc may be up to 100mb.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/using-tika-inside-SOLR-vs-using-nutch-tp4089120.html
Sent from the Solr - User mailing list archive at Nabble.com.


Question about SOLR-5017 - Allow sharding based on the value of a field

2013-08-28 Thread adfel70
Hi
I'm looking into allowing query joins in solr cloud.
This has the limitation of having to index all the documents that are
joineable together to the same shard.
I'm wondering if  SOLR-5017
   would give me the
ability to do so without implementing my own routing mechanism?

If I add a field named "parent_id" and give that field the same value in all
the documents that I want to join, it seems, theoretically, that it will be
enough.

Am I correct?

Thanks.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Question-about-SOLR-5017-Allow-sharding-based-on-the-value-of-a-field-tp4087050.html
Sent from the Solr - User mailing list archive at Nabble.com.


What do you use for solr's logging analysis?

2013-08-11 Thread adfel70
Hi
I'm looking at a tool that could help me perform solr logging analysis.
I use SolrCloud on multiple servers, so the tool should be able to collect
logs from multiple servers.

Any tool you use and can advice of?

Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/What-do-you-use-for-solr-s-logging-analysis-tp4083809.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr qtime suddenly increased in production env

2013-08-05 Thread adfel70
Thanks for your detailed answer.
Some followup questions:

1. Are there any tests I can make to determine 100% that this is a  "not
enough RAM" scenario"?

2. Sounds like I always need to have as much RAM as the size of the index
I this really a MUST for getting good search performance?


Shawn Heisey-4 wrote
> On 8/5/2013 10:17 AM, adfel70 wrote:
>> I have a solr cluster of 7 shards, replicationFactor 2, running on 7
>> physical
>> machines.
>> Machine spec:
>> cpu: 16
>> memory: 32gb
>> storage is on local disks
>>
>> Each machine runs 2 solr processes, each process with 6gb memory to jvm.
>>
>> The cluster currently has 330 million documents, each process around 30gb
>> of
>> data.
>>
>> Until recently performance was fine, but after a recent indexing which
>> added
>> arround 25 million docs, the search performance degraded dramatically.
>> I'm now getting qtime of 30 second and sometimes even 60 seconds, for
>> simple
>> queries (fieldA:value AND fieldB:value + facets + highlighting)
>>
>> Any idea how can I check where the problem is?
> 
> Sounds like a "not enough RAM" scenario.  It's likely that you were 
> sitting at a threshold for a performance problem, and the 25 million 
> additional documents pushed your installation over that threshold.  I 
> think there are two possibilities:
> 
> 1) Not enough Java heap, resulting in major GC pauses as it works to 
> free up memory for basic operation.  If this is the problem, increasing 
> your 6GB heap and/or using facet.method=enum will help.  Note that 
> facet.method=enum will make facet performance much more dependent on the 
> OS disk cache being big enough, which leads into the other problem:
> 
> 2) Not enough OS disk cache for the size of your index.  You have two 
> processes each eating up 6GB of your 32GB RAM.  If Solr is the only 
> thing running on these servers, then you have slightly less than 20GB of 
> memory for your OS disk cache.  If other things are running on the 
> hardware, then you have even less available.
> 
> With 60GB of data (two shard replicas at 30GB each) on each server, you 
> want between 30GB and 60GB of RAM available for your OS disk cache, 
> making 64GB an ideal RAM size for your servers.  The alternative is to 
> add servers that each have 32GB and make a new index with a larger 
> numShards.
> 
> http://wiki.apache.org/solr/SolrPerformanceProblems
> 
> The first thing I'd try is running only one Solr process per machine. 
> You might need an 8GB heap instead of a 6GB heap, but that would give 
> you 4GB more per machine for the OS disk cache.  There's no need to have 
> two complete containers running Solr on every machine - SolrCloud's 
> Collections API has a maxShardsPerNode parameter that lets it run 
> multiple indexes on one instance.
> 
> For any change other than just adding RAM to the hardware, it's likely 
> that you'll need to start over and rebuild your collection from scratch.
> 
> Thanks,
> Shawn





--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-qtime-suddenly-increased-in-production-env-tp4082605p4082616.html
Sent from the Solr - User mailing list archive at Nabble.com.


solr qtime suddenly increased in production env

2013-08-05 Thread adfel70
I have a solr cluster of 7 shards, replicationFactor 2, running on 7 physical
machines.
Machine spec:
cpu: 16
memory: 32gb
storage is on local disks

Each machine runs 2 solr processes, each process with 6gb memory to jvm.

The cluster currently has 330 million documents, each process around 30gb of
data.

Until recently performance was fine, but after a recent indexing which added
arround 25 million docs, the search performance degraded dramatically.
I'm now getting qtime of 30 second and sometimes even 60 seconds, for simple
queries (fieldA:value AND fieldB:value + facets + highlighting)

Any idea how can I check where the problem is?

Could it be that the recent indexing caused the cluster to choke, and each
its limit, and now I need to add more shards? (either split, or reindex to a
new collection with more shards)


Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-qtime-suddenly-increased-in-production-env-tp4082605.html
Sent from the Solr - User mailing list archive at Nabble.com.


Need advice on performing 300 queries per second on solr index

2013-07-16 Thread adfel70
Hi
I need to create a solr cluster that contains geospatial information and
provides the ability to perform a few hundreds queries per second, each
query should retrieve around 100k results.
The data is around 100k documents, around 300gb total.

I started with 2 shard cluster (replicationFactor 1) and a portion of the
data - 20 gb.

I run some load-tests and see that when 100 requests are sent in one second,
the average qTime is around 4 seconds, but the average total response time
(measuring from sending the request to solr untill getting a response )
reaches 20-25 seconds which is very bad.

Currently I load-balance myself between the 2 solr servers (each request is
sent to another server)

Any advice on  which resources do I need and how my solr cluster should look
like?
More shards? more replicas? another webserver?

Thanks.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Need-advice-on-performing-300-queries-per-second-on-solr-index-tp4078353.html
Sent from the Solr - User mailing list archive at Nabble.com.


Every collection.reload makes zookeeper think shards are down

2013-07-08 Thread adfel70
Hi

each time I reload a collection via collections API, zookeeper thinks that
all the shards in the collection are down.

It marks them as down and I can't send requests.

Why "thinks"? because if I manually edit clusterstate.json file and set
'state' value to 'active', they come back up and requests work fine.

Any idea how can I fix this?
Is it possible I did something wrong configuring the solr with zookeeper?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Every-collection-reload-makes-zookeeper-think-shards-are-down-tp4076260.html
Sent from the Solr - User mailing list archive at Nabble.com.


Is it possible to facet on existence of a field?

2013-07-08 Thread adfel70
I have a field that's only indexed in some of the documents.
Can I create a boolean facet on this field by its existence?
for instance:
yes(124)
no(479)

Note that the fields' value is not facetable because all its values are
unique most of the time.
I just want to facet on the question whether this information (the field )is
available or  not.


Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-it-possible-to-facet-on-existence-of-a-field-tp4076226.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Why shouldn't lang-id component work at query-time?

2013-07-07 Thread adfel70
Well, yes, the problem is indeed simple..

Regarding the approach you're offering - if I query on multiple fields, each
field for another language, why should it matter if I use edismax searching
or default lucene searching?



Jack Krupansky-2 wrote
> The problem at query time is simple: a typical query has too few terms to 
> reliably identify the language using statistical techniques, especially
> for 
> a language like English which is famous for "borrowing" words from other 
> languages. I mean, is "raison d'être" REALLY French anymore? Or, are 
> "sombrero" or "poncho" or "mañana" really strictly Spanish anymore?
> 
> Multi-lingual support is an art/craft; don't expect cookbook answers that 
> will apply to all apps in all environments.
> 
> That said, Edismax searching of multiple field, one for each language is 
> probably the best you're going to do without doing something 
> super-sophisticated.
> 
> -- Jack Krupansky
> 
> -Original Message- 
> From: adfel70
> Sent: Sunday, July 07, 2013 1:32 PM
> To: 

> solr-user@.apache

> Subject: Why shouldn't lang-id component work at query-time?
> 
> Hi,
> I'm trying to integrate solr's lang-id component in my solr environment.
> In my scenario, I have documents in many different languages. I want to
> index them in the same solr collection, to different fields and apply
> language-specific analyzers on each field by its language.
> 
> So far lang-id component does exactly what I need.
> 
> The problem is that in all recepies that I've read, eventually at
> query-time
> I have to indicate which language I'm querying.
> Either by specifying the field I want to search:
> /solr/collection/select?q=text_it:abc abc
> Or by creating a language-specific request handler which I would have to
> use
> like this:
> /solr/collection/selectIT?q=text:abc abc
> 
> Either way, I must tell solr the language, which in my case - a web
> client+many different languages, it's quite problematic.
> 
> I was wondering why shouldn't lang-id component provide a full ability to
> index and query on multi-languages when both in indexing and in querying
> the
> language is transparent to the client.
> This could be achieved by applying the same language-detection tool at
> query
> time.
> 
> Any insights?
> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Why-shouldn-t-lang-id-component-work-at-query-time-tp4076057.html
> Sent from the Solr - User mailing list archive at Nabble.com.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Why-shouldn-t-lang-id-component-work-at-query-time-tp4076057p4076062.html
Sent from the Solr - User mailing list archive at Nabble.com.


Why shouldn't lang-id component work at query-time?

2013-07-07 Thread adfel70
Hi,
I'm trying to integrate solr's lang-id component in my solr environment.
In my scenario, I have documents in many different languages. I want to
index them in the same solr collection, to different fields and apply
language-specific analyzers on each field by its language.

So far lang-id component does exactly what I need.

The problem is that in all recepies that I've read, eventually at query-time
I have to indicate which language I'm querying.
Either by specifying the field I want to search:
/solr/collection/select?q=text_it:abc abc
Or by creating a language-specific request handler which I would have to use
like this:
/solr/collection/selectIT?q=text:abc abc

Either way, I must tell solr the language, which in my case - a web
client+many different languages, it's quite problematic.

I was wondering why shouldn't lang-id component provide a full ability to
index and query on multi-languages when both in indexing and in querying the
language is transparent to the client.
This could be achieved by applying the same language-detection tool at query
time.

Any insights?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Why-shouldn-t-lang-id-component-work-at-query-time-tp4076057.html
Sent from the Solr - User mailing list archive at Nabble.com.


  1   2   >