Re: prefix length in fuzzy search solr 4.10.1

2014-11-01 Thread elisabeth benoit
ok, thanks for the answer.

best regards,
Elisabeth

2014-10-31 22:04 GMT+01:00 Jack Krupansky j...@basetechnology.com:

 No, but it is a reasonable request, as a global default, a
 collection-specific default, a request-specific default, and on an
 individual fuzzy term.

 -- Jack Krupansky

 -Original Message- From: elisabeth benoit
 Sent: Thursday, October 30, 2014 6:07 AM
 To: solr-user@lucene.apache.org
 Subject: prefix length in fuzzy search solr 4.10.1


 Hello all,

 Is there a parameter in solr 4.10.1 api allowing user to fix prefix length
 in fuzzy search.

 Best regards,
 Elisabeth



Re: Consul instead of ZooKeeper anyone?

2014-11-01 Thread Jürgen Wagner (DVT)
Hello Greg,
  Consul and  Zookeeper are quite similar in their offering with respect
to what SolrCloud needs. Service discovery, watches on distributed
cluster state, updates of configuration could all be handled through
Consul. Plus, Consul does offer built-in  capabilities for
multi-datacenter scenarios and encryption. Also, the capability to
inquire Consul via DNS, i.e., without any client-side library
requirements, is quite compelling. One could integrate Java, C/C++,
C#/.NET, Python, Ruby and other types of clients without much effort.

The largest benefit, however, I would see for the zoo of services around
Solr. At least in my experience, SolrCloud for serious applications is
never deployed by itself. There will be numerous services for data
collection, semantic processing, log management, monitoring,
administration, reporting and user front-ends around the core SolrCloud.
This zoo is hard to manage and especially the coordination of
configuration and cluster consistency is hard to manage. Consul could
help here as it comes from the more operations-type level of managing an
elastic set of services in data centers.

So, after singing the praises, why have I not started using Consul then? :-)

First and foremost: Zookeeper from the Hadoop/Apache ecosystem is
already integrated with SolrCloud. Ripping it out and replacing it with
something similar but not quite the same would require significant
effort, esp. for testing this thoroughly. My clients are not willing to
pay for basic groundworks.

Second: Consul looks nice but documentation leaves many questions open.
Once you start setting it up, there will be questions where you have to
dive into the code for answers. Consul does not give me the same
mature impression as Zookeeper. So, I am still using our own service
management framework for the zoo of services in typical search clouds.
Consul is young, however, and may evolve. The version is 0.4.1 and I
don't use anything with a zero in front to manage a serious customer
infrastructure. Would you trust the a customer's 50-100 TB of source
data to a set  of SolrClouds based on a 0.x Consul? ;-)

Third: Consul lacks a decent integration with log management. In any
distributed environment, you don't just want to keep a snapshot of the
moment, but rather a possibly long history of state changes and
statistics, so there is a chance to not just monitor, but also to act.
In that respect, we would need more of cloud management recipes
integrated, without having to pull out the entire Puppet or Chef stack
that will come with its own view of the world. That again is a topic of
maturity and being fit for real-life requirements. I would love to see
Consul evolve into that type of lightweight cloud management with basic
services integrated. But: some way to go still.

There are other issues, but these are the major ones from my perspective.

So, the concept is nice, Hashimoto et al. are known to be creative
heads, and therefore I will keep watching what's happing there, but I
won't use Consul for any real customer projects yet - not even that part
that is not SolrCloud-dependent.

Best regards,
--Jürgen



On 01.11.2014 00:08, Greg Solovyev wrote:
 I am investigating a project to make SolrCloud run on Consul instead of 
 ZooKeeper. So far, my research revealed no such efforts, but I wanted to 
 check with this list to make sure I am not going to be reinventing the wheel. 
 Have anyone attempted using Consul instead of ZK to coordinate SolrCloud 
 nodes? 

 Thanks, 
 Greg 



-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center Intelligence
 Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wag...@devoteam.com
mailto:juergen.wag...@devoteam.com, URL: www.devoteam.de
http://www.devoteam.de/


Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071




Re: Sharding configuration

2014-11-01 Thread Ramkumar R. Aiyengar
On 30 Oct 2014 23:46, Erick Erickson erickerick...@gmail.com wrote:

 This configuration deals with all
 the replication, NRT processing, self-repair when nodes go up and
 down and all that, but since there's no second trip to get the docs
 from shards your query performance won't be affected.

More or less.. Vaguely recall that you still would need to add a
shortCircuit parameter to the url in such a case to avoid a second trip. I
might be wrong here but I do recall wondering why that wasn't the default..


 And using SolrCloud with a single shard will essentially scale linearly
 as you add nodes for queries.

 Best,
 Erick


 On Thu, Oct 30, 2014 at 8:29 AM, Anca Kopetz anca.kop...@kelkoo.com
wrote:
  Hi,
 
  You are right, it is a mistake in my phrase, for the tests with 4
  shards/ 4 instances,  the latency was worse (therefore *bigger*) than
  for the tests with one shard.
 
  In our case, the query rate is high.
 
  Thanks,
  Anca
 
 
  On 10/30/2014 03:48 PM, Shawn Heisey wrote:
 
  On 10/30/2014 4:32 AM, Anca Kopetz wrote:
 
  We did some tests with 4 shards / 4 different tomcat instances on the
  same server and the average latency was smaller than the one when
having
  only one shard.
  We tested also é shards on different servers and the performance
results
  were also worse.
 
  It seems that the sharding does not make any difference for our index
in
  terms of latency gains.
 
  That statement is confusing, because if latency goes down, that's good,
  not worse.
 
  If you're going to put multiple shards on one server, it should be done
  with one solr/tomcat instance, not multiple.  One instance is perfectly
  capable of dealing with many shards, and has a lot less overhead.  The
  SolrCloud collection create command would need the maxShardsPerNode
  parameter.
 
  In order to see a gain in performance from multiple shards per server,
  the server must have a lot of CPUs and the query rate must be fairly
  low.  If the query rate is high, then all the CPUs will be busy just
  handling simultaneous queries, so putting multiple shards per server
  will probably slow things down.  When query rate is low, multiple CPUs
  can handle each shard query simultaneously, speeding up the overall
query.
 
  Thanks,
  Shawn
 
 
  Kelkoo SAS
  Société par Actions Simplifiée
  Au capital de € 4.168.964,30
  Siège social : 8, rue du Sentier 75002 Paris
  425 093 069 RCS Paris
 
  Ce message et les pièces jointes sont confidentiels et établis à
l'attention
  exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de
ce
  message, merci de le détruire et d'en avertir l'expéditeur.


Re: Sharding configuration

2014-11-01 Thread Ramkumar R. Aiyengar
On 30 Oct 2014 14:49, Shawn Heisey apa...@elyograg.org wrote:

 In order to see a gain in performance from multiple shards per server,
 the server must have a lot of CPUs and the query rate must be fairly
 low.  If the query rate is high, then all the CPUs will be busy just
 handling simultaneous queries, so putting multiple shards per server
 will probably slow things down.  When query rate is low, multiple CPUs
 can handle each shard query simultaneously, speeding up the overall query.

Except that your query latency isn't always CPU bound, there's a
significant IO bound portion as well. I wouldn't go so far as to say that
will large query volumes you shouldn't use multiple shards -- finally comes
down to how many shards a machine can handle under peak load, it could
depend on CPU/IO/GC pressure.. We have multiple shards on a machine under
heavy query load for example. The only real way is to benchmark this and
see..

 Thanks,
 Shawn



Re: How to update SOLR schema from continuous integration environment

2014-11-01 Thread Jack Krupansky
In all honesty, incrementally updating resources of a production server is a 
rather frightening proposition. Parallel testing is always a better way to 
go - bring up any changes in a parallel system for testing and then do an 
atomic swap - redirection of requests from the old server to the new 
server and then retire the old server only after the new server has had 
enough time to burn in and get past any infant mortality problems.


That's production. Testing and dev? Who needs the hassle; just tear the old 
server down and bring up the new server from scratch with all resources 
updated from the get-go.


Oh, and the starting point would be keeping your full set of config and 
resource files under source control so that you can carefully review changes 
before they are pushed, can compare different revisions, and can easily 
back out a revision with confidence rather than winging it.


That said, a lot of production systems these days are not designed for 
parallel operation and swapping out parallel systems, especially for cloud 
and cluster systems. In these cases the reality is more of a rolling 
update, where one node at a time is taken down, updated, brought up, 
tested, brought back into production, tested some more, and only after 
enough burn in time do you move to the next node.


This rolling update may also force you to sequence or stage your changes so 
that old and new nodes are at least relatively compatible. So, the first 
stage would update all nodes, one at a time, to the intermediate compatible 
change, and only when that rolling update of all nodes is complete would you 
move up to the next stage of the update to replace the intermediate update 
with the final update. And maybe more than one intermediate stage is 
required for more complex updates.


Some changes might involve upgrading Java jars as well, in a way that might 
cause nodes give incompatible results, in which case you may need to stage 
or sequence your Java changes as well, so that you don't make the final code 
change until you have verified that all nodes have compatible intermediate 
code that is compatible with both old nodes and new nodes.


Of course, it all depends on the nature of the update. For example, adding 
more synonyms may or may not be harmless with respect to whether existing 
index data becomes invalidated and each node needs to be completely 
reindexed, or if query-time synonyms are incompatible with index-time 
synonyms. Ditto for just about any analysis chain changes - they may be 
harmless, they may require full reindexing, they may simply not work for new 
data (i.e., a synonym is added in response to late-breaking news or an 
addition to a taxonomy) until nodes are updated, or maybe some queries 
become slightly or somewhat inaccurate until the update/reindex is complete.


So, you might want to have two stages of test system - one to just do a raw 
functional test of the changes, like whether your new synonyms work as 
expected or not, and then the pre-production stage which would be updated 
using exactly the same process as the production system, such as a rolling 
update or staged rolling update as required. The closer that pre-production 
system is run to the actual production, the greater the odds that you can 
have confidence that the update won't compromise the production system.


The pre-production test system might have, say, 10% of the production data 
and by only 10% the size of the production system.


In short, for smaller clusters having parallel systems with an atomic 
swap/redirection is probably simplest, while for larger clusters an 
incremental rolling update with thorough testing on a pre-production test 
cluster is the way to go.


-- Jack Krupansky

-Original Message- 
From: Faisal Mansoor

Sent: Saturday, November 1, 2014 12:10 AM
To: solr-user@lucene.apache.org
Subject: How to update SOLR schema from continuous integration environment

Hi,

How do people usually update Solr configuration files from continuous
integration environment like TeamCity or Jenkins.

We have multiple development and testing environments and use WebDeploy and
AwsDeploy type of tools to remotely deploy code multiple times a day, to
update solr I wrote a simple node server which accepts conf folder over
http, updates the specified conf core folder and restarts the solr service.

Does there exists a standard tool for this uses case. I know about schema
rest api, but, I want to update all the files in the conf folder rather
than just updating a single file or adding or removing synonyms piecemeal.

Here is the link for the node server I mentioned if anyone is interested.
https://github.com/faisalmansoor/UpdateSolrConfig


Thanks,
Faisal 



RE: How to update SOLR schema from continuous integration environment

2014-11-01 Thread Will Martin
http://www.thoughtworks.com/insights/blog/enabling-continuous-delivery-enterprises-testing


-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: Saturday, November 01, 2014 9:46 AM
To: solr-user@lucene.apache.org
Subject: Re: How to update SOLR schema from continuous integration environment

In all honesty, incrementally updating resources of a production server is a 
rather frightening proposition. Parallel testing is always a better way to go - 
bring up any changes in a parallel system for testing and then do an atomic 
swap - redirection of requests from the old server to the new server and then 
retire the old server only after the new server has had enough time to burn in 
and get past any infant mortality problems.

That's production. Testing and dev? Who needs the hassle; just tear the old 
server down and bring up the new server from scratch with all resources updated 
from the get-go.

Oh, and the starting point would be keeping your full set of config and 
resource files under source control so that you can carefully review changes 
before they are pushed, can compare different revisions, and can easily back 
out a revision with confidence rather than winging it.

That said, a lot of production systems these days are not designed for parallel 
operation and swapping out parallel systems, especially for cloud and cluster 
systems. In these cases the reality is more of a rolling update, where one 
node at a time is taken down, updated, brought up, tested, brought back into 
production, tested some more, and only after enough burn in time do you move to 
the next node.

This rolling update may also force you to sequence or stage your changes so 
that old and new nodes are at least relatively compatible. So, the first stage 
would update all nodes, one at a time, to the intermediate compatible change, 
and only when that rolling update of all nodes is complete would you move up to 
the next stage of the update to replace the intermediate update with the final 
update. And maybe more than one intermediate stage is required for more complex 
updates.

Some changes might involve upgrading Java jars as well, in a way that might 
cause nodes give incompatible results, in which case you may need to stage or 
sequence your Java changes as well, so that you don't make the final code 
change until you have verified that all nodes have compatible intermediate code 
that is compatible with both old nodes and new nodes.

Of course, it all depends on the nature of the update. For example, adding more 
synonyms may or may not be harmless with respect to whether existing index data 
becomes invalidated and each node needs to be completely reindexed, or if 
query-time synonyms are incompatible with index-time synonyms. Ditto for just 
about any analysis chain changes - they may be harmless, they may require full 
reindexing, they may simply not work for new data (i.e., a synonym is added in 
response to late-breaking news or an addition to a taxonomy) until nodes are 
updated, or maybe some queries become slightly or somewhat inaccurate until the 
update/reindex is complete.

So, you might want to have two stages of test system - one to just do a raw 
functional test of the changes, like whether your new synonyms work as expected 
or not, and then the pre-production stage which would be updated using exactly 
the same process as the production system, such as a rolling update or staged 
rolling update as required. The closer that pre-production system is run to the 
actual production, the greater the odds that you can have confidence that the 
update won't compromise the production system.

The pre-production test system might have, say, 10% of the production data and 
by only 10% the size of the production system.

In short, for smaller clusters having parallel systems with an atomic 
swap/redirection is probably simplest, while for larger clusters an incremental 
rolling update with thorough testing on a pre-production test cluster is the 
way to go.

-- Jack Krupansky

-Original Message-
From: Faisal Mansoor
Sent: Saturday, November 1, 2014 12:10 AM
To: solr-user@lucene.apache.org
Subject: How to update SOLR schema from continuous integration environment

Hi,

How do people usually update Solr configuration files from continuous 
integration environment like TeamCity or Jenkins.

We have multiple development and testing environments and use WebDeploy and 
AwsDeploy type of tools to remotely deploy code multiple times a day, to update 
solr I wrote a simple node server which accepts conf folder over http, updates 
the specified conf core folder and restarts the solr service.

Does there exists a standard tool for this uses case. I know about schema rest 
api, but, I want to update all the files in the conf folder rather than just 
updating a single file or adding or removing synonyms piecemeal.

Here is the link for the node server I mentioned if anyone is interested.

Re: Ideas for debugging poor SolrCloud scalability

2014-11-01 Thread Ian Rose
Erick,

Just to make sure I am thinking about this right: batching will certainly
make a big difference in performance, but it should be more or less a
constant factor no matter how many Solr nodes you are using, right?  Right
now in my load tests, I'm not actually that concerned about the absolute
performance numbers; instead I'm just trying to figure out why relative
performance (no matter how bad it is since I am not batching) does not go
up with more Solr nodes.  Once I get that part figured out and we are
seeing more writes per sec when we add nodes, then I'll turn on batching in
the client to see what kind of additional performance gain that gets us.

Cheers,
Ian


On Fri, Oct 31, 2014 at 3:43 PM, Peter Keegan peterlkee...@gmail.com
wrote:

 Yes, I was inadvertently sending them to a replica. When I sent them to the
 leader, the leader reported (1000 adds) and the replica reported only 1 add
 per document. So, it looks like the leader forwards the batched jobs
 individually to the replicas.

 On Fri, Oct 31, 2014 at 3:26 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  Internally, the docs are batched up into smaller buckets (10 as I
  remember) and forwarded to the correct shard leader. I suspect that's
  what you're seeing.
 
  Erick
 
  On Fri, Oct 31, 2014 at 12:20 PM, Peter Keegan peterlkee...@gmail.com
  wrote:
   Regarding batch indexing:
   When I send batches of 1000 docs to a standalone Solr server, the log
  file
   reports (1000 adds) in LogUpdateProcessor. But when I send them to
 the
   leader of a replicated index, the leader log file reports much smaller
   numbers, usually (12 adds). Why do the batches appear to be broken
 up?
  
   Peter
  
   On Fri, Oct 31, 2014 at 10:40 AM, Erick Erickson 
  erickerick...@gmail.com
   wrote:
  
   NP, just making sure.
  
   I suspect you'll get lots more bang for the buck, and
   results much more closely matching your expectations if
  
   1 you batch up a bunch of docs at once rather than
   sending them one at a time. That's probably the easiest
   thing to try. Sending docs one at a time is something of
   an anti-pattern. I usually start with batches of 1,000.
  
   And just to check.. You're not issuing any commits from the
   client, right? Performance will be terrible if you issue commits
   after every doc, that's totally an anti-pattern. Doubly so for
   optimizes Since you showed us your solrconfig  autocommit
   settings I'm assuming not but want to be sure.
  
   2 use a leader-aware client. I'm totally unfamiliar with Go,
   so I have no suggestions whatsoever to offer there But you'll
   want to batch in this case too.
  
   On Fri, Oct 31, 2014 at 5:51 AM, Ian Rose ianr...@fullstory.com
  wrote:
Hi Erick -
   
Thanks for the detailed response and apologies for my confusing
terminology.  I should have said WPS (writes per second) instead
 of
  QPS
but I didn't want to introduce a weird new acronym since QPS is well
known.  Clearly a bad decision on my part.  To clarify: I am doing
*only* writes
(document adds).  Whenever I wrote QPS I was referring to writes.
   
It seems clear at this point that I should wrap up the code to do
  smart
routing rather than choose Solr nodes randomly.  And then see if
 that
changes things.  I must admit that although I understand that random
  node
selection will impose a performance hit, theoretically it seems to
 me
   that
the system should still scale up as you add more nodes (albeit at
  lower
absolute level of performance than if you used a smart router).
Nonetheless, I'm just theorycrafting here so the better thing to do
 is
   just
try it experimentally.  I hope to have that working today - will
  report
back on my findings.
   
Cheers,
- Ian
   
p.s. To clarify why we are rolling our own smart router code, we use
  Go
over here rather than Java.  Although if we still get bad
 performance
   with
our custom Go router I may try a pure Java load client using
CloudSolrServer to eliminate the possibility of bugs in our
   implementation.
   
   
On Fri, Oct 31, 2014 at 1:37 AM, Erick Erickson 
  erickerick...@gmail.com
   
wrote:
   
I'm really confused:
   
bq: I am not issuing any queries, only writes (document inserts)
   
bq: It's clear that once the load test client has ~40 simulated
 users
   
bq: A cluster of 3 shards over 3 Solr nodes *should* support
a higher QPS than 2 shards over 2 Solr nodes, right
   
QPS is usually used to mean Queries Per Second, which is
 different
   from
the statement that I am not issuing any queries. And what do
  the
number of users have to do with inserting documents?
   
You also state:  In many cases, CPU on the solr servers is quite
  low as
well
   
So let's talk about indexing first. Indexing should scale nearly
linearly as long as
1 you are routing your docs to the correct leader, which 

Re: How to update SOLR schema from continuous integration environment

2014-11-01 Thread Walter Underwood
Nice pictures, but that preso does not even begin to answer the question.

With master/slave replication, I do schema migration in two ways, depending on 
whether a field is added or removed.

Adding a field:

1. Update the schema on the slaves. A defined field with no data is not a 
problem.
2. Update the master.
3. Reindex to populate the field and wait for replication.
4. Update the request handlers or clients to use the new field.

Removing a field is the opposite. I haven’t tried lately, but Solr used to have 
problems with a field that was in the index but not in the schema.

1. Update the request handlers and clients to stop using the field.
2. Reindex without any data for the field that will be removed, wait for 
replication.
3. Update the schema on the master and slaves.

I have not tried to automate this for continuous deployment. It isn’t a big 
deal for a single server test environment. It is the prod deployment that is 
tricky.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/


On Nov 1, 2014, at 7:29 AM, Will Martin wmartin...@gmail.com wrote:

 http://www.thoughtworks.com/insights/blog/enabling-continuous-delivery-enterprises-testing
 
 
 -Original Message-
 From: Jack Krupansky [mailto:j...@basetechnology.com] 
 Sent: Saturday, November 01, 2014 9:46 AM
 To: solr-user@lucene.apache.org
 Subject: Re: How to update SOLR schema from continuous integration environment
 
 In all honesty, incrementally updating resources of a production server is a 
 rather frightening proposition. Parallel testing is always a better way to go 
 - bring up any changes in a parallel system for testing and then do an atomic 
 swap - redirection of requests from the old server to the new server and 
 then retire the old server only after the new server has had enough time to 
 burn in and get past any infant mortality problems.
 
 That's production. Testing and dev? Who needs the hassle; just tear the old 
 server down and bring up the new server from scratch with all resources 
 updated from the get-go.
 
 Oh, and the starting point would be keeping your full set of config and 
 resource files under source control so that you can carefully review changes 
 before they are pushed, can compare different revisions, and can easily 
 back out a revision with confidence rather than winging it.
 
 That said, a lot of production systems these days are not designed for 
 parallel operation and swapping out parallel systems, especially for cloud 
 and cluster systems. In these cases the reality is more of a rolling 
 update, where one node at a time is taken down, updated, brought up, tested, 
 brought back into production, tested some more, and only after enough burn in 
 time do you move to the next node.
 
 This rolling update may also force you to sequence or stage your changes so 
 that old and new nodes are at least relatively compatible. So, the first 
 stage would update all nodes, one at a time, to the intermediate compatible 
 change, and only when that rolling update of all nodes is complete would you 
 move up to the next stage of the update to replace the intermediate update 
 with the final update. And maybe more than one intermediate stage is required 
 for more complex updates.
 
 Some changes might involve upgrading Java jars as well, in a way that might 
 cause nodes give incompatible results, in which case you may need to stage or 
 sequence your Java changes as well, so that you don't make the final code 
 change until you have verified that all nodes have compatible intermediate 
 code that is compatible with both old nodes and new nodes.
 
 Of course, it all depends on the nature of the update. For example, adding 
 more synonyms may or may not be harmless with respect to whether existing 
 index data becomes invalidated and each node needs to be completely 
 reindexed, or if query-time synonyms are incompatible with index-time 
 synonyms. Ditto for just about any analysis chain changes - they may be 
 harmless, they may require full reindexing, they may simply not work for new 
 data (i.e., a synonym is added in response to late-breaking news or an 
 addition to a taxonomy) until nodes are updated, or maybe some queries become 
 slightly or somewhat inaccurate until the update/reindex is complete.
 
 So, you might want to have two stages of test system - one to just do a raw 
 functional test of the changes, like whether your new synonyms work as 
 expected or not, and then the pre-production stage which would be updated 
 using exactly the same process as the production system, such as a rolling 
 update or staged rolling update as required. The closer that pre-production 
 system is run to the actual production, the greater the odds that you can 
 have confidence that the update won't compromise the production system.
 
 The pre-production test system might have, say, 10% of the production data 
 and by only 10% the size of the production system.
 
 In short, 

Missing log entries with log4j log rotation

2014-11-01 Thread Shawn Heisey
There appear to be large blocks of time missing in my solr logfiles
created with slf4j-log4j and rotated using the log4j config:

End of solr.log.1: INFO  - 2014-10-31 12:52:25.073;
Start of solr.log: INFO  - 2014-11-01 02:27:27.404;

End of solr.log.2: INFO  - 2014-10-29 06:30:32.661;
Start of solr.log.1: INFO  - 2014-10-30 07:01:34.241;

Queries happen at a fairly constant low level and updates happen once a
minute, so I know for sure that there is activity during the missing
blocks of time.  I need to investigate a problem that occurred during
the time that is not logged, which means I have nothing to investigate.

This is the log4j configuration that I'm using:

http://apaste.info/9vC

These are the logging jars that I have in jetty's lib/ext:

-rw-r--r-- 1 ncindex ncindex  16515 Apr 11  2014 jcl-over-slf4j-1.7.6.jar
-rw-r--r-- 1 ncindex ncindex   4959 Apr 11  2014 jul-to-slf4j-1.7.6.jar
-rw-r--r-- 1 ncindex ncindex 489883 Apr 11  2014 log4j-1.2.17.jar
-rw-r--r-- 1 ncindex ncindex  28688 Apr 11  2014 slf4j-api-1.7.6.jar
-rw-r--r-- 1 ncindex ncindex   8869 Apr 11  2014 slf4j-log4j12-1.7.6.jar

Is this a bug, or have I done something wrong in my config?  Should I be
putting this on the log4j mailing list instead of here?  My best guess
about how this is happening is that an entire logfile is getting deleted
during rotation.

Thanks,
Shawn



Re: Missing log entries with log4j log rotation

2014-11-01 Thread Shawn Heisey
On 11/1/2014 11:45 AM, Shawn Heisey wrote:
 Is this a bug, or have I done something wrong in my config?  Should I be
 putting this on the log4j mailing list instead of here?  My best guess
 about how this is happening is that an entire logfile is getting deleted
 during rotation.

I did find this blog post describing a similar problem with a different
Appender:

http://vivekagarwal.wordpress.com/2008/02/09/missing-log4j-log-files-with-dailyrollingfileappender-when-they-should-roll-over/

I'm not running on Windows, I'm on Linux, which normally does not have
problems with renaming files even when they are open.

My logfiles where I redirect stdout and stderr from Jetty don't show
anything related, and I don't see anything like the error mentioned in
any of the surviving logfiles from log4j.

Thanks,
Shawn



Re: Ideas for debugging poor SolrCloud scalability

2014-11-01 Thread Erick Erickson
bq: but it should be more or less a constant factor no matter how many
Solr nodes you are using, right?

Not really. You've stated that you're not driving Solr very hard in
your tests. Therefore you're waiting on I/O. Therefore your tests just
aren't going to scale linearly with the number of shards. This is a
simplification, but

Your network utilization is pretty much irrelevant. I send a packet
somewhere. somewhere does some stuff and sends me back an
acknowledgement. While I'm waiting, the network is getting no traffic,
so. If the network traffic was in the 90% range that would be
different, so it's a good thing to monitor.

Really, use a leader aware client and rack enough clients together
that you're driving Solr hard. Then double the number of shards. Then
rack enough _more_ clients to drive Solr at the same level. In this
case I'll go out on a limb and predict near 2x throughput increases.

One additional note, though. When you add _replicas_ to shards expect
to see a drop in throughput that may be quite significant, 20-40%
anecdotally...

Best,
Erick

On Sat, Nov 1, 2014 at 9:23 AM, Shawn Heisey apa...@elyograg.org wrote:
 On 11/1/2014 9:52 AM, Ian Rose wrote:
 Just to make sure I am thinking about this right: batching will certainly
 make a big difference in performance, but it should be more or less a
 constant factor no matter how many Solr nodes you are using, right?  Right
 now in my load tests, I'm not actually that concerned about the absolute
 performance numbers; instead I'm just trying to figure out why relative
 performance (no matter how bad it is since I am not batching) does not go
 up with more Solr nodes.  Once I get that part figured out and we are
 seeing more writes per sec when we add nodes, then I'll turn on batching in
 the client to see what kind of additional performance gain that gets us.

 The basic problem I see with your methodology is that you are sending an
 update request and waiting for it to complete before sending another.
 No matter how big the batches are, this is an inefficient use of resources.

 If you send many such requests at the same time, then they will be
 handled in parallel.  Lucene (and by extension, Solr) has the thread
 synchronization required to keep multiple simultaneous update requests
 from stomping on each other and corrupting the index.

 If you have enough CPU cores, such handling will *truly* be in parallel,
 otherwise the operating system will just take turns giving each thread
 CPU time.  This results in a pretty good facsimile of parallel
 operation, but because it splits the available CPU resources, isn't as
 fast as true parallel operation.

 Thanks,
 Shawn



RE: How to update SOLR schema from continuous integration environment

2014-11-01 Thread Will Martin
Well yes. But since there hasn't been any devops approaches yet, we really
aren't talking about Continuous Delivery. Continually delivering builds into
production is old hat and Jack nailed the canonical manners in which it has
been done. It really depends on whether an org is investing in the full
Agile lifecycle. A piece at a time is common,.

One possible devop approach:

Once you get near full test automation
: Jenkins builds the target
: chef does due diligence on dependencies
: chef pulls the build over. 
: chef configures the build once it is installed.
:chef takes the machine out of the load-balancers rotation
: chef puts the machine back in once it is launched and sanity tested (by
chef).

or puppet or any others I'm not familiar with


If you substitute Jack's plan, you get pretty much the same thing; except
that by using devops tools you introduce a little thing called idempotency.



-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org] 
Sent: Saturday, November 01, 2014 12:25 PM
To: solr-user@lucene.apache.org
Subject: Re: How to update SOLR schema from continuous integration
environment

Nice pictures, but that preso does not even begin to answer the question.

With master/slave replication, I do schema migration in two ways, depending
on whether a field is added or removed.

Adding a field:

1. Update the schema on the slaves. A defined field with no data is not a
problem.
2. Update the master.
3. Reindex to populate the field and wait for replication.
4. Update the request handlers or clients to use the new field.

Removing a field is the opposite. I haven't tried lately, but Solr used to
have problems with a field that was in the index but not in the schema.

1. Update the request handlers and clients to stop using the field.
2. Reindex without any data for the field that will be removed, wait for
replication.
3. Update the schema on the master and slaves.

I have not tried to automate this for continuous deployment. It isn't a big
deal for a single server test environment. It is the prod deployment that is
tricky.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/


On Nov 1, 2014, at 7:29 AM, Will Martin wmartin...@gmail.com wrote:


http://www.thoughtworks.com/insights/blog/enabling-continuous-delivery-enter
prises-testing
 
 
 -Original Message-
 From: Jack Krupansky [mailto:j...@basetechnology.com] 
 Sent: Saturday, November 01, 2014 9:46 AM
 To: solr-user@lucene.apache.org
 Subject: Re: How to update SOLR schema from continuous integration
environment
 
 In all honesty, incrementally updating resources of a production server is
a rather frightening proposition. Parallel testing is always a better way to
go - bring up any changes in a parallel system for testing and then do an
atomic swap - redirection of requests from the old server to the new
server and then retire the old server only after the new server has had
enough time to burn in and get past any infant mortality problems.
 
 That's production. Testing and dev? Who needs the hassle; just tear the
old server down and bring up the new server from scratch with all resources
updated from the get-go.
 
 Oh, and the starting point would be keeping your full set of config and
resource files under source control so that you can carefully review changes
before they are pushed, can compare different revisions, and can easily
back out a revision with confidence rather than winging it.
 
 That said, a lot of production systems these days are not designed for
parallel operation and swapping out parallel systems, especially for cloud
and cluster systems. In these cases the reality is more of a rolling
update, where one node at a time is taken down, updated, brought up,
tested, brought back into production, tested some more, and only after
enough burn in time do you move to the next node.
 
 This rolling update may also force you to sequence or stage your changes
so that old and new nodes are at least relatively compatible. So, the first
stage would update all nodes, one at a time, to the intermediate compatible
change, and only when that rolling update of all nodes is complete would you
move up to the next stage of the update to replace the intermediate update
with the final update. And maybe more than one intermediate stage is
required for more complex updates.
 
 Some changes might involve upgrading Java jars as well, in a way that
might cause nodes give incompatible results, in which case you may need to
stage or sequence your Java changes as well, so that you don't make the
final code change until you have verified that all nodes have compatible
intermediate code that is compatible with both old nodes and new nodes.
 
 Of course, it all depends on the nature of the update. For example, adding
more synonyms may or may not be harmless with respect to whether existing
index data becomes invalidated and each node needs to be completely
reindexed, or if query-time synonyms are 

Re: How to update SOLR schema from continuous integration environment

2014-11-01 Thread Walter Underwood
You do that with schema changes and I’ll watch your site crash.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/


On Nov 1, 2014, at 8:31 PM, Will Martin wmartin...@gmail.com wrote:

 Well yes. But since there hasn't been any devops approaches yet, we really
 aren't talking about Continuous Delivery. Continually delivering builds into
 production is old hat and Jack nailed the canonical manners in which it has
 been done. It really depends on whether an org is investing in the full
 Agile lifecycle. A piece at a time is common,.
 
 One possible devop approach:
 
 Once you get near full test automation
 : Jenkins builds the target
 : chef does due diligence on dependencies
 : chef pulls the build over. 
 : chef configures the build once it is installed.
 :chef takes the machine out of the load-balancers rotation
 : chef puts the machine back in once it is launched and sanity tested (by
 chef).
 
 or puppet or any others I'm not familiar with
 
 
 If you substitute Jack's plan, you get pretty much the same thing; except
 that by using devops tools you introduce a little thing called idempotency.
 
 
 
 -Original Message-
 From: Walter Underwood [mailto:wun...@wunderwood.org] 
 Sent: Saturday, November 01, 2014 12:25 PM
 To: solr-user@lucene.apache.org
 Subject: Re: How to update SOLR schema from continuous integration
 environment
 
 Nice pictures, but that preso does not even begin to answer the question.
 
 With master/slave replication, I do schema migration in two ways, depending
 on whether a field is added or removed.
 
 Adding a field:
 
 1. Update the schema on the slaves. A defined field with no data is not a
 problem.
 2. Update the master.
 3. Reindex to populate the field and wait for replication.
 4. Update the request handlers or clients to use the new field.
 
 Removing a field is the opposite. I haven't tried lately, but Solr used to
 have problems with a field that was in the index but not in the schema.
 
 1. Update the request handlers and clients to stop using the field.
 2. Reindex without any data for the field that will be removed, wait for
 replication.
 3. Update the schema on the master and slaves.
 
 I have not tried to automate this for continuous deployment. It isn't a big
 deal for a single server test environment. It is the prod deployment that is
 tricky.
 
 wunder
 Walter Underwood
 wun...@wunderwood.org
 http://observer.wunderwood.org/
 
 
 On Nov 1, 2014, at 7:29 AM, Will Martin wmartin...@gmail.com wrote:
 
 
 http://www.thoughtworks.com/insights/blog/enabling-continuous-delivery-enter
 prises-testing
 
 
 -Original Message-
 From: Jack Krupansky [mailto:j...@basetechnology.com] 
 Sent: Saturday, November 01, 2014 9:46 AM
 To: solr-user@lucene.apache.org
 Subject: Re: How to update SOLR schema from continuous integration
 environment
 
 In all honesty, incrementally updating resources of a production server is
 a rather frightening proposition. Parallel testing is always a better way to
 go - bring up any changes in a parallel system for testing and then do an
 atomic swap - redirection of requests from the old server to the new
 server and then retire the old server only after the new server has had
 enough time to burn in and get past any infant mortality problems.
 
 That's production. Testing and dev? Who needs the hassle; just tear the
 old server down and bring up the new server from scratch with all resources
 updated from the get-go.
 
 Oh, and the starting point would be keeping your full set of config and
 resource files under source control so that you can carefully review changes
 before they are pushed, can compare different revisions, and can easily
 back out a revision with confidence rather than winging it.
 
 That said, a lot of production systems these days are not designed for
 parallel operation and swapping out parallel systems, especially for cloud
 and cluster systems. In these cases the reality is more of a rolling
 update, where one node at a time is taken down, updated, brought up,
 tested, brought back into production, tested some more, and only after
 enough burn in time do you move to the next node.
 
 This rolling update may also force you to sequence or stage your changes
 so that old and new nodes are at least relatively compatible. So, the first
 stage would update all nodes, one at a time, to the intermediate compatible
 change, and only when that rolling update of all nodes is complete would you
 move up to the next stage of the update to replace the intermediate update
 with the final update. And maybe more than one intermediate stage is
 required for more complex updates.
 
 Some changes might involve upgrading Java jars as well, in a way that
 might cause nodes give incompatible results, in which case you may need to
 stage or sequence your Java changes as well, so that you don't make the
 final code change until you have verified that all nodes have compatible
 intermediate code that is