Everything below is my humble opinion and input - DONT MEAN TO OFFEND ANYONE

Radim Kolar wrote:

what you should do:
* stuff i do
Like people with confidence, but it is a balance :-) Every decent developer in the world believes that he is the best in the world. Chance is that he is not. Be humble.

    +
* ant -> maven
Maven is a step forward, but it is still crap. Believe the original creator of ant has apologized in public for basing it on XML. Maven is also based on XML, besides being way to complex in infrastructure - goal, phases, environments, strange plugin with exections mapping to phases etc. XML is good for static data/config stuff, but build process is not static data/config - it is a process. Go gradle!
I dont have either, if i decide to go with SOLR instead of EC, i will fork it. It will save me lot of time.
We are baiscally handling our own version of Solr at my organization, because it is so hard go get contributions in - SOLR-3173, SOLR-3178, SOLR-3382, SOLR-3428, SOLR-3383 etc - and lately SOLR-4114 and SOLR-4120. It is really hard keeping up with the latest versions of Apache Solr, because it is a huge job to merge new stuff into our Solr. We are considering to take the consequence and fork our own public (to let others bennefit and contribute) "variant" of Solr.

I understand that no committers are really assigned to focus on committing other peoples stuff, but it is a shame. I would really, really not like Solr to end up in a situation, where many organizations run their own little fork. Instead we should all collaborate on improving "the one and only Solr"! Maybe we should try to find a sponsor to pay for a full-time Solr committer with the main focus on verifying and committing contributions from the "outside".
* svn -> git (way better tools)
I think we had this discussion already and it seems that lots of folks are positive, yet there is still some barrier infrasturcuture wise along the lines.

dont blame infrastructure, other apache projects are using it.
Git is the way forward. It will also make comitting outside contributions easier (especially if the commit is to be performed after the branch has developed a lot since the pull-request was made). Merging among branches will also become easier. Why? Basically, since a pull request (request to merge) is a operation handled/know by git, i allows for git to maintain more information about where merged code fits into the code-base considering revisions etc. That information can be used to ease future or late merges.

* split code into small manageable maven modules
see above - we have a fully functional maven build but ant is out primary build.
i dont see pom.xml in your source tree.
Have a look at templates in dev-tools/maven. Do a "ant -Dversion=$VERSION get-maven-poms" to get your maven stuff generated in folder "maven-build". Maven build does not work 100% out of the box, (at least on lucene_solr_4_0 branch) but it is very close.

* use github to track patches wait why is github good for patches?
you can track patch revisions and apply/browse/comment it easily. Also its way easier to upload it and do pull request then attach to ticket in jira.
See comments under "git" above

Besides that I have some additional input, now that we are talking

Basically that code is a mess. Not blaming anyone particular. Its probably to some extend the nature of open source. If someone honestly belive that the code-base is beautiful, they should find something else to do. Some of the major problems are
* Bad "separation of concerns"
** Very long classes/methods dealing with a lot of different concerns
*** Example: DistributedUpdateProcessor - dealing with cloud/standalone-, phases-, optimistic-locking, calculating values for document-fields (for add/inc/set requests), routing etc. This should all be separated into different classes each dealing with the a different concern ** Code dealing with a particular concern is spread all over the code - it makes it very hard to "change strategy" for this concern *** Example: An obvious "separate concern" is routing (the decision about in which shard under a collection a particualr document belongs (should be indexed and found) and where particualr request needs to go - leaders, replica, all shards under the collection?). This concern is dealt with in a lot of places - DistributedUpdateProcessor, CloudSolrServer, RealTimeGetComponent, SearchHandler etc. ** In my patch for TLT-3178 I have made a "separate concern" called UpdateSemantics. It deals with decissions on stuff related to how updates should be performed, depending on what update-semantics you have choosen (classic, consistency or classic-consistency-hybrid). This class UpdateSemantics is used from the actual updating component DirectUpdateHandler2 - instead of having a lot of if-else-if-else statements in DirectUpdateHandler2 itself
* Copied code
** A lot of code is clearly just copied from another place in the code. It does not only make the code-base very big (we can live with that), but again it really makes it hard to "change strategy" on the stuff the code deals with *** Example (not taken from the Solr code): If you have code, even just a single line, implementing a calculation of some value - e.g. "int myImportantValue = param1 * param2 + (param3 % param4);" - DONT copy that code. It will make it impossible, in the future, when you realize that the calculation should actually be (because you realize an error or because you change strategy) e.g. "int myImportantValue = param1 * param2 + (param4 % param3) / param5;". The same of course goes for a sequence of code-lines handling a specific task.

As I said, we cannot blame anyone for the code being a mess, but we can look at the future and encourage people to do stuff that step-by-step will/can reduce the problems mentioned above
* Refactor whenever you have the chance
** To clean up mistakes made in the past releated to e.g. "separation of concerns" or "copied code" or ...
** To make sure you dont make more of those mistakes - a few rules of thumb:
*** Evey developer should have a bell ringing in their head whenever they are about to do ctrl-c plus ctrl-v. This bell should remind you to think about whether or not you really want to do the copy instead of making a method (or something) containing the code you are about to copy and use this method from both the place where the original code was, and the place where you where about to copy it. If you want a little bit of difference in the code among the two (or more) places where it is used, still make sure that all the common stuff is shared - e.g. implement the difference in the shared method by letting it take a paramater deciding on the variant of semantics. There are also more advanced wayes to share code in object-oriented languages. *** Whenever you are about to make a change to the code of a certain size, start by considering what kinds of concerns it is dealing with and separate dealing with those concerns in different classes (class hierarchies). If you are going to deal with a concern already dealt with other places in the code, take the opportunity to refactor and isolate the existing code in a separate class dealing with the concern, add your additions to the concern and use the class from both places. Keep in mind that java is and object-oriented language and that REAL object-orientation is better than just advanced procedural coding - know and use the advantages you get from a object-oriented language. * Cover you changes by tests - even consider doing TDD, by implementing your tests before you implement the actual change in the "real" code ** Being able to "trust you test-suite" is the key to being able to do the following with a fair amount of confidence that you do not ruin existing stuff
*** Taking in and committing other peoples contributions
*** Daring to make major refactoring (which is greatly needed in Solr code-base) ** Frankly if you break existing stuff by your commit, it is not your fault, unless you cheated and modified/disabled existing tests. Its the fault of the original implementor of the stuff you ruined - he did not create a good enough test-suite on top of his code, to prevent others for accidently ruining it in the future.

Do real performance testing. Solr is a project where an important property is "big data". You cannot claim anything about performance without having tested it. In my world performance covers * Endurance - that the system can run under load for a very long time without performing worse overtime (except due to stuff like "amounts of data increasing") - e.g. because of memory leak, thread leak, synchronization-problems, congestion etc. * Responsetime - how does responsetimes develop as the data store fills up - indexing times and search times (responsetimes always becomes worse at some point but this "point" need to be "far out") * Capacity - how high a load can the system handle per time-unit - probably a function of sevaral things like RAM/CPU/OS on the involved machines, number of involved machines, number of shards the collections are split up into, etc. * Scalability - e.g. "will your capacity double if you double the amount of hardware", "does it hold both when going from 1 to 2 units of hardware (e.g machines) and when going from 1000 to 2000 units of hardware" etc. In my organization we have created completely automated performance tests. We have many "big machines" with Xen-server installed. The tests take a description about the "environment/setup" you want to run the test against - environment/setup being:
** The number of machines running Solrs nodes, ZKs etc.
** The amount RAM/CPU of each machine
** The load to put on the cluster - indexing and search
** ect.
The test automatically sets up virtual machines on Xen-servers accorting to the environment/setup-description, installs the versions of solr/zk under test, starts everything up, creates collection with shard-distribution according to configs, starts a test-driver generating the indexing/search load, measures numerous metrics (CPU load, memory-usage, IO-throughput, disk-space usage, indexing-responsetime, search-responsetime, indexing capacity etc etc) during the test run. We sometimes run the test for months. You will be amazed to se how all of the metrics develop as the test advance and collections are filled with data. Because of the complete automation and being based on virtualized servers, we can (and have) run the test on numerous setups with combinations of
** 4/6/8 GB RAM on each machine
** 1/4/8 shards per collection per Solr node
** Indexing across 1/2/many collections
** Modified mergers inside Solr/Lucene (merging of segments shows up to become a problem when you have huge amounts of data in your collections/shards) The Solr project should do stuff like that on its own to be able to say (with confidence) that it performs with respect to endurance, responsetime, capacity and scalability - and in what way in performs (limitations, recommended hardware as a function of load (indexing and search), amounts of data, etc.)

About the randomized tests promoted by Dawid Weiss:
I see your point about "bringing up bugs nobody thought to cover manually", but it also has cons - e.g. violating the principal that tests should be (easily) repeatable (you will/can end up with tests the sometimes fail and sometimes succeed, and you have to dig out the random values of the tests that fail in order to be able to repeat/reconstruct the fail)

A little last remark- use enums instead of just static string constants. Enums are not just dumb simple replacements for static constants. You can have logic associated with each option in the enum etc. See the UpdateSemantics class of my SOLR-3178 patch.

Regards, Per Steffensen



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to