Re: pro coding style

Per Steffensen Fri, 30 Nov 2012 03:18:17 -0800

Everything below is my humble opinion and input - DONT MEAN TO OFFEND ANYONE


Radim Kolar wrote:

what you should do:
* stuff i do

Like people with confidence, but it is a balance :-) Every decentdeveloper in the world believes that he is the best in the world. Chanceis that he is not. Be humble.


    +
* ant -> maven

Maven is a step forward, but it is still crap. Believe the originalcreator of ant has apologized in public for basing it on XML. Maven isalso based on XML, besides being way to complex in infrastructure -goal, phases, environments, strange plugin with exections mapping tophases etc. XML is good for static data/config stuff, but build processis not static data/config - it is a process. Go gradle!

I dont have either, if i decide to go with SOLR instead of EC, i willfork it. It will save me lot of time.

We are baiscally handling our own version of Solr at my organization,because it is so hard go get contributions in - SOLR-3173, SOLR-3178,SOLR-3382, SOLR-3428, SOLR-3383 etc - and lately SOLR-4114 andSOLR-4120. It is really hard keeping up with the latest versions ofApache Solr, because it is a huge job to merge new stuff into our Solr.We are considering to take the consequence and fork our own public (tolet others bennefit and contribute) "variant" of Solr.

I understand that no committers are really assigned to focus oncommitting other peoples stuff, but it is a shame. I would really,really not like Solr to end up in a situation, where many organizationsrun their own little fork. Instead we should all collaborate onimproving "the one and only Solr"! Maybe we should try to find a sponsorto pay for a full-time Solr committer with the main focus on verifyingand committing contributions from the "outside".

* svn -> git (way better tools)
I think we had this discussion already and it seems that lots offolks are positive, yet there is still some barrier infrasturcuturewise along the lines.
dont blame infrastructure, other apache projects are using it.

Git is the way forward. It will also make comitting outsidecontributions easier (especially if the commit is to be performed afterthe branch has developed a lot since the pull-request was made). Mergingamong branches will also become easier. Why? Basically, since a pullrequest (request to merge) is a operation handled/know by git, i allowsfor git to maintain more information about where merged code fits intothe code-base considering revisions etc. That information can be used toease future or late merges.

* split code into small manageable maven modules
see above - we have a fully functional maven build but ant is outprimary build.
i dont see pom.xml in your source tree.

Have a look at templates in dev-tools/maven. Do a "ant-Dversion=$VERSION get-maven-poms" to get your maven stuff generated infolder "maven-build". Maven build does not work 100% out of the box, (atleast on lucene_solr_4_0 branch) but it is very close.

* use github to track patches wait why is github good for patches?
you can track patch revisions and apply/browse/comment it easily. Alsoits way easier to upload it and do pull request then attach to ticketin jira.

See comments under "git" above

Besides that I have some additional input, now that we are talking

Basically that code is a mess. Not blaming anyone particular. Itsprobably to some extend the nature of open source. If someone honestlybelive that the code-base is beautiful, they should find something elseto do. Some of the major problems are

* Bad "separation of concerns"
** Very long classes/methods dealing with a lot of different concerns

*** Example: DistributedUpdateProcessor - dealing withcloud/standalone-, phases-, optimistic-locking, calculating values fordocument-fields (for add/inc/set requests), routing etc. This should allbe separated into different classes each dealing with the a differentconcern** Code dealing with a particular concern is spread all over the code -it makes it very hard to "change strategy" for this concern*** Example: An obvious "separate concern" is routing (the decisionabout in which shard under a collection a particualr document belongs(should be indexed and found) and where particualr request needs to go -leaders, replica, all shards under the collection?). This concern isdealt with in a lot of places - DistributedUpdateProcessor,CloudSolrServer, RealTimeGetComponent, SearchHandler etc.** In my patch for TLT-3178 I have made a "separate concern" calledUpdateSemantics. It deals with decissions on stuff related to howupdates should be performed, depending on what update-semantics you havechoosen (classic, consistency or classic-consistency-hybrid). This classUpdateSemantics is used from the actual updating componentDirectUpdateHandler2 - instead of having a lot of if-else-if-elsestatements in DirectUpdateHandler2 itself

* Copied code

** A lot of code is clearly just copied from another place in the code.It does not only make the code-base very big (we can live with that),but again it really makes it hard to "change strategy" on the stuff thecode deals with*** Example (not taken from the Solr code): If you have code, even justa single line, implementing a calculation of some value - e.g. "intmyImportantValue = param1 * param2 + (param3 % param4);" - DONT copythat code. It will make it impossible, in the future, when you realizethat the calculation should actually be (because you realize an error orbecause you change strategy) e.g. "int myImportantValue = param1 *param2 + (param4 % param3) / param5;". The same of course goes for asequence of code-lines handling a specific task.

As I said, we cannot blame anyone for the code being a mess, but we canlook at the future and encourage people to do stuff that step-by-stepwill/can reduce the problems mentioned above

* Refactor whenever you have the chance

** To clean up mistakes made in the past releated to e.g. "separation ofconcerns" or "copied code" or ...

** To make sure you dont make more of those mistakes - a few rules of thumb:

*** Evey developer should have a bell ringing in their head wheneverthey are about to do ctrl-c plus ctrl-v. This bell should remind you tothink about whether or not you really want to do the copy instead ofmaking a method (or something) containing the code you are about to copyand use this method from both the place where the original code was, andthe place where you where about to copy it. If you want a little bit ofdifference in the code among the two (or more) places where it is used,still make sure that all the common stuff is shared - e.g. implement thedifference in the shared method by letting it take a paramater decidingon the variant of semantics. There are also more advanced wayes to sharecode in object-oriented languages.*** Whenever you are about to make a change to the code of a certainsize, start by considering what kinds of concerns it is dealing with andseparate dealing with those concerns in different classes (classhierarchies). If you are going to deal with a concern already dealt withother places in the code, take the opportunity to refactor and isolatethe existing code in a separate class dealing with the concern, add youradditions to the concern and use the class from both places. Keep inmind that java is and object-oriented language and that REALobject-orientation is better than just advanced procedural coding - knowand use the advantages you get from a object-oriented language.* Cover you changes by tests - even consider doing TDD, by implementingyour tests before you implement the actual change in the "real" code** Being able to "trust you test-suite" is the key to being able to dothe following with a fair amount of confidence that you do not ruinexisting stuff

*** Taking in and committing other peoples contributions

*** Daring to make major refactoring (which is greatly needed in Solrcode-base)** Frankly if you break existing stuff by your commit, it is not yourfault, unless you cheated and modified/disabled existing tests. Its thefault of the original implementor of the stuff you ruined - he did notcreate a good enough test-suite on top of his code, to prevent othersfor accidently ruining it in the future.

Do real performance testing. Solr is a project where an importantproperty is "big data". You cannot claim anything about performancewithout having tested it. In my world performance covers* Endurance - that the system can run under load for a very long timewithout performing worse overtime (except due to stuff like "amounts ofdata increasing") - e.g. because of memory leak, thread leak,synchronization-problems, congestion etc.* Responsetime - how does responsetimes develop as the data store fillsup - indexing times and search times (responsetimes always becomes worseat some point but this "point" need to be "far out")* Capacity - how high a load can the system handle per time-unit -probably a function of sevaral things like RAM/CPU/OS on the involvedmachines, number of involved machines, number of shards the collectionsare split up into, etc.* Scalability - e.g. "will your capacity double if you double the amountof hardware", "does it hold both when going from 1 to 2 units ofhardware (e.g machines) and when going from 1000 to 2000 units ofhardware" etc.In my organization we have created completely automated performancetests. We have many "big machines" with Xen-server installed. The teststake a description about the "environment/setup" you want to run thetest against - environment/setup being:

** The number of machines running Solrs nodes, ZKs etc.
** The amount RAM/CPU of each machine
** The load to put on the cluster - indexing and search
** ect.

The test automatically sets up virtual machines on Xen-servers accortingto the environment/setup-description, installs the versions of solr/zkunder test, starts everything up, creates collection withshard-distribution according to configs, starts a test-driver generatingthe indexing/search load, measures numerous metrics (CPU load,memory-usage, IO-throughput, disk-space usage, indexing-responsetime,search-responsetime, indexing capacity etc etc) during the test run. Wesometimes run the test for months. You will be amazed to se how all ofthe metrics develop as the test advance and collections are filled withdata. Because of the complete automation and being based on virtualizedservers, we can (and have) run the test on numerous setups withcombinations of

** 4/6/8 GB RAM on each machine
** 1/4/8 shards per collection per Solr node
** Indexing across 1/2/many collections

** Modified mergers inside Solr/Lucene (merging of segments shows up tobecome a problem when you have huge amounts of data in yourcollections/shards)The Solr project should do stuff like that on its own to be able to say(with confidence) that it performs with respect to endurance,responsetime, capacity and scalability - and in what way in performs(limitations, recommended hardware as a function of load (indexing andsearch), amounts of data, etc.)


About the randomized tests promoted by Dawid Weiss:

I see your point about "bringing up bugs nobody thought to covermanually", but it also has cons - e.g. violating the principal thattests should be (easily) repeatable (you will/can end up with tests thesometimes fail and sometimes succeed, and you have to dig out the randomvalues of the tests that fail in order to be able to repeat/reconstructthe fail)

A little last remark- use enums instead of just static string constants.Enums are not just dumb simple replacements for static constants. Youcan have logic associated with each option in the enum etc. See theUpdateSemantics class of my SOLR-3178 patch.


Regards, Per Steffensen



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: pro coding style

Reply via email to