Everything below is my humble opinion and input - DONT MEAN TO OFFEND ANYONE
Radim Kolar wrote:
what you should do:
* stuff i do
Like people with confidence, but it is a balance :-) Every decent
developer in the world believes that he is the best in the world. Chance
is that he is not. Be humble.
+
* ant -> maven
Maven is a step forward, but it is still crap. Believe the original
creator of ant has apologized in public for basing it on XML. Maven is
also based on XML, besides being way to complex in infrastructure -
goal, phases, environments, strange plugin with exections mapping to
phases etc. XML is good for static data/config stuff, but build process
is not static data/config - it is a process. Go gradle!
I dont have either, if i decide to go with SOLR instead of EC, i will
fork it. It will save me lot of time.
We are baiscally handling our own version of Solr at my organization,
because it is so hard go get contributions in - SOLR-3173, SOLR-3178,
SOLR-3382, SOLR-3428, SOLR-3383 etc - and lately SOLR-4114 and
SOLR-4120. It is really hard keeping up with the latest versions of
Apache Solr, because it is a huge job to merge new stuff into our Solr.
We are considering to take the consequence and fork our own public (to
let others bennefit and contribute) "variant" of Solr.
I understand that no committers are really assigned to focus on
committing other peoples stuff, but it is a shame. I would really,
really not like Solr to end up in a situation, where many organizations
run their own little fork. Instead we should all collaborate on
improving "the one and only Solr"! Maybe we should try to find a sponsor
to pay for a full-time Solr committer with the main focus on verifying
and committing contributions from the "outside".
* svn -> git (way better tools)
I think we had this discussion already and it seems that lots of
folks are positive, yet there is still some barrier infrasturcuture
wise along the lines.
dont blame infrastructure, other apache projects are using it.
Git is the way forward. It will also make comitting outside
contributions easier (especially if the commit is to be performed after
the branch has developed a lot since the pull-request was made). Merging
among branches will also become easier. Why? Basically, since a pull
request (request to merge) is a operation handled/know by git, i allows
for git to maintain more information about where merged code fits into
the code-base considering revisions etc. That information can be used to
ease future or late merges.
* split code into small manageable maven modules
see above - we have a fully functional maven build but ant is out
primary build.
i dont see pom.xml in your source tree.
Have a look at templates in dev-tools/maven. Do a "ant
-Dversion=$VERSION get-maven-poms" to get your maven stuff generated in
folder "maven-build". Maven build does not work 100% out of the box, (at
least on lucene_solr_4_0 branch) but it is very close.
* use github to track patches wait why is github good for patches?
you can track patch revisions and apply/browse/comment it easily. Also
its way easier to upload it and do pull request then attach to ticket
in jira.
See comments under "git" above
Besides that I have some additional input, now that we are talking
Basically that code is a mess. Not blaming anyone particular. Its
probably to some extend the nature of open source. If someone honestly
belive that the code-base is beautiful, they should find something else
to do. Some of the major problems are
* Bad "separation of concerns"
** Very long classes/methods dealing with a lot of different concerns
*** Example: DistributedUpdateProcessor - dealing with
cloud/standalone-, phases-, optimistic-locking, calculating values for
document-fields (for add/inc/set requests), routing etc. This should all
be separated into different classes each dealing with the a different
concern
** Code dealing with a particular concern is spread all over the code -
it makes it very hard to "change strategy" for this concern
*** Example: An obvious "separate concern" is routing (the decision
about in which shard under a collection a particualr document belongs
(should be indexed and found) and where particualr request needs to go -
leaders, replica, all shards under the collection?). This concern is
dealt with in a lot of places - DistributedUpdateProcessor,
CloudSolrServer, RealTimeGetComponent, SearchHandler etc.
** In my patch for TLT-3178 I have made a "separate concern" called
UpdateSemantics. It deals with decissions on stuff related to how
updates should be performed, depending on what update-semantics you have
choosen (classic, consistency or classic-consistency-hybrid). This class
UpdateSemantics is used from the actual updating component
DirectUpdateHandler2 - instead of having a lot of if-else-if-else
statements in DirectUpdateHandler2 itself
* Copied code
** A lot of code is clearly just copied from another place in the code.
It does not only make the code-base very big (we can live with that),
but again it really makes it hard to "change strategy" on the stuff the
code deals with
*** Example (not taken from the Solr code): If you have code, even just
a single line, implementing a calculation of some value - e.g. "int
myImportantValue = param1 * param2 + (param3 % param4);" - DONT copy
that code. It will make it impossible, in the future, when you realize
that the calculation should actually be (because you realize an error or
because you change strategy) e.g. "int myImportantValue = param1 *
param2 + (param4 % param3) / param5;". The same of course goes for a
sequence of code-lines handling a specific task.
As I said, we cannot blame anyone for the code being a mess, but we can
look at the future and encourage people to do stuff that step-by-step
will/can reduce the problems mentioned above
* Refactor whenever you have the chance
** To clean up mistakes made in the past releated to e.g. "separation of
concerns" or "copied code" or ...
** To make sure you dont make more of those mistakes - a few rules of thumb:
*** Evey developer should have a bell ringing in their head whenever
they are about to do ctrl-c plus ctrl-v. This bell should remind you to
think about whether or not you really want to do the copy instead of
making a method (or something) containing the code you are about to copy
and use this method from both the place where the original code was, and
the place where you where about to copy it. If you want a little bit of
difference in the code among the two (or more) places where it is used,
still make sure that all the common stuff is shared - e.g. implement the
difference in the shared method by letting it take a paramater deciding
on the variant of semantics. There are also more advanced wayes to share
code in object-oriented languages.
*** Whenever you are about to make a change to the code of a certain
size, start by considering what kinds of concerns it is dealing with and
separate dealing with those concerns in different classes (class
hierarchies). If you are going to deal with a concern already dealt with
other places in the code, take the opportunity to refactor and isolate
the existing code in a separate class dealing with the concern, add your
additions to the concern and use the class from both places. Keep in
mind that java is and object-oriented language and that REAL
object-orientation is better than just advanced procedural coding - know
and use the advantages you get from a object-oriented language.
* Cover you changes by tests - even consider doing TDD, by implementing
your tests before you implement the actual change in the "real" code
** Being able to "trust you test-suite" is the key to being able to do
the following with a fair amount of confidence that you do not ruin
existing stuff
*** Taking in and committing other peoples contributions
*** Daring to make major refactoring (which is greatly needed in Solr
code-base)
** Frankly if you break existing stuff by your commit, it is not your
fault, unless you cheated and modified/disabled existing tests. Its the
fault of the original implementor of the stuff you ruined - he did not
create a good enough test-suite on top of his code, to prevent others
for accidently ruining it in the future.
Do real performance testing. Solr is a project where an important
property is "big data". You cannot claim anything about performance
without having tested it. In my world performance covers
* Endurance - that the system can run under load for a very long time
without performing worse overtime (except due to stuff like "amounts of
data increasing") - e.g. because of memory leak, thread leak,
synchronization-problems, congestion etc.
* Responsetime - how does responsetimes develop as the data store fills
up - indexing times and search times (responsetimes always becomes worse
at some point but this "point" need to be "far out")
* Capacity - how high a load can the system handle per time-unit -
probably a function of sevaral things like RAM/CPU/OS on the involved
machines, number of involved machines, number of shards the collections
are split up into, etc.
* Scalability - e.g. "will your capacity double if you double the amount
of hardware", "does it hold both when going from 1 to 2 units of
hardware (e.g machines) and when going from 1000 to 2000 units of
hardware" etc.
In my organization we have created completely automated performance
tests. We have many "big machines" with Xen-server installed. The tests
take a description about the "environment/setup" you want to run the
test against - environment/setup being:
** The number of machines running Solrs nodes, ZKs etc.
** The amount RAM/CPU of each machine
** The load to put on the cluster - indexing and search
** ect.
The test automatically sets up virtual machines on Xen-servers accorting
to the environment/setup-description, installs the versions of solr/zk
under test, starts everything up, creates collection with
shard-distribution according to configs, starts a test-driver generating
the indexing/search load, measures numerous metrics (CPU load,
memory-usage, IO-throughput, disk-space usage, indexing-responsetime,
search-responsetime, indexing capacity etc etc) during the test run. We
sometimes run the test for months. You will be amazed to se how all of
the metrics develop as the test advance and collections are filled with
data. Because of the complete automation and being based on virtualized
servers, we can (and have) run the test on numerous setups with
combinations of
** 4/6/8 GB RAM on each machine
** 1/4/8 shards per collection per Solr node
** Indexing across 1/2/many collections
** Modified mergers inside Solr/Lucene (merging of segments shows up to
become a problem when you have huge amounts of data in your
collections/shards)
The Solr project should do stuff like that on its own to be able to say
(with confidence) that it performs with respect to endurance,
responsetime, capacity and scalability - and in what way in performs
(limitations, recommended hardware as a function of load (indexing and
search), amounts of data, etc.)
About the randomized tests promoted by Dawid Weiss:
I see your point about "bringing up bugs nobody thought to cover
manually", but it also has cons - e.g. violating the principal that
tests should be (easily) repeatable (you will/can end up with tests the
sometimes fail and sometimes succeed, and you have to dig out the random
values of the tests that fail in order to be able to repeat/reconstruct
the fail)
A little last remark- use enums instead of just static string constants.
Enums are not just dumb simple replacements for static constants. You
can have logic associated with each option in the enum etc. See the
UpdateSemantics class of my SOLR-3178 patch.
Regards, Per Steffensen
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]