> On Sept. 30, 2014, 4:25 p.m., Vishesh Handa wrote:
> > Hey Aaron. Thanks for putting this up for review. Could you please split 
> > this patch into multiple reviews? I see a mix of this patch, xapian->sqlite 
> > changes, new API, coding styles changes, and some improvements in the 
> > FileMapping class.
> > 
> > The indexing progress seems to be now tracked in sqlite instead of Xapian. 
> > You mentioned it is more performant. Do you know why that is? Also, I'd 
> > love to know how these benchmarks were measured. I don't see any tests for 
> > benchmarks. Perhaps you forgot to add them to this patch?
> > 
> > And finally, from what I understand this removed the need for the whole 
> > binary search when a file fails to index. Could you please add a test in 
> > order to make sure the extractor crashing is handled properly? Some tests 
> > for the file indexing queue would also be needed.
> > 
> > How do you handle when a file takes too long to index? The FileIndexingJob 
> > had a timer for 5 minutes.
> 
> Aaron J. Seigo wrote:
>     Sorry I haven't responded sooner but I was away for a week and have been 
> in a job transition since then, and that's kept me occupied the last while. 
> I'll hopefully have some time again to hack on personal / fun things opening 
> up by week's end.
>     
>     "Could you please split this patch into multiple reviews? I see a mix of 
> this patch, xapian->sqlite changes, new API, coding styles changes, and some 
> improvements in the FileMapping class."
>     
>     The code style fixes are limited two small places. The only one that is 
> entirely separate from functional changes is in indexingqueue.cpp. If you 
> really want those bits reverted, I can do so, but I'm not going to submit a 
> separate review for them on their own. Realistically, it's 6 curly braces (by 
> my count) ... 
>     
>     The rest of the changes belong together. The xapian->sqlite changes are 
> necessary for the cross-process queue syncronization, the new API is the 
> cross-process communication and the other changes all support those changes. 
> Example: the sqlite db schema versioning is needed to perform the upgrade to 
> the schema; without that ther would be code there to check if there are 
> specific columns or not, and that would then get replaced by a more sane 
> versioning (such as what is in this patch). There's really not much to break 
> up further.
>     
>     "The indexing progress seems to be now tracked in sqlite instead of 
> Xapian."
>     
>     Correct. The booleans stored in the xapian db as Z1 and Z2 were pretty 
> obscure (what does 'Z1' mean?) and the sqlite db already has a listing of the 
> files being indexed so is a natural place to track progress asyncronously 
> between processes without requiring the xapian db to commit continuously.
>     
>     "You mentioned it is more performant. Do you know why that is?"
>     
>     Yes (see above)
>     
>     "Also, I'd love to know how these benchmarks were measured."
>     
>     The schedulertest. Handy as both an integration test and a small 
> benchmark. Extending it to create more than 20 files per run is pretty simple.
>     
>     "Could you please add a test in order to make sure the extractor crashing 
> is handled properly?"
>     
>     Sure. (Figuring out how to simulate it crashing, since it really 
> shouldn't ever do that, will be fun and interesting ... but I'm sure it can 
> be done)
>     
>     "Some tests for the file indexing queue would also be needed."
>     
>     What needs testing beyond what schedulertest already tests?
>     
>     "How do you handle when a file takes too long to index?"
>     
>     It doesn't currently; I'll add that.

Of course, creating a new review for the coding style changes would be too 
much. But it should go in a separate commit. I would not like your entire 
branch to be merged, rather one commit at a time.

Could you please explain "xapian->sqlite changes are necessary for 
cross-process queue synchronization". I'm afraid I really don't follow. Also, I 
don't follow why they are not functionally seperate. 

Regarding the booleans. Z was a random letter chosen to represent the indexing 
level. One could have chosen a longer string, but that would occupy more space. 
Is there a problem with having to commit changes in Xapian? We are indexing the 
file and making changes over there, is not more sensible to just club them 
together? Unless, there are provable speed and io improvments by storing them 
in sqlite.

Regarding the Scheduler Test: Each change should ideally be tested 
independently so that we know what parts are actually improving the system and 
by how much. Also, the scheduler test isn't ideal since it contains artificial 
delays in the queues. We have them because indexing everything too fast is a 
problem.

Regarding Test Crashing - The previous FileIndexingQueue test already had code 
for that.

Test coverage. At minimum, the following tests would be required -
1. Extractor Client - Handling crashing and timeouts. Maybe tests for the 
FileIQ would also be required since it is now updating the
2. Benchmarks for the FileIndexingQueue (both io and cpu)
3. Benchmarks for the Scheduler (both io and cpu)
4. The BasicIndexingJob test should still be there - I understand that you've 
moved some part of the code to the BasicIQ, please add relevant tests.

I would be very interested in seeing both the IO and CPU benchmarks. More in 
the IO ones since the speed benefits while nice to have, do not amount to much 
if we have artifical delays.


> On Sept. 30, 2014, 4:25 p.m., Vishesh Handa wrote:
> > .gitignore, line 1
> > <https://git.reviewboard.kde.org/r/120438/diff/1/?file=315778#file315778line1>
> >
> >     Is this required?
> 
> Aaron J. Seigo wrote:
>     no; doesn't have to be there. happy to leave it out.

Please do. These kind of things belong in a global gitignore file.


> On Sept. 30, 2014, 4:25 p.m., Vishesh Handa wrote:
> > src/file/lib/extractorclient.h, line 30
> > <https://git.reviewboard.kde.org/r/120438/diff/1/?file=315808#file315808line30>
> >
> >     This add new public API.
> 
> Aaron J. Seigo wrote:
>     Which makes sense if it is used by the Baloo widgets repository, unless 
> they can share code privately. Otherwise, yes, it's fine to make this private 
> if nothing else is going to use it.

Baloo Widgets no longer uses the extractor. This can be made private.


- Vishesh


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://git.reviewboard.kde.org/r/120438/#review67687
-----------------------------------------------------------


On Sept. 30, 2014, 12:47 p.m., Aaron J. Seigo wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://git.reviewboard.kde.org/r/120438/
> -----------------------------------------------------------
> 
> (Updated Sept. 30, 2014, 12:47 p.m.)
> 
> 
> Review request for Baloo.
> 
> 
> Repository: baloo
> 
> 
> Description
> -------
> 
> Current Baloo's file indexer takes batches of files by its command line and 
> processes them async. There are several iterations over and creation of 
> various collection classes. This patch set does away with all of this.
> 
> Instead, the worker process now is directed over stdin and responds over 
> stdout (old school UNIX!). This is done using a very simple protocol which is 
> fully documented in docs/ and is encapsulated in a convenience class so as to 
> make future maintenance relatively painless in the advent that the protocol 
> is adjusted or an entirely new system is put inplace. IOW, the next me should 
> have an easier time improving on what is there. This allows the worker 
> process to be long(er)-lived and moves more of the relevant bookkeeping to 
> the same place (the worker process). The resulting code is about the same 
> size but cleaner and much more performant.
> 
> The tracking of file progress is also moved to the (less expensive, easier to 
> update incrementally) sqlite db where file data is already kept. It also 
> versions the sqlite schema so that runtime upgrades can be made (as required 
> here) both now and in the future without user intervention.
> 
> Finally, a tiny (93 LOC, including header) executable that wraps the protocol 
> encapsulation class is provided for 100% backwards compatibility of *all* 
> command line options that currently exist.
> 
> This makes it a 100% drop-in replacement.
> 
> Performance of the affected unit tests is ~2.5x better on my laptop: 7.6 v 
> 3.0 seconds.
> 
> Known weaknesses:
> 
> * still isn't hardened against multiple processes indexing simultaneously. 
> the added sqlite columns are a step in this direction, howeer
> * the DBus signal that was being emitted was removed and has not been 
> replaced with anything yet. That is because the affected class is in 
> baloo-widgets (which happens to be the only user of the relevant *public* 
> class in the baloo file library) and I have not yet worked on that, pending 
> this patch set being approved in principle. I am actually of the suspicion 
> that keeping that data synchronized at all times is not needed .. but if it 
> is, there are much better ways than DBus. Note that the DBus overhead was 
> applied to *every single file indexed* even if Dolphin (the only user of this 
> feature) wasn't running or had the metadata frame open.
> 
> The base idea behind this patch is to make file indexing as blazing fast as 
> possible, which means using as few resources, CPU & memory, as possible. That 
> ought to be the #1 priority for Baloo when it comes to indexing files, both 
> to keep it away from user's notice on desktop class hardware as well as 
> extend its relevance to smaller devices. The code in this patch is at least 
> as clear (imho more so, actually) than what it replaces and was written for 
> maintainability.
> 
> 
> Diffs
> -----
> 
>   src/file/extractor/app.cpp PRE-CREATION 
>   src/file/extractor/baloo_file_extractor.h 
> a8cff51ae763c36fb2934c80c04cbf0f710a01c7 
>   src/file/extractor/baloo_file_extractor.cpp 
> b2e7f97f2f0537807c314e2aca008d67711a69ce 
>   src/file/extractor/extractorworker.cpp 
> e710908dc8b02b7edc9f40e44b9d5d5883986846 
>   src/file/extractor/main.cpp 606a1eb9d42b7079fee497a3fdffbe2327512c13 
>   src/file/fileindexingjob.h PRE-CREATION 
>   src/file/fileindexingjob.cpp PRE-CREATION 
>   src/file/fileindexingqueue.h 5b1c612feb84046acf36b3dec3bf890da2cbf650 
>   src/file/fileindexingqueue.cpp c6479cd25308bee9bb6fe6e6791391e6ca83bddf 
>   src/file/indexingqueue.cpp c8ae4d9d1436ebe222a4758bd5ce551e091ec714 
>   src/file/lib/CMakeLists.txt fb158a5f00f07ab48de963bf72924c83fc1c6673 
>   src/file/lib/extractorclient.h 1e2ae09c7bd0068ca185b814c5c4996c0b563856 
>   src/file/lib/extractorclient.cpp 221b90abe1899c27c03cf77e279fb71afce80b4d 
>   src/file/lib/filemapping.cpp 5769fe23576c341917e8f37f7dd2743dd446a6e1 
>   src/file/main.cpp 41855215c7b550c403d5ef523d29f9b73db58cb0 
>   src/file/metadatamover.cpp 5117687c1247275235c3830b9d2234bfff353748 
>   src/file/util.h a4cc2423fd46cda7146b02c572cc3f4ccd069e3e 
>   src/file/util.cpp 2f7ac6de0dddc7d71c2b36bb4fe88677bfe52336 
>   src/file/autotest/fileindexingjob/fileindexingjobtest.h PRE-CREATION 
>   src/file/autotest/fileindexingjob/fileindexingjobtest.cpp PRE-CREATION 
>   src/file/autotest/scheduler/schedulertest.cpp 
> b91a481138382b609068b4737c654ac5182c37ec 
>   src/file/basicindexingjob.h 3d2fb339e38be862332ddca45b4926f8000fa0c1 
>   src/file/basicindexingjob.cpp 8d70fa2b1ca737eaa8d2e717a0eb2dbf430fac84 
>   src/file/basicindexingqueue.cpp 8484c9b6868890cae6c2e19b29ea92d500aac03e 
>   src/file/commitqueue.cpp b26f73cfdbc390cceef01aa6abd2b5af6f9689b2 
>   src/file/database.h 6548de7dbce003112dad8c58c39c624dd806a261 
>   src/file/database.cpp 5c9ccdc442d230bdc5dc9edbe5bc595537d08a8f 
>   src/file/extractor/CMakeLists.txt 7cf3478f547f7fbcdf8908a3471b50fab09b5220 
>   src/file/extractor/app.h 8059a16e8ec2e2a84d1a3b7eb1dd8cfc2a3136de 
>   .gitignore 1377554ebea6f98a2c748183bc5a96852af12ac2 
>   docs/file_extractor.md 3d830e2313bd35994f9fd1d6b52df422d0556fd8 
>   src/file/CMakeLists.txt 1a9e6bf8f2c70b096fbb2c6c9eb78feec9d06622 
>   src/file/autotest/CMakeLists.txt ccb8575fa8c07f2b136d259ed63d43a09ad75a12 
>   src/file/autotest/basicindexingjobtest.h PRE-CREATION 
>   src/file/autotest/basicindexingjobtest.cpp PRE-CREATION 
>   src/file/autotest/fileindexingjob/CMakeLists.txt PRE-CREATION 
>   src/file/autotest/fileindexingjob/extractor.cpp PRE-CREATION 
> 
> Diff: https://git.reviewboard.kde.org/r/120438/diff/
> 
> 
> Testing
> -------
> 
> Indexed files, ran testsuite with 0 failures.
> 
> 
> Thanks,
> 
> Aaron J. Seigo
> 
>

>> Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to unsubscribe <<

Reply via email to