Hi, > The problem isn't so much that distributed files are inefficient but > probably that the algorithm you are using is not optimal. What is the > SELECT statement you are using? These are regular SELECT statements, eg. SELECT <distributed_file> WITH FIELD1 LIKE sth...
I will focus on one specific file only, but other are very much similar. Our distribution algorithms are always uncomplicated. They distribute data very well in this, discussed case. Distribution algorithm is very simple - it uses part of date (day) contained in key to distribute records. So we have 32 partfiles. IDs not matching some pattern (say 1A5N), are put to partfile 32. For these matching pattern there is only 1 invocation of ICONV and OCONV. Day is obtained from date and returned as partfile number. Procedure can not be simpler (few lines) I think :) The performance problem arises when you ask for data with selection criteria. jBASE will start to call distribution subroutine thousands of times. This will introduce enomours overhead. We usually do not need to ask queries like that, but for some (CSHD) investigations we are forced to do it like that. > If you create a list then read through > that list, you will read in the list order, which may not be optimal for > that distribution. Note that many moons ago I modified the distributed > files so that you could change the key on the fly to guarantee that the > distribution is good (not that anyone has ever used except the people I > wrote it for) Otherwise the key order you get is not necessarily the > order that is best to read through the part files and you will create > millions of random reads instead of lots of sequential reads. I think that select is taking keys "in natural order" from partfiles, but I can confirm tommorow. We are using jBASE 4.1.5.17. The main difference is that jBASE runs distribution routine for these "full scan" selects and I can not understand why does it need to do it? I guess that SELECT / READNEXT operations of jEDI driver implemented for distributed files are virtually handling distribution (so SELECT program is not aware of partfiles), but just performs SELECT / READNEXT + READ of record. This is inefficient, because READ introduces unnecessary overhead caused by calling distribution routine. Results can be obtained much faster by doing (direct) SELECTs on partfiles and combining output. This is however optimization for jBASE team, not us I belive. We already raised it, but I noticed "resistance" in accepting this ticket :( > A select will read the ID, then it will read the record - > non-distributed files use some neat tricks for bulk reads, whereas > distributed files probably cannot. Have you considered SELECTing each > part file individually, then merging the results? I doubt that the > distributed files can be generically optimized, but by changing the key > on read and write (computationally of course) you can probably get much > better performance. > > Of course, once you can use my new file system, you won't need > distributed files and won't have this problem :-) I need to read your post from the past. Do you need if jBASE 5 would help us in liquidating described problem? I think that calling of distribution routine is not needed if you do full scan table. I guess that many people could benefit from such optimization. Kind regards Pawel ---------------------------------------------------- EuroBasket 2009 w Polsce! Giganci nadchodzą, zobacz trailer. Kliknij: http://klik.wp.pl/?adr=http%3A%2F%2Fcorto.www.wp.pl%2Fas%2Feurobasket.html&sid=668 --~--~---------~--~----~------------~-------~--~----~ Please read the posting guidelines at: http://groups.google.com/group/jBASE/web/Posting%20Guidelines IMPORTANT: Type T24: at the start of the subject line for questions specific to Globus/T24 To post, send email to [email protected] To unsubscribe, send email to [email protected] For more options, visit this group at http://groups.google.com/group/jBASE?hl=en -~----------~----~----~----~------~----~------~--~---
