t sounded like your standard app tends to load a bunch of data and then deliver those dbs for read only access. Again this is a minor tweak but you may see some improvement if you can sort the data by primary key for insert into the base table. This will give you a one time clustering that would improve index scans that require index to base table look up. Thinking about this, I wonder if we should add an option to compress table to do this automatically? It would be very easy to do (ie. a one time cluster - continual cluster would be a lot more work). All that would have to happen is at compress time throw all the rows into the sorter and then insert the result on the other side. [Arindam] This is a great suggestion. Our DB is dynamic however only one modification owner can write to it. The rest will see a copy. We can just compress the master DB before we replicate it to the copies. By the way, I tested on mysql - it is processing 1000 records in IN clause for the same DB in 16 millis. I have increased the query page size to 16 MB. I will send the execution plan as I get them.
On Wed, Apr 8, 2009 at 10:31 PM, Mike Matrigali <[email protected]>wrote: > Arindam Bhattacharjee wrote: > >> Thanks for the excellent summary Mike. The IN clause will have an unique >> list of object ids. If we can return 100 object ids in good time, we are >> okay. The users will be given an option to increase this to MAX 1000 - >> however, they will be warned that the performance would be slower. The real >> power of Derby according to us is the flexibility to manage the life cycle >> of the database files without support from DB admins. We want to make the DB >> management completely hidden from our customers. However, SolidDB and >> TimesTen are good options which we can investigate but Derby has the >> advantage of being used within our organization already. In general we liked >> its memory footprint - except for one case of SQL parsing, it was mostly >> well behaved. >> >> Question: will changing page size for the tables improve query speed? We >> will always run one query through multiple threads on Derby and we will not >> use it for any other purpose. We are okay to be a little slow on the >> insertions but retrieval should be as quick and fast as possible. The >> database will be created and replicated. The replicated databases will be >> read-only. >> > I am not sure about page size but seems reasonable to try upping them all > to 32k. I usually leave page size to the very end of tuning. Best is just > to run the experiment. You definitely have enough rows so > you won't be wasting any space going to bigger pages. The trade off is > that i/o will be better but processing bigger pages requires more cpu > for things like binary searches on the btree pages. But bigger pages may > also shrink the height of the btree so again it is a tradeoff. > Another downside is that memory footprint of derby will increase as Derby > currently sizes it's cache by number of pages so where you may > have 1000 (default cache size is 1000 pages) 4k pages cached you now might > get 1000 32k pages. > > It sounded like your standard app tends to load a bunch of data and then > deliver those dbs for read only access. Again this is a minor tweak but > you may see some improvement if you can sort the data by primary key for > insert into the base table. This will give you a one time clustering that > would improve index scans that require index to base table look up. > Thinking about this, I wonder if we should add an option to compress table > to do this automatically? It would be very easy to do (ie. a one time > cluster - continual cluster would be a lot more work). All that would have > to happen is at compress time throw all the rows into the sorter and then > insert the result on the other side. > > >> Best regards, >> >> Arindam. >> >> >> On Wed, Apr 8, 2009 at 6:23 PM, <[email protected] <mailto:[email protected]>> >> wrote: >> >> >> Arindam, >> >> >> Ok, >> >> >> Just a few things… >> >> >> 1) Derby is a free, open source, product. Warts and all, its >> not going to be your best choice for a commercial rdbms product. >> (Sorry Derby fans, you get what you pay for.) >> >> 2) You want speed, you’re going to have to consider IBM’s >> SolidDB which is an in memory database along with Oracle’s TimesTen >> database. This will give you the speed that you want. >> >> 3) You’re running on a laptop where your disk drive could be a >> 5400 rpm IDE drive. Laptop drive == low energy consumption and low >> performance. Disk based solutions will be much slower on a laptop >> than on a ‘comparable’ desktop and/or server. >> >> 4) I’m not sure why you had to load and then compress your >> tables. I could understand that if you created your index prior to >> loading the data that you could have to update the statistics. >> >> 5) Since we don’t know what you’re actually trying to do, >> there could be a better or more efficient design. Having an IN >> clause with 1000’s of entries is definitely not a good design for >> any database. >> >> 6) You never did answer the question about the object ids that >> were in the IN clause. Were they unique or were there duplicates? >> >> >> >> HTH >> >> >> -Mike >> >> >> ------------------------------------------------------------------------ >> >> *From:* Arindam Bhattacharjee >> [mailto:[email protected] >> <mailto:[email protected]>] >> *Sent:* Wednesday, April 08, 2009 2:14 AM >> >> *To:* Derby Discussion; [email protected] <mailto:[email protected]> >> >> *Subject:* Re: URGENT!!! JDBC SQL query taking long time for large >> IN clause >> >> >> Few inputs/observations: >> >> a) This is not a school project. This is a project which makes money >> for us today. We have functioning product running today, developed >> by us which doesn't use any RDBMS's to achieve what I am trying to >> achieve with the tables. For 200,000 object master we get 45 - 60 >> millis performance. The reason we are trying to use an RDBMS is to >> ensure that we can scale even higher and updates are easier and more >> maintenable. Hence all these questions. >> >> b) I tried removing the "preserve rows" from the declare temp table >> clause - and that didn't help me either. >> >> c) The data base which I have created is just a sample database >> which vaguely represents the load which we will see if we implement >> the existing feature of ours, using Derby RDBMS. It doesn't contain >> REAL data. >> >> d) The entire database was created in one shot, as I have stated >> earlier, and records are populated in one go. That was the actual >> problem - thanks to Knuth for indicating that as a possibility. When >> I compress the tables - the quries start performing a LOT faster! 1 >> million in object master, 20 million in object category mapping and >> 10 K in category master and 1000 in IN clause gives me 1100 millis >> speed now - down from 30 seconds. That probably explains why the >> object master index scans were taking nearly 30 seconds in the >> runtimestats which I posted. Which is slow, but still manageable >> since the 100 and 500s are much faster. However, that is still a lot >> slower than other databases we are testing on. Since Derby is >> embedded we expected it to be much faster since no IPC is required >> between the client driver and the network server. >> >> e) IN doesn't function properly beyond 1000 object ids, and we used >> a regular table CREATE TABLE DUMMY (ID INTEGER PRIMARY KEY). >> Insertion of 5000 records into this happened in 100 millis (after >> compression of the tables) and the query took 4 seconds. With >> executeUpdate() or execute() (as suggested) insertion was at least >> 200% slower. So the execute batch works better for the inserts into >> the table for the payloads required by us. The temp table query took >> 10 MINS (this time I waited patiently for it to finish - there was >> nothing wrong there - the query just takes too much time and I >> didn't have patience to wait for it to finish the first time :) ) to >> execute while the regular table with primary key took just 4 seconds. >> >> f) The stack overflow exception for "select ... values...union" was >> much less of a problem in 10.4 compared to 10.1 since the memory >> usage is tapering off to the -mx<Size In Megs> for the JVM. But, >> since it cannot be parameterized it isn't of much use to us. IN >> clause based queries are getting parsed with much lower memory >> footprints. >> >> Best regards, >> >> Arindam. >> >> On Wed, Apr 8, 2009 at 2:39 AM, <[email protected] >> <mailto:[email protected]>> wrote: >> >> >> > -----Original Message----- >> > From: Mike Matrigali [mailto:[email protected] >> <mailto:[email protected]>] >> > Sent: Tuesday, April 07, 2009 2:05 PM >> > To: Derby Discussion >> >> > Subject: Re: URGENT!!! JDBC SQL query taking long time for large >> IN clause >> > >> >> > It is impossible to say what the performance of the query can be >> without >> > knowing exact values of all the values of the IN LIST. But it is >> > possible to get some idea assuming some worst case behavior, and >> from >> > that I am going to guess you will never come close to 100ms with an >> > uncached database, on hardware using some sort of standard disk >> based >> > hard drive. >> > >> > I do think the query may go faster with index and query tweeking, >> but >> > 100ms to an uncached db and non-clustered unique values in that IN >> list >> > is never going to go that fast. Adding up just what is posted it >> looks >> > like this is a 1.2 gig db. >> > >> >> Drop the unnecessary indexes and you'll see the database size shrink >> fast. >> Also note that he's running this on a Windows XP laptop. Depending >> on the >> model of the lap top, you will have not only CPU issues but also >> disk i/o >> issues as well. (5400 rpm IDE as an example....) >> >> However, it is possible for the OP to get better performance, if not >> realistically 100ms performance. (BTW where did 100ms come from? I'm >> sorry >> but this really sounds like a class project...) >> >> >> > You posted the space for the tables and indexes. The interesting >> ones >> > are the big ones. You have 5 tables or indexes over 1000 pages >> big. If >> > in the worst case your 1000 value IN list happens to be on 1000 >> > different pages then Derby is going to need to do at least 1000 >> i/o's to >> > get to them - I usually use back of envelope of max 100 i/o's per >> second >> > (even if your disk has specs that say higher rate this I/O is not >> > going to >> > get streamed as fast as possible by this query, it is going to ask >> for >> > page, process it, do some join work then later ask for another >> page, ...) >> > : >> > > CATEGORY_MASTER 0 103 0 0 4096 0 >> > > SQL090406091302600 1 55 0 0 4096 0 >> > > SQL090406091302601 1 160 0 1 4096 0 >> > > SQL090406091302730 1 1 0 1 4096 0 >> > > OBJECT_MASTER 0 10497 0 0 4096 0 >> > > SQL090406091302760 1 5340 0 1 4096 0 >> > > SQL090406091302761 1 16708 0 410 4096 0 >> > > OBJECT_CATEGORY_MAPPING 0 150794 0 0 4096 0 >> > > OBJECT_CATEGORY_MAPPING_INDEX 1 112177 0 57 4096 >> 0 >> > >> >> Mike, >> >> I think that a lot of this information is a bit skewed. Outside of the >> primary index, the indexes he created included the varchar field. >> Not sure >> why he did this except under the impression that he'd only have to >> hit the >> index and not the underlying table. While there is some potential >> merit to >> this, I think that there are things that he can do to improve >> performance. >> (Hence my post about reworking the query itself and using a temp >> table.) >> Drop those indexes and you'll see a big change in database size. >> >> >> > There was work done in 10.3 on IN-LISTS, making them perform more >> like >> > unions, See DERBY-47. So if you have a choice of releases I would >> > suggest you move to 10.4 and post query plan and results against >> that. >> > The basic idea of that change was to allow the >> > system to do 1 probe into an index for each value in the IN-LIST, >> before >> > this change DERBY could only sort the values in the IN list and then >> > limit a index scan to the lowest and biggest values in the in list. >> > So for instance for OBJECT_CATEGORY_MAPPING_INDEX, worst case it >> might >> > have to scan 112177 pages to find the 1000 rows, where worst case >> for >> > probing would be 1000 page (plus btree parent index pages, but those >> > are much more likely cached). The problem is that there is >> definitely >> > overhead for probing one at a time, scans go much faster - so there >> is >> > a crossover point - ie. I would guess it would likely better to >> scan all >> > 112177 pages then do 100,000 probes. >> > >> >> I believe that it was already recommended that he do just that. >> There are two ways he could use the temp table. As a sub-select >> statement, >> or as part of the table join. >> >> I think this would bypass the whole use of the IN list. I'm still >> not 100% >> sure why there's 100+ values coming from an outside source. Based on >> his >> query below it looks like the object_ids in the IN clause are not >> unique... >> >> Its kind of hard trying to help someone when you don't know the whole >> problem.... >> >> -Mike >> >> >> > arindam.bhattacharjee wrote: >> > > Hello Knut, >> > > >> > > Thanks for your quick response. This is a sample database which >> I have >> > > created just for testing out the performance and has been written >> to >> > only >> > > once in one go. I tried temp tables but that is just too slow. >> The IN >> > clause >> > > has values which comes from another source and I can't modify >> that. >> > > >> > > However, I will try out what you state below. But still, I >> wanted to get >> > > your pulse about whether Derby can respond in sub 100 millisec >> time with >> > the >> > > table sizes you see above? >> > > >> > > I find that: >> > > >> > > select category_master.category_name, >> > count(category_master.category_name) >> > > as category_count >> > > from >> > > ( >> > > select internal.object_id >> > > from >> > > ( >> > > values(1001) union all >> > > values(1001) union all >> > > values(1001) union all >> > > values(1001) union all >> > > values(1002) union all >> > > values(1001) union all >> > > values(1001) union all >> > > values(1001) union all >> > > values(1001) union all >> > > values(1001) union all >> > > values(1001) union all >> > > values(1001) union all ....... >> > > values(9999) >> > > ) as internal(object_id) >> > > >> > > ) as external_ids, >> > > object_master, >> > > category_master, >> > > object_category_mapping >> > > where >> > > external_ids.object_id = object_master.object_id and >> > > external_ids.object_id = object_category_mapping.object_id and >> > > object_master.object_id = object_category_mapping.object_id >> and >> > > category_master.category_id = >> object_category_mapping.category_id >> > > group by >> > > category_master.category_name >> > > order by >> > > category_count desc >> > > >> > > is much faster unfortunately connection.prepareStatement() is >> taking way >> > too >> > > much memory (both stack and heap - I have a constraint of 256 MB >> MAX >> > memory >> > > for my JVM) which goes beyond my applications resources. Is >> there a way >> > I >> > > can precompile some SQLs which are very expensive to parse during >> > execution. >> > > >> > > Best regards, >> > > >> > > Arindam. >> > > >> > > >> > > Knut Anders Hatlen wrote: >> > >> "arindam.bhattacharjee" <[email protected] >> <mailto:[email protected]>> writes: >> > >> >> > >>> Hello, >> > >>> >> > >>> I would like my query below to return within 100 millisecs. >> Please >> > help >> > >>> me, >> > >>> and the values for the IN clause comes from outside hence cannot >> > really >> > >>> change the IN clause to a join on an existing table. >> > >> Hi Arindam, >> > >> >> > >> Does the query run faster if you compress all the tables >> involved, or >> > if >> > >> you drop and recreate all the indexes? If so, it is likely that >> the >> > >> index cardinality statistics are out of date, which may make the >> > >> optimizer pick a bad execution plan. Currently, index cardinality >> > >> statistics are only updated at index creation time, when tables >> are >> > >> compressed, and when columns are dropped. A more automatic >> solution is >> > >> being worked on. For more details, see: >> > >> >> > >> https://issues.apache.org/jira/browse/DERBY-269 >> > >> https://issues.apache.org/jira/browse/DERBY-3788 >> > >> http://db.apache.org/derby/docs/10.4/tuning/ctunstats18908.html >> > >> >> > >> You may be experiencing some other problem, but this is a >> problem that >> > >> keeps coming up, so I think it's worth checking. >> > >> >> > >> Hope this helps, >> > >> >> > >> -- >> > >> Knut Anders >> > >> >> > >> >> > > >> >> >> >> >> >
