[ https://issues.apache.org/jira/browse/TRAFODION-1740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15088293#comment-15088293 ]
ASF GitHub Bot commented on TRAFODION-1740: ------------------------------------------- Github user DaveBirdsall commented on a diff in the pull request: https://github.com/apache/incubator-trafodion/pull/253#discussion_r49136585 --- Diff: core/sql/ustat/hs_globals.cpp --- @@ -3762,6 +3765,68 @@ Lng32 HSSample::make(NABoolean rowCountIsEstimate, // input // For Hive tables the sample table used is a Trafodion table if (hs_globals->isHbaseTable || hs_globals->isHiveTable) { + // The optimal degree of parallelism for the LOAD or UPSERT is + // the number of partitions of the original table. Force that. + // Note that when the default for AGGRESSIVE_ESP_ALLOCATION_PER_CORE + // is permanently changed to 'ON', we may be able to remove this CQD. + if (hs_globals->objDef->getNumPartitions() > 1) + { + char temp[40]; // way more space than needed, but it's safe + sprintf(temp,"'%d'",hs_globals->objDef->getNumPartitions()); + NAString EspsCQD = "CONTROL QUERY DEFAULT PARALLEL_NUM_ESPS "; + EspsCQD += temp; + HSFuncExecQuery(EspsCQD); + EspCQDUsed = TRUE; // remember to reset later + } + + // If the table is very large, we risk HBase time-outs because the + // sample scan doesn't return rows fast enough. In this case, we + // want to reduce the HBase row cache size to a smaller number to + // force more frequent returns. Experience shows that a value of + // '10' worked well with a 17.7 billion row table with 128 regions + // on six nodes (one million row sample). We'll assume a workable + // HBase cache size value scales linearly with the sampling ratio. + // That is, we'll assume the model: + // + // workable value = (sample row count / actual row count) * c, + // where c is chosen so that we get 10 when the sample row count + // is 1,000,000 and the actual row count is 17.7 billion. + // + // Solving for c, we get c = 10 * (17.7 billion/1 million). + // + // Note that the Generator does a similar calculation in + // Generator::setHBaseNumCacheRows. The calculation here is more + // conservative because we care more about getting UPDATE STATISTICS + // done without a timeout, trading off possible speed improvements + // by using a smaller cache size. + // + // Note that when we move to HBase 1.1, with its heartbeat protocol, + // this time-out problem goes away and we can remove these CQDs. + if (hs_globals->isHbaseTable) + { + double sampleRatio = (double)(sampleRowCnt) / hs_globals->actualRowCount; + double calibrationFactor = 10 * (17700000000/1000000); + Int64 workableCacheSize = (Int64)(sampleRatio * calibrationFactor); + if (workableCacheSize < 1) + workableCacheSize = 1; // can't go below 1 unfortunately + + Int32 max = getDefaultAsLong(HBASE_NUM_CACHE_ROWS_MAX); + if ((workableCacheSize < 10000) && // don't bother if 10000 works + (max == 10000)) // don't do it if user has already set this CQD + { + char temp1[40]; // way more space than needed, but it's safe + Lng32 wcs = (Lng32)workableCacheSize; + sprintf(temp1,"'%d'",wcs); + NAString minCQD = "CONTROL QUERY HBASE_NUM_CACHE_ROWS_MIN "; --- End diff -- Bug: Need to add the DEFAULT keyword after CONTROL QUERY > Use CQDs to prevent time-outs and speed sampling in UPDATE STATS > ---------------------------------------------------------------- > > Key: TRAFODION-1740 > URL: https://issues.apache.org/jira/browse/TRAFODION-1740 > Project: Apache Trafodion > Issue Type: Bug > Components: sql-cmp > Affects Versions: 2.0-incubating, 1.3-incubating > Environment: All; however this is particularly noticeable in clusters > Reporter: David Wayne Birdsall > Assignee: David Wayne Birdsall > > When doing UPDATE STATISTICS on large tables (one example had 17.7 billion > rows), we sometimes time-out with an error like the following: > SQL>update statistics for table table1 on every column sample; > *** ERROR[9200] UPDATE STATISTICS for table TRAFODION.SEABASE.TABLE1 encounter > ed an error (8448) from statement Process_Query. [2015-12-15 12:18:45] > *** ERROR[8448] Unable to access Hbase interface. Call to > ExpHbaseInterface::nex > tRow returned error HBASE_ACCESS_ERROR(-706). Cause: > java.util.concurrent.ExecutionException: > org.apache.hadoop.hbase.client.RetriesE > xhaustedException: Failed after attempts=36, exceptions: > Tue Dec 15 17:18:19 GMT 2015, null, java.net.SocketTimeoutException: > callTimeout > =60000, callDuration=60303: row '?. [2015-12-15 12:18:45] -- This message was sent by Atlassian JIRA (v6.3.4#6332)