Is it worth raising a bug in hive ? On Thu, Mar 24, 2016 at 3:37 PM, Sandeep Khurana <[email protected]> wrote:
> Hello > > Hive provides a table sample approach for number of rows. The > documentation is at > > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling#LanguageManualSampling-BlockSampling > > It states > > "For example, the following query will take the first 10 rows from each > input split. > SELECT * FROM source TABLESAMPLE(10 ROWS); > " > > But when I look at the code, FetchOperator.java at > https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java > > I see below method, check the bold and underlined lines. It looks like it > is exiting the sampling as and when the number of recs (size) is obtained > from the splits i.e. if first input split gives the needed data then it > wont go over rest of the splits and recs from 1st split itself will be > returned. But this is in contradiction to what the documentation states. > > When I run query for tablesapme with number of rows I also get the rows > from same split. I validated this by selecting "INPUT__FILE__NAME" as well > (my data on hdfs has thousands of files) . > > Am I missing something or is it a bug? > > private FetchInputFormatSplit[] splitSampling(SplitSample splitSample, > FetchInputFormatSplit[] splits) { > long totalSize = 0; > for (FetchInputFormatSplit split: splits) { > totalSize += split.getLength(); > } > List<FetchInputFormatSplit> result = new > ArrayList<FetchInputFormatSplit>(splits.length); > * long targetSize = splitSample.getTargetSize(totalSize);* > int startIndex = splitSample.getSeedNum() % splits.length; > long size = 0; > for (int i = 0; i < splits.length; i++) { > FetchInputFormatSplit split = splits[(startIndex + i) % > splits.length]; > result.add(split); > long splitgLength = split.getLength(); > if (size + splitgLength >= targetSize) { > * if (size + splitgLength > targetSize) {* > * split.shrinkedLength = targetSize - size;* > * }* > * break;* > * }* > size += splitgLength; > } > return result.toArray(new FetchInputFormatSplit[result.size()]); > } > > HIve bug for this is , https://issues.apache.org/jira/browse/HIVE-3401 . > > > -- > Thanks and regards > Sandeep Khurana > -- Thanks and regards Sandeep Khurana
