Based on the comment of AllocateScanRange(), it seems like in some cases the partition id is ignored so a dummy value can safely be passed in. I'm honestly not sure why it was optional in those cases - it doesn't seem like it would be hard plumb through the valid partition ID.
So the change seems reasonable to me. On Tue, Aug 8, 2017 at 9:10 AM, Gabor Kaszab <gaborkas...@cloudera.com> wrote: > Hey, > > I'm currently working on IMPALA-5412 > <https://issues.apache.org/jira/browse/IMPALA-5412> where I enhanced > FileDescMap to have a key of (partition_id, filename) pair instead of a > single filename. > With this I have to extend the HdfsScanNode::GetFileDesc to expect the > partition_id as well. In turn one of the functions calling this GetFileDesc > is HdfsScanNode::AllocateScanRange already has a partition_id as an input > parameter so it seemed obvious to pass that to GetFileDesc. > Unfortunately, after some debugging of the impala code that crashed during > the "Loading Kudu TPCH" phase of testing I observed that this > AllocateScanRange is sometimes given '-1' as a partition_id. (in one > occasion it is given a scalar_reader->col_idx() as partition_id.) > As a result no matching file descriptor is found by GetFileDesc where a > DCHECK fails. > > I naively modified how AllocateScanRange is called to receive valid > partition IDs instead of '-1' and col_idx(). This fixes the issues I > encountered and mentioned above, however I'm not sure that this change > doesn't break some other logic. (Core tests are being run, but even if they > pass I'm not sure this change is fine) > > Was it intentional to give those non-valid partition_id's to > AllocateScanRange? Do you see any risk of passing valid IDs instead? > > Cheers, > Gabor >