[ https://issues.apache.org/jira/browse/HIVE-16784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16028257#comment-16028257 ]
Marta Kuczora edited comment on HIVE-16784 at 5/29/17 12:02 PM: ---------------------------------------------------------------- In the LineageState.setLineage method we get the file sink operator for the path: {noformat} public void setLineage(Path dir, DataContainer dc, List<FieldSchema> cols) { // First lookup the file sink operator from the load work. Operator<?> op = dirToFop.get(dir); // Go over the associated fields and look up the dependencies // by position in the row schema of the filesink operator. if (op == null) { return; } List<ColumnInfo> signature = op.getSchema().getSignature(); int i = 0; for (FieldSchema fs : cols) { linfo.putDependency(dc, fs, index.getDependency(op, signature.get(i++))); } } {noformat} The reason why the lineage information is missing from the out file is that the dirToFop map doesn't contain the given path. This map is created in the SemanticAnalyzer.genFileSinkPlan method: {noformat} if (ltd != null && SessionState.get() != null) { SessionState.get().getLineageState() .mapDirToFop(ltd.getSourcePath(), (FileSinkOperator) output); } {noformat} The path used here doesn't match with the patch used in the LineageState.setLineage method. The difference is in the file name, the map contains the path for the file "-ext-10000", but the path in the LineageState points to the "-ext-10002" file. was (Author: kuczoram): In the LineageState.setLineage method we get the FileSinkOperator for the path: {noformat} public void setLineage(Path dir, DataContainer dc, List<FieldSchema> cols) { // First lookup the file sink operator from the load work. FileSinkOperator fop = dirToFop.get(dir); // Go over the associated fields and look up the dependencies // by position in the row schema of the filesink operator. if (fop == null) { return; } List<ColumnInfo> signature = fop.getSchema().getSignature(); int i = 0; for (FieldSchema fs : cols) { linfo.putDependency(dc, fs, index.getDependency(fop, signature.get(i++))); } } {noformat} The reason why the lineage information is missing from the out file is that the dirToFop map doesn't contain the given path. This map is created in the SemanticAnalyzer.genFileSinkPlan method: {noformat} if (ltd != null && SessionState.get() != null) { SessionState.get().getLineageState() .mapDirToFop(ltd.getSourcePath(), (FileSinkOperator) output); } {noformat} The path used here doesn't match with the patch used in the LineageState.setLineage method. The difference is in the file name, the map contains the path for the file "-ext-10000", but the path in the LineageState points to the "-ext-10002" file. > Missing lineage information when hive.blobstore.optimizations.enabled is true > ----------------------------------------------------------------------------- > > Key: HIVE-16784 > URL: https://issues.apache.org/jira/browse/HIVE-16784 > Project: Hive > Issue Type: Bug > Reporter: Marta Kuczora > > Running the commands of the add_part_multiple.q test on S3 with > hive.blobstore.optimizations.enabled=true fails because of missing lineage > information. > Running the command on HDFS > {noformat} > from src TABLESAMPLE (1 ROWS) > insert into table add_part_test PARTITION (ds='2010-01-01') select 100,100 > insert into table add_part_test PARTITION (ds='2010-02-01') select 200,200 > insert into table add_part_test PARTITION (ds='2010-03-01') select 400,300 > insert into table add_part_test PARTITION (ds='2010-04-01') select 500,400; > {noformat} > results the following posthook outputs > {noformat} > POSTHOOK: Lineage: add_part_test2 PARTITION(ds=2010-01-01).key EXPRESSION [] > POSTHOOK: Lineage: add_part_test2 PARTITION(ds=2010-01-01).value EXPRESSION [] > POSTHOOK: Lineage: add_part_test2 PARTITION(ds=2010-02-01).key EXPRESSION [] > POSTHOOK: Lineage: add_part_test2 PARTITION(ds=2010-02-01).value EXPRESSION [] > POSTHOOK: Lineage: add_part_test2 PARTITION(ds=2010-03-01).key EXPRESSION [] > POSTHOOK: Lineage: add_part_test2 PARTITION(ds=2010-03-01).value EXPRESSION [] > POSTHOOK: Lineage: add_part_test2 PARTITION(ds=2010-04-01).key EXPRESSION [] > POSTHOOK: Lineage: add_part_test2 PARTITION(ds=2010-04-01).value EXPRESSION [] > {noformat} > These lines are not printed when running the command on the table located in > S3. > If hive.blobstore.optimizations.enabled=false, the lineage information is > printed. -- This message was sent by Atlassian JIRA (v6.3.15#6346)