[ 
https://issues.apache.org/jira/browse/HIVE-16784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16028257#comment-16028257
 ] 

Marta Kuczora edited comment on HIVE-16784 at 5/29/17 12:02 PM:
----------------------------------------------------------------

In the LineageState.setLineage method we get the file sink operator for the 
path:
{noformat}
  public void setLineage(Path dir, DataContainer dc,
      List<FieldSchema> cols) {
    // First lookup the file sink operator from the load work.
    Operator<?> op = dirToFop.get(dir);

    // Go over the associated fields and look up the dependencies
    // by position in the row schema of the filesink operator.
    if (op == null) {
      return;
    }

    List<ColumnInfo> signature = op.getSchema().getSignature();
    int i = 0;
    for (FieldSchema fs : cols) {
      linfo.putDependency(dc, fs, index.getDependency(op, signature.get(i++)));
    }
  }
{noformat}
The reason why the lineage information is missing from the out file is that the 
dirToFop map doesn't contain the given path.
This map is created in the SemanticAnalyzer.genFileSinkPlan method:
{noformat}
    if (ltd != null && SessionState.get() != null) {
      SessionState.get().getLineageState()
          .mapDirToFop(ltd.getSourcePath(), (FileSinkOperator) output);
    }
{noformat}
The path used here doesn't match with the patch used in the 
LineageState.setLineage method. The difference is in the file name, the map 
contains the path for the file "-ext-10000", but the path in the LineageState 
points to the "-ext-10002" file.


was (Author: kuczoram):
In the LineageState.setLineage method we get the FileSinkOperator for the path:
{noformat}
  public void setLineage(Path dir, DataContainer dc,
      List<FieldSchema> cols) {
    // First lookup the file sink operator from the load work.
    FileSinkOperator fop = dirToFop.get(dir);

    // Go over the associated fields and look up the dependencies
    // by position in the row schema of the filesink operator.
    if (fop == null) {
      return;
    }

    List<ColumnInfo> signature = fop.getSchema().getSignature();
    int i = 0;
    for (FieldSchema fs : cols) {
      linfo.putDependency(dc, fs, index.getDependency(fop, signature.get(i++)));
    }
  }
{noformat}
The reason why the lineage information is missing from the out file is that the 
dirToFop map doesn't contain the given path.
This map is created in the SemanticAnalyzer.genFileSinkPlan method:
{noformat}
    if (ltd != null && SessionState.get() != null) {
      SessionState.get().getLineageState()
          .mapDirToFop(ltd.getSourcePath(), (FileSinkOperator) output);
    }
{noformat}
The path used here doesn't match with the patch used in the 
LineageState.setLineage method. The difference is in the file name, the map 
contains the path for the file "-ext-10000", but the path in the LineageState 
points to the "-ext-10002" file.

> Missing lineage information when hive.blobstore.optimizations.enabled is true
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-16784
>                 URL: https://issues.apache.org/jira/browse/HIVE-16784
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Marta Kuczora
>
> Running the commands of the add_part_multiple.q test on S3 with 
> hive.blobstore.optimizations.enabled=true fails because of missing lineage 
> information.
> Running the command on HDFS
> {noformat}
> from src TABLESAMPLE (1 ROWS)
> insert into table add_part_test PARTITION (ds='2010-01-01') select 100,100
> insert into table add_part_test PARTITION (ds='2010-02-01') select 200,200
> insert into table add_part_test PARTITION (ds='2010-03-01') select 400,300
> insert into table add_part_test PARTITION (ds='2010-04-01') select 500,400;
> {noformat}
> results the following posthook outputs 
> {noformat}
> POSTHOOK: Lineage: add_part_test2 PARTITION(ds=2010-01-01).key EXPRESSION []
> POSTHOOK: Lineage: add_part_test2 PARTITION(ds=2010-01-01).value EXPRESSION []
> POSTHOOK: Lineage: add_part_test2 PARTITION(ds=2010-02-01).key EXPRESSION []
> POSTHOOK: Lineage: add_part_test2 PARTITION(ds=2010-02-01).value EXPRESSION []
> POSTHOOK: Lineage: add_part_test2 PARTITION(ds=2010-03-01).key EXPRESSION []
> POSTHOOK: Lineage: add_part_test2 PARTITION(ds=2010-03-01).value EXPRESSION []
> POSTHOOK: Lineage: add_part_test2 PARTITION(ds=2010-04-01).key EXPRESSION []
> POSTHOOK: Lineage: add_part_test2 PARTITION(ds=2010-04-01).value EXPRESSION []
> {noformat}
> These lines are not printed when running the command on the table located in 
> S3.
> If hive.blobstore.optimizations.enabled=false, the lineage information is 
> printed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to