[jira] [Resolved] (IMPALA-9132) Explain statements should not cause NPE in LogLineageRecord()

2019-11-06 Thread Anurag Mantripragada (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anurag Mantripragada resolved IMPALA-9132.
--
Fix Version/s: Impala 3.4.0
   Resolution: Fixed

> Explain statements should not cause NPE in LogLineageRecord()
> -
>
> Key: IMPALA-9132
> URL: https://issues.apache.org/jira/browse/IMPALA-9132
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Reporter: Anurag Mantripragada
>Assignee: Anurag Mantripragada
>Priority: Blocker
>  Labels: crash
> Fix For: Impala 3.4.0
>
>
> For DDLs, LogLineageRecord() adds certain fields to the lineageGraph in the 
> backend. However, explain statements do not have a catalogOpExecutor causing 
> a NPE.
> We should, in general, avoid creating lineage records for Explain as Atlas 
> currently does not use them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-9132) Explain statements should not cause NPE in LogLineageRecord()

2019-11-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968787#comment-16968787
 ] 

ASF subversion and git services commented on IMPALA-9132:
-

Commit f49f8d8a32d12128eafb4c76632ca2908d22fa28 in impala's branch 
refs/heads/master from Anurag Mantripragada
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=f49f8d8 ]

IMPALA-9132: Explain statements should not cause nullptr in
LogLineageRecord()

For DDLs LogLineageRecord() adds certain fields in the backend before
flushing the lineage. It uses ddl_exec_response() to get these fields.
However, explain is a special kind of DDL which does not have an
associated catalog_op_executor_. This causes explain statements to
throw NPE when ddl_exec_response() is called.

Currently, tools like atlas do not track lineages for explain
statements. This change skips lineage logging for explain statements.
In general, adds a nullptr check for catalog_op_executor_.

Testing:
Added a test to verify lineage is not created for explain statements.

Change-Id: Iccc20fd5a80841c820ebeb4edffccebea30df76e
Reviewed-on: http://gerrit.cloudera.org:8080/14646
Reviewed-by: Tim Armstrong 
Tested-by: Impala Public Jenkins 


> Explain statements should not cause NPE in LogLineageRecord()
> -
>
> Key: IMPALA-9132
> URL: https://issues.apache.org/jira/browse/IMPALA-9132
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Reporter: Anurag Mantripragada
>Assignee: Anurag Mantripragada
>Priority: Blocker
>  Labels: crash
>
> For DDLs, LogLineageRecord() adds certain fields to the lineageGraph in the 
> backend. However, explain statements do not have a catalogOpExecutor causing 
> a NPE.
> We should, in general, avoid creating lineage records for Explain as Atlas 
> currently does not use them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-8778) Support read/write Apache Hudi tables

2019-11-06 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-8778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968698#comment-16968698
 ] 

Vinoth Chandar commented on IMPALA-8778:


> I think you need logic in Impala that understands slices and only uses the 
> latest slice when querying a partition.

+1. in Hive/Spark/Presto, we make the query call HoodieInputFormat to do this 

> Support read/write Apache Hudi tables
> -
>
> Key: IMPALA-8778
> URL: https://issues.apache.org/jira/browse/IMPALA-8778
> Project: IMPALA
>  Issue Type: New Feature
>Reporter: Yuanbin Cheng
>Assignee: Yanjia Gary Li
>Priority: Major
>
> Apache Impala currently not support Apache Hudi, cannot even pull metadata 
> from Hive.
> Related issue: 
> [https://github.com/apache/incubator-hudi/issues/179] 
> [https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146|https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146?filter=allopenissues]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-9127) Clean up probe-side state machine in hash join

2019-11-06 Thread Tim Armstrong (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968594#comment-16968594
 ] 

Tim Armstrong commented on IMPALA-9127:
---

Maybe the states would be something like:
* PROBING_HASH_PARTITIONS_NO_BATCH -> hash_partitions are valid and we are 
processing probe batches. We do not have a current probe batch.
* PROBING_HASH_PARTITIONS_IN_BATCH -> hash_partitions are valid and we are 
processing probe batches. We have a current probe batch.
* OUTPUTTING_UNMATCHED_ROWS -> we have some output rows to emit
* CURRENT_PROBE_COMPLETE -> finished processing the current input probe rows 
and any subsequent actions. There may be more spilled partitions.
* OUTPUTTING_NULL_PROBE_ROWS -> processing null_probe_rows_  for output
* OUTPUTTING_NULL_AWARE_PARTITION -> processing null_aware partition for output
* END -> finishing processing all probe input, spilled partitions and any 
follow-up work.

> Clean up probe-side state machine in hash join
> --
>
> Key: IMPALA-9127
> URL: https://issues.apache.org/jira/browse/IMPALA-9127
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Backend
>Reporter: Tim Armstrong
>Assignee: Tim Armstrong
>Priority: Major
>
> There's an implicit state machine in the main loop in  
> PartitionedHashJoinNode::GetNext() 
> https://github.com/apache/impala/blob/eea617b/be/src/exec/partitioned-hash-join-node.cc#L510
> The state is implicitly defined based on the following conditions:
> * !output_build_partitions_.empty() -> "outputting build rows after probing"
> * builder_->null_aware_partition() == NULL -> "eos, because this the 
> null-aware partition is processed after all other partitions"
> * null_probe_output_idx_ >= 0 -> "null probe rows being processed"
> * output_null_aware_probe_rows_running_ -> "null-aware partition being 
> processed"
> * probe_batch_pos_ != -1 -> "processing probe batch"
> * builder_->num_hash_partitions() != 0 -> "have active hash partitions that 
> are being probed"
> * spilled_partitions_.empty() -> "no more spilled partitions"
> I think this would be a lot easier to follow if the state machine was 
> explicit and documented, and would make separating out the build side of a 
> spilling hash join easier to get right.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Assigned] (IMPALA-9127) Clean up probe-side state machine in hash join

2019-11-06 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong reassigned IMPALA-9127:
-

Assignee: Tim Armstrong

> Clean up probe-side state machine in hash join
> --
>
> Key: IMPALA-9127
> URL: https://issues.apache.org/jira/browse/IMPALA-9127
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Backend
>Reporter: Tim Armstrong
>Assignee: Tim Armstrong
>Priority: Major
>
> There's an implicit state machine in the main loop in  
> PartitionedHashJoinNode::GetNext() 
> https://github.com/apache/impala/blob/eea617b/be/src/exec/partitioned-hash-join-node.cc#L510
> The state is implicitly defined based on the following conditions:
> * !output_build_partitions_.empty() -> "outputting build rows after probing"
> * builder_->null_aware_partition() == NULL -> "eos, because this the 
> null-aware partition is processed after all other partitions"
> * null_probe_output_idx_ >= 0 -> "null probe rows being processed"
> * output_null_aware_probe_rows_running_ -> "null-aware partition being 
> processed"
> * probe_batch_pos_ != -1 -> "processing probe batch"
> * builder_->num_hash_partitions() != 0 -> "have active hash partitions that 
> are being probed"
> * spilled_partitions_.empty() -> "no more spilled partitions"
> I think this would be a lot easier to follow if the state machine was 
> explicit and documented, and would make separating out the build side of a 
> spilling hash join easier to get right.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-8778) Support read/write Apache Hudi tables

2019-11-06 Thread Tim Armstrong (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-8778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968581#comment-16968581
 ] 

Tim Armstrong commented on IMPALA-8778:
---

I don't see how you could implement reading from a Hudi table without changing 
Impala (or Hive for that matter). With the original Hive table layout, the 
contents of a partition are determined by listing a directory, and it looks 
like if you list the directory of a Hudi partition, you will get back 
duplicated data from multiple slices. I.e. I think you need logic in Impala 
that understands slices and only uses the latest slice when querying a 
partition.

The only way to add or remove an individual file to a classic Hive table 
(Impala/Hive tables are the same thing) is to add or remove it from the 
partition directory. 

> Support read/write Apache Hudi tables
> -
>
> Key: IMPALA-8778
> URL: https://issues.apache.org/jira/browse/IMPALA-8778
> Project: IMPALA
>  Issue Type: New Feature
>Reporter: Yuanbin Cheng
>Assignee: Yanjia Gary Li
>Priority: Major
>
> Apache Impala currently not support Apache Hudi, cannot even pull metadata 
> from Hive.
> Related issue: 
> [https://github.com/apache/incubator-hudi/issues/179] 
> [https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146|https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146?filter=allopenissues]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Assigned] (IMPALA-8778) Support read/write Apache Hudi tables

2019-11-06 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-8778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong reassigned IMPALA-8778:
-

Assignee: Yanjia Gary Li  (was: Yuanbin Cheng)

> Support read/write Apache Hudi tables
> -
>
> Key: IMPALA-8778
> URL: https://issues.apache.org/jira/browse/IMPALA-8778
> Project: IMPALA
>  Issue Type: New Feature
>Reporter: Yuanbin Cheng
>Assignee: Yanjia Gary Li
>Priority: Major
>
> Apache Impala currently not support Apache Hudi, cannot even pull metadata 
> from Hive.
> Related issue: 
> [https://github.com/apache/incubator-hudi/issues/179] 
> [https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146|https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146?filter=allopenissues]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-8561) ScanRanges with mtime=-1 can lead to inconsistent reads when using the file handle cache

2019-11-06 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-8561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated IMPALA-8561:
--
Description: 
{color:red}colored text{color}The file handle cache relies on the mtime to 
distinguish between different versions of a file. For example, if file X exists 
with mtime=1, then it is overwritten and the metadata is updated so that now it 
is at mtime=2, the file handle cache treats them as completely different things 
and can never use a single file handle to serve both. However, some codepaths 
generate ScanRanges with an mtime of -1. This removes the ability to 
distinguish these two versions of a file and can read to consistency problems.

A specific example is the code that reads the parquet footer 
[HdfsParquetScanner::ProcessFooter()|https://github.com/apache/impala/blob/832c9de7810b47b5f782bccb761e07264e7548e5/be/src/exec/parquet/hdfs-parquet-scanner.cc#L1354].
 We don't know ahead of time how big the Parquet footer is. So, we read 100KB 
(determined by 
[FOOTER_SIZE|https://github.com/apache/impala/blob/449fe73d2145bd22f0f857623c3652a097f06d73/be/src/exec/hdfs-scanner.h#L331]).
 If the footer size encoded in the last few bytes of the file indicates that 
the footer is larger than that [code 
here|https://github.com/apache/impala/blob/832c9de7810b47b5f782bccb761e07264e7548e5/be/src/exec/parquet/hdfs-parquet-scanner.cc#L1414],
 then we issue a separate read for the actual size of the footer. That separate 
read does not inherit the mtime of the original read and instead uses an mtime 
of -1. I verified this by adding tracing and issuing a select against 
functional_parquet.widetable_1000_cols.

A failure scenario associated with this is that we read the last 100KB using a 
ScanRange with mtime=2, then we find that the footer is larger than 100KB and 
issue a ScanRange with mtime=-1. This uses a file handle that is from a 
previous version of the file equivalent to mtime=1. The data it is reading may 
not come from the end of the file, or it may be at the end of the file but the 
footer has a different length. (There is no validation on the new read to check 
the magic value or metadata size reported by the new buffer.) Either would 
result in a failure to deserialize the thrift for the footer. For example, a 
problem case could produce an error message like:

 
{noformat}
File hdfs://test-warehouse/example_file.parq of length 1048576 bytes has 
invalid file metadata at file offset 462017. Error = couldn't deserialize 
thrift msg:
TProtocolException: Invalid data
.{noformat}
To fix this, we should examine all locations that can result in ScanRanges with 
mtime=-1 and eliminate any that we can. For example, the 
HdfsParquetScanner::ProcessFooter() code should create a ScanRange that 
inherits the mtime from the original footer ScanRange. Also, the file handle 
cache should refuse to cache file handles with mtime=-1.

The code in HdfsParquetScanner::ProcessFooter() should add validation for the 
magic value and metadata size when reading a footer larger than 100KB to verify 
that we are reading something valid. The thrift deserialize failure gives some 
information, but catching this case more specifically would provide a better 
error message.

h2. Workarounds
* This is most often caused by overwriting files in-place (e.g. INSERT 
OVERWRITE from Hive) without refreshing the metadata. You can avoid the issue 
by avoiding these in-place rewrites or by consistently running REFRESH  in 
Impala after the modifications.
* Setting --max_cached_file_handles=0 in the impalad startup options can work 
around the issue, at the cost of performance.



 

  was:
{color:red}colored text{color}The file handle cache relies on the mtime to 
distinguish between different versions of a file. For example, if file X exists 
with mtime=1, then it is overwritten and the metadata is updated so that now it 
is at mtime=2, the file handle cache treats them as completely different things 
and can never use a single file handle to serve both. However, some codepaths 
generate ScanRanges with an mtime of -1. This removes the ability to 
distinguish these two versions of a file and can read to consistency problems.

A specific example is the code that reads the parquet footer 
[HdfsParquetScanner::ProcessFooter()|https://github.com/apache/impala/blob/832c9de7810b47b5f782bccb761e07264e7548e5/be/src/exec/parquet/hdfs-parquet-scanner.cc#L1354].
 We don't know ahead of time how big the Parquet footer is. So, we read 100KB 
(determined by 
[FOOTER_SIZE|https://github.com/apache/impala/blob/449fe73d2145bd22f0f857623c3652a097f06d73/be/src/exec/hdfs-scanner.h#L331]).
 If the footer size encoded in the last few bytes of the file indicates that 
the footer is larger than that [code 
here|https://github.com/apache/impala/blob/832c9de7810b47b5f782bccb761e07264e7548e5/be/src/exec/parquet/hdfs-parquet-scanner.cc#L1414],
 then we 

[jira] [Updated] (IMPALA-8561) ScanRanges with mtime=-1 can lead to inconsistent reads when using the file handle cache

2019-11-06 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-8561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated IMPALA-8561:
--
Description: 
{color:red}colored text{color}The file handle cache relies on the mtime to 
distinguish between different versions of a file. For example, if file X exists 
with mtime=1, then it is overwritten and the metadata is updated so that now it 
is at mtime=2, the file handle cache treats them as completely different things 
and can never use a single file handle to serve both. However, some codepaths 
generate ScanRanges with an mtime of -1. This removes the ability to 
distinguish these two versions of a file and can read to consistency problems.

A specific example is the code that reads the parquet footer 
[HdfsParquetScanner::ProcessFooter()|https://github.com/apache/impala/blob/832c9de7810b47b5f782bccb761e07264e7548e5/be/src/exec/parquet/hdfs-parquet-scanner.cc#L1354].
 We don't know ahead of time how big the Parquet footer is. So, we read 100KB 
(determined by 
[FOOTER_SIZE|https://github.com/apache/impala/blob/449fe73d2145bd22f0f857623c3652a097f06d73/be/src/exec/hdfs-scanner.h#L331]).
 If the footer size encoded in the last few bytes of the file indicates that 
the footer is larger than that [code 
here|https://github.com/apache/impala/blob/832c9de7810b47b5f782bccb761e07264e7548e5/be/src/exec/parquet/hdfs-parquet-scanner.cc#L1414],
 then we issue a separate read for the actual size of the footer. That separate 
read does not inherit the mtime of the original read and instead uses an mtime 
of -1. I verified this by adding tracing and issuing a select against 
functional_parquet.widetable_1000_cols.

A failure scenario associated with this is that we read the last 100KB using a 
ScanRange with mtime=2, then we find that the footer is larger than 100KB and 
issue a ScanRange with mtime=-1. This uses a file handle that is from a 
previous version of the file equivalent to mtime=1. The data it is reading may 
not come from the end of the file, or it may be at the end of the file but the 
footer has a different length. (There is no validation on the new read to check 
the magic value or metadata size reported by the new buffer.) Either would 
result in a failure to deserialize the thrift for the footer. For example, a 
problem case could produce an error message like:

 
{noformat}
File hdfs://test-warehouse/example_file.parq of length 1048576 bytes has 
invalid file metadata at file offset 462017. Error = couldn't deserialize 
thrift msg:
TProtocolException: Invalid data
.{noformat}
To fix this, we should examine all locations that can result in ScanRanges with 
mtime=-1 and eliminate any that we can. For example, the 
HdfsParquetScanner::ProcessFooter() code should create a ScanRange that 
inherits the mtime from the original footer ScanRange. Also, the file handle 
cache should refuse to cache file handles with mtime=-1.

The code in HdfsParquetScanner::ProcessFooter() should add validation for the 
magic value and metadata size when reading a footer larger than 100KB to verify 
that we are reading something valid. The thrift deserialize failure gives some 
information, but catching this case more specifically would provide a better 
error message.



 

  was:
{color:red}colored text{color}The file handle cache relies on the mtime to 
distinguish between different versions of a file. For example, if file X exists 
with mtime=1, then it is overwritten and the metadata is updated so that now it 
is at mtime=2, the file handle cache treats them as completely different things 
and can never use a single file handle to serve both. However, some codepaths 
generate ScanRanges with an mtime of -1. This removes the ability to 
distinguish these two versions of a file and can read to consistency problems.

A specific example is the code that reads the parquet footer 
[HdfsParquetScanner::ProcessFooter()|https://github.com/apache/impala/blob/832c9de7810b47b5f782bccb761e07264e7548e5/be/src/exec/parquet/hdfs-parquet-scanner.cc#L1354].
 We don't know ahead of time how big the Parquet footer is. So, we read 100KB 
(determined by 
[FOOTER_SIZE|https://github.com/apache/impala/blob/449fe73d2145bd22f0f857623c3652a097f06d73/be/src/exec/hdfs-scanner.h#L331]).
 If the footer size encoded in the last few bytes of the file indicates that 
the footer is larger than that [code 
here|https://github.com/apache/impala/blob/832c9de7810b47b5f782bccb761e07264e7548e5/be/src/exec/parquet/hdfs-parquet-scanner.cc#L1414],
 then we issue a separate read for the actual size of the footer. That separate 
read does not inherit the mtime of the original read and instead uses an mtime 
of -1. I verified this by adding tracing and issuing a select against 
functional_parquet.widetable_1000_cols.

A failure scenario associated with this is that we read the last 100KB using a 
ScanRange with mtime=2, then we find that the footer is 

[jira] [Updated] (IMPALA-9132) Explain statements should not cause NPE in LogLineageRecord()

2019-11-06 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated IMPALA-9132:
--
Priority: Blocker  (was: Critical)

> Explain statements should not cause NPE in LogLineageRecord()
> -
>
> Key: IMPALA-9132
> URL: https://issues.apache.org/jira/browse/IMPALA-9132
> Project: IMPALA
>  Issue Type: Bug
>Reporter: Anurag Mantripragada
>Assignee: Anurag Mantripragada
>Priority: Blocker
>
> For DDLs, LogLineageRecord() adds certain fields to the lineageGraph in the 
> backend. However, explain statements do not have a catalogOpExecutor causing 
> a NPE.
> We should, in general, avoid creating lineage records for Explain as Atlas 
> currently does not use them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-9132) Explain statements should not cause NPE in LogLineageRecord()

2019-11-06 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated IMPALA-9132:
--
Labels: crash  (was: )

> Explain statements should not cause NPE in LogLineageRecord()
> -
>
> Key: IMPALA-9132
> URL: https://issues.apache.org/jira/browse/IMPALA-9132
> Project: IMPALA
>  Issue Type: Bug
>Reporter: Anurag Mantripragada
>Assignee: Anurag Mantripragada
>Priority: Blocker
>  Labels: crash
>
> For DDLs, LogLineageRecord() adds certain fields to the lineageGraph in the 
> backend. However, explain statements do not have a catalogOpExecutor causing 
> a NPE.
> We should, in general, avoid creating lineage records for Explain as Atlas 
> currently does not use them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-9132) Explain statements should not cause NPE in LogLineageRecord()

2019-11-06 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated IMPALA-9132:
--
Component/s: Backend

> Explain statements should not cause NPE in LogLineageRecord()
> -
>
> Key: IMPALA-9132
> URL: https://issues.apache.org/jira/browse/IMPALA-9132
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Reporter: Anurag Mantripragada
>Assignee: Anurag Mantripragada
>Priority: Blocker
>  Labels: crash
>
> For DDLs, LogLineageRecord() adds certain fields to the lineageGraph in the 
> backend. However, explain statements do not have a catalogOpExecutor causing 
> a NPE.
> We should, in general, avoid creating lineage records for Explain as Atlas 
> currently does not use them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-7860) Tests use partition name that isn't supported on ABFS

2019-11-06 Thread Sahil Takiar (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-7860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sahil Takiar resolved IMPALA-7860.
--
Fix Version/s: Impala 3.4.0
   Resolution: Fixed

Closing this as Fixed. HADOOP-15860 was done a while ago, and now 
Impala-on-ABFS cannot write files / directories that end with a period (which 
is expected). There was one bug that was introduced to Impala due to this 
change: IMPALA-8557 - but that has been fixed now as well. In IMPALA-9117 I 
created a new skip flag for ABFS tests for the "cannot write write trailing 
periods" behavior, and added it to any affected tests.

> Tests use partition name that isn't supported on ABFS
> -
>
> Key: IMPALA-7860
> URL: https://issues.apache.org/jira/browse/IMPALA-7860
> Project: IMPALA
>  Issue Type: Bug
>  Components: Infrastructure
>Reporter: Sean Mackrory
>Priority: Major
> Fix For: Impala 3.4.0
>
>
> IMPALA-7681 introduced support for the ADLS Gen2 service / ABFS client. As 
> mentioned in the code review for that 
> (https://gerrit.cloudera.org/#/c/11630/) a couple of tests were failing 
> because they use a partition name that ends with a period. If the tests are 
> modified to end with anything other than a period, they work just fine.
> In HADOOP-15860, that's sounding like it's just a known limitation of the 
> blob storage that shares infrastructure with ADLS Gen2 that won't be changing 
> any time soon. I propose we modify the tests to just use a slightly different 
> partition name.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-7726) Drop with purge tests fail against ABFS due to trash misbehavior

2019-11-06 Thread Sahil Takiar (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-7726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sahil Takiar resolved IMPALA-7726.
--
Fix Version/s: Impala 3.4.0
   Resolution: Fixed

Closing as Fixed. I re-enabled the tests, looped them overnight, and didn't hit 
any failures. So it is likely whatever bug was causing these issues has been 
resolved.

> Drop with purge tests fail against ABFS due to trash misbehavior
> 
>
> Key: IMPALA-7726
> URL: https://issues.apache.org/jira/browse/IMPALA-7726
> Project: IMPALA
>  Issue Type: Bug
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Major
>  Labels: flaky
> Fix For: Impala 3.4.0
>
>
> In testing IMPALA-7681, I've seen test_drop_partition_with_purge and 
> test_drop_table_with_purge fail because of files not found in the trash are a 
> drop without purge. I've traced that functionality through Hive, which uses 
> Hadoop's Trash API, and traced through a bunch of scenarios in that API with 
> ABFS and I can't see it misbehaving in any way. It also should be pretty 
> FS-agnostic. I also suspected a bug in abfs_utils.py's exists() function, but 
> have not been able to find one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-8557) Impala on ABFS failed with error "IllegalArgumentException: ABFS does not allow files or directories to end with a dot."

2019-11-06 Thread Sahil Takiar (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-8557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sahil Takiar resolved IMPALA-8557.
--
Fix Version/s: Impala 3.4.0
   Resolution: Fixed

> Impala on ABFS failed with error "IllegalArgumentException: ABFS does not 
> allow files or directories to end with a dot."
> 
>
> Key: IMPALA-8557
> URL: https://issues.apache.org/jira/browse/IMPALA-8557
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Affects Versions: Impala 3.2.0
>Reporter: Eric Lin
>Assignee: Sahil Takiar
>Priority: Major
> Fix For: Impala 3.4.0
>
>
> HDFS introduced below feature to stop users from creating a file that ends 
> with "." on ABFS:
> https://issues.apache.org/jira/browse/HADOOP-15860
> As a result of this change, Impala now writes to ABFS fails with such error.
> I can see that it generates temp file using this format "$0.$1.$2":
> https://github.com/cloudera/Impala/blob/cdh6.2.0/be/src/exec/hdfs-table-sink.cc#L329
> $2 is the file extension and will be empty if it is TEXT file format:
> https://github.com/cloudera/Impala/blob/cdh6.2.0/be/src/exec/hdfs-text-table-writer.cc#L65
> Since HADOOP-15860 was backported into CDH6.2, it is currently only affecting 
> 6.2 and works in older versions.
> There is no way to override this empty file extension so no workaround is 
> possible, unless user choose another file format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-9117) test_lineage.py and test_mt_dop.py are failing on ABFS

2019-11-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968499#comment-16968499
 ] 

ASF subversion and git services commented on IMPALA-9117:
-

Commit e8fda1f224d3ad237183a53e238eee90188d82e2 in impala's branch 
refs/heads/master from Sahil Takiar
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=e8fda1f ]

IMPALA-9117, IMPALA-7726: Fixed a few unit tests for ABFS

This test makes the following changes / fixes when running Impala tests
on ABFS:
* Skips some tests in test_lineage.py that don't work on ABFS / ADLS
(they were already skipped for S3)
* Skips some tests in test_mt_dop.py; the test creates a directory that
ends with a period (and ABFS does not support writing files or
directories that end with a period)
* Removes the ABFS skip flag SkipIfABFS.trash (IMPALA-7726: Drop with
purge tests fail against ABFS due to trash misbehavior"); I removed
these flags and looped the tests overnight with no failures, so it is
likely whatever bug was causing this has now been fixed
* Now that HADOOP-15860 has been resolved, and the agreed upon behavior
for ABFS is that it will fail if a client tries to write a file /
directory that ends with a period, I added a new entry to the SkipIfABFS
class called file_or_folder_name_ends_with_period and applied it where
necessary

Testing:
* Ran core tests on ABFS

Change-Id: I18ae5b0f7de6aa7628a1efd780ff30a0cc3c5285
Reviewed-on: http://gerrit.cloudera.org:8080/14636
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 


> test_lineage.py and test_mt_dop.py are failing on ABFS
> --
>
> Key: IMPALA-9117
> URL: https://issues.apache.org/jira/browse/IMPALA-9117
> Project: IMPALA
>  Issue Type: Test
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
>
> Both failures are known issues.
> {{TestLineage::test_lineage_output}} is failing because the test requires 
> HBase to run (this test is already disabled for S3).
> {{TestMtDopFlags::test_mt_dop_all}} is failing because it runs 
> {{QueryTest/insert}} which includes a query that writes a folder that ends in 
> a dot. ABFS does not allow files or directories to end in a dot - IMPALA-7860 
> / IMPALA-7681 / HADOOP-15860.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7726) Drop with purge tests fail against ABFS due to trash misbehavior

2019-11-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-7726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968501#comment-16968501
 ] 

ASF subversion and git services commented on IMPALA-7726:
-

Commit e8fda1f224d3ad237183a53e238eee90188d82e2 in impala's branch 
refs/heads/master from Sahil Takiar
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=e8fda1f ]

IMPALA-9117, IMPALA-7726: Fixed a few unit tests for ABFS

This test makes the following changes / fixes when running Impala tests
on ABFS:
* Skips some tests in test_lineage.py that don't work on ABFS / ADLS
(they were already skipped for S3)
* Skips some tests in test_mt_dop.py; the test creates a directory that
ends with a period (and ABFS does not support writing files or
directories that end with a period)
* Removes the ABFS skip flag SkipIfABFS.trash (IMPALA-7726: Drop with
purge tests fail against ABFS due to trash misbehavior"); I removed
these flags and looped the tests overnight with no failures, so it is
likely whatever bug was causing this has now been fixed
* Now that HADOOP-15860 has been resolved, and the agreed upon behavior
for ABFS is that it will fail if a client tries to write a file /
directory that ends with a period, I added a new entry to the SkipIfABFS
class called file_or_folder_name_ends_with_period and applied it where
necessary

Testing:
* Ran core tests on ABFS

Change-Id: I18ae5b0f7de6aa7628a1efd780ff30a0cc3c5285
Reviewed-on: http://gerrit.cloudera.org:8080/14636
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 


> Drop with purge tests fail against ABFS due to trash misbehavior
> 
>
> Key: IMPALA-7726
> URL: https://issues.apache.org/jira/browse/IMPALA-7726
> Project: IMPALA
>  Issue Type: Bug
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Major
>  Labels: flaky
>
> In testing IMPALA-7681, I've seen test_drop_partition_with_purge and 
> test_drop_table_with_purge fail because of files not found in the trash are a 
> drop without purge. I've traced that functionality through Hive, which uses 
> Hadoop's Trash API, and traced through a bunch of scenarios in that API with 
> ABFS and I can't see it misbehaving in any way. It also should be pretty 
> FS-agnostic. I also suspected a bug in abfs_utils.py's exists() function, but 
> have not been able to find one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-8557) Impala on ABFS failed with error "IllegalArgumentException: ABFS does not allow files or directories to end with a dot."

2019-11-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-8557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968502#comment-16968502
 ] 

ASF subversion and git services commented on IMPALA-8557:
-

Commit 8b8a49e617818e9bcf99b784b63587c95cebd622 in impala's branch 
refs/heads/master from Sahil Takiar
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=8b8a49e ]

IMPALA-8557: Add '.txt' to text files, remove '.' at end of filenames

Writes to text tables on ABFS are failing because HADOOP-15860 recently
changed the ABFS behavior when writing files / folders that end with a
'.'. ABFS explicitly does not allow files / folders that end with a dot.
>From the ABFS docs: "Avoid blob names that end with a dot (.), a forward
slash (/), or a sequence or combination of the two."

The behavior prior to HADOOP-15860 was to simply drop any trailing dots
when writing files or folders, but that can lead to various issues
because clients may try to read back a file that should exist on ABFS,
but doesn't. HADOOP-15860 changed the behavior so that any attempt to
write a file or folder with a trailing dot fails on ABFS.

Impala writes all text files with a trailing dot due to some odd
behavior in hdfs-table-sink.cc. The table sink writes files with
a "file extension" which is dependent on the file type. For example,
Parquet files have a file extension of ".parq". For some reason, text
files had no file extension, so Impala would try to write text files of
the following form:
"244c5ee8ece6f759-8b1a1e3b_45513034_data.0.".

Several tables created during dataload, such as alltypes, already use
the '.txt' extension for their files. These tables are not created via
Impala's INSERT code path, they are copied into the table. However,
there are several tables created during dataload, such as
alltypesinsert, that are created via Impala. This patch will change
the files in these tables so that they end in '.txt'.

This patch adds the ".txt" extension to all written text files and
modifies the hdfs-table-sink.cc so that it doesn't add a trailing dot to
a filename if there is no file extension.

Testing:
* Ran core tests
* Re-ran affected ABFS tests
* Added test to validate that the correct file extension is used for
Parquet and text tables
* Manually validated that without the addition of the '.txt' file
extension, files are not written with a trailing dot

Change-Id: I2a9adacd45855cde86724e10f8a131e17ebf46f8
Reviewed-on: http://gerrit.cloudera.org:8080/14621
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 


> Impala on ABFS failed with error "IllegalArgumentException: ABFS does not 
> allow files or directories to end with a dot."
> 
>
> Key: IMPALA-8557
> URL: https://issues.apache.org/jira/browse/IMPALA-8557
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Affects Versions: Impala 3.2.0
>Reporter: Eric Lin
>Assignee: Sahil Takiar
>Priority: Major
>
> HDFS introduced below feature to stop users from creating a file that ends 
> with "." on ABFS:
> https://issues.apache.org/jira/browse/HADOOP-15860
> As a result of this change, Impala now writes to ABFS fails with such error.
> I can see that it generates temp file using this format "$0.$1.$2":
> https://github.com/cloudera/Impala/blob/cdh6.2.0/be/src/exec/hdfs-table-sink.cc#L329
> $2 is the file extension and will be empty if it is TEXT file format:
> https://github.com/cloudera/Impala/blob/cdh6.2.0/be/src/exec/hdfs-text-table-writer.cc#L65
> Since HADOOP-15860 was backported into CDH6.2, it is currently only affecting 
> 6.2 and works in older versions.
> There is no way to override this empty file extension so no workaround is 
> possible, unless user choose another file format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7726) Drop with purge tests fail against ABFS due to trash misbehavior

2019-11-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-7726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968500#comment-16968500
 ] 

ASF subversion and git services commented on IMPALA-7726:
-

Commit e8fda1f224d3ad237183a53e238eee90188d82e2 in impala's branch 
refs/heads/master from Sahil Takiar
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=e8fda1f ]

IMPALA-9117, IMPALA-7726: Fixed a few unit tests for ABFS

This test makes the following changes / fixes when running Impala tests
on ABFS:
* Skips some tests in test_lineage.py that don't work on ABFS / ADLS
(they were already skipped for S3)
* Skips some tests in test_mt_dop.py; the test creates a directory that
ends with a period (and ABFS does not support writing files or
directories that end with a period)
* Removes the ABFS skip flag SkipIfABFS.trash (IMPALA-7726: Drop with
purge tests fail against ABFS due to trash misbehavior"); I removed
these flags and looped the tests overnight with no failures, so it is
likely whatever bug was causing this has now been fixed
* Now that HADOOP-15860 has been resolved, and the agreed upon behavior
for ABFS is that it will fail if a client tries to write a file /
directory that ends with a period, I added a new entry to the SkipIfABFS
class called file_or_folder_name_ends_with_period and applied it where
necessary

Testing:
* Ran core tests on ABFS

Change-Id: I18ae5b0f7de6aa7628a1efd780ff30a0cc3c5285
Reviewed-on: http://gerrit.cloudera.org:8080/14636
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 


> Drop with purge tests fail against ABFS due to trash misbehavior
> 
>
> Key: IMPALA-7726
> URL: https://issues.apache.org/jira/browse/IMPALA-7726
> Project: IMPALA
>  Issue Type: Bug
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Major
>  Labels: flaky
>
> In testing IMPALA-7681, I've seen test_drop_partition_with_purge and 
> test_drop_table_with_purge fail because of files not found in the trash are a 
> drop without purge. I've traced that functionality through Hive, which uses 
> Hadoop's Trash API, and traced through a bunch of scenarios in that API with 
> ABFS and I can't see it misbehaving in any way. It also should be pretty 
> FS-agnostic. I also suspected a bug in abfs_utils.py's exists() function, but 
> have not been able to find one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-9132) Explain statements should not cause NPE in LogLineageRecord()

2019-11-06 Thread Anurag Mantripragada (Jira)
Anurag Mantripragada created IMPALA-9132:


 Summary: Explain statements should not cause NPE in 
LogLineageRecord()
 Key: IMPALA-9132
 URL: https://issues.apache.org/jira/browse/IMPALA-9132
 Project: IMPALA
  Issue Type: Bug
Reporter: Anurag Mantripragada
Assignee: Anurag Mantripragada


For DDLs, LogLineageRecord() adds certain fields to the lineageGraph in the 
backend. However, explain statements do not have a catalogOpExecutor causing a 
NPE.

We should, in general, avoid creating lineage records for Explain as Atlas 
currently does not use them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-9131) Use single quotes when printing out FORMAT clause within CAST.

2019-11-06 Thread Gabor Kaszab (Jira)
Gabor Kaszab created IMPALA-9131:


 Summary: Use single quotes when printing out FORMAT clause within 
CAST.
 Key: IMPALA-9131
 URL: https://issues.apache.org/jira/browse/IMPALA-9131
 Project: IMPALA
  Issue Type: Bug
  Components: Frontend
Affects Versions: Impala 3.3.0
Reporter: Gabor Kaszab
Assignee: Gabor Kaszab


Here the content of the FORMAT clause is surrounded by double quotes. 
{code:java}
select cast('2016/10/10' as date format '/MM/DD');
++
| cast('2016/10/10' as date format "/mm/dd") |
++
| 2016-10-10 |
++
{code}

In order to follow SQL standards this should be surrounded by single quotes 
regardless of how the user gave the FORMAT clause.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-9130) Upgrade external non-ACID table to ACID from Impala

2019-11-06 Thread Gabor Kaszab (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968324#comment-16968324
 ] 

Gabor Kaszab commented on IMPALA-9130:
--

[~csringhofer]

> Upgrade external non-ACID table to ACID from Impala
> ---
>
> Key: IMPALA-9130
> URL: https://issues.apache.org/jira/browse/IMPALA-9130
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog, Frontend
>Affects Versions: Impala 3.3.0
>Reporter: Gabor Kaszab
>Priority: Major
>  Labels: impala-acid
>
> If you have an external, non-ACID table and try to upgrade it to become an 
> ACID table you get an error message that an external table is not allowed to 
> be promoted to ACID. This is fine, however if in the very same step you set 
> 'EXTERNAL' = 'FALSE' in table properties you still get the same error while 
> Hive is able to execute it.
> Steps to repro:
> 1) Create a non-ACID external table. (or a single non-ACID table if you use 
> Hive that contains HIVE-22158)
> 2) Upgrade the table
> {code:java}
> alter table tbl set tblproperties ('transactional'='true', 
> 'transactional_properties'='insert_only', 'EXTERNAL'='FALSE');
> {code}
> Step 2) fails in Impala but succeeds in Hive



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-9130) Upgrade external non-ACID table to ACID from Impala

2019-11-06 Thread Gabor Kaszab (Jira)
Gabor Kaszab created IMPALA-9130:


 Summary: Upgrade external non-ACID table to ACID from Impala
 Key: IMPALA-9130
 URL: https://issues.apache.org/jira/browse/IMPALA-9130
 Project: IMPALA
  Issue Type: Bug
  Components: Catalog, Frontend
Affects Versions: Impala 3.3.0
Reporter: Gabor Kaszab


If you have an external, non-ACID table and try to upgrade it to become an ACID 
table you get an error message that an external table is not allowed to be 
promoted to ACID. This is fine, however if in the very same step you set 
'EXTERNAL' = 'FALSE' in table properties you still get the same error while 
Hive is able to execute it.

Steps to repro:
1) Create a non-ACID external table. (or a single non-ACID table if you use 
Hive that contains HIVE-22158)
2) Upgrade the table
{code:java}
alter table tbl set tblproperties ('transactional'='true', 
'transactional_properties'='insert_only', 'EXTERNAL'='FALSE');
{code}

Step 2) fails in Impala but succeeds in Hive





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org