[jira] [Created] (HIVE-24316) Upgrade ORC from 1.5.6 to 1.5.8 in branch-3.1

2020-10-27 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created HIVE-24316:


 Summary: Upgrade ORC from 1.5.6 to 1.5.8 in branch-3.1
 Key: HIVE-24316
 URL: https://issues.apache.org/jira/browse/HIVE-24316
 Project: Hive
  Issue Type: Bug
  Components: ORC
Affects Versions: 3.1.3
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24315) Improve validation and semantic analysis in HPL/SQL

2020-10-27 Thread Attila Magyar (Jira)
Attila Magyar created HIVE-24315:


 Summary: Improve validation and semantic analysis in HPL/SQL 
 Key: HIVE-24315
 URL: https://issues.apache.org/jira/browse/HIVE-24315
 Project: Hive
  Issue Type: Improvement
  Components: hpl/sql
Reporter: Attila Magyar
Assignee: Attila Magyar


There are some known issues that need to be fixed. For example it seems that 
arity of a function is not checked when calling it, and same is true for 
parameter types. Calling an undefined function is evaluated to null and 
sometimes it seems that incorrect syntax is silently ignored. 

In cases like this a helpful error message would be expected, thought we should 
also consider how PL/SQL works and maintain compatibility.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Hive SQL extension

2020-10-27 Thread Jesus Camacho Rodriguez
Hi Peter,

Thanks for bringing this up.

Why are targeting the 'partition by spec' syntax? Is it for convenience?
Was it already introduced by Iceberg?

I did not understand the reasoning for not introducing the new syntax in
Hive. As it was already mentioned by Stamatis, there is some advantage to
doing syntax validation through the usual flow.
If the question is whether we could make it useful beyond Iceberg format
and there is no attachment to the syntax above, could this be generalized
to introducing generated virtual column declaration in Hive (many RDBMSs
support these) + using the current partitioning declaration? For instance,
for your DDL declaration above:

create table iceberg_test(
level string,
event_time timestamp,
message string,
register_time date,
telephone array 
)
partitioned by (
v_level [GENERATED ALWAYS] AS level,
v_event_time [GENERATED ALWAYS] AS event_time,
v_event_time_hour [GENERATED ALWAYS] AS hour(event_time),
v_register_time [GENERATED ALWAYS] AS day(register_time)
)
stored as iceberg;

This would assume that the underlying storage format supports partitioning
by virtual columns. I believe this syntax would allow us to take some of
these ideas even further, e.g., introduce 'stored' derived columns or
custom partitioning specs (although without the underlying storage format
support, probably they would need to be 'stored' instead of 'virtual').

Even if you introduce this syntax, you could still do the transformation
that you described above internally, i.e., storage handler resolution and
table properties generation. Thus, the internal handling would be the same.

Thanks,
Jesús

On Mon, Oct 26, 2020 at 2:27 AM Stamatis Zampetakis 
wrote:

> I do like extensions and things that simplify our life when
> writing queries.
>
> Regarding the partitioning syntax for Iceberg, there may be better
> alternatives.
> I was also leaning towards a syntax like the one proposed by Jesus (in
> another thread) based on virtual columns, which is also part of SQL
> standard.
>
> Regarding the other use cases mentioned (temporal queries, time travel
> etc.) there are things that are part of SQL standard so we could start from
> there and then introduce extensions if needed.
>
> Syntactic sugar is powerful but in terms of design I find it more
> appropriate to perform "desugaring" after having an AST; either AST to AST
> transformations or afterwards.
> The syntax (sugared or not) is part and responsibility of the parser so an
> architecture with sub-parser hooks seems a bit brittle, especially if we
> start using it extensively.
> Having said that you have thought of this much more than I did so maybe
> the hook's approach is a better idea after all :)
>
> Best,
> Stamatis
>
> On Fri, Oct 23, 2020 at 2:26 PM Pau Tallada  wrote:
>
>> Hi all,
>>
>> I do not know if that may be of interest to you, but there are other
>> projects that could benefit from this.
>> For instance, ADQL
>> 
>> (Astronomical Data Query Language) is a SQL-like language that defines some
>> higher-level functions that enable powerful geospatial queries. Projects
>> like queryparser  are able
>> to translate from ADQL to vendor-SQL for MySQL or PostreSQL. In this case,
>> the syntactic sugar is implemented as an external layer on top, but could
>> very well be implemented in a rewrite hook if available.
>>
>> Cheers,
>>
>> Pau.
>>
>> Missatge de Peter Vary  del dia dj., 22 d’oct. 2020
>> a les 16:21:
>>
>>>
>>> Let's assume that this feature would be useful for Iceberg tables, but
>>> useless and even problematic/forbidden for other tables. :)
>>>
>>> My thinking is, that it could make Hive much more user friendly, if we
>>> would allow for extensions in language.
>>>
>>> With Iceberg integration we plan to do several extensions which might
>>> not be useful for other tables. Some examples:
>>>
>>>- When creating tables we want to send additional information to the
>>>storage layer, and pushing everything in properties is a pain (not really
>>>user friendly)
>>>- We would like to allow querying table history for iceberg tables
>>>(previous snapshotId-s, timestamps, etc)
>>>- We would like to allow time travel for iceberg tables based on the
>>>data queried above
>>>- We would like to allow the user to see / manage / remove old
>>>snapshots
>>>
>>>
>>> These are all very specific Iceberg related stuff, and most probably
>>> will not work / useful for any other type of the tables, so I think adding
>>> them to Hive parser would be a stretch.
>>>
>>> On the other hand if we do not provide SQL interface for accessing these
>>> features then the users will turn to Spark/Impala/Presto to be able to work
>>> with Iceberg tables.
>>>
>>> As for your specific question for handling syntax errors (I 

[jira] [Created] (HIVE-24314) compactor.Cleaner should not set state "mark cleaned" if it didn't remove any files

2020-10-27 Thread Karen Coppage (Jira)
Karen Coppage created HIVE-24314:


 Summary: compactor.Cleaner should not set state "mark cleaned" if 
it didn't remove any files
 Key: HIVE-24314
 URL: https://issues.apache.org/jira/browse/HIVE-24314
 Project: Hive
  Issue Type: Bug
Reporter: Karen Coppage
Assignee: Karen Coppage






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24313) Optimise stats collection for file sizes on cloud storage

2020-10-27 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-24313:
---

 Summary: Optimise stats collection for file sizes on cloud storage
 Key: HIVE-24313
 URL: https://issues.apache.org/jira/browse/HIVE-24313
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2
Reporter: Rajesh Balamohan


When stats information is not present (e.g external table), RelOptHiveTable 
computes basic stats at runtime.

Following is the codepath.

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/RelOptHiveTable.java#L598]
{code:java}
Statistics stats = StatsUtils.collectStatistics(hiveConf, partitionList,
hiveTblMetadata, hiveNonPartitionCols, 
nonPartColNamesThatRqrStats, colStatsCached,
nonPartColNamesThatRqrStats, true);
 {code}
[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L322]
{code:java}
for (Partition p : partList.getNotDeniedPartns()) {
BasicStats basicStats = basicStatsFactory.build(Partish.buildFor(table, 
p));
partStats.add(basicStats);
  }
 {code}
[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStats.java#L205]

 
{code:java}
try {
ds = getFileSizeForPath(path);
  } catch (IOException e) {
ds = 0L;
  }
 {code}
 

For a table & query with large number of partitions, this takes long time to 
compute statistics and increases compilation time.  It would be good to fix it 
with "ForkJoinPool" ( 
partList.getNotDeniedPartns().parallelStream().forEach((p) )

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)