[jira] [Updated] (HIVE-21458) ACID: Optimize AcidUtils$MetaDataFile.isRawFormat

2019-03-15 Thread Vaibhav Gumashta (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vaibhav Gumashta updated HIVE-21458:

Labels: Transactions-Performance  (was: )

> ACID: Optimize AcidUtils$MetaDataFile.isRawFormat 
> --
>
> Key: HIVE-21458
> URL: https://issues.apache.org/jira/browse/HIVE-21458
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.1.1
>Reporter: Vaibhav Gumashta
>Priority: Major
>  Labels: Transactions-Performance
> Attachments: async-prof-pid-1-cpu-1.svg
>
>
> In the transactional subsystems, in several places we check to see if a data 
> file has ROW__ID fields or not. Every time we do that (even within the 
> context of the same query), we open a Reader for that file/split. We could 
> optimize this by caching or perhaps checking once, and saving our result for 
> later. Also, perhaps we don't need to do this for every split. An example 
> call stack:
> {code}
> OrcFile.createReader(Path, OrcFile$ReaderOptions) line: 105   
> AcidUtils$MetaDataFile.isRawFormatFile(Path, FileSystem) line: 2026   
> AcidUtils$MetaDataFile.isRawFormat(Path, FileSystem) line: 2022   
> AcidUtils.parsedDelta(Path, String, FileSystem) line: 1007
> OrcRawRecordMerger$TransactionMetaData.findWriteIDForSynthetcRowIDs(Path, 
> Path, Configuration) line: 1231 
> OrcRawRecordMerger.discoverOriginalKeyBounds(Reader, int, Reader$Options, 
> Configuration, OrcRawRecordMerger$Options) line: 722
> OrcRawRecordMerger.(Configuration, boolean, Reader, boolean, int, 
> ValidWriteIdList, Reader$Options, Path[], OrcRawRecordMerger$Options) line: 
> 1022  
> OrcInputFormat.getReader(InputSplit, Options) line: 2108  
> OrcInputFormat.getRecordReader(InputSplit, JobConf, Reporter) line: 2006  
> FetchOperator$FetchInputFormatSplit.getRecordReader(JobConf) line: 776
> FetchOperator.getRecordReader() line: 344 
> FetchOperator.getNextRow() line: 540  
> FetchOperator.pushRow() line: 509 
> FetchTask.fetch(List) line: 146   
> {code} 
> Here, for each split we'll make that check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21458) ACID: Optimize AcidUtils$MetaDataFile.isRawFormat

2019-03-15 Thread Prasanth Jayachandran (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanth Jayachandran updated HIVE-21458:
-
Attachment: async-prof-pid-1-cpu-1.svg

> ACID: Optimize AcidUtils$MetaDataFile.isRawFormat 
> --
>
> Key: HIVE-21458
> URL: https://issues.apache.org/jira/browse/HIVE-21458
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.1.1
>Reporter: Vaibhav Gumashta
>Assignee: Prasanth Jayachandran
>Priority: Major
> Attachments: async-prof-pid-1-cpu-1.svg
>
>
> In the transactional subsystems, in several places we check to see if a data 
> file has ROW__ID fields or not. Every time we do that (even within the 
> context of the same query), we open a Reader for that file/split. We could 
> optimize this by caching or perhaps checking once, and saving our result for 
> later. Also, perhaps we don't need to do this for every split. An example 
> call stack:
> {code}
> OrcFile.createReader(Path, OrcFile$ReaderOptions) line: 105   
> AcidUtils$MetaDataFile.isRawFormatFile(Path, FileSystem) line: 2026   
> AcidUtils$MetaDataFile.isRawFormat(Path, FileSystem) line: 2022   
> AcidUtils.parsedDelta(Path, String, FileSystem) line: 1007
> OrcRawRecordMerger$TransactionMetaData.findWriteIDForSynthetcRowIDs(Path, 
> Path, Configuration) line: 1231 
> OrcRawRecordMerger.discoverOriginalKeyBounds(Reader, int, Reader$Options, 
> Configuration, OrcRawRecordMerger$Options) line: 722
> OrcRawRecordMerger.(Configuration, boolean, Reader, boolean, int, 
> ValidWriteIdList, Reader$Options, Path[], OrcRawRecordMerger$Options) line: 
> 1022  
> OrcInputFormat.getReader(InputSplit, Options) line: 2108  
> OrcInputFormat.getRecordReader(InputSplit, JobConf, Reporter) line: 2006  
> FetchOperator$FetchInputFormatSplit.getRecordReader(JobConf) line: 776
> FetchOperator.getRecordReader() line: 344 
> FetchOperator.getNextRow() line: 540  
> FetchOperator.pushRow() line: 509 
> FetchTask.fetch(List) line: 146   
> {code} 
> Here, for each split we'll make that check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21458) ACID: Optimize AcidUtils$MetaDataFile.isRawFormat

2019-03-15 Thread Vaibhav Gumashta (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vaibhav Gumashta updated HIVE-21458:

Description: 
In the transactional subsystems, in several places we check to see if a data 
file has ROW__ID fields or not. Every time we do that (even within the context 
of the same query), we open a Reader for that file/split. We could optimize 
this by caching or perhaps checking once, and saving our result for later. 
Also, perhaps we don't need to do this for every split. An example call stack:
{code}
OrcFile.createReader(Path, OrcFile$ReaderOptions) line: 105 
AcidUtils$MetaDataFile.isRawFormatFile(Path, FileSystem) line: 2026 
AcidUtils$MetaDataFile.isRawFormat(Path, FileSystem) line: 2022 
AcidUtils.parsedDelta(Path, String, FileSystem) line: 1007  
OrcRawRecordMerger$TransactionMetaData.findWriteIDForSynthetcRowIDs(Path, Path, 
Configuration) line: 1231   
OrcRawRecordMerger.discoverOriginalKeyBounds(Reader, int, Reader$Options, 
Configuration, OrcRawRecordMerger$Options) line: 722  
OrcRawRecordMerger.(Configuration, boolean, Reader, boolean, int, 
ValidWriteIdList, Reader$Options, Path[], OrcRawRecordMerger$Options) line: 
1022
OrcInputFormat.getReader(InputSplit, Options) line: 2108
OrcInputFormat.getRecordReader(InputSplit, JobConf, Reporter) line: 2006
FetchOperator$FetchInputFormatSplit.getRecordReader(JobConf) line: 776  
FetchOperator.getRecordReader() line: 344   
FetchOperator.getNextRow() line: 540
FetchOperator.pushRow() line: 509   
FetchTask.fetch(List) line: 146 
{code} 

Here, for each split we'll make that check.

  was:
In the transactional subsystems, in several places we check to see if a data 
file has ROW__ID fields or not. Every time we do that (even within the context 
of the same query), we open a Reader for that file/split. We could optimize 
this by caching. Also, perhaps we don't need to do this for every split. An 
example call stack:
{code}
OrcFile.createReader(Path, OrcFile$ReaderOptions) line: 105 
AcidUtils$MetaDataFile.isRawFormatFile(Path, FileSystem) line: 2026 
AcidUtils$MetaDataFile.isRawFormat(Path, FileSystem) line: 2022 
AcidUtils.parsedDelta(Path, String, FileSystem) line: 1007  
OrcRawRecordMerger$TransactionMetaData.findWriteIDForSynthetcRowIDs(Path, Path, 
Configuration) line: 1231   
OrcRawRecordMerger.discoverOriginalKeyBounds(Reader, int, Reader$Options, 
Configuration, OrcRawRecordMerger$Options) line: 722  
OrcRawRecordMerger.(Configuration, boolean, Reader, boolean, int, 
ValidWriteIdList, Reader$Options, Path[], OrcRawRecordMerger$Options) line: 
1022
OrcInputFormat.getReader(InputSplit, Options) line: 2108
OrcInputFormat.getRecordReader(InputSplit, JobConf, Reporter) line: 2006
FetchOperator$FetchInputFormatSplit.getRecordReader(JobConf) line: 776  
FetchOperator.getRecordReader() line: 344   
FetchOperator.getNextRow() line: 540
FetchOperator.pushRow() line: 509   
FetchTask.fetch(List) line: 146 
{code} 

Here, for each split we'll make that check.


> ACID: Optimize AcidUtils$MetaDataFile.isRawFormat 
> --
>
> Key: HIVE-21458
> URL: https://issues.apache.org/jira/browse/HIVE-21458
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.1.1
>Reporter: Vaibhav Gumashta
>Priority: Major
>
> In the transactional subsystems, in several places we check to see if a data 
> file has ROW__ID fields or not. Every time we do that (even within the 
> context of the same query), we open a Reader for that file/split. We could 
> optimize this by caching or perhaps checking once, and saving our result for 
> later. Also, perhaps we don't need to do this for every split. An example 
> call stack:
> {code}
> OrcFile.createReader(Path, OrcFile$ReaderOptions) line: 105   
> AcidUtils$MetaDataFile.isRawFormatFile(Path, FileSystem) line: 2026   
> AcidUtils$MetaDataFile.isRawFormat(Path, FileSystem) line: 2022   
> AcidUtils.parsedDelta(Path, String, FileSystem) line: 1007
> OrcRawRecordMerger$TransactionMetaData.findWriteIDForSynthetcRowIDs(Path, 
> Path, Configuration) line: 1231 
> OrcRawRecordMerger.discoverOriginalKeyBounds(Reader, int, Reader$Options, 
> Configuration, OrcRawRecordMerger$Options) line: 722
> OrcRawRecordMerger.(Configuration, boolean, Reader, boolean, int, 
> ValidWriteIdList, Reader$Options, Path[], OrcRawRecordMerger$Options) line: 
> 1022  
> OrcInputFormat.getReader(InputSplit, Options) line: 2108  
> OrcInputFormat.getRecordReader(InputSplit, JobConf, Reporter) line: 2006  
> FetchOperator$FetchInputFormatSplit.getRecordReader(JobConf) line: 776
> FetchOperator.getRecordReader() line: 344 
> FetchOperator.getNextRow() line: 

[jira] [Updated] (HIVE-21458) ACID: Optimize AcidUtils$MetaDataFile.isRawFormat

2019-03-15 Thread Vaibhav Gumashta (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vaibhav Gumashta updated HIVE-21458:

Summary: ACID: Optimize AcidUtils$MetaDataFile.isRawFormat   (was: ACID: 
Optimize AcidUtils$MetaDataFile.isRawFormat check by caching the split reader)

> ACID: Optimize AcidUtils$MetaDataFile.isRawFormat 
> --
>
> Key: HIVE-21458
> URL: https://issues.apache.org/jira/browse/HIVE-21458
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.1.1
>Reporter: Vaibhav Gumashta
>Priority: Major
>
> In the transactional subsystems, in several places we check to see if a data 
> file has ROW__ID fields or not. Every time we do that (even within the 
> context of the same query), we open a Reader for that file/split. We could 
> optimize this by caching. Also, perhaps we don't need to do this for every 
> split. An example call stack:
> {code}
> OrcFile.createReader(Path, OrcFile$ReaderOptions) line: 105   
> AcidUtils$MetaDataFile.isRawFormatFile(Path, FileSystem) line: 2026   
> AcidUtils$MetaDataFile.isRawFormat(Path, FileSystem) line: 2022   
> AcidUtils.parsedDelta(Path, String, FileSystem) line: 1007
> OrcRawRecordMerger$TransactionMetaData.findWriteIDForSynthetcRowIDs(Path, 
> Path, Configuration) line: 1231 
> OrcRawRecordMerger.discoverOriginalKeyBounds(Reader, int, Reader$Options, 
> Configuration, OrcRawRecordMerger$Options) line: 722
> OrcRawRecordMerger.(Configuration, boolean, Reader, boolean, int, 
> ValidWriteIdList, Reader$Options, Path[], OrcRawRecordMerger$Options) line: 
> 1022  
> OrcInputFormat.getReader(InputSplit, Options) line: 2108  
> OrcInputFormat.getRecordReader(InputSplit, JobConf, Reporter) line: 2006  
> FetchOperator$FetchInputFormatSplit.getRecordReader(JobConf) line: 776
> FetchOperator.getRecordReader() line: 344 
> FetchOperator.getNextRow() line: 540  
> FetchOperator.pushRow() line: 509 
> FetchTask.fetch(List) line: 146   
> {code} 
> Here, for each split we'll make that check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)