[ 
https://issues.apache.org/jira/browse/RANGER-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16259521#comment-16259521
 ] 

Don Bosco Durai edited comment on RANGER-1837 at 11/20/17 5:37 PM:
-------------------------------------------------------------------

bq. current implementation uses existing AuditQueue for the data pipe into 
destination. If this has to be avoided we need to have a one more major 
refactoring on the audit framework
Not sure I understood your concern. What are we currently using to write to 
HDFS in batches? Is it using AuditFileCacheProvider + ???

The audit framework consists of Queues and Destination. Ideally, we could just 
clone AuditBatchQueue and simplify it. E.g. in the new class AuditFileQueue, we 
can read from the consumer and write to a local file from the method 
log(AuditEventBase event). And at fixed intervals, we could just write the 
local file to the Destination (HDFS/S3). Similar to AuditFileSpool, we might 
have to keep track of the local files that need to be written to the 
destination. 

And I think, BaseAuditHandler should have another method called logFile( File 
localFile), which can be implemented by HDFSDestination to just copy the file 
over or convert it to ORC and write to the HDFS. Default implementation could 
be to read  the file line by line and call logJSON();


b.q. Current framework provides 3 buffer size 
If possible, I would suggest using one of the current buffers. Looking at the 
current code, it seems AuditFileCacheProvider replaces AuditAsyncQueue at the 
top level. Which could be a problem, because now we are sinking into the file 
up front. The original design was, every destination has a queue backing it so 
that the queue can help regulate the output flow to the destination. In this 
way, the queue manages the buffer and not the destination. Even with the 
current implementation, I would suggest introducing something like 
AuditFileQueue and set the file size and time on it. This will give users to 
pick a larger file size e.g. 1-hour interval and write to the destination after 
1 hour.

bq.  I see this article 
https://community.hortonworks.com/articles/75501/orc-creation-best-practices.html
 which has some details.
This is good information. It seems advance users can reload the data using Hive 
at regular intervals to get good distribution, the sort order for audit dates, 
etc. And they can also analyze for better runtime performance. We can add these 
in best practices or recommendation section for this feature.

bq. I tested it for 1 hr data for hdfs plugin and having all three 10000 was 
fine. It didn't create multiple files for the amount, but this depends on the 
amount of hdfs activities. I need to check with KAFKA plugin
I am worried that if we do this in memory buffer, we risk affecting the native 
component by hogging the memory.



was (Author: bosco):
bq. current implementation uses existing AuditQueue for the data pipe into 
destination. If this has to be avoided we need to have a one more major 
refactoring on the audit framework
Not sure I understood your concern. What are we currently using to write to 
HDFS in batches? Is it using AuditFileCacheProvider + ???

The audit framework consists of Queues and Destination. Ideally, we could just 
clone AuditBatchQueue and simplify it. E.g. in the new class AuditFileQueue, we 
can read from the consumer and write to a local file from the method 
log(AuditEventBase event). And at fixed intervals, we could just write the 
local file to the Destination (HDFS/S3). Similar to AuditFileSpool, we might 
have to keep track of the local files that need to be written to the 
destination. 

b.q. Current framework provides 3 buffer size 
If possible, I would suggest using one of the current buffers. Looking at the 
current code, it seems AuditFileCacheProvider replaces AuditAsyncQueue at the 
top level. Which could be a problem, because now we are sinking into the file 
up front. The original design was, every destination has a queue backing it so 
that the queue can help regulate the output flow to the destination. In this 
way, the queue manages the buffer and not the destination. Even with the 
current implementation, I would suggest introducing something like 
AuditFileQueue and set the file size and time on it. This will give users to 
pick a larger file size e.g. 1-hour interval and write to the destination after 
1 hour.

And I think, BaseAuditHandler should have another method called logFile( File 
localFile), which can be implemented by HDFSDestination to just copy the file 
over or convert it to ORC and write to the HDFS. Default implementation could 
be to read  the file line by line and call logJSON();

bq.  I see this article 
https://community.hortonworks.com/articles/75501/orc-creation-best-practices.html
 which has some details.
This is good information. It seems advance users can reload the data using Hive 
at regular intervals to get good distribution, the sort order for audit dates, 
etc. And they can also analyze for better runtime performance. We can add these 
in best practices or recommendation section for this feature.

bq. I tested it for 1 hr data for hdfs plugin and having all three 10000 was 
fine. It didn't create multiple files for the amount, but this depends on the 
amount of hdfs activities. I need to check with KAFKA plugin
I am worried that if we do this in memory buffer, we risk affecting the native 
component by hogging the memory.


> Enhance Ranger Audit to HDFS to support ORC file format
> -------------------------------------------------------
>
>                 Key: RANGER-1837
>                 URL: https://issues.apache.org/jira/browse/RANGER-1837
>             Project: Ranger
>          Issue Type: Improvement
>          Components: audit
>            Reporter: Kevin Risden
>            Assignee: Ramesh Mani
>         Attachments: 
> 0001-RANGER-1837-Enhance-Ranger-Audit-to-HDFS-to-support-.patch, 
> 0001-RANGER-1837-Enhance-Ranger-Audit-to-HDFS-to-support_001.patch
>
>
> My team has done some research and found that Ranger HDFS audits are:
> * Stored as JSON objects (one per line)
> * Not compressed
> This is currently very verbose and would benefit from compression since this 
> data is not frequently accessed. 
> From Bosco on the mailing list:
> {quote}You are right, currently one of the options is saving the audits in 
> HDFS itself as JSON files in one folder per day. I have loaded these JSON 
> files from the folder into Hive as compressed ORC format. The compressed 
> files in ORC were less than 10% of the original size. So, it was significant 
> decrease in size. Also, it is easier to run analytics on the Hive tables.
>  
> So, there are couple of ways of doing it.
>  
> Write an Oozie job which runs every night and loads the previous day worth 
> audit logs into ORC or other format
> Write a AuditDestination which can write into the format you want to.
>  
> Regardless which approach you take, this would be a good feature for 
> Ranger.{quote}
> http://mail-archives.apache.org/mod_mbox/ranger-user/201710.mbox/%3CCAJU9nmiYzzUUX1uDEysLAcMti4iLmX7RE%3DmN2%3DdoLaaQf87njQ%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to