[ 
https://issues.apache.org/jira/browse/RANGER-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16257622#comment-16257622
 ] 

Ramesh Mani commented on RANGER-1837:
-------------------------------------

[~bosco][~risdenk] I have attached the patch #2 on this after working on the 
comments, sorry for the delay as I was busy with other commitments. 
[~bosco]
Regarding the question on the review related to not having the buffer, current 
implementation uses existing AuditQueue for the data pipe into destination. If 
this has to be avoided we need to have a one more major refactoring on the 
audit framework. 
1) a new Ranger Audit Pipeline which don't have buffer/Queues and should be 
able to support multiple destination. This should be able to handle the batches 
received from the sources
2) this new Ranger Audit Pipeline should support  variable destination data 
flow rate. i.e Audit to Solr Destination should be immediate ( basically no 
storing and forwarding) , where as audit to hdfs can be of different rate based 
on the batch size / format etc.

Current framework provides 3 buffer size 
1) xasecure.audit.provider.filecache.filespool.buffer.size=
    This determine the batch size to be read from the local file and send to 
Audit Queue. Default is 1000 lines which we need to increase for ORC file 
format. This is batch size of the ORC file to be created, so this has to 
configured according the file spool size which is determined by the file spool 
rollover time.
2) xasecure.audit.destination.hdfs.batch.batch.size=
   This is the audit queue batch size in this pipeline, this size will be read  
from the queue and send to Destination. Default is 1000  lines.
3) xasecure.audit.destination.hdfs.orc.buffersize=
  This is the ORCWriter batch size which holds the data before it writes. This 
dynamically changes based on the Audit Batch size coming from the source.
I tested it for 1 hr data for hdfs plugin and having all three 10000 was fine. 
It didn't create multiple files for the amount, but this depends on the amount 
of hdfs activities. I need to check with KAFKA plugin
Please let me know.

> Enhance Ranger Audit to HDFS to support ORC file format
> -------------------------------------------------------
>
>                 Key: RANGER-1837
>                 URL: https://issues.apache.org/jira/browse/RANGER-1837
>             Project: Ranger
>          Issue Type: Improvement
>          Components: audit
>            Reporter: Kevin Risden
>         Attachments: 
> 0001-RANGER-1837-Enhance-Ranger-Audit-to-HDFS-to-support-.patch, 
> 0001-RANGER-1837-Enhance-Ranger-Audit-to-HDFS-to-support_001.patch
>
>
> My team has done some research and found that Ranger HDFS audits are:
> * Stored as JSON objects (one per line)
> * Not compressed
> This is currently very verbose and would benefit from compression since this 
> data is not frequently accessed. 
> From Bosco on the mailing list:
> {quote}You are right, currently one of the options is saving the audits in 
> HDFS itself as JSON files in one folder per day. I have loaded these JSON 
> files from the folder into Hive as compressed ORC format. The compressed 
> files in ORC were less than 10% of the original size. So, it was significant 
> decrease in size. Also, it is easier to run analytics on the Hive tables.
>  
> So, there are couple of ways of doing it.
>  
> Write an Oozie job which runs every night and loads the previous day worth 
> audit logs into ORC or other format
> Write a AuditDestination which can write into the format you want to.
>  
> Regardless which approach you take, this would be a good feature for 
> Ranger.{quote}
> http://mail-archives.apache.org/mod_mbox/ranger-user/201710.mbox/%3CCAJU9nmiYzzUUX1uDEysLAcMti4iLmX7RE%3DmN2%3DdoLaaQf87njQ%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to