[ 
https://issues.apache.org/jira/browse/RANGER-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16261466#comment-16261466
 ] 

Ramesh Mani edited comment on RANGER-1837 at 11/21/17 8:54 PM:
---------------------------------------------------------------

!AuditDataFlow.png|thumbnail!


was (Author: rmani):
[~bosco]
 Current implementation is
# AuditFileCacheProvider -> This gets the audit logs and stores it to Local 
filesystem via AuditFileCacheProviderSpooler.
# AuditFileCacheProviderSpooler has a thread that reads the local audit files 
in chunks ( configured via param 
“xasecure.audit.provider.filecache.filespool.buffer.size” ) and send it to 
AsyncAuditQueue. This chunk becomes the batch size of the data that is going to 
the next point in this follow, in this case AsyncAuditQueue.
# AsyncAuditQueue , this is existing one which using AuditBatchQueue for each 
of the destination configured  Here Queue size is one buffer which can be 
configured "xasecure.audit.destination.<destination>.batch.batch.size” ( 
<destination>= hdfs/solr/etc. I used the existing AsyncAuditQueue, so that in 
case of failures in the destination, this can backup with its own spooling and 
forwarding mechanism.
# Finally HDFSAuditDestination has a WRITER, which can writer in JSON/ORC file. 
When the write is ORCWriter, it has a buffer size which determine the batch 
size of each ORC file that is going to be created in HDFS or other destination.

So configuring these buffers will determine the Batch size when ORC files are 
created.

I believe that you wanted to eliminate AsyncAuditQueue in this flow and send 
directly to HDFSDestination / SOLR destination via a AuditFileQueue. If you 
proposing this, then that is what I was mentioning of about the refactoring / 
introducing a new pipeline to handle this scenario. Please correct me if I am 
wrong in this.

I have one more request which is related to data flow rate to different 
destination. Currently if we store the data local and forwarding it, 
destinations will get the data at the same rate. Say suppose that 
AuditFileCacheProvider file rollover time 1 hr, each destination will get the 
data after 1 hr. Some may want SOLR destination to have the data more quickly 
than  HDFS /S3. In that case we need to have the existing pipeline for one or 
more destination and store and forward for other destinations. so this also 
need refactoring to introduce a  mechanism to pick queues for each destination 
or group of destinations.  Please let me know about this.

> Enhance Ranger Audit to HDFS to support ORC file format
> -------------------------------------------------------
>
>                 Key: RANGER-1837
>                 URL: https://issues.apache.org/jira/browse/RANGER-1837
>             Project: Ranger
>          Issue Type: Improvement
>          Components: audit
>            Reporter: Kevin Risden
>            Assignee: Ramesh Mani
>         Attachments: 
> 0001-RANGER-1837-Enhance-Ranger-Audit-to-HDFS-to-support-.patch, 
> 0001-RANGER-1837-Enhance-Ranger-Audit-to-HDFS-to-support_001.patch, 
> AuditDataFlow.png
>
>
> My team has done some research and found that Ranger HDFS audits are:
> * Stored as JSON objects (one per line)
> * Not compressed
> This is currently very verbose and would benefit from compression since this 
> data is not frequently accessed. 
> From Bosco on the mailing list:
> {quote}You are right, currently one of the options is saving the audits in 
> HDFS itself as JSON files in one folder per day. I have loaded these JSON 
> files from the folder into Hive as compressed ORC format. The compressed 
> files in ORC were less than 10% of the original size. So, it was significant 
> decrease in size. Also, it is easier to run analytics on the Hive tables.
>  
> So, there are couple of ways of doing it.
>  
> Write an Oozie job which runs every night and loads the previous day worth 
> audit logs into ORC or other format
> Write a AuditDestination which can write into the format you want to.
>  
> Regardless which approach you take, this would be a good feature for 
> Ranger.{quote}
> http://mail-archives.apache.org/mod_mbox/ranger-user/201710.mbox/%3CCAJU9nmiYzzUUX1uDEysLAcMti4iLmX7RE%3DmN2%3DdoLaaQf87njQ%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to