this is a very common pattern, yes.

note that in Netflix's case, they're currently pushing all of their logs to
a Fronting Kafka + Samza Router which can route to S3 (or HDFS),
ElasticSearch, and/or another Kafka Topic for further consumption by
internal apps using other technologies like Spark Streaming (instead of
Samza).

this Fronting Kafka + Samza Router also helps to differentiate between
high-priority events (Errors or High Latencies) and normal-priority events
(normal User Play or Stop events).

here's a recent presentation i did which details this configuration
starting at slide 104:
http://www.slideshare.net/cfregly/dc-spark-users-group-march-15-2016-spark-and-netflix-recommendations
.

btw, Confluent's distribution of Kafka does have a direct Http/REST API
which is not recommended for production use, but has worked well for me in
the past.

these are some additional options to think about, anyway.


On Thu, Mar 31, 2016 at 4:26 AM, Steve Loughran <ste...@hortonworks.com>
wrote:

>
> On 31 Mar 2016, at 09:37, ashish rawat <dceash...@gmail.com> wrote:
>
> Hi,
>
> I have been evaluating Spark for analysing Application and Server Logs. I
> believe there are some downsides of doing this:
>
> 1. No direct mechanism of collecting log, so need to introduce other tools
> like Flume into the pipeline.
>
>
> you need something to collect logs no matter what you run. Flume isn't so
> bad; if you bring it up on the same host as the app then you can even
> collect logs while the network is playing up.
>
> Or you can just copy log4j files to HDFS and process them later
>
> 2. Need to write lots of code for parsing different patterns from logs,
> while some of the log analysis tools like logstash or loggly provide it out
> of the box
>
>
>
> Log parsing is essentially an ETL problem, especially if you don't try to
> lock down the log event format.
>
> You can also configure Log4J to save stuff in an easy-to-parse format
> and/or forward directly to your application.
>
> There's a log4j to flume connector to do that for you,
>
>
> http://www.thecloudavenue.com/2013/11/using-log4jflume-to-log-application.html
>
> or you can output in, say, JSON (
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/log/Log4Json.java
>  )
>
> I'd go with flume unless you had a need to save the logs locally and copy
> them to HDFS laster.
>
>
>
> On the benefits side, I believe Spark might be more performant (although I
> am yet to benchmark it) and being a generic processing engine, might work
> with complex use cases where the out of the box functionality of log
> analysis tools is not sufficient (although I don't have any such use case
> right now).
>
> One option I was considering was to use logstash for collection and basic
> processing and then sink the processed logs to both elastic search and
> kafka. So that Spark Streaming can pick data from Kafka for the complex use
> cases, while logstash filters can be used for the simpler use cases.
>
> I was wondering if someone has already done this evaluation and could
> provide me some pointers on how/if to create this pipeline with Spark.
>
> Regards,
> Ashish
>
>
>
>


-- 

*Chris Fregly*
Principal Data Solutions Engineer
IBM Spark Technology Center, San Francisco, CA
http://spark.tc | http://advancedspark.com

Reply via email to