Another question re dataflow architecture. (was Re: Question from a data analytics/log management dude)

Stephan Buys Thu, 26 May 2016 06:01:21 -0700

What role/to what degree do native queries from sources come into play in the 
Dataflow architecture? I'm guessing the answer really depends on the situation, 
but in a micro-batch situation (or hybrid) I'm guessing as much of the heavy 
lifting as possible will still be relegated to the initial query/source ingest?



> On 26 May 2016, at 9:53 AM, Stephan Buys <[email protected]> wrote:
> 
> Thanks so much for the feedback (Max and JB) so far as well as the reference 
> to the projects, my reading list keeps growing.
> 
> Continuing with my bad habit of just asking before I'm really familiar with a 
> subject...
> 
> The more I look at the examples and read about the kinds of problems 
> Dataflow/Beam attempt to solve I'm running into a perceived chasm between 
> stacks such as ELK (elastic/logstash,etc) or Splunk and projects such as 
> Apache Beam. I guess that even though the problems that are solved are the 
> same in a strict sense Splunk/ELK/etc are more suited to 
> querying/searching/investigation where projects such as Beam are well suited 
> to being a pipeline feeding those systems, a pipeline integrating with those 
> systems for realtime metrics/reporting as well as a pipeline for 
> alerting/training.
> 
> In my mind a proper streaming system keeps looping back into and originating 
> from a data store such as elasticsearch/hdfs. Am I on the right track? Is 
> there a 'grand unified' vision for these kinds of systems that I can delve 
> into a bit? 
> 
> Regards,
> Stephan
> 
>> On 25 May 2016, at 4:14 PM, Jean-Baptiste Onofré <[email protected]> wrote:
>> 
>> Hi Stephan,
>> 
>> I created Karaf Decanter as an alternative to logstash/elasticsearch.
>> 
>> What you describe looks like a DSL to me (as a bit discussed here:
>> 
>> - Technical Vision
>> - http://blog.nanthrax.net/2016/01/introducing-apache-dataflow/
>> 
>> I'm working on a PoC to mix Decanter with Beam, which can result to a DSL ;)
>> 
>> Regards
>> JB
>> 
>> On 05/25/2016 01:43 PM, Stephan Buys wrote:
>>> Hi all,
>>> 
>>> Hope I'm in the right forum, I'm someone with about a decade's worth of log 
>>> management/event analytics experience - for the last 2 years though we've 
>>> been building our own solutions based on a variety of open source 
>>> technologies. As hopefully some of you might appreciate, whenever you want 
>>> to do something interesting, or at scale with timeseries/event data a lot 
>>> of the tools are lacking.
>>> 
>>> I started off working in Splunk and it sort off spoiled me with 
>>> end-user/administrator functionality from the get go (even if it 
>>> prohibitively expensive and slow). In Splunk the 'sandpit' that you play in 
>>> has all the toys a non developer can ask for: built-in map/reduce + 
>>> streaming, and manipulation of results/streams through a simple DSL 
>>> familiar to anyone with a bit of Unix CLI/Bash experience. (ie. search 
>>> something | filter | map | eval | visualise 
>>> http://docs.splunk.com/Documentation/Splunk/latest/Search/Aboutsearchlanguagesyntax)
>>> 
>>> At the moment we spend our days in logstash + elasticsearch (and sundry).
>>> 
>>> I looked into Beam and Flink a bit and from a technical perspective it 
>>> seems like the ideal direction to go, combining many sources of data (such 
>>> as elasticsearch, influxdb, rethinkdb, etc) and many analytics use-cases. 
>>> The only gotcha seems to be that, from what I can see, the target audience 
>>> is almost always developers. This isn't a problem for myself, but ideally I 
>>> would want to bolt a simple DSL (submittable via simple interfaces, such as 
>>> cli) on top of my datasets but have all of the stream/batch processing 
>>> capabilities that project like Flink allow.
>>> 
>>> Is anyone aware of projects/efforts along these lines? Ideas on how we 
>>> could there from a project such as Apache Beam? (Am I being naive?)
>>> 
>>> Your input/perspectives are most welcome!
>>> 
>>> Kind regards,
>>> Stephan Buys
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> -- 
>> Jean-Baptiste Onofré
>> [email protected]
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>

Another question re dataflow architecture. (was Re: Question from a data analytics/log management dude)

Reply via email to