What role/to what degree do native queries from sources come into play in the Dataflow architecture? I'm guessing the answer really depends on the situation, but in a micro-batch situation (or hybrid) I'm guessing as much of the heavy lifting as possible will still be relegated to the initial query/source ingest?
> On 26 May 2016, at 9:53 AM, Stephan Buys <[email protected]> wrote: > > Thanks so much for the feedback (Max and JB) so far as well as the reference > to the projects, my reading list keeps growing. > > Continuing with my bad habit of just asking before I'm really familiar with a > subject... > > The more I look at the examples and read about the kinds of problems > Dataflow/Beam attempt to solve I'm running into a perceived chasm between > stacks such as ELK (elastic/logstash,etc) or Splunk and projects such as > Apache Beam. I guess that even though the problems that are solved are the > same in a strict sense Splunk/ELK/etc are more suited to > querying/searching/investigation where projects such as Beam are well suited > to being a pipeline feeding those systems, a pipeline integrating with those > systems for realtime metrics/reporting as well as a pipeline for > alerting/training. > > In my mind a proper streaming system keeps looping back into and originating > from a data store such as elasticsearch/hdfs. Am I on the right track? Is > there a 'grand unified' vision for these kinds of systems that I can delve > into a bit? > > Regards, > Stephan > >> On 25 May 2016, at 4:14 PM, Jean-Baptiste Onofré <[email protected]> wrote: >> >> Hi Stephan, >> >> I created Karaf Decanter as an alternative to logstash/elasticsearch. >> >> What you describe looks like a DSL to me (as a bit discussed here: >> >> - Technical Vision >> - http://blog.nanthrax.net/2016/01/introducing-apache-dataflow/ >> >> I'm working on a PoC to mix Decanter with Beam, which can result to a DSL ;) >> >> Regards >> JB >> >> On 05/25/2016 01:43 PM, Stephan Buys wrote: >>> Hi all, >>> >>> Hope I'm in the right forum, I'm someone with about a decade's worth of log >>> management/event analytics experience - for the last 2 years though we've >>> been building our own solutions based on a variety of open source >>> technologies. As hopefully some of you might appreciate, whenever you want >>> to do something interesting, or at scale with timeseries/event data a lot >>> of the tools are lacking. >>> >>> I started off working in Splunk and it sort off spoiled me with >>> end-user/administrator functionality from the get go (even if it >>> prohibitively expensive and slow). In Splunk the 'sandpit' that you play in >>> has all the toys a non developer can ask for: built-in map/reduce + >>> streaming, and manipulation of results/streams through a simple DSL >>> familiar to anyone with a bit of Unix CLI/Bash experience. (ie. search >>> something | filter | map | eval | visualise >>> http://docs.splunk.com/Documentation/Splunk/latest/Search/Aboutsearchlanguagesyntax) >>> >>> At the moment we spend our days in logstash + elasticsearch (and sundry). >>> >>> I looked into Beam and Flink a bit and from a technical perspective it >>> seems like the ideal direction to go, combining many sources of data (such >>> as elasticsearch, influxdb, rethinkdb, etc) and many analytics use-cases. >>> The only gotcha seems to be that, from what I can see, the target audience >>> is almost always developers. This isn't a problem for myself, but ideally I >>> would want to bolt a simple DSL (submittable via simple interfaces, such as >>> cli) on top of my datasets but have all of the stream/batch processing >>> capabilities that project like Flink allow. >>> >>> Is anyone aware of projects/efforts along these lines? Ideas on how we >>> could there from a project such as Apache Beam? (Am I being naive?) >>> >>> Your input/perspectives are most welcome! >>> >>> Kind regards, >>> Stephan Buys >>> >>> >>> >>> >>> >> >> -- >> Jean-Baptiste Onofré >> [email protected] >> http://blog.nanthrax.net >> Talend - http://www.talend.com >
