Re: [Discuss] - Future plans for Spot-ingest

Austin Leahy Thu, 20 Apr 2017 14:41:58 -0700

So this is basically why the flume suggestion has come up. Flume natively
acts as a syslog listener and will write files to basically anything (HDFS,
Hive, HBase, S3).


On Thu, Apr 20, 2017 at 8:15 AM Michael Ridley <[email protected]> wrote:

> When we say ingest from Kafka, what does that mean?  I understand we can
> read from Kafka to ingest into the cluster, but how will the data get to
> Kafka and what data are we talking about?  My understanding is that right
> now the primary data sources would be Netflow and Syslog, neither of which
> writes to Kafka natively so we would need something like StreamSets in the
> middle.  Certainly StreamSets UDP source -> Kafka would work.
>
> Michael
>
> On Wed, Apr 19, 2017 at 7:05 PM, kant kodali <[email protected]> wrote:
>
> > sure I guess Kafka has something called Kafka connect but may not be as
> > mature as flume since I heard about this recently.
> >
> > On Wed, Apr 19, 2017 at 3:39 PM, Austin Leahy <[email protected]>
> > wrote:
> >
> > > The advantage of flume or a flume Kafka hybrid is that the team doesn't
> > > have to build sinks for any new source types added to the project just
> > > create configs pointing to the landing pad
> > > On Wed, Apr 19, 2017 at 3:31 PM kant kodali <[email protected]>
> wrote:
> > >
> > > > What kind of benchmarks are we looking for? just throughput? since I
> am
> > > > assuming this is for ingestion. I haven't seen anything faster than
> > Kafka
> > > > and that is because of its simplicity after all publisher appends
> > message
> > > > to a file(so called the partition in kafka) and clients just do
> > > sequential
> > > > reads from a file so its a matter of disk throughput. The benchmark
> > > numbers
> > > > I have for Kafka is at very least 75K messages/sec where each message
> > is
> > > > 1KB on m4.xlarge which by default has EBS storage (EBS is
> > > network-attached
> > > > SSD disk). The network attached disk has a max throughput of
> > > > 125MB/s(m4.xlarge has 1Gigabit) but if we were deploy it on ephemeral
> > > > storage (local-SSD) and on a 10 Gigabit Network we would easily get
> > 5-10X
> > > > more.
> > > >
> > > > No idea about flume.
> > > >
> > > > Finally, not trying to pitch for Kafka however it is fastest I have
> > seen
> > > > but if someone has better numbers for flume then we should use that.
> > > Also I
> > > > would suspect there are benchmarks for Kafka vs Flume available
> online
> > > > already or we can try it with our own datasets.
> > > >
> > > > Thanks!
> > > >
> > > > On Wed, Apr 19, 2017 at 3:09 PM, Austin Leahy <
> > [email protected]>
> > > > wrote:
> > > >
> > > > > I am happy to create and test a flume source... #intelteam would
> need
> > > to
> > > > > create the benchmark by deploying it and pointing a data source at
> > > it...
> > > > > since I don't have good enough volume of source data handy
> > > > > On Wed, Apr 19, 2017 at 3:04 PM Ross, Alan D <
> [email protected]>
> > > > > wrote:
> > > > >
> > > > > > We discussed this in our staff meeting a bit today.  I would like
> > to
> > > > see
> > > > > > some benchmarking of different approaches (kafka, flume, etc) to
> > see
> > > > what
> > > > > > the numbers look like. Is anyone in the community willing to
> > > volunteer
> > > > on
> > > > > > this work?
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: Austin Leahy [mailto:[email protected]]
> > > > > > Sent: Wednesday, April 19, 2017 1:05 PM
> > > > > > To: [email protected]
> > > > > > Subject: Re: [Discuss] - Future plans for Spot-ingest
> > > > > >
> > > > > > I think Kafka is probably a red herring. It's an industry goto in
> > the
> > > > > > application world because because of redundancy but the type and
> > > > volumes
> > > > > of
> > > > > > network telemetry that we are talking about here will bog kafka
> > down
> > > > > unless
> > > > > > you dedicate really serious hardware to just the kafka
> > > implementation.
> > > > > It's
> > > > > > essentially the next level of the problem that the team was
> already
> > > > > running
> > > > > > into when rabbitMQ was queueing in data.
> > > > > >
> > > > > > On Wed, Apr 19, 2017 at 12:33 PM Mark Grover <[email protected]>
> > > wrote:
> > > > > >
> > > > > > > On Wed, Apr 19, 2017 at 10:19 AM, Smith, Nathanael P <
> > > > > > > [email protected]> wrote:
> > > > > > >
> > > > > > > > Mark,
> > > > > > > >
> > > > > > > > just digesting the below.
> > > > > > > >
> > > > > > > > Backing up in my thought process, I was thinking that the
> > ingest
> > > > > > > > master (first point of entry into the system) would want to
> put
> > > the
> > > > > > > > data into a standard serializable format. I was thinking that
> > > > > > > > libraries (such as pyarrow in this case) could help by
> writing
> > > the
> > > > > > > > data in parquet format early in the process. You are probably
> > > > > > > > correct that at this point in time it might not be worth the
> > time
> > > > and
> > > > > > can be kept in the backlog.
> > > > > > > > That being said, I still think the master should produce data
> > in
> > > a
> > > > > > > > standard format, what in your opinion (and I open this up of
> > > course
> > > > > > > > to
> > > > > > > > others) would be the most logical format?
> > > > > > > > the most basic would be to just keep it as a .csv.
> > > > > > > >
> > > > > > > > The master will likely write data to a staging directory in
> > HDFS
> > > > > > > > where
> > > > > > > the
> > > > > > > > spark streaming job will pick it up for normalization/writing
> > to
> > > > > > > > parquet
> > > > > > > in
> > > > > > > > the correct block sizes and partitions.
> > > > > > > >
> > > > > > >
> > > > > > > Hi Nate,
> > > > > > > Avro is usually preferred for such a standard format - because
> it
> > > > > > > asserts a schema (types, etc.) which CSV doesn't and it allows
> > for
> > > > > > > schema evolution which depending on the type of evolution, CSV
> > may
> > > or
> > > > > > may not support.
> > > > > > > And, that's something I have seen being done very commonly.
> > > > > > >
> > > > > > > Now, if the data were in Kafka before it gets to master, one
> > could
> > > > > > > argue that the master could just send metadata to the workers
> > > (topic
> > > > > > > name, partition number, offset start and end) and the workers
> > could
> > > > > > > read from Kafka directly. I do understand that'd be a much
> > > different
> > > > > > > architecture than the current one, but if you think it's a good
> > > idea
> > > > > > > too, we could document that, say in a JIRA, and (de-)prioritize
> > it
> > > > > > > (and in line with the rest of the discussion on this thread,
> it's
> > > not
> > > > > > the top-most priority).
> > > > > > > Thoughts?
> > > > > > >
> > > > > > > - Nathanael
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > On Apr 17, 2017, at 1:12 PM, Mark Grover <[email protected]>
> > > > wrote:
> > > > > > > > >
> > > > > > > > > Thanks all your opinion.
> > > > > > > > >
> > > > > > > > > I think it's good to consider two things:
> > > > > > > > > 1. What do (we think) users care about?
> > > > > > > > > 2. What's the cost of changing things?
> > > > > > > > >
> > > > > > > > > About #1, I think users care more about what format data is
> > > > > > > > > written
> > > > > > > than
> > > > > > > > > how the data is written. I'd argue whether that uses Hive,
> > MR,
> > > or
> > > > > > > > > a
> > > > > > > > custom
> > > > > > > > > Parquet writer is not as important to them as long as we
> > > maintain
> > > > > > > > > data/format compatibility.
> > > > > > > > > About #2, having worked on several projects, I find that
> it's
> > > > > > > > > rather difficult to keep up with Parquet. Even in Spark,
> > there
> > > > are
> > > > > > > > > a few
> > > > > > > > different
> > > > > > > > > ways to write to Parquet - there's a regular mode, and a
> > legacy
> > > > > > > > > mode <
> https://github.com/apache/spark/blob/master/sql/core/
> > > > > > > > src/main/scala/org/apache/spark/sql/execution/
> > > datasources/parquet/
> > > > > > > > ParquetWriteSupport.scala#L44>
> > > > > > > > > which
> > > > > > > > > continues to cause confusion
> > > > > > > > > <https://issues.apache.org/jira/browse/SPARK-20297> till
> > date.
> > > > > > > > > Parquet itself is pretty dependent on Hadoop
> > > > > > > > > <https://github.com/Parquet/parquet-mr/search?l=Maven+POM&;
> > > > > > > > q=hadoop&type=&utf8=%E2%9C%93>
> > > > > > > > > and,
> > > > > > > > > just integrating it with systems with a lot of developers
> > (like
> > > > > > > > > Spark <
> > > > https://www.google.com/webhp?sourceid=chrome-instant&ion=1&;
> > > > > > > > espv=2&ie=UTF-8#q=spark+parquet+jiras>)
> > > > > > > > > is still a lot of work.
> > > > > > > > >
> > > > > > > > > I personally think we should leverage higher level tools
> like
> > > > > > > > > Hive, or Spark to write data in widespread formats
> (Parquet,
> > > > being
> > > > > > > > > a very good
> > > > > > > > > example) but I personally wouldn't encourage us to manage
> the
> > > > > > > > > writers ourselves.
> > > > > > > > >
> > > > > > > > > Thoughts?
> > > > > > > > > Mark
> > > > > > > > >
> > > > > > > > > On Mon, Apr 17, 2017 at 11:44 AM, Michael Ridley
> > > > > > > > > <[email protected]
> > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > >> Without having given it too terribly much thought, that
> > seems
> > > > > > > > >> like an
> > > > > > > OK
> > > > > > > > >> approach.
> > > > > > > > >>
> > > > > > > > >> Michael
> > > > > > > > >>
> > > > > > > > >> On Mon, Apr 17, 2017 at 2:33 PM, Nathanael Smith <
> > > > > > > [email protected]>
> > > > > > > > >> wrote:
> > > > > > > > >>
> > > > > > > > >>> I think the question is rather we can write the data
> > > > generically
> > > > > > > > >>> to
> > > > > > > > HDFS
> > > > > > > > >>> as parquet without the use of hive/impala?
> > > > > > > > >>>
> > > > > > > > >>> Today we write parquet data using the hive/mapreduce
> > method.
> > > > > > > > >>> As part of the redesign i’d like to use libraries for
> this
> > as
> > > > > > > > >>> opposed
> > > > > > > > to
> > > > > > > > >> a
> > > > > > > > >>> hadoop dependency.
> > > > > > > > >>> I think it would be preferred to use the python master to
> > > write
> > > > > > > > >>> the
> > > > > > > > data
> > > > > > > > >>> into the format we want, then do normalization of the
> data
> > in
> > > > > > > > >>> spark streaming.
> > > > > > > > >>> Any thoughts?
> > > > > > > > >>>
> > > > > > > > >>> - Nathanael
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>>> On Apr 17, 2017, at 11:08 AM, Michael Ridley
> > > > > > > > >>>> <[email protected]>
> > > > > > > > >>> wrote:
> > > > > > > > >>>>
> > > > > > > > >>>> I had thought that the plan was to write the data in
> > Parquet
> > > > in
> > > > > > > > >>>> HDFS ultimately.
> > > > > > > > >>>>
> > > > > > > > >>>> Michael
> > > > > > > > >>>>
> > > > > > > > >>>> On Sun, Apr 16, 2017 at 11:55 AM, kant kodali
> > > > > > > > >>>> <[email protected]>
> > > > > > > > >>> wrote:
> > > > > > > > >>>>
> > > > > > > > >>>>> Hi Mark,
> > > > > > > > >>>>>
> > > > > > > > >>>>> Thank you so much for hearing my argument. And I
> > definetly
> > > > > > > understand
> > > > > > > > >>> that
> > > > > > > > >>>>> you guys have bunch of things to do. My only concern is
> > > that
> > > > I
> > > > > > > > >>>>> hope
> > > > > > > > it
> > > > > > > > >>>>> doesn't take too long too support other backends. For
> > > example
> > > > > > > > @Kenneth
> > > > > > > > >>> had
> > > > > > > > >>>>> given an example of LAMP stack had not moved away from
> > > mysql
> > > > > > > > >>>>> yet
> > > > > > > > which
> > > > > > > > >>>>> essentially means its probably a decade ? I see that in
> > the
> > > > > > > > >>>>> current architecture the results from with python
> > > > > > > > >>>>> multiprocessing or Spark Streaming are written back to
> > HDFS
> > > > > > > > >>>>> and  If so, can we write them in
> > > > > > > > >>> parquet
> > > > > > > > >>>>> format ? such that users should be able to plug in any
> > > query
> > > > > > > > >>>>> engine
> > > > > > > > >> but
> > > > > > > > >>>>> again I am not pushing you guys to do this right away
> or
> > > > > > > > >>>>> anything
> > > > > > > > just
> > > > > > > > >>>>> seeing if there a way for me to get started in parallel
> > and
> > > > if
> > > > > > > > >>>>> not feasible, its fine I just wanted to share my 2
> cents
> > > and
> > > > I
> > > > > > > > >>>>> am glad
> > > > > > > my
> > > > > > > > >>>>> argument is heard!
> > > > > > > > >>>>>
> > > > > > > > >>>>> Thanks much!
> > > > > > > > >>>>>
> > > > > > > > >>>>> On Fri, Apr 14, 2017 at 1:38 PM, Mark Grover <
> > > > [email protected]>
> > > > > > > > wrote:
> > > > > > > > >>>>>
> > > > > > > > >>>>>> Hi Kant,
> > > > > > > > >>>>>> Just wanted to make sure you don't feel like we are
> > > ignoring
> > > > > > > > >>>>>> your
> > > > > > > > >>>>>> comment:-) I hear you about pluggability.
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> The design can and should be pluggable but the project
> > has
> > > > > > > > >>>>>> one
> > > > > > > stack
> > > > > > > > >> it
> > > > > > > > >>>>>> ships out of the box with, one stack that's the
> default
> > > > stack
> > > > > > > > >>>>>> in
> > > > > > > the
> > > > > > > > >>>>> sense
> > > > > > > > >>>>>> that it's the most tested and so on. And, for us,
> that's
> > > our
> > > > > > > current
> > > > > > > > >>>>> stack.
> > > > > > > > >>>>>> If we were to take Apache Hive as an example, it
> shipped
> > > > (and
> > > > > > > ships)
> > > > > > > > >>> with
> > > > > > > > >>>>>> MapReduce as the default configuration engine. At some
> > > > point,
> > > > > > > Apache
> > > > > > > > >>> Tez
> > > > > > > > >>>>>> came along and wanted Hive to run on Tez, so they
> made a
> > > > > > > > >>>>>> bunch of
> > > > > > > > >>> things
> > > > > > > > >>>>>> pluggable to run Hive on Tez (instead of the only
> option
> > > > > > > > >>>>>> up-until
> > > > > > > > >> then:
> > > > > > > > >>>>>> Hive-on-MR) and then Apache Spark came and re-used
> some
> > of
> > > > > > > > >>>>>> that pluggability and even added some more so
> > > Hive-on-Spark
> > > > > > > > >>>>>> could
> > > > > > > become
> > > > > > > > a
> > > > > > > > >>>>>> reality. In the same way, I don't think anyone
> disagrees
> > > > here
> > > > > > > > >>>>>> that pluggabilty is a good thing but it's hard to do
> > > > > > > > >>>>>> pluggability
> > > > > > > right,
> > > > > > > > >> and
> > > > > > > > >>>>> at
> > > > > > > > >>>>>> the right level, unless on has a clear use-case in
> mind.
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> As a project, we have many things to do and I
> personally
> > > > > > > > >>>>>> think the
> > > > > > > > >>>>> biggest
> > > > > > > > >>>>>> bang for the buck for us in making Spot a really solid
> > and
> > > > > > > > >>>>>> the
> > > > > > > best
> > > > > > > > >>> cyber
> > > > > > > > >>>>>> security solution isn't pluggability but the things we
> > are
> > > > > > > > >>>>>> working
> > > > > > > > on
> > > > > > > > >>> - a
> > > > > > > > >>>>>> better user interface, a common/unified approach to
> > > storing
> > > > > > > > >>>>>> and
> > > > > > > > >>> modeling
> > > > > > > > >>>>>> data, etc.
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> Having said that, we are open, if it's important to
> you
> > or
> > > > > > > > >>>>>> someone
> > > > > > > > >>> else,
> > > > > > > > >>>>>> we'd be happy to receive and review those patches.
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> Thanks!
> > > > > > > > >>>>>> Mark
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> On Fri, Apr 14, 2017 at 10:14 AM, kant kodali
> > > > > > > > >>>>>> <[email protected]
> > > > > > > >
> > > > > > > > >>>>> wrote:
> > > > > > > > >>>>>>
> > > > > > > > >>>>>>> Thanks Ross! and yes option C sounds good to me as
> well
> > > > > > > > >>>>>>> however I
> > > > > > > > >> just
> > > > > > > > >>>>>>> think Distributed Sql query engine  and the resource
> > > > manager
> > > > > > > should
> > > > > > > > >> be
> > > > > > > > >>>>>>> pluggable.
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> On Fri, Apr 14, 2017 at 9:55 AM, Ross, Alan D <
> > > > > > > > >> [email protected]>
> > > > > > > > >>>>>>> wrote:
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>>> Option C is to use python on the front end of ingest
> > > > > > > > >>>>>>>> pipeline
> > > > > > > and
> > > > > > > > >>>>>>>> spark/scala on the back end.
> > > > > > > > >>>>>>>>
> > > > > > > > >>>>>>>> Option A uses python workers on the backend
> > > > > > > > >>>>>>>>
> > > > > > > > >>>>>>>> Option B uses all scala.
> > > > > > > > >>>>>>>>
> > > > > > > > >>>>>>>>
> > > > > > > > >>>>>>>>
> > > > > > > > >>>>>>>> -----Original Message-----
> > > > > > > > >>>>>>>> From: kant kodali [mailto:[email protected]]
> > > > > > > > >>>>>>>> Sent: Friday, April 14, 2017 9:53 AM
> > > > > > > > >>>>>>>> To: [email protected]
> > > > > > > > >>>>>>>> Subject: Re: [Discuss] - Future plans for
> Spot-ingest
> > > > > > > > >>>>>>>>
> > > > > > > > >>>>>>>> What is option C ? am I missing an email or
> something?
> > > > > > > > >>>>>>>>
> > > > > > > > >>>>>>>> On Fri, Apr 14, 2017 at 9:15 AM, Chokha
> Palayamkottai
> > <
> > > > > > > > >>>>>>>> [email protected]> wrote:
> > > > > > > > >>>>>>>>
> > > > > > > > >>>>>>>>> +1 for Python 3.x
> > > > > > > > >>>>>>>>>
> > > > > > > > >>>>>>>>>
> > > > > > > > >>>>>>>>>
> > > > > > > > >>>>>>>>> On 4/14/2017 11:59 AM, Austin Leahy wrote:
> > > > > > > > >>>>>>>>>
> > > > > > > > >>>>>>>>>> I think that C is the strong solution, getting the
> > > > ingest
> > > > > > > really
> > > > > > > > >>>>>>>>>> strong is going to lower barriers to adoption.
> Doing
> > > it
> > > > > > > > >>>>>>>>>> in
> > > > > > > > Python
> > > > > > > > >>>>>>>>>> will open up the ingest portion of the project to
> > > > include
> > > > > > > > >>>>>>>>>> many
> > > > > > > > >>>>> more
> > > > > > > > >>>>>>>> developers.
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> Before it comes up I would like to throw the
> > following
> > > > on
> > > > > > > > >>>>>>>>>> the
> > > > > > > > >>>>>> pile...
> > > > > > > > >>>>>>>>>> Major
> > > > > > > > >>>>>>>>>> python projects django/flash, others are dropping
> > 2.x
> > > > > > > > >>>>>>>>>> support
> > > > > > > in
> > > > > > > > >>>>>>>>>> releases scheduled in the next 6 to 8 months.
> Hadoop
> > > > > > > > >>>>>>>>>> projects
> > > > > > > in
> > > > > > > > >>>>>>>>>> general tend to lag in modern python support, lets
> > > > please
> > > > > > > build
> > > > > > > > >>>>> this
> > > > > > > > >>>>>>>>>> in 3.5 so that we don't have to immediately
> expect a
> > > > > > > > >>>>>>>>>> rebuild
> > > > > > > in
> > > > > > > > >>>>> the
> > > > > > > > >>>>>>>>>> pipeline.
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> -Vote C
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> Thanks Nate
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> Austin
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> On Fri, Apr 14, 2017 at 8:52 AM Alan Ross
> > > > > > > > >>>>>>>>>> <[email protected]>
> > > > > > > > >>>>> wrote:
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> I really like option C because it gives a lot of
> > > > > > > > >>>>>>>>>> flexibility
> > > > > > > for
> > > > > > > > >>>>>>>>>> ingest
> > > > > > > > >>>>>>>>>>> (python vs scala) but still has the robust spark
> > > > > > > > >>>>>>>>>>> streaming
> > > > > > > > >>>>> backend
> > > > > > > > >>>>>>>>>>> for performance.
> > > > > > > > >>>>>>>>>>>
> > > > > > > > >>>>>>>>>>> Thanks for putting this together Nate.
> > > > > > > > >>>>>>>>>>>
> > > > > > > > >>>>>>>>>>> Alan
> > > > > > > > >>>>>>>>>>>
> > > > > > > > >>>>>>>>>>> On Fri, Apr 14, 2017 at 8:44 AM, Chokha
> > > Palayamkottai <
> > > > > > > > >>>>>>>>>>> [email protected]> wrote:
> > > > > > > > >>>>>>>>>>>
> > > > > > > > >>>>>>>>>>> I agree. We should continue making the existing
> > stack
> > > > > > > > >>>>>>>>>>> more
> > > > > > > > >> mature
> > > > > > > > >>>>>> at
> > > > > > > > >>>>>>>>>>>> this point. Maybe if we have enough community
> > > support
> > > > > > > > >>>>>>>>>>>> we can
> > > > > > > > >> add
> > > > > > > > >>>>>>>>>>>> additional datastores.
> > > > > > > > >>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>> Chokha.
> > > > > > > > >>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>> On 4/14/17 11:10 AM, [email protected] wrote:
> > > > > > > > >>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>> Hi Kant,
> > > > > > > > >>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>> YARN is the standard scheduler in Hadoop. If
> > you're
> > > > > > > > >>>>>>>>>>>>> using
> > > > > > > > >>>>>>>>>>>>> Hive+Spark, then sure you'll have YARN.
> > > > > > > > >>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>> Haven't seen any HIVE on Mesos so far. As said,
> > > Spot
> > > > > > > > >>>>>>>>>>>>> is
> > > > > > > based
> > > > > > > > >>>>> on
> > > > > > > > >>>>>> a
> > > > > > > > >>>>>>>>>>>>> quite standard Hadoop stack and I wouldn't
> switch
> > > too
> > > > > > > > >>>>>>>>>>>>> many
> > > > > > > > >>>>> pieces
> > > > > > > > >>>>>>>> yet.
> > > > > > > > >>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>> In most Opensource projects you start relying
> on
> > a
> > > > > > > well-known
> > > > > > > > >>>>>>>>>>>>> stack and then you begin to support other DB
> > > backends
> > > > > > > > >>>>>>>>>>>>> once
> > > > > > > > >> it's
> > > > > > > > >>>>>>>>>>>>> quite mature. Think in the loads of LAMP apps
> > which
> > > > > > > > >>>>>>>>>>>>> haven't
> > > > > > > > >>>>> been
> > > > > > > > >>>>>>>>>>>>> ported away from MySQL yet.
> > > > > > > > >>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>> In any case, you'll need a high performance
> SQL +
> > > > > > > > >>>>>>>>>>>>> Massive
> > > > > > > > >>>>> Storage
> > > > > > > > >>>>>>>>>>>>> + Machine Learning + Massive Ingestion, and...
> > ATM,
> > > > > > > > >>>>>>>>>>>>> + that
> > > > > > > can
> > > > > > > > >> be
> > > > > > > > >>>>>>>>>>>>> only provided by Hadoop.
> > > > > > > > >>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>> Regards!
> > > > > > > > >>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>> Kenneth
> > > > > > > > >>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>> A 2017-04-14 12:56, kant kodali escrigué:
> > > > > > > > >>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>> Hi Kenneth,
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>> Thanks for the response.  I think you made a
> > case
> > > > for
> > > > > > > > >>>>>>>>>>>>>> HDFS however users may want to use S3 or some
> > > other
> > > > > > > > >>>>>>>>>>>>>> FS in which
> > > > > > > > >>>>> case
> > > > > > > > >>>>>>>>>>>>>> they can use Auxilio (hoping that there are no
> > > > > > > > >>>>>>>>>>>>>> changes
> > > > > > > > needed
> > > > > > > > >>>>>>>>>>>>>> within Spot in which case I
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>> can
> > > > > > > > >>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>> agree to that). for example, Netflix stores all
> > > there
> > > > > > > > >>>>>>>>>>>> data
> > > > > > > > into
> > > > > > > > >>>>> S3
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>> The distributed sql query engine I would say
> > > should
> > > > > > > > >>>>>>>>>>>>>> be
> > > > > > > > >>>>> pluggable
> > > > > > > > >>>>>>>>>>>>>> with whatever user may want to use and there a
> > > bunch
> > > > > > > > >>>>>>>>>>>>>> of
> > > > > > > them
> > > > > > > > >>>>> out
> > > > > > > > >>>>>>>> there.
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>> sure
> > > > > > > > >>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>> Impala is better than hive but what if users are
> > > > > > > > >>>>>>>>>>>> already
> > > > > > > using
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>> something
> > > > > > > > >>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>> else like Drill or Presto?
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>> Me personally, would not assume that users are
> > > > > > > > >>>>>>>>>>>>>> willing to
> > > > > > > > >>>>> deploy
> > > > > > > > >>>>>>>>>>>>>> all
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>> of
> > > > > > > > >>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>> that and make their existing stack more
> > complicated
> > > at
> > > > > > > > >>>>>>>>>>>> very
> > > > > > > > >>>>> least
> > > > > > > > >>>>>> I
> > > > > > > > >>>>>>>>>>>>>> would
> > > > > > > > >>>>>>>>>>>>>> say it is a uphill battle. Things have been
> > > changing
> > > > > > > rapidly
> > > > > > > > >>>>> in
> > > > > > > > >>>>>>>>>>>>>> Big
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>> data
> > > > > > > > >>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>> space so whatever we think is standard won't be
> > > > > > > > >>>>>>>>>>>> standard
> > > > > > > > >> anymore
> > > > > > > > >>>>>>>>>>>> but
> > > > > > > > >>>>>>>>>>>>>> importantly there shouldn't be any reason why
> we
> > > > > > > > >>>>>>>>>>>>>> shouldn't
> > > > > > > > be
> > > > > > > > >>>>>>>>>>>>>> flexible right.
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>> Also I am not sure why only YARN? why not make
> > > that
> > > > > > > > >>>>>>>>>>>>>> also
> > > > > > > > more
> > > > > > > > >>>>>>>>>>>>>> flexible so users can pick Mesos or
> standalone.
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>> I think Flexibility is a key for a wide
> adoption
> > > > > > > > >>>>>>>>>>>>>> rather
> > > > > > > than
> > > > > > > > >>>>> the
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>> tightly
> > > > > > > > >>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>> coupled architecture.
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>> Thanks!
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>> On Fri, Apr 14, 2017 at 3:12 AM, Kenneth
> Peiruza
> > > > > > > > >>>>>>>>>>>>>> <[email protected]>
> > > > > > > > >>>>>>>>>>>>>> wrote:
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>> PS: you need a big data platform to be able to
> > > > > > > > >>>>>>>>>>>>>> collect all
> > > > > > > > >>>>> those
> > > > > > > > >>>>>>>>>>>>>>> netflows
> > > > > > > > >>>>>>>>>>>>>>> and logs.
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>> Spot isn't intended for SMBs, that's clear,
> > then
> > > > you
> > > > > > > > >>>>>>>>>>>>>>> need
> > > > > > > > >>>>> loads
> > > > > > > > >>>>>>>>>>>>>>> of data to get ML working properly, and
> > somewhere
> > > > to
> > > > > > > > >>>>>>>>>>>>>>> run
> > > > > > > > >>>>> those
> > > > > > > > >>>>>>>>>>>>>>> algorithms. That
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>> is
> > > > > > > > >>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>> Hadoop.
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>> Regards!
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>> Kenneth
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>> Sent from my Mi phone On kant kodali
> > > > > > > > >>>>>>>>>>>>>>> <[email protected]>, Apr 14, 2017 4:04
> > > > > > > AM
> > > > > > > > >>>>>> wrote:
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>> Hi,
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>> Thanks for starting this thread. Here is my
> > > > feedback.
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>> I somehow think the architecture is too
> > > complicated
> > > > > > > > >>>>>>>>>>>>>>> for
> > > > > > > > wide
> > > > > > > > >>>>>>>>>>>>>>> adoption since it requires to install the
> > > > following.
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>> HDFS.
> > > > > > > > >>>>>>>>>>>>>>> HIVE.
> > > > > > > > >>>>>>>>>>>>>>> IMPALA.
> > > > > > > > >>>>>>>>>>>>>>> KAFKA.
> > > > > > > > >>>>>>>>>>>>>>> SPARK (YARN).
> > > > > > > > >>>>>>>>>>>>>>> YARN.
> > > > > > > > >>>>>>>>>>>>>>> Zookeeper.
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>> Currently there are way too many dependencies
> > > that
> > > > > > > > >>>>> discourages
> > > > > > > > >>>>>>>>>>>>>>> lot of users from using it because they have
> to
> > > go
> > > > > > > through
> > > > > > > > >>>>>>>>>>>>>>> deployment of all that required software. I
> > think
> > > > > > > > >>>>>>>>>>>>>>> for
> > > > > > > wide
> > > > > > > > >>>>>>>>>>>>>>> option we should minimize the dependencies
> and
> > > have
> > > > > > > > >>>>>>>>>>>>>>> more pluggable architecture. for example I am
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>> not
> > > > > > > > >>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>> sure why HIVE & IMPALA both are required? why
> not
> > > just
> > > > > > > > >>>>>>>>>>>> use
> > > > > > > > >> Spark
> > > > > > > > >>>>>>>>>>>> SQL
> > > > > > > > >>>>>>>>>>>>>>> since
> > > > > > > > >>>>>>>>>>>>>>> its already dependency or say users may want
> to
> > > use
> > > > > > > > >>>>>>>>>>>>>>> their
> > > > > > > > >> own
> > > > > > > > >>>>>>>>>>>>>>> distributed query engine they like such as
> > Apache
> > > > > > > > >>>>>>>>>>>>>>> Drill
> > > > > > > or
> > > > > > > > >>>>>>>>>>>>>>> something else. we should be flexible enough
> to
> > > > > > > > >>>>>>>>>>>>>>> provide
> > > > > > > > that
> > > > > > > > >>>>>>>>>>>>>>> option
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>> Also, I see that HDFS is used such that
> > > collectors
> > > > > > > > >>>>>>>>>>>>>>> can
> > > > > > > > >>>>> receive
> > > > > > > > >>>>>>>>>>>>>>> file path's through Kafka and be able to
> read a
> > > > > > > > >>>>>>>>>>>>>>> file. How
> > > > > > > > >> big
> > > > > > > > >>>>>>>>>>>>>>> are these files ?
> > > > > > > > >>>>>>>>>>>>>>> Do we
> > > > > > > > >>>>>>>>>>>>>>> really need HDFS for this? Why not provide
> more
> > > > ways
> > > > > > > > >>>>>>>>>>>>>>> to
> > > > > > > > send
> > > > > > > > >>>>>>>>>>>>>>> data such as sending data directly through
> > Kafka
> > > or
> > > > > > > > >>>>>>>>>>>>>>> say
> > > > > > > > just
> > > > > > > > >>>>>>>>>>>>>>> leaving up to the user to specify the file
> > > location
> > > > > > > > >>>>>>>>>>>>>>> as an argument to collector process
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>> Finally, I learnt that to generate Net flow
> > data
> > > > one
> > > > > > > would
> > > > > > > > >>>>>>>>>>>>>>> require a specific hardware. This really
> means
> > > > > > > > >>>>>>>>>>>>>>> Apache
> > > > > > > Spot
> > > > > > > > >> is
> > > > > > > > >>>>>>>>>>>>>>> not meant for everyone.
> > > > > > > > >>>>>>>>>>>>>>> I thought Apache Spot can be used to analyze
> > the
> > > > > > > > >>>>>>>>>>>>>>> network
> > > > > > > > >>>>>> traffic
> > > > > > > > >>>>>>>>>>>>>>> of
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>> any
> > > > > > > > >>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>> machine but if it requires a specific hard then
> I
> > > > think
> > > > > > > > >>>>>>>>>>>> it
> > > > > > > is
> > > > > > > > >>>>>>>>>>>>>>> targeted for
> > > > > > > > >>>>>>>>>>>>>>> specific group of people.
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>> The real strength of Apache Spot should
> mainly
> > be
> > > > > > > > >>>>>>>>>>>>>>> just
> > > > > > > > >>>>>> analyzing
> > > > > > > > >>>>>>>>>>>>>>> network traffic through ML.
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>> Thanks!
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>> On Thu, Apr 13, 2017 at 4:28 PM, Segerlind,
> > > Nathan
> > > > L
> > > > > > > > >>>>>>>>>>>>>>> < [email protected]> wrote:
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>> Thanks, Nate,
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> Nate.
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> -----Original Message-----
> > > > > > > > >>>>>>>>>>>>>>>> From: Nate Smith [mailto:
> > [email protected]]
> > > > > > > > >>>>>>>>>>>>>>>> Sent: Thursday, April 13, 2017 4:26 PM
> > > > > > > > >>>>>>>>>>>>>>>> To: [email protected]
> > > > > > > > >>>>>>>>>>>>>>>> Cc: [email protected];
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>> [email protected]
> > > > > > > > >>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>> Subject: Re: [Discuss] - Future plans for
> > > Spot-ingest
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> I was really hoping it came through ok, Oh
> > well
> > > :)
> > > > > > > Here’s
> > > > > > > > >> an
> > > > > > > > >>>>>>>>>>>>>>>> image form:
> > > > > > > > >>>>>>>>>>>>>>>> http://imgur.com/a/DUDsD
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> On Apr 13, 2017, at 4:05 PM, Segerlind,
> Nathan
> > > L <
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> [email protected]> wrote:
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> The diagram became garbled in the text
> > format.
> > > > > > > > >>>>>>>>>>>>>>>>> Could you resend it as a pdf?
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> Thanks,
> > > > > > > > >>>>>>>>>>>>>>>>> Nate
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> -----Original Message-----
> > > > > > > > >>>>>>>>>>>>>>>>> From: Nathanael Smith
> > > > > > > > >>>>>>>>>>>>>>>>> [mailto:[email protected]]
> > > > > > > > >>>>>>>>>>>>>>>>> Sent: Thursday, April 13, 2017 4:01 PM
> > > > > > > > >>>>>>>>>>>>>>>>> To: [email protected];
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> [email protected];
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> [email protected]
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> Subject: [Discuss] - Future plans for
> > > Spot-ingest
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> How would you like to see Spot-ingest
> change?
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> A. continue development on the Python
> > > > > > > > >>>>>>>>>>>>>>>>> Master/Worker
> > > > > > > with
> > > > > > > > >>>>>> focus
> > > > > > > > >>>>>>>>>>>>>>>>> on
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> performance / error handling / logging B.
> > > Develop
> > > > > > > > >>>>>>>>>>>>>>>> Scala
> > > > > > > > >>>>> based
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>> ingest to
> > > > > > > > >>>>>>>>>>>>>>> be
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> inline with code base from ingest, ml, to OA
> > (UI
> > > > to
> > > > > > > > >> continue
> > > > > > > > >>>>>>>>>>>>>>>> being
> > > > > > > > >>>>>>>>>>>>>>>> ipython/JS) C. Python ingest Worker with
> Scala
> > > > > > > > >>>>>>>>>>>>>>>> based
> > > > > > > Spark
> > > > > > > > >>>>>> code
> > > > > > > > >>>>>>>>>>>>>>>> for normalization and input into DB
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> Including the high level diagram:
> > > > > > > > >>>>>>>>>>>>>>>>> +-----------------------------
> > > > > > > > >>>>> ------------------------------
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> -------------------------------+
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> | +--------------------------+
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> +-----------------+        |
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> | | Master                   |  A. B. C.
> > > > > > > > >>>>>>> |
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> Worker          |        |
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> | |    A. Python
> >  +---------------+
> > > > > > A.
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> |   A.
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> Python     |        |
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> | |    B. Scala              |
> >  |
> > > > > > > > >>>>>>> +------------->
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>         +----+   |
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> | |    C. Python             |
> >  |
> > > >   |
> > > > > > > > >>>>>>> |
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>         |    |   |
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> | +---^------+---------------+
> >  |
> > > >   |
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> +-----------------+    |   |
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> |     |      |
> >  |
> > > >   |
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>              |   |
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> |     |      |
> >  |
> > > >   |
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>              |   |
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> |     |     +Note--------------+
> >  |
> > > >   |
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> +-----------------+    |   |
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> |     |     |Running on a      |
> >  |
> > > >   |
> > > > > > > > >>>>>>> |
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> Spark
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> Streaming |    |   |
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> |     |     |worker node in    |
> >  |
> > > >   |
> > > > > > > > >> B.
> > > > > > > > >>>>>> C.
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> | B.
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> Scala        |    |   |
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> |     |     |the Hadoop cluster|
> >  |
> > > >   |
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> +--------> C.
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> Scala        +-+  |   |
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> |     |     +------------------+
> >  |
> > > >   |
> > > > > > > |
> > > > > > > > >>>>>>> |
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>         | |  |   |
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> |   A.|
> > |
> > > >   |
> > > > > > > |
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> +-----------------+ |  |   |
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> |   B.|
> > |
> > > >   |
> > > > > > > |
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>            |  |   |
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> |   C.|
> > |
> > > >   |
> > > > > > > |
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>            |  |   |
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> | +----------------------+
> > > > > > > > +-v------+----+----+-+
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> +--------------v--v-+ |
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> | |                      |          |
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> |           |
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>                 | |
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> | |   Local FS:          |          |
> hdfs
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> |           |
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> Hive / Impala    | |
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> | |  - Binary/Text       |          |
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> |           |
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> - Parquet -     | |
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> | |    Log files -       |          |
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> |           |
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>                 | |
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> | |                      |          |
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> |           |
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>                 | |
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> | +----------------------+
> > > > > > > > +--------------------+
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> +-------------------+ |
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> +-----------------------------
> > > > > > > > >>>>> ------------------------------
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> -------------------------------+
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> Please let me know your thoughts,
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> - Nathanael
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>
> > > > > > > > >>>>>>>>>
> > > > > > > > >>>>>>>>
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>>
> > > > > > > > >>>>
> > > > > > > > >>>>
> > > > > > > > >>>> --
> > > > > > > > >>>> Michael Ridley <[email protected]>
> > > > > > > > >>>> office: (650) 352-1337
> > > > > > > > >>>> mobile: (571) 438-2420
> > > > > > > > >>>> Senior Solutions Architect
> > > > > > > > >>>> Cloudera, Inc.
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> --
> > > > > > > > >> Michael Ridley <[email protected]>
> > > > > > > > >> office: (650) 352-1337
> > > > > > > > >> mobile: (571) 438-2420
> > > > > > > > >> Senior Solutions Architect
> > > > > > > > >> Cloudera, Inc.
> > > > > > > > >>
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
>
> --
> Michael Ridley <[email protected]>
> office: (650) 352-1337
> mobile: (571) 438-2420
> Senior Solutions Architect
> Cloudera, Inc.
>

Re: [Discuss] - Future plans for Spot-ingest

Reply via email to