Re: [Discuss] - Future plans for Spot-ingest

Mark Grover Wed, 19 Apr 2017 12:27:06 -0700

On Wed, Apr 19, 2017 at 11:50 AM, Austin Leahy <[email protected]>
wrote:


> I think there are some technical decisions that need to be made but I think
> there are some important product and community issues to balance here and
> it's important to get close to the same page.
>
> 1. Who is our goal technical constituency?
>
> 2. Who is the strongest technical constituency that will give us the
> momentum to exit the incubator and keep the project alive?
>
> 3. If the difference between our goal technical constituency and our
> strongest technical constituency is significant how do we build abstraction
> into the project so that we can serve our goal constituents in the long
> run.
>
> Answering these questions makes answering some of the core technical
> questions easier.
>
> For example in my last cluster our data volumes were such that if data
> analysis and storage for spot was built on passing uncompressed csv back
> and forth we wouldn't ever be able to give serious thought to an
> implementation.
>
> I understand that there are many ways to deploy and utilize spark. I have
> used several of them. But until we have a straight forward deployable
> product with several pubic implementations I think that we should agree on
> a single supported architecture and punt discussions of interchangeable
> storage engines and finer points like "how to support mesos" till after our
> first major release.
>
I completely agree.

>
> On Wed, Apr 19, 2017 at 10:19 AM Smith, Nathanael P <
> [email protected]> wrote:
>
> > Mark,
> >
> > just digesting the below.
> >
> > Backing up in my thought process, I was thinking that the ingest master
> > (first point of entry into the system) would want to put the data into a
> > standard serializable format. I was thinking that libraries (such as
> > pyarrow in this case) could help by writing the data in parquet format
> > early in the process. You are probably correct that at this point in time
> > it might not be worth the time and can be kept in the backlog.
> > That being said, I still think the master should produce data in a
> > standard format, what in your opinion (and I open this up of course to
> > others) would be the most logical format?
> > the most basic would be to just keep it as a .csv.
> >
> > The master will likely write data to a staging directory in HDFS where
> the
> > spark streaming job will pick it up for normalization/writing to parquet
> in
> > the correct block sizes and partitions.
> >
> > - Nathanael
> >
> >
> >
> > > On Apr 17, 2017, at 1:12 PM, Mark Grover <[email protected]> wrote:
> > >
> > > Thanks all your opinion.
> > >
> > > I think it's good to consider two things:
> > > 1. What do (we think) users care about?
> > > 2. What's the cost of changing things?
> > >
> > > About #1, I think users care more about what format data is written
> than
> > > how the data is written. I'd argue whether that uses Hive, MR, or a
> > custom
> > > Parquet writer is not as important to them as long as we maintain
> > > data/format compatibility.
> > > About #2, having worked on several projects, I find that it's rather
> > > difficult to keep up with Parquet. Even in Spark, there are a few
> > different
> > > ways to write to Parquet - there's a regular mode, and a legacy mode
> > > <
> > https://github.com/apache/spark/blob/master/sql/core/
> src/main/scala/org/apache/spark/sql/execution/datasources/parquet/
> ParquetWriteSupport.scala#L44
> > >
> > > which
> > > continues to cause confusion
> > > <https://issues.apache.org/jira/browse/SPARK-20297> till date. Parquet
> > > itself is pretty dependent on Hadoop
> > > <
> > https://github.com/Parquet/parquet-mr/search?l=Maven+POM&;
> q=hadoop&type=&utf8=%E2%9C%93
> > >
> > > and,
> > > just integrating it with systems with a lot of developers (like Spark
> > > <
> > https://www.google.com/webhp?sourceid=chrome-instant&ion=1&;
> espv=2&ie=UTF-8#q=spark+parquet+jiras
> > >)
> > > is still a lot of work.
> > >
> > > I personally think we should leverage higher level tools like Hive, or
> > > Spark to write data in widespread formats (Parquet, being a very good
> > > example) but I personally wouldn't encourage us to manage the writers
> > > ourselves.
> > >
> > > Thoughts?
> > > Mark
> > >
> > > On Mon, Apr 17, 2017 at 11:44 AM, Michael Ridley <[email protected]
> >
> > > wrote:
> > >
> > >> Without having given it too terribly much thought, that seems like an
> OK
> > >> approach.
> > >>
> > >> Michael
> > >>
> > >> On Mon, Apr 17, 2017 at 2:33 PM, Nathanael Smith <
> [email protected]>
> > >> wrote:
> > >>
> > >>> I think the question is rather we can write the data generically to
> > HDFS
> > >>> as parquet without the use of hive/impala?
> > >>>
> > >>> Today we write parquet data using the hive/mapreduce method.
> > >>> As part of the redesign i’d like to use libraries for this as opposed
> > to
> > >> a
> > >>> hadoop dependency.
> > >>> I think it would be preferred to use the python master to write the
> > data
> > >>> into the format we want, then do normalization of the data in spark
> > >>> streaming.
> > >>> Any thoughts?
> > >>>
> > >>> - Nathanael
> > >>>
> > >>>
> > >>>
> > >>>> On Apr 17, 2017, at 11:08 AM, Michael Ridley <[email protected]>
> > >>> wrote:
> > >>>>
> > >>>> I had thought that the plan was to write the data in Parquet in HDFS
> > >>>> ultimately.
> > >>>>
> > >>>> Michael
> > >>>>
> > >>>> On Sun, Apr 16, 2017 at 11:55 AM, kant kodali <[email protected]>
> > >>> wrote:
> > >>>>
> > >>>>> Hi Mark,
> > >>>>>
> > >>>>> Thank you so much for hearing my argument. And I definetly
> understand
> > >>> that
> > >>>>> you guys have bunch of things to do. My only concern is that I hope
> > it
> > >>>>> doesn't take too long too support other backends. For example
> > @Kenneth
> > >>> had
> > >>>>> given an example of LAMP stack had not moved away from mysql yet
> > which
> > >>>>> essentially means its probably a decade ? I see that in the current
> > >>>>> architecture the results from with python multiprocessing or Spark
> > >>>>> Streaming are written back to HDFS and  If so, can we write them in
> > >>> parquet
> > >>>>> format ? such that users should be able to plug in any query engine
> > >> but
> > >>>>> again I am not pushing you guys to do this right away or anything
> > just
> > >>>>> seeing if there a way for me to get started in parallel and if not
> > >>>>> feasible, its fine I just wanted to share my 2 cents and I am glad
> my
> > >>>>> argument is heard!
> > >>>>>
> > >>>>> Thanks much!
> > >>>>>
> > >>>>> On Fri, Apr 14, 2017 at 1:38 PM, Mark Grover <[email protected]>
> > wrote:
> > >>>>>
> > >>>>>> Hi Kant,
> > >>>>>> Just wanted to make sure you don't feel like we are ignoring your
> > >>>>>> comment:-) I hear you about pluggability.
> > >>>>>>
> > >>>>>> The design can and should be pluggable but the project has one
> stack
> > >> it
> > >>>>>> ships out of the box with, one stack that's the default stack in
> the
> > >>>>> sense
> > >>>>>> that it's the most tested and so on. And, for us, that's our
> current
> > >>>>> stack.
> > >>>>>> If we were to take Apache Hive as an example, it shipped (and
> ships)
> > >>> with
> > >>>>>> MapReduce as the default configuration engine. At some point,
> Apache
> > >>> Tez
> > >>>>>> came along and wanted Hive to run on Tez, so they made a bunch of
> > >>> things
> > >>>>>> pluggable to run Hive on Tez (instead of the only option up-until
> > >> then:
> > >>>>>> Hive-on-MR) and then Apache Spark came and re-used some of that
> > >>>>>> pluggability and even added some more so Hive-on-Spark could
> become
> > a
> > >>>>>> reality. In the same way, I don't think anyone disagrees here that
> > >>>>>> pluggabilty is a good thing but it's hard to do pluggability
> right,
> > >> and
> > >>>>> at
> > >>>>>> the right level, unless on has a clear use-case in mind.
> > >>>>>>
> > >>>>>> As a project, we have many things to do and I personally think the
> > >>>>> biggest
> > >>>>>> bang for the buck for us in making Spot a really solid and the
> best
> > >>> cyber
> > >>>>>> security solution isn't pluggability but the things we are working
> > on
> > >>> - a
> > >>>>>> better user interface, a common/unified approach to storing and
> > >>> modeling
> > >>>>>> data, etc.
> > >>>>>>
> > >>>>>> Having said that, we are open, if it's important to you or someone
> > >>> else,
> > >>>>>> we'd be happy to receive and review those patches.
> > >>>>>>
> > >>>>>> Thanks!
> > >>>>>> Mark
> > >>>>>>
> > >>>>>> On Fri, Apr 14, 2017 at 10:14 AM, kant kodali <[email protected]
> >
> > >>>>> wrote:
> > >>>>>>
> > >>>>>>> Thanks Ross! and yes option C sounds good to me as well however I
> > >> just
> > >>>>>>> think Distributed Sql query engine  and the resource manager
> should
> > >> be
> > >>>>>>> pluggable.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Fri, Apr 14, 2017 at 9:55 AM, Ross, Alan D <
> > >> [email protected]>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Option C is to use python on the front end of ingest pipeline
> and
> > >>>>>>>> spark/scala on the back end.
> > >>>>>>>>
> > >>>>>>>> Option A uses python workers on the backend
> > >>>>>>>>
> > >>>>>>>> Option B uses all scala.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> -----Original Message-----
> > >>>>>>>> From: kant kodali [mailto:[email protected]]
> > >>>>>>>> Sent: Friday, April 14, 2017 9:53 AM
> > >>>>>>>> To: [email protected]
> > >>>>>>>> Subject: Re: [Discuss] - Future plans for Spot-ingest
> > >>>>>>>>
> > >>>>>>>> What is option C ? am I missing an email or something?
> > >>>>>>>>
> > >>>>>>>> On Fri, Apr 14, 2017 at 9:15 AM, Chokha Palayamkottai <
> > >>>>>>>> [email protected]> wrote:
> > >>>>>>>>
> > >>>>>>>>> +1 for Python 3.x
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On 4/14/2017 11:59 AM, Austin Leahy wrote:
> > >>>>>>>>>
> > >>>>>>>>>> I think that C is the strong solution, getting the ingest
> really
> > >>>>>>>>>> strong is going to lower barriers to adoption. Doing it in
> > Python
> > >>>>>>>>>> will open up the ingest portion of the project to include many
> > >>>>> more
> > >>>>>>>> developers.
> > >>>>>>>>>>
> > >>>>>>>>>> Before it comes up I would like to throw the following on the
> > >>>>>> pile...
> > >>>>>>>>>> Major
> > >>>>>>>>>> python projects django/flash, others are dropping 2.x support
> in
> > >>>>>>>>>> releases scheduled in the next 6 to 8 months. Hadoop projects
> in
> > >>>>>>>>>> general tend to lag in modern python support, lets please
> build
> > >>>>> this
> > >>>>>>>>>> in 3.5 so that we don't have to immediately expect a rebuild
> in
> > >>>>> the
> > >>>>>>>>>> pipeline.
> > >>>>>>>>>>
> > >>>>>>>>>> -Vote C
> > >>>>>>>>>>
> > >>>>>>>>>> Thanks Nate
> > >>>>>>>>>>
> > >>>>>>>>>> Austin
> > >>>>>>>>>>
> > >>>>>>>>>> On Fri, Apr 14, 2017 at 8:52 AM Alan Ross <[email protected]>
> > >>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>> I really like option C because it gives a lot of flexibility
> for
> > >>>>>>>>>> ingest
> > >>>>>>>>>>> (python vs scala) but still has the robust spark streaming
> > >>>>> backend
> > >>>>>>>>>>> for performance.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks for putting this together Nate.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Alan
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Fri, Apr 14, 2017 at 8:44 AM, Chokha Palayamkottai <
> > >>>>>>>>>>> [email protected]> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> I agree. We should continue making the existing stack more
> > >> mature
> > >>>>>> at
> > >>>>>>>>>>>> this point. Maybe if we have enough community support we can
> > >> add
> > >>>>>>>>>>>> additional datastores.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Chokha.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On 4/14/17 11:10 AM, [email protected] wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Hi Kant,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> YARN is the standard scheduler in Hadoop. If you're using
> > >>>>>>>>>>>>> Hive+Spark, then sure you'll have YARN.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Haven't seen any HIVE on Mesos so far. As said, Spot is
> based
> > >>>>> on
> > >>>>>> a
> > >>>>>>>>>>>>> quite standard Hadoop stack and I wouldn't switch too many
> > >>>>> pieces
> > >>>>>>>> yet.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> In most Opensource projects you start relying on a
> well-known
> > >>>>>>>>>>>>> stack and then you begin to support other DB backends once
> > >> it's
> > >>>>>>>>>>>>> quite mature. Think in the loads of LAMP apps which haven't
> > >>>>> been
> > >>>>>>>>>>>>> ported away from MySQL yet.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> In any case, you'll need a high performance SQL + Massive
> > >>>>> Storage
> > >>>>>>>>>>>>> + Machine Learning + Massive Ingestion, and... ATM, that
> can
> > >> be
> > >>>>>>>>>>>>> only provided by Hadoop.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Regards!
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Kenneth
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> A 2017-04-14 12:56, kant kodali escrigué:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Hi Kenneth,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Thanks for the response.  I think you made a case for HDFS
> > >>>>>>>>>>>>>> however users may want to use S3 or some other FS in which
> > >>>>> case
> > >>>>>>>>>>>>>> they can use Auxilio (hoping that there are no changes
> > needed
> > >>>>>>>>>>>>>> within Spot in which case I
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> can
> > >>>>>>>>>>>
> > >>>>>>>>>>>> agree to that). for example, Netflix stores all there data
> > into
> > >>>>> S3
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> The distributed sql query engine I would say should be
> > >>>>> pluggable
> > >>>>>>>>>>>>>> with whatever user may want to use and there a bunch of
> them
> > >>>>> out
> > >>>>>>>> there.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> sure
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Impala is better than hive but what if users are already
> using
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> something
> > >>>>>>>>>>>
> > >>>>>>>>>>>> else like Drill or Presto?
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Me personally, would not assume that users are willing to
> > >>>>> deploy
> > >>>>>>>>>>>>>> all
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> of
> > >>>>>>>>>>>
> > >>>>>>>>>>>> that and make their existing stack more complicated at very
> > >>>>> least
> > >>>>>> I
> > >>>>>>>>>>>>>> would
> > >>>>>>>>>>>>>> say it is a uphill battle. Things have been changing
> rapidly
> > >>>>> in
> > >>>>>>>>>>>>>> Big
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> data
> > >>>>>>>>>>>
> > >>>>>>>>>>>> space so whatever we think is standard won't be standard
> > >> anymore
> > >>>>>>>>>>>> but
> > >>>>>>>>>>>>>> importantly there shouldn't be any reason why we shouldn't
> > be
> > >>>>>>>>>>>>>> flexible right.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Also I am not sure why only YARN? why not make that also
> > more
> > >>>>>>>>>>>>>> flexible so users can pick Mesos or standalone.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> I think Flexibility is a key for a wide adoption rather
> than
> > >>>>> the
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> tightly
> > >>>>>>>>>>>
> > >>>>>>>>>>>> coupled architecture.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Thanks!
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Fri, Apr 14, 2017 at 3:12 AM, Kenneth Peiruza
> > >>>>>>>>>>>>>> <[email protected]>
> > >>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> PS: you need a big data platform to be able to collect all
> > >>>>> those
> > >>>>>>>>>>>>>>> netflows
> > >>>>>>>>>>>>>>> and logs.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Spot isn't intended for SMBs, that's clear, then you need
> > >>>>> loads
> > >>>>>>>>>>>>>>> of data to get ML working properly, and somewhere to run
> > >>>>> those
> > >>>>>>>>>>>>>>> algorithms. That
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> is
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Hadoop.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Regards!
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Kenneth
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Sent from my Mi phone
> > >>>>>>>>>>>>>>> On kant kodali <[email protected]>, Apr 14, 2017 4:04
> AM
> > >>>>>> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Hi,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Thanks for starting this thread. Here is my feedback.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> I somehow think the architecture is too complicated for
> > wide
> > >>>>>>>>>>>>>>> adoption since it requires to install the following.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> HDFS.
> > >>>>>>>>>>>>>>> HIVE.
> > >>>>>>>>>>>>>>> IMPALA.
> > >>>>>>>>>>>>>>> KAFKA.
> > >>>>>>>>>>>>>>> SPARK (YARN).
> > >>>>>>>>>>>>>>> YARN.
> > >>>>>>>>>>>>>>> Zookeeper.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Currently there are way too many dependencies that
> > >>>>> discourages
> > >>>>>>>>>>>>>>> lot of users from using it because they have to go
> through
> > >>>>>>>>>>>>>>> deployment of all that required software. I think for
> wide
> > >>>>>>>>>>>>>>> option we should minimize the dependencies and have more
> > >>>>>>>>>>>>>>> pluggable architecture. for example I am
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> not
> > >>>>>>>>>>>
> > >>>>>>>>>>>> sure why HIVE & IMPALA both are required? why not just use
> > >> Spark
> > >>>>>>>>>>>> SQL
> > >>>>>>>>>>>>>>> since
> > >>>>>>>>>>>>>>> its already dependency or say users may want to use their
> > >> own
> > >>>>>>>>>>>>>>> distributed query engine they like such as Apache Drill
> or
> > >>>>>>>>>>>>>>> something else. we should be flexible enough to provide
> > that
> > >>>>>>>>>>>>>>> option
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Also, I see that HDFS is used such that collectors can
> > >>>>> receive
> > >>>>>>>>>>>>>>> file path's through Kafka and be able to read a file. How
> > >> big
> > >>>>>>>>>>>>>>> are these files ?
> > >>>>>>>>>>>>>>> Do we
> > >>>>>>>>>>>>>>> really need HDFS for this? Why not provide more ways to
> > send
> > >>>>>>>>>>>>>>> data such as sending data directly through Kafka or say
> > just
> > >>>>>>>>>>>>>>> leaving up to the user to specify the file location as an
> > >>>>>>>>>>>>>>> argument to collector process
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Finally, I learnt that to generate Net flow data one
> would
> > >>>>>>>>>>>>>>> require a specific hardware. This really means Apache
> Spot
> > >> is
> > >>>>>>>>>>>>>>> not meant for everyone.
> > >>>>>>>>>>>>>>> I thought Apache Spot can be used to analyze the network
> > >>>>>> traffic
> > >>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> any
> > >>>>>>>>>>>
> > >>>>>>>>>>>> machine but if it requires a specific hard then I think it
> is
> > >>>>>>>>>>>>>>> targeted for
> > >>>>>>>>>>>>>>> specific group of people.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> The real strength of Apache Spot should mainly be just
> > >>>>>> analyzing
> > >>>>>>>>>>>>>>> network traffic through ML.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Thanks!
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On Thu, Apr 13, 2017 at 4:28 PM, Segerlind, Nathan L <
> > >>>>>>>>>>>>>>> [email protected]> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Thanks, Nate,
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Nate.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> -----Original Message-----
> > >>>>>>>>>>>>>>>> From: Nate Smith [mailto:[email protected]]
> > >>>>>>>>>>>>>>>> Sent: Thursday, April 13, 2017 4:26 PM
> > >>>>>>>>>>>>>>>> To: [email protected]
> > >>>>>>>>>>>>>>>> Cc: [email protected];
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> [email protected]
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Subject: Re: [Discuss] - Future plans for Spot-ingest
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> I was really hoping it came through ok, Oh well :)
> Here’s
> > >> an
> > >>>>>>>>>>>>>>>> image form:
> > >>>>>>>>>>>>>>>> http://imgur.com/a/DUDsD
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> On Apr 13, 2017, at 4:05 PM, Segerlind, Nathan L <
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> [email protected]> wrote:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> The diagram became garbled in the text format.
> > >>>>>>>>>>>>>>>>> Could you resend it as a pdf?
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>> Nate
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> -----Original Message-----
> > >>>>>>>>>>>>>>>>> From: Nathanael Smith [mailto:[email protected]]
> > >>>>>>>>>>>>>>>>> Sent: Thursday, April 13, 2017 4:01 PM
> > >>>>>>>>>>>>>>>>> To: [email protected];
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> [email protected];
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> [email protected]
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Subject: [Discuss] - Future plans for Spot-ingest
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> How would you like to see Spot-ingest change?
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> A. continue development on the Python Master/Worker
> with
> > >>>>>> focus
> > >>>>>>>>>>>>>>>>> on
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> performance / error handling / logging B. Develop Scala
> > >>>>> based
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> ingest to
> > >>>>>>>>>>>>>>> be
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> inline with code base from ingest, ml, to OA (UI to
> > >> continue
> > >>>>>>>>>>>>>>>> being
> > >>>>>>>>>>>>>>>> ipython/JS) C. Python ingest Worker with Scala based
> Spark
> > >>>>>> code
> > >>>>>>>>>>>>>>>> for normalization and input into DB
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Including the high level diagram:
> > >>>>>>>>>>>>>>>>> +-----------------------------
> > >>>>> ------------------------------
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> -------------------------------+
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> | +--------------------------+
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> +-----------------+        |
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> | | Master                   |  A. B. C.
> > >>>>>>> |
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Worker          |        |
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> | |    A. Python             +---------------+      A.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> |   A.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Python     |        |
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> | |    B. Scala              |               |
> > >>>>>>> +------------->
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>         +----+   |
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> | |    C. Python             |               |    |
> > >>>>>>> |
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>         |    |   |
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> | +---^------+---------------+               |    |
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> +-----------------+    |   |
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> |     |      |                               |    |
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>              |   |
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> |     |      |                               |    |
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>              |   |
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> |     |     +Note--------------+             |    |
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> +-----------------+    |   |
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> |     |     |Running on a      |             |    |
> > >>>>>>> |
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Spark
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Streaming |    |   |
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> |     |     |worker node in    |             |    |
> > >> B.
> > >>>>>> C.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> | B.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Scala        |    |   |
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> |     |     |the Hadoop cluster|             |    |
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> +--------> C.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Scala        +-+  |   |
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> |     |     +------------------+             |    |
> |
> > >>>>>>> |
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>         | |  |   |
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> |   A.|                                      |    |
> |
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> +-----------------+ |  |   |
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> |   B.|                                      |    |
> |
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>            |  |   |
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> |   C.|                                      |    |
> |
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>            |  |   |
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> | +----------------------+
> > +-v------+----+----+-+
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> +--------------v--v-+ |
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> | |                      |          |
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> |           |
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>                 | |
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> | |   Local FS:          |          |    hdfs
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> |           |
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Hive / Impala    | |
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> | |  - Binary/Text       |          |
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> |           |
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> - Parquet -     | |
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> | |    Log files -       |          |
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> |           |
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>                 | |
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> | |                      |          |
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> |           |
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>                 | |
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> | +----------------------+
> > +--------------------+
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> +-------------------+ |
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> +-----------------------------
> > >>>>> ------------------------------
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> -------------------------------+
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Please let me know your thoughts,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> - Nathanael
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> --
> > >>>> Michael Ridley <[email protected]>
> > >>>> office: (650) 352-1337
> > >>>> mobile: (571) 438-2420
> > >>>> Senior Solutions Architect
> > >>>> Cloudera, Inc.
> > >>>
> > >>>
> > >>
> > >>
> > >> --
> > >> Michael Ridley <[email protected]>
> > >> office: (650) 352-1337
> > >> mobile: (571) 438-2420
> > >> Senior Solutions Architect
> > >> Cloudera, Inc.
> > >>
> >
> >
>

Re: [Discuss] - Future plans for Spot-ingest

Reply via email to