On Wed, Apr 19, 2017 at 11:50 AM, Austin Leahy <[email protected]> wrote:
> I think there are some technical decisions that need to be made but I think > there are some important product and community issues to balance here and > it's important to get close to the same page. > > 1. Who is our goal technical constituency? > > 2. Who is the strongest technical constituency that will give us the > momentum to exit the incubator and keep the project alive? > > 3. If the difference between our goal technical constituency and our > strongest technical constituency is significant how do we build abstraction > into the project so that we can serve our goal constituents in the long > run. > > Answering these questions makes answering some of the core technical > questions easier. > > For example in my last cluster our data volumes were such that if data > analysis and storage for spot was built on passing uncompressed csv back > and forth we wouldn't ever be able to give serious thought to an > implementation. > > I understand that there are many ways to deploy and utilize spark. I have > used several of them. But until we have a straight forward deployable > product with several pubic implementations I think that we should agree on > a single supported architecture and punt discussions of interchangeable > storage engines and finer points like "how to support mesos" till after our > first major release. > I completely agree. > > On Wed, Apr 19, 2017 at 10:19 AM Smith, Nathanael P < > [email protected]> wrote: > > > Mark, > > > > just digesting the below. > > > > Backing up in my thought process, I was thinking that the ingest master > > (first point of entry into the system) would want to put the data into a > > standard serializable format. I was thinking that libraries (such as > > pyarrow in this case) could help by writing the data in parquet format > > early in the process. You are probably correct that at this point in time > > it might not be worth the time and can be kept in the backlog. > > That being said, I still think the master should produce data in a > > standard format, what in your opinion (and I open this up of course to > > others) would be the most logical format? > > the most basic would be to just keep it as a .csv. > > > > The master will likely write data to a staging directory in HDFS where > the > > spark streaming job will pick it up for normalization/writing to parquet > in > > the correct block sizes and partitions. > > > > - Nathanael > > > > > > > > > On Apr 17, 2017, at 1:12 PM, Mark Grover <[email protected]> wrote: > > > > > > Thanks all your opinion. > > > > > > I think it's good to consider two things: > > > 1. What do (we think) users care about? > > > 2. What's the cost of changing things? > > > > > > About #1, I think users care more about what format data is written > than > > > how the data is written. I'd argue whether that uses Hive, MR, or a > > custom > > > Parquet writer is not as important to them as long as we maintain > > > data/format compatibility. > > > About #2, having worked on several projects, I find that it's rather > > > difficult to keep up with Parquet. Even in Spark, there are a few > > different > > > ways to write to Parquet - there's a regular mode, and a legacy mode > > > < > > https://github.com/apache/spark/blob/master/sql/core/ > src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ > ParquetWriteSupport.scala#L44 > > > > > > which > > > continues to cause confusion > > > <https://issues.apache.org/jira/browse/SPARK-20297> till date. Parquet > > > itself is pretty dependent on Hadoop > > > < > > https://github.com/Parquet/parquet-mr/search?l=Maven+POM& > q=hadoop&type=&utf8=%E2%9C%93 > > > > > > and, > > > just integrating it with systems with a lot of developers (like Spark > > > < > > https://www.google.com/webhp?sourceid=chrome-instant&ion=1& > espv=2&ie=UTF-8#q=spark+parquet+jiras > > >) > > > is still a lot of work. > > > > > > I personally think we should leverage higher level tools like Hive, or > > > Spark to write data in widespread formats (Parquet, being a very good > > > example) but I personally wouldn't encourage us to manage the writers > > > ourselves. > > > > > > Thoughts? > > > Mark > > > > > > On Mon, Apr 17, 2017 at 11:44 AM, Michael Ridley <[email protected] > > > > > wrote: > > > > > >> Without having given it too terribly much thought, that seems like an > OK > > >> approach. > > >> > > >> Michael > > >> > > >> On Mon, Apr 17, 2017 at 2:33 PM, Nathanael Smith < > [email protected]> > > >> wrote: > > >> > > >>> I think the question is rather we can write the data generically to > > HDFS > > >>> as parquet without the use of hive/impala? > > >>> > > >>> Today we write parquet data using the hive/mapreduce method. > > >>> As part of the redesign i’d like to use libraries for this as opposed > > to > > >> a > > >>> hadoop dependency. > > >>> I think it would be preferred to use the python master to write the > > data > > >>> into the format we want, then do normalization of the data in spark > > >>> streaming. > > >>> Any thoughts? > > >>> > > >>> - Nathanael > > >>> > > >>> > > >>> > > >>>> On Apr 17, 2017, at 11:08 AM, Michael Ridley <[email protected]> > > >>> wrote: > > >>>> > > >>>> I had thought that the plan was to write the data in Parquet in HDFS > > >>>> ultimately. > > >>>> > > >>>> Michael > > >>>> > > >>>> On Sun, Apr 16, 2017 at 11:55 AM, kant kodali <[email protected]> > > >>> wrote: > > >>>> > > >>>>> Hi Mark, > > >>>>> > > >>>>> Thank you so much for hearing my argument. And I definetly > understand > > >>> that > > >>>>> you guys have bunch of things to do. My only concern is that I hope > > it > > >>>>> doesn't take too long too support other backends. For example > > @Kenneth > > >>> had > > >>>>> given an example of LAMP stack had not moved away from mysql yet > > which > > >>>>> essentially means its probably a decade ? I see that in the current > > >>>>> architecture the results from with python multiprocessing or Spark > > >>>>> Streaming are written back to HDFS and If so, can we write them in > > >>> parquet > > >>>>> format ? such that users should be able to plug in any query engine > > >> but > > >>>>> again I am not pushing you guys to do this right away or anything > > just > > >>>>> seeing if there a way for me to get started in parallel and if not > > >>>>> feasible, its fine I just wanted to share my 2 cents and I am glad > my > > >>>>> argument is heard! > > >>>>> > > >>>>> Thanks much! > > >>>>> > > >>>>> On Fri, Apr 14, 2017 at 1:38 PM, Mark Grover <[email protected]> > > wrote: > > >>>>> > > >>>>>> Hi Kant, > > >>>>>> Just wanted to make sure you don't feel like we are ignoring your > > >>>>>> comment:-) I hear you about pluggability. > > >>>>>> > > >>>>>> The design can and should be pluggable but the project has one > stack > > >> it > > >>>>>> ships out of the box with, one stack that's the default stack in > the > > >>>>> sense > > >>>>>> that it's the most tested and so on. And, for us, that's our > current > > >>>>> stack. > > >>>>>> If we were to take Apache Hive as an example, it shipped (and > ships) > > >>> with > > >>>>>> MapReduce as the default configuration engine. At some point, > Apache > > >>> Tez > > >>>>>> came along and wanted Hive to run on Tez, so they made a bunch of > > >>> things > > >>>>>> pluggable to run Hive on Tez (instead of the only option up-until > > >> then: > > >>>>>> Hive-on-MR) and then Apache Spark came and re-used some of that > > >>>>>> pluggability and even added some more so Hive-on-Spark could > become > > a > > >>>>>> reality. In the same way, I don't think anyone disagrees here that > > >>>>>> pluggabilty is a good thing but it's hard to do pluggability > right, > > >> and > > >>>>> at > > >>>>>> the right level, unless on has a clear use-case in mind. > > >>>>>> > > >>>>>> As a project, we have many things to do and I personally think the > > >>>>> biggest > > >>>>>> bang for the buck for us in making Spot a really solid and the > best > > >>> cyber > > >>>>>> security solution isn't pluggability but the things we are working > > on > > >>> - a > > >>>>>> better user interface, a common/unified approach to storing and > > >>> modeling > > >>>>>> data, etc. > > >>>>>> > > >>>>>> Having said that, we are open, if it's important to you or someone > > >>> else, > > >>>>>> we'd be happy to receive and review those patches. > > >>>>>> > > >>>>>> Thanks! > > >>>>>> Mark > > >>>>>> > > >>>>>> On Fri, Apr 14, 2017 at 10:14 AM, kant kodali <[email protected] > > > > >>>>> wrote: > > >>>>>> > > >>>>>>> Thanks Ross! and yes option C sounds good to me as well however I > > >> just > > >>>>>>> think Distributed Sql query engine and the resource manager > should > > >> be > > >>>>>>> pluggable. > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> On Fri, Apr 14, 2017 at 9:55 AM, Ross, Alan D < > > >> [email protected]> > > >>>>>>> wrote: > > >>>>>>> > > >>>>>>>> Option C is to use python on the front end of ingest pipeline > and > > >>>>>>>> spark/scala on the back end. > > >>>>>>>> > > >>>>>>>> Option A uses python workers on the backend > > >>>>>>>> > > >>>>>>>> Option B uses all scala. > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> -----Original Message----- > > >>>>>>>> From: kant kodali [mailto:[email protected]] > > >>>>>>>> Sent: Friday, April 14, 2017 9:53 AM > > >>>>>>>> To: [email protected] > > >>>>>>>> Subject: Re: [Discuss] - Future plans for Spot-ingest > > >>>>>>>> > > >>>>>>>> What is option C ? am I missing an email or something? > > >>>>>>>> > > >>>>>>>> On Fri, Apr 14, 2017 at 9:15 AM, Chokha Palayamkottai < > > >>>>>>>> [email protected]> wrote: > > >>>>>>>> > > >>>>>>>>> +1 for Python 3.x > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> On 4/14/2017 11:59 AM, Austin Leahy wrote: > > >>>>>>>>> > > >>>>>>>>>> I think that C is the strong solution, getting the ingest > really > > >>>>>>>>>> strong is going to lower barriers to adoption. Doing it in > > Python > > >>>>>>>>>> will open up the ingest portion of the project to include many > > >>>>> more > > >>>>>>>> developers. > > >>>>>>>>>> > > >>>>>>>>>> Before it comes up I would like to throw the following on the > > >>>>>> pile... > > >>>>>>>>>> Major > > >>>>>>>>>> python projects django/flash, others are dropping 2.x support > in > > >>>>>>>>>> releases scheduled in the next 6 to 8 months. Hadoop projects > in > > >>>>>>>>>> general tend to lag in modern python support, lets please > build > > >>>>> this > > >>>>>>>>>> in 3.5 so that we don't have to immediately expect a rebuild > in > > >>>>> the > > >>>>>>>>>> pipeline. > > >>>>>>>>>> > > >>>>>>>>>> -Vote C > > >>>>>>>>>> > > >>>>>>>>>> Thanks Nate > > >>>>>>>>>> > > >>>>>>>>>> Austin > > >>>>>>>>>> > > >>>>>>>>>> On Fri, Apr 14, 2017 at 8:52 AM Alan Ross <[email protected]> > > >>>>> wrote: > > >>>>>>>>>> > > >>>>>>>>>> I really like option C because it gives a lot of flexibility > for > > >>>>>>>>>> ingest > > >>>>>>>>>>> (python vs scala) but still has the robust spark streaming > > >>>>> backend > > >>>>>>>>>>> for performance. > > >>>>>>>>>>> > > >>>>>>>>>>> Thanks for putting this together Nate. > > >>>>>>>>>>> > > >>>>>>>>>>> Alan > > >>>>>>>>>>> > > >>>>>>>>>>> On Fri, Apr 14, 2017 at 8:44 AM, Chokha Palayamkottai < > > >>>>>>>>>>> [email protected]> wrote: > > >>>>>>>>>>> > > >>>>>>>>>>> I agree. We should continue making the existing stack more > > >> mature > > >>>>>> at > > >>>>>>>>>>>> this point. Maybe if we have enough community support we can > > >> add > > >>>>>>>>>>>> additional datastores. > > >>>>>>>>>>>> > > >>>>>>>>>>>> Chokha. > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> On 4/14/17 11:10 AM, [email protected] wrote: > > >>>>>>>>>>>> > > >>>>>>>>>>>>> Hi Kant, > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> YARN is the standard scheduler in Hadoop. If you're using > > >>>>>>>>>>>>> Hive+Spark, then sure you'll have YARN. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> Haven't seen any HIVE on Mesos so far. As said, Spot is > based > > >>>>> on > > >>>>>> a > > >>>>>>>>>>>>> quite standard Hadoop stack and I wouldn't switch too many > > >>>>> pieces > > >>>>>>>> yet. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> In most Opensource projects you start relying on a > well-known > > >>>>>>>>>>>>> stack and then you begin to support other DB backends once > > >> it's > > >>>>>>>>>>>>> quite mature. Think in the loads of LAMP apps which haven't > > >>>>> been > > >>>>>>>>>>>>> ported away from MySQL yet. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> In any case, you'll need a high performance SQL + Massive > > >>>>> Storage > > >>>>>>>>>>>>> + Machine Learning + Massive Ingestion, and... ATM, that > can > > >> be > > >>>>>>>>>>>>> only provided by Hadoop. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> Regards! > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> Kenneth > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> A 2017-04-14 12:56, kant kodali escrigué: > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> Hi Kenneth, > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Thanks for the response. I think you made a case for HDFS > > >>>>>>>>>>>>>> however users may want to use S3 or some other FS in which > > >>>>> case > > >>>>>>>>>>>>>> they can use Auxilio (hoping that there are no changes > > needed > > >>>>>>>>>>>>>> within Spot in which case I > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> can > > >>>>>>>>>>> > > >>>>>>>>>>>> agree to that). for example, Netflix stores all there data > > into > > >>>>> S3 > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> The distributed sql query engine I would say should be > > >>>>> pluggable > > >>>>>>>>>>>>>> with whatever user may want to use and there a bunch of > them > > >>>>> out > > >>>>>>>> there. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> sure > > >>>>>>>>>>> > > >>>>>>>>>>>> Impala is better than hive but what if users are already > using > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> something > > >>>>>>>>>>> > > >>>>>>>>>>>> else like Drill or Presto? > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Me personally, would not assume that users are willing to > > >>>>> deploy > > >>>>>>>>>>>>>> all > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> of > > >>>>>>>>>>> > > >>>>>>>>>>>> that and make their existing stack more complicated at very > > >>>>> least > > >>>>>> I > > >>>>>>>>>>>>>> would > > >>>>>>>>>>>>>> say it is a uphill battle. Things have been changing > rapidly > > >>>>> in > > >>>>>>>>>>>>>> Big > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> data > > >>>>>>>>>>> > > >>>>>>>>>>>> space so whatever we think is standard won't be standard > > >> anymore > > >>>>>>>>>>>> but > > >>>>>>>>>>>>>> importantly there shouldn't be any reason why we shouldn't > > be > > >>>>>>>>>>>>>> flexible right. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Also I am not sure why only YARN? why not make that also > > more > > >>>>>>>>>>>>>> flexible so users can pick Mesos or standalone. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> I think Flexibility is a key for a wide adoption rather > than > > >>>>> the > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> tightly > > >>>>>>>>>>> > > >>>>>>>>>>>> coupled architecture. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Thanks! > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> On Fri, Apr 14, 2017 at 3:12 AM, Kenneth Peiruza > > >>>>>>>>>>>>>> <[email protected]> > > >>>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> PS: you need a big data platform to be able to collect all > > >>>>> those > > >>>>>>>>>>>>>>> netflows > > >>>>>>>>>>>>>>> and logs. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> Spot isn't intended for SMBs, that's clear, then you need > > >>>>> loads > > >>>>>>>>>>>>>>> of data to get ML working properly, and somewhere to run > > >>>>> those > > >>>>>>>>>>>>>>> algorithms. That > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>> is > > >>>>>>>>>>> > > >>>>>>>>>>>> Hadoop. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> Regards! > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> Kenneth > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> Sent from my Mi phone > > >>>>>>>>>>>>>>> On kant kodali <[email protected]>, Apr 14, 2017 4:04 > AM > > >>>>>> wrote: > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> Hi, > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> Thanks for starting this thread. Here is my feedback. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> I somehow think the architecture is too complicated for > > wide > > >>>>>>>>>>>>>>> adoption since it requires to install the following. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> HDFS. > > >>>>>>>>>>>>>>> HIVE. > > >>>>>>>>>>>>>>> IMPALA. > > >>>>>>>>>>>>>>> KAFKA. > > >>>>>>>>>>>>>>> SPARK (YARN). > > >>>>>>>>>>>>>>> YARN. > > >>>>>>>>>>>>>>> Zookeeper. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> Currently there are way too many dependencies that > > >>>>> discourages > > >>>>>>>>>>>>>>> lot of users from using it because they have to go > through > > >>>>>>>>>>>>>>> deployment of all that required software. I think for > wide > > >>>>>>>>>>>>>>> option we should minimize the dependencies and have more > > >>>>>>>>>>>>>>> pluggable architecture. for example I am > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>> not > > >>>>>>>>>>> > > >>>>>>>>>>>> sure why HIVE & IMPALA both are required? why not just use > > >> Spark > > >>>>>>>>>>>> SQL > > >>>>>>>>>>>>>>> since > > >>>>>>>>>>>>>>> its already dependency or say users may want to use their > > >> own > > >>>>>>>>>>>>>>> distributed query engine they like such as Apache Drill > or > > >>>>>>>>>>>>>>> something else. we should be flexible enough to provide > > that > > >>>>>>>>>>>>>>> option > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> Also, I see that HDFS is used such that collectors can > > >>>>> receive > > >>>>>>>>>>>>>>> file path's through Kafka and be able to read a file. How > > >> big > > >>>>>>>>>>>>>>> are these files ? > > >>>>>>>>>>>>>>> Do we > > >>>>>>>>>>>>>>> really need HDFS for this? Why not provide more ways to > > send > > >>>>>>>>>>>>>>> data such as sending data directly through Kafka or say > > just > > >>>>>>>>>>>>>>> leaving up to the user to specify the file location as an > > >>>>>>>>>>>>>>> argument to collector process > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> Finally, I learnt that to generate Net flow data one > would > > >>>>>>>>>>>>>>> require a specific hardware. This really means Apache > Spot > > >> is > > >>>>>>>>>>>>>>> not meant for everyone. > > >>>>>>>>>>>>>>> I thought Apache Spot can be used to analyze the network > > >>>>>> traffic > > >>>>>>>>>>>>>>> of > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>> any > > >>>>>>>>>>> > > >>>>>>>>>>>> machine but if it requires a specific hard then I think it > is > > >>>>>>>>>>>>>>> targeted for > > >>>>>>>>>>>>>>> specific group of people. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> The real strength of Apache Spot should mainly be just > > >>>>>> analyzing > > >>>>>>>>>>>>>>> network traffic through ML. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> Thanks! > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> On Thu, Apr 13, 2017 at 4:28 PM, Segerlind, Nathan L < > > >>>>>>>>>>>>>>> [email protected]> wrote: > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> Thanks, Nate, > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> Nate. > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> -----Original Message----- > > >>>>>>>>>>>>>>>> From: Nate Smith [mailto:[email protected]] > > >>>>>>>>>>>>>>>> Sent: Thursday, April 13, 2017 4:26 PM > > >>>>>>>>>>>>>>>> To: [email protected] > > >>>>>>>>>>>>>>>> Cc: [email protected]; > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> [email protected] > > >>>>>>>>>>> > > >>>>>>>>>>>> Subject: Re: [Discuss] - Future plans for Spot-ingest > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> I was really hoping it came through ok, Oh well :) > Here’s > > >> an > > >>>>>>>>>>>>>>>> image form: > > >>>>>>>>>>>>>>>> http://imgur.com/a/DUDsD > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> On Apr 13, 2017, at 4:05 PM, Segerlind, Nathan L < > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> [email protected]> wrote: > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> The diagram became garbled in the text format. > > >>>>>>>>>>>>>>>>> Could you resend it as a pdf? > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> Thanks, > > >>>>>>>>>>>>>>>>> Nate > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> -----Original Message----- > > >>>>>>>>>>>>>>>>> From: Nathanael Smith [mailto:[email protected]] > > >>>>>>>>>>>>>>>>> Sent: Thursday, April 13, 2017 4:01 PM > > >>>>>>>>>>>>>>>>> To: [email protected]; > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> [email protected]; > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> [email protected] > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> Subject: [Discuss] - Future plans for Spot-ingest > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> How would you like to see Spot-ingest change? > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> A. continue development on the Python Master/Worker > with > > >>>>>> focus > > >>>>>>>>>>>>>>>>> on > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> performance / error handling / logging B. Develop Scala > > >>>>> based > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> ingest to > > >>>>>>>>>>>>>>> be > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> inline with code base from ingest, ml, to OA (UI to > > >> continue > > >>>>>>>>>>>>>>>> being > > >>>>>>>>>>>>>>>> ipython/JS) C. Python ingest Worker with Scala based > Spark > > >>>>>> code > > >>>>>>>>>>>>>>>> for normalization and input into DB > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> Including the high level diagram: > > >>>>>>>>>>>>>>>>> +----------------------------- > > >>>>> ------------------------------ > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> -------------------------------+ > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> | +--------------------------+ > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> +-----------------+ | > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> | | Master | A. B. C. > > >>>>>>> | > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> Worker | | > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> | | A. Python +---------------+ A. > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> | A. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> Python | | > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> | | B. Scala | | > > >>>>>>> +-------------> > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> +----+ | > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> | | C. Python | | | > > >>>>>>> | > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> | | | > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> | +---^------+---------------+ | | > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> +-----------------+ | | > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> | | | | | > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> | | > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> | | | | | > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> | | > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> | | +Note--------------+ | | > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> +-----------------+ | | > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> | | |Running on a | | | > > >>>>>>> | > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> Spark > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> Streaming | | | > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> | | |worker node in | | | > > >> B. > > >>>>>> C. > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> | B. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> Scala | | | > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> | | |the Hadoop cluster| | | > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> +--------> C. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> Scala +-+ | | > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> | | +------------------+ | | > | > > >>>>>>> | > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> | | | | > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> | A.| | | > | > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> +-----------------+ | | | > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> | B.| | | > | > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> | | | > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> | C.| | | > | > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> | | | > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> | +----------------------+ > > +-v------+----+----+-+ > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> +--------------v--v-+ | > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> | | | | > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> | | > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> | | > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> | | Local FS: | | hdfs > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> | | > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> Hive / Impala | | > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> | | - Binary/Text | | > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> | | > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> - Parquet - | | > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> | | Log files - | | > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> | | > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> | | > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> | | | | > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> | | > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> | | > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> | +----------------------+ > > +--------------------+ > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> +-------------------+ | > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> +----------------------------- > > >>>>> ------------------------------ > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> -------------------------------+ > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> Please let me know your thoughts, > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> - Nathanael > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>>> > > >>>> > > >>>> -- > > >>>> Michael Ridley <[email protected]> > > >>>> office: (650) 352-1337 > > >>>> mobile: (571) 438-2420 > > >>>> Senior Solutions Architect > > >>>> Cloudera, Inc. > > >>> > > >>> > > >> > > >> > > >> -- > > >> Michael Ridley <[email protected]> > > >> office: (650) 352-1337 > > >> mobile: (571) 438-2420 > > >> Senior Solutions Architect > > >> Cloudera, Inc. > > >> > > > > >
