So this is basically why the flume suggestion has come up. Flume natively acts as a syslog listener and will write files to basically anything (HDFS, Hive, HBase, S3).
On Thu, Apr 20, 2017 at 8:15 AM Michael Ridley <[email protected]> wrote: > When we say ingest from Kafka, what does that mean? I understand we can > read from Kafka to ingest into the cluster, but how will the data get to > Kafka and what data are we talking about? My understanding is that right > now the primary data sources would be Netflow and Syslog, neither of which > writes to Kafka natively so we would need something like StreamSets in the > middle. Certainly StreamSets UDP source -> Kafka would work. > > Michael > > On Wed, Apr 19, 2017 at 7:05 PM, kant kodali <[email protected]> wrote: > > > sure I guess Kafka has something called Kafka connect but may not be as > > mature as flume since I heard about this recently. > > > > On Wed, Apr 19, 2017 at 3:39 PM, Austin Leahy <[email protected]> > > wrote: > > > > > The advantage of flume or a flume Kafka hybrid is that the team doesn't > > > have to build sinks for any new source types added to the project just > > > create configs pointing to the landing pad > > > On Wed, Apr 19, 2017 at 3:31 PM kant kodali <[email protected]> > wrote: > > > > > > > What kind of benchmarks are we looking for? just throughput? since I > am > > > > assuming this is for ingestion. I haven't seen anything faster than > > Kafka > > > > and that is because of its simplicity after all publisher appends > > message > > > > to a file(so called the partition in kafka) and clients just do > > > sequential > > > > reads from a file so its a matter of disk throughput. The benchmark > > > numbers > > > > I have for Kafka is at very least 75K messages/sec where each message > > is > > > > 1KB on m4.xlarge which by default has EBS storage (EBS is > > > network-attached > > > > SSD disk). The network attached disk has a max throughput of > > > > 125MB/s(m4.xlarge has 1Gigabit) but if we were deploy it on ephemeral > > > > storage (local-SSD) and on a 10 Gigabit Network we would easily get > > 5-10X > > > > more. > > > > > > > > No idea about flume. > > > > > > > > Finally, not trying to pitch for Kafka however it is fastest I have > > seen > > > > but if someone has better numbers for flume then we should use that. > > > Also I > > > > would suspect there are benchmarks for Kafka vs Flume available > online > > > > already or we can try it with our own datasets. > > > > > > > > Thanks! > > > > > > > > On Wed, Apr 19, 2017 at 3:09 PM, Austin Leahy < > > [email protected]> > > > > wrote: > > > > > > > > > I am happy to create and test a flume source... #intelteam would > need > > > to > > > > > create the benchmark by deploying it and pointing a data source at > > > it... > > > > > since I don't have good enough volume of source data handy > > > > > On Wed, Apr 19, 2017 at 3:04 PM Ross, Alan D < > [email protected]> > > > > > wrote: > > > > > > > > > > > We discussed this in our staff meeting a bit today. I would like > > to > > > > see > > > > > > some benchmarking of different approaches (kafka, flume, etc) to > > see > > > > what > > > > > > the numbers look like. Is anyone in the community willing to > > > volunteer > > > > on > > > > > > this work? > > > > > > > > > > > > -----Original Message----- > > > > > > From: Austin Leahy [mailto:[email protected]] > > > > > > Sent: Wednesday, April 19, 2017 1:05 PM > > > > > > To: [email protected] > > > > > > Subject: Re: [Discuss] - Future plans for Spot-ingest > > > > > > > > > > > > I think Kafka is probably a red herring. It's an industry goto in > > the > > > > > > application world because because of redundancy but the type and > > > > volumes > > > > > of > > > > > > network telemetry that we are talking about here will bog kafka > > down > > > > > unless > > > > > > you dedicate really serious hardware to just the kafka > > > implementation. > > > > > It's > > > > > > essentially the next level of the problem that the team was > already > > > > > running > > > > > > into when rabbitMQ was queueing in data. > > > > > > > > > > > > On Wed, Apr 19, 2017 at 12:33 PM Mark Grover <[email protected]> > > > wrote: > > > > > > > > > > > > > On Wed, Apr 19, 2017 at 10:19 AM, Smith, Nathanael P < > > > > > > > [email protected]> wrote: > > > > > > > > > > > > > > > Mark, > > > > > > > > > > > > > > > > just digesting the below. > > > > > > > > > > > > > > > > Backing up in my thought process, I was thinking that the > > ingest > > > > > > > > master (first point of entry into the system) would want to > put > > > the > > > > > > > > data into a standard serializable format. I was thinking that > > > > > > > > libraries (such as pyarrow in this case) could help by > writing > > > the > > > > > > > > data in parquet format early in the process. You are probably > > > > > > > > correct that at this point in time it might not be worth the > > time > > > > and > > > > > > can be kept in the backlog. > > > > > > > > That being said, I still think the master should produce data > > in > > > a > > > > > > > > standard format, what in your opinion (and I open this up of > > > course > > > > > > > > to > > > > > > > > others) would be the most logical format? > > > > > > > > the most basic would be to just keep it as a .csv. > > > > > > > > > > > > > > > > The master will likely write data to a staging directory in > > HDFS > > > > > > > > where > > > > > > > the > > > > > > > > spark streaming job will pick it up for normalization/writing > > to > > > > > > > > parquet > > > > > > > in > > > > > > > > the correct block sizes and partitions. > > > > > > > > > > > > > > > > > > > > > > Hi Nate, > > > > > > > Avro is usually preferred for such a standard format - because > it > > > > > > > asserts a schema (types, etc.) which CSV doesn't and it allows > > for > > > > > > > schema evolution which depending on the type of evolution, CSV > > may > > > or > > > > > > may not support. > > > > > > > And, that's something I have seen being done very commonly. > > > > > > > > > > > > > > Now, if the data were in Kafka before it gets to master, one > > could > > > > > > > argue that the master could just send metadata to the workers > > > (topic > > > > > > > name, partition number, offset start and end) and the workers > > could > > > > > > > read from Kafka directly. I do understand that'd be a much > > > different > > > > > > > architecture than the current one, but if you think it's a good > > > idea > > > > > > > too, we could document that, say in a JIRA, and (de-)prioritize > > it > > > > > > > (and in line with the rest of the discussion on this thread, > it's > > > not > > > > > > the top-most priority). > > > > > > > Thoughts? > > > > > > > > > > > > > > - Nathanael > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Apr 17, 2017, at 1:12 PM, Mark Grover <[email protected]> > > > > wrote: > > > > > > > > > > > > > > > > > > Thanks all your opinion. > > > > > > > > > > > > > > > > > > I think it's good to consider two things: > > > > > > > > > 1. What do (we think) users care about? > > > > > > > > > 2. What's the cost of changing things? > > > > > > > > > > > > > > > > > > About #1, I think users care more about what format data is > > > > > > > > > written > > > > > > > than > > > > > > > > > how the data is written. I'd argue whether that uses Hive, > > MR, > > > or > > > > > > > > > a > > > > > > > > custom > > > > > > > > > Parquet writer is not as important to them as long as we > > > maintain > > > > > > > > > data/format compatibility. > > > > > > > > > About #2, having worked on several projects, I find that > it's > > > > > > > > > rather difficult to keep up with Parquet. Even in Spark, > > there > > > > are > > > > > > > > > a few > > > > > > > > different > > > > > > > > > ways to write to Parquet - there's a regular mode, and a > > legacy > > > > > > > > > mode < > https://github.com/apache/spark/blob/master/sql/core/ > > > > > > > > src/main/scala/org/apache/spark/sql/execution/ > > > datasources/parquet/ > > > > > > > > ParquetWriteSupport.scala#L44> > > > > > > > > > which > > > > > > > > > continues to cause confusion > > > > > > > > > <https://issues.apache.org/jira/browse/SPARK-20297> till > > date. > > > > > > > > > Parquet itself is pretty dependent on Hadoop > > > > > > > > > <https://github.com/Parquet/parquet-mr/search?l=Maven+POM& > > > > > > > > q=hadoop&type=&utf8=%E2%9C%93> > > > > > > > > > and, > > > > > > > > > just integrating it with systems with a lot of developers > > (like > > > > > > > > > Spark < > > > > https://www.google.com/webhp?sourceid=chrome-instant&ion=1& > > > > > > > > espv=2&ie=UTF-8#q=spark+parquet+jiras>) > > > > > > > > > is still a lot of work. > > > > > > > > > > > > > > > > > > I personally think we should leverage higher level tools > like > > > > > > > > > Hive, or Spark to write data in widespread formats > (Parquet, > > > > being > > > > > > > > > a very good > > > > > > > > > example) but I personally wouldn't encourage us to manage > the > > > > > > > > > writers ourselves. > > > > > > > > > > > > > > > > > > Thoughts? > > > > > > > > > Mark > > > > > > > > > > > > > > > > > > On Mon, Apr 17, 2017 at 11:44 AM, Michael Ridley > > > > > > > > > <[email protected] > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > >> Without having given it too terribly much thought, that > > seems > > > > > > > > >> like an > > > > > > > OK > > > > > > > > >> approach. > > > > > > > > >> > > > > > > > > >> Michael > > > > > > > > >> > > > > > > > > >> On Mon, Apr 17, 2017 at 2:33 PM, Nathanael Smith < > > > > > > > [email protected]> > > > > > > > > >> wrote: > > > > > > > > >> > > > > > > > > >>> I think the question is rather we can write the data > > > > generically > > > > > > > > >>> to > > > > > > > > HDFS > > > > > > > > >>> as parquet without the use of hive/impala? > > > > > > > > >>> > > > > > > > > >>> Today we write parquet data using the hive/mapreduce > > method. > > > > > > > > >>> As part of the redesign i’d like to use libraries for > this > > as > > > > > > > > >>> opposed > > > > > > > > to > > > > > > > > >> a > > > > > > > > >>> hadoop dependency. > > > > > > > > >>> I think it would be preferred to use the python master to > > > write > > > > > > > > >>> the > > > > > > > > data > > > > > > > > >>> into the format we want, then do normalization of the > data > > in > > > > > > > > >>> spark streaming. > > > > > > > > >>> Any thoughts? > > > > > > > > >>> > > > > > > > > >>> - Nathanael > > > > > > > > >>> > > > > > > > > >>> > > > > > > > > >>> > > > > > > > > >>>> On Apr 17, 2017, at 11:08 AM, Michael Ridley > > > > > > > > >>>> <[email protected]> > > > > > > > > >>> wrote: > > > > > > > > >>>> > > > > > > > > >>>> I had thought that the plan was to write the data in > > Parquet > > > > in > > > > > > > > >>>> HDFS ultimately. > > > > > > > > >>>> > > > > > > > > >>>> Michael > > > > > > > > >>>> > > > > > > > > >>>> On Sun, Apr 16, 2017 at 11:55 AM, kant kodali > > > > > > > > >>>> <[email protected]> > > > > > > > > >>> wrote: > > > > > > > > >>>> > > > > > > > > >>>>> Hi Mark, > > > > > > > > >>>>> > > > > > > > > >>>>> Thank you so much for hearing my argument. And I > > definetly > > > > > > > understand > > > > > > > > >>> that > > > > > > > > >>>>> you guys have bunch of things to do. My only concern is > > > that > > > > I > > > > > > > > >>>>> hope > > > > > > > > it > > > > > > > > >>>>> doesn't take too long too support other backends. For > > > example > > > > > > > > @Kenneth > > > > > > > > >>> had > > > > > > > > >>>>> given an example of LAMP stack had not moved away from > > > mysql > > > > > > > > >>>>> yet > > > > > > > > which > > > > > > > > >>>>> essentially means its probably a decade ? I see that in > > the > > > > > > > > >>>>> current architecture the results from with python > > > > > > > > >>>>> multiprocessing or Spark Streaming are written back to > > HDFS > > > > > > > > >>>>> and If so, can we write them in > > > > > > > > >>> parquet > > > > > > > > >>>>> format ? such that users should be able to plug in any > > > query > > > > > > > > >>>>> engine > > > > > > > > >> but > > > > > > > > >>>>> again I am not pushing you guys to do this right away > or > > > > > > > > >>>>> anything > > > > > > > > just > > > > > > > > >>>>> seeing if there a way for me to get started in parallel > > and > > > > if > > > > > > > > >>>>> not feasible, its fine I just wanted to share my 2 > cents > > > and > > > > I > > > > > > > > >>>>> am glad > > > > > > > my > > > > > > > > >>>>> argument is heard! > > > > > > > > >>>>> > > > > > > > > >>>>> Thanks much! > > > > > > > > >>>>> > > > > > > > > >>>>> On Fri, Apr 14, 2017 at 1:38 PM, Mark Grover < > > > > [email protected]> > > > > > > > > wrote: > > > > > > > > >>>>> > > > > > > > > >>>>>> Hi Kant, > > > > > > > > >>>>>> Just wanted to make sure you don't feel like we are > > > ignoring > > > > > > > > >>>>>> your > > > > > > > > >>>>>> comment:-) I hear you about pluggability. > > > > > > > > >>>>>> > > > > > > > > >>>>>> The design can and should be pluggable but the project > > has > > > > > > > > >>>>>> one > > > > > > > stack > > > > > > > > >> it > > > > > > > > >>>>>> ships out of the box with, one stack that's the > default > > > > stack > > > > > > > > >>>>>> in > > > > > > > the > > > > > > > > >>>>> sense > > > > > > > > >>>>>> that it's the most tested and so on. And, for us, > that's > > > our > > > > > > > current > > > > > > > > >>>>> stack. > > > > > > > > >>>>>> If we were to take Apache Hive as an example, it > shipped > > > > (and > > > > > > > ships) > > > > > > > > >>> with > > > > > > > > >>>>>> MapReduce as the default configuration engine. At some > > > > point, > > > > > > > Apache > > > > > > > > >>> Tez > > > > > > > > >>>>>> came along and wanted Hive to run on Tez, so they > made a > > > > > > > > >>>>>> bunch of > > > > > > > > >>> things > > > > > > > > >>>>>> pluggable to run Hive on Tez (instead of the only > option > > > > > > > > >>>>>> up-until > > > > > > > > >> then: > > > > > > > > >>>>>> Hive-on-MR) and then Apache Spark came and re-used > some > > of > > > > > > > > >>>>>> that pluggability and even added some more so > > > Hive-on-Spark > > > > > > > > >>>>>> could > > > > > > > become > > > > > > > > a > > > > > > > > >>>>>> reality. In the same way, I don't think anyone > disagrees > > > > here > > > > > > > > >>>>>> that pluggabilty is a good thing but it's hard to do > > > > > > > > >>>>>> pluggability > > > > > > > right, > > > > > > > > >> and > > > > > > > > >>>>> at > > > > > > > > >>>>>> the right level, unless on has a clear use-case in > mind. > > > > > > > > >>>>>> > > > > > > > > >>>>>> As a project, we have many things to do and I > personally > > > > > > > > >>>>>> think the > > > > > > > > >>>>> biggest > > > > > > > > >>>>>> bang for the buck for us in making Spot a really solid > > and > > > > > > > > >>>>>> the > > > > > > > best > > > > > > > > >>> cyber > > > > > > > > >>>>>> security solution isn't pluggability but the things we > > are > > > > > > > > >>>>>> working > > > > > > > > on > > > > > > > > >>> - a > > > > > > > > >>>>>> better user interface, a common/unified approach to > > > storing > > > > > > > > >>>>>> and > > > > > > > > >>> modeling > > > > > > > > >>>>>> data, etc. > > > > > > > > >>>>>> > > > > > > > > >>>>>> Having said that, we are open, if it's important to > you > > or > > > > > > > > >>>>>> someone > > > > > > > > >>> else, > > > > > > > > >>>>>> we'd be happy to receive and review those patches. > > > > > > > > >>>>>> > > > > > > > > >>>>>> Thanks! > > > > > > > > >>>>>> Mark > > > > > > > > >>>>>> > > > > > > > > >>>>>> On Fri, Apr 14, 2017 at 10:14 AM, kant kodali > > > > > > > > >>>>>> <[email protected] > > > > > > > > > > > > > > > > >>>>> wrote: > > > > > > > > >>>>>> > > > > > > > > >>>>>>> Thanks Ross! and yes option C sounds good to me as > well > > > > > > > > >>>>>>> however I > > > > > > > > >> just > > > > > > > > >>>>>>> think Distributed Sql query engine and the resource > > > > manager > > > > > > > should > > > > > > > > >> be > > > > > > > > >>>>>>> pluggable. > > > > > > > > >>>>>>> > > > > > > > > >>>>>>> > > > > > > > > >>>>>>> > > > > > > > > >>>>>>> > > > > > > > > >>>>>>> On Fri, Apr 14, 2017 at 9:55 AM, Ross, Alan D < > > > > > > > > >> [email protected]> > > > > > > > > >>>>>>> wrote: > > > > > > > > >>>>>>> > > > > > > > > >>>>>>>> Option C is to use python on the front end of ingest > > > > > > > > >>>>>>>> pipeline > > > > > > > and > > > > > > > > >>>>>>>> spark/scala on the back end. > > > > > > > > >>>>>>>> > > > > > > > > >>>>>>>> Option A uses python workers on the backend > > > > > > > > >>>>>>>> > > > > > > > > >>>>>>>> Option B uses all scala. > > > > > > > > >>>>>>>> > > > > > > > > >>>>>>>> > > > > > > > > >>>>>>>> > > > > > > > > >>>>>>>> -----Original Message----- > > > > > > > > >>>>>>>> From: kant kodali [mailto:[email protected]] > > > > > > > > >>>>>>>> Sent: Friday, April 14, 2017 9:53 AM > > > > > > > > >>>>>>>> To: [email protected] > > > > > > > > >>>>>>>> Subject: Re: [Discuss] - Future plans for > Spot-ingest > > > > > > > > >>>>>>>> > > > > > > > > >>>>>>>> What is option C ? am I missing an email or > something? > > > > > > > > >>>>>>>> > > > > > > > > >>>>>>>> On Fri, Apr 14, 2017 at 9:15 AM, Chokha > Palayamkottai > > < > > > > > > > > >>>>>>>> [email protected]> wrote: > > > > > > > > >>>>>>>> > > > > > > > > >>>>>>>>> +1 for Python 3.x > > > > > > > > >>>>>>>>> > > > > > > > > >>>>>>>>> > > > > > > > > >>>>>>>>> > > > > > > > > >>>>>>>>> On 4/14/2017 11:59 AM, Austin Leahy wrote: > > > > > > > > >>>>>>>>> > > > > > > > > >>>>>>>>>> I think that C is the strong solution, getting the > > > > ingest > > > > > > > really > > > > > > > > >>>>>>>>>> strong is going to lower barriers to adoption. > Doing > > > it > > > > > > > > >>>>>>>>>> in > > > > > > > > Python > > > > > > > > >>>>>>>>>> will open up the ingest portion of the project to > > > > include > > > > > > > > >>>>>>>>>> many > > > > > > > > >>>>> more > > > > > > > > >>>>>>>> developers. > > > > > > > > >>>>>>>>>> > > > > > > > > >>>>>>>>>> Before it comes up I would like to throw the > > following > > > > on > > > > > > > > >>>>>>>>>> the > > > > > > > > >>>>>> pile... > > > > > > > > >>>>>>>>>> Major > > > > > > > > >>>>>>>>>> python projects django/flash, others are dropping > > 2.x > > > > > > > > >>>>>>>>>> support > > > > > > > in > > > > > > > > >>>>>>>>>> releases scheduled in the next 6 to 8 months. > Hadoop > > > > > > > > >>>>>>>>>> projects > > > > > > > in > > > > > > > > >>>>>>>>>> general tend to lag in modern python support, lets > > > > please > > > > > > > build > > > > > > > > >>>>> this > > > > > > > > >>>>>>>>>> in 3.5 so that we don't have to immediately > expect a > > > > > > > > >>>>>>>>>> rebuild > > > > > > > in > > > > > > > > >>>>> the > > > > > > > > >>>>>>>>>> pipeline. > > > > > > > > >>>>>>>>>> > > > > > > > > >>>>>>>>>> -Vote C > > > > > > > > >>>>>>>>>> > > > > > > > > >>>>>>>>>> Thanks Nate > > > > > > > > >>>>>>>>>> > > > > > > > > >>>>>>>>>> Austin > > > > > > > > >>>>>>>>>> > > > > > > > > >>>>>>>>>> On Fri, Apr 14, 2017 at 8:52 AM Alan Ross > > > > > > > > >>>>>>>>>> <[email protected]> > > > > > > > > >>>>> wrote: > > > > > > > > >>>>>>>>>> > > > > > > > > >>>>>>>>>> I really like option C because it gives a lot of > > > > > > > > >>>>>>>>>> flexibility > > > > > > > for > > > > > > > > >>>>>>>>>> ingest > > > > > > > > >>>>>>>>>>> (python vs scala) but still has the robust spark > > > > > > > > >>>>>>>>>>> streaming > > > > > > > > >>>>> backend > > > > > > > > >>>>>>>>>>> for performance. > > > > > > > > >>>>>>>>>>> > > > > > > > > >>>>>>>>>>> Thanks for putting this together Nate. > > > > > > > > >>>>>>>>>>> > > > > > > > > >>>>>>>>>>> Alan > > > > > > > > >>>>>>>>>>> > > > > > > > > >>>>>>>>>>> On Fri, Apr 14, 2017 at 8:44 AM, Chokha > > > Palayamkottai < > > > > > > > > >>>>>>>>>>> [email protected]> wrote: > > > > > > > > >>>>>>>>>>> > > > > > > > > >>>>>>>>>>> I agree. We should continue making the existing > > stack > > > > > > > > >>>>>>>>>>> more > > > > > > > > >> mature > > > > > > > > >>>>>> at > > > > > > > > >>>>>>>>>>>> this point. Maybe if we have enough community > > > support > > > > > > > > >>>>>>>>>>>> we can > > > > > > > > >> add > > > > > > > > >>>>>>>>>>>> additional datastores. > > > > > > > > >>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>> Chokha. > > > > > > > > >>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>> On 4/14/17 11:10 AM, [email protected] wrote: > > > > > > > > >>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>> Hi Kant, > > > > > > > > >>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>> YARN is the standard scheduler in Hadoop. If > > you're > > > > > > > > >>>>>>>>>>>>> using > > > > > > > > >>>>>>>>>>>>> Hive+Spark, then sure you'll have YARN. > > > > > > > > >>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>> Haven't seen any HIVE on Mesos so far. As said, > > > Spot > > > > > > > > >>>>>>>>>>>>> is > > > > > > > based > > > > > > > > >>>>> on > > > > > > > > >>>>>> a > > > > > > > > >>>>>>>>>>>>> quite standard Hadoop stack and I wouldn't > switch > > > too > > > > > > > > >>>>>>>>>>>>> many > > > > > > > > >>>>> pieces > > > > > > > > >>>>>>>> yet. > > > > > > > > >>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>> In most Opensource projects you start relying > on > > a > > > > > > > well-known > > > > > > > > >>>>>>>>>>>>> stack and then you begin to support other DB > > > backends > > > > > > > > >>>>>>>>>>>>> once > > > > > > > > >> it's > > > > > > > > >>>>>>>>>>>>> quite mature. Think in the loads of LAMP apps > > which > > > > > > > > >>>>>>>>>>>>> haven't > > > > > > > > >>>>> been > > > > > > > > >>>>>>>>>>>>> ported away from MySQL yet. > > > > > > > > >>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>> In any case, you'll need a high performance > SQL + > > > > > > > > >>>>>>>>>>>>> Massive > > > > > > > > >>>>> Storage > > > > > > > > >>>>>>>>>>>>> + Machine Learning + Massive Ingestion, and... > > ATM, > > > > > > > > >>>>>>>>>>>>> + that > > > > > > > can > > > > > > > > >> be > > > > > > > > >>>>>>>>>>>>> only provided by Hadoop. > > > > > > > > >>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>> Regards! > > > > > > > > >>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>> Kenneth > > > > > > > > >>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>> A 2017-04-14 12:56, kant kodali escrigué: > > > > > > > > >>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>> Hi Kenneth, > > > > > > > > >>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>> Thanks for the response. I think you made a > > case > > > > for > > > > > > > > >>>>>>>>>>>>>> HDFS however users may want to use S3 or some > > > other > > > > > > > > >>>>>>>>>>>>>> FS in which > > > > > > > > >>>>> case > > > > > > > > >>>>>>>>>>>>>> they can use Auxilio (hoping that there are no > > > > > > > > >>>>>>>>>>>>>> changes > > > > > > > > needed > > > > > > > > >>>>>>>>>>>>>> within Spot in which case I > > > > > > > > >>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>> can > > > > > > > > >>>>>>>>>>> > > > > > > > > >>>>>>>>>>>> agree to that). for example, Netflix stores all > > > there > > > > > > > > >>>>>>>>>>>> data > > > > > > > > into > > > > > > > > >>>>> S3 > > > > > > > > >>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>> The distributed sql query engine I would say > > > should > > > > > > > > >>>>>>>>>>>>>> be > > > > > > > > >>>>> pluggable > > > > > > > > >>>>>>>>>>>>>> with whatever user may want to use and there a > > > bunch > > > > > > > > >>>>>>>>>>>>>> of > > > > > > > them > > > > > > > > >>>>> out > > > > > > > > >>>>>>>> there. > > > > > > > > >>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>> sure > > > > > > > > >>>>>>>>>>> > > > > > > > > >>>>>>>>>>>> Impala is better than hive but what if users are > > > > > > > > >>>>>>>>>>>> already > > > > > > > using > > > > > > > > >>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>> something > > > > > > > > >>>>>>>>>>> > > > > > > > > >>>>>>>>>>>> else like Drill or Presto? > > > > > > > > >>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>> Me personally, would not assume that users are > > > > > > > > >>>>>>>>>>>>>> willing to > > > > > > > > >>>>> deploy > > > > > > > > >>>>>>>>>>>>>> all > > > > > > > > >>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>> of > > > > > > > > >>>>>>>>>>> > > > > > > > > >>>>>>>>>>>> that and make their existing stack more > > complicated > > > at > > > > > > > > >>>>>>>>>>>> very > > > > > > > > >>>>> least > > > > > > > > >>>>>> I > > > > > > > > >>>>>>>>>>>>>> would > > > > > > > > >>>>>>>>>>>>>> say it is a uphill battle. Things have been > > > changing > > > > > > > rapidly > > > > > > > > >>>>> in > > > > > > > > >>>>>>>>>>>>>> Big > > > > > > > > >>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>> data > > > > > > > > >>>>>>>>>>> > > > > > > > > >>>>>>>>>>>> space so whatever we think is standard won't be > > > > > > > > >>>>>>>>>>>> standard > > > > > > > > >> anymore > > > > > > > > >>>>>>>>>>>> but > > > > > > > > >>>>>>>>>>>>>> importantly there shouldn't be any reason why > we > > > > > > > > >>>>>>>>>>>>>> shouldn't > > > > > > > > be > > > > > > > > >>>>>>>>>>>>>> flexible right. > > > > > > > > >>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>> Also I am not sure why only YARN? why not make > > > that > > > > > > > > >>>>>>>>>>>>>> also > > > > > > > > more > > > > > > > > >>>>>>>>>>>>>> flexible so users can pick Mesos or > standalone. > > > > > > > > >>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>> I think Flexibility is a key for a wide > adoption > > > > > > > > >>>>>>>>>>>>>> rather > > > > > > > than > > > > > > > > >>>>> the > > > > > > > > >>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>> tightly > > > > > > > > >>>>>>>>>>> > > > > > > > > >>>>>>>>>>>> coupled architecture. > > > > > > > > >>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>> Thanks! > > > > > > > > >>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>> On Fri, Apr 14, 2017 at 3:12 AM, Kenneth > Peiruza > > > > > > > > >>>>>>>>>>>>>> <[email protected]> > > > > > > > > >>>>>>>>>>>>>> wrote: > > > > > > > > >>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>> PS: you need a big data platform to be able to > > > > > > > > >>>>>>>>>>>>>> collect all > > > > > > > > >>>>> those > > > > > > > > >>>>>>>>>>>>>>> netflows > > > > > > > > >>>>>>>>>>>>>>> and logs. > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>> Spot isn't intended for SMBs, that's clear, > > then > > > > you > > > > > > > > >>>>>>>>>>>>>>> need > > > > > > > > >>>>> loads > > > > > > > > >>>>>>>>>>>>>>> of data to get ML working properly, and > > somewhere > > > > to > > > > > > > > >>>>>>>>>>>>>>> run > > > > > > > > >>>>> those > > > > > > > > >>>>>>>>>>>>>>> algorithms. That > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>> is > > > > > > > > >>>>>>>>>>> > > > > > > > > >>>>>>>>>>>> Hadoop. > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>> Regards! > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>> Kenneth > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>> Sent from my Mi phone On kant kodali > > > > > > > > >>>>>>>>>>>>>>> <[email protected]>, Apr 14, 2017 4:04 > > > > > > > AM > > > > > > > > >>>>>> wrote: > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>> Hi, > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>> Thanks for starting this thread. Here is my > > > > feedback. > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>> I somehow think the architecture is too > > > complicated > > > > > > > > >>>>>>>>>>>>>>> for > > > > > > > > wide > > > > > > > > >>>>>>>>>>>>>>> adoption since it requires to install the > > > > following. > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>> HDFS. > > > > > > > > >>>>>>>>>>>>>>> HIVE. > > > > > > > > >>>>>>>>>>>>>>> IMPALA. > > > > > > > > >>>>>>>>>>>>>>> KAFKA. > > > > > > > > >>>>>>>>>>>>>>> SPARK (YARN). > > > > > > > > >>>>>>>>>>>>>>> YARN. > > > > > > > > >>>>>>>>>>>>>>> Zookeeper. > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>> Currently there are way too many dependencies > > > that > > > > > > > > >>>>> discourages > > > > > > > > >>>>>>>>>>>>>>> lot of users from using it because they have > to > > > go > > > > > > > through > > > > > > > > >>>>>>>>>>>>>>> deployment of all that required software. I > > think > > > > > > > > >>>>>>>>>>>>>>> for > > > > > > > wide > > > > > > > > >>>>>>>>>>>>>>> option we should minimize the dependencies > and > > > have > > > > > > > > >>>>>>>>>>>>>>> more pluggable architecture. for example I am > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>> not > > > > > > > > >>>>>>>>>>> > > > > > > > > >>>>>>>>>>>> sure why HIVE & IMPALA both are required? why > not > > > just > > > > > > > > >>>>>>>>>>>> use > > > > > > > > >> Spark > > > > > > > > >>>>>>>>>>>> SQL > > > > > > > > >>>>>>>>>>>>>>> since > > > > > > > > >>>>>>>>>>>>>>> its already dependency or say users may want > to > > > use > > > > > > > > >>>>>>>>>>>>>>> their > > > > > > > > >> own > > > > > > > > >>>>>>>>>>>>>>> distributed query engine they like such as > > Apache > > > > > > > > >>>>>>>>>>>>>>> Drill > > > > > > > or > > > > > > > > >>>>>>>>>>>>>>> something else. we should be flexible enough > to > > > > > > > > >>>>>>>>>>>>>>> provide > > > > > > > > that > > > > > > > > >>>>>>>>>>>>>>> option > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>> Also, I see that HDFS is used such that > > > collectors > > > > > > > > >>>>>>>>>>>>>>> can > > > > > > > > >>>>> receive > > > > > > > > >>>>>>>>>>>>>>> file path's through Kafka and be able to > read a > > > > > > > > >>>>>>>>>>>>>>> file. How > > > > > > > > >> big > > > > > > > > >>>>>>>>>>>>>>> are these files ? > > > > > > > > >>>>>>>>>>>>>>> Do we > > > > > > > > >>>>>>>>>>>>>>> really need HDFS for this? Why not provide > more > > > > ways > > > > > > > > >>>>>>>>>>>>>>> to > > > > > > > > send > > > > > > > > >>>>>>>>>>>>>>> data such as sending data directly through > > Kafka > > > or > > > > > > > > >>>>>>>>>>>>>>> say > > > > > > > > just > > > > > > > > >>>>>>>>>>>>>>> leaving up to the user to specify the file > > > location > > > > > > > > >>>>>>>>>>>>>>> as an argument to collector process > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>> Finally, I learnt that to generate Net flow > > data > > > > one > > > > > > > would > > > > > > > > >>>>>>>>>>>>>>> require a specific hardware. This really > means > > > > > > > > >>>>>>>>>>>>>>> Apache > > > > > > > Spot > > > > > > > > >> is > > > > > > > > >>>>>>>>>>>>>>> not meant for everyone. > > > > > > > > >>>>>>>>>>>>>>> I thought Apache Spot can be used to analyze > > the > > > > > > > > >>>>>>>>>>>>>>> network > > > > > > > > >>>>>> traffic > > > > > > > > >>>>>>>>>>>>>>> of > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>> any > > > > > > > > >>>>>>>>>>> > > > > > > > > >>>>>>>>>>>> machine but if it requires a specific hard then > I > > > > think > > > > > > > > >>>>>>>>>>>> it > > > > > > > is > > > > > > > > >>>>>>>>>>>>>>> targeted for > > > > > > > > >>>>>>>>>>>>>>> specific group of people. > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>> The real strength of Apache Spot should > mainly > > be > > > > > > > > >>>>>>>>>>>>>>> just > > > > > > > > >>>>>> analyzing > > > > > > > > >>>>>>>>>>>>>>> network traffic through ML. > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>> Thanks! > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>> On Thu, Apr 13, 2017 at 4:28 PM, Segerlind, > > > Nathan > > > > L > > > > > > > > >>>>>>>>>>>>>>> < [email protected]> wrote: > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>> Thanks, Nate, > > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> Nate. > > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> -----Original Message----- > > > > > > > > >>>>>>>>>>>>>>>> From: Nate Smith [mailto: > > [email protected]] > > > > > > > > >>>>>>>>>>>>>>>> Sent: Thursday, April 13, 2017 4:26 PM > > > > > > > > >>>>>>>>>>>>>>>> To: [email protected] > > > > > > > > >>>>>>>>>>>>>>>> Cc: [email protected]; > > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>> [email protected] > > > > > > > > >>>>>>>>>>> > > > > > > > > >>>>>>>>>>>> Subject: Re: [Discuss] - Future plans for > > > Spot-ingest > > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> I was really hoping it came through ok, Oh > > well > > > :) > > > > > > > Here’s > > > > > > > > >> an > > > > > > > > >>>>>>>>>>>>>>>> image form: > > > > > > > > >>>>>>>>>>>>>>>> http://imgur.com/a/DUDsD > > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> On Apr 13, 2017, at 4:05 PM, Segerlind, > Nathan > > > L < > > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> [email protected]> wrote: > > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>>> The diagram became garbled in the text > > format. > > > > > > > > >>>>>>>>>>>>>>>>> Could you resend it as a pdf? > > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>>> Thanks, > > > > > > > > >>>>>>>>>>>>>>>>> Nate > > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>>> -----Original Message----- > > > > > > > > >>>>>>>>>>>>>>>>> From: Nathanael Smith > > > > > > > > >>>>>>>>>>>>>>>>> [mailto:[email protected]] > > > > > > > > >>>>>>>>>>>>>>>>> Sent: Thursday, April 13, 2017 4:01 PM > > > > > > > > >>>>>>>>>>>>>>>>> To: [email protected]; > > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> [email protected]; > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> [email protected] > > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>>> Subject: [Discuss] - Future plans for > > > Spot-ingest > > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>>> How would you like to see Spot-ingest > change? > > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>>> A. continue development on the Python > > > > > > > > >>>>>>>>>>>>>>>>> Master/Worker > > > > > > > with > > > > > > > > >>>>>> focus > > > > > > > > >>>>>>>>>>>>>>>>> on > > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> performance / error handling / logging B. > > > Develop > > > > > > > > >>>>>>>>>>>>>>>> Scala > > > > > > > > >>>>> based > > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>> ingest to > > > > > > > > >>>>>>>>>>>>>>> be > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> inline with code base from ingest, ml, to OA > > (UI > > > > to > > > > > > > > >> continue > > > > > > > > >>>>>>>>>>>>>>>> being > > > > > > > > >>>>>>>>>>>>>>>> ipython/JS) C. Python ingest Worker with > Scala > > > > > > > > >>>>>>>>>>>>>>>> based > > > > > > > Spark > > > > > > > > >>>>>> code > > > > > > > > >>>>>>>>>>>>>>>> for normalization and input into DB > > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>>> Including the high level diagram: > > > > > > > > >>>>>>>>>>>>>>>>> +----------------------------- > > > > > > > > >>>>> ------------------------------ > > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> -------------------------------+ > > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>>> | +--------------------------+ > > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> +-----------------+ | > > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>>> | | Master | A. B. C. > > > > > > > > >>>>>>> | > > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> Worker | | > > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>>> | | A. Python > > +---------------+ > > > > > > A. > > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> | A. > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> Python | | > > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>>> | | B. Scala | > > | > > > > > > > > >>>>>>> +-------------> > > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> +----+ | > > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>>> | | C. Python | > > | > > > > | > > > > > > > > >>>>>>> | > > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> | | | > > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>>> | +---^------+---------------+ > > | > > > > | > > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> +-----------------+ | | > > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>>> | | | > > | > > > > | > > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> | | > > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>>> | | | > > | > > > > | > > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> | | > > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>>> | | +Note--------------+ > > | > > > > | > > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> +-----------------+ | | > > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>>> | | |Running on a | > > | > > > > | > > > > > > > > >>>>>>> | > > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> Spark > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> Streaming | | | > > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>>> | | |worker node in | > > | > > > > | > > > > > > > > >> B. > > > > > > > > >>>>>> C. > > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> | B. > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> Scala | | | > > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>>> | | |the Hadoop cluster| > > | > > > > | > > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> +--------> C. > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> Scala +-+ | | > > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>>> | | +------------------+ > > | > > > > | > > > > > > > | > > > > > > > > >>>>>>> | > > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> | | | | > > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>>> | A.| > > | > > > > | > > > > > > > | > > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> +-----------------+ | | | > > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>>> | B.| > > | > > > > | > > > > > > > | > > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> | | | > > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>>> | C.| > > | > > > > | > > > > > > > | > > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> | | | > > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>>> | +----------------------+ > > > > > > > > +-v------+----+----+-+ > > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> +--------------v--v-+ | > > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>>> | | | | > > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> | | > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> | | > > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>>> | | Local FS: | | > hdfs > > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> | | > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> Hive / Impala | | > > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>>> | | - Binary/Text | | > > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> | | > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> - Parquet - | | > > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>>> | | Log files - | | > > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> | | > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> | | > > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>>> | | | | > > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> | | > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> | | > > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>>> | +----------------------+ > > > > > > > > +--------------------+ > > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> +-------------------+ | > > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>>> +----------------------------- > > > > > > > > >>>>> ------------------------------ > > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> -------------------------------+ > > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>>> Please let me know your thoughts, > > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>>> - Nathanael > > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>>>>> > > > > > > > > >>>>>>>>>>>> > > > > > > > > >>>>>>>>> > > > > > > > > >>>>>>>> > > > > > > > > >>>>>>> > > > > > > > > >>>>>> > > > > > > > > >>>>> > > > > > > > > >>>> > > > > > > > > >>>> > > > > > > > > >>>> > > > > > > > > >>>> -- > > > > > > > > >>>> Michael Ridley <[email protected]> > > > > > > > > >>>> office: (650) 352-1337 > > > > > > > > >>>> mobile: (571) 438-2420 > > > > > > > > >>>> Senior Solutions Architect > > > > > > > > >>>> Cloudera, Inc. > > > > > > > > >>> > > > > > > > > >>> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> -- > > > > > > > > >> Michael Ridley <[email protected]> > > > > > > > > >> office: (650) 352-1337 > > > > > > > > >> mobile: (571) 438-2420 > > > > > > > > >> Senior Solutions Architect > > > > > > > > >> Cloudera, Inc. > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > Michael Ridley <[email protected]> > office: (650) 352-1337 > mobile: (571) 438-2420 > Senior Solutions Architect > Cloudera, Inc. >
