Not sure if you are aware of these
1) Edx/Berkely/Databricks has three Spark related certifications. Might be a
good start.
2) Fair understanding of scala/distributed collection patterns to better
appreciate the internals of Spark. Coursera has three scala courses. I know
there are other language bindings. The Edx course goes in great detail on
those.
3) Advanced Analytics on Spark book.
--sachin
Sent from my iPhone
> On Dec 8, 2016, at 11:38 AM, Peter Figliozzi wrote:
>
> Keeping in mind Spark is a parallel computing engine, Spark does not change
> your data infrastructure/data architecture. These days it's relatively
> convenient to read data from a variety of sources (S3, HDFS, Cassandra, ...)
> and ditto on the output side.
>
> For example, for one of my use-cases, I store 10's of gigs of time-series
> data in Cassandra. It just so happens I like to analyze all of it at once
> using Spark, which writes a very nice, small text file table of results I
> look at using Python/Pandas, in a Jupyter notebook, on a laptop.
>
> If we didn't have Spark, I'd still be doing the input side (Cassandra) and
> output side (small text file, ingestible by a laptop) the same way. The only
> difference would be, instead of importing and processing in Spark, my
> fictional group of 5,000 assistants would each download a portion of the data
> into their Excel spreadsheet, then have a big meeting to produce my small
> text file.
>
> So my view is the nature of your data and specific objectives determine your
> infrastructure and architecture, not the presence or absence of Spark.
>
>
>
>
>
>> On Sat, Dec 3, 2016 at 10:59 AM, Vasu Gourabathina
>> wrote:
>> Hi,
>>
>> I know this is a broad question. If this is not the right forum, appreciate
>> if you can point to other sites/areas that may be helpful.
>>
>> Before posing this question, I did use our friend Google, but sanitizing the
>> query results from my need angle hasn't been easy.
>>
>> Who I am:
>>- Have done data processing and analytics, but relatively new to Spark
>> world
>>
>> What I am looking for:
>> - Architecture/Design of a ML system using Spark
>> - In particular, looking for best practices that can support/bridge both
>> Engineering and Data Science teams
>>
>> Engineering:
>>- Build a system that has typical engineering needs, data processing,
>> scalability, reliability, availability, fault-tolerance etc.
>>- System monitoring etc.
>> Data Science:
>>- Build a system for Data Science team to do data exploration activities
>>- Develop models using supervised learning and tweak models
>>
>> Data:
>> - Batch and incremental updates - mostly structured or semi-structured
>> (some data from transaction systems, weblogs, click stream etc.)
>> - Steaming, in near term, but not to begin with
>>
>> Data Storage:
>> - Data is expected to grow on a daily basis...so, system should be able to
>> support and handle big data
>> - May be, after further analysis, there might be a possibility/need to
>> archive some of the data...it all depends on how the ML models were built
>> and results were stored/used for future usage
>>
>> Data Analysis:
>> - Obvious data related aspects, such as data cleansing, data
>> transformation, data partitioning etc
>> - May be run models on windows of data. For example: last 1-year, 2-years
>> etc.
>>
>> ML models:
>> - Ability to store model versions and previous results
>> - Compare results of different variants of models
>>
>> Consumers:
>> - RESTful webservice clients to look at the results
>>
>> So, the questions I have are:
>> 1) Are there architectural and design patterns that I can use based on
>> industry best-practices. In particular:
>> - data ingestion
>> - data storage (for eg. go with HDFS or not)
>> - data partitioning, especially in Spark world
>> - running parallel ML models and combining results etc.
>> - consumption of final results by clients (for eg. by pushing results
>> to Cassandra, NoSQL dbs etc.)
>>
>> Again, I know this is a broad questionPointers to some best-practices in
>> some of the areas, if not all, would be highly appreciated. Open to purchase
>> any books that may have relevant information.
>>
>> Thanks much folks,
>> Vasu.
>>
>