Not sure if you are aware of these....

1) Edx/Berkely/Databricks has three Spark related certifications. Might be a 
good start. 

2) Fair understanding of scala/distributed collection patterns to better 
appreciate the internals of Spark. Coursera has three scala courses. I know 
there are other language bindings. The Edx course goes in great detail on 
those. 

3) Advanced Analytics on Spark book. 

--sachin

Sent from my iPhone

> On Dec 8, 2016, at 11:38 AM, Peter Figliozzi <pete.figlio...@gmail.com> wrote:
> 
> Keeping in mind Spark is a parallel computing engine, Spark does not change 
> your data infrastructure/data architecture.  These days it's relatively 
> convenient to read data from a variety of sources (S3, HDFS, Cassandra, ...) 
> and ditto on the output side.  
> 
> For example, for one of my use-cases, I store 10's of gigs of time-series 
> data in Cassandra.  It just so happens I like to analyze all of it at once 
> using Spark, which writes a very nice, small text file table of results I 
> look at using Python/Pandas, in a Jupyter notebook, on a laptop. 
> 
> If we didn't have Spark, I'd still be doing the input side (Cassandra) and 
> output side (small text file, ingestible by a laptop) the same way.  The only 
> difference would be, instead of importing and processing in Spark, my 
> fictional group of 5,000 assistants would each download a portion of the data 
> into their Excel spreadsheet, then have a big meeting to produce my small 
> text file.
> 
> So my view is the nature of your data and specific objectives determine your 
> infrastructure and architecture, not the presence or absence of Spark.
> 
> 
> 
> 
> 
>> On Sat, Dec 3, 2016 at 10:59 AM, Vasu Gourabathina <vgour...@gmail.com> 
>> wrote:
>> Hi,
>> 
>> I know this is a broad question. If this is not the right forum, appreciate 
>> if you can point to other sites/areas that may be helpful.
>> 
>> Before posing this question, I did use our friend Google, but sanitizing the 
>> query results from my need angle hasn't been easy.
>> 
>> Who I am: 
>>    - Have done data processing and analytics, but relatively new to Spark 
>> world
>> 
>> What I am looking for:
>>   - Architecture/Design of a ML system using Spark
>>   - In particular, looking for best practices that can support/bridge both 
>> Engineering and Data Science teams
>> 
>> Engineering:
>>    - Build a system that has typical engineering needs, data processing, 
>> scalability, reliability, availability, fault-tolerance etc.
>>    - System monitoring etc.
>> Data Science:
>>    - Build a system for Data Science team to do data exploration activities
>>    - Develop models using supervised learning and tweak models
>> 
>> Data:
>>   - Batch and incremental updates - mostly structured or semi-structured 
>> (some data from transaction systems, weblogs, click stream etc.)
>>   - Steaming, in near term, but not to begin with
>> 
>> Data Storage:
>>   - Data is expected to grow on a daily basis...so, system should be able to 
>> support and handle big data
>>   - May be, after further analysis, there might be a possibility/need to 
>> archive some of the data...it all depends on how the ML models were built 
>> and results were stored/used for future usage
>> 
>> Data Analysis:
>>   - Obvious data related aspects, such as data cleansing, data 
>> transformation, data partitioning etc
>>   - May be run models on windows of data. For example: last 1-year, 2-years 
>> etc.
>> 
>> ML models:
>>   - Ability to store model versions and previous results
>>   - Compare results of different variants of models
>>  
>> Consumers:
>>   - RESTful webservice clients to look at the results
>> 
>> So, the questions I have are:
>> 1) Are there architectural and design patterns that I can use based on 
>> industry best-practices. In particular:    
>>       - data ingestion
>>       - data storage (for eg. go with HDFS or not)
>>       - data partitioning, especially in Spark world
>>       - running parallel ML models and combining results etc.
>>       - consumption of final results by clients (for eg. by pushing results 
>> to Cassandra, NoSQL dbs etc.)
>> 
>> Again, I know this is a broad question....Pointers to some best-practices in 
>> some of the areas, if not all, would be highly appreciated. Open to purchase 
>> any books that may have relevant information.
>> 
>> Thanks much folks,
>> Vasu.
>> 
> 

Reply via email to