Not sure if you are aware of these.... 1) Edx/Berkely/Databricks has three Spark related certifications. Might be a good start.
2) Fair understanding of scala/distributed collection patterns to better appreciate the internals of Spark. Coursera has three scala courses. I know there are other language bindings. The Edx course goes in great detail on those. 3) Advanced Analytics on Spark book. --sachin Sent from my iPhone > On Dec 8, 2016, at 11:38 AM, Peter Figliozzi <pete.figlio...@gmail.com> wrote: > > Keeping in mind Spark is a parallel computing engine, Spark does not change > your data infrastructure/data architecture. These days it's relatively > convenient to read data from a variety of sources (S3, HDFS, Cassandra, ...) > and ditto on the output side. > > For example, for one of my use-cases, I store 10's of gigs of time-series > data in Cassandra. It just so happens I like to analyze all of it at once > using Spark, which writes a very nice, small text file table of results I > look at using Python/Pandas, in a Jupyter notebook, on a laptop. > > If we didn't have Spark, I'd still be doing the input side (Cassandra) and > output side (small text file, ingestible by a laptop) the same way. The only > difference would be, instead of importing and processing in Spark, my > fictional group of 5,000 assistants would each download a portion of the data > into their Excel spreadsheet, then have a big meeting to produce my small > text file. > > So my view is the nature of your data and specific objectives determine your > infrastructure and architecture, not the presence or absence of Spark. > > > > > >> On Sat, Dec 3, 2016 at 10:59 AM, Vasu Gourabathina <vgour...@gmail.com> >> wrote: >> Hi, >> >> I know this is a broad question. If this is not the right forum, appreciate >> if you can point to other sites/areas that may be helpful. >> >> Before posing this question, I did use our friend Google, but sanitizing the >> query results from my need angle hasn't been easy. >> >> Who I am: >> - Have done data processing and analytics, but relatively new to Spark >> world >> >> What I am looking for: >> - Architecture/Design of a ML system using Spark >> - In particular, looking for best practices that can support/bridge both >> Engineering and Data Science teams >> >> Engineering: >> - Build a system that has typical engineering needs, data processing, >> scalability, reliability, availability, fault-tolerance etc. >> - System monitoring etc. >> Data Science: >> - Build a system for Data Science team to do data exploration activities >> - Develop models using supervised learning and tweak models >> >> Data: >> - Batch and incremental updates - mostly structured or semi-structured >> (some data from transaction systems, weblogs, click stream etc.) >> - Steaming, in near term, but not to begin with >> >> Data Storage: >> - Data is expected to grow on a daily basis...so, system should be able to >> support and handle big data >> - May be, after further analysis, there might be a possibility/need to >> archive some of the data...it all depends on how the ML models were built >> and results were stored/used for future usage >> >> Data Analysis: >> - Obvious data related aspects, such as data cleansing, data >> transformation, data partitioning etc >> - May be run models on windows of data. For example: last 1-year, 2-years >> etc. >> >> ML models: >> - Ability to store model versions and previous results >> - Compare results of different variants of models >> >> Consumers: >> - RESTful webservice clients to look at the results >> >> So, the questions I have are: >> 1) Are there architectural and design patterns that I can use based on >> industry best-practices. In particular: >> - data ingestion >> - data storage (for eg. go with HDFS or not) >> - data partitioning, especially in Spark world >> - running parallel ML models and combining results etc. >> - consumption of final results by clients (for eg. by pushing results >> to Cassandra, NoSQL dbs etc.) >> >> Again, I know this is a broad question....Pointers to some best-practices in >> some of the areas, if not all, would be highly appreciated. Open to purchase >> any books that may have relevant information. >> >> Thanks much folks, >> Vasu. >> >