[kylin] branch document updated: add post use-python-for-data-science-with-apache-kylin

shaofengshi Wed, 26 Jun 2019 02:13:16 -0700

This is an automated email from the ASF dual-hosted git repository.

shaofengshi pushed a commit to branch document
in repository https://gitbox.apache.org/repos/asf/kylin.git



The following commit(s) were added to refs/heads/document by this push:
     new e1298e0  add post use-python-for-data-science-with-apache-kylin
e1298e0 is described below

commit e1298e038ce32e9ffed6665cc3dfa5039f68da30
Author: shaofengshi <shaofeng...@apache.org>
AuthorDate: Wed Jun 26 17:12:35 2019 +0800

    add post use-python-for-data-science-with-apache-kylin
---
 ...se-python-for-data-science-with-apache-kylin.md | 230 +++++++++++++++++++++
 website/images/blog/python-data-science/chart1.png | Bin 0 -> 38014 bytes
 website/images/blog/python-data-science/chart2.png | Bin 0 -> 45090 bytes
 website/images/blog/python-data-science/chart3.png | Bin 0 -> 87658 bytes
 website/images/blog/python-data-science/chart4.png | Bin 0 -> 186083 bytes
 website/images/blog/python-data-science/chart5.png | Bin 0 -> 188626 bytes
 .../images/blog/python-data-science/diagram1.png   | Bin 0 -> 211995 bytes
 .../images/blog/python-data-science/diagram2.png   | Bin 0 -> 143992 bytes
 8 files changed, 230 insertions(+)

diff --git 
a/website/_posts/blog/2019-06-26-use-python-for-data-science-with-apache-kylin.md
 
b/website/_posts/blog/2019-06-26-use-python-for-data-science-with-apache-kylin.md
new file mode 100644
index 0000000..d2f0949
--- /dev/null
+++ 
b/website/_posts/blog/2019-06-26-use-python-for-data-science-with-apache-kylin.md
@@ -0,0 +1,230 @@
+---
+layout: post-blog
+title:  Use Python for Data Science with Apache Kylin
+date:   2019-06-26 14:30:00
+author: Nikhil Jain
+categories: blog
+---
+Original from [Kyligence tech 
blog](https://kyligence.io/blog/use-python-for-data-science-with-apache-kylin/)
+
+In today’s world, Big Data, data science, and machine learning analytics and 
are not only hot topics, they’re also an essential part of our society. Data is 
everywhere, and the amount of digital data that exists is growing at a rapid 
rate. According to 
[Forbes](https://www.forbes.com/sites/tomcoughlin/2018/11/27/175-zettabytes-by-2025/#622d803d5459),
 around 175 Zettabytes of data will be generated annually by 2025. 
+
+The economy, healthcare, agriculture, energy, media, education and all other 
critical human activities rely more and more on the advanced processing and 
analysis of large quantities of collected data. However, these massive datasets 
pose a real challenge to data analytics, data mining, machine learning and data 
science. 
+
+Data Scientists and analysts have often expressed frustration while trying to 
work with Big Data. The good news is that there is a solution: Apache Kylin. 
Kylin solves this Big Data dilemma by integrating with Python to help analysts 
& data scientists finally gain unfettered access to their large-scale (terabyte 
and petabyte) datasets.
+
+## Machine Learning Challenges
+
+One of the main challenges machine learning (ML) engineers and data scientists 
encounter when running computations with Big Data comes from the principle that 
higher volume or scale equates to greater computational complexity. 
+
+Consequently, as datasets scale up, even trivial operations can become costly. 
Moreover, as data volume rises, algorithm performance becomes increasingly 
dependent on the architecture used to store and move data. Parallel data 
structures, data partitioning and placement, and data reuse become more 
important as the amount of data one is working with grows.
+
+## What Apache Kylin Is and How It Helps
+
+Apache Kylin is an open source distributed Big Data analytics engine designed 
to provide a SQL interface for multi-dimensional analysis (MOLAP) on Hadoop. It 
allows enterprises to rapidly analyze their massive datasets in a fraction of 
the time it would take using other approaches or Big Data analytics tools.
+
+With Apache Kylin, data teams are able to dramatically cut down on analytics 
processing time and associated IT and ops costs. It’s able to do this by 
pre-computing large datasets into one (or another very small amount) of OLAP 
cubes and storing them in a columnar database. This allows ML Engineers, data 
scientists, and analysts to quickly access the data and perform data mining 
activities to uncover hidden trends easily. 
+
+The Following diagram illustrates how machine learning and data science 
activities on big data become much easier when Apache Kylin is introduced.
+
+![diagram1](/images/blog/python-data-science/diagram1.png)
+
+
+
+## How to Integrate Python with Apache Kylin
+
+Python has quickly risen in prominence to take its spot as one of the leading 
programming languages in the data analytics field (as well as outside the 
field). With its ease of use and extensive collection of libraries, Python has 
become well-positioned to take on Big Data. 
+
+Python also provides plenty of data mining tools to assist in the handling of 
data, offering up a variety of applications already adopted by the machine 
learning and data science communities. Simply put, if you’re working with Big 
Data, there’s probably a way Python can make your job easier.
+
+Apache Kylin can be easily integrated with Python with support from 
[Kylinpy](https://github.com/Kyligence/kylinpy). Kylinpy is a python library 
that provides a SQLAlchemy Dialect implementation. Thus, any application that 
uses SQLAlchemy can now query Kylin OLAP cubes. Additionally, it also allows 
users to access data via Pandas data frames.
+
+**Sample code to access data via Pandas:**
+
+```
+$ python
+
+ >>> import sqlalchemy as sa
+ >>> import pandas as pd
+ >>> kylin_engine = 
sa.create_engine('kylin://<username>:<password>@<IP>:<PORT>/<project_name>',
+     connect_args={'is_ssl': True, 'timeout': 60})
+ >>> sql = 'select * from kylin_sales limit 10'
+ >>> dataframe = pd.read_sql(sql, kylin_engine)
+ >>> print(dataframe)
+```
+
+
+**Benefits of using Apache Kylin as Data Source:**
+
+● **Easy Access to Massive Datasets:** Interactively work with large amounts 
(TB/PB) of data. 
+
+● **Blazing Fast Performance:** Get sub-second response times to your queries 
on Big Data.
+
+● **High Scalability:** With Kylin’s linear scalability, scale up your data 
without worrying about performance.
+
+● **Web Scale Concurrency:** Deploy to thousands of concurrent users.
+
+● **Minimal Data Engineering:** Invest time in discovering insights and leave 
the data engineering to Apache  Kylin.
+
+## A Use Case: Data Science with Apache Kylin
+
+**<u>Dataset</u>**
+
+We imported an IMDB movie dataset (**Source:** 
[Movielens](https://grouplens.org/datasets/movielens/)) into our Kylin OLAP 
cube and used Python to read the data and perform exploratory analysis in order 
to find trends in movie ratings for different genres over a given period of 
time.
+
+**<u>Motivation</u>**
+
+▪       Identify top rated movies.
+
+▪       Compare Male vs Female preference for different movie genres.
+
+▪       Find correlation between Occupation & Genre. 
+
+▪       Analyzing trends in average movie ratings for different genres across 
the weeks.
+
+▪       Compare Men & Women average ratings.
+
+**<u>Data Lifecycle</u>**
+
+In order to analyze the data via Python, the Kylinpy library was used and 
SQL(s) were written to ingest relevant data for the analysis in question. The 
dataset(s) returned via SQL(s) were stored as Pandas data frame(s) and then 
data manipulation was done on the data frames to bring the data into a shape 
suitable for our analysis. We have leveraged the Matplotlib and Seaborn 
libraries for visualizing the data. The diagram below illustrates the data 
lifecycle through each of its stages.
+
+![diagram2](/images/blog/python-data-science/diagram2.png)
+
+**<u>Analysis</u>**
+
+Let us first visualize the top-rated movies. **It can be seen that from the 
top 15 movies, apart from top 2, 13 movies have been rated by an almost equal 
number of viewers.** This information is a starting point for correlational 
discovery and can be further drilled down into to find the correlation between 
the closely rated movies.
+
+```
+import sqlalchemy as sa
+import pandas as pd
+import matplotlib.pyplot as plt
+
+
+kylin_engine = 
sa.create_engine('kylin://<username>:<password>@1<IP>:<PORT>/<project_name>', 
connect_args={'timeout': 60})
+sql = 'select movieid,count(distinct userid) as COUNT_USERS from userratings 
group by movieid order by count(distinct userid) desc limit 15'
+
+moviecount = pd.read_sql(sql, kylin_engine)
+
+df=moviecount.sort_values(by='COUNT_USERS', ascending=False, 
na_position='first')
+
+ax = df.plot(kind='bar', x='MOVIEID', 
y='COUNT_USERS',figsize=(15,10),legend=False, color='blue', fontsize=18)
+ax.set_xlabel("Movie ID",fontsize=18)
+ax.set_ylabel("Users Count",fontsize=18)
+plt.title('Top Rated Movies',fontweight="bold",fontsize=22,y=1.05)
+
+plt.show()
+```
+
+
+![chart1](/images/blog/python-data-science/chart1.png)
+
+
+Similarly, plot below displays the comparison of Males vs. Females count per 
Genre. This describes a **gender-based inclination across various movie 
genres**.
+
+```
+import sqlalchemy as sa
+import pandas as pd
+import numpy as np
+import matplotlib.pyplot as plt
+
+
+kylin_engine = 
sa.create_engine('kylin://<username>:<password>@1<IP>:<PORT>/<project_name>', 
connect_args={'timeout': 60})
+sql = 'SELECT movies.genre as genre,users.gender as gender, 
count(userrat.userid) as counts from jainnik.userratings as userrat \
+ inner join jainnik.movies as movies on movies.movieid =  userrat.movieid \
+inner join jainnik.viewer as users on userrat.userid = users.userid \
+group by genre,gender'
+
+df2 = pd.read_sql(sql, kylin_engine)
+
+df2.columns = ['GENRE', 'GENDER', 'COUNTS']
+
+df3= df2[df2['GENDER'] == 'M']
+
+df3.columns = ['GENRE', 'GENDER_M', 'COUNT_M']
+
+df4= df2[df2['GENDER'] == 'F']
+
+df4.columns = ['GENRE', 'GENDER_F', 'COUNT_F']
+
+df_con = df4.merge(df3, left_on='GENRE', right_on='GENRE', how='inner')
+
+df_con.columns = ['GENRE', 'GENDER_F', 'Female', 'GENDER_M', 'Male']
+
+df_con = df_con.reindex(columns=['GENRE', 'GENDER_F', 'Female', 'GENDER_M', 
'Male'])
+
+df_con1 = df_con[:10]
+
+ax = df_con1.plot(kind='barh', x='GENRE',figsize=(15,10),legend=True, 
fontsize=12)
+ax.set_ylabel("Genre",fontsize=12)
+ax.set_xlabel("Users Count",fontsize=12)
+plt.title('Males vs Females',fontweight="bold",fontsize=22,y=1.05)
+
+plt.show()
+
+```
+
+
+![chart2](/images/blog/python-data-science/chart2.png)
+
+From the below correlation matrix (Heat map), we can state the relationship 
between Occupation and Genres of Movies that an individual prefers. For 
example: **Farmers do not prefer to watch Mystery based movies and College 
Students prefer Film-Noir or Documentaries.**
+
+```
+import sqlalchemy as sa
+import pandas as pd
+import matplotlib.pyplot as plt
+import numpy as np
+import calendar
+import seaborn as sns
+
+kylin_engine = 
sa.create_engine('kylin://<username>:<password>@1<IP>:<PORT>/<project_name>', 
connect_args={'timeout': 60})
+
+sql1 = 'select genre,occupation,AVG(cast(rating as decimal(1,6))) as RATING 
from USERRATINGS \
+inner join movies on USERRATINGS.movieid = movies.movieid \
+inner join viewer on viewer.userid = userratings.userid inner join occupation 
on viewer.occupationid=occupation.occupationid \
+group by genre, occupation'
+
+df1 = pd.read_sql(sql1, kylin_engine)
+
+df10 = df1.pivot_table(values='RATING', index=['GENRE'],columns='OCCUPATION')
+#print(df10.head(100))
+df10=df10.sort_values("college/grad student",ascending=False)
+
+ax = sns.heatmap(df10.head(15),cmap="BuPu")
+plt.xticks(fontsize=12)
+plt.yticks(fontsize=12)
+plt.xlabel("Occupation",fontsize=15)
+plt.ylabel("Genres",fontsize=15)
+
+for x in ax.get_xticklabels():
+    x.set_rotation(90)
+for x in ax.get_yticklabels():
+    x.set_rotation(0)
+plt.tight_layout()
+plt.show()
+```
+
+
+![chart3](/images/blog/python-data-science/chart3.png)
+
+The next figure shows the trends of the average ratings by users for different 
genres across different weeks for a given year. From the chart it can be seen 
that **Documentary and Crime movies are amongst people’s favorites while 
children’s movies always had the lowest average rating.**
+
+![chart4](/images/blog/python-data-science/chart4.png)
+
+
+The two scatter plots below are used for a side by side comparison to infer 
correlation between the ratings of Men and Women. 
+
+**Left Plot:** The scatter plot shows that the average rating of Men and Women 
(all movies) has a linearly increasing trend and the highly concentrated part 
of the plot is equally distributed on both sides of the reference line, which 
depicts that apart from a few movie ratings, **Men and Women tend to think 
alike**.
+
+**Right Plot:** The scatter plot was produced by segregating only those movies 
which have been rated more than 400 times. In this case as well we can see that 
**Men and Women have similar ratings**, suggesting that our **initial inference 
was accurate**.
+
+
+![chart5](/images/blog/python-data-science/chart5.png)
+
+
+## **Get Started with Python on Apache Kylin**
+
+We discussed how Python easily integrates with Apache Kylin’s OLAP technology 
using the Kylinpy library, which in turn was used to run advanced analytics on 
our example movie dataset. We also used Pandas, Matplotlib and Seaborn 
libraries to manipulate and visualize the data residing in our Apache Kylin 
cubes. 
+
+Such analysis gave us insight into how people’s liking of different movie 
genres changes over time. It also told us about the strength of association 
between trends in different movie genres. Insights like these could be useful 
for movie critics.
+
+If you or your team are facing issues in fully accessing your massive datasets 
and want to leverage Kylin’s OLAP on Big Data approach for your machine 
learning or data science activities, Apache Kylin has you covered. Visit the 
[download](http://kylin.apache.org/download/) page to try Apache Kylin now.  
For more information about Apache Kylin's OLAP analytics solutions visit 
[Apache Kylin](http://kylin.apache.org/) website.
\ No newline at end of file
diff --git a/website/images/blog/python-data-science/chart1.png 
b/website/images/blog/python-data-science/chart1.png
new file mode 100644
index 0000000..fa26f26
Binary files /dev/null and b/website/images/blog/python-data-science/chart1.png 
differ
diff --git a/website/images/blog/python-data-science/chart2.png 
b/website/images/blog/python-data-science/chart2.png
new file mode 100644
index 0000000..72c84a1
Binary files /dev/null and b/website/images/blog/python-data-science/chart2.png 
differ
diff --git a/website/images/blog/python-data-science/chart3.png 
b/website/images/blog/python-data-science/chart3.png
new file mode 100644
index 0000000..c007c10
Binary files /dev/null and b/website/images/blog/python-data-science/chart3.png 
differ
diff --git a/website/images/blog/python-data-science/chart4.png 
b/website/images/blog/python-data-science/chart4.png
new file mode 100644
index 0000000..3776dab
Binary files /dev/null and b/website/images/blog/python-data-science/chart4.png 
differ
diff --git a/website/images/blog/python-data-science/chart5.png 
b/website/images/blog/python-data-science/chart5.png
new file mode 100644
index 0000000..f0a1824
Binary files /dev/null and b/website/images/blog/python-data-science/chart5.png 
differ
diff --git a/website/images/blog/python-data-science/diagram1.png 
b/website/images/blog/python-data-science/diagram1.png
new file mode 100644
index 0000000..e81c737
Binary files /dev/null and 
b/website/images/blog/python-data-science/diagram1.png differ
diff --git a/website/images/blog/python-data-science/diagram2.png 
b/website/images/blog/python-data-science/diagram2.png
new file mode 100644
index 0000000..2303628
Binary files /dev/null and 
b/website/images/blog/python-data-science/diagram2.png differ

[kylin] branch document updated: add post use-python-for-data-science-with-apache-kylin

Reply via email to