Re: [PR] 202306-notebook-ingestion of flight data as events (druid)

via GitHub Mon, 03 Jul 2023 17:13:29 -0700


techdocsmith commented on code in PR #14501:
URL: https://github.com/apache/druid/pull/14501#discussion_r1251311729



##########
examples/quickstart/jupyter-notebooks/notebooks/02-ingestion/XX-example-flightdata-events.ipynb:
##########
@@ -0,0 +1,807 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "e79d7d48-b403-4b9e-8cc6-0f0accecac1f",
+   "metadata": {},
+   "source": [
+    "# Data modeling and ingestion principles - creating Events from Druid's 
sample flight data\n",
+    "\n",
+    "Druid's data loader allows you to quickly ingest sample carrier data into 
a `TABLE`, giving you an easy way to learn about the SQL functions that are 
available. It's also a great place to start understanding how data modeling for 
event analytics in a real-time database differs from modeling you'd apply in 
other databases, as well as being small enough to safely see - and try out - 
different data layout designs safely.\n",

Review Comment:
   Why are we using the data loader for a notebook? That is more suited to a 
walkthrough tutorial. I think for this case, it might be OK to include a couple 
of sample rows. If a screen capture of the data loader would help, you could do 
that.



##########
examples/quickstart/jupyter-notebooks/notebooks/02-ingestion/XX-example-flightdata-events.ipynb:
##########
@@ -0,0 +1,807 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "e79d7d48-b403-4b9e-8cc6-0f0accecac1f",
+   "metadata": {},
+   "source": [
+    "# Data modeling and ingestion principles - creating Events from Druid's 
sample flight data\n",
+    "\n",
+    "Druid's data loader allows you to quickly ingest sample carrier data into 
a `TABLE`, giving you an easy way to learn about the SQL functions that are 
available. It's also a great place to start understanding how data modeling for 
event analytics in a real-time database differs from modeling you'd apply in 
other databases, as well as being small enough to safely see - and try out - 
different data layout designs safely.\n",
+    "\n",
+    "In this notebook, you'll walk through creating a table of events out of 
the sample data set, applying data modeling principles as you go. At the end 
you'll have a `TABLE` called \"flight-events\" that you can then use as you 
continue your learning in Apache Druid.\n",
+    "\n",
+    "## Prerequisites\n",
+    "\n",
+    "In order to use this notebook, you'll need access to a small Druid 
deployment.\n",
+    "\n",
+    "It's a good idea to test ingesting the data \"as is\" on that cluster to 
make sure it's operational before you get going.\n",
+    "\n",
+    "## Getting started\n",
+    "\n",
+    "Run the following to set up the druid api. Remember to change the 
`druid-host` to the appropriate endpoint to submit your SQL.\n",
+    "\n",
+    "**NOTE** that this notebook calls the `sql_client.wait_until_ready` 
method. This will pause the Python kernel until ingestion has completed, and 
subsequent cells will not run until the ingestion is finished."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ffc13d62-d1fc-45bc-855a-8c7687d4c720",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import druidapi\n",
+    "\n",
+    "# druid_host is the hostname and port for your Druid deployment. \n",
+    "# In the Docker Compose tutorial environment, this is the Router\n",
+    "# service running at \"http://router:8888\".\n";,
+    "\n",
+    "# If you are not using the Docker Compose environment, edit the 
`druid_host`.\n",
+    "\n",
+    "druid_host = \"http://router:8888\"\n";,
+    "druid_host\n",
+    "\n",
+    "druid = druidapi.jupyter_client(druid_host)\n",
+    "display = druid.display\n",
+    "sql_client = druid.sql"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "596eed1f-c47f-48cc-a537-5703c7eefc38",
+   "metadata": {},
+   "source": [
+    "## Apply modeling principles\n",
+    "\n",
+    "### Principle 1 - create the right `TABLE` for the right query\n",
+    "\n",
+    "#### Finding the Events\n",
+    "\n",
+    "Let's take a look at the data we have. Using the Druid Console you can 
preview the data you want to load.\n",
+    "\n",
+    "1. Open the console\n",

Review Comment:
   OR maybe MSQ can run a select on an EXTERN. (Of course I can get this to 
work in the console, but not using the restapi client)



##########
examples/quickstart/jupyter-notebooks/notebooks/02-ingestion/XX-example-flightdata-events.ipynb:
##########
@@ -0,0 +1,807 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "e79d7d48-b403-4b9e-8cc6-0f0accecac1f",
+   "metadata": {},
+   "source": [
+    "# Data modeling and ingestion principles - creating Events from Druid's 
sample flight data\n",
+    "\n",
+    "Druid's data loader allows you to quickly ingest sample carrier data into 
a `TABLE`, giving you an easy way to learn about the SQL functions that are 
available. It's also a great place to start understanding how data modeling for 
event analytics in a real-time database differs from modeling you'd apply in 
other databases, as well as being small enough to safely see - and try out - 
different data layout designs safely.\n",
+    "\n",
+    "In this notebook, you'll walk through creating a table of events out of 
the sample data set, applying data modeling principles as you go. At the end 
you'll have a `TABLE` called \"flight-events\" that you can then use as you 
continue your learning in Apache Druid.\n",
+    "\n",
+    "## Prerequisites\n",
+    "\n",
+    "In order to use this notebook, you'll need access to a small Druid 
deployment.\n",
+    "\n",
+    "It's a good idea to test ingesting the data \"as is\" on that cluster to 
make sure it's operational before you get going.\n",

Review Comment:
   not sure what this is getting at. "that cluster?" meaning the "small Druid 
deplyoment?" Maybe: verify that ingestion works on your cluster before 
continuing with the notebook? (It seems a little strange. Like you would expect 
it to work, right?)



##########
examples/quickstart/jupyter-notebooks/notebooks/02-ingestion/XX-example-flightdata-events.ipynb:
##########
@@ -0,0 +1,807 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "e79d7d48-b403-4b9e-8cc6-0f0accecac1f",
+   "metadata": {},
+   "source": [
+    "# Data modeling and ingestion principles - creating Events from Druid's 
sample flight data\n",
+    "\n",
+    "Druid's data loader allows you to quickly ingest sample carrier data into 
a `TABLE`, giving you an easy way to learn about the SQL functions that are 
available. It's also a great place to start understanding how data modeling for 
event analytics in a real-time database differs from modeling you'd apply in 
other databases, as well as being small enough to safely see - and try out - 
different data layout designs safely.\n",
+    "\n",
+    "In this notebook, you'll walk through creating a table of events out of 
the sample data set, applying data modeling principles as you go. At the end 
you'll have a `TABLE` called \"flight-events\" that you can then use as you 
continue your learning in Apache Druid.\n",
+    "\n",
+    "## Prerequisites\n",
+    "\n",
+    "In order to use this notebook, you'll need access to a small Druid 
deployment.\n",

Review Comment:
   We need to encourage folks to run in the Docker Compose environment. With 
the exception of ARM/M1 where they need to load Druid using the Local 
Quickstart. 



##########
examples/quickstart/jupyter-notebooks/notebooks/02-ingestion/XX-example-flightdata-events.ipynb:
##########
@@ -0,0 +1,807 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "e79d7d48-b403-4b9e-8cc6-0f0accecac1f",
+   "metadata": {},
+   "source": [
+    "# Data modeling and ingestion principles - creating Events from Druid's 
sample flight data\n",
+    "\n",
+    "Druid's data loader allows you to quickly ingest sample carrier data into 
a `TABLE`, giving you an easy way to learn about the SQL functions that are 
available. It's also a great place to start understanding how data modeling for 
event analytics in a real-time database differs from modeling you'd apply in 
other databases, as well as being small enough to safely see - and try out - 
different data layout designs safely.\n",
+    "\n",
+    "In this notebook, you'll walk through creating a table of events out of 
the sample data set, applying data modeling principles as you go. At the end 
you'll have a `TABLE` called \"flight-events\" that you can then use as you 
continue your learning in Apache Druid.\n",
+    "\n",
+    "## Prerequisites\n",
+    "\n",
+    "In order to use this notebook, you'll need access to a small Druid 
deployment.\n",
+    "\n",
+    "It's a good idea to test ingesting the data \"as is\" on that cluster to 
make sure it's operational before you get going.\n",
+    "\n",
+    "## Getting started\n",
+    "\n",
+    "Run the following to set up the druid api. Remember to change the 
`druid-host` to the appropriate endpoint to submit your SQL.\n",
+    "\n",
+    "**NOTE** that this notebook calls the `sql_client.wait_until_ready` 
method. This will pause the Python kernel until ingestion has completed, and 
subsequent cells will not run until the ingestion is finished."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ffc13d62-d1fc-45bc-855a-8c7687d4c720",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import druidapi\n",
+    "\n",
+    "# druid_host is the hostname and port for your Druid deployment. \n",
+    "# In the Docker Compose tutorial environment, this is the Router\n",
+    "# service running at \"http://router:8888\".\n";,
+    "\n",
+    "# If you are not using the Docker Compose environment, edit the 
`druid_host`.\n",
+    "\n",
+    "druid_host = \"http://router:8888\"\n";,
+    "druid_host\n",
+    "\n",
+    "druid = druidapi.jupyter_client(druid_host)\n",
+    "display = druid.display\n",
+    "sql_client = druid.sql"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "596eed1f-c47f-48cc-a537-5703c7eefc38",
+   "metadata": {},
+   "source": [
+    "## Apply modeling principles\n",

Review Comment:
   It's not great to have back to back headers with no content in between. 
Could we just skip directly to:
   
   ## Create a table optimized for your queries 
   
   What do we get from numbering the principles. If we do want to number them 
(and include them in a toc in the intro, mayber
   
   ## Principle 1: Create ...



##########
examples/quickstart/jupyter-notebooks/notebooks/02-ingestion/XX-example-flightdata-events.ipynb:
##########
@@ -0,0 +1,807 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "e79d7d48-b403-4b9e-8cc6-0f0accecac1f",
+   "metadata": {},
+   "source": [
+    "# Data modeling and ingestion principles - creating Events from Druid's 
sample flight data\n",
+    "\n",
+    "Druid's data loader allows you to quickly ingest sample carrier data into 
a `TABLE`, giving you an easy way to learn about the SQL functions that are 
available. It's also a great place to start understanding how data modeling for 
event analytics in a real-time database differs from modeling you'd apply in 
other databases, as well as being small enough to safely see - and try out - 
different data layout designs safely.\n",
+    "\n",
+    "In this notebook, you'll walk through creating a table of events out of 
the sample data set, applying data modeling principles as you go. At the end 
you'll have a `TABLE` called \"flight-events\" that you can then use as you 
continue your learning in Apache Druid.\n",
+    "\n",
+    "## Prerequisites\n",

Review Comment:
   Prerequisites should look mostly like 
https://github.com/petermarshallio/druid/blob/202306-notebook-flightDataModeling/examples/quickstart/jupyter-notebooks/notebooks/02-ingestion/01-streaming-from-kafka.ipynb
 only pared back



##########
examples/quickstart/jupyter-notebooks/notebooks/02-ingestion/XX-example-flightdata-events.ipynb:
##########
@@ -0,0 +1,807 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "e79d7d48-b403-4b9e-8cc6-0f0accecac1f",
+   "metadata": {},
+   "source": [
+    "# Data modeling and ingestion principles - creating Events from Druid's 
sample flight data\n",
+    "\n",
+    "Druid's data loader allows you to quickly ingest sample carrier data into 
a `TABLE`, giving you an easy way to learn about the SQL functions that are 
available. It's also a great place to start understanding how data modeling for 
event analytics in a real-time database differs from modeling you'd apply in 
other databases, as well as being small enough to safely see - and try out - 
different data layout designs safely.\n",
+    "\n",
+    "In this notebook, you'll walk through creating a table of events out of 
the sample data set, applying data modeling principles as you go. At the end 
you'll have a `TABLE` called \"flight-events\" that you can then use as you 
continue your learning in Apache Druid.\n",
+    "\n",
+    "## Prerequisites\n",
+    "\n",
+    "In order to use this notebook, you'll need access to a small Druid 
deployment.\n",
+    "\n",
+    "It's a good idea to test ingesting the data \"as is\" on that cluster to 
make sure it's operational before you get going.\n",
+    "\n",
+    "## Getting started\n",
+    "\n",
+    "Run the following to set up the druid api. Remember to change the 
`druid-host` to the appropriate endpoint to submit your SQL.\n",
+    "\n",
+    "**NOTE** that this notebook calls the `sql_client.wait_until_ready` 
method. This will pause the Python kernel until ingestion has completed, and 
subsequent cells will not run until the ingestion is finished."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ffc13d62-d1fc-45bc-855a-8c7687d4c720",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import druidapi\n",
+    "\n",
+    "# druid_host is the hostname and port for your Druid deployment. \n",
+    "# In the Docker Compose tutorial environment, this is the Router\n",
+    "# service running at \"http://router:8888\".\n";,
+    "\n",
+    "# If you are not using the Docker Compose environment, edit the 
`druid_host`.\n",
+    "\n",
+    "druid_host = \"http://router:8888\"\n";,
+    "druid_host\n",
+    "\n",
+    "druid = druidapi.jupyter_client(druid_host)\n",
+    "display = druid.display\n",
+    "sql_client = druid.sql"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "596eed1f-c47f-48cc-a537-5703c7eefc38",
+   "metadata": {},
+   "source": [
+    "## Apply modeling principles\n",
+    "\n",
+    "### Principle 1 - create the right `TABLE` for the right query\n",
+    "\n",
+    "#### Finding the Events\n",
+    "\n",
+    "Let's take a look at the data we have. Using the Druid Console you can 
preview the data you want to load.\n",
+    "\n",
+    "1. Open the console\n",
+    "2. Select Load data --> Batch - SQL (multi-stage query)\n",
+    "3. Click Example data and select \"FlightCarrierOnTime (1 month)\n",
+    "4. Click Use Example\n",
+    "\n",
+    "You can read more about what each data means in the [dataset 
explainer](https://dax-cdn.cdn.appdomain.cloud/dax-airline/1.0.1/data-preview/index.html).\n",
+    "\n",
+    "Notice how each row is **not** an event. Instead, each row represents a 
single flight - a \"session\" - containing information about the entire flight. 
We need to reverse engineer this data into events, as well as ingesting this 
session data.\n",
+    "\n",
+    "Notice that there are several event types aggregated into each row:\n",
+    "\n",
+    "* Data about the departure\n",
+    "* Data about takeoff (wheels off)\n",
+    "* Data about landing (wheels on)\n",
+    "* Data about arrival\n",
+    "\n",
+    "There is also Taxi In and Taxi Out data, and also information about 
cancellation. An opportunity here for Event analytics is to ask for more data 
about these events, either as part of this row or as individual events 
themselves (making our job easier!).\n",
+    "\n",
+    "Far to the right is data about diversions, but no easy way to calculate 
_when_ this happened so that we can create events.\n",
+    "\n",
+    "There are also a set of dimensions that have be calculated ahead of time 
for us. This is possibly because the databases used on top of this data can't 
calculate this on-the-fly in the way that Druid can; as with any data set, we 
need to decide whether to keep or remove them.\n",

Review Comment:
   This feels vague. Example?



##########
examples/quickstart/jupyter-notebooks/notebooks/02-ingestion/XX-example-flightdata-events.ipynb:
##########
@@ -0,0 +1,807 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "e79d7d48-b403-4b9e-8cc6-0f0accecac1f",
+   "metadata": {},
+   "source": [
+    "# Data modeling and ingestion principles - creating Events from Druid's 
sample flight data\n",
+    "\n",
+    "Druid's data loader allows you to quickly ingest sample carrier data into 
a `TABLE`, giving you an easy way to learn about the SQL functions that are 
available. It's also a great place to start understanding how data modeling for 
event analytics in a real-time database differs from modeling you'd apply in 
other databases, as well as being small enough to safely see - and try out - 
different data layout designs safely.\n",
+    "\n",
+    "In this notebook, you'll walk through creating a table of events out of 
the sample data set, applying data modeling principles as you go. At the end 
you'll have a `TABLE` called \"flight-events\" that you can then use as you 
continue your learning in Apache Druid.\n",
+    "\n",
+    "## Prerequisites\n",
+    "\n",
+    "In order to use this notebook, you'll need access to a small Druid 
deployment.\n",
+    "\n",
+    "It's a good idea to test ingesting the data \"as is\" on that cluster to 
make sure it's operational before you get going.\n",
+    "\n",
+    "## Getting started\n",

Review Comment:
   Suggest: Load Python libraries ans set up global variables or Set up stage
   "Getting started" is a little overloaded.
   
   



##########
examples/quickstart/jupyter-notebooks/notebooks/02-ingestion/XX-example-flightdata-events.ipynb:
##########
@@ -0,0 +1,807 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "e79d7d48-b403-4b9e-8cc6-0f0accecac1f",
+   "metadata": {},
+   "source": [
+    "# Data modeling and ingestion principles - creating Events from Druid's 
sample flight data\n",
+    "\n",
+    "Druid's data loader allows you to quickly ingest sample carrier data into 
a `TABLE`, giving you an easy way to learn about the SQL functions that are 
available. It's also a great place to start understanding how data modeling for 
event analytics in a real-time database differs from modeling you'd apply in 
other databases, as well as being small enough to safely see - and try out - 
different data layout designs safely.\n",
+    "\n",
+    "In this notebook, you'll walk through creating a table of events out of 
the sample data set, applying data modeling principles as you go. At the end 
you'll have a `TABLE` called \"flight-events\" that you can then use as you 
continue your learning in Apache Druid.\n",
+    "\n",
+    "## Prerequisites\n",
+    "\n",
+    "In order to use this notebook, you'll need access to a small Druid 
deployment.\n",
+    "\n",
+    "It's a good idea to test ingesting the data \"as is\" on that cluster to 
make sure it's operational before you get going.\n",
+    "\n",
+    "## Getting started\n",
+    "\n",
+    "Run the following to set up the druid api. Remember to change the 
`druid-host` to the appropriate endpoint to submit your SQL.\n",
+    "\n",
+    "**NOTE** that this notebook calls the `sql_client.wait_until_ready` 
method. This will pause the Python kernel until ingestion has completed, and 
subsequent cells will not run until the ingestion is finished."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ffc13d62-d1fc-45bc-855a-8c7687d4c720",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import druidapi\n",
+    "\n",
+    "# druid_host is the hostname and port for your Druid deployment. \n",
+    "# In the Docker Compose tutorial environment, this is the Router\n",
+    "# service running at \"http://router:8888\".\n";,
+    "\n",
+    "# If you are not using the Docker Compose environment, edit the 
`druid_host`.\n",
+    "\n",
+    "druid_host = \"http://router:8888\"\n";,
+    "druid_host\n",
+    "\n",
+    "druid = druidapi.jupyter_client(druid_host)\n",
+    "display = druid.display\n",
+    "sql_client = druid.sql"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "596eed1f-c47f-48cc-a537-5703c7eefc38",
+   "metadata": {},
+   "source": [
+    "## Apply modeling principles\n",
+    "\n",
+    "### Principle 1 - create the right `TABLE` for the right query\n",
+    "\n",
+    "#### Finding the Events\n",
+    "\n",
+    "Let's take a look at the data we have. Using the Druid Console you can 
preview the data you want to load.\n",
+    "\n",
+    "1. Open the console\n",

Review Comment:
   Rather than make a procedure for learning how to do the console, what about 
including a sample row here. You suggest we examine the data, so let's see the 
data here without having to change to the console.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org

Re: [PR] 202306-notebook-ingestion of flight data as events (druid)

Reply via email to