[GitHub] rhunwicks opened a new issue #3302: Create a PandasDatasource

git Wed, 16 Aug 2017 03:02:35 -0700

rhunwicks opened a new issue #3302: Create a PandasDatasource
URL: https://github.com/apache/incubator-superset/issues/3302
 
 
   - [X] I have checked the issue tracker for the same issue and I haven't 
found one similar
   
   ### Superset version
   
   0.19.0
   
   ### Expected results
   
   There are a large number of Issues asking about adding new Datasources / 
Connectors:
   
   1. #381
   1. #2790 
   1. #2468
   1. #945 
   1. #241 
   1. #600 
   1. #245 
   
   Unfortunately, I can't find any examples of a working third party datasource 
/ connector on Github and I think this is possibly because of the complexity 
and level of effort required to implement a BaseDatasource subclass with all 
the required methods. In particular, it needs to be able to report the schema 
and do filtering, grouping and aggregating.
   
   Pandas has great import code, and I have seen Pandas proposed an a method 
for implementing a CSV connector - see #381 - read the CSV using Pandas and 
then output to sqlite and then connect to sqlite using the SQLA Datasource to 
create the slices.
   
   This approach could be extended to other data formats that Pandas can read, 
e.g. Excel, HDF5, etc.
   
   However, it is not ideal because the sqlite file will be potentially be out 
of date as soon as it is loaded.
   
   I'd like to propose an altenative: a PandasDatasource that allows the user 
to specify the import method (`read_csv`, `read_table`, `read_hdf`, etc.) and a 
URL and which then queries the URL using the method to create a Dataframe. It 
reports the columns available and their types based on the dtypes for the 
Dataframe. And by default it allows grouping, filtering and aggregating using 
Pandas built in functionality.
   
   I realize that this approach won't work for very large datasets that could 
overwhelm the memory of the server, but it would work for my use case and 
probably for many others. The results of the read, filter, group and aggregate 
would be cached anyway, so the large memory usage is potentially only temporary.
   
   This would also make it very much easier for people working with larger 
datasets to create a custom connector to suit their purposes. For example, 
someone wanting to use BigQuery (see #945) could extend the PandasDatasource to 
use `read_gbq` and to pass the filter options through to BigQuery but still 
rely on Pandas to do grouping and aggregating. Given that starting point, 
someone else might come along later and add the additional code necessary to 
pass some group options through to BigQuery.
   
   The point is that instead of having to write an entire Datasource and 
implement all methods, you could extend an existing one to scratch your 
particular itch, and over time as more itches get scratched we would end up 
with a much broader selection of datasources for Superset.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]



With regards,
Apache Git Services

[GitHub] rhunwicks opened a new issue #3302: Create a PandasDatasource

Reply via email to