[DISCUSSION] PyCarbon: provide python interface for users to use CarbonData by python code

xubo245 Sat, 23 Nov 2019 09:45:11 -0800

More and more people use big data to optimize their algorithm, train their
model, deploy their model as service and inference image. It's big challenge
to storage, manage and analysis lots of structured and unstructured data,
especially unstructured data, like image, video, audio and so on.

Many users use python to install their project for these scenario. Apache
CarbonData is an indexed columnar data store solution for fast analytics on
big data platform. Apache CarbonData has many great feature and high
performance to storage, manage and analysis big data. Apache CarbonData not
only already supported String, Int, Double, Boolean, Char,Date, TImeStamp
data types, but also supported Binay (CARBONDATA-3336), which can avoid
small binary files problem and speed up S3 access performance reach dozens
or even hundreds of times, also can decrease cost of accessing OBS by
decreasing the number of calling S3 API. But it's not easy for them to use
carbon by Java/Scala/C++. So it's better to provide python interface for
these users to use CarbonData by python code

We already work for these feature several months in
https://github.com/xubo245/pycarbon

*Goals:
1. Apache CarbonData should provides python interface to support to write
and read structured and unstructured data in CarbonData, like String, int
and binary data: image/voice/video. It should not dependency Apache Spark.
2. Apache CarbonData should provides python interface to support deep
learning framework to ready and write data from/to CarbonData, like
TensorFlow , MXNet, PyTorch and so on. It should not dependency Apache
Spark.
3. Apache CarbonData should provides python interface to manage and analysis
data based on Apache Spark. Apache CarbonData should support DDL, DML,
DataMap feature in Python.*

*Details:*
*1. Apache CarbonData should provides python interface to support to write
and read structured and unstructured data in CarbonData, like String, int
and binary data: image/voice/video. It should not dependency Apache Spark.*
Apache CarbonData already provide Java/ Scala/C++ interface for users, and
more and more people use python to manage and analysis big data, so it's
better to provide python interface to support to write and read structured
and unstructured data in CarbonData, like String, int and binary data:
image/voice/video. It should not dependency Apache Spark. We called it is
PYSDK.

PYSDK based on CarbonData Java SDK, use pyjnius to call java code in python
code. Even though Apache Spark use py4j in PySpark to call java code in
python, but it's low performance when use py4j to read bigdata with
CarbonData format in python code, py4j also show low performance when read
big data in their report:
https://www.py4j.org/advanced_topics.html#performance. JPype is also a
popular tool to call java code in python, but it already stoped update
several years ago, so we can not use it. In our test, pyjnius has high
performance to read big data by call java code in python, so it's good
choice for us.

We already work for these feature several months in
https://github.com/xubo245/pycarbon
Goals:

1). PYSDK should provide interface to support read data
2). PYSDK should provide interface to support write data
3). PYSDK should support basic data types
4). PYSDK should support projection
5). PYSDK should support filter

*2. Apache CarbonData should provides python interface to support deep
learning framework to ready and write data from/to CarbonData, like
TensorFlow , MXNet, PyTorch and so on. It should not dependency Apache
Spark.*

Goals：
1). CarbonData provides python interface to support TensorFlow to ready data
from CarbonData for training model
2). CarbonData provides python interface to support MXNet to ready data from
CarbonData for training model
3). CarbonData provides python interface to support PyTorch to ready data
from CarbonData for training model
4). CarbonData should support epoch function
5). CarbonData should support cache for speed up performance.

*3.Apache CarbonData should provides python interface to manage and analysis
data based on Apache Spark. Apache CarbonData should support DDL, DML,
DataMap feature in Python.*

Goals：
1). PyCarbon support  read data from local/HDFS/S3 in python code by PySpark
DataFrame
2). PyCarbon support  write data in python code to local/HDFS/S3 by PySpark
DataFrame
3). PyCarbon support  DDL in python with sql format
4). PyCarbon support  DML in python with sql format
5). PyCarbon support  DataMap in python with sql format


The JIRA is: 

https://issues.apache.org/jira/browse/CARBONDATA-3254





--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
[DISCUSSION] PyCarbon: provide python interface for users to use CarbonData by python code

Reply via email to