[jira] [Comment Edited] (SPARK-40502) Support dataframe API use jdbc data source in PySpark
[ https://issues.apache.org/jira/browse/SPARK-40502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607523#comment-17607523 ] CaoYu edited comment on SPARK-40502 at 9/21/22 6:07 AM: I am a teacher Recently designed Python language basic course, big data direction PySpark is one of the practical cases, but it is only a simple use of RDD code to complete the basic data processing work, and the use of JDBC data source is a part of the course Because the course is very basic, simple rdd code is suitable as an example. But if you use DataFrame, you need to explain more content, which is not friendly to novice students DataFrames(SparkSQL) will be used in future design advanced courses. So I hope that the extraction of jdbc data may be completed through the api of rdd was (Author: javacaoyu): I am a teacher Recently designed Python language basic course, big data direction PySpark is one of the practical cases, but it is only a simple use of RDD code to complete the basic data processing work, and the use of JDBC data source is a part of the course DataFrames(SparkSQL) will be used in future design advanced courses. So I hope the datastream API to have the capability of jdbc datasource. > Support dataframe API use jdbc data source in PySpark > - > > Key: SPARK-40502 > URL: https://issues.apache.org/jira/browse/SPARK-40502 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.3.0 >Reporter: CaoYu >Priority: Major > > When i using pyspark, i wanna get data from mysql database. so i want use > JDBCRDD like java\scala. > But that is not be supported in PySpark. > > For some reasons, i can't using DataFrame API, only can use RDD(datastream) > API. Even i know the DataFrame can get data from jdbc source fairly well. > > So i want to implement functionality that can use rdd to get data from jdbc > source for PySpark. > > *But i don't know if that are necessary for PySpark. so we can discuss it.* > > {*}If it is necessary for PySpark{*}{*}, i want to contribute to Spark.{*} > *i hope this Jira task can assigned to me, so i can start working to > implement it.* > > *if not, please close this Jira task.* > > > *thanks a lot.* > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40502) Support dataframe API use jdbc data source in PySpark
[ https://issues.apache.org/jira/browse/SPARK-40502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607524#comment-17607524 ] CaoYu commented on SPARK-40502: --- When I designed the Python Flink course It is found that PyFlink does not have the operators sum\min\minby\max\maxby So I submitted a PR to the flink community and provided the python implementation code of these operators (FLINK-26609 FLINK-26728) So, again, if jdbc datasource is what pyspark needs, I'd love and have the time to implement it > Support dataframe API use jdbc data source in PySpark > - > > Key: SPARK-40502 > URL: https://issues.apache.org/jira/browse/SPARK-40502 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.3.0 >Reporter: CaoYu >Priority: Major > > When i using pyspark, i wanna get data from mysql database. so i want use > JDBCRDD like java\scala. > But that is not be supported in PySpark. > > For some reasons, i can't using DataFrame API, only can use RDD(datastream) > API. Even i know the DataFrame can get data from jdbc source fairly well. > > So i want to implement functionality that can use rdd to get data from jdbc > source for PySpark. > > *But i don't know if that are necessary for PySpark. so we can discuss it.* > > {*}If it is necessary for PySpark{*}{*}, i want to contribute to Spark.{*} > *i hope this Jira task can assigned to me, so i can start working to > implement it.* > > *if not, please close this Jira task.* > > > *thanks a lot.* > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40502) Support dataframe API use jdbc data source in PySpark
[ https://issues.apache.org/jira/browse/SPARK-40502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607523#comment-17607523 ] CaoYu commented on SPARK-40502: --- I am a teacher Recently designed Python language basic course, big data direction PySpark is one of the practical cases, but it is only a simple use of RDD code to complete the basic data processing work, and the use of JDBC data source is a part of the course DataFrames(SparkSQL) will be used in future design advanced courses. So I hope the datastream API to have the capability of jdbc datasource. > Support dataframe API use jdbc data source in PySpark > - > > Key: SPARK-40502 > URL: https://issues.apache.org/jira/browse/SPARK-40502 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.3.0 >Reporter: CaoYu >Priority: Major > > When i using pyspark, i wanna get data from mysql database. so i want use > JDBCRDD like java\scala. > But that is not be supported in PySpark. > > For some reasons, i can't using DataFrame API, only can use RDD(datastream) > API. Even i know the DataFrame can get data from jdbc source fairly well. > > So i want to implement functionality that can use rdd to get data from jdbc > source for PySpark. > > *But i don't know if that are necessary for PySpark. so we can discuss it.* > > {*}If it is necessary for PySpark{*}{*}, i want to contribute to Spark.{*} > *i hope this Jira task can assigned to me, so i can start working to > implement it.* > > *if not, please close this Jira task.* > > > *thanks a lot.* > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40491) Remove too old TODO for JdbcRDD
[ https://issues.apache.org/jira/browse/SPARK-40491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607045#comment-17607045 ] CaoYu commented on SPARK-40491: --- Maybe we can just not remove these. I have already created https://issues.apache.org/jira/browse/SPARK-40502, please take a look. i want try to implement jdbc data source in pyspark. Also I'm interested in this task for scala. if possible, Please assign me this task, I want to try to get it done > Remove too old TODO for JdbcRDD > --- > > Key: SPARK-40491 > URL: https://issues.apache.org/jira/browse/SPARK-40491 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > According to the legacy document of JdbcRDD, we need to expose a jdbcRDD > function in SparkContext. > In fact, this is a very old TODO and we need to revisit if this is still > necessary. Since Spark SQL is the new core, I'm not sure if anyone is > interested in a new API to create jdbc RDD. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40502) Support dataframe API use jdbc data source in PySpark
CaoYu created SPARK-40502: - Summary: Support dataframe API use jdbc data source in PySpark Key: SPARK-40502 URL: https://issues.apache.org/jira/browse/SPARK-40502 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 3.3.0 Reporter: CaoYu When i using pyspark, i wanna get data from mysql database. so i want use JDBCRDD like java\scala. But that is not be supported in PySpark. For some reasons, i can't using DataFrame API, only can use RDD(datastream) API. Even i know the DataFrame can get data from jdbc source fairly well. So i want to implement functionality that can use rdd to get data from jdbc source for PySpark. *But i don't know if that are necessary for PySpark. so we can discuss it.* {*}If it is necessary for PySpark{*}{*}, i want to contribute to Spark.{*} *i hope this Jira task can assigned to me, so i can start working to implement it.* *if not, please close this Jira task.* *thanks a lot.* -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org