[jira] [Comment Edited] (SPARK-40502) Support dataframe API use jdbc data source in PySpark

2022-09-20 Thread CaoYu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607523#comment-17607523
 ] 

CaoYu edited comment on SPARK-40502 at 9/21/22 6:07 AM:


I am a teacher
Recently designed Python language basic course, big data direction

PySpark is one of the practical cases, but it is only a simple use of RDD code 
to complete the basic data processing work, and the use of JDBC data source is 
a part of the course

 

Because the course is very basic, simple rdd code is suitable as an example.
But if you use DataFrame, you need to explain more content, which is not 
friendly to novice students

DataFrames(SparkSQL) will be used in future design advanced courses.

So I hope that the extraction of jdbc data may be completed through the api of 
rdd

 

 

 


was (Author: javacaoyu):
I am a teacher
Recently designed Python language basic course, big data direction

PySpark is one of the practical cases, but it is only a simple use of RDD code 
to complete the basic data processing work, and the use of JDBC data source is 
a part of the course

DataFrames(SparkSQL) will be used in future design advanced courses.
So I hope the datastream API to have the capability of jdbc datasource.

 

 

> Support dataframe API use jdbc data source in PySpark
> -
>
> Key: SPARK-40502
> URL: https://issues.apache.org/jira/browse/SPARK-40502
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: CaoYu
>Priority: Major
>
> When i using pyspark, i wanna get data from mysql database.  so i want use 
> JDBCRDD like java\scala.
> But that is not be supported in PySpark.
>  
> For some reasons, i can't using DataFrame API, only can use RDD(datastream) 
> API. Even i know the DataFrame can get data from jdbc source fairly well.
>  
> So i want to implement functionality that can use rdd to get data from jdbc 
> source for PySpark.
>  
> *But i don't know if that are necessary for PySpark.   so we can discuss it.*
>  
> {*}If it is necessary for PySpark{*}{*}, i want to contribute to Spark.{*}  
> *i hope this Jira task can assigned to me, so i can start working to 
> implement it.*
>  
> *if not, please close this Jira task.*
>  
>  
> *thanks a lot.*
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40502) Support dataframe API use jdbc data source in PySpark

2022-09-20 Thread CaoYu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607524#comment-17607524
 ] 

CaoYu commented on SPARK-40502:
---

When I designed the Python Flink course
It is found that PyFlink does not have the operators sum\min\minby\max\maxby

So I submitted a PR to the flink community and provided the python 
implementation code of these operators (FLINK-26609 FLINK-26728)

So, again, if jdbc datasource is what pyspark needs, I'd love and have the time 
to implement it

> Support dataframe API use jdbc data source in PySpark
> -
>
> Key: SPARK-40502
> URL: https://issues.apache.org/jira/browse/SPARK-40502
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: CaoYu
>Priority: Major
>
> When i using pyspark, i wanna get data from mysql database.  so i want use 
> JDBCRDD like java\scala.
> But that is not be supported in PySpark.
>  
> For some reasons, i can't using DataFrame API, only can use RDD(datastream) 
> API. Even i know the DataFrame can get data from jdbc source fairly well.
>  
> So i want to implement functionality that can use rdd to get data from jdbc 
> source for PySpark.
>  
> *But i don't know if that are necessary for PySpark.   so we can discuss it.*
>  
> {*}If it is necessary for PySpark{*}{*}, i want to contribute to Spark.{*}  
> *i hope this Jira task can assigned to me, so i can start working to 
> implement it.*
>  
> *if not, please close this Jira task.*
>  
>  
> *thanks a lot.*
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40502) Support dataframe API use jdbc data source in PySpark

2022-09-20 Thread CaoYu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607523#comment-17607523
 ] 

CaoYu commented on SPARK-40502:
---

I am a teacher
Recently designed Python language basic course, big data direction

PySpark is one of the practical cases, but it is only a simple use of RDD code 
to complete the basic data processing work, and the use of JDBC data source is 
a part of the course

DataFrames(SparkSQL) will be used in future design advanced courses.
So I hope the datastream API to have the capability of jdbc datasource.

 

 

> Support dataframe API use jdbc data source in PySpark
> -
>
> Key: SPARK-40502
> URL: https://issues.apache.org/jira/browse/SPARK-40502
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: CaoYu
>Priority: Major
>
> When i using pyspark, i wanna get data from mysql database.  so i want use 
> JDBCRDD like java\scala.
> But that is not be supported in PySpark.
>  
> For some reasons, i can't using DataFrame API, only can use RDD(datastream) 
> API. Even i know the DataFrame can get data from jdbc source fairly well.
>  
> So i want to implement functionality that can use rdd to get data from jdbc 
> source for PySpark.
>  
> *But i don't know if that are necessary for PySpark.   so we can discuss it.*
>  
> {*}If it is necessary for PySpark{*}{*}, i want to contribute to Spark.{*}  
> *i hope this Jira task can assigned to me, so i can start working to 
> implement it.*
>  
> *if not, please close this Jira task.*
>  
>  
> *thanks a lot.*
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40491) Remove too old TODO for JdbcRDD

2022-09-20 Thread CaoYu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607045#comment-17607045
 ] 

CaoYu commented on SPARK-40491:
---

Maybe we can just not remove these. 

 

I have already created https://issues.apache.org/jira/browse/SPARK-40502, 
please take a look.

i want try to implement jdbc data source in pyspark.

 

Also I'm interested in this task for scala.

if possible, Please assign me this task, I want to try to get it done

> Remove too old TODO for JdbcRDD
> ---
>
> Key: SPARK-40491
> URL: https://issues.apache.org/jira/browse/SPARK-40491
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> According to the legacy document of JdbcRDD, we need to expose a jdbcRDD 
> function in SparkContext.
> In fact, this is a very old TODO and we need to revisit if this is still 
> necessary. Since Spark SQL is the new core, I'm not sure if anyone is 
> interested in a new API to create jdbc RDD.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40502) Support dataframe API use jdbc data source in PySpark

2022-09-20 Thread CaoYu (Jira)
CaoYu created SPARK-40502:
-

 Summary: Support dataframe API use jdbc data source in PySpark
 Key: SPARK-40502
 URL: https://issues.apache.org/jira/browse/SPARK-40502
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 3.3.0
Reporter: CaoYu


When i using pyspark, i wanna get data from mysql database.  so i want use 
JDBCRDD like java\scala.

But that is not be supported in PySpark.

 

For some reasons, i can't using DataFrame API, only can use RDD(datastream) 
API. Even i know the DataFrame can get data from jdbc source fairly well.

 
So i want to implement functionality that can use rdd to get data from jdbc 
source for PySpark.
 
*But i don't know if that are necessary for PySpark.   so we can discuss it.*
 
{*}If it is necessary for PySpark{*}{*}, i want to contribute to Spark.{*}  
*i hope this Jira task can assigned to me, so i can start working to implement 
it.*
 
*if not, please close this Jira task.*
 
 
*thanks a lot.*
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org