[jira] [Commented] (SPARK-9983) Local physical operators for query execution

2019-06-11 Thread Lai Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860976#comment-16860976
 ] 

Lai Zhou commented on SPARK-9983:
-

[~rxin], we now use Calcite to build a high performance hive sql engine , it's 
released as opensource now.

see [https://github.com/51nb/marble]

It works fine for real-time ML scene in our financial business. But I think 
it's not the best solution.

 

I think adding a single-node version of DataFrame to spark may be the best 
solution, because spark sql has 

natural compatibility with Hive sql, and people can enjoy the benefits of the 
excellent optimizer, vectorized execution, code gen...etc .

> Local physical operators for query execution
> 
>
> Key: SPARK-9983
> URL: https://issues.apache.org/jira/browse/SPARK-9983
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> In distributed query execution, there are two kinds of operators:
> (1) operators that exchange data between different executors or threads: 
> examples include broadcast, shuffle.
> (2) operators that process data in a single thread: examples include project, 
> filter, group by, etc.
> This ticket proposes clearly differentiating them and creating local 
> operators in Spark. This leads to a lot of benefits: easier to test, easier 
> to optimize data exchange, better design (single responsibility), and 
> potentially even having a hyper-optimized single-node version of DataFrame.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9983) Local physical operators for query execution

2019-06-11 Thread Lai Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860959#comment-16860959
 ] 

Lai Zhou commented on SPARK-9983:
-

[~rxin], `a hyper-optimized single-node version of DataFrame`, do you have any 
roadmap about it?

In real world, we use spark sql to handle our ETL jobs on Hive. We may extract 
a lots of user's variables by complex sql queries,  which will be the input for 
machine-learning models. 

But when we want to migrate the jobs to real-time system, we always need to 
interpret these sql queries by another programming language,

which requires a lot of work.

Now the local mode of spark sql is not a direct and high performance execution 
mode, I think it will make great sense to have a high hyper-optimized 
single-node.

 

> Local physical operators for query execution
> 
>
> Key: SPARK-9983
> URL: https://issues.apache.org/jira/browse/SPARK-9983
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> In distributed query execution, there are two kinds of operators:
> (1) operators that exchange data between different executors or threads: 
> examples include broadcast, shuffle.
> (2) operators that process data in a single thread: examples include project, 
> filter, group by, etc.
> This ticket proposes clearly differentiating them and creating local 
> operators in Spark. This leads to a lot of benefits: easier to test, easier 
> to optimize data exchange, better design (single responsibility), and 
> potentially even having a hyper-optimized single-node version of DataFrame.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org