Chris Sanjiv Xavier created SPARK-17366:
-------------------------------------------

             Summary: Temp tables cached in spark - Joins performance
                 Key: SPARK-17366
                 URL: https://issues.apache.org/jira/browse/SPARK-17366
             Project: Spark
          Issue Type: Brainstorming
          Components: SQL
         Environment: Amazon S3
            Reporter: Chris Sanjiv Xavier


Hi ,

I have a use case wherein we have SPARK running on an EC2 instance from amazon 
. We are puling data from an S3 Bucket . We pull them into DF's and then cache 
the tables . 

We face a lot of performance issues when we try to Join the two tables which 
have been cached. It runs really slowly. 

Example of issue :-

Table A in memory 1000MB 
Table B in memory 1000MB

Pulling data using SQL interface on Zeppelin UI notebook on Amazon.

Select * from table A inner join table B on A.column 1 = B.column 1 where 
B.column 2 = 'SPARK' ; 

The above query returns results extremely slowly . 

This is a spark cluster with 6 nodes holding close to 250 GB memory in total.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to