[ 
https://issues.apache.org/jira/browse/SPARK-17366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15456694#comment-15456694
 ] 

Herman van Hovell commented on SPARK-17366:
-------------------------------------------

[~chris_sanjiv] Is this a question or a bug report?

There are to many unknowns to be able help you. What version of Spark are you 
using? What does your plan look like? What do you mean by to slow? Is the data 
you are joining skewed?

> Temp tables cached in spark - Joins performance
> -----------------------------------------------
>
>                 Key: SPARK-17366
>                 URL: https://issues.apache.org/jira/browse/SPARK-17366
>             Project: Spark
>          Issue Type: Brainstorming
>          Components: SQL
>         Environment: Amazon S3
>            Reporter: Chris Sanjiv Xavier
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Hi ,
> I have a use case wherein we have SPARK running on an EC2 instance from 
> amazon . We are puling data from an S3 Bucket . We pull them into DF's and 
> then cache the tables . 
> We face a lot of performance issues when we try to Join the two tables which 
> have been cached. It runs really slowly. 
> Example of issue :-
> Table A in memory 1000MB 
> Table B in memory 1000MB
> Pulling data using SQL interface on Zeppelin UI notebook on Amazon.
> Select * from table A inner join table B on A.column 1 = B.column 1 where 
> B.column 2 = 'SPARK' ; 
> The above query returns results extremely slowly . 
> This is a spark cluster with 6 nodes holding close to 250 GB memory in total.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to