Hi Yana,
I notice there is GC happening in every executor which
is around 400ms on an average. Do you think it is a major impact on the
overall query time..??
And regarding the memory for a single worker,
I have tried distributing the memory by increasing the number of workers
per node and dividing the memory for each worker accordingly.
Means,
in my trials I have configured 96G, 48G for each worker, but dint see any
difference in the query time.
You have any comments regarding the GC
time..?? Because, as I have mentioned in the earlier mails, I have varied
the memory fractions but not much difference was
seen.
 
Thanks and regards
Vinay Kashyap

 
________________________________________________

From:"Yana Kadiyska" <yana.kadiy...@gmail.com>

Sent:"vinay.kashyap" <vinay.kash...@socialinfra.net>

Date:Sat, August 9, 2014 6:56 am

Subject:Re: Low Performance of Shark over Spark.







        Can you see where your time is spent? I have noticed quite a bit of
variance in query time in the case where GC occurs in the middle of a
computation. I'm guessing you're talking time averaged over multiple runs
but thought I'd mention this as a possible thing to check. I might be
mistaken but 96G in a single worker(if I'm reading your mail correctly)
seems on the high side (although I cannot find now any recommendation on
what it should be)
        
                 


        

        

        
                On Fri, Aug 8, 2014 at 4:45 AM, vinay.kashyap 
vinay.kash...@socialinfra.net> wrote:

                
                        Hi Mayur,

                        

                        I cannot use spark sql in this case because many of the 
aggregations
are not

                        supported yet. Hence I migrated back to use Shark as 
all those
aggregation

                        functions are supported.

                        

                        
apache-spark-user-list.1001560.n3.nabble.com/Support-for-Percentile-and-Variance-Aggregation-functions-in-Spark-with-HiveContext-td10658.html

                        
http://apache-spark-user-list.1001560.n3.nabble.com/Support-for-Percentile-and-Variance-Aggregation-functions-in-Spark-with-HiveContext-td10658.html>

                        

                        Forgot to mention in the earlier thread, that the 
raw_table which I am
using

                        is actually a parquet table.

                        
                                

                                >> 2. cache data at a partition level from Hive 
& operate on
those instead.

                                 
                        Do you mean that I need to cache the table created by 
querying data for
set

                        of few months and then issue the adhoc query on that 
table.??

                        
                                

                                

                                

                                Thanks and regards

                                Vinay Kashyap

                                

                                

                                 
                        --

                        View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Low-Performance-of-Shark-over-Spark-tp11649p11776.html

                        Sent from the Apache Spark User List mailing list 
archive at
Nabble.com.

                        
                                
                                        

                                        
---------------------------------------------------------------------

                                        To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org

                                        For additional commands, e-mail: 
user-h...@spark.apache.org

                                         
                        
                
        


Reply via email to