Hi Yana,
I notice there is GC happening in every executor which
is around 400ms on an average. Do you think it is a major impact on the
overall query time..??
And regarding the memory for a single worker,
I have tried distributing the memory by increasing the number of workers
per node and dividing the memory for each worker accordingly.
Means,
in my trials I have configured 96G, 48G for each worker, but dint see any
difference in the query time.
You have any comments regarding the GC
time..?? Because, as I have mentioned in the earlier mails, I have varied
the memory fractions but not much difference was
seen.
Thanks and regards
Vinay Kashyap
________________________________________________
From:"Yana Kadiyska" <yana.kadiy...@gmail.com>
Sent:"vinay.kashyap" <vinay.kash...@socialinfra.net>
Date:Sat, August 9, 2014 6:56 am
Subject:Re: Low Performance of Shark over Spark.
Can you see where your time is spent? I have noticed quite a bit of
variance in query time in the case where GC occurs in the middle of a
computation. I'm guessing you're talking time averaged over multiple runs
but thought I'd mention this as a possible thing to check. I might be
mistaken but 96G in a single worker(if I'm reading your mail correctly)
seems on the high side (although I cannot find now any recommendation on
what it should be)
On Fri, Aug 8, 2014 at 4:45 AM, vinay.kashyap
vinay.kash...@socialinfra.net> wrote:
Hi Mayur,
I cannot use spark sql in this case because many of the
aggregations
are not
supported yet. Hence I migrated back to use Shark as
all those
aggregation
functions are supported.
apache-spark-user-list.1001560.n3.nabble.com/Support-for-Percentile-and-Variance-Aggregation-functions-in-Spark-with-HiveContext-td10658.html
http://apache-spark-user-list.1001560.n3.nabble.com/Support-for-Percentile-and-Variance-Aggregation-functions-in-Spark-with-HiveContext-td10658.html>
Forgot to mention in the earlier thread, that the
raw_table which I am
using
is actually a parquet table.
>> 2. cache data at a partition level from Hive
& operate on
those instead.
Do you mean that I need to cache the table created by
querying data for
set
of few months and then issue the adhoc query on that
table.??
Thanks and regards
Vinay Kashyap
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Low-Performance-of-Shark-over-Spark-tp11649p11776.html
Sent from the Apache Spark User List mailing list
archive at
Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail:
user-unsubscr...@spark.apache.org
For additional commands, e-mail:
user-h...@spark.apache.org