Xin [mailto:r...@databricks.com]
Sent: Tuesday, August 25, 2015 6:05 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Subject: Re: Dataframe aggregation with Tungsten unsafe
There are a lot of GC activity due to the non-code-gen path being sloppy about
garbage creation. This is not actually what
Thank you for the explanation. The size if the 100M data is ~1.4GB in memory
and each worker has 32GB of memory. It seems to be a lot of free memory
available. I wonder how Spark can hit GC with such setup?
Reynold Xin r...@databricks.commailto:r...@databricks.com
On Fri, Aug 21, 2015 at
On Fri, Aug 21, 2015 at 11:07 AM, Ulanov, Alexander alexander.ula...@hp.com
wrote:
It seems that there is a nice improvement with Tungsten enabled given that
data is persisted in memory 2x and 3x. However, the improvement is not that
nice for parquet, it is 1.5x. What’s interesting, with
is not interesting.
From: Reynold Xin [mailto:r...@databricks.com]
Sent: Thursday, August 20, 2015 9:24 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Subject: Re: Dataframe aggregation with Tungsten unsafe
Not sure what's going on or how you measure the time, but the difference here
is pretty big
Hi Reynold,
Thank you for suggestion. This code takes around 30 sec on my setup (5 workers
with 32GB). My issue is that I don't see the change in time if I unset the
unsafe flags. Could you explain why it might happen?
20 авг. 2015 г., в 15:32, Reynold Xin
I think you might need to turn codegen on also in order for the unsafe
stuff to work.
On Thu, Aug 20, 2015 at 4:09 PM, Ulanov, Alexander alexander.ula...@hp.com
wrote:
Hi Reynold,
Thank you for suggestion. This code takes around 30 sec on my setup (5
workers with 32GB). My issue is that I
BTW one other thing -- don't use the count() to do benchmark, since the
optimizer is smart enough to figure out that you don't actually need to run
the sum.
For the purpose of benchmarking, you can use
df.foreach(i = do nothing)
On Thu, Aug 20, 2015 at 3:31 PM, Reynold Xin
@spark.apache.org
Subject: Re: Dataframe aggregation with Tungsten unsafe
I think you might need to turn codegen on also in order for the unsafe stuff to
work.
On Thu, Aug 20, 2015 at 4:09 PM, Ulanov, Alexander
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Hi Reynold,
Thank you
Subject: Re: Dataframe aggregation with Tungsten unsafe
I think you might need to turn codegen on also in order for the unsafe stuff to
work.
On Thu, Aug 20, 2015 at 4:09 PM, Ulanov, Alexander
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Hi Reynold,
Thank you for suggestion
*From:* Reynold Xin [mailto:r...@databricks.com]
*Sent:* Thursday, August 20, 2015 4:22 PM
*To:* Ulanov, Alexander
*Cc:* dev@spark.apache.org
*Subject:* Re: Dataframe aggregation with Tungsten unsafe
I think you might need to turn codegen on also in order for the unsafe
stuff to work
, August 20, 2015 5:26 PM
*To:* Ulanov, Alexander
*Cc:* dev@spark.apache.org
*Subject:* Re: Dataframe aggregation with Tungsten unsafe
Yes - DataFrame and SQL are the same thing.
Which version are you running? Spark 1.4 doesn't run Janino --- but you
have a Janino exception?
On Thu, Aug 20
, Alexander
Cc: dev@spark.apache.org
Subject: Re: Dataframe aggregation with Tungsten unsafe
Please git pull :)
On Thu, Aug 20, 2015 at 5:35 PM, Ulanov, Alexander
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
I am using Spark 1.5 cloned from master on June 12. (The aggregate unsafe
*To:* Ulanov, Alexander
*Cc:* dev@spark.apache.org
*Subject:* Re: Dataframe aggregation with Tungsten unsafe
Please git pull :)
On Thu, Aug 20, 2015 at 5:35 PM, Ulanov, Alexander
alexander.ula...@hp.com wrote:
I am using Spark 1.5 cloned from master on June 12. (The aggregate unsafe
Dear Spark developers,
I am trying to benchmark the new Dataframe aggregation implemented under the
project Tungsten and released with Spark 1.4 (I am using the latest Spark from
the repo, i.e. 1.5):
https://github.com/apache/spark/pull/5725
It tells that the aggregation should be faster due to
How did you run this? I couldn't run your query with 4G of RAM in 1.4, but
in 1.5 it ran.
Also I recommend just dumping the data to parquet on disk to evaluate,
rather than using the in-memory cache, which is super slow and we are
thinking of removing/replacing with something else.
val size =
I didn't wait long enough earlier. Actually it did finish when I raised
memory to 8g.
In 1.5 with Tungsten (which should be the same as 1.4 with your unsafe
flags), the query took 40s with 4G of mem.
In 1.4, it took 195s with 8G of mem.
This is not a scientific benchmark and I only ran it
16 matches
Mail list logo