e with ORC format.
>>
>> I executed a very simple SQL query: "SELECT * from table_name"
>> The issue is that for some small size tables (even table with few dozen
>> of records), SparkSQL still required about 7-8 seconds to finish, while
>> Drill and Presto only
* from table_name"
> The issue is that for some small size tables (even table with few dozen of
> records), SparkSQL still required about 7-8 seconds to finish, while Drill
> and Presto only needed less than 1 second.
> For other large tables with billions records, SparkSQL perform
need to make a call whether you want to take the
upfront cost of a shuffle, or you want to live with large number of tasks
From: Tin Vu
Date: Thursday, March 29, 2018 at 10:47 AM
To: "Lalwani, Jayesh"
Cc: "user@spark.apache.org"
Subject: Re: [SparkSQL] SparkSQL performance o
;user@spark.apache.org"
> *Subject: *[SparkSQL] SparkSQL performance on small TPCDS tables is very
> low when compared to Drill or Presto
>
>
>
> Hi,
>
>
>
> I am executing a benchmark to compare performance of SparkSQL, Apache
> Drill and Presto. My experimental setup
UI.
From: Tin Vu
Date: Wednesday, March 28, 2018 at 8:04 PM
To: "user@spark.apache.org"
Subject: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when
compared to Drill or Presto
Hi,
I am executing a benchmark to compare performance of SparkSQL, Apache Drill and
uot;SELECT * from table_name"
> The issue is that for some small size tables (even table with few dozen of
> records), SparkSQL still required about 7-8 seconds to finish, while Drill
> and Presto only needed less than 1 second.
> For other large tables with billions records, SparkS
rill
> and Presto only needed less than 1 second.
> For other large tables with billions records, SparkSQL performance was
> reasonable when it required 20-30 seconds to scan the whole table.
> Do you have any idea or reasonable explanation for this issue?
> Thanks,
>
econd.
For other large tables with billions records, SparkSQL performance was
reasonable when it required 20-30 seconds to scan the whole table.
Do you have any idea or reasonable explanation for this issue?
Thanks,
Hammad,
The recommended way to implement this logic would be to:
Create a SparkSession.
Create a Streaming Context using the SparkContext embedded in the
SparkSession
Use the single SparkSession instance for the SQL operations within the
foreachRDD.
It's important to note that spark operations c
Hello,
*Background:*
I have Spark Streaming context;
SparkConf conf = new
SparkConf().setMaster("local[2]").setAppName("TransformerStreamPOC");
conf.set("spark.driver.allowMultipleContexts", "true"); *<== this*
JavaStreamingContext jssc = new JavaStreamingContext(conf,
Durations.seconds(60));
https://github.com/databricks/spark-avro
On Tue, Apr 21, 2015 at 3:09 PM, Renato Marroquín Mogrovejo <
renatoj.marroq...@gmail.com> wrote:
> Thanks Michael!
> I have tried applying my schema programatically but I didn't get any
> improvement on performance :(
> Could you point me to some code exa
Thanks Michael!
I have tried applying my schema programatically but I didn't get any
improvement on performance :(
Could you point me to some code examples using Avro please?
Many thanks again!
Renato M.
2015-04-21 20:45 GMT+02:00 Michael Armbrust :
> Here is an example using rows directly:
>
>
Here is an example using rows directly:
https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#programmatically-specifying-the-schema
Avro or parquet input would likely give you the best performance.
On Tue, Apr 21, 2015 at 4:28 AM, Renato Marroquín Mogrovejo <
renatoj.marroq...@gmail.com
Thanks for the hints guys! much appreciated!
Even if I just do a something like:
"Select * from tableX where attribute1 < 5"
I see similar behaviour.
@Michael
Could you point me to any sample code that uses Spark's Rows? We are at a
phase where we can actually change our JavaBeans for something
There is a cost to converting from JavaBeans to Rows and this code path has
not been optimized. That is likely what you are seeing.
On Mon, Apr 20, 2015 at 3:55 PM, ayan guha wrote:
> SparkSQL optimizes better by column pruning and predicate pushdown,
> primarily. Here you are not taking advant
SparkSQL optimizes better by column pruning and predicate pushdown,
primarily. Here you are not taking advantage of either.
I am curious to know what goes in your filter function, as you are not
using a filter in SQL side.
Best
Ayan
On 21 Apr 2015 08:05, "Renato Marroquín Mogrovejo" <
renatoj.mar
Does anybody have an idea? a clue? a hint?
Thanks!
Renato M.
2015-04-20 9:31 GMT+02:00 Renato Marroquín Mogrovejo <
renatoj.marroq...@gmail.com>:
> Hi all,
>
> I have a simple query "Select * from tableX where attribute1 between 0 and
> 5" that I run over a Kryo file with four partitions that e
Hi all,
I have a simple query "Select * from tableX where attribute1 between 0 and
5" that I run over a Kryo file with four partitions that ends up being
around 3.5 million rows in our case.
If I run this query by doing a simple map().filter() it takes around ~9.6
seconds but when I apply schema,
On 1/27/15 5:55 PM, Cheng Lian wrote:
On 1/27/15 11:38 AM, Manoj Samel wrote:
Spark 1.2, no Hive, prefer not to use HiveContext to avoid metastore_db.
Use case is Spark Yarn app will start and serve as query server for
multiple users i.e. always up and running. At startup, there is
option t
On 1/27/15 11:38 AM, Manoj Samel wrote:
Spark 1.2, no Hive, prefer not to use HiveContext to avoid metastore_db.
Use case is Spark Yarn app will start and serve as query server for
multiple users i.e. always up and running. At startup, there is option
to cache data and also pre-compute some r
Spark 1.2, no Hive, prefer not to use HiveContext to avoid metastore_db.
Use case is Spark Yarn app will start and serve as query server for
multiple users i.e. always up and running. At startup, there is option to
cache data and also pre-compute some results sets, hash maps etc. that
would be lik
> Ideally it’s critical for the user community to be informed of all the
> in-depth tuning tricks of all products. However, realistically, there is a
> big gap in terms of documentation. Hope the Spark folks will make a
> difference. :-)
>
> Du
>
>
> From: Soumya Sim
all the
> in-depth tuning tricks of all products. However, realistically, there is a
> big gap in terms of documentation. Hope the Spark folks will make a
> difference. :-)
>
> Du
>
>
> From: Soumya Simanta
> Date: Friday, October 31, 2014 at 4:04 PM
> To: "user@s
all the
> in-depth tuning tricks of all products. However, realistically, there is a
> big gap in terms of documentation. Hope the Spark folks will make a
> difference. :-)
>
> Du
>
>
> From: Soumya Simanta
> Date: Friday, October 31, 2014 at 4:04 PM
> To: "user@s
From: Soumya Simanta mailto:soumya.sima...@gmail.com>>
Date: Friday, October 31, 2014 at 4:04 PM
To: "user@spark.apache.org<mailto:user@spark.apache.org>"
mailto:user@spark.apache.org>>
Subject: SparkSQL performance
I was really surprised to see the results here, e
I was really surprised to see the results here, esp. SparkSQL "not
completing"
http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style
I was under the impression that SparkSQL performs really well because it
can optimize the RDD operations and load only the columns that are
required.
26 matches
Mail list logo