Performance Issue with Column Addition in Spark 3.4.x: Time Doubling with Increased Columns

2023-07-04 Thread KO Dukhyun
Dear spark users, I'm experiencing an unusual issue with Spark 3.4.x. When creating a new column as the sum of several existing columns, the time taken almost doubles as the number of columns increases. This operation doesn't require much resources, so I suspect there might be a problem with

Spark 2.4 and Hive 2.3 - Performance issue with concurrent hive DDL queries

2020-01-30 Thread Nirav Patel
Hi, I am trying to do 1000s of update parquet partition operations on different hive tables parallely from my client application. I am using sparksql in local mode with hive enabled in my application to submit hive query. Spark is being used in local mode because all the operations we do are

Re: Performance Issue

2019-01-13 Thread Arnaud LARROQUE
>>>>> '2018-12-28') >>>>> JOIN csv_file as g >>>>> ON g.device_id = re.id and g.advertiser_id = re.advertiser_id >>>>> LEFT JOIN campaigns as c >>>>> ON c.campaign_id = re.campaign_id >>>>> GROUP by 1 , 2 , 3 ,

Re: Performance Issue

2019-01-13 Thread Tzahi File
>>>> GROUP by 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10, 11, 12, 13, 14, 15, >>>> 16, 17, 18, 19,20,21 >>>> >>>> Looking forward to any insights. >>>> >>>> >>>> Thanks. >>>> >>>> On Wed, Jan 9,

Re: Performance Issue

2019-01-13 Thread Gourav Sengupta
ign_id >>> GROUP by 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10, 11, 12, 13, 14, 15, 16, >>> 17, 18, 19,20,21 >>> >>> Looking forward to any insights. >>> >>> >>> Thanks. >>> >>> On Wed, Jan 9, 2019 at 8:21 AM Gourav Sengupta

Re: Performance Issue

2019-01-13 Thread Tzahi File
; >> >> Thanks. >> >> On Wed, Jan 9, 2019 at 8:21 AM Gourav Sengupta >> wrote: >> >>> Hi, >>> >>> Can you please let us know the SPARK version, and the query, and whether >>> the data is in parquet format or not, and where is it stored?

Re: Performance Issue

2019-01-10 Thread Gourav Sengupta
;> the data is in parquet format or not, and where is it stored? >> >> Regards, >> Gourav Sengupta >> >> On Wed, Jan 9, 2019 at 1:53 AM 大啊 wrote: >> >>> What is your performance issue? >>> >>> >>> >>> >>> &

Re: Performance Issue

2019-01-10 Thread Tzahi File
av Sengupta > > On Wed, Jan 9, 2019 at 1:53 AM 大啊 wrote: > >> What is your performance issue? >> >> >> >> >> >> At 2019-01-08 22:09:24, "Tzahi File" wrote: >> >> Hello, >> >> I have some performance issue ru

Re: Performance Issue

2019-01-08 Thread Gourav Sengupta
Hi, Can you please let us know the SPARK version, and the query, and whether the data is in parquet format or not, and where is it stored? Regards, Gourav Sengupta On Wed, Jan 9, 2019 at 1:53 AM 大啊 wrote: > What is your performance issue? > > > > > > At 2019-01-08

Performance Issue

2019-01-08 Thread Tzahi File
Hello, I have some performance issue running SQL query on Spark. The query contains one parquet partitioned table (partition by date) one each partition is about 200gb and simple table with about 100 records. The spark cluster is of type m5.2xlarge - 8 cores. I'm using Qubole interface

Re: Performance issue when running Spark-1.6.1 in yarn-client mode with Hadoop 2.6.0

2017-06-08 Thread Satish John Bosco
7, at 18:02, satishjohn <satish.johnbo...@gmail.com> wrote: > > > > Performance issue / time taken to complete spark job in yarn is 4 x > slower, > > when considered spark standalone mode. However, in spark standalone mode > > jobs often fails with executor lost issue. >

Re: Performance issue when running Spark-1.6.1 in yarn-client mode with Hadoop 2.6.0

2017-06-06 Thread Jörn Franke
akes sense to have a special scheduling configuration. > On 6. Jun 2017, at 18:02, satishjohn <satish.johnbo...@gmail.com> wrote: > > Performance issue / time taken to complete spark job in yarn is 4 x slower, > when considered spark standalone mode. However, in spark standa

Performance issue when running Spark-1.6.1 in yarn-client mode with Hadoop 2.6.0

2017-06-06 Thread satishjohn
Performance issue / time taken to complete spark job in yarn is 4 x slower, when considered spark standalone mode. However, in spark standalone mode jobs often fails with executor lost issue. Hardware configuration 32GB RAM 8 Cores (16) and 1 TB HDD 3 (1 Master and 2 Workers) Spark

Re: GroupBy and Spark Performance issue

2017-01-17 Thread Andy Dang
Repartition wouldn't save you from skewed data unfortunately. The way Spark works now is that it pulls data of the same key to one single partition, and Spark, AFAIK, retains the mapping from key to data in memory. You can use aggregateBykey() or combineByKey() or reduceByKey() to avoid this

GroupBy and Spark Performance issue

2017-01-16 Thread KhajaAsmath Mohammed
Hi, I am trying to group by data in spark and find out maximum value for group of data. I have to use group by as I need to transpose based on the values. I tried repartition data by increasing number from 1 to 1.Job gets run till the below stage and it takes long time to move ahead. I was

Re: Performance issue with spark ml model to make single predictions on server side

2016-06-24 Thread Nick Pentreath
it in last resort. > > Does any one have some hints to increase performances ? > > Philippe > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Performance-issue-with-spark-ml-model-to-make-single-predictions-o

Performance issue with spark ml model to make single predictions on server side

2016-06-23 Thread philippe v
: http://apache-spark-user-list.1001560.n3.nabble.com/Performance-issue-with-spark-ml-model-to-make-single-predictions-on-server-side-tp27217.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Parquet partitioning performance issue

2015-09-13 Thread sonal sharma
Hi Team, We have scheduled jobs that read new records from MySQL database every hour and write (append) them to parquet. For each append operation, spark creates 10 new partitions in parquet file. Some of these partitions are fairly small in size (20-40 KB) leading to high number of smaller

Re: Parquet partitioning performance issue

2015-09-13 Thread Dean Wampler
One general technique is perform a second pass later over the files, for example the next day or once a week, to concatenate smaller files into larger ones. This can be done for all file types and allows you make recent data available to analysis tools, while avoiding a large build up of small

Re: [streaming] DStream with window performance issue

2015-09-09 Thread Cody Koeninger
It looked like from your graphs that you had a 10 second batch time, but that your processing time was consistently 11 seconds. If that's correct, then yes your delay is going to keep growing. You'd need to either increase your batch time, or get your processing time down (either by adding more

Re: [streaming] DStream with window performance issue

2015-09-09 Thread Понькин Алексей
That`s correct, I have 10 seconds batch. The problem is actually in processing time, it is increasing constantly no matter how small or large my window duration is. I am trying to prepare some example code to clarify my use case. -- Яндекс.Почта — надёжная почта

Re: [streaming] DStream with window performance issue

2015-09-08 Thread Понькин Алексей
Thank you very much for great answer! -- Яндекс.Почта — надёжная почта http://mail.yandex.ru/neo2/collect/?exp=1=1 08.09.2015, 23:53, "Cody Koeninger" : > Yeah, that's the general idea. > > When you say hard code topic name, do you mean  Set(topicA, topicB, topicB) ?  >

Re: [streaming] DStream with window performance issue

2015-09-08 Thread Понькин Алексей
Oh my, I implemented one directStream instead of union of three but it is still growing exponential with window method. -- Яндекс.Почта — надёжная почта http://mail.yandex.ru/neo2/collect/?exp=1=1 08.09.2015, 23:53, "Cody Koeninger" : > Yeah, that's the general idea. > >

Re: [streaming] DStream with window performance issue

2015-09-08 Thread Cody Koeninger
Yeah, that's the general idea. When you say hard code topic name, do you mean Set(topicA, topicB, topicB) ? You should be able to use a variable for that - read it from a config file, whatever. If you're talking about the match statement, yeah you'd need to hardcode your cases. On Tue, Sep

Re: [streaming] DStream with window performance issue

2015-09-08 Thread Alexey Ponkin
Ok. Spark 1.4.1 on yarn Here is my application I have 4 different Kafka topics(different object streams) type Edge = (String,String) val a = KafkaUtils.createDirectStream[...](sc,"A",params).filter( nonEmpty ).map( toEdge ) val b = KafkaUtils.createDirectStream[...](sc,"B",params).filter(

Re: [streaming] DStream with window performance issue

2015-09-08 Thread Понькин Алексей
The thing is, that these topics contain absolutely different AVRO objects(Array[Byte]) that I need to deserialize to different Java(Scala) objects, filter and then map to tuple (String, String). So i have 3 streams with different avro object in there. I need to cast them(using some business

Re: [streaming] DStream with window performance issue

2015-09-08 Thread Cody Koeninger
I'm not 100% sure what's going on there, but why are you doing a union in the first place? If you want multiple topics in a stream, just pass them all in the set of topics to one call to createDirectStream On Tue, Sep 8, 2015 at 10:52 AM, Alexey Ponkin wrote: > Ok. > Spark

Re: [streaming] DStream with window performance issue

2015-09-08 Thread Cody Koeninger
That doesn't really matter. With the direct stream you'll get all objects for a given topicpartition in the same spark partition. You know what topic it's from via hasOffsetRanges. Then you can deserialize appropriately based on topic. On Tue, Sep 8, 2015 at 11:16 AM, Понькин Алексей

[streaming] DStream with window performance issue

2015-09-08 Thread Alexey Ponkin
Hi, I have an application with 2 streams, which are joined together. Stream1 - is simple DStream(relativly small size batch chunks) Stream2 - is a windowed DStream(with duration for example 60 seconds) Stream1 and Stream2 are Kafka direct stream. The problem is that according to logs window

Re: [streaming] DStream with window performance issue

2015-09-08 Thread Cody Koeninger
Can you provide more info (what version of spark, code example)? On Tue, Sep 8, 2015 at 8:18 AM, Alexey Ponkin wrote: > Hi, > > I have an application with 2 streams, which are joined together. > Stream1 - is simple DStream(relativly small size batch chunks) > Stream2 - is a

Performance issue with Spark join

2015-08-26 Thread lucap
much as well 6] Tables aren't particularly big, the bigger one should be few GBs Regards, Luca -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Performance-issue-with-Spark-join-tp24458.html Sent from the Apache Spark User List mailing list archive

Re: Performance issue with Spark join

2015-08-26 Thread Hemant Bhanawat
, the bigger one should be few GBs Regards, Luca -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Performance-issue-with-Spark-join-tp24458.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Performance issue with Spak's foreachpartition method

2015-07-27 Thread diplomatic Guru
issue writing to Oracle? In particular how many commits are you making? If you are issuing a lot of commits that would be a performance problem. Robin On 22 Jul 2015, at 19:11, diplomatic Guru diplomaticg...@gmail.com wrote: Hello all, We are having a major performance issue with the Spark

Re: Performance issue with Spak's foreachpartition method

2015-07-24 Thread Bagavath
ask is have you determined whether you have a performance issue writing to Oracle? In particular how many commits are you making? If you are issuing a lot of commits that would be a performance problem. Robin On 22 Jul 2015, at 19:11, diplomatic Guru diplomaticg...@gmail.com wrote: Hello

Performance issue with Spak's foreachpartition method

2015-07-22 Thread diplomatic Guru
Hello all, We are having a major performance issue with the Spark, which is holding us from going live. We have a job that carries out computation on log files and write the results into Oracle DB. The reducer 'reduceByKey' have been set to parallelize by 4 as we don't want to establish too

Re: Performance issue with Spak's foreachpartition method

2015-07-22 Thread Robin East
The first question I would ask is have you determined whether you have a performance issue writing to Oracle? In particular how many commits are you making? If you are issuing a lot of commits that would be a performance problem. Robin On 22 Jul 2015, at 19:11, diplomatic Guru diplomaticg

Re: Performance issue with Spak's foreachpartition method

2015-07-22 Thread diplomatic Guru
, Raj On 22 July 2015 at 20:20, Robin East robin.e...@xense.co.uk wrote: The first question I would ask is have you determined whether you have a performance issue writing to Oracle? In particular how many commits are you making? If you are issuing a lot of commits that would

Re: Spark performance issue

2015-07-03 Thread Silvio Fiorito
Date: Friday, July 3, 2015 at 8:58 AM To: user@spark.apache.orgmailto:user@spark.apache.org Subject: Spark performance issue Hello guys, I'm after some advice on Spark performance. I've a MapReduce job that read inputs carry out a simple calculation and write the results into HDFS. I've

Spark performance issue

2015-07-03 Thread diplomatic Guru
Hello guys, I'm after some advice on Spark performance. I've a MapReduce job that read inputs carry out a simple calculation and write the results into HDFS. I've implemented the same logic in Spark job. When I tried both jobs on same datasets, I'm getting different execution time, which is

Re: is there any significant performance issue converting between rdd and dataframes in pyspark?

2015-07-02 Thread Davies Liu
On Mon, Jun 29, 2015 at 1:27 PM, Axel Dahl a...@whisperstream.com wrote: In pyspark, when I convert from rdds to dataframes it looks like the rdd is being materialized/collected/repartitioned before it's converted to a dataframe. It's not true. When converting a RDD to dataframe, it only take

is there any significant performance issue converting between rdd and dataframes in pyspark?

2015-06-29 Thread Axel Dahl
In pyspark, when I convert from rdds to dataframes it looks like the rdd is being materialized/collected/repartitioned before it's converted to a dataframe. Just wondering if there's any guidelines for doing this conversion and whether it's best to do it early to get the performance benefits of

spark sql insert into table performance issue

2015-06-11 Thread Wangfei (X)
发件人: Wangfei (X) 发送时间: 2015年6月11日 17:33 收件人: user@spark.apache.org 主题: Hi, all We use spark sql to insert data from a text table into a partitioning table and found that if we give more cores to executors the insert performance whold be worse.

Re: group by and distinct performance issue

2015-05-19 Thread Akhil Das
Hi Peer, If you open the driver UI (running on port 4040) you can see the stages and the tasks happening inside it. Best way to identify the bottleneck for a stage is to see if there's any time spending on GC, and how many tasks are there per stage (it should be a number total # cores to achieve

group by and distinct performance issue

2015-05-19 Thread Peer, Oded
I am running Spark over Cassandra to process a single table. My task reads a single days' worth of data from the table and performs 50 group by and distinct operations, counting distinct userIds by different grouping keys. My code looks like this: JavaRddRow rdd =

Re: group by and distinct performance issue

2015-05-19 Thread Todd Nist
You may want to look at this tooling for helping identify performance issues and bottlenecks: https://github.com/kayousterhout/trace-analysis I believe this is slated to become part of the web ui in the 1.4 release, in fact based on the status of the JIRA,

Re: Spark SQL performance issue.

2015-04-23 Thread ayan guha
Quick questions: why are you cache both rdd and table? Which stage of job is slow? On 23 Apr 2015 17:12, Nikolay Tikhonov tikhonovnico...@gmail.com wrote: Hi, I have Spark SQL performance issue. My code contains a simple JavaBean: public class Person implements Externalizable

Re: Spark SQL performance issue.

2015-04-23 Thread Nikolay Tikhonov
...@gmail.com: Quick questions: why are you cache both rdd and table? Which stage of job is slow? On 23 Apr 2015 17:12, Nikolay Tikhonov tikhonovnico...@gmail.com wrote: Hi, I have Spark SQL performance issue. My code contains a simple JavaBean: public class Person implements Externalizable

Re: Spark SQL performance issue.

2015-04-23 Thread Arush Kharbanda
tikhonovnico...@gmail.com wrote: Hi, I have Spark SQL performance issue. My code contains a simple JavaBean: public class Person implements Externalizable { private int id; private String name; private double salary; } Apply

Re: Performance issue

2015-01-17 Thread TJ Klein
I suspect that putting a function into shared variable incurs additional overhead? Any suggestion how to avoid that? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Performance-issue-tp21194p21210.html Sent from the Apache Spark User List mailing list

Performance issue

2015-01-16 Thread TJ Klein
Hi, I observed some weird performance issue using Spark in combination with Theano, and I have no real explanation for that. To exemplify the issue I am using the pi.py example of spark that computes pi: When I modify the function from the example: #unmodified code def f(_): x = random

Re: hdfs read performance issue

2014-08-20 Thread Gurvinder Singh
I got some time to look in to it. It appears as that Spark (latest git) is doing this operation much more often compare to Aug 1 version. Here is the log from operation I am referring to 14/08/19 12:37:26 INFO spark.CacheManager: Partition rdd_8_414 not found, computing it 14/08/19 12:37:26 INFO

read performance issue

2014-08-14 Thread Gurvinder Singh
Hi, I am running spark from the git directly. I recently compiled the newer version Aug 13 version and it has performance drop of 2-3x in read from HDFS compare to git version of Aug 1. So I am wondering which commit would have cause such an issue in read performance. The performance is almost

Spark Performance issue

2014-07-15 Thread Malligarjunan S
Hello all, I am a newbie to Spark, Just analyzing the product. I am facing a performance problem with hive, Trying analyse whether the Spark will solve it or not. but it seems that Spark also taking lot of time.Let me know if I miss anything. shark select count(time) from table2; OK 6050 Time

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-11 Thread Zongheng Yang
Hey Jerry, When you ran these queries using different methods, did you see any discrepancy in the returned results (i.e. the counts)? On Thu, Jul 10, 2014 at 5:55 PM, Michael Armbrust mich...@databricks.com wrote: Yeah, sorry. I think you are seeing some weirdness with partitioned tables that

Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Jerry Lam
Hi Spark users and developers, I'm doing some simple benchmarks with my team and we found out a potential performance issue using Hive via SparkSQL. It is very bothersome. So your help in understanding why it is terribly slow is very very important. First, we have some text files in HDFS which

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Jerry Lam
By the way, I also try hql(select * from m).count. It is terribly slow too. On Thu, Jul 10, 2014 at 5:08 PM, Jerry Lam chiling...@gmail.com wrote: Hi Spark users and developers, I'm doing some simple benchmarks with my team and we found out a potential performance issue using Hive via

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Jerry Lam
Hi Spark users, Also, to put the performance issue into perspective, we also ran the query on Hive. It took about 5 minutes to run. Best Regards, Jerry On Thu, Jul 10, 2014 at 5:10 PM, Jerry Lam chiling...@gmail.com wrote: By the way, I also try hql(select * from m).count. It is terribly

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Michael Armbrust
On Thu, Jul 10, 2014 at 2:08 PM, Jerry Lam chiling...@gmail.com wrote: For the curious mind, the dataset is about 200-300GB and we are using 10 machines for this benchmark. Given the env is equal between the two experiments, why pure spark is faster than SparkSQL? There is going to be some

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Jerry Lam
Hi Michael, Yes the table is partitioned on 1 column. There are 11 columns in the table and they are all String type. I understand that SerDes contributes to some overheads but using pure Hive, we could run the query about 5 times faster than SparkSQL. Given that Hive also has the same SerDes

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Michael Armbrust
Yeah, sorry. I think you are seeing some weirdness with partitioned tables that I have also seen elsewhere. I've created a JIRA and assigned someone at databricks to investigate. https://issues.apache.org/jira/browse/SPARK-2443 On Thu, Jul 10, 2014 at 5:33 PM, Jerry Lam chiling...@gmail.com