Dear spark users,
I'm experiencing an unusual issue with Spark 3.4.x.
When creating a new column as the sum of several existing columns, the time
taken almost doubles as the number of columns increases. This operation doesn't
require much resources, so I suspect there might be a problem with
Hi,
I am trying to do 1000s of update parquet partition operations on different
hive tables parallely from my client application. I am using sparksql in
local mode with hive enabled in my application to submit hive query.
Spark is being used in local mode because all the operations we do are
>>>>> '2018-12-28')
>>>>> JOIN csv_file as g
>>>>> ON g.device_id = re.id and g.advertiser_id = re.advertiser_id
>>>>> LEFT JOIN campaigns as c
>>>>> ON c.campaign_id = re.campaign_id
>>>>> GROUP by 1 , 2 , 3 ,
>>>> GROUP by 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10, 11, 12, 13, 14, 15,
>>>> 16, 17, 18, 19,20,21
>>>>
>>>> Looking forward to any insights.
>>>>
>>>>
>>>> Thanks.
>>>>
>>>> On Wed, Jan 9,
ign_id
>>> GROUP by 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10, 11, 12, 13, 14, 15, 16,
>>> 17, 18, 19,20,21
>>>
>>> Looking forward to any insights.
>>>
>>>
>>> Thanks.
>>>
>>> On Wed, Jan 9, 2019 at 8:21 AM Gourav Sengupta
;
>>
>> Thanks.
>>
>> On Wed, Jan 9, 2019 at 8:21 AM Gourav Sengupta
>> wrote:
>>
>>> Hi,
>>>
>>> Can you please let us know the SPARK version, and the query, and whether
>>> the data is in parquet format or not, and where is it stored?
;> the data is in parquet format or not, and where is it stored?
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Wed, Jan 9, 2019 at 1:53 AM 大啊 wrote:
>>
>>> What is your performance issue?
>>>
>>>
>>>
>>>
>>>
&
av Sengupta
>
> On Wed, Jan 9, 2019 at 1:53 AM 大啊 wrote:
>
>> What is your performance issue?
>>
>>
>>
>>
>>
>> At 2019-01-08 22:09:24, "Tzahi File" wrote:
>>
>> Hello,
>>
>> I have some performance issue ru
Hi,
Can you please let us know the SPARK version, and the query, and whether
the data is in parquet format or not, and where is it stored?
Regards,
Gourav Sengupta
On Wed, Jan 9, 2019 at 1:53 AM 大啊 wrote:
> What is your performance issue?
>
>
>
>
>
> At 2019-01-08
Hello,
I have some performance issue running SQL query on Spark.
The query contains one parquet partitioned table (partition by date) one
each partition is about 200gb and simple table with about 100 records. The
spark cluster is of type m5.2xlarge - 8 cores. I'm using Qubole interface
7, at 18:02, satishjohn <satish.johnbo...@gmail.com> wrote:
> >
> > Performance issue / time taken to complete spark job in yarn is 4 x
> slower,
> > when considered spark standalone mode. However, in spark standalone mode
> > jobs often fails with executor lost issue.
>
akes sense to have a special
scheduling configuration.
> On 6. Jun 2017, at 18:02, satishjohn <satish.johnbo...@gmail.com> wrote:
>
> Performance issue / time taken to complete spark job in yarn is 4 x slower,
> when considered spark standalone mode. However, in spark standa
Performance issue / time taken to complete spark job in yarn is 4 x slower,
when considered spark standalone mode. However, in spark standalone mode
jobs often fails with executor lost issue.
Hardware configuration
32GB RAM 8 Cores (16) and 1 TB HDD 3 (1 Master and 2 Workers)
Spark
Repartition wouldn't save you from skewed data unfortunately. The way Spark
works now is that it pulls data of the same key to one single partition,
and Spark, AFAIK, retains the mapping from key to data in memory.
You can use aggregateBykey() or combineByKey() or reduceByKey() to avoid
this
Hi,
I am trying to group by data in spark and find out maximum value for group
of data. I have to use group by as I need to transpose based on the values.
I tried repartition data by increasing number from 1 to 1.Job gets run
till the below stage and it takes long time to move ahead. I was
it in last resort.
>
> Does any one have some hints to increase performances ?
>
> Philippe
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Performance-issue-with-spark-ml-model-to-make-single-predictions-o
:
http://apache-spark-user-list.1001560.n3.nabble.com/Performance-issue-with-spark-ml-model-to-make-single-predictions-on-server-side-tp27217.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
Hi Team,
We have scheduled jobs that read new records from MySQL database every hour
and write (append) them to parquet. For each append operation, spark
creates 10 new partitions in parquet file.
Some of these partitions are fairly small in size (20-40 KB) leading to
high number of smaller
One general technique is perform a second pass later over the files, for
example the next day or once a week, to concatenate smaller files into
larger ones. This can be done for all file types and allows you make recent
data available to analysis tools, while avoiding a large build up of small
It looked like from your graphs that you had a 10 second batch time, but
that your processing time was consistently 11 seconds. If that's correct,
then yes your delay is going to keep growing. You'd need to either
increase your batch time, or get your processing time down (either by
adding more
That`s correct, I have 10 seconds batch.
The problem is actually in processing time, it is increasing constantly no
matter how small or large my window duration is.
I am trying to prepare some example code to clarify my use case.
--
Яндекс.Почта — надёжная почта
Thank you very much for great answer!
--
Яндекс.Почта — надёжная почта
http://mail.yandex.ru/neo2/collect/?exp=1=1
08.09.2015, 23:53, "Cody Koeninger" :
> Yeah, that's the general idea.
>
> When you say hard code topic name, do you mean Set(topicA, topicB, topicB) ?
>
Oh my, I implemented one directStream instead of union of three but it is still
growing exponential with window method.
--
Яндекс.Почта — надёжная почта
http://mail.yandex.ru/neo2/collect/?exp=1=1
08.09.2015, 23:53, "Cody Koeninger" :
> Yeah, that's the general idea.
>
>
Yeah, that's the general idea.
When you say hard code topic name, do you mean Set(topicA, topicB, topicB)
? You should be able to use a variable for that - read it from a config
file, whatever.
If you're talking about the match statement, yeah you'd need to hardcode
your cases.
On Tue, Sep
Ok.
Spark 1.4.1 on yarn
Here is my application
I have 4 different Kafka topics(different object streams)
type Edge = (String,String)
val a = KafkaUtils.createDirectStream[...](sc,"A",params).filter( nonEmpty
).map( toEdge )
val b = KafkaUtils.createDirectStream[...](sc,"B",params).filter(
The thing is, that these topics contain absolutely different AVRO
objects(Array[Byte]) that I need to deserialize to different Java(Scala)
objects, filter and then map to tuple (String, String). So i have 3 streams
with different avro object in there. I need to cast them(using some business
I'm not 100% sure what's going on there, but why are you doing a union in
the first place?
If you want multiple topics in a stream, just pass them all in the set of
topics to one call to createDirectStream
On Tue, Sep 8, 2015 at 10:52 AM, Alexey Ponkin wrote:
> Ok.
> Spark
That doesn't really matter. With the direct stream you'll get all objects
for a given topicpartition in the same spark partition. You know what
topic it's from via hasOffsetRanges. Then you can deserialize
appropriately based on topic.
On Tue, Sep 8, 2015 at 11:16 AM, Понькин Алексей
Hi,
I have an application with 2 streams, which are joined together.
Stream1 - is simple DStream(relativly small size batch chunks)
Stream2 - is a windowed DStream(with duration for example 60 seconds)
Stream1 and Stream2 are Kafka direct stream.
The problem is that according to logs window
Can you provide more info (what version of spark, code example)?
On Tue, Sep 8, 2015 at 8:18 AM, Alexey Ponkin wrote:
> Hi,
>
> I have an application with 2 streams, which are joined together.
> Stream1 - is simple DStream(relativly small size batch chunks)
> Stream2 - is a
much as well
6] Tables aren't particularly big, the bigger one should be few GBs
Regards,
Luca
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Performance-issue-with-Spark-join-tp24458.html
Sent from the Apache Spark User List mailing list archive
, the bigger one should be few GBs
Regards,
Luca
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Performance-issue-with-Spark-join-tp24458.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
issue writing to Oracle? In particular how many commits are you
making? If you are issuing a lot of commits that would be a performance
problem.
Robin
On 22 Jul 2015, at 19:11, diplomatic Guru diplomaticg...@gmail.com
wrote:
Hello all,
We are having a major performance issue with the Spark
ask is have you determined whether you have a
performance issue writing to Oracle? In particular how many commits are you
making? If you are issuing a lot of commits that would be a performance
problem.
Robin
On 22 Jul 2015, at 19:11, diplomatic Guru diplomaticg...@gmail.com
wrote:
Hello
Hello all,
We are having a major performance issue with the Spark, which is holding us
from going live.
We have a job that carries out computation on log files and write the
results into Oracle DB.
The reducer 'reduceByKey' have been set to parallelize by 4 as we don't
want to establish too
The first question I would ask is have you determined whether you have a
performance issue writing to Oracle? In particular how many commits are you
making? If you are issuing a lot of commits that would be a performance problem.
Robin
On 22 Jul 2015, at 19:11, diplomatic Guru diplomaticg
,
Raj
On 22 July 2015 at 20:20, Robin East robin.e...@xense.co.uk wrote:
The first question I would ask is have you determined whether you have a
performance issue writing to Oracle? In particular how many commits are you
making? If you are issuing a lot of commits that would
Date: Friday, July 3, 2015 at 8:58 AM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Spark performance issue
Hello guys,
I'm after some advice on Spark performance.
I've a MapReduce job that read inputs carry out a simple calculation and write
the results into HDFS. I've
Hello guys,
I'm after some advice on Spark performance.
I've a MapReduce job that read inputs carry out a simple calculation and
write the results into HDFS. I've implemented the same logic in Spark job.
When I tried both jobs on same datasets, I'm getting different execution
time, which is
On Mon, Jun 29, 2015 at 1:27 PM, Axel Dahl a...@whisperstream.com wrote:
In pyspark, when I convert from rdds to dataframes it looks like the rdd is
being materialized/collected/repartitioned before it's converted to a
dataframe.
It's not true. When converting a RDD to dataframe, it only take
In pyspark, when I convert from rdds to dataframes it looks like the rdd is
being materialized/collected/repartitioned before it's converted to a
dataframe.
Just wondering if there's any guidelines for doing this conversion and
whether it's best to do it early to get the performance benefits of
发件人: Wangfei (X)
发送时间: 2015年6月11日 17:33
收件人: user@spark.apache.org
主题:
Hi, all
We use spark sql to insert data from a text table into a partitioning table
and found that if we give more cores to executors the insert performance whold
be worse.
Hi Peer,
If you open the driver UI (running on port 4040) you can see the stages and
the tasks happening inside it. Best way to identify the bottleneck for a
stage is to see if there's any time spending on GC, and how many tasks are
there per stage (it should be a number total # cores to achieve
I am running Spark over Cassandra to process a single table.
My task reads a single days' worth of data from the table and performs 50 group
by and distinct operations, counting distinct userIds by different grouping
keys.
My code looks like this:
JavaRddRow rdd =
You may want to look at this tooling for helping identify performance
issues and bottlenecks:
https://github.com/kayousterhout/trace-analysis
I believe this is slated to become part of the web ui in the 1.4 release,
in fact based on the status of the JIRA,
Quick questions: why are you cache both rdd and table?
Which stage of job is slow?
On 23 Apr 2015 17:12, Nikolay Tikhonov tikhonovnico...@gmail.com wrote:
Hi,
I have Spark SQL performance issue. My code contains a simple JavaBean:
public class Person implements Externalizable
...@gmail.com:
Quick questions: why are you cache both rdd and table?
Which stage of job is slow?
On 23 Apr 2015 17:12, Nikolay Tikhonov tikhonovnico...@gmail.com
wrote:
Hi,
I have Spark SQL performance issue. My code contains a simple JavaBean:
public class Person implements Externalizable
tikhonovnico...@gmail.com
wrote:
Hi,
I have Spark SQL performance issue. My code contains a simple JavaBean:
public class Person implements Externalizable {
private int id;
private String name;
private double salary;
}
Apply
I suspect that putting a function into shared variable incurs additional
overhead? Any suggestion how to avoid that?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Performance-issue-tp21194p21210.html
Sent from the Apache Spark User List mailing list
Hi,
I observed some weird performance issue using Spark in combination with
Theano, and I have no real explanation for that. To exemplify the issue I am
using the pi.py example of spark that computes pi:
When I modify the function from the example:
#unmodified code
def f(_):
x = random
I got some time to look in to it. It appears as that Spark (latest git)
is doing this operation much more often compare to Aug 1 version. Here
is the log from operation I am referring to
14/08/19 12:37:26 INFO spark.CacheManager: Partition rdd_8_414 not
found, computing it
14/08/19 12:37:26 INFO
Hi,
I am running spark from the git directly. I recently compiled the newer
version Aug 13 version and it has performance drop of 2-3x in read from
HDFS compare to git version of Aug 1. So I am wondering which commit
would have cause such an issue in read performance. The performance is
almost
Hello all,
I am a newbie to Spark, Just analyzing the product. I am facing a
performance problem with hive, Trying analyse whether the Spark will solve
it or not. but it seems that Spark also taking lot of time.Let me know if I
miss anything.
shark select count(time) from table2;
OK
6050
Time
Hey Jerry,
When you ran these queries using different methods, did you see any
discrepancy in the returned results (i.e. the counts)?
On Thu, Jul 10, 2014 at 5:55 PM, Michael Armbrust
mich...@databricks.com wrote:
Yeah, sorry. I think you are seeing some weirdness with partitioned tables
that
Hi Spark users and developers,
I'm doing some simple benchmarks with my team and we found out a potential
performance issue using Hive via SparkSQL. It is very bothersome. So your
help in understanding why it is terribly slow is very very important.
First, we have some text files in HDFS which
By the way, I also try hql(select * from m).count. It is terribly slow
too.
On Thu, Jul 10, 2014 at 5:08 PM, Jerry Lam chiling...@gmail.com wrote:
Hi Spark users and developers,
I'm doing some simple benchmarks with my team and we found out a potential
performance issue using Hive via
Hi Spark users,
Also, to put the performance issue into perspective, we also ran the query
on Hive. It took about 5 minutes to run.
Best Regards,
Jerry
On Thu, Jul 10, 2014 at 5:10 PM, Jerry Lam chiling...@gmail.com wrote:
By the way, I also try hql(select * from m).count. It is terribly
On Thu, Jul 10, 2014 at 2:08 PM, Jerry Lam chiling...@gmail.com wrote:
For the curious mind, the dataset is about 200-300GB and we are using 10
machines for this benchmark. Given the env is equal between the two
experiments, why pure spark is faster than SparkSQL?
There is going to be some
Hi Michael,
Yes the table is partitioned on 1 column. There are 11 columns in the table
and they are all String type.
I understand that SerDes contributes to some overheads but using pure Hive,
we could run the query about 5 times faster than SparkSQL. Given that Hive
also has the same SerDes
Yeah, sorry. I think you are seeing some weirdness with partitioned tables
that I have also seen elsewhere. I've created a JIRA and assigned someone
at databricks to investigate.
https://issues.apache.org/jira/browse/SPARK-2443
On Thu, Jul 10, 2014 at 5:33 PM, Jerry Lam chiling...@gmail.com
60 matches
Mail list logo