Thanks to all for the quick replies, they helped a lot. To answer a few of
the follow-up questions ...
> 1. How did you fix this performance which I gather programmatically
The main problem in my original code was that the logic was not being executed when it should have been. This
'if'
nce degradation. After investigation, I found that the through the data frame> check logic above was not correct, causing a new
> combined data frame to build on the lineage of the old data frame RDD that
> it is replacing. The Spark physical plan of many queries was becoming
> la
in steady
> performance degradation. After investigation, I found that the through the data frame> check logic above was not correct, causing a new
> combined data frame to build on the lineage of the old data frame RDD that
> it is replacing. The Spark physical plan of many queri
a new
combined data frame to build on the lineage of the old data frame RDD that
it is replacing. The Spark physical plan of many queries was becoming
larger and larger because of 'filter' and 'union' transformations being
added to the same data frame. I am not yet very familiar with Spark qu
(network IO increases while the query
>>> is running). The strange part of this situation is that the number shown in
>>> the spark UI's shuffle section is very small.
>>> How can I find out the root cause of this problem and solve that?
>>> link of stackoverflow.com <http://stackoverflow.com/> :
>>> https://stackoverflow.com/questions/66612906/problem-with-data-locality-when-running-spark-query-with-local-nature-on-apache
>>>
>>> <https://stackoverflow.com/questions/66612906/problem-with-data-locality-when-running-spark-query-with-local-nature-on-apache>
>
e root cause of this problem and solve that?
> link of stackoverflow.com <http://stackoverflow.com/> :
> https://stackoverflow.com/questions/66612906/problem-with-data-locality-when-running-spark-query-with-local-nature-on-apache
>
> <https://stackoverflow.com/questions/66612906/problem-with-data-locality-when-running-spark-query-with-local-nature-on-apache>
m/questions/66612906/problem-with-data-locality-when-running-spark-query-with-local-nature-on-apache
<https://stackoverflow.com/questions/66612906/problem-with-data-locality-when-running-spark-query-with-local-nature-on-apache>
Hi Ayan,
, thanks for the explanation,
I am aware of compression codecs.
How does locality level set?
Is it done by Spark or yarn?
Please let me know,
Thanks,
Yesh
On Nov 22, 2016 5:13 PM, "ayan guha" wrote:
Hi
RACK_LOCAL = Task running on the same rack but not on
Hi
RACK_LOCAL = Task running on the same rack but not on the same node where
data is
NODE_LOCAL = task and data is co-located. Probably you were looking for
this one?
GZIP - Read is through GZIP codec, but because it is non-splittable, so you
can have atmost 1 task reading a gzip file. Now, the
Hi Ayan,
we have default rack topology.
-Yeshwanth
Can you Imagine what I would do if I could do all I can - Art of War
On Tue, Nov 22, 2016 at 6:37 AM, ayan guha wrote:
> Because snappy is not splittable, so single task makes sense.
>
> Are sure about rack topology?
Because snappy is not splittable, so single task makes sense.
Are sure about rack topology? Ie 225 is in a different rack than 227 or
228? What does your topology file says?
On 22 Nov 2016 10:14, "yeshwanth kumar" wrote:
> Thanks for your reply,
>
> i can definitely
Thanks for your reply,
i can definitely change the underlying compression format.
but i am trying to understand the Locality Level,
why executor ran on a different node, where the blocks are not present,
when Locality Level is RACK_LOCAL
can you shed some light on this.
Thanks,
Yesh
Use as a format orc, parquet or avro because they support any compression type
with parallel processing. Alternatively split your file in several smaller
ones. Another alternative would be bzip2 (but slower in general) or Lzo
(usually it is not included by default in many distributions).
> On
Try changing compression to bzip2 or lzo. For reference -
http://comphadoop.weebly.com
Thanks,
Aniket
On Mon, Nov 21, 2016, 10:18 PM yeshwanth kumar
wrote:
> Hi,
>
> we are running Hive on Spark, we have an external table over snappy
> compressed csv file of size 917.4 M
Hi,
we are running Hive on Spark, we have an external table over snappy
compressed csv file of size 917.4 M
HDFS block size is set to 256 MB
as per my Understanding, if i run a query over that external table , it
should launch 4 tasks. one for each block.
but i am seeing one executor and one
you print out the sql execution plan? My guess is about broadcast
>>>> join.
>>>>
>>>>
>>>>
>>>> On Jun 9, 2016, at 07:14, Gourav Sengupta < <gourav.sengu...@gmail.com>
>>>> gourav.sengu...@gmail.com> wrote:
>>>&
is happening
>>> here and is there a way we can optimize the queries in SPARK without the
>>> obvious hack in Query2.
>>>
>>>
>>> ---
>>> ENVIRONMENT:
>>> ---
>>>
>&
cessed through SPARK using HiveContext
>> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
>> allowMaximumResource allocation and node types are c3.4xlarge).
>>
>> --
>> QUERY1:
>> --
>> select A.PK, B.FK
>> from A
>> left outer join B on (A.PK = B.FK)
>> where B.FK is not null;
>>
>>
>>
>> This query takes 4 mins in HIVE and 1.1 hours in SPARK
>>
>>
>> --
>> QUERY 2:
>> --
>>
>> select A.PK, B.FK
>> from (select PK from A) A
>> left outer join B on (A.PK = B.FK)
>> where B.FK is not null;
>>
>> This query takes 4.5 mins in SPARK
>>
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>>
>>
>>
>>
>
A 533 columns x 24 million rows and Table B has 2 columns x 3
>>>> > million rows. Both the files are single gzipped csv file.
>>>> > Both table A and B are external tables in AWS S3 and created in HIVE
>>>> > ac
text
> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
> allowMaximumResource allocation and node types are c3.4xlarge).
>
> --
> QUERY1:
> --
> select A.PK, B.FK
> from A
> left outer join B on (A.PK = B.FK)
> where B.FK is
A.PK<http://A.PK>, B.FK<http://B.FK>
from A
left outer join B on (A.PK<http://A.PK> = B.FK<http://B.FK>)
where B.FK<http://B.FK> is not null;
This query takes 4 mins in HIVE and 1.1 hours in SPARK
--
QUERY 2:
--
select A.PK<http:/
million rows and Table B has 2 columns x 3
>>> million rows. Both the files are single gzipped csv file.
>>> > Both table A and B are external tables in AWS S3 and created in HIVE
>>> accessed through SPARK using HiveContext
>>> > EMR 4.6, Spark 1.6.1 and Hive 1.
sed through SPARK using HiveContext
>> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
>> allowMaximumResource allocation and node types are c3.4xlarge).
>>
>> --
>> QUERY1:
>> --
>> select A.PK, B.FK
>> from
t; --
> QUERY1:
> --
> select A.PK, B.FK
> from A
> left outer join B on (A.PK = B.FK)
> where B.FK is not null;
>
>
>
> This query takes 4 mins in HIVE and 1.1 hours in SPARK
>
>
> --
> QUERY 2:
> --
>
;
> --
> QUERY1:
> --
> select A.PK, B.FK
> from A
> left outer join B on (A.PK = B.FK)
> where B.FK is not null;
>
>
>
> This query takes 4 mins in HIVE and 1.1 hours in SPARK
>
>
> --
> QUERY 2:
&
without the
>>>>> obvious hack in Query2.
>>>>>
>>>>>
>>>>> ---
>>>>> ENVIRONMENT:
>>>>> ---
>>>>>
>>>>> > Table A 533 columns x 24 mi
are single gzipped csv file.
>>>> > Both table A and B are external tables in AWS S3 and created in HIVE
>>>> accessed through SPARK using HiveContext
>>>> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
>>>> allowMaximumResource allocation and
columns x 3
>>> million rows. Both the files are single gzipped csv file.
>>> > Both table A and B are external tables in AWS S3 and created in HIVE
>>> accessed through SPARK using HiveContext
>>> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
>
; Both table A and B are external tables in AWS S3 and created in HIVE
>> accessed through SPARK using HiveContext
>> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
>> allowMaximumResource allocation and node types are c3.4xlarge).
>>
>> ----
SPARK using HiveContext
> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
> allowMaximumResource allocation and node types are c3.4xlarge).
>
> --
> QUERY1:
> --
> select A.PK, B.FK
> from A
> left outer join B on (A.PK = B
3.4xlarge).
--
QUERY1:
--
select A.PK, B.FK
from A
left outer join B on (A.PK = B.FK)
where B.FK is not null;
This query takes 4 mins in HIVE and 1.1 hours in SPARK
------
QUERY 2:
--
select A.PK, B.FK
from (select PK from A) A
left outer join B
Seems like you have "hive.server2.enable.doAs" enabled; you can either
disable it, or configure hs2 so that the user running the service
("hadoop" in your case) can impersonate others.
See:
https://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-common/Superusers.html
On Fri, Sep 25,
Hi All,
I am following
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started?
to setup hive on spark. After setup/configuration everything startup I am
able to show tables but when executing sql statement within beeline I got
error. Please help and
exited with code 1.
-Original Message-
From: Marcelo Vanzin [mailto:van...@cloudera.com]
Sent: Friday, September 25, 2015 1:12 PM
To: Garry Chen <g...@cornell.edu>
Cc: Jimmy Xiang <jxi...@cloudera.com>; user@spark.apache.org
Subject: Re: hive on spark query error
On Fri, Sep 25
: hive on spark query error
> Error: Master must start with yarn, spark, mesos, or local
What's your setting for spark.master?
On Fri, Sep 25, 2015 at 9:56 AM, Garry Chen
<g...@cornell.edu<mailto:g...@cornell.edu>> wrote:
Hi All,
I am following
https://cwiki.apache.o
On Fri, Sep 25, 2015 at 10:05 AM, Garry Chen wrote:
> In spark-defaults.conf the spark.master is spark://hostname:7077. From
> hive-site.xml
> spark.master
> hostname
>
That's not a valid value for spark.master (as the error indicates).
You should set it to
Mani rrav...@gmail.com wrote:
Hi everyone,
I can't get 'day of year' when using spark query. Can you help any way to
achieve day of year?
Regards,
Ravi
Hi everyone,
I can't get 'day of year' when using spark query. Can you help any way to
achieve day of year?
Regards,
Ravi
-07-09 12:02:44, Ravisankar Mani rrav...@gmail.com wrote:
Hi everyone,
I can't get 'day of year' when using spark query. Can you help any way
to achieve day of year?
Regards,
Ravi
39 matches
Mail list logo