Limit Query Performance Suggestion

2017-01-12 Thread sujith chacko
When limit is being added in the terminal of the physical plan there will
be possibility of memory bottleneck
if the limit value is too large and system will try to aggregate all the
partition limit values as part of single partition.
Description:
Eg:
create table src_temp as select * from src limit n;
== Physical Plan ==
ExecutedCommand
   +- CreateHiveTableAsSelectCommand [Database:spark}, TableName: t2,
InsertIntoHiveTable]
 +- GlobalLimit 2
+- LocalLimit 2
   +- Project [imei#101, age#102, task#103L, num#104,
level#105, productdate#106, name#107, point#108]
  +- SubqueryAlias hive
 +-
Relation[imei#101,age#102,task#103L,num#104,level#105,productdate#106,name#107,point#108]
csv  |

As shown in above plan when the limit comes in terminal ,there can be two
types of performance bottlenecks.
scenario 1: when the partition count is very high and limit value is small
scenario 2: when the limit value is very large

 protected override def doExecute(): RDD[InternalRow] = {
val locallyLimited =
child.execute().mapPartitionsInternal(_.take(limit))
val shuffled = new ShuffledRowRDD(
  ShuffleExchange.prepareShuffleDependency(
locallyLimited, child.output, SinglePartition, serializer))
shuffled.mapPartitionsInternal(_.take(limit))
  }
}

As per my understanding the current algorithm first creates the
MapPartitionsRDD by applying limit from each partition, then ShuffledRowRDD
will be created by grouping data from all partitions into single partition,
this can create overhead since all partitions will return limit n data , so
while grouping there will be N partition * limit N which can be very huge,
in both scenarios mentioned above this logic can be a bottle neck.

My suggestion for handling scenario 1 where large number of partition and
limit value is small, in this case driver can create an accumulator value
and try to send to all partitions, all executer will be updating the
accumulator value based on the data fetched ,
eg: number of partition = 100, number of cores =10
tasks will be launched in a group of 10(10*10 = 100), once the first group
finishes the tasks driver will check whether the accumulator value is been
reached the limit value
if its reached then no further task will be launched to executers and the
result will be returned.

Let me know for any furthur suggestions or solution.

Thanks in advance,
Sujith


Unsubscribe

2017-01-12 Thread anuj ojha
Unsubscribe


Re: [PYSPARK] Python tests organization

2017-01-12 Thread Maciej Szymkiewicz
Thanks Holden. If you have some spare time would you take a look at
https://github.com/apache/spark/pull/16534?

It is somewhat related to
https://issues.apache.org/jira/browse/SPARK-18777 (Return UDF objects
when registering from Python).


On 01/12/2017 07:34 PM, Holden Karau wrote:
> I'd be happy to help with reviewing Python test improvements. Maybe
> make an umbrella JIRA and do one sub components at a time?
>
> On Thu, Jan 12, 2017 at 12:20 PM Saikat Kanjilal  > wrote:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Following up, any thoughts on next steps for this?
>
>
>
>
>
>
>
>
>
>
>
> 
>
>
> *From:* Maciej Szymkiewicz  >
>
>
>
> *Sent:* Wednesday, January 11, 2017 10:14 AM
>
>
> *To:* Saikat Kanjilal
>
>
> *Subject:* Re: [PYSPARK] Python tests organization
>
>
> Not yet, I want to see if there is any consensus about it. It is a
> lot of tedious work and I would be shame if someone started
> working on this just to get it dropped.
>
>
>
>
>
>
>
> On 01/11/2017 06:44 PM, Saikat Kanjilal wrote:
>
>
>
>
>>
>>
>>
>>
>> Hello Maciej,
>>
>>
>>
>> If there's a jira available for this I'd like to help get this
>> moving, let me know next steps.
>>
>>
>>
>> Thanks in advance.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 
>>
>>
>> *From:* Maciej Szymkiewicz
>>
>>  
>>
>>
>> *Sent:* Wednesday, January 11, 2017 4:18 AM
>>
>>
>> *To:*
>>
>> dev@spark.apache.org 
>>
>>
>> *Subject:* [PYSPARK] Python tests organization
>>
>>  
>>
>>
>>
>>
>>
>>
>>
>>
>> Hi,
>>
>>
>>
>>
>>
>> I can't help but wonder if there is any practical reason for keeping
>>
>>
>> monolithic test modules. These things are already pretty large
>> (1500 -
>>
>>
>> 2200 LOCs) and can only grow. Development aside, I assume that many
>>
>>
>> users use tests the same way as me, to check the intended
>> behavior, and
>>
>>
>> largish loosely coupled modules make it harder than it should be.
>>
>>
>>
>>
>>
>> If there's no rationale for that it could be a good time start
>> thinking
>>
>>
>> about moving tests to packages and separating into modules reflecting
>>
>>
>> project structure.
>>
>>
>>
>>
>>
>> -- 
>>
>>
>> Best,
>>
>>
>> Maciej
>>
>>
>>
>>
>>
>>
>>
>>
>> -
>>
>>
>> To unsubscribe e-mail:
>>
>> dev-unsubscr...@spark.apache.org
>> 
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
>
>
>
> -- 
> Maciej Szymkiewicz
>
>
>

-- 
Maciej Szymkiewicz



Re: [PYSPARK] Python tests organization

2017-01-12 Thread Maciej Szymkiewicz
Sounds good, but it looks like JIRA is still down.


Personally I can look at sql.tests and see what can be done there.
Depending on the resolution of
https://issues.apache.org/jira/browse/SPARK-19160 I may have to adjust
some tests anyway.


On 01/12/2017 07:36 PM, Saikat Kanjilal wrote:
>
> Maciej? LGTM, what do you think?  I can create a JIRA and drive this.
>
>
>
> 
> *From:* Holden Karau 
> *Sent:* Thursday, January 12, 2017 10:34 AM
> *To:* Saikat Kanjilal; dev@spark.apache.org
> *Subject:* Re: [PYSPARK] Python tests organization
>  
> I'd be happy to help with reviewing Python test improvements. Maybe
> make an umbrella JIRA and do one sub components at a time?
>
> On Thu, Jan 12, 2017 at 12:20 PM Saikat Kanjilal  > wrote:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Following up, any thoughts on next steps for this?
>
>
>
>
>
>
>
>
>
>
>
> 
>
>
> *From:* Maciej Szymkiewicz  >
>
>
>
> *Sent:* Wednesday, January 11, 2017 10:14 AM
>
>
> *To:* Saikat Kanjilal
>
>
> *Subject:* Re: [PYSPARK] Python tests organization
>
>
> Not yet, I want to see if there is any consensus about it. It is a
> lot of tedious work and I would be shame if someone started
> working on this just to get it dropped.
>
>
>
>
>
>
>
> On 01/11/2017 06:44 PM, Saikat Kanjilal wrote:
>
>
>
>
>>
>>
>>
>>
>> Hello Maciej,
>>
>>
>>
>> If there's a jira available for this I'd like to help get this
>> moving, let me know next steps.
>>
>>
>>
>> Thanks in advance.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 
>>
>>
>> *From:* Maciej Szymkiewicz
>>
>>  
>>
>>
>> *Sent:* Wednesday, January 11, 2017 4:18 AM
>>
>>
>> *To:*
>>
>> dev@spark.apache.org 
>>
>>
>> *Subject:* [PYSPARK] Python tests organization
>>
>>  
>>
>>
>>
>>
>>
>>
>>
>>
>> Hi,
>>
>>
>>
>>
>>
>> I can't help but wonder if there is any practical reason for keeping
>>
>>
>> monolithic test modules. These things are already pretty large
>> (1500 -
>>
>>
>> 2200 LOCs) and can only grow. Development aside, I assume that many
>>
>>
>> users use tests the same way as me, to check the intended
>> behavior, and
>>
>>
>> largish loosely coupled modules make it harder than it should be.
>>
>>
>>
>>
>>
>> If there's no rationale for that it could be a good time start
>> thinking
>>
>>
>> about moving tests to packages and separating into modules reflecting
>>
>>
>> project structure.
>>
>>
>>
>>
>>
>> -- 
>>
>>
>> Best,
>>
>>
>> Maciej
>>
>>
>>
>>
>>
>>
>>
>>
>> -
>>
>>
>> To unsubscribe e-mail:
>>
>> dev-unsubscr...@spark.apache.org
>> 
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
>
>
>
> -- 
> Maciej Szymkiewicz
>
>
>

-- 
Maciej Szymkiewicz



Re: [PYSPARK] Python tests organization

2017-01-12 Thread Saikat Kanjilal
Maciej? LGTM, what do you think?  I can create a JIRA and drive this.



From: Holden Karau 
Sent: Thursday, January 12, 2017 10:34 AM
To: Saikat Kanjilal; dev@spark.apache.org
Subject: Re: [PYSPARK] Python tests organization

I'd be happy to help with reviewing Python test improvements. Maybe make an 
umbrella JIRA and do one sub components at a time?

On Thu, Jan 12, 2017 at 12:20 PM Saikat Kanjilal 
> wrote:















Following up, any thoughts on next steps for this?













From: Maciej Szymkiewicz >



Sent: Wednesday, January 11, 2017 10:14 AM


To: Saikat Kanjilal


Subject: Re: [PYSPARK] Python tests organization



Not yet, I want to see if there is any consensus about it. It is a lot of 
tedious work and I would be shame if someone started working on this just to 
get it dropped.







On 01/11/2017 06:44 PM, Saikat Kanjilal wrote:









Hello Maciej,



If there's a jira available for this I'd like to help get this moving, let me 
know next steps.



Thanks in advance.















From: Maciej Szymkiewicz




Sent: Wednesday, January 11, 2017 4:18 AM


To:

dev@spark.apache.org


Subject: [PYSPARK] Python tests organization










Hi,





I can't help but wonder if there is any practical reason for keeping


monolithic test modules. These things are already pretty large (1500 -


2200 LOCs) and can only grow. Development aside, I assume that many


users use tests the same way as me, to check the intended behavior, and


largish loosely coupled modules make it harder than it should be.





If there's no rationale for that it could be a good time start thinking


about moving tests to packages and separating into modules reflecting


project structure.





--


Best,


Maciej








-


To unsubscribe e-mail:

dev-unsubscr...@spark.apache.org

















--
Maciej Szymkiewicz




Re: [PYSPARK] Python tests organization

2017-01-12 Thread Holden Karau
I'd be happy to help with reviewing Python test improvements. Maybe make an
umbrella JIRA and do one sub components at a time?

On Thu, Jan 12, 2017 at 12:20 PM Saikat Kanjilal 
wrote:

>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Following up, any thoughts on next steps for this?
>
>
>
>
>
>
>
>
>
>
> --
>
>
> *From:* Maciej Szymkiewicz 
>
>
>
> *Sent:* Wednesday, January 11, 2017 10:14 AM
>
>
> *To:* Saikat Kanjilal
>
>
> *Subject:* Re: [PYSPARK] Python tests organization
>
>
> Not yet, I want to see if there is any consensus about it. It is a lot of
> tedious work and I would be shame if someone started working on this just
> to get it dropped.
>
>
>
>
>
>
>
> On 01/11/2017 06:44 PM, Saikat Kanjilal wrote:
>
>
>
>
>
>
>
>
> Hello Maciej,
>
>
> If there's a jira available for this I'd like to help get this moving, let
> me know next steps.
>
>
> Thanks in advance.
>
>
>
>
>
>
>
>
>
>
>
>
> --
>
>
> *From:* Maciej Szymkiewicz
>
>  
>
>
> *Sent:* Wednesday, January 11, 2017 4:18 AM
>
>
> *To:*
>
> dev@spark.apache.org
>
>
> *Subject:* [PYSPARK] Python tests organization
>
>
>
>
>
>
>
>
>
>
> Hi,
>
>
>
>
>
> I can't help but wonder if there is any practical reason for keeping
>
>
> monolithic test modules. These things are already pretty large (1500 -
>
>
> 2200 LOCs) and can only grow. Development aside, I assume that many
>
>
> users use tests the same way as me, to check the intended behavior, and
>
>
> largish loosely coupled modules make it harder than it should be.
>
>
>
>
>
> If there's no rationale for that it could be a good time start thinking
>
>
> about moving tests to packages and separating into modules reflecting
>
>
> project structure.
>
>
>
>
>
> --
>
>
> Best,
>
>
> Maciej
>
>
>
>
>
>
>
>
> -
>
>
> To unsubscribe e-mail:
>
> dev-unsubscr...@spark.apache.org
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> --
> Maciej Szymkiewicz
>
>
>
>


Re: [PYSPARK] Python tests organization

2017-01-12 Thread Saikat Kanjilal
Following up, any thoughts on next steps for this?



From: Maciej Szymkiewicz 
Sent: Wednesday, January 11, 2017 10:14 AM
To: Saikat Kanjilal
Subject: Re: [PYSPARK] Python tests organization


Not yet, I want to see if there is any consensus about it. It is a lot of 
tedious work and I would be shame if someone started working on this just to 
get it dropped.

On 01/11/2017 06:44 PM, Saikat Kanjilal wrote:

Hello Maciej,

If there's a jira available for this I'd like to help get this moving, let me 
know next steps.

Thanks in advance.



From: Maciej Szymkiewicz 
Sent: Wednesday, January 11, 2017 4:18 AM
To: dev@spark.apache.org
Subject: [PYSPARK] Python tests organization

Hi,

I can't help but wonder if there is any practical reason for keeping
monolithic test modules. These things are already pretty large (1500 -
2200 LOCs) and can only grow. Development aside, I assume that many
users use tests the same way as me, to check the intended behavior, and
largish loosely coupled modules make it harder than it should be.

If there's no rationale for that it could be a good time start thinking
about moving tests to packages and separating into modules reflecting
project structure.

--
Best,
Maciej


-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org



--
Maciej Szymkiewicz


Unsubscribe

2017-01-12 Thread williamtellme123
Please unsubscribe me

 

From: Varanasi, Venkata [mailto:venkata.varan...@bankofamerica.com] 
Sent: Thursday, April 28, 2016 1:35 PM
To: dev@spark.apache.org
Subject: Unsubscribe

 

 

  _  

This message, and any attachments, is for the intended recipient(s) only,
may contain information that is privileged, confidential and/or proprietary
and subject to important terms and conditions available at
http://www.bankofamerica.com/emaildisclaimer. If you are not the intended
recipient, please delete this message.



FOSDEM 2017 Open Source Conference - Brussels

2017-01-12 Thread Sharan F

Hello Everyone

This email is to tell you about ASF participation at FOSDEM. The event 
will be held in Brussels on 4^th & 5^th February 2017 and we are hoping 
that many people from our ASF projects will be there.


https://fosdem.org/2017/

Attending FOSDEM is completely free and the ASF will again be running a 
booth there. Our main focus will on talking to people about the ASF, our 
projects and communities.


*_Why Attend FOSDEM?_*
Some reasons for attending FOSDEM are:

1. Promoting your project: FOSDEM has up to 4-5000 attendees so is a
   great place to spread the word about your project
2. Learning, participating and meeting up: FOSDEM is a developers
   conference so includes presentations covering a range of
   technologies and includes lots of topic specific devrooms

_*FOSDEM Wiki *_
A page on the Community Development wiki has been created with the main 
details about our involvement at conference, so please take a look


https://cwiki.apache.org/confluence/display/COMDEV/FOSDEM+2017

If you would like to spend some time on the ASF booth promoting your 
project then please sign up on the FOSDEM wiki page. Initially we would 
like to split this into slots of 3-4 hours but this will depend on the 
number of projects that are represented.


We are also looking for volunteers to help out on the booth over the 2 
days of the conference, so if you are going to be there and are willing 
to help then please add your name to the volunteer list.


_*Project Stickers*_
If you are going to be at FOSDEM and do not have any project stickers to 
give away then we may (budget permitting) be able to help you get some 
printed. Please contact me with your requirements.


_*Social Event*_
Some people have asked about organising an ASF social event / meetup 
during the conference. This is possible but we will need know how many 
people are interested and which date works best. The FOSDEM wiki page 
also contains an 'Arrival / Departure' section so so please add your 
details if you would like to participate.


I hope this helps people see some of the advantages of attending FOSDEM 
and we are looking forward to seeing lots of people there from our ASF 
communities.


Thanks
Sharan

Apache Community Development
http://community.apache.org/