Re: Welcoming some new committers and PMC members

2019-09-09 Thread sujith chacko
Congratulations all.

On Tue, 10 Sep 2019 at 7:27 AM, Haibo  wrote:

> congratulations~
>
>
>
> 在2019年09月10日 09:30,Joseph Torres
>  写道:
>
> congratulations!
>
> On Mon, Sep 9, 2019 at 6:27 PM 王 斐  wrote:
>
>> congratulations!
>>
>> 获取 Outlook for iOS 
>>
>> --
>> *发件人:* Ye Xianjin 
>> *发送时间:* 星期二, 九月 10, 2019 09:26
>> *收件人:* Jeff Zhang
>> *抄送:* Saisai Shao; dev
>> *主题:* Re: Welcoming some new committers and PMC members
>>
>> Congratulations!
>>
>> Sent from my iPhone
>>
>> On Sep 10, 2019, at 9:19 AM, Jeff Zhang  wrote:
>>
>> Congratulations!
>>
>> Saisai Shao  于2019年9月10日周二 上午9:16写道:
>>
>>> Congratulations!
>>>
>>> Jungtaek Lim  于2019年9月9日周一 下午6:11写道:
>>>
 Congratulations! Well deserved!

 On Tue, Sep 10, 2019 at 9:51 AM John Zhuge  wrote:

> Congratulations!
>
> On Mon, Sep 9, 2019 at 5:45 PM Shane Knapp 
> wrote:
>
>> congrats everyone!  :)
>>
>> On Mon, Sep 9, 2019 at 5:32 PM Matei Zaharia 
>> wrote:
>> >
>> > Hi all,
>> >
>> > The Spark PMC recently voted to add several new committers and one
>> PMC member. Join me in welcoming them to their new roles!
>> >
>> > New PMC member: Dongjoon Hyun
>> >
>> > New committers: Ryan Blue, Liang-Chi Hsieh, Gengliang Wang, Yuming
>> Wang, Weichen Xu, Ruifeng Zheng
>> >
>> > The new committers cover lots of important areas including ML, SQL,
>> and data sources, so it’s great to have them here. All the best,
>> >
>> > Matei and the Spark PMC
>> >
>> >
>> >
>> -
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >
>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> John Zhuge
>


 --
 Name : Jungtaek Lim
 Blog : http://medium.com/@heartsavior
 Twitter : http://twitter.com/heartsavior
 LinkedIn : http://www.linkedin.com/in/heartsavior

>>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>>


Re: Support SqlStreaming in spark

2019-02-10 Thread sujith chacko
Hi All,

 I think there are few more updates are added in the design document
compare to last document where few folks has reviewed and provided inputs.,
requesting all experts to review the design document and help us to
baseline the design for the  SPIP
'Support SQL streaming' in spark structured streaming, few more sections is
been added in-order to handle some scenarios as below

1) Passing the stream level configurations to the sql command instead of
setting it in session/application level.

2) Supporting Multiple Streams in single application,. etc

Link to the design document

https://docs.google.com/document/d/19degwnIIcuMSELv6BQ_1VQI5AIVcvGeqOm5xE2-aRA0/edit#


Few Questions are already clarified by Jacky, please find through below link

https://docs.google.com/document/d/19degwnIIcuMSELv6BQ_1VQI5AIVcvGeqOm5xE2-aRA0/edit#heading=h.t96f9l205fk1


Regards,
Sujith

On Thu, Dec 27, 2018 at 6:39 PM JackyLee  wrote:

> Hi, Wenchen
>
> Thank you for your recognition of Streaming on sql. I have written the
> SQLStreaming design document:
>
> https://docs.google.com/document/d/19degwnIIcuMSELv6BQ_1VQI5AIVcvGeqOm5xE2-aRA0/edit#
>
> Your Questions are answered in here:
>
> https://docs.google.com/document/d/19degwnIIcuMSELv6BQ_1VQI5AIVcvGeqOm5xE2-aRA0/edit#heading=h.t96f9l205fk1
>
> There may be some details that I have not considered, we can discuss it in
> more depth.
>
> Thanks
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: scheduler braindump: architecture, gotchas, etc.

2019-02-04 Thread sujith chacko
Thanks Li and Imran for providing us  an overview about one of the complex
module in spark  Excellent sharing.

Regards
Sujith.

On Mon, 4 Feb 2019 at 10:54 PM, Xiao Li  wrote:

> Thank you, Imran!
>
> Also, I attached the slides of "Deep Dive: Scheduler of Apache Spark".
>
> Cheers,
>
> Xiao
>
>
>
> John Zhuge  于2019年2月4日周一 上午8:59写道:
>
>> Thanks Imran!
>>
>> On Mon, Feb 4, 2019 at 8:42 AM Imran Rashid 
>> wrote:
>>
>>> The scheduler has been pretty error-prone and hard to work on, and I
>>> feel like there may be a dwindling core of active experts.  I'm sure its
>>> very discouraging to folks trying to make what seem like simple changes,
>>> and then find they are in a rats nest of complex issues they weren't
>>> expecting.  But for those who are still trying, THANK YOU!  more
>>> involvement and more folks becoming experts is definitely needed.
>>>
>>> I put together a doc going over the architecture of the scheduler, and
>>> things I've seen us get bitten by in the past.  Its sort of a brain dump,
>>> but I'm hopeful it'll help orient new folks to the scheduler.  I also hope
>>> more experts will chime in -- there are places in the doc I know I've
>>> missed things, and called that out, but there are probably even more that
>>> should be discussed, & mistakes I've made.  All input welcome.
>>>
>>>
>>> https://docs.google.com/document/d/1oiE21t-8gXLXk5evo-t-BXpO5Hdcob5D-Ps40hogsp8/edit?usp=sharing
>>>
>>
>>
>> --
>> John Zhuge
>>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


Re: welcome a new batch of committers

2018-10-03 Thread sujith chacko
Great news Congrats all for achieving the feat !!!

On Wed, 3 Oct 2018 at 2:29 PM, Reynold Xin  wrote:

> Hi all,
>
> The Apache Spark PMC has recently voted to add several new committers to
> the project, for their contributions:
>
> - Shane Knapp (contributor to infra)
> - Dongjoon Hyun (contributor to ORC support and other parts of Spark)
> - Kazuaki Ishizaki (contributor to Spark SQL)
> - Xingbo Jiang (contributor to Spark Core and SQL)
> - Yinan Li (contributor to Spark on Kubernetes)
> - Takeshi Yamamuro (contributor to Spark SQL)
>
> Please join me in welcoming them!
>
>


Re: Welcome Zhenhua Wang as a Spark committer

2018-04-01 Thread sujith chacko
Congratulations zhenhua for this great achievement.

On Mon, 2 Apr 2018 at 11:05 AM, Denny Lee  wrote:

> Awesome - congrats Zhenhua!
>
> On Sun, Apr 1, 2018 at 10:33 PM 叶先进  wrote:
>
>> Big congs.
>>
>> > On Apr 2, 2018, at 1:28 PM, Wenchen Fan  wrote:
>> >
>> > Hi all,
>> >
>> > The Spark PMC recently added Zhenhua Wang as a committer on the
>> project. Zhenhua is the major contributor of the CBO project, and has been
>> contributing across several areas of Spark for a while, focusing especially
>> on analyzer, optimizer in Spark SQL. Please join me in welcoming Zhenhua!
>> >
>> > Wenchen
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: Regarding NimbusDS JOSE JWT jar 3.9 security vulnerability

2018-02-14 Thread sujith chacko
Hi Steve,

 While we are building spark 2.1 version this particular JWT jar is getting
added as part of transitive dependency of Hadoop-auth-2.7.2 project. I
discussed with one of the  Hadoop pmc, he will analyse the impact of this
particular issue in Hadoop . Once I will get more information I will update
you about this.

Thanks,
Sujith

On Wed, 14 Feb 2018 at 07 PM, Steve Loughran  wrote:

> might be coming in transitively
>
> https://issues.apache.org/jira/browse/HADOOP-14799
>
> On 13 Feb 2018, at 18:18, PJ Fanning  wrote:
>
> Hi Sujith,
> I didn't find the nimbusds dependency in any spark 2.2 jars. Maybe I missed
> something. Could you tell us which spark jar has the nimbusds dependency?
>
>
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 
>
>


Re: sessionState could not be accessed in spark-shell command line

2017-09-07 Thread sujith chacko
Hi,
may I know which version of spark you are using, in 2.2 I tried with
below query in spark-shell for viewing the logical plan and it's working
fine

spark.sql("explain extended select * from table1")

The above query you can use for seeing logical plan.

Thanks,
Sujith

On Thu, 7 Sep 2017 at 11:03 AM, ChenJun Zou  wrote:

> Hi,
>
> when I use spark-shell to get the logical plan of  sql, an error occurs
>
> scala> spark.sessionState
> :30: error: lazy value sessionState in class SparkSession cannot
> be accessed in org.apache.spark.sql.SparkSession
>spark.sessionState
>  ^
>
> But if I use spark-submit to access the "sessionState" variable, It's OK.
>
> Is there a way to access it in spark-shell?
>


Limit Query Performance Suggestion

2017-01-12 Thread sujith chacko
When limit is being added in the terminal of the physical plan there will
be possibility of memory bottleneck
if the limit value is too large and system will try to aggregate all the
partition limit values as part of single partition.
Description:
Eg:
create table src_temp as select * from src limit n;
== Physical Plan ==
ExecutedCommand
   +- CreateHiveTableAsSelectCommand [Database:spark}, TableName: t2,
InsertIntoHiveTable]
 +- GlobalLimit 2
+- LocalLimit 2
   +- Project [imei#101, age#102, task#103L, num#104,
level#105, productdate#106, name#107, point#108]
  +- SubqueryAlias hive
 +-
Relation[imei#101,age#102,task#103L,num#104,level#105,productdate#106,name#107,point#108]
csv  |

As shown in above plan when the limit comes in terminal ,there can be two
types of performance bottlenecks.
scenario 1: when the partition count is very high and limit value is small
scenario 2: when the limit value is very large

 protected override def doExecute(): RDD[InternalRow] = {
val locallyLimited =
child.execute().mapPartitionsInternal(_.take(limit))
val shuffled = new ShuffledRowRDD(
  ShuffleExchange.prepareShuffleDependency(
locallyLimited, child.output, SinglePartition, serializer))
shuffled.mapPartitionsInternal(_.take(limit))
  }
}

As per my understanding the current algorithm first creates the
MapPartitionsRDD by applying limit from each partition, then ShuffledRowRDD
will be created by grouping data from all partitions into single partition,
this can create overhead since all partitions will return limit n data , so
while grouping there will be N partition * limit N which can be very huge,
in both scenarios mentioned above this logic can be a bottle neck.

My suggestion for handling scenario 1 where large number of partition and
limit value is small, in this case driver can create an accumulator value
and try to send to all partitions, all executer will be updating the
accumulator value based on the data fetched ,
eg: number of partition = 100, number of cores =10
tasks will be launched in a group of 10(10*10 = 100), once the first group
finishes the tasks driver will check whether the accumulator value is been
reached the limit value
if its reached then no further task will be launched to executers and the
result will be returned.

Let me know for any furthur suggestions or solution.

Thanks in advance,
Sujith