Re: Migration of Hadoop labelled nodes to new dedicated Master

2020-04-29 Thread Gavin McDonald
Hi All,

Following on from the below email I sent *11 DAYS ago now*, so far we have
had *one reply *from mahout cc:d to me (thank you Trevor) , and had *ONE
PERSON* sign up to the new hadoop-migrati...@infra.apache.org -
that is out of a total of *OVER 7000* people signed up to the 13 mailing
lists emailed.

To recap what I asked for:-

"...What I would like from each community, is to decide who is going to
help with their project in performing these migrations - ideally 2 or 3
folks who use the current builds.a.o regularly. Those folks should then
subscribe to the new dedicated hadoop-migrati...@infra.apache.org mailing
lists as soon as possible so we can get started..."

This will be last email I sent to your dev list directly. I am now building
a new Jenkins Master, and as soon as it is ready I will start to migrate
the Jenkins Nodes/Agents over to the new system.
And; when I am done, the existing builds.apache.org *WILL BE TURNED OFF*.

I am now going to continue all conversations on the
hadoop-migrati...@infra.apache.org list *only.*

Thanks

Gavin McDonald (ASF Infra)


On Sat, Apr 18, 2020 at 4:21 PM Gavin McDonald  wrote:

> Hi All,
>
> A couple of months ago, I wrote to a few project private lists mentioning
> the need to migrate Hadoop labelled nodes (H0-H21) over to a new dedicated
> Jenkins Master [1] (a Cloudbees Client Master.).
>
> I'd like to revisit this now that I have more time to dedicate to getting
> this done. However, keeping track across multiple mailing lists,
> separate conversations that spring up in various places is cumbersome and
> not realistic. To that end, I have created a new specific mailing list
> dedicated to the migrations of these nodes, and the projects that use them,
> over to the new system.
>
> The mailing list 'hadoop-migrati...@infra.apache.org' is up and running
> now (and this will be the first post to it). Previous discussions were on
> the private PMC lists, (there was some debate about that but I wanted the
> PMCs initially to be aware of the change,) this new list is public and
> archived.
>
> This email is BCC'd to 13 projects dev lists [2] determined by the https:/
> hadoop.apache.org list of Related projects, minus Cassandra whom already
> have their own dedicated client master [3] and I added Yetus as I think
> they cross collaborate with many Hadoop based projects. If anyone thinks a
> project is missing, or should not be on the list, let me know.
>
> What I would like from each community, is to decide who is going to help
> with their project in performing these migrations - ideally 2 or 3 folks
> who use the current builds.a.o regularly. Those folks should then subscribe
> to the new dedicated hadoop-migrati...@infra.apache.org mailing lists as
> soon as possible so we can get started.
>
> About the current setup - and I hope this answers previously asked
> questions on private lists - the new dedicated master is a Cloudbees Client
> Master 2.204.3.7-rolling. It is not the same setup as the current Jenkins
> master on builds.a.o - it is not intended to be. It is more or less a
> 'clean install' in that I have not installed over 500 plugins as is the
> case on builds.a.o , I would rather we install plugins as we find we need
> them. So yes, there may be some features missing - the point of having
> people sign up to the new list is to find out what those are, get them
> installed, and get your builds to at least the same state they are in
> currently.
>
> We have 2 nodes on there currently for testing, as things progress we can
> transfer over a couple more, projects can start to migrate their jobs over
> at any time they are happy , until done. We also need to test auth - the
> master; and its nodes will be restricted to just Hadoop + Related projects
> (which is important this list of related projects is correct). No longer
> will other projects be able to hop on to Hadoop nodes, and no longer will
> Hadoop related projects be able to hop onto other folks nodes. This is a
> good thing, and may encourage some providers to donate a few more VMs for
> dedicated use.
>
> For now then, decide who will help with this process, and sign up to the
> new mailing list, and lets get started!
>
> Note I am NOT subscribed to any of your dev lists, so replies please cc
> the new list. and I will await your presence there to get started.
>
> Thanks all.
>
> Gavin McDonald (ASF Infra)
>
> [1] - https://ci-hadoop.apache.org
> [2] -
> hadoop,chukwa,avro,ambari,hbase,hive,mahout,pig,spark,submarine,tez,zookeeper,yetus
> [3] - https://ci-cassandra.apache.org
>


Re: [DISCUSS] Java specific APIs design concern and choice

2020-04-29 Thread Hyukjin Kwon
I think I am not seeing explicit objection here but rather see people tend
to agree with the proposal in general.
I would like to step forward rather than leaving it as a deadlock - the
worst choice here is to postpone and abandon this discussion with this
inconsistency.

I don't currently target to document this as the cases are rather rare, and
we haven't really documented JavaRDD <> RDD vs DataFrame case as well.
Let's keep monitoring and see if this discussion thread clarifies things
enough in such cases I mentioned.

Let me know if you guys think differently.


2020년 4월 28일 (화) 오후 5:03, Hyukjin Kwon 님이 작성:

> Spark has targeted to have a unified API set rather than having separate
> Java classes to reduce the maintenance cost,
> e.g.) JavaRDD <> RDD vs DataFrame. These JavaXXX are more about the legacy.
>
> I think it's best to stick to the approach 4. in general cases.
> Other options might have to be considered based upon a specific context.
> For example, if we *must* to add a bunch of Java-specifics
> into a specific class for an inevitable reason somewhere, I would consider
> to have a Java-specific class.
>
>
>
> 2020년 4월 28일 (화) 오후 4:38, ZHANG Wei 님이 작성:
>
>> Be frankly, I also love the pure Java type in Java API and Scala type in
>> Scala API. :-)
>>
>> If we don't treat Java as a "FRIEND" of Scala, just as Python, maybe we
>> can adopt the status of option 1, the specific Java classes. (But I don't
>> like the `Java` prefix, which is redundant when I'm coding Java app,
>> such as JavaRDD, why not distinct it by package namespace...) The specific
>> Java API can also leverage some native Java language features with new
>> versions.
>>
>> And just since the friendly relationship between Scala and Java, the Java
>> user can call Scala API with `.asScala` or `.asJava`'s help if Java API
>> is not ready. Then switch to Java API when it's well cooked.
>>
>> The cons is more efforts to maintain.
>>
>> My 2 cents.
>>
>> --
>> Cheers,
>> -z
>>
>> On Tue, 28 Apr 2020 12:07:36 +0900
>> Hyukjin Kwon  wrote:
>>
>> > The problem is that calling Scala instances in Java side is discouraged
>> in
>> > general up to my best knowledge.
>> > A Java user won't likely know asJava in Scala but a Scala user will
>> likely
>> > know both asScala and asJava.
>> >
>> >
>> > 2020년 4월 28일 (화) 오전 11:35, ZHANG Wei 님이 작성:
>> >
>> > > How about making a small change on option 4:
>> > >   Keep Scala API returning Scala type instance with providing a
>> > >   `asJava` method to return a Java type instance.
>> > >
>> > > Scala 2.13 has provided CollectionConverter [1][2][3], in the
>> following
>> > > Spark dependences upgrade, which can be supported by nature. For
>> > > current Scala 2.12 version, we can wrap `ImplicitConversionsToJava`[4]
>> > > as what Scala 2.13 does and add implicit conversions.
>> > >
>> > > Just my 2 cents.
>> > >
>> > > --
>> > > Cheers,
>> > > -z
>> > >
>> > > [1]
>> > >
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.scala-lang.org%2Foverviews%2Fcollections-2.13%2Fconversions-between-java-and-scala-collections.html&data=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435%7C1%7C0%7C637236400701707166&sdata=1qauveOMB1lKHSkRco7v8tBpcJXab8IeGlcoYNMCZ%2BU%3D&reserved=0
>> > > [2]
>> > >
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.scala-lang.org%2Fapi%2F2.13.0%2Fscala%2Fjdk%2Fjavaapi%2FCollectionConverters%24.html&data=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435%7C1%7C0%7C637236400701707166&sdata=%2B9TrlfiGSWDnsaT8DMPrSn1CqGIxtgfNLcPFRJ%2F%2FANQ%3D&reserved=0
>> > > [3]
>> > >
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.scala-lang.org%2Fapi%2F2.13.0%2Fscala%2Fjdk%2FCollectionConverters%24.html&data=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435%7C1%7C0%7C637236400701707166&sdata=EjocqFcoIho43wU3yvOEO9Vtvn2jTHliV88W%2BSOed9k%3D&reserved=0
>> > > [4]
>> > >
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.scala-lang.org%2Fapi%2F2.12.11%2Fscala%2Fcollection%2Fconvert%2FImplicitConversionsToJava%24.html&data=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435%7C1%7C0%7C637236400701707166&sdata=BpMYD30%2B2tXeaoIj0nNhlho8XUZOEYvT%2FzH%2FJ4WEK98%3D&reserved=0
>> > >
>> > >
>> > > On Tue, 28 Apr 2020 08:52:57 +0900
>> > > Hyukjin Kwon  wrote:
>> > >
>> > > > I would like to make sure I am open for other options that can be
>> > > > considered situationally and based on the context.
>> > > > It's okay, and I don't target to restrict this here. For example,
>> DSv2, I
>> > > > understand it's written in Java because Java
>> > > > interfaces arguably brings better performance. That's why vectorized
>> > > > readers are written in Java too.
>> > > >
>> > > > Maybe the "general" wasn't explicit in my previous email. Adding
>> APIs to
>> > > > return a Java instance is still
>> > > > rat

Re: [DISCUSS] Java specific APIs design concern and choice

2020-04-29 Thread Tom Graves
 Sorry I'm not sure what your last email means. Does it mean you are putting it 
up for a vote or just waiting to get more feedback?  I disagree with saying 
option 4 is the rule but agree having a general rule makes sense.  I think we 
need a lot more input to make the rule as it affects the api's.
Tom 
On Wednesday, April 29, 2020, 09:53:22 AM CDT, Hyukjin Kwon 
 wrote:  
 
 I think I am not seeing explicit objection here but rather see people tend to 
agree with the proposal in general.
I would like to step forward rather than leaving it as a deadlock - the worst 
choice here is to postpone and abandon this discussion with this inconsistency.

I don't currently target to document this as the cases are rather rare, and we 
haven't really documented JavaRDD <> RDD vs DataFrame case as well.
Let's keep monitoring and see if this discussion thread clarifies things enough 
in such cases I mentioned.

Let me know if you guys think differently.


2020년 4월 28일 (화) 오후 5:03, Hyukjin Kwon 님이 작성:

Spark has targeted to have a unified API set rather than having separate Java 
classes to reduce the maintenance cost,
e.g.) JavaRDD <> RDD vs DataFrame. These JavaXXX are more about the legacy.

I think it's best to stick to the approach 4. in general cases.
Other options might have to be considered based upon a specific context. For 
example, if we must to add a bunch of Java-specifics
into a specific class for an inevitable reason somewhere, I would consider to 
have a Java-specific class.
 
2020년 4월 28일 (화) 오후 4:38, ZHANG Wei 님이 작성:

Be frankly, I also love the pure Java type in Java API and Scala type in
Scala API. :-)

If we don't treat Java as a "FRIEND" of Scala, just as Python, maybe we
can adopt the status of option 1, the specific Java classes. (But I don't
like the `Java` prefix, which is redundant when I'm coding Java app,
such as JavaRDD, why not distinct it by package namespace...) The specific
Java API can also leverage some native Java language features with new
versions.

And just since the friendly relationship between Scala and Java, the Java
user can call Scala API with `.asScala` or `.asJava`'s help if Java API
is not ready. Then switch to Java API when it's well cooked.

The cons is more efforts to maintain. 

My 2 cents.

-- 
Cheers,
-z

On Tue, 28 Apr 2020 12:07:36 +0900
Hyukjin Kwon  wrote:

> The problem is that calling Scala instances in Java side is discouraged in
> general up to my best knowledge.
> A Java user won't likely know asJava in Scala but a Scala user will likely
> know both asScala and asJava.
> 
> 
> 2020년 4월 28일 (화) 오전 11:35, ZHANG Wei 님이 작성:
> 
> > How about making a small change on option 4:
> >   Keep Scala API returning Scala type instance with providing a
> >   `asJava` method to return a Java type instance.
> >
> > Scala 2.13 has provided CollectionConverter [1][2][3], in the following
> > Spark dependences upgrade, which can be supported by nature. For
> > current Scala 2.12 version, we can wrap `ImplicitConversionsToJava`[4]
> > as what Scala 2.13 does and add implicit conversions.
> >
> > Just my 2 cents.
> >
> > --
> > Cheers,
> > -z
> >
> > [1]
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.scala-lang.org%2Foverviews%2Fcollections-2.13%2Fconversions-between-java-and-scala-collections.html&data=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435%7C1%7C0%7C637236400701707166&sdata=1qauveOMB1lKHSkRco7v8tBpcJXab8IeGlcoYNMCZ%2BU%3D&reserved=0
> > [2]
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.scala-lang.org%2Fapi%2F2.13.0%2Fscala%2Fjdk%2Fjavaapi%2FCollectionConverters%24.html&data=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435%7C1%7C0%7C637236400701707166&sdata=%2B9TrlfiGSWDnsaT8DMPrSn1CqGIxtgfNLcPFRJ%2F%2FANQ%3D&reserved=0
> > [3]
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.scala-lang.org%2Fapi%2F2.13.0%2Fscala%2Fjdk%2FCollectionConverters%24.html&data=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435%7C1%7C0%7C637236400701707166&sdata=EjocqFcoIho43wU3yvOEO9Vtvn2jTHliV88W%2BSOed9k%3D&reserved=0
> > [4]
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.scala-lang.org%2Fapi%2F2.12.11%2Fscala%2Fcollection%2Fconvert%2FImplicitConversionsToJava%24.html&data=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435%7C1%7C0%7C637236400701707166&sdata=BpMYD30%2B2tXeaoIj0nNhlho8XUZOEYvT%2FzH%2FJ4WEK98%3D&reserved=0
> >
> >
> > On Tue, 28 Apr 2020 08:52:57 +0900
> > Hyukjin Kwon  wrote:
> >
> > > I would like to make sure I am open for other options that can be
> > > considered situationally and based on the context.
> > > It's okay, and I don't target to restrict this here. For example, DSv2, I
> > > understand it's written in Java because Java
> > > interfaces arguably brings better performance. That's why vectorized
> > > readers are written in Java 

Re: In Apache Spark JIRA, spark/dev/github_jira_sync.py not running properly

2020-04-29 Thread Nicholas Chammas
Not sure what you mean. The native integration will auto-link from a Jira
ticket to the PRs that mention that ticket. I don't think it will update
the ticket's status, though.

Would you like me to file a ticket with Infra and see what they say?

On Tue, Apr 28, 2020 at 12:21 AM Hyukjin Kwon  wrote:

> Maybe it's time to switch. Do you know if we can still link the JIRA
> against Github?
> The script used to change the status of JIRA too but it stopped working
> for a long time - I suspect this isn't a big deal.
>
> 2020년 4월 25일 (토) 오전 10:31, Nicholas Chammas 님이
> 작성:
>
>> Have we asked Infra recently about enabling the native Jira-GitHub
>> integration
>> ?
>> Maybe we can deprecate the part of this script that updates Jira tickets
>> with links to the PR and rely on the native integration instead. We use it
>> at my day job, for example.
>>
>> On Fri, Apr 24, 2020 at 12:39 AM Hyukjin Kwon 
>> wrote:
>>
>>> Hi all,
>>>
>>> Seems like this github_jira_sync.py
>>>  script
>>> seems stopped working completely now.
>>>
>>> https://issues.apache.org/jira/browse/SPARK-31532 <>
>>> https://github.com/apache/spark/pull/28316
>>> https://issues.apache.org/jira/browse/SPARK-31529 <>
>>> https://github.com/apache/spark/pull/28315
>>> https://issues.apache.org/jira/browse/SPARK-31528 <>
>>> https://github.com/apache/spark/pull/28313
>>>
>>> Josh, would you mind taking a look please when you find some time?
>>> There is a bunch of JIRAs now, and it is very confusing which JIRA is in
>>> progress with a PR or not.
>>>
>>>
>>> 2019년 7월 26일 (금) 오후 1:20, Hyukjin Kwon 님이 작성:
>>>
 Just FYI, I had to come up with a better JQL to filter out the JIRAs
 that already have linked PRs.
 In case it helps someone, I use this JQL now to look through the open
 JIRAs:

 project = SPARK AND
 status = Open AND
 NOT issueFunction in linkedIssuesOfRemote("Github Pull Request *")
 ORDER BY created DESC, priority DESC, updated DESC




 2019년 7월 19일 (금) 오후 4:54, Hyukjin Kwon 님이 작성:

> That's a great explanation. Thanks I didn't know that.
>
> Josh, do you know who I should ping on this?
>
> On Fri, 19 Jul 2019, 16:52 Dongjoon Hyun, 
> wrote:
>
>> Hi, Hyukjin.
>>
>> In short, there are two bots. And, the current situation happens when
>> only one bot with `dev/github_jira_sync.py` works.
>>
>> And, `dev/github_jira_sync.py` is irrelevant to the JIRA status
>> change because it only use `add_remote_link` and `add_comment` API.
>> I know only this bot (in Apache Spark repository repo)
>>
>> AFAIK, `deb/github_jira_sync.py`'s activity is done under JIRA ID
>> `githubbot` (Name: `ASF GitHub Bot`).
>> And, the other bot's activity is done under JIRA ID `apachespark`
>> (Name: `Apache Spark`).
>> The other bot is the one which Josh mentioned before. (in
>> `databricks/spark-pr-dashboard` repo).
>>
>> The root cause will be the same. The API key used by the bot is
>> rejected by Apache JIRA and forwarded to CAPCHAR.
>>
>> Bests,
>> Dongjoon.
>>
>> On Thu, Jul 18, 2019 at 8:24 PM Hyukjin Kwon 
>> wrote:
>>
>>> Hi all,
>>>
>>> Seems this issue is re-happening again. Seems the PR link is
>>> properly created in the corresponding JIRA but it doesn't change the 
>>> JIRA's
>>> status from OPEN to IN-PROGRESS.
>>>
>>> See, for instance,
>>>
>>> https://issues.apache.org/jira/browse/SPARK-28443
>>> https://issues.apache.org/jira/browse/SPARK-28440
>>> https://issues.apache.org/jira/browse/SPARK-28436
>>> https://issues.apache.org/jira/browse/SPARK-28434
>>> https://issues.apache.org/jira/browse/SPARK-28433
>>> https://issues.apache.org/jira/browse/SPARK-28431
>>>
>>> Josh and Dongjoon, do you guys maybe have any idea?
>>>
>>> 2019년 4월 25일 (목) 오후 3:09, Hyukjin Kwon 님이 작성:
>>>
 Thank you so much Josh .. !!

 2019년 4월 25일 (목) 오후 3:04, Josh Rosen 님이 작성:

> The code for this runs in http://spark-prs.appspot.com (see
> https://github.com/databricks/spark-pr-dashboard/blob/1e799c9e510fa8cdc9a6c084a777436bebeabe10/sparkprs/controllers/tasks.py#L137
> )
>
> I checked the AppEngine logs and it looks like we're getting error
> responses, possibly due to a credentials issue:
>
> Exception when starting progress on JIRA issue SPARK-27355 (
>> /base/data/home/apps/s~spark-prs/live.412416057856832734/sparkprs/controllers/tasks.py:142
>> 

[OSS DIGEST] The major changes of Apache Spark from Mar 25 to Apr 7

2020-04-29 Thread Xiao Li
Hi all,

This is the bi-weekly Apache Spark digest from the Databricks OSS team.
For each API/configuration/behavior change, an *[API] *tag is added in the
title.

CORE
[3.0][SPARK-30623][CORE]
Spark external shuffle allow disable of separate event loop group (+66, -33)
>


PR#22173  introduced a perf
regression in shuffle, even if we disable the feature flag
spark.shuffle.server.chunkFetchHandlerThreadsPercent. To fix the perf
regression, this PR refactors the related code to completely disable this
feature by default.
[3.0][SPARK-31314][CORE]
Revert SPARK-29285 to fix shuffle regression caused by creating temporary
file eagerly (+10, -71)>


PR#25962  introduced a perf
regression in shuffle, which may create empty files unnecessarily. This PR
reverts it.
[API][3.1][SPARK-29154][CORE]
Update Spark scheduler for stage level scheduling (+704, -218)>


This PR updates the DAG scheduler to schedule tasks to match the resource
profile. It's for the stage level scheduling.
[API][3.1][SPARK-29153][CORE] Add ability to merge resource profiles within
a stage with Stage Level Scheduling (+304, -15)>


Add the ability to optionally merged resource profiles if they are
specified on multiple RDDs within a Stage. The feature is part of Stage
Level Scheduling. There is a config
spark.scheduler.resourceProfile.mergeConflicts to enable this feature, the
config if off by default.

spark.scheduler.resource.profileMergeConflicts (Default: false)

   - If set to true, Spark will merge ResourceProfiles when different
   profiles are specified in RDDs that get combined into a single stage. When
   they are merged, Spark chooses the maximum of each resource and creates a
   new ResourceProfile. The default of false results in Spark throwing an
   exception if multiple different ResourceProfiles are found in RDDs going
   into the same stage.

[API][3.1][SPARK-31208][CORE]
Add an experimental API: cleanShuffleDependencies (+158, -71)>


Add a new experimental developer API RDD.cleanShuffleDependencies(blocking:
Boolean) to allow explicitly clean up shuffle files. This could help
dynamic scaling of K8s backend since the backend only recycles executors
without shuffle files.

  /**   * :: Experimental ::   * Removes an RDD's shuffles and it's
non-persisted ancestors.   * When running without a shuffle service,
cleaning up shuffle files enables downscaling.   * If you use the RDD
after this call, you should checkpoint and materialize it first.   *
If you are uncertain of what you are doing, please do not use this
feature.   * Additional techniques for mitigating orphaned shuffle
files:   *   * Tuning the driver GC to be more aggressive, so the
regular context cleaner is triggered   *   * Setting an appropriate
TTL for shuffle files to be auto cleaned   */
  @Experimental
  @DeveloperApi
  @Since("3.1.0")
  def cleanShuffleDependencies(blocking: Boolean = false): Unit

[3.1][SPARK-31179]
Fast fail the connection while last connection failed in fast fail time
window (+68, -12)>


In TransportFactory, if a connection to the destination address fails, the
new connection requests [that are created within a time window] fail fast
for avoiding too many retries. This time window size is set to 95% of the
IO retry wait time (spark.io.shuffle.retryWait whose default is 5 seconds).

SQL
[API][3.0][SP

Re: [OSS DIGEST] The major changes of Apache Spark from Mar 25 to Apr 7

2020-04-29 Thread Xingbo Jiang
Thank you so much for doing this, Xiao!

On Wed, Apr 29, 2020 at 11:09 AM Xiao Li  wrote:

> Hi all,
>
> This is the bi-weekly Apache Spark digest from the Databricks OSS team.
> For each API/configuration/behavior change, an *[API] *tag is added in
> the title.
>
> CORE
> [3.0][SPARK-30623][CORE]
> Spark external shuffle allow disable of separate event loop group (+66, -33)
> >
> 
>
> PR#22173  introduced a perf
> regression in shuffle, even if we disable the feature flag
> spark.shuffle.server.chunkFetchHandlerThreadsPercent. To fix the perf
> regression, this PR refactors the related code to completely disable this
> feature by default.
>
> [3.0][SPARK-31314][CORE]
> Revert SPARK-29285 to fix shuffle regression caused by creating temporary
> file eagerly (+10, -71)>
> 
>
> PR#25962  introduced a perf
> regression in shuffle, which may create empty files unnecessarily. This PR
> reverts it.
>
> [API][3.1][SPARK-29154][CORE]
> Update Spark scheduler for stage level scheduling (+704, -218)>
> 
>
> This PR updates the DAG scheduler to schedule tasks to match the resource
> profile. It's for the stage level scheduling.
> [API][3.1][SPARK-29153][CORE] Add ability to merge resource profiles
> within a stage with Stage Level Scheduling (+304, -15)>
> 
>
> Add the ability to optionally merged resource profiles if they are
> specified on multiple RDDs within a Stage. The feature is part of Stage
> Level Scheduling. There is a config
> spark.scheduler.resourceProfile.mergeConflicts to enable this feature,
> the config if off by default.
>
> spark.scheduler.resource.profileMergeConflicts (Default: false)
>
>- If set to true, Spark will merge ResourceProfiles when different
>profiles are specified in RDDs that get combined into a single stage. When
>they are merged, Spark chooses the maximum of each resource and creates a
>new ResourceProfile. The default of false results in Spark throwing an
>exception if multiple different ResourceProfiles are found in RDDs going
>into the same stage.
>
>
> [API][3.1][SPARK-31208][CORE]
> Add an experimental API: cleanShuffleDependencies (+158, -71)>
> 
>
> Add a new experimental developer API RDD.cleanShuffleDependencies(blocking:
> Boolean) to allow explicitly clean up shuffle files. This could help
> dynamic scaling of K8s backend since the backend only recycles executors
> without shuffle files.
>
>   /**   * :: Experimental ::   * Removes an RDD's shuffles and it's 
> non-persisted ancestors.   * When running without a shuffle service, cleaning 
> up shuffle files enables downscaling.   * If you use the RDD after this call, 
> you should checkpoint and materialize it first.   * If you are uncertain of 
> what you are doing, please do not use this feature.   * Additional techniques 
> for mitigating orphaned shuffle files:   *   * Tuning the driver GC to be 
> more aggressive, so the regular context cleaner is triggered   *   * Setting 
> an appropriate TTL for shuffle files to be auto cleaned   */
>   @Experimental
>   @DeveloperApi
>   @Since("3.1.0")
>   def cleanShuffleDependencies(blocking: Boolean = false): Unit
>
>
> [3.1][SPARK-31179]
> Fast fail the connection while last connection failed in fast fail time
> window (+68, -12)>
> 
>
> In TransportFactory, if a connection to the destination address fails, the
> new connection requests [that are created within a time window] fail fast
> for avoiding too many retries. This time window size is set to 95% of the
> IO retry wait time (spark.io.shuffle.retryWait whose default is 5 seconds).
>
> 

Re: [DISCUSS] Java specific APIs design concern and choice

2020-04-29 Thread Hyukjin Kwon
Hm, I thought you meant you prefer 3. over 4 but don't mind particularly.
I don't mean to wait for more feedback. It looks likely just a deadlock
which will be the worst case.
I was suggesting to pick one way first, and stick to it. If we find out
something later, we can discuss
more about changing it later.

Having separate Java specific API (4. way)
  - causes maintenance cost
  - makes users to search which API for Java every time
  - this looks the opposite why against the unified API set Spark targeted
so far.

I don't completely buy the argument about Scala/Java friendly because using
Java instance is already documented in the official Scala documentation.
Users still need to search if we have Java specific methods for *some* APIs.



On Thu, 30 Apr 2020, 00:06 Tom Graves,  wrote:

> Sorry I'm not sure what your last email means. Does it mean you are
> putting it up for a vote or just waiting to get more feedback?  I disagree
> with saying option 4 is the rule but agree having a general rule makes
> sense.  I think we need a lot more input to make the rule as it affects the
> api's.
>
> Tom
>
> On Wednesday, April 29, 2020, 09:53:22 AM CDT, Hyukjin Kwon <
> gurwls...@gmail.com> wrote:
>
>
> I think I am not seeing explicit objection here but rather see people tend
> to agree with the proposal in general.
> I would like to step forward rather than leaving it as a deadlock - the
> worst choice here is to postpone and abandon this discussion with this
> inconsistency.
>
> I don't currently target to document this as the cases are rather
> rare, and we haven't really documented JavaRDD <> RDD vs DataFrame case as
> well.
> Let's keep monitoring and see if this discussion thread clarifies things
> enough in such cases I mentioned.
>
> Let me know if you guys think differently.
>
>
> 2020년 4월 28일 (화) 오후 5:03, Hyukjin Kwon 님이 작성:
>
> Spark has targeted to have a unified API set rather than having separate
> Java classes to reduce the maintenance cost,
> e.g.) JavaRDD <> RDD vs DataFrame. These JavaXXX are more about the legacy.
>
> I think it's best to stick to the approach 4. in general cases.
> Other options might have to be considered based upon a specific context.
> For example, if we *must* to add a bunch of Java-specifics
> into a specific class for an inevitable reason somewhere, I would consider
> to have a Java-specific class.
>
>
>
> 2020년 4월 28일 (화) 오후 4:38, ZHANG Wei 님이 작성:
>
> Be frankly, I also love the pure Java type in Java API and Scala type in
> Scala API. :-)
>
> If we don't treat Java as a "FRIEND" of Scala, just as Python, maybe we
> can adopt the status of option 1, the specific Java classes. (But I don't
> like the `Java` prefix, which is redundant when I'm coding Java app,
> such as JavaRDD, why not distinct it by package namespace...) The specific
> Java API can also leverage some native Java language features with new
> versions.
>
> And just since the friendly relationship between Scala and Java, the Java
> user can call Scala API with `.asScala` or `.asJava`'s help if Java API
> is not ready. Then switch to Java API when it's well cooked.
>
> The cons is more efforts to maintain.
>
> My 2 cents.
>
> --
> Cheers,
> -z
>
> On Tue, 28 Apr 2020 12:07:36 +0900
> Hyukjin Kwon  wrote:
>
> > The problem is that calling Scala instances in Java side is discouraged
> in
> > general up to my best knowledge.
> > A Java user won't likely know asJava in Scala but a Scala user will
> likely
> > know both asScala and asJava.
> >
> >
> > 2020년 4월 28일 (화) 오전 11:35, ZHANG Wei 님이 작성:
> >
> > > How about making a small change on option 4:
> > >   Keep Scala API returning Scala type instance with providing a
> > >   `asJava` method to return a Java type instance.
> > >
> > > Scala 2.13 has provided CollectionConverter [1][2][3], in the following
> > > Spark dependences upgrade, which can be supported by nature. For
> > > current Scala 2.12 version, we can wrap `ImplicitConversionsToJava`[4]
> > > as what Scala 2.13 does and add implicit conversions.
> > >
> > > Just my 2 cents.
> > >
> > > --
> > > Cheers,
> > > -z
> > >
> > > [1]
> > >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.scala-lang.org%2Foverviews%2Fcollections-2.13%2Fconversions-between-java-and-scala-collections.html&data=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435%7C1%7C0%7C637236400701707166&sdata=1qauveOMB1lKHSkRco7v8tBpcJXab8IeGlcoYNMCZ%2BU%3D&reserved=0
> > > [2]
> > >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.scala-lang.org%2Fapi%2F2.13.0%2Fscala%2Fjdk%2Fjavaapi%2FCollectionConverters%24.html&data=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435%7C1%7C0%7C637236400701707166&sdata=%2B9TrlfiGSWDnsaT8DMPrSn1CqGIxtgfNLcPFRJ%2F%2FANQ%3D&reserved=0
> > > [3]
> > >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.scala-lang.org%2Fapi%2F2.13.0%2Fscala%2Fjdk%2FCollectionConverters%24.html&data=02

Re: In Apache Spark JIRA, spark/dev/github_jira_sync.py not running properly

2020-04-29 Thread Hyukjin Kwon
WDYT @Josh Rosen ?
Seems
https://github.com/databricks/spark-pr-dashboard/blob/1e799c9e510fa8cdc9a6c084a777436bebeabe10/sparkprs/controllers/tasks.py#L131-L142
this
isn't working anymore.
Does it make sense to move it to native Jira-GitHub integration

?
It won't change JIRA status as we used to do but it might be better from a
cursory look. However, maybe I missed some context.


2020년 4월 30일 (목) 오전 2:46, Nicholas Chammas 님이
작성:

> Not sure what you mean. The native integration will auto-link from a Jira
> ticket to the PRs that mention that ticket. I don't think it will update
> the ticket's status, though.
>
> Would you like me to file a ticket with Infra and see what they say?
>
> On Tue, Apr 28, 2020 at 12:21 AM Hyukjin Kwon  wrote:
>
>> Maybe it's time to switch. Do you know if we can still link the JIRA
>> against Github?
>> The script used to change the status of JIRA too but it stopped working
>> for a long time - I suspect this isn't a big deal.
>>
>> 2020년 4월 25일 (토) 오전 10:31, Nicholas Chammas 님이
>> 작성:
>>
>>> Have we asked Infra recently about enabling the native Jira-GitHub
>>> integration
>>> ?
>>> Maybe we can deprecate the part of this script that updates Jira tickets
>>> with links to the PR and rely on the native integration instead. We use it
>>> at my day job, for example.
>>>
>>> On Fri, Apr 24, 2020 at 12:39 AM Hyukjin Kwon 
>>> wrote:
>>>
 Hi all,

 Seems like this github_jira_sync.py
  
 script
 seems stopped working completely now.

 https://issues.apache.org/jira/browse/SPARK-31532 <>
 https://github.com/apache/spark/pull/28316
 https://issues.apache.org/jira/browse/SPARK-31529 <>
 https://github.com/apache/spark/pull/28315
 https://issues.apache.org/jira/browse/SPARK-31528 <>
 https://github.com/apache/spark/pull/28313

 Josh, would you mind taking a look please when you find some time?
 There is a bunch of JIRAs now, and it is very confusing which JIRA is
 in progress with a PR or not.


 2019년 7월 26일 (금) 오후 1:20, Hyukjin Kwon 님이 작성:

> Just FYI, I had to come up with a better JQL to filter out the JIRAs
> that already have linked PRs.
> In case it helps someone, I use this JQL now to look through the open
> JIRAs:
>
> project = SPARK AND
> status = Open AND
> NOT issueFunction in linkedIssuesOfRemote("Github Pull Request *")
> ORDER BY created DESC, priority DESC, updated DESC
>
>
>
>
> 2019년 7월 19일 (금) 오후 4:54, Hyukjin Kwon 님이 작성:
>
>> That's a great explanation. Thanks I didn't know that.
>>
>> Josh, do you know who I should ping on this?
>>
>> On Fri, 19 Jul 2019, 16:52 Dongjoon Hyun, 
>> wrote:
>>
>>> Hi, Hyukjin.
>>>
>>> In short, there are two bots. And, the current situation happens
>>> when only one bot with `dev/github_jira_sync.py` works.
>>>
>>> And, `dev/github_jira_sync.py` is irrelevant to the JIRA status
>>> change because it only use `add_remote_link` and `add_comment` API.
>>> I know only this bot (in Apache Spark repository repo)
>>>
>>> AFAIK, `deb/github_jira_sync.py`'s activity is done under JIRA ID
>>> `githubbot` (Name: `ASF GitHub Bot`).
>>> And, the other bot's activity is done under JIRA ID `apachespark`
>>> (Name: `Apache Spark`).
>>> The other bot is the one which Josh mentioned before. (in
>>> `databricks/spark-pr-dashboard` repo).
>>>
>>> The root cause will be the same. The API key used by the bot is
>>> rejected by Apache JIRA and forwarded to CAPCHAR.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Thu, Jul 18, 2019 at 8:24 PM Hyukjin Kwon 
>>> wrote:
>>>
 Hi all,

 Seems this issue is re-happening again. Seems the PR link is
 properly created in the corresponding JIRA but it doesn't change the 
 JIRA's
 status from OPEN to IN-PROGRESS.

 See, for instance,

 https://issues.apache.org/jira/browse/SPARK-28443
 https://issues.apache.org/jira/browse/SPARK-28440
 https://issues.apache.org/jira/browse/SPARK-28436
 https://issues.apache.org/jira/browse/SPARK-28434
 https://issues.apache.org/jira/browse/SPARK-28433
 https://issues.apache.org/jira/browse/SPARK-28431

 Josh and Dongjoon, do you guys maybe have any idea?

 2019년 4월 25일 (목) 오후 3:09, Hyukjin Kwon 님이 작성:

> Thank you so much Josh .. !!
>
> 2019년 4월 25일 (목) 오후 3:04, Josh Rosen 님이 작성:
>
>> The code for this runs in http://spark-prs.appspot.com (see
>> https://github.com/databricks

Re: [DISCUSS] Java specific APIs design concern and choice

2020-04-29 Thread Hyukjin Kwon
There was a typo in the previous email. I am re-sending:

Hm, I thought you meant you prefer 3. over 4 but don't mind particularly.
I don't mean to wait for more feedback. It looks likely just a deadlock
which will be the worst case.
I was suggesting to pick one way first, and stick to it. If we find out
something later, we can discuss
more about changing it later.

Having separate Java specific API (3. way)
  - causes maintenance cost
  - makes users to search which API for Java every time
  - this looks the opposite why against the unified API set Spark targeted
so far.

I don't completely buy the argument about Scala/Java friendly because using
Java instance is already documented in the official Scala documentation.
Users still need to search if we have Java specific methods for *some* APIs.

2020년 4월 30일 (목) 오전 8:58, Hyukjin Kwon 님이 작성:

> Hm, I thought you meant you prefer 3. over 4 but don't mind particularly.
> I don't mean to wait for more feedback. It looks likely just a deadlock
> which will be the worst case.
> I was suggesting to pick one way first, and stick to it. If we find out
> something later, we can discuss
> more about changing it later.
>
> Having separate Java specific API (4. way)
>   - causes maintenance cost
>   - makes users to search which API for Java every time
>   - this looks the opposite why against the unified API set Spark targeted
> so far.
>
> I don't completely buy the argument about Scala/Java friendly because
> using Java instance is already documented in the official Scala
> documentation.
> Users still need to search if we have Java specific methods for *some*
> APIs.
>
>
>
> On Thu, 30 Apr 2020, 00:06 Tom Graves,  wrote:
>
>> Sorry I'm not sure what your last email means. Does it mean you are
>> putting it up for a vote or just waiting to get more feedback?  I disagree
>> with saying option 4 is the rule but agree having a general rule makes
>> sense.  I think we need a lot more input to make the rule as it affects the
>> api's.
>>
>> Tom
>>
>> On Wednesday, April 29, 2020, 09:53:22 AM CDT, Hyukjin Kwon <
>> gurwls...@gmail.com> wrote:
>>
>>
>> I think I am not seeing explicit objection here but rather see people
>> tend to agree with the proposal in general.
>> I would like to step forward rather than leaving it as a deadlock - the
>> worst choice here is to postpone and abandon this discussion with this
>> inconsistency.
>>
>> I don't currently target to document this as the cases are rather
>> rare, and we haven't really documented JavaRDD <> RDD vs DataFrame case as
>> well.
>> Let's keep monitoring and see if this discussion thread clarifies things
>> enough in such cases I mentioned.
>>
>> Let me know if you guys think differently.
>>
>>
>> 2020년 4월 28일 (화) 오후 5:03, Hyukjin Kwon 님이 작성:
>>
>> Spark has targeted to have a unified API set rather than having separate
>> Java classes to reduce the maintenance cost,
>> e.g.) JavaRDD <> RDD vs DataFrame. These JavaXXX are more about the
>> legacy.
>>
>> I think it's best to stick to the approach 4. in general cases.
>> Other options might have to be considered based upon a specific context.
>> For example, if we *must* to add a bunch of Java-specifics
>> into a specific class for an inevitable reason somewhere, I would
>> consider to have a Java-specific class.
>>
>>
>>
>> 2020년 4월 28일 (화) 오후 4:38, ZHANG Wei 님이 작성:
>>
>> Be frankly, I also love the pure Java type in Java API and Scala type in
>> Scala API. :-)
>>
>> If we don't treat Java as a "FRIEND" of Scala, just as Python, maybe we
>> can adopt the status of option 1, the specific Java classes. (But I don't
>> like the `Java` prefix, which is redundant when I'm coding Java app,
>> such as JavaRDD, why not distinct it by package namespace...) The specific
>> Java API can also leverage some native Java language features with new
>> versions.
>>
>> And just since the friendly relationship between Scala and Java, the Java
>> user can call Scala API with `.asScala` or `.asJava`'s help if Java API
>> is not ready. Then switch to Java API when it's well cooked.
>>
>> The cons is more efforts to maintain.
>>
>> My 2 cents.
>>
>> --
>> Cheers,
>> -z
>>
>> On Tue, 28 Apr 2020 12:07:36 +0900
>> Hyukjin Kwon  wrote:
>>
>> > The problem is that calling Scala instances in Java side is discouraged
>> in
>> > general up to my best knowledge.
>> > A Java user won't likely know asJava in Scala but a Scala user will
>> likely
>> > know both asScala and asJava.
>> >
>> >
>> > 2020년 4월 28일 (화) 오전 11:35, ZHANG Wei 님이 작성:
>> >
>> > > How about making a small change on option 4:
>> > >   Keep Scala API returning Scala type instance with providing a
>> > >   `asJava` method to return a Java type instance.
>> > >
>> > > Scala 2.13 has provided CollectionConverter [1][2][3], in the
>> following
>> > > Spark dependences upgrade, which can be supported by nature. For
>> > > current Scala 2.12 version, we can wrap `ImplicitConversionsToJava`[4]
>> > > as what Scala 2.13 does and add implici

Re: In Apache Spark JIRA, spark/dev/github_jira_sync.py not running properly

2020-04-29 Thread Hyukjin Kwon
Let actually me just take a look by myself and bring some updates soon.

2020년 4월 30일 (목) 오전 9:13, Hyukjin Kwon 님이 작성:

> WDYT @Josh Rosen ?
> Seems
> https://github.com/databricks/spark-pr-dashboard/blob/1e799c9e510fa8cdc9a6c084a777436bebeabe10/sparkprs/controllers/tasks.py#L131-L142
>  this
> isn't working anymore.
> Does it make sense to move it to native Jira-GitHub integration
> 
> ?
> It won't change JIRA status as we used to do but it might be better from a
> cursory look. However, maybe I missed some context.
>
>
> 2020년 4월 30일 (목) 오전 2:46, Nicholas Chammas 님이
> 작성:
>
>> Not sure what you mean. The native integration will auto-link from a Jira
>> ticket to the PRs that mention that ticket. I don't think it will update
>> the ticket's status, though.
>>
>> Would you like me to file a ticket with Infra and see what they say?
>>
>> On Tue, Apr 28, 2020 at 12:21 AM Hyukjin Kwon 
>> wrote:
>>
>>> Maybe it's time to switch. Do you know if we can still link the JIRA
>>> against Github?
>>> The script used to change the status of JIRA too but it stopped working
>>> for a long time - I suspect this isn't a big deal.
>>>
>>> 2020년 4월 25일 (토) 오전 10:31, Nicholas Chammas 님이
>>> 작성:
>>>
 Have we asked Infra recently about enabling the native Jira-GitHub
 integration
 ?
 Maybe we can deprecate the part of this script that updates Jira tickets
 with links to the PR and rely on the native integration instead. We use it
 at my day job, for example.

 On Fri, Apr 24, 2020 at 12:39 AM Hyukjin Kwon 
 wrote:

> Hi all,
>
> Seems like this github_jira_sync.py
>  
> script
> seems stopped working completely now.
>
> https://issues.apache.org/jira/browse/SPARK-31532 <>
> https://github.com/apache/spark/pull/28316
> https://issues.apache.org/jira/browse/SPARK-31529 <>
> https://github.com/apache/spark/pull/28315
> https://issues.apache.org/jira/browse/SPARK-31528 <>
> https://github.com/apache/spark/pull/28313
>
> Josh, would you mind taking a look please when you find some time?
> There is a bunch of JIRAs now, and it is very confusing which JIRA is
> in progress with a PR or not.
>
>
> 2019년 7월 26일 (금) 오후 1:20, Hyukjin Kwon 님이 작성:
>
>> Just FYI, I had to come up with a better JQL to filter out the JIRAs
>> that already have linked PRs.
>> In case it helps someone, I use this JQL now to look through the open
>> JIRAs:
>>
>> project = SPARK AND
>> status = Open AND
>> NOT issueFunction in linkedIssuesOfRemote("Github Pull Request *")
>> ORDER BY created DESC, priority DESC, updated DESC
>>
>>
>>
>>
>> 2019년 7월 19일 (금) 오후 4:54, Hyukjin Kwon 님이 작성:
>>
>>> That's a great explanation. Thanks I didn't know that.
>>>
>>> Josh, do you know who I should ping on this?
>>>
>>> On Fri, 19 Jul 2019, 16:52 Dongjoon Hyun, 
>>> wrote:
>>>
 Hi, Hyukjin.

 In short, there are two bots. And, the current situation happens
 when only one bot with `dev/github_jira_sync.py` works.

 And, `dev/github_jira_sync.py` is irrelevant to the JIRA status
 change because it only use `add_remote_link` and `add_comment` API.
 I know only this bot (in Apache Spark repository repo)

 AFAIK, `deb/github_jira_sync.py`'s activity is done under JIRA ID
 `githubbot` (Name: `ASF GitHub Bot`).
 And, the other bot's activity is done under JIRA ID `apachespark`
 (Name: `Apache Spark`).
 The other bot is the one which Josh mentioned before. (in
 `databricks/spark-pr-dashboard` repo).

 The root cause will be the same. The API key used by the bot is
 rejected by Apache JIRA and forwarded to CAPCHAR.

 Bests,
 Dongjoon.

 On Thu, Jul 18, 2019 at 8:24 PM Hyukjin Kwon 
 wrote:

> Hi all,
>
> Seems this issue is re-happening again. Seems the PR link is
> properly created in the corresponding JIRA but it doesn't change the 
> JIRA's
> status from OPEN to IN-PROGRESS.
>
> See, for instance,
>
> https://issues.apache.org/jira/browse/SPARK-28443
> https://issues.apache.org/jira/browse/SPARK-28440
> https://issues.apache.org/jira/browse/SPARK-28436
> https://issues.apache.org/jira/browse/SPARK-28434
> https://issues.apache.org/jira/browse/SPARK-28433
> https://issues.apache.org/jira/browse/SPARK-28431
>
> Josh and Dongjoon, do you guys maybe have any idea?
>
> 2019년 4월 25일 (목

Re: In Apache Spark JIRA, spark/dev/github_jira_sync.py not running properly

2020-04-29 Thread Josh Rosen
(Catching up on a backlog of emails, hence my belated reply)

I just checked the spark-prs app engine logs and it appears that our JIRA
API calls are failing due to a CAPTCHA check (same issue as before). Based
on https://jira.atlassian.com/browse/JRASERVER-40362, it sounds like this
is a fairly common problem.

I manually logged into the 'apachespark' JIRA account and completed the
CPATCHA, so *hopefully* things should be temporarily unbroken.

To permanently fix this issue, we might need to use OAuth tokens for
connecting to JIRA (instead of basic username + password auth). It looks
like the Python JIRA library supports this (
https://jira.readthedocs.io/en/master/examples.html#oauth) and I found some
promising-looking instructions on how to generate the OAuth tokens:
https://www.redradishtech.com/display/KB/How+to+write+a+Python+script+authenticating+with+Jira+via+OAuth
.
However, it looks like you need to be a JIRA administrator in order to
configure the applink so I can't fix this by myself.

It would be great if the native GitHub <-> JIRA integration could meet our
needs; this probably either didn't exist or wasn't configurable by us when
we first wrote our own integration / sync script.

On Wed, Apr 29, 2020 at 6:21 PM Hyukjin Kwon  wrote:

> Let actually me just take a look by myself and bring some updates soon.
>
> 2020년 4월 30일 (목) 오전 9:13, Hyukjin Kwon 님이 작성:
>
>> WDYT @Josh Rosen ?
>> Seems
>> https://github.com/databricks/spark-pr-dashboard/blob/1e799c9e510fa8cdc9a6c084a777436bebeabe10/sparkprs/controllers/tasks.py#L131-L142
>>  this
>> isn't working anymore.
>> Does it make sense to move it to native Jira-GitHub integration
>> 
>> ?
>> It won't change JIRA status as we used to do but it might be better from
>> a cursory look. However, maybe I missed some context.
>>
>>
>> 2020년 4월 30일 (목) 오전 2:46, Nicholas Chammas 님이
>> 작성:
>>
>>> Not sure what you mean. The native integration will auto-link from a
>>> Jira ticket to the PRs that mention that ticket. I don't think it will
>>> update the ticket's status, though.
>>>
>>> Would you like me to file a ticket with Infra and see what they say?
>>>
>>> On Tue, Apr 28, 2020 at 12:21 AM Hyukjin Kwon 
>>> wrote:
>>>
 Maybe it's time to switch. Do you know if we can still link the JIRA
 against Github?
 The script used to change the status of JIRA too but it stopped working
 for a long time - I suspect this isn't a big deal.

 2020년 4월 25일 (토) 오전 10:31, Nicholas Chammas 님이
 작성:

> Have we asked Infra recently about enabling the native Jira-GitHub
> integration
> ?
> Maybe we can deprecate the part of this script that updates Jira tickets
> with links to the PR and rely on the native integration instead. We use it
> at my day job, for example.
>
> On Fri, Apr 24, 2020 at 12:39 AM Hyukjin Kwon 
> wrote:
>
>> Hi all,
>>
>> Seems like this github_jira_sync.py
>>  
>> script
>> seems stopped working completely now.
>>
>> https://issues.apache.org/jira/browse/SPARK-31532 <>
>> https://github.com/apache/spark/pull/28316
>> https://issues.apache.org/jira/browse/SPARK-31529 <>
>> https://github.com/apache/spark/pull/28315
>> https://issues.apache.org/jira/browse/SPARK-31528 <>
>> https://github.com/apache/spark/pull/28313
>>
>> Josh, would you mind taking a look please when you find some time?
>> There is a bunch of JIRAs now, and it is very confusing which JIRA is
>> in progress with a PR or not.
>>
>>
>> 2019년 7월 26일 (금) 오후 1:20, Hyukjin Kwon 님이 작성:
>>
>>> Just FYI, I had to come up with a better JQL to filter out the JIRAs
>>> that already have linked PRs.
>>> In case it helps someone, I use this JQL now to look through the
>>> open JIRAs:
>>>
>>> project = SPARK AND
>>> status = Open AND
>>> NOT issueFunction in linkedIssuesOfRemote("Github Pull Request *")
>>> ORDER BY created DESC, priority DESC, updated DESC
>>>
>>>
>>>
>>>
>>> 2019년 7월 19일 (금) 오후 4:54, Hyukjin Kwon 님이 작성:
>>>
 That's a great explanation. Thanks I didn't know that.

 Josh, do you know who I should ping on this?

 On Fri, 19 Jul 2019, 16:52 Dongjoon Hyun, 
 wrote:

> Hi, Hyukjin.
>
> In short, there are two bots. And, the current situation happens
> when only one bot with `dev/github_jira_sync.py` works.
>
> And, `dev/github_jira_sync.py` is irrelevant to the JIRA status
> change because it only use `add_remote_link` and `add_comment` API.
> I know only this bot (in Apache Spark repository repo)
>
>>