PySpark documentation main page

2020-08-01 Thread Hyukjin Kwon
Hi all,

I am trying to write up the main page of PySpark documentation at
https://github.com/apache/spark/pull/29320.

While I think the current proposal might be good enough, I would like
to collect more feedback about the contents, structure and image since
this is the entrance page of PySpark documentation.

For example, sharing a reference site is also very welcome. Let me know
if any of you guys have a good idea to share. I plan to leave it open for
some
more days.

PS: thanks @Liang-Chi Hsieh  and @Sean Owen
 for taking a look at it quickly.


Re: Spark-submit --files option help

2020-08-01 Thread Russell Spitzer
You can use SparkFiles.get(path)

Example here
https://github.com/datastax/spark-cassandra-connector/blob/master/connector/src/main/scala/com/datastax/spark/connector/cql/CassandraConnectionFactory.scala#L152

Also this is probably a better question for the user list than the dev one

On Sat, Aug 1, 2020, 8:34 AM rahul c  wrote:

> Hi all,
>
> I am trying to pass some configuration files via spark-submit command in
> cluster mode.
> From logs I can see the files are transferred to each executors.
> But how to build the absolute path of the file in the code?
>
> Can anyone plz guide on it with some references.
>
> Appreciate your help on this.
>
> Thanks and regards
> Rahul
>
>
>
>


Spark-submit --files option help

2020-08-01 Thread rahul c
Hi all,

I am trying to pass some configuration files via spark-submit command in
cluster mode.
>From logs I can see the files are transferred to each executors.
But how to build the absolute path of the file in the code?

Can anyone plz guide on it with some references.

Appreciate your help on this.

Thanks and regards
Rahul


Re: Contributing to JIRA Maintenance

2020-08-01 Thread Hyukjin Kwon
Thank you!

On Sat, 1 Aug 2020, 19:31 Takeshi Yamamuro,  wrote:

> Great work and thanks for your JIRA maintenance and this heads-up (sorry
> for my late reply...)
> Yea, I noticed that I didn't take much time recently on the JIRA side.
> So, I will take more care about it from now on for the community's help.
>
> On Wed, Jul 29, 2020 at 10:52 AM Hyukjin Kwon  wrote:
>
>> Yeah, to contribute to JIRA maintenance, it does not need a lot of codes
>> given my experience.
>>
>> Just to share my own story:
>> 4 years ago when I was one of contributors, I have been looking for many
>> other ways around to
>> contribute to Spark. I noticed Sean was making exceptional efforts in the
>> JIRA maintenance
>> contribution - he monitored JIRAs basically 24/7. I started to make
>> sustained efforts and contributions
>> there when he asked some help in the dev mailing list. I also did some
>> code work but my JIRA
>> maintenance contribution is also one of the important community
>> activities.
>> This was appropriately considered and recognised by other PMCs.
>>
>> The commit bit. Probably the ideal case is to have contributions in
>> balance across many
>> aspects. But If somebody makes a lot of sustained efforts and
>> contributions to one
>> aspect, this can be also the case we take into account. Yeah, I think
>> Shane is a good example.
>>
>>
>> 2020년 7월 29일 (수) 오전 2:57, Rohit Mishra 님이 작성:
>>
>>> Thanks Sean for your elaborate and valuable explanation. I will look
>>> into it from tomorrow and will reach out if required.
>>>
>>> Have a good day.
>>>
>>> Regards,
>>> Rohit Mishra
>>>
>>> On Tue, 28 Jul 2020 at 11:20 PM, Sean Owen  wrote:
>>>
 To help with JIRA, I don't think you need to know a lot about the code
 structure. I think we're talking about more basic triage, like, is it
 a question that should go to the mailing list instead? is there enough
 detail to understand it at all? is it tagged with a few appropriate
 components, does its affected version make sense?  Finding duplicate
 issues is hard but quite valuable if you can identify related issues
 and mark them.

 I can also tell you about using the JIRA Client to search for issues
 that don't make much sense, like, open and targeting a released
 version.

 Actually I think anyone can modify issues in JIRA, so you don't need
 special permission. You could consult with me or Hyukjin or dev@ after
 making a few changes to check if they're on the right track.

 iss...@spark.apache.org (IIRC) gets a copy of all the JIRA emails
 about changes. I don't know if it's that useful to subscribe to.

 Documenting the code structure - might be kind of hard in any detail,
 but if you put together a doc that is useful and doesn't require a lot
 of maintenance, that gives a good overview, we could consider adding
 that to the developer docs.



 On Tue, Jul 28, 2020 at 12:16 PM Rohit Mishra 
 wrote:
 >
 > Hello All,
 >
 > I have recently joined the Dev mailing list to help the community.
 Since I am in my attempt to understand the code base before contributing, I
 think looking into Jira maintenance will be a good way to help. I will
 start looking into it. Do I need anyone’s approval?
 >
 > In case I need any help in the beginning can I mail here or there is
 a separate mailing id related to Jira maintenance?
 >
 > Just a trivial question- Do we have any document to give an overview
 of the code structure for newbie like me, I can create one if there isn’t
 any.
 >
 > Thanks,
 > Rohit Mishra
 >
 > On Tue, 28 Jul 2020 at 6:46 PM, Sean Owen  wrote:
 >>
 >> Thanks for doing this - and I will say this is a great way for anyone
 >> out there to contribute directly to the project. Issue trackers need
 >> maintenance too. It's not that hard to spot basic problems with JIRAs
 >> and request fixes, as a way to engage the reporter usefully.
 >>
 >> I triage PRs but rarely look at JIRAs anymore, just because the
 volume
 >> and noise level is larger. But it is important.
 >>
 >> On Mon, Jul 27, 2020 at 10:12 PM Hyukjin Kwon 
 wrote:
 >> >
 >> > Hi all,
 >> >
 >> > I would like to ask for some help about JIRA maintenance
 contributions in Apache Spark.
 >> > I tend to see less and less people active in JIRA maintenance
 contributions.
 >> >
 >> > I have regularly checked all JIRAs and monitored them continuously
 for the last 4 years.
 >> > For the last week, I didn't have time to take a look, and I felt
 frustrated that there are
 >> > many JIRAs that look clearly needing action. Here are the examples
 only from the last week:
 >> >
 >> > Exact duplication:
 >> > Resolve one and link another one as a duplicate.
 >> > - https://issues.apache.org/jira/browse/SPARK-32370
 

Re: Contributing to JIRA Maintenance

2020-08-01 Thread Takeshi Yamamuro
Great work and thanks for your JIRA maintenance and this heads-up (sorry
for my late reply...)
Yea, I noticed that I didn't take much time recently on the JIRA side.
So, I will take more care about it from now on for the community's help.

On Wed, Jul 29, 2020 at 10:52 AM Hyukjin Kwon  wrote:

> Yeah, to contribute to JIRA maintenance, it does not need a lot of codes
> given my experience.
>
> Just to share my own story:
> 4 years ago when I was one of contributors, I have been looking for many
> other ways around to
> contribute to Spark. I noticed Sean was making exceptional efforts in the
> JIRA maintenance
> contribution - he monitored JIRAs basically 24/7. I started to make
> sustained efforts and contributions
> there when he asked some help in the dev mailing list. I also did some
> code work but my JIRA
> maintenance contribution is also one of the important community activities.
> This was appropriately considered and recognised by other PMCs.
>
> The commit bit. Probably the ideal case is to have contributions in
> balance across many
> aspects. But If somebody makes a lot of sustained efforts and
> contributions to one
> aspect, this can be also the case we take into account. Yeah, I think
> Shane is a good example.
>
>
> 2020년 7월 29일 (수) 오전 2:57, Rohit Mishra 님이 작성:
>
>> Thanks Sean for your elaborate and valuable explanation. I will look into
>> it from tomorrow and will reach out if required.
>>
>> Have a good day.
>>
>> Regards,
>> Rohit Mishra
>>
>> On Tue, 28 Jul 2020 at 11:20 PM, Sean Owen  wrote:
>>
>>> To help with JIRA, I don't think you need to know a lot about the code
>>> structure. I think we're talking about more basic triage, like, is it
>>> a question that should go to the mailing list instead? is there enough
>>> detail to understand it at all? is it tagged with a few appropriate
>>> components, does its affected version make sense?  Finding duplicate
>>> issues is hard but quite valuable if you can identify related issues
>>> and mark them.
>>>
>>> I can also tell you about using the JIRA Client to search for issues
>>> that don't make much sense, like, open and targeting a released
>>> version.
>>>
>>> Actually I think anyone can modify issues in JIRA, so you don't need
>>> special permission. You could consult with me or Hyukjin or dev@ after
>>> making a few changes to check if they're on the right track.
>>>
>>> iss...@spark.apache.org (IIRC) gets a copy of all the JIRA emails
>>> about changes. I don't know if it's that useful to subscribe to.
>>>
>>> Documenting the code structure - might be kind of hard in any detail,
>>> but if you put together a doc that is useful and doesn't require a lot
>>> of maintenance, that gives a good overview, we could consider adding
>>> that to the developer docs.
>>>
>>>
>>>
>>> On Tue, Jul 28, 2020 at 12:16 PM Rohit Mishra 
>>> wrote:
>>> >
>>> > Hello All,
>>> >
>>> > I have recently joined the Dev mailing list to help the community.
>>> Since I am in my attempt to understand the code base before contributing, I
>>> think looking into Jira maintenance will be a good way to help. I will
>>> start looking into it. Do I need anyone’s approval?
>>> >
>>> > In case I need any help in the beginning can I mail here or there is a
>>> separate mailing id related to Jira maintenance?
>>> >
>>> > Just a trivial question- Do we have any document to give an overview
>>> of the code structure for newbie like me, I can create one if there isn’t
>>> any.
>>> >
>>> > Thanks,
>>> > Rohit Mishra
>>> >
>>> > On Tue, 28 Jul 2020 at 6:46 PM, Sean Owen  wrote:
>>> >>
>>> >> Thanks for doing this - and I will say this is a great way for anyone
>>> >> out there to contribute directly to the project. Issue trackers need
>>> >> maintenance too. It's not that hard to spot basic problems with JIRAs
>>> >> and request fixes, as a way to engage the reporter usefully.
>>> >>
>>> >> I triage PRs but rarely look at JIRAs anymore, just because the volume
>>> >> and noise level is larger. But it is important.
>>> >>
>>> >> On Mon, Jul 27, 2020 at 10:12 PM Hyukjin Kwon 
>>> wrote:
>>> >> >
>>> >> > Hi all,
>>> >> >
>>> >> > I would like to ask for some help about JIRA maintenance
>>> contributions in Apache Spark.
>>> >> > I tend to see less and less people active in JIRA maintenance
>>> contributions.
>>> >> >
>>> >> > I have regularly checked all JIRAs and monitored them continuously
>>> for the last 4 years.
>>> >> > For the last week, I didn't have time to take a look, and I felt
>>> frustrated that there are
>>> >> > many JIRAs that look clearly needing action. Here are the examples
>>> only from the last week:
>>> >> >
>>> >> > Exact duplication:
>>> >> > Resolve one and link another one as a duplicate.
>>> >> > - https://issues.apache.org/jira/browse/SPARK-32370
>>> >> > - https://issues.apache.org/jira/browse/SPARK-32369
>>> >> >
>>> >> > Different languages:
>>> >> > Ask English translations which dev people use to communicate.
>>> >> > If the