RE: Beginner searching for guidance with Jira and issues

2018-03-21 Thread Efim Poberezkin
Thanks a lot Jose,

I’ll look into the issue you’ve recommended then. Will comment on Jira to 
indicate I’m working on it and ask further questions there if needed.

BR,
Efim

From: Joseph Torres [mailto:joseph.tor...@databricks.com]
Sent: Tuesday, March 20, 2018 8:41 PM
To: Efim Poberezkin 
Cc: dev@spark.apache.org
Subject: Re: Beginner searching for guidance with Jira and issues

Hi!

I can't speak for the other tasks, but SPARK-23444 I'd expect to be pretty 
complicated. It's not obvious what the right strategy is, and there's a bunch 
of minor stuff that needs to be cleaned up (e.g. tasks shouldn't print 
cancellation warnings when cancellation is expected).

If you're interested in working on continuous processing, 
https://issues.apache.org/jira/browse/SPARK-23503 could be a good newbie task. 
It's a pretty localized change to the EpochCoordinator class; basically, it 
needs to wait to call query.commit(n + 1) until after query.commit(n). I'm not 
sure how well I've managed to document the existing implementation, but I'd be 
happy to answer any questions about it.

Jose

On Tue, Mar 20, 2018 at 9:01 AM, Efim Poberezkin 
mailto:efim_poberez...@epam.com>> wrote:
Good time of day,

I’d like to contribute to Spark development, but find it difficult to get into 
the process. I’m somewhat overwhelmed by Spark’s Jira as it’s hard for me to 
figure out the complexity of tasks and choose an appropriate one.
I’ve surfed Jira for some time and have selected a few issues I think I could 
try to solve:

https://issues.apache.org/jira/browse/SPARK-23444
https://issues.apache.org/jira/browse/SPARK-23693
https://issues.apache.org/jira/browse/SPARK-23673 - although for this one 
there’s an uncertainty that it’s needed at all, according to the comment

Also I think it would be interesting to work on Continuous Processing if there 
were some newbie tasks, but I wasn’t able to find them.
If you could give me some directions on any of these issues I’ve linked, or 
just point to some tasks that are suitable for a beginner, that’d help me a 
lot, I would appreciate any advice.

Best regards,
Efim



Toward an "API" for spark images used by the Kubernetes back-end

2018-03-21 Thread Erik Erlandson
During the review of the recent PR to remove use of the init_container from
kube pods as created by the Kubernetes back-end, the topic of documenting
the "API" for these container images also came up. What information does
the back-end provide to these containers? In what form? What assumptions
does the back-end make about the structure of these containers?  This
information is important in a scenario where a user wants to create custom
images, particularly if these are not based on the reference dockerfiles.

A related topic is deciding what such an API should look like.  For
example, early incarnations were based more purely on environment
variables, which could have advantages in terms of an API that is easy to
describe in a document.  If we document the current API, should we annotate
it as Experimental?  If not, does that effectively freeze the API?

We are interested in community input about possible customization use cases
and opinions on possible API designs!
Cheers,
Erik


Re: Toward an "API" for spark images used by the Kubernetes back-end

2018-03-21 Thread Holden Karau
I’m glad this discussion is happening on dev@ :)

Personally I like customizing with shell env variables during rolling my
own image, but definitely documentation the expectations/usage of the
variables is needed before we can really call it an API.

On the related question I suspect two of the more “common” likely
customizations is adding additional jars for bootstrapping fetching from a
DFS & also similarity complicated Python dependencies (although given the
Pythons support isn’t merged yet it’s hard to say what exactly this would
look like).

I could also see some vendors wanting to add some bootstrap/setup scripts
to fetch keys or other things.

What other ways do folks foresee customizing their Spark docker containers?

On Wed, Mar 21, 2018 at 5:04 PM Erik Erlandson  wrote:

> During the review of the recent PR to remove use of the init_container
> from kube pods as created by the Kubernetes back-end, the topic of
> documenting the "API" for these container images also came up. What
> information does the back-end provide to these containers? In what form?
> What assumptions does the back-end make about the structure of these
> containers?  This information is important in a scenario where a user wants
> to create custom images, particularly if these are not based on the
> reference dockerfiles.
>
> A related topic is deciding what such an API should look like.  For
> example, early incarnations were based more purely on environment
> variables, which could have advantages in terms of an API that is easy to
> describe in a document.  If we document the current API, should we annotate
> it as Experimental?  If not, does that effectively freeze the API?
>
> We are interested in community input about possible customization use
> cases and opinions on possible API designs!
> Cheers,
> Erik
>
-- 
Twitter: https://twitter.com/holdenkarau


Re: Toward an "API" for spark images used by the Kubernetes back-end

2018-03-21 Thread Felix Cheung
I like being able to customize the docker image itself - but I realize this 
thread is more about “API” for the stock image.

Environment is nice. Probably we need a way to set custom spark config (as a 
file??)



From: Holden Karau 
Sent: Wednesday, March 21, 2018 10:44:20 PM
To: Erik Erlandson
Cc: dev
Subject: Re: Toward an "API" for spark images used by the Kubernetes back-end

I’m glad this discussion is happening on dev@ :)

Personally I like customizing with shell env variables during rolling my own 
image, but definitely documentation the expectations/usage of the variables is 
needed before we can really call it an API.

On the related question I suspect two of the more “common” likely 
customizations is adding additional jars for bootstrapping fetching from a DFS 
& also similarity complicated Python dependencies (although given the Pythons 
support isn’t merged yet it’s hard to say what exactly this would look like).

I could also see some vendors wanting to add some bootstrap/setup scripts to 
fetch keys or other things.

What other ways do folks foresee customizing their Spark docker containers?

On Wed, Mar 21, 2018 at 5:04 PM Erik Erlandson 
mailto:eerla...@redhat.com>> wrote:
During the review of the recent PR to remove use of the init_container from 
kube pods as created by the Kubernetes back-end, the topic of documenting the 
"API" for these container images also came up. What information does the 
back-end provide to these containers? In what form? What assumptions does the 
back-end make about the structure of these containers?  This information is 
important in a scenario where a user wants to create custom images, 
particularly if these are not based on the reference dockerfiles.

A related topic is deciding what such an API should look like.  For example, 
early incarnations were based more purely on environment variables, which could 
have advantages in terms of an API that is easy to describe in a document.  If 
we document the current API, should we annotate it as Experimental?  If not, 
does that effectively freeze the API?

We are interested in community input about possible customization use cases and 
opinions on possible API designs!
Cheers,
Erik
--
Twitter: https://twitter.com/holdenkarau