[rtc-linux] Urgent Need – SRE/Devops Engineer – 100% Remote – Long Term Contract

Manohar Indsoft Wed, 10 Nov 2021 07:03:53 -0800

Hi Professionals,

This is Manohar from IndSoft, Inc




Please find the below requirement



*Must have 10+ years of experience below 10 years don’t send me resume *

*Any visa is fine but no Transfers*



Position: SRE/Devops Engineer

Location: 100% Remote

Duration: Long Term Contract



Must have:

10 years experience

SRE work

Infrastructure and Application

Any Programming language

Good Admin capabilities as well





Staff Site Reliability Engineer





Job Summary:



As a member of the Site Reliability Engineering team, you will work with
other developers and DevOps practitioners to produce mission-critical
infrastructure, tools, and processes that will ensure highest levels of
availability and reliability of all our websites. As a senior member of the
team, you will be expected to work with management, peers, and customers to
define and implement the technical vision of the team.





You are right for the job if you are comfortable with deep technical Linux,
networking topics, and distributed architectures. You will work
cross-functionally amongst a variety of teams and be a core contributor in
every significant engineering service or solution that we deliver to our
stakeholders. You will excel if you have enthusiasm for digging deep, and a
flare for sharp technical communication, prioritization, and organization.
You will work directly with our Software Engineering teams to build our
next generation “always up” cloud-based e-commerce/Retail and Enterprise
platform.





Site Reliability Engineers are hybrid systems and software engineers who
are responsible and take ownership for reliability, scalability,
automation, and other issues related to uptime and availability of
Walmart’s e-commerce/Retail and Enterprise platform. Our goal is to build,
scale and guard the systems that delights the customers. To do so, you will
need to strong skills in following areas:



Design, write and build tools to improve the reliability, latency,
availability and scalability of Walmart e-commerce/Retail and Enterprise
products.

Engender reliability and availability starting with metrics and
measurements.

Enable scaling by providing tools, developing training and/or augmenting
processes.

Build tools/automate to prevent re-occurrence of problem to mission
critical products/services.

Augment existing instrumentation to build a cohesive picture of the
characteristics of our systems with special attention to points of failure.

Participate in capacity planning, demand forecasting, software performance
analysis and system tuning.

Develop a deep understanding of the numerous services and applications that
come together to deliver Walmart e-commerce/Retail and Enterprise products

Design new tools to monitor and smart alerts that help discover
failures/issues in a timely fashion and work with engineers to identify
root cause and fix issues.

Influence, design and create new architectures, standards, and methods for
large-scale enterprise systems.

Root-cause analysis complex problems involving multiple parties, networks,
hardware, and software that relate to scaling and performance.

Participate in on-call rotation.

Secure the system from issues, be they real, perceived, or notional.

High focus on collecting and inferring metrics.

Experience with containerization and container platforms. (e.g., Docker,
Kubernetes, Docker EE, OpenShift, Mesosphere)

Experience with configuration management tools such as Ansible, Saltstack,
Chef and Puppet

Build and drive the automation systems that maintain system health.

Eliminate Single Point of failure and test disaster recovery and HA
regularly.





Additional responsibilities may include:



Drives standardization and service focused instrumentation. Provides
subject matter expertise. Resolves break/fix scenarios, engaging broader
teams as necessary; and partners/leads to achieve continuous improvement.
Contributes to command-and-control related activities focused on
restoration of complex outages, and rapid restoration. Participate on 24/7
on-call rotation. May work independently or as part of a team on more
complex projects. Provides mentoring and guidance to more junior team
members.

Creates systems engineering and architectural documentation to be used by
others to build and maintain systems.

Scripting and Development responsibilities: Develop software in several
modern languages. Develops large/complex database-backed systems and
understands DB schema and query performance. Utilizes professional best
practices in day-to-day work like revision control, unit testing, or other.
Applies statistical data analysis techniques.

Networking responsibilities: Understanding and performing TCP dumps, snoop,
and other network sniffers. Understands and applies knowledge of most
protocols (TCP/IP, HTTP, UDP, etc.)

Application Technologies): Provides recommendations and advice to the team
and/or department in the areas of web services, OS, and storage, including
being an active liaison to Development, QA, and the Business.

Analyzes systems and makes recommendations to prevent potential problems.
Takes lead on issue resolution activities using knowledge of complex and
company-wide systems.

Lead end-to-end audit of monitors and alarms based on subsystem knowledge.

Utilizes time management and project management skills to lead the
resolution of issues in a timely and organized manner, effectively
communicating necessary information. May consult directly with developers
or third-party vendors; provides subject matter expertise.

Consistent exercise of independent judgment and discretion in matters of
significance.

Other duties and responsibilities as assigned.





Qualifications:



10+ years in a software development, DevOps role, or SRE role.

Experience in designing, investigating, analyzing, and troubleshooting
large-scale enterprise systems.

Methodical and systematic problem-solving approach, combined with a solid
awareness of ownership, initiative, and drive.

Fluency with running services at scale; In depth understanding of Unix
systems internals and networking.

Networking knowledge and in depth understanding of network concepts, such
as different protocols (TCP/IP, UDP, ICMP, etc.), MAC addresses, IP
packets, DNS, OSI layers, and load balancing).

Understanding of Unix/Linux systems from kernel to shell and beyond, taking
in system libraries, file systems, and client-server protocols along the
way. Experience administering Linux systems in a production environment.

Programming experience in one or more of the following languages: Go, Java,
Python, Ruby, Shell

Bachelor's Degree in Computer Science or a related field, or relevant work
experience

Experience with distributed version control like Git or similar

Experience with IaaS and PaaS providers such as AWS, AZURE OpenStack, GCP

Experience with containerization and container platforms. (e.g., Docker,
Kubernetes, Docker EE, OpenShift, Mesosphere).

Experience with enterprise monitoring solutions like AppDynamics, New
Relic, Prometheus, Graphite, Grafana, Nagios, Sensu and Splunk

Familiarity with continuous integration/deployment processes and tools such
as Jenkins, Maven, Nexus, etc.,

-- 
You received this message because you are subscribed to "rtc-linux".
Membership options at http://groups.google.com/group/rtc-linux .
Please read http://groups.google.com/group/rtc-linux/web/checklist
before submitting a driver.
--- 
You received this message because you are subscribed to the Google Groups 
"rtc-linux" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/rtc-linux/CAPY1naHx6hc5ck-t2kTRDNJdLA8r3h1VWJZ6NVin05UrpSJZiA%40mail.gmail.com.

[rtc-linux] Urgent Need – SRE/Devops Engineer – 100% Remote – Long Term Contract

Reply via email to