Re: Spark production scenario

2018-03-08 Thread yncxcw
hi, Passion

I don't know an exact solution. But yes, the port each executor chosen to
communicate with driver is random.  I am wondering if it's possible that you
can have a node has two ethernet card, configure one card for intranet for
Spark and configure one card for WAN. Then connect the rests nodes using the
intranet. 

And also, I think you might not use WAN for Spark data transfer since the
amount of data during shuffle is huge. You got to have a high-speed switch
for your cluster.

Hopes this answer can help you!


Wei



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



DataSet save to parquet partition problem

2018-03-08 Thread Junfeng Chen
I am trying to save a DataSet object to parquet file via

> df.write().partitionBy("...").parquet(path)


while this dataset contains the following struct:
time: struct
-dayOfMonth
-monthOfYear
...

Can I use the child field like time.monthOfYear as above in partition ? If
yes, how?

Thanks!

Regard,
Junfeng Chen


Spark production scenario

2018-03-08 Thread रविशंकर नायर
Hi all,

We are going to move to production with an 8 node Spark cluster. Request
some help for below

We are running on YARN cluster manager.That means YARN is installed with
SSH between the nodes. When we run a standalone Spark program with
spark-submit, YARN initializes a resource manager followed by application
master per application. This is allocated randomely with arbitrary port.
So, would we be opening all ports in between the nodes in a production
implementation ?

Best,
Passion


Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-03-08 Thread Tathagata Das
This doc is unrelated to the stream-stream join we added in Structured
Streaming. :)

That said we added append mode first because it easier to reason about the
semantics of append mode especially in the context of outer joins. You
output a row only when you know it wont be changed ever. The semantics of
update mode in outer joins is trickier to reason about and expose through
the APIs. Consider a left outer join. As soon as we get a left-side record
with a key K that does not have a match, do we output *(K, leftValue, null)*?
And if we do so, then later get 2 matches from the right side, we have to
output *(K, leftValue, rightValue1) and (K, leftValue, rightValue2)*. But
how do we convey that *rightValue1* and *rightValue2 *together replace the
earlier *null*, rather than *rightValue2* replacing *rightValue1* replacing
*null?*

We will figure these out in future releases. For now, we have released
append mode, which allow quite a large range of use cases, including
multiple cascading joins.

TD



On Thu, Mar 8, 2018 at 9:18 AM, Gourav Sengupta 
wrote:

> super interesting.
>
> On Wed, Mar 7, 2018 at 11:44 AM, kant kodali  wrote:
>
>> It looks to me that the StateStore described in this doc
>> 
>>  Actually
>> has full outer join and every other join is a filter of that. Also the doc
>> talks about update mode but looks like Spark 2.3 ended up with append mode?
>> Anyways the moment it is in master I am ready to test so JIRA tickets on
>> this would help to keep track. please let me know.
>>
>> Thanks!
>>
>> On Tue, Mar 6, 2018 at 9:16 PM, kant kodali  wrote:
>>
>>> Sorry I meant Spark 2.4 in my previous email
>>>
>>> On Tue, Mar 6, 2018 at 9:15 PM, kant kodali  wrote:
>>>
 Hi TD,

 I agree I think we are better off either with a full fix or no fix. I
 am ok with the complete fix being available in master or some branch. I
 guess the solution for me is to just build from the source.

 On a similar note, I am not finding any JIRA tickets related to full
 outer joins and update mode for maybe say Spark 2.3. I wonder how hard is
 it two implement both of these? It turns out the update mode and full outer
 join is very useful and required in my case, therefore, I'm just asking.

 Thanks!

 On Tue, Mar 6, 2018 at 6:25 PM, Tathagata Das <
 tathagata.das1...@gmail.com> wrote:

> I thought about it.
> I am not 100% sure whether this fix should go into 2.3.1.
>
> There are two parts to this bug fix to enable self-joins.
>
> 1. Enabling deduping of leaf logical nodes by extending
> MultInstanceRelation
>   - This is safe to be backported into the 2.3 branch as it does not
> touch production code paths.
>
> 2. Fixing attribute rewriting in MicroBatchExecution, when the
> micro-batch plan is spliced into the streaming plan.
>   - This touches core production code paths and therefore, may not
> safe to backport.
>
> Part 1 enables self-joins in all but a small fraction of self-join
> queries. That small fraction can produce incorrect results, and part 2
> avoids that.
>
> So for 2.3.1, we can enable self-joins by merging only part 1, but it
> can give wrong results in some cases. I think that is strictly worse than
> no fix.
>
> TD
>
>
>
> On Thu, Feb 22, 2018 at 2:32 PM, kant kodali 
> wrote:
>
>> Hi TD,
>>
>> I pulled your commit that is listed on this ticket
>> https://issues.apache.org/jira/browse/SPARK-23406 specifically I did
>> the following steps and self joins work after I cherry-pick your commit!
>> Good Job! I was hoping it will be part of 2.3.0 but looks like it is
>> targeted for 2.3.1 :(
>>
>> git clone https://github.com/apache/spark.gitcd spark
>> git fetch
>> git checkout branch-2.3
>> git cherry-pick 658d9d9d785a30857bf35d164e6cbbd9799d6959
>> export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"
>> ./build/mvn -DskipTests compile
>> ./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr 
>> -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn
>>
>>
>> On Thu, Feb 22, 2018 at 11:25 AM, Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>>> Hey,
>>>
>>> Thanks for testing out stream-stream joins and reporting this issue.
>>> I am going to take a look at this.
>>>
>>> TD
>>>
>>>
>>>
>>> On Tue, Feb 20, 2018 at 8:20 PM, kant kodali 
>>> wrote:
>>>
 if I change it to the below code it works. However, I don't believe
 it is the solution I am looking for. I want to be able to do it in raw
 SQL and moreover, If a user gives a big 

Re: handling Remote dependencies for spark-submit in spark 2.3 with kubernetes

2018-03-08 Thread Yinan Li
One thing to note is you may need to have the S3 credentials in the
init-container unless you use a publicly accessible URL. If this is the
case, you can either create a Kubernetes secret and use the Spark config
option for mounting secrets (secrets will be mounted into the
init-container as well as into the main container), or you create a custom
init-container with the credentials baked in.

Yinan

On Thu, Mar 8, 2018 at 12:05 PM, Anirudh Ramanathan <
ramanath...@google.com.invalid> wrote:

> You don't need to create the init-container. It's an implementation detail.
> If you provide a remote uri, and specify 
> spark.kubernetes.container.image=,
> Spark *internally* will add the init container to the pod spec for you.
> *If *for some reason, you want to customize the init container image, you
> can choose to do that using the specific options, but I don't think this is
> necessary in most scenarios. The init container image, driver and executor
> images can be identical by default.
>
>
> On Thu, Mar 8, 2018 at 6:52 AM purna pradeep 
> wrote:
>
>> Im trying to run spark-submit to kubernetes cluster with spark 2.3 docker
>> container image
>>
>> The challenge im facing is application have a mainapplication.jar and
>> other dependency files & jars which are located in Remote location like AWS
>> s3 ,but as per spark 2.3 documentation there is something called kubernetes
>> init-container to download remote dependencies but in this case im not
>> creating any Podspec to include init-containers in kubernetes, as per
>> documentation Spark 2.3 spark/kubernetes internally creates Pods
>> (driver,executor) So not sure how can i use init-container for spark-submit
>> when there are remote dependencies.
>>
>> https://spark.apache.org/docs/latest/running-on-kubernetes.
>> html#using-remote-dependencies
>>
>> Please suggest
>>
>
>
> --
> Anirudh Ramanathan
>


Upgrades of streaming jobs

2018-03-08 Thread Georg Heiler
Hi

What is the state of spark structured streaming jobs and upgrades?

Can checkpoints of version 1 be read by version 2 of a job? Is downtime
required to upgrade the job?

Thanks


Re: handling Remote dependencies for spark-submit in spark 2.3 with kubernetes

2018-03-08 Thread Anirudh Ramanathan
You don't need to create the init-container. It's an implementation detail.
If you provide a remote uri, and
specify spark.kubernetes.container.image=, Spark *internally*
will add the init container to the pod spec for you.
*If *for some reason, you want to customize the init container image, you
can choose to do that using the specific options, but I don't think this is
necessary in most scenarios. The init container image, driver and executor
images can be identical by default.


On Thu, Mar 8, 2018 at 6:52 AM purna pradeep 
wrote:

> Im trying to run spark-submit to kubernetes cluster with spark 2.3 docker
> container image
>
> The challenge im facing is application have a mainapplication.jar and
> other dependency files & jars which are located in Remote location like AWS
> s3 ,but as per spark 2.3 documentation there is something called kubernetes
> init-container to download remote dependencies but in this case im not
> creating any Podspec to include init-containers in kubernetes, as per
> documentation Spark 2.3 spark/kubernetes internally creates Pods
> (driver,executor) So not sure how can i use init-container for spark-submit
> when there are remote dependencies.
>
>
> https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-remote-dependencies
>
> Please suggest
>


-- 
Anirudh Ramanathan


Incompatibility in LZ4 dependencies

2018-03-08 Thread Lalwani, Jayesh
There is an incompatibility in LZ4 dependencies being imported in spark 2.3.0

org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0 imports 
org.apache.kafka:kafka-clients:0.11.0.0, which imports net.jpountz.lz4:lz4:1.3.0
OTOH, org.apache.spark:spark-core_2.11:2.3.0 imports org.lz4:lz4-java:1.4.0

These jars have the same classes. Which is the right one to use? I see that the 
jars folder has lz4-java:1.4.0. Is that the right dependency?


The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-03-08 Thread Gourav Sengupta
super interesting.

On Wed, Mar 7, 2018 at 11:44 AM, kant kodali  wrote:

> It looks to me that the StateStore described in this doc
> 
>  Actually
> has full outer join and every other join is a filter of that. Also the doc
> talks about update mode but looks like Spark 2.3 ended up with append mode?
> Anyways the moment it is in master I am ready to test so JIRA tickets on
> this would help to keep track. please let me know.
>
> Thanks!
>
> On Tue, Mar 6, 2018 at 9:16 PM, kant kodali  wrote:
>
>> Sorry I meant Spark 2.4 in my previous email
>>
>> On Tue, Mar 6, 2018 at 9:15 PM, kant kodali  wrote:
>>
>>> Hi TD,
>>>
>>> I agree I think we are better off either with a full fix or no fix. I am
>>> ok with the complete fix being available in master or some branch. I guess
>>> the solution for me is to just build from the source.
>>>
>>> On a similar note, I am not finding any JIRA tickets related to full
>>> outer joins and update mode for maybe say Spark 2.3. I wonder how hard is
>>> it two implement both of these? It turns out the update mode and full outer
>>> join is very useful and required in my case, therefore, I'm just asking.
>>>
>>> Thanks!
>>>
>>> On Tue, Mar 6, 2018 at 6:25 PM, Tathagata Das <
>>> tathagata.das1...@gmail.com> wrote:
>>>
 I thought about it.
 I am not 100% sure whether this fix should go into 2.3.1.

 There are two parts to this bug fix to enable self-joins.

 1. Enabling deduping of leaf logical nodes by extending
 MultInstanceRelation
   - This is safe to be backported into the 2.3 branch as it does not
 touch production code paths.

 2. Fixing attribute rewriting in MicroBatchExecution, when the
 micro-batch plan is spliced into the streaming plan.
   - This touches core production code paths and therefore, may not safe
 to backport.

 Part 1 enables self-joins in all but a small fraction of self-join
 queries. That small fraction can produce incorrect results, and part 2
 avoids that.

 So for 2.3.1, we can enable self-joins by merging only part 1, but it
 can give wrong results in some cases. I think that is strictly worse than
 no fix.

 TD



 On Thu, Feb 22, 2018 at 2:32 PM, kant kodali 
 wrote:

> Hi TD,
>
> I pulled your commit that is listed on this ticket
> https://issues.apache.org/jira/browse/SPARK-23406 specifically I did
> the following steps and self joins work after I cherry-pick your commit!
> Good Job! I was hoping it will be part of 2.3.0 but looks like it is
> targeted for 2.3.1 :(
>
> git clone https://github.com/apache/spark.gitcd spark
> git fetch
> git checkout branch-2.3
> git cherry-pick 658d9d9d785a30857bf35d164e6cbbd9799d6959
> export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"
> ./build/mvn -DskipTests compile
> ./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr 
> -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn
>
>
> On Thu, Feb 22, 2018 at 11:25 AM, Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
>> Hey,
>>
>> Thanks for testing out stream-stream joins and reporting this issue.
>> I am going to take a look at this.
>>
>> TD
>>
>>
>>
>> On Tue, Feb 20, 2018 at 8:20 PM, kant kodali 
>> wrote:
>>
>>> if I change it to the below code it works. However, I don't believe
>>> it is the solution I am looking for. I want to be able to do it in raw
>>> SQL and moreover, If a user gives a big chained raw spark SQL join 
>>> query I
>>> am not even sure how to make copies of the dataframe to achieve the
>>> self-join. Is there any other way here?
>>>
>>>
>>>
>>> import org.apache.spark.sql.streaming.Trigger
>>>
>>> val jdf = 
>>> spark.readStream.format("kafka").option("kafka.bootstrap.servers", 
>>> "localhost:9092").option("subscribe", 
>>> "join_test").option("startingOffsets", "earliest").load();
>>> val jdf1 = 
>>> spark.readStream.format("kafka").option("kafka.bootstrap.servers", 
>>> "localhost:9092").option("subscribe", 
>>> "join_test").option("startingOffsets", "earliest").load();
>>>
>>> jdf.createOrReplaceTempView("table")
>>> jdf1.createOrReplaceTempView("table")
>>>
>>> val resultdf = spark.sql("select * from table inner join table1 on 
>>> table.offset=table1.offset")
>>>
>>> resultdf.writeStream.outputMode("append").format("console").option("truncate",
>>>  false).trigger(Trigger.ProcessingTime(1000)).start()
>>>
>>>
>>> On Tue, Feb 20, 2018 at 8:16 PM, kant kodali 
>>> wrote:
>>>
 If I change it to this


Re: Spark & S3 - Introducing random values into key names

2018-03-08 Thread Subhash Sriram
Thanks, Vadim! That helps and makes sense. I don't think we have a number of 
keys so large that we have to worry about it. If we do, I think I would go with 
an approach similar to what you suggested.

Thanks again,
Subhash 

Sent from my iPhone

> On Mar 8, 2018, at 11:56 AM, Vadim Semenov  wrote:
> 
> You need to put randomness into the beginning of the key, if you put it other 
> than into the beginning, it's not guaranteed that you're going to have good 
> performance.
> 
> The way we achieved this is by writing to HDFS first, and then having a 
> custom DistCp implemented using Spark that copies parquet files using random 
> keys,
> and then saves the list of resulting keys to S3, and when we want to use 
> those parquet files, we just need to load the listing file, and then take 
> keys from it and pass them into the loader.
> 
> You only need to do this when you have way too many files, if the number of 
> keys you operate is reasonably small (let's say, in thousands), you won't get 
> any benefits.
> 
> Also the S3 buckets have internal optimizations, and overtime it adjusts to 
> the workload, i.e. some additional underlying partitions are getting added, 
> some splits happen, etc.
> If you want to have good performance from start, you would need to use 
> randomization, yes.
> Or alternatively, you can contact AWS and tell them about the naming schema 
> that you're going to have (but it must be set in stone), and then they can 
> try to pre-optimize the bucket for you.
> 
>> On Thu, Mar 8, 2018 at 11:42 AM, Subhash Sriram  
>> wrote:
>> Hey Spark user community,
>> 
>> I am writing Parquet files from Spark to S3 using S3a. I was reading this 
>> article about improving S3 bucket performance, specifically about how it can 
>> help to introduce randomness to your key names so that data is written to 
>> different partitions.
>> 
>> https://aws.amazon.com/premiumsupport/knowledge-center/s3-bucket-performance-improve/
>> 
>> Is there a straight forward way to accomplish this randomness in Spark via 
>> the DataSet API? The only thing that I could think of would be to actually 
>> split the large set into multiple sets (based on row boundaries), and then 
>> write each one with the random key name.
>> 
>> Is there an easier way that I am missing?
>> 
>> Thanks in advance!
>> Subhash
>> 
>> 
> 


Re: Spark & S3 - Introducing random values into key names

2018-03-08 Thread Vadim Semenov
You need to put randomness into the beginning of the key, if you put it
other than into the beginning, it's not guaranteed that you're going to
have good performance.

The way we achieved this is by writing to HDFS first, and then having a
custom DistCp implemented using Spark that copies parquet files using
random keys,
and then saves the list of resulting keys to S3, and when we want to use
those parquet files, we just need to load the listing file, and then take
keys from it and pass them into the loader.

You only need to do this when you have way too many files, if the number of
keys you operate is reasonably small (let's say, in thousands), you won't
get any benefits.

Also the S3 buckets have internal optimizations, and overtime it adjusts to
the workload, i.e. some additional underlying partitions are getting added,
some splits happen, etc.
If you want to have good performance from start, you would need to use
randomization, yes.
Or alternatively, you can contact AWS and tell them about the naming schema
that you're going to have (but it must be set in stone), and then they can
try to pre-optimize the bucket for you.

On Thu, Mar 8, 2018 at 11:42 AM, Subhash Sriram 
wrote:

> Hey Spark user community,
>
> I am writing Parquet files from Spark to S3 using S3a. I was reading this
> article about improving S3 bucket performance, specifically about how it
> can help to introduce randomness to your key names so that data is written
> to different partitions.
>
> https://aws.amazon.com/premiumsupport/knowledge-
> center/s3-bucket-performance-improve/
>
> Is there a straight forward way to accomplish this randomness in Spark via
> the DataSet API? The only thing that I could think of would be to actually
> split the large set into multiple sets (based on row boundaries), and then
> write each one with the random key name.
>
> Is there an easier way that I am missing?
>
> Thanks in advance!
> Subhash
>
>
>


Spark & S3 - Introducing random values into key names

2018-03-08 Thread Subhash Sriram
Hey Spark user community,

I am writing Parquet files from Spark to S3 using S3a. I was reading this
article about improving S3 bucket performance, specifically about how it
can help to introduce randomness to your key names so that data is written
to different partitions.

https://aws.amazon.com/premiumsupport/knowledge-center/s3-bucket-performance-improve/

Is there a straight forward way to accomplish this randomness in Spark via
the DataSet API? The only thing that I could think of would be to actually
split the large set into multiple sets (based on row boundaries), and then
write each one with the random key name.

Is there an easier way that I am missing?

Thanks in advance!
Subhash


Re: Issues with large schema tables

2018-03-08 Thread Gourav Sengupta
Hi Ballas,

in Data Science terms you have 4500 variables without correlations or which
are independent of each other. In Data Modelling terms you have an entity
with 4500 properties. I have worked on hair splitting financial products,
even they do not have properties of a financial product with more than 800
properties through out its lifecycle.

I think that the best way to approach your problem is not to think it as a
data engineering problem, but a data architecture problem. Please apply
dimensionality reduction, data modelling and MDM to the data before
processing it.


Regards,
Gourav

On Wed, Mar 7, 2018 at 6:34 PM, Ballas, Ryan W 
wrote:

> Hello All,
>
>
>
> Our team is having a lot of issues with the Spark API particularly with
> large schema tables. We currently have a program written in Scala that
> utilizes the Apache spark API to create two tables from raw files. We have
> one particularly very large raw data file that contains around ~4700
> columns and ~200,000 rows. Every week we get a new file that shows the
> updates, inserts and deletes that happened in the last week. Our program
> will create two files – a master file and a history file. The master file
> will be the most up to date version of this table while the history table
> shows all changes inserts and updates that happened to this table and
> showing what changed. For example, if we have the following schema where A
> and B are unique:
>
>
>
> Week 1
> Week 2
>
> *A* *B* C
>  *A* *B* C
>
> 1  2  3
> 1  2  4
>
>
>
> Then the master table will now be
>
> *A* *B* C
>
> 1  2  4
>
>
>
> and History table will be
>
> *A B* change_column  change_type
> old_value  new_value
>
> 1  2  C  Update
> 3  4
>
>
>
> This process is working flawlessly for shorter schema tables. We have a
> table that has 300 columns but over 100,000,000 rows and this code still
> runs. The process above for the larger schema table runs for around 15
> hours, and then crashes with the following error:
>
>
>
> Exception in thread "main" java.lang.StackOverflowError
>
> at scala.collection.generic.Growable$class.loop$1(
> Growable.scala:52)
>
> at scala.collection.generic.Growable$class.$plus$plus$eq(
> Growable.scala:57)
>
> at scala.collection.mutable.ListBuffer.$plus$plus$eq(
> ListBuffer.scala:183)
>
> at scala.collection.mutable.ListBuffer.$plus$plus$eq(
> ListBuffer.scala:45)
>
> at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(
> TraversableLike.scala:241)
>
> at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(
> TraversableLike.scala:241)
>
> at scala.collection.immutable.List.foreach(List.scala:381)
>
> at scala.collection.TraversableLike$class.flatMap(
> TraversableLike.scala:241)
>
> at scala.collection.immutable.List.flatMap(List.scala:344)
>
> …
>
>
>
> Some other notes… This is running on a very large MAPR cluster. We have
> tried running the job with upwards of ½ a TB of RAM and this still happens.
> All of our other smaller schema tables run except for this one.
>
>
>
> Here is a code example that takes around 4 hours to run for this larger
> table, but runs in 20 seconds for other tables:
>
>
>
> *var *dataframe_result = dataframe1.join(*broadcast*(dataframe2), *Seq*(
> listOfUniqueIds:_*)).repartition(100).cache()
>
>
>
> We have tried all of the following with no success:
>
>- Using hash broad-cast joins (dataframe2 is smaller, dataframe1 is
>huge)
>- Repartioining on different numbers, as well as not repartitioning at
>all
>- Caching the result of the dataframe (we originally did not do this).
>
>
>
> What is causing this error and how do we go about fixing it? This code
> just takes in 1 parameter (the table to run) so it’s the exact same code
> for every table. It runs flawlessly for every other table except for this
> one. The only thing different between this table and all the other ones is
> the number of columns. This has the most columns at 4700 where the second
> most is 800.
>
>
>
> If anyone has any ideas on how to fix this it would be greatly
> appreciated. Thank you in advance for the help!!
>
>
>
>
> This e-mail, including attachments, may include confidential and/or
> proprietary information, and may be used only by the person or entity
> to which it is addressed. If the reader of this e-mail is not the intended
> recipient or his or her authorized agent, the reader is hereby notified
> that any dissemination, distribution or copying of this e-mail is
> prohibited. If you have received this e-mail in error, please notify the
> sender by replying to this message and delete this 

handling Remote dependencies for spark-submit in spark 2.3 with kubernetes

2018-03-08 Thread purna pradeep
Im trying to run spark-submit to kubernetes cluster with spark 2.3 docker
container image

The challenge im facing is application have a mainapplication.jar and other
dependency files & jars which are located in Remote location like AWS s3
,but as per spark 2.3 documentation there is something called kubernetes
init-container to download remote dependencies but in this case im not
creating any Podspec to include init-containers in kubernetes, as per
documentation Spark 2.3 spark/kubernetes internally creates Pods
(driver,executor) So not sure how can i use init-container for spark-submit
when there are remote dependencies.

https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-remote-dependencies

Please suggest


Re: Properly stop applications or jobs within the application

2018-03-08 Thread bsikander
I am running in Spark standalone mode. No YARN.

anyways, yarn application -kill is a manual process. I donot want that. I
was to properly kill the driver/application programatically.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Properly stop applications or jobs within the application

2018-03-08 Thread sagar grover
I am assuming you are running in yarn cluster mode. Have you tried yarn
application -kill application_id ?

With regards,
Sagar Grover
Phone - 7022175584

On Thu, Mar 8, 2018 at 4:03 PM, bsikander  wrote:

> I have scenarios for both.
> So, I want to kill both batch and streaming midway, if required.
>
> Usecase:
> Normally, if everything is okay we don't kill the application but sometimes
> while accessing external resources (like Kafka) something can go wrong. In
> that case, the application can become useless because it is not doing
> anything useful, so we want to kill it (midway). In such a case, when we
> kill it, sometimes the application becomes a zombie and doesn't get killed
> programmatically (atleast, this is what we found). A kill through Master UI
> or manual using kill -9 is required to clean up the zombies.
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Properly stop applications or jobs within the application

2018-03-08 Thread bsikander
I have scenarios for both.
So, I want to kill both batch and streaming midway, if required.

Usecase:
Normally, if everything is okay we don't kill the application but sometimes
while accessing external resources (like Kafka) something can go wrong. In
that case, the application can become useless because it is not doing
anything useful, so we want to kill it (midway). In such a case, when we
kill it, sometimes the application becomes a zombie and doesn't get killed
programmatically (atleast, this is what we found). A kill through Master UI
or manual using kill -9 is required to clean up the zombies.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Properly stop applications or jobs within the application

2018-03-08 Thread sagar grover
What do you mean by stopping applications?
Do you want to kill a batch application mid way or are you running
streaming jobs that you want to kill?

With regards,
Sagar Grover

On Thu, Mar 8, 2018 at 1:45 PM, bsikander  wrote:

> Any help would be much appreciated. This seems to be a common problem.
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Properly stop applications or jobs within the application

2018-03-08 Thread bsikander
Any help would be much appreciated. This seems to be a common problem.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org