Re: Spark on YARN, HowTo kill executor or individual task?

2019-02-12 Thread Serega Sheypak
ng like this > > import org.apache.spark.TaskContext > ds.map(r => { > val taskContext = TaskContext.get() > if (taskContext.partitionId == 1000) { > throw new RuntimeException > } > r > }) > > On Mon, Feb 11, 2019 at 8:41 AM Serega Sheypak > wrote: >

Re: Spark on YARN, HowTo kill executor or individual task?

2019-02-11 Thread Serega Sheypak
I need to crash task which does repartition. пн, 11 февр. 2019 г. в 10:37, Gabor Somogyi : > What blocks you to put if conditions inside the mentioned map function? > > On Mon, Feb 11, 2019 at 10:31 AM Serega Sheypak > wrote: > >> Yeah, but I don't need to crash entir

Re: Spark on YARN, HowTo kill executor or individual task?

2019-02-11 Thread Serega Sheypak
DS.map(_ / 0).writeStream.format("console").start() > > G > > > On Sun, Feb 10, 2019 at 9:36 PM Serega Sheypak > wrote: > >> Hi BR, >> thanks for your reply. I want to mimic the issue and kill tasks at a >> certain stage. Killing executor is also an

Re: Spark on YARN, HowTo kill executor or individual task?

2019-02-10 Thread Serega Sheypak
> G > > > On Sun, Feb 10, 2019 at 4:19 PM Jörn Franke wrote: > >> yarn application -kill applicationid ? >> >> > Am 10.02.2019 um 13:30 schrieb Serega Sheypak > >: >> > >> > Hi there! >> > I have weird issue that appears only when

Spark on YARN, HowTo kill executor or individual task?

2019-02-10 Thread Serega Sheypak
Hi there! I have weird issue that appears only when tasks fail at specific stage. I would like to imitate failure on my own. The plan is to run problematic app and then kill entire executor or some tasks when execution reaches certain stage. Is it do-able?

Spark 2.x duplicates output when task fails at "repartition" stage. Checkpointing is enabled before repartition.

2019-02-05 Thread Serega Sheypak
Hi, I have spark job that produces duplicates when one or tasks from repartition stage fails. Here is simplified code. sparkContext.setCheckpointDir("hdfs://path-to-checkpoint-dir") *val *inputRDDs: List[RDD[String]] = *List*.*empty *// an RDD per input dir *val *updatedRDDs = inputRDDs.map{

Re: Spark on Yarn, is it possible to manually blacklist nodes before running spark job?

2019-01-23 Thread Serega Sheypak
ng that issue) > > On Tue, Jan 22, 2019 at 6:09 AM Jörn Franke wrote: > >> You can try with Yarn node labels: >> >> https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeLabel.html >> >> Then you can whitelist nodes. >> >> A

Re: Spark on Yarn, is it possible to manually blacklist nodes before running spark job?

2019-01-21 Thread Serega Sheypak
Hi Apiros, thanks for your reply. Is it this one: https://github.com/apache/spark/pull/23223 ? Can I try to reach you through Cloudera Support portal? пн, 21 янв. 2019 г. в 20:06, attilapiros : > Hello, I was working on this area last year (I have developed the > YarnAllocatorBlacklistTracker)

Re: Spark on Yarn, is it possible to manually blacklist nodes before running spark job?

2019-01-20 Thread Serega Sheypak
pulate such a > blacklist. > > If you can change yarn config, the equivalent is node label: > https://hadoop.apache.org/docs/r2.7.4/hadoop-yarn/hadoop-yarn-site/NodeLabel.html > > > > -- > *From:* Li Gao > *Sent:* Saturday, January 19, 2019 8:43 AM > *T

Spark on Yarn, is it possible to manually blacklist nodes before running spark job?

2019-01-18 Thread Serega Sheypak
Hi, is there any possibility to tell Scheduler to blacklist specific nodes in advance?

Kill spark executor when spark runs specific stage

2018-07-04 Thread Serega Sheypak
Hi, I'm running spark on YARN. My code is very simple. I want to kill one executor when "data.repartition(10)" is executed. Ho can I do it in easy way? val data = sc.sequenceFile[NullWritable, BytesWritable](inputPath) .map { case (key, value) => Data.fromBytes(value) } process =

Re: how "hour" function in Spark SQL is supposed to work?

2018-03-20 Thread Serega Sheypak
Ok, this one works: .withColumn("hour", hour(from_unixtime(typedDataset.col("ts") / 1000))) 2018-03-20 22:43 GMT+01:00 Serega Sheypak <serega.shey...@gmail.com>: > Hi, any updates? Looks like some API inconsistency or bug..? > > 2018-03-17 13:09 GM

Re: how "hour" function in Spark SQL is supposed to work?

2018-03-20 Thread Serega Sheypak
Hi, any updates? Looks like some API inconsistency or bug..? 2018-03-17 13:09 GMT+01:00 Serega Sheypak <serega.shey...@gmail.com>: > > Not sure why you are dividing by 1000. from_unixtime expects a long type > It expects seconds, I have milliseconds. > > > > 2018-03-1

Re: Run spark 2.2 on yarn as usual java application

2018-03-19 Thread Serega Sheypak
19 13:41 GMT+01:00 Jörn Franke <jornfra...@gmail.com>: > Maybe you should better run it in yarn cluster mode. Yarn client would > start the driver on the oozie server. > > On 19. Mar 2018, at 12:58, Serega Sheypak <serega.shey...@gmail.com> > wrote: > > I'm

Re: Run spark 2.2 on yarn as usual java application

2018-03-19 Thread Serega Sheypak
> Jacek > > On 19 Mar 2018 00:20, "Serega Sheypak" <serega.shey...@gmail.com> wrote: > >> Hi, Is it even possible to run spark on yarn as usual java application? >> I've built jat using maven with spark-yarn dependency and I manually >> populate S

Run spark 2.2 on yarn as usual java application

2018-03-18 Thread Serega Sheypak
Hi, Is it even possible to run spark on yarn as usual java application? I've built jat using maven with spark-yarn dependency and I manually populate SparkConf with all hadoop properties. SparkContext fails to start with exception: 1. Caused by: java.lang.IllegalStateException: Library

Re: Append more files to existing partitioned data

2018-03-18 Thread Serega Sheypak
ion > > > I guess it's managed by > > job.getConfiguration.set(DATASOURCE_WRITEJOBUUID, uniqueWriteJobId.toString) > > > On 17 March 2018 at 20:46, Serega Sheypak <serega.shey...@gmail.com> > wrote: > >> Hi Denis, great to see you here :) >> It works, thanks! >&

Re: Append more files to existing partitioned data

2018-03-17 Thread Serega Sheypak
com>: > Hello Serega, > > https://spark.apache.org/docs/latest/sql-programming-guide.html > > Please try SaveMode.Append option. Does it work for you? > > > сб, 17 мар. 2018 г., 15:19 Serega Sheypak <serega.shey...@gmail.com>: > >> Hi, I', using spark-sql t

Append more files to existing partitioned data

2018-03-17 Thread Serega Sheypak
Hi, I', using spark-sql to process my data and store result as parquet partitioned by several columns ds.write .partitionBy("year", "month", "day", "hour", "workflowId") .parquet("/here/is/my/dir") I want to run more jobs that will produce new partitions or add more files to existing

Re: how "hour" function in Spark SQL is supposed to work?

2018-03-17 Thread Serega Sheypak
> Not sure why you are dividing by 1000. from_unixtime expects a long type It expects seconds, I have milliseconds. 2018-03-12 6:16 GMT+01:00 vermanurag : > Not sure why you are dividing by 1000. from_unixtime expects a long type > which is time in milliseconds

how "hour" function in Spark SQL is supposed to work?

2018-03-11 Thread Serega Sheypak
hi, desperately trying to extract hour from unix seconds year, month, dayofmonth functions work as expected. hour function always returns 0. val ds = dataset .withColumn("year", year(to_date(from_unixtime(dataset.col("ts") / 1000 .withColumn("month",

Implement Dataset reader from SEQ file with protobuf to Dataset

2017-10-08 Thread Serega Sheypak
Hi, did anyone try to implement Spark SQL dataset reader from SEQ file with protobuf inside to Dataset? Imagine I have protobuf def Person - name: String - lastName: String - phones: List[String] and generated scala case class: case class Person(name:String, lastName: String, phones:

Re: Managed memory leak detected.SPARK-11293 ?

2016-05-18 Thread Serega Sheypak
nappy-vs-lzf-vs-zlib-a-comparison-of > > performance of snappy and lzf were on-par to each other. > > Maybe lzf has lower memory requirement. > > On Wed, May 18, 2016 at 7:22 AM, Serega Sheypak <serega.shey...@gmail.com> > wrote: > >> Switching from snappy to l

Re: Managed memory leak detected.SPARK-11293 ?

2016-05-18 Thread Serega Sheypak
Switching from snappy to lzf helped me: *spark.io.compression.codec=lzf* Do you know why? :) I can't find exact explanation... 2016-05-18 15:41 GMT+02:00 Ted Yu <yuzhih...@gmail.com>: > Please increase the number of partitions. > > Cheers > > On Wed, May 18, 2016 at 4:

Managed memory leak detected.SPARK-11293 ?

2016-05-18 Thread Serega Sheypak
Hi, please have a look at log snippet: 16/05/18 03:27:16 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://mapoutputtrac...@xxx.xxx.xxx.xxx:38128) 16/05/18 03:27:16 INFO spark.MapOutputTrackerWorker: Got the output locations 16/05/18 03:27:16 INFO

Re: Why does spark 1.6.0 can't use jar files stored on HDFS

2016-05-17 Thread Serega Sheypak
sname \ //location of your main classname > --master yarn \ > --deploy-mode cluster \ > /home/hadoop/SparkSampleProgram.jar //location of your jar file > > Thanks > Raj > > > > Sent from Yahoo Mail. Get the app <https://yho.com/148vdq> > > > On Tuesday,

Re: Why does spark 1.6.0 can't use jar files stored on HDFS

2016-05-17 Thread Serega Sheypak
spark-submit --conf "spark.driver.userClassPathFirst=true" --class com.MyClass --master yarn --deploy-mode client --jars hdfs:///my-lib.jar,hdfs:///my-seocnd-lib.jar jar-wth-com-MyClass.jar job_params 2016-05-17 15:41 GMT+02:00 Serega Sheypak <serega.shey...@gmail.c

Re: Why does spark 1.6.0 can't use jar files stored on HDFS

2016-05-17 Thread Serega Sheypak
https://issues.apache.org/jira/browse/SPARK-10643 Looks like it's the reason... 2016-05-17 15:31 GMT+02:00 Serega Sheypak <serega.shey...@gmail.com>: > No, and it looks like a problem. > > 2.2. --master yarn --deploy-mode client > means: > 1. submit spark as yarn

Re: Why does spark 1.6.0 can't use jar files stored on HDFS

2016-05-17 Thread Serega Sheypak
t; On Tue, May 17, 2016 at 8:33 PM, Serega Sheypak <serega.shey...@gmail.com> > wrote: > >> hi, I'm trying to: >> 1. upload my app jar files to HDFS >> 2. run spark-submit with: >> 2.1. --master yarn --deploy-mode cluster >> or >> 2.2. --master yarn -

Why does spark 1.6.0 can't use jar files stored on HDFS

2016-05-17 Thread Serega Sheypak
hi, I'm trying to: 1. upload my app jar files to HDFS 2. run spark-submit with: 2.1. --master yarn --deploy-mode cluster or 2.2. --master yarn --deploy-mode client specifying --jars hdfs:///my/home/commons.jar,hdfs:///my/home/super.jar When spark job is submitted, SparkSubmit client outputs:

Re: spark sql and cassandra. spark generate 769 tasks to read 3 lines from cassandra table

2015-06-17 Thread Serega Sheypak
and this particular table is causing issues or are you trying to figure out the right way to do a read). What version of Spark and Cassandra-connector are you using? Also, what do you get for select count(*) from foo -- is that just as bad? On Wed, Jun 17, 2015 at 4:37 AM, Serega Sheypak serega.shey

Re: spark sql and cassandra. spark generate 769 tasks to read 3 lines from cassandra table

2015-06-17 Thread Serega Sheypak
Hi, can somebody suggest me the way to reduce quantity of task? 2015-06-15 18:26 GMT+02:00 Serega Sheypak serega.shey...@gmail.com: Hi, I'm running spark sql against Cassandra table. I have 3 C* nodes, Each of them has spark worker. The problem is that spark runs 869 task to read 3 lines

spark-sql estimates Cassandra table with 3 rows as 8 TB of data

2015-06-17 Thread Serega Sheypak
Hi, spark-sql estimated input for Cassandra table with 3 rows as 8 TB. sometimes it's estimated as -167B. I run it on laptop, I don't have 8 TB space for the data.

Re: spark sql and cassandra. spark generate 769 tasks to read 3 lines from cassandra table

2015-06-17 Thread Serega Sheypak
, 2015 at 4:37 AM, Serega Sheypak serega.shey...@gmail.com wrote: Hi, can somebody suggest me the way to reduce quantity of task? 2015-06-15 18:26 GMT+02:00 Serega Sheypak serega.shey...@gmail.com: Hi, I'm running spark sql against Cassandra table. I have 3 C* nodes, Each of them has spark

spark sql and cassandra. spark generate 769 tasks to read 3 lines from cassandra table

2015-06-15 Thread Serega Sheypak
Hi, I'm running spark sql against Cassandra table. I have 3 C* nodes, Each of them has spark worker. The problem is that spark runs 869 task to read 3 lines: select bar from foo. I've tried these properties: #try to avoid 769 tasks per dummy select foo from bar qeury

Re: Driver memory leak?

2015-04-29 Thread Serega Sheypak
The memory leak could be related to this https://issues.apache.org/jira/browse/SPARK-5967 defect that was resolved in Spark 1.2.2 and 1.3.0. @Sean Will it be backported to CDH? I did't find that bug in CDH 5.4 release notes. 2015-04-29 14:51 GMT+02:00 Conor Fennell conor.fenn...@altocloud.com:

Re: history-server does't read logs which are on FS

2015-04-20 Thread Serega Sheypak
(), it doesn't know that those applications have been stopped. Note that in spark 1.3, the history server can also display running applications (including completed applications, but that it thinks are still running), which improves things a little bit. On Fri, Apr 17, 2015 at 10:13 AM, Serega

Spark 1.2, trying to run spark-history as a service, spark-defaults.conf are ignored

2015-04-14 Thread Serega Sheypak
Here is related problem: http://apache-spark-user-list.1001560.n3.nabble.com/Launching-history-server-problem-td12574.html but no answer. What I'm trying to do: wrap spark-history with /etc/init.d script Problems I have: can't make it read spark-defaults.conf I've put this file here: