Re: Dynamic metric names

2019-05-06 Thread Saisai Shao
I remembered there was a PR about doing similar thing ( https://github.com/apache/spark/pull/18406). From my understanding, this seems like a quite specific requirement, it may requires code change to support your needs. Thanks Saisai Sergey Zhemzhitsky 于2019年5月4日周六 下午4:44写道: > Hello Spark

[ANNOUNCE] Announcing Apache Spark 2.3.2

2018-09-26 Thread Saisai Shao
We are happy to announce the availability of Spark 2.3.2! Apache Spark 2.3.2 is a maintenance release, based on the branch-2.3 maintenance branch of Spark. We strongly recommend all 2.3.x users to upgrade to this stable release. To download Spark 2.3.2, head over to the download page:

Re: Spark YARN job submission error (code 13)

2018-06-08 Thread Saisai Shao
In Spark on YARN, error code 13 means SparkContext doesn't initialize in time. You can check the yarn application log to get more information. BTW, did you just write a plain python script without creating SparkContext/SparkSession? Aakash Basu 于2018年6月8日周五 下午4:15写道: > Hi, > > I'm trying to

Re: [Spark Streaming] is spark.streaming.concurrentJobs a per node or a cluster global value ?

2018-06-05 Thread Saisai Shao
t is delayed), which will lead to unexpected results. thomas lavocat 于2018年6月5日周二 下午7:48写道: > > On 05/06/2018 13:44, Saisai Shao wrote: > > You need to read the code, this is an undocumented configuration. > > I'm on it right now, but, Spark is a big piece of software. >

Re: [Spark Streaming] is spark.streaming.concurrentJobs a per node or a cluster global value ?

2018-06-05 Thread Saisai Shao
; > On 05/06/2018 11:24, Saisai Shao wrote: > > spark.streaming.concurrentJobs is a driver side internal configuration, > this means that how many streaming jobs can be submitted concurrently in > one batch. Usually this should not be configured by user, unless you're > familiar

Re: [Spark Streaming] is spark.streaming.concurrentJobs a per node or a cluster global value ?

2018-06-05 Thread Saisai Shao
spark.streaming.concurrentJobs is a driver side internal configuration, this means that how many streaming jobs can be submitted concurrently in one batch. Usually this should not be configured by user, unless you're familiar with Spark Streaming internals, and know the implication of this

Re: [Spark Streaming]: Does DStream workload run over Spark SQL engine?

2018-05-02 Thread Saisai Shao
No, the underlying of DStream is RDD, so it will not leverage any SparkSQL related feature. I think you should use Structured Streaming instead, which is based on SparkSQL. Khaled Zaouk 于2018年5月2日周三 下午4:51写道: > Hi, > > I have a question regarding the execution engine of

Re: How to submit some code segment to existing SparkContext

2018-04-11 Thread Saisai Shao
Maybe you can try Livy (http://livy.incubator.apache.org/). Thanks Jerry 2018-04-11 15:46 GMT+08:00 杜斌 : > Hi, > > Is there any way to submit some code segment to the existing SparkContext? > Just like a web backend, send some user code to the Spark to run, but the > initial

Re: spark application running in yarn client mode is slower than in local mode.

2018-04-09 Thread Saisai Shao
> > In yarn mode, only two executor are assigned to process the task, since > one executor can process one task only, they need 6 min in total. > This is not true. You should set --executor-cores/--num-executors to increase the task parallelism for executor. To be fair, Spark application should

Re: Spark and Accumulo Delegation tokens

2018-03-23 Thread Saisai Shao
pp there would be great. > Thanks > > Jorge Machado > > > > > > On 23 Mar 2018, at 07:38, Saisai Shao <sai.sai.s...@gmail.com> wrote: > > I think you can build your own Accumulo credential provider as similar to > HadoopDelegationTokenProvider out of Spark,

Re: Spark and Accumulo Delegation tokens

2018-03-23 Thread Saisai Shao
I think you can build your own Accumulo credential provider as similar to HadoopDelegationTokenProvider out of Spark, Spark already provided an interface "ServiceCredentialProvider" for user to plug-in customized credential provider. Thanks Jerry 2018-03-23 14:29 GMT+08:00 Jorge Machado

Re: Is Apache Spark-2.2.1 compatible with Hadoop-3.0.0

2018-01-07 Thread Saisai Shao
AFAIK, there's no large scale test for Hadoop 3.0 in the community. So it is not clear whether it is supported or not (or has some issues). I think in the download page "Pre-Built for Apache Hadoop 2.7 and later" mostly means that it supports Hadoop 2.7+ (2.8...), but not 3.0 (IIUC). Thanks Jerry

Re: Multiple vcores per container when running Spark applications in Yarn cluster mode

2017-09-10 Thread Saisai Shao
I guess you're using Capacity Scheduler with DefaultResourceCalculator, which doesn't count cpu cores into resource calculation, this "1" you saw is actually meaningless. If you want to also calculate cpu resource, you should choose DominantResourceCalculator. Thanks Jerry On Sat, Sep 9, 2017 at

Re: Port to open for submitting Spark on Yarn application

2017-09-03 Thread Saisai Shao
I think spark.yarn.am.port is not used any more, so you don't need to consider this. If you're running Spark on YARN, I think some YARN RM port to submit applications should also be reachable via firewall, as well as HDFS port to upload resources. Also in the Spark side, executors will be

Re: Livy with Spark package

2017-08-23 Thread Saisai Shao
You could set "spark.jars.packages" in `conf` field of session post API ( https://github.com/apache/incubator-livy/blob/master/docs/rest-api.md#post-sessions). This is equal to --package in spark-submit. BTW you'd better ask livy question in u...@livy.incubator.apache.org. Thanks Jerry On Thu,

Re: Spark Web UI SSL Encryption

2017-08-21 Thread Saisai Shao
Can you please post the specific problem you met? Thanks Jerry On Sat, Aug 19, 2017 at 1:49 AM, Anshuman Kumar wrote: > Hello, > > I have recently installed Sparks 2.2.0, and trying to use it for some big > data processing. Spark is installed on a server that I

Re: Kafka 0.10 with PySpark

2017-07-05 Thread Saisai Shao
Please see the reason in this thread ( https://github.com/apache/spark/pull/14340). It would better to use structured streaming instead. So I would like to -1 this patch. I think it's been a mistake to support > dstream in Python -- yes it satisfies a checkbox and Spark could claim > there's

Re: Question about standalone Spark cluster reading from Kerberosed hadoop

2017-06-23 Thread Saisai Shao
Spark running with standalone cluster manager currently doesn't support accessing security Hadoop. Basically the problem is that standalone mode Spark doesn't have the facility to distribute delegation tokens. Currently only Spark on YARN or local mode supports security Hadoop. Thanks Jerry On

Re: Kerberos impersonation of a Spark Context at runtime

2017-05-04 Thread Saisai Shao
Current Spark doesn't support impersonate different users at run-time. Current Spark's proxy user is application level, which means when setting through --proxy-user the whole application will be running with that user. On Thu, May 4, 2017 at 5:13 PM, matd wrote: > Hi folks,

Re: Off heap memory settings and Tungsten

2017-04-24 Thread Saisai Shao
AFAIK, I don't think the off-heap memory settings is enabled automatically, there're two configurations control the tungsten off-heap memory usage: 1. spark.memory.offHeap.enabled. 2. spark.memory.offHeap.size. On Sat, Apr 22, 2017 at 7:44 PM, geoHeil wrote: > Hi,

Re: Spark API authentication

2017-04-14 Thread Saisai Shao
filter is not supported. It > is a bug or expected behavior? > > On 14.04.2017 13:22, Saisai Shao wrote: > > AFAIK, For the first line, custom filter should be worked. But for the > latter it is not supported. > > On Fri, Apr 14, 2017 at 6:17 PM, Sergey Grigorev <grig

Re: Spark API authentication

2017-04-14 Thread Saisai Shao
s> *or > *http://master:6066/v1/submissions/status/driver-20170414025324- > <http://master:6066/v1/submissions/status/driver-20170414025324-> *return > successful result. But if I open the spark master web ui then it requests > username and password. > > > O

Re: Spark API authentication

2017-04-14 Thread Saisai Shao
Hi, What specifically are you referring to "Spark API endpoint"? Filter can only be worked with Spark Live and History web UI. On Fri, Apr 14, 2017 at 5:18 PM, Sergey wrote: > Hello all, > > I've added own spark.ui.filters to enable basic authentication to access to >

Re: spark kafka consumer with kerberos

2017-03-31 Thread Saisai Shao
e > Caused by: javax.security.auth.login.LoginException: Unable to obtain > password from user > > > On Fri, Mar 31, 2017 at 9:08 AM, Saisai Shao <sai.sai.s...@gmail.com> > wrote: > >> Hi Bill, >> >> The exception is from executor side. From the gist you prov

Re: spark kafka consumer with kerberos

2017-03-31 Thread Saisai Shao
Hi Bill, The exception is from executor side. From the gist you provided, looks like the issue is that you only configured java options in driver side, I think you should also configure this in executor side. You could refer to here (

Re: spark-submit config via file

2017-03-27 Thread Saisai Shao
It's quite obvious your hdfs URL is not complete, please looks at the exception, your hdfs URI doesn't have host, port. Normally it should be OK if HDFS is your default FS. I think the problem is you're running on HDI, in which default FS is wasb. So here short name without host:port will lead to

Re: question on Write Ahead Log (Spark Streaming )

2017-03-08 Thread Saisai Shao
IIUC, your scenario is quite like what currently ReliableKafkaReceiver does. You can only send ack to the upstream source after WAL is persistent, otherwise because of asynchronization of data processing and data receiving, there's still a chance data could be lost if you send out ack before WAL.

Re: How to use ManualClock with Spark streaming

2017-02-28 Thread Saisai Shao
I don't think using ManualClock is a right way to fix your problem here in Spark Streaming. ManualClock in Spark is mainly used for unit test, it should manually advance the time to make the unit test work. The usage looks different compared to the scenario you mentioned. Thanks Jerry On Tue,

Re: spark.speculation setting support on standalone mode?

2017-02-27 Thread Saisai Shao
I think it should be. These configurations doesn't depend on specific cluster manager use chooses. On Tue, Feb 28, 2017 at 4:42 AM, satishl wrote: > Are spark.speculation and related settings supported on standalone mode? > > > > -- > View this message in context:

Re: Why spark history server does not show RDD even if it is persisted?

2017-02-22 Thread Saisai Shao
i <paragp...@gmail.com> wrote: > Thanks a lot the information! > > Is there any reason why EventLoggingListener ignore this event? > > *Thanks,* > > > *​Parag​* > > On Wed, Feb 22, 2017 at 7:11 PM, Saisai Shao <sai.sai.s...@gmail.com> > wrote: > &g

Re: Why spark history server does not show RDD even if it is persisted?

2017-02-22 Thread Saisai Shao
AFAIK, Spark's EventLoggingListerner ignores BlockUpdate event, so it will not be written into event-log, I think that's why you cannot get such info in history server. On Thu, Feb 23, 2017 at 9:51 AM, Parag Chaudhari wrote: > Hi, > > I am running spark shell in spark

Re: Remove dependence on HDFS

2017-02-13 Thread Saisai Shao
IIUC Spark doesn't strongly bind to HDFS, it uses a common FileSystem layer which supports different FS implementations, HDFS is just one option. You could also use S3 as a backend FS, from Spark's point it is transparent to different FS implementations. On Sun, Feb 12, 2017 at 5:32 PM, ayan

Re: Livy with Spark

2016-12-07 Thread Saisai Shao
Hi Mich, 1. Each user could create a Livy session (batch or interactive), one session is backed by one Spark application, and the resource quota is the same as normal spark application (configured by spark.executor.cores/memory,. etc), and this will be passed to yarn if running on Yarn. This is

Re: spark.yarn.executor.memoryOverhead

2016-11-23 Thread Saisai Shao
>From my understanding, this memory overhead should include "spark.memory.offHeap.size", which means off-heap memory size should not be larger than the overhead memory size when running in yarn. On Thu, Nov 24, 2016 at 3:01 AM, Koert Kuipers wrote: > in YarnAllocator i see

Re: dataframe data visualization

2016-11-20 Thread Saisai Shao
You might take a look at this project (https://github.com/vegas-viz/Vegas), it has Spark integration. Thanks Saisai On Mon, Nov 21, 2016 at 10:23 AM, wenli.o...@alibaba-inc.com < wenli.o...@alibaba-inc.com> wrote: > Hi anyone, > > is there any easy way for me to do data visualization in spark

Re: spark pi example fail on yarn

2016-10-20 Thread Saisai Shao
n Fri, Oct 21, 2016 at 8:06 AM Li Li <fancye...@gmail.com> wrote: > > which log file should I > > On Thu, Oct 20, 2016 at 10:02 PM, Saisai Shao <sai.sai.s...@gmail.com> > wrote: > > Looks like ApplicationMaster is killed by SIGTERM. > > > > 16/10/20 18:

Re: spark pi example fail on yarn

2016-10-20 Thread Saisai Shao
Looks like ApplicationMaster is killed by SIGTERM. 16/10/20 18:12:04 ERROR yarn.ApplicationMaster: RECEIVED SIGNAL TERM 16/10/20 18:12:04 INFO yarn.ApplicationMaster: Final app status: This container may be killed by yarn NodeManager or other processes, you'd better check yarn log to dig out

Re: NoClassDefFoundError: org/apache/spark/Logging in SparkSession.getOrCreate

2016-10-17 Thread Saisai Shao
Not sure why your code will search Logging class under org/apache/spark, this should be “org/apache/spark/internal/Logging”, and it changed long time ago. On Sun, Oct 16, 2016 at 3:25 AM, Brad Cox wrote: > I'm experimenting with Spark 2.0.1 for the first time and hitting a

Re: spark with kerberos

2016-10-13 Thread Saisai Shao
I think security has nothing to do with what API you use, spark sql or RDD API. Assuming you're running on yarn cluster (that is the only cluster manager supports Kerberos currently). Firstly you need to get Kerberos tgt in your local spark-submit process, after being authenticated by Kerberos,

Re: Spark metrics when running with YARN?

2016-09-17 Thread Saisai Shao
dalone? > > Why are there 2 ways to get information, REST API and this Sink? > > > Best regards, Vladimir. > > > > > > > On Mon, Sep 12, 2016 at 3:53 PM, Vladimir Tretyakov < > vladimir.tretya...@sematext.com> wrote: > >> Hello Saisai Shao,

Re: Spark metrics when running with YARN?

2016-09-12 Thread Saisai Shao
Here is the yarn RM REST API for you to refer ( http://hadoop.apache.org/docs/r2.7.0/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html). You can use these APIs to query applications running on yarn. On Sun, Sep 11, 2016 at 11:25 PM, Jacek Laskowski wrote: > Hi Vladimir, > >

Re: Re: Can we redirect Spark shuffle spill data to HDFS or Alluxio?

2016-08-24 Thread Saisai Shao
oud.com > > > *From:* Sun Rui <sunrise_...@163.com> > *Date:* 2016-08-24 22:17 > *To:* Saisai Shao <sai.sai.s...@gmail.com> > *CC:* tony@tendcloud.com; user <user@spark.apache.org> > *Subject:* Re: Can we redirect Spark shuffle spill data to HDFS or >

Re: Can we redirect Spark shuffle spill data to HDFS or Alluxio?

2016-08-24 Thread Saisai Shao
ty, and also there is additional overhead of network I/O and replica > of HDFS files. > > On Aug 24, 2016, at 21:02, Saisai Shao <sai.sai.s...@gmail.com> wrote: > > Spark Shuffle uses Java File related API to create local dirs and R/W > data, so it can only be worked with OS suppor

Re: Can we redirect Spark shuffle spill data to HDFS or Alluxio?

2016-08-24 Thread Saisai Shao
Spark Shuffle uses Java File related API to create local dirs and R/W data, so it can only be worked with OS supported FS. It doesn't leverage Hadoop FileSystem API, so writing to Hadoop compatible FS is not worked. Also it is not suitable to write temporary shuffle data into distributed FS, this

Re: dynamic allocation in Spark 2.0

2016-08-24 Thread Saisai Shao
This looks like Spark application is running into a abnormal status. From the stack it means driver could not send requests to AM, can you please check if AM is reachable or are there any other exceptions beside this one. >From my past test, Spark's dynamic allocation may run into some corner

Re: Apache Spark toDebugString producing different output for python and scala repl

2016-08-15 Thread Saisai Shao
The implementation inside the Python API and Scala API for RDD is slightly different, so the difference of RDD lineage you printed is expected. On Tue, Aug 16, 2016 at 10:58 AM, DEEPAK SHARMA wrote: > Hi All, > > > Below is the small piece of code in scala and

Re: submitting spark job with kerberized Hadoop issue

2016-08-07 Thread Saisai Shao
1. Standalone mode doesn't support accessing kerberized Hadoop, simply because it lacks the mechanism to distribute delegation tokens via cluster manager. 2. For the HBase token fetching failure, I think you have to do kinit to generate tgt before start spark application (

Re: spark 2.0.0 - how to build an uber-jar?

2016-08-03 Thread Saisai Shao
I guess you're mentioning about spark assembly uber jar. In Spark 2.0, there's no uber jar, instead there's a jars folder which contains all jars required in the run-time. For the end user it is transparent, the way to submit spark application is still the same. On Wed, Aug 3, 2016 at 4:51 PM,

Re: Spark on yarn, only 1 or 2 vcores getting allocated to the containers getting created.

2016-08-03 Thread Saisai Shao
Use dominant resource calculator instead of default resource calculator will get the expected vcores as you wanted. Basically by default yarn does not honor cpu cores as resource, so you will always see vcore is 1 no matter what number of cores you set in spark. On Wed, Aug 3, 2016 at 12:11 PM,

Re: Getting error, when I do df.show()

2016-08-01 Thread Saisai Shao
> > java.lang.NoClassDefFoundError: spray/json/JsonReader > > at > com.memsql.spark.pushdown.MemSQLPhysicalRDD$.fromAbstractQueryTree(MemSQLPhysicalRDD.scala:95) > > at > com.memsql.spark.pushdown.MemSQLPushdownStrategy.apply(MemSQLPushdownStrategy.scala:49) >

Re: yarn.exceptions.ApplicationAttemptNotFoundException when trying to shut down spark applicaiton via yarn applicaiton --kill

2016-07-26 Thread Saisai Shao
Several useful information can be found here ( https://issues.apache.org/jira/browse/YARN-1842), though personally I haven't met this problem before. Thanks Saisai On Tue, Jul 26, 2016 at 2:21 PM, Yu Wei wrote: > Hi guys, > > > When I tried to shut down spark application

Re: How to submit app in cluster mode? port 7077 or 6066

2016-07-21 Thread Saisai Shao
I think both 6066 and 7077 can be worked. 6066 is using the REST way to submit application, while 7077 is the legacy way. From user's aspect, it should be transparent and no need to worry about the difference. - *URL:* spark://hw12100.local:7077 - *REST URL:* spark://hw12100.local:6066

Re: scala.MatchError on stand-alone cluster mode

2016-07-15 Thread Saisai Shao
The error stack is throwing from your code: Caused by: scala.MatchError: [Ljava.lang.String;@68d279ec (of class [Ljava.lang.String;) at com.jd.deeplog.LogAggregator$.main(LogAggregator.scala:29) at com.jd.deeplog.LogAggregator.main(LogAggregator.scala) I think you should debug

Re: It seemed JavaDStream.print() did not work when launching via yarn on a single node

2016-07-06 Thread Saisai Shao
DStream.print() will collect some of the data to driver and display, please see the implementation of DStream.print() RDD.take() will collect some of the data to driver. Normally the behavior should be consistent between cluster and local mode, please find out the root cause of this problem,

Re: spark local dir to HDFS ?

2016-07-05 Thread Saisai Shao
It is not worked to configure local dirs to HDFS. Local dirs are mainly used for data spill and shuffle data persistence, it is not suitable to use hdfs. If you met capacity problem, you could configure multiple dirs located in different mounted disks. On Wed, Jul 6, 2016 at 9:05 AM, Sri

Re: deploy-mode flag in spark-sql cli

2016-06-29 Thread Saisai Shao
I think you cannot use sql client in the cluster mode, also for spark-shell/pyspark which has a repl, all these application can only be started with client deploy mode. On Thu, Jun 30, 2016 at 12:46 PM, Mich Talebzadeh wrote: > Hi, > > When you use spark-shell or for

Re: problem running spark with yarn-client not using spark-submit

2016-06-26 Thread Saisai Shao
It means several jars are missing in the yarn container environment, if you want to submit your application through some other ways besides spark-submit, you have to take care all the environment things yourself. Since we don't know your implementation of java web service, so it is hard to provide

Re: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

2016-06-22 Thread Saisai Shao
spark.yarn.jar (none) The location of the Spark jar file, in case overriding the default location is desired. By default, Spark on YARN will use a Spark jar installed locally, but the Spark jar can also be in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it

YARN Application Timeline service with Spark 2.0.0 issue

2016-06-17 Thread Saisai Shao
Hi Community, In Spark 2.0.0 we upgrade to use jersey2 ( https://issues.apache.org/jira/browse/SPARK-12154) instead of jersey 1.9, while for the whole Hadoop we still stick on the old version. This will bring in some issues when yarn timeline service is enabled (

Re: Map tuple to case class in Dataset

2016-05-31 Thread Saisai Shao
It works fine in my local test, I'm using latest master, maybe this bug is already fixed. On Wed, Jun 1, 2016 at 7:29 AM, Michael Armbrust wrote: > Version of Spark? What is the exception? > > On Tue, May 31, 2016 at 4:17 PM, Tim Gautier > wrote:

Re: duplicate jar problem in yarn-cluster mode

2016-05-17 Thread Saisai Shao
I think it is already fixed if your problem is exactly the same as what mentioned in this JIRA (https://issues.apache.org/jira/browse/SPARK-14423). Thanks Jerry On Wed, May 18, 2016 at 2:46 AM, satish saley wrote: > Hello, > I am executing a simple code with

Re: Re: How to change output mode to Update

2016-05-17 Thread Saisai Shao
> .mode(SaveMode.Overwrite) >From my understanding mode is not supported in continuous query. def mode(saveMode: SaveMode): DataFrameWriter = { // mode() is used for non-continuous queries // outputMode() is used for continuous queries assertNotStreaming("mode() can only be called on

Re: How to use Kafka as data source for Structured Streaming

2016-05-17 Thread Saisai Shao
It is not supported now, currently only filestream is supported. Thanks Jerry On Wed, May 18, 2016 at 10:14 AM, Todd wrote: > Hi, > I am wondering whether structured streaming supports Kafka as data source. > I brief the source code(meanly related with DataSourceRegister

Re: Re: spark uploading resource error

2016-05-10 Thread Saisai Shao
, May 10, 2016 at 4:17 PM, 朱旻 <z...@126.com> wrote: > > > it was a product sold by huawei . name is FusionInsight. it says spark was > 1.3 with hadoop 2.7.1 > > where can i find the code or config file which define the files to be > uploaded? > > >

Re: spark uploading resource error

2016-05-10 Thread Saisai Shao
What is the version of Spark are you using? From my understanding, there's no code in yarn#client will upload "__hadoop_conf__" into distributed cache. On Tue, May 10, 2016 at 3:51 PM, 朱旻 wrote: > hi all: > I found a problem using spark . > WHEN I use spark-submit to

Re: Re: How big the spark stream window could be ?

2016-05-09 Thread Saisai Shao
same.) > > Ideally, it will distributed evenly across the executors, also this is target for tuning. Normally it depends on several conditions like receiver distribution, partition distribution. > > The issue raises if the amount of streaming data does not fit into these 4 > caches

Re: Re: How big the spark stream window could be ?

2016-05-09 Thread Saisai Shao
, Ashok Kumar <ashok34...@yahoo.com> wrote: > hi, > > so if i have 10gb of streaming data coming in does it require 10gb of > memory in each node? > > also in that case why do we need using > > dstream.cache() > > thanks > > > On Monday, 9 May 2016, 9:5

Re: Re: How big the spark stream window could be ?

2016-05-09 Thread Saisai Shao
batch calculation ? > > > > > > At 2016-05-09 15:14:47, "Saisai Shao" <sai.sai.s...@gmail.com> wrote: > > For window related operators, Spark Streaming will cache the data into > memory within this window, in your case your window size is up to 24 hours, > whi

Re: How big the spark stream window could be ?

2016-05-09 Thread Saisai Shao
t;https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 9 May 2016 at 08:14, Saisai Shao <sai.sai.s...@gmail.com> wrote: > >> For window related operators, Spark Streaming will cache the dat

Re: How big the spark stream window could be ?

2016-05-09 Thread Saisai Shao
For window related operators, Spark Streaming will cache the data into memory within this window, in your case your window size is up to 24 hours, which means data has to be in Executor's memory for more than 1 day, this may introduce several problems when memory is not enough. On Mon, May 9,

Re: Fw: Significant performance difference for same spark job in scala vs pyspark

2016-05-05 Thread Saisai Shao
Writing RDD based application using pyspark will bring in additional overheads, Spark is running on the JVM whereas your python code is running on python runtime, so data should be communicated between JVM world and python world, this requires additional serialization-deserialization, IPC. Also

Re: kafka direct streaming python API fromOffsets

2016-05-03 Thread Saisai Shao
I guess the problem is that py4j automatically translate the python int into java int or long according to the value of the data. If this value is small it will translate to java int, otherwise it will translate into java long. But in java code, the parameter must be long type, so that's the

Re: Detecting application restart when running in supervised cluster mode

2016-04-05 Thread Saisai Shao
Hi Deepak, I don't think supervise can be worked with yarn, it is a standalone and Mesos specific feature. Thanks Saisai On Tue, Apr 5, 2016 at 3:23 PM, Deepak Sharma wrote: > Hi Rafael > If you are using yarn as the engine , you can always use RM UI to see the >

Re: --packages configuration equivalent item name?

2016-04-04 Thread Saisai Shao
spark.jars.ivy, spark.jars.packages, spark.jars.excludes is the configurations you can use. Thanks Saisai On Sun, Apr 3, 2016 at 1:59 AM, Russell Jurney wrote: > Thanks, Andy! > > On Mon, Mar 28, 2016 at 8:44 AM, Andy Davidson < > a...@santacruzintegration.com> wrote:

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Saisai Shao
eliminate this. > > > On Fri, Apr 1, 2016, 7:25 PM Saisai Shao <sai.sai.s...@gmail.com> wrote: > >> Hi Michael, shuffle data (mapper output) have to be materialized into >> disk finally, no matter how large memory you have, it is the design purpose >> of Spark. In you scenari

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Saisai Shao
Hi Michael, shuffle data (mapper output) have to be materialized into disk finally, no matter how large memory you have, it is the design purpose of Spark. In you scenario, since you have a big memory, shuffle spill should not happen frequently, most of the disk IO you see might be final shuffle

Re: Spark Metrics : Why is the Sink class declared private[spark] ?

2016-04-01 Thread Saisai Shao
There's a JIRA (https://issues.apache.org/jira/browse/SPARK-14151) about it, please take a look. Thanks Saisai On Sat, Apr 2, 2016 at 6:48 AM, Walid Lezzar wrote: > Hi, > > I looked into the spark code at how spark report metrics using the > MetricsSystem class. I've seen

Re: Re: Is there a way to submit spark job to your by YARN REST API?

2016-03-22 Thread Saisai Shao
zhitao_yan > QQ : 4707059 > 地址:北京市东城区东直门外大街39号院2号楼航空服务大厦602室 > 邮编:100027 > > ---- > TalkingData.com <http://talkingdata.com/> - 让数据说话 > > > *From:* Saisai Shao <sai.sai.s...@gmail.com> > *Date:* 2016-03-22 18:03 > *To:* tony@tendcloud.com > *CC:* user <us

Re: Is there a way to submit spark job to your by YARN REST API?

2016-03-22 Thread Saisai Shao
I'm afraid currently it is not supported by Spark to submit application through Yarn REST API. However Yarn AMRMClient is functionally equal to REST API, not sure which specific features are you referring? Thanks Saisai On Tue, Mar 22, 2016 at 5:27 PM, tony@tendcloud.com <

Re: Issues facing while Running Spark Streaming Job in YARN cluster mode

2016-03-22 Thread Saisai Shao
I guess in local mode you're using local FS instead of HDFS, here the exception mainly threw from HDFS when running on Yarn, I think it would be better to check the status and configurations of HDFS to see if it normal or not. Thanks Saisai On Tue, Mar 22, 2016 at 5:46 PM, Soni spark

Re: Enabling spark_shuffle service without restarting YARN Node Manager

2016-03-16 Thread Saisai Shao
If you want to avoid existing job failure while restarting NM, you could enable work preserving for NM, in this case, the restart of NM will not affect the running containers (containers can still run). That could alleviate NM restart problem. Thanks Saisai On Wed, Mar 16, 2016 at 6:30 PM, Alex

Re: Job failed while submitting python to yarn programatically

2016-03-15 Thread Saisai Shao
You cannot directly invoke Spark application by using yarn#client like what you mentioned, it is deprecated and not supported. you have to use spark-submit to submit a Spark application to yarn. Also here the specific problem is that you're invoking yarn#client to run spark app as yarn-client

Re: Spark streaming - update configuration while retaining write ahead log data?

2016-03-15 Thread Saisai Shao
Currently configuration is a part of checkpoint data, and when recovering from failure, Spark Streaming will fetch the configuration from checkpoint data, so even if you change the configuration file, recovered Spark Streaming application will not use it. So from my understanding currently there's

Re: Dynamic allocation doesn't work on YARN

2016-03-09 Thread Saisai Shao
hedExecutorIdleTimeout=60s, "--conf" was lost > when I copied it to mail. > > -- Forwarded message -- > From: Jy Chen <chen.wah...@gmail.com> > Date: 2016-03-10 10:09 GMT+08:00 > Subject: Re: Dynamic allocation doesn't work on YARN > To: Saisai Sh

Re: Dynamic allocation doesn't work on YARN

2016-03-09 Thread Saisai Shao
Would you please send out the configurations of dynamic allocation so we could know better. On Wed, Mar 9, 2016 at 4:29 PM, Jy Chen wrote: > Hello everyone: > > I'm trying the dynamic allocation in Spark on YARN. I have followed > configuration steps and started the

Re: How to compile Spark with private build of Hadoop

2016-03-08 Thread Saisai Shao
I think the first step is to publish your in-house built Hadoop related jars to your local maven or ivy repo, and then change the Spark building profiles like -Phadoop-2.x (you could use 2.7 or you have to change the pom file if you met jar conflicts) -Dhadoop.version=3.0.0-SNAPSHOT to build

Re: Spark executor killed without apparent reason

2016-03-03 Thread Saisai Shao
If it is due to heartbeat problem and driver explicitly killed the executors, there should be some driver logs mentioned about it. So you could check the driver log about it. Also container (executor) logs are useful, if this container is killed, then there'll be some signal related logs, like

Re: Spark streaming: StorageLevel.MEMORY_AND_DISK_SER setting for KafkaUtils.createDirectStream

2016-03-02 Thread Saisai Shao
You don't have to specify the storage level for direct Kafka API, since it doesn't require to store the input data ahead of time. Only receiver-based approach could specify the storage level. Thanks Saisai On Wed, Mar 2, 2016 at 7:08 PM, Vinti Maheshwari wrote: > Hi All,

Re: Kafka streaming receiver approach - new topic not read from beginning

2016-02-22 Thread Saisai Shao
You could set this configuration "auto.offset.reset" through parameter "kafkaParams" which is provided in some other overloaded APIs of createStream. By default Kafka will pick data from latest offset unless you explicitly set it, this is the behavior Kafka, not Spark. Thanks Saisai On Mon, Feb

Re: Yarn client mode: Setting environment variables

2016-02-17 Thread Saisai Shao
IIUC for example you want to set environment FOO=bar in executor side, you could use "spark.executor.Env.FOO=bar" in conf file, AM will pick this configuration and set as environment variable through container launching. Just list all the envs you want to set in executor side like

Re: IllegalStateException : When use --executor-cores option in YARN

2016-02-14 Thread Saisai Shao
Hi Divya, Would you please provide full stack of exception? From my understanding --executor-cores should be worked, we could know better if you provide the full stack trace. The performance relies on many different aspects, I'd recommend you to check the spark web UI to know the application

Re: Programmatically launching spark on yarn-client mode no longer works in spark 1.5.2

2016-01-28 Thread Saisai Shao
saying creating sparkcontext manually in your application > still works then I'll investigate more on my side. It just before I dig > more I wanted to know if it was still supported. > > Nir > > On Thu, Jan 28, 2016 at 7:47 PM, Saisai Shao <sai.sai.s...@gmail.com> > wrote:

Re: Programmatically launching spark on yarn-client mode no longer works in spark 1.5.2

2016-01-28 Thread Saisai Shao
I think I met this problem before, this problem might be due to some race conditions in exit period. The way you mentioned is still valid, this problem only occurs when stopping the application. Thanks Saisai On Fri, Jan 29, 2016 at 10:22 AM, Nirav Patel wrote: > Hi, we

Re: How data locality is honored when spark is running on yarn

2016-01-27 Thread Saisai Shao
Hi Todd, There're two levels of locality based scheduling when you run Spark on Yarn if dynamic allocation enabled: 1. Container allocation is based on the locality ratio of pending tasks, this is Yarn specific and only works with dynamic allocation enabled. 2. Task scheduling is locality

Re: streaming textFileStream problem - got only ONE line

2016-01-26 Thread Saisai Shao
Any possibility that this file is still written by other application, so what Spark Streaming processed is an incomplete file. On Tue, Jan 26, 2016 at 5:30 AM, Shixiong(Ryan) Zhu wrote: > Did you move the file into "hdfs://helmhdfs/user/patcharee/cerdata/", or > write

Re: OOM on yarn-cluster mode

2016-01-19 Thread Saisai Shao
You could try increase the driver memory by "--driver-memory", looks like the OOM is came from driver side, so the simple solution is to increase the memory of driver. On Tue, Jan 19, 2016 at 1:15 PM, Julio Antonio Soto wrote: > Hi, > > I'm having trouble when uploadig spark

Re: Problem About Worker System.out

2015-12-28 Thread Saisai Shao
Stdout will not be sent back to driver, no matter you use Scala or Java. You must do something wrongly that makes you think it is an expected behavior. On Mon, Dec 28, 2015 at 5:33 PM, David John wrote: > I have used Spark *1.4* for 6 months. Thanks all the

Re: Opening Dynamic Scaling Executors on Yarn

2015-12-27 Thread Saisai Shao
ark-1.6.0 on one yarn > cluster? > > > > *From:* Saisai Shao [mailto:sai.sai.s...@gmail.com] > *Sent:* Monday, December 28, 2015 2:29 PM > *To:* Jeff Zhang > *Cc:* 顾亮亮; user@spark.apache.org; 刘骋昺 > *Subject:* Re: Opening Dynamic Scaling Executors on Yarn > > &g

Re: Opening Dynamic Scaling Executors on Yarn

2015-12-27 Thread Saisai Shao
Replace all the shuffle jars and restart the NodeManager is enough, no need to restart NN. On Mon, Dec 28, 2015 at 2:05 PM, Jeff Zhang wrote: > See > http://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation > > > > On Mon, Dec 28, 2015 at 2:00 PM,

Re: Job Error:Actor not found for: ActorSelection[Anchor(akka.tcp://sparkDriver@130.1.10.108:23600/)

2015-12-25 Thread Saisai Shao
I think SparkContext is thread-safe, you could concurrently submit jobs from different threads, the problem you hit might not relate to this. Can you reproduce this issue each time when you concurrently submit jobs, or is it happened occasionally? BTW, I guess you're using the old version of

  1   2   >