singular value decomposition in Spark ML

2016-08-04 Thread Sandy Ryza
Hi, Is SVD or PCA in Spark ML (i.e. spark.ml parity with the mllib RowMatrix.computeSVD API) slated for any upcoming release? Many thanks for any guidance! -Sandy

Re: Content based window operation on Time-series data

2015-12-17 Thread Sandy Ryza
Hi Arun, A Java API was actually recently added to the library. It will be available in the next release. -Sandy On Thu, Dec 10, 2015 at 12:16 AM, Arun Verma wrote: > Thank you for your reply. It is a Scala and Python library. Is similar > library exists for Java? >

Re: PySpark Lost Executors

2015-11-19 Thread Sandy Ryza
Hi Ross, This is most likely occurring because YARN is killing containers for exceeding physical memory limits. You can make this less likely to happen by bumping spark.yarn.executor.memoryOverhead to something higher than 10% of your spark.executor.memory. -Sandy On Thu, Nov 19, 2015 at 8:14

Re: SequenceFile and object reuse

2015-11-18 Thread Sandy Ryza
Hi Jeff, Many access patterns simply take the result of hadoopFile and use it to create some other object, and thus have no need for each input record to refer to a different object. In those cases, the current API is more performant than an alternative that would create an object for each

Re: Is the resources specified in configuration shared by all jobs?

2015-11-04 Thread Sandy Ryza
Hi Nisrina, The resources you specify are shared by all jobs that run inside the application. -Sandy On Wed, Nov 4, 2015 at 9:24 AM, Nisrina Luthfiyati < nisrina.luthfiy...@gmail.com> wrote: > Hi all, > > I'm running some spark jobs in java on top of YARN by submitting one > application jar

Re: Spark tunning increase number of active tasks

2015-10-31 Thread Sandy Ryza
Hi Xiaochuan, The most likely cause of the "Lost container" issue is that YARN is killing container for exceeding memory limits. If this is the case, you should be able to find instances of "exceeding memory limits" in the application logs.

Re: Spark 1.5 on CDH 5.4.0

2015-10-22 Thread Sandy Ryza
Hi Deenar, The version of Spark you have may not be compiled with YARN support. If you inspect the contents of the assembly jar, does org.apache.spark.deploy.yarn.ExecutorLauncher exist? If not, you'll need to find a version that does have the YARN classes. You can also build your own using

Re: Custom Hadoop InputSplit, Spark partitions, spark executors/task and Yarn containers

2015-09-23 Thread Sandy Ryza
Hi Anfernee, That's correct that each InputSplit will map to exactly a Spark partition. On YARN, each Spark executor maps to a single YARN container. Each executor can run multiple tasks over its lifetime, both parallel and sequentially. If you enable dynamic allocation, after the stage

Re: Spark on Yarn vs Standalone

2015-09-21 Thread Sandy Ryza
le:/mnt/yarn/usercache/hadoop/appcache/application_1442869100946_0001/container_1442869100946_0001_01_56/__app__.jar >> 1> >> /var/log/hadoop-yarn/containers/application_1442869100946_0001/container_1442869100946_0001_01_56/stdout >> 2> >> /var/lo

Re: Spark on Yarn vs Standalone

2015-09-10 Thread Sandy Ryza
can be killed by YARN (because executor might be unresponsive because > of GC or it might occupy more memory than Yarn allows) > > > > On Tue, Sep 8, 2015 at 3:02 PM, Sandy Ryza <sandy.r...@cloudera.com> > wrote: > >> Those settings seem reasonable to me. >> >>

Driver OOM after upgrading to 1.5

2015-09-09 Thread Sandy Ryza
I just upgraded the spark-timeseries project to run on top of 1.5, and I'm noticing that tests are failing with OOMEs. I ran a jmap -histo on the process and discovered the top heap items to be: 1:163428 22236064 2:

Re: Driver OOM after upgrading to 1.5

2015-09-09 Thread Sandy Ryza
Java 7. FWIW I was just able to get it to work by increasing MaxPermSize to 256m. -Sandy On Wed, Sep 9, 2015 at 11:37 AM, Reynold Xin <r...@databricks.com> wrote: > Java 7 / 8? > > On Wed, Sep 9, 2015 at 10:10 AM, Sandy Ryza <sandy.r...@cloudera.com> > wrote: > &

Re: Spark on Yarn vs Standalone

2015-09-08 Thread Sandy Ryza
m = slave_count * 16 > > Does it look good for you? (we run single heavy job on cluster) > > Alex > > On Mon, Sep 7, 2015 at 11:03 AM, Sandy Ryza <sandy.r...@cloudera.com> > wrote: > >> Hi Alex, >> >> If they're both configured correctly, there'

Re: Spark on Yarn vs Standalone

2015-09-07 Thread Sandy Ryza
Hi Alex, If they're both configured correctly, there's no reason that Spark Standalone should provide performance or memory improvement over Spark on YARN. -Sandy On Fri, Sep 4, 2015 at 1:24 PM, Alexander Pivovarov wrote: > Hi Everyone > > We are trying the latest aws

Re: Spark Effects of Driver Memory, Executor Memory, Driver Memory Overhead and Executor Memory Overhead on success of job runs

2015-08-31 Thread Sandy Ryza
Hi Timothy, For your first question, you would need to look in the logs and provide additional information about why your job is failing. The SparkContext shutting down could happen for a variety of reasons. In the situation where you give more memory, but less memory overhead, and the job

Re: How to avoid executor time out on yarn spark while dealing with large shuffle skewed data?

2015-08-20 Thread Sandy Ryza
What version of Spark are you using? Have you set any shuffle configs? On Wed, Aug 19, 2015 at 11:46 AM, unk1102 umesh.ka...@gmail.com wrote: I have one Spark job which seems to run fine but after one hour or so executor start getting lost because of time out something like the following

Re: How to avoid executor time out on yarn spark while dealing with large shuffle skewed data?

2015-08-20 Thread Sandy Ryza
lost things are messing. On Aug 20, 2015 7:59 PM, Sandy Ryza sandy.r...@cloudera.com wrote: What sounds most likely is that you're hitting heavy garbage collection. Did you hit issues when the shuffle memory fraction was at its default of 0.2? A potential danger with setting the shuffle

Re: How to avoid executor time out on yarn spark while dealing with large shuffle skewed data?

2015-08-20 Thread Sandy Ryza
GC error there. Please guide. Thanks much. On Thu, Aug 20, 2015 at 8:14 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Moving this back onto user@ Regarding GC, can you look in the web UI and see whether the GC time metric dominates the amount of time spent on each task (or at least

Re: Executors on multiple nodes

2015-08-16 Thread Sandy Ryza
Hi Mohit, It depends on whether dynamic allocation is turned on. If not, the number of executors is specified by the user with the --num-executors option. If dynamic allocation is turned on, refer to the doc for details:

Re: Boosting spark.yarn.executor.memoryOverhead

2015-08-11 Thread Sandy Ryza
Hi Eric, This is likely because you are putting the parameter after the primary resource (latest_msmtdt_by_gridid_and_source.py), which makes it a parameter to your application instead of a parameter to Spark/ -Sandy On Wed, Aug 12, 2015 at 4:40 AM, Eric Bless eric.bl...@yahoo.com.invalid

Re: Spark on YARN

2015-08-08 Thread Sandy Ryza
Hi Jem, Do they fail with any particular exception? Does YARN just never end up giving them resources? Does an application master start? If so, what are in its logs? If not, anything suspicious in the YARN ResourceManager logs? -Sandy On Fri, Aug 7, 2015 at 1:48 AM, Jem Tucker

Re: [General Question] [Hadoop + Spark at scale] Spark Rack Awareness ?

2015-07-19 Thread Sandy Ryza
Hi Mike, Spark is rack-aware in its task scheduling. Currently Spark doesn't honor any locality preferences when scheduling executors, but this is being addressed in SPARK-4352, after which executor-scheduling will be rack-aware as well. -Sandy On Sat, Jul 18, 2015 at 6:25 PM, Mike Frampton

Re: What else is need to setup native support of BLAS/LAPACK with Spark?

2015-07-17 Thread Sandy Ryza
Can you try setting the spark.yarn.jar property to make sure it points to the jar you're thinking of? -Sandy On Fri, Jul 17, 2015 at 11:32 AM, Arun Ahuja aahuj...@gmail.com wrote: Yes, it's a YARN cluster and using spark-submit to run. I have SPARK_HOME set to the directory above and using

Re: Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf

2015-07-15 Thread Sandy Ryza
Hi Jonathan, This is a problem that has come up for us as well, because we'd like dynamic allocation to be turned on by default in some setups, but not break existing users with these properties. I'm hoping to figure out a way to reconcile these by Spark 1.5. -Sandy On Wed, Jul 15, 2015 at

Re: How to restrict disk space for spark caches on yarn?

2015-07-13 Thread Sandy Ryza
To clear one thing up: the space taken up by data that Spark caches on disk is not related to YARN's local resource / application cache concept. The latter is a way that YARN provides for distributing bits to worker nodes. The former is just usage of disk by Spark, which happens to be in a local

Re: Pyspark not working on yarn-cluster mode

2015-07-10 Thread Sandy Ryza
To add to this, conceptually, it makes no sense to launch something in yarn-cluster mode by creating a SparkContext on the client - the whole point of yarn-cluster mode is that the SparkContext runs on the cluster, not on the client. On Thu, Jul 9, 2015 at 2:35 PM, Marcelo Vanzin

Re: Remote spark-submit not working with YARN

2015-07-08 Thread Sandy Ryza
, i checked it in the WEB UI page of my cluster Also, i'm able to submit the same script in any of the nodes of the cluster. That's why i don't understand whats happening. Thanks JG On Wed, Jul 8, 2015 at 5:26 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi JG, One way this can occur

Re: Executors requested are way less than what i actually got

2015-06-26 Thread Sandy Ryza
maxResultSize=200G On Thu, Jun 25, 2015 at 4:52 PM, Sandy Ryza sandy.r...@cloudera.com wrote: How many nodes do you have, how much space is allocated to each node for YARN, how big are the executors you're requesting, and what else is running on the cluster? On Thu, Jun 25, 2015 at 3:57

Re: Executors requested are way less than what i actually got

2015-06-25 Thread Sandy Ryza
How many nodes do you have, how much space is allocated to each node for YARN, how big are the executors you're requesting, and what else is running on the cluster? On Thu, Jun 25, 2015 at 3:57 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I run Spark App on Spark 1.3.1 over YARN. When i

Re: Spark launching without all of the requested YARN resources

2015-06-24 Thread Sandy Ryza
Hi Arun, You can achieve this by setting spark.scheduler.maxRegisteredResourcesWaitingTime to some really high number and spark.scheduler.minRegisteredResourcesRatio to 1.0. -Sandy On Wed, Jun 24, 2015 at 2:21 AM, Steve Loughran ste...@hortonworks.com wrote: On 24 Jun 2015, at 05:55, canan

Re: When to use underlying data management layer versus standalone Spark?

2015-06-24 Thread Sandy Ryza
Hi Michael, Spark itself is an execution engine, not a storage system. While it has facilities for caching data in memory, think about these the way you would think about a process on a single machine leveraging memory - the source data needs to be stored somewhere, and you need to be able to

Re: Velox Model Server

2015-06-20 Thread Sandy Ryza
Hi Debasish, The Oryx project (https://github.com/cloudera/oryx), which is Apache 2 licensed, contains a model server that can serve models built with MLlib. -Sandy On Sat, Jun 20, 2015 at 8:00 AM, Charles Earl charles.ce...@gmail.com wrote: Is velox NOT open source? On Saturday, June 20,

Re: Velox Model Server

2015-06-20 Thread Sandy Ryza
Oops, that link was for Oryx 1. Here's the repo for Oryx 2: https://github.com/OryxProject/oryx On Sat, Jun 20, 2015 at 10:20 AM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Debasish, The Oryx project (https://github.com/cloudera/oryx), which is Apache 2 licensed, contains a model server

Re: deployment options for Spark and YARN w/ many app jar library dependencies

2015-06-17 Thread Sandy Ryza
Hi Matt, If you place your jars on HDFS in a public location, YARN will cache them on each node after the first download. You can also use the spark.executor.extraClassPath config to point to them. -Sandy On Wed, Jun 17, 2015 at 4:47 PM, Sweeney, Matt mswee...@fourv.com wrote: Hi folks,

Re: [SparkScore] Performance portal for Apache Spark

2015-06-17 Thread Sandy Ryza
This looks really awesome. On Tue, Jun 16, 2015 at 10:27 AM, Huang, Jie jie.hu...@intel.com wrote: Hi All We are happy to announce Performance portal for Apache Spark http://01org.github.io/sparkscore/ ! The Performance Portal for Apache Spark provides performance data on the Spark

Re: Dynamic allocator requests -1 executors

2015-06-13 Thread Sandy Ryza
Hi Patrick, I'm noticing that you're using Spark 1.3.1. We fixed a bug in dynamic allocation in 1.4 that permitted requesting negative numbers of executors. Any chance you'd be able to try with the newer version and see if the problem persists? -Sandy On Fri, Jun 12, 2015 at 7:42 PM, Patrick

Re: Determining number of executors within RDD

2015-06-10 Thread Sandy Ryza
On YARN, there is no concept of a Spark Worker. Multiple executors will be run per node without any effort required by the user, as long as all the executors fit within each node's resource limits. -Sandy On Wed, Jun 10, 2015 at 3:24 PM, Evo Eftimov evo.efti...@isecc.com wrote: Yes i think

Re: How to run spark streaming application on YARN?

2015-06-04 Thread Sandy Ryza
not. I run it with sbt «sbt run-main Branchmark». I thought it was the same thing since I am passing all the configurations through the application code. Is that the problem? On Thu, Jun 4, 2015 at 6:26 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Saiph, Are you launching using spark-submit

Re: data localisation in spark

2015-06-03 Thread Sandy Ryza
to stages it calculates executors required and then acquire executors/worker nodes ? On Tue, Jun 2, 2015 at 11:06 PM, Sandy Ryza sandy.r...@cloudera.com wrote: It is not possible with JavaSparkContext either. The API mentioned below currently does not have any effect (we should document

Re: data localisation in spark

2015-06-02 Thread Sandy Ryza
It is not possible with JavaSparkContext either. The API mentioned below currently does not have any effect (we should document this). The primary difference between MR and Spark here is that MR runs each task in its own YARN container, while Spark runs multiple tasks within an executor, which

Re: data localisation in spark

2015-05-31 Thread Sandy Ryza
Hi Shushant, Spark currently makes no effort to request executors based on data locality (although it does try to schedule tasks within executors based on data locality). We're working on adding this capability at SPARK-4352 https://issues.apache.org/jira/browse/SPARK-4352. -Sandy On Sun, May

Re: yarn-cluster spark-submit process not dying

2015-05-28 Thread Sandy Ryza
Hi Corey, As of this PR https://github.com/apache/spark/pull/5297/files, this can be controlled with spark.yarn.submit.waitAppCompletion. -Sandy On Thu, May 28, 2015 at 11:48 AM, Corey Nolet cjno...@gmail.com wrote: I am submitting jobs to my yarn cluster via the yarn-cluster mode and I'm

Re: number of executors

2015-05-18 Thread Sandy Ryza
*All On Mon, May 18, 2015 at 9:07 AM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Xiaohe, The all Spark options must go before the jar or they won't take effect. -Sandy On Sun, May 17, 2015 at 8:59 AM, xiaohe lan zombiexco...@gmail.com wrote: Sorry, them both are assigned task

Re: number of executors

2015-05-18 Thread Sandy Ryza
Hi Xiaohe, The all Spark options must go before the jar or they won't take effect. -Sandy On Sun, May 17, 2015 at 8:59 AM, xiaohe lan zombiexco...@gmail.com wrote: Sorry, them both are assigned task actually. Aggregated Metrics by Executor Executor IDAddressTask TimeTotal TasksFailed

Re: number of executors

2015-05-18 Thread Sandy Ryza
target/scala-2.10/simple-project_2.10-1.0.jar --class scala.SimpleApp is working awesomely. Is there any documentations pointing to this ? Thanks, Xiaohe On Tue, May 19, 2015 at 12:07 AM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Xiaohe, The all Spark options must go before the jar

Re: Expert advise needed. (POC is at crossroads)

2015-04-30 Thread Sandy Ryza
Hi Deepak, I wrote a couple posts with a bunch of different information about how to tune Spark jobs. The second one might be helpful with how to think about tuning the number of partitions and resources? What kind of OOMEs are you hitting?

Re: Question about Memory Used and VCores Used

2015-04-29 Thread Sandy Ryza
Hi, Good question. The extra memory comes from spark.yarn.executor.memoryOverhead, the space used for the application master, and the way the YARN rounds requests up. This explains it in a little more detail: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

Re: Running beyond physical memory limits

2015-04-15 Thread Sandy Ryza
The setting to increase is spark.yarn.executor.memoryOverhead On Wed, Apr 15, 2015 at 6:35 AM, Brahma Reddy Battula brahmareddy.batt...@huawei.com wrote: Hello Sean Owen, Thanks for your reply..Ill increase overhead memory and check it.. Bytheway ,Any difference between 1.1 and 1.2 makes,

Re: Spark: Using node-local files within functions?

2015-04-14 Thread Sandy Ryza
Hi Tobias, It should be possible to get an InputStream from an HDFS file. However, if your libraries only work directly on files, then maybe that wouldn't work? If that's the case and different tasks need different files, your way is probably the best way. If all tasks need the same file, a

Re: Rack locality

2015-04-13 Thread Sandy Ryza
Hi Riya, As far as I know, that is correct, unless Mesos fine-grained mode handles this in some mysterious way. -Sandy On Mon, Apr 13, 2015 at 2:09 PM, rcharaya riya.char...@gmail.com wrote: I want to use Rack locality feature of Apache Spark in my application. Is YARN the only resource

Re: Spark Job Run Resource Estimation ?

2015-04-09 Thread Sandy Ryza
Hi Deepak, I'm going to shamelessly plug my blog post on tuning Spark: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ It talks about tuning executor size as well as how the number of tasks for a stage is calculated. -Sandy On Thu, Apr 9, 2015 at 9:21 AM,

Re: Strategy regarding maximum number of executor's failure for log running jobs/ spark streaming jobs

2015-04-06 Thread Sandy Ryza
to have minimum n number of executors within x period of time, then we should fail the application. Adding time factor here, will allow some window for spark to get more executors allocated if some of them fails. Thoughts please. Thanks, Twinkle On Wed, Apr 1, 2015 at 10:19 PM, Sandy

Re: Data locality across jobs

2015-04-02 Thread Sandy Ryza
This isn't currently a capability that Spark has, though it has definitely been discussed: https://issues.apache.org/jira/browse/SPARK-1061. The primary obstacle at this point is that Hadoop's FileInputFormat doesn't guarantee that each file corresponds to a single split, so the records

Re: Strategy regarding maximum number of executor's failure for log running jobs/ spark streaming jobs

2015-04-01 Thread Sandy Ryza
That's a good question, Twinkle. One solution could be to allow a maximum number of failures within any given time span. E.g. a max failures per hour property. -Sandy On Tue, Mar 31, 2015 at 11:52 PM, twinkle sachdeva twinkle.sachd...@gmail.com wrote: Hi, In spark over YARN, there is a

Re: Cross-compatibility of YARN shuffle service

2015-03-26 Thread Sandy Ryza
Hi Matt, I'm not sure whether we have documented compatibility guidelines here. However, a strong goal is to keep the external shuffle service compatible so that many versions of Spark can run against the same shuffle service. -Sandy On Wed, Mar 25, 2015 at 6:44 PM, Matt Cheah

Re: What is best way to run spark job in yarn-cluster mode from java program(servlet container) and NOT using spark-submit command.

2015-03-26 Thread Sandy Ryza
Creating a SparkContext and setting master as yarn-cluster unfortunately will not work. SPARK-4924 added APIs for doing this in Spark, but won't be included until 1.4. -Sandy On Tue, Mar 17, 2015 at 3:19 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Create SparkContext set master as

Re: issue while submitting Spark Job as --master yarn-cluster

2015-03-25 Thread Sandy Ryza
Hi Sachin, It appears that the application master is failing. To figure out what's wrong you need to get the logs for the application master. -Sandy On Wed, Mar 25, 2015 at 7:05 AM, Sachin Singh sachin.sha...@gmail.com wrote: OS I am using Linux, when I will run simply as master yarn, its

Re: How to avoid being killed by YARN node manager ?

2015-03-24 Thread Sandy Ryza
Hi Yuichiro, The way to avoid this is to boost spark.yarn.executor.memoryOverhead until the executors have enough off-heap memory to avoid going over their limits. -Sandy On Tue, Mar 24, 2015 at 11:49 AM, Yuichiro Sakamoto ks...@muc.biglobe.ne.jp wrote: Hello. We use ALS(Collaborative

Re: Why doesn't the --conf parameter work in yarn-cluster mode (but works in yarn-client and local)?

2015-03-24 Thread Sandy Ryza
but not in yarn-cluster mode). I'm surprised why I can't use it on the cluster while I can use it while local development and testing. Kind regards, Emre Sevinç http://www.bigindustries.be/ On Mon, Mar 23, 2015 at 6:15 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Emre, The --conf property

Re: Invalid ContainerId ... Caused by: java.lang.NumberFormatException: For input string: e04

2015-03-24 Thread Sandy Ryza
Steve, that's correct, but the problem only shows up when different versions of the YARN jars are included on the classpath. -Sandy On Tue, Mar 24, 2015 at 6:29 AM, Steve Loughran ste...@hortonworks.com wrote: On 24 Mar 2015, at 02:10, Marcelo Vanzin van...@cloudera.com wrote: This

Re: Is yarn-standalone mode deprecated?

2015-03-24 Thread Sandy Ryza
that? On Mon, Mar 23, 2015 at 1:13 PM, Sandy Ryza sandy.r...@cloudera.com wrote: The former is deprecated. However, the latter is functionally equivalent to it. Both launch an app in what is now called yarn-cluster mode. Oozie now also has a native Spark action, though I'm not familiar

Re: Is yarn-standalone mode deprecated?

2015-03-23 Thread Sandy Ryza
The mode is not deprecated, but the name yarn-standalone is now deprecated. It's now referred to as yarn-cluster. -Sandy On Mon, Mar 23, 2015 at 11:49 AM, nitinkak001 nitinkak...@gmail.com wrote: Is yarn-standalone mode deprecated in Spark now. The reason I am asking is because while I can

Re: Is yarn-standalone mode deprecated?

2015-03-23 Thread Sandy Ryza
\ --queue thequeue \ lib/spark-examples*.jar I didnt see example of ./bin/spark-class in 1.2.0 documentation, so am wondering if that is deprecated. On Mon, Mar 23, 2015 at 12:11 PM, Sandy Ryza sandy.r...@cloudera.com wrote: The mode is not deprecated, but the name yarn-standalone

Re: Why doesn't the --conf parameter work in yarn-cluster mode (but works in yarn-client and local)?

2015-03-23 Thread Sandy Ryza
Hi Emre, The --conf property is meant to work with yarn-cluster mode. System.getProperty(key) isn't guaranteed, but new SparkConf().get(key) should. Does it not? -Sandy On Mon, Mar 23, 2015 at 8:39 AM, Emre Sevinc emre.sev...@gmail.com wrote: Hello, According to Spark Documentation at

Re: Shuffle Spill Memory and Shuffle Spill Disk

2015-03-23 Thread Sandy Ryza
Hi Bijay, The Shuffle Spill (Disk) is the total number of bytes written to disk by records spilled during the shuffle. The Shuffle Spill (Memory) is the amount of space the spilled records occupied in memory before they were spilled. These differ because the serialized format is more compact,

Re: No executors allocated on yarn with latest master branch

2015-03-09 Thread Sandy Ryza
, 2015 at 12:05 AM, Sandy Ryza sandy.r...@cloudera.com wrote: Are you using the capacity scheduler or fifo scheduler without multi resource scheduling by any chance? On Thu, Feb 12, 2015 at 1:51 PM, Anders Arpteg arp...@spotify.com wrote: The nm logs only seems to contain similar

Re: Spark Performance on Yarn

2015-02-20 Thread Sandy Ryza
above, then one implication from them is: (spark.executor.memory + spark.yarn.executor.memoryOverhead) * number of executors per machine should be configured smaller than a single machine physical memory Right? Again, thanks! Kelvin On Fri, Feb 20, 2015 at 11:50 AM, Sandy Ryza sandy.r

Re: No executors allocated on yarn with latest master branch

2015-02-20 Thread Sandy Ryza
to absent application application_1422406067005_0053 On Thu, Feb 12, 2015 at 10:38 PM, Sandy Ryza sandy.r...@cloudera.com wrote: It seems unlikely to me that it would be a 2.2 issue, though not entirely impossible. Are you able to find any of the container logs? Is the NodeManager launching

Re: Spark Performance on Yarn

2015-02-20 Thread Sandy Ryza
Are you specifying the executor memory, cores, or number of executors anywhere? If not, you won't be taking advantage of the full resources on the cluster. -Sandy On Fri, Feb 20, 2015 at 2:41 AM, Sean Owen so...@cloudera.com wrote: None of this really points to the problem. These indicate

Re: Spark Performance on Yarn

2015-02-20 Thread Sandy Ryza
If that's the error you're hitting, the fix is to boost spark.yarn.executor.memoryOverhead, which will put some extra room in between the executor heap sizes and the amount of memory requested for them from YARN. -Sandy On Fri, Feb 20, 2015 at 9:40 AM, lbierman leebier...@gmail.com wrote: A

Re: Spark Performance on Yarn

2015-02-20 Thread Sandy Ryza
AM, Sandy Ryza sandy.r...@cloudera.com wrote: If that's the error you're hitting, the fix is to boost spark.yarn.executor.memoryOverhead, which will put some extra room in between the executor heap sizes and the amount of memory requested for them from YARN. -Sandy On Fri, Feb 20, 2015 at 9

Re: build spark for cdh5

2015-02-18 Thread Sandy Ryza
Hi Koert, You should be using -Phadoop-2.3 instead of -Phadoop2.3. -Sandy On Wed, Feb 18, 2015 at 10:51 AM, Koert Kuipers ko...@tresata.com wrote: does anyone have the right maven invocation for cdh5 with yarn? i tried: $ mvn -Phadoop2.3 -Dhadoop.version=2.5.0-cdh5.2.3 -Pyarn -DskipTests

Re: Why can't Spark find the classes in this Jar?

2015-02-12 Thread Sandy Ryza
What version of Java are you using? Core NLP dropped support for Java 7 in its 3.5.0 release. Also, the correct command line option is --jars, not --addJars. On Thu, Feb 12, 2015 at 12:03 PM, Deborah Siegel deborah.sie...@gmail.com wrote: Hi Abe, I'm new to Spark as well, so someone else

Re: No executors allocated on yarn with latest master branch

2015-02-12 Thread Sandy Ryza
) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:178) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:99) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) /Anders On Thu, Feb 12, 2015 at 1:33 AM, Sandy Ryza sandy.r

Re: No executors allocated on yarn with latest master branch

2015-02-11 Thread Sandy Ryza
Hi Anders, I just tried this out and was able to successfully acquire executors. Any strange log messages or additional color you can provide on your setup? Does yarn-client mode work? -Sandy On Wed, Feb 11, 2015 at 1:28 PM, Anders Arpteg arp...@spotify.com wrote: Hi, Compiled the latest

feeding DataFrames into predictive algorithms

2015-02-11 Thread Sandy Ryza
Hey All, I've been playing around with the new DataFrame and ML pipelines APIs and am having trouble accomplishing what seems like should be a fairly basic task. I have a DataFrame where each column is a Double. I'd like to turn this into a DataFrame with a features column and a label column

Re: Resource allocation in yarn-cluster mode

2015-02-10 Thread Sandy Ryza
Hi Zsolt, spark.executor.memory, spark.executor.cores, and spark.executor.instances are only honored when launching through spark-submit. Marcelo is working on a Spark launcher (SPARK-4924) that will enable using these programmatically. That's correct that the error comes up when

Re: How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-10 Thread Sandy Ryza
YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory On Fri, Feb 6, 2015 at 3:24 PM, Sandy Ryza sandy.r...@cloudera.com wrote: You can call collect() to pull in the contents of an RDD into the driver

Re: How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-10 Thread Sandy Ryza
StreamingContext(sparkConf, Seconds(bucketSecs)) val sc = new SparkContext() On Tue, Feb 10, 2015 at 1:02 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Is the SparkContext you're using the same one that the StreamingContext wraps? If not, I don't think using two is supported. -Sandy On Tue

Re: Open file limit settings for Spark on Yarn job

2015-02-10 Thread Sandy Ryza
Hi Arun, The limit for the YARN user on the cluster nodes should be all that matters. What version of Spark are you using? If you can turn on sort-based shuffle it should solve this problem. -Sandy On Tue, Feb 10, 2015 at 1:16 PM, Arun Luthra arun.lut...@gmail.com wrote: Hi, I'm running

Re: getting error when submit spark with master as yarn

2015-02-07 Thread Sandy Ryza
Hi Sachin, In your YARN configuration, either yarn.nodemanager.resource.memory-mb is 1024 on your nodes or yarn.scheduler.maximum-allocation-mb is set to 1024. If you have more than 1024 MB on each node, you should bump these properties. Otherwise, you should request fewer resources by setting

Re: Spark impersonation

2015-02-07 Thread Sandy Ryza
https://issues.apache.org/jira/browse/SPARK-5493 currently tracks this. -Sandy On Mon, Feb 2, 2015 at 9:37 PM, Zhan Zhang zzh...@hortonworks.com wrote: I think you can configure hadoop/hive to do impersonation. There is no difference between secure or insecure hadoop cluster by using kinit.

Re: How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-06 Thread Sandy Ryza
and then constantly joining I think will be too slow for a streaming job. On Thu, Feb 5, 2015 at 8:06 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Jon, You'll need to put the file on HDFS (or whatever distributed filesystem you're running on) and load it from there. -Sandy On Thu, Feb 5, 2015

Re: Problems with GC and time to execute with different number of executors.

2015-02-06 Thread Sandy Ryza
transformations like,,, [0..3] an integer, [4...20] an String, [21..27] another String and so on. It's just a test code, I'd like to understand what it's happeing. 2015-02-04 18:57 GMT+01:00 Sandy Ryza sandy.r...@cloudera.com: Hi Guillermo, What exactly do you mean by each iteration

Re: Problems with GC and time to execute with different number of executors.

2015-02-06 Thread Sandy Ryza
.bin parameters This is what I executed with different values in num-executors and executor-memory. What do you think there are too many executors for those HDDs? Could it be the reason because of each executor takes more time? 2015-02-06 9:36 GMT+01:00 Sandy Ryza sandy.r...@cloudera.com

Re: How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-05 Thread Sandy Ryza
Hi Jon, You'll need to put the file on HDFS (or whatever distributed filesystem you're running on) and load it from there. -Sandy On Thu, Feb 5, 2015 at 3:18 PM, YaoPau jonrgr...@gmail.com wrote: I have a file badFullIPs.csv of bad IP addresses used for filtering. In yarn-client mode, I

Re: advice on diagnosing Spark stall for 1.5hr out of 3.5hr job?

2015-02-04 Thread Sandy Ryza
Also, do you see any lines in the YARN NodeManager logs where it says that it's killing a container? -Sandy On Wed, Feb 4, 2015 at 8:56 AM, Imran Rashid iras...@cloudera.com wrote: Hi Michael, judging from the logs, it seems that those tasks are just working a really long time. If you have

Re: running 2 spark applications in parallel on yarn

2015-02-01 Thread Sandy Ryza
Hi Tomer, Are you able to look in your NodeManager logs to see if the NodeManagers are killing any executors for exceeding memory limits? If you observe this, you can solve the problem by bumping up spark.yarn.executor.memoryOverhead. -Sandy On Sun, Feb 1, 2015 at 5:28 AM, Tomer Benyamini

Re: HW imbalance

2015-01-30 Thread Sandy Ryza
you could over subscribe a node in terms of cpu cores if you have memory available. YMMV HTH -Mike On Jan 30, 2015, at 7:10 AM, Sandy Ryza sandy.r...@cloudera.com wrote: My answer was based off the specs that Antony mentioned: different amounts of memory, but 10 cores on all the boxes

Re: Duplicate key when sorting BytesWritable with Kryo?

2015-01-30 Thread Sandy Ryza
Hi Andrew, Here's a note from the doc for sequenceFile: * '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable object for each * record, directly caching the returned RDD will create many references to the same object. * If you plan to directly cache Hadoop

Re: Duplicate key when sorting BytesWritable with Kryo?

2015-01-30 Thread Sandy Ryza
record rather than holding many in memory at once). The documentation should be updated. On Fri, Jan 30, 2015 at 11:27 AM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Andrew, Here's a note from the doc for sequenceFile: * '''Note:''' Because Hadoop's RecordReader class re-uses

Re: HW imbalance

2015-01-29 Thread Sandy Ryza
cluster than spark. -Mike On Jan 26, 2015, at 5:02 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Antony, Unfortunately, all executors for any single Spark application must have the same amount of memory. It's possibly to configure YARN with different amounts of memory for each host

Re: RDD caching, memory network input

2015-01-28 Thread Sandy Ryza
Hi Fanilo, How many cores are you using per executor? Are you aware that you can combat the container is running beyond physical memory limits error by bumping the spark.yarn.executor.memoryOverhead property? Also, are you caching the parsed version or the text? -Sandy On Wed, Jan 28, 2015 at

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-01-27 Thread Sandy Ryza
Hi Antony, If you look in the YARN NodeManager logs, do you see that it's killing the executors? Or are they crashing for a different reason? -Sandy On Tue, Jan 27, 2015 at 12:43 PM, Antony Mayi antonym...@yahoo.com.invalid wrote: Hi, I am using spark.yarn.executor.memoryOverhead=8192 yet

Re: HW imbalance

2015-01-26 Thread Sandy Ryza
Hi Antony, Unfortunately, all executors for any single Spark application must have the same amount of memory. It's possibly to configure YARN with different amounts of memory for each host (using yarn.nodemanager.resource.memory-mb), so other apps might be able to take advantage of the extra

Re: Large number of pyspark.daemon processes

2015-01-23 Thread Sandy Ryza
Hi Sven, What version of Spark are you running? Recent versions have a change that allows PySpark to share a pool of processes instead of starting a new one for each task. -Sandy On Fri, Jan 23, 2015 at 9:36 AM, Sven Krasser kras...@gmail.com wrote: Hey all, I am running into a problem

Re: RangePartitioner

2015-01-21 Thread Sandy Ryza
Hi Rishi, If you look in the Spark UI, have any executors registered? Are you able to collect a jstack of the driver process? -Sandy On Tue, Jan 20, 2015 at 9:07 PM, Rishi Yadav ri...@infoobjects.com wrote: I am joining two tables as below, the program stalls at below log line and never

Re: Trouble with large Yarn job

2015-01-11 Thread Sandy Ryza
Hi Anders, Have you checked your NodeManager logs to make sure YARN isn't killing executors for exceeding memory limits? -Sandy On Tue, Jan 6, 2015 at 8:20 AM, Anders Arpteg arp...@spotify.com wrote: Hey, I have a job that keeps failing if too much data is processed, and I can't see how to

Re: SPARKonYARN failing on CDH 5.3.0 : container cannot be fetched because of NumberFormatException

2015-01-08 Thread Sandy Ryza
Hi Mukesh, Those line numbers in ConverterUtils in the stack trace don't appear to line up with CDH 5.3: https://github.com/cloudera/hadoop-common/blob/cdh5-2.5.0_5.3.0/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ConverterUtils.java Is it possible

Re: Can spark supports task level resource management?

2015-01-07 Thread Sandy Ryza
Hi Xuelin, Spark 1.2 includes a dynamic allocation feature that allows Spark on YARN to modulate its YARN resource consumption as the demands of the application grow and shrink. This is somewhat coarser than what you call task-level resource management. Elasticity comes through allocating and

  1   2   >