Re: Unable to build spark from source

2015-01-03 Thread Simon Elliston Ball
You can use the same build commands, but it's well worth setting up a zinc server if you're doing a lot of builds. That will allow incremental scala builds, which speeds up the process significantly. SPARK-4501 might be of interest too. Simon > On 3 Jan 2015, at 17:27, Manoj Kumar wrote: > >

Re: HW imbalance

2015-01-28 Thread simon elliston ball
You shouldn’t have any issues with differing nodes on the latest Ambari and Hortonworks. It works fine for mixed hardware and spark on yarn. Simon > On Jan 26, 2015, at 4:34 PM, Michael Segel wrote: > > If you’re running YARN, then you should be able to mix and max where YARN is > managing t

Re: InferredSchema Example in Spark-SQL

2015-05-17 Thread Simon Elliston Ball
You mean toDF() not toRD(). It stands for data frame of that makes it easier to remember. Simon > On 18 May 2015, at 01:07, Rajdeep Dua wrote: > > Hi All, > Was trying the Inferred Schema spart example > http://spark.apache.org/docs/latest/sql-programming-guide.html#overview > > I am getting

Re: HDFS not supported by databricks cloud :-(

2015-06-16 Thread Simon Elliston Ball
You could consider using Zeppelin and spark on yarn as an alternative. http://zeppelin.incubator.apache.org/ Simon > On 16 Jun 2015, at 17:58, Sanjay Subramanian > wrote: > > hey guys > > After day one at the spark-summit SFO, I realized sadly that (indeed) HDFS is > not supported by Databr

Re: How to avoid using some nodes while running a spark program on yarn

2015-03-14 Thread Simon Elliston Ball
You won’t be able to use YARN labels on 2.2.0. However, you only need the labels if you want to map containers on specific hardware. In your scenario, the capacity scheduler in YARN might be the best bet. You can setup separate queues for the streaming and other jobs to protect a percentage of c

Re: distributeBy using advantage of HDFS or RDD partitioning

2016-01-13 Thread Simon Elliston Ball
If you load data using ORC or parquet, the RDD will have a partition per file, so in fact your data frame will not directly match the partitioning of the table. If you want to process by and guarantee preserving partitioning then mapPartition etc will be useful. Note that if you perform any

Re: Spark and Speech Recognition

2015-07-30 Thread Simon Elliston Ball
You might also want to consider broadcasting the models to ensure you get one instance shared across cores in each machine, otherwise the model will be serialised to each task and you'll get a copy per executor (roughly core in this instance) Simon Sent from my iPhone > On 30 Jul 2015, at 10

Re: How to connect to spark remotely from java

2015-08-10 Thread Simon Elliston Ball
You don't connect to spark exactly. The spark client (running on your remote machine) submits jobs to the YARN cluster running on HDP. What you probably need is yarn-cluster or yarn-client with the yarn client configs as downloaded from the Ambari actions menu. Simon > On 10 Aug 2015, at 12:44

Re: Running 2 spark application in parallel

2015-10-22 Thread Simon Elliston Ball
If yarn has capacity to run both simultaneously it will. You should ensure you are not allocating too many executors for the first app and leave some space for the second) You may want to run the application on different yarn queues to control resource allocation. If you run as a different user