Re: distributeBy using advantage of HDFS or RDD partitioning

2016-01-13 Thread Simon Elliston Ball
If you load data using ORC or parquet, the RDD will have a partition per file, so in fact your data frame will not directly match the partitioning of the table. If you want to process by and guarantee preserving partitioning then mapPartition etc will be useful. Note that if you perform any

Re: Running 2 spark application in parallel

2015-10-22 Thread Simon Elliston Ball
If yarn has capacity to run both simultaneously it will. You should ensure you are not allocating too many executors for the first app and leave some space for the second) You may want to run the application on different yarn queues to control resource allocation. If you run as a different

Re: How to connect to spark remotely from java

2015-08-10 Thread Simon Elliston Ball
You don't connect to spark exactly. The spark client (running on your remote machine) submits jobs to the YARN cluster running on HDP. What you probably need is yarn-cluster or yarn-client with the yarn client configs as downloaded from the Ambari actions menu. Simon On 10 Aug 2015, at

Re: Spark and Speech Recognition

2015-07-30 Thread Simon Elliston Ball
You might also want to consider broadcasting the models to ensure you get one instance shared across cores in each machine, otherwise the model will be serialised to each task and you'll get a copy per executor (roughly core in this instance) Simon Sent from my iPhone On 30 Jul 2015, at

Re: HDFS not supported by databricks cloud :-(

2015-06-16 Thread Simon Elliston Ball
You could consider using Zeppelin and spark on yarn as an alternative. http://zeppelin.incubator.apache.org/ Simon On 16 Jun 2015, at 17:58, Sanjay Subramanian sanjaysubraman...@yahoo.com.INVALID wrote: hey guys After day one at the spark-summit SFO, I realized sadly that (indeed) HDFS

Re: InferredSchema Example in Spark-SQL

2015-05-17 Thread Simon Elliston Ball
You mean toDF() not toRD(). It stands for data frame of that makes it easier to remember. Simon On 18 May 2015, at 01:07, Rajdeep Dua rajdeep@gmail.com wrote: Hi All, Was trying the Inferred Schema spart example http://spark.apache.org/docs/latest/sql-programming-guide.html#overview

Re: How to avoid using some nodes while running a spark program on yarn

2015-03-14 Thread Simon Elliston Ball
You won’t be able to use YARN labels on 2.2.0. However, you only need the labels if you want to map containers on specific hardware. In your scenario, the capacity scheduler in YARN might be the best bet. You can setup separate queues for the streaming and other jobs to protect a percentage of

Re: HW imbalance

2015-01-28 Thread simon elliston ball
You shouldn’t have any issues with differing nodes on the latest Ambari and Hortonworks. It works fine for mixed hardware and spark on yarn. Simon On Jan 26, 2015, at 4:34 PM, Michael Segel msegel_had...@hotmail.com wrote: If you’re running YARN, then you should be able to mix and max

Re: Unable to build spark from source

2015-01-03 Thread Simon Elliston Ball
You can use the same build commands, but it's well worth setting up a zinc server if you're doing a lot of builds. That will allow incremental scala builds, which speeds up the process significantly. SPARK-4501 might be of interest too. Simon On 3 Jan 2015, at 17:27, Manoj Kumar