Re: distributeBy using advantage of HDFS or RDD partitioning

2016-01-13 Thread Simon Elliston Ball
If you load data using ORC or parquet, the RDD will have a partition per file, 
so in fact your data frame will not directly match the partitioning of the 
table. 

If you want to process by and guarantee preserving partitioning then 
mapPartition etc will be useful. 

Note that if you perform any DataFrame operations which shuffle, you will end 
up implicitly re-partitioning to spark.sql.shuffle.partitions (default 200).

Simon

> On 13 Jan 2016, at 10:09, Deenar Toraskar  wrote:
> 
> Hi
> 
> I have data in HDFS partitioned by a logical key and would like to preserve 
> the partitioning when creating a dataframe for the same. Is it possible to 
> create a dataframe that preserves partitioning from HDFS or the underlying 
> RDD?
> 
> Regards
> Deenar


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Running 2 spark application in parallel

2015-10-22 Thread Simon Elliston Ball
If yarn has capacity to run both simultaneously it will. You should ensure you 
are not allocating too many executors for the first app and leave some space 
for the second)

You may want to run the application on different yarn queues to control 
resource allocation. If you run as a different user within the same queue you 
should also get an even split between the applications, however you may need to 
enable preemption to ensure the first doesn't just hog the queue. 

Simon 

> On 22 Oct 2015, at 19:20, Suman Somasundar  
> wrote:
> 
> Hi all,
>  
> Is there a way to run 2 spark applications in parallel under Yarn in the same 
> cluster?
>  
> Currently, if I submit 2 applications, one of them waits till the other one 
> is completed.
>  
> I want both of them to start and run at the same time.
>  
> Thanks,
> Suman.


Re: How to connect to spark remotely from java

2015-08-10 Thread Simon Elliston Ball
You don't connect to spark exactly. The spark client (running on your remote 
machine) submits jobs to the YARN cluster running on HDP. What you probably 
need is yarn-cluster or yarn-client with the yarn client configs as downloaded 
from the Ambari actions menu.

Simon

 On 10 Aug 2015, at 12:44, Zsombor Egyed egye...@starschema.net wrote:
 
 Hi!
 
 I want to know how can I connect to hortonworks spark from an other machine. 
 
 So there is a HDP 2.2 and I want to connect to this, from remotely via java 
 api. 
 Do you have any suggestion? 
 
 Thanks!
 Regards,
 
 -- 
 
 
 Egyed Zsombor 
 Junior Big Data Engineer
 
 
 Mobile: +36 70 320 65 81 | Twitter:@starschemaltd
 Email: egye...@starschema.net | Web: www.starschema.net


Re: Spark and Speech Recognition

2015-07-30 Thread Simon Elliston Ball
You might also want to consider broadcasting the models to ensure you get one 
instance shared across cores in each machine, otherwise the model will be 
serialised to each task and you'll get a copy per executor (roughly core in 
this instance)

Simon 

Sent from my iPhone

 On 30 Jul 2015, at 10:14, Akhil Das ak...@sigmoidanalytics.com wrote:
 
 Like this?
 
 val data = sc.textFile(/sigmoid/audio/data/, 24).foreachPartition(urls = 
 speachRecognizer(urls))
 
 Let 24 be the total number of cores that you have on all the workers.
 
 Thanks
 Best Regards
 
 On Wed, Jul 29, 2015 at 6:50 AM, Peter Wolf opus...@gmail.com wrote:
 Hello, I am writing a Spark application to use speech recognition to 
 transcribe a very large number of recordings.
 
 I need some help configuring Spark.
 
 My app is basically a transformation with no side effects: recording URL -- 
 transcript.  The input is a huge file with one URL per line, and the output 
 is a huge file of transcripts.  
 
 The speech recognizer is written in Java (Sphinx4), so it can be packaged as 
 a JAR.
 
 The recognizer is very processor intensive, so you can't run too many on one 
 machine-- perhaps one recognizer per core.  The recognizer is also big-- 
 maybe 1 GB.  But, most of the recognizer is a immutable acoustic and 
 language models that can be shared with other instances of the recognizer.
 
 So I want to run about one recognizer per core of each machine in my 
 cluster.  I want all recognizer on one machine to run within the same JVM 
 and share the same models.
 
 How does one configure Spark for this sort of application?  How does one 
 control how Spark deploys the stages of the process.  Can someone point me 
 to an appropriate doc or keywords I should Google.
 
 Thanks
 Peter 
 


Re: HDFS not supported by databricks cloud :-(

2015-06-16 Thread Simon Elliston Ball
You could consider using Zeppelin and spark on yarn as an alternative. 
http://zeppelin.incubator.apache.org/

Simon

 On 16 Jun 2015, at 17:58, Sanjay Subramanian 
 sanjaysubraman...@yahoo.com.INVALID wrote:
 
 hey guys
 
 After day one at the spark-summit SFO, I realized sadly that (indeed) HDFS is 
 not supported by Databricks cloud.
 My speed bottleneck is to transfer ~1TB of snapshot HDFS data (250+ external 
 hive tables) to S3 :-( 
 
 I want to use databricks cloud but this to me is a starting disabler.
 The hard road for me will be (as I believe EVERYTHING is possible. The 
 impossible just takes longer) 
 - transfer all HDFS to S3
 - our org does not permit AWS server side encryption so I have figure out if 
 AWS KMS encrypted S3 files can be read by Hive/Impala/Spark  
 - modify all table locations in metadata to S3
 - modify all scripts to point and write to S3 instead of   
 
 Any ideas / thoughts will be helpful.
 
 Till I can get the above figured out , I am going ahead and working hard to 
 make spark-sql as the main workhorse for creating dataset (now its Hive and 
 Impala)
 
 
 thanks
 regards
 
 sanjay
  
 


Re: InferredSchema Example in Spark-SQL

2015-05-17 Thread Simon Elliston Ball
You mean toDF() not toRD(). It stands for data frame of that makes it easier to 
remember.

Simon

 On 18 May 2015, at 01:07, Rajdeep Dua rajdeep@gmail.com wrote:
 
 Hi All,
 Was trying the Inferred Schema spart example
 http://spark.apache.org/docs/latest/sql-programming-guide.html#overview
 
 I am getting the following compilation error on the function toRD()
 
 value toRD is not a member of org.apache.spark.rdd.RDD[Person]
 [error] val people = 
 sc.textFile(/home/ubuntu/work/spark-src/spark/examples/src/main/resources/people.txt).map(_.split(,)).map(p
  = Person(p(0), p(1).trim.toInt)).toRD()
 [error]  
 
 Thanks
 Rajdeep   
 
 
 


Re: How to avoid using some nodes while running a spark program on yarn

2015-03-14 Thread Simon Elliston Ball
You won’t be able to use YARN labels on 2.2.0. However, you only need the 
labels if you want to map containers on specific hardware. In your scenario, 
the capacity scheduler in YARN might be the best bet. You can setup separate 
queues for the streaming and other jobs to protect a percentage of cluster 
resources. You can then spread all jobs across the cluster while protecting the 
streaming jobs’ capacity (if your resource containers sizes are granular 
enough).

Simon


 On Mar 14, 2015, at 9:57 AM, James alcaid1...@gmail.com wrote:
 
 My hadoop version is 2.2.0, and my spark version is 1.2.0
 
 2015-03-14 17:22 GMT+08:00 Ted Yu yuzhih...@gmail.com 
 mailto:yuzhih...@gmail.com:
 Which release of hadoop are you using ?
 
 Can you utilize node labels feature ?
 See YARN-2492 and YARN-796
 
 Cheers
 
 On Sat, Mar 14, 2015 at 1:49 AM, James alcaid1...@gmail.com 
 mailto:alcaid1...@gmail.com wrote:
 Hello, 
 
 I am got a cluster with spark on yarn. Currently some nodes of it are running 
 a spark streamming program, thus their local space is not enough to support 
 other application. Thus I wonder is that possible to use a blacklist to avoid 
 using these nodes when running a new spark program? 
 
 Alcaid
 
 



Re: HW imbalance

2015-01-28 Thread simon elliston ball
You shouldn’t have any issues with differing nodes on the latest Ambari and 
Hortonworks. It works fine for mixed hardware and spark on yarn. 

Simon

 On Jan 26, 2015, at 4:34 PM, Michael Segel msegel_had...@hotmail.com wrote:
 
 If you’re running YARN, then you should be able to mix and max where YARN is 
 managing the resources available on the node. 
 
 Having said that… it depends on which version of Hadoop/YARN. 
 
 If you’re running Hortonworks and Ambari, then setting up multiple profiles 
 may not be straight forward. (I haven’t seen the latest version of Ambari) 
 
 So in theory, one profile would be for your smaller 36GB of ram, then one 
 profile for your 128GB sized machines. 
 Then as your request resources for your spark job, it should schedule the 
 jobs based on the cluster’s available resources. 
 (At least in theory.  I haven’t tried this so YMMV) 
 
 HTH
 
 -Mike
 
 On Jan 26, 2015, at 4:25 PM, Antony Mayi antonym...@yahoo.com.INVALID 
 mailto:antonym...@yahoo.com.INVALID wrote:
 
 should have said I am running as yarn-client. all I can see is specifying 
 the generic executor memory that is then to be used in all containers.
 
 
 On Monday, 26 January 2015, 16:48, Charles Feduke charles.fed...@gmail.com 
 mailto:charles.fed...@gmail.com wrote:
 
 
 You should look at using Mesos. This should abstract away the individual 
 hosts into a pool of resources and make the different physical 
 specifications manageable.
 
 I haven't tried configuring Spark Standalone mode to have different specs on 
 different machines but based on spark-env.sh.template:
 
 # - SPARK_WORKER_CORES, to set the number of cores to use on this machine
 # - SPARK_WORKER_MEMORY, to set how much total memory workers have to give 
 executors (e.g. 1000m, 2g)
 # - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. 
 -Dx=y)
 it looks like you should be able to mix. (Its not clear to me whether 
 SPARK_WORKER_MEMORY is uniform across the cluster or for the machine where 
 the config file resides.)
 
 On Mon Jan 26 2015 at 8:07:51 AM Antony Mayi antonym...@yahoo.com.invalid 
 mailto:antonym...@yahoo.com.invalid wrote:
 Hi,
 
 is it possible to mix hosts with (significantly) different specs within a 
 cluster (without wasting the extra resources)? for example having 10 nodes 
 with 36GB RAM/10CPUs now trying to add 3 hosts with 128GB/10CPUs - is there 
 a way to utilize the extra memory by spark executors (as my understanding is 
 all spark executors must have same memory).
 
 thanks,
 Antony.
 
 
 



Re: Unable to build spark from source

2015-01-03 Thread Simon Elliston Ball
You can use the same build commands, but it's well worth setting up a zinc 
server if you're doing a lot of builds. That will allow incremental scala 
builds, which speeds up the process significantly.

SPARK-4501 might be of interest too.

Simon

 On 3 Jan 2015, at 17:27, Manoj Kumar manojkumarsivaraj...@gmail.com wrote:
 
 My question was that if once I make changes in the source code to a file,
 
 do I rebuild it using any other command, such that it takes in only the 
 changes (because it takes a lot of time)? 
 
 On Sat, Jan 3, 2015 at 10:40 PM, Manoj Kumar 
 manojkumarsivaraj...@gmail.com wrote:
 Yes, I've built spark successfully, using the same command
 
 mvn -DskipTests clean package
 
 but it built because now I do not work behind a proxy.
 
 Thanks.
 
 
 
 -- 
 Godspeed,
 Manoj Kumar,
 Intern, Telecom ParisTech
 Mech Undergrad
 http://manojbits.wordpress.com