Re: distributeBy using advantage of HDFS or RDD partitioning
If you load data using ORC or parquet, the RDD will have a partition per file, so in fact your data frame will not directly match the partitioning of the table. If you want to process by and guarantee preserving partitioning then mapPartition etc will be useful. Note that if you perform any DataFrame operations which shuffle, you will end up implicitly re-partitioning to spark.sql.shuffle.partitions (default 200). Simon > On 13 Jan 2016, at 10:09, Deenar Toraskarwrote: > > Hi > > I have data in HDFS partitioned by a logical key and would like to preserve > the partitioning when creating a dataframe for the same. Is it possible to > create a dataframe that preserves partitioning from HDFS or the underlying > RDD? > > Regards > Deenar - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Running 2 spark application in parallel
If yarn has capacity to run both simultaneously it will. You should ensure you are not allocating too many executors for the first app and leave some space for the second) You may want to run the application on different yarn queues to control resource allocation. If you run as a different user within the same queue you should also get an even split between the applications, however you may need to enable preemption to ensure the first doesn't just hog the queue. Simon > On 22 Oct 2015, at 19:20, Suman Somasundar> wrote: > > Hi all, > > Is there a way to run 2 spark applications in parallel under Yarn in the same > cluster? > > Currently, if I submit 2 applications, one of them waits till the other one > is completed. > > I want both of them to start and run at the same time. > > Thanks, > Suman.
Re: How to connect to spark remotely from java
You don't connect to spark exactly. The spark client (running on your remote machine) submits jobs to the YARN cluster running on HDP. What you probably need is yarn-cluster or yarn-client with the yarn client configs as downloaded from the Ambari actions menu. Simon On 10 Aug 2015, at 12:44, Zsombor Egyed egye...@starschema.net wrote: Hi! I want to know how can I connect to hortonworks spark from an other machine. So there is a HDP 2.2 and I want to connect to this, from remotely via java api. Do you have any suggestion? Thanks! Regards, -- Egyed Zsombor Junior Big Data Engineer Mobile: +36 70 320 65 81 | Twitter:@starschemaltd Email: egye...@starschema.net | Web: www.starschema.net
Re: Spark and Speech Recognition
You might also want to consider broadcasting the models to ensure you get one instance shared across cores in each machine, otherwise the model will be serialised to each task and you'll get a copy per executor (roughly core in this instance) Simon Sent from my iPhone On 30 Jul 2015, at 10:14, Akhil Das ak...@sigmoidanalytics.com wrote: Like this? val data = sc.textFile(/sigmoid/audio/data/, 24).foreachPartition(urls = speachRecognizer(urls)) Let 24 be the total number of cores that you have on all the workers. Thanks Best Regards On Wed, Jul 29, 2015 at 6:50 AM, Peter Wolf opus...@gmail.com wrote: Hello, I am writing a Spark application to use speech recognition to transcribe a very large number of recordings. I need some help configuring Spark. My app is basically a transformation with no side effects: recording URL -- transcript. The input is a huge file with one URL per line, and the output is a huge file of transcripts. The speech recognizer is written in Java (Sphinx4), so it can be packaged as a JAR. The recognizer is very processor intensive, so you can't run too many on one machine-- perhaps one recognizer per core. The recognizer is also big-- maybe 1 GB. But, most of the recognizer is a immutable acoustic and language models that can be shared with other instances of the recognizer. So I want to run about one recognizer per core of each machine in my cluster. I want all recognizer on one machine to run within the same JVM and share the same models. How does one configure Spark for this sort of application? How does one control how Spark deploys the stages of the process. Can someone point me to an appropriate doc or keywords I should Google. Thanks Peter
Re: HDFS not supported by databricks cloud :-(
You could consider using Zeppelin and spark on yarn as an alternative. http://zeppelin.incubator.apache.org/ Simon On 16 Jun 2015, at 17:58, Sanjay Subramanian sanjaysubraman...@yahoo.com.INVALID wrote: hey guys After day one at the spark-summit SFO, I realized sadly that (indeed) HDFS is not supported by Databricks cloud. My speed bottleneck is to transfer ~1TB of snapshot HDFS data (250+ external hive tables) to S3 :-( I want to use databricks cloud but this to me is a starting disabler. The hard road for me will be (as I believe EVERYTHING is possible. The impossible just takes longer) - transfer all HDFS to S3 - our org does not permit AWS server side encryption so I have figure out if AWS KMS encrypted S3 files can be read by Hive/Impala/Spark - modify all table locations in metadata to S3 - modify all scripts to point and write to S3 instead of Any ideas / thoughts will be helpful. Till I can get the above figured out , I am going ahead and working hard to make spark-sql as the main workhorse for creating dataset (now its Hive and Impala) thanks regards sanjay
Re: InferredSchema Example in Spark-SQL
You mean toDF() not toRD(). It stands for data frame of that makes it easier to remember. Simon On 18 May 2015, at 01:07, Rajdeep Dua rajdeep@gmail.com wrote: Hi All, Was trying the Inferred Schema spart example http://spark.apache.org/docs/latest/sql-programming-guide.html#overview I am getting the following compilation error on the function toRD() value toRD is not a member of org.apache.spark.rdd.RDD[Person] [error] val people = sc.textFile(/home/ubuntu/work/spark-src/spark/examples/src/main/resources/people.txt).map(_.split(,)).map(p = Person(p(0), p(1).trim.toInt)).toRD() [error] Thanks Rajdeep
Re: How to avoid using some nodes while running a spark program on yarn
You won’t be able to use YARN labels on 2.2.0. However, you only need the labels if you want to map containers on specific hardware. In your scenario, the capacity scheduler in YARN might be the best bet. You can setup separate queues for the streaming and other jobs to protect a percentage of cluster resources. You can then spread all jobs across the cluster while protecting the streaming jobs’ capacity (if your resource containers sizes are granular enough). Simon On Mar 14, 2015, at 9:57 AM, James alcaid1...@gmail.com wrote: My hadoop version is 2.2.0, and my spark version is 1.2.0 2015-03-14 17:22 GMT+08:00 Ted Yu yuzhih...@gmail.com mailto:yuzhih...@gmail.com: Which release of hadoop are you using ? Can you utilize node labels feature ? See YARN-2492 and YARN-796 Cheers On Sat, Mar 14, 2015 at 1:49 AM, James alcaid1...@gmail.com mailto:alcaid1...@gmail.com wrote: Hello, I am got a cluster with spark on yarn. Currently some nodes of it are running a spark streamming program, thus their local space is not enough to support other application. Thus I wonder is that possible to use a blacklist to avoid using these nodes when running a new spark program? Alcaid
Re: HW imbalance
You shouldn’t have any issues with differing nodes on the latest Ambari and Hortonworks. It works fine for mixed hardware and spark on yarn. Simon On Jan 26, 2015, at 4:34 PM, Michael Segel msegel_had...@hotmail.com wrote: If you’re running YARN, then you should be able to mix and max where YARN is managing the resources available on the node. Having said that… it depends on which version of Hadoop/YARN. If you’re running Hortonworks and Ambari, then setting up multiple profiles may not be straight forward. (I haven’t seen the latest version of Ambari) So in theory, one profile would be for your smaller 36GB of ram, then one profile for your 128GB sized machines. Then as your request resources for your spark job, it should schedule the jobs based on the cluster’s available resources. (At least in theory. I haven’t tried this so YMMV) HTH -Mike On Jan 26, 2015, at 4:25 PM, Antony Mayi antonym...@yahoo.com.INVALID mailto:antonym...@yahoo.com.INVALID wrote: should have said I am running as yarn-client. all I can see is specifying the generic executor memory that is then to be used in all containers. On Monday, 26 January 2015, 16:48, Charles Feduke charles.fed...@gmail.com mailto:charles.fed...@gmail.com wrote: You should look at using Mesos. This should abstract away the individual hosts into a pool of resources and make the different physical specifications manageable. I haven't tried configuring Spark Standalone mode to have different specs on different machines but based on spark-env.sh.template: # - SPARK_WORKER_CORES, to set the number of cores to use on this machine # - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g) # - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. -Dx=y) it looks like you should be able to mix. (Its not clear to me whether SPARK_WORKER_MEMORY is uniform across the cluster or for the machine where the config file resides.) On Mon Jan 26 2015 at 8:07:51 AM Antony Mayi antonym...@yahoo.com.invalid mailto:antonym...@yahoo.com.invalid wrote: Hi, is it possible to mix hosts with (significantly) different specs within a cluster (without wasting the extra resources)? for example having 10 nodes with 36GB RAM/10CPUs now trying to add 3 hosts with 128GB/10CPUs - is there a way to utilize the extra memory by spark executors (as my understanding is all spark executors must have same memory). thanks, Antony.
Re: Unable to build spark from source
You can use the same build commands, but it's well worth setting up a zinc server if you're doing a lot of builds. That will allow incremental scala builds, which speeds up the process significantly. SPARK-4501 might be of interest too. Simon On 3 Jan 2015, at 17:27, Manoj Kumar manojkumarsivaraj...@gmail.com wrote: My question was that if once I make changes in the source code to a file, do I rebuild it using any other command, such that it takes in only the changes (because it takes a lot of time)? On Sat, Jan 3, 2015 at 10:40 PM, Manoj Kumar manojkumarsivaraj...@gmail.com wrote: Yes, I've built spark successfully, using the same command mvn -DskipTests clean package but it built because now I do not work behind a proxy. Thanks. -- Godspeed, Manoj Kumar, Intern, Telecom ParisTech Mech Undergrad http://manojbits.wordpress.com