Re: Web Service + Spark

2015-01-11 Thread Raghavendra Pandey
You can take a look at http://zeppelin.incubator.apache.org. it is a notebook and graphic visual designer. On Sun, Jan 11, 2015, 01:45 Cui Lin cui@hds.com wrote: Thanks, Gaurav and Corey, Probably I didn’t make myself clear. I am looking for best Spark practice similar to Shiny for R,

Re: How to set UI port #?

2015-01-11 Thread jeanlyn92
HI,YaoPau: You can set `spark.ui.port` to 0 ,the program will find a available port by random 2015-01-11 16:38 GMT+08:00 YaoPau jonrgr...@gmail.com: I have multiple Spark Streaming jobs running all day, and so when I run my hourly batch job, I always get a java.net.BindException: Address

Re: ALS.trainImplicit running out of mem when using higher rank

2015-01-11 Thread Antony Mayi
the question really is whether this is expected that the memory requirements grow rapidly with the rank... as I would expect memory is rather O(1) problem with dependency only on the size of input data. if this is expected is there any rough formula to determine the required memory based on ALS

How to set UI port #?

2015-01-11 Thread YaoPau
I have multiple Spark Streaming jobs running all day, and so when I run my hourly batch job, I always get a java.net.BindException: Address already in use which starts at 4040 then goes to 4041, 4042, 4043 before settling at 4044. That slows down my hourly job, and time is critical. Is there a

Re: Does DecisionTree model in MLlib deal with missing values?

2015-01-11 Thread Sean Owen
I do not recall seeing support for missing values. Categorical values are encoded as 0.0, 1.0, 2.0, ... When training the model you indicate which are interpreted as categorical with the categoricalFeaturesInfo parameter, which maps feature offset to count of distinct categorical values for the

Re: ALS.trainImplicit running out of mem when using higher rank

2015-01-11 Thread Sean Owen
I would expect the size of the user/item feature RDDs to grow linearly with the rank, of course. They are cached, so that would drive cache memory usage on the cluster. This wouldn't cause executors to fail for running out of memory though. In fact, your error does not show the task failing for

Re: train many decision tress with a single spark job

2015-01-11 Thread Sean Owen
You just mean you want to divide the data set into N subsets, and do that dividing by user, not make one model per user right? I suppose you could filter the source RDD N times, and build a model for each resulting subset. This can be parallelized on the driver. For example let's say you divide

Re: ScalaReflectionException when using saveAsParquetFile in sbt

2015-01-11 Thread Shing Hing Man
I have the same exception when I run the following example fromSpark SQL Programming Guide - Spark 1.2.0 Documentation |   | |   |   |   |   |   | | Spark SQL Programming Guide - Spark 1.2.0 DocumentationSpark SQL Programming Guide Overview Getting Started Data Sources RDDs Inferring the Schema

Re: SparkSQL schemaRDD MapPartitions calls - performance issues - columnar formats?

2015-01-11 Thread Cheng Lian
On 1/11/15 1:40 PM, Nathan McCarthy wrote: Thanks Cheng Michael! Makes sense. Appreciate the tips! Idiomatic scala isn't performant. I’ll definitely start using while loops or tail recursive methods. I have noticed this in the spark code base. I might try turning off columnar compression

Re: Job priority

2015-01-11 Thread Cody Koeninger
If you set up a number of pools equal to the number of different priority levels you want, make the relative weights of those pools very different, and submit a job to the pool representing its priority, I think youll get behavior equivalent to a priority queue. Try it and see. If I'm

Re: Spark SQL: Storing AVRO Schema in Parquet

2015-01-11 Thread Raghavendra Pandey
I think AvroWriteSupport class already saves avro schema as part of parquet meta data. You can think of using parquet-mr https://github.com/Parquet/parquet-mr directly. Raghavendra On Fri, Jan 9, 2015 at 10:32 PM, Jerry Lam chiling...@gmail.com wrote: Hi Raghavendra, This makes a lot of

Re: Removing JARs from spark-jobserver

2015-01-11 Thread Sasi
Thank you Abhishek. That works. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Removing-JARs-from-spark-jobserver-tp21081p21084.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Job priority

2015-01-11 Thread Alessandro Baretta
Cody, While I might be able to improve the scheduling of my jobs by using a few different pools with weights equal to, say, 1, 1e3 and 1e6, effectively getting a small handful of priority classes. Still, this is really not quite what I am describing. This is why my original post was on the dev

RE: Submit Spark applications from a machine that doesn't have Java installed

2015-01-11 Thread Nate D'Amico
Cant speak to the internals of SparkSubmit and how to reproduce sans jvm, guess would depend if you want/need to support various deployment enviroments (stand-alone, mesos, yarn, etc) If just need YARN, or looking at starting point, might want to look at capabilities of YARN API:

RE: Does DecisionTree model in MLlib deal with missing values?

2015-01-11 Thread Christopher Thom
Is there any plan to extend the data types that would be accepted by the Tree models in Spark? e.g. Many models that we build contain a large number of string-based categorical factors. Currently the only strategy is to map these string values to integers, and store the mapping so the data can

Re: propogating edges

2015-01-11 Thread Anwar Rizal
It looks like to be similar (simpler) to the connected component implementation in GraphX. Have you checked that ? I have questions though, in your example, the graph is a tree. What is the behavior if it is a more general graph ? Cheers, Anwar Rizal. On Mon, Jan 12, 2015 at 1:02 AM, dizzy5112

Re: Issue writing to Cassandra from Spark

2015-01-11 Thread Ankur Srivastava
Hi Akhil, thank you for your response. Actually we are first reading from cassandra and then writing back after doing some processing. All the reader stages succeed with no error and many writer stages also succeed but many fail as well. Thanks Ankur On Sat, Jan 10, 2015 at 10:15 PM, Akhil Das

propogating edges

2015-01-11 Thread dizzy5112
Hi all looking for some help in propagating some values in edges. What i want to achieve (see diagram) is for each connected part of the graph assign an incrementing value for each of the out links from the root node. This value will restart again for the next part of the graph. ie node 1 has out

Submit Spark applications from a machine that doesn't have Java installed

2015-01-11 Thread Nick Chammas
Is it possible to submit a Spark application to a cluster from a machine that does not have Java installed? My impression is that many, many more computers come with Python installed by default than do with Java. I want to write a command-line utility

How to recovery application running records when I restart Spark master?

2015-01-11 Thread ChongTang
Hi all, Due to some reasons, I restarted Spark master node. Before I restart it, there were some application running records at the bottom of the master web page. But they are gone after I restart the master node. The records include application name, running time, status, and so on. I am sure

Re: Data locality running Spark on Mesos

2015-01-11 Thread Michael V Le
I tried two Spark stand-alone configurations: SPARK_WORKER_CORES=1 SPARK_WORKER_MEMORY=1g SPARK_WORKER_INSTANCES=6 spark.driver.memory 1g spark.executor.memory 1g spark.storage.memoryFraction 0.9 --total-executor-cores 60 In the second configuration (same as first, but): SPARK_WORKER_CORES=6

Re: ALS.trainImplicit running out of mem when using higher rank

2015-01-11 Thread Antony Mayi
this seems to have sorted it, awesome, thanks for great help.Antony. On Sunday, 11 January 2015, 13:02, Sean Owen so...@cloudera.com wrote: I would expect the size of the user/item feature RDDs to grow linearly with the rank, of course. They are cached, so that would drive cache

Re: Trouble with large Yarn job

2015-01-11 Thread Sandy Ryza
Hi Anders, Have you checked your NodeManager logs to make sure YARN isn't killing executors for exceeding memory limits? -Sandy On Tue, Jan 6, 2015 at 8:20 AM, Anders Arpteg arp...@spotify.com wrote: Hey, I have a job that keeps failing if too much data is processed, and I can't see how to

Support for SQL on unions of tables (merge tables?)

2015-01-11 Thread Paul Wais
Dear List, What are common approaches for addressing over a union of tables / RDDs? E.g. suppose I have a collection of log files in HDFS, one log file per day, and I want to compute the sum of some field over a date range in SQL. Using log schema, I can read each as a distinct SchemaRDD, but I

what is the roadmap for Spark SQL dialect in the coming releases?

2015-01-11 Thread Niranda Perera
Hi, I found out that SparkSQL supports only a relatively small subset of SQL dialect currently. I would like to know the roadmap for the coming releases. And, are you focusing more on popularizing the 'Hive on Spark' SQL dialect or the Spark SQL dialect? Rgds -- Niranda

Re: Issue writing to Cassandra from Spark

2015-01-11 Thread Akhil Das
I see, can you paste the piece of code? Its probably because you are exceeding the number of connection that are specified in the property rpc_max_threads. Make sure you close all the connections properly. Thanks Best Regards On Mon, Jan 12, 2015 at 7:45 AM, Ankur Srivastava