Re: MLLib decision tree: Weights

2014-09-03 Thread Xiangrui Meng
This is not supported in MLlib. Hopefully, we will add support for weighted examples in v1.2. If you want to train weighted instances with the current tree implementation, please try importance sampling first to adjust the weights. For instance, an example with weight 0.3 is sampled with

Re: New features (Discretization) for v1.x in xiangrui.pdf

2014-09-03 Thread Xiangrui Meng
We have a pending PR (https://github.com/apache/spark/pull/216) for discretization but it has performance issues. We will try to spend more time to improve it. -Xiangrui On Tue, Sep 2, 2014 at 2:56 AM, filipus floe...@gmail.com wrote: i guess i found it

Re: New features (Discretization) for v1.x in xiangrui.pdf

2014-09-03 Thread filipus
howto install? just clone by git clone https://github.com/apache/spark/pull/216 the code and than sbt package? is it the same as https://github.com/LIDIAgroup/SparkFeatureSelection ??? or something different filip -- View this message in context:

Re: New features (Discretization) for v1.x in xiangrui.pdf

2014-09-03 Thread Xiangrui Meng
I think they are the same. If you have hub (https://hub.github.com/) installed, you can run hub checkout https://github.com/apache/spark/pull/216 and then `sbt/sbt assembly` -Xiangrui On Wed, Sep 3, 2014 at 12:03 AM, filipus floe...@gmail.com wrote: howto install? just clone by git clone

Re: Number of elements in ArrayBuffer

2014-09-03 Thread Sean Owen
You really should show your Spark code then. I think you are mistaking one of the Spark APIs, and are processing a collection of 1 ArrayBuffer at some point, not an ArrayBuffer. On Wed, Sep 3, 2014 at 6:42 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: I have a problem here. When I run the

Memcached error when using during map

2014-09-03 Thread gavin zhang
finished a distributed project in hadoop streaming and it worked fine with using memcached storage during mapping. Actually, it's a python project. Now I want to move it to Spark. But when I called the memcached library, two errors was found during computing. (Both) 1. File memcache.py, line 414,

.sparkrc for Spark shell?

2014-09-03 Thread Jianshi Huang
To make my shell experience merrier, I need to import several packages, and define implicit sparkContext and sqlContext. Is there a startup file (e.g. ~/.sparkrc) that Spark shell will load when it's started? Cheers, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog:

RDDs

2014-09-03 Thread rapelly kartheek
Hi, Can someone tell me what kind of operations can be performed on a replicated rdd?? What are the use-cases of a replicated rdd. One basic doubt that is bothering me from long time: what is the difference between an application and job in the Spark parlance. I am confused b'cas of Hadoop

SparkSQL TPC-H query 3 joining multiple tables

2014-09-03 Thread Samay
Hi, I am trying to run query 3 from the TPC-H benchmark using SparkSQL. But, I am running into errors which I believe are because the parser does not accept the JOIN syntax I am trying. Below are the syntax which I tried and the error messages I am seeing . Exception in thread main

Exchanging data between pyspark and scala

2014-09-03 Thread Dominik Hübner
Hey, I am about to implement a spark app which will require to use both, pyspark and spark on scala. Data should be read from AWS S3 (compressed CSV files), and must be pre-processed by an existing Python codebase. However, our final goal is to make those datasets available for Spark apps

Re: Invalid Class Exception

2014-09-03 Thread niranda
Hi, I'm getting the same error while manually setting up Spark cluster. Has there been any update about this error? Rgds Niranda -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Invalid-Class-Exception-tp6859p13346.html Sent from the Apache Spark User

Support R in Spark

2014-09-03 Thread oppokui
Does spark ML team have plan to support R script natively? There is a SparkR project, but not from spark team. Spark ML used netlib-java to talk with native fortran routines or use NumPy, why not try to use R in some sense. R had lot of useful packages. If spark ML team can include R support,

How to list all registered tables in a sql context?

2014-09-03 Thread Jianshi Huang
Hi, How can I list all registered tables in a sql context? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Re: .sparkrc for Spark shell?

2014-09-03 Thread Prashant Sharma
Hey, You can use spark-shell -i sparkrc, to do this. Prashant Sharma On Wed, Sep 3, 2014 at 2:17 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: To make my shell experience merrier, I need to import several packages, and define implicit sparkContext and sqlContext. Is there a startup

parsing json in spark streaming

2014-09-03 Thread godraude
Hello everyone. I'm trying to receive a DStream structured as a json from a kafka topic and I want to parse the content of each json. The json I receive is something like this:

How to clear broadcast variable from driver memory?

2014-09-03 Thread Kevin Jung
Hi, I tried Broadcast.unpersist() on Spark 1.0.1 but MemoryStore(driver memory) still allocated it. //LOGS //Block broadcast_0 stored as values to memory (estimated size 380.1 MB, free 5.7 GB) The free size of memory was same after calling unpersist. Can I clear this? -- View this message in

pyspark on yarn hdp hortonworks

2014-09-03 Thread Oleg Ruchovets
Hi all. I am trying to run pyspark on yarn already couple of days: http://hortonworks.com/kb/spark-1-0-1-technical-preview-hdp-2-1-3/ I posted exception on previous posts. It looks that I didn't do correct configuration. I googled quite a lot and I can't find the steps should be done to

Message Passing among workers

2014-09-03 Thread laxmanvemula
Hi, I would like to implement an asynchronous distributed optimization algorithm where workers communicate among one another. It is similar to belief propagation where each worker is a vertex in the graph. Can some one let me know if this is possible using spark? Thanks, Laxman -- View this

Re: Message Passing among workers

2014-09-03 Thread Evan R. Sparks
Asynchrony is not supported directly - spark's programming model is naturally BSP. I have seen cases where people have instantiated actors with akka on worker nodes to enable message passing, or even used spark's own ActorSystem to do this. But, I do not recommend this, since you lose a bunch of

Re: [GraphX] how to set memory configurations to avoid OutOfMemoryError GC overhead limit exceeded

2014-09-03 Thread Yifan LI
Hi Ankur, Thanks so much for your advice. But it failed when I tried to set the storage level in constructing a graph. val graph = GraphLoader.edgeListFile(sc, edgesFile, minEdgePartitions = numPartitions).partitionBy(PartitionStrategy.EdgePartition2D).persist(StorageLevel.MEMORY_AND_DISK)

Re: SparkSQL TPC-H query 3 joining multiple tables

2014-09-03 Thread Michael Armbrust
Are you using SQLContext or HiveContext? The default sql dialect in HiveContext (HiveQL) is a little more complete and might be a better place to start. On Wed, Sep 3, 2014 at 2:12 AM, Samay smilingsa...@gmail.com wrote: Hi, I am trying to run query 3 from the TPC-H benchmark using

RE: MLLib decision tree: Weights

2014-09-03 Thread Sameer Tilak
Dear Xiangrui, Thanks for your reply. We will use sampling for now. However, just to let you know, we believe that it is not the best fit for our problems due to two reasons (1) high dimensionality of data (600) features and (2) Highly skewed distribution. Do you have any idea when MLLib v1.2

Accessing neighboring elements in an RDD

2014-09-03 Thread Daniel, Ronald (ELS-SDG)
Hi all, Assume I have read the lines of a text file into an RDD: textFile = sc.textFile(SomeArticle.txt) Also assume that the sentence breaks in SomeArticle.txt were done by machine and have some errors, such as the break at Fig. in the sample text below. Index Text N...as shown

Re: Low Level Kafka Consumer for Spark

2014-09-03 Thread Dibyendu Bhattacharya
Hi, Sorry for little delay . As discussed in this thread, I have modified the Kafka-Spark-Consumer ( https://github.com/dibbhatt/kafka-spark-consumer) code to have dedicated Receiver for every Topic Partition. You can see the example howto create Union of these receivers in

Re: Accessing neighboring elements in an RDD

2014-09-03 Thread Victor Tso-Guillen
Interestingly, there was an almost identical question posed on Aug 22 by cjwang. Here's the link to the archive: http://apache-spark-user-list.1001560.n3.nabble.com/Finding-previous-and-next-element-in-a-sorted-RDD-td12621.html#a12664 On Wed, Sep 3, 2014 at 10:33 AM, Daniel, Ronald (ELS-SDG)

Re: Accessing neighboring elements in an RDD

2014-09-03 Thread Chris Gore
There is support for Spark in ElasticSearch’s Hadoop integration package. http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/spark.html Maybe you could split and insert all of your documents from Spark and then query for “MoreLikeThis” on the ElasticSearch index. I haven’t

How can I start history-server with kerberos HDFS ?

2014-09-03 Thread Zhanfeng Huo
Hi, I have seted properties in conf/spark-defaults.conf and start with command ./sbin/start-history-server.sh /tmp/spark-events. It get errors and seems that the properties in spark-defaults.conf file doesn't effect. How can I solve this problem(Enable properties in spark-defaults.conf

Re: How can I start history-server with kerberos HDFS ?

2014-09-03 Thread Marcelo Vanzin
The history server (and other Spark daemons) do not read spark-defaults.conf. There's a bug open to implement that (SPARK-2098), and an open PR to fix it, but it's still not in Spark. On Wed, Sep 3, 2014 at 11:00 AM, Zhanfeng Huo huozhanf...@gmail.com wrote: Hi, I have seted properties in

Re: mllib performance on cluster

2014-09-03 Thread Evan R. Sparks
I spoke with SK offline about this, it looks like the difference in timings came from the fact that he was training 100 models for 100 iterations and taking the total time (vs. my example which trains a single model for 100 iterations). I'm posting my response here, though, because I think it's

RE: Accessing neighboring elements in an RDD

2014-09-03 Thread Daniel, Ronald (ELS-SDG)
Thanks for the pointer to that thread. Looks like there is some demand for this capability, but not a lot yet. Also doesn't look like there is an easy answer right now. Thanks, Ron From: Victor Tso-Guillen [mailto:v...@paxata.com] Sent: Wednesday, September 03, 2014 10:40 AM To: Daniel,

spark history server trying to hit port 8021

2014-09-03 Thread Greg Hill
My Spark history server won't start because it's trying to hit the namenode on 8021, but the namenode is on 8020 (the default). How can I configure the history server to use the right port? I can't find any relevant setting on the docs:

Web UI

2014-09-03 Thread Ruebenacker, Oliver A
Hello, What is included in the Spark web UI? What are the available URLs? Can the information be obtained in a machine-readable way (e.g. JSON, XML, etc)? Thanks! Best, Oliver Oliver Ruebenacker | Solutions Architect Altisource(tm) 290 Congress St, 7th Floor | Boston,

Re: spark history server trying to hit port 8021

2014-09-03 Thread Greg Hill
Nevermind, PEBKAC. I had put in the wrong port in the $LOG_DIR environment variable. Greg From: Greg greg.h...@rackspace.commailto:greg.h...@rackspace.com Date: Wednesday, September 3, 2014 1:56 PM To: user@spark.apache.orgmailto:user@spark.apache.org

Re: Web UI

2014-09-03 Thread Wonha Ryu
Hi Oliver, Spark standalone master and worker support '/json' endpoint in web UI, which returns some of the information in JSON format. I wasn't able to find relevant documentation, though. - Wonha On Wed, Sep 3, 2014 at 12:12 PM, Ruebenacker, Oliver A oliver.ruebenac...@altisource.com wrote:

RE: Web UI

2014-09-03 Thread Ruebenacker, Oliver A
Hello, Thanks for the help! But I tried starting with “–master local[4]” and when I load http://localhost:4040/json I just get forwarded to http://localhost:4040/stages/, and it’s all human-readable HTML, no JSON. Best, Oliver From: Wonha Ryu [mailto:wonha@gmail.com]

Re: Accessing neighboring elements in an RDD

2014-09-03 Thread Xiangrui Meng
There is a sliding method implemented in MLlib (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/SlidingRDD.scala), which is used in computing Area Under Curve:

Re: Web UI

2014-09-03 Thread Wonha Ryu
Hey Oliver, IIRC there's no JSON endpoint for application web UI. They only exist for cluster master and worker. - Wonha On Wed, Sep 3, 2014 at 12:58 PM, Ruebenacker, Oliver A oliver.ruebenac...@altisource.com wrote: Hello, Thanks for the help! But I tried starting with

Re: [GraphX] how to set memory configurations to avoid OutOfMemoryError GC overhead limit exceeded

2014-09-03 Thread Ankur Dave
At 2014-09-03 17:58:09 +0200, Yifan LI iamyifa...@gmail.com wrote: val graph = GraphLoader.edgeListFile(sc, edgesFile, minEdgePartitions = numPartitions).partitionBy(PartitionStrategy.EdgePartition2D).persist(StorageLevel.MEMORY_AND_DISK) Error: java.lang.UnsupportedOperationException: Cannot

RE: Web UI

2014-09-03 Thread Ruebenacker, Oliver A
Hello, Interestingly, http://localhost:4040/metrics/json/ gives some numbers, but only a few which never seem to change during the application’s lifetime. Either the web UI has some very strange limitations, or there are some URLs yet to be discovered that do something interesting.

Spark Streaming into HBase

2014-09-03 Thread kpeng1
I have been trying to understand how spark streaming and hbase connect, but have not been successful. What I am trying to do is given a spark stream, process that stream and store the results in an hbase table. So far this is what I have: import org.apache.spark.SparkConf import

How do you debug with the logs ?

2014-09-03 Thread Yan Fang
Hi guys, curious how you deal with the logs. I feel difficulty in debugging with the logs: run spark-streaming in our yarn cluster using client-mode. So have two logs: yarn log and local log ( for client ). Whenever I have problem, the log is too big to read with gedit and grep. (e.g. after

Re: spark history server trying to hit port 8021

2014-09-03 Thread Andrew Or
Hi Greg, For future references you can set spark.history.ui.port in SPARK_HISTORY_OPTS. By default this should point to 18080. This information is actually in the link that you provided :) (as well as the most updated docs here: http://spark.apache.org/docs/latest/monitoring.html) -Andrew

Re: pyspark on yarn hdp hortonworks

2014-09-03 Thread Andrew Or
Hi Oleg, There isn't much you need to do to setup a Yarn cluster to run PySpark. You need to make sure all machines have python installed, and... that's about it. Your assembly jar will be shipped to all containers along with all the pyspark and py4j files needed. One caveat, however, is that the

Re: How to clear broadcast variable from driver memory?

2014-09-03 Thread Andrew Or
Hi Kevin, there is currently no way to do this... Broadcast.unpersist() only unpersists it from the executors, but not from the driver. However, this is not that bad because the Spark automatically cleans up broadcasts that are no longer used, even on the driver. So as long as there is no memory

Re: Spark Streaming into HBase

2014-09-03 Thread Ted Yu
Adding back user@ I am not familiar with the NotSerializableException. Can you show the full stack trace ? See SPARK-1297 for changes you need to make so that Spark works with hbase 0.98 Cheers On Wed, Sep 3, 2014 at 2:33 PM, Kevin Peng kpe...@gmail.com wrote: Ted, The hbase-site.xml is

Re: Spark Streaming into HBase

2014-09-03 Thread Sean Owen
This doesn't seem to have to do with HBase per se. Some function is getting the StreamingContext into the closure and that won't work. Is this exactly the code? since it doesn't reference a StreamingContext, but is there maybe a different version in reality that tries to use StreamingContext

RE: Accessing neighboring elements in an RDD

2014-09-03 Thread Daniel, Ronald (ELS-SDG)
Thanks Xiangrui, that looks very helpful. Best regards, Ron -Original Message- From: Xiangrui Meng [mailto:men...@gmail.com] Sent: Wednesday, September 03, 2014 1:19 PM To: Daniel, Ronald (ELS-SDG) Cc: Victor Tso-Guillen; user@spark.apache.org Subject: Re: Accessing neighboring

Re: Spark Streaming into HBase

2014-09-03 Thread kpeng1
Sean, I create a streaming context near the bottom of the code (ssc) and basically apply a foreachRDD on the resulting DStream so that I can get access to the underlying RDD, which in return I apply a foreach on and pass in my function which applies the storing logic. Is there a different

If master is local, where are master and workers?

2014-09-03 Thread Ruebenacker, Oliver A
Hello, If launched with local as master, where are master and workers? Do they each have a web UI? How can they be monitored? Thanks! Best, Oliver Oliver Ruebenacker | Solutions Architect Altisource(tm) 290 Congress St, 7th Floor | Boston, Massachusetts 02210 P: (617)

Re: If master is local, where are master and workers?

2014-09-03 Thread Marcelo Vanzin
local means everything runs in the same process; that means there is no need for master and worker daemons to start processes. On Wed, Sep 3, 2014 at 3:12 PM, Ruebenacker, Oliver A oliver.ruebenac...@altisource.com wrote: Hello, If launched with “local” as master, where are master

RE: If master is local, where are master and workers?

2014-09-03 Thread Ruebenacker, Oliver A
How can that single process be monitored? Thanks! -Original Message- From: Marcelo Vanzin [mailto:van...@cloudera.com] Sent: Wednesday, September 03, 2014 6:32 PM To: Ruebenacker, Oliver A Cc: user@spark.apache.org Subject: Re: If master is local, where are master and workers? local

Re: If master is local, where are master and workers?

2014-09-03 Thread Marcelo Vanzin
The only monitoring available is the driver's Web UI, which will generally be available on port 4040. On Wed, Sep 3, 2014 at 3:43 PM, Ruebenacker, Oliver A oliver.ruebenac...@altisource.com wrote: How can that single process be monitored? Thanks! -Original Message- From: Marcelo

Re: Spark Streaming into HBase

2014-09-03 Thread Kevin Peng
Ted, Here is the full stack trace coming from spark-shell: 14/09/03 16:21:03 ERROR scheduler.JobScheduler: Error running job streaming job 1409786463000 ms.0 org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException:

Re: Running Wordcount on large file stucks and throws OOM exception

2014-09-03 Thread Zhan Zhang
In word count, you don’t need much driver memory, unless you do collect, but it is not recommended. val file = sc.textFile(hdfs://sandbox.hortonworks.com:8020/tmp/data) val counts = file.flatMap(line = line.split( )).map(word = (word, 1)).reduceByKey(_ + _)

Re: How can I start history-server with kerberos HDFS ?

2014-09-03 Thread Andrew Or
Hi Zhanfeng, You will need to set these through SPARK_HISTORY_OPTS in conf/spark-env.sh. This is documented here: http://spark.apache.org/docs/latest/monitoring.html . Let me know if you have it working, -Andrew 2014-09-03 11:14 GMT-07:00 Marcelo Vanzin van...@cloudera.com: The history

[MLib] How do you normalize features?

2014-09-03 Thread Yana Kadiyska
It seems like the next release will add a nice org.apache.spark.mllib.feature package but what is the recommended way to normalize features in the current release (1.0.2) -- I'm hoping for a general pointer here. At the moment I have a RDD[LabeledPoint] and I can get a

Re: RDDs

2014-09-03 Thread Tobias Pfeiffer
Hello, On Wed, Sep 3, 2014 at 6:02 PM, rapelly kartheek kartheek.m...@gmail.com wrote: Can someone tell me what kind of operations can be performed on a replicated rdd?? What are the use-cases of a replicated rdd. I suggest you read

Re: What is the appropriate privileges needed for writting files into checkpoint directory?

2014-09-03 Thread Tao Xiao
I found the answer. Here the file system of the checkpoint should be a fault-tolerant file system like HDFS, so we should set it to a HDFS path. It is not a local file system path. 2014-09-03 10:28 GMT+08:00 Tao Xiao xiaotao.cs@gmail.com: I tried to run KafkaWordCount in a Spark

Multi-tenancy for Spark (Streaming) Applications

2014-09-03 Thread Tobias Pfeiffer
Hi, I am not sure if multi-tenancy is the right word, but I am thinking about a Spark application where multiple users can, say, log into some web interface and specify a data processing pipeline with streaming source, processing steps, and output. Now as far as I know, there can be only one

Re: Multi-tenancy for Spark (Streaming) Applications

2014-09-03 Thread Tathagata Das
In the current state of Spark Streaming, creating separate Java processes each having a streaming context is probably the best approach to dynamically adding and removing of input sources. All of these should be able to to use a YARN cluster for resource allocation. On Wed, Sep 3, 2014 at 6:30

Re: [MLib] How do you normalize features?

2014-09-03 Thread Xiangrui Meng
Maybe copy the implementation of StandardScaler from 1.1 and use it in v1.0.x. -Xiangrui On Wed, Sep 3, 2014 at 5:10 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: It seems like the next release will add a nice org.apache.spark.mllib.feature package but what is the recommended way to

RE: RDDs

2014-09-03 Thread Liu, Raymond
Not sure what did you refer to when saying replicated rdd, if you actually mean RDD, then, yes , read the API doc and paper as Tobias mentioned. If you actually focus on the word replicated, then that is for fault tolerant, and probably mostly used in the streaming case for receiver created

Why spark on yarn applicationmaster cannot get a proper resourcemanager address from yarnconfiguration?

2014-09-03 Thread 남윤민
Hello, I tried to submit a spark job to yarn cluster, there is an error occured with those messages: [root@saturn00 bin]# ./spark-submit --class SparkHiveJoin --master yarn-cluster --num-executors 10 --executor-memory 12g --executor-cores 1 spark.jarSpark assembly has been built with Hive,

Re: Is there any way to control the parallelism in LogisticRegression

2014-09-03 Thread Jiusheng Chen
Hi Xiangrui, A side-by question about MLLib. It looks current LBFGS in MLLib (version 1.0.2 and even v1.1) only support L2 regurization, the doc explains it: The L1 regularization by using L1Updater

Re: Is there any way to control the parallelism in LogisticRegression

2014-09-03 Thread Xiangrui Meng
+DB David (They implemented QWLQN on Spark today.) On Sep 3, 2014 7:18 PM, Jiusheng Chen chenjiush...@gmail.com wrote: Hi Xiangrui, A side-by question about MLLib. It looks current LBFGS in MLLib (version 1.0.2 and even v1.1) only support L2 regurization, the doc explains it: The L1

Re: Why spark on yarn applicationmaster cannot get a proper resourcemanager address from yarnconfiguration?

2014-09-03 Thread Guodong Wang
Did you follow the exact step in this page https://spark.apache.org/docs/1.0.2/running-on-yarn.html ? Please be sure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster. Guodong On Thu, Sep 4, 2014 at 10:15

Re: Spark Streaming into HBase

2014-09-03 Thread Tathagata Das
This is some issue with how Scala computes closures. Here because of the function blah it is trying the serialize the whole function that this code is part of. Can you define the function blah outside the main function? In fact you canTry putting the function in a serializable object. object

Re: Is there any way to control the parallelism in LogisticRegression

2014-09-03 Thread DB Tsai
With David's help today, we were able to implement elastic net glm in Spark. It's surprising easy, and with just some modification in breeze's OWLQN code, it just works without further investigation. We did benchmark, and the coefficients are within 0.5% differences compared with R's glmnet

resize memory size for caching RDD

2014-09-03 Thread 牛兆捷
Dear all: Spark uses memory to cache RDD and the memory size is specified by spark.storage.memoryFraction. One the Executor starts, does Spark support adjusting/resizing memory size of this part dynamically? Thanks. -- *Regards,* *Zhaojie*

How to use memcached with spark

2014-09-03 Thread gavin zhang
tried to connect memcached in map with xmemcached lib, faild: net.rubyeye.xmemcached.exception.MemcachedException: There is no available connection at this moment is there anybody succeed to use memcached? -- View this message in context:

Re: Support R in Spark

2014-09-03 Thread oppokui
Thanks, Shivaram. No specific use case yet. We try to use R in our project as data scientest are all knowing R. We had a concern that how R handles the mass data. Spark does a better work on big data area, and Spark ML is focusing on predictive analysis area. Then we are thinking whether we

RE: resize memory size for caching RDD

2014-09-03 Thread Liu, Raymond
AFAIK, No. Best Regards, Raymond Liu From: 牛兆捷 [mailto:nzjem...@gmail.com] Sent: Thursday, September 04, 2014 11:30 AM To: user@spark.apache.org Subject: resize memory size for caching RDD Dear all: Spark uses memory to cache RDD and the memory size is specified by

Re: Is there any way to control the parallelism in LogisticRegression

2014-09-03 Thread Jiusheng Chen
Thanks DB and Xiangrui. Glad to know you guys are actively working on it. Another thing, did we evaluate the loss of using Float to store values? currently it is Double. Use fewer bits has the benifit of memory footprint reduction. According to Google, they even uses 16 bits (a special encoding

Re: Re: How can I start history-server with kerberos HDFS ?

2014-09-03 Thread Zhanfeng Huo
Thanks for your help. It works after setting SPARK_HISTORY_OPTS. Zhanfeng Huo From: Andrew Or Date: 2014-09-04 07:52 To: Marcelo Vanzin CC: Zhanfeng Huo; user Subject: Re: How can I start history-server with kerberos HDFS ? Hi Zhanfeng, You will need to set these through SPARK_HISTORY_OPTS

RE: RDDs

2014-09-03 Thread Kartheek.R
Thank you Raymond and Tobias. Yeah, I am very clear about what I was asking. I was talking about replicated rdd only. Now that I've got my understanding about job and application validated, I wanted to know if we can replicate an rdd and run two jobs (that need same rdd) of an application in

Re: memory size for caching RDD

2014-09-03 Thread Patrick Wendell
Changing this is not supported, it si immutable similar to other spark configuration settings. On Wed, Sep 3, 2014 at 8:13 PM, 牛兆捷 nzjem...@gmail.com wrote: Dear all: Spark uses memory to cache RDD and the memory size is specified by spark.storage.memoryFraction. One the Executor starts,

Starting Thriftserver via hostname on Spark 1.1 RC4?

2014-09-03 Thread Denny Lee
When I start the thrift server (on Spark 1.1 RC4) via: ./sbin/start-thriftserver.sh --master spark://hostname:7077 --driver-class-path $CLASSPATH It appears that the thrift server is starting off of localhost as opposed to hostname.  I have set the spark-env.sh to use the hostname, modified the