Re: Is the executor number fixed during the lifetime of one app ?

2015-05-27 Thread canan chen
How does the dynamic allocation works ? I mean does it related with parallelism of my RDD and how does driver know how many executor it needs ? On Wed, May 27, 2015 at 2:49 PM, Saisai Shao wrote: > It depends on how you use Spark, if you use Spark with Yarn and enable > dynamic allocation, the n

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-05-27 Thread Akhil Das
After submitting the job, if you do a ps aux | grep spark-submit then you can see all JVM params. Are you using the highlevel consumer (receiver based) for receiving data from Kafka? In that case if your throughput is high and the processing delay exceeds batch interval then you will hit this memor

Re: How to give multiple directories as input ?

2015-05-27 Thread Akhil Das
How about creating two and union [ sc.union(first, second) ] them? Thanks Best Regards On Wed, May 27, 2015 at 11:51 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > I have this piece > > sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, > AvroKeyInputFormat[GenericRecord]]( > "/a/b/c/d/exptsession/2015/05/2

Re: DataFrame. Conditional aggregation

2015-05-27 Thread Masf
Yes. I think that your sql solution will work but I was looking for a solution with DataFrame API. I had thought to use UDF such as: val myFunc = udf {(input: Int) => {if (input > 100) 1 else 0}} Although I'd like to know if it's possible to do it directly in the aggregation inserting a lambda fu

Re: Is the executor number fixed during the lifetime of one app ?

2015-05-27 Thread Saisai Shao
The drive has a heuristic mechanism to decide the number of executors in the run-time according the pending tasks. You could enable with configuration, you could refer to spark document to find the details. 2015-05-27 15:00 GMT+08:00 canan chen : > How does the dynamic allocation works ? I mean d

Re: Is the executor number fixed during the lifetime of one app ?

2015-05-27 Thread DB Tsai
If with mesos, how do we control the number of executors? In our cluster, each node only has one executor with very big JVM. Sometimes, if the executor dies, all the concurrent running tasks will be gone. We would like to have multiple executors in one node but can not figure out a way to do it in

Re: Is the executor number fixed during the lifetime of one app ?

2015-05-27 Thread DB Tsai
Typo. We can not figure a way to increase the number of executor in one node in mesos. On Wednesday, May 27, 2015, DB Tsai wrote: > If with mesos, how do we control the number of executors? In our cluster, > each node only has one executor with very big JVM. Sometimes, if the > executor dies, al

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-05-27 Thread Ji ZHANG
Hi Akhil, Thanks for your reply. Accoding to the Streaming tab of Web UI, the Processing Time is around 400ms, and there's no Scheduling Delay, so I suppose it's not the Kafka messages that eat up the off-heap memory. Or maybe it is, but how to tell? I googled about how to check the off-heap memo

Re: Is the executor number fixed during the lifetime of one app ?

2015-05-27 Thread Saisai Shao
I'm not sure about Mesos, maybe someone has the Mesos experience can help answer this. Thanks Jerry 2015-05-27 15:21 GMT+08:00 DB Tsai : > Typo. We can not figure a way to increase the number of executor in one > node in mesos. > > > On Wednesday, May 27, 2015, DB Tsai wrote: > >> If with mesos

Re: How does spark manage the memory of executor with multiple tasks

2015-05-27 Thread canan chen
Does anyone can answer my question ? I am curious to know if there's multiple reducer tasks in one executor, how to allocate memory between these reducers tasks since each shuffle will consume a lot of memory ? On Tue, May 26, 2015 at 7:27 PM, Evo Eftimov wrote: > the link you sent says multipl

Inconsistent behavior with Dataframe Timestamp between 1.3.1 and 1.4.0

2015-05-27 Thread Justin Yip
Hello, I am trying out 1.4.0 and notice there are some differences in behavior with Timestamp between 1.3.1 and 1.4.0. In 1.3.1, I can compare a Timestamp with string. scala> val df = sqlContext.createDataFrame(Seq((1, Timestamp.valueOf("2015-01-01 00:00:00")), (2, Timestamp.valueOf("2014-01-01 0

Re: How to give multiple directories as input ?

2015-05-27 Thread ๏̯͡๏
I am doing that now. Is there no other way ? On Wed, May 27, 2015 at 12:40 PM, Akhil Das wrote: > How about creating two and union [ sc.union(first, second) ] them? > > Thanks > Best Regards > > On Wed, May 27, 2015 at 11:51 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) > wrote: > >> I have this piece >> >> sc.newAPIHadoo

Re: How many executors can I acquire in standalone mode ?

2015-05-27 Thread canan chen
Thanks Arush. My scenario is that In standalone mode, if I have one worker, when I start spark-shell, there will be one executor launched. But if I have 2 workers, there will be 2 executors launched, so I am wondering the mechanism of executor allocation. Is it possible to specify how many executor

Re: How to give multiple directories as input ?

2015-05-27 Thread ayan guha
What about /blah/*/blah/out*.avro? On 27 May 2015 18:08, "ÐΞ€ρ@Ҝ (๏̯͡๏)" wrote: > I am doing that now. > Is there no other way ? > > On Wed, May 27, 2015 at 12:40 PM, Akhil Das > wrote: > >> How about creating two and union [ sc.union(first, second) ] them? >> >> Thanks >> Best Regards >> >> On

Re: How to give multiple directories as input ?

2015-05-27 Thread Eugen Cepoi
You can do that using FileInputFormat.addInputPath 2015-05-27 10:41 GMT+02:00 ayan guha : > What about /blah/*/blah/out*.avro? > On 27 May 2015 18:08, "ÐΞ€ρ@Ҝ (๏̯͡๏)" wrote: > >> I am doing that now. >> Is there no other way ? >> >> On Wed, May 27, 2015 at 12:40 PM, Akhil Das >> wrote: >> >>> H

Re: How to give multiple directories as input ?

2015-05-27 Thread ๏̯͡๏
def readGenericRecords(sc: SparkContext, inputDir: String, startDate: Date, endDate: Date) = { val path = getInputPaths(inputDir, startDate, endDate) sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]]("/A/B/C/D/D/2015/05/22/out-r-*.avro") } T

Avro CombineInputFormat ?

2015-05-27 Thread ๏̯͡๏
Can someone share me some code to use CombineInputFormat to read avro data. Today I use def readGenericRecords(sc: SparkContext, inputDir: String, startDate: Date, endDate: Date) = { val path = getInputPaths(inputDir, startDate, endDate) sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWr

Re: How to give multiple directories as input ?

2015-05-27 Thread Eugen Cepoi
Try something like that: def readGenericRecords(sc: SparkContext, inputDir: String, startDate: Date, endDate: Date) = { // assuming a list of paths val paths: Seq[String] = getInputPaths(inputDir, startDate, endDate) val job = Job.getInstance(new Configuration(sc.hadoopConfiguration)

Re: How many executors can I acquire in standalone mode ?

2015-05-27 Thread ayan guha
You can request number of cores and amount of memory for each executor. On 27 May 2015 18:25, "canan chen" wrote: > Thanks Arush. > My scenario is that In standalone mode, if I have one worker, when I start > spark-shell, there will be one executor launched. But if I have 2 workers, > there will

Re: DataFrame. Conditional aggregation

2015-05-27 Thread ayan guha
Till 1.3, you have to prepare the DF appropriately def setupCondition(t): if t[1] > 100: v = 1 else: v = 0 return Row(col1=t[0],col2=t[1],col3=t[2],col4=v) d1=[[1001,100,50],[1001,200,100],[1002,100,99]] d1RDD = sc.parallelize(d1).map(setupCondition) d1DF = s

回复:Re: Re: Re: Re: how to distributed run a bash shell in spark

2015-05-27 Thread luohui20001
Hi Akhil and all My previous code has some problems,all the executors are looping and running the same command. That's not what I am expecting.previous code: val shellcompare = List("run","sort.sh") val shellcompareRDD = sc.makeRDD(shellcompare) val result = List("aggregate","r

[POWERED BY] Please add our organization

2015-05-27 Thread Antonio Giambanco
Name: Objectway.com URL: www.objectway.com Description: We're building a Big Data solution on Spark. We use Apache Flume for parallel message queuing infrastructure and Apache Spark Streaming for near real time datastream processing combined with a rule engine for complex events catching.

Re: View all user's application logs in history server

2015-05-27 Thread Jianshi Huang
No one using History server? :) Am I the only one need to see all user's logs? Jianshi On Thu, May 21, 2015 at 1:29 PM, Jianshi Huang wrote: > Hi, > > I'm using Spark 1.4.0-rc1 and I'm using default settings for history > server. > > But I can only see my own logs. Is it possible to view all u

RE: How does spark manage the memory of executor with multiple tasks

2015-05-27 Thread java8964
Same as you, there are lots of people coming from MapReduce world, and try to understand the internals of Spark. Hope below can help you some way. For the end users, they only have concept of Job. I want to run a word count job from this one big file, that is the job I want to run. How many stage

Where does partitioning and data loading happen?

2015-05-27 Thread Stephen Carman
A colleague and I were having a discussion and we were disagreeing about something in Spark/Mesos that perhaps someone can shed some light into. We have a mesos cluster that runs spark via a sparkHome, rather than downloading an executable and such. My colleague says that say we have parquet fi

Spark and logging

2015-05-27 Thread dgoldenberg
I'm wondering how logging works in Spark. I see that there's the log4j.properties.template file in the conf directory. Safe to assume Spark is using log4j 1? What's the approach if we're using log4j 2? I've got a log4j2.xml file in the job jar which seems to be working for my log statements but

Re: Where does partitioning and data loading happen?

2015-05-27 Thread ayan guha
I hate to say this, but your friend is right. Spark slaves (executors) really pull the data. In fact, it is a standard practice in distributed world, eg Hadoop. It is not practical to pass large amount of data through master nor it gives a way to parallely read the data. You can either use spark's

Decision Trees / Random Forest Multiple Labels

2015-05-27 Thread cfusting
Hi, I'd like to use a DT or RF to describe data with multiple dimensions (lat/long). I'm new to Spark but this doesn't seem possible with a LabeledPoint. Am I mistaken? Will it be in the future? Cheers, _Chris -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.

Adding columns to DataFrame

2015-05-27 Thread Masf
Hi. I have a DataFrame with 16 columns (df1) and another with 26 columns(df2). I want to do a UnionAll. So, I want to add 10 columns to df1 in order to have the same number of columns in both dataframes. Is there some alternative to "withColumn"? Thanks -- Regards. Miguel Ángel

Re: Spark and logging

2015-05-27 Thread Imran Rashid
only an answer to one of your questions: What about log statements in the > partition processing functions? Will their log statements get logged into > a > file residing on a given 'slave' machine, or will Spark capture this log > output and divert it into the log file of the driver's machine? >

hive external metastore connection timeout

2015-05-27 Thread jamborta
Hi all, I am setting up a spark standalone server with an external hive metastore (using mysql), there is an issue that after 5 minutes inactivity, if I try to reconnect to the metastore (i.e. by executing a new query), it hangs for about 10 mins then times out. My guess is that datanucleus does n

How to get the best performance with LogisticRegressionWithSGD?

2015-05-27 Thread mélanie gallois
I'm new to Spark and I'm getting bad performance with classification methods on Spark MLlib (worse than R in terms of AUC). I am trying to put my own parameters rather than the default parameters. Here is the method I want to use : train(RDD

How to get the best performance with LogisticRegressionWithSGD?

2015-05-27 Thread SparknewUser
I'm new to Spark and I'm getting bad performance with classification methods on Spark MLlib (worse than R in terms of AUC). I am trying to put my own parameters rather than the default parameters. Here is the method I want to use : train(RDD input, int numIterations, doub

Re: Question about Serialization in Storage Level

2015-05-27 Thread Imran Rashid
Hi Zhipeng, yes, your understanding is correct. the "SER" portion just refers to how its stored in memory. On disk, the data always has to be serialized. On Fri, May 22, 2015 at 10:40 PM, Jiang, Zhipeng wrote: > Hi Todd, Howard, > > > > Thanks for your reply, I might not present my question

Multilabel classification using logistic regression

2015-05-27 Thread peterg
Hi all I believe I have created a multi-label classifier using LogisticRegression but there is one snag. No matter what features I use to get the prediction, it will always return the label. I feel like I need to set a threshold but can't seem to figure out how to do that. I attached the code belo

Re: View all user's application logs in history server

2015-05-27 Thread Marcelo Vanzin
You may be the only one not seeing all the logs. Are you sure all the users are writing to the same log directory? The HS can only read from a single log directory. On Wed, May 27, 2015 at 5:33 AM, Jianshi Huang wrote: > No one using History server? :) > > Am I the only one need to see all user'

Re: Adding columns to DataFrame

2015-05-27 Thread Masf
Hi. I think that it's possible to do: *df.select($"*", lit(null).as("col17", lit(null).as("col18", lit(null).as("col19",, lit(null).as("col26")* Any other advice? Miguel. On Wed, May 27, 2015 at 5:02 PM, Masf wrote: > Hi. > > I have a DataFrame with 16 columns (df1) and another with

Re: View all user's application logs in history server

2015-05-27 Thread Jianshi Huang
Yes, all written to the same directory on HDFS. Jianshi On Wed, May 27, 2015 at 11:57 PM, Marcelo Vanzin wrote: > You may be the only one not seeing all the logs. Are you sure all the > users are writing to the same log directory? The HS can only read from a > single log directory. > > On Wed,

Re: View all user's application logs in history server

2015-05-27 Thread Marcelo Vanzin
Then: - Are all files readable by the user running the history server? - Did all applications call sc.stop() correctly (i.e. files do not have the ".inprogress" suffix)? Other than that, always look at the logs first, looking for any errors that may be thrown. On Wed, May 27, 2015 at 9:10 AM, Ji

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-05-27 Thread Akhil Das
Are you using the createStream or createDirectStream api? If its the former, you can try setting the StorageLevel to MEMORY_AND_DISK (it might slow things down though). Another way would be to try the later one. Thanks Best Regards On Wed, May 27, 2015 at 1:00 PM, Ji ZHANG wrote: > Hi Akhil, >

Re: hive external metastore connection timeout

2015-05-27 Thread Yana Kadiyska
I have not run into this particular issue but I'm not using latest bits in production. However, testing your theory should be easy -- MySQL is just a database, so you should be able to use a regular mysql client and see how many connections are active. You can then compare to the maximum allowed co

Re: Spark partition issue with Stanford NLP

2015-05-27 Thread vishalvibhandik
i am facing a similar issue. when the job runs with partitions > num of cores then sometimes the executors are getting lost and the job doesnt complete. is there any additional logging that can be turned on to see the exact cause of this issue? -- View this message in context: http://apache-spa

Re: Spark partition issue with Stanford NLP

2015-05-27 Thread mathewvinoj
Hi There, I am trying to process millions of data with spark/scala integrated with stanford NLP (3.4.1). Since I am using social media data I have to use NLP for the themes generation (pos tagging) and Sentiment calulation. I have to deal with Twitter data and NON Twitter data separately.So I

Re: Spark and Stanford CoreNLP

2015-05-27 Thread mathewvinoj
Evan, could you please look into this post.Below is the link.Any thoughts or suggestion is really appreciated http://apache-spark-user-list.1001560.n3.nabble.com/Spark-partition-issue-with-Stanford-NLP-td23048.html -- View this message in context: http://apache-spark-user-list.1001560.n3.nabb

Re: Intermittent difficulties for Worker to contact Master on same machine in standalone

2015-05-27 Thread Stephen Boesch
Thanks Yana, My current experience here is after running some small spark-submit based tests the Master once again stopped being reachable. No change in the test setup. I restarted Master/Worker and still not reachable. What might be the variables here in which association with the Master/Wo

Invoking Hive UDF programmatically

2015-05-27 Thread Punyashloka Biswal
Dear Spark users, Given a DataFrame df with a column named foo bar, I can call a Spark SQL built-in function on it like so: df.select(functions.max(df("foo bar"))) However, if I want to apply a Hive UDF named myCustomFunction, I need to write df.selectExpr("myCustomFunction(`foo bar`)") which

debug jsonRDD problem?

2015-05-27 Thread Michael Stone
Can anyone provide some suggestions on how to debug this? Using spark 1.3.1. The json itself seems to be valid (other programs can parse it) and the problem seems to lie in jsonRDD trying to describe & use a schema. scala> sqlContext.jsonRDD(rdd).count() java.util.NoSuchElementException: None.

RDD boundaries and triggering processing using tags in the data

2015-05-27 Thread David Webber
Hi All, I'm new to Spark and I'd like some help understanding if a particular use case would be a good fit for Spark Streaming. I have an imaginary stream of sensor data consisting of integers 1-10. Every time the sensor reads "10" I'd like to average all the numbers that were received since the

Re: Multilabel classification using logistic regression

2015-05-27 Thread Joseph Bradley
It looks like you are training each model i (for label i) by only using data with label i. You need to use all of your data to train each model so the models can compare each label i with the other labels (roughly speaking). However, what you're doing is multiclass (not multilabel) classification

Re: Intermittent difficulties for Worker to contact Master on same machine in standalone

2015-05-27 Thread Stephen Boesch
Here is example after git clone-ing latest 1.4.0-SNAPSHOT. The first 3 runs (FINISHED) were successful and connected quickly. Fourth run (ALIVE) is failing on connection/association. URL: spark://mellyrn.local:7077 REST URL: spark://mellyrn.local:6066 (cluster mode) Workers: 1 Cores: 8 Total, 0

Re: debug jsonRDD problem?

2015-05-27 Thread Ted Yu
Can you tell us a bit more about (schema of) your JSON ? You can find sample JSON in sql/core/src/test//scala/org/apache/spark/sql/json/TestJsonData.scala Cheers On Wed, May 27, 2015 at 12:33 PM, Michael Stone wrote: > Can anyone provide some suggestions on how to debug this? Using spark > 1.3

Re: debug jsonRDD problem?

2015-05-27 Thread Michael Stone
On Wed, May 27, 2015 at 01:13:43PM -0700, Ted Yu wrote: Can you tell us a bit more about (schema of) your JSON ? It's fairly simple, consisting of 22 fields with values that are mostly strings or integers, except that some of the fields are objects with http header/value pairs. I'd guess it's

Re: debug jsonRDD problem?

2015-05-27 Thread Ted Yu
Looks like the exception was caused by resolved.get(prefix ++ a) returning None : a => StructField(a.head, resolved.get(prefix ++ a).get, nullable = true) There are three occurrences of resolved.get() in createSchema() - None should be better handled in these places. My two cents. On Wed

Re: How to get the best performance with LogisticRegressionWithSGD?

2015-05-27 Thread Joseph Bradley
The model is learned using an iterative convex optimization algorithm. "numIterations," "stepSize" and "miniBatchFraction" are for those; you can see details here: http://spark.apache.org/docs/latest/mllib-linear-methods.html#implementation-developer http://spark.apache.org/docs/latest/mllib-optim

SF / East Bay Area Stream Processing Meetup next Thursday (6/4)

2015-05-27 Thread Siva Jagadeesan
http://www.meetup.com/Bay-Area-Stream-Processing/events/219086133/ Thursday, June 4, 2015 6:45 PM TubeMogul 1250 53rd St #1 Emeryville, CA 6:45PM to 7:00PM - Socializing 7:00PM to 8:00PM - Talks 8:00PM to 8

Spark Streaming from Kafka - no receivers and spark.streaming.receiver.maxRate?

2015-05-27 Thread dgoldenberg
Hi, With the no receivers approach to streaming from Kafka, is there a way to set something like spark.streaming.receiver.maxRate so as not to overwhelm the Spark consumers? What would be some of the ways to throttle the streamed messages so that the consumers don't run out of memory? -- Vie

Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-27 Thread dgoldenberg
Hi, I'm trying to understand if there are design patterns for autoscaling Spark (add/remove slave machines to the cluster) based on the throughput. Assuming we can throttle Spark consumers, the respective Kafka topics we stream data from would start growing. What are some of the ways to generate

Re: Spark Streaming from Kafka - no receivers and spark.streaming.receiver.maxRate?

2015-05-27 Thread Ted Yu
Have you seen http://stackoverflow.com/questions/29051579/pausing-throttling-spark-spark-streaming-application ? Cheers On Wed, May 27, 2015 at 4:11 PM, dgoldenberg wrote: > Hi, > > With the no receivers approach to streaming from Kafka, is there a way to > set something like spark.streaming.re

Re: Spark Streaming from Kafka - no receivers and spark.streaming.receiver.maxRate?

2015-05-27 Thread Tathagata Das
You can throttle the no receiver direct Kafka stream using spark.streaming.kafka.maxRatePerPartition On Wed, May 27, 2015 at 4:34 PM, Ted Yu wrote: > Have you seen > http://stackoverflow.com/questions/29051579/pausing-thro

Re: Spark Streaming from Kafka - no receivers and spark.streaming.receiver.maxRate?

2015-05-27 Thread Dmitry Goldenberg
Got it, thank you, Tathagata and Ted. Could you comment on my other question as well? Basically, I'm trying to get a handle on a good approa

Re: Multilabel classification using logistic regression

2015-05-27 Thread Peter Garbers
Hi Joseph, I looked at that but it seems that LogisticRegressionWithLBFGS's run method takes RDD[LabeledPoint] objects so I'm not sure it's exactly how one would use it in the way I think you're describing On Wed, May 27, 2015 at 4:04 PM, Joseph Bradley wrote: > It looks like you are training ea

Re: How does spark manage the memory of executor with multiple tasks

2015-05-27 Thread canan chen
Thanks Yong, this is very helpful. And found ShuffleMemoryManager which is used to allocate memory across tasks in one executor. >> These 2 tasks have to share the 2G heap memory. I don't think specifying the memory per task is a good idea, as task is running in the Thread level, and Memory only a

Value for SPARK_EXECUTOR_CORES

2015-05-27 Thread Mulugeta Mammo
My executor has the following spec (lscpu): CPU(s): 16 Core(s) per socket: 4 Socket(s): 2 Thread(s) per code: 2 The CPU count is obviously 4*2*2 = 16. My question is what value is Spark expecting in SPARK_EXECUTOR_CORES ? The CPU count (16) or total # of cores (2 * 2 = 4) ? Thanks

Re: Intermittent difficulties for Worker to contact Master on same machine in standalone

2015-05-27 Thread Yana Kadiyska
What does your master log say -- normally the master should NEVER shut down...you should be able to spark-submit to infinity with no issues...So the question about high variance on upstart is one issue, but the other thing that's puzzling to me is why your master is ever down to begin with...(assum

Pointing SparkSQL to existing Hive Metadata with data file locations in HDFS

2015-05-27 Thread Sanjay Subramanian
hey guys On the Hive/Hadoop ecosystem we have using Cloudera distribution CDH 5.2.x , there are about 300+ hive tables.The data is stored an text (moving slowly to Parquet) on HDFS.I want to use SparkSQL and point to the Hive metadata and be able to define JOINS etc using a programming structure

Re: SparkR Jobs Hanging in collectPartitions

2015-05-27 Thread Shivaram Venkataraman
Could you try to see which phase is causing the hang ? i.e. If you do a count() after flatMap does that work correctly ? My guess is that the hang is somehow related to data not fitting in the R process memory but its hard to say without more diagnostic information. Thanks Shivaram On Tue, May 26

RE: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-27 Thread Rajesh_Kalluri
Dell - Internal Use - Confidential Did you check https://drive.google.com/file/d/0B7tmGAdbfMI2OXl6azYySk5iTGM/edit and http://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation Not sure if the spark kafka receiver emits metrics on the lag, check this link out http://

RE: Pointing SparkSQL to existing Hive Metadata with data file locations in HDFS

2015-05-27 Thread Cheng, Hao
Yes, but be sure you put the hive-site.xml under your class path. Any problem you meet? Cheng Hao From: Sanjay Subramanian [mailto:sanjaysubraman...@yahoo.com.INVALID] Sent: Thursday, May 28, 2015 8:53 AM To: user Subject: Pointing SparkSQL to existing Hive Metadata with data file locations in

Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-27 Thread Dmitry Goldenberg
Thanks, Rajesh. I think that acquring/relinquishing executors is important but I feel like there are at least two layers for resource allocation and autoscaling. It seems that acquiring and relinquishing executors is a way to optimize resource utilization within a pre-set Spark cluster of machine

Re: View all user's application logs in history server

2015-05-27 Thread Jianshi Huang
BTW, is there an option to set file permission for spark event logs? Jianshi On Thu, May 28, 2015 at 11:25 AM, Jianshi Huang wrote: > Hmm...all files under the event log folder has permission 770 but > strangely my account cannot read other user's files. Permission denied. > > I'll sort it out

Re: View all user's application logs in history server

2015-05-27 Thread Jianshi Huang
Hmm...all files under the event log folder has permission 770 but strangely my account cannot read other user's files. Permission denied. I'll sort it out with our Hadoop admin. Thanks for then help! Jianshi On Thu, May 28, 2015 at 12:13 AM, Marcelo Vanzin wrote: > Then: > - Are all files read

Re: Spark on Mesos vs Yarn

2015-05-27 Thread Bharath Ravi Kumar
A follow up : considering that spark on mesos is indeed important to databricks, its partners and the community, fundamental issues like spark-6284 shouldn't be languishing for this long. A mesos cluster hosting diverse (i.e.multi-tenant) workloads is a common scenario in production for serious us

Re: Pointing SparkSQL to existing Hive Metadata with data file locations in HDFS

2015-05-27 Thread ayan guha
Yes, you are at right path. Only thing to remember is placing hive site XML to correct path so spark can talk to hive metastore. Best Ayan On 28 May 2015 10:53, "Sanjay Subramanian" wrote: > hey guys > > On the Hive/Hadoop ecosystem we have using Cloudera distribution CDH 5.2.x > , there are abo

Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-27 Thread Ted Yu
bq. detect the presence of a new node and start utilizing it My understanding is that Spark is concerned with managing executors. Whether request for an executor is fulfilled on an existing node or a new node is up to the underlying cluster manager (YARN e.g.). Assuming the cluster is single tenan

Fwd: Model weights of linear regression becomes abnormal values

2015-05-27 Thread Maheshakya Wijewardena
Hi, I'm trying to use Sparks' *LinearRegressionWithSGD* in PySpark with the attached dataset. The code is attached. When I check the model weights vector after training, it contains `nan` values. [nan,nan,nan,nan,nan,nan,nan,nan] But for some data sets, this problem does not occur. What might be

Re: Model weights of linear regression becomes abnormal values

2015-05-27 Thread DB Tsai
LinearRegressionWithSGD requires to tune the step size and # of iteration very carefully. Please try Linear Regression with elastic net implementation in Spark 1.4 in ML framework, which uses quasi newton method and step size will be automatically determined. That implementation also matches the re

Re: Model weights of linear regression becomes abnormal values

2015-05-27 Thread Maheshakya Wijewardena
Thanks for the information. I'll try that out with Spark 1.4. On Thu, May 28, 2015 at 9:54 AM, DB Tsai wrote: > LinearRegressionWithSGD requires to tune the step size and # of > iteration very carefully. Please try Linear Regression with elastic > net implementation in Spark 1.4 in ML framework,

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-05-27 Thread Ji ZHANG
Hi, Yes, I'm using createStream, but the storageLevel param is by default MEMORY_AND_DISK_SER_2. Besides, the driver's memory is also growing. I don't think Kafka messages will be cached in driver. On Thu, May 28, 2015 at 12:24 AM, Akhil Das wrote: > Are you using the createStream or createDir

Re: Model weights of linear regression becomes abnormal values

2015-05-27 Thread 吴明瑜
Sorry. I mean the parameter step. 2015-05-28 12:21 GMT+08:00 Maheshakya Wijewardena : > What is the parameter for the learning rate alpha? LinearRegressionWithSGD > has only following parameters. > > > @param data: The training data. >> @param iterations:The number of iterati

How to use Eclipse on Windows to build Spark environment?

2015-05-27 Thread Nan Xiao
Hi all, I want to use Eclipse on Windows to build Spark environment, but find the reference page(https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-IDESetup) doesn't contain any guide about Eclipse. Could anyone give tutorials or links about how to using