Re: [Spark] RDDs are not persisting in memory

2016-10-11 Thread diplomatic Guru
= 1224.6 MB. Storage limit = 1397.3 MB. Therefore, I repartitioned the RDDs for better memory utilisation, wich resolved the issue. Kind regards, Guru On 11 October 2016 at 11:23, diplomatic Guru <diplomaticg...@gmail.com> wrote: > @Song, I have called an action but it did not

Re: [Spark] RDDs are not persisting in memory

2016-10-11 Thread diplomatic Guru
t; > Regards, > Chin Wei > > On Tue, Oct 11, 2016 at 6:14 AM, diplomatic Guru <diplomaticg...@gmail.com > > wrote: > >> Hello team, >> >> Spark version: 1.6.0 >> >> I'm trying to persist done data into memory for reusing them. However, >

[Spark] RDDs are not persisting in memory

2016-10-10 Thread diplomatic Guru
Hello team, Spark version: 1.6.0 I'm trying to persist done data into memory for reusing them. However, when I call rdd.cache() OR rdd.persist(StorageLevel.MEMORY_ONLY()) it does not store the data as I can not see any rdd information under WebUI (Storage Tab). Therefore I tried

[Spark + MLlib] how to update offline model with the online model

2016-06-22 Thread diplomatic Guru
Hello all, I have built a spark batch model using MLlib and a Streaming online model. Now I would like to load the offline model in streaming job and apply and update the model. Could to please advise me how to do it. is there an example to look at. The streaming model does not allow saving or

Fwd: [Spark + MLlib] How to prevent negative values in Linear regression?

2016-06-21 Thread diplomatic Guru
but wanted to find out. Thanks. On 21 June 2016 at 13:55, Sean Owen <so...@cloudera.com> wrote: > There's nothing inherently wrong with a regression predicting a > negative value. What is the issue, more specifically? > > On Tue, Jun 21, 2016 at 1:38 PM, diplomatic Guru > <

[Spark + MLlib] How to prevent negative values in Linear regression?

2016-06-21 Thread diplomatic Guru
Hello all, I have a job for forecasting using linear regression, but sometimes I'm getting a negative prediction. How do I prevent this? Thanks.

StreamingLinearRegression Java example

2016-05-09 Thread diplomatic Guru
Hello, I'm trying to find an example of using StreamingLinearRegression in Java, but couldn't find any. There are examples for Scala but not for Java, Has anyone got any example that I can take a look. Thanks.

Could we use Sparkling Water Lib with Spark Streaming

2016-05-05 Thread diplomatic Guru
Hello all, I was wondering if it is possible to use H2O with Spark Streaming for online prediction?

H2O + Spark Streaming?

2016-05-05 Thread diplomatic Guru
Hello all, I was wondering if it is possible to use H2O with Spark Streaming for online prediction?

Re: [Streaming + MLlib] Is it only Linear regression supported by online learning?

2016-03-09 Thread diplomatic Guru
Could someone verify this for me? On 8 March 2016 at 14:06, diplomatic Guru <diplomaticg...@gmail.com> wrote: > Hello all, > > I'm using Random Forest for my machine learning (batch), I would like to > use online prediction using Streaming job. However, the document on

[Streaming + MLlib] Is it only Linear regression supported by online learning?

2016-03-08 Thread diplomatic Guru
Hello all, I'm using Random Forest for my machine learning (batch), I would like to use online prediction using Streaming job. However, the document only states linear algorithm for regression job. Could we not use other algorithms?

Re: [MLlib] How to set Loss to Gradient Boosted Tree in Java

2016-02-29 Thread diplomatic Guru
Losses.scala > > When passing the Loss, you should be able to do something like: > > Losses.fromString("leastSquaresError") > > On Mon, Feb 29, 2016 at 10:03 AM, diplomatic Guru < > diplomaticg...@gmail.com> wrote: > >> It's strange as you are co

Re: [MLlib] How to set Loss to Gradient Boosted Tree in Java

2016-02-29 Thread diplomatic Guru
n Mellott <kevin.r.mell...@gmail.com> wrote: > Looks like it should be present in 1.3 at > org.apache.spark.mllib.tree.loss.AbsoluteError > > > spark.apache.org/docs/1.3.0/api/java/org/apache/spark/mllib/tree/loss/AbsoluteError.html > > On Mon, Feb 29, 2016 at 9:46 AM, d

Re: [MLlib] How to set Loss to Gradient Boosted Tree in Java

2016-02-29 Thread diplomatic Guru
bject, since that object implements the Loss interface. > For example. > > val loss = new AbsoluteError() > boostingStrategy.setLoss(loss) > > On Mon, Feb 29, 2016 at 9:33 AM, diplomatic Guru <diplomaticg...@gmail.com > > wrote: > >> Hi Kevin, >> >> Y

[MLlib] How to set Loss to Gradient Boosted Tree in Java

2016-02-29 Thread diplomatic Guru
Hello guys, I think the default Loss algorithm is Squared Error for regression, but how do I change that to Absolute Error in Java. Could you please show me an example?

Re: [MLlib] What is the best way to forecast the next month page visit?

2016-02-18 Thread diplomatic Guru
; > You should have at the end for Januar and PageA something like : > > LabeledPoint (label , (0,0,1,0,0,01,1.0,2.0,3.0)) > > Pass the LabeledPoint to the ML model. > > test it. > > PS: label is what you want to predict. > > On 02/02/2016, at 20:44, diplomatic Guru <diplomatic

Re: [MLlib] What is the best way to forecast the next month page visit?

2016-02-02 Thread diplomatic Guru
t me know what I'm doing wrong? PS: My cluster is running Spark 1.3.0, which doesn't support StringIndexer, OneHotEncoder but for testing this I've installed the 1.6.0 on my local machine. Cheer. On 2 February 2016 at 10:25, Jorge Machado <jom...@me.com> wrote: > Hi Guru, > >

Re: [MLlib] What is the best way to forecast the next month page visit?

2016-02-01 Thread diplomatic Guru
Any suggestions please? On 29 January 2016 at 22:31, diplomatic Guru <diplomaticg...@gmail.com> wrote: > Hello guys, > > I'm trying understand how I could predict the next month page views based > on the previous access pattern. > > For example, I've collected statistic

[MLlib] What is the best way to forecast the next month page visit?

2016-01-29 Thread diplomatic Guru
Hello guys, I'm trying understand how I could predict the next month page views based on the previous access pattern. For example, I've collected statistics on page views: e.g. Page,UniqueView - pageA, 1 pageB, 999 ... pageZ,200 I aggregate the statistics monthly.

[Spark] Reading avro file in Spark 1.3.0

2016-01-25 Thread diplomatic Guru
Hello guys, I've been trying to read avro file using Spark's DataFrame but it's throwing this error: java.lang.NoSuchMethodError: org.apache.spark.sql.SQLContext.read()Lorg/apache/spark/sql/DataFrameReader; This is what I've done so far: I've added the dependency to pom.xml:

[SPARK] Obtaining matrices of an individual Spark job

2015-12-07 Thread diplomatic Guru
Hello team, I need to present the Spark job performance to my management. I could get the execution time by measuring the starting and finishing time of the job (includes overhead). However, not sure how to get the other matrices e.g cpu, i/o, memory etc.. I want to measure the individual job,

Obtaining metrics of an individual Spark job

2015-12-07 Thread diplomatic Guru
Hello team, I need to present the Spark job performance to my management. I could get the execution time by measuring the starting and finishing time of the job (includes overhead). However, not sure how to get the other matrices e.g cpu, i/o, memory etc.. I want to measure the individual job,

[Spark Streaming] How to clear old data from Stream State?

2015-11-25 Thread diplomatic Guru
Hello, I know how I could clear the old state depending on the input value. If some condition matches to determine that the state is old then set the return null, will invalidate the record. But this is only feasible if a new record arrives that matches the old key. What if no new data arrives

[SPARK STREAMING] multiple hosts and multiple ports for Stream job

2015-11-19 Thread diplomatic Guru
Hello team, I was wondering whether it is a good idea to have multiple hosts and multiple ports for a spark job. Let's say that there are two hosts, and each has 2 ports, is this a good idea? If this is not an issue then what is the best way to do it. Currently, we pass it as an argument comma

How to enable debug in Spark Streaming?

2015-11-03 Thread diplomatic Guru
I have an issue with a Spark Streaming job that appears to be running but not producing any results. Therefore, I would like to enable the debugging mode to get much logging as possible.

[Spark Streaming] Why are some uncached RDDs are growing?

2015-10-27 Thread diplomatic Guru
Hello All, When I checked my running Stream job on WebUI, I can see that some RDDs are being listed that were not requested to be cached. What more is that they are growing! I've not asked them to be cached. What are they? Are they the state (UpdateStateByKey)? Only the rows in white are being

Re: [Spark Streaming] Connect to Database only once at the start of Streaming job

2015-10-27 Thread diplomatic Guru
I know it uses lazy model, which is why I was wondering. On 27 October 2015 at 19:02, Uthayan Suthakar wrote: > Hello all, > > What I wanted to do is configure the spark streaming job to read the > database using JdbcRDD and cache the results. This should occur only

How to check whether the RDD is empty or not

2015-10-21 Thread diplomatic Guru
Hello All, I have a Spark Streaming job that should do some action only if the RDD is not empty. This can be done easily with the spark batch RDD as I could .take(1) and check whether it is empty or not. But this cannot been done in Spark Streaming DStrem JavaPairInputDStream

Re: How to check whether the RDD is empty or not

2015-10-21 Thread diplomatic Guru
r 2015 at 18:00, diplomatic Guru <diplomaticg...@gmail.com> wrote: > > Hello All, > > I have a Spark Streaming job that should do some action only if the RDD > is not empty. This can be done easily with the spark batch RDD as I could > .take(1) and check whether it is empty

Re: How to check whether the RDD is empty or not

2015-10-21 Thread diplomatic Guru
t;t...@databricks.com> wrote: > What do you mean by checking when a "DStream is empty"? DStream represents > an endless stream of data, and at point of time checking whether it is > empty or not does not make sense. > > FYI, there is RDD.isEmpty() > > > > On Wed

Re: How to calculate average from multiple values

2015-09-17 Thread diplomatic Guru
> > --- > Robin East > *Spark GraphX in Action* Michael Malak and Robin East > Manning Publications Co. > http://www.manning.com/books/spark-graphx-in-action > > > > > > On 16 Sep 2015, at 15:46

How to calculate average from multiple values

2015-09-16 Thread diplomatic Guru
have a mapper that emit key/value pairs(composite keys and composite values separated by comma). e.g *key:* a,b,c,d *Value:* 1,2,3,4,5 *key:* a1,b1,c1,d1 *Value:* 5,4,3,2,1 ... ... *key:* a,b,c,d *Value:* 5,4,3,2,1 I could easily SUM these values using reduceByKey. e.g. reduceByKey(new

Re: Is it this a BUG?: Why Spark Flume Streaming job is not deploying the Receiver to the specified host?

2015-08-18 Thread diplomatic Guru
stream or the older stream? Such problems of binding used to occur in the older push-based approach, hence we built the polling stream (pull-based). On Tue, Aug 18, 2015 at 4:45 AM, diplomatic Guru diplomaticg...@gmail.com wrote: I'm testing the Flume + Spark integration example (flume count

Re: Performance issue with Spak's foreachpartition method

2015-07-27 Thread diplomatic Guru
append to do bulk inserts to oracle. On Thu, Jul 23, 2015 at 1:12 AM, diplomatic Guru diplomaticg...@gmail.com wrote: Thanks Robin for your reply. I'm pretty sure that writing to Oracle is taking longer as when writing to HDFS is only taking ~5minutes. The job is writing about ~5 Million

Performance issue with Spak's foreachpartition method

2015-07-22 Thread diplomatic Guru
Hello all, We are having a major performance issue with the Spark, which is holding us from going live. We have a job that carries out computation on log files and write the results into Oracle DB. The reducer 'reduceByKey' have been set to parallelize by 4 as we don't want to establish too

Re: Performance issue with Spak's foreachpartition method

2015-07-22 Thread diplomatic Guru
be a performance problem. Robin On 22 Jul 2015, at 19:11, diplomatic Guru diplomaticg...@gmail.com wrote: Hello all, We are having a major performance issue with the Spark, which is holding us from going live. We have a job that carries out computation on log files and write the results into Oracle

Re: query on Spark + Flume integration using push model

2015-07-10 Thread diplomatic Guru
...@sigmoidanalytics.com wrote: Here's an example https://github.com/przemek1990/spark-streaming Thanks Best Regards On Thu, Jul 9, 2015 at 4:35 PM, diplomatic Guru diplomaticg...@gmail.com wrote: Hello all, I'm trying to configure the flume to push data into a sink so that my stream job could pick

query on Spark + Flume integration using push model

2015-07-09 Thread diplomatic Guru
Hello all, I'm trying to configure the flume to push data into a sink so that my stream job could pick up the data. My events are in JSON format, but the Spark + Flume integration [1] document only refer to Avro sink. [1] https://spark.apache.org/docs/latest/streaming-flume-integration.html I

Spark performance issue

2015-07-03 Thread diplomatic Guru
Hello guys, I'm after some advice on Spark performance. I've a MapReduce job that read inputs carry out a simple calculation and write the results into HDFS. I've implemented the same logic in Spark job. When I tried both jobs on same datasets, I'm getting different execution time, which is

load Java properties file in Spark

2015-06-29 Thread diplomatic Guru
I want to store the Spark application arguments such as input file, output file into a Java property files and pass that file into Spark Driver. I'm using spark-submit for submitting the job but couldn't find a parameter to pass the properties file. Have you got any suggestions?

Spark job throwing “java.lang.OutOfMemoryError: GC overhead limit exceeded”

2015-06-15 Thread diplomatic Guru
Hello All, I have a Spark job that throws java.lang.OutOfMemoryError: GC overhead limit exceeded. The job is trying to process a filesize 4.5G. I've tried following spark configuration: --num-executors 6 --executor-memory 6G --executor-cores 6 --driver-memory 3G I tried increasing more

Could Spark batch processing live within Spark Streaming?

2015-06-11 Thread diplomatic Guru
Hello all, I was wondering if it is possible to have a high latency batch processing job coexists within Spark Streaming job? If it's possible then could we share the state of the batch job with the Spark Streaming job? For example when the long-running batch computation is complete, could we

Re: How do I access the SPARK SQL

2014-04-24 Thread diplomatic Guru
...@databricks.com wrote: Yeah, you'll need to run `sbt publish-local` to push the jars to your local maven repository (~/.m2) and then depend on version 1.0.0-SNAPSHOT. On Thu, Apr 24, 2014 at 9:58 AM, diplomatic Guru diplomaticg...@gmail.com wrote: It's a simple application based on the People

Re: How do I access the SPARK SQL

2014-04-24 Thread diplomatic Guru
It worked!! Many thanks for your brilliant support. On 24 April 2014 18:20, diplomatic Guru diplomaticg...@gmail.com wrote: Many thanks for your prompt reply. I'll try your suggestions and will get back to you. On 24 April 2014 18:17, Michael Armbrust mich...@databricks.com wrote: Oh

How do I access the SPARK SQL

2014-04-23 Thread diplomatic Guru
Hello Team, I'm new to SPARK and just came across SPARK SQL, which appears to be interesting but not sure how I could get it. I know it's an Alpha version but not sure if its available for community yet. Many thanks. Raj.