Re: Spark Streaming withWatermark

2018-02-06 Thread Vishnu Viswanath
06 11:59:00 <-- this message should not be counted, right? > however in my test, this one is still counted > > > > On Tue, Feb 6, 2018 at 2:05 PM, Vishnu Viswanath < > vishnu.viswanat...@gmail.com> wrote: > >> Yes, that is correct. >> >> On Tue, Feb 6, 2

Re: Spark Streaming withWatermark

2018-02-06 Thread Vishnu Viswanath
of "timestamp" > field of the input and never moves down, is that correct understanding? > > > On Tue, Feb 6, 2018 at 11:47 AM, Vishnu Viswanath < > vishnu.viswanat...@gmail.com> wrote: > >> Hi >> >> 20 second corresponds to when the window stat

Re: Spark Streaming withWatermark

2018-02-06 Thread Vishnu Viswanath
Hi 20 second corresponds to when the window state should be cleared. For the late message to be dropped, it should come in after you receive a message with event time >= window end time + 20 seconds. I wrote a post on this recently:

Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-01-31 Thread Vishnu Viswanath
Hi Mans, Watermark is Spark is used to decide when to clear the state, so if the even it delayed more than when the state is cleared by Spark, then it will be ignored. I recently wrote a blog post on this : http://vishnuviswanath.com/spark_structured_streaming.html#watermark Yes, this State is

Handling skewed data

2017-04-17 Thread Vishnu Viswanath
Hello All, Does anyone know if the skew handling code mentioned in this talk https://www.youtube.com/watch?v=bhYV0JOPd9Y was added to spark? If so can I know where to look for more info, JIRA? Pull request? Thanks in advance. Regards, Vishnu Viswanath.

Re: Installing Spark on Mac

2016-03-04 Thread Vishnu Viswanath
Installing spark on mac is similar to how you install it on Linux. I use mac and have written a blog on how to install spark here is the link : http://vishnuviswanath.com/spark_start.html Hope this helps. On Fri, Mar 4, 2016 at 2:29 PM, Simon Hafner wrote: > I'd try

Re: Question about MEOMORY_AND_DISK persistence

2016-02-28 Thread Vishnu Viswanath
Thank you Ashwin. On Sun, Feb 28, 2016 at 7:19 PM, Ashwin Giridharan <ashwin.fo...@gmail.com> wrote: > Hi Vishnu, > > A partition will either be in memory or in disk. > > -Ashwin > On Feb 28, 2016 15:09, "Vishnu Viswanath" <vishnu.viswanat...@gmail.co

Question about MEOMORY_AND_DISK persistence

2016-02-28 Thread Vishnu Viswanath
Hi All, I have a question regarding Persistence (MEMORY_AND_DISK) Suppose I am trying to persist an RDD which has 2 partitions and only 1 partition can be fit in memory completely but some part of partition 2 can also be fit, will spark keep the portion of partition 2 in memory and rest in disk,

Question on RDD caching

2016-02-04 Thread Vishnu Viswanath
will store the Partition in memory 4. Therefore, each node can have partitions of different RDDs in it's cache. Can someone please tell me if I am correct. Thanks and Regards, Vishnu Viswanath,

Re: Spark ML Random Forest output.

2015-12-04 Thread Vishnu Viswanath
Hi, As per my understanding the probability matrix is giving the probability that that particular item can belong to each class. So the one with highest probability is your predicted class. Since you have converted you label to index label, according the model the classes are 0.0 to 9.0 and I

Re: General question on using StringIndexer in SparkML

2015-12-02 Thread Vishnu Viswanath
Thank you Yanbo, It looks like this is available in 1.6 version only. Can you tell me how/when can I download version 1.6? Thanks and Regards, Vishnu Viswanath, On Wed, Dec 2, 2015 at 4:37 AM, Yanbo Liang <yblia...@gmail.com> wrote: > You can set "handleInvalid" to "s

Re: General question on using StringIndexer in SparkML

2015-12-02 Thread Vishnu Viswanath
Thank you. On Wed, Dec 2, 2015 at 8:12 PM, Yanbo Liang <yblia...@gmail.com> wrote: > You can get 1.6.0-RC1 from > http://people.apache.org/~pwendell/spark-releases/spark-v1.6.0-rc1-bin/ > currently, but it's not the last release version. > > 2015-12-02 23:57 GMT+0

Re: General question on using StringIndexer in SparkML

2015-12-01 Thread Vishnu Viswanath
is: For the column on which I am doing StringIndexing, the test data is having values which was not there in train data. Since fit() is done only on the train data, the indexing is failing. Can you suggest me what can be done in this situation. Thanks, On Mon, Nov 30, 2015 at 12:32 AM, Vishnu Viswanath

Re: General question on using StringIndexer in SparkML

2015-11-29 Thread Vishnu Viswanath
and do the prediction. Thanks and Regards, Vishnu Viswanath On Sun, Nov 29, 2015 at 8:14 AM, Yanbo Liang <yblia...@gmail.com> wrote: > Hi Vishnu, > > The string and indexer map is generated at model training step and > used at model prediction step. > It means that the s

Re: General question on using StringIndexer in SparkML

2015-11-29 Thread Vishnu Viswanath
is section of spark ml doc > http://spark.apache.org/docs/latest/ml-guide.html#how-it-works > > > > On Mon, Nov 30, 2015 at 12:52 AM, Vishnu Viswanath < > vishnu.viswanat...@gmail.com> wrote: > >> Thanks for the reply Yanbo. >> >> I understand that the mod

General question on using StringIndexer in SparkML

2015-11-28 Thread Vishnu Viswanath
and A. So the StringIndexer will assign index as C 0.0 B 1.0 A 2.0 These indexes are different from what we used for modeling. So won’t this give me a wrong prediction if I use StringIndexer? ​ -- Thanks and Regards, Vishnu Viswanath, *www.vishnuviswanath.com <http://www.vishnuviswanath.com>*

Adding new column to Dataframe

2015-11-25 Thread Vishnu Viswanath
n("line2",df2("line")) org.apache.spark.sql.AnalysisException: resolved attribute(s) line#2330 missing from line#2326 in operator !Project [line#2326,line#2330 AS line2#2331]; ​ Thanks and Regards, Vishnu Viswanath *www.vishnuviswanath.com <http://www.vishnuviswanath.com>*

Re: Adding new column to Dataframe

2015-11-25 Thread Vishnu Viswanath
: Yes, thats why I thought of adding row number in both the DataFrames and join them based on row number. Is there any better way of doing this? Both DataFrames will have same number of rows always, but are not related by any column to do join. Thanks and Regards, Vishnu Viswanath ​ On Wed, Nov 25, 201

Re: Adding new column to Dataframe

2015-11-25 Thread Vishnu Viswanath
Window.partitionBy(df2.A).orderBy(df2.B) > df3 = df2.select("client", "date", > rowNumber().over(ws).alias("rn")).filter("rn < 0") > > Cheers > > On Wed, Nov 25, 2015 at 5:08 PM, Vishnu Viswanath < > vishnu.viswanat...@gmail.com> wrot

how to us DataFrame.na.fill based on condition

2015-11-23 Thread Vishnu Viswanath
Hi Can someone tell me if there is a way I can use the fill method in DataFrameNaFunctions based on some condition. e.g., df.na.fill("value1","column1","condition1") df.na.fill("value2","column1","condition2") i want to fill nulls in column1 with values - either value 1 or value 2,

Re: how to us DataFrame.na.fill based on condition

2015-11-23 Thread Vishnu Viswanath
e.replace(to_replace, value, subset=None) > > > http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.replace > > On Mon, Nov 23, 2015 at 11:05 AM, Vishnu Viswanath > <vishnu.viswanat...@gmail.com> wrote: > > Hi > > > > Can someone