Re: Spark Streaming withWatermark

2018-02-06 Thread Vishnu Viswanath
Could it be that these messages were processed in the same micro batch? In
that case, watermark will be updated only after the batch finishes which
did not have any effect of the late data in the current batch.

On Tue, Feb 6, 2018 at 4:18 PM Jiewen Shao <fifistorm...@gmail.com> wrote:

> Ok, Thanks for confirmation.
>
> So based on my code, I have messages with following timestamps (converted
> to more readable format) in the following order:
>
> 2018-02-06 12:00:00
> 2018-02-06 12:00:01
> 2018-02-06 12:00:02
> 2018-02-06 12:00:03
> 2018-02-06 11:59:00  <-- this message should not be counted, right?
> however in my test, this one is still counted
>
>
>
> On Tue, Feb 6, 2018 at 2:05 PM, Vishnu Viswanath <
> vishnu.viswanat...@gmail.com> wrote:
>
>> Yes, that is correct.
>>
>> On Tue, Feb 6, 2018 at 4:56 PM, Jiewen Shao <fifistorm...@gmail.com>
>> wrote:
>>
>>> Vishnu, thanks for the reply
>>> so "event time" and "window end time" have nothing to do with current
>>> system timestamp, watermark moves with the higher value of "timestamp"
>>> field of the input and never moves down, is that correct understanding?
>>>
>>>
>>> On Tue, Feb 6, 2018 at 11:47 AM, Vishnu Viswanath <
>>> vishnu.viswanat...@gmail.com> wrote:
>>>
>>>> Hi
>>>>
>>>> 20 second corresponds to when the window state should be cleared. For
>>>> the late message to be dropped, it should come in after you receive a
>>>> message with event time >= window end time + 20 seconds.
>>>>
>>>> I wrote a post on this recently:
>>>> http://vishnuviswanath.com/spark_structured_streaming.html#watermark
>>>>
>>>> Thanks,
>>>> Vishnu
>>>>
>>>> On Tue, Feb 6, 2018 at 12:11 PM Jiewen Shao <fifistorm...@gmail.com>
>>>> wrote:
>>>>
>>>>> sample code:
>>>>>
>>>>> Let's say Xyz is POJO with a field called timestamp,
>>>>>
>>>>> regarding code withWatermark("timestamp", "20 seconds")
>>>>>
>>>>> I expect the msg with timestamp 20 seconds or older will be dropped,
>>>>> what does 20 seconds compare to? based on my test nothing was dropped no
>>>>> matter how old the timestamp is, what did i miss?
>>>>>
>>>>> Dataset xyz = lines
>>>>> .as(Encoders.STRING())
>>>>> .map((MapFunction<String, Xyz>) value -> mapper.readValue(value, 
>>>>> Xyz.class), Encoders.bean(Xyz.class));
>>>>>
>>>>> Dataset aggregated = xyz.withWatermark("timestamp", "20 seconds")
>>>>> .groupBy(functions.window(xyz.col("timestamp"), "5 seconds"), 
>>>>> xyz.col("x") //tumbling window of size 5 seconds (timestamp)
>>>>> ).count();
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>
>>
>


Re: Spark Streaming withWatermark

2018-02-06 Thread Vishnu Viswanath
Yes, that is correct.

On Tue, Feb 6, 2018 at 4:56 PM, Jiewen Shao <fifistorm...@gmail.com> wrote:

> Vishnu, thanks for the reply
> so "event time" and "window end time" have nothing to do with current
> system timestamp, watermark moves with the higher value of "timestamp"
> field of the input and never moves down, is that correct understanding?
>
>
> On Tue, Feb 6, 2018 at 11:47 AM, Vishnu Viswanath <
> vishnu.viswanat...@gmail.com> wrote:
>
>> Hi
>>
>> 20 second corresponds to when the window state should be cleared. For the
>> late message to be dropped, it should come in after you receive a message
>> with event time >= window end time + 20 seconds.
>>
>> I wrote a post on this recently: http://vishnuviswana
>> th.com/spark_structured_streaming.html#watermark
>>
>> Thanks,
>> Vishnu
>>
>> On Tue, Feb 6, 2018 at 12:11 PM Jiewen Shao <fifistorm...@gmail.com>
>> wrote:
>>
>>> sample code:
>>>
>>> Let's say Xyz is POJO with a field called timestamp,
>>>
>>> regarding code withWatermark("timestamp", "20 seconds")
>>>
>>> I expect the msg with timestamp 20 seconds or older will be dropped,
>>> what does 20 seconds compare to? based on my test nothing was dropped no
>>> matter how old the timestamp is, what did i miss?
>>>
>>> Dataset xyz = lines
>>> .as(Encoders.STRING())
>>> .map((MapFunction<String, Xyz>) value -> mapper.readValue(value, 
>>> Xyz.class), Encoders.bean(Xyz.class));
>>>
>>> Dataset aggregated = xyz.withWatermark("timestamp", "20 seconds")
>>> .groupBy(functions.window(xyz.col("timestamp"), "5 seconds"), 
>>> xyz.col("x") //tumbling window of size 5 seconds (timestamp)
>>> ).count();
>>>
>>> Thanks
>>>
>>>
>


Re: Spark Streaming withWatermark

2018-02-06 Thread Vishnu Viswanath
Hi

20 second corresponds to when the window state should be cleared. For the
late message to be dropped, it should come in after you receive a message
with event time >= window end time + 20 seconds.

I wrote a post on this recently:
http://vishnuviswanath.com/spark_structured_streaming.html#watermark

Thanks,
Vishnu

On Tue, Feb 6, 2018 at 12:11 PM Jiewen Shao <fifistorm...@gmail.com> wrote:

> sample code:
>
> Let's say Xyz is POJO with a field called timestamp,
>
> regarding code withWatermark("timestamp", "20 seconds")
>
> I expect the msg with timestamp 20 seconds or older will be dropped, what
> does 20 seconds compare to? based on my test nothing was dropped no matter
> how old the timestamp is, what did i miss?
>
> Dataset xyz = lines
> .as(Encoders.STRING())
> .map((MapFunction<String, Xyz>) value -> mapper.readValue(value, 
> Xyz.class), Encoders.bean(Xyz.class));
>
> Dataset aggregated = xyz.withWatermark("timestamp", "20 seconds")
> .groupBy(functions.window(xyz.col("timestamp"), "5 seconds"), 
> xyz.col("x") //tumbling window of size 5 seconds (timestamp)
> ).count();
>
> Thanks
>
>


Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-01-31 Thread Vishnu Viswanath
Hi Mans,

Watermark is Spark is used to decide when to clear the state, so if the
even it delayed more than when the state is cleared by Spark, then it will
be ignored.
I recently wrote a blog post on this :
http://vishnuviswanath.com/spark_structured_streaming.html#watermark

Yes, this State is applicable for aggregation only. If you are having only
a map function and don't want to process it, you could do a filter based on
its EventTime field, but I guess you will have to compare it with the
processing time since there is no API to access Watermark by the user.

-Vishnu

On Fri, Jan 26, 2018 at 1:14 PM, M Singh <mans2si...@yahoo.com.invalid>
wrote:

> Hi:
>
> I am trying to filter out records which are lagging behind (based on event
> time) by a certain amount of time.
>
> Is the watermark api applicable to this scenario (ie, filtering lagging
> records) or it is only applicable with aggregation ?  I could not get a
> clear understanding from the documentation which only refers to it's usage
> with aggregation.
>
> Thanks
>
> Mans
>


Handling skewed data

2017-04-17 Thread Vishnu Viswanath
Hello All,

Does anyone know if the skew handling code mentioned in this talk
https://www.youtube.com/watch?v=bhYV0JOPd9Y was added to spark?

If so can I know where to look for more info, JIRA? Pull request?

Thanks in advance.
Regards,
Vishnu Viswanath.


Printing MLpipeline model in Python.

2016-03-14 Thread VISHNU SUBRAMANIAN
HI All,

I am using Spark 1.6 and Pyspark.

I am trying to build a Randomforest classifier model using mlpipeline and
in python.

When I am trying to print the model I get the below value.

RandomForestClassificationModel (uid=rfc_be9d4f681b92) with 10 trees

When I use MLLIB RandomForest model with toDebugString I get all the rules
used for building the model.


  Tree 0:
If (feature 53 <= 0.0)
 If (feature 49 <= 0.0)
  If (feature 3 <= 1741.0)
   If (feature 47 <= 0.0)
Predict: 0.0
   Else (feature 47 > 0.0)
Predict: 0.0
  Else (feature 3 > 1741.0)
   If (feature 47 <= 0.0)
Predict: 1.0
   Else (feature 47 > 0.0)

How can I achieve the same thing using MLpipelines model.

Thanks in Advance.

Vishnu


Re: Installing Spark on Mac

2016-03-04 Thread Vishnu Viswanath
Installing spark on mac is similar to how you install it on Linux.

I use mac and have written a blog on how to install spark here is the link
: http://vishnuviswanath.com/spark_start.html

Hope this helps.

On Fri, Mar 4, 2016 at 2:29 PM, Simon Hafner  wrote:

> I'd try `brew install spark` or `apache-spark` and see where that gets
> you. https://github.com/Homebrew/homebrew
>
> 2016-03-04 21:18 GMT+01:00 Aida :
> > Hi all,
> >
> > I am a complete novice and was wondering whether anyone would be willing
> to
> > provide me with a step by step guide on how to install Spark on a Mac; on
> > standalone mode btw.
> >
> > I downloaded a prebuilt version, the second version from the top.
> However, I
> > have not installed Hadoop and am not planning to at this stage.
> >
> > I also downloaded Scala from the Scala website, do I need to download
> > anything else?
> >
> > I am very eager to learn more about Spark but am unsure about the best
> way
> > to do it.
> >
> > I would be happy for any suggestions or ideas
> >
> > Many thanks,
> >
> > Aida
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Installing-Spark-on-Mac-tp26397.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Question about MEOMORY_AND_DISK persistence

2016-02-28 Thread Vishnu Viswanath
Thank you Ashwin.

On Sun, Feb 28, 2016 at 7:19 PM, Ashwin Giridharan <ashwin.fo...@gmail.com>
wrote:

> Hi Vishnu,
>
> A partition will either be in memory or in disk.
>
> -Ashwin
> On Feb 28, 2016 15:09, "Vishnu Viswanath" <vishnu.viswanat...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> I have a question regarding Persistence (MEMORY_AND_DISK)
>>
>> Suppose I am trying to persist an RDD which has 2 partitions and only 1
>> partition can be fit in memory completely but some part of partition 2 can
>> also be fit, will spark keep the portion of partition 2 in memory and rest
>> in disk, or will the whole 2nd partition be kept in disk.
>>
>> Regards,
>> Vishnu
>>
>


-- 
Thanks and Regards,
Vishnu Viswanath,
*www.vishnuviswanath.com <http://www.vishnuviswanath.com>*


Question about MEOMORY_AND_DISK persistence

2016-02-28 Thread Vishnu Viswanath
Hi All,

I have a question regarding Persistence (MEMORY_AND_DISK)

Suppose I am trying to persist an RDD which has 2 partitions and only 1
partition can be fit in memory completely but some part of partition 2 can
also be fit, will spark keep the portion of partition 2 in memory and rest
in disk, or will the whole 2nd partition be kept in disk.

Regards,
Vishnu


Question on RDD caching

2016-02-04 Thread Vishnu Viswanath
Hello,

When we call cache() or persist(MEMORY_ONLY), how does the request flow to
the nodes?
I am assuming this will happen:

1.  Driver knows which all nodes hold the partition for the given
rdd (where is this info stored?)
2. It sends a cache request to the node's executor
3. The executor will store the Partition in memory
4. Therefore, each node can have partitions of different RDDs in it's cache.

Can someone please tell me if I am correct.

Thanks and Regards,
Vishnu Viswanath,


Re: how to covert millisecond time to SQL timeStamp

2016-02-01 Thread VISHNU SUBRAMANIAN
HI ,

If you need a data frame specific solution , you can try the below

df.select(from_unixtime(col("max(utcTimestamp)")/1000))

On Tue, 2 Feb 2016 at 09:44 Ted Yu  wrote:

> See related thread on using Joda DateTime:
> http://search-hadoop.com/m/q3RTtSfi342nveex1=RE+NPE+
> when+using+Joda+DateTime
>
> On Mon, Feb 1, 2016 at 7:44 PM, Kevin Mellott 
> wrote:
>
>> I've had pretty good success using Joda-Time
>>  for date/time manipulations
>> within Spark applications. You may be able to use the *DateTIme* constructor
>> below, if you are starting with milliseconds.
>>
>> DateTime
>>
>> public DateTime(long instant)
>>
>> Constructs an instance set to the milliseconds from 1970-01-01T00:00:00Z
>> using ISOChronology in the default time zone.
>> Parameters:instant - the milliseconds from 1970-01-01T00:00:00Z
>>
>> On Mon, Feb 1, 2016 at 5:51 PM, Andy Davidson <
>> a...@santacruzintegration.com> wrote:
>>
>>> What little I know about working with timestamps is based on
>>> https://databricks.com/blog/2015/09/16/spark-1-5-dataframe-api-highlights-datetimestring-handling-time-intervals-and-udafs.html
>>>
>>> Using the example of dates formatted into human friend strings ->
>>> timeStamps I was able to figure out how to convert Epoch times to
>>> timestamps. The same trick did not work for millisecond times.
>>>
>>> Any suggestions would be greatly appreciated.
>>>
>>>
>>> Andy
>>>
>>> Working with epoch times
>>> 
>>>
>>> ref: http://www.epochconverter.com/
>>>
>>> Epoch timestamp: 1456050620
>>>
>>> Timestamp in milliseconds: 145605062
>>>
>>> Human time (GMT): Sun, 21 Feb 2016 10:30:20 GMT
>>>
>>> Human time (your time zone): 2/21/2016, 2:30:20 AM
>>>
>>>
>>> # Epoch time stamp example
>>>
>>> data = [
>>>
>>>   ("1456050620", "1456050621", 1),
>>>
>>>   ("1456050622", "14560506203", 2),
>>>
>>>   ("14560506204", "14560506205", 3)]
>>>
>>> df = sqlContext.createDataFrame(data, ["start_time", "end_time", "id"])
>>>
>>> ​
>>>
>>> # convert epoch time strings in to spark timestamps
>>>
>>> df = df.select(
>>>
>>>   df.start_time.cast("long").alias("start_time"),
>>>
>>>   df.end_time.cast("long").alias("end_time"),
>>>
>>>   df.id)
>>>
>>> df.printSchema()
>>>
>>> df.show(truncate=False)
>>>
>>> ​
>>>
>>> # convert longs to timestamps
>>>
>>> df = df.select(
>>>
>>>   df.start_time.cast("timestamp").alias("start_time"),
>>>
>>>   df.end_time.cast("timestamp").alias("end_time"),
>>>
>>>   df.id)
>>>
>>> df.printSchema()
>>>
>>> df.show(truncate=False)
>>>
>>> ​
>>>
>>> root
>>>  |-- start_time: long (nullable = true)
>>>  |-- end_time: long (nullable = true)
>>>  |-- id: long (nullable = true)
>>>
>>> +---+---+---+
>>> |start_time |end_time   |id |
>>> +---+---+---+
>>> |1456050620 |1456050621 |1  |
>>> |1456050622 |14560506203|2  |
>>> |14560506204|14560506205|3  |
>>> +---+---+---+
>>>
>>> root
>>>  |-- start_time: timestamp (nullable = true)
>>>  |-- end_time: timestamp (nullable = true)
>>>  |-- id: long (nullable = true)
>>>
>>> +-+-+---+
>>> |start_time   |end_time |id |
>>> +-+-+---+
>>> |2016-02-21 02:30:20.0|2016-02-21 02:30:21.0|1  |
>>> |2016-02-21 02:30:22.0|2431-05-28 02:03:23.0|2  |
>>> |2431-05-28 02:03:24.0|2431-05-28 02:03:25.0|3  |
>>> +-+-+---+
>>>
>>>
>>> In [21]:
>>>
>>> # working with millisecond times
>>>
>>> data = [
>>>
>>>   ("145605062", "145605062", 1)]
>>>
>>>   df = sqlContext.createDataFrame(data, ["start_time", "end_time", "id"])
>>>
>>> ​
>>>
>>> # convert epoch time strings in to spark timestamps
>>>
>>> df = df.select(
>>>
>>>   df.start_time.cast("long").alias("start_time"),
>>>
>>>   df.end_time.cast("long").alias("end_time"),
>>>
>>>   df.id)
>>>
>>> df.printSchema()
>>>
>>> df.show(truncate=False)
>>>
>>> ​
>>>
>>> # convert longs to timestamps
>>>
>>> df = df.select(
>>>
>>>   df.start_time.cast("timestamp").alias("start_time"),
>>>
>>>   df.end_time.cast("timestamp").alias("end_time"),
>>>
>>>   df.id)
>>>
>>> df.printSchema()
>>>
>>> df.show(truncate=False)
>>>
>>> root
>>>  |-- start_time: long (nullable = true)
>>>  |-- end_time: long (nullable = true)
>>>  |-- id: long (nullable = true)
>>>
>>> +-+-+---+
>>> |start_time   |end_time |id |
>>> +-+-+---+
>>> |145605062|145605062|1  |
>>> +-+-+---+
>>>
>>> root
>>>  |-- start_time: timestamp (nullable = true)
>>>  |-- end_time: timestamp (nullable = true)
>>>  |-- id: long (nullable = true)
>>>
>>> +--+--+---+
>>> |start_time|end_time  |id |
>>> +--+--+---+
>>> 

Re: How to accelerate reading json file?

2016-01-05 Thread VISHNU SUBRAMANIAN
HI ,

You can try this

sqlContext.read.format("json").option("samplingRatio","0.1").load("path")

If it still takes time , feel free to experiment with the samplingRatio.

Thanks,
Vishnu

On Wed, Jan 6, 2016 at 12:43 PM, Gavin Yue <yue.yuany...@gmail.com> wrote:

> I am trying to read json files following the example:
>
> val path = "examples/src/main/resources/jsonfile"val people = 
> sqlContext.read.json(path)
>
> I have 1 Tb size files in the path.  It took 1.2 hours to finish the reading 
> to infer the schema.
>
> But I already know the schema. Could I make this process short?
>
> Thanks a lot.
>
>
>
>


Re: custom schema in spark throwing error

2015-12-21 Thread VISHNU SUBRAMANIAN
Try this

val customSchema = StructType(Array(
StructField("year", IntegerType, true),
StructField("make", StringType, true),
StructField("model", StringType, true)
))

On Mon, Dec 21, 2015 at 8:26 AM, Divya Gehlot 
wrote:

>
>1. scala> import org.apache.spark.sql.hive.HiveContext
>2. import org.apache.spark.sql.hive.HiveContext
>3.
>4. scala> import org.apache.spark.sql.hive.orc._
>5. import org.apache.spark.sql.hive.orc._
>6.
>7. scala> import org.apache.spark.sql.types.{StructType, StructField,
>StringType, IntegerType};
>8. import org.apache.spark.sql.types.{StructType, StructField,
>StringType, IntegerType}
>9.
>10. scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(
>sc)
>11. 15/12/21 02:06:24 WARN SparkConf: The configuration key
>'spark.yarn.applicationMaster.waitTries' has been deprecated as of
>Spark 1.3 and and may be r
>12. emoved in the future. Please use the new key
>'spark.yarn.am.waitTime' instead.
>13. 15/12/21 02:06:24 INFO HiveContext: Initializing execution hive,
>version 0.13.1
>14. hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.
>spark.sql.hive.HiveContext@74cba4b
>15.
>16.
>17. scala> val customSchema = StructType(Seq(StructField("year",
>IntegerType, true),StructField("make", StringType, true),StructField(
>"model", StringType
>18. , true),StructField("comment", StringType, true),StructField(
>"blank", StringType, true)))
>19. customSchema: org.apache.spark.sql.types.StructType = StructType(
>StructField(year,IntegerType,true), StructField(make,StringType,true),
>StructField(m
>20. odel,StringType,true), StructField(comment,StringType,true),
>StructField(blank,StringType,true))
>21.
>22. scala> val customSchema = (new StructType).add("year", IntegerType,
>true).add("make", StringType, true).add("model", StringType, true).add(
>"comment",
>23. StringType, true).add("blank", StringType, true)
>24. :24: error: not enough arguments for constructor StructType: (
>fields: Array[org.apache.spark.sql.types.StructField])org.apache.spark.
>sql.typ
>25. es.StructType.
>26. Unspecified value parameter fields.
>27. val customSchema = (new StructType).add("year", IntegerType, true).
>add("make", StringType, true).add("model", StringType, true).add(
>"comment",
>28. StringType, true).add("blank", StringType, true)
>
>


Re: Spark ML Random Forest output.

2015-12-04 Thread Vishnu Viswanath
Hi,

As per my understanding the probability matrix is giving the probability
that that particular item can belong to each class. So the one with highest
probability is your predicted class.

Since you have converted you label to index label, according the model the
classes are 0.0 to 9.0 and I see you are getting prediction as a value
which is in [0.0,1.0,,9.0] -  which is correct.

So what you want is a reverse map that can convert your predicted class
back to the String. I don't know if  StringIndexer has such an option, may
be you can create your own map and reverse map of (label to index) and
(index to label) and use this for getting back your original label.

May be there is better way to do this..

Regards,
Vishnu

On Fri, Dec 4, 2015 at 4:56 PM, Eugene Morozov <evgeny.a.moro...@gmail.com>
wrote:

> Hello,
>
> I've got an input dataset of handwritten digits and working java code that
> uses random forest classification algorithm to determine the numbers. My
> test set is just some lines from the same input dataset - just to be sure
> I'm doing the right thing. My understanding is that having correct
> classifier in this case would give me the correct prediction.
> At the moment overfitting is not an issue.
>
> After applying StringIndexer to my input DataFrame I've applied an ugly
> trick and got "indexedLabel" metadata:
>
> {"ml_attr":{"vals":["1.0","7.0","3.0","9.0","2.0","6.0","0.0","4.0","8.0","5.0"],"type":"nominal","name":"indexedLabel"}}
>
>
> So, my algorithm gives me the following result. The question is I'm not
> sure I understand the meaning of the "prediction" here in the output. It
> looks like it's just an index of the highest probability value in the
> "prob" array. Shouldn't "prediction" be the actual class, i.e. one of the
> "0.0", "1.0", ..., "9.0"? If the prediction is just an ordinal number, then
> I have to manually correspond it to my classes, but for that I have to
> either specify classes manually to know their order or somehow be able to
> get them out of the classifier. Both of these way seem to be are not
> accessible.
>
> (4.0 -> prediction=7.0,
> prob=[0.004708283878223195,0.08478236104777455,0.03594642191080532,0.19286506771018885,0.038304389235523435,0.02841307797386,0.003334431932056404,0.5685242322346109,0.018564705500837587,0.024557028569980155]
> (9.0 -> prediction=3.0,
> prob=[0.018432404716456248,0.16837195846781422,0.05995559403934031,0.32282148259583565,0.018374168600855455,0.04792285114398864,0.018226352623526704,0.1611650363085499,0.11703073969440755,0.06769941180922535]
> (2.0 -> prediction=4.0,
> prob=[0.017918245251872154,0.029243677407669404,0.06228050320552064,0.03633295481094746,0.45707974962418885,0.09675606366289394,0.03921437851648226,0.043917057390743426,0.14132883075087405,0.0759285393788078]
>
> So, what is the prediction here? How can I specify classes manually or get
> the valid access to them?
> --
> Be well!
> Jean Morozov
>


Re: General question on using StringIndexer in SparkML

2015-12-02 Thread Vishnu Viswanath
Thank you Yanbo,

It looks like this is available in 1.6 version only.
Can you tell me how/when can I download version 1.6?

Thanks and Regards,
Vishnu Viswanath,

On Wed, Dec 2, 2015 at 4:37 AM, Yanbo Liang <yblia...@gmail.com> wrote:

> You can set "handleInvalid" to "skip" which help you skip the labels which
> not exist in training dataset.
>
> 2015-12-02 14:31 GMT+08:00 Vishnu Viswanath <vishnu.viswanat...@gmail.com>
> :
>
>> Hi Jeff,
>>
>> I went through the link you provided and I could understand how the fit()
>> and transform() work.
>> I tried to use the pipeline in my code and I am getting exception  Caused
>> by: org.apache.spark.SparkException: Unseen label:
>>
>> The reason for this error as per my understanding is:
>> For the column on which I am doing StringIndexing, the test data is
>> having values which was not there in train data.
>> Since fit() is done only on the train data, the indexing is failing.
>>
>> Can you suggest me what can be done in this situation.
>>
>> Thanks,
>>
>> On Mon, Nov 30, 2015 at 12:32 AM, Vishnu Viswanath <
>> vishnu.viswanat...@gmail.com> wrote:
>>
>> Thank you Jeff.
>>>
>>> On Sun, Nov 29, 2015 at 7:36 PM, Jeff Zhang <zjf...@gmail.com> wrote:
>>>
>>>> StringIndexer is an estimator which would train a model to be used both
>>>> in training & prediction. So it is consistent between training & 
>>>> prediction.
>>>>
>>>> You may want to read this section of spark ml doc
>>>> http://spark.apache.org/docs/latest/ml-guide.html#how-it-works
>>>>
>>>>
>>>>
>>>> On Mon, Nov 30, 2015 at 12:52 AM, Vishnu Viswanath <
>>>> vishnu.viswanat...@gmail.com> wrote:
>>>>
>>>>> Thanks for the reply Yanbo.
>>>>>
>>>>> I understand that the model will be trained using the indexer map
>>>>> created during the training stage.
>>>>>
>>>>> But since I am getting a new set of data during prediction, and I have
>>>>> to do StringIndexing on the new data also,
>>>>> Right now I am using a new StringIndexer for this purpose, or is there
>>>>> any way that I can reuse the Indexer used for training stage.
>>>>>
>>>>> Note: I am having a pipeline with StringIndexer in it, and I am
>>>>> fitting my train data in it and building the model. Then later when i get
>>>>> the new data for prediction, I am using the same pipeline to fit the data
>>>>> again and do the prediction.
>>>>>
>>>>> Thanks and Regards,
>>>>> Vishnu Viswanath
>>>>>
>>>>>
>>>>> On Sun, Nov 29, 2015 at 8:14 AM, Yanbo Liang <yblia...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Vishnu,
>>>>>>
>>>>>> The string and indexer map is generated at model training step and
>>>>>> used at model prediction step.
>>>>>> It means that the string and indexer map will not changed when
>>>>>> prediction. You will use the original trained model when you do
>>>>>> prediction.
>>>>>>
>>>>>> 2015-11-29 4:33 GMT+08:00 Vishnu Viswanath <
>>>>>> vishnu.viswanat...@gmail.com>:
>>>>>> > Hi All,
>>>>>> >
>>>>>> > I have a general question on using StringIndexer.
>>>>>> > StringIndexer gives an index to each label in the feature starting
>>>>>> from 0 (
>>>>>> > 0 for least frequent word).
>>>>>> >
>>>>>> > Suppose I am building a model, and I use StringIndexer for
>>>>>> transforming on
>>>>>> > of my column.
>>>>>> > e.g., suppose A was most frequent word followed by B and C.
>>>>>> >
>>>>>> > So the StringIndexer will generate
>>>>>> >
>>>>>> > A  0.0
>>>>>> > B  1.0
>>>>>> > C  2.0
>>>>>> >
>>>>>> > After building the model, I am going to do some prediction using
>>>>>> this model,
>>>>>> > So I do the same transformation on my new data which I need to
>>>>>> predict. And
>>>>>> > suppose the new dataset has C as the most frequent word, followed
>>>>>> by B and
>>>>>> > A. So the StringIndexer will assign index as
>>>>>> >
>>>>>> > C 0.0
>>>>>> > B 1.0
>>>>>> > A 2.0
>>>>>> >
>>>>>> > These indexes are different from what we used for modeling. So
>>>>>> won’t this
>>>>>> > give me a wrong prediction if I use StringIndexer?
>>>>>> >
>>>>>> >
>>>>>>
>>>>>
>>>>>
>>>> --
>>>> Best Regards
>>>>
>>>> Jeff Zhang
>>>>
>>>
>>>
>>>
>>> ​
>>
>
>


Re: General question on using StringIndexer in SparkML

2015-12-02 Thread Vishnu Viswanath
Thank you.

On Wed, Dec 2, 2015 at 8:12 PM, Yanbo Liang <yblia...@gmail.com> wrote:

> You can get 1.6.0-RC1 from
> http://people.apache.org/~pwendell/spark-releases/spark-v1.6.0-rc1-bin/
> currently, but it's not the last release version.
>
> 2015-12-02 23:57 GMT+08:00 Vishnu Viswanath <vishnu.viswanat...@gmail.com>
> :
>
>> Thank you Yanbo,
>>
>> It looks like this is available in 1.6 version only.
>> Can you tell me how/when can I download version 1.6?
>>
>> Thanks and Regards,
>> Vishnu Viswanath,
>>
>> On Wed, Dec 2, 2015 at 4:37 AM, Yanbo Liang <yblia...@gmail.com> wrote:
>>
>>> You can set "handleInvalid" to "skip" which help you skip the labels
>>> which not exist in training dataset.
>>>
>>> 2015-12-02 14:31 GMT+08:00 Vishnu Viswanath <
>>> vishnu.viswanat...@gmail.com>:
>>>
>>>> Hi Jeff,
>>>>
>>>> I went through the link you provided and I could understand how the
>>>> fit() and transform() work.
>>>> I tried to use the pipeline in my code and I am getting exception  Caused
>>>> by: org.apache.spark.SparkException: Unseen label:
>>>>
>>>> The reason for this error as per my understanding is:
>>>> For the column on which I am doing StringIndexing, the test data is
>>>> having values which was not there in train data.
>>>> Since fit() is done only on the train data, the indexing is failing.
>>>>
>>>> Can you suggest me what can be done in this situation.
>>>>
>>>> Thanks,
>>>>
>>>> On Mon, Nov 30, 2015 at 12:32 AM, Vishnu Viswanath <
>>>> vishnu.viswanat...@gmail.com> wrote:
>>>>
>>>> Thank you Jeff.
>>>>>
>>>>> On Sun, Nov 29, 2015 at 7:36 PM, Jeff Zhang <zjf...@gmail.com> wrote:
>>>>>
>>>>>> StringIndexer is an estimator which would train a model to be used
>>>>>> both in training & prediction. So it is consistent between training &
>>>>>> prediction.
>>>>>>
>>>>>> You may want to read this section of spark ml doc
>>>>>> http://spark.apache.org/docs/latest/ml-guide.html#how-it-works
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Nov 30, 2015 at 12:52 AM, Vishnu Viswanath <
>>>>>> vishnu.viswanat...@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks for the reply Yanbo.
>>>>>>>
>>>>>>> I understand that the model will be trained using the indexer map
>>>>>>> created during the training stage.
>>>>>>>
>>>>>>> But since I am getting a new set of data during prediction, and I
>>>>>>> have to do StringIndexing on the new data also,
>>>>>>> Right now I am using a new StringIndexer for this purpose, or is
>>>>>>> there any way that I can reuse the Indexer used for training stage.
>>>>>>>
>>>>>>> Note: I am having a pipeline with StringIndexer in it, and I am
>>>>>>> fitting my train data in it and building the model. Then later when i 
>>>>>>> get
>>>>>>> the new data for prediction, I am using the same pipeline to fit the 
>>>>>>> data
>>>>>>> again and do the prediction.
>>>>>>>
>>>>>>> Thanks and Regards,
>>>>>>> Vishnu Viswanath
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Nov 29, 2015 at 8:14 AM, Yanbo Liang <yblia...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Vishnu,
>>>>>>>>
>>>>>>>> The string and indexer map is generated at model training step and
>>>>>>>> used at model prediction step.
>>>>>>>> It means that the string and indexer map will not changed when
>>>>>>>> prediction. You will use the original trained model when you do
>>>>>>>> prediction.
>>>>>>>>
>>>>>>>> 2015-11-29 4:33 GMT+08:00 Vishnu Viswanath <
>>>>>>>> vishnu.viswanat...@gmail.com>:
>>>>>>>> > Hi All,
>>>>>>>> >
>>>>>>>> > I have a general question on using StringIndexer.
>>>>>>>> > StringIndexer gives an index to each label in the feature
>>>>>>>> starting from 0 (
>>>>>>>> > 0 for least frequent word).
>>>>>>>> >
>>>>>>>> > Suppose I am building a model, and I use StringIndexer for
>>>>>>>> transforming on
>>>>>>>> > of my column.
>>>>>>>> > e.g., suppose A was most frequent word followed by B and C.
>>>>>>>> >
>>>>>>>> > So the StringIndexer will generate
>>>>>>>> >
>>>>>>>> > A  0.0
>>>>>>>> > B  1.0
>>>>>>>> > C  2.0
>>>>>>>> >
>>>>>>>> > After building the model, I am going to do some prediction using
>>>>>>>> this model,
>>>>>>>> > So I do the same transformation on my new data which I need to
>>>>>>>> predict. And
>>>>>>>> > suppose the new dataset has C as the most frequent word, followed
>>>>>>>> by B and
>>>>>>>> > A. So the StringIndexer will assign index as
>>>>>>>> >
>>>>>>>> > C 0.0
>>>>>>>> > B 1.0
>>>>>>>> > A 2.0
>>>>>>>> >
>>>>>>>> > These indexes are different from what we used for modeling. So
>>>>>>>> won’t this
>>>>>>>> > give me a wrong prediction if I use StringIndexer?
>>>>>>>> >
>>>>>>>> >
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> --
>>>>>> Best Regards
>>>>>>
>>>>>> Jeff Zhang
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ​
>>>>
>>>
>>>
>>
>


-- 
Thanks and Regards,
Vishnu Viswanath,
*www.vishnuviswanath.com <http://www.vishnuviswanath.com>*


Re: General question on using StringIndexer in SparkML

2015-12-01 Thread Vishnu Viswanath
Hi Jeff,

I went through the link you provided and I could understand how the fit()
and transform() work.
I tried to use the pipeline in my code and I am getting exception  Caused
by: org.apache.spark.SparkException: Unseen label:

The reason for this error as per my understanding is:
For the column on which I am doing StringIndexing, the test data is having
values which was not there in train data.
Since fit() is done only on the train data, the indexing is failing.

Can you suggest me what can be done in this situation.

Thanks,

On Mon, Nov 30, 2015 at 12:32 AM, Vishnu Viswanath <
vishnu.viswanat...@gmail.com> wrote:

Thank you Jeff.
>
> On Sun, Nov 29, 2015 at 7:36 PM, Jeff Zhang <zjf...@gmail.com> wrote:
>
>> StringIndexer is an estimator which would train a model to be used both
>> in training & prediction. So it is consistent between training & prediction.
>>
>> You may want to read this section of spark ml doc
>> http://spark.apache.org/docs/latest/ml-guide.html#how-it-works
>>
>>
>>
>> On Mon, Nov 30, 2015 at 12:52 AM, Vishnu Viswanath <
>> vishnu.viswanat...@gmail.com> wrote:
>>
>>> Thanks for the reply Yanbo.
>>>
>>> I understand that the model will be trained using the indexer map
>>> created during the training stage.
>>>
>>> But since I am getting a new set of data during prediction, and I have
>>> to do StringIndexing on the new data also,
>>> Right now I am using a new StringIndexer for this purpose, or is there
>>> any way that I can reuse the Indexer used for training stage.
>>>
>>> Note: I am having a pipeline with StringIndexer in it, and I am fitting
>>> my train data in it and building the model. Then later when i get the new
>>> data for prediction, I am using the same pipeline to fit the data again and
>>> do the prediction.
>>>
>>> Thanks and Regards,
>>> Vishnu Viswanath
>>>
>>>
>>> On Sun, Nov 29, 2015 at 8:14 AM, Yanbo Liang <yblia...@gmail.com> wrote:
>>>
>>>> Hi Vishnu,
>>>>
>>>> The string and indexer map is generated at model training step and
>>>> used at model prediction step.
>>>> It means that the string and indexer map will not changed when
>>>> prediction. You will use the original trained model when you do
>>>> prediction.
>>>>
>>>> 2015-11-29 4:33 GMT+08:00 Vishnu Viswanath <
>>>> vishnu.viswanat...@gmail.com>:
>>>> > Hi All,
>>>> >
>>>> > I have a general question on using StringIndexer.
>>>> > StringIndexer gives an index to each label in the feature starting
>>>> from 0 (
>>>> > 0 for least frequent word).
>>>> >
>>>> > Suppose I am building a model, and I use StringIndexer for
>>>> transforming on
>>>> > of my column.
>>>> > e.g., suppose A was most frequent word followed by B and C.
>>>> >
>>>> > So the StringIndexer will generate
>>>> >
>>>> > A  0.0
>>>> > B  1.0
>>>> > C  2.0
>>>> >
>>>> > After building the model, I am going to do some prediction using this
>>>> model,
>>>> > So I do the same transformation on my new data which I need to
>>>> predict. And
>>>> > suppose the new dataset has C as the most frequent word, followed by
>>>> B and
>>>> > A. So the StringIndexer will assign index as
>>>> >
>>>> > C 0.0
>>>> > B 1.0
>>>> > A 2.0
>>>> >
>>>> > These indexes are different from what we used for modeling. So won’t
>>>> this
>>>> > give me a wrong prediction if I use StringIndexer?
>>>> >
>>>> >
>>>>
>>>
>>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> ​


Re: General question on using StringIndexer in SparkML

2015-11-29 Thread Vishnu Viswanath
Thanks for the reply Yanbo.

I understand that the model will be trained using the indexer map created
during the training stage.

But since I am getting a new set of data during prediction, and I have to
do StringIndexing on the new data also,
Right now I am using a new StringIndexer for this purpose, or is there any
way that I can reuse the Indexer used for training stage.

Note: I am having a pipeline with StringIndexer in it, and I am fitting my
train data in it and building the model. Then later when i get the new data
for prediction, I am using the same pipeline to fit the data again and do
the prediction.

Thanks and Regards,
Vishnu Viswanath


On Sun, Nov 29, 2015 at 8:14 AM, Yanbo Liang <yblia...@gmail.com> wrote:

> Hi Vishnu,
>
> The string and indexer map is generated at model training step and
> used at model prediction step.
> It means that the string and indexer map will not changed when
> prediction. You will use the original trained model when you do
> prediction.
>
> 2015-11-29 4:33 GMT+08:00 Vishnu Viswanath <vishnu.viswanat...@gmail.com>:
> > Hi All,
> >
> > I have a general question on using StringIndexer.
> > StringIndexer gives an index to each label in the feature starting from
> 0 (
> > 0 for least frequent word).
> >
> > Suppose I am building a model, and I use StringIndexer for transforming
> on
> > of my column.
> > e.g., suppose A was most frequent word followed by B and C.
> >
> > So the StringIndexer will generate
> >
> > A  0.0
> > B  1.0
> > C  2.0
> >
> > After building the model, I am going to do some prediction using this
> model,
> > So I do the same transformation on my new data which I need to predict.
> And
> > suppose the new dataset has C as the most frequent word, followed by B
> and
> > A. So the StringIndexer will assign index as
> >
> > C 0.0
> > B 1.0
> > A 2.0
> >
> > These indexes are different from what we used for modeling. So won’t this
> > give me a wrong prediction if I use StringIndexer?
> >
> > --
> > Thanks and Regards,
> > Vishnu Viswanath,
> > www.vishnuviswanath.com
>



-- 
Thanks and Regards,
Vishnu Viswanath,
*www.vishnuviswanath.com <http://www.vishnuviswanath.com>*


Re: General question on using StringIndexer in SparkML

2015-11-29 Thread Vishnu Viswanath
Thank you Jeff.

On Sun, Nov 29, 2015 at 7:36 PM, Jeff Zhang <zjf...@gmail.com> wrote:

> StringIndexer is an estimator which would train a model to be used both in
> training & prediction. So it is consistent between training & prediction.
>
> You may want to read this section of spark ml doc
> http://spark.apache.org/docs/latest/ml-guide.html#how-it-works
>
>
>
> On Mon, Nov 30, 2015 at 12:52 AM, Vishnu Viswanath <
> vishnu.viswanat...@gmail.com> wrote:
>
>> Thanks for the reply Yanbo.
>>
>> I understand that the model will be trained using the indexer map created
>> during the training stage.
>>
>> But since I am getting a new set of data during prediction, and I have to
>> do StringIndexing on the new data also,
>> Right now I am using a new StringIndexer for this purpose, or is there
>> any way that I can reuse the Indexer used for training stage.
>>
>> Note: I am having a pipeline with StringIndexer in it, and I am fitting
>> my train data in it and building the model. Then later when i get the new
>> data for prediction, I am using the same pipeline to fit the data again and
>> do the prediction.
>>
>> Thanks and Regards,
>> Vishnu Viswanath
>>
>>
>> On Sun, Nov 29, 2015 at 8:14 AM, Yanbo Liang <yblia...@gmail.com> wrote:
>>
>>> Hi Vishnu,
>>>
>>> The string and indexer map is generated at model training step and
>>> used at model prediction step.
>>> It means that the string and indexer map will not changed when
>>> prediction. You will use the original trained model when you do
>>> prediction.
>>>
>>> 2015-11-29 4:33 GMT+08:00 Vishnu Viswanath <vishnu.viswanat...@gmail.com
>>> >:
>>> > Hi All,
>>> >
>>> > I have a general question on using StringIndexer.
>>> > StringIndexer gives an index to each label in the feature starting
>>> from 0 (
>>> > 0 for least frequent word).
>>> >
>>> > Suppose I am building a model, and I use StringIndexer for
>>> transforming on
>>> > of my column.
>>> > e.g., suppose A was most frequent word followed by B and C.
>>> >
>>> > So the StringIndexer will generate
>>> >
>>> > A  0.0
>>> > B  1.0
>>> > C  2.0
>>> >
>>> > After building the model, I am going to do some prediction using this
>>> model,
>>> > So I do the same transformation on my new data which I need to
>>> predict. And
>>> > suppose the new dataset has C as the most frequent word, followed by B
>>> and
>>> > A. So the StringIndexer will assign index as
>>> >
>>> > C 0.0
>>> > B 1.0
>>> > A 2.0
>>> >
>>> > These indexes are different from what we used for modeling. So won’t
>>> this
>>> > give me a wrong prediction if I use StringIndexer?
>>> >
>>> > --
>>> > Thanks and Regards,
>>> > Vishnu Viswanath,
>>> > www.vishnuviswanath.com
>>>
>>
>>
>>
>> --
>> Thanks and Regards,
>> Vishnu Viswanath,
>> *www.vishnuviswanath.com <http://www.vishnuviswanath.com>*
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


General question on using StringIndexer in SparkML

2015-11-28 Thread Vishnu Viswanath
Hi All,

I have a general question on using StringIndexer.
StringIndexer gives an index to each label in the feature starting from 0 (
0 for least frequent word).

Suppose I am building a model, and I use StringIndexer for transforming on
of my column.
e.g., suppose A was most frequent word followed by B and C.

So the StringIndexer will generate

A  0.0
B  1.0
C  2.0

After building the model, I am going to do some prediction using this
model, So I do the same transformation on my new data which I need to
predict. And suppose the new dataset has C as the most frequent word,
followed by B and A. So the StringIndexer will assign index as

C 0.0
B 1.0
A 2.0

These indexes are different from what we used for modeling. So won’t this
give me a wrong prediction if I use StringIndexer?
​
-- 
Thanks and Regards,
Vishnu Viswanath,
*www.vishnuviswanath.com <http://www.vishnuviswanath.com>*


Adding new column to Dataframe

2015-11-25 Thread Vishnu Viswanath
Hi,

I am trying to add the row number to a spark dataframe.
This is my dataframe:

scala> df.printSchema
root
|-- line: string (nullable = true)

I tried to use df.withColumn but I am getting below exception.

scala> df.withColumn("row",rowNumber)
org.apache.spark.sql.AnalysisException: unresolved operator 'Project
[line#2326,'row_number() AS row#2327];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)

Also, is it possible to add a column from one dataframe to another?
something like

scala> df.withColumn("line2",df2("line"))

org.apache.spark.sql.AnalysisException: resolved attribute(s)
line#2330 missing from line#2326 in operator !Project
[line#2326,line#2330 AS line2#2331];

​

Thanks and Regards,
Vishnu Viswanath
*www.vishnuviswanath.com <http://www.vishnuviswanath.com>*


Re: Adding new column to Dataframe

2015-11-25 Thread Vishnu Viswanath
Thanks Jeff,

rowNumber is a function in org.apache.spark.sql.functions link
<https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.sql.functions$>

I will try to use monotonicallyIncreasingId and see if it works.

You’d better to use join to correlate 2 data frames : Yes, thats why I
thought of adding row number in both the DataFrames and join them based on
row number. Is there any better way of doing this? Both DataFrames will
have same number of rows always, but are not related by any column to do
join.

Thanks and Regards,
Vishnu Viswanath
​

On Wed, Nov 25, 2015 at 6:43 PM, Jeff Zhang <zjf...@gmail.com> wrote:

> >>> I tried to use df.withColumn but I am getting below exception.
>
> What is rowNumber here ? UDF ?  You can use monotonicallyIncreasingId
> for generating id
>
> >>> Also, is it possible to add a column from one dataframe to another?
>
> You can't, because how can you add one dataframe to another if they have
> different number of rows. You'd better to use join to correlate 2 data
> frames.
>
> On Thu, Nov 26, 2015 at 6:39 AM, Vishnu Viswanath <
> vishnu.viswanat...@gmail.com> wrote:
>
>> Hi,
>>
>> I am trying to add the row number to a spark dataframe.
>> This is my dataframe:
>>
>> scala> df.printSchema
>> root
>> |-- line: string (nullable = true)
>>
>> I tried to use df.withColumn but I am getting below exception.
>>
>> scala> df.withColumn("row",rowNumber)
>> org.apache.spark.sql.AnalysisException: unresolved operator 'Project 
>> [line#2326,'row_number() AS row#2327];
>> at 
>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>> at 
>> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>> at 
>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
>> at 
>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>>
>> Also, is it possible to add a column from one dataframe to another?
>> something like
>>
>> scala> df.withColumn("line2",df2("line"))
>>
>> org.apache.spark.sql.AnalysisException: resolved attribute(s) line#2330 
>> missing from line#2326 in operator !Project [line#2326,line#2330 AS 
>> line2#2331];
>>
>> ​
>>
>> Thanks and Regards,
>> Vishnu Viswanath
>> *www.vishnuviswanath.com <http://www.vishnuviswanath.com>*
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>



-- 
Thanks and Regards,
Vishnu Viswanath
+1 309 550 2311
*www.vishnuviswanath.com <http://www.vishnuviswanath.com>*


Re: Adding new column to Dataframe

2015-11-25 Thread Vishnu Viswanath
Thanks Ted,

It looks like I cannot use row_number then. I tried to run a sample window
function and got below error
org.apache.spark.sql.AnalysisException: Could not resolve window function
'avg'. Note that, using window functions currently requires a HiveContext;

On Wed, Nov 25, 2015 at 8:28 PM, Ted Yu <yuzhih...@gmail.com> wrote:

Vishnu:
> rowNumber (deprecated, replaced with row_number) is a window function.
>
>* Window function: returns a sequential number starting at 1 within a
> window partition.
>*
>* @group window_funcs
>* @since 1.6.0
>*/
>   def row_number(): Column = withExpr {
> UnresolvedWindowFunction("row_number", Nil) }
>
> Sample usage:
>
> df =  sqlContext.range(1<<20)
> df2 = df.select((df.id % 1000).alias("A"), (df.id / 1000).alias('B'))
> ws = Window.partitionBy(df2.A).orderBy(df2.B)
> df3 = df2.select("client", "date",
> rowNumber().over(ws).alias("rn")).filter("rn < 0")
>
> Cheers
>
> On Wed, Nov 25, 2015 at 5:08 PM, Vishnu Viswanath <
> vishnu.viswanat...@gmail.com> wrote:
>
>> Thanks Jeff,
>>
>> rowNumber is a function in org.apache.spark.sql.functions link
>> <https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.sql.functions$>
>>
>> I will try to use monotonicallyIncreasingId and see if it works.
>>
>> You’d better to use join to correlate 2 data frames : Yes, thats why I
>> thought of adding row number in both the DataFrames and join them based on
>> row number. Is there any better way of doing this? Both DataFrames will
>> have same number of rows always, but are not related by any column to do
>> join.
>>
>> Thanks and Regards,
>> Vishnu Viswanath
>> ​
>>
>> On Wed, Nov 25, 2015 at 6:43 PM, Jeff Zhang <zjf...@gmail.com> wrote:
>>
>>> >>> I tried to use df.withColumn but I am getting below exception.
>>>
>>> What is rowNumber here ? UDF ?  You can use monotonicallyIncreasingId
>>> for generating id
>>>
>>> >>> Also, is it possible to add a column from one dataframe to another?
>>>
>>> You can't, because how can you add one dataframe to another if they have
>>> different number of rows. You'd better to use join to correlate 2 data
>>> frames.
>>>
>>> On Thu, Nov 26, 2015 at 6:39 AM, Vishnu Viswanath <
>>> vishnu.viswanat...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am trying to add the row number to a spark dataframe.
>>>> This is my dataframe:
>>>>
>>>> scala> df.printSchema
>>>> root
>>>> |-- line: string (nullable = true)
>>>>
>>>> I tried to use df.withColumn but I am getting below exception.
>>>>
>>>> scala> df.withColumn("row",rowNumber)
>>>> org.apache.spark.sql.AnalysisException: unresolved operator 'Project 
>>>> [line#2326,'row_number() AS row#2327];
>>>> at 
>>>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>>>> at 
>>>> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>>>> at 
>>>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
>>>> at 
>>>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>>>>
>>>> Also, is it possible to add a column from one dataframe to another?
>>>> something like
>>>>
>>>> scala> df.withColumn("line2",df2("line"))
>>>>
>>>> org.apache.spark.sql.AnalysisException: resolved attribute(s) line#2330 
>>>> missing from line#2326 in operator !Project [line#2326,line#2330 AS 
>>>> line2#2331];
>>>>
>>>> ​
>>>>
>>>> Thanks and Regards,
>>>> Vishnu Viswanath
>>>> *www.vishnuviswanath.com <http://www.vishnuviswanath.com>*
>>>>
>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>>
>> --
>> Thanks and Regards,
>> Vishnu Viswanath
>> +1 309 550 2311
>> *www.vishnuviswanath.com <http://www.vishnuviswanath.com>*
>>
>
> ​
-- 
Thanks and Regards,
Vishnu Viswanath,
*www.vishnuviswanath.com <http://www.vishnuviswanath.com>*


how to us DataFrame.na.fill based on condition

2015-11-23 Thread Vishnu Viswanath
Hi

Can someone tell me if there is a way I can use the fill method
in DataFrameNaFunctions based on some condition.

e.g., df.na.fill("value1","column1","condition1")
df.na.fill("value2","column1","condition2")

i want to fill nulls in column1 with values - either value 1 or value 2,
based on some condition.

Thanks,


Re: how to us DataFrame.na.fill based on condition

2015-11-23 Thread Vishnu Viswanath
Thanks for the reply Davies

I think replace, replaces a value with another value. But what I want to do
is fill in the null value of a column.( I don't have a to_replace here )

Regards,
Vishnu

On Mon, Nov 23, 2015 at 1:37 PM, Davies Liu <dav...@databricks.com> wrote:

> DataFrame.replace(to_replace, value, subset=None)
>
>
> http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.replace
>
> On Mon, Nov 23, 2015 at 11:05 AM, Vishnu Viswanath
> <vishnu.viswanat...@gmail.com> wrote:
> > Hi
> >
> > Can someone tell me if there is a way I can use the fill method in
> > DataFrameNaFunctions based on some condition.
> >
> > e.g., df.na.fill("value1","column1","condition1")
> > df.na.fill("value2","column1","condition2")
> >
> > i want to fill nulls in column1 with values - either value 1 or value 2,
> > based on some condition.
> >
> > Thanks,
>



-- 
Thanks and Regards,
Vishnu Viswanath
+1 309 550 2311
*www.vishnuviswanath.com <http://www.vishnuviswanath.com>*


How VectorIndexer works in Spark ML pipelines

2015-10-15 Thread VISHNU SUBRAMANIAN
HI All,

I am trying to use the VectorIndexer (FeatureExtraction) technique
available from the Spark ML Pipelines.

I ran the example in the documentation .

val featureIndexer = new VectorIndexer()
  .setInputCol("features")
  .setOutputCol("indexedFeatures")
  .setMaxCategories(4)
  .fit(data)


And then I wanted to see what output it generates.

After performing transform on the data set , the output looks like below.

scala> predictions.select("indexedFeatures").take(1).foreach(println)

[(692,[124,125,126,127,151,152,153,154,155,179,180,181,182,183,208,209,210,211,235,236,237,238,239,263,264,265,266,267,268,292,293,294,295,296,321,322,323,324,349,350,351,352,377,378,379,380,405,406,407,408,433,434,435,436,461,462,463,464,489,490,491,492,493,517,518,519,520,521,545,546,547,548,549,574,575,576,577,578,602,603,604,605,606,630,631,632,633,634,658,659,660,661,662],[145.0,255.0,211.0,31.0,32.0,237.0,253.0,252.0,71.0,11.0,175.0,253.0,252.0,71.0,144.0,253.0,252.0,71.0,16.0,191.0,253.0,252.0,71.0,26.0,221.0,253.0,252.0,124.0,31.0,125.0,253.0,252.0,252.0,108.0,253.0,252.0,252.0,108.0,255.0,253.0,253.0,108.0,253.0,252.0,252.0,108.0,253.0,252.0,252.0,108.0,253.0,252.0,252.0,108.0,255.0,253.0,253.0,170.0,253.0,252.0,252.0,252.0,42.0,149.0,252.0,252.0,252.0,144.0,109.0,252.0,252.0,252.0,144.0,218.0,253.0,253.0,255.0,35.0,175.0,252.0,252.0,253.0,35.0,73.0,252.0,252.0,253.0,35.0,31.0,211.0,252.0,253.0,35.0])]


scala> predictions.select("features").take(1).foreach(println)

[(692,[124,125,126,127,151,152,153,154,155,179,180,181,182,183,208,209,210,211,235,236,237,238,239,263,264,265,266,267,268,292,293,294,295,296,321,322,323,324,349,350,351,352,377,378,379,380,405,406,407,408,433,434,435,436,461,462,463,464,489,490,491,492,493,517,518,519,520,521,545,546,547,548,549,574,575,576,577,578,602,603,604,605,606,630,631,632,633,634,658,659,660,661,662],[145.0,255.0,211.0,31.0,32.0,237.0,253.0,252.0,71.0,11.0,175.0,253.0,252.0,71.0,144.0,253.0,252.0,71.0,16.0,191.0,253.0,252.0,71.0,26.0,221.0,253.0,252.0,124.0,31.0,125.0,253.0,252.0,252.0,108.0,253.0,252.0,252.0,108.0,255.0,253.0,253.0,108.0,253.0,252.0,252.0,108.0,253.0,252.0,252.0,108.0,253.0,252.0,252.0,108.0,255.0,253.0,253.0,170.0,253.0,252.0,252.0,252.0,42.0,149.0,252.0,252.0,252.0,144.0,109.0,252.0,252.0,252.0,144.0,218.0,253.0,253.0,255.0,35.0,175.0,252.0,252.0,253.0,35.0,73.0,252.0,252.0,253.0,35.0,31.0,211.0,252.0,253.0,35.0])]

I can,t understand what is happening. I tried with simple data sets also ,
but similar result.

Please help.

Thanks,

Vishnu


Re: UDF in spark

2015-07-08 Thread VISHNU SUBRAMANIAN
HI Vinod,

Yes If you want to use a scala or python function you need the block of
code.

Only Hive UDF's are available permanently.

Thanks,
Vishnu

On Wed, Jul 8, 2015 at 5:17 PM, vinod kumar vinodsachin...@gmail.com
wrote:

 Thanks Vishnu,

 When restart the service the UDF was not accessible by my query.I need to
 run the mentioned block again to use the UDF.
 Is there is any way to maintain UDF in sqlContext permanently?

 Thanks,
 Vinod

 On Wed, Jul 8, 2015 at 7:16 AM, VISHNU SUBRAMANIAN 
 johnfedrickena...@gmail.com wrote:

 Hi,

 sqlContext.udf.register(udfname, functionname _)

 example:

 def square(x:Int):Int = { x * x}

 register udf as below

 sqlContext.udf.register(square,square _)

 Thanks,
 Vishnu

 On Wed, Jul 8, 2015 at 2:23 PM, vinod kumar vinodsachin...@gmail.com
 wrote:

 Hi Everyone,

 I am new to spark.may I know how to define and use User Define Function
 in SPARK SQL.

 I want to use defined UDF by using sql queries.

 My Environment
 Windows 8
 spark 1.3.1

 Warm Regards,
 Vinod







Re: UDF in spark

2015-07-08 Thread VISHNU SUBRAMANIAN
Hi,

sqlContext.udf.register(udfname, functionname _)

example:

def square(x:Int):Int = { x * x}

register udf as below

sqlContext.udf.register(square,square _)

Thanks,
Vishnu

On Wed, Jul 8, 2015 at 2:23 PM, vinod kumar vinodsachin...@gmail.com
wrote:

 Hi Everyone,

 I am new to spark.may I know how to define and use User Define Function in
 SPARK SQL.

 I want to use defined UDF by using sql queries.

 My Environment
 Windows 8
 spark 1.3.1

 Warm Regards,
 Vinod





Re: used cores are less then total no. of core

2015-02-24 Thread VISHNU SUBRAMANIAN
Try adding --total-executor-cores 5 , where 5 is the number of cores.

Thanks,
Vishnu

On Wed, Feb 25, 2015 at 11:52 AM, Somnath Pandeya 
somnath_pand...@infosys.com wrote:

  Hi All,



 I am running a simple word count example of spark (standalone cluster) ,
 In the UI it is showing

 For each worker no. of cores available are 32 ,but while running the jobs
 only 5 cores are being used,



 What should I do to increase no. of used core or it is selected based on
 jobs.



 Thanks

 Somnaht

  CAUTION - Disclaimer *
 This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
 for the use of the addressee(s). If you are not the intended recipient, please
 notify the sender by e-mail and delete the original message. Further, you are 
 not
 to copy, disclose, or distribute this e-mail or its contents to any other 
 person and
 any such actions are unlawful. This e-mail may contain viruses. Infosys has 
 taken
 every reasonable precaution to minimize this risk, but is not liable for any 
 damage
 you may sustain as a result of any virus in this e-mail. You should carry out 
 your
 own virus checks before opening the e-mail or attachment. Infosys reserves the
 right to monitor and review the content of all messages sent to or from this 
 e-mail
 address. Messages sent to or from this e-mail address may be stored on the
 Infosys e-mail system.
 ***INFOSYS End of Disclaimer INFOSYS***




Re: Running Example Spark Program

2015-02-22 Thread VISHNU SUBRAMANIAN
Try restarting your Spark cluster .
./sbin/stop-all.sh
./sbin/start-all.sh

Thanks,
Vishnu

On Sun, Feb 22, 2015 at 7:30 PM, Surendran Duraisamy 
2013ht12...@wilp.bits-pilani.ac.in wrote:

  Hello All,

 I am new to Apache Spark, I am trying to run JavaKMeans.java from Spark
 Examples in my Ubuntu System.

 I downloaded spark-1.2.1-bin-hadoop2.4.tgz
 http://www.apache.org/dyn/closer.cgi/spark/spark-1.2.1/spark-1.2.1-bin-hadoop2.4.tgz
 and started sbin/start-master.sh

 After starting Spark and access http://localhost:8080/ to look at the
 status of my Spark Instance, and it shows as follows.


- *URL:* spark://vm:7077
- *Workers:* 0
- *Cores:* 0 Total, 0 Used
- *Memory:* 0.0 B Total, 0.0 B Used
- *Applications:* 0 Running, 4 Completed
- *Drivers:* 0 Running, 0 Completed
- *Status:* ALIVE

 Number of Cores is 0 and Memory is 0.0B. I think because of this I am
 getting following error when I try to run JavaKMeans.java

 Initial job has not accepted any resources; check your cluster UI to
 ensure that workers are registered and have sufficient memory

 Am I missing any configuration before running sbin/start-master.sh?
  Regards,
 Surendran



Re: Hive/Hbase for low latency

2015-02-11 Thread VISHNU SUBRAMANIAN
Hi Siddarth,

It depends on what you are trying to solve. But the connectivity for
cassandra and spark is good .

The answer depends upon what exactly you are trying to solve.

Thanks,
Vishnu

On Wed, Feb 11, 2015 at 7:47 PM, Siddharth Ubale 
siddharth.ub...@syncoms.com wrote:

  Hi ,



 I am new to Spark . We have recently moved from Apache Storm to Apache
 Spark to build our OLAP tool .

 Now ,earlier we were using Hbase  Phoenix.

 We need to re-think what to use in case of Spark.

 Should we go ahead with Hbase or Hive or Cassandra for query processing
 with Spark Sql.



 Please share ur views.



 Thanks,

 Siddharth Ubale,







Re: Question related to Spark SQL

2015-02-11 Thread VISHNU SUBRAMANIAN
I dint mean that. When you try the above approach only one client will have
access to the cached data.

But when you expose your data through a thrift server the case is quite
different.

In the case of thrift server all the request goes to the thrift server and
spark will be able to take the advantage of caching.

That is Thrift server be your sole client to the spark cluster.

check this link
http://spark.apache.org/docs/1.1.0/sql-programming-guide.html#running-the-thrift-jdbc-server

Your applications can connect to your spark cluster through jdbc driver.It
works similar to your hive thrift server.

Thanks,
Vishnu

On Wed, Feb 11, 2015 at 10:31 PM, Ashish Mukherjee 
ashish.mukher...@gmail.com wrote:

 Thanks for your reply, Vishnu.

 I assume you are suggesting I build Hive tables and cache them in memory
 and query on top of that for fast, real-time querying.

 Perhaps, I should write a generic piece of code like this and submit this
 as a Spark job with the SQL clause as an argument based on user selections
 on the Web interface -

 String sqlClause = args[0];
 ...
 JavaHiveContext sqlContext = new 
 org.apache.spark.sql.hive.api.java.HiveContext(sc);// Queries are expressed 
 in HiveQL.Row[] results = sqlContext.sql(sqlClause).collect();


 Is my understanding right?

 Regards,
 Ashish

 On Wed, Feb 11, 2015 at 4:42 PM, VISHNU SUBRAMANIAN 
 johnfedrickena...@gmail.com wrote:

 Hi Ashish,

 In order to answer your question , I assume that you are planning to
 process data and cache them in the memory.If you are using a thrift server
 that comes with Spark then you can query on top of it. And multiple
 applications can use the cached data as internally all the requests go to
 thrift server.

 Spark exposes hive query language and allows you access its data through
 spark .So you can consider using HiveQL for querying .

 Thanks,
 Vishnu

 On Wed, Feb 11, 2015 at 4:12 PM, Ashish Mukherjee 
 ashish.mukher...@gmail.com wrote:

 Hi,

 I am planning to use Spark for a Web-based adhoc reporting tool on
 massive date-sets on S3. Real-time queries with filters, aggregations and
 joins could be constructed from UI selections.

 Online documentation seems to suggest that SharkQL is deprecated and
 users should move away from it.  I understand Hive is generally not used
 for real-time querying and for Spark SQL to work with other data stores,
 tables need to be registered explicitly in code. Also, the This would not
 be suitable for a dynamic query construction scenario.

 For a real-time , dynamic querying scenario like mine what is the proper
 tool to be used with Spark SQL?

 Regards,
 Ashish






Re: Re: How can I read this avro file using spark scala?

2015-02-11 Thread VISHNU SUBRAMANIAN
Check this link.
https://github.com/databricks/spark-avro

Home page for Spark-avro project.

Thanks,
Vishnu

On Wed, Feb 11, 2015 at 10:19 PM, Todd bit1...@163.com wrote:

 Databricks provides a sample code on its website...but i can't find it for
 now.






 At 2015-02-12 00:43:07, captainfranz captainfr...@gmail.com wrote:
 I am confused as to whether avro support was merged into Spark 1.2 or it is
 still an independent library.
 I see some people writing sqlContext.avroFile similarly to jsonFile but this
 does not work for me, nor do I see this in the Scala docs.
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-read-this-avro-file-using-spark-scala-tp19400p21601.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 




Re: getting the cluster elements from kmeans run

2015-02-11 Thread VISHNU SUBRAMANIAN
You can use model.predict(point) that will help you identify the cluster
center and map it to the point.

rdd.map(x = (x,model.predict(x)))

Thanks,
Vishnu

On Wed, Feb 11, 2015 at 11:06 PM, Harini Srinivasan har...@us.ibm.com
wrote:

 Hi,

 Is there a way to get the elements of each cluster after running kmeans
 clustering? I am using the Java version.


 
 thanks




Re: NaiveBayes classifier causes ShuffleDependency class cast exception

2015-02-06 Thread VISHNU SUBRAMANIAN
Can you try creating just a single spark context  and then try your code.
If you want to use it for streaming pass the same sparkcontext object
instead of conf.

Note: Instead of just replying to me , try to use reply to all so that the
post is visible for the community . That way you can expect immediate
responses.

On Fri, Feb 6, 2015 at 6:09 AM, aanilpala aanilp...@gmail.com wrote:

 I have the following code:


 SparkConf conf = new
 SparkConf().setAppName(streamer).setMaster(local[2]);
 conf.set(spark.driver.allowMultipleContexts, true);
 JavaStreamingContext ssc = new JavaStreamingContext(conf, new
 Duration(batch_interval));
 ssc.checkpoint(/tmp/spark/checkpoint);

 SparkConf conf2 = new
 SparkConf().setAppName(classifier).setMaster(local[1]);
 conf2.set(spark.driver.allowMultipleContexts, true);
 JavaSparkContext sc = new JavaSparkContext(conf);

 JavaReceiverInputDStreamString stream =
 ssc.socketTextStream(localhost, );

 // String to Tuple3 Conversion
 JavaDStreamTuple3lt;Long, String, String tuple_stream =
 stream.map(new FunctionString, Tuple3lt;Long, String, String() {
  ... });

 JavaPairDStreamInteger, DictionaryEntry
 raw_dictionary_stream =
 tuple_stream.filter(new FunctionTuple3lt;Long, String,String, Boolean()
 {

 @Override
 public Boolean call(Tuple3Long, String,String
 tuple) throws Exception {
 if((tuple._1()/Time.scaling_factor %
 training_interval)  training_dur)
 NaiveBayes.train(sc.parallelize(training_set).rdd());

 return true;
 }


 }).

 I am working on a text mining project and I want to use
 NaiveBayesClassifier
 of MLlib to classify some stream items. So, I have two Spark contexts one
 of
 which is a streaming context. The call to NaiveBayes.train causes the
 following exception.

 Any ideas?


  Exception in thread main org.apache.spark.SparkException: Job aborted
 due
 to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure:
 Lost task 0.0 in stage 0.0 (TID 0, localhost):
 java.lang.ClassCastException:
 org.apache.spark.SparkContext$$anonfun$runJob$4 cannot be cast to
 org.apache.spark.ShuffleDependency
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:60)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)

 Driver stacktrace:
 at
 org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
 at

 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
 at scala.Option.foreach(Option.scala:236)
 at

 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
 at

 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420)
 at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
 at

 org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
 at akka.actor.ActorCell.invoke(ActorCell.scala:487)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
 at akka.dispatch.Mailbox.run(Mailbox.scala:220)
 at

 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
 at
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at

 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at

 

Re: Shuffle Dependency Casting error

2015-02-05 Thread VISHNU SUBRAMANIAN
Hi,

Could you share the code snippet.

Thanks,
Vishnu

On Thu, Feb 5, 2015 at 11:22 PM, aanilpala aanilp...@gmail.com wrote:

 Hi, I am working on a text mining project and I want to use
 NaiveBayesClassifier of MLlib to classify some stream items. So, I have two
 Spark contexts one of which is a streaming context. Everything looks fine
 if
 I comment out train and predict methods, it works fine although doesn't
 obviously do what I want. The exception (and its trace) I am getting is
 below.

 Any ideas?

 Exception in thread main org.apache.spark.SparkException: Job aborted due
 to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure:
 Lost task 0.0 in stage 0.0 (TID 0, localhost):
 java.lang.ClassCastException:
 org.apache.spark.SparkContext$$anonfun$runJob$4 cannot be cast to
 org.apache.spark.ShuffleDependency
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:60)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)

 Driver stacktrace:
 at
 org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
 at

 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
 at scala.Option.foreach(Option.scala:236)
 at

 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
 at

 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420)
 at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
 at

 org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
 at akka.actor.ActorCell.invoke(ActorCell.scala:487)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
 at akka.dispatch.Mailbox.run(Mailbox.scala:220)
 at

 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
 at
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at

 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at

 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Shuffle-Dependency-Casting-error-tp21518.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Java Kafka Word Count Issue

2015-02-02 Thread VISHNU SUBRAMANIAN
You can use updateStateByKey() to perform the above operation.

On Mon, Feb 2, 2015 at 4:29 PM, Jadhav Shweta jadhav.shw...@tcs.com wrote:


 Hi Sean,

 Kafka Producer is working fine.
 This is related to Spark.

 How can i configure spark so that it will make sure to remember count from
 the beginning.

 If my log.text file has

 spark
 apache
 kafka
 spark

 My Spark program gives correct output as

 spark 2
 apache 1
 kafka 1

 but when I append spark to my log.text file

 Spark program gives output as

 spark 1

 which should be spark 3.

 So how to handle this in Spark code.

 Thanks and regards
 Shweta Jadhav



 -Sean Owen so...@cloudera.com wrote: -
 To: Jadhav Shweta jadhav.shw...@tcs.com
 From: Sean Owen so...@cloudera.com
 Date: 02/02/2015 04:13PM
 Subject: Re: Java Kafka Word Count Issue

 This is a question about the Kafka producer right? Not Spark
 On Feb 2, 2015 10:34 AM, Jadhav Shweta jadhav.shw...@tcs.com wrote:


 Hi All,

 I am trying to run Kafka Word Count Program.
 please find below, the link for the same

 https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java

 I have set spark master to setMaster(local[*])

 and I have started Kafka Producer which reads the file.

 If my file has already few words
 then after running Spark java program I get proper output.

 But when i append new words in same file it starts word count again from
 1.

 If I need to do word count for already present and newly appended words
 exactly what changes I need to make in code for that.

 P.S. I am using Spark spark-1.2.0-bin-hadoop2.3

 Thanks and regards
 Shweta Jadhav

 =-=-=
 Notice: The information contained in this e-mail
 message and/or attachments to it may contain
 confidential or privileged information. If you are
 not the intended recipient, any dissemination, use,
 review, distribution, printing or copying of the
 information contained in this e-mail message
 and/or attachments to it are strictly prohibited. If
 you have received this communication in error,
 please notify us by reply e-mail or telephone and
 immediately and permanently delete the message
 and any attachments. Thank you




Re: Failed to save RDD as text file to local file system

2015-01-08 Thread VISHNU SUBRAMANIAN
looks like it is trying to save the file in Hdfs.

Check if you have set any hadoop path in your system.

On Fri, Jan 9, 2015 at 12:14 PM, Raghavendra Pandey 
raghavendra.pan...@gmail.com wrote:

 Can you check permissions etc as I am able to run
 r.saveAsTextFile(file:///home/cloudera/tmp/out1) successfully on my
 machine..

 On Fri, Jan 9, 2015 at 10:25 AM, NingjunWang ningjun.w...@lexisnexis.com
 wrote:

 I try to save RDD as text file to local file system (Linux) but it does
 not
 work

 Launch spark-shell and run the following

 val r = sc.parallelize(Array(a, b, c))
 r.saveAsTextFile(file:///home/cloudera/tmp/out1)


 IOException: Mkdirs failed to create

 file:/home/cloudera/tmp/out1/_temporary/0/_temporary/attempt_201501082027_0003_m_00_47
 (exists=false, cwd=file:/var/run/spark/work/app-20150108201046-0021/0)
 at

 org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:442)
 at

 org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:428)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801)
 at

 org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
 at
 org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
 at

 org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1056)
 at

 org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1047)
 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)


 I also try with 4 slash but still get the same error
 r.saveAsTextFile(file:home/cloudera/tmp/out1)

 Please advise

 Ningjun




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Failed-to-save-RDD-as-text-file-to-local-file-system-tp21050.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





how to do incremental model updates using spark streaming and mllib

2014-12-25 Thread vishnu
Hi,

Say I have created a clustering model using KMeans for 100million
transactions at time t1. I am using streaming and say for every 1 hour i
need to update my existing model. How do I do it. Should it include every
time all the data or can it be incrementally updated.

If I can do an incrementally updating , how do i do it.

Thanks,
Vishnu



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-do-incremental-model-updates-using-spark-streaming-and-mllib-tp20862.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org