Re: Time window on Processing Time

2017-08-30 Thread madhu phatak
Hi,
That's great. Thanks a lot.

On Wed, Aug 30, 2017 at 10:44 AM, Tathagata Das <tathagata.das1...@gmail.com
> wrote:

> Yes, it can be! There is a sql function called current_timestamp() which
> is self-explanatory. So I believe you should be able to do something like
>
> import org.apache.spark.sql.functions._
>
> ds.withColumn("processingTime", current_timestamp())
>   .groupBy(window("processingTime", "1 minute"))
>   .count()
>
>
> On Mon, Aug 28, 2017 at 5:46 AM, madhu phatak <phatak@gmail.com>
> wrote:
>
>> Hi,
>> As I am playing with structured streaming, I observed that window
>> function always requires a time column in input data.So that means it's
>> event time.
>>
>> Is it possible to old spark streaming style window function based on
>> processing time. I don't see any documentation on the same.
>>
>> --
>> Regards,
>> Madhukara Phatak
>> http://datamantra.io/
>>
>
>


-- 
Regards,
Madhukara Phatak
http://datamantra.io/


Time window on Processing Time

2017-08-28 Thread madhu phatak
Hi,
As I am playing with structured streaming, I observed that window function
always requires a time column in input data.So that means it's event time.

Is it possible to old spark streaming style window function based on
processing time. I don't see any documentation on the same.

-- 
Regards,
Madhukara Phatak
http://datamantra.io/


Review of ML PR

2017-08-14 Thread madhu phatak
Hi,

I have provided a PR around 2 months back to improve the performance of
decision tree by allowing flexible user provided storage class for
intermediate data. I have posted few questions about handling backward
compatibility but there is no answers from long.

Can anybody help me to move this forward? The below is the link to PR

https://github.com/apache/spark/pull/17972

-- 
Regards,
Madhukara Phatak
http://datamantra.io/


Re: RandomForest caching

2017-05-12 Thread madhu phatak
Hi,
I opened a jira.

https://issues.apache.org/jira/browse/SPARK-20723

Can some one have a look?

On Fri, Apr 28, 2017 at 1:34 PM, madhu phatak <phatak@gmail.com> wrote:

> Hi,
>
> I am testing RandomForestClassification with 50gb of data which is cached
> in memory. I have 64gb of ram, in which 28gb is used for original dataset
> caching.
>
> When I run random forest, it caches around 300GB of intermediate data
> which un caches the original dataset. This caching is triggered by below
> code in RandomForest.scala
>
> ```
> val baggedInput = BaggedPoint
>   .convertToBaggedRDD(treeInput, strategy.subsamplingRate,
> numTrees, withReplacement, seed)
>   .persist(StorageLevel.MEMORY_AND_DISK)
>
> ```
>
> As I don't have control over storage level, I cannot make sure original
> dataset stays in memory for other interactive tasks when random forest is
> running.
>
> Is it a good idea to make this storage level a user parameter? If so I can
> open a jira issue and give pr for the same.
>
> --
> Regards,
> Madhukara Phatak
> http://datamantra.io/
>



-- 
Regards,
Madhukara Phatak
http://datamantra.io/


RandomForest caching

2017-04-28 Thread madhu phatak
Hi,

I am testing RandomForestClassification with 50gb of data which is cached
in memory. I have 64gb of ram, in which 28gb is used for original dataset
caching.

When I run random forest, it caches around 300GB of intermediate data which
un caches the original dataset. This caching is triggered by below code in
RandomForest.scala

```
val baggedInput = BaggedPoint
  .convertToBaggedRDD(treeInput, strategy.subsamplingRate,
numTrees, withReplacement, seed)
  .persist(StorageLevel.MEMORY_AND_DISK)

```

As I don't have control over storage level, I cannot make sure original
dataset stays in memory for other interactive tasks when random forest is
running.

Is it a good idea to make this storage level a user parameter? If so I can
open a jira issue and give pr for the same.

-- 
Regards,
Madhukara Phatak
http://datamantra.io/


Re: Contributing Documentation Changes

2015-04-24 Thread madhu phatak
Hi,
I understand that. The following page

http://spark.apache.org/documentation.html has a external tutorials,blogs
section which points to other blog pages. I wanted to add there.




Regards,
Madhukara Phatak
http://datamantra.io/

On Fri, Apr 24, 2015 at 5:17 PM, Sean Owen so...@cloudera.com wrote:

 I think that your own tutorials and such should live on your blog. The
 goal isn't to pull in a bunch of external docs to the site.

 On Fri, Apr 24, 2015 at 12:57 AM, madhu phatak phatak@gmail.com
 wrote:
  Hi,
   As I was reading contributing to Spark wiki, it was mentioned that we
 can
  contribute external links to spark tutorials. I have written many
  http://blog.madhukaraphatak.com/categories/spark/ of them in my blog.
 It
  will be great if someone can add it to the spark website.
 
 
 
  Regards,
  Madhukara Phatak
  http://datamantra.io/



Contributing Documentation Changes

2015-04-23 Thread madhu phatak
Hi,
 As I was reading contributing to Spark wiki, it was mentioned that we can
contribute external links to spark tutorials. I have written many
http://blog.madhukaraphatak.com/categories/spark/ of them in my blog. It
will be great if someone can add it to the spark website.



Regards,
Madhukara Phatak
http://datamantra.io/


Help needed to publish SizeEstimator as separate library

2014-11-19 Thread madhu phatak
Hi,
 As I was going through spark source code, SizeEstimator
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/SizeEstimator.scala
caught my eye. It's a very useful tool to do the size estimations on JVM
which helps in use cases like memory bounded cache.

It will be useful to have this as separate library, which can be used in
the other projects too. There was a discussion
https://spark-project.atlassian.net/browse/SPARK-383 long back, but i
don't see any updates on it.

I have extracted the code and packaged as separate project on github
https://github.com/phatak-dev/java-sizeof. I have simplified the code to
remove dependencies from google-guava and OpenHashSet which leads to a
small compromise in accuracy in big arrays. But at same time, it greatly
simplifies the code base and dependency graph. I want to publish it to
maven central so it can be added as dependency.

Though I have published code under my package com.madhu with keeping
license information, I am not sure is it the right way to do. So it will be
great if someone can guide me on package naming and attribution.

-- 
Regards,
Madhukara Phatak
http://www.madhukaraphatak.com