Re: Time window on Processing Time
Hi, That's great. Thanks a lot. On Wed, Aug 30, 2017 at 10:44 AM, Tathagata Das <tathagata.das1...@gmail.com > wrote: > Yes, it can be! There is a sql function called current_timestamp() which > is self-explanatory. So I believe you should be able to do something like > > import org.apache.spark.sql.functions._ > > ds.withColumn("processingTime", current_timestamp()) > .groupBy(window("processingTime", "1 minute")) > .count() > > > On Mon, Aug 28, 2017 at 5:46 AM, madhu phatak <phatak@gmail.com> > wrote: > >> Hi, >> As I am playing with structured streaming, I observed that window >> function always requires a time column in input data.So that means it's >> event time. >> >> Is it possible to old spark streaming style window function based on >> processing time. I don't see any documentation on the same. >> >> -- >> Regards, >> Madhukara Phatak >> http://datamantra.io/ >> > > -- Regards, Madhukara Phatak http://datamantra.io/
Time window on Processing Time
Hi, As I am playing with structured streaming, I observed that window function always requires a time column in input data.So that means it's event time. Is it possible to old spark streaming style window function based on processing time. I don't see any documentation on the same. -- Regards, Madhukara Phatak http://datamantra.io/
Review of ML PR
Hi, I have provided a PR around 2 months back to improve the performance of decision tree by allowing flexible user provided storage class for intermediate data. I have posted few questions about handling backward compatibility but there is no answers from long. Can anybody help me to move this forward? The below is the link to PR https://github.com/apache/spark/pull/17972 -- Regards, Madhukara Phatak http://datamantra.io/
Re: RandomForest caching
Hi, I opened a jira. https://issues.apache.org/jira/browse/SPARK-20723 Can some one have a look? On Fri, Apr 28, 2017 at 1:34 PM, madhu phatak <phatak@gmail.com> wrote: > Hi, > > I am testing RandomForestClassification with 50gb of data which is cached > in memory. I have 64gb of ram, in which 28gb is used for original dataset > caching. > > When I run random forest, it caches around 300GB of intermediate data > which un caches the original dataset. This caching is triggered by below > code in RandomForest.scala > > ``` > val baggedInput = BaggedPoint > .convertToBaggedRDD(treeInput, strategy.subsamplingRate, > numTrees, withReplacement, seed) > .persist(StorageLevel.MEMORY_AND_DISK) > > ``` > > As I don't have control over storage level, I cannot make sure original > dataset stays in memory for other interactive tasks when random forest is > running. > > Is it a good idea to make this storage level a user parameter? If so I can > open a jira issue and give pr for the same. > > -- > Regards, > Madhukara Phatak > http://datamantra.io/ > -- Regards, Madhukara Phatak http://datamantra.io/
RandomForest caching
Hi, I am testing RandomForestClassification with 50gb of data which is cached in memory. I have 64gb of ram, in which 28gb is used for original dataset caching. When I run random forest, it caches around 300GB of intermediate data which un caches the original dataset. This caching is triggered by below code in RandomForest.scala ``` val baggedInput = BaggedPoint .convertToBaggedRDD(treeInput, strategy.subsamplingRate, numTrees, withReplacement, seed) .persist(StorageLevel.MEMORY_AND_DISK) ``` As I don't have control over storage level, I cannot make sure original dataset stays in memory for other interactive tasks when random forest is running. Is it a good idea to make this storage level a user parameter? If so I can open a jira issue and give pr for the same. -- Regards, Madhukara Phatak http://datamantra.io/
Re: Contributing Documentation Changes
Hi, I understand that. The following page http://spark.apache.org/documentation.html has a external tutorials,blogs section which points to other blog pages. I wanted to add there. Regards, Madhukara Phatak http://datamantra.io/ On Fri, Apr 24, 2015 at 5:17 PM, Sean Owen so...@cloudera.com wrote: I think that your own tutorials and such should live on your blog. The goal isn't to pull in a bunch of external docs to the site. On Fri, Apr 24, 2015 at 12:57 AM, madhu phatak phatak@gmail.com wrote: Hi, As I was reading contributing to Spark wiki, it was mentioned that we can contribute external links to spark tutorials. I have written many http://blog.madhukaraphatak.com/categories/spark/ of them in my blog. It will be great if someone can add it to the spark website. Regards, Madhukara Phatak http://datamantra.io/
Contributing Documentation Changes
Hi, As I was reading contributing to Spark wiki, it was mentioned that we can contribute external links to spark tutorials. I have written many http://blog.madhukaraphatak.com/categories/spark/ of them in my blog. It will be great if someone can add it to the spark website. Regards, Madhukara Phatak http://datamantra.io/
Help needed to publish SizeEstimator as separate library
Hi, As I was going through spark source code, SizeEstimator https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/SizeEstimator.scala caught my eye. It's a very useful tool to do the size estimations on JVM which helps in use cases like memory bounded cache. It will be useful to have this as separate library, which can be used in the other projects too. There was a discussion https://spark-project.atlassian.net/browse/SPARK-383 long back, but i don't see any updates on it. I have extracted the code and packaged as separate project on github https://github.com/phatak-dev/java-sizeof. I have simplified the code to remove dependencies from google-guava and OpenHashSet which leads to a small compromise in accuracy in big arrays. But at same time, it greatly simplifies the code base and dependency graph. I want to publish it to maven central so it can be added as dependency. Though I have published code under my package com.madhu with keeping license information, I am not sure is it the right way to do. So it will be great if someone can guide me on package naming and attribution. -- Regards, Madhukara Phatak http://www.madhukaraphatak.com