Re: Time window on Processing Time

2017-08-30 Thread madhu phatak
e > > import org.apache.spark.sql.functions._ > > ds.withColumn("processingTime", current_timestamp()) > .groupBy(window("processingTime", "1 minute")) > .count() > > > On Mon, Aug 28, 2017 at 5:46 AM, madhu phatak <phatak@gmail.com> > wrote:

Time window on Processing Time

2017-08-28 Thread madhu phatak
Hi, As I am playing with structured streaming, I observed that window function always requires a time column in input data.So that means it's event time. Is it possible to old spark streaming style window function based on processing time. I don't see any documentation on the same. -- Regards,

Re: How to create SparkSession using SparkConf?

2017-04-28 Thread madhu phatak
SparkSession.builder.config() takes SparkConf as parameter. You can use that to pass SparkConf as it is. https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/SparkSession.Builder.html#config(org.apache.spark.SparkConf) On Fri, Apr 28, 2017 at 11:40 AM, Yanbo Liang

Re: Spark structured streaming is Micro batch?

2016-05-07 Thread madhu phatak
>> querying the stram without having to store or transform. >> I have not used it yet but seems it will be like start streaming data >> from source as son as you define it. >> >> Thanks >> Deepak >> >> >> On Fri, May 6, 2016 at 1:37 PM, madhu pha

Spark structured streaming is Micro batch?

2016-05-06 Thread madhu phatak
Hi, As I was playing with new structured streaming API, I noticed that spark starts processing as and when the data appears. It's no more seems like micro batch processing. Is spark structured streaming will be an event based processing? -- Regards, Madhukara Phatak http://datamantra.io/

Talk on Deep dive in Spark Dataframe API

2015-08-06 Thread madhu phatak
Hi, Recently I gave a talk on a deep dive into data frame api and sql catalyst . Video of the same is available on Youtube with slides and code. Please have a look if you are interested. *http://blog.madhukaraphatak.com/anatomy-of-spark-dataframe-api/

Running mllib from R in Spark 1.4

2015-07-15 Thread madhu phatak
Hi, I have been playing with Spark R API that is introduced in Spark 1.4 version. Can we use any mllib functionality from the R as of now?. From the documentation it looks like we can only use SQL/Dataframe functionality as of now. I know there is separate project SparkR project but it doesnot

Talk on Deep dive into Spark Data source API

2015-06-30 Thread madhu phatak
Hi, Recently I gave a talk on how to create spark data sources from scratch. Screencast of the same is available on Youtube with slides and code. Please have a look if you are interested. http://blog.madhukaraphatak.com/anatomy-of-spark-datasource-api/ -- Regards, Madhukara Phatak

Re: Re: how to distributed run a bash shell in spark

2015-05-25 Thread madhu phatak
Hi, You can use pipe operator, if you are running shell script/perl script on some data. More information on my blog http://blog.madhukaraphatak.com/pipe-in-spark/. Regards, Madhukara Phatak http://datamantra.io/ On Mon, May 25, 2015 at 8:02 AM, luohui20...@sina.com wrote: Thanks Akhil,

Re: Spark SQL on large number of columns

2015-05-19 Thread madhu phatak
? On Tue, May 19, 2015 at 8:04 PM, madhu phatak phatak@gmail.com wrote: Hi, I am trying run spark sql aggregation on a file with 26k columns. No of rows is very small. I am running into issue that spark is taking huge amount of time to parse the sql and create a logical plan. Even if i have

Spark SQL on large number of columns

2015-05-19 Thread madhu phatak
Hi, I am trying run spark sql aggregation on a file with 26k columns. No of rows is very small. I am running into issue that spark is taking huge amount of time to parse the sql and create a logical plan. Even if i have just one row, it's taking more than 1 hour just to get pass the parsing. Any

Re: Spark SQL on large number of columns

2015-05-19 Thread madhu phatak
, 2015 at 3:59 PM, ayan guha guha.a...@gmail.com wrote: can you kindly share your code? On Tue, May 19, 2015 at 8:04 PM, madhu phatak phatak@gmail.com wrote: Hi, I am trying run spark sql aggregation on a file with 26k columns. No of rows is very small. I am running into issue that spark

Re: Spark SQL on large number of columns

2015-05-19 Thread madhu phatak
Hi, An additional information is, table is backed by a csv file which is read using spark-csv from databricks. Regards, Madhukara Phatak http://datamantra.io/ On Tue, May 19, 2015 at 4:05 PM, madhu phatak phatak@gmail.com wrote: Hi, I have fields from field_0 to fied_26000. The query

Re: Spark SQL on large number of columns

2015-05-19 Thread madhu phatak
Hi, Tested for calculating values for 300 columns. Analyser takes around 4 minutes to generate the plan. Is this normal? Regards, Madhukara Phatak http://datamantra.io/ On Tue, May 19, 2015 at 4:35 PM, madhu phatak phatak@gmail.com wrote: Hi, I am using spark 1.3.1 Regards

Re: Spark SQL on large number of columns

2015-05-19 Thread madhu phatak
, Madhukara Phatak http://datamantra.io/ On Tue, May 19, 2015 at 6:23 PM, madhu phatak phatak@gmail.com wrote: Hi, Tested with HiveContext also. It also take similar amount of time. To make the things clear, the following is select clause for a given column *aggregateStats( $columnName , max

Re: Spark SQL on large number of columns

2015-05-19 Thread madhu phatak
(*) )* aggregateStats is UDF generating case class to hold the values. Regards, Madhukara Phatak http://datamantra.io/ On Tue, May 19, 2015 at 5:57 PM, madhu phatak phatak@gmail.com wrote: Hi, Tested for calculating values for 300 columns. Analyser takes around 4 minutes to generate

Spark JDBC data source API issue with mysql

2015-04-27 Thread madhu phatak
Hi, I have been trying out spark data source api with JDBC. The following is the code to get DataFrame, Try(hc.load(org.apache.spark.sql.jdbc,Map(url - dbUrl,dbtable-s($ query) ))) By looking at test cases, I found that query has to be inside brackets, otherwise it's treated as table name.

Re: Hive table creation - possible bug in Spark 1.3?

2015-04-23 Thread madhu phatak
Hi, Hive table creation need an extra step from 1.3. You can follow the following template df.registerTempTable(tableName) hc.sql(screate table $tableName as select * from $tableName) this will save the table in hive with given tableName. Regards, Madhukara Phatak

Re: Hive table creation - possible bug in Spark 1.3?

2015-04-23 Thread madhu phatak
Hi Michael, Here https://issues.apache.org/jira/browse/SPARK-7084 is the jira issue and PR https://github.com/apache/spark/pull/5654 for the same. Please have a look. Regards, Madhukara Phatak http://datamantra.io/ On Thu, Apr 23, 2015 at 1:22 PM, madhu phatak phatak@gmail.com wrote: Hi

Re: Is the Spark-1.3.1 support build with scala 2.8 ?

2015-04-23 Thread madhu phatak
Hi, AFAIK it's only build with 2.10 and 2.11. You should integrate kafka_2.10.0-0.8.0 to make it work. Regards, Madhukara Phatak http://datamantra.io/ On Fri, Apr 24, 2015 at 9:22 AM, guoqing0...@yahoo.com.hk guoqing0...@yahoo.com.hk wrote: Is the Spark-1.3.1 support build with scala 2.8

Anatomy of RDD : Deep dive into RDD data structure

2015-03-31 Thread madhu phatak
Hi, Recently I gave a talk on RDD data structure which gives in depth understanding of spark internals. You can watch it on youtube https://www.youtube.com/watch?v=WVdyuVwWcBc. Also slides are on slideshare http://www.slideshare.net/datamantra/anatomy-of-rdd and code is on github

Re: MappedStream vs Transform API

2015-03-17 Thread madhu phatak
DStreams. TD On Mon, Mar 16, 2015 at 3:37 AM, madhu phatak phatak@gmail.com wrote: Hi, Thanks for the response. I understand that part. But I am asking why the internal implementation using a subclass when it can use an existing api? Unless there is a real difference, it feels like code

Re: why generateJob is a private API?

2015-03-17 Thread madhu phatak
DStream.foreachRDD On Sun, Mar 15, 2015 at 11:14 PM, madhu phatak phatak@gmail.com wrote: Hi, I am trying to create a simple subclass of DStream. If I understand correctly, I should override *compute *lazy operations and *generateJob* for actions. But when I try to override, generateJob

Re: MappedStream vs Transform API

2015-03-17 Thread madhu phatak
not super essential then ok. If you are interested in contributing to Spark Streaming, i can point you to a number of issues where your contributions will be more valuable. Yes please. TD On Tue, Mar 17, 2015 at 1:56 AM, madhu phatak phatak@gmail.com wrote: Hi, Thank you

Re: MappedStream vs Transform API

2015-03-17 Thread madhu phatak
not super essential then ok. If you are interested in contributing to Spark Streaming, i can point you to a number of issues where your contributions will be more valuable. That will be great. TD On Tue, Mar 17, 2015 at 1:56 AM, madhu phatak phatak@gmail.com wrote: Hi, Thank

Re: MappedStream vs Transform API

2015-03-16 Thread madhu phatak
straightforward and easy to understand. Thanks Jerry *From:* madhu phatak [mailto:phatak@gmail.com] *Sent:* Monday, March 16, 2015 4:32 PM *To:* user@spark.apache.org *Subject:* MappedStream vs Transform API Hi, Current implementation of map function in spark streaming looks as below

why generateJob is a private API?

2015-03-16 Thread madhu phatak
Hi, I am trying to create a simple subclass of DStream. If I understand correctly, I should override *compute *lazy operations and *generateJob* for actions. But when I try to override, generateJob it gives error saying method is private to the streaming package. Is my approach is correct or am

Re: Need Advice about reading lots of text files

2015-03-16 Thread madhu phatak
Hi, Internally Spark uses HDFS api to handle file data. Have a look at HAR, Sequence file input format. More information on this cloudera blog http://blog.cloudera.com/blog/2009/02/the-small-files-problem/. Regards, Madhukara Phatak http://datamantra.io/ On Sun, Mar 15, 2015 at 9:59 PM, Pat

MappedStream vs Transform API

2015-03-16 Thread madhu phatak
Hi, Current implementation of map function in spark streaming looks as below. def map[U: ClassTag](mapFunc: T = U): DStream[U] = { new MappedDStream(this, context.sparkContext.clean(mapFunc)) } It creates an instance of MappedDStream which is a subclass of DStream. The same function can

Re: Streaming: getting data from Cassandra based on input stream values

2015-01-23 Thread madhu phatak
it doesn't make sense for me how I can put some values from the streamRDD to the cassandra query (to where method). Greg On 1/23/15 1:11 AM, madhu phatak wrote: Hi, Seems like you want to get username for a give user id. You can use transform on the kafka stream to join two RDD's. The psuedo

Re: save a histogram to a file

2015-01-23 Thread madhu phatak
Hi, histogram method return normal scala types not a RDD. So you will not have saveAsTextFile. You can use makeRDD method make a rdd out of the data and saveAsObject file val hist = a.histogram(10) val histRDD = sc.makeRDD(hist) histRDD.saveAsObjectFile(path) On Fri, Jan 23, 2015 at 5:37 AM, SK

Re: DAG info

2015-01-03 Thread madhu phatak
Hi, You can turn off these messages using log4j.properties. On Fri, Jan 2, 2015 at 1:51 PM, Robineast robin.e...@xense.co.uk wrote: Do you have some example code of what you are trying to do? Robin -- View this message in context:

Re: broadcasting object issue

2014-12-22 Thread madhu phatak
Hi, Just ran your code on spark-shell. If you replace val bcA = sc.broadcast(a) with val bcA = sc.broadcast(new B().getA) it seems to work. Not sure why. On Tue, Dec 23, 2014 at 9:12 AM, Henry Hung ythu...@winbond.com wrote: Hi All, I have a problem with broadcasting a serialize

Re: Joins in Spark

2014-12-22 Thread madhu phatak
Hi, You can map your vertices rdd as follow val pairVertices = verticesRDD.map(vertice = (vertice,null)) the above gives you a pairRDD. After join make sure that you remove superfluous null value. On Tue, Dec 23, 2014 at 10:36 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I have two

Re: reading files recursively using spark

2014-12-19 Thread madhu phatak
Hi, You can use FileInputformat API of Hadoop and newApiHadoopFile of spark to get recursion. More on the topic you can refer here http://stackoverflow.com/questions/8114579/using-fileinputformat-addinputpaths-to-recursively-add-hdfs-path On Fri, Dec 19, 2014 at 4:50 PM, Sean Owen

Re: SchemaRDD.sample problem

2014-12-18 Thread madhu phatak
Hi, Can you clean up the code lil bit better, it's hard to read what's going on. You can use pastebin or gist to put the code. On Wed, Dec 17, 2014 at 3:58 PM, Hao Ren inv...@gmail.com wrote: Hi, I am using SparkSQL on 1.2.1 branch. The problem comes froms the following 4-line code: *val

Re: When will spark 1.2 released?

2014-12-18 Thread madhu phatak
It’s on Maven Central already http://search.maven.org/#browse%7C717101892 On Fri, Dec 19, 2014 at 11:17 AM, vboylin1...@gmail.com vboylin1...@gmail.com wrote: Hi, Dose any know when will spark 1.2 released? 1.2 has many great feature that we can't wait now ,-) Sincely Lin wukang