Re: Time window on Processing Time

2017-08-30 Thread madhu phatak
e > > import org.apache.spark.sql.functions._ > > ds.withColumn("processingTime", current_timestamp()) > .groupBy(window("processingTime", "1 minute")) > .count() > > > On Mon, Aug 28, 2017 at 5:46 AM, madhu phatak <phatak@gmail.com> > wrote:

Time window on Processing Time

2017-08-28 Thread madhu phatak
Hi, As I am playing with structured streaming, I observed that window function always requires a time column in input data.So that means it's event time. Is it possible to old spark streaming style window function based on processing time. I don't see any documentation on the same. -- Regards,

Re: How to create SparkSession using SparkConf?

2017-04-28 Thread madhu phatak
SparkSession.builder.config() takes SparkConf as parameter. You can use that to pass SparkConf as it is. https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/SparkSession.Builder.html#config(org.apache.spark.SparkConf) On Fri, Apr 28, 2017 at 11:40 AM, Yanbo Liang

Re: Spark structured streaming is Micro batch?

2016-05-07 Thread madhu phatak
>> querying the stram without having to store or transform. >> I have not used it yet but seems it will be like start streaming data >> from source as son as you define it. >> >> Thanks >> Deepak >> >> >> On Fri, May 6, 2016 at 1:37 PM, madhu pha

Spark structured streaming is Micro batch?

2016-05-06 Thread madhu phatak
Hi, As I was playing with new structured streaming API, I noticed that spark starts processing as and when the data appears. It's no more seems like micro batch processing. Is spark structured streaming will be an event based processing? -- Regards, Madhukara Phatak http://datamantra.io/

Talk on Deep dive in Spark Dataframe API

2015-08-06 Thread madhu phatak
Hi, Recently I gave a talk on a deep dive into data frame api and sql catalyst . Video of the same is available on Youtube with slides and code. Please have a look if you are interested. *http://blog.madhukaraphatak.com/anatomy-of-spark-dataframe-api/

Spark Mesos Dispatcher

2015-07-19 Thread Jahagirdar, Madhu
All, Can we run different version of Spark using the same Mesos Dispatcher. For example we can run drivers with Spark 1.3 and Spark 1.4 at the same time ? Regards, Madhu Jahagirdar The information contained in this message may be confidential and legally

RE: Spark Mesos Dispatcher

2015-07-19 Thread Jahagirdar, Madhu
, Madhu Cc: user; d...@spark.apache.org Subject: Re: Spark Mesos Dispatcher Yes. Sent from my iPhone On 19 Jul, 2015, at 10:52 pm, Jahagirdar, Madhu madhu.jahagir...@philips.commailto:madhu.jahagir...@philips.com wrote: All, Can we run different version of Spark using the same Mesos Dispatcher

Running mllib from R in Spark 1.4

2015-07-15 Thread madhu phatak
Hi, I have been playing with Spark R API that is introduced in Spark 1.4 version. Can we use any mllib functionality from the R as of now?. From the documentation it looks like we can only use SQL/Dataframe functionality as of now. I know there is separate project SparkR project but it doesnot

Talk on Deep dive into Spark Data source API

2015-06-30 Thread madhu phatak
Hi, Recently I gave a talk on how to create spark data sources from scratch. Screencast of the same is available on Youtube with slides and code. Please have a look if you are interested. http://blog.madhukaraphatak.com/anatomy-of-spark-datasource-api/ -- Regards, Madhukara Phatak

Re: Re: how to distributed run a bash shell in spark

2015-05-25 Thread madhu phatak
Hi, You can use pipe operator, if you are running shell script/perl script on some data. More information on my blog http://blog.madhukaraphatak.com/pipe-in-spark/. Regards, Madhukara Phatak http://datamantra.io/ On Mon, May 25, 2015 at 8:02 AM, luohui20...@sina.com wrote: Thanks Akhil,

Re: Spark SQL on large number of columns

2015-05-19 Thread madhu phatak
? On Tue, May 19, 2015 at 8:04 PM, madhu phatak phatak@gmail.com wrote: Hi, I am trying run spark sql aggregation on a file with 26k columns. No of rows is very small. I am running into issue that spark is taking huge amount of time to parse the sql and create a logical plan. Even if i have

Spark SQL on large number of columns

2015-05-19 Thread madhu phatak
Hi, I am trying run spark sql aggregation on a file with 26k columns. No of rows is very small. I am running into issue that spark is taking huge amount of time to parse the sql and create a logical plan. Even if i have just one row, it's taking more than 1 hour just to get pass the parsing. Any

Re: Spark SQL on large number of columns

2015-05-19 Thread madhu phatak
, 2015 at 3:59 PM, ayan guha guha.a...@gmail.com wrote: can you kindly share your code? On Tue, May 19, 2015 at 8:04 PM, madhu phatak phatak@gmail.com wrote: Hi, I am trying run spark sql aggregation on a file with 26k columns. No of rows is very small. I am running into issue that spark

Re: Spark SQL on large number of columns

2015-05-19 Thread madhu phatak
Hi, An additional information is, table is backed by a csv file which is read using spark-csv from databricks. Regards, Madhukara Phatak http://datamantra.io/ On Tue, May 19, 2015 at 4:05 PM, madhu phatak phatak@gmail.com wrote: Hi, I have fields from field_0 to fied_26000. The query

Re: Spark SQL on large number of columns

2015-05-19 Thread madhu phatak
Hi, Tested for calculating values for 300 columns. Analyser takes around 4 minutes to generate the plan. Is this normal? Regards, Madhukara Phatak http://datamantra.io/ On Tue, May 19, 2015 at 4:35 PM, madhu phatak phatak@gmail.com wrote: Hi, I am using spark 1.3.1 Regards

Re: Spark SQL on large number of columns

2015-05-19 Thread madhu phatak
, Madhukara Phatak http://datamantra.io/ On Tue, May 19, 2015 at 6:23 PM, madhu phatak phatak@gmail.com wrote: Hi, Tested with HiveContext also. It also take similar amount of time. To make the things clear, the following is select clause for a given column *aggregateStats( $columnName , max

Re: Spark SQL on large number of columns

2015-05-19 Thread madhu phatak
(*) )* aggregateStats is UDF generating case class to hold the values. Regards, Madhukara Phatak http://datamantra.io/ On Tue, May 19, 2015 at 5:57 PM, madhu phatak phatak@gmail.com wrote: Hi, Tested for calculating values for 300 columns. Analyser takes around 4 minutes to generate

Spark JDBC data source API issue with mysql

2015-04-27 Thread madhu phatak
Hi, I have been trying out spark data source api with JDBC. The following is the code to get DataFrame, Try(hc.load(org.apache.spark.sql.jdbc,Map(url - dbUrl,dbtable-s($ query) ))) By looking at test cases, I found that query has to be inside brackets, otherwise it's treated as table name.

Re: Hive table creation - possible bug in Spark 1.3?

2015-04-23 Thread madhu phatak
Hi, Hive table creation need an extra step from 1.3. You can follow the following template df.registerTempTable(tableName) hc.sql(screate table $tableName as select * from $tableName) this will save the table in hive with given tableName. Regards, Madhukara Phatak

Re: Hive table creation - possible bug in Spark 1.3?

2015-04-23 Thread madhu phatak
Hi Michael, Here https://issues.apache.org/jira/browse/SPARK-7084 is the jira issue and PR https://github.com/apache/spark/pull/5654 for the same. Please have a look. Regards, Madhukara Phatak http://datamantra.io/ On Thu, Apr 23, 2015 at 1:22 PM, madhu phatak phatak@gmail.com wrote: Hi

Re: Is the Spark-1.3.1 support build with scala 2.8 ?

2015-04-23 Thread madhu phatak
Hi, AFAIK it's only build with 2.10 and 2.11. You should integrate kafka_2.10.0-0.8.0 to make it work. Regards, Madhukara Phatak http://datamantra.io/ On Fri, Apr 24, 2015 at 9:22 AM, guoqing0...@yahoo.com.hk guoqing0...@yahoo.com.hk wrote: Is the Spark-1.3.1 support build with scala 2.8

Anatomy of RDD : Deep dive into RDD data structure

2015-03-31 Thread madhu phatak
Hi, Recently I gave a talk on RDD data structure which gives in depth understanding of spark internals. You can watch it on youtube https://www.youtube.com/watch?v=WVdyuVwWcBc. Also slides are on slideshare http://www.slideshare.net/datamantra/anatomy-of-rdd and code is on github

Re: MappedStream vs Transform API

2015-03-17 Thread madhu phatak
DStreams. TD On Mon, Mar 16, 2015 at 3:37 AM, madhu phatak phatak@gmail.com wrote: Hi, Thanks for the response. I understand that part. But I am asking why the internal implementation using a subclass when it can use an existing api? Unless there is a real difference, it feels like code

Re: why generateJob is a private API?

2015-03-17 Thread madhu phatak
DStream.foreachRDD On Sun, Mar 15, 2015 at 11:14 PM, madhu phatak phatak@gmail.com wrote: Hi, I am trying to create a simple subclass of DStream. If I understand correctly, I should override *compute *lazy operations and *generateJob* for actions. But when I try to override, generateJob

Re: MappedStream vs Transform API

2015-03-17 Thread madhu phatak
not super essential then ok. If you are interested in contributing to Spark Streaming, i can point you to a number of issues where your contributions will be more valuable. Yes please. TD On Tue, Mar 17, 2015 at 1:56 AM, madhu phatak phatak@gmail.com wrote: Hi, Thank you

Re: MappedStream vs Transform API

2015-03-17 Thread madhu phatak
not super essential then ok. If you are interested in contributing to Spark Streaming, i can point you to a number of issues where your contributions will be more valuable. That will be great. TD On Tue, Mar 17, 2015 at 1:56 AM, madhu phatak phatak@gmail.com wrote: Hi, Thank

Re: MappedStream vs Transform API

2015-03-16 Thread madhu phatak
straightforward and easy to understand. Thanks Jerry *From:* madhu phatak [mailto:phatak@gmail.com] *Sent:* Monday, March 16, 2015 4:32 PM *To:* user@spark.apache.org *Subject:* MappedStream vs Transform API Hi, Current implementation of map function in spark streaming looks as below

why generateJob is a private API?

2015-03-16 Thread madhu phatak
Hi, I am trying to create a simple subclass of DStream. If I understand correctly, I should override *compute *lazy operations and *generateJob* for actions. But when I try to override, generateJob it gives error saying method is private to the streaming package. Is my approach is correct or am

Re: Need Advice about reading lots of text files

2015-03-16 Thread madhu phatak
Hi, Internally Spark uses HDFS api to handle file data. Have a look at HAR, Sequence file input format. More information on this cloudera blog http://blog.cloudera.com/blog/2009/02/the-small-files-problem/. Regards, Madhukara Phatak http://datamantra.io/ On Sun, Mar 15, 2015 at 9:59 PM, Pat

MappedStream vs Transform API

2015-03-16 Thread madhu phatak
Hi, Current implementation of map function in spark streaming looks as below. def map[U: ClassTag](mapFunc: T = U): DStream[U] = { new MappedDStream(this, context.sparkContext.clean(mapFunc)) } It creates an instance of MappedDStream which is a subclass of DStream. The same function can

Spark Drill 1.2.1 - error

2015-02-26 Thread Jahagirdar, Madhu
) at org.apache.spark.rdd.JdbcRDD.compute(JdbcRDD.scala:73) at org.apache.spark.rdd.JdbcRDD.compute(JdbcRDD.scala:53) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.scheduler. Regards, Madhu Jahagirdar

Re: Streaming: getting data from Cassandra based on input stream values

2015-01-23 Thread madhu phatak
at same time }) On Sat, Jan 24, 2015 at 3:32 AM, Greg Temchenko s...@dicefield.com wrote: Hi Madhu, Thanks for you response! But as I understand in this case you select all data from the Cassandra table. I don't wanna do it as it can be huge. I wanna just lookup some ids in the table. So

Re: save a histogram to a file

2015-01-23 Thread madhu phatak
Hi, histogram method return normal scala types not a RDD. So you will not have saveAsTextFile. You can use makeRDD method make a rdd out of the data and saveAsObject file val hist = a.histogram(10) val histRDD = sc.makeRDD(hist) histRDD.saveAsObjectFile(path) On Fri, Jan 23, 2015 at 5:37 AM, SK

Re: DAG info

2015-01-03 Thread madhu phatak
Hi, You can turn off these messages using log4j.properties. On Fri, Jan 2, 2015 at 1:51 PM, Robineast robin.e...@xense.co.uk wrote: Do you have some example code of what you are trying to do? Robin -- View this message in context:

Re: Can we say 1 RDD is generated every batch interval?

2014-12-30 Thread Jahagirdar, Madhu
Foreach iterates through the partitions in the RDD and executes the operations for each partitions i guess. On 29-Dec-2014, at 10:19 pm, SamyaMaiti samya.maiti2...@gmail.com wrote: Hi All, Please clarify. Can we say 1 RDD is generated every batch interval? If the above is true. Then, is

Re: broadcasting object issue

2014-12-22 Thread madhu phatak
Hi, Just ran your code on spark-shell. If you replace val bcA = sc.broadcast(a) with val bcA = sc.broadcast(new B().getA) it seems to work. Not sure why. On Tue, Dec 23, 2014 at 9:12 AM, Henry Hung ythu...@winbond.com wrote: Hi All, I have a problem with broadcasting a serialize

Re: Joins in Spark

2014-12-22 Thread madhu phatak
Hi, You can map your vertices rdd as follow val pairVertices = verticesRDD.map(vertice = (vertice,null)) the above gives you a pairRDD. After join make sure that you remove superfluous null value. On Tue, Dec 23, 2014 at 10:36 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I have two

Re: reading files recursively using spark

2014-12-19 Thread madhu phatak
Hi, You can use FileInputformat API of Hadoop and newApiHadoopFile of spark to get recursion. More on the topic you can refer here http://stackoverflow.com/questions/8114579/using-fileinputformat-addinputpaths-to-recursively-add-hdfs-path On Fri, Dec 19, 2014 at 4:50 PM, Sean Owen

Re: SchemaRDD.sample problem

2014-12-18 Thread madhu phatak
Hi, Can you clean up the code lil bit better, it's hard to read what's going on. You can use pastebin or gist to put the code. On Wed, Dec 17, 2014 at 3:58 PM, Hao Ren inv...@gmail.com wrote: Hi, I am using SparkSQL on 1.2.1 branch. The problem comes froms the following 4-line code: *val

Re: When will spark 1.2 released?

2014-12-18 Thread madhu phatak
It’s on Maven Central already http://search.maven.org/#browse%7C717101892 On Fri, Dec 19, 2014 at 11:17 AM, vboylin1...@gmail.com vboylin1...@gmail.com wrote: Hi, Dose any know when will spark 1.2 released? 1.2 has many great feature that we can't wait now ,-) Sincely Lin wukang

RE: CheckPoint Issue with JsonRDD

2014-11-07 Thread Jahagirdar, Madhu
Michael any idea on this? From: Jahagirdar, Madhu Sent: Thursday, November 06, 2014 2:36 PM To: mich...@databricks.com; user Subject: CheckPoint Issue with JsonRDD When we enable checkpoint and use JsonRDD we get the following error: Is this bug

CheckPoint Issue with JsonRDD

2014-11-06 Thread Jahagirdar, Madhu
= hivecontext.jsonRDD(rdd,schema) logInfo(inserting into table: + TEMP_TABLE_NAME) schRdd.insertInto(TEMP_TABLE_NAME) } }) jssc.checkpoint(CHECKPOINT_DIR) jssc } } case class Person(name:String, age:String) extends Serializable Regards, Madhu jahagirdar

Dynamically InferSchema From Hive and Create parquet file

2014-11-05 Thread Jahagirdar, Madhu
()) .registerTempTable(TEMP_TABLE_NAME); Is it possible that we dynamically Infer Schema From Hive using hive context and the table name, then give that Schema ? Regards. Madhu Jahagirdar

Issue with Spark Twitter Streaming

2014-10-13 Thread Jahagirdar, Madhu
All, We are using Spark Streaming to receive data from twitter stream. This is running behind proxy. We have done the following configurations inside spark steaming for twitter4j to work behind proxy. def main(args: Array[String]) { val filters = Array(Modi)

RE: Dstream Transformations

2014-10-06 Thread Jahagirdar, Madhu
Given that I have multiple worker nodes and when Spark schedules the job again on the worker nodes that are alive, does it then again store the data in elastic search and then flume or does it only run functions to store in flume ? Regards, Madhu Jahagirdar

RE: Dstream Transformations

2014-10-06 Thread Jahagirdar, Madhu
To: Jahagirdar, Madhu Cc: Akhil Das; user Subject: Re: Dstream Transformations From the Spark Streaming Programming Guide (http://spark.apache.org/docs/latest/streaming-programming-guide.html#failure-of-a-worker-node): ...output operations (like foreachRDD) have at-least once semantics

hdfs short circuit

2014-07-03 Thread Jahagirdar, Madhu
can i enable spark to use dfs.client.read.shortcircuit property to improve performance and ready natively on local nodes instead of hdfs api ? The information contained in this message may be confidential and legally protected under applicable law. The message

Re: best practice: write and debug Spark application in scala-ide and maven

2014-06-07 Thread Madhu
that should be sufficient for your example. - Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/best-practice-write-and-debug-Spark-application-in-scala-ide-and-maven-tp7151p7183.html Sent from the Apache

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-21 Thread Madhu
Can you identify a specific file that fails? There might be a real bug here, but I have found gzip to be reliable. Every time I have run into a bad header error with gzip, I had a non-gzip file with the wrong extension for whatever reason. - Madhu https://www.linkedin.com/in/msiddalingaiah

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-20 Thread Madhu
I have read gzip files from S3 successfully. It sounds like a file is corrupt or not a valid gzip file. Does it work with fewer gzip files? How are you reading the files? - Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-user-list

Re: Configuring Spark for reduceByKey on on massive data sets

2014-05-17 Thread Madhu
Daniel, How many partitions do you have? Are they more or less uniformly distributed? We have similar data volume currently running well on Hadoop MapReduce with roughly 30 nodes. I was planning to test it with Spark. I'm very interested in your findings. - Madhu https

Re: 1.0.0 Release Date?

2014-05-15 Thread Madhu
Spark 1.0.0 rc5 is available and open for voting Give it a try and vote on it at the dev user list. - Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/1-0-0-Release-Date-tp5664p5716.html Sent from

Re: instalation de spark

2014-05-14 Thread Madhu
have Java installed, I have Java 7 You can install Scala 2.10.x for Scala development. I have Python 2.7.6? For pySpark I use ScalaIDE Eclipse plugin. Let me know how it works out. - Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-user

Re: java.lang.NoSuchMethodError on Java API

2014-05-12 Thread Madhu
in that's referenced in that method. There's a lot of stuff going on in that method, so it's not easy for me to follow. I would break it down to more manageable pieces and build it up one step at a time. Sorry I couldn't find the problem. - Madhu https://www.linkedin.com/in/msiddalingaiah

Re: why is Spark 0.9.1 (context creation?) so slow on my OSX laptop?

2014-05-11 Thread Madhu
Svend, I built it on my iMac and it was about the same speed as Windows 7, RHEL 6 VM on Windows 7, and Linux on EC2. Spark is pleasantly easy to build on all of these platforms, which is wonderful. How long does it take to start spark-shell? Maybe it's a JVM memory setting problem on your