Re: Spark/Parquet/Statistics question

2017-11-21 Thread Rabin Banerjee
Spark is not adding any STAT meta in parquet files in Version 1.6.x. Scanning all files for filter. (1 to 30).map(i => (i, i.toString)).toDF("a", "b").sort("a").coalesce(1).write.format("parquet").saveAsTable("metrics") ./parquet-meta /user/hive/warehouse/metrics/*.parquet file: file:/user/h

Parquet Filter pushdown not working and statistics are not generating for any column with Spark 1.6 CDH 5.7

2017-11-21 Thread Rabin Banerjee
Hi All , I am using CDH 5.7 which comes with Spark version 1.6.0. I am saving my data set as parquet data and then querying it . Query is executing fine But when I checked the files generated by spark, I found statistics(min/max) is missing for all the columns . And hence filters are not pushed

Re: DataSet creation not working Spark 1.6.0 , populating wrong data CDH 5.7.1

2017-08-03 Thread Rabin Banerjee
; things better. > > > Regards, > Gourav > > On Thu, Aug 3, 2017 at 11:09 AM, Rabin Banerjee < > dev.rabin.baner...@gmail.com> wrote: > >> Hi All, >> >> I am trying to create a DataSet from DataFrame, where dataframe has been >> created successfu

DataSet creation not working Spark 1.6.0 , populating wrong data CDH 5.7.1

2017-08-03 Thread Rabin Banerjee
Hi All, I am trying to create a DataSet from DataFrame, where dataframe has been created successfully, and using the same bean I am trying to create dataset. But when I am running it, Dataframe is created as expected. I am able to print the content as well. But not the dataset. The DataSet is hav

Assign Custom receiver to a scheduler pool

2017-06-14 Thread Rabin Banerjee

Fwd: Assign Custom receiver to a scheduler pool

2017-06-13 Thread Rabin Banerjee
running on default pool. Setting Receiver using : ssc.receiverStream(MyCustomReceiver()) Any Help ? Regards, Rabin Banerjee

Fwd: Assign Custom receiver to a scheduler pool

2017-06-13 Thread Rabin Banerjee
running on default pool. Setting Receiver using : ssc.receiverStream(MyCustomReceiver()) Any Help ? Regards, Rabin Banerjee

Assign Custom receiver to a scheduler pool

2017-06-13 Thread Rabin Banerjee
running on default pool. Setting Receiver using : ssc.receiverStream(MyCustomReceiver()) Any Help ? Regards, Rabin Banerjee

Re: How to set NameSpace while storing from Spark to HBase using saveAsNewAPIHadoopDataSet

2016-12-19 Thread Rabin Banerjee
Thanks , It worked !! On Mon, Dec 19, 2016 at 5:55 PM, Dhaval Modi wrote: > > Replace with ":" > > Regards, > Dhaval Modi > > On 19 December 2016 at 13:10, Rabin Banerjee > wrote: > >> HI All, >> >> I am trying to save data from Spark i

How to set NameSpace while storing from Spark to HBase using saveAsNewAPIHadoopDataSet

2016-12-19 Thread Rabin Banerjee
HI All, I am trying to save data from Spark into HBase using saveHadoopDataSet API . Please refer the below code . Code is working fine .But the table is getting stored in the default namespace.how to set the NameSpace in the below code? wordCounts.foreachRDD ( rdd => { val conf = HBaseCon

Re: Will spark cache table once even if I call read/cache on the same table multiple times

2016-11-18 Thread Rabin Banerjee
u don't change the StorageLevel. > > > https://github.com/apache/spark/blob/master/core/src/ > main/scala/org/apache/spark/rdd/RDD.scala#L166 > > > > Yong > > -- > *From:* Rabin Banerjee > *Sent:* Friday, November 18, 2016 10:36 AM &g

Re: How to load only the data of the last partition

2016-11-18 Thread Rabin Banerjee
HI , In order to do that you can write code to read/list a HDFS directory first , then list its sub-directories . In this way using custom logic ,first identify the latest year/month/version , then read the avro in that dir in a DF, then add year/month/version to that DF using withColumn. Regard

Re: sort descending with multiple columns

2016-11-18 Thread Rabin Banerjee
++Stuart val colList = df.columns can be used On Fri, Nov 18, 2016 at 8:03 PM, Stuart White wrote: > Is this what you're looking for? > > val df = Seq( > (1, "A"), > (1, "B"), > (1, "C"), > (2, "D"), > (3, "E") > ).toDF("foo", "bar") > > val colList = Seq("foo", "bar") > df.sort(colL

Will spark cache table once even if I call read/cache on the same table multiple times

2016-11-18 Thread Rabin Banerjee
g ?? Shall I maintain a global hashmap to handle that ? something like Map[String,DataFrame] ?? Regards, Rabin Banerjee

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Rabin Banerjee
is a good option . Can anyone share the best way to implement this in Spark .? Regards, Rabin Banerjee On Thu, Nov 3, 2016 at 6:59 PM, Koert Kuipers wrote: > Just realized you only want to keep first element. You can do this without > sorting by doing something similar to min or max op

Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Rabin Banerjee
claims it will work except in Spark 1.5.1 and 1.5.2 .* *I need a bit elaboration of how internally spark handles it ? also is it more efficient than using a Window function ?* *Thanks in Advance ,* *Rabin Banerjee*

Re: Spark_JDBC_Partitions

2016-09-13 Thread Rabin Banerjee
Trust me, Only thing that can help you in your situation is SQOOP oracle direct connector which is known as ORAOOP. Spark cannot do everything , you need a OOZIE workflow which will trigger sqoop job with oracle direct connector to pull the data then spark batch to process . Hope it helps !! On

Re: SparkSQL DAG generation , DAG optimization , DAG execution

2016-09-10 Thread Rabin Banerjee
of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 10 September 2016 at 06:21, Rab

SparkSQL DAG generation , DAG optimization , DAG execution

2016-09-09 Thread Rabin Banerjee
HI All, I am writing and executing a Spark Batch program which only use SPARK-SQL , But it is taking lot of time and finally giving GC overhead . Here is the program , 1.Read 3 files ,one medium size and 2 small files, and register them as DF. 2. fire sql with complex aggregation and windo

Re: Spark Job trigger in production

2016-07-20 Thread Rabin Banerjee
++ crontab :) On Wed, Jul 20, 2016 at 9:07 AM, Andrew Ehrlich wrote: > Another option is Oozie with the spark action: > https://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html > > On Jul 18, 2016, at 12:15 AM, Jagat Singh wrote: > > You can use following options > > * spark-submit from

Re: Storm HDFS bolt equivalent in Spark Streaming.

2016-07-20 Thread Rabin Banerjee
++Deepak, There is also a option to use saveAsHadoopFile & saveAsNewAPIHadoopFile, In which you can customize(filename and many things ...) the way you want to save it. :) Happy Sparking Regards, Rabin Banerjee On Wed, Jul 20, 2016 at 10:01 AM, Deepak Sharma wrote: > In spark st

Re: Running multiple Spark Jobs on Yarn( Client mode)

2016-07-20 Thread Rabin Banerjee
Hi Vaibhav, Please check your yarn configuration and make sure you have available resources .Please try creating multiple queues ,And submit job on queues. --queue thequeue Regards, Rabin Banerjee On Wed, Jul 20, 2016 at 12:05 PM, vaibhavrtk wrote: > I have a silly question: > > Do

Re: Is it good choice to use DAO to store results generated by spark application?

2016-07-20 Thread Rabin Banerjee
basis. Regards, Rabin Banerjee On Wed, Jul 20, 2016 at 12:06 PM, Yu Wei wrote: > I need to write all data received from MQTT data into hbase for further > processing. > > They're not final result. I also need to read the data from hbase for > analysis. > > > Is it good

Re: write and call UDF in spark dataframe

2016-07-20 Thread Rabin Banerjee
Hi Divya , Try, val df = sqlContext.sql("select from_unixtime(ts,'-MM-dd') as `ts` from mr") Regards, Rabin On Wed, Jul 20, 2016 at 12:44 PM, Divya Gehlot wrote: > Hi, > Could somebody share example of writing and calling udf which converts > unix tme stamp to date tiime . > > > Thanks, >

Re: XLConnect in SparkR

2016-07-20 Thread Rabin Banerjee
to a MS-Access file to access via JDBC , Regards, Rabin Banerjee On Wed, Jul 20, 2016 at 2:12 PM, Yogesh Vyas wrote: > Hi, > > I am trying to load and read excel sheets from HDFS in sparkR using > XLConnect package. > Can anyone help me in finding out how to read xls files from H

Re: run spark apps in linux crontab

2016-07-20 Thread Rabin Banerjee
HI , Please check your deploy mode and master , For example if you want to deploy in yarn cluster you should use --master yarn-cluster , if you want to do it on yarn client mode you should use --master yarn-client . Please note for your case deploying yarn-cluster will be better as cluster mode

Re: Latest 200 messages per topic

2016-07-20 Thread Rabin Banerjee
the latest 4 ts data on get(key). Spark streaming will get the ID from Kafka, then read the data from HBASE using get(ID). This will eliminate usage of Windowing from Spark-Streaming . Is it good to use ? Regards, Rabin Banerjee On Tue, Jul 19, 2016 at 8:44 PM, Cody Koeninger wrote: > Unless

Re: Execute function once on each node

2016-07-19 Thread Rabin Banerjee
" I am working on a spark application that requires the ability to run a function on each node in the cluster " -- Use Apache Ignite instead of Spark. Trust me it's awesome for this use case. Regards, Rabin Banerjee On Jul 19, 2016 3:27 AM, "joshuata" wrote:

Re: Latest 200 messages per topic

2016-07-16 Thread Rabin Banerjee
Just to add , I want to read the MAX_OFFSET of a topic , then read MAX_OFFSET-200 , every time . Also I want to know , If I want to fetch a specific offset range for Batch processing, is there any option for doing that ? On Sat, Jul 16, 2016 at 9:08 PM, Rabin Banerjee < dev.rabin.ba

Latest 200 messages per topic

2016-07-16 Thread Rabin Banerjee
HI All, I have 1000 kafka topics each storing messages for different devices . I want to use the direct approach for connecting kafka from Spark , in which I am only interested in latest 200 messages in the Kafka . How do I do that ? Thanks.

Re: Is that possible to launch spark streaming application on yarn with only one machine?

2016-07-08 Thread Rabin Banerjee
Ya , I mean dump in hdfs as a file ,via yarn cluster mode . On Jul 8, 2016 3:10 PM, "Yu Wei" wrote: > How could I dump data into text file? Writing to HDFS or other approach? > > > Thanks, > > Jared > ------ > *From:* Rabin Banerjee >

Re: Is that possible to launch spark streaming application on yarn with only one machine?

2016-07-07 Thread Rabin Banerjee
// scalastyle:off println > println("---") > println("Time: " + time) > println("---") > firstNum.take(num).foreach(println) > if (firstNum.length

Re: Spark streaming Kafka Direct API + Multiple consumers

2016-07-07 Thread Rabin Banerjee
It's not required , *Simplified Parallelism:* No need to create multiple input Kafka streams and union them. With directStream, Spark Streaming will create as many RDD partitions as there are Kafka partitions to consume, which will all read data from Kafka in parallel. So there is a one-to-one map

Re: Is that possible to launch spark streaming application on yarn with only one machine?

2016-07-06 Thread Rabin Banerjee
In yarn cluster mode , Driver is running in AM , so you can find the logs in that AM log . Open rersourcemanager UI , and check for the Job and logs. or yarn logs -applicationId In yarn client mode , the driver is the same JVM from where you are launching ,,So you are getting it in the log . On

Re: Difference between DataFrame.write.jdbc and DataFrame.write.format("jdbc")

2016-07-06 Thread Rabin Banerjee
HI Buddy, I sued both but DataFrame.write.jdbc is old, and will work if provide table name , It wont work if you provide custom queries . Where as DataFrame.write.format is more generic as well as working perfectly with not only table name but also custom queries . Hence I recommend to use the

Re: Spark Left outer Join issue using programmatic sql joins

2016-07-06 Thread Rabin Banerjee
Checked in spark-shell with spark 1.5.0 scala> val emmpdat = sc.textFile("empfile"); emmpdat: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[23] at textFile at :21 scala> case class EMP (id:Int , name : String , deptId: Int) defined class EMP scala> val empdf = emmpdat.map((f) => {val ff=f.

Re: It seemed JavaDStream.print() did not work when launching via yarn on a single node

2016-07-06 Thread Rabin Banerjee
#x27;d have to collect the RDD and then > println, really. Also see DStream.print() > > On Wed, Jul 6, 2016 at 1:07 PM, Rabin Banerjee > wrote: > > It's not working because , you haven't collected the data. > > > > Try something like > > > > DS

Re: It seemed JavaDStream.print() did not work when launching via yarn on a single node

2016-07-06 Thread Rabin Banerjee
It's not working because , you haven't collected the data. Try something like DStream.forEachRDD((rdd)=> {rdd.foreach(println)}) Thanks, Rabin On Wed, Jul 6, 2016 at 5:05 PM, Yu Wei wrote: > Hi guys, > > > It seemed that when launching application via yarn on single node, > JavaDStream.print

Spark parallelism with mapToPair

2015-12-14 Thread Rabin Banerjee
Hi Team, I am new to spark Streaming , I am trying to write a spark Streaming application , where the Calculation of incoming data will be performed in "R" in the micro batching . But I want to make wordCounts.mapToPair parallel where wordCounts is the output of groupByKey, How can I ensure th