Re: Spark/Parquet/Statistics question

2017-11-21 Thread Rabin Banerjee
Spark is not adding any STAT meta in parquet files in Version 1.6.x. Scanning all files for filter. (1 to 30).map(i => (i, i.toString)).toDF("a", "b").sort("a").coalesce(1).write.format("parquet").saveAsTable("metrics") ./parquet-meta /user/hive/warehouse/metrics/*.parquet file:

Parquet Filter pushdown not working and statistics are not generating for any column with Spark 1.6 CDH 5.7

2017-11-21 Thread Rabin Banerjee
Hi All , I am using CDH 5.7 which comes with Spark version 1.6.0. I am saving my data set as parquet data and then querying it . Query is executing fine But when I checked the files generated by spark, I found statistics(min/max) is missing for all the columns . And hence filters are not

Re: DataSet creation not working Spark 1.6.0 , populating wrong data CDH 5.7.1

2017-08-03 Thread Rabin Banerjee
inimum translation loss makes > things better. > > > Regards, > Gourav > > On Thu, Aug 3, 2017 at 11:09 AM, Rabin Banerjee < > dev.rabin.baner...@gmail.com> wrote: > >> Hi All, >> >> I am trying to create a DataSet from DataFrame, where dataframe

DataSet creation not working Spark 1.6.0 , populating wrong data CDH 5.7.1

2017-08-03 Thread Rabin Banerjee
Hi All, I am trying to create a DataSet from DataFrame, where dataframe has been created successfully, and using the same bean I am trying to create dataset. But when I am running it, Dataframe is created as expected. I am able to print the content as well. But not the dataset. The DataSet is

Assign Custom receiver to a scheduler pool

2017-06-14 Thread Rabin Banerjee

Fwd: Assign Custom receiver to a scheduler pool

2017-06-13 Thread Rabin Banerjee
running on default pool. Setting Receiver using : ssc.receiverStream(MyCustomReceiver()) Any Help ? Regards, Rabin Banerjee

Fwd: Assign Custom receiver to a scheduler pool

2017-06-13 Thread Rabin Banerjee
running on default pool. Setting Receiver using : ssc.receiverStream(MyCustomReceiver()) Any Help ? Regards, Rabin Banerjee

Assign Custom receiver to a scheduler pool

2017-06-13 Thread Rabin Banerjee
running on default pool. Setting Receiver using : ssc.receiverStream(MyCustomReceiver()) Any Help ? Regards, Rabin Banerjee

Re: How to set NameSpace while storing from Spark to HBase using saveAsNewAPIHadoopDataSet

2016-12-19 Thread Rabin Banerjee
Thanks , It worked !! On Mon, Dec 19, 2016 at 5:55 PM, Dhaval Modi <dhavalmod...@gmail.com> wrote: > > Replace with ":" > > Regards, > Dhaval Modi > > On 19 December 2016 at 13:10, Rabin Banerjee <dev.rabin.baner...@gmail.com > > wrote: > >&

How to set NameSpace while storing from Spark to HBase using saveAsNewAPIHadoopDataSet

2016-12-19 Thread Rabin Banerjee
HI All, I am trying to save data from Spark into HBase using saveHadoopDataSet API . Please refer the below code . Code is working fine .But the table is getting stored in the default namespace.how to set the NameSpace in the below code? wordCounts.foreachRDD ( rdd = { val conf =

Re: Will spark cache table once even if I call read/cache on the same table multiple times

2016-11-18 Thread Rabin Banerjee
ect, as long as you don't change the StorageLevel. > > > https://github.com/apache/spark/blob/master/core/src/ > main/scala/org/apache/spark/rdd/RDD.scala#L166 > > > > Yong > > -- > *From:* Rabin Banerjee <dev.rabin.baner...@gmail.

Re: How to load only the data of the last partition

2016-11-18 Thread Rabin Banerjee
HI , In order to do that you can write code to read/list a HDFS directory first , then list its sub-directories . In this way using custom logic ,first identify the latest year/month/version , then read the avro in that dir in a DF, then add year/month/version to that DF using withColumn.

Re: sort descending with multiple columns

2016-11-18 Thread Rabin Banerjee
++Stuart val colList = df.columns can be used On Fri, Nov 18, 2016 at 8:03 PM, Stuart White wrote: > Is this what you're looking for? > > val df = Seq( > (1, "A"), > (1, "B"), > (1, "C"), > (2, "D"), > (3, "E") > ).toDF("foo", "bar") > > val colList =

Will spark cache table once even if I call read/cache on the same table multiple times

2016-11-18 Thread Rabin Banerjee
g ?? Shall I maintain a global hashmap to handle that ? something like Map[String,DataFrame] ?? Regards, Rabin Banerjee

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Rabin Banerjee
good option . Can anyone share the best way to implement this in Spark .? Regards, Rabin Banerjee On Thu, Nov 3, 2016 at 6:59 PM, Koert Kuipers <ko...@tresata.com> wrote: > Just realized you only want to keep first element. You can do this without > sorting by doing something simil

Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Rabin Banerjee
claims it will work except in Spark 1.5.1 and 1.5.2 .* *I need a bit elaboration of how internally spark handles it ? also is it more efficient than using a Window function ?* *Thanks in Advance ,* *Rabin Banerjee*

Re: Spark_JDBC_Partitions

2016-09-13 Thread Rabin Banerjee
Trust me, Only thing that can help you in your situation is SQOOP oracle direct connector which is known as ORAOOP. Spark cannot do everything , you need a OOZIE workflow which will trigger sqoop job with oracle direct connector to pull the data then spark batch to process . Hope it helps !! On

Re: SparkSQL DAG generation , DAG optimization , DAG execution

2016-09-10 Thread Rabin Banerjee
ss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 10

SparkSQL DAG generation , DAG optimization , DAG execution

2016-09-09 Thread Rabin Banerjee
HI All, I am writing and executing a Spark Batch program which only use SPARK-SQL , But it is taking lot of time and finally giving GC overhead . Here is the program , 1.Read 3 files ,one medium size and 2 small files, and register them as DF. 2. fire sql with complex aggregation and

Re: Spark Job trigger in production

2016-07-20 Thread Rabin Banerjee
++ crontab :) On Wed, Jul 20, 2016 at 9:07 AM, Andrew Ehrlich wrote: > Another option is Oozie with the spark action: > https://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html > > On Jul 18, 2016, at 12:15 AM, Jagat Singh wrote: > > You can

Re: Storm HDFS bolt equivalent in Spark Streaming.

2016-07-20 Thread Rabin Banerjee
++Deepak, There is also a option to use saveAsHadoopFile & saveAsNewAPIHadoopFile, In which you can customize(filename and many things ...) the way you want to save it. :) Happy Sparking Regards, Rabin Banerjee On Wed, Jul 20, 2016 at 10:01 AM, Deepak Sharma <deepakmc...@gmail.com

Re: Running multiple Spark Jobs on Yarn( Client mode)

2016-07-20 Thread Rabin Banerjee
Hi Vaibhav, Please check your yarn configuration and make sure you have available resources .Please try creating multiple queues ,And submit job on queues. --queue thequeue Regards, Rabin Banerjee On Wed, Jul 20, 2016 at 12:05 PM, vaibhavrtk <vaibhav...@gmail.com> wrote: > I hav

Re: Is it good choice to use DAO to store results generated by spark application?

2016-07-20 Thread Rabin Banerjee
basis. Regards, Rabin Banerjee On Wed, Jul 20, 2016 at 12:06 PM, Yu Wei <yu20...@hotmail.com> wrote: > I need to write all data received from MQTT data into hbase for further > processing. > > They're not final result. I also need to read the data from hbase for > analysis. > &g

Re: write and call UDF in spark dataframe

2016-07-20 Thread Rabin Banerjee
Hi Divya , Try, val df = sqlContext.sql("select from_unixtime(ts,'-MM-dd') as `ts` from mr") Regards, Rabin On Wed, Jul 20, 2016 at 12:44 PM, Divya Gehlot wrote: > Hi, > Could somebody share example of writing and calling udf which converts > unix tme stamp to

Re: XLConnect in SparkR

2016-07-20 Thread Rabin Banerjee
to a MS-Access file to access via JDBC , Regards, Rabin Banerjee On Wed, Jul 20, 2016 at 2:12 PM, Yogesh Vyas <informy...@gmail.com> wrote: > Hi, > > I am trying to load and read excel sheets from HDFS in sparkR using > XLConnect package. > Can anyone help me in finding out

Re: run spark apps in linux crontab

2016-07-20 Thread Rabin Banerjee
HI , Please check your deploy mode and master , For example if you want to deploy in yarn cluster you should use --master yarn-cluster , if you want to do it on yarn client mode you should use --master yarn-client . Please note for your case deploying yarn-cluster will be better as cluster

Re: Latest 200 messages per topic

2016-07-20 Thread Rabin Banerjee
get the latest 4 ts data on get(key). Spark streaming will get the ID from Kafka, then read the data from HBASE using get(ID). This will eliminate usage of Windowing from Spark-Streaming . Is it good to use ? Regards, Rabin Banerjee On Tue, Jul 19, 2016 at 8:44 PM, Cody Koeninger <c...@koenin

Re: Execute function once on each node

2016-07-19 Thread Rabin Banerjee
" I am working on a spark application that requires the ability to run a function on each node in the cluster " -- Use Apache Ignite instead of Spark. Trust me it's awesome for this use case. Regards, Rabin Banerjee On Jul 19, 2016 3:27 AM, "joshuata" <joshaspl...@gm

Re: Latest 200 messages per topic

2016-07-16 Thread Rabin Banerjee
Just to add , I want to read the MAX_OFFSET of a topic , then read MAX_OFFSET-200 , every time . Also I want to know , If I want to fetch a specific offset range for Batch processing, is there any option for doing that ? On Sat, Jul 16, 2016 at 9:08 PM, Rabin Banerjee < dev.rabin.ba

Latest 200 messages per topic

2016-07-16 Thread Rabin Banerjee
HI All, I have 1000 kafka topics each storing messages for different devices . I want to use the direct approach for connecting kafka from Spark , in which I am only interested in latest 200 messages in the Kafka . How do I do that ? Thanks.

Re: Is that possible to launch spark streaming application on yarn with only one machine?

2016-07-08 Thread Rabin Banerjee
Ya , I mean dump in hdfs as a file ,via yarn cluster mode . On Jul 8, 2016 3:10 PM, "Yu Wei" <yu20...@hotmail.com> wrote: > How could I dump data into text file? Writing to HDFS or other approach? > > > Thanks, > > Jared > -- >

Re: Is that possible to launch spark streaming application on yarn with only one machine?

2016-07-07 Thread Rabin Banerjee
if (firstNum.length > num) println("...") > println() > // scalastyle:on println > } > } > > Thanks, > > Jared > > > -- > *From:* Rabin Banerjee <dev.rabin.baner...@gmail.com> > *Se

Re: Spark streaming Kafka Direct API + Multiple consumers

2016-07-07 Thread Rabin Banerjee
It's not required , *Simplified Parallelism:* No need to create multiple input Kafka streams and union them. With directStream, Spark Streaming will create as many RDD partitions as there are Kafka partitions to consume, which will all read data from Kafka in parallel. So there is a one-to-one

Re: Is that possible to launch spark streaming application on yarn with only one machine?

2016-07-06 Thread Rabin Banerjee
In yarn cluster mode , Driver is running in AM , so you can find the logs in that AM log . Open rersourcemanager UI , and check for the Job and logs. or yarn logs -applicationId In yarn client mode , the driver is the same JVM from where you are launching ,,So you are getting it in the log . On

Re: Difference between DataFrame.write.jdbc and DataFrame.write.format("jdbc")

2016-07-06 Thread Rabin Banerjee
HI Buddy, I sued both but DataFrame.write.jdbc is old, and will work if provide table name , It wont work if you provide custom queries . Where as DataFrame.write.format is more generic as well as working perfectly with not only table name but also custom queries . Hence I recommend to use the

Re: Spark Left outer Join issue using programmatic sql joins

2016-07-06 Thread Rabin Banerjee
Checked in spark-shell with spark 1.5.0 scala> val emmpdat = sc.textFile("empfile"); emmpdat: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[23] at textFile at :21 scala> case class EMP (id:Int , name : String , deptId: Int) defined class EMP scala> val empdf = emmpdat.map((f) => {val

Re: It seemed JavaDStream.print() did not work when launching via yarn on a single node

2016-07-06 Thread Rabin Banerjee
rs, not the driver. You'd have to collect the RDD and then > println, really. Also see DStream.print() > > On Wed, Jul 6, 2016 at 1:07 PM, Rabin Banerjee > <dev.rabin.baner...@gmail.com> wrote: > > It's not working because , you haven't collected the data. > > > &

Re: It seemed JavaDStream.print() did not work when launching via yarn on a single node

2016-07-06 Thread Rabin Banerjee
It's not working because , you haven't collected the data. Try something like DStream.forEachRDD((rdd)=> {rdd.foreach(println)}) Thanks, Rabin On Wed, Jul 6, 2016 at 5:05 PM, Yu Wei wrote: > Hi guys, > > > It seemed that when launching application via yarn on single

Spark parallelism with mapToPair

2015-12-14 Thread Rabin Banerjee
Hi Team, I am new to spark Streaming , I am trying to write a spark Streaming application , where the Calculation of incoming data will be performed in "R" in the micro batching . But I want to make wordCounts.mapToPair parallel where wordCounts is the output of groupByKey, How can I ensure