Re: [DISCUSS] Incremental statistics collection

2023-08-29 Thread Chetan
Thanks for the detailed explanation. Regards, Chetan On Tue, Aug 29, 2023, 4:50 PM Mich Talebzadeh wrote: > OK, let us take a deeper look here > > ANALYSE TABLE mytable COMPUTE STATISTICS FOR COLUMNS *(c1, c2), c3* > > In above, we are *explicitly grouping columns c1 and

Re: [DISCUSS] Incremental statistics collection

2023-08-29 Thread Chetan
has been raised for the same. >>>>> >>>>> Currently, spark invalidates the stats after data changing commands >>>>> which would make CBO non-functional. To update these stats, user either >>>>> needs to run `ANALYZE TABLE` command or turn >>>>> `spark.sql.statistics.size.autoUpdate.enabled`. Both of these ways have >>>>> their own drawbacks, executing `ANALYZE TABLE` command triggers full table >>>>> scan while the other one only updates table and partition stats and can be >>>>> costly in certain cases. >>>>> >>>>> The goal of this proposal is to collect stats incrementally while >>>>> executing data changing commands by utilizing the framework introduced in >>>>> SPARK-21669 <https://issues.apache.org/jira/browse/SPARK-21669>. >>>>> >>>>> SPIP Document has been attached along with JIRA: >>>>> >>>>> https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing >>>>> >>>>> Hive also supports automatic collection of statistics to keep the >>>>> stats consistent. >>>>> I can find multiple spark JIRAs asking for the same: >>>>> https://issues.apache.org/jira/browse/SPARK-28872 >>>>> https://issues.apache.org/jira/browse/SPARK-33825 >>>>> >>>>> Regards, >>>>> Rakesh >>>>> >>>> -- -- Regards, Chetan +353899475147 +919665562626

Re: Suggestion on Join Approach with Spark

2019-05-15 Thread Chetan Khatri
(i.e. this list) is for discussions about the development of > Spark itself. > > On Wed, May 15, 2019 at 1:50 PM Chetan Khatri > wrote: > >> Any one help me, I am confused. :( >> >> On Wed, May 15, 2019 at 7:28 PM Chetan Khatri < >> chetan.opensou...@gmail.com>

Re: Suggestion on Join Approach with Spark

2019-05-15 Thread Chetan Khatri
Any one help me, I am confused. :( On Wed, May 15, 2019 at 7:28 PM Chetan Khatri wrote: > Hello Spark Developers, > > I have a question on Spark Join I am doing. > > I have a full load data from RDBMS and storing at HDFS let's say, > > val historyDF = spark.read.parque

Suggestion on Join Approach with Spark

2019-05-15 Thread Chetan Khatri
Hello Spark Developers, I have a question on Spark Join I am doing. I have a full load data from RDBMS and storing at HDFS let's say, val historyDF = spark.read.parquet(*"/home/test/transaction-line-item"*) and I am getting changed data at seperate hdfs path,let's say; val deltaDF = spark.read

Re: Need help for Delta.io

2019-05-10 Thread Chetan Khatri
Any thoughts.. Please On Fri, May 10, 2019 at 2:22 AM Chetan Khatri wrote: > Hello All, > > I need your help / suggestions, > > I am using Spark 2.3.1 with HDP 2.6.1 Distribution, I will tell my use > case so you get it where people are trying to use Delta. > My use case

How to parallelize JDBC Read in Spark

2018-09-06 Thread Chetan Khatri
Hello Dev Users, I am struggling to parallelize JDBC Read in Spark, It is using 1 - 2 task only to read data and taking so much of time to read. Ex. val invoiceLineItemDF = ((spark.read.jdbc(url = t360jdbcURL, table = invoiceLineItemQuery, columnName = "INVOICE_LINE_ITEM_ID", lowerBound =

Re: Select top (100) percent equivalent in spark

2018-09-05 Thread Chetan Khatri
Sean, Thank you. Do you think, tempDF.orderBy($"invoice_id".desc).limit(100) this can give same result , I think so. Thanks On Wed, Sep 5, 2018 at 12:58 AM Sean Owen wrote: > Sort and take head(n)? > > On Tue, Sep 4, 2018 at 12:07 PM Chetan Khatri > wrote: > &g

Re: Select top (100) percent equivalent in spark

2018-09-04 Thread Chetan Khatri
ink doing a order and limit would be equivalent after > optimizations. > > On Tue, Sep 4, 2018 at 2:28 PM Sean Owen wrote: > >> Sort and take head(n)? >> >> On Tue, Sep 4, 2018 at 12:07 PM Chetan Khatri < >> chetan.opensou...@gmail.com> wrote: >> >>> Dear Spark dev, anything equivalent in spark ? >>> >>

Select top (100) percent equivalent in spark

2018-09-04 Thread Chetan Khatri
Dear Spark dev, anything equivalent in spark ?

Re: Reading 20 GB of log files from Directory - Out of Memory Error

2018-08-25 Thread Chetan Khatri
ap { x => x.replaceAll("""\n""", " ")} mappedRDD.collect() 2. val textlogRDD = sc.textFile("file:///usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/logs/spark-hduser-org.apache.spark.deploy.master.Master-1-chetan-ThinkPad-E460.out", 200) val textMappedRDD

Reading 20 GB of log files from Directory - Out of Memory Error

2018-08-25 Thread Chetan Khatri
mappedRDD = logRDD.flatMap { x => x._2.split("[^A-Za-z']+") }.map { x => x.replaceAll("""\n""", " ")} *2. Individual files can be processed with below approach* val textlogRDD = sc.textFile("file:///usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/lo

Re: Run Python User Defined Functions / code in Spark with Scala Codebase

2018-07-15 Thread Chetan Khatri
Hello Jayant, Thanks for great OSS Contribution :) On Thu, Jul 12, 2018 at 1:36 PM, Jayant Shekhar wrote: > Hello Chetan, > > Sorry missed replying earlier. You can find some sample code here : > > http://sparkflows.readthedocs.io/en/latest/user-guide/ > python/pipe-pytho

Re: Run Python User Defined Functions / code in Spark with Scala Codebase

2018-07-09 Thread Chetan Khatri
Shekhar wrote: > Hello Chetan, > > We have currently done it with .pipe(.py) as Prem suggested. > > That passes the RDD as CSV strings to the python script. The python script > can either process it line by line, create the result and return it back. > Or create things like

Re: Run Python User Defined Functions / code in Spark with Scala Codebase

2018-07-05 Thread Chetan Khatri
Prem sure, Thanks for suggestion. On Wed, Jul 4, 2018 at 8:38 PM, Prem Sure wrote: > try .pipe(.py) on RDD > > Thanks, > Prem > > On Wed, Jul 4, 2018 at 7:59 PM, Chetan Khatri > wrote: > >> Can someone please suggest me , thanks >> >> On Tue 3 J

Re: Run Python User Defined Functions / code in Spark with Scala Codebase

2018-07-04 Thread Chetan Khatri
Can someone please suggest me , thanks On Tue 3 Jul, 2018, 5:28 PM Chetan Khatri, wrote: > Hello Dear Spark User / Dev, > > I would like to pass Python user defined function to Spark Job developed > using Scala and return value of that function would be returned to DF / > Datas

Run Python User Defined Functions / code in Spark with Scala Codebase

2018-07-03 Thread Chetan Khatri
Hello Dear Spark User / Dev, I would like to pass Python user defined function to Spark Job developed using Scala and return value of that function would be returned to DF / Dataset API. Can someone please guide me, which would be best approach to do this. Python function would be mostly transfor

Re: Spark Writing to parquet directory : java.io.IOException: Disk quota exceeded

2017-11-22 Thread Chetan Khatri
Anybody reply on this ? On Tue, Nov 21, 2017 at 3:36 PM, Chetan Khatri wrote: > > Hello Spark Users, > > I am getting below error, when i am trying to write dataset to parquet > location. I have enough disk space available. Last time i was facing same > kind of error whic

Divide Spark Dataframe to parts by timestamp

2017-11-12 Thread Chetan Khatri
Hello All, I have Spark Dataframe with timestamp from 2015-10-07 19:36:59 to 2017-01-01 18:53:23 If i want to split this Dataframe to 3 parts, I wrote below code to split it. Can anyone please confirm is this correct approach or not ?! val finalDF1 = sampleDF.where(sampleDF.col("timestamp_col").

Re: Joining 3 tables with 17 billions records

2017-11-02 Thread Chetan Khatri
Is this just a one time thing or something regular? > If it is a one time thing then I would tend more towards putting each > table in HDFS (parquet or ORC) and then join them. > What is the Hive and Spark version? > > Best regards > > > On 2. Nov 2017, at 20:57, Chetan Khatr

Joining 3 tables with 17 billions records

2017-11-02 Thread Chetan Khatri
file creation on already repartitioned DF. 10. Finally store to external hive table with partition by skey. Any Suggestion or resources you come across please do share suggestions on this to optimize this. Thanks Chetan

Apache Spark Streaming / Spark SQL Job logs

2017-08-30 Thread Chetan Khatri
Hey Spark Dev, Can anyone suggests sample Spark Streaming / Spark SQL Job logs to download. I want to play with Log analytics. Thanks

Re: Reparitioning Hive tables - Container killed by YARN for exceeding memory limits

2017-08-03 Thread Chetan Khatri
stly most people > find this number for their job "experimentally" (e.g. they try a few > different things). > > On Wed, Aug 2, 2017 at 1:52 PM, Chetan Khatri > wrote: > >> Ryan, >> Thank you for reply. >> >> For 2 TB of Data what should be the value of

Re: Reparitioning Hive tables - Container killed by YARN for exceeding memory limits

2017-08-02 Thread Chetan Khatri
=2048 spark.shuffle.io.preferDirectBufs=false On Wed, Aug 2, 2017 at 10:43 PM, Ryan Blue wrote: > Chetan, > > When you're writing to a partitioned table, you want to use a shuffle to > avoid the situation where each task has to write to every partition. You > can do t

Re: Reparitioning Hive tables - Container killed by YARN for exceeding memory limits

2017-08-02 Thread Chetan Khatri
Can anyone please guide me with above issue. On Wed, Aug 2, 2017 at 6:28 PM, Chetan Khatri wrote: > Hello Spark Users, > > I have Hbase table reading and writing to Hive managed table where i > applied partitioning by date column which worked fine but it has generate > more num

Re: Support Dynamic Partition Inserts params with SET command in Spark 2.0.1

2017-07-28 Thread Chetan Khatri
I think it will be same, but let me try that FYR - https://issues.apache.org/jira/browse/SPARK-19881 On Fri, Jul 28, 2017 at 4:44 PM, ayan guha wrote: > Try running spark.sql("set yourconf=val") > > On Fri, 28 Jul 2017 at 8:51 pm, Chetan Khatri > wrote: > >> Jo

Re: Support Dynamic Partition Inserts params with SET command in Spark 2.0.1

2017-07-28 Thread Chetan Khatri
Jorn, Both are same. On Fri, Jul 28, 2017 at 4:18 PM, Jörn Franke wrote: > Try sparksession.conf().set > > On 28. Jul 2017, at 12:19, Chetan Khatri > wrote: > > Hey Dev/ USer, > > I am working with Spark 2.0.1 and with dynamic partitioning with H

Support Dynamic Partition Inserts params with SET command in Spark 2.0.1

2017-07-28 Thread Chetan Khatri
Hey Dev/ USer, I am working with Spark 2.0.1 and with dynamic partitioning with Hive facing below issue: org.apache.hadoop.hive.ql.metadata.HiveException: Number of dynamic partitions created is 1344, which is more than 1000. To solve this try to set hive.exec.max.dynamic.partitions to at least 1

Flatten JSON to multiple columns in Spark

2017-07-17 Thread Chetan Khatri
Hello Spark Dev's, Can you please guide me, how to flatten JSON to multiple columns in Spark. *Example:* Sr No Title ISBN Info 1 Calculus Theory 1234567890 [{"cert":[{ "authSbmtr":"009415da-c8cd-418d-869e-0a19601d79fa", 009415da-c8cd-418d-869e-0a19601d79fa "certUUID":"03ea5a1a-5530-4fa3-8871-9d1

Re: Issues: Generate JSON with null values in Spark 2.0.x

2017-03-20 Thread Chetan Khatri
Exactly. On Sat, Mar 11, 2017 at 1:35 PM, Dongjin Lee wrote: > Hello Chetan, > > Could you post some code? If I understood correctly, you are trying to > save JSON like: > > { > "first_name": "Dongjin", > "last_name: null > } > > n

Issues: Generate JSON with null values in Spark 2.0.x

2017-03-07 Thread Chetan Khatri
Hello Dev / Users, I am working with PySpark Code migration to scala, with Python - Iterating Spark with dictionary and generating JSON with null is possible with json.dumps() which will be converted to SparkSQL[Row] but in scala how can we generate json will null values as a Dataframe ? Thanks.

Re: Spark Job Performance monitoring approaches

2017-02-15 Thread Chetan Khatri
> github.com/SparkMonitor/varOne https://github.com/groupon/sparklint > > Chetan Khatri schrieb am Do., 16. Feb. 2017 > um 06:15 Uhr: > >> Hello All, >> >> What would be the best approches to monitor Spark Performance, is there >> any tools for Spark Job Performance monitoring ? >> >> Thanks. >> >

Spark Job Performance monitoring approaches

2017-02-15 Thread Chetan Khatri
Hello All, What would be the best approches to monitor Spark Performance, is there any tools for Spark Job Performance monitoring ? Thanks.

Re: Update Public Documentation - SparkSession instead of SparkContext

2017-02-15 Thread Chetan Khatri
d, Feb 15, 2017, 06:44 Chetan Khatri > wrote: > >> Hello Spark Dev Team, >> >> I was working with my team having most of the confusion that why your >> public documentation is not updated with SparkSession if SparkSession is >> the ongoing extension and best practice instead of creating sparkcontext. >> >> Thanks. >> >

Update Public Documentation - SparkSession instead of SparkContext

2017-02-14 Thread Chetan Khatri
Hello Spark Dev Team, I was working with my team having most of the confusion that why your public documentation is not updated with SparkSession if SparkSession is the ongoing extension and best practice instead of creating sparkcontext. Thanks.

Re: Error Saving Dataframe to Hive with Spark 2.0.0

2017-01-29 Thread Chetan Khatri
> since. > > Jacek > > > On 29 Jan 2017 9:24 a.m., "Chetan Khatri" > wrote: > > Hello Spark Users, > > I am getting error while saving Spark Dataframe to Hive Table: > Hive 1.2.1 > Spark 2.0.0 > Local environment. > Note: Job is getting execut

Re: HBaseContext with Spark

2017-01-27 Thread Chetan Khatri
TotalOrderPartitioner (sorts data, producing a large number of region files) Import HFiles into HBase HBase can merge files if necessary On Sat, Jan 28, 2017 at 11:32 AM, Chetan Khatri wrote: > @Ted, I dont think so. > > On Thu, Jan 26, 2017 at 6:35 AM, Ted Yu wrote: > >> Does t

Re: HBaseContext with Spark

2017-01-27 Thread Chetan Khatri
@Ted, I dont think so. On Thu, Jan 26, 2017 at 6:35 AM, Ted Yu wrote: > Does the storage handler provide bulk load capability ? > > Cheers > > On Jan 25, 2017, at 3:39 AM, Amrit Jangid > wrote: > > Hi chetan, > > If you just need HBase Data into Hive, You can

Re: HBaseContext with Spark

2017-01-25 Thread Chetan Khatri
Yu wrote: > Though no hbase release has the hbase-spark module, you can find the > backport patch on HBASE-14160 (for Spark 1.6) > > You can build the hbase-spark module yourself. > > Cheers > > On Wed, Jan 25, 2017 at 3:32 AM, Chetan Khatri < > chetan.opensou...@gmai

HBaseContext with Spark

2017-01-25 Thread Chetan Khatri
Hello Spark Community Folks, Currently I am using HBase 1.2.4 and Hive 1.2.1, I am looking for Bulk Load from Hbase to Hive. I have seen couple of good example at HBase Github Repo: https://github.com/ apache/hbase/tree/master/hbase-spark If I would like to use HBaseContext with HBase 1.2.4, how

Re: Weird experience Hive with Spark Transformations

2017-01-17 Thread Chetan Khatri
Hive jobs hive.downloaded.resources.dir $HIVE_HOME/iotmp Temporary local directory for added resources in the remote file system. On Tue, Jan 17, 2017 at 10:01 PM, Dongjoon Hyun wrote: > Hi, Chetan. > > Did you copy your `hive-site.xml` into Spark conf directory? For example, > > cp /usr/local/hive/conf

Weird experience Hive with Spark Transformations

2017-01-16 Thread Chetan Khatri
Hello, I have following services are configured and installed successfully: Hadoop 2.7.x Spark 2.0.x HBase 1.2.4 Hive 1.2.1 *Installation Directories:* /usr/local/hadoop /usr/local/spark /usr/local/hbase *Hive Environment variables:* #HIVE VARIABLES START export HIVE_HOME=/usr/local/hive expo

Re: About saving DataFrame to Hive 1.2.1 with Spark 2.0.1

2017-01-16 Thread Chetan Khatri
chema.struct); stdDf: org.apache.spark.sql.DataFrame = [stid: string, name: string ... 3 more fields] Thanks. On Tue, Jan 17, 2017 at 12:48 AM, Chetan Khatri wrote: > Hello Community, > > I am struggling to save Dataframe to Hive Table, > > Versions: > > Hive 1.2.

About saving DataFrame to Hive 1.2.1 with Spark 2.0.1

2017-01-16 Thread Chetan Khatri
Hello Community, I am struggling to save Dataframe to Hive Table, Versions: Hive 1.2.1 Spark 2.0.1 *Working code:* /* @Author: Chetan Khatri /* @Author: Chetan Khatri Description: This Scala script has written for HBase to Hive module, which reads table from HBase and dump it out to Hive

Re: Error at starting Phoenix shell with HBase

2017-01-15 Thread Chetan Khatri
h. > > I would check the RegionServer logs -- I'm guessing that it never started > correctly or failed. The error message is saying that certain regions in > the system were never assigned to a RegionServer which only happens in > exceptional cases. > > Chetan Khatri wrote

Re: Approach: Incremental data load from HBASE

2017-01-06 Thread Chetan Khatri
Ayan, Thanks Correct I am not thinking RDBMS terms, i am wearing NoSQL glasses ! On Fri, Jan 6, 2017 at 3:23 PM, ayan guha wrote: > IMHO you should not "think" HBase in RDMBS terms, but you can use > ColumnFilters to filter out new records > > On Fri, Jan 6, 2017 at

Re: Approach: Incremental data load from HBASE

2017-01-06 Thread Chetan Khatri
for me, or other alternative approaches can be done through reading Hbase tables in RDD and saving RDD to Hive. Thanks. On Thu, Jan 5, 2017 at 2:02 AM, ayan guha wrote: > Hi Chetan > > What do you mean by incremental load from HBase? There is a timestamp > marker for each cell, but no

Re: Approach: Incremental data load from HBASE

2017-01-04 Thread Chetan Khatri
using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to load the > data into hbase. > > For your use case, the producer needs to find rows where the flag is 0 or > 1. > After such rows are obtained, it is up to you how the result of processing > is delivered to hbase. > > Cheers > > On Wed, De

Re: Dependency Injection and Microservice development with Spark

2017-01-04 Thread Chetan Khatri
tlS, https://freebusy.io/la...@mapflat.com > > > On Fri, Dec 23, 2016 at 11:56 AM, Chetan Khatri > wrote: > > Hello Community, > > > > Current approach I am using for Spark Job Development with Scala + SBT > and > > Uber Jar with yml properties file to pass config

Re: Apache Hive with Spark Configuration

2017-01-04 Thread Chetan Khatri
is hive 1.2.1 . Thanks. On Wed, Jan 4, 2017 at 2:02 AM, Ryan Blue wrote: > Chetan, > > Spark is currently using Hive 1.2.1 to interact with the Metastore. Using > that version for Hive is going to be the most reliable, but the metastore > API doesn't change very often a

Re: Error: at sqlContext.createDataFrame with RDD and Schema

2016-12-28 Thread Chetan Khatri
, unable to check with error that what exactly is. Thanks., On Wed, Dec 28, 2016 at 9:00 PM, Chetan Khatri wrote: > Hello Spark Community, > > I am reading HBase table from Spark and getting RDD but now i wants to > convert RDD of Spark Rows and want to convert to DF. >

Error: at sqlContext.createDataFrame with RDD and Schema

2016-12-28 Thread Chetan Khatri
k.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) stdDf: org.apache.spark.sql.DataFrame = [Rowid: string, maths: string ... 4 more fields] What would be resolution ? Thanks, Chetan

Apache Hive with Spark Configuration

2016-12-28 Thread Chetan Khatri
Hello Users / Developers, I am using Hive 2.0.1 with MySql as a Metastore, can you tell me which version is more compatible with Spark 2.0.2 ? THanks

Re: Negative number of active tasks

2016-12-23 Thread Chetan Khatri
Could you share Pseudo code for the same. Cheers! C Khatri. On Fri, Dec 23, 2016 at 4:33 PM, Andy Dang wrote: > Hi all, > > Today I hit a weird bug in Spark 2.0.2 (vanilla Spark) - the executor tab > shows negative number of active tasks. > > I have about 25 jobs, each with 20k tasks so the nu

Re: Approach: Incremental data load from HBASE

2016-12-23 Thread Chetan Khatri
> After such rows are obtained, it is up to you how the result of processing > is delivered to hbase. > > Cheers > > On Wed, Dec 21, 2016 at 8:00 AM, Chetan Khatri < > chetan.opensou...@gmail.com> wrote: > >> Ok, Sure will ask. >> >> But what would be

Re: Best Practice for Spark Job Jar Generation

2016-12-23 Thread Chetan Khatri
dy > > On Fri, Dec 23, 2016 at 6:00 PM, Chetan Khatri < > chetan.opensou...@gmail.com> wrote: > >> Andy, Thanks for reply. >> >> If we download all the dependencies at separate location and link with >> spark job jar on spark cluster, is it best way to execute

Re: Best Practice for Spark Job Jar Generation

2016-12-23 Thread Chetan Khatri
us). > > --- > Regards, > Andy > > On Fri, Dec 23, 2016 at 6:44 AM, Chetan Khatri < > chetan.opensou...@gmail.com> wrote: > >> Hello Spark Community, >> >> For Spark Job Creation I use SBT Assembly to build Uber("Super") Jar and >>

Dependency Injection and Microservice development with Spark

2016-12-23 Thread Chetan Khatri
standard approach. Thanks Chetan

Best Practice for Spark Job Jar Generation

2016-12-22 Thread Chetan Khatri
Hello Spark Community, For Spark Job Creation I use SBT Assembly to build Uber("Super") Jar and then submit to spark-submit. Example, bin/spark-submit --class hbase.spark.chetan.com.SparkHbaseJob /home/chetan/hbase-spark/SparkMSAPoc-assembly-1.0.jar But other folks has debate wit

Re: Approach: Incremental data load from HBASE

2016-12-21 Thread Chetan Khatri
> > > On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri < > chetan.opensou...@gmail.com> wrote: > >> Hello Guys, >> >> I would like to understand different approach for Distributed Incremental >> load from HBase, Is there any *tool / incubactor tool* which

Approach: Incremental data load from HBASE

2016-12-21 Thread Chetan Khatri
batch where flag is 0 or 1. I am looking for best practice approach with any distributed tool. Thanks. - Chetan Khatri