Fwd: iceberg queries

2023-06-15 Thread Gaurav Agarwal
Hi Team,

Sample Merge query:

df.createOrReplaceTempView("source")

MERGE INTO iceberg_hive_cat.iceberg_poc_db.iceberg_tab target
USING (SELECT * FROM source)
ON target.col1 = source.col1// this is my bucket column
WHEN MATCHED  THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

The source dataset is a temporary view and it contains 1.5 million records
in future can 20 Million rows and with id that have 16 buckets.
The target iceberg table has 16 buckets . The source dataset will only
update if matched and insert if not matched with those id

I have 1700 columns in my table.

spark dataset is using default partitioning , do we need to bucket the
spark dataset on bucket column as well ?

Let me know if you need any further details.

it fails with OOME ,

Regards
Gaurav


Re: Spark using iceberg

2023-06-15 Thread Gaurav Agarwal
> HI
>
> I am using spark with iceberg, updating the table with 1700 columns ,
> We are loading 0.6 Million rows from parquet files ,in future it will be
> 16 Million rows and trying to update the data in the table which has 16
> buckets .
> Using the default partitioner of spark .Also we don't do any
> repartitioning of the dataset.on the bucketing column,
> One of the executor fails with OOME , and it recovers and again fails.when
> we are using Merge Into strategy of iceberg
> Merge into target( select * from source) on Target.id= source.id when
> matched then update set
> When not matched then insert
>
> But when we do append blind append . this works.
>
> Question :
>
> How to find what the issue is ? as we are running spark on EKS cluster
> .when executor gives OOME it dies logs also gone , unable to see the logs.
>
> DO we need to partition of the column in the dataset ? when at the time of
> loading or once the data is loaded .
>
> Need help to understand?
>
>


Spark using iceberg

2023-06-15 Thread Gaurav Agarwal
HI

I am using spark with iceberg, updating the table with 1700 columns ,
We are loading 0.6 Million rows from parquet files ,in future it will be 16
Million rows and trying to update the data in the table which has 16
buckets .
Using the default partitioner of spark .Also we don't do any repartitioning
of the dataset.on the bucketing column,
One of the executor fails with OOME , and it recovers and again fails.when
we are using Merge Into strategy of iceberg
Merge into target( select * from source) on Target.id= source.id when
matched then update set
When not matched then insert

But when we do append blind append . this works.

Question :

How to find what the issue is ? as we are running spark on EKS cluster
.when executor gives OOME it dies logs also gone , unable to see the logs.

DO we need to partition of the column in the dataset ? when at the time of
loading or once the data is loaded .

Need help to understand?


Re: OFFICIAL USA REPORT TODAY India Most Dangerous : USA Religious Freedom Report out TODAY

2020-04-29 Thread Gaurav Agarwal
Spark moderator supress this user please. Unnecessary Spam or apache spark
account is hacked ?

On Wed, Apr 29, 2020, 11:56 AM Zahid Amin  wrote:

> How can it be rumours   ?
> Of course you want  to suppress me.
> Suppress USA official Report out TODAY .
>
> > Sent: Wednesday, April 29, 2020 at 8:17 AM
> > From: "Deepak Sharma" 
> > To: "Zahid Amin" 
> > Cc: user@spark.apache.org
> > Subject: Re: India Most Dangerous : USA Religious Freedom Report
> >
> > Can someone block this email ?
> > He is spreading rumours and spamming.
> >
> > On Wed, 29 Apr 2020 at 11:46 AM, Zahid Amin  wrote:
> >
> > > USA report states that India is now the most dangerous country for
> Ethnic
> > > Minorities.
> > >
> > > Remember Martin Luther King.
> > >
> > >
> > >
> https://www.mail.com/int/news/us/9880960-religious-freedom-watchdog-pitches-adding-india-to.html#.1258-stage-set1-3
> > >
> > > It began with Kasmir and still in locked down Since August 2019.
> > >
> > > The Hindutwa  want to eradicate all minorities .
> > > The Apache foundation is infested with these Hindutwa purists and their
> > > sympathisers.
> > > Making Sure all Muslims are kept away from IT industry. Using you to
> help
> > > them.
> > >
> > > Those people in IT you deal with are purists yet you are not welcome
> India.
> > >
> > > The recognition of  Hindutwa led to the creation of Pakistan in 1947.
> > >
> > > Evil propers when good men do nothing.
> > > The genocide is not coming . It is Here.
> > > I ask you please think and act.
> > > Protect the Muslims from Indian Continent.
> > >
> > > -
> > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> > >
> > > --
> > Thanks
> > Deepak
> > www.bigdatabig.com
> > www.keosha.net
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Enrichment with static tables

2017-02-16 Thread Gaurav Agarwal
Thanks That worked for me previously I was using wrong join .that the
reason it did Not worked for me

Tbanks

On Feb 16, 2017 01:20, "Sam Elamin" <hussam.ela...@gmail.com> wrote:

> You can do a join or a union to combine all the dataframes to one fat
> dataframe
>
> or do a select on the columns you want to produce your transformed
> dataframe
>
> Not sure if I understand the question though, If the goal is just an end
> state transformed dataframe that can easily be done
>
>
> Regards
> Sam
>
> On Wed, Feb 15, 2017 at 6:34 PM, Gaurav Agarwal <gaurav130...@gmail.com>
> wrote:
>
>> Hello
>>
>> We want to enrich our spark RDD loaded with multiple Columns and multiple
>> Rows . This need to be enriched with 3 different tables that i loaded 3
>> different spark dataframe . Can we write some logic in spark so i can
>> enrich my spark RDD with different stattic tables.
>>
>> Thanks
>>
>>
>


Enrichment with static tables

2017-02-15 Thread Gaurav Agarwal
Hello

We want to enrich our spark RDD loaded with multiple Columns and multiple
Rows . This need to be enriched with 3 different tables that i loaded 3
different spark dataframe . Can we write some logic in spark so i can
enrich my spark RDD with different stattic tables.

Thanks


Regarding transformation with dataframe

2017-02-15 Thread Gaurav Agarwal
Hello

I have loaded 3 dataframes with 3 different Static tables. Now i got the
csv file and with the help of Spark i loaded the csv into dataframe and
named it as temporary table as "Employee".
Now i need to enrich the columns in the Employee DF and query any of 3
static table respectively with some business logic
e.x.

If(employee.firstName!=null){
String emp.FristName=employee.FirstName;

Select mgr.Name from Mgr where broker.firstName=emp.FirstName
}

Once I retrieve Mgr.Name from Mgr table then i will set the value in the
employee table/DF in the corresponding column.

Thanks


Re: Spark on Windows platform

2016-02-29 Thread Gaurav Agarwal
> Hi
> I am running spark on windows but a standalone one.
>
> Use this code
>
> SparkConf conf = new
SparkConf().setMaster("local[1]").seatAppName("spark").setSparkHome("c:/spark/bin/spark-submit.cmd");
>
> Where sparkhome is the path where u extracted ur spark binaries till
bin/*.cmd
>
> You will get spark context or streaming context
>
> Thanks
>
> On Feb 29, 2016 7:10 PM, "gaurav pathak" 
wrote:
>>
>> Thanks Jorn.
>>
>> Any guidance on how to get started with getting SPARK on Windows, is
highly appreciated.
>>
>> Thanks & Regards
>>
>> Gaurav Pathak
>>
>> ~ sent from handheld device
>>
>> On Feb 29, 2016 5:34 AM, "Jörn Franke"  wrote:
>>>
>>> I think Hortonworks has a Windows Spark distribution. Maybe Bigtop as
well?
>>>
>>> > On 29 Feb 2016, at 14:27, gaurav pathak 
wrote:
>>> >
>>> > Can someone guide me the steps and information regarding,
installation of SPARK on Windows 7/8.1/10 , as well as on Windows Server.
Also, it will be great to read your experiences in using SPARK on Windows
platform.
>>> >
>>> >
>>> > Thanks & Regards,
>>> > Gaurav Pathak


Re: Stored proc with spark

2016-02-16 Thread Gaurav Agarwal
Thanks I will try with the options
On Feb 16, 2016 9:15 PM, "Mich Talebzadeh" <m...@peridale.co.uk> wrote:

> You can use JDBC to oracle to get that data from a given table. What
> Oracle stored procedure does anyway? How many tables are involved?
>
> JDBC is pretty neat. In example below I use JDBC to load two
> Dimension tables from Oracle in Spark shell and read the FACT table of 100
> million rows from Hive
>
> val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>
> println ("\nStarted at"); HiveContext.sql("SELECT
> FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
> ").collect.foreach(println)
>
> //
> var _ORACLEserver : String = "jdbc:oracle:thin:@rhes564:1521:mydb"
> var _username : String = "sh"
> var _password : String = "xx"
> //
>
> /Get the FACT table from Hive
> //
> var s = HiveContext.sql("SELECT AMOUNT_SOLD, TIME_ID, CHANNEL_ID FROM
> oraclehadoop.sales")
>
> //Get Oracle tables via JDBC
>
> val c = HiveContext.load("jdbc",
> Map("url" -> _ORACLEserver,
> "dbtable" -> "(SELECT to_char(CHANNEL_ID) AS CHANNEL_ID, CHANNEL_DESC FROM
> sh.channels)",
> "user" -> _username,
> "password" -> _password))
>
> val t = HiveContext.load("jdbc",
> Map("url" -> _ORACLEserver,
> "dbtable" -> "(SELECT TIME_ID AS TIME_ID, CALENDAR_MONTH_DESC FROM
> sh.times)",
> "user" -> _username,
> "password" -> _password))
>
> // Registar three data frames as temporary tables using
> registerTempTable() call
>
> s.registerTempTable("t_s")
> c.registerTempTable("t_c")
> t.registerTempTable("t_t")
> //
> var sqltext : String = ""
> sqltext = """
> SELECT rs.Month, rs.SalesChannel, round(TotalSales,2)
> FROM
> (
> SELECT t_t.CALENDAR_MONTH_DESC AS Month, t_c.CHANNEL_DESC AS SalesChannel,
> SUM(t_s.AMOUNT_SOLD) AS TotalSales
> FROM t_s, t_t, t_c
> WHERE t_s.TIME_ID = t_t.TIME_ID
> AND   t_s.CHANNEL_ID = t_c.CHANNEL_ID
> GROUP BY t_t.CALENDAR_MONTH_DESC, t_c.CHANNEL_DESC
> ORDER by t_t.CALENDAR_MONTH_DESC, t_c.CHANNEL_DESC
> ) rs
> LIMIT 10
> """
> HiveContext.sql(sqltext).collect.foreach(println)
> println ("\nFinished at"); HiveContext.sql("SELECT
> FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
> ").collect.foreach(println)
>
> sys.exit()
>
>
>
> HTH
>
> --
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Cloud Technology
> Partners Ltd, its subsidiaries or their employees, unless expressly so
> stated. It is the responsibility of the recipient to ensure that this email
> is virus free, therefore neither Cloud Technology partners Ltd, its
> subsidiaries nor their employees accept any responsibility.
>
> On 16/02/2016 09:04, Gaurav Agarwal wrote:
>
> Hi
> Can I load the data into spark from oracle storedproc
>
> Thanks
>
>
>
>
>
>
>


Stored proc with spark

2016-02-16 Thread Gaurav Agarwal
Hi
Can I load the data into spark from oracle storedproc

Thanks


Dataframes

2016-02-11 Thread Gaurav Agarwal
Hi

Can we load 5 data frame for 5 tables in one spark context.
I am asking why because we have to give
Map options= new hashmap();

Options.put(driver,"");
Options.put(URL,"");
Options.put(dbtable,"");

I can give only table query at time in dbtable options .
How will I register multiple queries and dataframes

Thankw
with all table.

Thanks
+


Re: cache DataFrame

2016-02-11 Thread Gaurav Agarwal
Thanks for the below info.
I have one more question. I have my own framework where the Sql query is
already build ,so I am thinking instead of using data frame filter criteria
I could use
Dataframe d=sqlcontext.Sql(" and append query here").
d.printschema()
List row =d.collectaslist();

Here when I say d.collectaslist then it will go to database and execute
query there.

Nothing will be cached there .please confirm my query.

Thanks



On Feb 12, 2016 12:01 AM, "Rishabh Wadhawan" <rishabh...@gmail.com> wrote:

> Hi Gaurav Spark will not load the tables into memory at both the points as
> DataFrames are just abstractions of something that might happen in future
> when you actually throw an (ACTION) like say df.collectAsList or df.show.
> When you run DataFrame df =  sContext.load("jdbc","(select * from employee)
> as employee); all spark does is that it just generates a queryExecution
> plan. That plan gets executed when you throw and ACTION statement.
> Take this example.
>
> DataFrame df = sContext.load("jdbc","(select * from employee) as
> employee); // Spark makes a query execution tree
> df.filter(df.col(“wmpid”).equalTo(“!”));  // Spark adds this to query
> execution tree
> System.out.println(df.queryExecution()) // Print out the query execution
> plan, with physical and logical plans.
>
> df.show(); /*This is when spark starts loading data into memory and
> executes the optimized execution plan, according to the query execution
> tree. This is the point when data gets* materialized
>   */
>
> > On Feb 11, 2016, at 11:20 AM, Gaurav Agarwal <gaurav130...@gmail.com>
> wrote:
> >
> > Hi
> >
> > When the dataFrame will load the table into memory when it reads from
> HIVe/Phoenix or from any database.
> > These are two points where need one info , when tables will be loaded
> into memory or cached when at  point 1 or point 2 below.
> >
> >  1. DataFrame df = sContext.load("jdbc","(select * from employee) as
> employee);
> >
> > 2.sContext.sql("select * from employee where wmpid="!");
> >
> >
> >
> >
> >
> >
>
>


cache DataFrame

2016-02-11 Thread Gaurav Agarwal
Hi

When the dataFrame will load the table into memory when it reads from
HIVe/Phoenix or from any database.
These are two points where need one info , when tables will be loaded into
memory or cached when at  point 1 or point 2 below.

 1. DataFrame df = sContext.load("jdbc","(select * from employee) as
employee);

2.sContext.sql("select * from employee where wmpid="!");


Re: Problem using limit clause in spark sql

2015-12-23 Thread Gaurav Agarwal
I am going to have the above scenario without using limit clause then will
it work check among all the partitions.
On Dec 24, 2015 9:26 AM, "汪洋"  wrote:

> I see.
>
> Thanks.
>
>
> 在 2015年12月24日,上午11:44,Zhan Zhang  写道:
>
> There has to have a central point to collaboratively collecting exactly
> 1 records, currently the approach is using one single partitions, which
> is easy to implement.
> Otherwise, the driver has to count the number of records in each partition
> and then decide how many records  to be materialized in each partition,
> because some partition may not have enough number of records, sometimes it
> is even empty.
>
> I didn’t see any straightforward walk around for this.
>
> Thanks.
>
> Zhan Zhang
>
>
>
> On Dec 23, 2015, at 5:32 PM, 汪洋  wrote:
>
> It is an application running as an http server. So I collect the data as
> the response.
>
> 在 2015年12月24日,上午8:22,Hudong Wang  写道:
>
> When you call collect() it will bring all the data to the driver. Do you
> mean to call persist() instead?
>
> --
> From: tiandiwo...@icloud.com
> Subject: Problem using limit clause in spark sql
> Date: Wed, 23 Dec 2015 21:26:51 +0800
> To: user@spark.apache.org
>
> Hi,
> I am using spark sql in a way like this:
>
> sqlContext.sql(“select * from table limit 1”).map(...).collect()
>
> The problem is that the limit clause will collect all the 10,000 records
> into a single partition, resulting the map afterwards running only in one
> partition and being really slow.I tried to use repartition, but it is
> kind of a waste to collect all those records into one partition and then
> shuffle them around and then collect them again.
>
> Is there a way to work around this?
> BTW, there is no order by clause and I do not care which 1 records I
> get as long as the total number is less or equal then 1.
>
>
>
>
>


Spark data frame

2015-12-22 Thread Gaurav Agarwal
We are able to retrieve data frame by filtering the rdd object . I need to
convert that data frame into java pojo. Any idea how to do that


Regarding spark in nemory

2015-12-22 Thread Gaurav Agarwal
If I have 3 more cluster and spark is running there .if I load the records
from phoenix to spark rdd and fetch the records from the spark through data
frame.

Now I want to know that spark is distributed?
So I fetch the records from any of the node, records will be retrieved
present on any node present in spark rdd .


sparkStreaming how to work with partitions,how tp create partition

2015-08-22 Thread Gaurav Agarwal
1. how to work with partition in spark streaming from kafka
2. how to create partition in spark streaming from kafka



when i send the message from kafka topic having three partitions.

Spark will listen the message when i say kafkautils.createStream or
createDirectstSream have local[4]
Now i want to see if spark will create partitions when it receive
message from kafka using dstream, how and where ,prwhich method of
spark api i have to see to find out


Re: spark kafka partitioning

2015-08-21 Thread Gaurav Agarwal
when i send the message from kafka topic having three partitions.

Spark will listen the message when i say kafkautils.createStream or
createDirectstSream have local[4]
Now i want to see if spark will create partitions when it receive
message from kafka using dstream, how and where ,prwhich method of
spark api i have to see to find out

On 8/21/15, Gaurav Agarwal gaurav130...@gmail.com wrote:
 Hello

 Regarding Spark Streaming and Kafka Partitioning

 When i send message on kafka topic with 3 partitions and listens on
 kafkareceiver with local value[4] . how will i come to know in Spark
 Streaming that different Dstreams are created according to partitions of
 kafka messages .

 Thanks


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark kafka partitioning

2015-08-20 Thread Gaurav Agarwal
Hello

Regarding Spark Streaming and Kafka Partitioning

When i send message on kafka topic with 3 partitions and listens on
kafkareceiver with local value[4] . how will i come to know in Spark
Streaming that different Dstreams are created according to partitions of
kafka messages .

Thanks