Fwd: iceberg queries
Hi Team, Sample Merge query: df.createOrReplaceTempView("source") MERGE INTO iceberg_hive_cat.iceberg_poc_db.iceberg_tab target USING (SELECT * FROM source) ON target.col1 = source.col1// this is my bucket column WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * The source dataset is a temporary view and it contains 1.5 million records in future can 20 Million rows and with id that have 16 buckets. The target iceberg table has 16 buckets . The source dataset will only update if matched and insert if not matched with those id I have 1700 columns in my table. spark dataset is using default partitioning , do we need to bucket the spark dataset on bucket column as well ? Let me know if you need any further details. it fails with OOME , Regards Gaurav
Re: Spark using iceberg
> HI > > I am using spark with iceberg, updating the table with 1700 columns , > We are loading 0.6 Million rows from parquet files ,in future it will be > 16 Million rows and trying to update the data in the table which has 16 > buckets . > Using the default partitioner of spark .Also we don't do any > repartitioning of the dataset.on the bucketing column, > One of the executor fails with OOME , and it recovers and again fails.when > we are using Merge Into strategy of iceberg > Merge into target( select * from source) on Target.id= source.id when > matched then update set > When not matched then insert > > But when we do append blind append . this works. > > Question : > > How to find what the issue is ? as we are running spark on EKS cluster > .when executor gives OOME it dies logs also gone , unable to see the logs. > > DO we need to partition of the column in the dataset ? when at the time of > loading or once the data is loaded . > > Need help to understand? > >
Spark using iceberg
HI I am using spark with iceberg, updating the table with 1700 columns , We are loading 0.6 Million rows from parquet files ,in future it will be 16 Million rows and trying to update the data in the table which has 16 buckets . Using the default partitioner of spark .Also we don't do any repartitioning of the dataset.on the bucketing column, One of the executor fails with OOME , and it recovers and again fails.when we are using Merge Into strategy of iceberg Merge into target( select * from source) on Target.id= source.id when matched then update set When not matched then insert But when we do append blind append . this works. Question : How to find what the issue is ? as we are running spark on EKS cluster .when executor gives OOME it dies logs also gone , unable to see the logs. DO we need to partition of the column in the dataset ? when at the time of loading or once the data is loaded . Need help to understand?
Re: OFFICIAL USA REPORT TODAY India Most Dangerous : USA Religious Freedom Report out TODAY
Spark moderator supress this user please. Unnecessary Spam or apache spark account is hacked ? On Wed, Apr 29, 2020, 11:56 AM Zahid Amin wrote: > How can it be rumours ? > Of course you want to suppress me. > Suppress USA official Report out TODAY . > > > Sent: Wednesday, April 29, 2020 at 8:17 AM > > From: "Deepak Sharma" > > To: "Zahid Amin" > > Cc: user@spark.apache.org > > Subject: Re: India Most Dangerous : USA Religious Freedom Report > > > > Can someone block this email ? > > He is spreading rumours and spamming. > > > > On Wed, 29 Apr 2020 at 11:46 AM, Zahid Amin wrote: > > > > > USA report states that India is now the most dangerous country for > Ethnic > > > Minorities. > > > > > > Remember Martin Luther King. > > > > > > > > > > https://www.mail.com/int/news/us/9880960-religious-freedom-watchdog-pitches-adding-india-to.html#.1258-stage-set1-3 > > > > > > It began with Kasmir and still in locked down Since August 2019. > > > > > > The Hindutwa want to eradicate all minorities . > > > The Apache foundation is infested with these Hindutwa purists and their > > > sympathisers. > > > Making Sure all Muslims are kept away from IT industry. Using you to > help > > > them. > > > > > > Those people in IT you deal with are purists yet you are not welcome > India. > > > > > > The recognition of Hindutwa led to the creation of Pakistan in 1947. > > > > > > Evil propers when good men do nothing. > > > The genocide is not coming . It is Here. > > > I ask you please think and act. > > > Protect the Muslims from Indian Continent. > > > > > > - > > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > > > > -- > > Thanks > > Deepak > > www.bigdatabig.com > > www.keosha.net > > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: Enrichment with static tables
Thanks That worked for me previously I was using wrong join .that the reason it did Not worked for me Tbanks On Feb 16, 2017 01:20, "Sam Elamin" <hussam.ela...@gmail.com> wrote: > You can do a join or a union to combine all the dataframes to one fat > dataframe > > or do a select on the columns you want to produce your transformed > dataframe > > Not sure if I understand the question though, If the goal is just an end > state transformed dataframe that can easily be done > > > Regards > Sam > > On Wed, Feb 15, 2017 at 6:34 PM, Gaurav Agarwal <gaurav130...@gmail.com> > wrote: > >> Hello >> >> We want to enrich our spark RDD loaded with multiple Columns and multiple >> Rows . This need to be enriched with 3 different tables that i loaded 3 >> different spark dataframe . Can we write some logic in spark so i can >> enrich my spark RDD with different stattic tables. >> >> Thanks >> >> >
Enrichment with static tables
Hello We want to enrich our spark RDD loaded with multiple Columns and multiple Rows . This need to be enriched with 3 different tables that i loaded 3 different spark dataframe . Can we write some logic in spark so i can enrich my spark RDD with different stattic tables. Thanks
Regarding transformation with dataframe
Hello I have loaded 3 dataframes with 3 different Static tables. Now i got the csv file and with the help of Spark i loaded the csv into dataframe and named it as temporary table as "Employee". Now i need to enrich the columns in the Employee DF and query any of 3 static table respectively with some business logic e.x. If(employee.firstName!=null){ String emp.FristName=employee.FirstName; Select mgr.Name from Mgr where broker.firstName=emp.FirstName } Once I retrieve Mgr.Name from Mgr table then i will set the value in the employee table/DF in the corresponding column. Thanks
Re: Spark on Windows platform
> Hi > I am running spark on windows but a standalone one. > > Use this code > > SparkConf conf = new SparkConf().setMaster("local[1]").seatAppName("spark").setSparkHome("c:/spark/bin/spark-submit.cmd"); > > Where sparkhome is the path where u extracted ur spark binaries till bin/*.cmd > > You will get spark context or streaming context > > Thanks > > On Feb 29, 2016 7:10 PM, "gaurav pathak"wrote: >> >> Thanks Jorn. >> >> Any guidance on how to get started with getting SPARK on Windows, is highly appreciated. >> >> Thanks & Regards >> >> Gaurav Pathak >> >> ~ sent from handheld device >> >> On Feb 29, 2016 5:34 AM, "Jörn Franke" wrote: >>> >>> I think Hortonworks has a Windows Spark distribution. Maybe Bigtop as well? >>> >>> > On 29 Feb 2016, at 14:27, gaurav pathak wrote: >>> > >>> > Can someone guide me the steps and information regarding, installation of SPARK on Windows 7/8.1/10 , as well as on Windows Server. Also, it will be great to read your experiences in using SPARK on Windows platform. >>> > >>> > >>> > Thanks & Regards, >>> > Gaurav Pathak
Re: Stored proc with spark
Thanks I will try with the options On Feb 16, 2016 9:15 PM, "Mich Talebzadeh" <m...@peridale.co.uk> wrote: > You can use JDBC to oracle to get that data from a given table. What > Oracle stored procedure does anyway? How many tables are involved? > > JDBC is pretty neat. In example below I use JDBC to load two > Dimension tables from Oracle in Spark shell and read the FACT table of 100 > million rows from Hive > > val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc) > > println ("\nStarted at"); HiveContext.sql("SELECT > FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss') > ").collect.foreach(println) > > // > var _ORACLEserver : String = "jdbc:oracle:thin:@rhes564:1521:mydb" > var _username : String = "sh" > var _password : String = "xx" > // > > /Get the FACT table from Hive > // > var s = HiveContext.sql("SELECT AMOUNT_SOLD, TIME_ID, CHANNEL_ID FROM > oraclehadoop.sales") > > //Get Oracle tables via JDBC > > val c = HiveContext.load("jdbc", > Map("url" -> _ORACLEserver, > "dbtable" -> "(SELECT to_char(CHANNEL_ID) AS CHANNEL_ID, CHANNEL_DESC FROM > sh.channels)", > "user" -> _username, > "password" -> _password)) > > val t = HiveContext.load("jdbc", > Map("url" -> _ORACLEserver, > "dbtable" -> "(SELECT TIME_ID AS TIME_ID, CALENDAR_MONTH_DESC FROM > sh.times)", > "user" -> _username, > "password" -> _password)) > > // Registar three data frames as temporary tables using > registerTempTable() call > > s.registerTempTable("t_s") > c.registerTempTable("t_c") > t.registerTempTable("t_t") > // > var sqltext : String = "" > sqltext = """ > SELECT rs.Month, rs.SalesChannel, round(TotalSales,2) > FROM > ( > SELECT t_t.CALENDAR_MONTH_DESC AS Month, t_c.CHANNEL_DESC AS SalesChannel, > SUM(t_s.AMOUNT_SOLD) AS TotalSales > FROM t_s, t_t, t_c > WHERE t_s.TIME_ID = t_t.TIME_ID > AND t_s.CHANNEL_ID = t_c.CHANNEL_ID > GROUP BY t_t.CALENDAR_MONTH_DESC, t_c.CHANNEL_DESC > ORDER by t_t.CALENDAR_MONTH_DESC, t_c.CHANNEL_DESC > ) rs > LIMIT 10 > """ > HiveContext.sql(sqltext).collect.foreach(println) > println ("\nFinished at"); HiveContext.sql("SELECT > FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss') > ").collect.foreach(println) > > sys.exit() > > > > HTH > > -- > > Dr Mich Talebzadeh > > > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > > > http://talebzadehmich.wordpress.com > > > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this > message shall not be understood as given or endorsed by Cloud Technology > Partners Ltd, its subsidiaries or their employees, unless expressly so > stated. It is the responsibility of the recipient to ensure that this email > is virus free, therefore neither Cloud Technology partners Ltd, its > subsidiaries nor their employees accept any responsibility. > > On 16/02/2016 09:04, Gaurav Agarwal wrote: > > Hi > Can I load the data into spark from oracle storedproc > > Thanks > > > > > > >
Stored proc with spark
Hi Can I load the data into spark from oracle storedproc Thanks
Dataframes
Hi Can we load 5 data frame for 5 tables in one spark context. I am asking why because we have to give Map options= new hashmap(); Options.put(driver,""); Options.put(URL,""); Options.put(dbtable,""); I can give only table query at time in dbtable options . How will I register multiple queries and dataframes Thankw with all table. Thanks +
Re: cache DataFrame
Thanks for the below info. I have one more question. I have my own framework where the Sql query is already build ,so I am thinking instead of using data frame filter criteria I could use Dataframe d=sqlcontext.Sql(" and append query here"). d.printschema() List row =d.collectaslist(); Here when I say d.collectaslist then it will go to database and execute query there. Nothing will be cached there .please confirm my query. Thanks On Feb 12, 2016 12:01 AM, "Rishabh Wadhawan" <rishabh...@gmail.com> wrote: > Hi Gaurav Spark will not load the tables into memory at both the points as > DataFrames are just abstractions of something that might happen in future > when you actually throw an (ACTION) like say df.collectAsList or df.show. > When you run DataFrame df = sContext.load("jdbc","(select * from employee) > as employee); all spark does is that it just generates a queryExecution > plan. That plan gets executed when you throw and ACTION statement. > Take this example. > > DataFrame df = sContext.load("jdbc","(select * from employee) as > employee); // Spark makes a query execution tree > df.filter(df.col(“wmpid”).equalTo(“!”)); // Spark adds this to query > execution tree > System.out.println(df.queryExecution()) // Print out the query execution > plan, with physical and logical plans. > > df.show(); /*This is when spark starts loading data into memory and > executes the optimized execution plan, according to the query execution > tree. This is the point when data gets* materialized > */ > > > On Feb 11, 2016, at 11:20 AM, Gaurav Agarwal <gaurav130...@gmail.com> > wrote: > > > > Hi > > > > When the dataFrame will load the table into memory when it reads from > HIVe/Phoenix or from any database. > > These are two points where need one info , when tables will be loaded > into memory or cached when at point 1 or point 2 below. > > > > 1. DataFrame df = sContext.load("jdbc","(select * from employee) as > employee); > > > > 2.sContext.sql("select * from employee where wmpid="!"); > > > > > > > > > > > > > >
cache DataFrame
Hi When the dataFrame will load the table into memory when it reads from HIVe/Phoenix or from any database. These are two points where need one info , when tables will be loaded into memory or cached when at point 1 or point 2 below. 1. DataFrame df = sContext.load("jdbc","(select * from employee) as employee); 2.sContext.sql("select * from employee where wmpid="!");
Re: Problem using limit clause in spark sql
I am going to have the above scenario without using limit clause then will it work check among all the partitions. On Dec 24, 2015 9:26 AM, "汪洋"wrote: > I see. > > Thanks. > > > 在 2015年12月24日,上午11:44,Zhan Zhang 写道: > > There has to have a central point to collaboratively collecting exactly > 1 records, currently the approach is using one single partitions, which > is easy to implement. > Otherwise, the driver has to count the number of records in each partition > and then decide how many records to be materialized in each partition, > because some partition may not have enough number of records, sometimes it > is even empty. > > I didn’t see any straightforward walk around for this. > > Thanks. > > Zhan Zhang > > > > On Dec 23, 2015, at 5:32 PM, 汪洋 wrote: > > It is an application running as an http server. So I collect the data as > the response. > > 在 2015年12月24日,上午8:22,Hudong Wang 写道: > > When you call collect() it will bring all the data to the driver. Do you > mean to call persist() instead? > > -- > From: tiandiwo...@icloud.com > Subject: Problem using limit clause in spark sql > Date: Wed, 23 Dec 2015 21:26:51 +0800 > To: user@spark.apache.org > > Hi, > I am using spark sql in a way like this: > > sqlContext.sql(“select * from table limit 1”).map(...).collect() > > The problem is that the limit clause will collect all the 10,000 records > into a single partition, resulting the map afterwards running only in one > partition and being really slow.I tried to use repartition, but it is > kind of a waste to collect all those records into one partition and then > shuffle them around and then collect them again. > > Is there a way to work around this? > BTW, there is no order by clause and I do not care which 1 records I > get as long as the total number is less or equal then 1. > > > > >
Spark data frame
We are able to retrieve data frame by filtering the rdd object . I need to convert that data frame into java pojo. Any idea how to do that
Regarding spark in nemory
If I have 3 more cluster and spark is running there .if I load the records from phoenix to spark rdd and fetch the records from the spark through data frame. Now I want to know that spark is distributed? So I fetch the records from any of the node, records will be retrieved present on any node present in spark rdd .
sparkStreaming how to work with partitions,how tp create partition
1. how to work with partition in spark streaming from kafka 2. how to create partition in spark streaming from kafka when i send the message from kafka topic having three partitions. Spark will listen the message when i say kafkautils.createStream or createDirectstSream have local[4] Now i want to see if spark will create partitions when it receive message from kafka using dstream, how and where ,prwhich method of spark api i have to see to find out
Re: spark kafka partitioning
when i send the message from kafka topic having three partitions. Spark will listen the message when i say kafkautils.createStream or createDirectstSream have local[4] Now i want to see if spark will create partitions when it receive message from kafka using dstream, how and where ,prwhich method of spark api i have to see to find out On 8/21/15, Gaurav Agarwal gaurav130...@gmail.com wrote: Hello Regarding Spark Streaming and Kafka Partitioning When i send message on kafka topic with 3 partitions and listens on kafkareceiver with local value[4] . how will i come to know in Spark Streaming that different Dstreams are created according to partitions of kafka messages . Thanks - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
spark kafka partitioning
Hello Regarding Spark Streaming and Kafka Partitioning When i send message on kafka topic with 3 partitions and listens on kafkareceiver with local value[4] . how will i come to know in Spark Streaming that different Dstreams are created according to partitions of kafka messages . Thanks