Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-19 Thread Joris Billen
+1 On 18 Mar 2024, at 21:53, Mich Talebzadeh wrote: Well as long as it works. Please all check this link from Databricks and let us know your thoughts. Will something similar work for us?. Of course Databricks have much deeper pockets than our ASF community. Will it require moderation in

Re: Topics for Spark online classes & webinars

2023-03-14 Thread Joris Billen
This is a very good idea-would love to read such a confluence page. Adding a section “common mistakes/misconceptions” might be useful for many of these sections. It would describe undesired behaviour/errors one would get in case of not following some best practices. On 13 Mar 2023, at 17:20,

[Spark/deeplyR] how come spark is caching tables read through jdbc connection from oracle, even when memory=false is chosen

2023-01-31 Thread Joris Billen
This question is related to using Spark and deeplyR. We load a lot of data from oracle in dataframes through a jdbc connection: dfX <- spark_read_jdbc(spConn, “myconnection", options = list( url = urlDEVdb, driver = "oracle.jdbc.OracleDriver",

[pyspark/sparksql]: How to overcome redundant/repetitive code? Is a for loop over an sql statement with a variable a bad idea?

2023-01-06 Thread Joris Billen
Hello Community, I am working in pyspark with sparksql and have a very similar very complex list of dataframes that Ill have to execute several times for all the “models” I have. Suppose the code is exactly the same for all models, only the table it reads from and some values in the where

[sparklyR] broadcast table for temporary table -> can you compute statistics for temporary table?

2022-11-23 Thread Joris Billen
Hi, question about using the R api for spark:we load some files from oracle (through jdbc ) and register it in a temporary table in spark. I see a lot of shuffling, but we have joins between large and small tables. So I probably need to broadcast the small tables. Normally autobroadcasting

should one every make a spark streaming job in pyspark

2022-11-02 Thread Joris Billen
Dear community, I had a general question about the use of scala VS pyspark for spark streaming. I believe spark streaming will work most efficiently when written in scala. I believe however that things can be implemented in pyspark. My question: 1)is it completely dumb to make a streaming job

external table with parquet files: problem querying in sparksql since data is stored as integer while hive schema expects a timestamp

2022-07-20 Thread Joris Billen
Hi, below sounds like something that someone will have experienced... I have external tables of parquet files with a hive table defined on top of the data. I dont manage/know the details of how the data lands. For some tables no issues when querying through spark. But for others there is an

Re: very simple UI on webpage to display x/y plots+histogram of data stored in hive

2022-07-18 Thread Joris Billen
Thank you - looks like it COULD do it. Have to try if I can have a simple UI, user selects one out of 100 options, and receives the correct x/y plot and correct histogram of data stored in hive and retrieved with spark into pandas… Many thanks for your suggestion! On 18 Jul 2022, at 15:08,

very simple UI on webpage to display x/y plots+histogram of data stored in hive

2022-07-18 Thread Joris Billen
Hi, I am making a very short demo and would like to make the most rudimentary UI (withouth knowing anything about front end) that would show a x/y plot of data stored in HIVE (that I typically query with spark) together with a histogram (something one would typically created in a jupyter

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-07 Thread Joris Billen
ed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Wed, 6 Apr 2022 at 16:41, Joris Billen mailto:joris.bil...@bigindustries.be>> wrote: HI, thanks for your reply. I believe I have found the issue: the job writes dat

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-07 Thread Joris Billen
u are running, why in loops, and whether you are caching in any data or not, and whether you are referencing the variables to create them like in the following expression we are referencing x to create x, x = x + 1 Thanks and Regards, Gourav Sengupta On Mon, Apr 4, 2022 at 10:51 AM Joris Bi

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-06 Thread Joris Billen
ferencing the variables to create them like in the following expression we are referencing x to create x, x = x + 1 Thanks and Regards, Gourav Sengupta On Mon, Apr 4, 2022 at 10:51 AM Joris Billen mailto:joris.bil...@bigindustries.be>> wrote: Clear-probably not a good idea. But a previo

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-04 Thread Joris Billen
, so I doubt caching helps anything here. On Fri, Apr 1, 2022 at 2:49 AM Joris Billen mailto:joris.bil...@bigindustries.be>> wrote: Hi, as said thanks for little discussion over mail. I understand that the action is triggered in the end at the write and then all of a sudden everything

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-01 Thread Joris Billen
, 2022 at 3:30 AM Joris Billen mailto:joris.bil...@bigindustries.be>> wrote: Thanks for reply :-) I am using pyspark. Basicially my code (simplified is): df=spark.read.csv(hdfs://somehdfslocation) df1=spark.sql (complex statement using df) ... dfx=spark.sql(complex statement using df x-

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-03-31 Thread Joris Billen
ons. 30. mar. 2022 kl. 17:41 skrev Joris Billen mailto:joris.bil...@bigindustries.be>>: Thanks for answer-much appreciated! This forum is very useful :-) I didnt know the sparkcontext stays alive. I guess this is eating up memory. The eviction means that he knows that he should clear some

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-03-30 Thread Joris Billen
, 2022, 10:16 AM Joris Billen mailto:joris.bil...@bigindustries.be>> wrote: Hi, I have a pyspark job submitted through spark-submit that does some heavy processing for 1 day of data. It runs with no errors. I have to loop over many days, so I run this spark job in a loop. I notice after couple

loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-03-30 Thread Joris Billen
Hi, I have a pyspark job submitted through spark-submit that does some heavy processing for 1 day of data. It runs with no errors. I have to loop over many days, so I run this spark job in a loop. I notice after couple executions the memory is increasing on all worker nodes and eventually this

[spark executor error] container running from bad node-> exit code 134

2021-11-19 Thread Joris Billen
Hi, we are seeing this error: Job aborted due to stage failure: Task 0 in stage 1.0 failed 8...Reason: Container from a bad node: container_xxx on host: dev-yyy Exit status: 134 This post suggests this has to do with blacklisted nodes:

[spark streaming] how to connect to rabbitmq with spark streaming.

2021-10-04 Thread Joris Billen
Hi, I am looking for someone who has made a spark streaming job that connects to rabbitmq. There is a lot of documentation how to make a connection with a java api (like here: https://www.rabbitmq.com/api-guide.html#connecting) , but I am looking for a recent working example for spark streaming

Re: Why are in 1 stage most of my executors idle: are tasks within a stage dependent of each other?

2021-09-10 Thread Joris Billen
are > much larger than the other 79,9993 partitions. Spark completes the 73 > tasks while those 7 are running. I would check the size of the partitions. If > the 7 are much larger, I would try to use salting to rebalance the partitions. > > On 9/10/21, 10:22 AM, "Joris

Why are in 1 stage most of my executors idle: are tasks within a stage dependent of each other?

2021-09-10 Thread Joris Billen
Dear community, I have a job that runs quite well for most stages: resource are consumed quite optimal (not much memoy/vcoresleft idle). My cluster is managed and works well. I end up with 27 executors and have 2 cores for each, so can run 54 tasks. For many stages I see I have a high number of