Re: Validate spark sql

2023-12-26 Thread Gourav Sengupta
Dear friend, thanks a ton was looking for linting for SQL for a long time, looks like https://sqlfluff.com/ is something that can be used :) Thank you so much, and wish you all a wonderful new year. Regards, Gourav On Tue, Dec 26, 2023 at 4:42 AM Bjørn Jørgensen wrote: > You can try sqlfluff

Re: Contributing to Spark MLLib

2023-07-17 Thread Gourav Sengupta
Hi, Holden Karau has some fantastic videos in her channel which will be quite helpful. Thanks Gourav On Sun, 16 Jul 2023, 19:15 Brian Huynh, wrote: > Good morning Dipayan, > > Happy to see another contributor! > > Please go through this document for contributors. Please note the >

Re: [Spark Core] [Advanced] [How-to] How to map any external field to job ids spawned by Spark.

2022-12-28 Thread Gourav Sengupta
Hi Khalid, just out of curiosity, does the API help us in setting JOB ID's or just job Descriptions? Regards, Gourav Sengupta On Wed, Dec 28, 2022 at 10:58 AM Khalid Mammadov wrote: > There is a feature in SparkContext to set localProperties > (setLocalProperty) where you can set your R

Re: Profiling data quality with Spark

2022-12-27 Thread Gourav Sengupta
makes sense :) Regards, Gourav Sengupta On Wed, Dec 28, 2022 at 4:13 AM Sean Owen wrote: > I think this is kind of mixed up. Data warehouses are simple SQL > creatures; Spark is (also) a distributed compute framework. Kind of like > comparing maybe a web server to Java. > Are you think

Re: Profiling data quality with Spark

2022-12-27 Thread Gourav Sengupta
ong, SPARK used to be great in 2016-2017, but there are superb alternatives now and the industry, in this recession, should focus on getting more value for every single dollar they spend. Best of luck. Regards, Gourav Sengupta On Tue, Dec 27, 2022 at 7:30 PM Mich Talebzadeh wrote: > W

Re: Help with Shuffle Read performance

2022-09-29 Thread Gourav Sengupta
andalone deployment (even > when ran on same k8s cluster) > > Sincerely, > > Leszek Reimus > > > > > On Thu, Sep 29, 2022 at 7:06 PM Gourav Sengupta > wrote: > >> Hi, >> >> dont containers finally run on systems, and the only advantage of >> contain

Re: Help with Shuffle Read performance

2022-09-29 Thread Gourav Sengupta
on containers as well, and in EMR running on EC2 nodes you can put all your binaries in containers and use those for running your jobs. Regards, Gourav Sengupta On Thu, Sep 29, 2022 at 7:46 PM Vladimir Prus wrote: > Igor, > > what exact instance types do you use? Unless you use local instance

Re: Help with Shuffle Read performance

2022-09-29 Thread Gourav Sengupta
Hi, why not use EMR or data proc, kubernetes does not provide any benefit at all for such scale of work. It is a classical case of over engineering and over complication just for the heck of it. Also I think that in case you are in AWS, Redshift Spectrum or Athena for 90% of use cases are way

Re: Spark SQL

2022-09-15 Thread Gourav Sengupta
Okay, so for the problem to the solution  that is powerful On Thu, 15 Sept 2022, 14:48 Mayur Benodekar, wrote: > Hi Gourav, > > It’s the way the framework is > > > Sent from my iPhone > > On Sep 15, 2022, at 02:02, Gourav Sengupta > wrote: > >  > Hi, >

Re: Spark SQL

2022-09-15 Thread Gourav Sengupta
Hi, Why spark and why scala? Regards, Gourav On Wed, 7 Sept 2022, 21:42 Mayur Benodekar, wrote: > am new to scala and spark both . > > I have a code in scala which executes quieres in while loop one after the > other. > > What we need to do is if a particular query takes more than a certain

Re: Pipelined execution in Spark (???)

2022-09-11 Thread Gourav Sengupta
Hi, for some tasks as repartitionbyrange, it is indeed quite annoying sometimes to wait for the maps to complete before reduce starts. @Sean Owen do you have any comments? Regards, Gourav Sengupta On Thu, Sep 8, 2022 at 12:10 AM Russell Jurney wrote: > I could be wrong , but… just st

Re: Profiling PySpark Pandas UDF

2022-08-29 Thread Gourav Sengupta
if people using PySpark and Python > UDFs find this proposed improvement useful. > > I see the proposed additional instrumentation as complementary to the > Python/Pandas UDF Profiler introduced in Spark 3.3. > > > > Best, > > Luca > > > > *From:* Abdeali Kothari

Re: Profiling PySpark Pandas UDF

2022-08-25 Thread Gourav Sengupta
Hi, May be I am jumping to conclusions and making stupid guesses, but have you tried koalas now that it is natively integrated with pyspark?? Regards Gourav On Thu, 25 Aug 2022, 11:07 Subash Prabanantham, wrote: > Hi All, > > I was wondering if we have any best practices on using pandas UDF ?

Re: Spark streaming

2022-08-20 Thread Gourav Sengupta
please be aware. Are you in AWS? Please try DMS. If you are then that might be the best solution depending on what you are looking for ofcourse. If you are not in AWS, please let me know your environment, and I can help you out. Regards, Gourav Sengupta On Fri, Aug 19, 2022 at 1:13 PM sandra

Re: Spark with GPU

2022-08-13 Thread Gourav Sengupta
, or redshift, or snowflake, they get a lot more done with less overheads and heart aches. I particularly like how native integration between ML systems like sagemaker works via redshift queries, and aurora postgres - that is true unified data analytics at work. Regards, Gourav Sengupta Regards, Gourav

Re: log transfering into hadoop/spark

2022-08-02 Thread Gourav Sengupta
hi, I do it with simple bash scripts to transfer to s3. Takes less than 1 minute to write it, and another 1 min to include it bootstrap scripts. Never saw the need for so much hype for such simple tasks. Regards, Gourav Sengupta On Tue, Aug 2, 2022 at 2:16 PM ayan guha wrote: > ELK or Spl

Re: Use case idea

2022-08-01 Thread Gourav Sengupta
defends the lack of support, and direction in this matter largely, which is a joke. Thanks and Regards, Gourav Sengupta On Mon, Aug 1, 2022 at 4:54 AM pengyh wrote: > > I don't think so. we were using spark integarted with Kafka for > streaming computing and realtime reports. that j

Re: Use case idea

2022-07-31 Thread Gourav Sengupta
on. Thanks and Regards, Gourav Sengupta On Mon, Aug 1, 2022 at 1:58 AM pengyh wrote: > > I am afraid the most sql functions spark has the other BI tools also have. > > spark is used for high performance computing, not for SQL function > comparisoin. > > Thanks. >

Re: PySpark cores

2022-07-29 Thread Gourav Sengupta
Hi, Agree with above response, but in case you are using arrow and transferring data from JVM to python and back, then please try to check how are things getting executed in python. Please let me know what is the processing you are trying to do while using arrow. Regards, Gourav Sengupta

Re: external table with parquet files: problem querying in sparksql since data is stored as integer while hive schema expects a timestamp

2022-07-24 Thread Gourav Sengupta
Hi, please try to query the table directly by loading the hive metastore (we can do that quite easily in AWS EMR, but we can do things quite easily with everything in AWS), rather than querying the s3 location directly. Regards, Gourav On Wed, Jul 20, 2022 at 9:51 PM Joris Billen wrote: >

Re: reading each JSON file from dataframe...

2022-07-13 Thread Gourav Sengupta
the SPARK system to read it as a string first or use 100% scanning of the files to have a full schema. Regards, Gourav Sengupta On Wed, Jul 13, 2022 at 12:41 AM Muthu Jayakumar wrote: > Hello Ayan, > > Thank you for the suggestion. But, I would lose correlation of the JSON > file wi

Re: about cpu cores

2022-07-11 Thread Gourav Sengupta
Hi, please see Sean's answer and please read about parallelism in spark. Regards, Gourav Sengupta On Mon, Jul 11, 2022 at 10:12 AM Tufan Rakshit wrote: > so as an average every 4 core , you get back 3.6 core in Yarn , but you > can use only 3 . > in Kubernetes you get back 3.6 and als

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

2022-07-05 Thread Gourav Sengupta
Hi, SPARK is just one of the technologies out there now, there are several other technologies far outperforming SPARK or at least as good as SPARK. Regards, Gourav On Sat, Jul 2, 2022 at 7:42 PM Sid wrote: > So as per the discussion, shuffle stages output is also stored on disk and > not in

Re: Need help with the configuration for AWS glue jobs

2022-06-23 Thread Gourav Sengupta
Please use EMR, Glue is not made for heavy processing jobs. On Thu, Jun 23, 2022 at 6:36 AM Sid wrote: > Hi Team, > > Could anyone help me in the below problem: > > > https://stackoverflow.com/questions/72724999/how-to-calculate-number-of-g-1-workers-in-aws-glue-for-processing-1tb-data > >

Re: input file size

2022-06-19 Thread Gourav Sengupta
Hi, Just so that we understand the intention why do you need to know the file size? Are you not using splittable file format? If you use spark streaming to read the files, using just once, then you will be able to get the metadata of the files I believe. Regards, Gourav Sengupta On Sun, Jun

Re: Redesign approach for hitting the APIs using PySpark

2022-06-13 Thread Gourav Sengupta
tch_count) batch_id FROM test).repartitionByRange("batch_id").createOrReplaceTempView("test_batch") the above code should be able to then be run with a udf as long as we are able to control the parallelism with the help of executor count and task cpi configuration. But once ag

Re: Redesign approach for hitting the APIs using PySpark

2022-06-13 Thread Gourav Sengupta
a simple python program works quite well. Regards, Gourav On Mon, Jun 13, 2022 at 9:28 AM Sid wrote: > Hi Gourav, > > Do you have any examples or links, please? That would help me to > understand. > > Thanks, > Sid > > On Mon, Jun 13, 2022 at 1:42 PM Gourav Sengupta &

Re: Redesign approach for hitting the APIs using PySpark

2022-06-13 Thread Gourav Sengupta
Hi, I think that serialising data using spark is an overkill, why not use normal python. Also have you tried repartition by range, that way you can use modulus operator to batch things up? Regards, Gourav On Mon, Jun 13, 2022 at 8:37 AM Sid wrote: > Hi Team, > > I am trying to hit the POST

Re: Job migrated from EMR to Dataproc takes 20 hours instead of 90 minutes

2022-05-31 Thread Gourav Sengupta
Hi, just to elaborate what Ranadip has pointed out here correctly, gzip files are read only by one executor, where as a bzip file can be read by multiple executors therefore their reading speed will be parallelised and higher. try to use bzip2 for kafka connect. Regards, Gourav Sengupta On Mon

Re: Complexity with the data

2022-05-26 Thread Gourav Sengupta
Hi, can you please give us a simple map of what the input is and what the output should be like? From your description it looks a bit difficult to figure out what exactly or how exactly you want the records actually parsed. Regards, Gourav Sengupta On Wed, May 25, 2022 at 9:08 PM Sid wrote

Re: Problem with implementing the Datasource V2 API for Salesforce

2022-05-24 Thread Gourav Sengupta
Hi, in the spirit of not fitting the solution to the problem, would it not be better to first create a producer for your job and use a broker like Kafka or Kinesis or Pulsar? Regards, Gourav Sengupta On Sat, May 21, 2022 at 3:46 PM Rohit Pant wrote: > Hi all, > > I am trying to

Re: Spark error with jupyter

2022-05-04 Thread Gourav Sengupta
Hi, looks like spark listener is not working? Is your session still running? Try to see the SPARK UI to find out whether the session is still active or not Regards, Gourav On Tue, May 3, 2022 at 7:37 PM Bjørn Jørgensen wrote: > I use jupyterlab and spark and I have not seen this before. > >

Re: structured streaming- checkpoint metadata growing indefinetely

2022-04-29 Thread Gourav Sengupta
Hi, this may not solve the problem, but have you tried to stop the job gracefully, and then restart without much delay by pointing to a new checkpoint location? The approach will have certain uncertainties for scenarios where the source system can loose data, or we do not expect duplicates to be

Re: Dealing with large number of small files

2022-04-27 Thread Gourav Sengupta
Hi, did that result in valid JSON in the output file? Regards, Gourav Sengupta On Tue, Apr 26, 2022 at 8:18 PM Sid wrote: > I have .txt files with JSON inside it. It is generated by some API calls > by the Client. > > On Wed, Apr 27, 2022 at 12:39 AM Bjørn Jørgensen > w

Re: Dealing with large number of small files

2022-04-26 Thread Gourav Sengupta
Hi, what is the version of spark are you using? And where is the data stored. I am not quite sure that just using a bash script will help because concatenating all the files into a single file creates a valid JSON. Regards, Gourav On Tue, Apr 26, 2022 at 3:44 PM Sid wrote: > Hello, > > Can

Re: Streaming write to orc problem

2022-04-23 Thread Gourav Sengupta
to the output location? Thanks and Regards, Gourav Sengupta On Fri, Apr 22, 2022 at 3:57 PM hsy...@gmail.com wrote: > Hello all, > > I’m just trying to build a pipeline reading data from a streaming source > and write to orc file. But I don’t see any file that is written to the &

Re: Question about bucketing and custom partitioners

2022-04-11 Thread Gourav Sengupta
Hi, have you checked skew settings in SPARK 3.2? I am also not quite sure why you need a custom partitioner? While RDD still remains a valid option you must try to explore the recent ways of thinking and framing better solutions using SPARK. Regards, Gourav Sengupta On Mon, Apr 11, 2022 at 4:47

Re: Spark 3.0.1 and spark 3.2 compatibility

2022-04-08 Thread Gourav Sengupta
Hi, absolutely agree with Sean, besides that please see the release notes as well for SPARK versions, they do mention about any issues around compatibility Regards, Gourav On Thu, Apr 7, 2022 at 6:32 PM Sean Owen wrote: > (Don't cross post please) > Generally you definitely want to compile and

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-06 Thread Gourav Sengupta
Hi, super duper. Please try to see if you can write out the data to S3, and then write a load script to load that data from S3 to HBase. Regards, Gourav Sengupta On Wed, Apr 6, 2022 at 4:39 PM Joris Billen wrote: > HI, > thanks for your reply. > > > I believe I have found the

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-05 Thread Gourav Sengupta
+ 1 Thanks and Regards, Gourav Sengupta On Mon, Apr 4, 2022 at 10:51 AM Joris Billen wrote: > Clear-probably not a good idea. > > But a previous comment said “you are doing everything in the end in one > go”. > So this made me wonder: in case your only action is a write in the e

Re: [Spark SQL] Structured Streaming in pyhton can connect to cassandra ?

2022-03-25 Thread Gourav Sengupta
and run. Regards, Gourav Sengupta On Fri, Mar 25, 2022 at 1:19 PM Alex Ott wrote: > You don't need to use foreachBatch to write to Cassandra. You just need to > use Spark Cassandra Connector version 2.5.0 or higher - it supports native > writing of stream data into Cassandra.

Re: Continuous ML model training in stream mode

2022-03-17 Thread Gourav Sengupta
that set me into data science and its applications. Thanks Sean! :) Regards, Gourav Sengupta On Tue, Mar 15, 2022 at 9:39 PM Artemis User wrote: > Thanks Sean! Well, it looks like we have to abandon our structured > streaming model to use DStream for this, or do you see possibility

Re: Question on List to DF

2022-03-16 Thread Gourav Sengupta
Hi Jayesh, thanks found your email quite interesting :) Regards, Gourav On Wed, Mar 16, 2022 at 8:02 AM Bitfox wrote: > Thank you. that makes sense. > > On Wed, Mar 16, 2022 at 2:03 PM Lalwani, Jayesh > wrote: > >> The toDF function in scala uses a bit of Scala magic that allows you to >>

Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

2022-03-07 Thread Gourav Sengupta
ere​) == count( > partition_column​ ), but this may not work for complex queries. > > > Regards > Saurabh > -- > *From:* Gourav Sengupta > *Sent:* 05 March 2022 11:06 > *To:* Saurabh Gulati > *Cc:* Mich Talebzadeh ; Kidong Lee < > mykid..

Re: {EXT} Re: Spark Parquet write OOM

2022-03-05 Thread Gourav Sengupta
. And the number of records per file configuration should be mentioned in the following link as maxrecordsperfile or something like that : https://spark.apache.org/docs/latest/configuration.html#runtime-sql-configuration . Regards, Gourav Sengupta On Sat, Mar 5, 2022 at 5:09 PM Anil Dasari wrote

Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

2022-03-05 Thread Gourav Sengupta
into it and kindly let me know if there is something blocking me? I will be sincerely obliged. Regards, Gourav Sengupta On Tue, Feb 22, 2022 at 3:58 PM Saurabh Gulati wrote: > Hey Mich, > We use spark 3.2 now. We are using BQ but migrating away because: > >- Its not reflective of

Re: {EXT} Re: Spark Parquet write OOM

2022-03-05 Thread Gourav Sengupta
; > > > Regards > > > > *From: *Gourav Sengupta > *Date: *Thursday, March 3, 2022 at 2:24 AM > *To: *Anil Dasari > *Cc: *Yang,Jie(INF) , user@spark.apache.org < > user@spark.apache.org> > *Subject: *Re: {EXT} Re: Spark Parquet write OOM > > Hi, >

Re: {EXT} Re: Spark Parquet write OOM

2022-03-03 Thread Gourav Sengupta
suggested) let me know how things are giong on your end Regards, Gourav Sengupta On Thu, Mar 3, 2022 at 8:37 AM Anil Dasari wrote: > Answers in the context. Thanks. > > > > *From: *Gourav Sengupta > *Date: *Thursday, March 3, 2022 at 12:13 AM > *To: *Anil Dasari > *C

Re: {EXT} Re: Spark Parquet write OOM

2022-03-03 Thread Gourav Sengupta
Sengupta On Wed, Mar 2, 2022 at 11:25 PM Anil Dasari wrote: > 2nd attempt.. > > > > Any suggestions to troubleshoot and fix the problem ? thanks in advance. > > > > Regards, > > Anil > > > > *From: *Anil Dasari > *Date: *Wednesday, March 2, 2022 a

Re: Spark Parquet write OOM

2022-03-02 Thread Gourav Sengupta
. Is your pipeline going to change or evolve soon, or the data volumes going to vary, or particularly increase, over time? 4. What is the memory that you are having in your executors, and drivers? 5. Can you show the list of transformations that you are running ? Regards, Gourav Sengupta On Wed

Re: can dataframe API deal with subquery

2022-03-01 Thread Gourav Sengupta
Hi, why would you want to do that? Regards, Gourav On Sat, Feb 26, 2022 at 8:00 AM wrote: > such as this table definition: > > > desc people; > +---+---+--+ > | col_name | data_type | comment | >

Re: StructuredStreaming error - pyspark.sql.utils.StreamingQueryException: batch 44 doesn't exist

2022-02-28 Thread Gourav Sengupta
RocksDB, it was introduced by Tathagata Das a few years ago in the Databricks version, and it has now been made available in the open source version, it really works well. Let me know how things go, and what was your final solution. Regards, Gourav Sengupta On Mon, Feb 28, 2022 at 6:02 AM karan

Re: StructuredStreaming error - pyspark.sql.utils.StreamingQueryException: batch 44 doesn't exist

2022-02-26 Thread Gourav Sengupta
Hi, May be the purpose of the article is different, but: instead of: sources (trail files) --> kafka --> flume --> write to cloud storage -->> SSS a much simpler solution is: sources (trail files) --> write to cloud storage -->> SSS Putting additional components and hops just does sound a bit

Re: StructuredStreaming error - pyspark.sql.utils.StreamingQueryException: batch 44 doesn't exist

2022-02-26 Thread Gourav Sengupta
Hi, Can you please let us know: 1. the SPARK version, and the kind of streaming query that you are running? 2. whether you are using at least once, utmost once, or only once concepts? 3. any additional details that you can provide, regarding the storage duration in Kafka, etc? 4. are your running

Re: How to gracefully shutdown Spark Structured Streaming

2022-02-26 Thread Gourav Sengupta
Dear Mich, a super duper note of thanks, I had to spend around two weeks to figure this out :) Regards, Gourav Sengupta On Sat, Feb 26, 2022 at 10:43 AM Mich Talebzadeh wrote: > > > On Mon, 26 Apr 2021 at 10:21, Mich Talebzadeh > wrote: > >> >> Spark Structured

Re: Non-Partition based Workload Distribution

2022-02-25 Thread Gourav Sengupta
Hi, not quite sure here, but can you please share your code? Regards, Gourav Sengupta On Thu, Feb 24, 2022 at 8:25 PM Artemis User wrote: > We got a Spark program that iterates through a while loop on the same > input DataFrame and produces different results per iteration. I see >

Re: Structured Streaming + UDF - logic based on checking if a column is present in the Dataframe

2022-02-25 Thread Gourav Sengupta
Hi, can you please let us know the following: 1. the spark version 2. a few samples of input data 3. a few samples of what is the expected output that you want Regards, Gourav Sengupta On Wed, Feb 23, 2022 at 8:43 PM karan alang wrote: > Hello All, > > I'm using StructuredStreamin

Re: [E] COMMERCIAL BULK: Re: TensorFlow on Spark

2022-02-24 Thread Gourav Sengupta
opinion should be fine I think. Just like inspite of having Pandas UDF we went for Koalas, similarly SPARK native integrations which are light weight and easy to use and extend to deep learning frameworks perhaps makes sense according to me. Regards, Gourav Sengupta Regards, Gourav Sengupta On Thu

Re: [E] COMMERCIAL BULK: Re: TensorFlow on Spark

2022-02-24 Thread Gourav Sengupta
, then what do we do? Because creating professional quality data loaders is a very big job, therefore, these solutions try to occupy that space as an entry point. Regards, Gourav Sengupta On Thu, Feb 24, 2022 at 1:21 PM Bitfox wrote: > I have been using tensorflow for a long time, it's not h

Re: [E] COMMERCIAL BULK: Re: TensorFlow on Spark

2022-02-24 Thread Gourav Sengupta
. Regards, Gourav Sengupta On Wed, Feb 23, 2022 at 4:42 PM Dennis Suhari wrote: > Currently we are trying AnalyticsZoo and Ray > > > Von meinem iPhone gesendet > > Am 23.02.2022 um 04:53 schrieb Bitfox : > >  > tensorflow itself can implement the distributed computing via a

Re: Loading .xlsx and .xlx files using pyspark

2022-02-23 Thread Gourav Sengupta
Hi, this looks like a very specific and exact problem in its scope. Do you think that you can load the data into panda dataframe and load it back to SPARK using PANDAS UDF? Koalas is now natively integrated with SPARK, try to see if you can use those features. Regards, Gourav On Wed, Feb 23,

Re: [E] COMMERCIAL BULK: Re: TensorFlow on Spark

2022-02-23 Thread Gourav Sengupta
between Ray and SPARK. Regards, Gourav Sengupta On Wed, Feb 23, 2022 at 12:35 PM Sean Owen wrote: > Spark does do distributed ML, but not Tensorflow. Barrier execution mode > is an element that things like Horovod uses. Not sure what you are getting > at? > Ray is not Spark. > As I

Re: [E] COMMERCIAL BULK: Re: TensorFlow on Spark

2022-02-23 Thread Gourav Sengupta
so, and achieve that :) I would sincerely request the open source SPARK community to prioritise building the SPARK capabilities to scale ML applications. Thanks and Regards, Gourav Sengupta On Wed, Feb 23, 2022 at 3:53 AM Bitfox wrote: > tensorflow itself can implement the distribu

Re: Spark Explain Plan and Joins

2022-02-21 Thread Gourav Sengupta
triggering the action of query execution, and whether you are using SPARK Dataframes or SPARK SQL, and the settings in SPARK (look at the settings for SPARK 3.x) and a few other aspects you will see that the plan is quite cryptic and difficult to read sometimes. Regards, Gourav Sengupta On Sun, Feb 20

Re: Spark Explain Plan and Joins

2022-02-20 Thread Gourav Sengupta
automate things. Reading how to understand the plans may be good depending on what you are trying to do. Regards, Gourav Sengupta On Sat, Feb 19, 2022 at 10:00 AM Sid Kal wrote: > I wrote a query like below and I am trying to understand its query > execution plan. > > >>&

Re: Cast int to string not possible?

2022-02-18 Thread Gourav Sengupta
Hi Rico, using SQL saves a lot of time, effort, and budget over the long term. But I guess that there are certain joys in solving self induced complexities. Thanks for sharing your findings. Regards, Gourav Sengupta On Fri, Feb 18, 2022 at 7:26 AM Rico Bergmann wrote: > I found the rea

Re: StructuredStreaming - foreach/foreachBatch

2022-02-17 Thread Gourav Sengupta
. Regards, Gourav Sengupta On Wed, Feb 9, 2022 at 8:51 PM karan alang wrote: > Thanks, Mich .. will check it out > > regds, > Karan Alang > > On Tue, Feb 8, 2022 at 3:06 PM Mich Talebzadeh > wrote: > >> BTW you can check this Linkedin article of mine on Processing Cha

Re: Cast int to string not possible?

2022-02-17 Thread Gourav Sengupta
Hi, can you please post a screen shot of the exact CAST statement that you are using? Did you use the SQL method mentioned by me earlier? Regards, Gourav Sengupta On Thu, Feb 17, 2022 at 12:17 PM Rico Bergmann wrote: > hi! > > Casting another int column that is not a partition col

Re: Cast int to string not possible?

2022-02-17 Thread Gourav Sengupta
Hi, This appears interesting, casting INT to STRING has never been an issue for me. Can you just help us with the output of : df.printSchema() ? I prefer to use SQL, and the method I use for casting is: CAST(<> AS STRING) <>. Regards, Gourav On Thu, Feb 17, 2022 at 6:02 AM Rico Bergmann

Re: Implementing circuit breaker pattern in Spark

2022-02-16 Thread Gourav Sengupta
d them to run economically, with security, costs and other implications for at least 3 to 4 years. There is an old saying, do not fit the solution to the problem. May be I do not understand the problem, and therefore saying all wrong things :) Regards, Gourav Sengupta On Wed, Feb 16, 2022 at 3:31 P

Re: Which manufacturers' GPUs support Spark?

2022-02-16 Thread Gourav Sengupta
the GPU's work fantastically well. Regards, Gourav Sengupta On Wed, Feb 16, 2022 at 1:09 PM Sean Owen wrote: > Spark itself does not use GPUs, and is agnostic to what GPUs exist on a > cluster, scheduled by the resource manager, and used by an application. > In practice, virtually all GP

Re: Implementing circuit breaker pattern in Spark

2022-02-16 Thread Gourav Sengupta
Hi, once again, just trying to understand the problem first. Why are we using SPARK to place calls to micro services? There are several reasons why this should never happen, including costs/ security/ scalability concerns, etc. Is there a way that you can create a producer and put the data into

Re: Deploying Spark on Google Kubernetes (GKE) autopilot, preliminary findings

2022-02-14 Thread Gourav Sengupta
Hi, sorry in case it appeared otherwise, Mich's takes are super interesting. Just that while applying solutions on commercial undertakings things are quite different from research/ development scenarios . Regards, Gourav Sengupta On Mon, Feb 14, 2022 at 5:02 PM ashok34

Re: Deploying Spark on Google Kubernetes (GKE) autopilot, preliminary findings

2022-02-14 Thread Gourav Sengupta
Hi, I would still not build any custom solution, and if in GCP use serverless Dataproc. I think that it is always better to be hands on with AWS Glue before commenting on it. Regards, Gourav Sengupta On Mon, Feb 14, 2022 at 11:18 AM Mich Talebzadeh wrote: > Good question. However, we ou

Re: Deploying Spark on Google Kubernetes (GKE) autopilot, preliminary findings

2022-02-13 Thread Gourav Sengupta
use cloud - to reduce operational costs. Sorry, just trying to understand what is the scope of this work. Regards, Gourav Sengupta On Fri, Feb 11, 2022 at 8:35 PM Mich Talebzadeh wrote: > The equivalent of Google GKE autopilot > <https://cloud.google.com/kubernetes-engine/docs/concepts/

Re: Unable to access Google buckets using spark-submit

2022-02-12 Thread Gourav Sengupta
Hi, agree with Holden, have faced quite a few issues with FUSE. Also trying to understand "spark-submit from local" . Are you submitting your SPARK jobs from a local laptop or in local mode from a GCP dataproc / system? If you are submitting the job from your local laptop, there will be

Re: Unable to force small partitions in streaming job without repartitioning

2022-02-12 Thread Gourav Sengupta
hi, Did you try to sorting while writing out the data? All of this engineering may not be required in that case. Regards, Gourav Sengupta On Sat, Feb 12, 2022 at 8:42 PM Chris Coutinho wrote: > Setting the option in the cluster configuration solved the issue, and now > we'r

Re: Unable to force small partitions in streaming job without repartitioning

2022-02-11 Thread Gourav Sengupta
eading its settings. Regards, Gourav Sengupta On Fri, Feb 11, 2022 at 6:00 PM Adam Binford wrote: > Writing to Delta might not support the write.option method. We set > spark.hadoop.parquet.block.size in our spark config for writing to Delta. > > Adam > > On Fri, Feb 11, 2022

Re: Using Avro file format with SparkSQL

2022-02-11 Thread Gourav Sengupta
Hi Anna, Avro libraries should be inbuilt in SPARK in case I am not wrong. Any particular reason why you are using a deprecated or soon to be deprecated version of SPARK? SPARK 3.2.1 is fantastic. Please do let us know about your set up if possible. Regards, Gourav Sengupta On Thu, Feb 10

Re: data size exceeds the total ram

2022-02-11 Thread Gourav Sengupta
, and there are different ways to manage that depending on the SPARK version. Thanks and Regards, Gourav Sengupta On Fri, Feb 11, 2022 at 11:09 AM frakass wrote: > Hello list > > I have imported the data into spark and I found there is disk IO in > every node. The memory didn't

Re: data size exceeds the total ram

2022-02-11 Thread Gourav Sengupta
Hi, just so that we understand the problem first? What is the source data (is it JSON, CSV, Parquet, etc)? Where are you reading it from (JDBC, file, etc)? What is the compression format (GZ, BZIP, etc)? What is the SPARK version that you are using? Thanks and Regards, Gourav Sengupta On Fri

Re: add an auto_increment column

2022-02-08 Thread Gourav Sengupta
Hi, so do you want to rank apple and tomato both as 2? Not quite clear on the use case here though. Regards, Gourav Sengupta On Tue, Feb 8, 2022 at 7:10 AM wrote: > > Hello Gourav > > > As you see here orderBy has already give the solution for "equal &

Re: add an auto_increment column

2022-02-07 Thread Gourav Sengupta
are trying to achieve by the rankings? Regards, Gourav Sengupta On Tue, Feb 8, 2022 at 4:22 AM ayan guha wrote: > For this req you can rank or dense rank. > > On Tue, 8 Feb 2022 at 1:12 pm, wrote: > >> Hello, >> >> For this query: >> >> >>&

Re: add an auto_increment column

2022-02-07 Thread Gourav Sengupta
records multiple times in a table, and still have different values? I think without knowing the requirements all the above responses, like everything else where solutions are reached before understanding the problem, has high chances of being wrong. Regards, Gourav Sengupta On Mon, Feb 7, 2022

Re: A Persisted Spark DataFrame is computed twice

2022-02-01 Thread Gourav Sengupta
in the data of the filters first. Regards, Gourav Sengupta On Mon, Jan 31, 2022 at 8:00 AM Benjamin Du wrote: > I don't think coalesce (by repartitioning I assume you mean coalesce) > itself and deserialising takes that much time. To add a little bit more > context, the computation of the

Re: [Spark UDF]: Where does UDF stores temporary Arrays/Sets

2022-01-30 Thread Gourav Sengupta
are not actually solving the problem and just addressing the issue. Regards, Gourav Sengupta On Wed, Jan 26, 2022 at 4:07 PM Sean Owen wrote: > Really depends on what your UDF is doing. You could read 2GB of XML into > much more than that as a DOM representation in memory. > Remember 15GB of

Re: A Persisted Spark DataFrame is computed twice

2022-01-30 Thread Gourav Sengupta
to read the difference between repartition and coalesce before making any kind of assumptions. Regards, Gourav Sengupta On Sun, Jan 30, 2022 at 8:52 AM Sebastian Piu wrote: > It's probably the repartitioning and deserialising the df that you are > seeing take time. Try doing this >

Re: How to delete the record

2022-01-30 Thread Gourav Sengupta
? There is a third option, which is akin to the second option that Mich was mentioning, and that is basically a database transaction log, which gets very large, very expensive to store and query over a period of time. Are you creating a database transaction log? Thanks and Regards, Gourav Sengupta On Thu, Jan 27

Re: how can I remove the warning message

2022-01-30 Thread Gourav Sengupta
warnings in spark-shell using the Logger.getLogger("akka").setLevel(Level.OFF) in case I have not completely forgotten. Other details are mentioned here: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.setLogLevel.html Regards, Gourav Sengupta On Fri, Ja

Re: Kafka to spark streaming

2022-01-30 Thread Gourav Sengupta
Hi Amit, before answering your question, I am just trying to understand it. I am not exactly clear how do the Akka application, Kafka and SPARK Streaming application sit together, and what are you exactly trying to achieve? Can you please elaborate? Regards, Gourav On Fri, Jan 28, 2022 at

Re: Small optimization questions

2022-01-28 Thread Gourav Sengupta
tasks to take care of memory. We do not have any other data regarding your clusters or environments therefore it is difficult to imagine things and provide more information. Regards, Gourav Sengupta On Thu, Jan 27, 2022 at 12:58 PM Aki Riisiö wrote: > Ah, sorry for spamming, I found the ans

Re: [Spark ML Pipeline]: Error Loading Pipeline Model with Custom Transformer

2022-01-12 Thread Gourav Sengupta
Hi, may be I have less time, but can you please add some inline comments in your code to explain what you are trying to do? Regards, Gourav Sengupta On Tue, Jan 11, 2022 at 5:29 PM Alana Young wrote: > I am experimenting with creating and persisting ML pipelines using custom > transf

Re: pyspark loop optimization

2022-01-11 Thread Gourav Sengupta
of the dataframe in each iteration to understand the effect of your loops on the explain plan - that should give some details. Regards, Gourav Sengupta On Mon, Jan 10, 2022 at 10:49 PM Ramesh Natarajan wrote: > I want to compute cume_dist on a bunch of columns in a spark dataframe, > but want to

Re: How to add a row number column with out reordering my data frame

2022-01-11 Thread Gourav Sengupta
art *=* i *** numRows > > end *=* start *+* numRows > > print("\ni:{} start:{} end:{}"*.*format(i, start,end)) > > df *=* trainDF*.*iloc[ start:end ] > > > > There does not seem to be an easy way to do this. > > > https://spark.apache.org/docs/lates

Re: How to add a row number column with out reordering my data frame

2022-01-10 Thread Gourav Sengupta
Hi, I am a bit confused here, it is not entirely clear to me why are you creating the row numbers, and how creating the row numbers helps you with the joins? Can you please explain with some sample data? Regards, Gourav On Fri, Jan 7, 2022 at 1:14 AM Andrew Davidson wrote: > Hi > > > > I am

Re: hive table with large column data size

2022-01-10 Thread Gourav Sengupta
-ref-datatypes.html. Parquet is definitely a columnar format, and if I am not entirely wrong, it definitely supports columnar reading of data by default in SPARK. Regards, Gourav Sengupta On Sun, Jan 9, 2022 at 2:34 PM weoccc wrote: > Hi , > > I want to store binary data (such as images)

Re: pyspark

2022-01-06 Thread Gourav Sengupta
Hi, I am not sure at all that we need to use SQLContext and HiveContext anymore. Can you please check your JAVA_HOME, and SPARK_HOME? I use findspark library to enable all environment variables for me regarding spark, or use conda to install pyspark using conda-forge Regards, Gourav Sengupta

Re: How to make batch filter

2022-01-02 Thread Gourav Sengupta
.rdd.getNumPartitions() 10 Please do refer to the following page for adaptive sql execution in SPARK 3, it will be of massive help particularly in case you are handling skewed joins, https://spark.apache.org/docs/latest

Re: Pyspark debugging best practices

2021-12-28 Thread Gourav Sengupta
Hi Andrew, Any chance you might give Databricks a try in GCP? The above transformations look complicated to me, why are you adding dataframes to a list? Regards, Gourav Sengupta On Sun, Dec 26, 2021 at 7:00 PM Andrew Davidson wrote: > Hi > > > > I am having trouble debu

Re: Dataframe's storage size

2021-12-24 Thread Gourav Sengupta
> On Fri, Dec 24, 2021, 4:54 AM Gourav Sengupta > wrote: > >> Hi, >> >> This question, once again like the last one, does not make much sense at >> all. Where are you trying to store the data frame, and how? >> >> Are you just trying to write a blog, as

  1   2   3   4   5   6   >