Re: Reading too many files

2022-10-03 Thread Sid
Are you trying to run on cloud ? On Mon, 3 Oct 2022, 21:55 Sachit Murarka, wrote: > Hello, > > I am reading too many files in Spark 3.2(Parquet) . It is not giving any > error in the logs. But after spark.read.parquet , it is not able to proceed > further. > Can anyone please suggest if there

Re: Splittable or not?

2022-09-19 Thread Sid
if your data), in case those files are big. > > Enrico > > > Am 14.09.22 um 21:57 schrieb Sid: > > Okay so you mean to say that parquet compresses the denormalized data > using snappy so it won't affect the performance. > > Only using snappy will affect the performa

Re: Splittable or not?

2022-09-14 Thread Sid
Okay so you mean to say that parquet compresses the denormalized data using snappy so it won't affect the performance. Only using snappy will affect the performance Am I correct? On Thu, 15 Sep 2022, 01:08 Amit Joshi, wrote: > Hi Sid, > > Snappy itself is not splittable. But t

Re: Long running task in spark

2022-09-14 Thread Sid
Try spark.driver.maxResultsSize =0 On Mon, 12 Sep 2022, 09:46 rajat kumar, wrote: > Hello Users, > > My 2 tasks are running forever. One of them gave a java heap space error. > I have 10 Joins , all tables are big. I understand this is data skewness. > Apart from changes at code level , any

Splittable or not?

2022-09-14 Thread Sid
*part*.snappy.parquet* So, when I read this data will it affect my performance? Please help me if there is any understanding gap. Thanks, Sid

Joins internally

2022-08-11 Thread Sid
that Table A has two Partitions A & B where data is hashed based on hash value and sorted within partitions So my question is how does it comes to know that which partition from Table A has to be joined or searched with which partition from Table B ? TIA, Sid

Re: Salting technique doubt

2022-08-04 Thread Sid
Hi Everyone, Thanks a lot for your answers. It helped me a lot to clear the concept :) Best, Sid On Mon, Aug 1, 2022 at 12:17 AM Vinod KC wrote: > Hi Sid, > This example code with output will add some more clarity > > spark-shell --conf spark.sql.shuffle.partiti

Re: Salting technique doubt

2022-07-31 Thread Sid
are segregated in different partitions based on unique keys they are not matching because x_1/x_2 !=x_8/x_9 How do you ensure that the results are matched? Best, Sid On Sun, Jul 31, 2022 at 1:34 AM Amit Joshi wrote: > Hi Sid, > > Salting is normally a technique to add random characters to

Salting technique doubt

2022-07-30 Thread Sid
understanding gap? Could anyone help me in layman's terms? TIA, Sid

How use pattern matching in spark

2022-07-13 Thread Sid
on Rdds. Just the validation part. Therefore, wanted to understand is there any better way to achieve this? Thanks, Sid

Re: How reading works?

2022-07-13 Thread Sid
Yeah, I understood that now. Thanks for the explanation, Bjorn. Sid On Wed, Jul 6, 2022 at 1:46 AM Bjørn Jørgensen wrote: > Ehh.. What is "*duplicate column*" ? I don't think Spark supports that. > > duplicate column = duplicate rows > > > tir. 5. jul. 2022 kl.

Re: How reading works?

2022-07-05 Thread Sid
Hi Team, I still need help in understanding how reading works exactly? Thanks, Sid On Mon, Jun 20, 2022 at 2:23 PM Sid wrote: > Hi Team, > > Can somebody help? > > Thanks, > Sid > > On Sun, Jun 19, 2022 at 3:51 PM Sid wrote: > >> Hi, >> >> I alrea

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

2022-07-02 Thread Sid
in >> reality spark does store intermediate results on disk, just in less places >> than MR >> >> --- Original Message --- >> On Saturday, July 2nd, 2022 at 5:27 PM, Sid >> wrote: >> >> I have explained the same thing in a very layman's t

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

2022-07-02 Thread Sid
I have explained the same thing in a very layman's terms. Go through it once. On Sat, 2 Jul 2022, 19:45 krexos, wrote: > > I think I understand where Spark saves IO. > > in MR we have map -> reduce -> map -> reduce -> map -> reduce ... > > which writes results do disk at the end of each such

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

2022-07-02 Thread Sid
an enough memory to store intermediate results) on an average involves 3 times less disk I/O i.e only while reading the data and writing the final data to the disk. Thanks, Sid On Sat, 2 Jul 2022, 17:58 krexos, wrote: > Hello, > > One of the main "selling points" of Spark is tha

Re: Glue is serverless? how?

2022-06-28 Thread Sid
> https://en.m.wikipedia.org/wiki/Serverless_computing > > > > > > > > søn. 26. jun. 2022, 10:26 skrev Sid : > > > > > Hi Team, > > > > > > I am developing a spark job in glue and have read that glue is > serverless. > > > I know that u

Understanding about joins in spark

2022-06-27 Thread Sid
here would be the join operation performed? on another worker node or it is performed on the driver side? Somebody, please help me to understand this by correcting me w.r.t my points or just adding an explanation to it. TIA, Sid

Glue is serverless? how?

2022-06-26 Thread Sid
is managed by Glue and we don't have to worry about the underlying infrastructure? Please help me to understand in layman's terms. Thanks, Sid

Re: Spark Doubts

2022-06-25 Thread Sid
is called it is submitted to the DAG scheduler and so on... Thanks, Sid On Sat, Jun 25, 2022 at 12:34 PM Tufan Rakshit wrote: > Please find the answers inline please . > 1) Can I apply predicate pushdown filters if I have data stored in S3 or > it should be used only while reading from DBs?

Spark Doubts

2022-06-25 Thread Sid
opies the data in the memory and then transfers to another partition". Is it correct? If not, please correct me. Please help me to understand these things in layman's terms if my assumptions are not correct. Thanks, Sid

Re: Need help with the configuration for AWS glue jobs

2022-06-23 Thread Sid
Where can I find information on the size of the datasets supported by AWS Glue? I didn't see it on the documentation Also, if I want to process TBs of data for eg 1TB what should be the ideal EMR cluster configuration? Could you please guide me on this? Thanks, Sid. On Thu, 23 Jun 2022, 23:44

Need help with the configuration for AWS glue jobs

2022-06-22 Thread Sid
Hi Team, Could anyone help me in the below problem: https://stackoverflow.com/questions/72724999/how-to-calculate-number-of-g-1-workers-in-aws-glue-for-processing-1tb-data Thanks, Sid

Re: Will it lead to OOM error?

2022-06-22 Thread Sid
Thanks all for your answers. Much appreciated. On Thu, Jun 23, 2022 at 6:07 AM Yong Walt wrote: > We have many cases like this. it won't cause OOM. > > Thanks > > On Wed, Jun 22, 2022 at 8:28 PM Sid wrote: > >> I have a 150TB CSV file. >> >> I have a total o

Re: Will it lead to OOM error?

2022-06-22 Thread Sid
Hi Enrico, Thanks for the insights. Could you please help me to understand with one example of compressed files where the file wouldn't be split in partitions and will put load on a single partition and might lead to OOM error? Thanks, Sid On Wed, Jun 22, 2022 at 6:40 PM Enrico Minack wrote

Will it lead to OOM error?

2022-06-22 Thread Sid
I have a 150TB CSV file. I have a total of 100 TB RAM and 100TB disk. So If I do something like this spark.read.option("header","true").csv(filepath).show(false) Will it lead to an OOM error since it doesn't have enough memory? or it will spill data onto the disk and process it? Thanks, Sid

Re: Spark Doubts

2022-06-22 Thread Sid
Hi, Thanks for your answers. Much appreciated I know that we can cache the data frame in memory or disk but I want to understand when the data frame is loaded initially and where does it reside by default? Thanks, Sid On Wed, Jun 22, 2022 at 6:10 AM Yong Walt wrote: > These are the ba

Spark Doubts

2022-06-21 Thread Sid
Hi Team, I have a few doubts about the below questions: 1) data frame will reside where? memory? disk? memory allocation about data frame? 2) How do you configure each partition? 3) Is there any way to calculate the exact partitions needed to load a specific file? Thanks, Sid

Re: How reading works?

2022-06-20 Thread Sid
Hi Team, Can somebody help? Thanks, Sid On Sun, Jun 19, 2022 at 3:51 PM Sid wrote: > Hi, > > I already have a partitioned JSON dataset in s3 like the below: > > edl_timestamp=2022090800 > > Now, the problem is, in the earlier 10 days of data collection there was

How reading works?

2022-06-19 Thread Sid
how the spark reads the data. Does it full dataset and filter on the basis of the last saved timestamp or does it filter only what is required? If the second case is true, then it should have read the data since the latest data is correct. So just trying to understand. Could anyone help here? Thanks, Sid

Re: Redesign approach for hitting the APIs using PySpark

2022-06-13 Thread Sid
t;test_batch") > > > the above code should be able to then be run with a udf as long as we are > able to control the parallelism with the help of executor count and task > cpi configuration. > > But once again, this is just an unnecessary overkill. > > > Regards,

Re: Redesign approach for hitting the APIs using PySpark

2022-06-13 Thread Sid
d on the number of executors that you put in the job. > > But once again, this is kind of an overkill, for fetching data from a API, > creating a simple python program works quite well. > > > Regards, > Gourav > > On Mon, Jun 13, 2022 at 9:28 AM Sid wrote: > >>

Re: Redesign approach for hitting the APIs using PySpark

2022-06-13 Thread Sid
Hi Gourav, Do you have any examples or links, please? That would help me to understand. Thanks, Sid On Mon, Jun 13, 2022 at 1:42 PM Gourav Sengupta wrote: > Hi, > I think that serialising data using spark is an overkill, why not use > normal python. > > Also have you tr

Redesign approach for hitting the APIs using PySpark

2022-06-13 Thread Sid
# print(json.dumps(custRequestBody)) response = call_to_cust_bulk_api(policyUrl, custRequestBody) print(response) finalDFStatus = finalDF.withColumn("edl_timestamp", to_timestamp(lit(F.TimeNow(.withColumn( "status_for_each_batch", lit(str(response))) print("Max Value:::") print(maxValue) print("Next I:::") i = rangeNum + 1 print(i) This is my very first approach to hitting the APIs with Spark. So, could anyone please help me to redesign the approach, or can share some links or references using which I can go to the depth of this and rectify myself. How can I scale this? Any help is much appreciated. TIA, Sid

Re: API Problem

2022-06-11 Thread Sid
l_to_cust_bulk_api(policyUrl, custRequestBody) print(response) finalDFStatus = finalDF.withColumn("edl_timestamp", to_timestamp(lit(F.TimeNow(.withColumn( "status_for_each_batch", lit(str(response)))

Re: API Problem

2022-06-10 Thread Sid
Hi Enrico, Thanks for your time. Much appreciated. I am expecting the payload to be as a JSON string to be a record like below: {"A":"some_value","B":"some_value"} Where A and B are the columns in my dataset. On Fri, Jun 10, 2022 at 6:09 PM Enrico Mi

Re: API Problem

2022-06-10 Thread Sid
Hi Stelios, Thank you so much for your help. If I use lit it gives an error of column not iterable. Can you suggest a simple way of achieving my use case? I need to send the entire column record by record to the API in JSON format. TIA, Sid On Fri, Jun 10, 2022 at 2:51 PM Stelios Philippou

Re: API Problem

2022-06-10 Thread Sid
rtition(finalDF.rdd.getNumPartitions()).withColumn("status_for_batch >> >> To >> >> finalDF.repartition(finalDF.rdd.getNumPartitions()).withColumn(col("status_for_batch") >> >> >> >> >> On Thu, 9 Jun 2022, 22:32 Sid, wrote:

API Problem

2022-06-09 Thread Sid
Hi Experts, I am facing one problem while passing a column to the method. The problem is described in detail here: https://stackoverflow.com/questions/72565095/how-to-pass-columns-as-a-json-record-to-the-api-method-using-pyspark TIA, Sid

Re: How the data is distributed

2022-06-07 Thread Sid
n Mon, Jun 6, 2022 at 4:07 PM Sid wrote: > >> Hi experts, >> >> >> When we load any file, I know that based on the information in the spark >> session about the executors location, status and etc , the data is >> distributed among the worker nodes and executors.

How the data is distributed

2022-06-06 Thread Sid
or it is directly distributed amongst the workers? Thanks, Sid

Re: Unable to format timestamp values in pyspark

2022-05-30 Thread Sid
Yeah, Stelios. It worked. Could you please post it as an answer so that I can accept it on the post and can be of help to people? Thanks, Sid On Mon, May 30, 2022 at 4:42 PM Stelios Philippou wrote: > Sid, > > According to the error that i am seeing there, this is the Date Forma

Unable to format timestamp values in pyspark

2022-05-30 Thread Sid
Best Regards, Sid

Unable to convert double values

2022-05-29 Thread Sid
Hi Team, I need help with the below problem: https://stackoverflow.com/questions/72422872/unable-to-format-double-values-in-pyspark?noredirect=1#comment127940175_72422872 What am I doing wrong? Thanks, Siddhesh

Re: Complexity with the data

2022-05-26 Thread Sid
orrupt record? Or is there anyway to process such kind of records. Thanks, Sid On Thu, May 26, 2022 at 10:54 PM Gourav Sengupta wrote: > Hi, > can you please give us a simple map of what the input is and what the > output should be like? From your description it looks a bit difficult

Re: Complexity with the data

2022-05-26 Thread Sid
e/spark/blob/8f610d1b4ce532705c528f3c085b0289b2b17a94/python/pyspark/pandas/namespace.py#L216 > probably have to be updated with the default options. This is so that > pandas API on spark will be like pandas. > > tor. 26. mai 2022 kl. 17:38 skrev Sid : > >> I was passing the wrong escape cha

Re: Complexity with the data

2022-05-26 Thread Sid
this approach :) Fingers crossed. Thanks, Sid On Thu, May 26, 2022 at 8:43 PM Apostolos N. Papadopoulos < papad...@csd.auth.gr> wrote: > Since you cannot create the DF directly, you may try to first create an > RDD of tuples from the file > > and then convert the RDD to a DF

Re: Complexity with the data

2022-05-26 Thread Sid
Thanks for opening the issue, Bjorn. However, could you help me to address the problem for now with some kind of alternative? I am actually stuck in this since yesterday. Thanks, Sid On Thu, 26 May 2022, 18:48 Bjørn Jørgensen, wrote: > Yes, it looks like a bug that we also have in pandas

Re: Complexity with the data

2022-05-26 Thread Sid
Hello Everyone, I have posted a question finally with the dataset and the column names. PFB link: https://stackoverflow.com/questions/72389385/how-to-load-complex-data-using-pyspark Thanks, Sid On Thu, May 26, 2022 at 2:40 AM Bjørn Jørgensen wrote: > Sid, dump one of yours files. > &

Re: Complexity with the data

2022-05-25 Thread Sid
I have 10 columns with me but in the dataset, I observed that some records have 11 columns of data(for the additional column it is marked as null). But, how do I handle this? Thanks, Sid On Thu, May 26, 2022 at 2:22 AM Sid wrote: > How can I do that? Any examples or links, please.

Re: Complexity with the data

2022-05-25 Thread Sid
, was wondering if spark could handle this in a much better way. Thanks, Sid On Thu, May 26, 2022 at 2:19 AM Gavin Ray wrote: > Forgot to reply-all last message, whoops. Not very good at email. > > You need to normalize the CSV with a parser that can escape commas inside > of string

Re: Complexity with the data

2022-05-25 Thread Sid
ot;, "true").option("multiline", "true").option("inferSchema", "true").option("quote", '"').option( "delimiter", ",").csv("path") What else I can do? Thanks, Sid On T

Complexity with the data

2022-05-25 Thread Sid
in total. These are the sample records. Should I load it as RDD and then may be using a regex should eliminate the new lines? Or how it should be? with ". /n" ? Any suggestions? Thanks, Sid

Count() action leading to errors | Pyspark

2022-05-06 Thread Sid
are not matching? Also what could be the possible reason for that simple count error? Environment: AWS GLUE 1.X 10 workers Spark 2.4.3 Thanks, Sid

Re: Dealing with large number of small files

2022-04-27 Thread Sid
Yes, It created a list of records separated by , and it was created faster as well. On Wed, 27 Apr 2022, 13:42 Gourav Sengupta, wrote: > Hi, > did that result in valid JSON in the output file? > > Regards, > Gourav Sengupta > > On Tue, Apr 26, 2022 at 8:18 PM Sid wr

Re: Dealing with large number of small files

2022-04-26 Thread Sid
I have .txt files with JSON inside it. It is generated by some API calls by the Client. On Wed, Apr 27, 2022 at 12:39 AM Bjørn Jørgensen wrote: > What is that you have? Is it txt files or json files? > Or do you have txt files with JSON inside? > > > > tir. 26. apr. 2022 k

Re: Dealing with large number of small files

2022-04-26 Thread Sid
Thanks for your time, everyone :) Much appreciated. I solved it using jq utility since I was dealing with JSON. I have solved it using below script: find . -name '*.txt' -exec cat '{}' + | jq -s '.' > output.txt Thanks, Sid On Tue, Apr 26, 2022 at 9:37 PM Bjørn Jørgensen wr

Dealing with large number of small files

2022-04-26 Thread Sid
Hello, Can somebody help me with the below problem? https://stackoverflow.com/questions/72015557/dealing-with-large-number-of-small-json-files-using-pyspark Thanks, Sid

When should we cache / persist ? After or Before Actions?

2022-04-21 Thread Sid
df1 = df.count() 3) df.persist() 4) df.repartition().write.mode().parquet("") So please help me to understand how it should be exactly and why? If I am not correct Thanks, Sid

Cannot compare columns directly in IF...ELSE statement

2022-03-25 Thread Sid
Hi Team, I need help with the below problem: https://stackoverflow.com/questions/71613292/how-to-use-columns-in-if-else-condition-in-pyspark Thanks, Sid

Re: Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Sid
the intended comparison is then; these are not equivalent > ways of doing the same thing, or does not seem so as far as I can see. > > On Sun, Feb 27, 2022 at 12:30 PM Sid wrote: > >> My bad. >> >> Aggregation Query: >> >> # Write your MySQL query statem

Re: Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Sid
Hi Enrico, Thanks for your time :) Consider a huge data volume scenario, If I don't use any keywords like distinct, which one would be faster ? Window with partitionBy or normal SQL aggregation methods? and how does df.groupBy().reduceByGroups() work internally ? Thanks, Sid On Mon, Feb 28

Re: Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Sid
k from Department d join Employee e on e.departmentId=d.id ) a where rnk<=3 Time Taken: 790 ms Thanks, Sid On Sun, Feb 27, 2022 at 11:35 PM Sean Owen wrote: > Those two queries are identical? > > On Sun, Feb 27, 2022 at 11:30 AM Sid wrote: > >> Hi Team, >> >> I am awa

Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Sid
Please correct me if I am wrong. TIA, Sid

Re: Spark Explain Plan and Joins

2022-02-23 Thread Sid
out. Like for string columns. In this case, are the string column sorted ? My question might be duplicate but I want to understand what happens in case there is a non-numeric data as the joining keys. Thanks, Sid On Thu, 24 Feb 2022, 02:45 Mich Talebzadeh, wrote: > Yes correct because s

Re: Spark Explain Plan and Joins

2022-02-23 Thread Sid
laimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Wed, 23 Feb 2022 at 20:07, Sid wrote: > >> Hi Mich, >> >> Thanks for the link. I will go through it. I have two doubts reg

Re: Spark Explain Plan and Joins

2022-02-23 Thread Sid
then in this case what can I do? Suppose I have a "Department" column and need to join with the other table based on "Department". So, can I sort the string as well? What does it exactly mean by non-sortable keys? Thanks, Sid On Wed, Feb 23, 2022 at 11:46 PM Mich Talebzadeh wrote: &

Re: Unable to display JSON records with null values

2022-02-23 Thread Sid
Okay. So what should I do if I get such data? On Wed, Feb 23, 2022 at 11:59 PM Sean Owen wrote: > There is no record "345" here it seems, right? it's not that it exists and > has null fields; it's invalid w.r.t. the schema that the rest suggests. > > On Wed, Feb 23, 2022

Re: Spark Explain Plan and Joins

2022-02-23 Thread Sid
Hi, Can you help me with my doubts? Any links would also be helpful. Thanks, Sid On Wed, Feb 23, 2022 at 1:22 AM Sid Kal wrote: > Hi Mich / Gourav, > > Thanks for your time :) Much appreciated. I went through the article > shared by Mich about the query execution plan. I

Unable to display JSON records with null values

2022-02-23 Thread Sid
}, "Party3": { "FIRSTNAMEBEN": "ABCDDE", "ALIASBEN": "", "RELATIONSHIPTYPE": "ABC, FGHIJK LMN", "DATEOFBIRTH": "7/Oct/1969" } }, "GeneratedTime": "2022-01-30 03:09:26" }, { "345": { }, "GeneratedTime": "2022-01-30 03:09:26" } ] However, when I try to display this JSON using below code, it doesn't show the blank records. In my case I don't get any records for 345 since it is null but I want to display it in the final flattened dataset. val df = spark.read.option("multiline", true).json("/home/siddhesh/Documents/nested_json.json") Spark version:3.1.1 Thanks, Sid

Re: Loading .xlsx and .xlx files using pyspark

2022-02-23 Thread Sid
23, 2022 at 8:28 PM Bjørn Jørgensen wrote: > You will. Pandas API on spark that `imported with from pyspark import > pandas as ps` is not pandas but an API that is using pyspark under. > > ons. 23. feb. 2022 kl. 15:54 skrev Sid : > >> Hi Bjørn, >> >> Thanks for yo

Re: Loading .xlsx and .xlx files using pyspark

2022-02-23 Thread Sid
Hi Bjørn, Thanks for your reply. This doesn't help while loading huge datasets. Won't be able to achieve spark functionality while loading the file in distributed manner. Thanks, Sid On Wed, Feb 23, 2022 at 7:38 PM Bjørn Jørgensen wrote: > from pyspark import pandas as ps > > > p

Re: Loading .xlsx and .xlx files using pyspark

2022-02-23 Thread Sid
Hi Gourav, Thanks for your time. I am worried about the distribution of data in case of a huge dataset file. Is Koalas still a better option to go ahead with? If yes, how can I use it with Glue ETL jobs? Do I have to pass some kind of external jars for it? Thanks, Sid On Wed, Feb 23, 2022 at 7

Loading .xlsx and .xlx files using pyspark

2022-02-23 Thread Sid
me/.../Documents/test_excel.xlsx") It is giving me the below error message: java.lang.NoClassDefFoundError: org/apache/logging/log4j/LogManager I tried several Jars for this error but no luck. Also, what would be the efficient way to load it? Thanks, Sid

Re: Spark Explain Plan and Joins

2022-02-22 Thread Sid Kal
hether you are using SPARK Dataframes or SPARK SQL, > and the settings in SPARK (look at the settings for SPARK 3.x) and a few > other aspects you will see that the plan is quite cryptic and difficult to > read sometimes. > > Regards, > Gourav Sengupta > > On Sun, Feb 20, 2022 at

Re: Spark Explain Plan and Joins

2022-02-20 Thread Sid
Thank you so much for your reply, Mich. I will go through it. However, I want to understand how to read this plan? If I face any errors or if I want to look how spark is cost optimizing or how should we approach it? Could you help me in layman terms? Thanks, Sid On Sun, 20 Feb 2022, 17:50 Mich

Spark Explain Plan and Joins

2022-02-19 Thread Sid Kal
y part? Also internally since spark by default uses sort merge join as I can see from the plan but when does it opts for Sort-Merge Join and when does it opts for Shuffle-Hash Join? Thanks, Sid

Re: How to delete the record

2022-01-27 Thread Sid Kal
of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Thu, 27 Jan 2022 at 18:46, Sid Kal wro

Re: How to delete the record

2022-01-27 Thread Sid Kal
Okay sounds good. So, below two options would help me to capture CDC changes: 1) Delta lake 2) Maintaining snapshot of records with some indicators and timestamp. Correct me if I'm wrong Thanks, Sid On Thu, 27 Jan 2022, 23:59 Mich Talebzadeh, wrote: > There are two ways of do

Re: How to delete the record

2022-01-27 Thread Sid Kal
property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> >> On Thu, 27 Jan 2022

Re: How to delete the record

2022-01-27 Thread Sid Kal
Hi Mich, Thanks for your time. Data is stored in S3 via DMS which is read in the Spark jobs. How can I mark as a soft delete ? Any small snippet / link / example. Anything would help. Thanks, Sid On Thu, 27 Jan 2022, 22:26 Mich Talebzadeh, wrote: > Where ETL data is sto

How to delete the record

2022-01-27 Thread Sid Kal
, Sid

Re: Python code crashing on ReduceByKey if I return custom class object

2014-10-27 Thread sid
I have implemented the advise and now the issue is resolved . -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-code-crashing-on-ReduceByKey-if-I-return-custom-class-object-tp17317p17426.html Sent from the Apache Spark User List mailing list archive