Re: What is the best way to organize a join within a foreach?

2023-04-27 Thread Amit Joshi
Hi Marco, I am not sure if you will get access to data frame inside the for each, as spark context used to be non serialized, if I remember correctly. One thing you can do. Use cogroup operation on both the dataset. This will help you have (Key- iter(v1),itr(V2). And then use for each partition

Re: Splittable or not?

2022-09-14 Thread Amit Joshi
Hi Sid, Snappy itself is not splittable. But the format that contains the actual data like parquet (which are basically divided into row groups) can be compressed using snappy. This works because blocks(pages of parquet format) inside the parquet can be independently compressed using snappy.

Re: Salting technique doubt

2022-07-31 Thread Amit Joshi
ow do you ensure that the results are matched? > > Best, > Sid > > On Sun, Jul 31, 2022 at 1:34 AM Amit Joshi > wrote: > >> Hi Sid, >> >> Salting is normally a technique to add random characters to existing >> values. >> In big data we can use salting t

Re: Salting technique doubt

2022-07-30 Thread Amit Joshi
values of the sale. join_col x1_1 x1_2 x2_1 x2_2 And then join it like table1.join(table2, where tableA.new_join_col == tableB. join_col) Let me know if you have any questions. Regards Amit Joshi On Sat, Jul 30, 2022 at 7:16 PM Sid wrote: > Hi Team, > > I was trying to understand th

Re: [Spark] Optimize spark join on different keys for same data frame

2021-10-04 Thread Amit Joshi
Hi spark users, Can anyone please provide any views on the topic. Regards Amit Joshi On Sunday, October 3, 2021, Amit Joshi wrote: > Hi Spark-Users, > > Hope you are doing good. > > I have been working on cases where a dataframe is joined with more than > one dat

[Spark] Optimize spark join on different keys for same data frame

2021-10-03 Thread Amit Joshi
uot;key2") I was thinking of bucketing as a solution to speed up the joins. But if I bucket df1 on the key1,then join2 may not benefit, and vice versa (if bucket on key2 for df1). or Should we bucket df1 twice, one with key1 and another with key2? Is there a strategy to make both the joins faster for both the joins? Regards Amit Joshi

Re: Does Rollups work with spark structured streaming with state.

2021-06-17 Thread Amit Joshi
HI Mich, Thanks for your email. I have tried for the batch mode, Still looking to try in streaming mode. Will update you as per. Regards Amit Joshi On Thu, Jun 17, 2021 at 1:07 PM Mich Talebzadeh wrote: > OK let us start with the basic cube > > create a DF first > > scal

Re: Does Rollups work with spark structured streaming with state.

2021-06-17 Thread Amit Joshi
s as well? I hope I was able to make my point clear. Regards Amit Joshi On Wed, Jun 16, 2021 at 11:36 PM Mich Talebzadeh wrote: > > > Hi, > > Just to clarify > > Are we talking about* rollup* as a subset of a cube that computes > hierarchical subtotals from left to ri

Fwd: Does Rollups work with spark structured streaming with state.

2021-06-16 Thread Amit Joshi
Appreciate if someone could give some pointers in the question below. -- Forwarded message - From: Amit Joshi Date: Tue, Jun 15, 2021 at 12:19 PM Subject: [Spark]Does Rollups work with spark structured streaming with state. To: spark-user Hi Spark-Users, Hope you are all

Does Rollups work with spark structured streaming with state.

2021-06-15 Thread Amit Joshi
are saved. If rollups are not supported, then what is the standard way to handle this? Regards Amit Joshi

Re: multiple query with structured streaming in spark does not work

2021-05-21 Thread Amit Joshi
, but provide enough memory in the spark cluster to run both. Regards Amit Joshi On Sat, May 22, 2021 at 5:41 AM wrote: > Hi Amit; > > > > Thank you for your prompt reply and kind help. Wonder how to set the > scheduler to FAIR mode in python. Following code seems to me does not work &

Re: multiple query with structured streaming in spark does not work

2021-05-21 Thread Amit Joshi
Hi Jian, You have to use same spark session to run all the queries. And use the following to wait for termination. q1 = writestream.start q2 = writstream2.start spark.streams.awaitAnyTermination And also set the scheduler in the spark config to FAIR scheduler. Regards Amit Joshi

Re: [EXTERNAL] Urgent Help - Py Spark submit error

2021-05-14 Thread Amit Joshi
of the worker nodes. Use this command to pass it to driver *--files /appl/common/ftp/conf.json --conf spark.driver.extraJavaOptions="-Dconfig.file=conf.json* And make sure you are able to access the file location from worker nodes. Regards Amit Joshi On Sat, May 15, 2021 at 5:14 AM KhajaAsmath Moh

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

2021-04-07 Thread Amit Joshi
ope this helps. Regards Amit Joshi On Wed, Apr 7, 2021 at 8:14 PM Mich Talebzadeh wrote: > > Did some tests. The concern is SSS job running under YARN > > > *Scenario 1)* use spark-sql-kafka-0-10_2.12-3.1.0.jar > >- Removed spark-sql-kafka-0-10_2.12-3.1.0.jar from any

Re: [Spark Structured Streaming] Processing the data path coming from kafka.

2021-01-18 Thread Amit Joshi
rk session > from foreach as code is not running on the driver. > > Please say if it makes sense or did I miss anything. > > > > Boris > > > > *From:* Amit Joshi > *Sent:* Monday, 18 January 2021 17:10 > *To:* Boris Litvak > *Cc:* spark-use

Re: [Spark Structured Streaming] Processing the data path coming from kafka.

2021-01-18 Thread Amit Joshi
st map()/mapXXX() the kafkaDf with the mapping function > that reads the paths? > > Also, do you really have to read the json into an additional dataframe? > > > > Thanks, Boris > > > > *From:* Amit Joshi > *Sent:* Monday, 18 January 2021 15:04 > *To:* spark-user

[Spark Structured Streaming] Processing the data path coming from kafka.

2021-01-18 Thread Amit Joshi
this approach is fine? Specifically if there is some problem with with creating the dataframe after calling collect. If there is any better approach, please let know the same. Regards Amit Joshi

Re: Missing required configuration "partition.assignment.strategy" [ Kafka + Spark Structured Streaming ]

2020-12-08 Thread Amit Joshi
Hi All, Can someone pls hellp with this. Thanks On Tuesday, December 8, 2020, Amit Joshi wrote: > Hi Gabor, > > Pls find the logs attached. These are truncated logs. > > Command used : > spark-submit --verbose --packages org.apache.spark:spark-sql- > kafka-0-10_2.12:3.0.

Re: Missing required configuration "partition.assignment.strategy" [ Kafka + Spark Structured Streaming ]

2020-12-07 Thread Amit Joshi
of the consumer > (this is printed also in the same log) > > G > > > On Mon, Dec 7, 2020 at 5:07 PM Amit Joshi > wrote: > >> Hi Gabor, >> >> The code is very simple Kafka consumption of data. >> I guess, it may be the cluster. >> Can you please point out

Re: Missing required configuration "partition.assignment.strategy" [ Kafka + Spark Structured Streaming ]

2020-12-07 Thread Amit Joshi
super interesting because that field has default value: >> *org.apache.kafka.clients.consumer.RangeAssignor* >> >> On Mon, 7 Dec 2020, 10:51 Amit Joshi, wrote: >> >>> Hi, >>> >>> Thnks for the reply. >>> I did tried removing the client version. >>&

Re: Missing required configuration "partition.assignment.strategy" [ Kafka + Spark Structured Streaming ]

2020-12-07 Thread Amit Joshi
sue is that you are overriding the kafka-clients that comes >> with spark-sql-kafka-0-10_2.12 >> >> >> I'd try removing the kafka-clients and see if it works >> >> >> On Sun, 6 Dec 2020 at 08:01, Amit Joshi >> wrote: >> >>> Hi All, >

Missing required configuration "partition.assignment.strategy" [ Kafka + Spark Structured Streaming ]

2020-12-05 Thread Amit Joshi
no default value. at org.apache.kafka.common.config.ConfigDef.parse(ConfigDef.java:124)* I have tried setting up the "partition.assignment.strategy", then also its not working. Please help. Regards Amit Joshi

Re: [Spark SQL] does pyspark udf support spark.sql inside def

2020-09-30 Thread Amit Joshi
Can you pls post the schema of both the tables. On Wednesday, September 30, 2020, Lakshmi Nivedita wrote: > Thank you for the clarification.I would like to how can I proceed for > this kind of scenario in pyspark > > I have a scenario subtracting the total number of days with the number of >

Re: Query around Spark Checkpoints

2020-09-27 Thread Amit Joshi
Hi, As far as I know, it depends on whether you are using spark streaming or structured streaming. In spark streaming you can write your own code to checkpoint. But in case of structured streaming it should be file location. But main question in why do you want to checkpoint in Nosql, as it's

Re: [pyspark 2.4] broadcasting DataFrame throws error

2020-09-18 Thread Amit Joshi
but not in earlier version? > > On Thu, Sep 17, 2020 at 11:35 PM Amit Joshi > wrote: > >> Hi, >> >> I think problem lies with driver memory. Broadcast in spark work by >> collecting all the data to driver and then driver broadcasting to all the >> executors.

Re: [pyspark 2.4] broadcasting DataFrame throws error

2020-09-17 Thread Amit Joshi
Hi, I think problem lies with driver memory. Broadcast in spark work by collecting all the data to driver and then driver broadcasting to all the executors. Different strategy could be employed for trasfer like bit torrent though. Please try increasing the driver memory. See if it works.

Re: Submitting Spark Job thru REST API?

2020-09-02 Thread Amit Joshi
Hi, There is other option like apache Livy which lets you submit the job using Rest api. Other option can be using AWS Datapipeline to configure your job as EMR activity. To activate pipeline, you need console or a program. Regards Amit On Thursday, September 3, 2020, Eric Beabes wrote: >

Re: [Spark Kafka Structured Streaming] Adding partition and topic to the kafka dynamically

2020-08-28 Thread Amit Joshi
added topic and partition, if your subscription covers the topic > (like subscribe pattern). Please try it out. > > Hope this helps. > > Thanks, > Jungtaek Lim (HeartSaVioR) > > On Fri, Aug 28, 2020 at 1:56 PM Amit Joshi > wrote: > >> Any pointers will be appreciated

Re: [Spark Kafka Structured Streaming] Adding partition and topic to the kafka dynamically

2020-08-27 Thread Amit Joshi
Any pointers will be appreciated. On Thursday, August 27, 2020, Amit Joshi wrote: > Hi All, > > I am trying to understand the effect of adding topics and partitions to a > topic in kafka, which is being consumed by spark structured streaming > applications. > > Do we have

[Spark Kafka Structured Streaming] Adding partition and topic to the kafka dynamically

2020-08-27 Thread Amit Joshi
the spark structured streaming application to read from the newly added partition to a topic? Kafka consumers have a meta data refresh property that works without restarting. Thanks advance. Regards Amit Joshi

[Spark-Kafka-Streaming] Verifying the approach for multiple queries

2020-08-09 Thread Amit Joshi
Hi, I have a scenario where a kafka topic is being written with different types of json records. I have to regroup the records based on the type and then fetch the schema and parse and write as parquet. I have tried structured programming. But dynamic schema is a constraint. So I have used

[SPARK-STRUCTURED-STREAMING] IllegalStateException: Race while writing batch 4

2020-08-07 Thread Amit Joshi
nk doesn't support multiple writers. It assumes there is only one writer writing to the path. Each query needs to use its own output directory. Is there a way to write the output to the same path by both queries, as I need the output at the same path.? Regards Amit Joshi

[SPARK-SQL] How to return GenericInternalRow from spark udf

2020-08-06 Thread Amit Joshi
"empl" => emplSchema } getGenericInternalRow(schema) } val data = udf(getData) Spark Version : 2.4.5 Please Help. Regards Amit Joshi