Reading too many files

2022-10-03 Thread Sachit Murarka
, Sachit Murarka

Re: Query regarding Proleptic Gregorian Calendar Spark3

2022-09-20 Thread Sachit Murarka
Reposting once. Kind Regards, Sachit Murarka On Tue, Sep 20, 2022 at 6:56 PM Sachit Murarka wrote: > Hi All, > > I am getting below error , I read the document and understood that we need > to set 2 properties > spark.conf.set("spark.sql.parquet.int96RebaseMo

Query regarding Proleptic Gregorian Calendar Spark3

2022-09-20 Thread Sachit Murarka
es w.r.t. the calendar difference during writing, to get maximum interoperability. Or set spark.sql.parquet.int96RebaseModeInWrite to 'CORRECTED' to write the datetime values as it is, if you are 100% sure that the written files will only be read by Spark 3.0+ or other systems that use Proleptic Gr

Re: EXT: Network time out property is not getting set in Spark

2022-09-13 Thread Sachit Murarka
On Tue, Sep 13, 2022, 21:23 Sachit Murarka wrote: > Hi Vibhor, > > Thanks for your response! > > There are some properties which can be set without changing this flag > "spark.sql.legacy.setCommandRejectsSparkCoreConfs" > post creation of spark session , like shuf

Network time out property is not getting set in Spark

2022-09-13 Thread Sachit Murarka
ionErrors.scala:2322) at org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:157) at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:41) Kind Regards, Sachit Murarka

Re: Issue while creating spark app

2022-02-26 Thread Sachit Murarka
Hello , Thanks for replying. I have installed Scala plugin in IntelliJ first then also it's giving same error Cannot find project Scala library 2.12.12 for module SparkSimpleApp Thanks Rajat On Sun, Feb 27, 2022, 00:52 Bitfox wrote: > You need to install scala first, the current version for

Collecting list of errors across executors

2021-08-03 Thread Sachit Murarka
to be shared among various executors, I thought of using Accumulator, but the accumulator uses only Integral values. Can someone please suggest how do I collect all errors in a list which are coming from all records of RDD. Thanks, Sachit Murarka

Re: Usage of DropDuplicate in Spark

2021-06-22 Thread Sachit Murarka
Hi Chetan, You can substract the data frame or use except operation. First DF contains full rows. Second DF contains unique rows (post remove duplicates) Subtract first and second DF . hope this helps Thanks Sachit On Tue, Jun 22, 2021, 22:23 Chetan Khatri wrote: > Hi Spark Users, > > I want

Small file problem

2021-06-16 Thread Sachit Murarka
Hello Spark Users, We are receiving too much small small files. About 3 million. Reading it using spark.read itself taking long time and job is not proceeding further. Is there any way to fasten this and proceed? Regards Sachit Murarka

How to handle auto-restart in Kubernetes Spark application

2021-05-02 Thread Sachit Murarka
Hi All, I am using Spark with Kubernetes, Can anyone please tell me how I can handle restarting failed Spark jobs? I have used following property but it is not working restartPolicy: type: OnFailure Kind Regards, Sachit Murarka

How to gracefully shutdown spark job on kubernetes

2021-03-29 Thread Sachit Murarka
(Thread.java:834) Kind Regards, Sachit Murarka

Re: Issue while consuming message in kafka using structured streaming

2021-03-23 Thread Sachit Murarka
topic”:{“1”:1499,“0":1410}}}* Kind Regards, Sachit Murarka On Fri, Mar 12, 2021 at 5:44 PM Gabor Somogyi wrote: > Please see that driver side for example resolved in 3.1.0... > > G > > > On Fri, Mar 12, 2021 at 1:03 PM Sachit Murarka > wrote: > >> Hi Gabor, &

Re: Issue while consuming message in kafka using structured streaming

2021-03-12 Thread Sachit Murarka
Hi Gabor, Thanks a lot for the response. I am using Spark 3.0.1 and this is spark structured streaming. Kind Regards, Sachit Murarka On Fri, Mar 12, 2021 at 5:30 PM Gabor Somogyi wrote: > Since you've not provided any version I guess you're using 2.x and you're > hitting this issue:

Issue while consuming message in kafka using structured streaming

2021-03-12 Thread Sachit Murarka
ic-1 could be determined Current Committed Offsets: {KafkaV2[Subscribe[my-topic]]: {“my-topic”:{“1":1498,“0”:1410}}} Current Available Offsets: {KafkaV2[Subscribe[my-topic]]: {“my-topic”:{“1”:1499,“0":1410}}} Kind Regards, Sachit Murarka

Re: Single executor processing all tasks in spark structured streaming kafka

2021-03-11 Thread Sachit Murarka
from > partitions, you can choose to repartition the batch so it is processed by > multiple tasks. > > On Mon, Mar 8, 2021 at 10:57 PM Sachit Murarka > wrote: > >> Hi All, >> >> I am using Spark 3.0.1 Structuring streaming with Pyspark. >> >> The

Single executor processing all tasks in spark structured streaming kafka

2021-03-08 Thread Sachit Murarka
load() .selectExpr("CAST(value AS STRING)") query = df.writeStream.foreach(process_events).option("checkpointLocation", "/opt/checkpoint").trigger(processingTime="30 seconds").start() Kind Regards, Sachit Murarka

Re: com.esotericsoftware.kryo.KryoException: java.io.IOException: No space left on device\n\t

2021-03-08 Thread Sachit Murarka
Thanks Sean. Kind Regards, Sachit Murarka On Mon, Mar 8, 2021 at 6:23 PM Sean Owen wrote: > It's there in the error: No space left on device > You ran out of disk space (local disk) on one of your machines. > > On Mon, Mar 8, 2021 at 2:02 AM Sachit Murarka > wrote: > >

Re: com.esotericsoftware.kryo.KryoException: java.io.IOException: No space left on device\n\t

2021-03-08 Thread Sachit Murarka
ge before > asking people to go through it. Also I am pretty sure that the error is > mentioned in the first line itself. > > Any ideas regarding the SPARK version, and environment that you are using? > > > Thanks and Regards, > Gourav Sengupta > > On Mon, Mar 8, 2021 at 8

com.esotericsoftware.kryo.KryoException: java.io.IOException: No space left on device\n\t

2021-03-08 Thread Sachit Murarka
more\n\n"} Kind Regards, Sachit Murarka

Re: Structured Streaming With Kafka - processing each event

2021-03-02 Thread Sachit Murarka
Hi Mich, Thanks for reply. Will checkout this. Kind Regards, Sachit Murarka On Fri, Feb 26, 2021 at 2:14 AM Mich Talebzadeh wrote: > Hi Sachit, > > I managed to make mine work using the *foreachBatch function *in > writeStream. > > "foreach" performs cu

Re: Spark job crashing - Spark Structured Streaming with Kafka

2021-03-02 Thread Sachit Murarka
reaming.sources.ForeachWriterTable$$anon$1$$anon$2@30f2abbb +- Project [cast(value#8 as string) AS value#21] +- StreamingDataSourceV2Relation [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13], org.apache.spark.sql.kafka010.KafkaSourceProvider$KafkaScan@433a9c3b, Kafk

Spark job crashing - Spark Structured Streaming with Kafka

2021-03-02 Thread Sachit Murarka
ermination return self._jsq.awaitTermination() File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__ return_value = get_return_value( File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 134, in deco raise_from(co

Structured Streaming With Kafka - processing each event

2021-02-24 Thread Sachit Murarka
afka and above code will run multiple times for each single message. If I change it for foreachbatch, will it optimize it? Kind Regards, Sachit Murarka

Re: EOF Exception Spark Structured Streams - Kubernetes

2021-02-01 Thread Sachit Murarka
: > Hi Sachit, > > The fix verison on that JIRA says 3.0.2, so this fix is not yet released. > Soon, there will be a 3.1.1 release, in the meantime you can try out the > 3.1.1-rc which also has the fix and let us know your findings. > > Thanks, > > > On Mon, Feb 1, 2

Re: Spark SQL query

2021-02-01 Thread Sachit Murarka
Application wise it wont show as such. You can try to corelate it with explain plain output using some filters or attribute. Or else if you do not have too much queries in history. Just take queries and find plan of those queries and match it with shown in UI. I know thats the tedious task. But

Re: Spark SQL query

2021-02-01 Thread Sachit Murarka
Hi arpan, In spark shell when you type :history. then also it is not showing? Thanks Sachit On Mon, 1 Feb 2021, 21:13 Arpan Bhandari, wrote: > Hey Sachit, > > It shows the query plan, which is difficult to diagnose out and depict the > actual query. > > > Thanks, > Arpan Bhandari > > > > -- >

Re: EOF Exception Spark Structured Streams - Kubernetes

2021-01-31 Thread Sachit Murarka
Following is the related JIRA , Can someone pls check https://issues.apache.org/jira/browse/SPARK-24266 I am using 3.0.1 , It says fixed in 3.0.0 and 3.1.0 . Could you please suggest what can be done to avoid this? Kind Regards, Sachit Murarka On Sun, Jan 31, 2021 at 6:38 PM Sachit Murarka

EOF Exception Spark Structured Streams - Kubernetes

2021-01-31 Thread Sachit Murarka
Regards, Sachit Murarka

Re: Spark SQL query

2021-01-31 Thread Sachit Murarka
ery. Hope this helps! Kind Regards, Sachit Murarka On Fri, Jan 29, 2021 at 9:33 PM Arpan Bhandari wrote: > Hi Sachit, > > Yes it was executed using spark shell, history is already enabled. already > checked sql tab but it is not showing the query. My spark version is 2.4.5 > &

Re: Spark SQL query

2021-01-29 Thread Sachit Murarka
Hi Arpan, Was it executed using spark shell? If yes type :history Do u have history server enabled? If yes , go to the history and go to the SQL tab in History UI. Thanks Sachit On Fri, 29 Jan 2021, 19:19 Arpan Bhandari, wrote: > Hi , > > Is there a way to track back spark sql after it has

Query on entrypoint.sh Kubernetes spark

2021-01-21 Thread Sachit Murarka
ent Could you pls suggest Why deploy-mode client is mentioned in entrypoint.sh ? I am running spark submit using deploy mode cluster but inside entrypoint.sh which it is mentioned like that. Kind Regards, Sachit Murarka

Re: Issue with executer

2021-01-20 Thread Sachit Murarka
Hi Vikas 1. Are you running in local mode? Master has local[*] 2. Pls mask the ip or confidential info while sharing logs Thanks Sachit On Wed, 20 Jan 2021, 17:35 Vikas Garg, wrote: > Hi, > > I am facing issue with spark executor. I am struggling with this issue > since last many days and

Re: Spark 3.0.1 giving warning while running with Java 11

2021-01-15 Thread Sachit Murarka
Sure Sean. Thanks for confirmation. On Fri, 15 Jan 2021, 10:57 Sean Owen, wrote: > You can ignore that. Spark 3.x works with Java 11 but it will generate > some warnings that are safe to disregard. > > On Thu, Jan 14, 2021 at 11:26 PM Sachit Murarka > wrote: > >> Hi Al

Spark 3.0.1 giving warning while running with Java 11

2021-01-14 Thread Sachit Murarka
,int) WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release Kind Regards, Sachit

Re: HPA - Kubernetes for Spark

2021-01-10 Thread Sachit Murarka
Hi , Yes I know by setting shuffle tracking property enabled we can use DRA. But , it is marked as experimental. Is it advised to use ? Also , regarding HPA. We do not have HPA differently as such for Spark. Right? Kind Regards, Sachit Murarka On Mon, Jan 11, 2021 at 2:17 AM Sandish Kumar HN

HPA - Kubernetes for Spark

2021-01-10 Thread Sachit Murarka
how can I proceed achieving pod scaling in Spark? Please note : I am using Kubernetes with Spark operator. Kind Regards, Sachit Murarka

Re: Suggestion on Spark 2.4.7 vs Spark 3 for Kubernetes

2021-01-05 Thread Sachit Murarka
including it will be transitioning from experimental to GA in > this release. > > See: https://issues.apache.org/jira/browse/SPARK-33005 > > Thanks, > > On Tue, Jan 5, 2021 at 12:41 AM Sachit Murarka > wrote: > >> Hi Users, >> >> Could you please tell which Spark version h

Suggestion on Spark 2.4.7 vs Spark 3 for Kubernetes

2021-01-04 Thread Sachit Murarka
Hi Users, Could you please tell which Spark version have you used in Production for Kubernetes. Which is a recommended version for Production provided that both Streaming and core apis have to be used using Pyspark. Thanks ! Kind Regards, Sachit Murarka

Re: Error while running Spark on K8s

2021-01-04 Thread Sachit Murarka
spark-test --conf spark.executor.instances=5 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-sa --conf spark.kubernetes.container.image=sparkpy local:///opt/spark/da/main.py Kind Regards, Sachit Murarka On Mon, Jan 4, 2021 at 5:46 PM Prashant Sharma wrote: > Hi Sachit

Error while running Spark on K8s

2021-01-04 Thread Sachit Murarka
parameter mentioned in JIRA too(spark.kubernetes.driverEnv.HTTP2_DISABLE=true), that also did not work. Can anyone suggest what can be done? Kind Regards, Sachit Murarka

Re: Issue while installing dependencies Python Spark

2020-12-18 Thread Sachit Murarka
ct. Please carefully study the documentation linked above for further help. Original error was: No module named 'numpy.core._multiarray_umath' Kind Regards, Sachit Murarka On Thu, Dec 17, 2020 at 9:24 PM Patrick McCarthy wrote: > I'm not very familiar with the environments on cloud cluster

Issue while installing dependencies Python Spark

2020-12-17 Thread Sachit Murarka
Regards, Sachit Murarka

Running spark code using wheel file

2020-12-16 Thread Sachit Murarka
Hi All, I have created a wheel file and I am using the following command to run the spark job: spark-submit --py-files application.whl main_flow.py My application is unable to reference the modules. Do I need to do the pip install of the wheel first? Kind Regards, Sachit Murarka

Streaming job taking all executors

2020-12-13 Thread Sachit Murarka
Hi All, I am using Standalone Spark. I am using dynamic memory allocation. Despite giving max executors, min executors and initial executors, my streaming job is taking all executors available in the cluster. Could anyone please suggest what can be wrong here? Please note source is Kafka. I

Re: Regexp_extract not giving correct output

2020-12-02 Thread Sachit Murarka
and as I mentioned when I am using 2 backslashes it is giving an exception as follows: : java.util.regex.PatternSyntaxException: Unknown inline modifier near index 21 (^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*) Kind Regards, Sachit Murarka On Wed, Dec 2,

Regexp_extract not giving correct output

2020-12-02 Thread Sachit Murarka
Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*) Can you please help here? Kind Regards, Sachit Murarka

Need Unit test complete reference for Pyspark

2020-11-18 Thread Sachit Murarka
Hi Users, I have to write Unit Test cases for PySpark. I think pytest-spark and "spark testing base" are good test libraries. Can anyone please provide full reference for writing the test cases in Python using these? Kind Regards, Sachit Murarka

Need help on Calling Pyspark code using Wheel

2020-10-23 Thread Sachit Murarka
Thanks Sachit Kind Regards, Sachit Murarka

Re: Multiple applications being spawned

2020-10-13 Thread Sachit Murarka
(although 25 > processes on a single node seem too high) > > > > *From: *Sachit Murarka > *Date: *Tuesday, October 13, 2020 at 8:15 AM > *To: *spark users > *Subject: *RE: [EXTERNAL] Multiple applications being spawned > > > > *CAUTION*: This email originated fr

Re: Multiple applications being spawned

2020-10-13 Thread Sachit Murarka
(PythonRunner.scala:346) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945) at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:195) Kind Regards, Sachit Murarka On Tue, Oct 13, 2020 at 4:02 PM Sachit Murarka wrote: > Hi Users, >

Multiple applications being spawned

2020-10-13 Thread Sachit Murarka
, then converting it back to dataframe and then applying 2 actions(Count & Write). Please note : This was working fine till the previous week, it has started giving this issue since yesterday. Could you please tell what can be the reason for this behavior? Kind Regards, Sachit Murarka

Re: Job is not able to perform Broadcast Join

2020-10-06 Thread Sachit Murarka
data set since it has 2 cols only. Thanks Sachit On Wed, 7 Oct 2020, 01:04 Eve Liao, wrote: > Try to avoid broadcast. Thought this: > https://towardsdatascience.com/adding-sequential-ids-to-a-spark-dataframe-fa0df5566ff6 > could be helpful. > > On Tue, Oct 6, 2020 at 12:18 PM

Re: Job is not able to perform Broadcast Join

2020-10-06 Thread Sachit Murarka
ven hundreds of thousands of rows is a broadcast candidate. > Your broadcast variable is probably too large. > > On Tue, Oct 6, 2020 at 11:37 AM Sachit Murarka > wrote: > >> Hello Users, >> >> I am facing an issue in spark job where I am doing row number() without

Job is not able to perform Broadcast Join

2020-10-06 Thread Sachit Murarka
it. Kind Regards, Sachit Murarka

Executor Lost Spark Issue

2020-10-06 Thread Sachit Murarka
please suggest something? I have sufficient memory in executors and the driver as well. Kind Regards, Sachit Murarka

Re: PGP Encrypt using spark Scala

2019-08-26 Thread Sachit Murarka
hit > PGP Encrypt is something that is not inbuilt with spark. > I would suggest writing a shell script that would do pgp encrypt and use > it in spark scala program , which would run from driver. > > Thanks > Deepak > > On Mon, Aug 26, 2019 at 8:10 PM Sachit Murarka > wr

PGP Encrypt using spark Scala

2019-08-26 Thread Sachit Murarka
Hi All, I want to encrypt my files available at HDFS location using PGP Encryption How can I do it in spark. I saw Apache Camel but it seems camel is used when source files are in Local location rather than HDFS. Kind Regards, Sachit Murarka

Re: Implementing Upsert logic Through Streaming

2019-06-30 Thread Sachit Murarka
>>> Target is Oracle Database. >>> >>> My Goal is to maintain latest record for a key in Oracle. Could you >>> please suggest how this can be implemented efficiently? >>> >>> Kind Regards, >>> Sachit Murarka >>> >>

Implementing Upsert logic Through Streaming

2019-06-25 Thread Sachit Murarka
Hi All, I will get records continously in text file form(Streaming). It will have timestamp as field also. Target is Oracle Database. My Goal is to maintain latest record for a key in Oracle. Could you please suggest how this can be implemented efficiently? Kind Regards, Sachit Murarka

1 task per executor

2019-05-28 Thread Sachit Murarka
Hi All, I am using spark 2.2 I have enabled spark dynamic allocation with executor cores 4, driver cores 4 and executor memory 12GB driver memory 10GB. In Spark UI, I see only 1 task is launched per executor. Could anyone please help on this? Kind Regards, Sachit Murarka

NoClassDefFoundError

2019-05-21 Thread Sachit Murarka
Hi All, I have simply added exception handling in my code in Scala. I am getting NoClassDefFoundError . Any leads would be appreciated. Thanks Kind Regards, Sachit Murarka