S3 committer for dynamic partitioning

2024-03-05 Thread Nikhil Goyal
in S3? Thanks Nikhil

Re: Architecture of Spark Connect

2023-12-14 Thread Nikhil Goyal
If multiple applications are running, we would need multiple spark connect servers? If so, is the user responsible for creating these servers or they are just created on the fly when the user requests a new spark session? On Thu, Dec 14, 2023 at 10:28 AM Nikhil Goyal wrote: > Hi folks, >

Architecture of Spark Connect

2023-12-14 Thread Nikhil Goyal
Hi folks, I am trying to understand one question. Does Spark Connect create a new driver in the backend for every user or there are a fixed number of drivers running to which requests are sent to? Thanks Nikhil

Shuffle data on pods which get decomissioned

2023-06-20 Thread Nikhil Goyal
Hi folks, When running Spark on K8s, what would happen to shuffle data if an executor is terminated or lost. Since there is no shuffle service, does all the work done by that executor gets recomputed? Thanks Nikhil

Viewing UI for spark jobs running on K8s

2023-05-31 Thread Nikhil Goyal
Hi folks, Is there an equivalent of the Yarn RM page for Spark on Kubernetes. We can port-forward the UI from the driver pod for each but this process is tedious given we have multiple jobs running. Is there a clever solution to exposing all Driver UIs in a centralized place? Thanks Nikhil

Re: Partition by on dataframe causing a Sort

2023-04-20 Thread Nikhil Goyal
Is it possible to use MultipleOutputs and define a custom OutputFormat and then use `saveAsHadoopFile` to be able to achieve this? On Thu, Apr 20, 2023 at 1:29 PM Nikhil Goyal wrote: > Hi folks, > > We are writing a dataframe and doing a partitionby() on it. > df.write.part

Partition by on dataframe causing a Sort

2023-04-20 Thread Nikhil Goyal
source but unable to see if we can really control this behavior in the sink. If anyone has any suggestions please let me know. Thanks Nikhil

Understanding executor memory behavior

2023-03-16 Thread Nikhil Goyal
Hi folks, I am trying to understand what would be the difference in running 8G 1 core executor vs 40G 5 core executors. I see that on yarn it can cause bin fitting issues but other than that are there any pros and cons on using either? Thanks Nikhil

Increasing Spark history resources

2022-12-08 Thread Nikhil Goyal
Hi folks, We are experiencing slowness in Spark history server, hence trying to find what config properties we can tune to fix the issue. I found that SPARK_DAEMON_MEMORY is used to control memory, similarly is there a config property to increase the number of threads? Thanks Nikhil

Driver takes long time to finish once job ends

2022-11-22 Thread Nikhil Goyal
Hi folks, We are running a job on our on prem cluster on K8s but writing the output to S3. We noticed that all the executors finish in < 1h but the driver takes another 5h to finish. Logs: 22/11/22 02:08:29 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 10.42.145.11:39001 in memory (size:

Dynamic allocation on K8

2022-10-25 Thread Nikhil Goyal
<https://spark.apache.org/docs/latest/running-on-kubernetes.html#future-work> say that shuffle service is not yet available. Thanks Nikhil

partitionBy creating lot of small files

2022-06-04 Thread Nikhil Goyal
Hi all, Is there a way to use dataframe.partitionBy("col") and control the number of output files without doing a full repartition? The thing is some partitions have more data while some have less. Doing a .repartition is a costly operation. We want to control the size of the output files. Is it

Re: PartitionBy and SortWithinPartitions

2022-06-03 Thread Nikhil Goyal
which does the sorting but that complicates the code. So is there a clever way to sort records after they have been partitioned? Thanks Nikhil On Fri, Jun 3, 2022 at 9:38 AM Enrico Minack wrote: > Nikhil, > > What are you trying to achieve with this in the first place? What are your >

PartitionBy and SortWithinPartitions

2022-06-03 Thread Nikhil Goyal
artitionBy or before? Basically would spark first partitionBy col2 and then sort by col1 or sort first and then partition? Thanks Nikhil

Serialization error when using scala kernel with Jupyter

2020-02-21 Thread Nikhil Goyal
ion: java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance of org.apache.spark.rdd.MapPartitionsRDD I was wondering if anyone has seen this before. Thanks Nikhil

Understanding deploy mode config

2019-10-02 Thread Nikhil Goyal
.setMaster("yarn") .set("spark.submit.deployMode", "cluster") sc = SparkContext(conf) Is the spark context being created on application master or on the machine where this python process is being run? Thanks Nikhil

unsubcribe

2019-05-24 Thread Nikhil R Patil
unsubcribe "Confidentiality Warning: This message and any attachments are intended only for the use of the intended recipient(s). are confidential and may be privileged. If you are not the intended recipient. you are hereby notified that any review. re-transmission. conversion to hard copy.

Re: K8s-Spark client mode : Executor image not able to download application jar from driver

2019-04-28 Thread Nikhil Chinnapa
Thanks for explaining in such detail and pointing to the source code. Yes, its helpful and cleared lot of confusions. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail:

Re: K8s-Spark client mode : Executor image not able to download application jar from driver

2019-04-27 Thread Nikhil Chinnapa
Hi Stavros, Thanks a lot for pointing in right direction. I got stuck in some release, so didn’t got time earlier. The mistake was “LINUX_APP_RESOURCE” : I was using “local” instead it should be “file”. I reached above due to your email only. What I understood: Driver image : $SPARK_HOME/bin

K8s-Spark client mode : Executor image not able to download application jar from driver

2019-04-16 Thread Nikhil Chinnapa
Environment: Spark: 2.4.0 Kubernetes:1.14 Query: Does application jar needs to be part of both Driver and Executor image? Invocation point (from Java code): sparkLaunch = new SparkLauncher() .setMaster(LINUX_MASTER)

Need help with SparkSQL Query

2018-12-17 Thread Nikhil Goyal
dataframe to get the right record (all the metrics). Or I can create a single column with all the records and then implement a UDAF in scala and use it in pyspark. Both solutions don't seem to be straight forward. Is there a simpler solution to this? Thanks Nikhil

Re: Writing to vertica from spark

2018-10-22 Thread Nikhil Goyal
Fixed this by setting fileformat -> "parquet" On Mon, Oct 22, 2018 at 11:48 AM Nikhil Goyal wrote: > Hi guys, > > My code is failing with this error > > java.lang.Exception: S2V: FATAL ERROR for job S2V_job9197956021769393773. > Job status information is

Writing to vertica from spark

2018-10-22 Thread Nikhil Goyal
format("com.vertica.spark.datasource.DefaultSource") .options(connectionProperties) .mode(SaveMode.Append) .save() Does anybody have any idea how to fix this? Thanks Nikhil

Re: [External Sender] Writing dataframe to vertica

2018-10-22 Thread Nikhil Goyal
t 16, 2018 at 7:24 PM Nikhil Goyal wrote: > >> Hi guys, >> >> I am trying to write dataframe to vertica using spark. It seems like >> spark is creating a temp table under public schema. I don't have access to >> public schema hence the job is failing. Is there a

Writing dataframe to vertica

2018-10-16 Thread Nikhil Goyal
to create job status table public.S2V_JOB_STATUS_USER_NGOYAL java.lang.Exception: S2V: FATAL ERROR for job S2V_job8087339107009511230. Unable to create status table for tracking this job:public.S2V_JOB_STATUS_USER_NGOYAL Thanks Nikhil

Driver OOM when using writing parquet

2018-08-06 Thread Nikhil Goyal
be the reason? Thanks Nikhil

Zstd codec for writing dataframes

2018-06-18 Thread Nikhil Goyal
Hi guys, I was wondering if there is a way to compress files using zstd. It seems zstd compression can be used for shuffle data, is there a way to use it for output data as well? Thanks Nikhil

unsubscribe

2018-04-05 Thread Nikhil Kalbande
unsubscribe DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized

Re: Class cast exception while using Data Frames

2018-03-27 Thread Nikhil Goyal
.0) (keyTuple, sum / count.toDouble) }.toMap }) instDF.withColumn("customMap", avgMapValueUDF(col("metricMap"), lit(1))).show On Mon, Mar 26, 2018 at 11:51 PM, Shmuel Blitz <shmuel.bl...@similarweb.com> wrote: > Hi Nikhil,

Re: Class cast exception while using Data Frames

2018-03-26 Thread Nikhil Goyal
uth...@dataroots.io> wrote: > Can you give the output of “printSchema” ? > > > On 26 Mar 2018, at 22:39, Nikhil Goyal <nownik...@gmail.com> wrote: > > Hi guys, > > I have a Map[(String, String), Double] as one of my columns. Using > > input.getAs[Map[(String, String),

Class cast exception while using Data Frames

2018-03-26 Thread Nikhil Goyal
says that key is of type struct of (string, string). Any idea why this is happening? Thanks Nikhil

Using Thrift with Dataframe

2018-02-28 Thread Nikhil Goyal
Hi guys, I have a RDD of thrift struct. I want to convert it into a dataframe. Can someone suggest how I can do this? Thanks Nikhil

Re: Job never finishing

2018-02-21 Thread Nikhil Goyal
ify-and-re-schedule-slow-running-tasks/ > > Sent from my iPhone > > On Feb 20, 2018, at 5:52 PM, Nikhil Goyal <nownik...@gmail.com> wrote: > > Hi guys, > > I have a job which gets stuck if a couple of tasks get killed due to OOM > exception. Spark doesn't kill the job an

Job never finishing

2018-02-20 Thread Nikhil Goyal
this? Thanks Nikhil

GC issues with spark job

2018-02-18 Thread Nikhil Goyal
Hi, I have a job which is spending approx 30% time in GC. When I looked at the logs it seems like GC is triggering before the spill happens. I wanted to know if there is a config setting which I can use to force spark to spill early, maybe when memory is 60-70% full. Thanks Nikhil

Question about DStreamCheckpointData

2017-01-25 Thread Nikhil Goyal
Hi, I am using DStreamCheckpointData and it seems that spark checkpoints data even if the rdd processing fails. It seems to checkpoint at the moment it creates the rdd rather than waiting till its completion. Anybody knows how to make it wait till completion? Thanks Nikhil

ALS.trainImplicit block sizes

2016-10-21 Thread Nikhil Mishra
Hi, I have a question about the block size to be specified in ALS.trainImplicit() in pyspark (Spark 1.6.1). There is only one block size parameter to be specified. I want to know if that would result in partitioning both the users as well as the items axes. For example, I am using the following

Streaming and Batch code sharing

2016-06-25 Thread Nikhil Goyal
Hi, Does anyone has a good example where realtime and batch are able to share same code. (Other than this one https://github.com/databricks/reference-apps/blob/master/logs_analyzer/chapter1/reuse.md ) Thanks Nikhil

Re: Protobuf class not found exception

2016-05-31 Thread Nikhil Goyal
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-find-proto-buffer-class-error-with-RDD-lt-protobuf-gt-td14529.html But has this been solved? On Tue, May 31, 2016 at 3:26 PM, Nikhil Goyal <nownik...@gmail.com> wrote: > I am getting this error when I am trying to c

Protobuf class not found exception

2016-05-31 Thread Nikhil Goyal
) at java.lang.Class.forName(Class.java:190) at com.google.protobuf.GeneratedMessageLite$SerializedForm.readResolve(GeneratedMessageLite.java:768) ... 28 more The class has been packaged into the jar and also doing *.toString* works fine. Does anyone has any idea on this? Thanks Nikhil

Re: Timed aggregation in Spark

2016-05-23 Thread Nikhil Goyal
in state. On Mon, May 23, 2016 at 1:33 PM, Ofir Kerker <ofir.ker...@gmail.com> wrote: > Yes, check out mapWithState: > > https://databricks.com/blog/2016/02/01/faster-stateful-stream-processing-in-apache-spark-streaming.html > > _____ > Fr

Timed aggregation in Spark

2016-05-23 Thread Nikhil Goyal
issue and how did they handle it. Thanks Nikhil

Re: Spark session dies in about 2 days: HDFS_DELEGATION_TOKEN token can'tbe found

2016-03-14 Thread Nikhil Gs
Mine is the same scenario. I get the HDFS_DELEGATION_TOKEN issue exactly after the 7 days of the spark job started and it then gets killed. Even I'm also looking for the solution. Regards, Nik. On Fri, Mar 11, 2016 at 8:10 PM, Ruslan Dautkhanov wrote: > [image: Boxbe]

Re: Spark Token Expired Exception

2016-01-06 Thread Nikhil Gs
; On Wed, Jan 6, 2016 at 12:16 PM, Nikhil Gs <gsnikhil1432...@gmail.com> > wrote: > >> Hello Team, >> >> >> Thank you for your time in advance. >> >> >> Below are the log lines of my spark job which is used for consuming the >> messages fro

Spark Token Expired Exception

2016-01-06 Thread Nikhil Gs
Hello Team, Thank you for your time in advance. Below are the log lines of my spark job which is used for consuming the messages from Kafka Instance and its loading to Hbase. I have noticed the below Warn lines and later it resulted to errors. But I noticed that, exactly after 7 days the token

Re: Classification model method not found

2015-12-22 Thread Nikhil Joshi
Hi Ted, Thanks. That fixed the issue :). Nikhil On Tue, Dec 22, 2015 at 1:14 PM, Ted Yu <yuzhih...@gmail.com> wrote: > Looks like you should define ctor for ExtendedLR which accepts String > (the uid). > > Cheers > > On Tue, Dec 22, 2015 at 1:04 PM, njoshi <nikh

Re: Unable to import SharedSparkContext

2015-11-18 Thread Nikhil Joshi
e: > http://blog.cloudera.com/blog/2015/09/making-apache-spark-testing-easy-with-spark-testing-base/ > > On Wed, Nov 18, 2015 at 2:25 PM, Sourigna Phetsarath < > gna.phetsar...@teamaol.com> wrote: > >> Nikhil, >> >> Please take a look at: https://github.com/holdenk/spark-te

Re: Spark Job is getting killed after certain hours

2015-11-17 Thread Nikhil Gs
e Loughran <ste...@hortonworks.com> wrote: > > On 17 Nov 2015, at 02:00, Nikhil Gs <gsnikhil1432...@gmail.com> wrote: > > Hello Team, > > Below is the error which we are facing in our cluster after 14 hours of > starting the spark submit job. Not able to understa

Re: Spark LogisticRegression returns scaled coefficients

2015-11-17 Thread Nikhil Joshi
Hi, Wonderful. I was sampling the output, but with a bug. Your comment brought the realization :). I was indeed victimized by the complete separability issue :). Thanks a lot. with regards, Nikhil On Tue, Nov 17, 2015 at 5:26 PM, DB Tsai <dbt...@dbtsai.com> wrote: > How do yo

Spark Job is getting killed after certain hours

2015-11-16 Thread Nikhil Gs
Hello Team, Below is the error which we are facing in our cluster after 14 hours of starting the spark submit job. Not able to understand the issue and why its facing the below error after certain time. If any of you have faced the same scenario or if you have any idea then please guide us. To

Kafka and Spark combination

2015-10-09 Thread Nikhil Gs
Has anyone worked with Kafka in a scenario where the Streaming data from the Kafka consumer is picked by Spark (Java) functionality and directly placed in Hbase. Regards, Gs.

Re: Kafka streaming "at least once" semantics

2015-10-09 Thread Nikhil Gs
Hello Everyone, Has anyone worked with Kafka in a scenario where the Streaming data from the Kafka consumer is picked by Spark (Java) functionality and directly placed in Hbase. Please let me know, we are completely new to this scenario. That will be very helpful. Regards, GS. Regards, Nik.

PySpark Unknown Opcode Error

2015-05-26 Thread Nikhil Muralidhar
Hello, I am trying to run a spark job (which runs fine on the master node of the cluster), on a HDFS hadoop cluster using YARN. When I run the job which has a rdd.saveAsTextFile() line in it, I get the following error: *SystemError: unknown opcode* The entire stacktrace has been appended to

Re: PySpark Job throwing IOError

2015-05-19 Thread Muralidhar, Nikhil
dictionaries that I have in shared memory without explicitly doing a broadcast. Can anyone help me understand what is going on? I have appended my python file and the stack trace to this email. Thanks, Nikhil from pyspark.mllib.linalg import SparseVector from pyspark import SparkContext import glob

Re: Query data in Spark RRD

2015-02-23 Thread Nikhil Bafna
Tathagata - Yes, I'm thinking on that line. The problem is how to send to send the query to the backend? Bundle a http server into a spark streaming job, that will accept the parameters? -- Nikhil Bafna On Mon, Feb 23, 2015 at 2:04 PM, Tathagata Das t...@databricks.com wrote: You will have

Re: Query data in Spark RRD

2015-02-21 Thread Nikhil Bafna
Yes. As my understanding, it would allow me to write SQLs to query a spark context. But, the query needs to be specified within a job deployed. What I want is to be able to run multiple dynamic queries specified at runtime from a dashboard. -- Nikhil Bafna On Sat, Feb 21, 2015 at 8:37 PM

Query data in Spark RRD

2015-02-21 Thread Nikhil Bafna
, which will need re-aggregation from the already computed job. My query is, how can I run dynamic queries over data in schema RDDs? -- Nikhil Bafna

Re: How to Integrate openNLP with Spark

2014-12-04 Thread Nikhil
Did anyone get a chance to look at this? Please provide some help. Thanks Nikhil -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-Integrate-openNLP-with-Spark-tp20117p20368.html Sent from the Apache Spark User List mailing list archive at Nabble.com

How to Integrate openNLP with Spark

2014-12-01 Thread Nikhil
am not sure how to do so. Though Philip Ogren has given a very nice presentation in Spark Summit, still I am confusing. Can someone please provide me end to end example on this. I am new in Spark and UIMAFit, recently started working on it. Thanks Nikhil Jain -- View this message in context