in S3?
Thanks
Nikhil
If multiple applications are running, we would need multiple spark connect
servers? If so, is the user responsible for creating these servers or they
are just created on the fly when the user requests a new spark session?
On Thu, Dec 14, 2023 at 10:28 AM Nikhil Goyal wrote:
> Hi folks,
>
Hi folks,
I am trying to understand one question. Does Spark Connect create a new
driver in the backend for every user or there are a fixed number of drivers
running to which requests are sent to?
Thanks
Nikhil
Hi folks,
When running Spark on K8s, what would happen to shuffle data if an executor
is terminated or lost. Since there is no shuffle service, does all the work
done by that executor gets recomputed?
Thanks
Nikhil
Hi folks,
Is there an equivalent of the Yarn RM page for Spark on Kubernetes. We can
port-forward the UI from the driver pod for each but this process is
tedious given we have multiple jobs running. Is there a clever solution to
exposing all Driver UIs in a centralized place?
Thanks
Nikhil
Is it possible to use MultipleOutputs and define a custom OutputFormat and
then use `saveAsHadoopFile` to be able to achieve this?
On Thu, Apr 20, 2023 at 1:29 PM Nikhil Goyal wrote:
> Hi folks,
>
> We are writing a dataframe and doing a partitionby() on it.
> df.write.part
source but unable to see if we
can really control this behavior in the sink. If anyone has any suggestions
please let me know.
Thanks
Nikhil
Hi folks,
I am trying to understand what would be the difference in running 8G 1 core
executor vs 40G 5 core executors. I see that on yarn it can cause bin
fitting issues but other than that are there any pros and cons on using
either?
Thanks
Nikhil
Hi folks,
We are experiencing slowness in Spark history server, hence trying to find
what config properties we can tune to fix the issue. I found that
SPARK_DAEMON_MEMORY is used to control memory, similarly is there a config
property to increase the number of threads?
Thanks
Nikhil
Hi folks,
We are running a job on our on prem cluster on K8s but writing the output
to S3. We noticed that all the executors finish in < 1h but the driver
takes another 5h to finish. Logs:
22/11/22 02:08:29 INFO BlockManagerInfo: Removed broadcast_3_piece0 on
10.42.145.11:39001 in memory (size:
<https://spark.apache.org/docs/latest/running-on-kubernetes.html#future-work>
say that shuffle service is not yet available.
Thanks
Nikhil
Hi all,
Is there a way to use dataframe.partitionBy("col") and control the number
of output files without doing a full repartition? The thing is some
partitions have more data while some have less. Doing a .repartition is a
costly operation. We want to control the size of the output files. Is it
which
does the sorting but that complicates the code. So is there a clever way to
sort records after they have been partitioned?
Thanks
Nikhil
On Fri, Jun 3, 2022 at 9:38 AM Enrico Minack wrote:
> Nikhil,
>
> What are you trying to achieve with this in the first place? What are your
>
artitionBy or before? Basically would spark first partitionBy col2 and
then sort by col1 or sort first and then partition?
Thanks
Nikhil
ion:
java.lang.ClassCastException: cannot assign instance of
java.lang.invoke.SerializedLambda to field
org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in
instance of org.apache.spark.rdd.MapPartitionsRDD
I was wondering if anyone has seen this before.
Thanks
Nikhil
.setMaster("yarn")
.set("spark.submit.deployMode", "cluster")
sc = SparkContext(conf)
Is the spark context being created on application master or on the machine
where this python process is being run?
Thanks
Nikhil
unsubcribe
"Confidentiality Warning: This message and any attachments are intended only
for the use of the intended recipient(s).
are confidential and may be privileged. If you are not the intended recipient.
you are hereby notified that any
review. re-transmission. conversion to hard copy.
Thanks for explaining in such detail and pointing to the source code.
Yes, its helpful and cleared lot of confusions.
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
To unsubscribe e-mail:
Hi Stavros,
Thanks a lot for pointing in right direction. I got stuck in some release,
so didn’t got time earlier.
The mistake was “LINUX_APP_RESOURCE” : I was using “local” instead it should
be “file”. I reached above due to your email only.
What I understood:
Driver image : $SPARK_HOME/bin
Environment:
Spark: 2.4.0
Kubernetes:1.14
Query: Does application jar needs to be part of both Driver and Executor
image?
Invocation point (from Java code):
sparkLaunch = new SparkLauncher()
.setMaster(LINUX_MASTER)
dataframe to get the right record (all the
metrics).
Or I can create a single column with all the records and then implement a
UDAF in scala and use it in pyspark.
Both solutions don't seem to be straight forward. Is there a simpler
solution to this?
Thanks
Nikhil
Fixed this by setting fileformat -> "parquet"
On Mon, Oct 22, 2018 at 11:48 AM Nikhil Goyal wrote:
> Hi guys,
>
> My code is failing with this error
>
> java.lang.Exception: S2V: FATAL ERROR for job S2V_job9197956021769393773.
> Job status information is
format("com.vertica.spark.datasource.DefaultSource")
.options(connectionProperties)
.mode(SaveMode.Append)
.save()
Does anybody have any idea how to fix this?
Thanks
Nikhil
t 16, 2018 at 7:24 PM Nikhil Goyal wrote:
>
>> Hi guys,
>>
>> I am trying to write dataframe to vertica using spark. It seems like
>> spark is creating a temp table under public schema. I don't have access to
>> public schema hence the job is failing. Is there a
to create job status table
public.S2V_JOB_STATUS_USER_NGOYAL
java.lang.Exception: S2V: FATAL ERROR for job S2V_job8087339107009511230.
Unable to create status table for tracking this
job:public.S2V_JOB_STATUS_USER_NGOYAL
Thanks
Nikhil
be the reason?
Thanks
Nikhil
Hi guys,
I was wondering if there is a way to compress files using zstd. It seems
zstd compression can be used for shuffle data, is there a way to use it for
output data as well?
Thanks
Nikhil
unsubscribe
DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the
property of Persistent Systems Ltd. It is intended only for the use of the
individual or entity to which it is addressed. If you are not the intended
recipient, you are not authorized
.0)
(keyTuple, sum / count.toDouble)
}.toMap
})
instDF.withColumn("customMap", avgMapValueUDF(col("metricMap"),
lit(1))).show
On Mon, Mar 26, 2018 at 11:51 PM, Shmuel Blitz <shmuel.bl...@similarweb.com>
wrote:
> Hi Nikhil,
uth...@dataroots.io>
wrote:
> Can you give the output of “printSchema” ?
>
>
> On 26 Mar 2018, at 22:39, Nikhil Goyal <nownik...@gmail.com> wrote:
>
> Hi guys,
>
> I have a Map[(String, String), Double] as one of my columns. Using
>
> input.getAs[Map[(String, String),
says that key is of type struct of (string, string).
Any idea why this is happening?
Thanks
Nikhil
Hi guys,
I have a RDD of thrift struct. I want to convert it into a dataframe. Can
someone suggest how I can do this?
Thanks
Nikhil
ify-and-re-schedule-slow-running-tasks/
>
> Sent from my iPhone
>
> On Feb 20, 2018, at 5:52 PM, Nikhil Goyal <nownik...@gmail.com> wrote:
>
> Hi guys,
>
> I have a job which gets stuck if a couple of tasks get killed due to OOM
> exception. Spark doesn't kill the job an
this?
Thanks
Nikhil
Hi,
I have a job which is spending approx 30% time in GC. When I looked at the
logs it seems like GC is triggering before the spill happens. I wanted to
know if there is a config setting which I can use to force spark to spill
early, maybe when memory is 60-70% full.
Thanks
Nikhil
Hi,
I am using DStreamCheckpointData and it seems that spark checkpoints data
even if the rdd processing fails. It seems to checkpoint at the moment it
creates the rdd rather than waiting till its completion. Anybody knows how
to make it wait till completion?
Thanks
Nikhil
Hi,
I have a question about the block size to be specified in
ALS.trainImplicit() in pyspark (Spark 1.6.1). There is only one block size
parameter to be specified. I want to know if that would result in
partitioning both the users as well as the items axes.
For example, I am using the following
Hi,
Does anyone has a good example where realtime and batch are able to share
same code.
(Other than this one
https://github.com/databricks/reference-apps/blob/master/logs_analyzer/chapter1/reuse.md
)
Thanks
Nikhil
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-find-proto-buffer-class-error-with-RDD-lt-protobuf-gt-td14529.html
But has this been solved?
On Tue, May 31, 2016 at 3:26 PM, Nikhil Goyal <nownik...@gmail.com> wrote:
> I am getting this error when I am trying to c
)
at java.lang.Class.forName(Class.java:190)
at
com.google.protobuf.GeneratedMessageLite$SerializedForm.readResolve(GeneratedMessageLite.java:768)
... 28 more
The class has been packaged into the jar and also doing *.toString* works
fine.
Does anyone has any idea on this?
Thanks
Nikhil
in
state.
On Mon, May 23, 2016 at 1:33 PM, Ofir Kerker <ofir.ker...@gmail.com> wrote:
> Yes, check out mapWithState:
>
> https://databricks.com/blog/2016/02/01/faster-stateful-stream-processing-in-apache-spark-streaming.html
>
> _____
> Fr
issue and
how did they handle it.
Thanks
Nikhil
Mine is the same scenario. I get the HDFS_DELEGATION_TOKEN issue exactly
after the 7 days of the spark job started and it then gets killed.
Even I'm also looking for the solution.
Regards,
Nik.
On Fri, Mar 11, 2016 at 8:10 PM, Ruslan Dautkhanov
wrote:
> [image: Boxbe]
; On Wed, Jan 6, 2016 at 12:16 PM, Nikhil Gs <gsnikhil1432...@gmail.com>
> wrote:
>
>> Hello Team,
>>
>>
>> Thank you for your time in advance.
>>
>>
>> Below are the log lines of my spark job which is used for consuming the
>> messages fro
Hello Team,
Thank you for your time in advance.
Below are the log lines of my spark job which is used for consuming the
messages from Kafka Instance and its loading to Hbase. I have noticed the
below Warn lines and later it resulted to errors. But I noticed that,
exactly after 7 days the token
Hi Ted,
Thanks. That fixed the issue :).
Nikhil
On Tue, Dec 22, 2015 at 1:14 PM, Ted Yu <yuzhih...@gmail.com> wrote:
> Looks like you should define ctor for ExtendedLR which accepts String
> (the uid).
>
> Cheers
>
> On Tue, Dec 22, 2015 at 1:04 PM, njoshi <nikh
e:
> http://blog.cloudera.com/blog/2015/09/making-apache-spark-testing-easy-with-spark-testing-base/
>
> On Wed, Nov 18, 2015 at 2:25 PM, Sourigna Phetsarath <
> gna.phetsar...@teamaol.com> wrote:
>
>> Nikhil,
>>
>> Please take a look at: https://github.com/holdenk/spark-te
e Loughran <ste...@hortonworks.com>
wrote:
>
> On 17 Nov 2015, at 02:00, Nikhil Gs <gsnikhil1432...@gmail.com> wrote:
>
> Hello Team,
>
> Below is the error which we are facing in our cluster after 14 hours of
> starting the spark submit job. Not able to understa
Hi,
Wonderful. I was sampling the output, but with a bug. Your comment brought
the realization :). I was indeed victimized by the complete separability
issue :).
Thanks a lot.
with regards,
Nikhil
On Tue, Nov 17, 2015 at 5:26 PM, DB Tsai <dbt...@dbtsai.com> wrote:
> How do yo
Hello Team,
Below is the error which we are facing in our cluster after 14 hours of
starting the spark submit job. Not able to understand the issue and why its
facing the below error after certain time.
If any of you have faced the same scenario or if you have any idea then
please guide us. To
Has anyone worked with Kafka in a scenario where the Streaming data from
the Kafka consumer is picked by Spark (Java) functionality and directly
placed in Hbase.
Regards,
Gs.
Hello Everyone,
Has anyone worked with Kafka in a scenario where the Streaming data from
the Kafka consumer is picked by Spark (Java) functionality and directly
placed in Hbase.
Please let me know, we are completely new to this scenario. That will be
very helpful.
Regards,
GS.
Regards,
Nik.
Hello,
I am trying to run a spark job (which runs fine on the master node of the
cluster), on a HDFS hadoop cluster using YARN. When I run the job which has
a rdd.saveAsTextFile() line in it, I get the following error:
*SystemError: unknown opcode*
The entire stacktrace has been appended to
dictionaries that I
have in shared memory without explicitly doing a broadcast.
Can anyone help me understand what is going on?
I have appended my python file and the stack trace to this email.
Thanks,
Nikhil
from pyspark.mllib.linalg import SparseVector
from pyspark import SparkContext
import glob
Tathagata - Yes, I'm thinking on that line.
The problem is how to send to send the query to the backend? Bundle a http
server into a spark streaming job, that will accept the parameters?
--
Nikhil Bafna
On Mon, Feb 23, 2015 at 2:04 PM, Tathagata Das t...@databricks.com wrote:
You will have
Yes. As my understanding, it would allow me to write SQLs to query a spark
context. But, the query needs to be specified within a job deployed.
What I want is to be able to run multiple dynamic queries specified at
runtime from a dashboard.
--
Nikhil Bafna
On Sat, Feb 21, 2015 at 8:37 PM
, which will need re-aggregation from the already computed job.
My query is, how can I run dynamic queries over data in schema RDDs?
--
Nikhil Bafna
Did anyone get a chance to look at this?
Please provide some help.
Thanks
Nikhil
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-Integrate-openNLP-with-Spark-tp20117p20368.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
am not
sure how to do so. Though Philip Ogren has given a very nice presentation in
Spark Summit, still I am confusing.
Can someone please provide me end to end example on this. I am new in Spark
and UIMAFit, recently started working on it.
Thanks
Nikhil Jain
--
View this message in context
59 matches
Mail list logo