Hello,
I'm try to run multiple spark jobs on cluster running in yarn.
Master is 24GB server with 6 Slaves of 12GB
fairscheduler.xml settings are -
FAIR
10
2
I am running 8 jobs simultaneously , jobs are running parallelly but not
all.
at a time only 7 of then runs simultaneously
I have a spark application which creates multiple sessions. Each of these
sessions can run jobs in parallel. I want to log some details about the
execution of these jobs, but want to the tag them with the session they were
called from.
I tried creating a listener from within each session
Hello,
yes, sure they can return complex types. For example, the functions
collect_list and collect_set return an ArrayType.
On 10 February 2018 at 14:28, kant kodali wrote:
> Hi All,
>
> Can UDAF's return complex types? like say a Map with key as an Integer and
> the value
Hi All,
I am using the latest version of EMR to overwrite Parquet files to an S3 bucket
encrypted with a KMS key. I am seeing the attached error whenever I Overwrite a
parquet file. For example the below code produces the attached error and
stacktrace:
I use Spark 2.2.1 with streaming and when i open the Spark Streaming
UI i can see input metadata for each of my batches. In my case i
stream from Kafka and in the metadata section i find useful
informations about my topic, partitions and offsets.
Assume the url for this batch looks like
Hi Priyank
I have a similar structure, although I am reading from Kafka and sinking to
multiple MySQL tables. My input stream has multiple message types and each
is headed for a different MySQL table.
I've looked for a solution for a few months, and have only come up with two
alternatives:
1.
Hi Mina
I believe this is different for Structured Streaming from Kafka,
specifically. I'm assuming you are using structured streaming based on the
name of the dependency ("spark-streaming-kafka"). There is a note in the
docs here:
Hi Spark users!
I noticed that spark doesn't allow python apps to run in cluster mode in
spark standalone cluster. Does anyone know the reason? I checked jira but
couldn't find anything relevant.
Thanks,
Ashwin
I had a similar issue and i think that’s where the structured streaming
design lacks.
Seems like Question#2 in your email is a viable workaround for you.
In my case, I have a custom Sink backed by an efficient in-memory column
store suited for fast ingestion.
I have a Kafka stream coming from
In 2.3, stream to stream joins(both Inner and Outer) are implemented using
symmetric hash join(SHJ) algorithm, and that is a good choice
and I am sure you had compared with other family of algorithms like XJoin
and non-blocking sort based algorithms like progressive merge join (PMJ
Hello - I'm writing a scala unittest for my Spark project
which checks the git information, and somehow it is not working from the
Unit Test
Added in pom.xml
--
pl.project13.maven
git-commit-id-plugin
2.2.4
Folks,
I have a time series table with each record being 350 columns.
the primary key is ((date, bucket), objectid, timestamp)
objective is to read 1 day worth of data, which comes to around 12k
partitions, each partition has around 25MB of data,
I see only 1 task active during the read
Hi,
I just wanted to notice that in the API doc page for the pregel operator
(graphX API for spark 2.2.1):
13 matches
Mail list logo