Folks,
I have a time series table with each record being 350 columns.
the primary key is ((date, bucket), objectid, timestamp)
objective is to read 1 day worth of data, which comes to around 12k
partitions, each partition has around 25MB of data,
I see only 1 task active during the read
Hello - I'm writing a scala unittest for my Spark project
which checks the git information, and somehow it is not working from the
Unit Test
Added in pom.xml
--
pl.project13.maven
git-commit-id-plugin
2.2.4
Hi,
I just wanted to notice that in the API doc page for the pregel operator
(graphX API for spark 2.2.1):
In 2.3, stream to stream joins(both Inner and Outer) are implemented using
symmetric hash join(SHJ) algorithm, and that is a good choice
and I am sure you had compared with other family of algorithms like XJoin
and non-blocking sort based algorithms like progressive merge join (PMJ
Hi Spark users!
I noticed that spark doesn't allow python apps to run in cluster mode in
spark standalone cluster. Does anyone know the reason? I checked jira but
couldn't find anything relevant.
Thanks,
Ashwin
I had a similar issue and i think that’s where the structured streaming
design lacks.
Seems like Question#2 in your email is a viable workaround for you.
In my case, I have a custom Sink backed by an efficient in-memory column
store suited for fast ingestion.
I have a Kafka stream coming from
Hi Mina
I believe this is different for Structured Streaming from Kafka,
specifically. I'm assuming you are using structured streaming based on the
name of the dependency ("spark-streaming-kafka"). There is a note in the
docs here:
Hi All,
I am using the latest version of EMR to overwrite Parquet files to an S3 bucket
encrypted with a KMS key. I am seeing the attached error whenever I Overwrite a
parquet file. For example the below code produces the attached error and
stacktrace:
Hi Priyank
I have a similar structure, although I am reading from Kafka and sinking to
multiple MySQL tables. My input stream has multiple message types and each
is headed for a different MySQL table.
I've looked for a solution for a few months, and have only come up with two
alternatives:
1.
I use Spark 2.2.1 with streaming and when i open the Spark Streaming
UI i can see input metadata for each of my batches. In my case i
stream from Kafka and in the metadata section i find useful
informations about my topic, partitions and offsets.
Assume the url for this batch looks like
Hello,
I'm try to run multiple spark jobs on cluster running in yarn.
Master is 24GB server with 6 Slaves of 12GB
fairscheduler.xml settings are -
FAIR
10
2
I am running 8 jobs simultaneously , jobs are running parallelly but not
all.
at a time only 7 of then runs simultaneously
Hello,
yes, sure they can return complex types. For example, the functions
collect_list and collect_set return an ArrayType.
On 10 February 2018 at 14:28, kant kodali wrote:
> Hi All,
>
> Can UDAF's return complex types? like say a Map with key as an Integer and
> the value
I have a spark application which creates multiple sessions. Each of these
sessions can run jobs in parallel. I want to log some details about the
execution of these jobs, but want to the tag them with the session they were
called from.
I tried creating a listener from within each session
13 matches
Mail list logo