It may not be as easy as you think. The rest call will happen in driver but
the reads will be in the executors.
On Sat, 19 Aug 2017 at 11:42 am, Imtiaz Ahmed wrote:
> Hi All,
>
> I am building a spark library which developers will use when writing their
> spark jobs to
-- Forwarded message --
From: Kaepke, Marc
Date: Fri, Aug 18, 2017 at 10:51 AM
Subject: PageRank - 4x slower then Spark?!
To: "u...@flink.apache.org"
Hi everyone,
I compared Flink and Spark by using PageRank. I guessed Flink
Hi All,
I am building a spark library which developers will use when writing their
spark jobs to get access to data on Azure Data Lake. But the authentication
will depend on the dataset they ask for. I need to call a rest API from
within spark job to get credentials and authenticate to read data
Hi Jacek,
The way the memory sink is architected at the moment is that it either
appends a row (append/update mode) or replaces all rows (complete mode).
When a user specifies a checkpoint location, the guarantee Structured
Streaming provides is that output sinks will not lose data and will be
Hi,
This is what I could find in Spark's source code about the
`recoverFromCheckpointLocation` flag (that led me to explore the
complete output mode for dropDuplicates operator).
`recoverFromCheckpointLocation` flag is enabled by default and varies
per sink (memory, console and others).
*
My assumption is it would be similar though, in memory sink of all of your
records would quickly overwhelm your cluster, but in aggregation it could
be reasonable. But there might be additional reasons on top of that.
On Fri, Aug 18, 2017 at 11:44 AM Holden Karau wrote:
>
Ah yes I'm not sure about the workings of the memory sink.
On Fri, Aug 18, 2017 at 11:36 AM, Jacek Laskowski wrote:
> Hi Holden,
>
> Thanks a lot for a bit more light on the topic. That however does not
> explain why memory sink requires Complete for a checkpoint location to
>
Hi Holden,
Thanks a lot for a bit more light on the topic. That however does not
explain why memory sink requires Complete for a checkpoint location to
work. The only reason I used Complete output mode was to meet the
requirements of memory sink and that got me thinking why would the
Hello!
I've some weird problems with Spark running on top of Yarn. (Spark 2.2
on Cloudera CDH 5.12)
There are a lot of "java.net.SocketException: Network is unreachable" in
the executors, part of a log file:
http://support.l3s.de/~zab/spark-errors.txt and the jobs also fail at
rather random
Hello,
I have recently installed Sparks 2.2.0, and trying to use it for some big data
processing. Spark is installed on a server that I access from a remote
computer. I need to setup SSL encryption for the Spark web UI, but following
some threads online I’m still not able to set it up.
Can
So performing complete output without an aggregation would require building
up a table of the entire input to write out at each micro batch. This would
get prohibitively expensive quickly. With an aggregation we just need to
keep track of the aggregates and update them every batch, so the memory
Hi Pat,
I am using dynamic scheduling with executor memory of 8 gb . Will check to do
static scheduling by giving number of executor and cores.
Thanks,
Asmath
Sent from my iPhone
> On Aug 18, 2017, at 10:39 AM, Patrick Alwell wrote:
>
> +1 what is the executor
+1 what is the executor memory? You may need to adjust executor memory and
cores. For the sake of simplicity; each executor can handle 5 concurrent tasks
and should have 5 cores. So if your cluster has 100 cores, you’d have 20
executors. And if your cluster memory is 500gb, each executor would
Hi,
I am using persit before inserting dataframe data back into hive. This step
is adding 8 minutes to my total execution time. is there a way to reduce
the total time without resulting in out of memory issues. Here is my code.
val datapoint_df: Dataset[Row] =
Hi,
Why is the requirement for a streaming aggregation in a streaming
query? What would happen if Spark allowed Complete without a single
aggregation? This is the latest master.
scala> val q = ids.
| writeStream.
| format("memory").
| queryName("dups").
|
It is just a sql from hive table with transformation if adding 10 more columns
calculated for currency. Input size for this query is 2 months which has around
450gb data.
I added persist but it didn't help. Also the executor memory is 8g . Any
suggestions please ?
Sent from my iPhone
> On
You have forgotten a y:
It must be MM/did/
> On 17. Aug 2017, at 21:30, Aakash Basu wrote:
>
> Hi Palwell,
>
> Tried doing that, but its becoming null for all the dates after the
> transformation with functions.
>
> df2 =
17 matches
Mail list logo