The watermark is just a user-provided indicator to spark that it's ok to drop
internal state after some period of time. The watermark "interval" doesn't
directly dictate whether hot rows are sent to a sink. Think of a hot row as
data younger than the watermark. However, the watermark will
I have the following query:
val ds = dataFrame
.filter(! $"requri".endsWith(".m3u8"))
.filter(! $"bserver".contains("trimmer"))
.withWatermark("time", "120 seconds")
.groupBy(window(dataFrame.col("time"),"60
seconds"),col("channelName"))
.agg(sum("bytes")/100
Unfortunately the other Kubernetes cluster I was using was rebuilt from
scratch yesterday, but the RSS I have up today has pretty uninteresting
logs.
[root@nid6 ~]# kubectl logs
default-spark-resource-staging-server-7669dd57d7-xkvp6
++ id -u
+ myuid=0
++ id -g
+ mygid=0
++ getent passwd
Hello Jenna,
Are there any logs from the resource staging server pod? They might show
something interesting.
Unfortunately, we haven’t been maintaining the resource staging server because
we’ve moved all of our effort to the main repository instead of the fork. When
we consider the
I added overkill high timeouts to the OkHttpClient.Builder() in
RetrofitClientFactory.scala and I don't seem to be timing out anymore.
val okHttpClientBuilder = new OkHttpClient.Builder()
.dispatcher(dispatcher)
.proxy(resolvedProxy)
.connectTimeout(120, TimeUnit.SECONDS)
Hi Spark Users,
I am trying to achieve the 'IN' functionality of SQL using the isin
function in pyspark
Eg: select count(*) from tableA
where (col1, col2) in ((1, 100),(2, 200), (3,300))
We can very well have 1 column isin statements like:
df.filter(df[0].isin(1,2,3)).count()
I am running into a weird issue in Spark 1.6, which I was wondering if
anyone has encountered before. I am running a simple select query from
spark using a jdbc connection to postgres: val POSTGRES_DRIVER: String =
"org.postgresql.Driver" val srcSql = """select total_action_value,
last_updated
It depends on how you have loaded data.. Ideally, if you have dozens of
records, your input data should have them in one partition. If the input has 1
partition, and data is small enough, Spark will keep it in one partition (as
far as possible)
If you cannot control your data, you need to
Try putting a Broadcast hint like show here
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-hint-framework.html#sql-hints
From: Vitaliy Pisarev
Date: Thursday, March 29, 2018 at 8:42 AM
To: "user@spark.apache.org"
Subject:
You are right. There are too much tasks was created. How can we reduce the
number of tasks?
On Thu, Mar 29, 2018, 7:44 AM Lalwani, Jayesh
wrote:
> Without knowing too many details, I can only guess. It could be that Spark
> is creating a lot of tasks even though
Without knowing too many details, I can only guess. It could be that Spark is
creating a lot of tasks even though there are less records. Creation and
distribution of tasks has a noticeable overhead on smaller datasets.
You might want to look at the driver logs, or the Spark Application Detail
Hey,
I have a Spark Streaming application processing some events.
Sometimes, I want to stop the application if a get a specific event. I collect
the executor's results in the driver and based on those results, I kill the
StreamingContext using StreamingContext.stop(stopSparkContext=true).
When I
There is a lot of talk that in order to really benefit from fast queries
over parquet and hdfs, we need to make sure that the data is stored in a
manner that is friendly to compression.
Unfortunately, I did not find any specific guidelines or tips online that
describe do-s and dont-s
in designing
I am looking at the physical plan for the following query:
SELECT f1,f2,f3,...
FROM T1
LEFT ANTI JOIN T2 ON T1.id = T2.id
WHERE f1 = 'bla'
AND f2 = 'bla2'
AND some_date >= date_sub(current_date(), 1)
LIMIT 100
An important detail: the table 'T1' can be very large (hundreds of
Hi,
I started my Spark Streaming journey from Structured Streaming using Spark
2.3, where I can easily do Spark SQL transformations on streaming data.
But, I want to know, how can I do columnar transformation (like, running
aggregation or casting, et al) using the prior utility of DStreams? Is
Hi
I have a custom streaming sink that internally uses
org.apache.spark.sql.kafka010.KafkaSink. This was working in 2.2.. When I
upgraded to 2.3, I get this exception.
Does spark-sql-Kafka010 work on Spark 2.3?
84705281f4b]] DEBUG com.capitalone.sdp.spark.source.SdpSink - Writing batch to
16 matches
Mail list logo