Re: Writing record once after the watermarking interval in Spark Structured Streaming

2018-03-29 Thread Bowden, Chris
The watermark is just a user-provided indicator to spark that it's ok to drop internal state after some period of time. The watermark "interval" doesn't directly dictate whether hot rows are sent to a sink. Think of a hot row as data younger than the watermark. However, the watermark will

Writing record once after the watermarking interval in Spark Structured Streaming

2018-03-29 Thread karthikjay
I have the following query: val ds = dataFrame .filter(! $"requri".endsWith(".m3u8")) .filter(! $"bserver".contains("trimmer")) .withWatermark("time", "120 seconds") .groupBy(window(dataFrame.col("time"),"60 seconds"),col("channelName")) .agg(sum("bytes")/100

Re: Spark on K8s resource staging server timeout

2018-03-29 Thread Jenna Hoole
Unfortunately the other Kubernetes cluster I was using was rebuilt from scratch yesterday, but the RSS I have up today has pretty uninteresting logs. [root@nid6 ~]# kubectl logs default-spark-resource-staging-server-7669dd57d7-xkvp6 ++ id -u + myuid=0 ++ id -g + mygid=0 ++ getent passwd

Re: Spark on K8s resource staging server timeout

2018-03-29 Thread Matt Cheah
Hello Jenna, Are there any logs from the resource staging server pod? They might show something interesting. Unfortunately, we haven’t been maintaining the resource staging server because we’ve moved all of our effort to the main repository instead of the fork. When we consider the

Re: Spark on K8s resource staging server timeout

2018-03-29 Thread Jenna Hoole
I added overkill high timeouts to the OkHttpClient.Builder() in RetrofitClientFactory.scala and I don't seem to be timing out anymore. val okHttpClientBuilder = new OkHttpClient.Builder() .dispatcher(dispatcher) .proxy(resolvedProxy) .connectTimeout(120, TimeUnit.SECONDS)

Multiple columns using 'isin' command in pyspark

2018-03-29 Thread Shuporno Choudhury
Hi Spark Users, I am trying to achieve the 'IN' functionality of SQL using the isin function in pyspark Eg: select count(*) from tableA where (col1, col2) in ((1, 100),(2, 200), (3,300)) We can very well have 1 column isin statements like: df.filter(df[0].isin(1,2,3)).count()

spark jdbc postgres query results don't match those of postgres query

2018-03-29 Thread Kevin Peng
I am running into a weird issue in Spark 1.6, which I was wondering if anyone has encountered before. I am running a simple select query from spark using a jdbc connection to postgres: val POSTGRES_DRIVER: String = "org.postgresql.Driver" val srcSql = """select total_action_value, last_updated

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-29 Thread Lalwani, Jayesh
It depends on how you have loaded data.. Ideally, if you have dozens of records, your input data should have them in one partition. If the input has 1 partition, and data is small enough, Spark will keep it in one partition (as far as possible) If you cannot control your data, you need to

Re: Why doesn't spark use broadcast join?

2018-03-29 Thread Lalwani, Jayesh
Try putting a Broadcast hint like show here https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-hint-framework.html#sql-hints From: Vitaliy Pisarev Date: Thursday, March 29, 2018 at 8:42 AM To: "user@spark.apache.org" Subject:

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-29 Thread Tin Vu
You are right. There are too much tasks was created. How can we reduce the number of tasks? On Thu, Mar 29, 2018, 7:44 AM Lalwani, Jayesh wrote: > Without knowing too many details, I can only guess. It could be that Spark > is creating a lot of tasks even though

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-29 Thread Lalwani, Jayesh
Without knowing too many details, I can only guess. It could be that Spark is creating a lot of tasks even though there are less records. Creation and distribution of tasks has a noticeable overhead on smaller datasets. You might want to look at the driver logs, or the Spark Application Detail

Stopping StreamingContext

2018-03-29 Thread Sidney Feiner
Hey, I have a Spark Streaming application processing some events. Sometimes, I want to stop the application if a get a specific event. I collect the executor's results in the driver and based on those results, I kill the StreamingContext using StreamingContext.stop(stopSparkContext=true). When I

Best practices for optimizing the structure of parquet schema

2018-03-29 Thread Vitaliy Pisarev
There is a lot of talk that in order to really benefit from fast queries over parquet and hdfs, we need to make sure that the data is stored in a manner that is friendly to compression. Unfortunately, I did not find any specific guidelines or tips online that describe do-s and dont-s in designing

Why doesn't spark use broadcast join?

2018-03-29 Thread Vitaliy Pisarev
I am looking at the physical plan for the following query: SELECT f1,f2,f3,... FROM T1 LEFT ANTI JOIN T2 ON T1.id = T2.id WHERE f1 = 'bla' AND f2 = 'bla2' AND some_date >= date_sub(current_date(), 1) LIMIT 100 An important detail: the table 'T1' can be very large (hundreds of

[Query] Columnar transformation without Structured Streaming

2018-03-29 Thread Aakash Basu
Hi, I started my Spark Streaming journey from Structured Streaming using Spark 2.3, where I can easily do Spark SQL transformations on streaming data. But, I want to know, how can I do columnar transformation (like, running aggregation or casting, et al) using the prior utility of DStreams? Is

[Structured Streaming] Kafka Sink in Spark 2.3

2018-03-29 Thread Lalwani, Jayesh
Hi I have a custom streaming sink that internally uses org.apache.spark.sql.kafka010.KafkaSink. This was working in 2.2.. When I upgraded to 2.3, I get this exception. Does spark-sql-Kafka010 work on Spark 2.3? 84705281f4b]] DEBUG com.capitalone.sdp.spark.source.SdpSink - Writing batch to