Re: [Spark Structured Streaming] Measure metrics from CsvSink for Rate source

2018-06-28 Thread Dhruv Kumar
Hi Can some one please take a look at below? Any help is deeply appreciated. -- Dhruv Kumar PhD Candidate Department of Computer Science and Engineering University of Minnesota www.dhruvkumar.me > On Jun 22, 2018, at 13:12, Dhruv Kumar wrote: >

Re: spark 2.3.1 with kafka spark-streaming-kafka-0-10 (java.lang.AbstractMethodError)

2018-06-28 Thread Peter Liu
Hello there, I just upgraded to spark 2.3.1 from spark 2.2.1, ran my streaming workload and got the error (java.lang.AbstractMethodError) never seen before; check the error stack attached in (a) bellow. anyone knows if spark 2.3.1 works well with kafka spark-streaming-kafka-0-10? this link

Re: Lag and queued up batches info in Structured Streaming UI

2018-06-28 Thread Tathagata Das
The SQL plan of each micro-batch in the Spark UI (SQL tab) has links to the actual Spark jobs that ran in the micro-batch. From that you can drill down into the stage information. I agree that its not there as a nice per-stream table as with the Streaming tab, but all the information is present if

Re: How to reduceByKeyAndWindow in Structured Streaming?

2018-06-28 Thread Tathagata Das
The fundamental conceptual difference between the windowing in DStream vs Structured Streaming is that DStream used the arrival time of the record in Spark (aka processing time) and Structured Streaming using event time. If you want to exactly replicate DStream's processing time windows in

Caching when you perfom one action and have a dataframe used more than once.

2018-06-28 Thread mxmn
Hi, Let's say I have the following code (it's an example) df_a = spark.read.json() df_b = df_a.sample(False, 0.5, 10) df_c = df_a.sample(False, 0.5, 10) df_d = df_b.union(df_c) df_d.count() Do we have to cache df_a as it is used by df_b and df_c, or spark will notice that df_a is used twice in

Re: How to handle java.sql.Date inside Maps with to_json / from_json

2018-06-28 Thread Patrick McGloin
Hi all, I tested this with a Date outside a map and it works fine so I think the issue is simply for Dates inside Maps. I will create a Jira for this unless there are objections. Best regards, Patrick On Thu, 28 Jun 2018, 11:53 Patrick McGloin, wrote: > Consider the following test, which will

How to handle java.sql.Date inside Maps with to_json / from_json

2018-06-28 Thread Patrick McGloin
Consider the following test, which will fail on the final show: * case class *UnitTestCaseClassWithDateInsideMap(map: Map[Date, Int]) test(*"Test a Date as key in a Map"*) { *val *map = *UnitTestCaseClassWithDateInsideMap*(*Map*(Date.*valueOf*( *"2018-06-28"*) -> 1)) *val *options =

Re: How to reduceByKeyAndWindow in Structured Streaming?

2018-06-28 Thread Gerard Maas
Hi, In Structured Streaming lingo, "ReduceByKeyAndWindow" would be a window aggregation with a composite key. Something like: stream.groupBy($"key", window($"timestamp", "5 minutes")) .agg(sum($"value") as "total") The aggregate could be any supported SQL function. Is this what you

Using newApiHadoopRDD for reading from HBase

2018-06-28 Thread Biplob Biswas
Hi, I had a few questions regarding the way *newApiHadoopRDD *accesses data from HBase. 1. Does it load all the data from a scan operation directly in memory? 2. According to my understanding, the data is loaded from different regions to different executors, is that assumption/understanding

Re: [ClusterMode] -Dspark.master with missing secondary master IP

2018-06-28 Thread bsikander
I did some further investigation. If I launch a driver in cluster mode with master IPs like spark://:7077,:7077, the the driver is launched with both IPs and -Dspark.master property has both IPs. But within the logs I see the following, it causes 20 second delay while launching each driver

How to reduceByKeyAndWindow in Structured Streaming?

2018-06-28 Thread oripwk
In Structured Streaming, there's the notion of event-time windowing: However, this is not quite similar to DStream's windowing operations: in Structured Streaming, windowing groups the data by fixed time-windows, and every event in a time window is associated to its group: And in DStreams it

Re: [Spark Streaming] Spark Streaming with S3 vs Kinesis

2018-06-28 Thread Farshid Zavareh
Thanks. A workaround I can think of is to rename/move the objects which have been processed to a different prefix (which is not monitored), But with StreamingContext. textFileStream method there doesn't seem to be a way to know where each record is coming from. Is there another way to do this?