Re: custom rdd - do I need a hadoop input format?

2019-09-17 Thread Arun Mahadevan
You can do it with custom RDD implementation. You will mainly implement "getPartitions" - the logic to split your input into partitions and "compute" to compute and return the values from the executors. On Tue, 17 Sep 2019 at 08:47, Marcelo Valle wrote: > Just to be more clear about my

Re: how to get spark-sql lineage

2019-05-16 Thread Arun Mahadevan
You can check out https://github.com/hortonworks-spark/spark-atlas-connector/ On Wed, 15 May 2019 at 19:44, lk_spark wrote: > hi,all: > When I use spark , if I run some SQL to do ETL how can I get > lineage info. I found that , CDH spark have some config about lineage : >

Re: JvmPauseMonitor

2019-04-15 Thread Arun Mahadevan
Spark TaskMetrics[1] has a "jvmGCTime" metric that captures the amount of time spent in GC. This is also available via the listener I guess. Thanks, Arun [1] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala#L89 On Mon, 15 Apr 2019 at

Re: Structured Streaming & Query Planning

2019-03-18 Thread Arun Mahadevan
I don't think its feasible with the current logic. Typically the query planning time should be a tiny fraction unless you are processing tiny micro-batches more frequently. You might want to consider adjusting the trigger interval to processes more data per micro-batch and see if it helps. The

Re: use rocksdb for spark structured streaming (SSS)

2019-03-10 Thread Arun Mahadevan
Read the link carefully, This solution is available (*only*) in Databricks Runtime. You can enable RockDB-based state management by setting the following configuration in the SparkSession before starting the streaming query. spark.conf.set( "spark.sql.streaming.stateStore.providerClass",

Re: Question about RDD pipe

2019-01-17 Thread Arun Mahadevan
Yes, the script should be present on all the executor nodes. You can pass your script via spark-submit (e.g. --files script.sh) and then you should be able to refer that (e.g. "./script.sh") in rdd.pipe. - Arun On Thu, 17 Jan 2019 at 14:18, Mkal wrote: > Hi, im trying to run an external

Re: Error - Dropping SparkListenerEvent because no remaining room in event queue

2018-10-24 Thread Arun Mahadevan
Maybe you have spark listeners that are not processing the events fast enough? Do you have spark event logging enabled? You might have to profile the built in and your custom listeners to see whats going on. - Arun On Wed, 24 Oct 2018 at 16:08, karan alang wrote: > > Pls note - Spark version

Re: Kafka backlog - spark structured streaming

2018-07-30 Thread Arun Mahadevan
Heres a proposal to a add - https://github.com/apache/spark/pull/21819 Its always good to set "maxOffsetsPerTrigger" unless you want spark to process till the end of the stream in each micro batch. Even without "maxOffsetsPerTrigger" the lag can be non-zero by the time the micro batch completes.

Re: Question of spark streaming

2018-07-27 Thread Arun Mahadevan
“activityQuery.awaitTermination()” is a blocking call. You can just skip this line and run other commands in the same shell to query the stream. Running the query from a different shell won’t help since the memory sink where the results are store is not shared between the two shells.

Re: [Structured Streaming] Avoiding multiple streaming queries

2018-07-12 Thread Arun Mahadevan
d basis (after deciding a record belongs to which particular sink), where as in the current implementation all data under a RDD partition gets committed to the sink atomically in one go. Please correct me if I am wrong here. Regards, Chandan On Thu, Jul 12, 2018 at 10:53 PM Arun Mahadevan wrot

Re: [Structured Streaming] Avoiding multiple streaming queries

2018-07-12 Thread Arun Mahadevan
Yes ForeachWriter [1] could be an option If you want to write to different sinks. You can put your custom logic to split the data into different sinks. The drawback here is that you cannot plugin existing sinks like Kafka and you need to write the custom logic yourself and you cannot scale the

Re: question on collect_list or say aggregations in general in structured streaming 2.3.0

2018-05-03 Thread Arun Mahadevan
I think you need to group by a window (tumbling) and define watermarks (put a very low watermark or even 0) to discard the state. Here the window duration becomes your logical batch. - Arun From: kant kodali Date: Thursday, May 3, 2018 at 1:52 AM To: "user @spark"

Re: [Structured Streaming] Restarting streaming query on exception/termination

2018-04-24 Thread Arun Mahadevan
I guess you can wait for the termination, catch exception and then restart the query in a loop. Something like… while (true) { try { val query = df.writeStream(). … .start() query.awaitTermination() } catch { case e:

Re: can we use mapGroupsWithState in raw sql?

2018-04-18 Thread Arun Mahadevan
ruct($"AMOUNT", $"*")).as("data")) .select($"data.*") .writeStream .format("console") .trigger(Trigger.ProcessingTime("1 seconds")) .outputMode(OutputMode.Update()) .start() It still have a minor issue: the column "AMOUNT" is showi

Re: can we use mapGroupsWithState in raw sql?

2018-04-18 Thread Arun Mahadevan
Output: id | amount | my_timestamp --- 1 | 10 | 2018-04-01T01:10:00.000Z 2 | 40 | 2018-04-01T01:30:00.000Z Looking for a streaming solution using either raw sql like sparkSession.sql("sql query") or similar to raw sql but not s

Re: can we use mapGroupsWithState in raw sql?

2018-04-18 Thread Arun Mahadevan
Cant the “max” function used here ? Something like.. stream.groupBy($"id").max("amount").writeStream.outputMode(“complete”/“update")…. Unless the “stream” is already a grouped stream, in which case the above would not work since the support for multiple aggregate operations is not there yet.