Re: Using Spark Accumulators with Structured Streaming

2020-06-08 Thread Something Something
*Honestly, I don't know how to do this in Scala.* I tried something like this... *.mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())( new StateUpdater(myAcc))* StateUpdater is similar to what Zhang has provided but it's NOT compiling 'cause I need to return a 'Dataset'. Here's

Re: Using Spark Accumulators with Structured Streaming

2020-06-08 Thread Srinivas V
Ya, I had asked this question before. No one responded. By the way, what’s your actual name “Something something” if you don’t mind me asking? On Tue, Jun 9, 2020 at 12:27 AM Something Something < mailinglist...@gmail.com> wrote: > What is scary is this interface is marked as "experimental" > >

Re: Using Spark Accumulators with Structured Streaming

2020-06-08 Thread Something Something
What is scary is this interface is marked as "experimental" @Experimental @InterfaceStability.Evolving public interface MapGroupsWithStateFunction extends Serializable { R call(K key, Iterator values, GroupState state) throws Exception; } On Mon, Jun 8, 2020 at 11:54 AM Something Something

Re: Using Spark Accumulators with Structured Streaming

2020-06-08 Thread Something Something
Right, this is exactly how I've it right now. Problem is in the cluster mode 'myAcc' does NOT get distributed. Try it out in the cluster mode & you will see what I mean. I think how Zhang is using will work. Will try & revert. On Mon, Jun 8, 2020 at 10:58 AM Srinivas V wrote: > > You don’t

Re: Using Spark Accumulators with Structured Streaming

2020-06-08 Thread Srinivas V
You don’t need to have a separate class. I created that as it has lot of code and logic in my case. For you to quickly test you can use Zhang’s Scala code in this chain. Pasting it below for your quick reference: ```scala spark.streams.addListener(new StreamingQueryListener { override

Re: we control spark file names before we write them - should we opensource it?

2020-06-08 Thread Panos Bletsos
May I ask how do you handle multiple partitions? Can't two files have the same name with this approach, or am I missing something? BR, Panos -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe

Re: Add python library

2020-06-08 Thread Patrick McCarthy
I've found Anaconda encapsulates modules and dependencies and such nicely, and you can deploy all the needed .so files and such by deploying a whole conda environment. I've used this method with success:

RE: we control spark file names before we write them - should we opensource it?

2020-06-08 Thread Stefan Panayotov
Yes, I think so. Stefan Panayotov, PhD spanayo...@outlook.com spanayo...@comcast.net spanayo...@gmail.com -Original Message- From: ilaimalka Sent: Monday, June 8, 2020 9:17 AM To: user@spark.apache.org Subject: we control spark file names before we write them - should we opensource

we control spark file names before we write them - should we opensource it?

2020-06-08 Thread ilaimalka
Hi, as part of our work we needed more control over the name of the files written out by Spark, e.g instead of "part-...csv.gz" we want to get something like this "15988891_1748330679_20200507124153.tsv.gz" where the first number is hardcoded, the second one is the value from partitionBy and third