Re: Metrics Problem

2020-06-26 Thread Srinivas V
One option is to create your main jar included with metrics jar like a fat jar. On Sat, Jun 27, 2020 at 8:04 AM Bryan Jeffrey wrote: > Srinivas, > > Thanks for the insight. I had not considered a dependency issue as the > metrics jar works well applied on the driver. Perhaps my main jar >

Re: [Structured spak streaming] How does cassandra connector readstream deals with deleted record

2020-06-26 Thread Russell Spitzer
The connector uses Java driver cql request under the hood which means it responds to the changing database like a normal application would. This means retries may result in a different set of data than the original request if the underlying database changed. On Fri, Jun 26, 2020, 9:42 PM Jungtaek

Re: [Structured spak streaming] How does cassandra connector readstream deals with deleted record

2020-06-26 Thread Jungtaek Lim
I'm not sure how it is implemented, but in general I wouldn't expect such behavior on the connectors which read from non-streaming fashion storages. The query result may depend on "when" the records are fetched. If you need to reflect the changes in your query you'll probably want to find a way

Re: Metrics Problem

2020-06-26 Thread Bryan Jeffrey
Srinivas, Thanks for the insight. I had not considered a dependency issue as the metrics jar works well applied on the driver. Perhaps my main jar includes the Hadoop dependencies but the metrics jar does not? I am confused as the only Hadoop dependency also exists for the built in metrics

Re: Metrics Problem

2020-06-26 Thread Srinivas V
It should work when you are giving hdfs path as long as your jar exists in the path. Your error is more security issue (Kerberos) or Hadoop dependencies missing I think, your error says : org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation On Fri, Jun 26, 2020 at 8:44 PM

Re: Spark Structured Streaming: “earliest” as “startingOffsets” is not working

2020-06-26 Thread Srinivas V
Cool. Are you not using watermark ? Also, is it possible to start listening offsets from a specific date time ? Regards Srini On Sat, Jun 27, 2020 at 6:12 AM Eric Beabes wrote: > My apologies... After I set the 'maxOffsetsPerTrigger' to a value such as > '20' it started working. Hopefully

Re: Spark Structured Streaming: “earliest” as “startingOffsets” is not working

2020-06-26 Thread Eric Beabes
My apologies... After I set the 'maxOffsetsPerTrigger' to a value such as '20' it started working. Hopefully this will help someone. Thanks. On Fri, Jun 26, 2020 at 2:12 PM Something Something < mailinglist...@gmail.com> wrote: > My Spark Structured Streaming job works fine when I set

Spark Structured Streaming: “earliest” as “startingOffsets” is not working

2020-06-26 Thread Something Something
My Spark Structured Streaming job works fine when I set "startingOffsets" to "latest". When I simply change it to "earliest" & specify a new "check point directory", the job doesn't work. The states don't get timed out after 10 minutes. While debugging I noticed that my 'state' logic is indeed

Re: apache-spark mongodb dataframe issue

2020-06-26 Thread Mannat Singh
Hi Jeff Thanks for confirming the same. I have also thought about reading every MongoDB document separately along with their schemas and then comparing them to the schemas of all the documents in the collection. For our huge database this is a horrible horrible approach as you have already

Data Explosion and repartition before group bys

2020-06-26 Thread lsn24
Hi , We have a use case where one record needs to be in two different aggregations. Say for example a credit card transaction "A", which belongs to transaction category ATM and crossborder. If I need to take the count of ATM transaction, I need to consider transaction A . For count of

Re: Metrics Problem

2020-06-26 Thread Bryan Jeffrey
It may be helpful to note that I'm running in Yarn cluster mode. My goal is to avoid having to manually distribute the JAR to all of the various nodes as this makes versioning deployments difficult. On Thu, Jun 25, 2020 at 5:32 PM Bryan Jeffrey wrote: > Hello. > > I am running Spark 2.4.4. I

Re: Spark 3 pod template for the driver

2020-06-26 Thread Michel Sumbul
Hi Jorge, If I set that in the spark submit command it works but I want it only in the pod template file. Best regards, Michel Le ven. 26 juin 2020 à 14:01, Jorge Machado a écrit : > Try to set spark.kubernetes.container.image > > On 26. Jun 2020, at 14:58, Michel Sumbul wrote: > > Hi guys, >

Re: Spark 3 pod template for the driver

2020-06-26 Thread Jorge Machado
Try to set spark.kubernetes.container.image > On 26. Jun 2020, at 14:58, Michel Sumbul wrote: > > Hi guys, > > I try to use Spark 3 on top of Kubernetes and to specify a pod template for > the driver. > > Here is my pod manifest or the driver and when I do a spark-submit with the > option:

Spark 3 pod template for the driver

2020-06-26 Thread Michel Sumbul
Hi guys, I try to use Spark 3 on top of Kubernetes and to specify a pod template for the driver. Here is my pod manifest or the driver and when I do a spark-submit with the option: --conf spark.kubernetes.driver.podTemplateFile=/data/k8s/podtemplate_driver3.yaml I got the error message that I

Spark 3 pod template for the driver

2020-06-26 Thread Michel Sumbul
Hi guys, I try to use Spark 3 on top of Kubernetes and to specify a pod template for the driver. Here is my pod manifest or the driver and when I do a spark-submit with the option: --conf spark.kubernetes.driver.podTemplateFile=/data/k8s/podtemplate_driver3.yaml I got the error message that I