Re: What are your experiences using google cloud platform

2022-01-23 Thread German Schiavon
Hi, Changing cloud providers won't help if your job is slow, has skew, etc... I think first you have to see why "big jobs" are not completing. On Sun, 23 Jan 2022 at 22:18, Andrew Davidson wrote: > Hi recently started using GCP dataproc spark. > > > > Seem to have trouble getting big jobs to

Re: [Spark SQL]: Aggregate Push Down / Spark 3.2

2021-11-03 Thread German Schiavon
Hi, Rohit, can you share how it looks using DSv2? Thanks! On Wed, 3 Nov 2021 at 19:35, huaxin gao wrote: > Great to hear. Thanks for testing this! > > On Wed, Nov 3, 2021 at 4:03 AM Kapoor, Rohit > wrote: > >> Thanks for your guidance Huaxin. I have been able to test the push down >>

Re: Apply window function on data consumed from Kafka topic

2021-06-15 Thread German Schiavon
Hi, If you want help I'd suggest copying the full code, you just shared the config part. On the other hand, if you are doing a project *now* I'd also suggest using Structured Streaming , I'm sure you would get

Re: Spark History Server log files questions

2021-03-23 Thread German Schiavon
Hey! I don't think you can do selectively removals, never heard of it but who knows.. You can refer here to see all the available options -> https://spark.apache.org/docs/latest/monitoring.html . In my experience having 4 days worth of logs is enough, usually if something fails you check it

Re: Bucketing 3.1.1

2021-03-22 Thread German Schiavon
Ohh! That is why! I missed that rename  Thanks a lot Bartosz! On Mon, 22 Mar 2021 at 09:55, Bartosz Konieczny wrote: > Hi German Schiavon, > > The property is supported in shuffle hash join strategy too and it was > renamed here https://github.com/apache/spark/pull/2907

Bucketing 3.1.1

2021-03-22 Thread German Schiavon
Hi all! In the 3.1.1 release a new bucket property was added in this PR . I'm trying to check this new behaviour but I'm not getting the same physical plan as the one given in the example. I'm executing the same code snippet from the PR in a 3.1.1

Re: How default partitioning in spark is deployed

2021-03-16 Thread German Schiavon
Hi all, I guess you could do something like this too: [image: Captura de pantalla 2021-03-16 a las 14.35.46.png] On Tue, 16 Mar 2021 at 13:31, Renganathan Mutthiah wrote: > Hi Attila, > > Thanks for looking into this! > > I actually found the issue and it turned out to be that the print >

Re: Thread spilling sort issue with single task

2021-01-26 Thread German Schiavon
be avoided . Could > you please suggest? > > I am in Spark2.4 > > Thanks > Rajat > > On Tue, Jan 26, 2021 at 3:58 PM German Schiavon > wrote: > >> Hi, >> >> One word : SKEW >> >> It seems the classic skew problem, you would have to apply

Re: Thread spilling sort issue with single task

2021-01-26 Thread German Schiavon
Hi, One word : SKEW It seems the classic skew problem, you would have to apply skew techniques to repartition your data properly or if you are in spark 3.0+ try the skewJoin optimization. On Tue, 26 Jan 2021 at 11:20, rajat kumar wrote: > Hi Everyone, > > I am running a spark application

Re: Structured Streaming Spark 3.0.1

2021-01-20 Thread German Schiavon
Hi, I couldn't reproduce this error :/ I wonder if there is something else underline causing it... *Input* ➜ kafka_2.12-2.5.0 ./bin/kafka-console-producer.sh --bootstrap-server localhost:9092 --topic test1 {"name": "pedro", "age": 50} >{"name": "pedro", "age": 50} >{"name": "pedro", "age": 50}

Re: Spark job stuck after read and not starting next stage

2021-01-20 Thread German Schiavon
Hi, not sure if it is your case, but if the source data is heavy and deeply nested I'd recommend explicitly providing the schema when reading the json. df = spark.read.schema(schema).json(updated_dataset) On Thu, 21 Jan 2021 at 04:15, srinivasarao daruna wrote: > Hi, > I am running a spark

Re: Data source v2 streaming sinks does not support Update mode

2021-01-18 Thread German Schiavon
Hi, This is the jira and regarding the repo, I believe just commit it to your personal repo and that should be it. Regards On Mon, 18 Jan 2021 at 15:46, Eric Beabes wrote: > Sorry. Can you please tell me where to create the JIRA? Also is

Re: Spark DF does not rename the column

2021-01-04 Thread German Schiavon
Hi, I think you have a typo : root |-- ceated: string (nullable = true) and then: withColumnRenamed("created","Date Calculated"). On Mon, 4 Jan 2021 at 19:12, Lalwani, Jayesh wrote: > You don’t have a column named “created”. The column name is “ceated”, > without the “r” > > > >

Re: Spark on Kubernetes : unable to write files to HDFS

2020-12-16 Thread German Schiavon
We all been there! no reason to be ashamed :) On Wed, 16 Dec 2020 at 18:14, Loic DESCOTTE < loic.desco...@kaizen-solutions.net> wrote: > Oh thank you you're right!! I feel shameful  > > -- > *De :* German Schiavon > *Envoyé :* mercredi 16 déc

Re: Spark on Kubernetes : unable to write files to HDFS

2020-12-16 Thread German Schiavon
Hi, seems that you have a typo no? Exception in thread "main" java.io.IOException: No FileSystem for scheme: hfds data.write.mode("overwrite").format("text").save("hfds:// hdfs-namenode/user/loic/result.txt") On Wed, 16 Dec 2020 at 17:02, Loic DESCOTTE < loic.desco...@kaizen-solutions.net>

Re: Missing required configuration "partition.assignment.strategy" [ Kafka + Spark Structured Streaming ]

2020-12-07 Thread German Schiavon
Hi, I think the issue is that you are overriding the kafka-clients that comes with spark-sql-kafka-0-10_2.12 I'd try removing the kafka-clients and see if it works On Sun, 6 Dec 2020 at 08:01, Amit Joshi wrote: > Hi All, > > I am running the Spark Structured Streaming along with Kafka. >

Re: Kafka structured straming - how to read headers

2020-12-03 Thread German Schiavon
Hello, see if this works, from the documentation: // Subscribe to 1 topic, with headersval df = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1") .option("includeHeaders", "true")

Re: Structured Streaming Checkpoint Error

2020-12-03 Thread German Schiavon
he same checkpoint and mess up. > With checkpoint in HDFS, the rename is atomic and only one succeeds even in > parallel and the other query lost writing to the checkpoint file simply > fails. That's a caveat you may want to keep in mind. > > On Wed, Dec 2, 2020 at 11:35 PM German Schiav

Re: Structured Streaming Checkpoint Error

2020-12-02 Thread German Schiavon
Hello! @Gabor Somogyi I wonder that now that s3 is *strongly consistent* , would work fine. Regards! https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/ On Thu, 17 Sep 2020 at 11:55, German Schiavon wrote: > Hi Gabor, > > Makes sense, tha

Re: Stream-static join : Refreshing subset of static data / Connection pooling

2020-11-26 Thread German Schiavon
So that's it no? you can push down the *in* filter in the query with the id's and only retrieve those rows. On Thu, 26 Nov 2020 at 16:49, Geervan Hayatnagarkar wrote: > Yes we used collect to get all IDs in forEachBatch > > No, the static table is huge and is updated frequently by other systems

Re: Stream-static join : Refreshing subset of static data / Connection pooling

2020-11-26 Thread German Schiavon
Hi! I guess you could use *foreachBatch* and do something like foreachBatch { ids = getIds spark.read.jdbc(query where id is in $ids) join write } The only thing is that in order to get he id's you would have to do collect no? Or how are you retrieving them right now? On Thu, 26 Nov 2020 at

Re: Spark Dataset withColumn issue

2020-11-12 Thread German Schiavon
ds.select("Col1", "Col2", "Col3") On Thu, 12 Nov 2020 at 15:28, Vikas Garg wrote: > In Spark Datase, if we add additional column using > withColumn > then the column is added in the last. > > e.g. > val ds1 = ds.select("Col1", "Col3").withColumn("Col2", lit("sample")) > > the the order of

Re: Writing to mysql from pyspark spark structured streaming

2020-10-16 Thread German Schiavon
Hi, can you share the code? You have to call *writeStream* on streaming Dataset/Dataframe On Fri, 16 Oct 2020 at 02:14, Krishnanand Khambadkone wrote: > Hi, I am trying to write to mysql from a spark structured streaming kafka > source. Using spark 2.4.0. I get this exception, > >

ForeachBatch Structured Streaming

2020-10-14 Thread German Schiavon
Hi! In the documentation it says: - By default, foreachBatch provides only at-least-once write guarantees. However, you can use the batchId provided to the function as way to deduplicate the output and get an exactly-once guarantee. Taking the example snippet :

Re: Spark watermarked aggregation query and append output mode

2020-09-23 Thread German Schiavon
Hi, try this : dataset.printSchema(); // see the output below Dataset ds1 = dataset .withWatermark("timestamp", "1 second") .groupBy( functions.window(*col("timestamp")*, "1 second", "1 second"),

Re: Structured Streaming Checkpoint Error

2020-09-17 Thread German Schiavon
rk like a charm. > > BR, > G > > > On Wed, Sep 16, 2020 at 4:12 PM German Schiavon > wrote: > >> Hi! >> >> I have an Structured Streaming Application that reads from kafka, >> performs some aggregations and writes in S3 in parquet format. >>

Structured Streaming Checkpoint Error

2020-09-16 Thread German Schiavon
Hi! I have an Structured Streaming Application that reads from kafka, performs some aggregations and writes in S3 in parquet format. Everything seems to work great except that from time to time I get a checkpoint error, at the beginning I thought it was a random error but it happened more than 3

Re: Spark Streaming with Kafka and Python

2020-08-12 Thread German Schiavon
Hey, Maybe I'm missing some restriction with EMR, but have you tried to use Structured Streaming instead of Spark Streaming? https://spark.apache.org/docs/2.4.5/structured-streaming-kafka-integration.html Regards On Wed, 12 Aug 2020 at 14:12, Hamish Whittal wrote: > Hi folks, > > Thought I

Re: S3 read/write from PySpark

2020-08-06 Thread German Schiavon
Hey, I think *BasicAWSCredentialsProvider *is no longer supported by hadoop. I couldn't find it the master branch but I could in 2.8 branch. Maybe that's why with Hadoop 2.7 works. I use *TemporaryAWSCredentialsProvider.* Hope it helps On Thu, 6 Aug 2020 at 03:16, Daniel Stojanov wrote: >

Re: Tab delimited csv import and empty columns

2020-07-30 Thread German Schiavon Matteo
Hey, I understand that your empty values in your CSV are "" , if so, try this option: *.option("emptyValue", "\"\"")* Hope it helps On Thu, 30 Jul 2020 at 08:49, Stephen Coy wrote: > Hi there, > > I’m trying to import a tab delimited file with: > > Dataset catalogData = sparkSession >