Re: Spark 2.4.3 - Structured Streaming - high on Storage Memory

2019-06-15 Thread puneetloya
Just More info on the above post:

Have been seeing lot of these logs:

1) The state for version 15109(other numbers too) doesn't exist in
loadedMaps. Reading snapshot file and delta files if needed...Note that this
is normal for the first batch of starting query.

2) KafkaConsumer cache hitting max capacity of 64, removing consumer for
CacheKey



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark 2.4.3 - Structured Streaming - high on Storage Memory

2019-06-15 Thread puneetloya
Hi,

Just upgraded Spark from 2.2.3 to 2.4.3. 

 

Ran a load test with a week worth of messages in kafka. Seeing an odd
behavior, why is the storage memory so high? Have run similar workloads with
Spark 2.2.3, have never seen such behavior. Has something pretty basic about
Spark has changed?

Our main changes for 2.4.3:
1) We started using Cassandra Sink Supported in Spark 2.4
2) Moved to Hadoop 3.1.1 from Hadoop 2.7.3. Mainly because we use s3
checkpointing and AWS SDK for 2.7.3 does not have a fix for connection
retries to s3 storage?

Thanks,
Puneet



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



unsubscribe

2019-06-15 Thread Humberto Marchezi
-- 
Humberto C Marchezi
-


Creating Spark buckets that Presto / Athena / Hive can leverage

2019-06-15 Thread Daniel Mateus Pires
Hi there!

I am trying to optimize joins on data created by Spark, so I'd like to
bucket the data to avoid shuffling.

I am writing to immutable partitions every day by writing data to a local
HDFS and then copying this data to S3, is there a combination of bucketBy
options and DDL that I can use so that Presto/Athena JOINs leverage the
special layout of the data?

e.g.
CREATE EXTERNAL TABLE ...(on Presto/Athena)
df.write.bucketBy(...).partitionBy(...). (in spark)
then copy this data to S3 with s3-dist-cp
then MSCK REPAIR TABLE (on Presto/Athena)

Daniel