Re: S3 token times out during data frame "write.csv"

2018-01-27 Thread Gourav Sengupta
Hi, There is definitely a parameter while creating temporary security credential to mention the number of minutes those credentials will be active. There is an upper limit ofcourse which is around 3 days in case I remember correctly and the default, as you can see, is 30 mins. Can you let me

Custom Catalyst Optimizer Strategy for DataFrame Writes?

2018-01-27 Thread CCInCharge
I've been working with Datastax's spark-cassandra-connector, and have noticed that, when creating batches of DataFrame Rows to write to database, write throughput is increased substantially and overall task completion time is decreased if the user sorts the DataFrame on Cassandra partition key

Semi-supervised learning in MLlib

2018-01-27 Thread Franco Victorio
Hi, I'm working on the implementation of a semi-supervised algorithm in Spark and I want it to implement the interfaces provided by MLlib, so that it can use things like model selection. My problem is that, as far as I can tell, the provided interfaces are meant for supervised algorithms (for

Spark Streaming Cluster queries

2018-01-27 Thread puneetloya
Hi All, A cluster of one spark driver and multiple executors(5) is setup with redis for spark processed data storage and s3 is used for checkpointing. I have a couple of queries about this setup. 1) How to analyze what part of code executes on Spark Driver and what part of code executes on the

Optimize sort merge join

2018-01-27 Thread Antoine Bonnin
Hi all, I'm relatively new to spark and something is bothering me for optimizing sort merge join from parquet. My work consists to get stats on purchases for a retail company. For example, i have to calculate the mean purchase over a period, for a segment of prodcuts and a segment of client.