Re: Recommendation of using StreamSinkProvider for a custom KairosDB Sink

2018-06-25 Thread Girish Subramanian
Thanks Tathagata Das Currently in my use case I am planning to collect() all the data in the driver and publish it into KairosDB something like this I am not worried about the size of the data. If I am doing something simple like this do I still need to bother about the un-stability of the API ?

[Spark Streaming] Spark Streaming with S3 vs Kinesis

2018-06-25 Thread Farshid Zavareh
I'm writing a Spark Streaming application where the input data is put into an S3 bucket in small batches (using Database Migration Service - DMS). The Spark application is the only consumer. I'm considering two possible architectures: Have Spark Streaming watch an S3 prefix and pick up new

Re: Recommendation of using StreamSinkProvider for a custom KairosDB Sink

2018-06-25 Thread Tathagata Das
This is interface is actually unstable. The v2 of DataSource APIs is being designed right now which will be public and stable in a release or two. So unfortunately there is no stable interface right now that I can officially recommend. That said, you could always use the ForeachWriter interface

Recommendation of using StreamSinkProvider for a custom KairosDB Sink

2018-06-25 Thread subramgr
We are using Spark 2.3 and would want to know if it is recommended to create a custom KairoDBSink by implementing the StreamSinkProvider ? The interface is marked experimental and in-stable ? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: Can we get the partition Index in an UDF

2018-06-25 Thread Vadim Semenov
Try using `TaskContext`: import org.apache.spark.TaskContext val partitionId = TaskContext.getPartitionId() On Mon, Jun 25, 2018 at 11:17 AM Lalwani, Jayesh wrote: > > We are trying to add a column to a Dataframe with some data that is seeded by > some random data. We want to be able to

Can we get the partition Index in an UDF

2018-06-25 Thread Lalwani, Jayesh
We are trying to add a column to a Dataframe with some data that is seeded by some random data. We want to be able to control the seed, so multiple runs of the same transformation generate the same output. We also want to generate different random numbers for each partition This is easy to do

Pyspark is not picking up correct python version on azure hdinsight

2018-06-25 Thread amit kumar singh
Hi Guys, Pyspark is not picking up correct python version on azure hdinsight property setup in spark2-env PYSPARK_PYTHON=${PYSPARK3_PYTHON:-/usr/bin/anaconda/envs/py35/bin/python3} export PYSPARK_DRIVER_PYTHON=${PYSPARK3_PYTHON:-/usr/bin/anaconda/envs/py35/bin/python3} Thanks

Error when joining on two bucketed tables

2018-06-25 Thread Vitaliy Pisarev
What I did: I have two datasets I need to join. One of the datasets does not change so I bucket it once and save in a table. It looks something like: spark.table("profiles").bucketBy(500, "uid").saveAsTable("profiles_bkt"). Now I have another dataset that I bucket "online":

[Spark SQL] was it correct that only one executor was used to shuffle the data for reduce task?

2018-06-25 Thread des...@163.com
hi,all i'm using spark-2.0.0 on hdp 2.5.0 to build a spark-sql app,below is the spark-submit configuration: spark-submit\ --class "FASTMDTFlow" \ --master yarn \ --deploy-mode client \ --driver-memory 12g \ --num-executors 110\ --executor-memory 8g \ --executor-cores 3 \ --conf

Re: Broadcast Variables

2018-06-25 Thread mrsanketh
Issue: Not able to broadcast or place the files locally in the Spark worker nodes from Spark application in Cluster deploy mode.Spark job always throws FileNotFoundException. Issue Description: We are trying to access Kafka Cluster which is configured with SSL for encryption from Spark