Re: Pyspark Structured Streaming Error

2018-07-12 Thread Arbab Khalil
Remove the kafka-clients package and add starting offset to options. df = spark.readStream\ .format("kafka")\ .option("zookeeper.connect", "localhost:2181")\ .option("kafka.bootstrap.servers", "localhost:9092")\ .option("subscribe", "ingest")\ .option("failOnDataLoss", "false")\

Re: How to avoid duplicate column names after join with multiple conditions

2018-07-12 Thread Prem Sure
Yes Nirav, we can probably request dev for a config param enablement to take care of this automatically (internally) - additional care required while specifying column names and joining from users Thanks, Prem On Thu, Jul 12, 2018 at 10:53 PM Nirav Patel wrote: > Hi Prem, dropping column,

Re: [Structured Streaming] Custom StateStoreProvider

2018-07-12 Thread Jungtaek Lim
Girish, I think reading through implementation of HDFSBackedStateStoreProvider as well as relevant traits should bring the idea to you how to implement custom one. HDFSBackedStateStoreProvider is not that complicated to read and understand. You just need to deal with your underlying storage

Reading multiple files in Spark / which pattern to use

2018-07-12 Thread Marco Mistroni
hi all i have mutliple files stored in S3 in the following pattern -MM-DD-securities.txt I want to read multiple files at the same time.. I am attempting to use this pattern, for example 2016-01*securities.txt,2016-02*securities.txt,2016-03*securities.txt But it does not seem to work

Re: How to validate orc vectorization is working within spark application?

2018-07-12 Thread umargeek
Hello Jorn, I am unable to post the entire code due to some data sharing related issues. Use Case: I am performing aggregations after reading data from HDFS file every min would like to understand how to perform using vectorisation enabled and what are pre requisite to successfully to enable

Upgrading spark history server, no logs showing.

2018-07-12 Thread bbarks
Hi, We have multiple installations of spark installed on our clusters. They reside in different directories which the jobs point to when they run. For a couple of years now, we've run our history server off spark 2.0.2. We have 2.1.2, 2.2.1 and 2.3.0 installed as well. I've tried upgrading to

Pyspark Structured Streaming Error

2018-07-12 Thread umargeek
Hi All, I am trying to test structured streaming using pyspark mentioned below spark submit commands and packages used * pyspark2 --master=yarn --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0 --packages org.apache.kafka:kafka-clients:0.10.0.1* but getting following error (in bold),

Re: Spark ML online serving

2018-07-12 Thread Maximiliano Felice
Hi! I know I'm late, but just to point some highlights of our usecase. We currently: - Use Spark as an ETL tool, followed by - a Python (numpy/pandas based) pipeline to preprocess information and - use Tensorflow for training our Neural Networks What we'd love to, and why we don't:

Re: Interest in adding ability to request GPU's to the spark client?

2018-07-12 Thread Mich Talebzadeh
I agree. Adding GPU capability to Spark in my opinion is a must for Advanced Analytics. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Re: Interest in adding ability to request GPU's to the spark client?

2018-07-12 Thread Maximiliano Felice
Hi, I've been meaning to reply to this email for a while now, sorry for taking so much time. I personally think that adding GPU resource management will allow us to boost some ETL performance a lot. For the last year, I've worked in transforming some Machine Learning pipelines from Python in

Re: [Structured Streaming] Avoiding multiple streaming queries

2018-07-12 Thread Arun Mahadevan
What I meant was the number of partitions cannot be varied with ForeachWriter v/s if you were to write to each sink using independent queries. Maybe this is obvious. I am not sure about the difference you highlight about the performance part. The commit happens once per micro batch and

Re: [Structured Streaming] Avoiding multiple streaming queries

2018-07-12 Thread chandan prakash
Thanks a lot Arun for your response. I got your point that existing sink plugins like kafka, etc can not be used. However I could not get the part : " you cannot scale the partitions for the sinks independently " Can you please rephrase the above part ? Also, I guess : using foreachwriter for

Re: How to avoid duplicate column names after join with multiple conditions

2018-07-12 Thread Nirav Patel
Hi Prem, dropping column, renaming column are working for me as a workaround. I thought it just nice to have generic api that can handle that for me. or some intelligence that since both columns are same it shouldn't complain in subsequent Select clause that it doesn't know if I mean a#12 or a#81.

Re: [Structured Streaming] Avoiding multiple streaming queries

2018-07-12 Thread Arun Mahadevan
Yes ForeachWriter [1] could be an option If you want to write to different sinks. You can put your custom logic to split the data into different sinks. The drawback here is that you cannot plugin existing sinks like Kafka and you need to write the custom logic yourself and you cannot scale the

Re: How to avoid duplicate column names after join with multiple conditions

2018-07-12 Thread Prem Sure
Hi Nirav, did you try .drop(df1(a) after join Thanks, Prem On Thu, Jul 12, 2018 at 9:50 PM Nirav Patel wrote: > Hi Vamshi, > > That api is very restricted and not generic enough. It imposes that all > conditions of joins has to have same column on both side and it also has to > be equijoin. It

Running Spark on Kubernetes behind a HTTP proxy

2018-07-12 Thread Lalwani, Jayesh
We are trying to run a Spark job on Kubernetes cluster. The Spark job needs to talk to some services external to the Kubernetes cluster through a proxy server. We are setting the proxy by setting the extraJavaOptions like this --conf spark.executor.extraJavaOptions=" -Dhttps.proxyHost=myhost

Re: How to avoid duplicate column names after join with multiple conditions

2018-07-12 Thread Nirav Patel
Hi Vamshi, That api is very restricted and not generic enough. It imposes that all conditions of joins has to have same column on both side and it also has to be equijoin. It doesn't serve my usecase where some join predicates don't have same column names. Thanks On Sun, Jul 8, 2018 at 7:39 PM,

Re: how to specify external jars in program with SparkConf

2018-07-12 Thread Prem Sure
I think JVM is initiated with available classpath by the time your conf execution comes... I faced this earlier during Spark1.6 and ended up moving to Spark Submit using --jars found it was not part of runtime config changes.. May I know the advantage you are trying to get programmatically On

streaming from mongo

2018-07-12 Thread Chethan
​Hi Dev, I am receving data from mongoDB​ through spark streaming.It gives Dstream of org.bson.document. How to convert Dstream [Document] to dataframe? all my other operations are in dataframes. Thanks, Chethan.

how to specify external jars in program with SparkConf

2018-07-12 Thread mytramesh
Context :- In EMR class path has old version of jar, want to refer new version of jar in my code. through bootstrap while spinning new nodes , copied necessary jars to local folder from S3. In spark-submit command by using extra class path parameter my code able refer new version jar which is

Re: [Structured Streaming] Avoiding multiple streaming queries

2018-07-12 Thread chandan prakash
Hi, Did anyone of you thought about writing a custom foreach sink writer which can decided which record should go to which sink (based on some marker in record, which we can possibly annotate during transformation) and then accordingly write to specific sink. This will mean that: 1. every custom

Re: Run Python User Defined Functions / code in Spark with Scala Codebase

2018-07-12 Thread Jayant Shekhar
Hello Chetan, Sorry missed replying earlier. You can find some sample code here : http://sparkflows.readthedocs.io/en/latest/user-guide/python/pipe-python.html We will continue adding more there. Feel free to ping me directly in case of questions. Thanks, Jayant On Mon, Jul 9, 2018 at 9:56

How to register custom structured streaming source

2018-07-12 Thread Farshid Zavareh
Hello. I need to create a custom streaming source by extending *FileStreamSource*. The idea is to override *commit*, so that processed files (S3 objects in my case) are renamed to have a certain prefix. However, I don't know how to use this custom source. Obviously I don't want to compile Spark