Read Avro Data using Spark Streaming

2018-11-02 Thread Divya Narayan
Hi, I produced avro data to kafka topic using schema registry and now I want to use spark streaming to read that data and do some computation in real time. Can some one please give a sample code for doing that . I couldn't find any working code online. I am using spark version 2.2.0 and

Is it possible to customize Spark TF-IDF implementation

2018-11-02 Thread Soheil Pourbafrani
Hi, I want to know is it possible to customize the logic of TF_IDF in Apache Spark? In typical TF_IDF the TF is computed for each word regarding its documents. For example, the TF of word "A" can be differentiated in documents D1 and D2, but I want to see the TF as term frequency among whole

Re: Spark Structured Streaming handles compressed files

2018-11-02 Thread Lian Jiang
Any clue? Thanks. On Wed, Oct 31, 2018 at 8:29 PM Lian Jiang wrote: > We have jsonl files each of which is compressed as gz file. Is it possible > to make SSS to handle such files? Appreciate any help! >

Multiply Matrix to it's transpose get undesired output

2018-11-02 Thread Soheil Pourbafrani
Hi, I want to compute the cosine similarities of vectors using apache spark. In a simple example, I created a vector from each document using built-in tf-idf. Here is the code: hashingTF = HashingTF(inputCol="tokenized", outputCol="tf") tf = hashingTF.transform(df) idf = IDF(inputCol="tf",

Re: Pyspark create RDD of dictionary

2018-11-02 Thread Soheil Pourbafrani
Got it, thanks! On Fri, Nov 2, 2018 at 7:18 PM Eike von Seggern wrote: > Hi, > > Soheil Pourbafrani schrieb am Fr., 2. Nov. 2018 > um 15:43 Uhr: > >> Hi, I have an RDD of the form (((a), (b), (c), (d)), (e)) and I want to >> transform every row to a dictionary of the form a:(b, c, d, e) >> >>

Re: Pyspark create RDD of dictionary

2018-11-02 Thread Eike von Seggern
Hi, Soheil Pourbafrani schrieb am Fr., 2. Nov. 2018 um 15:43 Uhr: > Hi, I have an RDD of the form (((a), (b), (c), (d)), (e)) and I want to > transform every row to a dictionary of the form a:(b, c, d, e) > > Here is my code, but it's errorful! > > map(lambda row : {row[0][0] : (row[1],

Pyspark create RDD of dictionary

2018-11-02 Thread Soheil Pourbafrani
Hi, I have an RDD of the form (((a), (b), (c), (d)), (e)) and I want to transform every row to a dictionary of the form a:(b, c, d, e) Here is my code, but it's errorful! map(lambda row : {row[0][0] : (row[1], row[0][1], row[0][2], row[0][3])) Is it possible to do such a transformation?

Spark Listeners for getting dataset partition information in streaming application

2018-11-02 Thread Kuttaiah Robin
Hello, Is there a way spark streaming application will get to know during the start and end of the data read from a dataset partition? I want to create partition specific cache during the start and delete during the partition is read completely. Thanks for you help in advance. regards, Robin

Re: how to use cluster sparkSession like localSession

2018-11-02 Thread Gabriel Wang
Agree. Spark is not designed for embedding in business applications(those traditional J2EE) for real-time interaction. Thanks, Gabriel On Fri, Nov 2, 2018 at 2:36 PM 张万新 wrote: > I think you should investigate apache zeppelin and livy > 崔苗(数据与人工智能产品开发部) <0049003...@znv.com>于2018年11月2日

Re: how to use cluster sparkSession like localSession

2018-11-02 Thread 张万新
I think you should investigate apache zeppelin and livy 崔苗(数据与人工智能产品开发部) <0049003...@znv.com>于2018年11月2日 周五11:01写道: > > Hi, > we want to execute spark code with out submit application.jar,like this > code: > > public static void main(String args[]) throws Exception{ > SparkSession spark =

Re: how to use cluster sparkSession like localSession

2018-11-02 Thread 数据与人工智能产品开发部
we use spark in web server , no application.jar and submit on the cluster