[Error :] RDD TO Dataframe Spark Streaming

2018-01-31 Thread Divya Gehlot
Hi, I am getting below error when creating Dataframe from twitter Streaming RDD val sparkSession:SparkSession = SparkSession .builder .appName("twittertest2") .master("local[*]") .enableHiveSupport()

Re: ML:One vs Rest with crossValidator for multinomial in logistic regression

2018-01-31 Thread Nicolas Paris
Hey I am also interested in how to get those parameters. For example, the demo code spark-2.2.1-bin-hadoop2.7/examples/src/main/python/ml/estimator_transformer_param_example.py return empty parameters when printing "lr.extractParamMap()" That's weird Thanks Le 30 janv. 2018 à 23:10, Bryan

Re: Spark Structured Streaming for Twitter Streaming data

2018-01-31 Thread Divya Gehlot
Got it Thanks for the clarification TD ! On Thu, 1 Feb 2018 at 11:36 AM, Tathagata Das wrote: > The code uses the format "socket" which is only for text sent over a > simple socket, which is completely different from how Twitter APIs works. > So this wont work at

Re: Spark Structured Streaming for Twitter Streaming data

2018-01-31 Thread Tathagata Das
The code uses the format "socket" which is only for text sent over a simple socket, which is completely different from how Twitter APIs works. So this wont work at all. Fundamentally, for Structured Streaming, we have focused only on those streaming sources that have the capabilities record-level

FOSDEM mini-office hour?

2018-01-31 Thread Holden Karau
Hi Spark Friends, If any folks are around for FOSDEM this year I was planning on doing a coffee office hour on the last day after my talks . Maybe like 6pm? I'm also going to see if any BEAM folks are around and interested :) Cheers,

Re: Spark Structured Streaming for Twitter Streaming data

2018-01-31 Thread Divya Gehlot
Hi , I see ,Does that means Spark structured streaming doesn't work with Twitter streams ? I could see people used kafka or other streaming tools and used spark to process the data in structured streaming . The below doesn't work directly with Twitter Stream until I set up Kafka ? > import

Re: Spark Structured Streaming for Twitter Streaming data

2018-01-31 Thread Tathagata Das
Hello Divya, To add further clarification, the Apache Bahir does not have any Structured Streaming support for Twitter. It only has support for Twitter + DStreams. TD On Wed, Jan 31, 2018 at 2:44 AM, vermanurag wrote: > Twitter functionality is not part of Core

Re: Max number of streams supported ?

2018-01-31 Thread Yogesh Mahajan
Thanks Michael, TD for quick reply. It was helpful. I will let you know the numbers(limit) based on my experiments. On Wed, Jan 31, 2018 at 3:10 PM, Tathagata Das wrote: > Just to clarify a subtle difference between DStreams and Structured > Streaming. Multiple

Re: mapGroupsWithState in Python

2018-01-31 Thread ayan guha
Thanks a lot TD, exactly what I was looking for. And I have seen most of your talks, really great stuff you guys are doing :) On Thu, Feb 1, 2018 at 10:38 AM, Tathagata Das wrote: > Hello Ayan, > > From what I understand, mapGroupsWithState (probably the more

Re: Apache Spark - Exception on adding column to Structured Streaming DataFrame

2018-01-31 Thread Tathagata Das
Could you give the full stack trace of the exception? Also, can you do `dataframe2.explain(true)` and show us the plan output? On Wed, Jan 31, 2018 at 3:35 PM, M Singh wrote: > Hi Folks: > > I have to add a column to a structured *streaming* dataframe but when I

Re: mapGroupsWithState in Python

2018-01-31 Thread Tathagata Das
Hello Ayan, >From what I understand, mapGroupsWithState (probably the more general flatMapGroupsWithState) is the best way forward (not available in python). However, you need to figure out your desired semantics of when you want to output the deduplicated data from the stremaing query. For

Apache Spark - Exception on adding column to Structured Streaming DataFrame

2018-01-31 Thread M Singh
Hi Folks: I have to add a column to a structured streaming dataframe but when I do that (using select or withColumn) I get an exception.  I can add a column in structured non-streaming structured dataframe. I could not find any documentation on how to do this in the following doc 

Re: Prefer Structured Streaming over Spark Streaming (DStreams)?

2018-01-31 Thread Michael Armbrust
At this point I recommend that new applications are built using structured streaming. The engine was GA-ed as of Spark 2.2 and I know of several very large (trillions of records) production jobs that are running in Structured Streaming. All of our production pipelines at databricks are written

Re: Max number of streams supported ?

2018-01-31 Thread Tathagata Das
Just to clarify a subtle difference between DStreams and Structured Streaming. Multiple input streams in a DStreamGraph is likely to mean they are all being processed/computed in the same way as there can be only one streaming query / context active in the StreamingContext. However, in the case of

Re: Max number of streams supported ?

2018-01-31 Thread Michael Armbrust
-dev +user > Similarly for structured streaming, Would there be any limit on number of > of streaming sources I can have ? > There is no fundamental limit, but each stream will have a thread on the driver that is doing coordination of execution. We comfortably run 20+ streams on a single

Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-01-31 Thread Vishnu Viswanath
Hi Mans, Watermark is Spark is used to decide when to clear the state, so if the even it delayed more than when the state is cleared by Spark, then it will be ignored. I recently wrote a blog post on this : http://vishnuviswanath.com/spark_structured_streaming.html#watermark Yes, this State is

Data of ArrayType field getting truncated when saving to parquet

2018-01-31 Thread HARSH TAKKAR
Hi I have a dataframe with a field of type array which is of large size, when i am trying to save the data to parquet file and read it again , array field comes out as empty array. Please help Harsh

Singular Value Decomposition (SVD) in Spark Java

2018-01-31 Thread Donni Khan
Hi, I would like to use the *Singular Value Decomposition* (SVD) to extract the important concepts from a collection of text documents. I applied all preprcessing pipeline( Tokenizer, IDFModel, Matrix, ... ) then I applied SVD SingularValueDecomposition svd =

Re: Prefer Structured Streaming over Spark Streaming (DStreams)?

2018-01-31 Thread vijay.bvp
here is my two cents, experts please correct me if wrong its important to understand why one over other and for what kind of use case. There might be sometime in future where low level API's are abstracted and become legacy but for now in Spark RDD API is the core and low level API, all higher

Re: Spark Streaming Cluster queries

2018-01-31 Thread vijay.bvp
Assuming you are talking about Spark Streaming 1) How to analyze what part of code executes on Spark Driver and what part of code executes on the executors? RDD's can be understood as set of data transformations or set of jobs. Your understanding deepens as you do more programming with Spark.

Re: Spark Structured Streaming for Twitter Streaming data

2018-01-31 Thread vermanurag
Twitter functionality is not part of Core Spark. We have successfully used the following packages from maven central in past org.apache.bahir:spark-streaming-twitter_2.11:2.2.0 Earlier there used to be a twitter package under spark, but I find that it has not been updated beyond Spark 1.6

Prefer Structured Streaming over Spark Streaming (DStreams)?

2018-01-31 Thread Biplob Biswas
Hi, I read an article which recommended to use dataframes instead of rdd primitives. Now I read about the differences over using DStreams and Structured Streaming and structured streaming adds a lot of improvements like checkpointing, windowing, sessioning, fault tolerance etc. What I am

Re: [Spark Streaming]: Non-deterministic uneven task-to-machine assignment

2018-01-31 Thread vijay.bvp
Summarizing 1) Static data set read from Parquet files as DataFrame in HDFS has initial parallelism of 90 (based on no input files) 2) static data set DataFrame is converted as rdd, and rdd has parallelism of 18 this was not expected dataframe.rdd is lazy evaluation there must be some operation

Re: Type Casting Error in Spark Data Frame

2018-01-31 Thread vijay.bvp
formatted = Assuming MessageHelper.sqlMapping schema is correctly mapped with input json (it would help if the schema and sample json is shared) here is explode function with dataframes similar functionality is available with SQL import sparkSession.implicits._ import

Re: Type Casting Error in Spark Data Frame

2018-01-31 Thread vijay.bvp
Assuming MessageHelper.sqlMapping schema is correctly mapped with input json (it would help if the schema and sample json is shared)here is explode function with dataframes similar functionality is available with SQL import sparkSession.implicits._import org.apache.spark.sql.functions._val