Re: Should I use Dstream or Structured Stream to transfer data from source to sink and then back from sink to source?

2017-09-14 Thread kant kodali
Thanks T.D! And sorry for the typo. It's very helpful to know that whatever I was achieving with DStreams I can also achieve the same with Structured streaming. It seems like there is some other error in my code which I fixed it and it seem to be working fine now! Thanks again! On Thu, Sep 14,

Re: [SS] Any way to optimize memory consumption of SS?

2017-09-14 Thread 张万新
There is expected to be about 5 million UUIDs in a day. I need to use this field to drop duplicate records and count number. If I simply count numbers without using dropDuplicates it only occupies less than 1g memory. I believe most of the memory is occupied by the state store for keeping the

PLs assist: trying to FlatMap a DataSet / partially OT

2017-09-14 Thread Marco Mistroni
HI all could anyone assist pls? i am trying to flatMap a DataSet[(String, String)] and i am getting errors in Eclipse the errors are more Scala related than spark -related, but i was wondering if someone came across a similar situation here's what i got. A DS of (String, String) , out of which i

Re: [SS] Any way to optimize memory consumption of SS?

2017-09-14 Thread Michael Armbrust
How many UUIDs do you expect to have in a day? That is likely where all the memory is being used. Does it work without that? On Tue, Sep 12, 2017 at 8:42 PM, 张万新 wrote: > *Yes, my code is shown below(I also post my code in another mail)* > /** > * input > */ >

Re: [SS]How to add a column with custom system time?

2017-09-14 Thread Michael Armbrust
Can you show the explain() for the version that doesn't work? You might just be hitting a bug. On Tue, Sep 12, 2017 at 9:03 PM, 张万新 wrote: > It seems current_timestamp() cannot be used directly in window function? > because after attempts I found that using > >

Re: cannot cast to double from spark row

2017-09-14 Thread Ram Sriharsha
try df.select($"col".cast(DoubleType)) import org.apache.spark.sql.types._ val df = spark.sparkContext.parallelize(Seq(("1.04"))).toDF("c") df.select($"c".cast(DoubleType)) On Thu, Sep 14, 2017 at 9:20 PM, KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote: > Hi, > > I am getting below

Re: [Structured Streaming] Multiple sources best practice/recommendation

2017-09-14 Thread Michael Armbrust
I would probably suggest that you partition by format (though you can get the file name from the build in function input_file_name()). You can load multiple streams from different directories and union them together as long as the schema is the same after parsing. Otherwise you can just run

cannot cast to double from spark row

2017-09-14 Thread KhajaAsmath Mohammed
Hi, I am getting below error when trying to cast column value from spark dataframe to double. any issues. I tried many solutions but none of them worked. java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Double 1. row.getAs[Double](Constants.Datapoint.Latitude) 2.

Re: RDD order preservation through transformations

2017-09-14 Thread Suzen, Mehmet
On 14 September 2017 at 10:42, wrote: > val noTs = myData.map(dropTimestamp) > > val scaled = scaler.transform(noTs) > > val projected = (new RowMatrix(scaled)).multiply(principalComponents).rows > > val clusters = myModel.predict(projected) > > val result =

Re: RDD order preservation through transformations

2017-09-14 Thread Georg Heiler
Usually spark ml Models specify the columns they use for training. i.e. you would only select your columns (X) for model training but metadata i.e. target labels or your date column (y) would still be present for each row. schrieb am Do., 14. Sep. 2017 um 10:42 Uhr:

RE: RDD order preservation through transformations

2017-09-14 Thread johan.grande.ext
In several situations I would like to zip RDDs knowing that their order matches. In particular I’m using an MLLib KMeansModel on an RDD of Vectors so I would like to do: myData.zip(myModel.predict(myData)) Also the first column in my RDD is a timestamp which I don’t want to be a part of the

RE: RDD order preservation through transformations

2017-09-14 Thread johan.grande.ext
(Sorry Mehmet, I'm seeing just now your first reply with the link to SO; it had first gone to my spam folder :-/ ) On 2017-09-14 10:02 CEST, GRANDE Johan Ext DTSI/DSI wrote: Well if the order cannot be guaranteed in case of a failure (or at all since failure can happen transparently), what

RE: RDD order preservation through transformations

2017-09-14 Thread johan.grande.ext
Well if the order cannot be guaranteed in case of a failure (or at all since failure can happen transparently), what does it mean to sort an RDD (method sortBy)? On 2017-09-14 03:36 CEST mehmet.su...@gmail.com wrote: I think it is one of the conceptual difference in Spark compare to other

Re: Should I use Dstream or Structured Stream to transfer data from source to sink and then back from sink to source?

2017-09-14 Thread Tathagata Das
Are you sure the code is correct? A Dataset does not have a method "trigger". Rather I believe the correct code should be StreamingQuery query = resultDataSet*.writeStream.*trigger( ProcesingTime(1000)).format("kafka").start(); You can do all the same things you can do with Structured Streaming

spark.streaming.receiver.maxRate

2017-09-14 Thread Margus Roo
Hi Using Spark 2.1.1.2.6-1.0-129 (from Hortonworks distro) and Scala 2.11.8 and Java 1.8.0_60 I have Nifi flow produces more records than Spark stream can work in batch time. To avoid spark queue overflow I wanted to try spark streaming backpressure (did not work for my) so back to the more