Re: how to construct parameter for model.transform() from datafile

2017-03-14 Thread Yuhao Yang
Hi Jinhong, Based on the error message, your second collection of vectors has a dimension of 804202, while the dimension of your training vectors was 144109. So please make sure your test dataset are of the same dimension as the training data. >From the test dataset you posted, the vector

Scaling Kafka Direct Streming application

2017-03-14 Thread Pranav Shukla
How to scale or possibly auto-scale a spark streaming application consuming from kafka and using kafka direct streams. We are using spark 1.6.3, cannot move to 2.x unless there is a strong reason. Scenario: Kafka topic with 10 partitions Standalone cluster running on kubernetes with 1 master and

Re: spark streaming with kafka source, how many concurrent jobs?

2017-03-14 Thread Tathagata Das
This setting allows multiple spark jobs generated through multiple foreachRDD to run concurrently, even if they are across batches. So output op2 from batch X, can run concurrently with op1 of batch X+1 This is not safe because it breaks the checkpointing logic in subtle ways. Note that this was

Re: Setting Optimal Number of Spark Executor Instances

2017-03-14 Thread mohini kalamkar
Hi, try using this parameter --conf spark.sql.shuffle.partitions=1000 Thanks, Mohini On Tue, Mar 14, 2017 at 3:30 PM, kpeng1 wrote: > Hi All, > > I am currently on Spark 1.6 and I was doing a sql join on two tables that > are over 100 million rows each and I noticed that it

Setting Optimal Number of Spark Executor Instances

2017-03-14 Thread kpeng1
Hi All, I am currently on Spark 1.6 and I was doing a sql join on two tables that are over 100 million rows each and I noticed that it was spawn 3+ tasks (this is the progress meter that we are seeing show up). We tried to coalesece, repartition and shuffle partitions to drop the number of

Re: spark streaming with kafka source, how many concurrent jobs?

2017-03-14 Thread shyla deshpande
Thanks TD for the response. Can you please provide more explanation. I am having multiple streams in the spark streaming application (Spark 2.0.2 using DStreams). I know many people using this setting. So your explanation will help a lot of people. Thanks On Fri, Mar 10, 2017 at 6:24 PM,

OffsetOutOfRangeException

2017-03-14 Thread Mohammad Kargar
To work around an out of space issue in a Direct Kafka Streaming application we create topics with a low retention policy (retention.ms=30) which works fine from Kafka perspective. However this results into OffsetOutOfRangeException in Spark job (red line below). Is there any configuration in

[MLlib] Multiple estimators for cross validation

2017-03-14 Thread David Leifker
I am hoping to open a discussion around the cross validation in mllib. I found that I often wanted to evaluate multiple estimators/pipelines (with different algorithms) or the same estimator with different parameter grids. The CrossValidator and TrainValidationSplit only allow a single estimator

Re: DataFrameWriter - Where to find list of Options applicable to particular format(datasource)

2017-03-14 Thread Nirav Patel
Thanks Kwon. Goal is to preserve whitespace. Not to alter data in general or do it with user provided options. It's causing our downstream jobs to fail. On Mon, Mar 13, 2017 at 7:23 PM, Hyukjin Kwon wrote: > Hi, all the options are documented in https://spark.apache.org/ >

Re: [MLlib] kmeans random initialization, same seed every time

2017-03-14 Thread Julian Keppel
I'm sorry, I missed some important informations. I use Spark version 2.0.2 in Scala 2.11.8. 2017-03-14 13:44 GMT+01:00 Julian Keppel : > Hi everybody, > > I make some experiments with the Spark kmeans implementation of the new > DataFrame-API. I compare clustering

[MLlib] kmeans random initialization, same seed every time

2017-03-14 Thread Julian Keppel
Hi everybody, I make some experiments with the Spark kmeans implementation of the new DataFrame-API. I compare clustering results of different runs with different parameters. I recognized that for random initialization mode, the seed value is the same every time. How is it calculated? In my

Re: Spark and continuous integration

2017-03-14 Thread Sam Elamin
Thank you both Steve that's a very interesting point. I have to admit I have never thought of doing analysis over time on the tests but it makes sense as the failures over time tell you quite a bit about your data platform Thanks for highlighting! We are using Pyspark for now so I hope some

Re: Spark and continuous integration

2017-03-14 Thread Jörn Franke
I agree the reporting is an important aspect. Sonarqube (or similar tool) can report over time, but does not support Scala (well indirectly via JaCoCo). In the end, you will need to think about a dashboard that displays results over time. > On 14 Mar 2017, at 12:44, Steve Loughran

Re: Spark and continuous integration

2017-03-14 Thread Steve Loughran
On 13 Mar 2017, at 13:24, Sam Elamin > wrote: Hi Jorn Thanks for the prompt reply, really we have 2 main concerns with CD, ensuring tests pasts and linting on the code. I'd add "providing diagnostics when tests fail", which is a

Re: Structured Streaming - Can I start using it?

2017-03-14 Thread Adline Dsilva
On 14 Mar 2017 4:19 p.m., Gaurav Pandya wrote: Thanks a lot Michal & Ofir for your insights. To Ofir - I have not yet finalized my spark streaming code. it is still work in progress. Now we have Structured streaming available, so thought to re write it to gain maximum

Re: Structured Streaming - Can I start using it?

2017-03-14 Thread Gaurav Pandya
Thanks a lot Michal & Ofir for your insights. To Ofir - I have not yet finalized my spark streaming code. it is still work in progress. Now we have Structured streaming available, so thought to re write it to gain maximum benefit in future. As of now, there are no specific functional or

Re: FPGrowth Model is taking too long to generate frequent item sets

2017-03-14 Thread Raju Bairishetti
Hi Yuhao, I have tried numPartitions from (numExecutors * numExecutorCores), 1000, 2000 and 1. I did not see much improvement. Having more partitions solved some perf issues but did not see any improvement when I give less minsupport. It is generating 260 million frequent item sets with

Re: Structured Streaming - Can I start using it?

2017-03-14 Thread Ofir Manor
To add to what Michael said, my experience was that Structured Streaming in 2.0 was half-baked / alpha, but in 2.1 it is significantly more robust. Also a lot of its "missing functionality" were not available in Spark Streaming either way. HOWEVER, you mentioned that you think about rewriting your

Re: FPGrowth Model is taking too long to generate frequent item sets

2017-03-14 Thread Yuhao Yang
Hi Raju, Have you tried setNumPartitions with a larger number? 2017-03-07 0:30 GMT-08:00 Eli Super : > Hi > > It's area of knowledge , you will need to read online several hours about > it > > What is your programming language ? > > Try search online : "machine learning