Re: How to use ManualClock with Spark streaming

2017-03-20 Thread ??????????
hi Hemalatha, you can use the time windows, it looks likee df.groupby(windows('timestamp', '20 seconds', '10 seconds')) ---Original--- From: "Saisai Shao" Date: 2017/3/1 09:39:58 To: "Hemalatha A"; Cc: "spark

Re: How does preprocessing fit into Spark MLlib pipeline

2017-03-20 Thread Yan Facai
SQLTransformer is a good solution if all operators are combined with SQL. By the way, if you like to get hands dirty, writing a Transformer in scala is not hard, and multiple output columns is valid in such case. On Fri, Mar 17, 2017 at 9:10 PM, Yanbo Liang wrote: > Hi

Re: how to retain part of the features in LogisticRegressionModel (spark2.0)

2017-03-20 Thread Yan Facai
Hi, jinhong. Do you use `setRegParam`, which is 0.0 by default ? Both elasticNetParam and regParam are required if regularization is need. val regParamL1 = $(elasticNetParam) * $(regParam) val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam) On Mon, Mar 20, 2017 at 6:31 PM, Yanbo Liang

worker connected to standalone cluster are continuously crashing

2017-03-20 Thread Diego Fanesi
Hello everybody, I configured a simple standalone cluster with few machines and I am trying to submit a very simple job just to test the cluster. my laptop is the client and one of the workers. my server contains the master and the second worker. If I submit my job just executing the scala code

Re: Contributing to Spark

2017-03-20 Thread cht liu
Hi Sam A great way to contribute to Spark is to help answer user questions on the user@spark.apache.org mailing list or on StackOverflow. 2017-03-20 11:50 GMT+08:00 Nick Pentreath : > If you have experience and interest in Python then PySpark is a good area > to look

Re: spark streaming exectors memory increasing and executor killed by yarn

2017-03-20 Thread darin
This issue on stackoverflow maybe help https://stackoverflow.com/questions/42641573/why-does-memory-usage-of-spark-worker-increases-with-time/42642233#42642233 -- View this message in context:

Re: Spark 2.0.2 Dataset union() slowness vs RDD union?

2017-03-20 Thread Everett Anderson
Closing the loop on this -- It appears we were just hitting some other problem related to S3A/S3, likely that the temporary directory used by the S3A Hadoop file system implementation for buffering data during upload either was full or had the wrong permissions. On Thu, Mar 16, 2017 at 6:03

Recombining output files in parallel

2017-03-20 Thread Matt Deaver
I have a Spark job that processes incremental data and partitions it by customer id. Some customers have very little data, and I have another job that takes a previous period's data and combines it. However, the job runs serially and I'd basically like to run the function on every partition

Re: Spark Streaming from Kafka, deal with initial heavy load.

2017-03-20 Thread Cody Koeninger
You want spark.streaming.kafka.maxRatePerPartition for the direct stream. On Sat, Mar 18, 2017 at 3:37 PM, Mal Edwin wrote: > > Hi, > You can enable backpressure to handle this. > > spark.streaming.backpressure.enabled > spark.streaming.receiver.maxRate > > Thanks, >

Re: how to retain part of the features in LogisticRegressionModel (spark2.0)

2017-03-20 Thread Yanbo Liang
Do you want to get sparse model that most of the coefficients are zeros? If yes, using L1 regularization leads to sparsity. But the LogisticRegressionModel coefficients vector's size is still equal with the number of features, you can get the non-zero elements manually. Actually, it would be a

Re: Foreachpartition in spark streaming

2017-03-20 Thread Ryan
foreachPartition is an action but run on each worker, which means you won't see anything on driver. mapPartitions is a transformation which is lazy and won't do anything until an action. it depends on the specific use case which is better. To output sth(like a print in single machine) you could

Re: Issues: Generate JSON with null values in Spark 2.0.x

2017-03-20 Thread Chetan Khatri
Exactly. On Sat, Mar 11, 2017 at 1:35 PM, Dongjin Lee wrote: > Hello Chetan, > > Could you post some code? If I understood correctly, you are trying to > save JSON like: > > { > "first_name": "Dongjin", > "last_name: null > } > > not in omitted form, like: > > { >

Foreachpartition in spark streaming

2017-03-20 Thread Diwakar Dhanuskodi
Just wanted to clarify!!! Is foreachPartition in spark an output operation? Which one is better use mapPartitions or foreachPartitions? Regards Diwakar