Re: How can we control CPU and Memory per Spark job operation..

2016-07-16 Thread Pedro Rodriguez
You could call map on an RDD which has “many” partitions, then call repartition/coalesce to drastically reduce the number of partitions so that your second map job has less things running. — Pedro Rodriguez PhD Student in Large-Scale Machine Learning | CU Boulder Systems Oriented Data Scientist

Re: How can we control CPU and Memory per Spark job operation..

2016-07-16 Thread Jacek Laskowski
Hi, My understanding is that these two map functions will end up as a job with one stage (as if you wrote the two maps as a single map) so you really need as much vcores and memory as possible for map1 and map2. I initially thought about dynamic allocation of executors that may or may not help

Re: Filtering RDD Using Spark.mllib's ChiSqSelector

2016-07-16 Thread Tobi Bosede
Hi Yanbo, Appreciate the response. I might not have phrased this correctly, but I really wanted to know how to convert the pipeline rdd into a data frame. I have seen the example you posted. However I need to transform all my data, just not 1 line. So I did sucessfully use map to use the chisq

Fwd: File to read sharded (2 levels) parquet files

2016-07-16 Thread Pei Sun
Hi Spark experts, spark version: 2.0.0-preview, hadoop version: 2.4, 2.7 (all tried, none works) The data is in parquet format and stored in hdfs: /root/file/partition1/file-xxx.parquet /root/file/partition2/file-xxx.parquet Then I did:

Re: standalone mode only supports FIFO scheduler across applications ? still in spark 2.0 time ?

2016-07-16 Thread Teng Qiu
Hi Mark, thanks, we just want to keep our system as simple as possible, using YARN means we need to maintain a full-size hadoop cluster, we are using s3 as storage layer, so HDFS is not needed, a hadoop cluster is a little bit overkill, mesos is an option, but still, it brings extra operation

High availability with Spark

2016-07-16 Thread KhajaAsmath Mohammed
Hi, could you please share your thoughts if anyone has idea on the below topics. - How to achieve high availability with spark cluster? I have referred to the link *https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/exercises/spark-exercise-standalone-master-ha.html

Re: Latest 200 messages per topic

2016-07-16 Thread Rabin Banerjee
Just to add , I want to read the MAX_OFFSET of a topic , then read MAX_OFFSET-200 , every time . Also I want to know , If I want to fetch a specific offset range for Batch processing, is there any option for doing that ? On Sat, Jul 16, 2016 at 9:08 PM, Rabin Banerjee <

Latest 200 messages per topic

2016-07-16 Thread Rabin Banerjee
HI All, I have 1000 kafka topics each storing messages for different devices . I want to use the direct approach for connecting kafka from Spark , in which I am only interested in latest 200 messages in the Kafka . How do I do that ? Thanks.

Re: Feature importance IN random forest

2016-07-16 Thread Yanbo Liang
Spark 1.5 only support getting feature importance for RandomForestClassificationModel and RandomForestRegressionModel by Scala. We support this feature in PySpark until 2.0.0. It's very straight forward with a few lines of code. rf = RandomForestClassifier(numTrees=3, maxDepth=2,

Re: bisecting kmeans model tree

2016-07-16 Thread Yanbo Liang
Currently we do not expose the APIs to get the Bisecting KMeans tree structure, they are private in the ml.clustering package scope. But I think we should make a plan to expose these APIs like what we did for Decision Tree. Thanks Yanbo 2016-07-12 11:45 GMT-07:00 roni : >

Re: Dense Vectors outputs in feature engineering

2016-07-16 Thread Yanbo Liang
Since you use two steps (StringIndexer and OneHotEncoder) to encode categories to Vector, I guess you want to decode the eventual vector into their original categories. Suppose you have a DataFrame with only one column named "name", there are three categories: "b", "a", "c" (ranked by frequency).

Re: Filtering RDD Using Spark.mllib's ChiSqSelector

2016-07-16 Thread Yanbo Liang
Hi Tobi, The MLlib RDD-based API does support to apply transformation on both Vector and RDD, but you did not use the appropriate way to do. Suppose you have a RDD with LabeledPoint in each line, you can refer the following code snippets to train a ChiSqSelectorModel model and do transformation:

Re: QuantileDiscretizer not working properly with big dataframes

2016-07-16 Thread Yanbo Liang
Could you tell us the Spark version you used? We have fixed this bug at Spark 1.6.2 and Spark 2.0, please upgrade to these versions and retry. If this issue still exists, please let us know. Thanks Yanbo 2016-07-12 11:03 GMT-07:00 Pasquinell Urbani < pasquinell.urb...@exalitica.com>: > In the

Re: Spark streaming takes longer time to read json into dataframes

2016-07-16 Thread Martin Eden
Hi, I would just do a repartition on the initial direct DStream since otherwise each RDD in the stream has exactly as many partitions as you have partitions in the Kafka topic (in your case 1). Like that receiving is still done in only 1 thread but at least the processing further down is done in

Re: How to convert from DataFrame to Dataset[Row]?

2016-07-16 Thread Sun Rui
For Spark 1.6.x, a DataFrame can't be directly converted to a Dataset[Row], but can done indirectly as follows: import org.apache.spark.sql.catalyst.encoders.RowEncoder // assume df is a DataFrame implicit val encoder: ExpressionEncoder[Row] = RowEncoder(df.schema) val ds = df.as[Row] However,

Re: Spark Streaming - Best Practices to handle multiple datapoints arriving at different time interval

2016-07-16 Thread Daniel Haviv
Or use mapWithState Thank you. Daniel > On 16 Jul 2016, at 03:02, RK Aduri wrote: > > You can probably define sliding windows and set larger batch intervals. > > > > -- > View this message in context: >