Re: Python vs. Scala

2017-09-05 Thread ayan guha
And I have just the opposite experience ie I know Python but I see scala demands more :) I think there are few fair points on both sides, and scala wins: 1. Feature parity: Definitely scala wins. Not only new spark features, but if you intend to use 3rd party connectors (such as Azure services).

Python vs. Scala

2017-09-05 Thread Adaryl Wakefield
Is there any performance difference in writing your application in python vs. scala? I’ve resisted learning Python because it’s an interpreted scripting language, but the market seems to be demanding Python skills. Adaryl "Bob" Wakefield, MBA Principal Mass Street Analytics, LLC 913.938.6685

Re: Spark 2.0.0 and Hive metastore

2017-09-05 Thread Dylan Wan
You can put the hive-site.xml in $SPARK_HOME/conf directory. This property can control where the data are located. spark.sql.warehouse.dir /home/myuser/spark-2.2.0/spark-warehouse location of the warehouse directory ~Dylan On Tue, Aug 29, 2017 at 1:53 PM, Andrés Ivaldi

Re: Spark 2.2 structured streaming with mapGroupsWithState + window functions

2017-09-05 Thread kant kodali
Hi Daniel, I am thinking you could use groupByKey & mapGroupsWithState to send whatever updates ("updated state") you want and then use .groupBy(window). will that work as expected? Thanks, Kant On Mon, Aug 28, 2017 at 7:06 AM, daniel williams wrote: > Hi all, > >

Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

2017-09-05 Thread Bryan Cutler
Hi Prem, Spark actually does somewhat support different algorithms in CrossValidator, but it's not really obvious. You basically need to make a Pipeline and build a ParamGrid with different algorithms as stages. Here is an simple example: val dt = new DecisionTreeClassifier()

Spark 2.1.1 with Kinesis Receivers is failing to launch 50 active receivers with oversized cluster on EMR Yarn

2017-09-05 Thread Mikhailau, Alex
Guys, I have a Spark 2.1.1 job with Kinesis where it is failing to launch 50 active receivers with oversized cluster on EMR Yarn. It registers sometimes 16, sometimes 32, other times 48 receivers but not all 50. Any help would be greatly appreciated. Kinesis stream shards = 500 YARN EMR

RE: Problem with CSV line break data in PySpark 2.1.0

2017-09-05 Thread JG Perrin
Have you tried the built-in parser, not the databricks one (which is not really used anymore)? What is your original CSV looking like? What is your code looking like? There are quite a few options to read a CSV… From: Aakash Basu [mailto:aakash.spark@gmail.com] Sent: Sunday, September 03,

Re: spark-jdbc impala with kerberos using yarn-client

2017-09-05 Thread morfious902002
I was able to query data from Impala table. Here is my git repo for anyone who would like to check it :- https://github.com/morfious902002/impala-spark-jdbc-kerberos -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

2017-09-05 Thread Yanbo Liang
You are right, native Spark MLlib CrossValidation can't run *different *algorithms in parallel. Thanks Yanbo On Tue, Sep 5, 2017 at 10:56 PM, Timsina, Prem wrote: > Hi Yanboo, > > Thank You, I very much appreciate your help. > > For the current use case, the data can fit

Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

2017-09-05 Thread Timsina, Prem
Hi Yanboo, Thank You, I very much appreciate your help. For the current use case, the data can fit into a single node. So, spark-sklearn seems to be good choice. I have on question regarding this “If no, Spark MLlib provide CrossValidation which can run multiple machine learning algorithms

Re: sparkR 3rd library

2017-09-05 Thread Yanbo Liang
I guess you didn't install R package `genalg` for all worker nodes. This is not built-in package for basic R, so you need to install it to all worker nodes manually or running `install.packages` inside of your SparkR UDF. Regards to how to download third party packages and install them inside of

Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

2017-09-05 Thread Yanbo Liang
Hi Prem, How large is your dataset? Can it be fitted in a single node? If no, Spark MLlib provide CrossValidation which can run multiple machine learning algorithms parallel on distributed dataset and do parameter search. FYI: https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation

Re: Inconsistent results with combineByKey API

2017-09-05 Thread Swapnil Shinde
Ping.. Can someone please correct me whether this is an issue or not. - Swapnil On Thu, Aug 31, 2017 at 12:27 PM, Swapnil Shinde wrote: > Hello All > > I am observing some strange results with aggregateByKey API which is > implemented with combineByKey. Not sure if

Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

2017-09-05 Thread Patrick McCarthy
You might benefit from watching this JIRA issue - https://issues.apache.org/jira/browse/SPARK-19071 On Sun, Sep 3, 2017 at 5:50 PM, Timsina, Prem wrote: > Is there a way to parallelize multiple ML algorithms in Spark. My use case > is something like this: > > A) Run

How to serialize or deserialize the SparkPlan(PhysicalPlan) ?

2017-09-05 Thread aliwumi
Hi, I want to submit a SparkPlan(Physical Plan) to spark for execute directly, so i want to know how to serialize or deserialize it or how does the SparkPlan be serialized or deserialized on the slaves in the spark cluster? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

how to serialize and deserialize the SparkPlan(Physical Plan)?

2017-09-05 Thread debugcool
Hi, I want submit a SparkPlan(Physical Plan) to a spark cluster for execute directly, how to serialize and deserialize it? or I want to know how does the SparkPlan be serialized and deserialized on the cluster slaves? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

unsubscribe

2017-09-05 Thread Patrik Medvedev
unsubscribe

how to get Cache size from storage

2017-09-05 Thread Selvam Raman
Hi All, I am having 100 GB of data(for use case). i am caching with MEMORY_AND_DISK. is there any log available to find how much data stored in memory and disk for the running or ran application. I could see the cache in UI with tab storage. So it should be available even after the job, where