[Help] Converting a Python Numpy code into Spark using RDD

2018-01-21 Thread Aakash Basu
Hi, How can I convert this Python Numpy code into Spark RDD so that the operations leverage the Spark distributed architecture for Big Data. Code is as follows - def gini(array): """Calculate the Gini coefficient of a numpy array.""" array = array.flatten() #all values are treated

Re: run spark job in yarn cluster mode as specified user

2018-01-21 Thread Margusja
Hi One way to get it is use YARN configuration parameter - yarn.nodemanager.container-executor.class. By default it is org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor - gives you user who executes script. Br

Re: Has there been any explanation on the performance degradation between spark.ml and Mllib?

2018-01-21 Thread Nick Pentreath
At least one of their comparisons is flawed. The Spark ML version of linear regression (*note* they use linear regression and not logistic regression, it is not clear why) uses L-BFGS as the solver, not SGD (as MLLIB uses). Hence it is typically going to be slower. However, it should in most

run spark job in yarn cluster mode as specified user

2018-01-21 Thread sd wang
Hi Advisers, When submit spark job in yarn cluster mode, the job will be executed by "yarn" user. Any parameters can change the user? I tried setting HADOOP_USER_NAME but it did not work. I'm using spark 2.2. Thanks for any help!

Is there any Spark ML or MLLib API for GINI for Model Evaluation? Please help! [EOM]

2018-01-21 Thread Aakash Basu

Has there been any explanation on the performance degradation between spark.ml and Mllib?

2018-01-21 Thread Stephen Boesch
While MLLib performed favorably vs Flink it *also *performed favorably vs spark.ml .. and by an *order of magnitude*. The following is one of the tables - it is for Logistic Regression. At that time spark.ML did not yet support SVM From: https://bdataanalytics.biomedcentral.com/articles/10.

Re: Processing huge amount of data from paged API

2018-01-21 Thread anonymous
The devices and device messages are retrieved using the APIs provided by company X (not the company's real name), which owns the IoT network. There is the option of setting HTTP POST callbacks for device messages, but we want to be able to run analytics on messages of ALL the devices of the

Re: Processing huge amount of data from paged API

2018-01-21 Thread Jörn Franke
Which device provides messages as thousands of http pages? This is obviously inefficient and it will not help much to run them in parallel. Furthermore with paging you risk that messages get los or you get duplicate messages. I still not get why nowadays applications download a lot of data

Processing huge amount of data from paged API

2018-01-21 Thread anonymous
Hello, I'm in an IoT company, and I have a use case for which I would like to know if Apache Spark could be helpful. It's a very broad question, and sorry if it's long winded. We have HTTP GET APIs to get two kinds of information: 1) The Device Messages API returns data about device messages (in

Gracefully shutdown spark streaming application

2018-01-21 Thread KhajaAsmath Mohammed
Hi, Could anyone please provide your thoughts on how to kill spark streaming application gracefully. I followed link of http://why-not-learn-something.blogspot.in/2016/05/apache-spark-streaming-how-to-do.html https://github.com/lanjiang/streamingstopgraceful I played around with having either

Re: external shuffle service in mesos

2018-01-21 Thread igor.berman
Hi Susan In general I can get what I need without Marathon, with configuring external-shuffle-service with puppet/ansible/chef + maybe some alerts for checks. I mean in companies that don't have strong Devops teams and want to install services as simple as possible just by config - Marathon