[Help] Converting a Python Numpy code into Spark using RDD

2018-01-21 Thread Aakash Basu
Hi,

How can I convert this Python Numpy code into Spark RDD so that the
operations leverage the Spark distributed architecture for Big Data.

Code is as follows -

def gini(array):
"""Calculate the Gini coefficient of a numpy array."""
array = array.flatten() #all values are treated equally, arrays must be 1d
if np.amin(array) < 0:
array -= np.amin(array) #values cannot be negative
array += 0.001 #values cannot be 0
array = np.sort(array) #values must be sorted
index = np.arange(1,array.shape[0]+1) #index per array element
n = array.shape[0]#number of array elements
return ((np.sum((2 * index - n  - 1) * array)) / (n *
np.sum(array))) #Gini coefficient




Thanks in adv,
Aakash.


Re: run spark job in yarn cluster mode as specified user

2018-01-21 Thread Margusja
Hi

One way to get it is use YARN configuration parameter - 
yarn.nodemanager.container-executor.class.
By default it is 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor

org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor - gives you 
user who executes script. 

Br
Margus



> On 22 Jan 2018, at 09:28, sd wang  wrote:
> 
> Hi Advisers,
> When submit spark job in yarn cluster mode, the job will be executed by 
> "yarn" user. Any parameters can change the user? I tried setting 
> HADOOP_USER_NAME but it did not work. I'm using spark 2.2. 
> Thanks for any help!



Re: Has there been any explanation on the performance degradation between spark.ml and Mllib?

2018-01-21 Thread Nick Pentreath
At least one of their comparisons is flawed.

The Spark ML version of linear regression (*note* they use linear
regression and not logistic regression, it is not clear why) uses L-BFGS as
the solver, not SGD (as MLLIB uses). Hence it is typically going to be
slower. However, it should in most cases converge to a better solution.
MLLIB doesn't offer an L-BFGS version for linear regression, but it does
for logistic regression.

In my view a more sensible comparison would be between LogReg with L-BFGS
between ML and MLLIB. These should be close to identical since now the
MLLIB version actually wraps the ML version.

They also don't show any results for algorithm performance (accuracy, AUC
etc). The better comparison to make is the run-time to achieve the same AUC
(for example). SGD may be fast, but it may result in a significantly poorer
solution relative to say L-BFGS.

Note that the "withSGD" algorithms are deprecated in MLLIB partly to move
users to ML, but also partly because their performance in terms of accuracy
is relatively poor and the amount of tuning required (e.g. learning rates)
is high.

They say:

The time difference between Spark MLlib and Spark ML can be explained by
internally transforming the dataset from DataFrame to RDD in order to use
the same implementation of the algorithm present in MLlib.

but this is not true for the LR example.

For the feature selection example, it is probably mostly due to the
conversion, but even then the difference seems larger than what I would
expect. It would be worth investigating their implementation to see if
there are other potential underlying causes.


On Sun, 21 Jan 2018 at 23:49 Stephen Boesch  wrote:

> While MLLib performed favorably vs Flink it *also *performed favorably vs
> spark.ml ..  and by an *order of magnitude*.  The following is one of the
> tables - it is for Logistic Regression.  At that time spark.ML did not yet
> support SVM
>
> From:
> https://bdataanalytics.biomedcentral.com/articles/10.1186/s41044-016-0020-2
>
>
>
> Table 3
>
> LR learning time in seconds
>
> Dataset
>
> Spark MLlib
>
> Spark ML
>
> Flink
>
> ECBDL14-10
>
> 3
>
> 26
>
> 181
>
> ECBDL14-30
>
> 5
>
> 63
>
> 815
>
> ECBDL14-50
>
> 6
>
> 173
>
> 1314
>
> ECBDL14-75
>
> 8
>
> 260
>
> 1878
>
> ECBDL14-100
>
> 12
>
> 415
>
> 2566
>
> The DataFrame based API (spark.ml) is even slower vs the RDD (mllib) than
> had been anticipated - yet the latter has been shutdown for several
> versions of Spark already.  What is the thought process behind that
> decision : *performance matters! *Is there visibility into a meaningful
> narrowing of that gap?
>


run spark job in yarn cluster mode as specified user

2018-01-21 Thread sd wang
Hi Advisers,
When submit spark job in yarn cluster mode, the job will be executed by
"yarn" user. Any parameters can change the user? I tried
setting HADOOP_USER_NAME but it did not work. I'm using spark 2.2.
Thanks for any help!


Is there any Spark ML or MLLib API for GINI for Model Evaluation? Please help! [EOM]

2018-01-21 Thread Aakash Basu



Has there been any explanation on the performance degradation between spark.ml and Mllib?

2018-01-21 Thread Stephen Boesch
While MLLib performed favorably vs Flink it *also *performed favorably vs
spark.ml ..  and by an *order of magnitude*.  The following is one of the
tables - it is for Logistic Regression.  At that time spark.ML did not yet
support SVM

From: https://bdataanalytics.biomedcentral.com/articles/10.
1186/s41044-016-0020-2



Table 3

LR learning time in seconds

Dataset

Spark MLlib

Spark ML

Flink

ECBDL14-10

3

26

181

ECBDL14-30

5

63

815

ECBDL14-50

6

173

1314

ECBDL14-75

8

260

1878

ECBDL14-100

12

415

2566

The DataFrame based API (spark.ml) is even slower vs the RDD (mllib) than
had been anticipated - yet the latter has been shutdown for several
versions of Spark already.  What is the thought process behind that
decision : *performance matters! *Is there visibility into a meaningful
narrowing of that gap?


Re: Processing huge amount of data from paged API

2018-01-21 Thread anonymous
The devices and device messages are retrieved using the APIs provided by
company X (not the company's real name), which owns the IoT network.
There is the option of setting HTTP POST callbacks for device messages, but
we want to be able to run analytics on messages of ALL the devices of the
network. Since the devices are owned by clients and we can't force all
clients to set callbacks on their devices, our only option remains to use
this GET API.
In fact, there are other APIs by company X and many of them are paged. Of
course, this is a major issue that should be addressed soon.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Processing huge amount of data from paged API

2018-01-21 Thread Jörn Franke
Which device provides messages as thousands of http pages? This is obviously 
inefficient and it will not help much to run them in parallel. Furthermore with 
paging you risk that messages get los or you get duplicate messages. I still 
not get why nowadays applications download a lot of data through services that 
provide a paging mechanism - it has failed in the past it fails today and will 
fail in the future.

 Can’t the device push data on a bus eg Kafka? Maybe via stomp or similar ? In 
doubt the device could prepare a file with all the measurements and make the 
file available through http (this would be of course with resumeable downloads).

> On 21. Jan 2018, at 21:33, anonymous  wrote:
> 
> Hello,
> 
> I'm in an IoT company, and I have a use case for which I would like to know
> if Apache Spark could be helpful. It's a very broad question, and sorry if
> it's long winded.
> 
> We have HTTP GET APIs to get two kinds of information:
> 1) The Device Messages API returns data about device messages (in JSON).
> 2) The Devices API returns information about devices (in JSON) -- for
> example, device name, device owner, etc. Each Device Message has a Device ID
> field, which points to the device which sent it.
> To make it clearer, we have devices, and each device can send many device
> messages.
> Our goal is to device messages and send them to an ElasticSearch index.
> 
> The two major problems are:
> 1) We need data to be denormalized. That is, we don't want to have one index
> for device messages, and a separate index for device information -- we want
> each device message to have the corresponding device's information attached
> to it. That is because ElasticSearch works best with denormalized data. So,
> we would like a solution that can join (as in an SQL join) the Device
> Message data with the Device data and apply some transformations to them
> before sending it to ElasticSearch. We can potentially have millions of
> devices and device messages, so this solution needs to be scalable.
> 2) Both the Device Messages API and the Devices API are paged, and can
> potentially have thousands of pages. We can potentially have millions of
> devices and device messages. Making HTTP requests for thousands of pages can
> become inefficient. So, it would be good to have a way to parallelize this
> process.
> 
> So, to be short, we would like a solution that can help with:
> 1) Joining and transforming large amounts of data (from a paged API) before
> sending it to ElasticSearch.
> 2) Making the process of sifting through all the pages in the paged APIs
> more efficient.
> 
> Can Apache Spark help with all this?
> 
> Thank you in advance.
> 
> 
> 
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Processing huge amount of data from paged API

2018-01-21 Thread anonymous
Hello,

I'm in an IoT company, and I have a use case for which I would like to know
if Apache Spark could be helpful. It's a very broad question, and sorry if
it's long winded.

We have HTTP GET APIs to get two kinds of information:
1) The Device Messages API returns data about device messages (in JSON).
2) The Devices API returns information about devices (in JSON) -- for
example, device name, device owner, etc. Each Device Message has a Device ID
field, which points to the device which sent it.
To make it clearer, we have devices, and each device can send many device
messages.
Our goal is to device messages and send them to an ElasticSearch index.

The two major problems are:
1) We need data to be denormalized. That is, we don't want to have one index
for device messages, and a separate index for device information -- we want
each device message to have the corresponding device's information attached
to it. That is because ElasticSearch works best with denormalized data. So,
we would like a solution that can join (as in an SQL join) the Device
Message data with the Device data and apply some transformations to them
before sending it to ElasticSearch. We can potentially have millions of
devices and device messages, so this solution needs to be scalable.
2) Both the Device Messages API and the Devices API are paged, and can
potentially have thousands of pages. We can potentially have millions of
devices and device messages. Making HTTP requests for thousands of pages can
become inefficient. So, it would be good to have a way to parallelize this
process.

So, to be short, we would like a solution that can help with:
1) Joining and transforming large amounts of data (from a paged API) before
sending it to ElasticSearch.
2) Making the process of sifting through all the pages in the paged APIs
more efficient.

Can Apache Spark help with all this?

Thank you in advance.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Gracefully shutdown spark streaming application

2018-01-21 Thread KhajaAsmath Mohammed
Hi,

Could anyone please provide your thoughts on how to kill spark streaming
application gracefully.

I followed link of
http://why-not-learn-something.blogspot.in/2016/05/apache-spark-streaming-how-to-do.html

https://github.com/lanjiang/streamingstopgraceful

I played around with having either property or adding marker file as github
link. Marker works sometimes successfully and it alos gets stuck sometimes.

Is there any efficient way to kill application?

Thanks,
Asmath


Re: external shuffle service in mesos

2018-01-21 Thread igor.berman
Hi Susan

In general I can get what I need without Marathon, with configuring
external-shuffle-service with puppet/ansible/chef + maybe some alerts for
checks.

I mean in companies that don't have strong Devops teams and want to install
services as simple as possible just by config - Marathon might be useful,
however if company already has strong puppet/ansible/chef whatever infra,
the Marathon addition(additional component) and management is less clear

WDYT?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org