[Help] Converting a Python Numpy code into Spark using RDD
Hi, How can I convert this Python Numpy code into Spark RDD so that the operations leverage the Spark distributed architecture for Big Data. Code is as follows - def gini(array): """Calculate the Gini coefficient of a numpy array.""" array = array.flatten() #all values are treated equally, arrays must be 1d if np.amin(array) < 0: array -= np.amin(array) #values cannot be negative array += 0.001 #values cannot be 0 array = np.sort(array) #values must be sorted index = np.arange(1,array.shape[0]+1) #index per array element n = array.shape[0]#number of array elements return ((np.sum((2 * index - n - 1) * array)) / (n * np.sum(array))) #Gini coefficient Thanks in adv, Aakash.
Re: run spark job in yarn cluster mode as specified user
Hi One way to get it is use YARN configuration parameter - yarn.nodemanager.container-executor.class. By default it is org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor - gives you user who executes script. Br Margus > On 22 Jan 2018, at 09:28, sd wangwrote: > > Hi Advisers, > When submit spark job in yarn cluster mode, the job will be executed by > "yarn" user. Any parameters can change the user? I tried setting > HADOOP_USER_NAME but it did not work. I'm using spark 2.2. > Thanks for any help!
Re: Has there been any explanation on the performance degradation between spark.ml and Mllib?
At least one of their comparisons is flawed. The Spark ML version of linear regression (*note* they use linear regression and not logistic regression, it is not clear why) uses L-BFGS as the solver, not SGD (as MLLIB uses). Hence it is typically going to be slower. However, it should in most cases converge to a better solution. MLLIB doesn't offer an L-BFGS version for linear regression, but it does for logistic regression. In my view a more sensible comparison would be between LogReg with L-BFGS between ML and MLLIB. These should be close to identical since now the MLLIB version actually wraps the ML version. They also don't show any results for algorithm performance (accuracy, AUC etc). The better comparison to make is the run-time to achieve the same AUC (for example). SGD may be fast, but it may result in a significantly poorer solution relative to say L-BFGS. Note that the "withSGD" algorithms are deprecated in MLLIB partly to move users to ML, but also partly because their performance in terms of accuracy is relatively poor and the amount of tuning required (e.g. learning rates) is high. They say: The time difference between Spark MLlib and Spark ML can be explained by internally transforming the dataset from DataFrame to RDD in order to use the same implementation of the algorithm present in MLlib. but this is not true for the LR example. For the feature selection example, it is probably mostly due to the conversion, but even then the difference seems larger than what I would expect. It would be worth investigating their implementation to see if there are other potential underlying causes. On Sun, 21 Jan 2018 at 23:49 Stephen Boeschwrote: > While MLLib performed favorably vs Flink it *also *performed favorably vs > spark.ml .. and by an *order of magnitude*. The following is one of the > tables - it is for Logistic Regression. At that time spark.ML did not yet > support SVM > > From: > https://bdataanalytics.biomedcentral.com/articles/10.1186/s41044-016-0020-2 > > > > Table 3 > > LR learning time in seconds > > Dataset > > Spark MLlib > > Spark ML > > Flink > > ECBDL14-10 > > 3 > > 26 > > 181 > > ECBDL14-30 > > 5 > > 63 > > 815 > > ECBDL14-50 > > 6 > > 173 > > 1314 > > ECBDL14-75 > > 8 > > 260 > > 1878 > > ECBDL14-100 > > 12 > > 415 > > 2566 > > The DataFrame based API (spark.ml) is even slower vs the RDD (mllib) than > had been anticipated - yet the latter has been shutdown for several > versions of Spark already. What is the thought process behind that > decision : *performance matters! *Is there visibility into a meaningful > narrowing of that gap? >
run spark job in yarn cluster mode as specified user
Hi Advisers, When submit spark job in yarn cluster mode, the job will be executed by "yarn" user. Any parameters can change the user? I tried setting HADOOP_USER_NAME but it did not work. I'm using spark 2.2. Thanks for any help!
Is there any Spark ML or MLLib API for GINI for Model Evaluation? Please help! [EOM]
Has there been any explanation on the performance degradation between spark.ml and Mllib?
While MLLib performed favorably vs Flink it *also *performed favorably vs spark.ml .. and by an *order of magnitude*. The following is one of the tables - it is for Logistic Regression. At that time spark.ML did not yet support SVM From: https://bdataanalytics.biomedcentral.com/articles/10. 1186/s41044-016-0020-2 Table 3 LR learning time in seconds Dataset Spark MLlib Spark ML Flink ECBDL14-10 3 26 181 ECBDL14-30 5 63 815 ECBDL14-50 6 173 1314 ECBDL14-75 8 260 1878 ECBDL14-100 12 415 2566 The DataFrame based API (spark.ml) is even slower vs the RDD (mllib) than had been anticipated - yet the latter has been shutdown for several versions of Spark already. What is the thought process behind that decision : *performance matters! *Is there visibility into a meaningful narrowing of that gap?
Re: Processing huge amount of data from paged API
The devices and device messages are retrieved using the APIs provided by company X (not the company's real name), which owns the IoT network. There is the option of setting HTTP POST callbacks for device messages, but we want to be able to run analytics on messages of ALL the devices of the network. Since the devices are owned by clients and we can't force all clients to set callbacks on their devices, our only option remains to use this GET API. In fact, there are other APIs by company X and many of them are paged. Of course, this is a major issue that should be addressed soon. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Processing huge amount of data from paged API
Which device provides messages as thousands of http pages? This is obviously inefficient and it will not help much to run them in parallel. Furthermore with paging you risk that messages get los or you get duplicate messages. I still not get why nowadays applications download a lot of data through services that provide a paging mechanism - it has failed in the past it fails today and will fail in the future. Can’t the device push data on a bus eg Kafka? Maybe via stomp or similar ? In doubt the device could prepare a file with all the measurements and make the file available through http (this would be of course with resumeable downloads). > On 21. Jan 2018, at 21:33, anonymouswrote: > > Hello, > > I'm in an IoT company, and I have a use case for which I would like to know > if Apache Spark could be helpful. It's a very broad question, and sorry if > it's long winded. > > We have HTTP GET APIs to get two kinds of information: > 1) The Device Messages API returns data about device messages (in JSON). > 2) The Devices API returns information about devices (in JSON) -- for > example, device name, device owner, etc. Each Device Message has a Device ID > field, which points to the device which sent it. > To make it clearer, we have devices, and each device can send many device > messages. > Our goal is to device messages and send them to an ElasticSearch index. > > The two major problems are: > 1) We need data to be denormalized. That is, we don't want to have one index > for device messages, and a separate index for device information -- we want > each device message to have the corresponding device's information attached > to it. That is because ElasticSearch works best with denormalized data. So, > we would like a solution that can join (as in an SQL join) the Device > Message data with the Device data and apply some transformations to them > before sending it to ElasticSearch. We can potentially have millions of > devices and device messages, so this solution needs to be scalable. > 2) Both the Device Messages API and the Devices API are paged, and can > potentially have thousands of pages. We can potentially have millions of > devices and device messages. Making HTTP requests for thousands of pages can > become inefficient. So, it would be good to have a way to parallelize this > process. > > So, to be short, we would like a solution that can help with: > 1) Joining and transforming large amounts of data (from a paged API) before > sending it to ElasticSearch. > 2) Making the process of sifting through all the pages in the paged APIs > more efficient. > > Can Apache Spark help with all this? > > Thank you in advance. > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Processing huge amount of data from paged API
Hello, I'm in an IoT company, and I have a use case for which I would like to know if Apache Spark could be helpful. It's a very broad question, and sorry if it's long winded. We have HTTP GET APIs to get two kinds of information: 1) The Device Messages API returns data about device messages (in JSON). 2) The Devices API returns information about devices (in JSON) -- for example, device name, device owner, etc. Each Device Message has a Device ID field, which points to the device which sent it. To make it clearer, we have devices, and each device can send many device messages. Our goal is to device messages and send them to an ElasticSearch index. The two major problems are: 1) We need data to be denormalized. That is, we don't want to have one index for device messages, and a separate index for device information -- we want each device message to have the corresponding device's information attached to it. That is because ElasticSearch works best with denormalized data. So, we would like a solution that can join (as in an SQL join) the Device Message data with the Device data and apply some transformations to them before sending it to ElasticSearch. We can potentially have millions of devices and device messages, so this solution needs to be scalable. 2) Both the Device Messages API and the Devices API are paged, and can potentially have thousands of pages. We can potentially have millions of devices and device messages. Making HTTP requests for thousands of pages can become inefficient. So, it would be good to have a way to parallelize this process. So, to be short, we would like a solution that can help with: 1) Joining and transforming large amounts of data (from a paged API) before sending it to ElasticSearch. 2) Making the process of sifting through all the pages in the paged APIs more efficient. Can Apache Spark help with all this? Thank you in advance. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Gracefully shutdown spark streaming application
Hi, Could anyone please provide your thoughts on how to kill spark streaming application gracefully. I followed link of http://why-not-learn-something.blogspot.in/2016/05/apache-spark-streaming-how-to-do.html https://github.com/lanjiang/streamingstopgraceful I played around with having either property or adding marker file as github link. Marker works sometimes successfully and it alos gets stuck sometimes. Is there any efficient way to kill application? Thanks, Asmath
Re: external shuffle service in mesos
Hi Susan In general I can get what I need without Marathon, with configuring external-shuffle-service with puppet/ansible/chef + maybe some alerts for checks. I mean in companies that don't have strong Devops teams and want to install services as simple as possible just by config - Marathon might be useful, however if company already has strong puppet/ansible/chef whatever infra, the Marathon addition(additional component) and management is less clear WDYT? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org