[Spark] Expose Spark to Untrusted Users?

2017-10-02 Thread Jack Leadford
Hello, I would like to expose Apache Spark to untrusted users (through Livy, and with a direct JDBC connection). However, there appear to be a variety of avenues wherein one of these untrusted users can execute arbitrary code (by design): PySpark, SparkR, Jar uploads, various UDFs, etc. I

Re: Example of GBTClassifier

2017-10-02 Thread Weichen Xu
It should be eclipses issues. The method is there, in super class `Predictor`. On Mon, Oct 2, 2017 at 11:51 PM, mckunkel wrote: > Greetings, > I am trying to run the example in the example directory for the > GBTClassifier. But when I view this code in eclipse, I get an

Quick one... AWS SDK version?

2017-10-02 Thread JG Perrin
Hey Sparkians, What version of AWS Java SDK do you use with Spark 2.2? Do you stick with the Hadoop 2.7.3 libs? Thanks! jg

Re: PySpark - Expand rows into dataframes via function

2017-10-02 Thread Sathish Kumaran Vairavelu
It's possible with array function combined with struct construct. Below is a SQL example select Array(struct(ip1,hashkey), struct(ip2,hashkey)) from (select substr(col1,1,2) as ip1, substr(col1,3,3) as ip2, etc, hashkey from object) a If you want dynamic ip ranges; you need to dynamically

Re: HDFS or NFS as a cache?

2017-10-02 Thread Miguel Morales
See: https://github.com/rdblue/s3committer and https://www.youtube.com/watch?v=8F2Jqw5_OnI=youtu.be On Mon, Oct 2, 2017 at 11:31 AM, Marcelo Vanzin wrote: > You don't need to collect data in the driver to save it. The code in > the original question doesn't use

Re: HDFS or NFS as a cache?

2017-10-02 Thread Marcelo Vanzin
You don't need to collect data in the driver to save it. The code in the original question doesn't use "collect()", so it's actually doing a distributed write. On Mon, Oct 2, 2017 at 11:26 AM, JG Perrin wrote: > Steve, > > > > If I refer to the collect() API, it says

RE: HDFS or NFS as a cache?

2017-10-02 Thread JG Perrin
Steve, If I refer to the collect() API, it says “Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.” So why would you need a distributed FS? jg From: Steve Loughran

RE: Error - Spark reading from HDFS via dataframes - Java

2017-10-02 Thread JG Perrin
@Anastasios: just a word of caution, this is Spark 1.x CSV parser, there a few (minor) changes for Spark 2.x, you can have a look at http://jgp.net/2017/10/01/loading-csv-in-spark/. From: Anastasios Zouzias [mailto:zouz...@gmail.com] Sent: Sunday, October 01, 2017 2:05 AM To: Kanagha Kumar

Example of GBTClassifier

2017-10-02 Thread mckunkel
Greetings, I am trying to run the example in the example directory for the GBTClassifier. But when I view this code in eclipse, I get an error such that "The method setLabelCol(String) is undefined for the type GBTClassifier" For the line GBTClassifier gbt = new

PySpark - Expand rows into dataframes via function

2017-10-02 Thread Patrick McCarthy
Hello, I'm trying to map ARIN registry files into more explicit IP ranges. They provide a number of IPs in the range (here it's 8192) and a starting IP, and I'm trying to map it into all the included /24 subnets. For example, Input: array(['arin', 'US', 'ipv4', '23.239.160.0', 8192, 20131104.0,

Re: Saving dataframes with partitionBy: append partitions, overwrite within each

2017-10-02 Thread Pavel Knoblokh
If your processing task inherently processes input data by month you may want to "manually" partition the output data by month as well as by day, that is to save it with a file name including the given month, i.e. "dataset.parquet/month=01". Then you will be able to use the overwrite mode with

Re: Should Flume integration be behind a profile?

2017-10-02 Thread Sean Owen
CCing user@ Yeah good point about perhaps moving the examples into the module itself. Actually removing it would be a long way off, no matter what. On Mon, Oct 2, 2017 at 8:35 AM Nick Pentreath wrote: > I'd agree with #1 or #2. Deprecation now seems fine. > > Perhaps