date:20171002

[Spark] Expose Spark to Untrusted Users?

2017-10-02 Thread Jack Leadford

Hello, I would like to expose Apache Spark to untrusted users (through Livy, and with a direct JDBC connection). However, there appear to be a variety of avenues wherein one of these untrusted users can execute arbitrary code (by design): PySpark, SparkR, Jar uploads, various UDFs, etc. I

Re: Example of GBTClassifier

2017-10-02 Thread Weichen Xu

It should be eclipses issues. The method is there, in super class `Predictor`. On Mon, Oct 2, 2017 at 11:51 PM, mckunkel wrote: > Greetings, > I am trying to run the example in the example directory for the > GBTClassifier. But when I view this code in eclipse, I get an

Quick one... AWS SDK version?

2017-10-02 Thread JG Perrin

Hey Sparkians, What version of AWS Java SDK do you use with Spark 2.2? Do you stick with the Hadoop 2.7.3 libs? Thanks! jg

Re: PySpark - Expand rows into dataframes via function

2017-10-02 Thread Sathish Kumaran Vairavelu

It's possible with array function combined with struct construct. Below is a SQL example select Array(struct(ip1,hashkey), struct(ip2,hashkey)) from (select substr(col1,1,2) as ip1, substr(col1,3,3) as ip2, etc, hashkey from object) a If you want dynamic ip ranges; you need to dynamically

Re: HDFS or NFS as a cache?

2017-10-02 Thread Miguel Morales

See: https://github.com/rdblue/s3committer and https://www.youtube.com/watch?v=8F2Jqw5_OnI=youtu.be On Mon, Oct 2, 2017 at 11:31 AM, Marcelo Vanzin wrote: > You don't need to collect data in the driver to save it. The code in > the original question doesn't use

Re: HDFS or NFS as a cache?

2017-10-02 Thread Marcelo Vanzin

You don't need to collect data in the driver to save it. The code in the original question doesn't use "collect()", so it's actually doing a distributed write. On Mon, Oct 2, 2017 at 11:26 AM, JG Perrin wrote: > Steve, > > > > If I refer to the collect() API, it says

RE: HDFS or NFS as a cache?

2017-10-02 Thread JG Perrin

Steve, If I refer to the collect() API, it says “Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.” So why would you need a distributed FS? jg From: Steve Loughran

RE: Error - Spark reading from HDFS via dataframes - Java

2017-10-02 Thread JG Perrin

@Anastasios: just a word of caution, this is Spark 1.x CSV parser, there a few (minor) changes for Spark 2.x, you can have a look at http://jgp.net/2017/10/01/loading-csv-in-spark/. From: Anastasios Zouzias [mailto:zouz...@gmail.com] Sent: Sunday, October 01, 2017 2:05 AM To: Kanagha Kumar

Example of GBTClassifier

2017-10-02 Thread mckunkel

Greetings, I am trying to run the example in the example directory for the GBTClassifier. But when I view this code in eclipse, I get an error such that "The method setLabelCol(String) is undefined for the type GBTClassifier" For the line GBTClassifier gbt = new

PySpark - Expand rows into dataframes via function

2017-10-02 Thread Patrick McCarthy

Hello, I'm trying to map ARIN registry files into more explicit IP ranges. They provide a number of IPs in the range (here it's 8192) and a starting IP, and I'm trying to map it into all the included /24 subnets. For example, Input: array(['arin', 'US', 'ipv4', '23.239.160.0', 8192, 20131104.0,

Re: Saving dataframes with partitionBy: append partitions, overwrite within each

2017-10-02 Thread Pavel Knoblokh

If your processing task inherently processes input data by month you may want to "manually" partition the output data by month as well as by day, that is to save it with a file name including the given month, i.e. "dataset.parquet/month=01". Then you will be able to use the overwrite mode with

Re: Should Flume integration be behind a profile?

2017-10-02 Thread Sean Owen

CCing user@ Yeah good point about perhaps moving the examples into the module itself. Actually removing it would be a long way off, no matter what. On Mon, Oct 2, 2017 at 8:35 AM Nick Pentreath wrote: > I'd agree with #1 or #2. Deprecation now seems fine. > > Perhaps

[Spark] Expose Spark to Untrusted Users?

Re: Example of GBTClassifier

Quick one... AWS SDK version?

Re: PySpark - Expand rows into dataframes via function

Re: HDFS or NFS as a cache?

Re: HDFS or NFS as a cache?

RE: HDFS or NFS as a cache?

RE: Error - Spark reading from HDFS via dataframes - Java

Example of GBTClassifier

PySpark - Expand rows into dataframes via function

Re: Saving dataframes with partitionBy: append partitions, overwrite within each

Re: Should Flume integration be behind a profile?

12 matches

Site Navigation

Mail list logo

Footer information