Re: Should Flume integration be behind a profile?

2017-10-02 Thread Sean Owen
CCing user@ Yeah good point about perhaps moving the examples into the module itself. Actually removing it would be a long way off, no matter what. On Mon, Oct 2, 2017 at 8:35 AM Nick Pentreath wrote: > I'd agree with #1 or #2. Deprecation now seems fine. > > Perhaps this should be raised on the

Re: Saving dataframes with partitionBy: append partitions, overwrite within each

2017-10-02 Thread Pavel Knoblokh
If your processing task inherently processes input data by month you may want to "manually" partition the output data by month as well as by day, that is to save it with a file name including the given month, i.e. "dataset.parquet/month=01". Then you will be able to use the overwrite mode with each

PySpark - Expand rows into dataframes via function

2017-10-02 Thread Patrick McCarthy
Hello, I'm trying to map ARIN registry files into more explicit IP ranges. They provide a number of IPs in the range (here it's 8192) and a starting IP, and I'm trying to map it into all the included /24 subnets. For example, Input: array(['arin', 'US', 'ipv4', '23.239.160.0', 8192, 20131104.0,

Example of GBTClassifier

2017-10-02 Thread mckunkel
Greetings, I am trying to run the example in the example directory for the GBTClassifier. But when I view this code in eclipse, I get an error such that "The method setLabelCol(String) is undefined for the type GBTClassifier" For the line GBTClassifier gbt = new GBTClassifier().se

RE: Error - Spark reading from HDFS via dataframes - Java

2017-10-02 Thread JG Perrin
@Anastasios: just a word of caution, this is Spark 1.x CSV parser, there a few (minor) changes for Spark 2.x, you can have a look at http://jgp.net/2017/10/01/loading-csv-in-spark/. From: Anastasios Zouzias [mailto:zouz...@gmail.com] Sent: Sunday, October 01, 2017 2:05 AM To: Kanagha Kumar Cc:

RE: HDFS or NFS as a cache?

2017-10-02 Thread JG Perrin
Steve, If I refer to the collect() API, it says “Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.” So why would you need a distributed FS? jg From: Steve Loughran [mailt

Re: HDFS or NFS as a cache?

2017-10-02 Thread Marcelo Vanzin
You don't need to collect data in the driver to save it. The code in the original question doesn't use "collect()", so it's actually doing a distributed write. On Mon, Oct 2, 2017 at 11:26 AM, JG Perrin wrote: > Steve, > > > > If I refer to the collect() API, it says “Running collect requires mo

Re: HDFS or NFS as a cache?

2017-10-02 Thread Miguel Morales
See: https://github.com/rdblue/s3committer and https://www.youtube.com/watch?v=8F2Jqw5_OnI&feature=youtu.be On Mon, Oct 2, 2017 at 11:31 AM, Marcelo Vanzin wrote: > You don't need to collect data in the driver to save it. The code in > the original question doesn't use "collect()", so it's actu

Re: PySpark - Expand rows into dataframes via function

2017-10-02 Thread Sathish Kumaran Vairavelu
It's possible with array function combined with struct construct. Below is a SQL example select Array(struct(ip1,hashkey), struct(ip2,hashkey)) from (select substr(col1,1,2) as ip1, substr(col1,3,3) as ip2, etc, hashkey from object) a If you want dynamic ip ranges; you need to dynamically constru

Quick one... AWS SDK version?

2017-10-02 Thread JG Perrin
Hey Sparkians, What version of AWS Java SDK do you use with Spark 2.2? Do you stick with the Hadoop 2.7.3 libs? Thanks! jg

Re: Example of GBTClassifier

2017-10-02 Thread Weichen Xu
It should be eclipses issues. The method is there, in super class `Predictor`. On Mon, Oct 2, 2017 at 11:51 PM, mckunkel wrote: > Greetings, > I am trying to run the example in the example directory for the > GBTClassifier. But when I view this code in eclipse, I get an error such > that > "The

[Spark] Expose Spark to Untrusted Users?

2017-10-02 Thread Jack Leadford
Hello, I would like to expose Apache Spark to untrusted users (through Livy, and with a direct JDBC connection). However, there appear to be a variety of avenues wherein one of these untrusted users can execute arbitrary code (by design): PySpark, SparkR, Jar uploads, various UDFs, etc. I wou

Re: Quick one... AWS SDK version?

2017-10-02 Thread Yash Sharma
Hi JG, Here are my cluster configs if it helps. Cheers. EMR: emr-5.8.0 Hadoop distribution: Amazon 2.7.3 AWS sdk: /usr/share/aws/aws-java-sdk/aws-java-sdk-1.11.160.jar Applications: Hive 2.3.0 Spark 2.2.0 Tez 0.8.4 On Tue, 3 Oct 2017 at 12:29 JG Perrin wrote: > Hey Sparkians, > > > > What ve