Free Column Reference with $

2018-05-04 Thread Christopher Piggott
How does $"something" actually work (from a scala perspective) as a free column reference?

Running apps over a VPN

2018-05-02 Thread Christopher Piggott
My setup is that I have a spark master (using the spark scheduler) and 32 workers registered with it but they are on a private network. I can connect to that private network via OpenVPN. I would like to be able to run spark applications from a local (on my desktop) IntelliJ but have them use the

Re: Stream writing parquet files

2018-04-19 Thread Christopher Piggott
As a follow-up question, what happened to org.apache.spark.sql.parquet.RowWriteSupport ? It seems like it would help me. On Thu, Apr 19, 2018 at 9:23 PM, Christopher Piggott <cpigg...@gmail.com> wrote: > I am trying to write some parquet files and running out of memory. I'm > givin

Stream writing parquet files

2018-04-19 Thread Christopher Piggott
I am trying to write some parquet files and running out of memory. I'm giving my workers each 16GB and the data is 102 columns * 65536 rows - not really all that much. The content of each row is a short string. I am trying to create the file by dynamically allocating a StructType of StructField

Custom metrics sink

2018-03-16 Thread Christopher Piggott
Just for fun, i want to make a stupid program that makes different frequency chimes as each worker becomes active. That way you can 'hear' what the cluster is doing and how it's distributing work. I thought to do this I would make a custom Sink, but the Sink and everything else in

Spark MakeRDD preferred workers

2018-01-08 Thread Christopher Piggott
Hi, def makeRDD[T](seq: Seq[(T, Seq[String])])(implicit arg0: ClassTag[T]): RDD[T] list of tuples of data and location preferences (hostnames of Spark nodes) Is that list a list of acceptable choices, and it will choose one of them? Or is it an ordered list? I'm trying to ascertain how

binaryFiles() on directory full of directories

2018-01-08 Thread Christopher Piggott
I have a top level directory in HDFS that contains nothing but subdirectories (no actual files). In each one of those subdirs are a combination of files and other subdirs /topdir/dir1/(lots of files) /topdir/dir2/(lots of files) /topdir/dir2//subdir/(lots of files) I

Converting binary files

2017-12-30 Thread Christopher Piggott
I have been searching for examples, but not finding exactly what I need. I am looking for the paradigm for using spark 2.2 to convert a bunch of binary files into a bunch of different binary files. I'm starting with: val files = spark.sparkContext.binaryFiles("hdfs://1.2.3.4/input") then

Spark 2.2.1 worker invocation

2017-12-26 Thread Christopher Piggott
I need to set java.library.path to get access to some native code. Following directions, I made a spark-env.sh: #!/usr/bin/env bash export LD_LIBRARY_PATH="/usr/local/lib/libcdfNativeLibrary.so:/usr/local/lib/libcdf.so:${LD_LIBRARY_PATH}" export

NASA CDF files in Spark

2017-12-15 Thread Christopher Piggott
I'm looking to run a job that involves a zillion files in a format called CDF, a nasa standard. There are a number of libraries out there that can read CDFs but most of them are not high quality compared to the official NASA one, which has java bindings (via JNI). It's a little clumsy but I have