Trouble with PySpark UDFs and SPARK_HOME only on EMR

2017-06-22 Thread Nick Chammas
I’m seeing a strange issue on EMR which I posted about here . In brief, when I try to import a UDF I’ve defined, Python somehow fails to find Spark. This exact code works for me locally and works on our on-premises CDH

Reading ORC file - fine on 1.6; GC timeout on 2+

2017-05-05 Thread Nick Chammas
I have this ORC file that was generated by a Spark 1.6 program. It opens fine in Spark 1.6 with 6GB of driver memory, and probably less. However, when I try to open the same file in Spark 2.0 or 2.1, I get GC timeout exceptions. And this is with 6, 8, and even 10GB of driver memory. This is

Spark fair scheduler pools vs. YARN queues

2017-04-05 Thread Nick Chammas
I'm having trouble understanding the difference between Spark fair scheduler pools and YARN queues . Do they conflict? Does one

Re: spark-ec2 vs. EMR

2015-12-01 Thread Nick Chammas
Pinging this thread in case anyone has thoughts on the matter they want to share. On Sat, Nov 21, 2015 at 11:32 AM Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Spark has come bundled with spark-ec2 > for many years. At > the same

[PSA] Use Stack Overflow!

2015-05-02 Thread Nick Chammas
This mailing list sees a lot of traffic every day. With such a volume of mail, you may find it hard to find discussions you are interested in, and if you are the one starting discussions you may sometimes feel your mail is going into a black hole. We can't change the nature of this mailing list

Submit Spark applications from a machine that doesn't have Java installed

2015-01-11 Thread Nick Chammas
Is it possible to submit a Spark application to a cluster from a machine that does not have Java installed? My impression is that many, many more computers come with Python installed by default than do with Java. I want to write a command-line utility

Discourse: A proposed alternative to the Spark User list

2014-12-24 Thread Nick Chammas
When people have questions about Spark, there are 2 main places (as far as I can tell) where they ask them: - Stack Overflow, under the apache-spark tag http://stackoverflow.com/questions/tagged/apache-spark - This mailing list The mailing list is valuable as an independent place for

Multipart uploads to Amazon S3 from Apache Spark

2014-10-13 Thread Nick Chammas
Cross posting an interesting question on Stack Overflow http://stackoverflow.com/questions/26321947/multipart-uploads-to-amazon-s3-from-apache-spark . Nick -- View this message in context:

Write 1 RDD to multiple output paths in one go

2014-09-13 Thread Nick Chammas
Howdy doody Spark Users, I’d like to somehow write out a single RDD to multiple paths in one go. Here’s an example. I have an RDD of (key, value) pairs like this: a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 'Frankie']).keyBy(lambda x: x[0]) a.collect() [('N', 'Nick'), ('N', 'Nancy'),

[PySpark] large # of partitions causes OOM

2014-08-29 Thread Nick Chammas
Here’s a repro for PySpark: a = sc.parallelize([Nick, John, Bob]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) When I try this on an EC2 cluster with 1.1.0-rc2 and Python 2.7, this is what I get: a = sc.parallelize([Nick, John, Bob]) a =

Spark Screencast doesn't show in Chrome on OS X

2014-08-25 Thread Nick Chammas
https://spark.apache.org/screencasts/1-first-steps-with-spark.html The embedded YouTube video shows up in Safari on OS X but not in Chrome. How come? Nick -- View this message in context:

How do you debug a PythonException?

2014-07-29 Thread Nick Chammas
I’m in the PySpark shell and I’m trying to do this: a = sc.textFile('s3n://path-to-handful-of-very-large-files-totalling-1tb/*.json', minPartitions=sc.defaultParallelism * 3).cache() a.map(lambda x: len(x)).max() My job dies with the following: 14/07/30 01:46:28 WARN TaskSetManager: Loss was

Spark SQL throws ClassCastException on first try; works on second

2014-07-15 Thread Nick Chammas
I’m running this query against RDD[Tweet], where Tweet is a simple case class with 4 fields. sqlContext.sql( SELECT user, COUNT(*) as num_tweets FROM tweets GROUP BY user ORDER BY num_tweets DESC, user ASC ; ).take(5) The first time I run this, it throws the following:

Stopping StreamingContext does not kill receiver

2014-07-12 Thread Nick Chammas
From the interactive shell I’ve created a StreamingContext. I call ssc.start() and take a look at http://master_url:4040/streaming/ and see that I have an active Twitter receiver. Then I call ssc.stop(stopSparkContext = false, stopGracefully = true) and wait a bit, but the receiver seems to stay

Supported SQL syntax in Spark SQL

2014-07-12 Thread Nick Chammas
Is there a place where we can find an up-to-date list of supported SQL syntax in Spark SQL? Nick -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Supported-SQL-syntax-in-Spark-SQL-tp9538.html Sent from the Apache Spark User List mailing list archive at

How to RDD.take(middle 10 elements)

2014-07-10 Thread Nick Chammas
Interesting question on Stack Overflow: http://stackoverflow.com/q/24677180/877069 Basically, is there a way to take() elements of an RDD at an arbitrary index? Nick ​ -- View this message in context:

What version of twitter4j should I use with Spark Streaming?

2014-07-10 Thread Nick Chammas
Looks like twitter4j http://twitter4j.org/archive/ 2.2.6 is what works, but I don’t believe it’s documented anywhere. Using 3.0.6 works for a while, but then causes the following error: 14/07/10 18:34:13 WARN ReceiverTracker: Error reported by receiver for stream 0: Error in block pushing thread

How should I add a jar?

2014-07-09 Thread Nick Chammas
I’m just starting to use the Scala version of Spark’s shell, and I’d like to add in a jar I believe I need to access Twitter data live, twitter4j http://twitter4j.org/en/index.html. I’m confused over where and how to add this jar in. SPARK-1089 https://issues.apache.org/jira/browse/SPARK-1089

Restarting a Streaming Context

2014-07-09 Thread Nick Chammas
So I do this from the Spark shell: // set things up// snipped ssc.start() // let things happen for a few minutes ssc.stop(stopSparkContext = false, stopGracefully = true) Then I want to restart the Streaming Context: ssc.start() // still in the shell; Spark Context is still alive Which

wholeTextFiles and gzip

2014-06-25 Thread Nick Chammas
Interesting question on Stack Overflow: http://stackoverflow.com/questions/24402737/how-to-read-gz-files-in-spark-using-wholetextfiles Is it possible to read gzipped files using wholeTextFiles()? Alternately, is it possible to read the source file names using textFile()? ​ -- View this

Spark is now available via Homebrew

2014-06-18 Thread Nick Chammas
OS X / Homebrew users, It looks like you can now download Spark simply by doing: brew install apache-spark I’m new to Homebrew, so I’m not too sure how people are intended to use this. I’m guessing this would just be a convenient way to get the latest release onto your workstation, and from

Patterns for making multiple aggregations in one pass

2014-06-18 Thread Nick Chammas
The following is a simplified example of what I am trying to accomplish. Say I have an RDD of objects like this: { country: USA, name: Franklin, age: 24, hits: 224} { country: USA, name: Bob, age: 55, hits: 108} { country: France, name: Remi, age:

Showing key cluster stats in the Web UI

2014-06-06 Thread Nick Chammas
Someone correct me if this is wrong, but I believe 2 very important things to know about your cluster are: 1. How many cores does your cluster have available. 2. How much memory does your cluster have available. (Perhaps this could be divided into total/in-use/free or something.) Is

stage kill link is awfully close to the stage name

2014-06-06 Thread Nick Chammas
Minor point, but does anyone else find the new (and super helpful!) kill link awfully close to the stage detail link in the 1.0.0 Web UI? I think it would be better to have the kill link flush right, leaving a large amount of space between it the stage detail link. Nick -- View this message

Subscribing to news releases

2014-05-30 Thread Nick Chammas
Is there a way to subscribe to news releases http://spark.apache.org/news/index.html? That would be swell. Nick -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Subscribing-to-news-releases-tp6592.html Sent from the Apache Spark User List mailing list

Why Scala?

2014-05-29 Thread Nick Chammas
I recently discovered Hacker News and started reading through older posts about Scala https://hn.algolia.com/?q=scala#!/story/forever/0/scala. It looks like the language is fairly controversial on there, and it got me thinking. Scala appears to be the preferred language to work with in Spark, and

Using Spark to analyze complex JSON

2014-05-20 Thread Nick Chammas
The Apache Drill http://incubator.apache.org/drill/ home page has an interesting heading: Liberate Nested Data. Is there any current or planned functionality in Spark SQL or Shark to enable SQL-like querying of complex JSON? Nick -- View this message in context:

How to Unsubscribe from the Spark user list

2014-05-20 Thread Nick Chammas
Send an email to this address to unsubscribe from the Spark user list: user-unsubscr...@spark.apache.org Sending an email to the Spark user list itself (i.e. this list) *does not do anything*, even if you put unsubscribe as the subject. We will all just see your email. Nick -- View this

count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-16 Thread Nick Chammas
I’m trying to do a simple count() on a large number of GZipped files in S3. My job is failing with the following message: 14/05/15 19:12:37 WARN scheduler.TaskSetManager: Loss was due to java.io.IOException java.io.IOException: incorrect header check at