Re: Cassandra row count grouped by multiple columns

2015-09-11 Thread Eric Walker
Hi Chirag, Maybe something like this? import org.apache.spark.sql._ import org.apache.spark.sql.types._ val rdd = sc.parallelize(Seq( Row("A1", "B1", "C1"), Row("A2", "B2", "C2"), Row("A3", "B3", "C2"), Row("A1", "B1", "C1") )) val schema = StructType(Seq("a", "b", "c").map(c =>

Re: spark.shuffle.spill=false ignored?

2015-09-09 Thread Eric Walker
n a tiny > disk and kept running out of disk on shuffles even though we also don't > spill. You may have already considered or ruled this out though. > > On Thu, Sep 3, 2015 at 12:56 PM, Eric Walker <eric.wal...@gmail.com> > wrote: > >> Hi, >> >> I am using S

spark.shuffle.spill=false ignored?

2015-09-03 Thread Eric Walker
Hi, I am using Spark 1.3.1 on EMR with lots of memory. I have attempted to run a large pyspark job several times, specifying `spark.shuffle.spill=false` in different ways. It seems that the setting is ignored, at least partially, and some of the tasks start spilling large amounts of data to

Re: cached data between jobs

2015-09-02 Thread Eric Walker
le to see the skipped stage in the spark job ui. > > > > [image: Inline image 1] > > On Wed, Sep 2, 2015 at 12:53 AM, Eric Walker <eric.wal...@gmail.com> > wrote: > >> Hi, >> >> I'm noticing that a 30 minute job that was initially IO-bound may not b

cached data between jobs

2015-09-01 Thread Eric Walker
Hi, I'm noticing that a 30 minute job that was initially IO-bound may not be during subsequent runs. Is there some kind of between-job caching that happens in Spark or in Linux that outlives jobs and that might be making subsequent runs faster? If so, is there a way to avoid the caching in

Re: bulk upload to Elasticsearch and shuffle behavior

2015-08-31 Thread Eric Walker
nd from its response to changes I subsequently made that the actual code that was running was the code doing the HBase lookups. I suspect the actual shuffle, once it occurred, required on the same order of network IO as the upload to Elasticsearch that followed. Eric On Mon, Aug 31, 2015 at

bulk upload to Elasticsearch and shuffle behavior

2015-08-31 Thread Eric Walker
Hi, I am working on a pipeline that carries out a number of stages, the last of which is to build some large JSON objects from information in the preceding stages. The JSON objects are then uploaded to Elasticsearch in bulk. If I carry out a shuffle via a `repartition` call after the JSON

registering an empty RDD as a temp table in a PySpark SQL context

2015-08-17 Thread Eric Walker
I have an RDD queried from a scan of a data source. Sometimes the RDD has rows and at other times it has none. I would like to register this RDD as a temporary table in a SQL context. I suspect this will work in Scala, but in PySpark some code assumes that the RDD has rows in it, which are used

adding a custom Scala RDD for use in PySpark

2015-08-11 Thread Eric Walker
Hi, I'm new to Scala, Spark and PySpark and have a question about what approach to take in the problem I'm trying to solve. I have noticed that working with HBase tables read in using `newAPIHadoopRDD` can be quite slow with large data sets when one is interested in only a small subset of the