Hi Chirag,
Maybe something like this?
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val rdd = sc.parallelize(Seq(
Row("A1", "B1", "C1"),
Row("A2", "B2", "C2"),
Row("A3", "B3", "C2"),
Row("A1", "B1", "C1")
))
val schema = StructType(Seq("a", "b", "c").map(c => StructF
ept running out of disk on shuffles even though we also don't
> spill. You may have already considered or ruled this out though.
>
> On Thu, Sep 3, 2015 at 12:56 PM, Eric Walker
> wrote:
>
>> Hi,
>>
>> I am using Spark 1.3.1 on EMR with lots of memory. I have a
Hi,
I am using Spark 1.3.1 on EMR with lots of memory. I have attempted to run
a large pyspark job several times, specifying `spark.shuffle.spill=false`
in different ways. It seems that the setting is ignored, at least
partially, and some of the tasks start spilling large amounts of data to
disk
ipped stage in the spark job ui.
>
>
>
> [image: Inline image 1]
>
> On Wed, Sep 2, 2015 at 12:53 AM, Eric Walker
> wrote:
>
>> Hi,
>>
>> I'm noticing that a 30 minute job that was initially IO-bound may not be
>> during subsequent runs. Is ther
Hi,
I'm noticing that a 30 minute job that was initially IO-bound may not be
during subsequent runs. Is there some kind of between-job caching that
happens in Spark or in Linux that outlives jobs and that might be making
subsequent runs faster? If so, is there a way to avoid the caching in
order
nd
from its response to changes I subsequently made that the actual code that
was running was the code doing the HBase lookups. I suspect the actual
shuffle, once it occurred, required on the same order of network IO as the
upload to Elasticsearch that followed.
Eric
On Mon, Aug 31, 2015 at
Hi,
I am working on a pipeline that carries out a number of stages, the last of
which is to build some large JSON objects from information in the preceding
stages. The JSON objects are then uploaded to Elasticsearch in bulk.
If I carry out a shuffle via a `repartition` call after the JSON docume
I have an RDD queried from a scan of a data source. Sometimes the RDD has
rows and at other times it has none. I would like to register this RDD as
a temporary table in a SQL context. I suspect this will work in Scala, but
in PySpark some code assumes that the RDD has rows in it, which are used
Hi,
I'm new to Scala, Spark and PySpark and have a question about what approach
to take in the problem I'm trying to solve.
I have noticed that working with HBase tables read in using
`newAPIHadoopRDD` can be quite slow with large data sets when one is
interested in only a small subset of the key