I have a cluster with 3 nodes (each with 8 cores) using Spark 1.0.1.
I have an RDDString which I've repartitioned so it has 100 partitions (hoping
to increase the parallelism).
When I do a transformation (such as filter) on this RDD, I can't seem to get
more than 24 tasks (my total number of
the documentation states). What
would I want that value to be based on my configuration below? Or, would I
leave that alone?
From: Daniel Siegmann daniel.siegm...@velos.io
To: user@spark.apache.org; Darin McBeath ddmcbe...@yahoo.com
Sent: Wednesday, July 30, 2014
=.08 -z us-east-1e --worker-instances=2
my-cluster
From: Daniel Siegmann daniel.siegm...@velos.io
To: Darin McBeath ddmcbe...@yahoo.com
Cc: Daniel Siegmann daniel.siegm...@velos.io; user@spark.apache.org
user@spark.apache.org
Sent: Thursday, July 31, 2014 10:04
I've been playing around with Spark off and on for the past month and have
developed some XML helper utilities that enable me to filter an XML dataset as
well as transform an XML dataset (we have a lot of XML content). I'm posting
this email to see if there would be any interest in this effort
I started up a cluster on EC2 (using the provided scripts) and specified a
different instance type for the master and the the worker nodes. The cluster
started fine, but when I looked at the cluster (via port 8080), it showed that
the amount of memory available to the worker nodes did not
I've seen a couple of issues posted about this, but I never saw a resolution.
When I'm using Spark 1.0.2 (and the spark-submit script to submit my jobs) and
AWS SDK 1.8.7, I get the stack trace below. However, if I drop back to AWS SDK
1.3.26 (or anything from the AWS SDK 1.4.* family) then
Can't seem to figure this out. I've tried several different approaches without
success. For example, I've tried setting spark.executor.extraJavaOptions in the
spark-default.conf (prior to starting the spark-shell) but this seems to have
no effect.
Outside of spark-shell (within a java
For weeks, I've been using the following trick to successfully disable log4j in
the spark-shell when running a cluster on ec2 that was started by the Spark
provided ec2 scripts.
cp ./conf/log4j.properties.template ./conf/log4j.properties
I then change log4j.rootCategory=INFO to
I have a PairRDD of type String,String which I persist to S3 (using the
following code).
JavaPairRDDText, Text aRDDWritable = aRDD.mapToPair(new
ConvertToWritableTypes());aRDDWritable.saveAsHadoopFile(outputFile, Text.class,
Text.class, SequenceFileOutputFormat.class);
class
I have some code that I only need to be executed once per executor in my spark
application. My current approach is to do something like the following:
scala xmlKeyPair.foreachPartition(i = XPathProcessor.init(ats,
Namespaces/NamespaceContext))
So, If I understand correctly, the
I developed the spark-xml-utils library because we have a large amount of XML
in big datasets and I felt this data could be better served by providing some
helpful xml utilities. This includes the ability to filter documents based on
an xpath/xquery expression, return specific nodes for an
I have a SchemaRDD with 100 records in 1 partition. We'll call this baseline.
I have a SchemaRDD with 11 records in 1 partition. We'll call this daily.
After a fairly basic join of these two tables
JavaSchemaRDD results = sqlContext.sql(SELECT id, action, daily.epoch,
daily.version FROM
ok. after reading some documentation, it would appear the issue is the default
number of partitions for a join (200).
After doing something like the following, I was able to change the value.
From: Darin McBeath ddmcbe...@yahoo.com.INVALID
To: User user@spark.apache.org
Sent: Wednesday
Sorry, hit the send key a bitt too early.
Anyway, this is the code I set.
sqlContext.sql(set spark.sql.shuffle.partitions=10);
From: Darin McBeath ddmcbe...@yahoo.com
To: Darin McBeath ddmcbe...@yahoo.com; User user@spark.apache.org
Sent: Wednesday, October 29, 2014 2:47 PM
Subject: Re
Let me know if you are interested in participating in a meet up in Cincinnati,
OH to discuss Apache Spark.
We currently have 4-5 different companies expressing interest but would like a
few more.
Darin.
I have the following code where I'm using RDD 'union' and 'subtractByKey' to
create a new baseline RDD. All of my RDDs are a key pair with the 'key' a
String and the 'value' a String (xml document).
// **// Merge the daily
Assume the following where both updatePairRDD and deletePairRDD are both
HashPartitioned. Before the union, each one of these has 512 partitions. The
new created updateDeletePairRDD has 1024 partitions. Is this the
general/expected behavior for a union (the number of partitions to double)?
I'm using spark 1.1 and the provided ec2 scripts to start my cluster
(r3.8xlarge machines). From the spark-shell, I can verify that the environment
variables are set
scala System.getenv(SPARK_LOCAL_DIRS)res0: String = /mnt/spark,/mnt2/spark
However, when I look on the workers, the directories
For one of my Spark jobs, my workers/executors are dying and leaving the
cluster.
On the master, I see something like the following in the log file. I'm
surprised to see the '60' seconds in the master log below because I explicitly
set it to '600' (or so I thought) in my spark job (see
Take a look at the O'Reilly Learning Spark (Early Release) book. I've found
this very useful.
Darin.
From: Saurabh Agrawal saurabh.agra...@markit.com
To: user@spark.apache.org user@spark.apache.org
Sent: Thursday, November 20, 2014 9:04 AM
Subject: Please help me get started on Apache
I've tried various ideas, but I'm really just shooting in the dark.
I have an 8 node cluster of r3.8xlarge machines. The RDD (with 1024 partitions)
I'm trying to save off to S3 is approximately 1TB in size (with the partitions
pretty evenly distributed in size).
I just tried a test to dial
,30);
From: Sven Krasser kras...@gmail.com
To: Darin McBeath ddmcbe...@yahoo.com
Cc: User user@spark.apache.org
Sent: Friday, January 23, 2015 5:12 PM
Subject: Re: Problems saving a large RDD (1 TB) to S3 as a sequence file
Hey Darin,
Are you running
I have the following code in a Spark Job.
// Get the baseline input file(s) JavaPairRDDText,Text
hsfBaselinePairRDDReadable = sc.hadoopFile(baselineInputBucketFile,
SequenceFileInputFormat.class, Text.class, Text.class); JavaPairRDDString,
String hsfBaselinePairRDD =
I have the following code in a Spark Job.
// Get the baseline input file(s)
JavaPairRDDText,Text hsfBaselinePairRDDReadable =
sc.hadoopFile(baselineInputBucketFile, SequenceFileInputFormat.class,
Text.class, Text.class);
JavaPairRDDString, String hsfBaselinePairRDD =
Thanks Imran. That's exactly what I needed to know.
Darin.
From: Imran Rashid iras...@cloudera.com
To: Darin McBeath ddmcbe...@yahoo.com
Cc: User user@spark.apache.org
Sent: Tuesday, February 17, 2015 8:35 PM
Subject: Re: How do you get the partitioner
In the following code, I read in a large sequence file from S3 (1TB) spread
across 1024 partitions. When I look at the job/stage summary, I see about
400GB of shuffle writes which seems to make sense as I'm doing a hash partition
on this file.
// Get the baseline input file
really want to do in the first place.
Thanks again for your insights.
Darin.
From: Imran Rashid iras...@cloudera.com
To: Darin McBeath ddmcbe...@yahoo.com
Cc: User user@spark.apache.org
Sent: Tuesday, February 17, 2015 3:29 PM
Subject: Re: MapValues and Shuffle
In an 'early release' of the Learning Spark book, there is the following
reference:
In Scala and Java, you can determine how an RDD is partitioned using its
partitioner property (or partitioner() method in Java)
However, I don't see the mentioned 'partitioner()' method in Spark 1.2 or a way
Consider the following left outer join
potentialDailyModificationsRDD =
reducedDailyPairRDD.leftOuterJoin(baselinePairRDD).partitionBy(new
HashPartitioner(1024)).persist(StorageLevel.MEMORY_AND_DISK_SER());
Below are the record counts for the RDDs involved
Number of records for
I've downloaded spark 1.2.0 to my laptop. In the lib directory, it includes
spark-assembly-1.2.0-hadoop2.4.0.jar
When I spin up a cluster using the ec2 scripts with 1.2.0 (and set
--hadoop-major-version=2) I notice that in the lib directory for the
master/slaves the assembly is for
I am using repartitionAndSortWithinPartitions to partition my content and then
sort within each partition. I've also created a custom partitioner that I use
with repartitionAndSortWithinPartitions. I created a custom partitioner as my
key consist of something like 'groupid|timestamp' and I
Thanks for you quick reply. Yes, that would be fine. I would rather wait/use
the optimal approach as opposed to hacking some one-off solution.
Darin.
From: Kostas Sakellis kos...@cloudera.com
To: Darin McBeath ddmcbe...@yahoo.com
Cc: User user
I have a fairly large Spark job where I'm essentially creating quite a few
RDDs, do several types of joins using these RDDS resulting in a final RDD which
I write back to S3.
Along the way, I would like to capture record counts for some of these RDDs. My
initial approach was to use the count
Just to close the loop in case anyone runs into the same problem I had.
By setting --hadoop-major-version=2 when using the ec2 scripts, everything
worked fine.
Darin.
- Original Message -
From: Darin McBeath ddmcbe...@yahoo.com.INVALID
To: Mingyu Kim m...@palantir.com; Aaron Davidson
Aaron. Thanks for the class. Since I'm currently writing Java based Spark
applications, I tried converting your class to Java (it seemed pretty
straightforward).
I set up the use of the class as follows:
SparkConf conf = new SparkConf()
.set(spark.hadoop.mapred.output.committer.class,
it and post a response.
- Original Message -
From: Mingyu Kim m...@palantir.com
To: Darin McBeath ddmcbe...@yahoo.com; Aaron Davidson ilike...@gmail.com
Cc: user@spark.apache.org user@spark.apache.org
Sent: Monday, February 23, 2015 3:06 PM
Subject: Re: Which OutputCommitter to use for S3?
Cool
I'm using Spark 1.2, stand-alone cluster on ec2 I have a cluster of 8
r3.8xlarge machines but limit the job to only 128 cores. I have also tried
other things such as setting 4 workers per r3.8xlarge and 67gb each but this
made no difference.
The job frequently fails at the end in this step
I decided to play around with DataFrames this morning but I'm running into
quite a few issues. I'm assuming that I must be doing something wrong so would
appreciate some advice.
First, I create my Data Frame.
import sqlContext.implicits._
case class Entity(InternalId: Long, EntityId: Long,
http://www.meetup.com/Cincinnati-Apache-Spark-Meetup/
Thanks.
Darin.
Another option might be to leverage spark-xml-utils
(https://github.com/dmcbeath/spark-xml-utils)
This is a collection of xml utilities that I've recently revamped that make it
relatively easy to use xpath, xslt, or xquery within the context of a Spark
application (or at least I think so). My
e new getInstance function and some more information on
the various features.
Darin.
From: Darin McBeath <ddmcbe...@yahoo.com.INVALID>
To: "user@spark.apache.org" <user@spark.apache.org>
Sent: Tuesday, December 1, 2015 11:51 AM
Subject: Re: Turning off D
The problem isn't really with DTD validation (by default validation is
disabled). The underlying problem is that the DTD can't be found (which is
indicated in your stack trace below). The underlying parser will try and
retrieve the DTD (regardless of validation) because things such as
I'm looking for some suggestions based on other's experiences.
I currently have a job that I need to run periodically where I need to read on
the order of 1+ million files from an S3 bucket. It is not the entire bucket
(nor does it match a pattern). Instead, I have a list of random keys that
I tried using Spark 1.6 in a stand-alone cluster this morning.
I submitted 2 jobs (and they both executed fine). In fact, they are the exact
same jobs with just some different parameters.
I was able to view the application history for the first job.
However, when I tried to view the second
Drake <dondr...@gmail.com>
To: Darin McBeath <ddmcbe...@yahoo.com>
Cc: User <user@spark.apache.org>
Sent: Wednesday, January 13, 2016 10:10 AM
Subject: Re: Spark 1.6 and Application History not working correctly
I noticed a similar problem going from 1.5.x to 1.6.0 on
I've run into a situation where it would appear that foreachPartition is only
running on one of my executors.
I have a small cluster (2 executors with 8 cores each).
When I run a job with a small file (with 16 partitions) I can see that the 16
partitions are initialized but they all appear to
dies.
Darin.
From: Jacek Laskowski <ja...@japila.pl>
To: Darin McBeath <ddmcbe...@yahoo.com>
Cc: user <user@spark.apache.org>
Sent: Friday, March 11, 2016 1:24 PM
Subject: Re: spark 1.6 foreachPartition only appears to be running on one
executor
repartition
should work or if this is a bug. Thanks Jacek for starting to dig into this.
Darin.
- Original Message -
From: Darin McBeath <ddmcbe...@yahoo.com.INVALID>
To: Jacek Laskowski <ja...@japila.pl>
Cc: user <user@spark.apache.org>
Sent: Friday, March 11, 2016 1
rviceInit call");
log.info("SimpleStorageServiceInit call arg1: "+ arg1);
log.info("SimpleStorageServiceInit call arg2:"+ arg2);
log.info("SimpleStorageServiceInit call arg3: "+ arg3);
SimpleStorageService.init(this.arg1, this.arg2, this.arg3);
}
}
___
I started playing round with Datasets on Spark 2.0 this morning and I'm
surprised by the significant performance difference I'm seeing between an RDD
and a Dataset for a very basic example.
I've defined a simple case class called AnnotationText that has a handful of
fields.
I create a
Another option would be to look at spark-xml-utils. We use this extensively in
the manipulation of our XML content.
https://github.com/elsevierlabs-os/spark-xml-utils
There are quite a few examples. Depending on your preference (and what you
want to do), you could use xpath, xquery, or
is that it
returns a string. So, you have to be a little creative when returning multiple
values (such as delimiting the values with a special character and then
splitting on this delimiter).
Darin.
From: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com>
To: Darin McBeath &
How do you find the partitioner for a Dataset?
I have a Dataset (om) which I created and repartitioned using one of the fields
(docId). Reading the documentation, I would assume the om Dataset should be
hash partitioned. But, how can I verify this?
When I do om.rdd.partitioner I get
I've been trying to understand the performance of Datasets (and filters) in
Spark 2.0.
I have a Dataset which I've read from a parquet file and cached into memory
(deser). This is spread across 8 partitions and consumes a total of 826MB of
memory on my cluster. I verified that the dataset
I have a Dataset (om) which I created and repartitioned (and cached) using
one of the fields (docId). Reading the Spark documentation, I would assume
the om Dataset should be hash partitioned. But, how can I verify this?
When I do om.rdd.partitioner I get
Option[org.apache.spark.Partitioner]
I'm not familiar with EDI, but perhaps one option might be spark-xml-utils
(https://github.com/elsevierlabs-os/spark-xml-utils). You could transform the
XML to the XML format required by the xml-to-json function and then return the
json. Spark-xml-utils wraps the open source Saxon project
56 matches
Mail list logo