Downgrade to hadoop-*:3.3.x, Hadoop 3.4.x is based on the AWS SDK v2 and should
probably be considered as breaking for tools that build on < 3.4.0 while using
AWS.
From: Oxlade, Dan
Sent: Wednesday, April 3, 2024 2:41:11 PM
To: user@spark.apache.org
Subject:
Hi Mich,
Thanks a lot for the insight, it was very helpful.
Aaron
On Thu, 2023-01-05 at 23:44 +, Mich Talebzadeh wrote:
Hi Aaron,
Thanks for the details.
It is a general practice when running Spark on premise to use Hadoop
clusters.<https://spark.apache.org/faq.html#:~:text=How%20d
re running provide localization benefits when Spark reads
from HBase, or are localization benefits negligible and it's a better idea to
put Spark in a standalone cluster?
Thanks for your time,
Aaron
On Thu, 2023-01-05 at 19:00 +, Mich Talebzadeh wrote:
Few questions
* As I understand you al
a benefit when using
hbase-connectors
(https://github.com/apache/hbase-connectors/tree/master/spark)? Is there a
mechanism in the connector to "pass through" a short circuit read to Spark, or
would data always bounce from HDFS -> RegionServer -> Spark?
Thanks in advance,
Aaron
I'm assuming some things here, but hopefully I understand. So, basically
you have a big table of data distributed across a bunch of executors. And,
you want an efficient way to call a native method for each row.
It sounds similar to a dataframe writer to me. Except, instead of writing
to disk or
I want to calculate quantiles on two different columns. I know that I can
calculate them with two separate operations. However, for performance
reasons, I'd like to calculate both with one operation.
Is this possible to do this with the Dataset API? I'm assuming that it
isn't. But, if it isn't,
That setting defines the total number of tasks that an executor can run in
parallel.
Each node is partitioned into executors, each with identical heap and
cores. So, it can be a balancing act to optimally set these values,
particularly if the goal is to maximize CPU usage with memory and other
Please unsubscribe
aaron.t.hoganc...@leidos.com<mailto:aaron.t.hoganc...@leidos.com> from this
mailing list.
Thanks,
Aaron Hogancamp
Data Scientist
(615) 431-3229 (desk)
(615) 617-7160 (mobile)
' shows null. I can
what I want to do by using an RDD, but I was hoping to avoid bypassing
tungsten.
It almost feels like it's optimizing the field based on the join. But I
tested other fields as well and they also came back with values from base.
Very odd.
Any thoughts?
Aaron
Unsubscribe.
Thanks,
Aaron Hogancamp
Data Scientist
Your error stems from spark.ml, and in your pom mllib is the only dependency
that is 2.10. Is there a reason for this? IE, you tell maven mllib 2.10 is
provided at runtime. Is 2.10 on the machine, or is 2.11?
-Aaron
From: VG <vlin...@gmail.com>
Date: Friday, July 22, 2016 at 1:49 PM
To
What version of Spark/Scala are you running?
-Aaron
Hi,
I have a cluster with 15 nodes of which 5 are HDFS nodes. I kick off a job
that creates some 120 stages. Eventually, the active and pending stages
reduce down to a small bottleneck and it never fails... the tasks
associated with the 10 (or so) running tasks are always allocated to the
same
> then resume execution. This works, but it ends up costing me a lot of extra
> memory (i.e. a few TiB when I have a lot of executors).
>
> What I'd like to do is use the broadcast mechanism to load the data structure
> once, per node. But, I can't serialize the data structure from the driver.
>
> Any ideas?
>
> Thanks!
>
> Aaron
>
ase executor memory
to ~385 across all executors?
(Note: I'm running on Yarn, which may affect this.)
Thanks,
Aaron
On Wed, Jun 29, 2016 at 12:09 PM Sean Owen <so...@cloudera.com> wrote:
> If you have one executor per machine, which is the right default thing
> to do, and this is a singl
g me a
lot of extra memory (i.e. a few TiB when I have a lot of executors).
What I'd like to do is use the broadcast mechanism to load the data
structure once, per node. But, I can't serialize the data structure from
the driver.
Any ideas?
Thanks!
Aaron
elp,
AARON ILOVICI
Software Engineer
Marketing Engineering
[cid:image001.png@01D1B7F9.A3949B20]
WAYFAIR
4 Copley Place
Boston, MA 02116
(617) 532-6100 x1231
ailov...@wayfair.com<mailto:ailov...@wayfair.com>
From: Reynold Xin <r...@databricks.com>
Date: Thursday, May 26, 2016 at 6:11
My environment: Spark 1.6, Vertica Driver 7.2.2, Java 1.7
I would be happy to create a Jira and submit a pull request with the
VerticaDialect once I figure this out.
Thank you for any insight on this,
AARON ILOVICI
Software Engineer
Marketing Engineering
[cid:image001.png@01D1B73E.F0A44
I'm using the spark 1.6.1 (hadoop-2.6) and I'm trying to load a file that's
in s3. I've done this previously with spark 1.5 with no issue. Attempting
to load and count a single file as follows:
dataFrame = sqlContext.read.text('s3a://bucket/path-to-file.csv')
dataFrame.count()
But when it
I think the SparkListener is about as close as it gets. That way I can
start up the instance (aws, open-stack, vmware, etc) and simply wait until
the SparkListener indicates that the executors are online before starting.
Thanks for the advise.
Aaron
On Fri, Mar 25, 2016 at 10:54 AM, Jacek
specific case, I may be growing the cluster size by a hundred nodes and if
I fail to wait for that initialization to complete the job will not have
enough memory to run my jobs.
Aaron
On Thu, Mar 24, 2016 at 3:07 AM, Takeshi Yamamuro <linguin@gmail.com>
wrote:
> Hi,
>
> There i
the SQL path. The problem is the loss
of type and the possibility for SQL injection. No biggie, just means that
where parameterized queries are in-play, we'll have to write it out in-code
rather than in SQL.
Thanks,
Aaron
On Sun, Dec 27, 2015 at 8:06 PM, Michael Armbrust <mich...@databricks.
tname/ip in mesos configuration - see Nikolaos answer
>
Cheers,
Aaron
On Mon, Nov 16, 2015 at 9:37 PM, Jo Voordeckers
<jo.voordeck...@gmail.com> wrote:
> I've seen this issue when the mesos cluster couldn't figure out my IP
> address correctly, have you tried setting the ENV va
Greetings,
I am processing a "batch" of files and have structured an iterative process
around them. Each batch is processed by first loading the data with
spark-csv, performing some minor transformations and then writing back out
as parquet. Absolutely no caching or shuffle should occur with
1560n24847...@n3.nabble.com>>
Date: Monday, September 28, 2015 at 1:35 PM
To: Aaron Dossett <aaron.doss...@target.com<mailto:aaron.doss...@target.com>>
Subject: Re: Python script runs fine in local mode, errors in other modes
Was there any eventual so
ConnectionManager has been deprecated and is no longer used by default
(NettyBlockTransferService is the replacement). Hopefully you would no
longer see these messages unless you have explicitly flipped it back on.
On Tue, Aug 4, 2015 at 6:14 PM, Jim Green openkbi...@gmail.com wrote:
And also
attempting to do the sqlContext.read.jdbc() assignment..not trying to
perform an operation on the RDD.
Cheers,
Aaron
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
Note that if you use multi-part upload, each part becomes 1 block, which
allows for multiple concurrent readers. One would typically use fixed-size
block sizes which align with Spark's default HDFS block size (64 MB, I
think) to ensure the reads are aligned.
On Sat, Jul 11, 2015 at 11:14 AM,
Are you seeing this after the app has already been running for some time,
or just at the beginning? Generally, registration should only occur once
initially, and a timeout would be due to the master not being accessible.
Try telneting to the master IP/port from the machine on which the driver
will
I think 2.6 failed to abruptly close streams that weren't fully read, which
we observed as a huge performance hit. We had to backport the 2.7
improvements before being able to use it.
Yep! That was it. Using the
parquet.version1.6.0rc3/parquet.version
that comes with spark, rather than using the 1.5.0-cdh5.4.2 version.
Thanks for the help!
Cheers,
Aaron
On Thu, Jun 25, 2015 at 8:24 AM, Sean Owen so...@cloudera.com wrote:
Hm that looks like a Parquet version
error?
You're compiling vs Hive 1.1 here and that is the problem. It is nothing
to do with CDH.
On Wed, Jun 24, 2015, 10:15 PM Aaron aarongm...@gmail.com wrote:
I was curious if any one was able to get CDH 5.4.1 or 5.4.2 compiling
with the v1.4.0 tag out of git? SparkSQL keeps dying on me
with CDH's JARs..just not sure
what. And doing a mvn -X didn't lead me anywherethoughts? help? URLs
to read?
Thanks in advance.
Cheers,
Aaron
Be careful shoving arbitrary binary data into a string, invalid utf
characters can cause significant computational overhead in my experience.
On Jun 11, 2015 10:09 AM, Mark Tse mark@d2l.com wrote:
Makes sense – I suspect what you suggested should work.
However, I think the overhead
Note that speculation is off by default to avoid these kinds of unexpected
issues.
On Sat, Mar 28, 2015 at 6:21 AM, Steve Loughran ste...@hortonworks.com
wrote:
It's worth adding that there's no guaranteed that re-evaluated work would
be on the same host as before, and in the case of node
Actually, this is the more relevant JIRA (which is resolved):
https://issues.apache.org/jira/browse/SPARK-3595
6352 is about saveAsParquetFile, which is not in use here.
Here is a DirectOutputCommitter implementation:
https://gist.github.com/aarondav/c513916e72101bbe14ec
and it can be
.
Darin.
- Original Message -
From: Darin McBeath ddmcbe...@yahoo.com.INVALID
To: Mingyu Kim m...@palantir.com; Aaron Davidson ilike...@gmail.com
Cc: user@spark.apache.org user@spark.apache.org
Sent: Monday, February 23, 2015 3:16 PM
Subject: Re: Which OutputCommitter to use for S3
could monitor each stage or task’s shuffle and GC status also
system status to identify the problem.
Thanks
Jerry
*From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com]
*Sent:* Thursday, March 5, 2015 2:32 PM
*To:* Aaron Davidson
*Cc:* user
*Subject:* Re: Having lots
Failed to connect implies that the executor at that host died, please
check its logs as well.
On Tue, Mar 3, 2015 at 11:03 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Sorry that I forgot the subject.
And in the driver, I got many FetchFailedException. The error messages are
15/03/03
)
Jianshi
On Wed, Mar 4, 2015 at 3:25 AM, Aaron Davidson ilike...@gmail.com wrote:
Failed to connect implies that the executor at that host died, please
check its logs as well.
On Tue, Mar 3, 2015 at 11:03 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Sorry that I forgot the subject
All stated symptoms are consistent with GC pressure (other nodes timeout
trying to connect because of a long stop-the-world), quite possibly due to
groupByKey. groupByKey is a very expensive operation as it may bring all
the data for a particular partition into memory (in particular, it cannot
Note that the parallelism (i.e., number of partitions) is just an upper
bound on how much of the work can be done in parallel. If you have 200
partitions, then you can divide the work among between 1 and 200 cores and
all resources will remain utilized. If you have more than 200 cores,
though,
Here is the class: https://gist.github.com/aarondav/c513916e72101bbe14ec
You can use it by setting mapred.output.committer.class in the Hadoop
configuration (or spark.hadoop.mapred.output.committer.class in the Spark
configuration). Note that this only works for the old Hadoop APIs, I
believe the
RangePartitioner does not actually provide a guarantee that all partitions
will be equal sized (that is hard), and instead uses sampling to
approximate equal buckets. Thus, it is possible that a bucket is left empty.
If you want the specified behavior, you should define your own partitioner.
It
I think Xuefeng Wu's suggestion is likely correct. This different is more
likely explained by the compression library changing versions than sort vs
hash shuffle (which should not affect output size significantly). Others
have reported that switching to lz4 fixed their issue.
We should document
Did the problem go away when you switched to lz4? There was a change from
the default compression codec fro 1.0 to 1.1, where we went from LZF to
Snappy. I don't think there was any such change from 1.1 to 1.2, though.
On Fri, Feb 6, 2015 at 12:17 AM, Praveen Garg praveen.g...@guavus.com
wrote:
The latter would be faster. With S3, you want to maximize number of
concurrent readers until you hit your network throughput limits.
On Wed, Feb 4, 2015 at 6:20 AM, Peter Rudenko petro.rude...@gmail.com
wrote:
Hi if i have a 10GB file on s3 and set 10 partitions, would it be
download whole
To be clear, there is no distinction between partitions and blocks for RDD
caching (each RDD partition corresponds to 1 cache block). The distinction
is important for shuffling, where by definition N partitions are shuffled
into M partitions, creating N*M intermediate blocks. Each of these blocks
Ah, this is in particular an issue due to sort-based shuffle (it was not
the case for hash-based shuffle, which would immediately serialize each
record rather than holding many in memory at once). The documentation
should be updated.
On Fri, Jan 30, 2015 at 11:27 AM, Sandy Ryza
, but the job sits there as the moving of files happens.
On Tue, Jan 27, 2015 at 7:24 PM, Aaron Davidson ilike...@gmail.com
wrote:
This renaming from _temporary to the final location is actually done by
executors, in parallel, for saveAsTextFile. It should be performed by each
task
This renaming from _temporary to the final location is actually done by
executors, in parallel, for saveAsTextFile. It should be performed by each
task individually before it returns.
I have seen an issue similar to what you mention dealing with Hive code
which did the renaming serially on the
It looks like something weird is going on with your object serialization,
perhaps a funny form of self-reference which is not detected by
ObjectOutputStream's typical loop avoidance. That, or you have some data
structure like a linked list with a parent pointer and you have many
thousand elements.
Please take a look at the executor logs (on both sides of the IOException)
to see if there are other exceptions (e.g., OOM) which precede this one.
Generally, the connections should not fail spontaneously.
On Sun, Jan 25, 2015 at 10:35 PM, octavian.ganea octavian.ga...@inf.ethz.ch
wrote:
Hi,
This was a regression caused by Netty Block Transfer Service. The fix for
this just barely missed the 1.2 release, and you can see the associated
JIRA here: https://issues.apache.org/jira/browse/SPARK-4837
Current master has the fix, and the Spark 1.2.1 release will have it
included. If you don't
Spark's network-common package depends on guava as a provided dependency
in order to avoid conflicting with other libraries (e.g., Hadoop) that
depend on specific versions. com/google/common/base/Preconditions has been
present in Guava since version 2, so this is likely a dependency not
found
Scala for-loops are implemented as closures using anonymous inner classes
which are instantiated once and invoked many times. This means, though,
that the code inside the loop is actually sitting inside a class, which
confuses Spark's Closure Cleaner, whose job is to remove unused references
from
What version are you running? I think spark.shuffle.use.netty was a valid
option only in Spark 1.1, where the Netty stuff was strictly experimental.
Spark 1.2 contains an officially supported and much more thoroughly tested
version under the property spark.shuffle.blockTransferService, which is
As Jerry said, this is not related to shuffle file consolidation.
The unique thing about this problem is that it's failing to find a file
while trying to _write_ to it, in append mode. The simplest explanation for
this would be that the file is deleted in between some check for existence
and
Do note that this problem may be fixed in Spark 1.2, as we changed the
default transfer service to use a Netty-based one rather than the
ConnectionManager.
On Thu, Jan 8, 2015 at 7:05 AM, Spidy yoni...@gmail.com wrote:
Hi,
Can you please explain which settings did you changed?
--
View
?
Maybe allow the Remoting Service to bind to the internal IP..but, advertise
it differently?
On Mon, Jan 5, 2015 at 9:02 AM, Aaron aarongm...@gmail.com wrote:
Thanks for the link! However, from reviewing the thread, it appears you
cannot have a NAT/firewall between the cluster and the
spark
Found the issue in JIRA:
https://issues.apache.org/jira/browse/SPARK-4389?jql=project%20%3D%20SPARK%20AND%20text%20~%20NAT
On Tue, Jan 6, 2015 at 10:45 AM, Aaron aarongm...@gmail.com wrote:
From what I can tell, this isn't a firewall issue per se..it's how the
Remoting Service binds to an IP
,
Aaron
On Mon, Jan 5, 2015 at 8:28 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
You can have a look at this discussion
http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-Spark-job-on-Unix-cluster-from-dev-environment-Windows-td16989.html
Thanks
Best Regards
On Mon, Jan 5
...but it still get's angry.
Any thoughts/suggestions?
Currently our work around is to VPNC connection from inside the vagrant VMs
or Openstack instances...but, that doesn't seem like a long term plan.
Thanks in advance!
Cheers,
Aaron
The ContextCleaner uncaches RDDs that have gone out of scope on the driver.
So it's possible that the given RDD is no longer reachable in your
program's control flow, or else it'd be a bug in the ContextCleaner.
On Wed, Dec 10, 2014 at 5:34 PM, ankits ankitso...@gmail.com wrote:
I'm using spark
You can actually submit multiple jobs to a single SparkContext in different
threads. In the case you mentioned with 2 stages having a common parent,
both will wait for the parent stage to complete and then the two will
execute in parallel, sharing the cluster resources.
Solutions that submit
Because this was a maintenance release, we should not have introduced any
binary backwards or forwards incompatibilities. Therefore, applications
that were written and compiled against 1.1.0 should still work against a
1.1.1 cluster, and vice versa.
On Wed, Dec 3, 2014 at 1:30 PM, Andrew Or
be interested in the new s3a filesystem in Hadoop 2.6.0 [1].
1.
https://issues.apache.org/jira/plugins/servlet/mobile#issue/HADOOP-10400
On Nov 26, 2014 12:24 PM, Aaron Davidson ilike...@gmail.com wrote:
Spark has a known problem where it will do a pass of metadata on a large
number of small
Spark has a known problem where it will do a pass of metadata on a large
number of small files serially, in order to find the partition information
prior to starting the job. This will probably not be repaired by switching
the FS impl.
However, you can change the FS being used like so (prior to
As Mohit said, making Main extend Serializable should fix this example. In
general, it's not a bad idea to mark the fields you don't want to serialize
(e.g., sc and conf in this case) as @transient as well, though this is not
the issue in this case.
Note that this problem would not have arisen in
In the situation you show, Spark will pipeline each filter together, and
will apply each filter one at a time to each row, effectively constructing
an statement. You would only see a performance difference if the
filter code itself is somewhat expensive, then you would want to only
execute it on
though in the times I mentioned - the list I gave
(3.1 min, 2 seconds, ... 8 min) were not different runs with different
cache %s, they were iterations within a single run with 100% caching.
-Nathan
On Thu, Nov 13, 2014 at 1:45 AM, Aaron Davidson ilike...@gmail.com
wrote
The fact that the caching percentage went down is highly suspicious. It
should generally not decrease unless other cached data took its place, or
if unless executors were dying. Do you know if either of these were the
case?
On Tue, Nov 11, 2014 at 8:58 AM, Nathan Kronenfeld
down within a run, with the same instance.
I meant I'd run the whole app, and one time, it would cache 100%, and the
next run, it might cache only 83%
Within a run, it doesn't change.
On Wed, Nov 12, 2014 at 11:31 PM, Aaron Davidson ilike...@gmail.com
wrote:
The fact that the caching
This may be due in part to Scala allocating an anonymous inner class in
order to execute the for loop. I would expect if you change it to a while
loop like
var i = 0
while (i 10) {
sc.parallelize(Array(1, 2, 3, 4)).foreach(x = accum += x)
i += 1
}
then the problem may go away. I am not
coalesce() is a streaming operation if used without the second parameter,
it does not put all the data in RAM. If used with the second parameter
(shuffle = true), then it performs a shuffle, but still does not put all
the data in RAM.
On Sat, Nov 1, 2014 at 12:09 PM, jan.zi...@centrum.cz wrote:
You may be running into this issue:
https://issues.apache.org/jira/browse/SPARK-4019
You could check by having 2000 or fewer reduce partitions.
On Wed, Oct 22, 2014 at 1:48 PM, DB Tsai dbt...@dbtsai.com wrote:
PS, sorry for spamming the mailing list. Based my knowledge, both
Another wild guess, if your data is stored in S3, you might be running into
an issue where the default jets3t properties limits the number of parallel
S3 connections to 4. Consider increasing the max-thread-counts from here:
http://www.jets3t.org/toolkit/configuration.html.
On Tue, Oct 21, 2014
The minPartitions argument of textFile/hadoopFile cannot decrease the
number of splits past the physical number of blocks/files. So if you have 3
HDFS blocks, asking for 2 minPartitions will still give you 3 partitions
(hence the min). It can, however, convert a file with fewer HDFS blocks
into
More of a Scala question than Spark, but apply here can be written with
just parentheses like this:
val array = Array.fill[Byte](10)(0)
if (array(index) == 0) {
array(index) = 1
}
The second instance of array(index) = 1 is actually not calling apply,
but update. It's a scala-ism that's usually
Are you doing this from the spark-shell? You're probably running into
https://issues.apache.org/jira/browse/SPARK-1199 which should be fixed in
1.1.
On Sat, Sep 6, 2014 at 3:03 AM, Dhimant dhimant84.jays...@gmail.com wrote:
I am using Spark version 1.0.2
--
View this message in context:
Looks like that's BlockManagerWorker.syncPutBlock(), which is in an if
check, perhaps obscuring its existence.
On Fri, Sep 5, 2014 at 2:19 AM, rapelly kartheek kartheek.m...@gmail.com
wrote:
Hi,
var cachedPeers: Seq[BlockManagerId] = null
private def replicate(blockId: String, data:
Pretty easy to do in Scala:
rdd.elementClassTag.runtimeClass
You can access this method from Python as well by using the internal _jrdd.
It would look something like this (warning, I have not tested it):
rdd._jrdd.classTag().runtimeClass()
(The method name is classTag for JavaRDDLike, and
If someone doesn't have the access to do that is there any easy to specify a
different properties file to be used?
Patrick Wendell wrote
If you want to customize the logging behavior - the simplest way is to
copy
conf/log4j.properties.tempate to conf/log4j.properties. Then you can go
and
This is likely due to a bug in shuffle file consolidation (which you have
enabled) which was hopefully fixed in 1.1 with this patch:
https://github.com/apache/spark/commit/78f2af582286b81e6dc9fa9d455ed2b369d933bd
Until 1.0.3 or 1.1 are released, the simplest solution is to disable
()[84, 104, 109, 89, 108, 92, 89, 90, 93, 102]I just
now have access to a Hadoop cluster with Spark installed, so hopefully I'm
running into some simple issues that I never had to deal with when testing
in my own sandbox in purely local mode before. Any help would be
appreciated, thanks! -Aaron
Sure thing, this is the stacktrace from pyspark. It's repeated a few times,
but I think this is the unique stuff.
Traceback (most recent call last):
File stdin, line 1, in module
File
/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/spark/python/pyspark/rdd.py,
line 583, in collect
These three lines of python code cause the error for me:
sc = SparkContext(appName=foo)
input = sc.textFile(hdfs://[valid hdfs path])
mappedToLines = input.map(lambda myline: myline.split(,))
The file I'm loading is a simple CSV.
--
View this message in context:
The driver must initially compute the partitions and their preferred
locations for each part of the file, which results in a serial
getFileBlockLocations() on each part. However, I would expect this to take
several seconds, not minutes, to perform on 1000 parts. Is your driver
inside or outside of
Yes, good point, I believe the masterLock is now unnecessary altogether.
The reason for its initial existence was that changeMaster() originally
could be called out-of-band of the actor, and so we needed to make sure the
master reference did not change out from under us. Now it appears that all
rdd.toLocalIterator will do almost what you want, but requires that each
individual partition fits in memory (rather than each individual line).
Hopefully that's sufficient, though.
On Fri, Aug 1, 2014 at 1:38 AM, Andrei faithlessfri...@gmail.com wrote:
Is there a way to get iterator from RDD?
:
Thanks, Aaron, it should be fine with partitions (I can repartition it
anyway, right?).
But rdd.toLocalIterator is purely Java/Scala method. Is there Python
interface to it?
I can get Java iterator though rdd._jrdd, but it isn't converted to Python
iterator automatically. E.g.:
rdd
I see. There should not be a significant algorithmic difference between
those two cases, as far as I can think, but there is a good bit of
local-mode-only logic in Spark.
One typical problem we see on large-heap, many-core JVMs, though, is much
more time spent in garbage collection. I'm not sure
information here about executors but is
ambiguous about whether there are single executors or multiple executors
on
each machine.
This message from Aaron Davidson implies that the executor memory
should be set to total available memory on the machine divided by the
number of cores:
*http
In particular, take a look at the TorrentBroadcast, which should be much
more efficient than HttpBroadcast (which was the default in 1.0) for large
files.
If you find that TorrentBroadcast doesn't work for you, then another way to
solve this problem is to place the data on all nodes' local disks,
What's the exception you're seeing? Is it an OOM?
On Mon, Jul 21, 2014 at 11:20 AM, chutium teng@gmail.com wrote:
Hi,
unfortunately it is not so straightforward
xxx_parquet.db
is a folder of managed database created by hive/impala, so, every sub
element in it is a table in
Hm, this is not a public API, but you should theoretically be able to use
TestBlockId if you like. Internally, we just use the BlockId's natural
hashing and equality to do lookups and puts, so it should work fine.
However, since it is in no way public API, it may change even in
maintenance
for running union? Could that create larger task sizes?
Kyle
On Sat, Jul 12, 2014 at 7:50 PM, Aaron Davidson ilike...@gmail.com
wrote:
I also did a quick glance through the code and couldn't find anything
worrying that should be included in the task closures. The only possibly
unsanitary part
at 9:13 AM, Guanhua Yan gh...@lanl.gov wrote:
Thanks, Aaron. I replaced groupByKey with reduceByKey along with some list
concatenation operations, and found that the performance becomes even
worse. So groupByKey is not that bad in my code.
Best regards,
- Guanhua
From: Aaron Davidson ilike
Yes, groupByKey() does partition by the hash of the key unless you specify
a custom Partitioner.
(1) If you were to use groupByKey() when the data was already partitioned
correctly, the data would indeed not be shuffled. Here is the associated
code, you'll see that it simply checks that the
The netlib.BLAS: Failed to load implementation warning only means that
the BLAS implementation may be slower than using a native one. The reason
why it only shows up at the end is that the library is only used for the
finalization step of the KMeans algorithm, so your job should've been
wrapping
1 - 100 of 189 matches
Mail list logo