Hi,
After a spark program completes, there are 3 temporary directories remain in
the temp directory.
The file names are like this: spark-2e389487-40cc-4a82-a5c7-353c0feefbb7
And the Spark program runs on Windows, a snappy DLL file also remains in the
temp directory.
The file name is like
On Thu, May 7, 2015 at 6:19 AM, Todd Nist tsind...@gmail.com wrote:
Have you tried to set the following?
spark.worker.cleanup.enabled=true
spark.worker.cleanup.appDataTtl=seconds”
On Thu, May 7, 2015 at 2:39 AM, Taeyun Kim taeyun@innowireless.com wrote:
Hi,
After a spark program
the
spark.yarn.scheduler.heartbeat.interval-ms.
I hope that the additional overhead it incurs would be negligible.
From: Zoltán Zvara [mailto:zoltan.zv...@gmail.com]
Sent: Thursday, May 07, 2015 10:05 PM
To: Taeyun Kim; user@spark.apache.org
Subject: Re: YARN mode startup takes too long (10+ secs)
Without
Hi,
I'm running a spark application with YARN-client or YARN-cluster mode.
But it seems to take too long to startup.
It takes 10+ seconds to initialize the spark context.
Is this normal? Or can it be optimized?
The environment is as follows:
- Hadoop: Hortonworks HDP 2.2 (Hadoop 2.6)
Hi,
I used CombineTextInputFormat to read many small files.
The Java code is as follows (I've written it as an utility function):
public static JavaRDDString combineTextFile(JavaSparkContext sc,
String path, long maxSplitSize, boolean recursive)
{
Configuration conf =
spark.executor.extraClassPath is especially useful when the output is
written to HBase, since the data nodes on the cluster have HBase library
jars.
-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com]
Sent: Friday, February 27, 2015 5:22 PM
To: Kannan Rajah
Cc: Marcelo
Hi,
When my spark program calls JavaSparkContext.stop(), the following errors
occur.
14/12/11 16:24:19 INFO Main: sc.stop {
14/12/11 16:24:20 ERROR ConnectionManager: Corresponding
SendingConnection to ConnectionManagerId(cluster02,38918) not found
(Sorry if this mail is duplicate, but it seems that my previous mail could
not reach the mailing list.)
Hi,
When my spark program calls JavaSparkContext.stop(), the following errors
occur.
14/12/11 16:24:19 INFO Main: sc.stop {
14/12/11 16:24:20 ERROR
Hi,
When my spark program calls JavaSparkContext.stop(), the following errors
occur.
14/12/11 16:24:19 INFO Main: sc.stop {
14/12/11 16:24:20 ERROR ConnectionManager: Corresponding
SendingConnection to ConnectionManagerId(cluster02,38918) not found
Hi,
I'm trying to open the Spark source code with IntelliJ IDEA.
I opened pom.xml on the Spark source code root directory.
Project tree is displayed in the Project tool window.
But, when I open a source file, say
org.apache.spark.deploy.yarn.ClientBase.scala, a lot of red marks shows on
the
Hi,
An information about the error.
On File | Project Structure window, the following error message is displayed
with pink background:
Library 'Maven: org.scala-lang:scala-compiler-bundle:2.10.4' is not used
Can it be a hint?
From: Taeyun Kim [mailto:taeyun@innowireless.com
), but it’s Ok.
From: Sandy Ryza [mailto:sandy.r...@cloudera.com]
Sent: Thursday, November 20, 2014 2:44 PM
To: innowireless TaeYun Kim
Cc: user
Subject: Re: How to view log on yarn-client mode?
While the app is running, you can find logs from the YARN web UI by navigating
to containers
Hi,
How can I view log on yarn-client mode?
When I insert the following line on mapToPair function for example,
System.out.println(TEST TEST);
On local mode, it is displayed on console.
But on yarn-client mode, it is not on anywhere.
When I use yarn resource manager web UI, the size
Hi,
I'm confused with saveAsNewAPIHadoopFile and saveAsNewAPIHadoopDataset.
What's the difference between the two?
What's the individual use cases of the two APIs?
Could you describe the internal flows of the two APIs briefly?
I've used Spark several months, but I have no experience on
Hi,
Is there a way to bulk-load to HBase from RDD?
HBase offers HFileOutputFormat class for bulk loading by MapReduce job, but
I cannot figure out how to use it with saveAsHadoopDataset.
Thanks.
right?
If so, is there another method to bulk-load to HBase from RDD?
Thanks.
From: innowireless TaeYun Kim [mailto:taeyun@innowireless.co.kr]
Sent: Friday, September 19, 2014 7:17 PM
To: user@spark.apache.org
Subject: Bulk-load to HBase
Hi,
Is there a way to bulk-load to HBase
Hi,
Sorry, I just found saveAsNewAPIHadoopDataset.
Then, Can I use HFileOutputFormat with saveAsNewAPIHadoopDataset? Is there
any example code for that?
Thanks.
From: innowireless TaeYun Kim [mailto:taeyun@innowireless.co.kr]
Sent: Friday, September 19, 2014 8:18 PM
To: user
it bypasses the write
path.
Thanks.
From: Aniket Bhatnagar [mailto:aniket.bhatna...@gmail.com]
Sent: Friday, September 19, 2014 9:01 PM
To: innowireless TaeYun Kim
Cc: user
Subject: Re: Bulk-load to HBase
I have been using saveAsNewAPIHadoopDataset but I use TableOutputFormat instead
: innowireless TaeYun Kim [mailto:taeyun@innowireless.co.kr]
Sent: Friday, September 19, 2014 9:20 PM
To: user@spark.apache.org
Subject: RE: Bulk-load to HBase
Thank you for the example code.
Currently I use foreachPartition() + Put(), but your example code can be used
to clean up my code
Hi,
On Spark Configuration document, spark.executor.extraClassPath is regarded
as a backwards-compatibility option. It also says that users typically
should not need to set this option.
Now, I must add a classpath to the executor environment (as well as to the
driver in the future, but for
Hi,
I'm trying to split one large multi-field text file into many single-field
text files.
My code is like this: (somewhat simplified)
final BroadcastColSchema bcSchema = sc.broadcast(schema);
final String outputPathName = env.outputPathName;
this was already fixed last week in SPARK-2414:
https://github.com/apache/spark/commit/7c23c0dc3ed721c95690fc49f435d9de6952523c
On Fri, Jul 25, 2014 at 1:34 PM, innowireless TaeYun Kim
taeyun@innowireless.co.kr wrote:
Hi,
I'm using Spark 1.0.0.
On filter() - map() - coalesce() - saveAsText() sequence
Thanks.
Really, now I compare a stage data of the two jobs, ‘core7-exec3’ spends about
12.5 minutes more than ‘core2-exec12’ on GC.
From: Nishkam Ravi [mailto:nr...@cloudera.com]
Sent: Wednesday, July 16, 2014 5:28 PM
To: user@spark.apache.org
Subject: Re: executor-cores vs.
Hi,
On running yarn-client mode, the following options can be specified:
l --executor-cores
l --num-executors
If we have following machines:
l 3 data nodes
l 8 cores each node
Which is the better?
1. --executor-cores 7 --num-executors 3 (more core for each executor,
Thank you for your response.
Maybe that applies to my case.
In my test case, The types of almost all of the data are either primitive
types, joda DateTime, or String.
But I'm somewhat disappointed with the speed.
At least it should not be slower than Java default serializer...
-Original
Hi,
For my test case, using Kryo serializer does not help.
It is slower than default Java serializer, and the size saving is minimal.
I've registered almost all classes to the Kryo registrator.
What is happening to my test case?
Have Anyone experienced a case like this?
Hi,
I'm trying to understand the relationship of the number of cores and the
number of executors when running a Spark job on YARN.
The test environment is as follows:
- # of data nodes: 3
- Data node machine spec:
- CPU: Core i7-4790 (# of cores: 4, # of threads: 8)
- RAM: 32GB (8GB
For your information, I've attached the Ganglia monitoring screen capture on
the Stack Overflow question.
Please see:
http://stackoverflow.com/questions/24622108/apache-spark-the-number-of-cores
-vs-the-number-of-executors
From: innowireless TaeYun Kim [mailto:taeyun@innowireless.co.kr
Hi,
A help for the implementation best practice is needed.
The operating environment is as follows:
- Log data file arrives irregularly.
- The size of a log data file is from 3.9KB to 8.5MB. The average is about
1MB.
- The number of records of a data file is from 13 lines to 22000
Hi,
When running a Spark job, the following warning message displays and the job
seems no longer progressing.
(Detailed log message are at the bottom of this message.)
---
14/07/02 17:00:14 WARN AbstractNioSelector: Unexpected exception in the
selector loop.
.
-Original Message-
From: innowireless TaeYun Kim [mailto:taeyun@innowireless.co.kr]
Sent: Wednesday, July 02, 2014 5:58 PM
To: user@spark.apache.org
Subject: Help: WARN AbstractNioSelector: Unexpected exception in the
selector loop. java.lang.OutOfMemoryError: Java heap space
Hi
of the driver program constantly grows?
-Original Message-
From: innowireless TaeYun Kim [mailto:taeyun@innowireless.co.kr]
Sent: Wednesday, July 02, 2014 6:05 PM
To: user@spark.apache.org
Subject: RE: Help: WARN AbstractNioSelector: Unexpected exception in the
selector loop
Hi,
Maybe this is a newbie question: How to read a snappy-compressed text file?
The OS is Windows 7.
Currently, I've done the following steps:
1. Built Hadoop 2.4.0 with snappy option.
'hadoop checknative' command displays the following line:
snappy: true
Hi,
How to use SequenceFileRDDFunctions.saveAsSequenceFile() in Java?
A simple example will be a great help.
Thanks in advance.
Kim
taeyun@innowireless.co.kr wrote:
BTW, it is possible that rdd.first() does not compute the whole partitions.
So, first() cannot be uses for the situation below.
-Original Message-
From: innowireless TaeYun Kim [mailto:taeyun@innowireless.co.kr]
Sent: Wednesday, June 11, 2014
, unpersist, materialization
FYI: Here is a related discussion
http://apache-spark-user-list.1001560.n3.nabble.com/Persist-and-unpersist-td6437.html
about this.
On Thu, Jun 12, 2014 at 8:10 PM, innowireless TaeYun Kim
taeyun@innowireless.co.kr wrote:
Maybe It would be nice that unpersist
(I¡¯ve clarified the statement (1) of my previous mail. See below.)
From: innowireless TaeYun Kim [mailto:taeyun@innowireless.co.kr]
Sent: Friday, June 13, 2014 10:05 AM
To: user@spark.apache.org
Subject: RE: Question about RDD cache, unpersist, materialization
Currently I use
Hi,
What I (seems to) know about RDD persisting API is as follows:
- cache() and persist() is not an action. It only does a marking.
- unpersist() is also not an action. It only removes a marking. But if the
rdd is already in memory, it is unloaded.
And there seems no API to forcefully
BTW, it is possible that rdd.first() does not compute the whole partitions.
So, first() cannot be uses for the situation below.
-Original Message-
From: innowireless TaeYun Kim [mailto:taeyun@innowireless.co.kr]
Sent: Wednesday, June 11, 2014 11:40 AM
To: user@spark.apache.org
Without (C), what is the best practice to implement the following scenario?
1. rdd = sc.textFile(FileA)
2. rdd = rdd.map(...) // actually modifying the rdd
3. rdd.saveAsTextFile(FileA)
Since the rdd transformation is 'lazy', rdd will not materialize until
saveAsTextFile(), so FileA must still
Hi,
How can I dispose an Accumulator?
It has no method like 'unpersist()' which Broadcast provides.
Thanks.
I'm trying to run spark-shell on Hadoop yarn.
Specifically, the environment is as follows:
- Client
- OS: Windows 7
- Spark version: 1.0.0-SNAPSHOT (git cloned 2014.5.8)
- Server
- Platform: hortonworks sandbox 2.1
I modified the spark code to apply
42 matches
Mail list logo