[
https://issues.apache.org/jira/browse/SPARK-25171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Haopu Wang updated SPARK-25171:
---
Comment: was deleted
(was: [~jerryshao], do you have any)
> After restart, StreamingCont
[
https://issues.apache.org/jira/browse/SPARK-25171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589803#comment-16589803
]
Haopu Wang commented on SPARK-25171:
[~jerryshao], do you have any
> After rest
Haopu Wang created SPARK-25171:
--
Summary: After restart, StreamingContext is replaying the last
successful micro-batch right before the stop
Key: SPARK-25171
URL: https://issues.apache.org/jira/browse/SPARK-25171
[
https://issues.apache.org/jira/browse/SPARK-17498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390777#comment-16390777
]
Haopu Wang commented on SPARK-17498:
After this fix, the reverse transformation "IndexToString&q
Regards,
Haopu
Haopu Wang created SPARK-21396:
--
Summary: Spark Hive Thriftserver doesn't return UDT field
Key: SPARK-21396
URL: https://issues.apache.org/jira/browse/SPARK-21396
Project: Spark
Issue Type
ias
On 11/9/16 3:11 PM, Haopu Wang wrote:
> Hi, do you mean that the new matched topics should be consumed
> after the regex subscription has been established? Thanks!
>
> -Original Message- From: Guozhang Wang
> [mailto:wangg...@gmail.com] Sent: 2016年11月10日 3:41 To:
> u
Hi, do you mean that the new matched topics should be consumed after the regex
subscription has been established? Thanks!
-Original Message-
From: Guozhang Wang [mailto:wangg...@gmail.com]
Sent: 2016年11月10日 3:41
To: users@kafka.apache.org
Subject: Re: Is it possible to resubcribe
I'm using Kafka direct stream (auto.offset.reset = earliest) and enable
Spark streaming's checkpoint.
The application starts and consumes messages correctly. Then I stop the
application and clean the checkpoint folder.
I restart the application and expect it to consumes old messages. But
It turns out to be a bug in application code. Thank you!
From: Haopu Wang
Sent: 2016年11月4日 17:23
To: user@spark.apache.org; Cody Koeninger
Subject: InvalidClassException when load KafkaDirectStream from checkpoint
(Spark 2.0.0)
When I load spark
Cody, thanks for the response. Do you think it's a Spark issue or Kafka issue?
Can you please let me know the jira ticket number?
-Original Message-
From: Cody Koeninger [mailto:c...@koeninger.org]
Sent: 2016年11月4日 22:35
To: Haopu Wang
Cc: user@spark.apache.org
Subject: Re: expected
When I load spark checkpoint, I get below error. Do you have any idea?
Much thanks!
*
2016-11-04 17:12:19,582 INFO
[org.apache.spark.streaming.CheckpointReader] (main;) Checkpoint files
found:
I'm using Kafka010 integration API to create a DStream using
SubscriberPattern ConsumerStrategy.
The specified topic doesn't exist when I start the application.
Then I create the topic and publish some test messages. I can see them
in the console subscriber.
But the spark application doesn't
pic2 0 2 after polling for 512
==
-Original Message-
From: Cody Koeninger [mailto:c...@koeninger.org]
Sent: 2016年10月13日 9:31
To: Haopu Wang
Cc: user@spark.apache.org
Subject: Re: Kafka integration: get existing Kafka messages?
Look at the presentation and blog post linked from
https://github.co
Cody, thanks for the response.
So Kafka direct stream actually has consumer on both the driver and executor?
Can you please provide more details? Thank you very much!
From: Cody Koeninger [mailto:c...@koeninger.org]
Sent: 2016年10月12日 20:10
To: Haopu Wang
Hi,
I want to read the existing Kafka messages and then subscribe new stream
messages.
But I find "auto.offset.reset" property is always set to "none" in
KafkaUtils. Does that mean I cannot specify "earliest" property value
when create direct stream?
Thank you!
Hi, in the official integration guide, it says "Spark Streaming 2.0.0 is
compatible with Kafka 0.8.2.1."
However, in maven repository, I can get
"spark-streaming-kafka-0-10_2.11" which depends on Kafka 0.10.0.0
Is this artifact stable enough? Thank you!
I see below stack trace when trying to run beeline command. I'm using
JDK 7.
Anything wrong? Much thanks!
==
D:\spark\download\spark-1.6.1-bin-hadoop2.4>bin\beeline
Beeline version 1.6.1 by Apache Hive
Exception in thread "main" java.lang.NoSuchMethodError:
uot; be?
Pozdrawiam,
Jacek Laskowski
https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski
On Thu, Jun 16, 2016 at 11:36 AM, Haopu Wang <hw...@qilinsoft.com> wrote:
> Hi,
>
>
>
>
Hi,
Suppose I have a spark application which is doing several ETL types of
things.
I understand Spark can analyze and generate several jobs to execute.
The question is: is it possible to control the dependency between these
jobs?
Thanks!
Can someone look at my questions? Thanks again!
From: Haopu Wang
Sent: 2016年6月12日 16:40
To: user@spark.apache.org
Subject: Should I avoid "state" in an Spark application?
I have a Spark application whose structure is below:
var ts:
Can someone look at my questions? Thanks again!
From: Haopu Wang
Sent: 2016年6月12日 16:40
To: u...@spark.apache.org
Subject: Should I avoid "state" in an Spark application?
I have a Spark application whose structure is below:
var ts:
I have a Spark application whose structure is below:
var ts: Long = 0L
dstream1.foreachRDD{
(x, time) => {
ts = time
x.do_something()...
}
}
..
process_data(dstream2, ts, ..)
I assume foreachRDD function call can
Hi,
I have a Spark 1.5.2 standalone cluster running.
I want to get all of the running applications and Cores/Memory in use.
Besides the Master UI, is there any other ways to do that?
I tried to send HTTP request using URL like this:
"http://node1:6066/v1/applications;
The
My question is not directly related: about the exactly-once semantic,
the document (copied below) said spark streaming gives exactly-once
semantic, but actually from my test result, with check-point enabled,
the application always re-process the files in last batch after
gracefully restart.
it?
Much thanks!
From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Tuesday, June 16, 2015 3:26 PM
To: Haopu Wang
Cc: user
Subject: Re: If not stop StreamingContext gracefully, will checkpoint
data be consistent?
Good question, with fileStream
Can someone help? Thank you!
From: Haopu Wang
Sent: Monday, June 15, 2015 3:36 PM
To: user; dev@spark.apache.org
Subject: [SparkStreaming] NPE in DStreamCheckPointData.scala:125
I use the attached program to test checkpoint. It's quite simple.
When I run
Can someone help? Thank you!
From: Haopu Wang
Sent: Monday, June 15, 2015 3:36 PM
To: user; d...@spark.apache.org
Subject: [SparkStreaming] NPE in DStreamCheckPointData.scala:125
I use the attached program to test checkpoint. It's quite simple.
When I run
15, 2015 3:48 PM
To: Haopu Wang
Cc: user
Subject: Re: If not stop StreamingContext gracefully, will checkpoint
data be consistent?
I think it should be fine, that's the whole point of check-pointing (in
case of driver failure etc).
Thanks
Best Regards
On Mon, Jun 15, 2015 at 6:54 AM
I use the attached program to test checkpoint. It's quite simple.
When I run the program second time, it will load checkpoint data, that's
expected, however I see NPE in driver log.
Do you have any idea about the issue? I'm on Spark 1.4.0, thank you very
much!
== logs ==
I use the attached program to test checkpoint. It's quite simple.
When I run the program second time, it will load checkpoint data, that's
expected, however I see NPE in driver log.
Do you have any idea about the issue? I'm on Spark 1.4.0, thank you very
much!
== logs ==
Hi, can someone help to confirm the behavior? Thank you!
-Original Message-
From: Haopu Wang
Sent: Friday, June 12, 2015 4:57 PM
To: user
Subject: If not stop StreamingContext gracefully, will checkpoint data
be consistent?
This is a quick question about Checkpoint. The question
This is a quick question about Checkpoint. The question is: if the
StreamingContext is not stopped gracefully, will the checkpoint be
consistent?
Or I should always gracefully shutdown the application even in order to
use the checkpoint?
Thank you very much!
.
-Original Message-
From: Shao, Saisai [mailto:saisai.s...@intel.com]
Sent: Tuesday, June 09, 2015 4:33 PM
To: Haopu Wang; user
Subject: RE: [SparkStreaming 1.3.0] Broadcast failure after setting
spark.cleaner.ttl
From the stack I think this problem may be due to the deletion of
broadcast
Cheng,
yes, it works, I set the property in SparkConf before initiating
SparkContext.
The property name is spark.hadoop.dfs.replication
Thanks fro the help!
-Original Message-
From: Cheng Lian [mailto:lian.cs@gmail.com]
Sent: Monday, June 08, 2015 6:41 PM
To: Haopu Wang; user
When I ran a spark streaming application longer, I noticed the local
directory's size was kept increasing.
I set spark.cleaner.ttl to 1800 seconds in order clean the metadata.
The spark streaming batch duration is 10 seconds and checkpoint duration
is 10 minutes.
The setting took effect but
: Cheng Lian [mailto:lian.cs@gmail.com]
Sent: Sunday, June 07, 2015 10:17 PM
To: Haopu Wang; user
Subject: Re: SparkSQL: How to specify replication factor on the
persisted parquet files?
Were you using HiveContext.setConf()?
dfs.replication is a Hadoop configuration, but setConf() is only
Hi,
I'm trying to save SparkSQL DataFrame to a persistent Hive table using
the default parquet data source.
I don't know how to change the replication factor of the generated
parquet files on HDFS.
I tried to set dfs.replication on HiveContext but that didn't work.
Any suggestions are
When I start the Spark master process, the old records are not shown in
the monitoring UI.
How to show the old records? Thank you very much!
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands,
[
https://issues.apache.org/jira/browse/SPARK-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14564042#comment-14564042
]
Haopu Wang commented on SPARK-6950:
---
I hit this issue on 1.3.0 and 1.3.1.
It can
[
https://issues.apache.org/jira/browse/SPARK-7696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Haopu Wang updated SPARK-7696:
--
Description:
In SparkSQL, the aggregate function's result currently is always nullable.
It will make
Haopu Wang created SPARK-7696:
-
Summary: Aggregate function's result should be nullable only if
the input expression is nullable
Key: SPARK-7696
URL: https://issues.apache.org/jira/browse/SPARK-7696
: Monday, May 18, 2015 12:39 AM
To: 'Akhil Das'; Haopu Wang
Cc: 'user'
Subject: RE: [SparkStreaming] Is it possible to delay the start of some
DStream in the application?
You can make ANY standard receiver sleep by implementing a custom
Message Deserializer class with sleep method inside
In my application, I want to start a DStream computation only after an
special event has happened (for example, I want to start the receiver
only after the reference data has been properly initialized).
My question is: it looks like the DStream will be started right after
the StreaminContext has
Thank you, should I open a JIRA for this issue?
From: Olivier Girardot [mailto:ssab...@gmail.com]
Sent: Tuesday, May 12, 2015 5:12 AM
To: Reynold Xin
Cc: Haopu Wang; user
Subject: Re: [SparkSQL 1.4.0] groupBy columns are always nullable?
I'll look
Hi TD, regarding to the performance of updateStateByKey, do you have a
JIRA for that so we can watch it? Thank you!
From: Tathagata Das [mailto:t...@databricks.com]
Sent: Wednesday, April 15, 2015 8:09 AM
To: Krzysztof Zarzycki
Cc: user
Subject: Re: Is it
I try to get the result schema of aggregate functions using DataFrame
API.
However, I find the result field of groupBy columns are always nullable
even the source field is not nullable.
I want to know if this is by design, thank you! Below is the simple code
to show the issue.
==
import
, 2015 2:41 AM
To: Haopu Wang
Cc: user; d...@spark.apache.org
Subject: Re: [SparkSQL] cannot filter by a DateType column
What version of Spark are you using? It appears that at least in master
we are doing the conversion correctly, but its possible older versions
of applySchema do not. If you can
I want to filter a DataFrame based on a Date column.
If the DataFrame object is constructed from a scala case class, it's
working (either compare as String or Date). But if the DataFrame is
generated by specifying a Schema to an RDD, it doesn't work. Below is
the exception and test code.
I want to filter a DataFrame based on a Date column.
If the DataFrame object is constructed from a scala case class, it's
working (either compare as String or Date). But if the DataFrame is
generated by specifying a Schema to an RDD, it doesn't work. Below is
the exception and test code.
I think the temporary folders are used to store blocks and shuffles.
That doesn't depend on the cluster manager.
Ideally they should be removed after the application has been
terminated.
Can you check if there are contents under those folders?
From: Taeyun
[mailto:hao.ch...@intel.com]
Sent: Monday, March 02, 2015 9:05 PM
To: Haopu Wang; user
Subject: RE: Is SQLContext thread-safe?
Yes it is thread safe, at least it's supposed to be.
-Original Message-
From: Haopu Wang [mailto:hw...@qilinsoft.com]
Sent: Monday, March 2, 2015 4:43 PM
To: user
[mailto:hao.ch...@intel.com]
Sent: Monday, March 02, 2015 9:05 PM
To: Haopu Wang; user
Subject: RE: Is SQLContext thread-safe?
Yes it is thread safe, at least it's supposed to be.
-Original Message-
From: Haopu Wang [mailto:hw...@qilinsoft.com]
Sent: Monday, March 2, 2015 4:43 PM
To: user
Michael, thanks for the response and looking forward to try 1.3.1
From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Friday, April 03, 2015 6:52 AM
To: Haopu Wang
Cc: user
Subject: Re: [SparkSQL 1.3.0] Cannot resolve column name SUM('p.q)
among (k
Hi, I want to rename an aggregation field using DataFrame API. The
aggregation is done on a nested field. But I got below exception.
Do you see the same issue and any workaround? Thank you very much!
==
Exception in thread main org.apache.spark.sql.AnalysisException:
Cannot resolve
Great! Thank you!
From: Reynold Xin [mailto:r...@databricks.com]
Sent: Thursday, April 02, 2015 8:11 AM
To: Haopu Wang
Cc: user; d...@spark.apache.org
Subject: Re: Can I call aggregate UDF in DataFrame?
You totally can.
https://github.com/apache/spark
Great! Thank you!
From: Reynold Xin [mailto:r...@databricks.com]
Sent: Thursday, April 02, 2015 8:11 AM
To: Haopu Wang
Cc: user; dev@spark.apache.org
Subject: Re: Can I call aggregate UDF in DataFrame?
You totally can.
https://github.com/apache/spark
Specifically there are only 5 aggregate functions in class
org.apache.spark.sql.GroupedData: sum/max/min/mean/count.
Can I plugin a function to calculate stddev?
Thank you!
-
To unsubscribe, e-mail:
Hi,
I have a DataFrame object and I want to do types of aggregations like
count, sum, variance, stddev, etc.
DataFrame has DSL to do simple aggregations like count and sum.
How about variance and stddev?
Thank you for any suggestions!
)
From: Cheng, Hao [mailto:hao.ch...@intel.com]
Sent: Wednesday, March 11, 2015 8:25 AM
To: Haopu Wang; user; d...@spark.apache.org
Subject: RE: [SparkSQL] Reuse HiveContext to different Hive warehouse?
I am not so sure if Hive supports change the metastore after
initialized, I guess
I'm using Spark 1.3.0 RC3 build with Hive support.
In Spark Shell, I want to reuse the HiveContext instance to different
warehouse locations. Below are the steps for my test (Assume I have
loaded a file into table src).
==
15/03/10 18:22:59 INFO SparkILoop: Created sql context (with
I'm using Spark 1.3.0 RC3 build with Hive support.
In Spark Shell, I want to reuse the HiveContext instance to different
warehouse locations. Below are the steps for my test (Assume I have
loaded a file into table src).
==
15/03/10 18:22:59 INFO SparkILoop: Created sql context (with
Thanks, it's an active project.
Will it be released with Spark 1.3.0?
From: 鹰 [mailto:980548...@qq.com]
Sent: Thursday, March 05, 2015 11:19 AM
To: Haopu Wang; user
Subject: Re: Where can I find more information about the R interface forSpark?
you can
Hi, in the roadmap of Spark in 2015 (link:
http://files.meetup.com/3138542/Spark%20in%202015%20Talk%20-%20Wendell.p
ptx), I saw SchemaRDD is designed to be the basis of BOTH Spark
Streaming and Spark SQL.
My question is: what's the typical usage of SchemaRDD in a Spark
Streaming application?
Hi, in the roadmap of Spark in 2015 (link:
http://files.meetup.com/3138542/Spark%20in%202015%20Talk%20-%20Wendell.p
ptx), I saw SchemaRDD is designed to be the basis of BOTH Spark
Streaming and Spark SQL.
My question is: what's the typical usage of SchemaRDD in a Spark
Streaming application?
Thanks for the response.
Then I have another question: when will we want to create multiple
SQLContext instances from the same SparkContext? What's the benefit?
-Original Message-
From: Cheng, Hao [mailto:hao.ch...@intel.com]
Sent: Monday, March 02, 2015 9:05 PM
To: Haopu Wang; user
Hao, thank you so much for the reply!
Do you already have some JIRA for the discussion?
-Original Message-
From: Cheng, Hao [mailto:hao.ch...@intel.com]
Sent: Tuesday, March 03, 2015 8:23 AM
To: Haopu Wang; user
Subject: RE: Is SQLContext thread-safe?
Currently, each SQLContext has its
Hi, is it safe to use the same SQLContext to do Select operations in
different threads at the same time? Thank you very much!
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail:
To: Michael Armbrust
Cc: Haopu Wang; dev@spark.apache.org
Subject: Re: HiveContext cannot be serialized
I submitted a patch
https://github.com/apache/spark/pull/4628
On Mon, Feb 16, 2015 at 10:59 AM, Michael Armbrust
mich...@databricks.com wrote:
I was suggesting you mark the variable
When I'm investigating this issue (in the end of this email), I take a
look at HiveContext's code and find this change
(https://github.com/apache/spark/commit/64945f868443fbc59cb34b34c16d782d
da0fb63d#diff-ff50aea397a607b79df9bec6f2a841db):
- @transient protected[hive] lazy val hiveconf = new
I have a streaming application which registered temp table on a
HiveContext for each batch duration.
The application runs well in Spark 1.1.0. But I get below error from
1.1.1.
Do you have any suggestions to resolve it? Thank you!
java.io.NotSerializableException:
Hi, I think a modeling tool may be helpful because sometimes it's
hard/tricky to program Spark. I don't know if there is already such a
tool.
Thanks!
-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional
Hi, I think a modeling tool may be helpful because sometimes it's
hard/tricky to program Spark. I don't know if there is already such a
tool.
Thanks!
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional
I’m using Spark 1.1.0 built for HDFS 2.4.
My application enables check-point (to HDFS 2.5.1) and it can build. But when I
run it, I get below error:
Exception in thread main org.apache.hadoop.ipc.RemoteException: Server IPC
version 9 cannot communicate with client version 4
at
the issue? Thanks for any suggestions.
From: Raghavendra Pandey [mailto:raghavendra.pan...@gmail.com]
Sent: Saturday, December 20, 2014 12:08 AM
To: Sean Owen; Haopu Wang
Cc: user@spark.apache.org
Subject: Re: Can Spark 1.1.0 save checkpoint to HDFS 2.5.1
(MissingRequirementError.scala:17)
……
From: Sean Owen [mailto:so...@cloudera.com]
Sent: Saturday, December 20, 2014 8:12 AM
To: Haopu Wang
Cc: user@spark.apache.org; Raghavendra Pandey
Subject: RE: Can Spark 1.1.0 save checkpoint to HDFS 2.5.1?
That's exactly
!
From: Patrick Wendell [mailto:pwend...@gmail.com]
Sent: 2014年10月23日 14:00
To: Haopu Wang
Cc: user
Subject: Re: About Memory usage in the Spark UI
It shows the amount of memory used to store RDD blocks, which are created when
you run .cache()/.persist() on an RDD.
On Wed
: 2014年10月24日 8:07
To: Haopu Wang
Cc: Patrick Wendell; user
Subject: Re: About Memory usage in the Spark UI
The memory usage of blocks of data received through Spark Streaming is not
reflected in the Spark UI. It only shows the memory usage due to cached RDDs.
I didnt find a JIRA for this, so I
Hi, please take a look at the attached screen-shot. I wonders what's the
Memory Used column mean.
I give 2GB memory to the driver process and 12GB memory to the executor
process.
Thank you!
I have a Spark application which is running Spark Streaming and Spark
SQL.
I observed the size of shuffle files under spark.local.dir folder
keeps increase and never decreases. Eventually it will run
out-of-disk-space error.
The question is: when will Spark delete these shuffle files?
In the
Liquan, yes, for full outer join, one hash table on both sides is more
efficient.
For the left/right outer join, it looks like one hash table should be enought.
From: Liquan Pei [mailto:liquan...@gmail.com]
Sent: 2014年9月30日 18:34
To: Haopu Wang
Cc: dev
Liquan, yes, for full outer join, one hash table on both sides is more
efficient.
For the left/right outer join, it looks like one hash table should be enought.
From: Liquan Pei [mailto:liquan...@gmail.com]
Sent: 2014年9月30日 18:34
To: Haopu Wang
Cc: d
!
From: Liquan Pei [mailto:liquan...@gmail.com]
Sent: 2014年9月30日 12:31
To: Haopu Wang
Cc: dev@spark.apache.org; user
Subject: Re: Spark SQL question: why build hashtable for both sides in
HashOuterJoin?
Hi Haopu,
My understanding is that the hashtable on both left and right side
!
From: Liquan Pei [mailto:liquan...@gmail.com]
Sent: 2014年9月30日 12:31
To: Haopu Wang
Cc: d...@spark.apache.org; user
Subject: Re: Spark SQL question: why build hashtable for both sides in
HashOuterJoin?
Hi Haopu,
My understanding is that the hashtable on both left and right side
I take a look at HashOuterJoin and it's building a Hashtable for both
sides.
This consumes quite a lot of memory when the partition is big. And it
doesn't reduce the iteration on streamed relation, right?
Thanks!
-
To
I take a look at HashOuterJoin and it's building a Hashtable for both
sides.
This consumes quite a lot of memory when the partition is big. And it
doesn't reduce the iteration on streamed relation, right?
Thanks!
-
To
[mailto:lian.cs@gmail.com]
Sent: 2014年9月26日 21:24
To: Haopu Wang; user@spark.apache.org
Subject: Re: Spark SQL question: is cached SchemaRDD storage controlled by
spark.storage.memoryFraction?
Yes it is. The in-memory storage used with SchemaRDD also uses RDD.cache()
under the hood.
On 9/26/14 4
Hi, I'm querying a big table using Spark SQL. I see very long GC time in
some stages. I wonder if I can improve it by tuning the storage
parameter.
The question is: the schemaRDD has been cached with cacheTable()
function. So is the cached schemaRDD part of memory storage controlled
by the
FWD to dev mail list for helps
From: Haopu Wang
Sent: 2014年9月22日 16:35
To: u...@spark.apache.org
Subject: Spark SQL 1.1.0: NPE when join two cached table
I have two data sets and want to join them on each first field. Sample data are
below:
data set
I see the binary packages include hadoop 1, 2.3 and 2.4.
Does Spark 1.1.0 support hadoop 2.5.0 at below address?
http://hadoop.apache.org/releases.html#11+August%2C+2014%3A+Release+2.5.0+available
-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com]
Sent: Friday,
.”
Did you try to read a hadoop 2.5.0 file using Spark 1.1 with hadoop 2.4?
Thanks!
From: Denny Lee [mailto:denny.g@gmail.com]
Sent: Friday, September 12, 2014 10:00 AM
To: Patrick Wendell; Haopu Wang; d...@spark.apache.org; user@spark.apache.org
Got it, thank you, Denny!
From: Denny Lee [mailto:denny.g@gmail.com]
Sent: Friday, September 12, 2014 11:04 AM
To: user@spark.apache.org; Haopu Wang; d...@spark.apache.org; Patrick Wendell
Subject: RE: Announcing Spark 1.1.0!
Yes, atleast for my query
I have a DStream receiving data from a socket. I'm using local mode.
I set spark.streaming.unpersist to false and leave
spark.cleaner.ttl to be infinite.
I can see files for input and shuffle blocks under spark.local.dir
folder and the size of folder keeps increasing, although JVM's memory
usage
.
Thanks
Jerry
-Original Message-
From: Haopu Wang [mailto:hw...@qilinsoft.com]
Sent: Wednesday, July 23, 2014 3:00 PM
To: user@spark.apache.org
Subject: spark.streaming.unpersist and spark.cleaner.ttl
I have a DStream receiving data from a socket. I'm using local mode.
I set
Hi, I'm using local mode and read a text file as RDD using
JavaSparkContext.textFile() API.
And then call cache() method on the result RDD.
I look at the Storage information and find the RDD has 3 partitions but
2 of them have been cached.
Is this a normal behavior? I assume all of
job.
The total number of executors is specified by the user.
-Sandy
On Fri, Jul 18, 2014 at 2:00 AM, Haopu Wang hw...@qilinsoft.com wrote:
Sandy,
Do you mean the “preferred location” is working for standalone cluster also?
Because I check the code of SparkContext and see comments
I have a standalone spark cluster and a HDFS cluster which share some of nodes.
When reading HDFS file, how does spark assign tasks to nodes? Will it ask HDFS
the location for each file block in order to get a right worker node?
How about a spark cluster on Yarn?
Thank you very much!
On Fri, Jul 18, 2014 at 12:35 AM, Haopu Wang hw...@qilinsoft.com wrote:
I have a standalone spark cluster and a HDFS cluster which share some of nodes.
When reading HDFS file, how does spark assign tasks to nodes? Will it ask HDFS
the location for each file block in order to get
By looking at the code of JobScheduler, I find a parameter of below:
private val numConcurrentJobs =
ssc.conf.getInt(spark.streaming.concurrentJobs, 1)
private val jobExecutor = Executors.newFixedThreadPool(numConcurrentJobs)
Does that mean each App can have only one active stage?
)
at
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
at
org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46)
From: Haopu
1 - 100 of 101 matches
Mail list logo