[jira] [Issue Comment Deleted] (SPARK-25171) After restart, StreamingContext is replaying the last successful micro-batch right before the stop

2018-08-23 Thread Haopu Wang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-25171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haopu Wang updated SPARK-25171: --- Comment: was deleted (was: [~jerryshao], do you have any) > After restart, StreamingCont

[jira] [Commented] (SPARK-25171) After restart, StreamingContext is replaying the last successful micro-batch right before the stop

2018-08-23 Thread Haopu Wang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-25171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589803#comment-16589803 ] Haopu Wang commented on SPARK-25171: [~jerryshao], do you have any > After rest

[jira] [Created] (SPARK-25171) After restart, StreamingContext is replaying the last successful micro-batch right before the stop

2018-08-21 Thread Haopu Wang (JIRA)
Haopu Wang created SPARK-25171: -- Summary: After restart, StreamingContext is replaying the last successful micro-batch right before the stop Key: SPARK-25171 URL: https://issues.apache.org/jira/browse/SPARK-25171

[jira] [Commented] (SPARK-17498) StringIndexer.setHandleInvalid should have another option 'new'

2018-03-07 Thread Haopu Wang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390777#comment-16390777 ] Haopu Wang commented on SPARK-17498: After this fix, the reverse transformation "IndexToString&q

unsubscribe

2018-02-21 Thread Haopu Wang
Regards, Haopu

[jira] [Created] (SPARK-21396) Spark Hive Thriftserver doesn't return UDT field

2017-07-12 Thread Haopu Wang (JIRA)
Haopu Wang created SPARK-21396: -- Summary: Spark Hive Thriftserver doesn't return UDT field Key: SPARK-21396 URL: https://issues.apache.org/jira/browse/SPARK-21396 Project: Spark Issue Type

RE: Is it possible to resubcribe KafkaStreams in runtime to different set of topics?

2016-11-09 Thread Haopu Wang
ias On 11/9/16 3:11 PM, Haopu Wang wrote: > Hi, do you mean that the new matched topics should be consumed > after the regex subscription has been established? Thanks! > > -Original Message- From: Guozhang Wang > [mailto:wangg...@gmail.com] Sent: 2016年11月10日 3:41 To: > u

RE: Is it possible to resubcribe KafkaStreams in runtime to different set of topics?

2016-11-09 Thread Haopu Wang
Hi, do you mean that the new matched topics should be consumed after the regex subscription has been established? Thanks! -Original Message- From: Guozhang Wang [mailto:wangg...@gmail.com] Sent: 2016年11月10日 3:41 To: users@kafka.apache.org Subject: Re: Is it possible to resubcribe

Kafka stream offset management question

2016-11-08 Thread Haopu Wang
I'm using Kafka direct stream (auto.offset.reset = earliest) and enable Spark streaming's checkpoint. The application starts and consumes messages correctly. Then I stop the application and clean the checkpoint folder. I restart the application and expect it to consumes old messages. But

RE: InvalidClassException when load KafkaDirectStream from checkpoint (Spark 2.0.0)

2016-11-08 Thread Haopu Wang
It turns out to be a bug in application code. Thank you! From: Haopu Wang Sent: 2016年11月4日 17:23 To: user@spark.apache.org; Cody Koeninger Subject: InvalidClassException when load KafkaDirectStream from checkpoint (Spark 2.0.0) When I load spark

RE: expected behavior of Kafka dynamic topic subscription

2016-11-06 Thread Haopu Wang
Cody, thanks for the response. Do you think it's a Spark issue or Kafka issue? Can you please let me know the jira ticket number? -Original Message- From: Cody Koeninger [mailto:c...@koeninger.org] Sent: 2016年11月4日 22:35 To: Haopu Wang Cc: user@spark.apache.org Subject: Re: expected

InvalidClassException when load KafkaDirectStream from checkpoint (Spark 2.0.0)

2016-11-04 Thread Haopu Wang
When I load spark checkpoint, I get below error. Do you have any idea? Much thanks! * 2016-11-04 17:12:19,582 INFO [org.apache.spark.streaming.CheckpointReader] (main;) Checkpoint files found:

expected behavior of Kafka dynamic topic subscription

2016-11-03 Thread Haopu Wang
I'm using Kafka010 integration API to create a DStream using SubscriberPattern ConsumerStrategy. The specified topic doesn't exist when I start the application. Then I create the topic and publish some test messages. I can see them in the console subscriber. But the spark application doesn't

RE: Kafka integration: get existing Kafka messages?

2016-10-14 Thread Haopu Wang
pic2 0 2 after polling for 512 == -Original Message- From: Cody Koeninger [mailto:c...@koeninger.org] Sent: 2016年10月13日 9:31 To: Haopu Wang Cc: user@spark.apache.org Subject: Re: Kafka integration: get existing Kafka messages? Look at the presentation and blog post linked from https://github.co

RE: Kafka integration: get existing Kafka messages?

2016-10-12 Thread Haopu Wang
Cody, thanks for the response. So Kafka direct stream actually has consumer on both the driver and executor? Can you please provide more details? Thank you very much! From: Cody Koeninger [mailto:c...@koeninger.org] Sent: 2016年10月12日 20:10 To: Haopu Wang

Kafka integration: get existing Kafka messages?

2016-10-12 Thread Haopu Wang
Hi, I want to read the existing Kafka messages and then subscribe new stream messages. But I find "auto.offset.reset" property is always set to "none" in KafkaUtils. Does that mean I cannot specify "earliest" property value when create direct stream? Thank you!

Can Spark Streaming 2.0 work with Kafka 0.10?

2016-09-26 Thread Haopu Wang
Hi, in the official integration guide, it says "Spark Streaming 2.0.0 is compatible with Kafka 0.8.2.1." However, in maven repository, I can get "spark-streaming-kafka-0-10_2.11" which depends on Kafka 0.10.0.0 Is this artifact stable enough? Thank you!

[Spark 1.6.1] Beeline cannot start on Windows7

2016-06-27 Thread Haopu Wang
I see below stack trace when trying to run beeline command. I'm using JDK 7. Anything wrong? Much thanks! == D:\spark\download\spark-1.6.1-bin-hadoop2.4>bin\beeline Beeline version 1.6.1 by Apache Hive Exception in thread "main" java.lang.NoSuchMethodError:

RE: Can I control the execution of Spark jobs?

2016-06-16 Thread Haopu Wang
uot; be? Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Thu, Jun 16, 2016 at 11:36 AM, Haopu Wang <hw...@qilinsoft.com> wrote: > Hi, > > > >

Can I control the execution of Spark jobs?

2016-06-16 Thread Haopu Wang
Hi, Suppose I have a spark application which is doing several ETL types of things. I understand Spark can analyze and generate several jobs to execute. The question is: is it possible to control the dependency between these jobs? Thanks!

RE: Should I avoid "state" in an Spark application?

2016-06-12 Thread Haopu Wang
Can someone look at my questions? Thanks again! From: Haopu Wang Sent: 2016年6月12日 16:40 To: user@spark.apache.org Subject: Should I avoid "state" in an Spark application? I have a Spark application whose structure is below: var ts:

RE: Should I avoid "state" in an Spark application?

2016-06-12 Thread Haopu Wang
Can someone look at my questions? Thanks again! From: Haopu Wang Sent: 2016年6月12日 16:40 To: u...@spark.apache.org Subject: Should I avoid "state" in an Spark application? I have a Spark application whose structure is below: var ts:

Should I avoid "state" in an Spark application?

2016-06-12 Thread Haopu Wang
I have a Spark application whose structure is below: var ts: Long = 0L dstream1.foreachRDD{ (x, time) => { ts = time x.do_something()... } } .. process_data(dstream2, ts, ..) I assume foreachRDD function call can

How to get the list of running applications and Cores/Memory in use?

2015-12-06 Thread Haopu Wang
Hi, I have a Spark 1.5.2 standalone cluster running. I want to get all of the running applications and Cores/Memory in use. Besides the Master UI, is there any other ways to do that? I tried to send HTTP request using URL like this: "http://node1:6066/v1/applications; The

RE: RE: Spark or Storm

2015-06-19 Thread Haopu Wang
My question is not directly related: about the exactly-once semantic, the document (copied below) said spark streaming gives exactly-once semantic, but actually from my test result, with check-point enabled, the application always re-process the files in last batch after gracefully restart.

RE: If not stop StreamingContext gracefully, will checkpoint data be consistent?

2015-06-18 Thread Haopu Wang
it? Much thanks! From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Tuesday, June 16, 2015 3:26 PM To: Haopu Wang Cc: user Subject: Re: If not stop StreamingContext gracefully, will checkpoint data be consistent? Good question, with fileStream

RE: [SparkStreaming] NPE in DStreamCheckPointData.scala:125

2015-06-17 Thread Haopu Wang
Can someone help? Thank you! From: Haopu Wang Sent: Monday, June 15, 2015 3:36 PM To: user; dev@spark.apache.org Subject: [SparkStreaming] NPE in DStreamCheckPointData.scala:125 I use the attached program to test checkpoint. It's quite simple. When I run

RE: [SparkStreaming] NPE in DStreamCheckPointData.scala:125

2015-06-17 Thread Haopu Wang
Can someone help? Thank you! From: Haopu Wang Sent: Monday, June 15, 2015 3:36 PM To: user; d...@spark.apache.org Subject: [SparkStreaming] NPE in DStreamCheckPointData.scala:125 I use the attached program to test checkpoint. It's quite simple. When I run

RE: If not stop StreamingContext gracefully, will checkpoint data be consistent?

2015-06-15 Thread Haopu Wang
15, 2015 3:48 PM To: Haopu Wang Cc: user Subject: Re: If not stop StreamingContext gracefully, will checkpoint data be consistent? I think it should be fine, that's the whole point of check-pointing (in case of driver failure etc). Thanks Best Regards On Mon, Jun 15, 2015 at 6:54 AM

[SparkStreaming] NPE in DStreamCheckPointData.scala:125

2015-06-15 Thread Haopu Wang
I use the attached program to test checkpoint. It's quite simple. When I run the program second time, it will load checkpoint data, that's expected, however I see NPE in driver log. Do you have any idea about the issue? I'm on Spark 1.4.0, thank you very much! == logs ==

[SparkStreaming] NPE in DStreamCheckPointData.scala:125

2015-06-15 Thread Haopu Wang
I use the attached program to test checkpoint. It's quite simple. When I run the program second time, it will load checkpoint data, that's expected, however I see NPE in driver log. Do you have any idea about the issue? I'm on Spark 1.4.0, thank you very much! == logs ==

RE: If not stop StreamingContext gracefully, will checkpoint data be consistent?

2015-06-14 Thread Haopu Wang
Hi, can someone help to confirm the behavior? Thank you! -Original Message- From: Haopu Wang Sent: Friday, June 12, 2015 4:57 PM To: user Subject: If not stop StreamingContext gracefully, will checkpoint data be consistent? This is a quick question about Checkpoint. The question

If not stop StreamingContext gracefully, will checkpoint data be consistent?

2015-06-12 Thread Haopu Wang
This is a quick question about Checkpoint. The question is: if the StreamingContext is not stopped gracefully, will the checkpoint be consistent? Or I should always gracefully shutdown the application even in order to use the checkpoint? Thank you very much!

RE: [SparkStreaming 1.3.0] Broadcast failure after setting spark.cleaner.ttl

2015-06-09 Thread Haopu Wang
. -Original Message- From: Shao, Saisai [mailto:saisai.s...@intel.com] Sent: Tuesday, June 09, 2015 4:33 PM To: Haopu Wang; user Subject: RE: [SparkStreaming 1.3.0] Broadcast failure after setting spark.cleaner.ttl From the stack I think this problem may be due to the deletion of broadcast

RE: SparkSQL: How to specify replication factor on the persisted parquet files?

2015-06-09 Thread Haopu Wang
Cheng, yes, it works, I set the property in SparkConf before initiating SparkContext. The property name is spark.hadoop.dfs.replication Thanks fro the help! -Original Message- From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: Monday, June 08, 2015 6:41 PM To: Haopu Wang; user

[SparkStreaming 1.3.0] Broadcast failure after setting spark.cleaner.ttl

2015-06-09 Thread Haopu Wang
When I ran a spark streaming application longer, I noticed the local directory's size was kept increasing. I set spark.cleaner.ttl to 1800 seconds in order clean the metadata. The spark streaming batch duration is 10 seconds and checkpoint duration is 10 minutes. The setting took effect but

RE: SparkSQL: How to specify replication factor on the persisted parquet files?

2015-06-08 Thread Haopu Wang
: Cheng Lian [mailto:lian.cs@gmail.com] Sent: Sunday, June 07, 2015 10:17 PM To: Haopu Wang; user Subject: Re: SparkSQL: How to specify replication factor on the persisted parquet files? Were you using HiveContext.setConf()? dfs.replication is a Hadoop configuration, but setConf() is only

SparkSQL: How to specify replication factor on the persisted parquet files?

2015-06-02 Thread Haopu Wang
Hi, I'm trying to save SparkSQL DataFrame to a persistent Hive table using the default parquet data source. I don't know how to change the replication factor of the generated parquet files on HDFS. I tried to set dfs.replication on HiveContext but that didn't work. Any suggestions are

Spark 1.3.0: how to let Spark history load old records?

2015-06-01 Thread Haopu Wang
When I start the Spark master process, the old records are not shown in the monitoring UI. How to show the old records? Thank you very much! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands,

[jira] [Commented] (SPARK-6950) Spark master UI believes some applications are in progress when they are actually completed

2015-05-28 Thread Haopu Wang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14564042#comment-14564042 ] Haopu Wang commented on SPARK-6950: --- I hit this issue on 1.3.0 and 1.3.1. It can

[jira] [Updated] (SPARK-7696) Aggregate function's result should be nullable only if the input expression is nullable

2015-05-17 Thread Haopu Wang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-7696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haopu Wang updated SPARK-7696: -- Description: In SparkSQL, the aggregate function's result currently is always nullable. It will make

[jira] [Created] (SPARK-7696) Aggregate function's result should be nullable only if the input expression is nullable

2015-05-17 Thread Haopu Wang (JIRA)
Haopu Wang created SPARK-7696: - Summary: Aggregate function's result should be nullable only if the input expression is nullable Key: SPARK-7696 URL: https://issues.apache.org/jira/browse/SPARK-7696

RE: [SparkStreaming] Is it possible to delay the start of some DStream in the application?

2015-05-17 Thread Haopu Wang
: Monday, May 18, 2015 12:39 AM To: 'Akhil Das'; Haopu Wang Cc: 'user' Subject: RE: [SparkStreaming] Is it possible to delay the start of some DStream in the application? You can make ANY standard receiver sleep by implementing a custom Message Deserializer class with sleep method inside

[SparkStreaming] Is it possible to delay the start of some DStream in the application?

2015-05-14 Thread Haopu Wang
In my application, I want to start a DStream computation only after an special event has happened (for example, I want to start the receiver only after the reference data has been properly initialized). My question is: it looks like the DStream will be started right after the StreaminContext has

RE: [SparkSQL 1.4.0] groupBy columns are always nullable?

2015-05-14 Thread Haopu Wang
Thank you, should I open a JIRA for this issue? From: Olivier Girardot [mailto:ssab...@gmail.com] Sent: Tuesday, May 12, 2015 5:12 AM To: Reynold Xin Cc: Haopu Wang; user Subject: Re: [SparkSQL 1.4.0] groupBy columns are always nullable? I'll look

RE: Is it feasible to keep millions of keys in state of Spark Streaming job for two months?

2015-05-14 Thread Haopu Wang
Hi TD, regarding to the performance of updateStateByKey, do you have a JIRA for that so we can watch it? Thank you! From: Tathagata Das [mailto:t...@databricks.com] Sent: Wednesday, April 15, 2015 8:09 AM To: Krzysztof Zarzycki Cc: user Subject: Re: Is it

[SparkSQL 1.4.0] groupBy columns are always nullable?

2015-05-11 Thread Haopu Wang
I try to get the result schema of aggregate functions using DataFrame API. However, I find the result field of groupBy columns are always nullable even the source field is not nullable. I want to know if this is by design, thank you! Below is the simple code to show the issue. == import

RE: [SparkSQL] cannot filter by a DateType column

2015-05-10 Thread Haopu Wang
, 2015 2:41 AM To: Haopu Wang Cc: user; d...@spark.apache.org Subject: Re: [SparkSQL] cannot filter by a DateType column What version of Spark are you using? It appears that at least in master we are doing the conversion correctly, but its possible older versions of applySchema do not. If you can

[SparkSQL] cannot filter by a DateType column

2015-05-08 Thread Haopu Wang
I want to filter a DataFrame based on a Date column. If the DataFrame object is constructed from a scala case class, it's working (either compare as String or Date). But if the DataFrame is generated by specifying a Schema to an RDD, it doesn't work. Below is the exception and test code.

[SparkSQL] cannot filter by a DateType column

2015-05-08 Thread Haopu Wang
I want to filter a DataFrame based on a Date column. If the DataFrame object is constructed from a scala case class, it's working (either compare as String or Date). But if the DataFrame is generated by specifying a Schema to an RDD, it doesn't work. Below is the exception and test code.

RE: Spark does not delete temporary directories

2015-05-07 Thread Haopu Wang
I think the temporary folders are used to store blocks and shuffles. That doesn't depend on the cluster manager. Ideally they should be removed after the application has been terminated. Can you check if there are contents under those folders? From: Taeyun

RE: Is SQLContext thread-safe?

2015-04-30 Thread Haopu Wang
[mailto:hao.ch...@intel.com] Sent: Monday, March 02, 2015 9:05 PM To: Haopu Wang; user Subject: RE: Is SQLContext thread-safe? Yes it is thread safe, at least it's supposed to be. -Original Message- From: Haopu Wang [mailto:hw...@qilinsoft.com] Sent: Monday, March 2, 2015 4:43 PM To: user

RE: Is SQLContext thread-safe?

2015-04-30 Thread Haopu Wang
[mailto:hao.ch...@intel.com] Sent: Monday, March 02, 2015 9:05 PM To: Haopu Wang; user Subject: RE: Is SQLContext thread-safe? Yes it is thread safe, at least it's supposed to be. -Original Message- From: Haopu Wang [mailto:hw...@qilinsoft.com] Sent: Monday, March 2, 2015 4:43 PM To: user

RE: [SparkSQL 1.3.0] Cannot resolve column name SUM('p.q) among (k, SUM('p.q));

2015-04-02 Thread Haopu Wang
Michael, thanks for the response and looking forward to try 1.3.1 From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Friday, April 03, 2015 6:52 AM To: Haopu Wang Cc: user Subject: Re: [SparkSQL 1.3.0] Cannot resolve column name SUM('p.q) among (k

[SparkSQL 1.3.0] Cannot resolve column name SUM('p.q) among (k, SUM('p.q));

2015-04-02 Thread Haopu Wang
Hi, I want to rename an aggregation field using DataFrame API. The aggregation is done on a nested field. But I got below exception. Do you see the same issue and any workaround? Thank you very much! == Exception in thread main org.apache.spark.sql.AnalysisException: Cannot resolve

RE: Can I call aggregate UDF in DataFrame?

2015-04-01 Thread Haopu Wang
Great! Thank you! From: Reynold Xin [mailto:r...@databricks.com] Sent: Thursday, April 02, 2015 8:11 AM To: Haopu Wang Cc: user; d...@spark.apache.org Subject: Re: Can I call aggregate UDF in DataFrame? You totally can. https://github.com/apache/spark

RE: Can I call aggregate UDF in DataFrame?

2015-04-01 Thread Haopu Wang
Great! Thank you! From: Reynold Xin [mailto:r...@databricks.com] Sent: Thursday, April 02, 2015 8:11 AM To: Haopu Wang Cc: user; dev@spark.apache.org Subject: Re: Can I call aggregate UDF in DataFrame? You totally can. https://github.com/apache/spark

Can I call aggregate UDF in DataFrame?

2015-03-26 Thread Haopu Wang
Specifically there are only 5 aggregate functions in class org.apache.spark.sql.GroupedData: sum/max/min/mean/count. Can I plugin a function to calculate stddev? Thank you! - To unsubscribe, e-mail:

[SparkSQL] How to calculate stddev on a DataFrame?

2015-03-25 Thread Haopu Wang
Hi, I have a DataFrame object and I want to do types of aggregations like count, sum, variance, stddev, etc. DataFrame has DSL to do simple aggregations like count and sum. How about variance and stddev? Thank you for any suggestions!

RE: [SparkSQL] Reuse HiveContext to different Hive warehouse?

2015-03-11 Thread Haopu Wang
) From: Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Wednesday, March 11, 2015 8:25 AM To: Haopu Wang; user; d...@spark.apache.org Subject: RE: [SparkSQL] Reuse HiveContext to different Hive warehouse? I am not so sure if Hive supports change the metastore after initialized, I guess

[SparkSQL] Reuse HiveContext to different Hive warehouse?

2015-03-10 Thread Haopu Wang
I'm using Spark 1.3.0 RC3 build with Hive support. In Spark Shell, I want to reuse the HiveContext instance to different warehouse locations. Below are the steps for my test (Assume I have loaded a file into table src). == 15/03/10 18:22:59 INFO SparkILoop: Created sql context (with

[SparkSQL] Reuse HiveContext to different Hive warehouse?

2015-03-10 Thread Haopu Wang
I'm using Spark 1.3.0 RC3 build with Hive support. In Spark Shell, I want to reuse the HiveContext instance to different warehouse locations. Below are the steps for my test (Assume I have loaded a file into table src). == 15/03/10 18:22:59 INFO SparkILoop: Created sql context (with

RE: Where can I find more information about the R interface forSpark?

2015-03-04 Thread Haopu Wang
Thanks, it's an active project. Will it be released with Spark 1.3.0? From: 鹰 [mailto:980548...@qq.com] Sent: Thursday, March 05, 2015 11:19 AM To: Haopu Wang; user Subject: Re: Where can I find more information about the R interface forSpark? you can

Spark Streaming and SchemaRDD usage

2015-03-04 Thread Haopu Wang
Hi, in the roadmap of Spark in 2015 (link: http://files.meetup.com/3138542/Spark%20in%202015%20Talk%20-%20Wendell.p ptx), I saw SchemaRDD is designed to be the basis of BOTH Spark Streaming and Spark SQL. My question is: what's the typical usage of SchemaRDD in a Spark Streaming application?

Spark Streaming and SchemaRDD usage

2015-03-04 Thread Haopu Wang
Hi, in the roadmap of Spark in 2015 (link: http://files.meetup.com/3138542/Spark%20in%202015%20Talk%20-%20Wendell.p ptx), I saw SchemaRDD is designed to be the basis of BOTH Spark Streaming and Spark SQL. My question is: what's the typical usage of SchemaRDD in a Spark Streaming application?

RE: Is SQLContext thread-safe?

2015-03-02 Thread Haopu Wang
Thanks for the response. Then I have another question: when will we want to create multiple SQLContext instances from the same SparkContext? What's the benefit? -Original Message- From: Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Monday, March 02, 2015 9:05 PM To: Haopu Wang; user

RE: Is SQLContext thread-safe?

2015-03-02 Thread Haopu Wang
Hao, thank you so much for the reply! Do you already have some JIRA for the discussion? -Original Message- From: Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Tuesday, March 03, 2015 8:23 AM To: Haopu Wang; user Subject: RE: Is SQLContext thread-safe? Currently, each SQLContext has its

Is SQLContext thread-safe?

2015-03-02 Thread Haopu Wang
Hi, is it safe to use the same SQLContext to do Select operations in different threads at the same time? Thank you very much! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail:

RE: HiveContext cannot be serialized

2015-02-16 Thread Haopu Wang
To: Michael Armbrust Cc: Haopu Wang; dev@spark.apache.org Subject: Re: HiveContext cannot be serialized I submitted a patch https://github.com/apache/spark/pull/4628 On Mon, Feb 16, 2015 at 10:59 AM, Michael Armbrust mich...@databricks.com wrote: I was suggesting you mark the variable

HiveContext cannot be serialized

2015-02-16 Thread Haopu Wang
When I'm investigating this issue (in the end of this email), I take a look at HiveContext's code and find this change (https://github.com/apache/spark/commit/64945f868443fbc59cb34b34c16d782d da0fb63d#diff-ff50aea397a607b79df9bec6f2a841db): - @transient protected[hive] lazy val hiveconf = new

Spark Streaming and SQL checkpoint error: (java.io.NotSerializableException: org.apache.hadoop.hive.conf.HiveConf)

2015-02-16 Thread Haopu Wang
I have a streaming application which registered temp table on a HiveContext for each batch duration. The application runs well in Spark 1.1.0. But I get below error from 1.1.1. Do you have any suggestions to resolve it? Thank you! java.io.NotSerializableException:

Do you know any Spark modeling tool?

2014-12-25 Thread Haopu Wang
Hi, I think a modeling tool may be helpful because sometimes it's hard/tricky to program Spark. I don't know if there is already such a tool. Thanks! - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional

Do you know any Spark modeling tool?

2014-12-25 Thread Haopu Wang
Hi, I think a modeling tool may be helpful because sometimes it's hard/tricky to program Spark. I don't know if there is already such a tool. Thanks! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional

Can Spark 1.1.0 save checkpoint to HDFS 2.5.1?

2014-12-19 Thread Haopu Wang
I’m using Spark 1.1.0 built for HDFS 2.4. My application enables check-point (to HDFS 2.5.1) and it can build. But when I run it, I get below error: Exception in thread main org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4 at

RE: Can Spark 1.1.0 save checkpoint to HDFS 2.5.1?

2014-12-19 Thread Haopu Wang
the issue? Thanks for any suggestions. From: Raghavendra Pandey [mailto:raghavendra.pan...@gmail.com] Sent: Saturday, December 20, 2014 12:08 AM To: Sean Owen; Haopu Wang Cc: user@spark.apache.org Subject: Re: Can Spark 1.1.0 save checkpoint to HDFS 2.5.1

RE: Can Spark 1.1.0 save checkpoint to HDFS 2.5.1?

2014-12-19 Thread Haopu Wang
(MissingRequirementError.scala:17) …… From: Sean Owen [mailto:so...@cloudera.com] Sent: Saturday, December 20, 2014 8:12 AM To: Haopu Wang Cc: user@spark.apache.org; Raghavendra Pandey Subject: RE: Can Spark 1.1.0 save checkpoint to HDFS 2.5.1? That's exactly

RE: About Memory usage in the Spark UI

2014-10-23 Thread Haopu Wang
! From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: 2014年10月23日 14:00 To: Haopu Wang Cc: user Subject: Re: About Memory usage in the Spark UI It shows the amount of memory used to store RDD blocks, which are created when you run .cache()/.persist() on an RDD. On Wed

RE: About Memory usage in the Spark UI

2014-10-23 Thread Haopu Wang
: 2014年10月24日 8:07 To: Haopu Wang Cc: Patrick Wendell; user Subject: Re: About Memory usage in the Spark UI The memory usage of blocks of data received through Spark Streaming is not reflected in the Spark UI. It only shows the memory usage due to cached RDDs. I didnt find a JIRA for this, so I

About Memory usage in the Spark UI

2014-10-22 Thread Haopu Wang
Hi, please take a look at the attached screen-shot. I wonders what's the Memory Used column mean. I give 2GB memory to the driver process and 12GB memory to the executor process. Thank you!

Spark's shuffle file size keep increasing

2014-10-15 Thread Haopu Wang
I have a Spark application which is running Spark Streaming and Spark SQL. I observed the size of shuffle files under spark.local.dir folder keeps increase and never decreases. Eventually it will run out-of-disk-space error. The question is: when will Spark delete these shuffle files? In the

RE: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-10-08 Thread Haopu Wang
Liquan, yes, for full outer join, one hash table on both sides is more efficient. For the left/right outer join, it looks like one hash table should be enought. From: Liquan Pei [mailto:liquan...@gmail.com] Sent: 2014年9月30日 18:34 To: Haopu Wang Cc: dev

RE: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-10-08 Thread Haopu Wang
Liquan, yes, for full outer join, one hash table on both sides is more efficient. For the left/right outer join, it looks like one hash table should be enought. From: Liquan Pei [mailto:liquan...@gmail.com] Sent: 2014年9月30日 18:34 To: Haopu Wang Cc: d

RE: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-09-30 Thread Haopu Wang
! From: Liquan Pei [mailto:liquan...@gmail.com] Sent: 2014年9月30日 12:31 To: Haopu Wang Cc: dev@spark.apache.org; user Subject: Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin? Hi Haopu, My understanding is that the hashtable on both left and right side

RE: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-09-30 Thread Haopu Wang
! From: Liquan Pei [mailto:liquan...@gmail.com] Sent: 2014年9月30日 12:31 To: Haopu Wang Cc: d...@spark.apache.org; user Subject: Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin? Hi Haopu, My understanding is that the hashtable on both left and right side

Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-09-29 Thread Haopu Wang
I take a look at HashOuterJoin and it's building a Hashtable for both sides. This consumes quite a lot of memory when the partition is big. And it doesn't reduce the iteration on streamed relation, right? Thanks! - To

Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-09-29 Thread Haopu Wang
I take a look at HashOuterJoin and it's building a Hashtable for both sides. This consumes quite a lot of memory when the partition is big. And it doesn't reduce the iteration on streamed relation, right? Thanks! - To

Spark SQL question: how to control the storage level of cached SchemaRDD?

2014-09-28 Thread Haopu Wang
[mailto:lian.cs@gmail.com] Sent: 2014年9月26日 21:24 To: Haopu Wang; user@spark.apache.org Subject: Re: Spark SQL question: is cached SchemaRDD storage controlled by spark.storage.memoryFraction? Yes it is. The in-memory storage used with SchemaRDD also uses RDD.cache() under the hood. On 9/26/14 4

Spark SQL question: is cached SchemaRDD storage controlled by spark.storage.memoryFraction?

2014-09-26 Thread Haopu Wang
Hi, I'm querying a big table using Spark SQL. I see very long GC time in some stages. I wonder if I can improve it by tuning the storage parameter. The question is: the schemaRDD has been cached with cacheTable() function. So is the cached schemaRDD part of memory storage controlled by the

FW: Spark SQL 1.1.0: NPE when join two cached table

2014-09-22 Thread Haopu Wang
FWD to dev mail list for helps From: Haopu Wang Sent: 2014年9月22日 16:35 To: u...@spark.apache.org Subject: Spark SQL 1.1.0: NPE when join two cached table I have two data sets and want to join them on each first field. Sample data are below: data set

RE: Announcing Spark 1.1.0!

2014-09-11 Thread Haopu Wang
I see the binary packages include hadoop 1, 2.3 and 2.4. Does Spark 1.1.0 support hadoop 2.5.0 at below address? http://hadoop.apache.org/releases.html#11+August%2C+2014%3A+Release+2.5.0+available -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Friday,

RE: Announcing Spark 1.1.0!

2014-09-11 Thread Haopu Wang
.” Did you try to read a hadoop 2.5.0 file using Spark 1.1 with hadoop 2.4? Thanks! From: Denny Lee [mailto:denny.g@gmail.com] Sent: Friday, September 12, 2014 10:00 AM To: Patrick Wendell; Haopu Wang; d...@spark.apache.org; user@spark.apache.org

RE: Announcing Spark 1.1.0!

2014-09-11 Thread Haopu Wang
Got it, thank you, Denny! From: Denny Lee [mailto:denny.g@gmail.com] Sent: Friday, September 12, 2014 11:04 AM To: user@spark.apache.org; Haopu Wang; d...@spark.apache.org; Patrick Wendell Subject: RE: Announcing Spark 1.1.0! Yes, atleast for my query

spark.streaming.unpersist and spark.cleaner.ttl

2014-07-23 Thread Haopu Wang
I have a DStream receiving data from a socket. I'm using local mode. I set spark.streaming.unpersist to false and leave spark.cleaner.ttl to be infinite. I can see files for input and shuffle blocks under spark.local.dir folder and the size of folder keeps increasing, although JVM's memory usage

RE: spark.streaming.unpersist and spark.cleaner.ttl

2014-07-23 Thread Haopu Wang
. Thanks Jerry -Original Message- From: Haopu Wang [mailto:hw...@qilinsoft.com] Sent: Wednesday, July 23, 2014 3:00 PM To: user@spark.apache.org Subject: spark.streaming.unpersist and spark.cleaner.ttl I have a DStream receiving data from a socket. I'm using local mode. I set

number of Cached Partitions v.s. Total Partitions

2014-07-22 Thread Haopu Wang
Hi, I'm using local mode and read a text file as RDD using JavaSparkContext.textFile() API. And then call cache() method on the result RDD. I look at the Storage information and find the RDD has 3 partitions but 2 of them have been cached. Is this a normal behavior? I assume all of

RE: data locality

2014-07-21 Thread Haopu Wang
job. The total number of executors is specified by the user. -Sandy On Fri, Jul 18, 2014 at 2:00 AM, Haopu Wang hw...@qilinsoft.com wrote: Sandy, Do you mean the “preferred location” is working for standalone cluster also? Because I check the code of SparkContext and see comments

data locality

2014-07-18 Thread Haopu Wang
I have a standalone spark cluster and a HDFS cluster which share some of nodes. When reading HDFS file, how does spark assign tasks to nodes? Will it ask HDFS the location for each file block in order to get a right worker node? How about a spark cluster on Yarn? Thank you very much!

RE: data locality

2014-07-18 Thread Haopu Wang
On Fri, Jul 18, 2014 at 12:35 AM, Haopu Wang hw...@qilinsoft.com wrote: I have a standalone spark cluster and a HDFS cluster which share some of nodes. When reading HDFS file, how does spark assign tasks to nodes? Will it ask HDFS the location for each file block in order to get

concurrent jobs

2014-07-18 Thread Haopu Wang
By looking at the code of JobScheduler, I find a parameter of below: private val numConcurrentJobs = ssc.conf.getInt(spark.streaming.concurrentJobs, 1) private val jobExecutor = Executors.newFixedThreadPool(numConcurrentJobs) Does that mean each App can have only one active stage?

RE: All of the tasks have been completed but the Stage is still shown as Active?

2014-07-11 Thread Haopu Wang
) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) at org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46) From: Haopu

  1   2   >