Following is a method that retrieves the list of executors registered to a
spark context. It worked perfectly with spark-submit in standalone mode for my
project.
/** * A simplified method that just returns the current active/registered
executors * excluding the driver. * @param sc *
I got the same problem with rdd,repartition() in my streaming app, which
generated a few huge partitions and many tiny partitions. The resulting high
data skew makes the processing time of a batch unpredictable and often
exceeding the batch interval. I eventually solved the problem by using
repartition() means coalesce(shuffle=false)
On Thursday, June 18, 2015 4:07 PM, Corey Nolet cjno...@gmail.com wrote:
Doesn't repartition call coalesce(shuffle=true)?On Jun 18, 2015 6:53 PM, Du
Li l...@yahoo-inc.com.invalid wrote:
I got the same problem with rdd,repartition() in my
Event log is enabled in my spark streaming app. My code runs in standalone mode
and the spark version is 1.3.1. I periodically stop and restart the streaming
context by calling ssc.stop(). However, from the web UI, when clicking on a
past job, it says the job is still in progress and does not
Spark master). Things move really fast
between releases. 1.1.1 feels really old to me :P
TD
On Wed, May 13, 2015 at 1:25 PM, Du Li l...@yahoo-inc.com wrote:
I do rdd.countApprox() and rdd.sparkContext.setJobGroup() inside
dstream.foreachRDD{...}. After calling cancelJobGroup(), the spark context
PM, Tathagata Das t...@databricks.com
wrote:
That is not supposed to happen :/ That is probably a bug.If you have the log4j
logs, would be good to file a JIRA. This may be worth debugging.
On Wed, May 13, 2015 at 12:13 PM, Du Li l...@yahoo-inc.com wrote:
Actually I tried that before asking
received by rdd.count()
On Tue, May 12, 2015 at 5:03 PM, Du Li l...@yahoo-inc.com.invalid wrote:
HI,
I tested the following in my streaming app and hoped to get an approximate
count within 5 seconds. However, rdd.countApprox(5000).getFinalValue() seemed
to always return after it finishes completely
Hi TD,
Do you know how to cancel the rdd.countApprox(5000) tasks after the timeout?
Otherwise it keeps running until completion, producing results not used but
consuming resources.
Thanks,Du
On Wednesday, May 13, 2015 10:33 AM, Du Li l...@yahoo-inc.com.INVALID
wrote:
Hi TD
On Wednesday, May 6, 2015 7:55 AM, Du Li l...@yahoo-inc.com.INVALID
wrote:
I have to count RDD's in a spark streaming app. When data goes large, count()
becomes expensive. Did anybody have experience using countApprox()? How
accurate/reliable is it?
The documentation is pretty modest. Suppose
I have to count RDD's in a spark streaming app. When data goes large, count()
becomes expensive. Did anybody have experience using countApprox()? How
accurate/reliable is it?
The documentation is pretty modest. Suppose the timeout parameter is in
milliseconds. Can I retrieve the count value by
very similar number of
records.
Thanks.
Zhan Zhang
On Mar 4, 2015, at 3:47 PM, Du Li l...@yahoo-inc.com.INVALID wrote:
Hi,
My RDD's are created from kafka stream. After receiving a RDD, I want to do
coalesce/repartition it so that the data will be processed in a set of machines
in parallel
Hi Spark community,
I searched for a way to configure a heterogeneous cluster because the need
recently emerged in my project. I didn't find any solution out there. Now I
have thought out a solution and thought it might be useful to many other people
with similar needs. Following is a blog post
Is it possible to extend this PR further (or create another PR) to allow for
per-node configuration of workers?
There are many discussions about heterogeneous spark cluster. Currently
configuration on master will override those on the workers. Many spark users
have the need for having machines
Is it being merged in the next release? It's indeed a critical patch!
Du
On Wednesday, January 21, 2015 3:59 PM, Nan Zhu zhunanmcg...@gmail.com
wrote:
…not sure when will it be reviewed…
but for now you can work around by allowing multiple worker instances on a
single machine
mailing list for future reference to the community? Might be a good idea
to post both methods with pros and cons, as different users may have different
constraints. :)Thanks :)
TD
On Fri, Mar 6, 2015 at 4:07 PM, Du Li l...@yahoo-inc.com wrote:
Yes but the caveat may not exist if we do this when
Hi,
I have a set of machines (say 5) and want to evenly launch a number (say 8) of
kafka receivers on those machines. In my code I did something like the
following, as suggested in the spark docs: val streams = (1 to
numReceivers).map(_ = ssc.receiverStream(new MyKafkaReceiver()))
Hi,
My RDD's are created from kafka stream. After receiving a RDD, I want to do
coalesce/repartition it so that the data will be processed in a set of machines
in parallel as even as possible. The number of processing nodes is larger than
the receiving nodes.
My question is how the
Figured it out: I need to override method preferredLocation() in MyReceiver
class.
On Wednesday, March 4, 2015 3:35 PM, Du Li l...@yahoo-inc.com.INVALID
wrote:
Hi,
I have a set of machines (say 5) and want to evenly launch a number (say 8) of
kafka receivers on those machines
streaming context to let all the executors registered, then all the
receivers can distribute to the nodes more evenly. Also setting locality is
another way as you mentioned. Thanks Jerry From: Du Li
[mailto:l...@yahoo-inc.com.INVALID]
Sent: Thursday, March 5, 2015 1:50 PM
To: User
Subject: Re
Add ³file://³ in front of your path.
On 11/26/14, 10:15 AM, firemonk9 dhiraj.peech...@gmail.com wrote:
When I am running spark locally, RDD saveAsObjectFile writes the file to
local file system (ex : path /data/temp.txt)
and
when I am running spark on YARN cluster, RDD saveAsObjectFile
We have seen all kinds of results published that often contradict each other.
My take is that the authors often know more tricks about how to tune their
own/familiar products than the others. So the product on focus is tuned for
ideal performance while the competitors are not. The authors are
)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1637)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:697)
... 74 more
From: Cheng Lian lian.cs@gmail.commailto:lian.cs@gmail.com
Date: Tuesday, October 28, 2014 at 2:50 AM
To: Du Li l...@yahoo
(TSaslServerTransport.java:216)
... 4 more
From: Cheng Lian lian.cs@gmail.commailto:lian.cs@gmail.com
Date: Tuesday, October 28, 2014 at 2:50 AM
To: Du Li l...@yahoo-inc.com.invalidmailto:l...@yahoo-inc.com.invalid
Cc: user@spark.apache.orgmailto:user@spark.apache.org
user@spark.apache.orgmailto:user
To clarify, this error was thrown from the thrift server when beeline was
started to establish the connection, as follows:
$ beeline -u jdbc:hive2://`hostname`:4080 –n username
From: Du Li l...@yahoo-inc.com.INVALIDmailto:l...@yahoo-inc.com.INVALID
Date: Tuesday, October 28, 2014 at 11:35 AM
Hi,
I was trying to set up Spark SQL on a private cluster. I configured a
hive-site.xml under spark/conf that uses a local metestore with warehouse and
default FS name set to HDFS on one of my corporate cluster. Then I started
spark master, worker and thrift server. However, when creating a
Thanks for your explanation.
From: Cheng Lian lian.cs@gmail.commailto:lian.cs@gmail.com
Date: Thursday, October 2, 2014 at 8:01 PM
To: Du Li l...@yahoo-inc.com.INVALIDmailto:l...@yahoo-inc.com.INVALID,
d...@spark.apache.orgmailto:d...@spark.apache.org
d...@spark.apache.orgmailto:d
Hi,
In Spark 1.1 HiveContext, I ran a create partitioned table command followed by
a cache table command and got a java.sql.SQLSyntaxErrorException: Table/View
'PARTITIONS' does not exist. But cache table worked fine if the table is not a
partitioned table.
Can anybody confirm that cache of
, Du Li wrote:
Hi,
I was loading data into a partitioned table on Spark 1.1.0
beeline-thriftserver. The table has complex data types such as
mapstring,
string and arraymapstring,string. The query is like ³insert
overwrite
table a partition (Š) select Š² and the select clause worked if run
Can anybody confirm whether or not view is currently supported in spark? I
found “create view translate” in the blacklist of HiveCompatibilitySuite.scala
and also the following scenario threw NullPointerException on
beeline/thriftserver (1.1.0). Any plan to support it soon?
create table
...@databricks.com
Date: Sunday, September 28, 2014 at 12:13 PM
To: Du Li l...@yahoo-inc.com.invalidmailto:l...@yahoo-inc.com.invalid
Cc: d...@spark.apache.orgmailto:d...@spark.apache.org
d...@spark.apache.orgmailto:d...@spark.apache.org,
user@spark.apache.orgmailto:user@spark.apache.org
user
Hi,
I was loading data into a partitioned table on Spark 1.1.0
beeline-thriftserver. The table has complex data types such as mapstring,
string and arraymapstring,string. The query is like ³insert overwrite
table a partition (Š) select Š² and the select clause worked if run
separately. However,
It might be a problem when inserting into a partitioned table. It worked
fine to when the target table was unpartitioned.
Can you confirm this?
Thanks,
Du
On 9/26/14, 4:48 PM, Du Li l...@yahoo-inc.com.INVALID wrote:
Hi,
I was loading data into a partitioned table on Spark 1.1.0
beeline
Thanks, Yanbo and Nicholas. Now it makes more sense — query optimization is the
answer. /Du
From: Nicholas Chammas
nicholas.cham...@gmail.commailto:nicholas.cham...@gmail.com
Date: Thursday, September 25, 2014 at 6:43 AM
To: Yanbo Liang yanboha...@gmail.commailto:yanboha...@gmail.com
Cc: Du Li
Hi,
The following query does not work in Shark nor in the new Spark SQLContext or
HiveContext.
SELECT key, value, concat(key, value) as combined from src where combined like
’11%’;
The following tweak of syntax works fine although a bit ugly.
SELECT key, value, concat(key, value) as combined
Hi,
After executing sql() in SQLContext or HiveContext, is there a way to tell
whether the query/command succeeded or failed? Method sql() returns SchemaRDD
which either is empty or contains some Rows of results. However, some queries
and commands do not return results by nature; being empty
, 2014 at 7:17 AM
To: Du Li l...@yahoo-inc.com.INVALIDmailto:l...@yahoo-inc.com.INVALID
Cc: Michael Armbrust mich...@databricks.commailto:mich...@databricks.com,
Cheng, Hao hao.ch...@intel.commailto:hao.ch...@intel.com,
user@spark.apache.orgmailto:user@spark.apache.org
user
Hi,
Wonder anybody had similar experience or any suggestion here.
I have an akka Actor that processes database requests in high-level messages.
Inside this Actor, it creates a HiveContext object that does the actual db
work. The main thread creates the needed SparkContext and passes in to the
...@intel.com
Cc: Du Li l...@yahoo-inc.com.invalidmailto:l...@yahoo-inc.com.invalid,
user@spark.apache.orgmailto:user@spark.apache.org
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: problem with HiveContext inside Actor
- dev
Is it possible that you are constructing more than one
(./test_data)
val rdd2 = sc.sequenceFile(./test_data, classOf[NullWritable],
classOf[Text])
assert(rdd.first == rdd2.first._2.toString)
}
}
From: Matei Zaharia matei.zaha...@gmail.commailto:matei.zaha...@gmail.com
Date: Monday, September 15, 2014 at 10:52 PM
To: Du Li l...@yahoo
There is a parameter spark.speculation that is turned off by default. Look at
the configuration doc: http://spark.apache.org/docs/latest/configuration.html
From: Pramod Biligiri
pramodbilig...@gmail.commailto:pramodbilig...@gmail.com
Date: Monday, September 15, 2014 at 3:30 PM
To:
To: Du Li l...@yahoo-inc.com.invalidmailto:l...@yahoo-inc.com.invalid,
user@spark.apache.orgmailto:user@spark.apache.org
user@spark.apache.orgmailto:user@spark.apache.org,
d...@spark.apache.orgmailto:d...@spark.apache.org
d...@spark.apache.orgmailto:d...@spark.apache.org
Subject: Re: NullWritable
Hi,
I was trying the following on spark-shell (built with apache master and hadoop
2.4.0). Both calling rdd2.collect and calling rdd3.collect threw
java.io.NotSerializableException: org.apache.hadoop.io.NullWritable.
I got the same problem in similar code of my app which uses the newly
Hi,
I have the following code snippet. It works fine on spark-shell but in a
standalone app it reports No TypeTag available for MySchema” at compile time
when calling hc.createScheamaRdd(rdd). Anybody knows what might be missing?
Thanks,
Du
--
Import org.apache.spark.sql.hive.HiveContext
Solved it.
The problem occurred because the case class was defined within a test case in
FunSuite. Moving the case class definition out of test fixed the problem.
From: Du Li l...@yahoo-inc.com.INVALIDmailto:l...@yahoo-inc.com.INVALID
Date: Thursday, September 11, 2014 at 11:25 AM
To: user
The implementation of SparkSQL is currently incomplete. You may try it out
with HiveContext instead of SQLContext.
On 9/11/14, 1:21 PM, jamborta jambo...@gmail.com wrote:
Hi,
I am trying to create a new table from a select query as follows:
CREATE TABLE IF NOT EXISTS new_table ROW FORMAT
Just moving it out of test is not enough. Must move the case class definition
to the top level. Otherwise it would report a runtime error of task not
serializable when executing collect().
From: Du Li l...@yahoo-inc.com.INVALIDmailto:l...@yahoo-inc.com.INVALID
Date: Thursday, September 11
SchemaRDD has a method insertInto(table). When the table is partitioned, it
would be more sensible and convenient to extend it with a list of partition key
and values.
From: Denny Lee denny.g@gmail.commailto:denny.g@gmail.com
Date: Thursday, September 11, 2014 at 6:39 PM
To: Du Li l
Hi Denny,
There is a related question by the way.
I have a program that reads in a stream of RDD¹s, each of which is to be
loaded into a hive table as one partition. Currently I do this by first
writing the RDD¹s to HDFS and then loading them to hive, which requires
multiple passes of HDFS I/O
Your tables were registered in the SqlContext, whereas the thrift server
works with HiveContext. They seem to be in two different worlds today.
On 9/9/14, 5:16 PM, alexandria1101 alexandria.shea...@gmail.com wrote:
Hi,
I want to use the sparksql thrift server in my application and make sure
As suggested in the error messages, double-check your class path.
From: CharlieLin chury...@gmail.commailto:chury...@gmail.com
Date: Tuesday, August 26, 2014 at 8:29 PM
To: user@spark.apache.orgmailto:user@spark.apache.org
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Execute
Hi, Michael.
I used HiveContext to create a table with a field of type Array. However, in
the hql results, this field was returned as type ArrayBuffer which is mutable.
Would it make more sense to be an Array?
The Spark version of my test is 1.0.2. I haven’t tested it on SQLContext nor
newer
mich...@databricks.commailto:mich...@databricks.com
Date: Wednesday, August 27, 2014 at 5:21 PM
To: Du Li l...@yahoo-inc.commailto:l...@yahoo-inc.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: SparkSQL returns ArrayBuffer
Hi,
I created an instance of LocalHiveContext and attempted to create a database.
However, it failed with message
org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution
Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask.
java.lang.RuntimeException: Unable to
be the
source of your problem.)
On Thu, Aug 21, 2014 at 4:23 PM, Du Li l...@yahoo-inc.com.invalid
wrote:
Hi,
This guava dependency conflict problem should have been fixed as of
yesterday according to
https://issues.apache.org/jira/browse/SPARK-2420
However, I just got
Hi,
This guava dependency conflict problem should have been fixed as of yesterday
according to https://issues.apache.org/jira/browse/SPARK-2420
However, I just got java.lang.NoSuchMethodError:
com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;
by the following code
58 matches
Mail list logo