from:"Du Li"

Following is a method that retrieves the list of executors registered to a spark context. It worked perfectly with spark-submit in standalone mode for my project. /** * A simplified method that just returns the current active/registered executors * excluding the driver. * @param sc *

Re: Shuffle produces one huge partition and many tiny partitions

2015-06-18 Thread Du Li

I got the same problem with rdd,repartition() in my streaming app, which generated a few huge partitions and many tiny partitions. The resulting high data skew makes the processing time of a batch unpredictable and often exceeding the batch interval. I eventually solved the problem by using

Re: Shuffle produces one huge partition and many tiny partitions

2015-06-18 Thread Du Li

repartition() means coalesce(shuffle=false) On Thursday, June 18, 2015 4:07 PM, Corey Nolet cjno...@gmail.com wrote: Doesn't repartition call coalesce(shuffle=true)?On Jun 18, 2015 6:53 PM, Du Li l...@yahoo-inc.com.invalid wrote: I got the same problem with rdd,repartition() in my

spark eventLog and history server

2015-06-08 Thread Du Li

Event log is enabled in my spark streaming app. My code runs in standalone mode and the spark version is 1.3.1. I periodically stop and restart the streaming context by calling ssc.stop(). However, from the web UI, when clicking on a past job, it says the job is still in progress and does not

Re: how to use rdd.countApprox

2015-05-15 Thread Du Li

Spark master). Things move really fast between releases. 1.1.1 feels really old to me :P TD On Wed, May 13, 2015 at 1:25 PM, Du Li l...@yahoo-inc.com wrote: I do rdd.countApprox() and rdd.sparkContext.setJobGroup() inside dstream.foreachRDD{...}. After calling cancelJobGroup(), the spark context

Re: how to use rdd.countApprox

2015-05-13 Thread Du Li

PM, Tathagata Das t...@databricks.com wrote: That is not supposed to happen :/ That is probably a bug.If you have the log4j logs, would be good to file a JIRA. This may be worth debugging. On Wed, May 13, 2015 at 12:13 PM, Du Li l...@yahoo-inc.com wrote: Actually I tried that before asking

Re: how to use rdd.countApprox

2015-05-13 Thread Du Li

received by rdd.count() On Tue, May 12, 2015 at 5:03 PM, Du Li l...@yahoo-inc.com.invalid wrote: HI, I tested the following in my streaming app and hoped to get an approximate count within 5 seconds. However, rdd.countApprox(5000).getFinalValue() seemed to always return after it finishes completely

Re: how to use rdd.countApprox

2015-05-13 Thread Du Li

Hi TD, Do you know how to cancel the rdd.countApprox(5000) tasks after the timeout? Otherwise it keeps running until completion, producing results not used but consuming resources. Thanks,Du On Wednesday, May 13, 2015 10:33 AM, Du Li l...@yahoo-inc.com.INVALID wrote: Hi TD

Re: how to use rdd.countApprox

2015-05-12 Thread Du Li

On Wednesday, May 6, 2015 7:55 AM, Du Li l...@yahoo-inc.com.INVALID wrote: I have to count RDD's in a spark streaming app. When data goes large, count() becomes expensive. Did anybody have experience using countApprox()? How accurate/reliable is it? The documentation is pretty modest. Suppose

how to use rdd.countApprox

2015-05-06 Thread Du Li

I have to count RDD's in a spark streaming app. When data goes large, count() becomes expensive. Did anybody have experience using countApprox()? How accurate/reliable is it? The documentation is pretty modest. Suppose the timeout parameter is in milliseconds. Can I retrieve the count value by

Re: RDD coalesce or repartition by #records or #bytes?

2015-05-05 Thread Du Li

very similar number of records. Thanks. Zhan Zhang On Mar 4, 2015, at 3:47 PM, Du Li l...@yahoo-inc.com.INVALID wrote: Hi, My RDD's are created from kafka stream. After receiving a RDD, I want to do coalesce/repartition it so that the data will be processed in a set of machines in parallel

set up spark cluster with heterogeneous hardware

2015-03-13 Thread Du Li

Hi Spark community, I searched for a way to configure a heterogeneous cluster because the need recently emerged in my project. I didn't find any solution out there. Now I have thought out a solution and thought it might be useful to many other people with similar needs. Following is a blog post

Re: How to use more executors

2015-03-11 Thread Du Li

Is it possible to extend this PR further (or create another PR) to allow for per-node configuration of workers? There are many discussions about heterogeneous spark cluster. Currently configuration on master will override those on the workers. Many spark users have the need for having machines

Re: How to use more executors

2015-03-11 Thread Du Li

Is it being merged in the next release? It's indeed a critical patch! Du On Wednesday, January 21, 2015 3:59 PM, Nan Zhu zhunanmcg...@gmail.com wrote: …not sure when will it be reviewed… but for now you can work around by allowing multiple worker instances on a single machine

Re: FW: RE: distribution of receivers in spark streaming

2015-03-10 Thread Du Li

mailing list for future reference to the community? Might be a good idea to post both methods with pros and cons, as different users may have different constraints. :)Thanks :) TD On Fri, Mar 6, 2015 at 4:07 PM, Du Li l...@yahoo-inc.com wrote: Yes but the caveat may not exist if we do this when

distribution of receivers in spark streaming

2015-03-04 Thread Du Li

Hi, I have a set of machines (say 5) and want to evenly launch a number (say 8) of kafka receivers on those machines. In my code I did something like the following, as suggested in the spark docs: val streams = (1 to numReceivers).map(_ = ssc.receiverStream(new MyKafkaReceiver()))

RDD coalesce or repartition by #records or #bytes?

2015-03-04 Thread Du Li

Hi, My RDD's are created from kafka stream. After receiving a RDD, I want to do coalesce/repartition it so that the data will be processed in a set of machines in parallel as even as possible. The number of processing nodes is larger than the receiving nodes. My question is how the

Re: distribution of receivers in spark streaming

2015-03-04 Thread Du Li

Figured it out: I need to override method preferredLocation() in MyReceiver class. On Wednesday, March 4, 2015 3:35 PM, Du Li l...@yahoo-inc.com.INVALID wrote: Hi, I have a set of machines (say 5) and want to evenly launch a number (say 8) of kafka receivers on those machines

Re: distribution of receivers in spark streaming

2015-03-04 Thread Du Li

streaming context to let all the executors registered, then all the receivers can distribute to the nodes more evenly. Also setting locality is another way as you mentioned. Thanks Jerry From: Du Li [mailto:l...@yahoo-inc.com.INVALID] Sent: Thursday, March 5, 2015 1:50 PM To: User Subject: Re

Re: RDD saveAsObjectFile write to local file and HDFS

2014-11-26 Thread Du Li

Add ³file://³ in front of your path. On 11/26/14, 10:15 AM, firemonk9 dhiraj.peech...@gmail.com wrote: When I am running spark locally, RDD saveAsObjectFile writes the file to local file system (ex : path /data/temp.txt) and when I am running spark on YARN cluster, RDD saveAsObjectFile

Re: SparkSQL performance

2014-10-31 Thread Du Li

We have seen all kinds of results published that often contradict each other. My take is that the authors often know more tricks about how to tune their own/familiar products than the others. So the product on focus is tuned for ideal performance while the competitors are not. The authors are

Re: [SPARK SQL] kerberos error when creating database from beeline/ThriftServer2

2014-10-28 Thread Du Li

) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1637) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:697) ... 74 more From: Cheng Lian lian.cs@gmail.commailto:lian.cs@gmail.com Date: Tuesday, October 28, 2014 at 2:50 AM To: Du Li l...@yahoo

Re: [SPARK SQL] kerberos error when creating database from beeline/ThriftServer2

2014-10-28 Thread Du Li

(TSaslServerTransport.java:216) ... 4 more From: Cheng Lian lian.cs@gmail.commailto:lian.cs@gmail.com Date: Tuesday, October 28, 2014 at 2:50 AM To: Du Li l...@yahoo-inc.com.invalidmailto:l...@yahoo-inc.com.invalid Cc: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user

Re: [SPARK SQL] kerberos error when creating database from beeline/ThriftServer2

2014-10-28 Thread Du Li

To clarify, this error was thrown from the thrift server when beeline was started to establish the connection, as follows: $ beeline -u jdbc:hive2://`hostname`:4080 –n username From: Du Li l...@yahoo-inc.com.INVALIDmailto:l...@yahoo-inc.com.INVALID Date: Tuesday, October 28, 2014 at 11:35 AM

[SPARK SQL] kerberos error when creating database from beeline/ThriftServer2

2014-10-27 Thread Du Li

Hi, I was trying to set up Spark SQL on a private cluster. I configured a hive-site.xml under spark/conf that uses a local metestore with warehouse and default FS name set to HDFS on one of my corporate cluster. Then I started spark master, worker and thrift server. However, when creating a

Re: HiveContext: cache table not supported for partitioned table?

2014-10-03 Thread Du Li

Thanks for your explanation. From: Cheng Lian lian.cs@gmail.commailto:lian.cs@gmail.com Date: Thursday, October 2, 2014 at 8:01 PM To: Du Li l...@yahoo-inc.com.INVALIDmailto:l...@yahoo-inc.com.INVALID, d...@spark.apache.orgmailto:d...@spark.apache.org d...@spark.apache.orgmailto:d

HiveContext: cache table not supported for partitioned table?

2014-10-02 Thread Du Li

Hi, In Spark 1.1 HiveContext, I ran a create partitioned table command followed by a cache table command and got a java.sql.SQLSyntaxErrorException: Table/View 'PARTITIONS' does not exist. But cache table worked fine if the table is not a partitioned table. Can anybody confirm that cache of

Re: SparkSQL: map type MatchError when inserting into Hive table

2014-09-28 Thread Du Li

, Du Li wrote: Hi, I was loading data into a partitioned table on Spark 1.1.0 beeline-thriftserver. The table has complex data types such as mapstring, string and arraymapstring,string. The query is like ³insert overwrite table a partition (Š) select Š² and the select clause worked if run

view not supported in spark thrift server?

2014-09-28 Thread Du Li

Can anybody confirm whether or not view is currently supported in spark? I found “create view translate” in the blacklist of HiveCompatibilitySuite.scala and also the following scenario threw NullPointerException on beeline/thriftserver (1.1.0). Any plan to support it soon? create table

Re: view not supported in spark thrift server?

2014-09-28 Thread Du Li

...@databricks.com Date: Sunday, September 28, 2014 at 12:13 PM To: Du Li l...@yahoo-inc.com.invalidmailto:l...@yahoo-inc.com.invalid Cc: d...@spark.apache.orgmailto:d...@spark.apache.org d...@spark.apache.orgmailto:d...@spark.apache.org, user@spark.apache.orgmailto:user@spark.apache.org user

SparkSQL: map type MatchError when inserting into Hive table

2014-09-26 Thread Du Li

Hi, I was loading data into a partitioned table on Spark 1.1.0 beeline-thriftserver. The table has complex data types such as mapstring, string and arraymapstring,string. The query is like ³insert overwrite table a partition (Š) select Š² and the select clause worked if run separately. However,

Re: SparkSQL: map type MatchError when inserting into Hive table

2014-09-26 Thread Du Li

It might be a problem when inserting into a partitioned table. It worked fine to when the target table was unpartitioned. Can you confirm this? Thanks, Du On 9/26/14, 4:48 PM, Du Li l...@yahoo-inc.com.INVALID wrote: Hi, I was loading data into a partitioned table on Spark 1.1.0 beeline

Re: Spark SQL use of alias in where clause

2014-09-25 Thread Du Li

Thanks, Yanbo and Nicholas. Now it makes more sense — query optimization is the answer. /Du From: Nicholas Chammas nicholas.cham...@gmail.commailto:nicholas.cham...@gmail.com Date: Thursday, September 25, 2014 at 6:43 AM To: Yanbo Liang yanboha...@gmail.commailto:yanboha...@gmail.com Cc: Du Li

Spark SQL use of alias in where clause

2014-09-24 Thread Du Li

Hi, The following query does not work in Shark nor in the new Spark SQLContext or HiveContext. SELECT key, value, concat(key, value) as combined from src where combined like ’11%’; The following tweak of syntax works fine although a bit ugly. SELECT key, value, concat(key, value) as combined

SQL status code to indicate success or failure of query

2014-09-23 Thread Du Li

Hi, After executing sql() in SQLContext or HiveContext, is there a way to tell whether the query/command succeeded or failed? Method sql() returns SchemaRDD which either is empty or contains some Rows of results. However, some queries and commands do not return results by nature; being empty

Re: problem with HiveContext inside Actor

2014-09-18 Thread Du Li

, 2014 at 7:17 AM To: Du Li l...@yahoo-inc.com.INVALIDmailto:l...@yahoo-inc.com.INVALID Cc: Michael Armbrust mich...@databricks.commailto:mich...@databricks.com, Cheng, Hao hao.ch...@intel.commailto:hao.ch...@intel.com, user@spark.apache.orgmailto:user@spark.apache.org user

problem with HiveContext inside Actor

2014-09-17 Thread Du Li

Hi, Wonder anybody had similar experience or any suggestion here. I have an akka Actor that processes database requests in high-level messages. Inside this Actor, it creates a HiveContext object that does the actual db work. The main thread creates the needed SparkContext and passes in to the

Re: problem with HiveContext inside Actor

2014-09-17 Thread Du Li

...@intel.com Cc: Du Li l...@yahoo-inc.com.invalidmailto:l...@yahoo-inc.com.invalid, user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: problem with HiveContext inside Actor - dev Is it possible that you are constructing more than one

Re: NullWritable not serializable

2014-09-16 Thread Du Li

(./test_data) val rdd2 = sc.sequenceFile(./test_data, classOf[NullWritable], classOf[Text]) assert(rdd.first == rdd2.first._2.toString) } } From: Matei Zaharia matei.zaha...@gmail.commailto:matei.zaha...@gmail.com Date: Monday, September 15, 2014 at 10:52 PM To: Du Li l...@yahoo

Re: Does Spark always wait for stragglers to finish running?

2014-09-15 Thread Du Li

There is a parameter spark.speculation that is turned off by default. Look at the configuration doc: http://spark.apache.org/docs/latest/configuration.html From: Pramod Biligiri pramodbilig...@gmail.commailto:pramodbilig...@gmail.com Date: Monday, September 15, 2014 at 3:30 PM To:

Re: NullWritable not serializable

2014-09-15 Thread Du Li

To: Du Li l...@yahoo-inc.com.invalidmailto:l...@yahoo-inc.com.invalid, user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org, d...@spark.apache.orgmailto:d...@spark.apache.org d...@spark.apache.orgmailto:d...@spark.apache.org Subject: Re: NullWritable

NullWritable not serializable

2014-09-12 Thread Du Li

Hi, I was trying the following on spark-shell (built with apache master and hadoop 2.4.0). Both calling rdd2.collect and calling rdd3.collect threw java.io.NotSerializableException: org.apache.hadoop.io.NullWritable. I got the same problem in similar code of my app which uses the newly

SparkSQL HiveContext TypeTag compile error

2014-09-11 Thread Du Li

Hi, I have the following code snippet. It works fine on spark-shell but in a standalone app it reports No TypeTag available for MySchema” at compile time when calling hc.createScheamaRdd(rdd). Anybody knows what might be missing? Thanks, Du -- Import org.apache.spark.sql.hive.HiveContext

Re: SparkSQL HiveContext TypeTag compile error

2014-09-11 Thread Du Li

Solved it. The problem occurred because the case class was defined within a test case in FunSuite. Moving the case class definition out of test fixed the problem. From: Du Li l...@yahoo-inc.com.INVALIDmailto:l...@yahoo-inc.com.INVALID Date: Thursday, September 11, 2014 at 11:25 AM To: user

Re: spark sql - create new_table as select * from table

2014-09-11 Thread Du Li

The implementation of SparkSQL is currently incomplete. You may try it out with HiveContext instead of SQLContext. On 9/11/14, 1:21 PM, jamborta jambo...@gmail.com wrote: Hi, I am trying to create a new table from a select query as follows: CREATE TABLE IF NOT EXISTS new_table ROW FORMAT

Re: SparkSQL HiveContext TypeTag compile error

2014-09-11 Thread Du Li

Just moving it out of test is not enough. Must move the case class definition to the top level. Otherwise it would report a runtime error of task not serializable when executing collect(). From: Du Li l...@yahoo-inc.com.INVALIDmailto:l...@yahoo-inc.com.INVALID Date: Thursday, September 11

Re: Table not found: using jdbc console to query sparksql hive thriftserver

2014-09-11 Thread Du Li

SchemaRDD has a method insertInto(table). When the table is partitioned, it would be more sensible and convenient to extend it with a list of partition key and values. From: Denny Lee denny.g@gmail.commailto:denny.g@gmail.com Date: Thursday, September 11, 2014 at 6:39 PM To: Du Li l

Re: Table not found: using jdbc console to query sparksql hive thriftserver

2014-09-10 Thread Du Li

Hi Denny, There is a related question by the way. I have a program that reads in a stream of RDD¹s, each of which is to be loaded into a hive table as one partition. Currently I do this by first writing the RDD¹s to HDFS and then loading them to hive, which requires multiple passes of HDFS I/O

Re: Table not found: using jdbc console to query sparksql hive thriftserver

2014-09-09 Thread Du Li

Your tables were registered in the SqlContext, whereas the thrift server works with HiveContext. They seem to be in two different worlds today. On 9/9/14, 5:16 PM, alexandria1101 alexandria.shea...@gmail.com wrote: Hi, I want to use the sparksql thrift server in my application and make sure

Re: Execute HiveFormSpark ERROR.

2014-08-27 Thread Du Li

As suggested in the error messages, double-check your class path. From: CharlieLin chury...@gmail.commailto:chury...@gmail.com Date: Tuesday, August 26, 2014 at 8:29 PM To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Execute

SparkSQL returns ArrayBuffer for fields of type Array

2014-08-27 Thread Du Li

Hi, Michael. I used HiveContext to create a table with a field of type Array. However, in the hql results, this field was returned as type ArrayBuffer which is mutable. Would it make more sense to be an Array? The Spark version of my test is 1.0.2. I haven’t tested it on SQLContext nor newer

Re: SparkSQL returns ArrayBuffer for fields of type Array

2014-08-27 Thread Du Li

mich...@databricks.commailto:mich...@databricks.com Date: Wednesday, August 27, 2014 at 5:21 PM To: Du Li l...@yahoo-inc.commailto:l...@yahoo-inc.com Cc: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: SparkSQL returns ArrayBuffer

unable to instantiate HiveMetaStoreClient on LocalHiveContext

2014-08-25 Thread Du Li

Hi, I created an instance of LocalHiveContext and attempted to create a database. However, it failed with message org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: Unable to

Re: Hive From Spark

2014-08-22 Thread Du Li

be the source of your problem.) On Thu, Aug 21, 2014 at 4:23 PM, Du Li l...@yahoo-inc.com.invalid wrote: Hi, This guava dependency conflict problem should have been fixed as of yesterday according to https://issues.apache.org/jira/browse/SPARK-2420 However, I just got

Re: Hive From Spark

2014-08-21 Thread Du Li

Hi, This guava dependency conflict problem should have been fixed as of yesterday according to https://issues.apache.org/jira/browse/SPARK-2420 However, I just got java.lang.NoSuchMethodError: com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode; by the following code

58 matches

Mail list logo