Why spark master consumes 100% CPU when we kill a spark streaming app?

2015-03-10 Thread Xuelin Cao
Hey, Recently, we found in our cluster, that when we kill a spark streaming app, the whole cluster cannot response for 10 minutes. And, we investigate the master node, and found the master process consumes 100% CPU when we kill the spark streaming app. How could it happen? Did

In the HA master mode, how to identify the alive master?

2015-03-04 Thread Xuelin Cao
Hi, In our project, we use stand alone duo master + zookeeper to make the HA of spark master. Now the problem is, how do we know which master is the current alive master? We tried to read the info that the master stored in zookeeper. But we found there is no information to

IF statement doesn't work in Spark-SQL?

2015-01-20 Thread Xuelin Cao
Hi, I'm trying to migrate some hive scripts to Spark-SQL. However, I found some statement is incompatible in Spark-sql. Here is my SQL. And the same SQL works fine in HIVE environment. SELECT *if(ad_user_id1000, 1000, ad_user_id) as user_id* FROM

Re: IF statement doesn't work in Spark-SQL?

2015-01-20 Thread Xuelin Cao
, DEVAN M.S. msdeva...@gmail.com wrote: Which context are you using HiveContext or SQLContext ? Can you try with HiveContext ?? Devan M.S. | Research Associate | Cyber Security | AMRITA VISHWA VIDYAPEETHAM | Amritapuri | Cell +919946535290 | On Tue, Jan 20, 2015 at 3:49 PM, Xuelin Cao

Re: IF statement doesn't work in Spark-SQL?

2015-01-20 Thread Xuelin Cao
Hi, I'm using Spark 1.2 On Tue, Jan 20, 2015 at 5:59 PM, Wang, Daoyuan daoyuan.w...@intel.com wrote: Hi Xuelin, What version of Spark are you using? Thanks, Daoyuan *From:* Xuelin Cao [mailto:xuelincao2...@gmail.com] *Sent:* Tuesday, January 20, 2015 5:22 PM *To:* User

Re: How to create an empty RDD with a given type?

2015-01-12 Thread Xuelin Cao
12, 2015 at 9:50 PM, Xuelin Cao xuelincao2...@gmail.com wrote: Hi, I'd like to create a transform function, that convert RDD[String] to RDD[Int] Occasionally, the input RDD could be an empty RDD. I just want to directly create an empty RDD[Int] if the input RDD is empty. And, I

Re: Did anyone tried overcommit of CPU cores?

2015-01-09 Thread Xuelin Cao
.). Why not increase the tasks per core? Best regards Le 9 janv. 2015 06:46, Xuelin Cao xuelincao2...@gmail.com a écrit : Hi, I'm wondering whether it is a good idea to overcommit CPU cores on the spark cluster. For example, in our testing cluster, each worker machine has 24

Did anyone tried overcommit of CPU cores?

2015-01-08 Thread Xuelin Cao
Hi, I'm wondering whether it is a good idea to overcommit CPU cores on the spark cluster. For example, in our testing cluster, each worker machine has 24 physical CPU cores. However, we are allowed to set the CPU core number to 48 or more in the spark configuration file. As a result,

Re: Spark SQL: The cached columnar table is not columnar?

2015-01-08 Thread Xuelin Cao
of the input data for each task is also 1212.5MB On Thu, Jan 8, 2015 at 6:40 PM, Cheng Lian lian.cs@gmail.com wrote: Hey Xuelin, which data item in the Web UI did you check? On 1/7/15 5:37 PM, Xuelin Cao wrote: Hi, Curious and curious. I'm puzzled by the Spark SQL cached table

Re: Spark SQL: The cached columnar table is not columnar?

2015-01-08 Thread Xuelin Cao
multiple times to generate a larger file. Cheng On 1/8/15 7:43 PM, Xuelin Cao wrote: Hi, Cheng I checked the Input data for each stage. For example, in my attached screen snapshot, the input data is 1212.5MB, which is the total amount of the whole table [image: Inline image 1

Spark SQL: The cached columnar table is not columnar?

2015-01-07 Thread Xuelin Cao
Hi,        Curious and curious. I'm puzzled by the Spark SQL cached table.       Theoretically, the cached table should be columnar table, and only scan the column that included in my SQL.       However, in my test, I always see the whole table is scanned even though I only select one column in

Re: Why Parquet Predicate Pushdown doesn't work?

2015-01-07 Thread Xuelin Cao
. https://issues.apache.org/jira/browse/SPARK-4258 You can turn it on if you want: http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration Daniel On 7 בינו׳ 2015, at 08:18, Xuelin Cao xuelin...@yahoo.com.INVALID wrote: Hi, I'm testing parquet file format

Can spark supports task level resource management?

2015-01-07 Thread Xuelin Cao
Hi, Currently, we are building up a middle scale spark cluster (100 nodes) in our company. One thing bothering us is, the how spark manages the resource (CPU, memory). I know there are 3 resource management modes: stand-along, Mesos, Yarn In the stand along mode, the cluster

Re: Can spark supports task level resource management?

2015-01-07 Thread Xuelin Cao
applications you want to be running besides Spark in the same cluster and also your use cases, to see what resource management fits your need. Tim On Wed, Jan 7, 2015 at 10:55 PM, Xuelin Cao xuelincao2...@gmail.com wrote: Hi, Currently, we are building up a middle scale spark cluster (100

Re: Can spark supports task level resource management?

2015-01-07 Thread Xuelin Cao
, 2015 at 11:19 PM, Xuelin Cao xuelincao2...@gmail.com wrote: Hi, Thanks for the information. One more thing I want to clarify, when does Mesos or Yarn allocate and release the resource? Aka, what is the resource life time? For example, in the stand-along mode, the resource

Why Parquet Predicate Pushdown doesn't work?

2015-01-06 Thread Xuelin Cao
Hi,        I'm testing parquet file format, and the predicate pushdown is a very useful feature for us.        However, it looks like the predicate push down doesn't work after I set        sqlContext.sql(SET spark.sql.parquet.filterPushdown=true)        Here is my sql:       

When will Spark SQL support building DB index natively?

2014-12-17 Thread Xuelin Cao
Hi,       In Spark SQL help document, it says Some of these (such as indexes) are less important due to Spark SQL’s in-memory  computational model. Others are slotted for future releases of Spark SQL. - Block level bitmap indexes and virtual columns (used to build indexes)      For our

Re: When will Spark SQL support building DB index natively?

2014-12-17 Thread Xuelin Cao
looked at partitioned table support?  That would only scan data where the predicate matches the partition.  Depending on the cardinality of the customerId column that could be a good option for you. On Wed, Dec 17, 2014 at 2:25 AM, Xuelin Cao xuelin...@yahoo.com.invalid wrote: Hi,       In Spark

Why my SQL UDF cannot be registered?

2014-12-15 Thread Xuelin Cao
Hi,      I tried to create a function that to convert an Unix time stamp to the hour number in a day.       It works if the code is like this:sqlContext.registerFunction(toHour, (x:Long)={new java.util.Date(x*1000).getHours})       But, if I do it like this, it doesn't work:       def toHour

Is there an efficient way to append new data to a registered Spark SQL Table?

2014-12-08 Thread Xuelin Cao
Hi,       I'm wondering whether there is an  efficient way to continuously append new data to a registered spark SQL table.       This is what I want:      I want to make an ad-hoc query service to a json formated system log. Certainly, the system log is continuously generated. I will use

Spark SQL: How to get the hierarchical element with SQL?

2014-12-07 Thread Xuelin Cao
Hi,     I'm generating a Spark SQL table from an offline Json file.     The difficulty is, in the original json file, there is a hierarchical structure. And, as a result, this is what I get: scala tb.printSchemaroot |-- budget: double (nullable = true) |-- filterIp: array (nullable = true) |    

Spark SQL: How to get the hierarchical element with SQL?

2014-12-07 Thread Xuelin Cao
Hi,     I'm generating a Spark SQL table from an offline Json file.     The difficulty is, in the original json file, there is a hierarchical structure. And, as a result, this is what I get: scala tb.printSchemaroot |-- budget: double (nullable = true) |-- filterIp: array (nullable = true) |    

Is it possible to just change the value of the items in RDD without making a full copy?

2014-12-02 Thread Xuelin Cao
Hi,       I'd like to make an operation on an RDD that ONLY change the value of   some items, without make a full copy or full scan of each data.      It is useful when I need to handle a large RDD, and each time I need only to change a little fraction of the data, and keeps other data

Is there a way to turn on spark eventLog on the worker node?

2014-11-24 Thread Xuelin Cao
Hi,      I'm going to debug some spark applications on our testing platform. And it would be helpful if we can see the eventLog on the worker node.      I've tried to turn on spark.eventLog.enabled and set spark.eventLog.dir  parameters on the worker node. However, it doesn't work.      I do

Is there a way to turn on spark eventLog on the worker node?

2014-11-21 Thread Xuelin Cao
Hi, I'm going to debug some spark applications on our testing platform. And it would be helpful if we can see the eventLog on the *worker *node. I've tried to turn on *spark.eventLog.enabled* and set *spark.eventLog.dir* parameters on the worker node. However, it doesn't work. I